Audio Lottery For The Win

8 min readJun 29, 2024

As conversational AI takes over the title of “next big thing” in the field, given the industry demand, it becomes ever so important to contemplate smaller, yet robust automatic speech recognition (ASR) systems.

ASR models run on devices which can be advantageous in terms of computational resources, latency and user privacy.

The key idea behind improving latency and creating a smaller memory footprint is to have smaller model sizes behind such ASR systems. The model size is generally quantified in terms of the number of parameters in the model. Reducing the parameters comes at the tradeoff of performance. The performance of an ASR system is typically measured in terms of Word Error Rate (WER). Lower WER indicates better performance. Ideally, we want models which are smaller in size, i.e., consist of a lesser number of parameters, while performing equally well or even better than a larger model.

Some of the popular methods applied for the task include network pruning, knowledge distillation, and parameter quantization, but all such methods result in a non-negligible degradation of the WER metric.

The Lottery Ticket Hypothesis (LTH) is a new avenue that is being explored for obtaining such smaller models. LTH empirically demonstrates the existence of highly sparse matching subnetworks in fully dense networks, that can be independently trained from scratch to match or even surpass the performance of the larger model. Although prior adoption of LTH has been observed widely in vision and language models, its study for speech recognition has mostly remained unexplored.

We perform a comprehensive study of the application of LTH on the most popular ASR networks, both in the industry and academia: 1) CNN-LSTM with CTC loss, 2) RNN-Transduce and 3) Conformer with CTC. Notably, we present results which indicate the following:

LTH can successfully be applied in the context of ASR models and can result in a reduction of the model sizes by 21%, 10.7% and 8.6% respectively for the aforementioned models.
Structural sparsity, a class of model sparsity which allows for the hardware such as CPUs and GPUs to exploit the particular arrangement of the parameters to make the computation at the hardware level, and
Robustness, and transferability of models optimized on one dataset on other datasets, with varying amounts of noise.

Background

Word Error Rate: WER is a widely used metric for ASR, which calculates the number of additions, deletions and edits it would take to convert a predicted transcript for a speech sample to the true transcript. Lower values indicate better performance.

Subnetwork: A subnetwork of a model with N parameters, is any network with the same architecture but with less than N number of parameters.

Matching Subnetworks: A subnetwork S, of a model M, is called matching if it performs at least as well as the original model M, based on a predecided metric (in the paper, the authors consider WER).

Winning Ticket: We define a winning ticket as a subnetwork S (smaller model) of a model M, which when trained with the same algorithm and data, reaches similar or better performance compared to M in a lesser or equal number of steps.

Pruning Methods for Subnetwork Searching: We explore Iterative weight magnitude pruning (IMP) which is the most widely used algorithm in LTH literature. IMP performs three main steps: (1) Train the original model M on the complete dataset. (2) Eliminate the most insignificant weights. (3) Rewind the training process, and retrain only the remaining parameters after elimination. Multiple rounds of steps (2) and (3) are usually run to find smaller and smaller models.

Winning Tickets in Speech Recognition

The next question that we answer is if Winning tickets exist for speech recognition. To answer this, we design experiments on all three classes of models we talked about before. For each of them, we run IMP to extract multiple models of varying sparsity levels. We use the WER measure to compare the performance of the models.

Performance of three backbones at the extreme sparsity or the best performance on LibriSpeech test-clean subset.

We compare the extremely spare models and the best-performing model with the full model. The Remaining Weight (RW) indicates the remaining percentage of the weights being trained and used at inference time. The results indicate the presence of winning tickets, which both improve the performance and are smaller and lighter than the original model, hence confirming the Lottery Ticket Hypothesis (LTH) for ASR networks.
We next experiment to show the effectiveness of the Iterative Weight Magnitude Pruning (IMP) algorithm, by comparing its application to ASR models with the application of two other algorithms. The algorithms under consideration are as follows:

Random Pruning: Random pruning identifies subnetworks which are initialized from a predefined weight value, but the weights are masked at random.
Random Tickets: Randm tickets are the subnetworks which are initialized at random, but the weights are masked as found by the IMP algorithm.

The WER curves of the best subnetworks produced by different pruning approaches.

We compare the three approaches by pruning a base model using all three. In the graph above, we observe that the pruning done using IMP results in models which can get sparser than the other two, without much adverse effect on the WER.

We further probe into a third and crucial problem related to finding better-performing subnetworks by initializing the full-sized model with pretrained weights. This is done by running IMP on the CNN-LSTM backbone with weights trained on the TED-LIUM dataset and LibriSpeech datasets.

The WER curves of initialization with pretrained models. We test the CNN-LSTM backbone on the TED-LIUM dataset.

In the graph above, θ0 is the random initialization; θLibri is the initialisation from the LibriSpeech pre-trained model; and θCV is the initialisation from the CommonVoice pre-trained model. We observe that the model shows rapid degradation in WER when initialized with random weights, as opposed to initialization with LibriSpeech and CommonVioce data. Substantiating the claim that initializing the ASR models with pretrained weights results in better-winning tickets.

Comparison of different initialization strategies

We further analyze the performance of the three initializations on the most sparse model at 4.4% and the best sparse model with respect to WER. The pretrained weight initialization performs better than the random initialization in all three cases.

Study of Winning Tickets

Now that we have empirically shown the existence of winning tickets in ASR, we next study the various properties of such winning tickets. Specifically, we study three properties: Structured sparsity, transferability and noise robustness, which are key to ASR applications.

Structured sparsity

The idea of structural sparsity is that weights in the full model are masked structurally in blocks. Since such pruning allows for a higher local density of active bits, the hardware can optimally exploit this to reduce execution time. To verify this, we apply block sparsity with 1x4 block to subnetwork searching and then evaluate the network to see if it can meet the original model’s performance.

The results in the table above indicate that block sparsity works at least as well as unstructured sparsity. The visualization below shows the difference between (a) unstructured sparsity and (b) structured sparsity.

Visualizations of weight matrix pruned with (a) unstructured sparsity and (b) block sparsity

Study of Transferability

In practical scenarios, the testing utterances are directly recorded from users in the wild, which may have very different distributions from the training utterances. To test adaptation and transferability to utterances which are varying from the one the model is trained on, we experiment with three test sets, from TED-LIUM, CommonVoice and LibriSpeech datasets. These datasets are different with respect to recording scenarios, noise levels, and speaker coverage, hence are a good proxy for in-the-wild data. TED-LIUM is made up of recordings from TEDTalks, CommonVoice is a collection of recordings taken from volunteers, and LibriSpeech is extracted from LibriVox audiobooks.

The WER curves of transferring winning tickets to target datasets: (a) TED-LIUM, (b) CommonVoice, and © LibriSpeech (test-clean).

The graphs above suggest that the winning tickets are transferable to variable situations. As one would expect, the adaptability of all three variations is highest on the LibriSpeech-clean dataset, as the data is cleaner compared to other target test sets. This also demonstrates the adaptability of a model trained on noisy data to cleaner data.

Study of Noise Robustness

The training/adaptation speech utterances are mostly collected from users and are usually recorded in uncontrolled environments with notable background noise. Even in standard ASR benchmarks such as LibriSpeech, there is significant background noise even in its “clean” subset.

To test this, we conduct experiments by adding noise generated from the DESED dataset which consists of various sounds from domestic settings, to the TED-LIUM dataset, and retrain the winning tickets identified from TED-LIUM, CommonVoice, and LibriSpeech. We add the noise using a level between 0 and 1. We experiment with three noise levels, 0, 0.2 and 0.5.

Results of noise robustness study on the TED-LIUM dataset

As can be seen in the above table, the full model’s performance degrades very sharply as the noise level is increased from 0 to 0.5. Comparing that with the performance of the extremely sparse model, and best pruned model, we see that the models are more robust to noise as opposed to the full model.

Conclusion

In conclusion, the application of the Lottery Ticket Hypothesis (LTH) to automatic speech recognition (ASR) models presents a significant advancement in the field, offering a solution to the pressing need for smaller yet robust ASR systems. Through empirical studies, it has been demonstrated that winning tickets, smaller subnetworks of ASR models, exist and can substantially reduce model sizes while maintaining or even improving performance. Furthermore, the effectiveness of the Iterative Weight Magnitude Pruning (IMP) algorithm in producing sparser models without compromising performance highlights its superiority over other pruning methods. Leveraging pretrained weights from datasets like LibriSpeech and CommonVoice further enhances model performance, underscoring the importance of utilizing prior knowledge.

Additionally, properties such as structured sparsity, transferability, and noise robustness of winning tickets validate their adaptability to various scenarios and datasets, making them invaluable assets in real-world ASR applications. Overall, the application of LTH offers a promising avenue for developing more efficient, versatile, and robust ASR systems to meet the demands of the industry and improve user experiences.

Audio Lottery For The Win

Background

Winning Tickets in Speech Recognition

Study of Winning Tickets

Conclusion

Link to the original paper

Written by Himanshu Dutta