Introduction

The ML-SUPERB 2.0 Challenge tasks participants in developing Automatic Speech Recognition (ASR) models that support over 154 languages. While this task may seem daunting to some, we promise that it is much easier than it seems! Our ESPnet baseline recipe shows that you can get pretty decent performance by just fine-tuning a simple pre-trained wav2vec model. But what if you want to shoot for first place and get even better performance? We'll introduce some ideas you can try in this blog post.

ASR Foundation Models

For the most part, any off-the-shelf multilingual ASR model (such as Whisper or OWSM) should lead to fairly strong performance on the ML-SUPERB benchmark. The most difficult part lies in how to adapt those models to cases where a language is either low-resource or unseen in its pre-training. The easiest method would be to perform continual learning on those languages, but that may require significant GPU resources. An alternative is to do zero-shot language adaptation, such as through prompting or language embeddings. Another issue is language confusion, where the model does not correctly identify the language being spoken and thus generates text in the wrong writing system. In cases like these, simple text re-scoring methods would suffice in finding both the correct language and transcription.

Agent-based Solutions

It's expected that different models have different strengths and weaknesses. For example, some models may excel at ASR for Asian languages, while others better support European ones. One way you can achieve strong performance (while avoiding any sort of training) is to implement an agentic solution that chooses the best model for a given language. An easy way to do this is to break down the task of multilingual speech recognition into two subtasks: language identification (LID) and monolingual ASR. We first determine the language being spoken with an LID model. Then, our agent will choose the best ASR model for that language. For example, we can use Whisper for the 97 languages it supports and MMS for the other 56 languages in the test set. More complex solutions are also possible, such as running ASR with multiple models and using some heuristic or language model to select the most likely output.

Speech LLMs

A very resource efficient way to participate in this challenge is to create a speech-capable Large Language Model (LLM) with Parameter-Efficient Fine-Tuning (PEFT). PEFT methods such as LoRA allow us to fine-tune foundation models with minimal computational resources, while still leveraging their full capabilities. For example, we can use LoRA to combine a pre-trained speech encoder (such as MMS, XEUS, or Whisper) and a pre-trained LLM (such as T5, Aya, or Llama). We insert trainable LoRA layers to fuse the models together, while the original parameters of the foundation models remain frozen.

Examples of these types of models include SALMONN, LTU-AS, and DiVA.

Pseudo-Labelling

While the challenge rules prevent the use of certain LLMs such as GPT-4o or Gemini (since they are locked behind APIs that require an internet connection) as the final submission, you can still use them to generate pseudo-labels for a model that you train by yourself. For example, you can use them to pseudo-label unlabeled speech from low-resource languages, giving you access to more paired ASR training data. You can also try the opposite approach and generate the speech instead with a dedicated TTS model such as ElevenLabs, targeting accents that may rarely occur in standard ASR datasets.

Conclusion

Its important to note that these methods are fairly orthogonal to each other - you can combine most of them together to yield an even better solution. The difficult part is determining which methods to use and the best ways to combine them. In the end, it just comes down to an empirical process of trial and error. Don't try all of these methods, and definitely don't try all possible combinations. So use your intuition to determine which ones should work the best, and experiment with a select few. Of course, you can also come up with your own ideas and explore techniques that aren't covered here, those might work even better!