We provide the following three datasets to the participants.
Note that the directory names of the language varieties development set do not necessarily correspond to the actual spoken language. Please refer to the "Expected Code" column for the language codes we use during evaluation.
The download links to these datasets:
Note that, all the distributed datasets follow the original licenses.
The statistics are shown below (after excluding Norwegian data).
Dataset | Hours | #Utterances |
---|---|---|
ML-SUPERB 2.0 Training | 266.70 | 142,876 |
ML-SUPERB 2.0 Development | 44.26 | 24,019 |
Language Varieties Development | 9.34 | 7,095 |
The directory structure of these datasets is
dataset_dir/
└── dataset_code/
└── lang_code/
├── transcript_1h_train.txt
├── transcript_10min_train.txt
├── transcript_10min_dev.txt
├── transcript_10min_test.txt
└── wav/
└── wav_id.wav
The format of the metadata (transcript_*.txt) is
wav_id original_wav_id transcription
You can use the first field wav_id to fetch the corresponding audios in wav/. An example to split each line is
wav_id, _, transcript = line.strip().split(maxsplit=2)
All audios are processed into WAV format with a sampling rate of 16khz and single channel.
For the ML-SUPERB 2.0 dataset, the transcriptions in Google Drive and Huggingface-zip are kept in their original format. The transcriptions in Huggingface-pre-formatted have been normalized, including punctuations removal and uppercasing.
For the Language Varieties Development, the transcriptions in Google Drive and Huggingface are kept in their original format.
Below we provide detailed descriptions for the development sets (transcript_10min_dev.txt).
This dev set covers 141 languages, sampled from 14 publicly available datasets. You can also check the original ML-SUPERB paper for more information.
The expected language codes are the same as the directory names in most cases, unless specified otherwise.
1. ALFFA (African Languages in the Field: speech Fundamentals and Automation)
License: MIT
Descriptions: This is a collection of Amharic, Swahili and Wolof speech. Each language has 10 min audio.
Languages | Code | #Utterances |
---|---|---|
Amharic | amh | 92 |
Swahili | swa | 171 |
Wolof | wol | 140 |
2. Common Voice: A Massively-Multilingual Speech Corpus
License: Creative Commons CC0
Descriptions: We use 84 languages from CommonVoice. Each language has 10 min of speech.
Notes:
⁰ The original directory name lga is incorrect. The language is actually Ganda/Luganda, and we will use [lug] for evaluation.
¹ We remove Norwegian data.
² This is a legacy version of Japanese data. We still expect [jpn] during evaluation.
³ We expect [ori] the macrolanguage for Oriya.
Languages | Code | #Utterances |
---|---|---|
Abkhazian | abk | 136 |
Arabic | ara | 142 |
Assamese | asm | 128 |
Bashkir | bak | 112 |
Basaa | bas | 145 |
Belarusian | bel | 108 |
Bengali | ben | 92 |
Breton | bre | 179 |
Bulgarian | bul | 108 |
Catalan | cat | 96 |
Czech | ces | 188 |
Chuvash | chv | 120 |
Central Kurdish | ckb | 131 |
Mandarin Chinese | cmn | 108 |
Hakha Chin | cnh | 167 |
Welsh | cym | 105 |
Danish | dan | 165 |
German | deu | 100 |
Dhivehi | div | 109 |
Greek | ell | 140 |
English | eng | 102 |
Esperanto | epo | 112 |
Estonian | est | 81 |
Basque | eus | 103 |
Persian | fas | 116 |
Finnish | fin | 124 |
French | fra | 112 |
Frisian | frr | 107 |
Irish | gle | 141 |
Galician | glg | 117 |
Guarani | grn | 135 |
Hausa | hau | 122 |
Hindi | hin | 109 |
Upper Sorbian | hsb | 97 |
Hungarian | hun | 119 |
Armenian | hye | 99 |
Interlingua | ina | 122 |
Indonesian | ind | 156 |
Italian | ita | 96 |
Japanese | jpn | 126 |
Kabyle | kab | 148 |
Georgian | kat | 169 |
Kazakh | kaz | 105 |
Kinyarwanda | kin | 111 |
Kyrgyz (Kirghiz) | kir | 136 |
Kurmanji Kurdish (Northern Kurdish) | kmr | 130 |
Latvian | lav | 179 |
Ganda (Luganda) | lga (we expect lug)⁰ | 108 |
Lithuanian | lit | 107 |
Malayalam | mal | 154 |
Marathi | mar | 94 |
Meadow Mari (Eastern Mari) | mhr | 110 |
Maltese | mlt | 121 |
Mongolian | mon | 105 |
Hill Mari (Western Mari) | mrj | 114 |
Erzya | myv | 97 |
Min Nan Chinese (Taiwanese Minnan) | nan | 203 |
Dutch | nld | 126 |
Japanese | org_jpn (we expect jpn)² | 106 |
Odia | ory (we expect ori)³ | 116 |
Polish | pol | 120 |
Portuguese | por | 128 |
Romanian | ron | 143 |
Russian | rus | 110 |
Yakut (Sakha) | sah | 91 |
Saraiki | skr | 124 |
Slovak | slk | 139 |
Slovenian | slv | 140 |
Spanish | spa | 106 |
Serbian | srp | 180 |
Swahili (macrolanguage) | swa | 103 |
Swedish | swe | 127 |
Tamil | tam | 104 |
Tatar | tat | 130 |
Thai | tha | 119 |
Toki Pona | tok | 162 |
Turkish | tur | 126 |
Uighur (Uyghur) | uig | 91 |
Ukrainian | ukr | 111 |
Urdu | urd | 144 |
Uzbek | uzb | 111 |
Vietnamese | vie | 158 |
Yue Chinese (Cantonese) | yue | 125 |
3. FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
License: Creative Commons Attribution 4.0
Descriptions: We use all the 102 languages from FLEURS. All languages have 10 min speech, except for isl, which has 7.2 min, and orm, which has 3.6 min.
Notes:
⁰ It is originally [azj] in FLEURS. Here we expect [aze] the macrolanguage for evaluation.
¹ FLEURS also uses [tgl]/Tagalog in the paper. Here we use [fil].
² We remove Norwegian data.
³ FLEURS uses [ory] for the individual language of Oriya. Here we use [ori] for the macrolanguage.
⁴ FLEURS uses [swh] for the individual language of Swahili. Here we use [swa] for the macrolanguage.
Languages | Code | #Utterances |
---|---|---|
Afrikaans | afr | 50 |
Amharic | amh | 58 |
Arabic | ara | 55 |
Assamese | asm | 50 |
Asturian | ast | 67 |
Azerbaijani | aze ⁰ | 49 |
Belarusian | bel | 45 |
Bengali | ben | 46 |
Bosnian | bos | 49 |
Bulgarian | bul | 63 |
Catalan | cat | 53 |
Cebuano | ceb | 38 |
Czech | ces | 50 |
Central Kurdish (Sorani Kurdish) | ckb | 50 |
Mandarin Chinese | cmn | 54 |
Welsh | cym | 41 |
Danish | dan | 51 |
German | deu | 48 |
Modern Greek | ell | 56 |
English | eng | 59 |
Estonian | est | 53 |
Persian | fas | 35 |
Filipino | fil ¹ | 30 |
Finnish | fin | 49 |
French | fra | 60 |
Fula (Fulah) | ful | 47 |
Irish | gle | 41 |
Galician | glg | 59 |
Gujarati | guj | 62 |
Hausa | hau | 32 |
Hebrew | heb | 71 |
Hindi | hin | 57 |
Croatian | hrv | 61 |
Hungarian | hun | 52 |
Armenian | hye | 54 |
Igbo | ibo | 32 |
Indonesian | ind | 48 |
Icelandic | isl | 36 |
Italian | ita | 44 |
Javanese | jav | 38 |
Japanese | jpn | 44 |
Kamba | kam | 39 |
Kannada | kan | 48 |
Georgian | kat | 58 |
Kazakh | kaz | 36 |
Kabuverdianu | kea | 46 |
Khmer | khm | 41 |
Kyrgyz (Kirghiz) | kir | 49 |
Korean | kor | 52 |
Lao | lao | 58 |
Latvian | lav | 49 |
Lingala | lin | 29 |
Lithuanian | lit | 58 |
Luxembourgish | ltz | 55 |
Ganda | lug | 39 |
Luo (Kenya and Tanzania) | luo | 46 |
Malayalam | mal | 37 |
Marathi | mar | 49 |
Macedonian | mkd | 52 |
Maltese | mlt | 49 |
Mongolian | mon | 52 |
Maori | mri | 33 |
Malay | msa | 61 |
Burmese | mya | 40 |
Nepali | nep | 53 |
Dutch | nld | 60 |
Pedi (Northern Sotho) | nso | 32 |
Nyanja | nya | 39 |
Occitan | oci | 42 |
Oriya (macrolanguage) | ori ³ | 50 |
Oromo | orm | 19 |
Punjabi | pan | 55 |
Polish | pol | 67 |
Portuguese | por | 47 |
Pashto (Pushto) | pus | 51 |
Romanian | ron | 57 |
Russian | rus | 55 |
Slovak | slk | 54 |
Slovenian | slv | 66 |
Shona | sna | 43 |
Sindhi | snd | 49 |
Somali | som | 48 |
Spanish | spa | 50 |
Serbian | srp | 58 |
Swahili (macrolanguage) | swa ⁴ | 39 |
Swedish | swe | 52 |
Tamil | tam | 51 |
Telugu | tel | 53 |
Tajik | tgk | 43 |
Thai | tha | 54 |
Turkish | tur | 50 |
Ukrainian | ukr | 56 |
Umbundu | umb | 27 |
Urdu | urd | 55 |
Uzbek | uzb | 53 |
Vietnamese | vie | 48 |
Wolof | wol | 39 |
Xhosa | xho | 46 |
Yoruba | yor | 36 |
Cantonese Chinese (Yue Chinese) | yue | 54 |
Zulu | zul | 39 |
4. googlei18n_asr: Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali
Paper | Javanese | Sundanese | Sinhala | Bengali | Nepali
License: Attribution-ShareAlike 4.0 International
Descriptions: We sample 10-min speech for each of these 5 languages.
Languages | Code | #Utterances |
---|---|---|
Bengali | ben | 171 |
Javanese | jav | 104 |
Nepali | nep | 182 |
Sinhala | sin | 137 |
Sundanese | sun | 110 |
5. googlei18n_tts
This is a collection of the following papers/datasets.
a. A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese
License: Attribution-ShareAlike 4.0 (CC BY-SA 4.0)
Descriptions: We use only the Bengali data from this project.
b. Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
Paper | Catalan | Basque | Galician
License: Attribution-ShareAlike 4.0 International
Descriptions: We use all three languages.
c. Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
Paper | Gujarati | Kannada | Malayalam | Marathi | Tamil
License: Attribution-ShareAlike 4.0 International
Descriptions: We use the Gujarati, Kannada, Malayalam, Marathi, and Tamil data.
d. Rapid development of TTS corpora for four South African languages
License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Descriptions: We use the Sesotho (Southern Sotho), Setswana (Tswana), and isiXhosa (Xhosa) data.
e. Developing an Open-Source Corpus of Yoruba Speech
License: Attribution-ShareAlike 4.0 International
Descriptions: It contains a single language Yoruba.
Summary for the above 5 datasets: There is 10-min speech for each of the 13 languages.
Languages | Code | #Utterances |
---|---|---|
Bengali | ben | 109 |
Catalan | cat | 76 |
Basque | eus | 85 |
Galician | glg | 86 |
Gujarati | guj | 91 |
Kannada | kan | 85 |
Malayalam | mal | 125 |
Marathi | mar | 89 |
Southern Sotho | sot | 107 |
Tamil | tam | 126 |
Tswana | tsn | 114 |
Xhosa | xho | 127 |
Yoruba | yor | 146 |
6. All Together Now: The Living Audio Dataset (LAD)
License: Apache-2.0 license
Descriptions: Each of the 4 languages has 10-min speech.
Languages | Code | #Utterances |
---|---|---|
English | eng | 128 |
Irish | gle | 184 |
Dutch | nld | 133 |
Russian | rus | 189 |
7. The M-AILABS Speech Dataset
License: please refer to this section.
Descriptions: We use all of the 8 languages and each has 10-min speech.
Languages | Code | #Utterances |
---|---|---|
German | deu | 78 |
English | eng | 79 |
French | fra | 77 |
Italian | ita | 112 |
Polish | pol | 81 |
Russian | rus | 64 |
Spanish | spa | 93 |
Ukrainian | ukr | 68 |
8. mexico-el
This is a collection of the following papers/datasets.
a. Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec
License: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Descriptions: This is a collection of Yolóxochitl Mixtec Speech.
b. Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation
License: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Descriptions: This is a collection of Highland Puebla Nahuatl Speech.
c. Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation
License: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Descriptions: This is a collection of Totonac Speech.
Summary for the above 3 dataset: We have 10-min speech for each of them.
Languages | Code | #Utterances |
---|---|---|
Highland Puebla Nahuatl | azz | 113 |
Highland Totonac | tos | 171 |
Yoloxochitl Mixtec | xty | 99 |
9. MLS: A Large-Scale Multilingual Dataset for Speech Research
License: CC BY 4.0
Descriptions: This dataset contains read audiobooks in the following 8 languages. This dev set has 10-min for each of them.
Languages | Code | #Utterances |
---|---|---|
German | deu | 39 |
English | eng | 40 |
French | fra | 40 |
Italian | ita | 39 |
Dutch | nld | 40 |
Polish | pol | 42 |
Portuguese | por | 39 |
Spanish | spa | 41 |
10. The NCHLT speech corpus of the South African languages
License: Creative Commons Attribution 3.0 Unported
Descriptions: This is a collection of 11 official languages of South Africa. There is 10-min speech for each of them.
Languages | Code | #Utterances |
---|---|---|
Afrikaans | afr | 191 |
English | eng | 225 |
South Ndebele (isiNdebele) | nbl | 128 |
Northern Sotho (Sepedi, Pedi) | nso | 189 |
Southern Sotho (Sesotho) | sot | 170 |
Swati (Siswati) | ssw | 103 |
Tswana (Setswana) | tsn | 169 |
Tsonga (Xitsonga) | tso | 166 |
Venda (Tshivenda) | ven | 132 |
Xhosa (isiXhosa) | xho | 135 |
Zulu (isiZulu ) | zul | 128 |
11. NST ASR Databases
Paper | Norwegian | Danish | Swedish
License: Creative_Commons-ZERO (CC-ZERO)
Descriptions: Each has 10-min of speech.
Notes:
⁰ We remove Norwegian data.
Languages | Code | #Utterances |
---|---|---|
Danish | dan | 116 |
Swedish | swe | 125 |
12. The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening
License: CC BY-SA 4.0
Descriptions: This is a collection of spoken Wikipedia articles in English, German, and Dutch. There is 10-min speech for each.
Languages | Code | #Utterances |
---|---|---|
German | deu | 207 |
English | eng | 239 |
Dutch | nld | 210 |
13. VoxForge
License: GNU General Public License
Descriptions: We choose the following 8 languages.
Languages | Code | #Utterances |
---|---|---|
German | deu | 121 |
English | eng | 115 |
French | fra | 103 |
Italian | ita | 90 |
Dutch | nld | 136 |
Portuguese | por | 162 |
Russian | rus | 68 |
Spanish | spa | 70 |
14. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
License: CC0
Descriptions: This is a collection of European Parliament speech. Each language has 10-min speech, except for Estonian/est, which has 7.65 min, and Lithuanian/lit, which has 0.75 min.
Languages | Code | #Utterances |
---|---|---|
Czech | ces | 57 |
German | deu | 64 |
English | eng | 62 |
Estonian | est | 47 |
Finnish | fin | 62 |
French | fra | 59 |
Croatian | hrv | 53 |
Hungarian | hun | 51 |
Italian | ita | 48 |
Lithuanian | lit | 3 |
Dutch | nld | 83 |
Polish | pol | 54 |
Romanian | ron | 49 |
Slovak | slk | 55 |
Slovenian (Slovene) | slv | 49 |
Spanish | spa | 54 |
Overall, this dev set covers 56 accents and dialects, sampled from 10 publicly available datasets. We show the number of utterances, the language code we use in this released dataset, and the language code we expect during evaluation for each language variety.
Note that the language codes used as the directory names are not necessarily used for evaluation. They are simply used for differentiating from each other. And we either follow the code in the original paper or we define a new code if one language variety does not have an ISO-3 code.
The "Expected Code" column shows the actual codes used for evaluation. The "Dialects" column shows which dialect is spoken. The "Accents" column shows which accent the speaker has.
Arabic
1. SADA: Saudi Audio Dataset for Arabic
License: CC BY-NC-SA 4.0.
Descriptions: The dataset contains 4 Arabic dialects. We sample a 10-min dev set for each dialect from the original development split.
We expect [ara] as the predicted language codes.
Dialects | Code | Expected Code | #Utterances |
---|---|---|---|
Khaliji | afb | [ara] | 222 |
Najdi | ars | [ara] | 153 |
ModernStandardArabic | arb | [ara] | 44 (5min) |
Hijazi | acw | [ara] | 175 |
English
1. VoxPopuli
License: CC0
Descriptions: We use the "Accented speech transcribed data" from VoxPopuli, which is a collection of English speech with different European accents. We sample a 10-min dev set for each accent, except for Lithuanian, Croatian, Slovene.
We expect [eng] as the predicted language code.
Accents | Code | Expected Code | #Utterances |
---|---|---|---|
Dutch/nl | nld | [eng] | 66 |
German/de | deu | [eng] | 66 |
Czech/cs | ces | [eng] | 56 |
Polish/pl | pol | [eng] | 60 |
French/fr | fra | [eng] | 60 |
Hungarian/hu | hun | [eng] | 78 |
Finnish/fi | fin | [eng] | 61 |
Romanian/ro | ron | [eng] | 67 |
Slovak/sk | slk | [eng] | 65 |
Spanish/es | spa | [eng] | 72 |
Italian/it | ita | [eng] | 61 |
Estonian/et | est | [eng] | 54 |
2. Crowdsourced high-quality UK and Ireland English Dialect speech data set (openslr83)
License: Attribution-ShareAlike 4.0 International
Descriptions: The dataset contains English dialects spoken in US and Ireland. We sample a 10-min dev set for each dialect. More specifically, we sample 5 min from female speakers and 5 min from male speakers, expect for Irish English where we sample 10 min solely from male speakers.
We expect [eng] as the predicted language code.
Dialects | Code | Expected Code | #Utterances |
---|---|---|---|
Irish | gle | [eng] | 96 |
Midlands | mid | [eng] | 90 |
Northern | nor | [eng] | 93 |
Schottish | sco | [eng] | 98 |
Southern | sou | [eng] | 91 |
Welsh | wel | [eng] | 93 |
3. GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech
License: cc0-1.0
Descriptions: The dataset contains English speech with accents from all over the world. We sample a 10-min dev set for each of the following 9 accents from the val split.
We expect [eng] as the predicted language code.
Accents | Code | Expected Code | #Utterances |
---|---|---|---|
Canadian English | can | [eng] | 174 |
Filipino | fil | [eng] | 167 |
Australian English | aus | [eng] | 166 |
United States English | use | [eng] | 177 |
England English | bre | [eng] | 173 |
New Zealand English | nze | [eng] | 167 |
India and South Asia (India, Pakistan, Sri Lanka) | sae | [eng] | 171 |
Irish English | gle | [eng] | 181 |
Scottish English | sco | [eng] | 153 |
4. L2-ARCTIC: a non-native English speech corpus
License: CC BY-NC 4.0
Descriptions: The dataset contains English speech with 6 different accents. For each accent, we use 1 male speaker and 1 female speaker as the dev set. And we sample 5 min from each speaker, which adds up to 10 min per accent.
We expect [eng] as the predicted language code.
Accents | Code | Expected Code | #Utterances |
---|---|---|---|
Arabic | ara | [eng] | 170 |
Mandarin | cmn | [eng] | 147 |
Hindi | hin | [eng] | 208 |
Korean | kor | [eng] | 176 |
Spanish | spa | [eng] | 153 |
Vietnamese | vie | [eng] | 164 |
Swiss German
1. SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German
License: Creative Commons Attribution-NonCommercial 4.0 International License
Descriptions: The dataset contains Swiss German spoken in 8 regions. For each of the 8 dialects, we sample a 10-min dev set. The language codes for directory names follow the original paper.
We expect [deu] as the predicted language code.
Dialects | Code | Expected Code | #Utterances |
---|---|---|---|
Aargau | ag | [deu] | 164 |
Bern | be | [deu] | 131 |
Basel | bs | [deu] | 146 |
Graubünden | gr | [deu] | 143 |
Luzern | lu | [deu] | 174 |
St. Gallen | sg | [deu] | 114 |
Wallis | vs | [deu] | 140 |
Zürich | zh | [deu] | 142 |
Greek
1. Speech Recognition for Greek Dialects: A Challenging Benchmark
Paper | Cretan Data | Messenian Data
Descriptions: We sample a 10-min dev set from each of the Cretan and Messenian datasets. The Cretan data is under "cretan/" and Messenian data is under "messenian/". The language codes for directory names are the first 3 letters of the dialect names.
We expect [ell] as the predicted language code.
Dialects | Code | Expected Code | #Utterances |
---|---|---|---|
Cretan | cre | [ell] | 290 |
Messenian | mes | [ell] | 139 |
Hindi
1. Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages (ms_speech)
License: Here is a quote from the dataset website:
Data provided in this dataset shall not be used for commercial purposes. You may use the data solely for research purposes. If you publish your findings, you must provide the following attribution: “Data provided by Microsoft and SpeechOcean.com”.
Descriptions: We sample a 10-min dev set from the Test splits of the 3 dialects.
We expect different language codes for them.
Dialects | Code | Expected Code | #Utterances |
---|---|---|---|
Tamil | tam | [tam] | 109 |
Telugu | tel | [tel] | 101 |
Gujarati | guj | [guj] | 95 |
Spanish
1. Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech (openslr_spa)
Paper | SLR61 | SLR71 | SLR72 | SLR73 | SLR74 | SLR75
License: Attribution-ShareAlike 4.0 International
Descriptions: This is collection of 6 Latin American dialects of Spanish speech. For each of the 6 dialects except for SLR74 (Puerto Rico Spanish), we sample 5 min from male speakers and 5 min from female speakers. For SLR74, we sample 10 min dev set from female speakers only. The language codes for directory names are the first 3 letters of the dialect names.
We expect [spa] as the predicted language code.
Dialects | Code | Expected Code | #Utterances |
---|---|---|---|
Argentinian | arg | [spa] | 127 |
Chilean | chi | [spa] | 107 |
Colombian | col | [spa] | 121 |
Peruvian | per | [spa] | 106 |
Puerto Rico | pue | [spa] | 100 |
Venezuelan | ven | [spa] | 125 |