Data

We provide the following three datasets to the participants.

  • A training set from the ML-SUPERB 2.0 public set. This is the baseline training dataset and covers 141 languages.
  • A development set from ML-SUPERB 2.0 and covers the same 141 languages as the training set.
  • A development set focusing on various language varieties and covering 56 dialects and accents.

Note that the directory names of the language varieties development set do not necessarily correspond to the actual spoken language. Please refer to the "Expected Code" column for the language codes we use during evaluation.

The download links to these datasets:

Note that, all the distributed datasets follow the original licenses.

The statistics are shown below (after excluding Norwegian data).

DatasetHours#Utterances
ML-SUPERB 2.0 Training266.70142,876
ML-SUPERB 2.0 Development44.2624,019
Language Varieties Development9.347,095

The directory structure of these datasets is

dataset_dir/
└── dataset_code/
    └── lang_code/
        ├── transcript_1h_train.txt
        ├── transcript_10min_train.txt
        ├── transcript_10min_dev.txt
        ├── transcript_10min_test.txt
        └── wav/
            └── wav_id.wav

The format of the metadata (transcript_*.txt) is

wav_id    original_wav_id    transcription

You can use the first field wav_id to fetch the corresponding audios in wav/. An example to split each line is

wav_id, _, transcript = line.strip().split(maxsplit=2)

Audio

All audios are processed into WAV format with a sampling rate of 16khz and single channel.

Text

For the ML-SUPERB 2.0 dataset, the transcriptions in Google Drive and Huggingface-zip are kept in their original format. The transcriptions in Huggingface-pre-formatted have been normalized, including punctuations removal and uppercasing.

For the Language Varieties Development, the transcriptions in Google Drive and Huggingface are kept in their original format.

Development Dataset Breakdown

Below we provide detailed descriptions for the development sets (transcript_10min_dev.txt).

ML-SUPERB 2.0 Development Set

This dev set covers 141 languages, sampled from 14 publicly available datasets. You can also check the original ML-SUPERB paper for more information.

The expected language codes are the same as the directory names in most cases, unless specified otherwise.

1. ALFFA (African Languages in the Field: speech Fundamentals and Automation)

Paper | Dataset

License: MIT

Descriptions: This is a collection of Amharic, Swahili and Wolof speech. Each language has 10 min audio.

LanguagesCode#Utterances
Amharicamh92
Swahiliswa171
Wolofwol140

2. Common Voice: A Massively-Multilingual Speech Corpus

Paper | Dataset

License: Creative Commons CC0

Descriptions: We use 84 languages from CommonVoice. Each language has 10 min of speech.

Notes:

⁰ The original directory name lga is incorrect. The language is actually Ganda/Luganda, and we will use [lug] for evaluation.

¹ We remove Norwegian data.

² This is a legacy version of Japanese data. We still expect [jpn] during evaluation.

³ We expect [ori] the macrolanguage for Oriya.

LanguagesCode#Utterances
Abkhazianabk136
Arabicara142
Assameseasm128
Bashkirbak112
Basaabas145
Belarusianbel108
Bengaliben92
Bretonbre179
Bulgarianbul108
Catalancat96
Czechces188
Chuvashchv120
Central Kurdishckb131
Mandarin Chinesecmn108
Hakha Chincnh167
Welshcym105
Danishdan165
Germandeu100
Dhivehidiv109
Greekell140
Englisheng102
Esperantoepo112
Estonianest81
Basqueeus103
Persianfas116
Finnishfin124
Frenchfra112
Frisianfrr107
Irishgle141
Galicianglg117
Guaranigrn135
Hausahau122
Hindihin109
Upper Sorbianhsb97
Hungarianhun119
Armenianhye99
Interlinguaina122
Indonesianind156
Italianita96
Japanesejpn126
Kabylekab148
Georgiankat169
Kazakhkaz105
Kinyarwandakin111
Kyrgyz (Kirghiz)kir136
Kurmanji Kurdish (Northern Kurdish)kmr130
Latvianlav179
Ganda (Luganda)lga (we expect lug)⁰108
Lithuanianlit107
Malayalammal154
Marathimar94
Meadow Mari (Eastern Mari)mhr110
Maltesemlt121
Mongolianmon105
Hill Mari (Western Mari)mrj114
Erzyamyv97
Min Nan Chinese (Taiwanese Minnan)nan203
Dutchnld126
Norwegian Nynorsknno¹126
Japaneseorg_jpn (we expect jpn106
Odiaory (we expect ori116
Polishpol120
Portuguesepor128
Romanianron143
Russianrus110
Yakut (Sakha)sah91
Saraikiskr124
Slovakslk139
Slovenianslv140
Spanishspa106
Serbiansrp180
Swahili (macrolanguage)swa103
Swedishswe127
Tamiltam104
Tatartat130
Thaitha119
Toki Ponatok162
Turkishtur126
Uighur (Uyghur)uig91
Ukrainianukr111
Urduurd144
Uzbekuzb111
Vietnamesevie158
Yue Chinese (Cantonese)yue125

3. FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Paper | Dataset

License: Creative Commons Attribution 4.0

Descriptions: We use all the 102 languages from FLEURS. All languages have 10 min speech, except for isl, which has 7.2 min, and orm, which has 3.6 min.

Notes:

⁰ It is originally [azj] in FLEURS. Here we expect [aze] the macrolanguage for evaluation.

¹ FLEURS also uses [tgl]/Tagalog in the paper. Here we use [fil].

² We remove Norwegian data.

³ FLEURS uses [ory] for the individual language of Oriya. Here we use [ori] for the macrolanguage.

⁴ FLEURS uses [swh] for the individual language of Swahili. Here we use [swa] for the macrolanguage.

LanguagesCode#Utterances
Afrikaansafr50
Amharicamh58
Arabicara55
Assameseasm50
Asturianast67
Azerbaijaniaze ⁰49
Belarusianbel45
Bengaliben46
Bosnianbos49
Bulgarianbul63
Catalancat53
Cebuanoceb38
Czechces50
Central Kurdish (Sorani Kurdish)ckb50
Mandarin Chinesecmn54
Welshcym41
Danishdan51
Germandeu48
Modern Greekell56
Englisheng59
Estonianest53
Persianfas35
Filipinofil ¹30
Finnishfin49
Frenchfra60
Fula (Fulah)ful47
Irishgle41
Galicianglg59
Gujaratiguj62
Hausahau32
Hebrewheb71
Hindihin57
Croatianhrv61
Hungarianhun52
Armenianhye54
Igboibo32
Indonesianind48
Icelandicisl36
Italianita44
Javanesejav38
Japanesejpn44
Kambakam39
Kannadakan48
Georgiankat58
Kazakhkaz36
Kabuverdianukea46
Khmerkhm41
Kyrgyz (Kirghiz)kir49
Koreankor52
Laolao58
Latvianlav49
Lingalalin29
Lithuanianlit58
Luxembourgishltz55
Gandalug39
Luo (Kenya and Tanzania)luo46
Malayalammal37
Marathimar49
Macedonianmkd52
Maltesemlt49
Mongolianmon52
Maorimri33
Malaymsa61
Burmesemya40
Nepalinep53
Dutchnld60
Norwegian Bokmålnob ²46
Pedi (Northern Sotho)nso32
Nyanjanya39
Occitanoci42
Oriya (macrolanguage)ori ³50
Oromoorm19
Punjabipan55
Polishpol67
Portuguesepor47
Pashto (Pushto)pus51
Romanianron57
Russianrus55
Slovakslk54
Slovenianslv66
Shonasna43
Sindhisnd49
Somalisom48
Spanishspa50
Serbiansrp58
Swahili (macrolanguage)swa ⁴39
Swedishswe52
Tamiltam51
Telugutel53
Tajiktgk43
Thaitha54
Turkishtur50
Ukrainianukr56
Umbunduumb27
Urduurd55
Uzbekuzb53
Vietnamesevie48
Wolofwol39
Xhosaxho46
Yorubayor36
Cantonese Chinese (Yue Chinese)yue54
Zuluzul39

4. googlei18n_asr: Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali

Paper | Javanese | Sundanese | Sinhala | Bengali | Nepali

License: Attribution-ShareAlike 4.0 International

Descriptions: We sample 10-min speech for each of these 5 languages.

LanguagesCode#Utterances
Bengaliben171
Javanesejav104
Nepalinep182
Sinhalasin137
Sundanesesun110

5. googlei18n_tts

This is a collection of the following papers/datasets.

a. A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese

Paper | Bengali

License: Attribution-ShareAlike 4.0 (CC BY-SA 4.0)

Descriptions: We use only the Bengali data from this project.

b. Open-Source High Quality Speech Datasets for Basque, Catalan and Galician

Paper | Catalan | Basque | Galician

License: Attribution-ShareAlike 4.0 International

Descriptions: We use all three languages.

c. Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems

Paper | Gujarati | Kannada | Malayalam | Marathi | Tamil

License: Attribution-ShareAlike 4.0 International

Descriptions: We use the Gujarati, Kannada, Malayalam, Marathi, and Tamil data.

d. Rapid development of TTS corpora for four South African languages

Paper | Dataset

License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Descriptions: We use the Sesotho (Southern Sotho), Setswana (Tswana), and isiXhosa (Xhosa) data.

e. Developing an Open-Source Corpus of Yoruba Speech

Paper | Dataset

License: Attribution-ShareAlike 4.0 International

Descriptions: It contains a single language Yoruba.

Summary for the above 5 datasets: There is 10-min speech for each of the 13 languages.

LanguagesCode#Utterances
Bengaliben109
Catalancat76
Basqueeus85
Galicianglg86
Gujaratiguj91
Kannadakan85
Malayalammal125
Marathimar89
Southern Sothosot107
Tamiltam126
Tswanatsn114
Xhosaxho127
Yorubayor146

6. All Together Now: The Living Audio Dataset (LAD)

Paper | Dataset

License: Apache-2.0 license

Descriptions: Each of the 4 languages has 10-min speech.

LanguagesCode#Utterances
Englisheng128
Irishgle184
Dutchnld133
Russianrus189

7. The M-AILABS Speech Dataset

Dataset

License: please refer to this section.

Descriptions: We use all of the 8 languages and each has 10-min speech.

LanguagesCode#Utterances
Germandeu78
Englisheng79
Frenchfra77
Italianita112
Polishpol81
Russianrus64
Spanishspa93
Ukrainianukr68

8. mexico-el

This is a collection of the following papers/datasets.

a. Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec

Paper | Dataset

License: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Descriptions: This is a collection of Yolóxochitl Mixtec Speech.

b. Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation

Paper | Dataset

License: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Descriptions: This is a collection of Highland Puebla Nahuatl Speech.

c. Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation

Paper | Dataset

License: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Descriptions: This is a collection of Totonac Speech.

Summary for the above 3 dataset: We have 10-min speech for each of them.

LanguagesCode#Utterances
Highland Puebla Nahuatlazz113
Highland Totonactos171
Yoloxochitl Mixtecxty99

9. MLS: A Large-Scale Multilingual Dataset for Speech Research

Paper | Dataset

License: CC BY 4.0

Descriptions: This dataset contains read audiobooks in the following 8 languages. This dev set has 10-min for each of them.

LanguagesCode#Utterances
Germandeu39
Englisheng40
Frenchfra40
Italianita39
Dutchnld40
Polishpol42
Portuguesepor39
Spanishspa41

10. The NCHLT speech corpus of the South African languages

Paper | Dataset

License: Creative Commons Attribution 3.0 Unported

Descriptions: This is a collection of 11 official languages of South Africa. There is 10-min speech for each of them.

LanguagesCode#Utterances
Afrikaansafr191
Englisheng225
South Ndebele (isiNdebele)nbl128
Northern Sotho (Sepedi, Pedi)nso189
Southern Sotho (Sesotho)sot170
Swati (Siswati)ssw103
Tswana (Setswana)tsn169
Tsonga (Xitsonga)tso166
Venda (Tshivenda)ven132
Xhosa (isiXhosa)xho135
Zulu (isiZulu )zul128

11. NST ASR Databases

Paper | Norwegian | Danish | Swedish

License: Creative_Commons-ZERO (CC-ZERO)

Descriptions: Each has 10-min of speech.

Notes:

We remove Norwegian data.

LanguagesCode#Utterances
Danishdan116
Norwegiannor114
Swedishswe125

12. The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening

Paper | Dataset

License: CC BY-SA 4.0

Descriptions: This is a collection of spoken Wikipedia articles in English, German, and Dutch. There is 10-min speech for each.

LanguagesCode#Utterances
Germandeu207
Englisheng239
Dutchnld210

13. VoxForge

Dataset

License: GNU General Public License

Descriptions: We choose the following 8 languages.

LanguagesCode#Utterances
Germandeu121
Englisheng115
Frenchfra103
Italianita90
Dutchnld136
Portuguesepor162
Russianrus68
Spanishspa70

14. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Paper | Dataset

License: CC0

Descriptions: This is a collection of European Parliament speech. Each language has 10-min speech, except for Estonian/est, which has 7.65 min, and Lithuanian/lit, which has 0.75 min.

LanguagesCode#Utterances
Czechces57
Germandeu64
Englisheng62
Estonianest47
Finnishfin62
Frenchfra59
Croatianhrv53
Hungarianhun51
Italianita48
Lithuanianlit3
Dutchnld83
Polishpol54
Romanianron49
Slovakslk55
Slovenian (Slovene)slv49
Spanishspa54

Language Varieties Development Set

Overall, this dev set covers 56 accents and dialects, sampled from 10 publicly available datasets. We show the number of utterances, the language code we use in this released dataset, and the language code we expect during evaluation for each language variety.

Note that the language codes used as the directory names are not necessarily used for evaluation. They are simply used for differentiating from each other. And we either follow the code in the original paper or we define a new code if one language variety does not have an ISO-3 code.

The "Expected Code" column shows the actual codes used for evaluation. The "Dialects" column shows which dialect is spoken. The "Accents" column shows which accent the speaker has.

Arabic

1. SADA: Saudi Audio Dataset for Arabic

Paper | Dataset

License: CC BY-NC-SA 4.0.

Descriptions: The dataset contains 4 Arabic dialects. We sample a 10-min dev set for each dialect from the original development split.

We expect [ara] as the predicted language codes.

DialectsCodeExpected Code#Utterances
Khalijiafb[ara]222
Najdiars[ara]153
ModernStandardArabicarb[ara]44 (5min)
Hijaziacw[ara]175

English

1. VoxPopuli

Paper | Dataset

License: CC0

Descriptions: We use the "Accented speech transcribed data" from VoxPopuli, which is a collection of English speech with different European accents. We sample a 10-min dev set for each accent, except for Lithuanian, Croatian, Slovene.

We expect [eng] as the predicted language code.

AccentsCodeExpected Code#Utterances
Dutch/nlnld[eng]66
German/dedeu[eng]66
Czech/csces[eng]56
Polish/plpol[eng]60
French/frfra[eng]60
Hungarian/huhun[eng]78
Finnish/fifin[eng]61
Romanian/roron[eng]67
Slovak/skslk[eng]65
Spanish/esspa[eng]72
Italian/itita[eng]61
Estonian/etest[eng]54

2. Crowdsourced high-quality UK and Ireland English Dialect speech data set (openslr83)

Paper | Dataset

License: Attribution-ShareAlike 4.0 International

Descriptions: The dataset contains English dialects spoken in US and Ireland. We sample a 10-min dev set for each dialect. More specifically, we sample 5 min from female speakers and 5 min from male speakers, expect for Irish English where we sample 10 min solely from male speakers.

We expect [eng] as the predicted language code.

DialectsCodeExpected Code#Utterances
Irishgle[eng]96
Midlandsmid[eng]90
Northernnor[eng]93
Schottishsco[eng]98
Southernsou[eng]91
Welshwel[eng]93

3. GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech

Paper | Dataset

License: cc0-1.0

Descriptions: The dataset contains English speech with accents from all over the world. We sample a 10-min dev set for each of the following 9 accents from the val split.

We expect [eng] as the predicted language code.

AccentsCodeExpected Code#Utterances
Canadian Englishcan[eng]174
Filipinofil[eng]167
Australian Englishaus[eng]166
United States Englishuse[eng]177
England Englishbre[eng]173
New Zealand Englishnze[eng]167
India and South Asia (India, Pakistan, Sri Lanka)sae[eng]171
Irish Englishgle[eng]181
Scottish Englishsco[eng]153

4. L2-ARCTIC: a non-native English speech corpus

Paper | Dataset

License: CC BY-NC 4.0

Descriptions: The dataset contains English speech with 6 different accents. For each accent, we use 1 male speaker and 1 female speaker as the dev set. And we sample 5 min from each speaker, which adds up to 10 min per accent.

We expect [eng] as the predicted language code.

AccentsCodeExpected Code#Utterances
Arabicara[eng]170
Mandarincmn[eng]147
Hindihin[eng]208
Koreankor[eng]176
Spanishspa[eng]153
Vietnamesevie[eng]164

Swiss German

1. SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German

Paper | Dataset

License: Creative Commons Attribution-NonCommercial 4.0 International License

Descriptions: The dataset contains Swiss German spoken in 8 regions. For each of the 8 dialects, we sample a 10-min dev set. The language codes for directory names follow the original paper.

We expect [deu] as the predicted language code.

DialectsCodeExpected Code#Utterances
Aargauag[deu]164
Bernbe[deu]131
Baselbs[deu]146
Graubündengr[deu]143
Luzernlu[deu]174
St. Gallensg[deu]114
Wallisvs[deu]140
Zürichzh[deu]142

Greek

1. Speech Recognition for Greek Dialects: A Challenging Benchmark

Paper | Cretan Data | Messenian Data

Descriptions: We sample a 10-min dev set from each of the Cretan and Messenian datasets. The Cretan data is under "cretan/" and Messenian data is under "messenian/". The language codes for directory names are the first 3 letters of the dialect names.

We expect [ell] as the predicted language code.

DialectsCodeExpected Code#Utterances
Cretancre[ell]290
Messenianmes[ell]139

Hindi

1. Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages (ms_speech)

Paper | Dataset

License: Here is a quote from the dataset website:

Data provided in this dataset shall not be used for commercial purposes. You may use the data solely for research purposes. If you publish your findings, you must provide the following attribution: “Data provided by Microsoft and SpeechOcean.com”.

Descriptions: We sample a 10-min dev set from the Test splits of the 3 dialects.

We expect different language codes for them.

DialectsCodeExpected Code#Utterances
Tamiltam[tam]109
Telugutel[tel]101
Gujaratiguj[guj]95

Spanish

1. Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech (openslr_spa)

Paper | SLR61 | SLR71 | SLR72 | SLR73 | SLR74 | SLR75

License: Attribution-ShareAlike 4.0 International

Descriptions: This is collection of 6 Latin American dialects of Spanish speech. For each of the 6 dialects except for SLR74 (Puerto Rico Spanish), we sample 5 min from male speakers and 5 min from female speakers. For SLR74, we sample 10 min dev set from female speakers only. The language codes for directory names are the first 3 letters of the dialect names.

We expect [spa] as the predicted language code.

DialectsCodeExpected Code#Utterances
Argentinianarg[spa]127
Chileanchi[spa]107
Colombiancol[spa]121
Peruvianper[spa]106
Puerto Ricopue[spa]100
Venezuelanven[spa]125