const data_description = String.raw`
# Data
We provide the following three datasets to the participants.
- A training set from the ML-SUPERB 2.0 public set. This is the baseline training dataset and covers 141 languages.
- A development set from ML-SUPERB 2.0 and covers the same 141 languages as the training set.
- A development set focusing on various language varieties and covering 56 dialects and accents.

Note that the directory names of the language varieties development set do not necessarily correspond to the actual spoken language. 
Please refer to the "Expected Code" column for the language codes we use during evaluation.


The download links to these datasets: 
- ML-SUPERB 2.0: [Google Drive](https://drive.google.com/file/d/1vQ5NksmGl-lY7I4mlU4Kde3EhrEYGii2/view?usp=drive_link), [Huggingface-zip](https://huggingface.co/datasets/ftshijt/mlsuperb_8th) and [Huggingface-pre-formatted](https://huggingface.co/datasets/espnet/ml_superb_hf).
- Language varieties development set: [Google Drive](https://drive.google.com/file/d/1E1fBJv6b1rmdHojQVaZ0yHwdw6ETcAw2/view?usp=drive_link) and [Huggingface](https://huggingface.co/datasets/mengct00/mlsuperb_lang_varieties).

Note that, all the distributed datasets follow the original licenses.

The statistics are shown below (after excluding Norwegian data).

| Dataset                        | Hours  | #Utterances |
|--------------------------------|--------|-------------|
| ML-SUPERB 2.0 Training         | 266.70 | 142,876     |
| ML-SUPERB 2.0 Development      | 44.26  | 24,019      |
| Language Varieties Development | 9.34   | 7,095       |


The directory structure of these datasets is
~~~
dataset_dir/
└── dataset_code/
    └── lang_code/
        ├── transcript_1h_train.txt
        ├── transcript_10min_train.txt
        ├── transcript_10min_dev.txt
        ├── transcript_10min_test.txt
        └── wav/
            └── wav_id.wav
~~~

The format of the metadata (transcript_*.txt) is
~~~
wav_id    original_wav_id    transcription
~~~

You can use the first field wav_id to fetch the corresponding audios in wav/.
An example to split each line is
~~~
wav_id, _, transcript = line.strip().split(maxsplit=2)
~~~

## Audio
All audios are processed into WAV format with a sampling rate of 16khz and single channel.


## Text
For the ML-SUPERB 2.0 dataset, the transcriptions in Google Drive and Huggingface-zip are kept in their original format.
The transcriptions in Huggingface-pre-formatted have been normalized, including punctuations removal and uppercasing.

For the Language Varieties Development, the transcriptions in Google Drive and Huggingface are kept in their original format.


## Development Dataset Breakdown

Below we provide detailed descriptions for the development sets (transcript_10min_dev.txt).

### ML-SUPERB 2.0 Development Set

This dev set covers 141 languages, sampled from 14 publicly available datasets.
You can also check [the original ML-SUPERB paper](https://www.isca-archive.org/interspeech_2023/shi23g_interspeech.pdf) for more information.

The expected language codes are the **same** as the directory names in **most** cases, unless specified otherwise.

**1. ALFFA (African Languages in the Field: speech Fundamentals and Automation)**

[Paper](https://aclanthology.org/L16-1611/) | [Dataset](https://openslr.org/25/)

**License**: MIT

**Descriptions**:
This is a collection of Amharic, Swahili and Wolof speech. 
Each language has 10 min audio.

| Languages | Code | #Utterances |
|-----------|------|-------------|
| Amharic   | amh  | 92          |
| Swahili   | swa  | 171         |
| Wolof     | wol  | 140         |

**2. Common Voice: A Massively-Multilingual Speech Corpus**

[Paper](https://aclanthology.org/2020.lrec-1.520/) | [Dataset](https://commonvoice.mozilla.org/en/datasets)

**License**: [Creative Commons CC0](https://creativecommons.org/public-domain/cc0/)

**Descriptions**:
We use 84 languages from CommonVoice. Each language has 10 min of speech.

**Notes**:

⁰ The original directory name lga is incorrect. 
The language is actually Ganda/Luganda, and we will use [lug] for evaluation.

¹ **We remove Norwegian data.**

² This is a legacy version of Japanese data. We still expect [jpn] during evaluation.

³ We expect [ori] the macrolanguage for Oriya.

| Languages                           | Code                              | #Utterances |
|-------------------------------------|-----------------------------------|-------------|
| Abkhazian                           | abk                               | 136         |
| Arabic                              | ara                               | 142         |
| Assamese                            | asm                               | 128         |
| Bashkir                             | bak                               | 112         |
| Basaa                               | bas                               | 145         |
| Belarusian                          | bel                               | 108         |
| Bengali                             | ben                               | 92          |
| Breton                              | bre                               | 179         |
| Bulgarian                           | bul                               | 108         |
| Catalan                             | cat                               | 96          |
| Czech                               | ces                               | 188         |
| Chuvash                             | chv                               | 120         |
| Central Kurdish                     | ckb                               | 131         |
| Mandarin Chinese                    | cmn                               | 108         |
| Hakha Chin                          | cnh                               | 167         |
| Welsh                               | cym                               | 105         |
| Danish                              | dan                               | 165         |
| German                              | deu                               | 100         |
| Dhivehi                             | div                               | 109         |
| Greek                               | ell                               | 140         |
| English                             | eng                               | 102         |
| Esperanto                           | epo                               | 112         |
| Estonian                            | est                               | 81          |
| Basque                              | eus                               | 103         |
| Persian                             | fas                               | 116         |
| Finnish                             | fin                               | 124         |
| French                              | fra                               | 112         |
| Frisian                             | frr                               | 107         |
| Irish                               | gle                               | 141         |
| Galician                            | glg                               | 117         |
| Guarani                             | grn                               | 135         |
| Hausa                               | hau                               | 122         |
| Hindi                               | hin                               | 109         |
| Upper Sorbian                       | hsb                               | 97          |
| Hungarian                           | hun                               | 119         |
| Armenian                            | hye                               | 99          |
| Interlingua                         | ina                               | 122         |
| Indonesian                          | ind                               | 156         |
| Italian                             | ita                               | 96          |
| Japanese                            | jpn                               | 126         |
| Kabyle                              | kab                               | 148         |
| Georgian                            | kat                               | 169         |
| Kazakh                              | kaz                               | 105         |
| Kinyarwanda                         | kin                               | 111         |
| Kyrgyz (Kirghiz)                    | kir                               | 136         |
| Kurmanji Kurdish (Northern Kurdish) | kmr                               | 130         |
| Latvian                             | lav                               | 179         |
| Ganda (Luganda)                     | lga  (**we expect lug**)⁰         | 108         |
| Lithuanian                          | lit                               | 107         |
| Malayalam                           | mal                               | 154         |
| Marathi                             | mar                               | 94          |
| Meadow Mari (Eastern Mari)          | mhr                               | 110         |
| Maltese                             | mlt                               | 121         |
| Mongolian                           | mon                               | 105         |
| Hill Mari (Western Mari)            | mrj                               | 114         |
| Erzya                               | myv                               | 97          |
| Min Nan Chinese (Taiwanese Minnan)  | nan                               | 203         |
| Dutch                               | nld                               | 126         |
| ~~Norwegian Nynorsk~~               | ~~nno~~¹                          | ~~126~~     |
| Japanese                            | org_jpn (**we expect jpn**)²      | 106         |
| Odia                                | ory (**we expect ori**)³          | 116         |
| Polish                              | pol                               | 120         |
| Portuguese                          | por                               | 128         |
| Romanian                            | ron                               | 143         |
| Russian                             | rus                               | 110         |
| Yakut (Sakha)                       | sah                               | 91          |
| Saraiki                             | skr                               | 124         |
| Slovak                              | slk                               | 139         |
| Slovenian                           | slv                               | 140         |
| Spanish                             | spa                               | 106         |
| Serbian                             | srp                               | 180         |
| Swahili (macrolanguage)             | swa                               | 103         |
| Swedish                             | swe                               | 127         |
| Tamil                               | tam                               | 104         |
| Tatar                               | tat                               | 130         |
| Thai                                | tha                               | 119         |
| Toki Pona                           | tok                               | 162         |
| Turkish                             | tur                               | 126         |
| Uighur (Uyghur)                     | uig                               | 91          |
| Ukrainian                           | ukr                               | 111         |
| Urdu                                | urd                               | 144         |
| Uzbek                               | uzb                               | 111         |
| Vietnamese                          | vie                               | 158         |
| Yue Chinese (Cantonese)             | yue                               | 125         |


**3. FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech**

[Paper](https://ieeexplore.ieee.org/document/10023141) | [Dataset](https://huggingface.co/datasets/google/xtreme_s)

**License**: [Creative Commons Attribution 4.0](https://choosealicense.com/licenses/cc-by-4.0/)

**Descriptions**:
We use all the 102 languages from FLEURS.
All languages have 10 min speech, except for isl, which has 7.2 min, and orm, which has 3.6 min.

**Notes**:

⁰ It is originally [azj] in FLEURS. Here we expect [aze] the macrolanguage for evaluation.

¹ FLEURS also uses [tgl]/Tagalog in the paper. Here we use [fil].

² **We remove Norwegian data.**

³ FLEURS uses [ory] for the individual language of Oriya. Here we use [ori] for the macrolanguage.

⁴ FLEURS uses [swh] for the individual language of Swahili. Here we use [swa] for the macrolanguage.

| Languages                        | Code      | #Utterances |
|----------------------------------|-----------|-------------|
| Afrikaans                        | afr       | 50          |
| Amharic                          | amh       | 58          |
| Arabic                           | ara       | 55          |
| Assamese                         | asm       | 50          |
| Asturian                         | ast       | 67          |
| Azerbaijani                      | aze ⁰     | 49          |
| Belarusian                       | bel       | 45          |
| Bengali                          | ben       | 46          |
| Bosnian                          | bos       | 49          |
| Bulgarian                        | bul       | 63          |
| Catalan                          | cat       | 53          |
| Cebuano                          | ceb       | 38          |
| Czech                            | ces       | 50          |
| Central Kurdish (Sorani Kurdish) | ckb       | 50          |
| Mandarin Chinese                 | cmn       | 54          |
| Welsh                            | cym       | 41          |
| Danish                           | dan       | 51          |
| German                           | deu       | 48          |
| Modern Greek                     | ell       | 56          |
| English                          | eng       | 59          |
| Estonian                         | est       | 53          |
| Persian                          | fas       | 35          |
| Filipino                         | fil ¹     | 30          |
| Finnish                          | fin       | 49          |
| French                           | fra       | 60          |
| Fula (Fulah)                     | ful       | 47          |
| Irish                            | gle       | 41          |
| Galician                         | glg       | 59          |
| Gujarati                         | guj       | 62          |
| Hausa                            | hau       | 32          |
| Hebrew                           | heb       | 71          |
| Hindi                            | hin       | 57          |
| Croatian                         | hrv       | 61          |
| Hungarian                        | hun       | 52          |
| Armenian                         | hye       | 54          |
| Igbo                             | ibo       | 32          |
| Indonesian                       | ind       | 48          |
| Icelandic                        | isl       | 36          |
| Italian                          | ita       | 44          |
| Javanese                         | jav       | 38          |
| Japanese                         | jpn       | 44          |
| Kamba                            | kam       | 39          |
| Kannada                          | kan       | 48          |
| Georgian                         | kat       | 58          |
| Kazakh                           | kaz       | 36          |
| Kabuverdianu                     | kea       | 46          |
| Khmer                            | khm       | 41          |
| Kyrgyz (Kirghiz)                 | kir       | 49          |
| Korean                           | kor       | 52          |
| Lao                              | lao       | 58          |
| Latvian                          | lav       | 49          |
| Lingala                          | lin       | 29          |
| Lithuanian                       | lit       | 58          |
| Luxembourgish                    | ltz       | 55          |
| Ganda                            | lug       | 39          |
| Luo (Kenya and Tanzania)         | luo       | 46          |
| Malayalam                        | mal       | 37          |
| Marathi                          | mar       | 49          |
| Macedonian                       | mkd       | 52          |
| Maltese                          | mlt       | 49          |
| Mongolian                        | mon       | 52          |
| Maori                            | mri       | 33          |
| Malay                            | msa       | 61          |
| Burmese                          | mya       | 40          |
| Nepali                           | nep       | 53          |
| Dutch                            | nld       | 60          |
| ~~Norwegian Bokmål~~             | ~~nob~~ ² | ~~46~~      |
| Pedi (Northern Sotho)            | nso       | 32          |
| Nyanja                           | nya       | 39          |
| Occitan                          | oci       | 42          |
| Oriya (macrolanguage)            | ori ³     | 50          |
| Oromo                            | orm       | 19          |
| Punjabi                          | pan       | 55          |
| Polish                           | pol       | 67          |
| Portuguese                       | por       | 47          |
| Pashto (Pushto)                  | pus       | 51          |
| Romanian                         | ron       | 57          |
| Russian                          | rus       | 55          |
| Slovak                           | slk       | 54          |
| Slovenian                        | slv       | 66          |
| Shona                            | sna       | 43          |
| Sindhi                           | snd       | 49          |
| Somali                           | som       | 48          |
| Spanish                          | spa       | 50          |
| Serbian                          | srp       | 58          |
| Swahili (macrolanguage)          | swa ⁴     | 39          |
| Swedish                          | swe       | 52          |
| Tamil                            | tam       | 51          |
| Telugu                           | tel       | 53          |
| Tajik                            | tgk       | 43          |
| Thai                             | tha       | 54          |
| Turkish                          | tur       | 50          |
| Ukrainian                        | ukr       | 56          |
| Umbundu                          | umb       | 27          |
| Urdu                             | urd       | 55          |
| Uzbek                            | uzb       | 53          |
| Vietnamese                       | vie       | 48          |
| Wolof                            | wol       | 39          |
| Xhosa                            | xho       | 46          |
| Yoruba                           | yor       | 36          |
| Cantonese Chinese (Yue Chinese)  | yue       | 54          |
| Zulu                             | zul       | 39          |


**4. googlei18n_asr: Crowd-Sourced Speech Corpora for Javanese, Sundanese,  Sinhala, Nepali, and Bangladeshi Bengali**

[Paper](https://research.google/pubs/crowd-sourced-speech-corpora-for-javanese-sundanese-sinhala-nepali-and-bangladeshi-bengali/) 
| [Javanese](https://www.openslr.org/35/)
| [Sundanese](https://www.openslr.org/36/)
| [Sinhala](https://www.openslr.org/52/)
| [Bengali](https://www.openslr.org/53/)
| [Nepali](https://www.openslr.org/54/)

**License**: [Attribution-ShareAlike 4.0 International](https://openslr.elda.org/resources/35/LICENSE)

**Descriptions**:
We sample 10-min speech for each of these 5 languages.

| Languages | Code | #Utterances |
|-----------|------|-------------|
| Bengali   | ben  | 171         |
| Javanese  | jav  | 104         |
| Nepali    | nep  | 182         |
| Sinhala   | sin  | 137         |
| Sundanese | sun  | 110         |

**5. googlei18n_tts**

This is a collection of the following papers/datasets.

**a. A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese**

[Paper](https://research.google/pubs/a-step-by-step-process-for-building-tts-voices-using-open-source-data-and-framework-for-bangla-javanese-khmer-nepali-sinhala-and-sundanese/)
| [Bengali](https://www.openslr.org/37/)

**License**: [Attribution-ShareAlike 4.0 (CC BY-SA 4.0)](https://openslr.elda.org/resources/37/LICENSE.txt)

**Descriptions**: We use only the Bengali data from this project.

**b. Open-Source High Quality Speech Datasets for Basque, Catalan and Galician**

[Paper](https://aclanthology.org/2020.sltu-1.3/)
| [Catalan](https://www.openslr.org/69/)
| [Basque](https://www.openslr.org/76/)
| [Galician](https://www.openslr.org/77/)

**License**: [Attribution-ShareAlike 4.0 International](https://openslr.elda.org/resources/69/LICENSE)

**Descriptions**: We use all three languages.

**c. Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems**

[Paper](https://aclanthology.org/2020.lrec-1.800/)
| [Gujarati](https://www.openslr.org/78/)
| [Kannada](https://www.openslr.org/79/)
| [Malayalam](https://www.openslr.org/63/)
| [Marathi](https://www.openslr.org/64/)
| [Tamil](https://www.openslr.org/65/)

**License**: [Attribution-ShareAlike 4.0 International](https://openslr.elda.org/resources/78/LICENSE)

**Descriptions**: We use the Gujarati, Kannada, Malayalam, Marathi, and Tamil data.

**d. Rapid development of TTS corpora for four South African languages**

[Paper](https://research.google/pubs/rapid-development-of-tts-corpora-for-four-south-african-languages/)
| [Dataset](https://www.openslr.org/32/)

**License**: [Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/deed.en)

**Descriptions**: We use the Sesotho (Southern Sotho), Setswana (Tswana), and isiXhosa (Xhosa) data.

**e. Developing an Open-Source Corpus of Yoruba Speech**

[Paper](https://research.google/pubs/developing-an-open-source-corpus-of-yoruba-speech/)
| [Dataset](https://www.openslr.org/86/)

**License**: [Attribution-ShareAlike 4.0 International](https://openslr.elda.org/resources/86/LICENSE)

**Descriptions**: It contains a single language Yoruba.

**Summary for the above 5 datasets**:
There is 10-min speech for each of the 13 languages.

| Languages      | Code | #Utterances |
|----------------|------|-------------|
| Bengali        | ben  | 109         |
| Catalan        | cat  | 76          |
| Basque         | eus  | 85          |
| Galician       | glg  | 86          |
| Gujarati       | guj  | 91          |
| Kannada        | kan  | 85          |
| Malayalam      | mal  | 125         |
| Marathi        | mar  | 89          |
| Southern Sotho | sot  | 107         |
| Tamil          | tam  | 126         |
| Tswana         | tsn  | 114         |
| Xhosa          | xho  | 127         |
| Yoruba         | yor  | 146         |


**6. All Together Now: The Living Audio Dataset (LAD)**

[Paper](https://www.isca-archive.org/interspeech_2019/braude19_interspeech.pdf)
| [Dataset](https://github.com/Idlak/Living-Audio-Dataset)

**License**: [Apache-2.0 license](https://github.com/Idlak/Living-Audio-Dataset?tab=Apache-2.0-1-ov-file#readme)

**Descriptions**: 
Each of the 4 languages has 10-min speech.

| Languages | Code | #Utterances |
|-----------|------|-------------|
| English   | eng  | 128         |
| Irish     | gle  | 184         |
| Dutch     | nld  | 133         |
| Russian   | rus  | 189         |


**7. The M-AILABS Speech Dataset**

[Dataset](https://github.com/imdatceleste/m-ailabs-dataset)

**License**: please refer to [this section](https://github.com/imdatceleste/m-ailabs-dataset?tab=readme-ov-file#license).

**Descriptions**: 
We use all of the 8 languages and each has 10-min speech.

| Languages | Code | #Utterances |
|-----------|------|-------------|
| German    | deu  | 78          |
| English   | eng  | 79          |
| French    | fra  | 77          |
| Italian   | ita  | 112         |
| Polish    | pol  | 81          |
| Russian   | rus  | 64          |
| Spanish   | spa  | 93          |
| Ukrainian | ukr  | 68          |

**8. mexico-el**

This is a collection of the following papers/datasets.

**a. Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec**

[Paper](https://aclanthology.org/2021.eacl-main.96/)
| [Dataset](https://www.openslr.org/89/)

**License**: [Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)](https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en)

**Descriptions**: This is a collection of Yolóxochitl Mixtec Speech.

**b. Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation**

[Paper](https://aclanthology.org/2021.americasnlp-1.7/)
| [Dataset](https://www.openslr.org/92)

**License**: [Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)](https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en)

**Descriptions**: This is a collection of Highland Puebla Nahuatl Speech.

**c. Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation**

[Paper](https://www.isca-archive.org/interspeech_2022/berrebbi22_interspeech.html)
| [Dataset](https://www.openslr.org/107/)

**License**: [Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)](https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en)

**Descriptions**: This is a collection of Totonac Speech.

**Summary for the above 3 dataset**:
We have 10-min speech for each of them.

| Languages               | Code | #Utterances |
|-------------------------|------|-------------|
| Highland Puebla Nahuatl | azz  | 113         |
| Highland Totonac        | tos  | 171         |
| Yoloxochitl Mixtec      | xty  | 99          |

**9. MLS: A Large-Scale Multilingual Dataset for Speech Research**

[Paper](https://www.isca-archive.org/interspeech_2020/pratap20_interspeech.html)
| [Dataset](https://www.openslr.org/94/)

**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en)

**Descriptions**: 
This dataset contains read audiobooks in the following 8 languages.
This dev set has 10-min for each of them.

| Languages  | Code | #Utterances |
|------------|------|-------------|
| German     | deu  | 39          |
| English    | eng  | 40          |
| French     | fra  | 40          |
| Italian    | ita  | 39          |
| Dutch      | nld  | 40          |
| Polish     | pol  | 42          |
| Portuguese | por  | 39          |
| Spanish    | spa  | 41          |

**10. The NCHLT speech corpus of the South African languages**

[Paper](https://www.isca-archive.org/sltu_2014/barnard14_sltu.html)
| [Dataset](https://sites.google.com/site/nchltspeechcorpus)

**License**: [Creative Commons Attribution 3.0 Unported](https://creativecommons.org/licenses/by/3.0/deed.en)

**Descriptions**: 
This is a collection of 11 official languages of South Africa.
There is 10-min speech for each of them.

| Languages                     | Code | #Utterances |
|-------------------------------|------|-------------|
| Afrikaans                     | afr  | 191         |
| English                       | eng  | 225         |
| South Ndebele (isiNdebele)    | nbl  | 128         |
| Northern Sotho (Sepedi, Pedi) | nso  | 189         |
| Southern Sotho (Sesotho)      | sot  | 170         |
| Swati (Siswati)               | ssw  | 103         |
| Tswana (Setswana)             | tsn  | 169         |
| Tsonga (Xitsonga)             | tso  | 166         |
| Venda (Tshivenda)             | ven  | 132         |
| Xhosa (isiXhosa)              | xho  | 135         |
| Zulu (isiZulu )               | zul  | 128         |

**11. NST ASR Databases**

[Paper](https://www.researchgate.net/publication/315418650_Language_Technology_Support_for_Norwegian)
| [Norwegian](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/)
| [Danish](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/)
| [Swedish](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/)

**License**: [Creative_Commons-ZERO (CC-ZERO)](https://creativecommons.org/publicdomain/zero/1.0/)

**Descriptions**: 
Each has 10-min of speech.

**Notes**:

⁰ **We remove Norwegian data.**

| Languages     | Code      | #Utterances |
|---------------|-----------|-------------|
| Danish        | dan       | 116         |
| ~~Norwegian~~ | ~~nor~~ ⁰ | ~~114~~     |
| Swedish       | swe       | 125         |


**12. The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening**

[Paper](https://link.springer.com/article/10.1007/s10579-017-9410-y) | [Dataset](https://nats.gitlab.io/swc/)

**License**: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

**Descriptions**:
This is a collection of spoken Wikipedia articles in English, German, and Dutch.
There is 10-min speech for each.

| Languages | Code | #Utterances |
|-----------|------|-------------|
| German    | deu  | 207         |
| English   | eng  | 239         |
| Dutch     | nld  | 210         |

**13. VoxForge**

[Dataset](https://www.voxforge.org/)

**License**: [GNU General Public License](https://www.voxforge.org/home/docs/faq/faq/what-is-gpl)

**Descriptions**:
We choose the following 8 languages.

| Languages  | Code | #Utterances |
|------------|------|-------------|
| German     | deu  | 121         |
| English    | eng  | 115         |
| French     | fra  | 103         |
| Italian    | ita  | 90          |
| Dutch      | nld  | 136         |
| Portuguese | por  | 162         |
| Russian    | rus  | 68          |
| Spanish    | spa  | 70          |


**14. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation**

[Paper](https://aclanthology.org/2021.acl-long.80/)
| [Dataset](https://github.com/facebookresearch/voxpopuli)

**License**: [CC0](https://creativecommons.org/public-domain/cc0/)

**Descriptions**:
This is a collection of European Parliament speech.
Each language has 10-min speech, except for Estonian/est, which has 7.65 min, and Lithuanian/lit, which has 0.75 min.

| Languages           | Code | #Utterances |
|---------------------|------|-------------|
| Czech               | ces  | 57          |
| German              | deu  | 64          |
| English             | eng  | 62          |
| Estonian            | est  | 47          |
| Finnish             | fin  | 62          |
| French              | fra  | 59          |
| Croatian            | hrv  | 53          |
| Hungarian           | hun  | 51          |
| Italian             | ita  | 48          |
| Lithuanian          | lit  | 3           |
| Dutch               | nld  | 83          |
| Polish              | pol  | 54          |
| Romanian            | ron  | 49          |
| Slovak              | slk  | 55          |
| Slovenian (Slovene) | slv  | 49          |
| Spanish             | spa  | 54          |


### Language Varieties Development Set

Overall, this dev set covers 56 accents and dialects, sampled from 10 publicly available datasets.
We show the number of utterances, the language code we use in this released dataset, and the language code we expect during evaluation for each language variety.

Note that the language codes used as the directory names are not necessarily used for evaluation.
They are simply used for differentiating from each other.
And we either follow the code in the original paper or we define a new code if one language variety does not have an ISO-3 code.

The "Expected Code" column shows the actual codes used for evaluation.
The "Dialects" column shows which dialect is spoken.
The "Accents" column shows which accent the speaker has.


**Arabic**

**1. SADA: Saudi Audio Dataset for Arabic**

[Paper](https://ieeexplore.ieee.org/document/10446243) | [Dataset](https://www.kaggle.com/datasets/sdaiancai/sada2022)

**License**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).

**Descriptions**:
The dataset contains 4 Arabic dialects.
We sample a 10-min dev set for each dialect from the original development split.

We expect [ara] as the predicted language codes.

| Dialects             | Code | Expected Code | #Utterances |
|----------------------|------|---------------|-------------|
| Khaliji              | afb  | [ara]         | 222         |
| Najdi                | ars  | [ara]         | 153         |
| ModernStandardArabic | arb  | [ara]         | 44  (5min)  |
| Hijazi               | acw  | [ara]         | 175         |

**English**

**1. VoxPopuli**

[Paper](https://aclanthology.org/2021.acl-long.80/) | [Dataset](https://github.com/facebookresearch/voxpopuli)

**License**: [CC0](https://creativecommons.org/public-domain/cc0/)

**Descriptions:**
We use the "Accented speech transcribed data" from VoxPopuli, which is a collection of English speech with different European accents.
We sample a 10-min dev set for each accent, except for Lithuanian, Croatian, Slovene.

We expect [eng] as the predicted language code.

| Accents      | Code | Expected Code | #Utterances |
|--------------|------|---------------|-------------|
| Dutch/nl     | nld  | [eng]         | 66          |
| German/de    | deu  | [eng]         | 66          |
| Czech/cs     | ces  | [eng]         | 56          |
| Polish/pl    | pol  | [eng]         | 60          |
| French/fr    | fra  | [eng]         | 60          |
| Hungarian/hu | hun  | [eng]         | 78          |
| Finnish/fi   | fin  | [eng]         | 61          |
| Romanian/ro  | ron  | [eng]         | 67          |
| Slovak/sk    | slk  | [eng]         | 65          |
| Spanish/es   | spa  | [eng]         | 72          |
| Italian/it   | ita  | [eng]         | 61          |
| Estonian/et  | est  | [eng]         | 54          |


**2. Crowdsourced high-quality UK and Ireland English Dialect speech data set (openslr83)**

[Paper](https://aclanthology.org/2020.lrec-1.804.pdf) | [Dataset](https://www.openslr.org/83/)

**License**: [Attribution-ShareAlike 4.0 International](https://openslr.elda.org/resources/83/LICENSE)

**Descriptions:**
The dataset contains English dialects spoken in US and Ireland.
We sample a 10-min dev set for each dialect. 
More specifically, we sample 5 min from female speakers and 5 min from male speakers, expect for Irish English where we sample 10 min solely from male speakers.

We expect [eng] as the predicted language code.

| Dialects  | Code | Expected Code | #Utterances |
|-----------|------|---------------|-------------|
| Irish     | gle  | [eng]         | 96          |
| Midlands  | mid  | [eng]         | 90          |
| Northern  | nor  | [eng]         | 93          |
| Schottish | sco  | [eng]         | 98          |
| Southern  | sou  | [eng]         | 91          |
| Welsh     | wel  | [eng]         | 93          |

**3. GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech**

[Paper](https://www.isca-archive.org/interspeech_2024/wang24b_interspeech.pdf) | [Dataset](https://huggingface.co/datasets/MushanW/GLOBE)

**License**: [cc0-1.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/cc0-1.0.md)

**Descriptions:**
The dataset contains English speech with accents from all over the world.
We sample a 10-min dev set for each of the following 9 accents from the val split.

We expect [eng] as the predicted language code.

| Accents                                           | Code | Expected Code | #Utterances |
|---------------------------------------------------|------|---------------|-------------|
| Canadian English                                  | can  | [eng]         | 174         |
| Filipino                                          | fil  | [eng]         | 167         |
| Australian English                                | aus  | [eng]         | 166         |
| United States English                             | use  | [eng]         | 177         |
| England English                                   | bre  | [eng]         | 173         |
| New Zealand English                               | nze  | [eng]         | 167         |
| India and South Asia (India, Pakistan, Sri Lanka) | sae  | [eng]         | 171         |
| Irish English                                     | gle  | [eng]         | 181         |
| Scottish English                                  | sco  | [eng]         | 153         |


**4. L2-ARCTIC: a non-native English speech corpus**

[Paper](https://www.isca-archive.org/interspeech_2018/zhao18b_interspeech.pdf) | [Dataset](https://psi.engr.tamu.edu/l2-arctic-corpus/)

**License**: [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)

**Descriptions:**
The dataset contains English speech with 6 different accents.
For each accent, we use 1 male speaker and 1 female speaker as the dev set.
And we sample 5 min from each speaker, which adds up to 10 min per accent.

We expect [eng] as the predicted language code.

| Accents    | Code | Expected Code | #Utterances |
|------------|------|---------------|-------------|
| Arabic     | ara  | [eng]         | 170         |
| Mandarin   | cmn  | [eng]         | 147         |
| Hindi      | hin  | [eng]         | 208         |
| Korean     | kor  | [eng]         | 176         |
| Spanish    | spa  | [eng]         | 153         |
| Vietnamese | vie  | [eng]         | 164         |

**Swiss German**

**1. SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German**

[Paper](https://arxiv.org/pdf/2103.11401) | [Dataset](https://mtc.ethz.ch/publications/open-source/swiss-dial.html)

**License**: [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/)

**Descriptions:**
The dataset contains Swiss German spoken in 8 regions.
For each of the 8 dialects, we sample a 10-min dev set.
The language codes for directory names follow the original paper.

We expect [deu] as the predicted language code.

| Dialects   | Code | Expected Code | #Utterances |
|------------|------|---------------|-------------|
| Aargau     | ag   | [deu]         | 164         |
| Bern       | be   | [deu]         | 131         |
| Basel      | bs   | [deu]         | 146         |
| Graubünden | gr   | [deu]         | 143         |
| Luzern     | lu   | [deu]         | 174         |
| St. Gallen | sg   | [deu]         | 114         |
| Wallis     | vs   | [deu]         | 140         |
| Zürich     | zh   | [deu]         | 142         |


**Greek**

**1.  Speech Recognition for Greek Dialects: A Challenging Benchmark**

[Paper](https://www.isca-archive.org/interspeech_2024/vakirtzian24_interspeech.pdf) | 
[Cretan Data](https://huggingface.co/datasets/ilsp/cretan-speech-corpus) | 
[Messenian Data](https://huggingface.co/datasets/ilsp/messenian-speech-corpus)

**Descriptions**:
We sample a 10-min dev set from each of the Cretan and Messenian datasets.
The Cretan data is under "cretan/" and Messenian data is under "messenian/".
The language codes for directory names are the first 3 letters of the dialect names.

We expect [ell] as the predicted language code.

| Dialects  | Code | Expected Code | #Utterances |
|-----------|------|---------------|-------------|
| Cretan    | cre  | [ell]         | 290         |
| Messenian | mes  | [ell]         | 139         |


**Hindi**

**1. Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages (ms_speech)**

[Paper](https://www.isca-archive.org/sltu_2018/srivastava18_sltu.pdf) |
[Dataset](https://www.microsoft.com/en-us/download/details.aspx?id=105292)

**License**: Here is a quote from the dataset website:
> Data provided in this dataset shall not be used for commercial purposes.
> You may use the data solely for research purposes. 
> If you publish your findings, you must provide the following attribution: “Data provided by Microsoft and SpeechOcean.com”.

**Descriptions**:
We sample a 10-min dev set from the Test splits of the 3 dialects.

We expect different language codes for them. 

| Dialects | Code | Expected Code | #Utterances |
|----------|------|---------------|-------------|
| Tamil    | tam  | [tam]         | 109         |
| Telugu   | tel  | [tel]         | 101         |
| Gujarati | guj  | [guj]         | 95          |


**Spanish**

**1. Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech (openslr_spa)**

[Paper](https://aclanthology.org/2020.lrec-1.801.pdf) |
[SLR61](http://www.openslr.org/61/) |
[SLR71](http://www.openslr.org/71/) |
[SLR72](http://www.openslr.org/72/) |
[SLR73](https://www.openslr.org/73/) |
[SLR74](http://www.openslr.org/74/) |
[SLR75](http://www.openslr.org/75/)

**License**: [Attribution-ShareAlike 4.0 International](https://openslr.elda.org/resources/75/LICENSE)

**Descriptions**:
This is collection of 6 Latin American dialects of Spanish speech.
For each of the 6 dialects except for SLR74 (Puerto Rico Spanish), we sample 5 min from male speakers and 5 min from female speakers.
For SLR74, we sample 10 min dev set from female speakers only.
The language codes for directory names are the first 3 letters of the dialect names.

We expect [spa] as the predicted language code.

| Dialects    | Code | Expected Code | #Utterances |
|-------------|------|---------------|-------------|
| Argentinian | arg  | [spa]         | 127         |
| Chilean     | chi  | [spa]         | 107         |
| Colombian   | col  | [spa]         | 121         |
| Peruvian    | per  | [spa]         | 106         |
| Puerto Rico | pue  | [spa]         | 100         |
| Venezuelan  | ven  | [spa]         | 125         |
`

export { data_description }
