| | --- |
| | language: multilingual |
| | thumbnail: |
| | tags: |
| | - audio-classification |
| | - speechbrain |
| | - embeddings |
| | - Language |
| | - Identification |
| | - pytorch |
| | - ECAPA-TDNN |
| | - TDNN |
| | - VoxLingua107 |
| | license: "apache-2.0" |
| | datasets: |
| | - VoxLingua107 |
| | metrics: |
| | - Accuracy |
| | widget: |
| | - example_title: English Sample |
| | src: https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac |
| | --- |
| | |
| | # VoxLingua107 ECAPA-TDNN Spoken Language Identification Model |
| |
|
| | ## Model description |
| |
|
| | This is a spoken language recognition model trained on the [VoxLingua107 dataset](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/) using SpeechBrain. |
| | The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. |
| |
|
| | The model can classify a speech utterance according to the language spoken. |
| | It covers 107 different languages ( |
| | Abkhazian, |
| | Afrikaans, |
| | Amharic, |
| | Arabic, |
| | Assamese, |
| | Azerbaijani, |
| | Bashkir, |
| | Belarusian, |
| | Bulgarian, |
| | Bengali, |
| | Tibetan, |
| | Breton, |
| | Bosnian, |
| | Catalan, |
| | Cebuano, |
| | Czech, |
| | Welsh, |
| | Danish, |
| | German, |
| | Greek, |
| | English, |
| | Esperanto, |
| | Spanish, |
| | Estonian, |
| | Basque, |
| | Persian, |
| | Finnish, |
| | Faroese, |
| | French, |
| | Galician, |
| | Guarani, |
| | Gujarati, |
| | Manx, |
| | Hausa, |
| | Hawaiian, |
| | Hindi, |
| | Croatian, |
| | Haitian, |
| | Hungarian, |
| | Armenian, |
| | Interlingua, |
| | Indonesian, |
| | Icelandic, |
| | Italian, |
| | Hebrew, |
| | Japanese, |
| | Javanese, |
| | Georgian, |
| | Kazakh, |
| | Central Khmer, |
| | Kannada, |
| | Korean, |
| | Latin, |
| | Luxembourgish, |
| | Lingala, |
| | Lao, |
| | Lithuanian, |
| | Latvian, |
| | Malagasy, |
| | Maori, |
| | Macedonian, |
| | Malayalam, |
| | Mongolian, |
| | Marathi, |
| | Malay, |
| | Maltese, |
| | Burmese, |
| | Nepali, |
| | Dutch, |
| | Norwegian Nynorsk, |
| | Norwegian, |
| | Occitan, |
| | Panjabi, |
| | Polish, |
| | Pushto, |
| | Portuguese, |
| | Romanian, |
| | Russian, |
| | Sanskrit, |
| | Scots, |
| | Sindhi, |
| | Sinhala, |
| | Slovak, |
| | Slovenian, |
| | Shona, |
| | Somali, |
| | Albanian, |
| | Serbian, |
| | Sundanese, |
| | Swedish, |
| | Swahili, |
| | Tamil, |
| | Telugu, |
| | Tajik, |
| | Thai, |
| | Turkmen, |
| | Tagalog, |
| | Turkish, |
| | Tatar, |
| | Ukrainian, |
| | Urdu, |
| | Uzbek, |
| | Vietnamese, |
| | Waray, |
| | Yiddish, |
| | Yoruba, |
| | Mandarin Chinese). |
| |
|
| | ## Intended uses & limitations |
| |
|
| | The model has two uses: |
| |
|
| | - use 'as is' for spoken language recognition |
| | - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data |
| | |
| | The model is trained on automatically collected YouTube data. For more |
| | information about the dataset, see [here](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/). |
| |
|
| |
|
| | #### How to use |
| |
|
| | ```python |
| | import torchaudio |
| | from speechbrain.pretrained import EncoderClassifier |
| | language_id = EncoderClassifier.from_hparams(source="TalTechNLP/voxlingua107-epaca-tdnn", savedir="tmp") |
| | # Download Thai language sample from Omniglot and cvert to suitable form |
| | signal = language_id.load_audio("https://omniglot.com/soundfiles/udhr/udhr_th.mp3") |
| | prediction = language_id.classify_batch(signal) |
| | print(prediction) |
| | (tensor([[0.3210, 0.3751, 0.3680, 0.3939, 0.4026, 0.3644, 0.3689, 0.3597, 0.3508, |
| | 0.3666, 0.3895, 0.3978, 0.3848, 0.3957, 0.3949, 0.3586, 0.4360, 0.3997, |
| | 0.4106, 0.3886, 0.4177, 0.3870, 0.3764, 0.3763, 0.3672, 0.4000, 0.4256, |
| | 0.4091, 0.3563, 0.3695, 0.3320, 0.3838, 0.3850, 0.3867, 0.3878, 0.3944, |
| | 0.3924, 0.4063, 0.3803, 0.3830, 0.2996, 0.4187, 0.3976, 0.3651, 0.3950, |
| | 0.3744, 0.4295, 0.3807, 0.3613, 0.4710, 0.3530, 0.4156, 0.3651, 0.3777, |
| | 0.3813, 0.6063, 0.3708, 0.3886, 0.3766, 0.4023, 0.3785, 0.3612, 0.4193, |
| | 0.3720, 0.4406, 0.3243, 0.3866, 0.3866, 0.4104, 0.4294, 0.4175, 0.3364, |
| | 0.3595, 0.3443, 0.3565, 0.3776, 0.3985, 0.3778, 0.2382, 0.4115, 0.4017, |
| | 0.4070, 0.3266, 0.3648, 0.3888, 0.3907, 0.3755, 0.3631, 0.4460, 0.3464, |
| | 0.3898, 0.3661, 0.3883, 0.3772, 0.9289, 0.3687, 0.4298, 0.4211, 0.3838, |
| | 0.3521, 0.3515, 0.3465, 0.4772, 0.4043, 0.3844, 0.3973, 0.4343]]), tensor([0.9289]), tensor([94]), ['th']) |
| | # The scores in the prediction[0] tensor can be interpreted as cosine scores between |
| | # the languages and the given utterance (i.e., the larger the better) |
| | # The identified language ISO code is given in prediction[3] |
| | print(prediction[3]) |
| | ['th'] |
| | |
| | # Alternatively, use the utterance embedding extractor: |
| | emb = language_id.encode_batch(signal) |
| | print(emb.shape) |
| | torch.Size([1, 1, 256]) |
| | ``` |
| |
|
| | #### Limitations and bias |
| |
|
| | Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are: |
| |
|
| | - Probably it's accuracy on smaller languages is quite limited |
| | - Probably it works worse on female speech than male speech (because YouTube data includes much more male speech) |
| | - Based on subjective experiments, it doesn't work well on speech with a foreign accent |
| | - Probably it doesn't work well on children's speech and on persons with speech disorders |
| |
|
| |
|
| | ## Training data |
| |
|
| | The model is trained on [VoxLingua107](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/). |
| |
|
| | VoxLingua107 is a speech dataset for training spoken language identification models. |
| | The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. |
| |
|
| | VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours. |
| | The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language. |
| |
|
| | ## Training procedure |
| |
|
| | We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model. |
| | Training recipe will be published soon. |
| |
|
| | ## Evaluation results |
| |
|
| | Error rate: 7% on the development dataset |
| |
|
| |
|
| | ### BibTeX entry and citation info |
| |
|
| | ```bibtex |
| | @inproceedings{valk2021slt, |
| | title={{VoxLingua107}: a Dataset for Spoken Language Recognition}, |
| | author={J{\"o}rgen Valk and Tanel Alum{\"a}e}, |
| | booktitle={Proc. IEEE SLT Workshop}, |
| | year={2021}, |
| | } |
| | ``` |
| |
|