Spaces:
Runtime error
Runtime error
| <p align="center"> | |
| <img src="flores_logo.png" width="500"> | |
| </p> | |
| # Flores101: Large-Scale Multilingual Machine Translation | |
| ## Introduction | |
| Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition. | |
| Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html | |
| Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/ | |
| ## Pretrained models | |
| Model | Num layers | Embed dimension | FFN dimension| Vocab Size | #params | Download | |
| ---|---|---|---|---|---|--- | |
| `flores101_mm100_615M` | 12 | 1024 | 4096 | 256,000 | 615M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz | |
| `flores101_mm100_175M` | 6 | 512 | 2048 | 256,000 | 175M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz | |
| These models are trained similar to [M2M-100](https://arxiv.org/abs/2010.11125) with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom. | |
| ## Example Generation code | |
| ### Download model, sentencepiece vocab | |
| ```bash | |
| fairseq=/path/to/fairseq | |
| cd $fairseq | |
| # Download 615M param model. | |
| wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz | |
| # Extract | |
| tar -xvzf flores101_mm100_615M.tar.gz | |
| ``` | |
| ### Encode using our SentencePiece Model | |
| Note: Install SentencePiece from [here](https://github.com/google/sentencepiece) | |
| ```bash | |
| fairseq=/path/to/fairseq | |
| cd $fairseq | |
| # Download example dataset From German to French | |
| sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de | |
| sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr | |
| for lang in de fr ; do | |
| python scripts/spm_encode.py \ | |
| --model flores101_mm100_615M/sentencepiece.bpe.model \ | |
| --output_format=piece \ | |
| --inputs=raw_input.de-fr.${lang} \ | |
| --outputs=spm.de-fr.${lang} | |
| done | |
| ``` | |
| ### Binarization | |
| ```bash | |
| fairseq-preprocess \ | |
| --source-lang de --target-lang fr \ | |
| --testpref spm.de-fr \ | |
| --thresholdsrc 0 --thresholdtgt 0 \ | |
| --destdir data_bin \ | |
| --srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt | |
| ``` | |
| ### Generation | |
| ```bash | |
| fairseq-generate \ | |
| data_bin \ | |
| --batch-size 1 \ | |
| --path flores101_mm100_615M/model.pt \ | |
| --fixed-dictionary flores101_mm100_615M/dict.txt \ | |
| -s de -t fr \ | |
| --remove-bpe 'sentencepiece' \ | |
| --beam 5 \ | |
| --task translation_multi_simple_epoch \ | |
| --lang-pairs flores101_mm100_615M/language_pairs.txt \ | |
| --decoder-langtok --encoder-langtok src \ | |
| --gen-subset test \ | |
| --fp16 \ | |
| --dataset-impl mmap \ | |
| --distributed-world-size 1 --distributed-no-spawn | |
| ``` | |
| ### Supported Languages and lang code | |
| Language | lang code | |
| ---|--- | |
| Akrikaans | af | |
| Amharic | am | |
| Arabic | ar | |
| Assamese | as | |
| Asturian | ast | |
| Aymara | ay | |
| Azerbaijani | az | |
| Bashkir | ba | |
| Belarusian | be | |
| Bulgarian | bg | |
| Bengali | bn | |
| Breton | br | |
| Bosnian | bs | |
| Catalan | ca | |
| Cebuano | ceb | |
| Chokwe | cjk | |
| Czech | cs | |
| Welsh | cy | |
| Danish | da | |
| German | de | |
| Dyula| dyu | |
| Greek | el | |
| English | en | |
| Spanish | es | |
| Estonian | et | |
| Persian | fa | |
| Fulah | ff | |
| Finnish | fi | |
| French | fr | |
| Western Frisian | fy | |
| Irish | ga | |
| Scottish Gaelic | gd | |
| Galician | gl | |
| Gujarati | gu | |
| Hausa | ha | |
| Hebrew | he | |
| Hindi | hi | |
| Croatian | hr | |
| Haitian Creole | ht | |
| Hungarian | hu | |
| Armenian | hy | |
| Indonesian | id | |
| Igbo | ig | |
| Iloko | ilo | |
| Icelandic | is | |
| Italian | it | |
| Japanese | ja | |
| Javanese | jv | |
| Georgian | ka | |
| Kachin | kac | |
| Kamba | kam | |
| Kabuverdianu | kea | |
| Kongo | kg | |
| Kazakh | kk | |
| Central Khmer | km | |
| Kimbundu | kmb | |
| Northern Kurdish | kmr | |
| Kannada | kn | |
| Korean | ko | |
| Kurdish | ku | |
| Kyrgyz | ky | |
| Luxembourgish | lb | |
| Ganda | lg | |
| Lingala | ln | |
| Lao | lo | |
| Lithuanian | lt | |
| Luo | luo | |
| Latvian | lv | |
| Malagasy | mg | |
| Maori | mi | |
| Macedonian | mk | |
| Malayalam | ml | |
| Mongolian | mn | |
| Marathi | mr | |
| Malay | ms | |
| Maltese | mt | |
| Burmese | my | |
| Nepali | ne | |
| Dutch | nl | |
| Norwegian | no | |
| Northern Sotho | ns | |
| Nyanja | ny | |
| Occitan | oc | |
| Oromo | om | |
| Oriya | or | |
| Punjabi | pa | |
| Polish | pl | |
| Pashto | ps | |
| Portuguese | pt | |
| Quechua | qu | |
| Romanian | ro | |
| Russian | ru | |
| Sindhi | sd | |
| Shan | shn | |
| Sinhala | si | |
| Slovak | sk | |
| Slovenian | sl | |
| Shona | sn | |
| Somali | so | |
| Albanian | sq | |
| Serbian | sr | |
| Swati | ss | |
| Sundanese | su | |
| Swedish | sv | |
| Swahili | sw | |
| Tamil | ta | |
| Telugu | te | |
| Tajik | tg | |
| Thai | th | |
| Tigrinya | ti | |
| Tagalog | tl | |
| Tswana | tn | |
| Turkish | tr | |
| Ukrainian | uk | |
| Umbundu | umb | |
| Urdu | ur | |
| Uzbek | uz | |
| Vietnamese | vi | |
| Wolof | wo | |
| Xhosa | xh | |
| Yiddish | yi | |
| Yoruba | yo | |
| Chinese| zh | |
| Zulu | zu | |