Translation
Transformers
PyTorch
ONNX
Safetensors
m2m_100
text2text-generation
small100
flores101
gsarti/flores_101
tico19
gmnlp/tico19
tatoeba
Instructions to use alirezamsh/small100 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use alirezamsh/small100 with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="alirezamsh/small100")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("alirezamsh/small100") model = AutoModelForSeq2SeqLM.from_pretrained("alirezamsh/small100") - Inference
- Notebooks
- Google Colab
- Kaggle
Commit ·
3e1147d
1
Parent(s): efd7721
Update README.md
Browse files
README.md
CHANGED
|
@@ -115,13 +115,13 @@ datasets:
|
|
| 115 |
|
| 116 |
# SMALL-100 Model
|
| 117 |
|
| 118 |
-
SMaLL-100 is a compact and fast massively multilingual machine translation model covering more than 10K language pairs, that achieves competitive results with M2M-100 while being much smaller and faster. It is introduced in [this paper](https://arxiv.org/abs/2210.11621), and initially released in [this repository](https://github.com/alirezamshi/small100).
|
| 119 |
|
| 120 |
The model architecture and config are the same as [M2M-100](https://huggingface.co/facebook/m2m100_418M/tree/main) implementation, but the tokenizer is modified to adjust language codes. So, you should load the tokenizer locally from [tokenization_small100.py](https://huggingface.co/alirezamsh/small100/blob/main/tokenization_small100.py) file for the moment.
|
| 121 |
|
| 122 |
**Note**: SMALL100Tokenizer requires sentencepiece, so make sure to install it by ```pip install sentencepiece```
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
SMaLL-100 is a seq-to-seq model for the translation task. The input to the model is ```source:[tgt_lang_code] + src_tokens + [EOS]``` and ```target: tgt_tokens + [EOS]```. An example of supervised training is shown below:
|
| 127 |
|
|
@@ -142,7 +142,7 @@ loss = model(**model_inputs).loss # forward pass
|
|
| 142 |
|
| 143 |
Training data can be provided upon request.
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
```
|
| 148 |
from transformers import M2M100ForConditionalGeneration
|
|
@@ -169,11 +169,11 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
|
| 169 |
# => "Life is like a box of chocolate."
|
| 170 |
```
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
Please refer to [original repository](https://github.com/alirezamshi/small100) for spBLEU computation.
|
| 175 |
|
| 176 |
-
|
| 177 |
|
| 178 |
Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)
|
| 179 |
|
|
|
|
| 115 |
|
| 116 |
# SMALL-100 Model
|
| 117 |
|
| 118 |
+
SMaLL-100 is a compact and fast massively multilingual machine translation model covering more than 10K language pairs, that achieves competitive results with M2M-100 while being much smaller and faster. It is introduced in [this paper](https://arxiv.org/abs/2210.11621)(accepted to EMNLP2022), and initially released in [this repository](https://github.com/alirezamshi/small100).
|
| 119 |
|
| 120 |
The model architecture and config are the same as [M2M-100](https://huggingface.co/facebook/m2m100_418M/tree/main) implementation, but the tokenizer is modified to adjust language codes. So, you should load the tokenizer locally from [tokenization_small100.py](https://huggingface.co/alirezamsh/small100/blob/main/tokenization_small100.py) file for the moment.
|
| 121 |
|
| 122 |
**Note**: SMALL100Tokenizer requires sentencepiece, so make sure to install it by ```pip install sentencepiece```
|
| 123 |
|
| 124 |
+
- **Supervised Training**
|
| 125 |
|
| 126 |
SMaLL-100 is a seq-to-seq model for the translation task. The input to the model is ```source:[tgt_lang_code] + src_tokens + [EOS]``` and ```target: tgt_tokens + [EOS]```. An example of supervised training is shown below:
|
| 127 |
|
|
|
|
| 142 |
|
| 143 |
Training data can be provided upon request.
|
| 144 |
|
| 145 |
+
- **Generation**
|
| 146 |
|
| 147 |
```
|
| 148 |
from transformers import M2M100ForConditionalGeneration
|
|
|
|
| 169 |
# => "Life is like a box of chocolate."
|
| 170 |
```
|
| 171 |
|
| 172 |
+
- **Evaluation**
|
| 173 |
|
| 174 |
Please refer to [original repository](https://github.com/alirezamshi/small100) for spBLEU computation.
|
| 175 |
|
| 176 |
+
- **Languages Covered**
|
| 177 |
|
| 178 |
Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)
|
| 179 |
|