| | --- |
| | language: |
| | - multilingual |
| | - af |
| | - am |
| | - ar |
| | - as |
| | - az |
| | - be |
| | - bg |
| | - bn |
| | - br |
| | - bs |
| | - ca |
| | - cs |
| | - cy |
| | - da |
| | - de |
| | - el |
| | - en |
| | - eo |
| | - es |
| | - et |
| | - eu |
| | - fa |
| | - fi |
| | - fr |
| | - fy |
| | - ga |
| | - gd |
| | - gl |
| | - gu |
| | - ha |
| | - he |
| | - hi |
| | - hr |
| | - hu |
| | - hy |
| | - id |
| | - is |
| | - it |
| | - ja |
| | - jv |
| | - ka |
| | - kk |
| | - km |
| | - kn |
| | - ko |
| | - ku |
| | - ky |
| | - la |
| | - lo |
| | - lt |
| | - lv |
| | - mg |
| | - mk |
| | - ml |
| | - mn |
| | - mr |
| | - ms |
| | - my |
| | - ne |
| | - nl |
| | - no |
| | - om |
| | - or |
| | - pa |
| | - pl |
| | - ps |
| | - pt |
| | - ro |
| | - ru |
| | - sa |
| | - sd |
| | - si |
| | - sk |
| | - sl |
| | - so |
| | - sq |
| | - sr |
| | - su |
| | - sv |
| | - sw |
| | - ta |
| | - te |
| | - th |
| | - tl |
| | - tr |
| | - ug |
| | - uk |
| | - ur |
| | - uz |
| | - vi |
| | - xh |
| | - yi |
| | - zh |
| | license: mit |
| | --- |
| | |
| | # XLM-V (Base-sized model) |
| |
|
| | XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R). |
| | It was introduced in the [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) |
| | paper by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer and Madian Khabsa. |
| |
|
| | **Disclaimer**: The team releasing XLM-V did not write a model card for this model so this model card has been written by the Hugging Face team. [This repository](https://github.com/stefan-it/xlm-v-experiments) documents all necessary integeration steps. |
| |
|
| | ## Model description |
| |
|
| | From the abstract of the XLM-V paper: |
| |
|
| | > Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. |
| | > As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. |
| | > This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. |
| | > In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by |
| | > de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity |
| | > to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically |
| | > more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, |
| | > a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we |
| | > tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and |
| | > named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER). |
| |
|
| | ## Usage |
| |
|
| | You can use this model directly with a pipeline for masked language modeling: |
| |
|
| | ```python |
| | >>> from transformers import pipeline |
| | >>> unmasker = pipeline('fill-mask', model='facebook/xlm-v-base') |
| | >>> unmasker("Paris is the <mask> of France.") |
| | |
| | [{'score': 0.9286897778511047, |
| | 'token': 133852, |
| | 'token_str': 'capital', |
| | 'sequence': 'Paris is the capital of France.'}, |
| | {'score': 0.018073994666337967, |
| | 'token': 46562, |
| | 'token_str': 'Capital', |
| | 'sequence': 'Paris is the Capital of France.'}, |
| | {'score': 0.013238662853837013, |
| | 'token': 8696, |
| | 'token_str': 'centre', |
| | 'sequence': 'Paris is the centre of France.'}, |
| | {'score': 0.010450296103954315, |
| | 'token': 550136, |
| | 'token_str': 'heart', |
| | 'sequence': 'Paris is the heart of France.'}, |
| | {'score': 0.005028395913541317, |
| | 'token': 60041, |
| | 'token_str': 'center', |
| | 'sequence': 'Paris is the center of France.'}] |
| | ``` |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because XLM-V has a similar architecture |
| | and has been trained on similar training data. |
| |
|
| | ### BibTeX entry and citation info |
| |
|
| | ```bibtex |
| | @ARTICLE{2023arXiv230110472L, |
| | author = {{Liang}, Davis and {Gonen}, Hila and {Mao}, Yuning and {Hou}, Rui and {Goyal}, Naman and {Ghazvininejad}, Marjan and {Zettlemoyer}, Luke and {Khabsa}, Madian}, |
| | title = "{XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models}", |
| | journal = {arXiv e-prints}, |
| | keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning}, |
| | year = 2023, |
| | month = jan, |
| | eid = {arXiv:2301.10472}, |
| | pages = {arXiv:2301.10472}, |
| | doi = {10.48550/arXiv.2301.10472}, |
| | archivePrefix = {arXiv}, |
| | eprint = {2301.10472}, |
| | primaryClass = {cs.CL}, |
| | adsurl = {https://ui.adsabs.harvard.edu/abs/2023arXiv230110472L}, |
| | adsnote = {Provided by the SAO/NASA Astrophysics Data System} |
| | } |
| | ``` |