Spaces:
Paused
Paused
| # LASER Language-Agnostic SEntence Representations | |
| LASER is a library to calculate and use multilingual sentence embeddings. | |
| **NEWS** | |
| * 2023/11/30 Released [**P-xSIM**](tasks/pxsim), a dual approach extension to multilingual similarity search (xSIM) | |
| * 2023/11/16 Released [**laser_encoders**](laser_encoders), a pip-installable package supporting LASER-2 and LASER-3 models | |
| * 2023/06/26 [**xSIM++**](https://arxiv.org/abs/2306.12907) evaluation pipeline and data [**released**](tasks/xsimplusplus/README.md) | |
| * 2022/07/06 Updated LASER models with support for over 200 languages are [**now available**](nllb/README.md) | |
| * 2022/07/06 Multilingual similarity search (**xSIM**) evaluation pipeline [**released**](tasks/xsim/README.md) | |
| * 2022/05/03 [**Librivox S2S is available**](tasks/librivox-s2s): Speech-to-Speech translations automatically mined in Librivox [9] | |
| * 2019/11/08 [**CCMatrix is available**](tasks/CCMatrix): Mining billions of high-quality parallel sentences on the WEB [8] | |
| * 2019/07/31 Gilles Bodard and Jérémy Rapin provided a [**Docker environment**](docker) to use LASER | |
| * 2019/07/11 [**WikiMatrix is available**](tasks/WikiMatrix): bitext extraction for 1620 language pairs in WikiPedia [7] | |
| * 2019/03/18 switch to BSD license | |
| * 2019/02/13 The code to perform bitext mining is [**now available**](tasks/bucc) | |
| **CURRENT VERSION:** | |
| * We now provide updated LASER models which support over 200 languages. Please see [here](nllb/README.md) for more details including how to download the models and perform inference. | |
| According to our experience, the sentence encoder also supports code-switching, i.e. | |
| the same sentences can contain words in several different languages. | |
| We have also some evidence that the encoder can generalize to other | |
| languages which have not been seen during training, but which are in | |
| a language family which is covered by other languages. | |
| A detailed description of how the multilingual sentence embeddings are trained can | |
| be found [here](https://arxiv.org/abs/2205.12654), together with an experimental evaluation. | |
| ## The core sentence embedding package: `laser_encoders` | |
| We provide a package `laser_encoders` with minimal dependencies. | |
| It supports LASER-2 (a single encoder for the languages listed [below](#supported-languages)) | |
| and LASER-3 (147 language-specific encoders described [here](nllb/README.md)). | |
| The package can be installed simply with `pip install laser_encoders` and used as below: | |
| ```python | |
| from laser_encoders import LaserEncoderPipeline | |
| encoder = LaserEncoderPipeline(lang="eng_Latn") | |
| embeddings = encoder.encode_sentences(["Hi!", "This is a sentence encoder."]) | |
| print(embeddings.shape) # (2, 1024) | |
| ``` | |
| The laser_encoders [readme file](laser_encoders) provides more examples of its installation and usage. | |
| ## The full LASER kit | |
| Apart from the `laser_encoders`, we provide support for LASER-1 (the original multilingual encoder) | |
| and for various LASER applications listed below. | |
| ### Dependencies | |
| * Python >= 3.7 | |
| * [PyTorch 1.0](http://pytorch.org/) | |
| * [NumPy](http://www.numpy.org/), tested with 1.15.4 | |
| * [Cython](https://pypi.org/project/Cython/), needed by Python wrapper of FastBPE, tested with 0.29.6 | |
| * [Faiss](https://github.com/facebookresearch/faiss), for fast similarity search and bitext mining | |
| * [transliterate 1.10.2](https://pypi.org/project/transliterate) (`pip install transliterate`) | |
| * [jieba 0.39](https://pypi.org/project/jieba/), Chinese segmenter (`pip install jieba`) | |
| * [mecab 0.996](https://pypi.org/project/JapaneseTokenizer/), Japanese segmenter | |
| * tokenization from the Moses encoder (installed automatically) | |
| * [FastBPE](https://github.com/glample/fastBPE), fast C++ implementation of byte-pair encoding (installed automatically) | |
| * [Fairseq](https://github.com/pytorch/fairseq), sequence modeling toolkit (`pip install fairseq==0.12.1`) | |
| * [tabulate](https://pypi.org/project/tabulate), pretty-print tabular data (`pip install tabulate`) | |
| * [pandas](https://pypi.org/project/pandas), data analysis toolkit (`pip install pandas`) | |
| * [Sentencepiece](https://github.com/google/sentencepiece), subword tokenization (installed automatically) | |
| ### Installation | |
| * install the `laser_encoders` package by e.g. `pip install -e .` for installing it in the editable mode | |
| * set the environment variable 'LASER' to the root of the installation, e.g. | |
| `export LASER="${HOME}/projects/laser"` | |
| * download encoders from Amazon s3 by e.g. `bash ./nllb/download_models.sh` | |
| * download third party software by `bash ./install_external_tools.sh` | |
| * download the data used in the example tasks (see description for each task) | |
| ## Applications | |
| We showcase several applications of multilingual sentence embeddings | |
| with code to reproduce our results (in the directory "tasks"). | |
| * [**Cross-lingual document classification**](tasks/mldoc) using the | |
| [*MLDoc*](https://github.com/facebookresearch/MLDoc) corpus [2,6] | |
| * [**WikiMatrix**](tasks/WikiMatrix) | |
| Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7] | |
| * [**Bitext mining**](tasks/bucc) using the | |
| [*BUCC*](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) corpus [3,5] | |
| * [**Cross-lingual NLI**](tasks/xnli) | |
| using the [*XNLI*](https://www.nyu.edu/projects/bowman/xnli/) corpus [4,5,6] | |
| * [**Multilingual similarity search**](tasks/similarity) [1,6] | |
| * [**Sentence embedding of text files**](tasks/embed) | |
| example how to calculate sentence embeddings for arbitrary text files in any of the supported language. | |
| **For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.** | |
| ## License | |
| LASER is BSD-licensed, as found in the [`LICENSE`](LICENSE) file in the root directory of this source tree. | |
| ## Supported languages | |
| The original LASER model was trained on the following languages: | |
| Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, | |
| Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, | |
| Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, | |
| Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, | |
| Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, | |
| Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, | |
| Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, | |
| Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, | |
| Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, | |
| Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese. | |
| We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g. | |
| Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, | |
| Swiss German or Western Frisian. | |
| ### LASER3 | |
| Updated LASER models referred to as *[LASER3](nllb/README.md)* supplement the above list with support for 147 languages. The full list of supported languages can be seen [here](nllb/README.md#list-of-available-laser3-encoders). | |
| ## References | |
| [1] Holger Schwenk and Matthijs Douze, | |
| [*Learning Joint Multilingual Sentence Representations with Neural Machine Translation*](https://aclanthology.info/papers/W17-2619/w17-2619), | |
| ACL workshop on Representation Learning for NLP, 2017 | |
| [2] Holger Schwenk and Xian Li, | |
| [*A Corpus for Multilingual Document Classification in Eight Languages*](http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf), | |
| LREC, pages 3548-3551, 2018. | |
| [3] Holger Schwenk, | |
| [*Filtering and Mining Parallel Data in a Joint Multilingual Space*](http://aclweb.org/anthology/P18-2037) | |
| ACL, July 2018 | |
| [4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, | |
| [*XNLI: Cross-lingual Sentence Understanding through Inference*](https://aclweb.org/anthology/D18-1269), | |
| EMNLP, 2018. | |
| [5] Mikel Artetxe and Holger Schwenk, | |
| [*Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings*](https://arxiv.org/abs/1811.01136) | |
| arXiv, Nov 3 2018. | |
| [6] Mikel Artetxe and Holger Schwenk, | |
| [*Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*](https://arxiv.org/abs/1812.10464) | |
| arXiv, Dec 26 2018. | |
| [7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, | |
| [*WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia*](https://arxiv.org/abs/1907.05791) | |
| arXiv, July 11 2019. | |
| [8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin | |
| [*CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB*](https://arxiv.org/abs/1911.04944) | |
| [9] Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk, | |
| [*Multimodal and Multilingual Embeddings for Large-Scale Speech Mining,*](https://papers.nips.cc/paper/2021/hash/8466f9ace6a9acbe71f75762ffc890f1-Abstract.html), NeurIPS 2021, pages 15748-15761. | |
| [10] Kevin Heffernan, Onur Celebi, and Holger Schwenk, | |
| [*Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages*](https://arxiv.org/abs/2205.12654) | |