Spaces:
Runtime error
Runtime error
| # LASER Language-Agnostic SEntence Representations | |
| LASER is a library to calculate and use multilingual sentence embeddings. | |
| You can find more information about LASER and how to use it on the official [LASER repository](https://github.com/facebookresearch/LASER). | |
| This folder contains source code for training LASER embeddings. | |
| ## Prepare data and configuration file | |
| Binarize your data with fairseq, as described [here](https://fairseq.readthedocs.io/en/latest/getting_started.html#data-pre-processing). | |
| Create a json config file with this format: | |
| ``` | |
| { | |
| "src_vocab": "/path/to/spm.src.cvocab", | |
| "tgt_vocab": "/path/to/spm.tgt.cvocab", | |
| "train": [ | |
| { | |
| "type": "translation", | |
| "id": 0, | |
| "src": "/path/to/srclang1-tgtlang0/train.srclang1", | |
| "tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0" | |
| }, | |
| { | |
| "type": "translation", | |
| "id": 1, | |
| "src": "/path/to/srclang1-tgtlang1/train.srclang1", | |
| "tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1" | |
| }, | |
| { | |
| "type": "translation", | |
| "id": 0, | |
| "src": "/path/to/srclang2-tgtlang0/train.srclang2", | |
| "tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0" | |
| }, | |
| { | |
| "type": "translation", | |
| "id": 1, | |
| "src": "/path/to/srclang2-tgtlang1/train.srclang2", | |
| "tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1" | |
| }, | |
| ... | |
| ], | |
| "valid": [ | |
| { | |
| "type": "translation", | |
| "id": 0, | |
| "src": "/unused", | |
| "tgt": "/unused" | |
| } | |
| ] | |
| } | |
| ``` | |
| where paths are paths to binarized indexed fairseq dataset files. | |
| `id` represents the target language id. | |
| ## Training Command Line Example | |
| ``` | |
| fairseq-train \ | |
| /path/to/configfile_described_above.json \ | |
| --user-dir examples/laser/laser_src \ | |
| --log-interval 100 --log-format simple \ | |
| --task laser --arch laser_lstm \ | |
| --save-dir . \ | |
| --optimizer adam \ | |
| --lr 0.001 \ | |
| --lr-scheduler inverse_sqrt \ | |
| --clip-norm 5 \ | |
| --warmup-updates 90000 \ | |
| --update-freq 2 \ | |
| --dropout 0.0 \ | |
| --encoder-dropout-out 0.1 \ | |
| --max-tokens 2000 \ | |
| --max-epoch 50 \ | |
| --encoder-bidirectional \ | |
| --encoder-layers 5 \ | |
| --encoder-hidden-size 512 \ | |
| --decoder-layers 1 \ | |
| --decoder-hidden-size 2048 \ | |
| --encoder-embed-dim 320 \ | |
| --decoder-embed-dim 320 \ | |
| --decoder-lang-embed-dim 32 \ | |
| --warmup-init-lr 0.001 \ | |
| --disable-validation | |
| ``` | |
| ## Applications | |
| We showcase several applications of multilingual sentence embeddings | |
| with code to reproduce our results (in the directory "tasks"). | |
| * [**Cross-lingual document classification**](https://github.com/facebookresearch/LASER/tree/master/tasks/mldoc) using the | |
| [*MLDoc*](https://github.com/facebookresearch/MLDoc) corpus [2,6] | |
| * [**WikiMatrix**](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix) | |
| Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7] | |
| * [**Bitext mining**](https://github.com/facebookresearch/LASER/tree/master/tasks/bucc) using the | |
| [*BUCC*](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) corpus [3,5] | |
| * [**Cross-lingual NLI**](https://github.com/facebookresearch/LASER/tree/master/tasks/xnli) | |
| using the [*XNLI*](https://www.nyu.edu/projects/bowman/xnli/) corpus [4,5,6] | |
| * [**Multilingual similarity search**](https://github.com/facebookresearch/LASER/tree/master/tasks/similarity) [1,6] | |
| * [**Sentence embedding of text files**](https://github.com/facebookresearch/LASER/tree/master/tasks/embed) | |
| example how to calculate sentence embeddings for arbitrary text files in any of the supported language. | |
| **For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.** | |
| ## References | |
| [1] Holger Schwenk and Matthijs Douze, | |
| [*Learning Joint Multilingual Sentence Representations with Neural Machine Translation*](https://aclanthology.info/papers/W17-2619/w17-2619), | |
| ACL workshop on Representation Learning for NLP, 2017 | |
| [2] Holger Schwenk and Xian Li, | |
| [*A Corpus for Multilingual Document Classification in Eight Languages*](http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf), | |
| LREC, pages 3548-3551, 2018. | |
| [3] Holger Schwenk, | |
| [*Filtering and Mining Parallel Data in a Joint Multilingual Space*](http://aclweb.org/anthology/P18-2037) | |
| ACL, July 2018 | |
| [4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, | |
| [*XNLI: Cross-lingual Sentence Understanding through Inference*](https://aclweb.org/anthology/D18-1269), | |
| EMNLP, 2018. | |
| [5] Mikel Artetxe and Holger Schwenk, | |
| [*Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings*](https://arxiv.org/abs/1811.01136) | |
| arXiv, Nov 3 2018. | |
| [6] Mikel Artetxe and Holger Schwenk, | |
| [*Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*](https://arxiv.org/abs/1812.10464) | |
| arXiv, Dec 26 2018. | |
| [7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, | |
| [*WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia*](https://arxiv.org/abs/1907.05791) | |
| arXiv, July 11 2019. | |
| [8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin | |
| [*CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB*](https://arxiv.org/abs/1911.04944) | |