Update README.md
Browse files
README.md
CHANGED
|
@@ -75,7 +75,7 @@ The 9,033,998 sentence pairs of synthetic parallel data were created by translat
|
|
| 75 |
|
| 76 |
#### Preprocessing
|
| 77 |
|
| 78 |
-
After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer)
|
| 79 |
|
| 80 |
#### Tokenization
|
| 81 |
All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
|
|
|
|
| 75 |
|
| 76 |
#### Preprocessing
|
| 77 |
|
| 78 |
+
After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/) for identifying repetions and cleaning encoding problems and LaBSE embeddings to filter missaligned sentences. Any sentence pairs with a LaBSE similarity score of less than 0.5 is removed. The filtered corpus is composed of 9,033,998 parallel sentences.
|
| 79 |
|
| 80 |
#### Tokenization
|
| 81 |
All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
|