Update README.md
Browse files
README.md
CHANGED
|
@@ -62,8 +62,7 @@ and other sources.
|
|
| 62 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
| 63 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
| 64 |
The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training
|
| 65 |
-
the punctuation is normalized using a modified version of the join-single-file.py script from
|
| 66 |
-
[SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
|
| 67 |
|
| 68 |
|
| 69 |
#### Tokenization
|
|
|
|
| 62 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
| 63 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
| 64 |
The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training
|
| 65 |
+
the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
|
|
|
|
| 66 |
|
| 67 |
|
| 68 |
#### Tokenization
|