Update README.md
Browse files
README.md
CHANGED
|
@@ -18,9 +18,9 @@ I undertook the following preprocessing steps:
|
|
| 18 |
- Retain only sentences with a Latin vocabulary ratio of > 85%.
|
| 19 |
- Exclude all lines containing '^' --> hints at the presence of OCR errors.
|
| 20 |
|
| 21 |
-
The result is a corpus of ~
|
| 22 |
|
| 23 |
-
The dataset used to train this
|
| 24 |
|
| 25 |
### Contact
|
| 26 |
|
|
|
|
| 18 |
- Retain only sentences with a Latin vocabulary ratio of > 85%.
|
| 19 |
- Exclude all lines containing '^' --> hints at the presence of OCR errors.
|
| 20 |
|
| 21 |
+
The result is a corpus of ~100 million tokens.
|
| 22 |
|
| 23 |
+
The dataset used to train this will be available on Hugging Face later [HERE (does not work yet)]().
|
| 24 |
|
| 25 |
### Contact
|
| 26 |
|