pstroe commited on
Commit
61489ed
·
1 Parent(s): 503d71b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -18,9 +18,9 @@ I undertook the following preprocessing steps:
18
  - Retain only sentences with a Latin vocabulary ratio of > 85%.
19
  - Exclude all lines containing '^' --> hints at the presence of OCR errors.
20
 
21
- The result is a corpus of ~390 million tokens.
22
 
23
- The dataset used to train this model is available [HERE](https://huggingface.co/datasets/pstroe/cc100-latin).
24
 
25
  ### Contact
26
 
 
18
  - Retain only sentences with a Latin vocabulary ratio of > 85%.
19
  - Exclude all lines containing '^' --> hints at the presence of OCR errors.
20
 
21
+ The result is a corpus of ~100 million tokens.
22
 
23
+ The dataset used to train this will be available on Hugging Face later [HERE (does not work yet)]().
24
 
25
  ### Contact
26