pstroe
/

roberta-base-latin-cased2

Model card Files Files and versions

pstroe commited on Jul 29, 2022

Commit

61489ed

·

1 Parent(s): 503d71b

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -18,9 +18,9 @@ I undertook the following preprocessing steps:
   - Retain only sentences with a Latin vocabulary ratio of > 85%.
   - Exclude all lines containing '^' --> hints at the presence of OCR errors.
-The result is a corpus of ~390 million tokens.
-The dataset used to train this model is available [HERE](https://huggingface.co/datasets/pstroe/cc100-latin).
 ### Contact

   - Retain only sentences with a Latin vocabulary ratio of > 85%.
   - Exclude all lines containing '^' --> hints at the presence of OCR errors.
+The result is a corpus of ~100 million tokens.
+The dataset used to train this will be available on Hugging Face later [HERE (does not work yet)]().
 ### Contact