Update README.md
Browse files
README.md
CHANGED
|
@@ -97,18 +97,23 @@ It achieves the following results on the evaluation set:
|
|
| 97 |
|
| 98 |
## Model description
|
| 99 |
|
| 100 |
-
|
|
|
|
| 101 |
|
| 102 |
## Intended uses & limitations
|
| 103 |
This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
|
| 104 |
Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
|
| 105 |
|
|
|
|
|
|
|
| 106 |
### Training and Evaluation Data
|
| 107 |
The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families
|
| 108 |
(Latin, Cyrillic, Indic, Arabic, Han, etc.), from `wikimedia/wikipedia` (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles,
|
| 109 |
by taking the first half of Wikipedia after filtering for stubs), `google/smol` (up to 1000 individual sentences),
|
| 110 |
and `HuggingFaceFW/finetranslations` (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows),
|
| 111 |
-
in which it is split into a reserve set for pure documents, and a main set for everything else.
|
|
|
|
|
|
|
| 112 |
|
| 113 |
The data composition follows a strategic curriculum:
|
| 114 |
|
|
|
|
| 97 |
|
| 98 |
## Model description
|
| 99 |
|
| 100 |
+
Introducing Polyglot Tagger 60L, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model generalizes well
|
| 101 |
+
on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
|
| 102 |
|
| 103 |
## Intended uses & limitations
|
| 104 |
This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
|
| 105 |
Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
|
| 106 |
|
| 107 |
+
The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements.
|
| 108 |
+
|
| 109 |
### Training and Evaluation Data
|
| 110 |
The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families
|
| 111 |
(Latin, Cyrillic, Indic, Arabic, Han, etc.), from `wikimedia/wikipedia` (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles,
|
| 112 |
by taking the first half of Wikipedia after filtering for stubs), `google/smol` (up to 1000 individual sentences),
|
| 113 |
and `HuggingFaceFW/finetranslations` (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows),
|
| 114 |
+
in which it is split into a reserve set for pure documents, and a main set for everything else.
|
| 115 |
+
|
| 116 |
+
A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources.
|
| 117 |
|
| 118 |
The data composition follows a strategic curriculum:
|
| 119 |
|