DerivedFunction
/

polyglot-tagger-60L-Experimental

Token Classification

language-detection

language-identification

Model card Files Files and versions

Metrics Training metrics Community

DerivedFunction commited on 3 days ago

Commit

8ce3cd1

·

verified ·

1 Parent(s): 56f4476

Update README.md

Files changed (1) hide show

README.md +7 -2

README.md CHANGED Viewed

@@ -97,18 +97,23 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
 This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
 Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
 ### Training and Evaluation Data
 The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families
 (Latin, Cyrillic, Indic, Arabic, Han, etc.), from `wikimedia/wikipedia` (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles,
 by taking the first half of Wikipedia after filtering for stubs), `google/smol` (up to 1000 individual sentences),
 and `HuggingFaceFW/finetranslations` (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows),
-in which it is split into a reserve set for pure documents, and a main set for everything else.
 The data composition follows a strategic curriculum:

 ## Model description
+Introducing Polyglot Tagger 60L, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model generalizes well
+on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
 ## Intended uses & limitations
 This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
 Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
+The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements.
 ### Training and Evaluation Data
 The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families
 (Latin, Cyrillic, Indic, Arabic, Han, etc.), from `wikimedia/wikipedia` (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles,
 by taking the first half of Wikipedia after filtering for stubs), `google/smol` (up to 1000 individual sentences),
 and `HuggingFaceFW/finetranslations` (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows),
+in which it is split into a reserve set for pure documents, and a main set for everything else.
+A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources.
 The data composition follows a strategic curriculum: