DerivedFunction commited on
Commit
8ce3cd1
·
verified ·
1 Parent(s): 56f4476

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -2
README.md CHANGED
@@ -97,18 +97,23 @@ It achieves the following results on the evaluation set:
97
 
98
  ## Model description
99
 
100
- More information needed
 
101
 
102
  ## Intended uses & limitations
103
  This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
104
  Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
105
 
 
 
106
  ### Training and Evaluation Data
107
  The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families
108
  (Latin, Cyrillic, Indic, Arabic, Han, etc.), from `wikimedia/wikipedia` (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles,
109
  by taking the first half of Wikipedia after filtering for stubs), `google/smol` (up to 1000 individual sentences),
110
  and `HuggingFaceFW/finetranslations` (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows),
111
- in which it is split into a reserve set for pure documents, and a main set for everything else.
 
 
112
 
113
  The data composition follows a strategic curriculum:
114
 
 
97
 
98
  ## Model description
99
 
100
+ Introducing Polyglot Tagger 60L, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model generalizes well
101
+ on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
102
 
103
  ## Intended uses & limitations
104
  This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
105
  Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
106
 
107
+ The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements.
108
+
109
  ### Training and Evaluation Data
110
  The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families
111
  (Latin, Cyrillic, Indic, Arabic, Han, etc.), from `wikimedia/wikipedia` (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles,
112
  by taking the first half of Wikipedia after filtering for stubs), `google/smol` (up to 1000 individual sentences),
113
  and `HuggingFaceFW/finetranslations` (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows),
114
+ in which it is split into a reserve set for pure documents, and a main set for everything else.
115
+
116
+ A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources.
117
 
118
  The data composition follows a strategic curriculum:
119