DerivedFunction
/

polyglot-tagger-v2

@@ -9,35 +9,159 @@ metrics:
 - recall
 - f1
 - accuracy
 model-index:
-- name: lang-ner-xlmr
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# lang-ner-xlmr
-This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.0427
-- Precision: 0.8949
-- Recall: 0.9144
-- F1: 0.9046
-- Accuracy: 0.9892
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
@@ -88,4 +212,4 @@ The following hyperparameters were used during training:
 - Transformers 5.0.0
 - Pytorch 2.10.0+cu128
 - Datasets 4.0.0
-- Tokenizers 0.22.2

 - recall
 - f1
 - accuracy
+language:
+- multilingual
+- af
+- am
+- ar
+- as
+- ba
+- be
+- bg
+- bn
+- bo
+- br
+- bs
+- ca
+- ce
+- ckb
+- cs
+- cy
+- da
+- de
+- dv
+- el
+- en
+- eo
+- es
+- et
+- eu
+- fa
+- fi
+- fr
+- ga
+- gd
+- gl
+- gu
+- he
+- hi
+- hr
+- hu
+- hy
+- id
+- is
+- it
+- ja
+- jv
+- ka
+- kk
+- km
+- kn
+- ko
+- ku
+- ky
+- la
+- lb
+- lo
+- lt
+- lv
+- mg
+- mk
+- ml
+- mn
+- mr
+- ms
+- mt
+- my
+- ne
+- nl
+- 'no'
+- ny
+- oc
+- om
+- or
+- pa
+- pl
+- ps
+- pt
+- rm
+- ro
+- ru
+- sd
+- si
+- sk
+- sl
+- so
+- sq
+- sr
+- su
+- sv
+- sw
+- ta
+- te
+- tg
+- th
+- ti
+- tl
+- tr
+- tt
+- ug
+- uk
+- ur
+- uz
+- vi
+- yo
+- zh
+- zu
 model-index:
+- name: polyglot-tagger
   results: []
+datasets:
+- wikimedia/wikipedia
+- HuggingFaceFW/finetranslations
+- google/smol
+- DerivedFunction/nlp-noise-snippets
+- DerivedFunction/wikipedia-language-snippets-filtered
+- DerivedFunction/finetranslations-filtered
+- DerivedFunction/lang-ner-v2
+pipeline_tag: token-classification
 ---
+![image](https://cdn-uploads.huggingface.co/production/uploads/67ee3f0a66388136438834cc/OnfV_fN2br5c4cPnOn6O0.png)
+Fine-tuned `xlm-roberta-base` for sentence-level language tagging across 100 languages.
+The model predicts BIO-style language tags over tokens, which makes it useful for
+language identification, code-switch detection, and multilingual document analysis.
 ## Model description
+Introducing Polyglot Tagger, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model
+generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
 ## Intended uses & limitations
+This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
+Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts. For example, English and German, Spanish and Portuguese, and Russian and Ukrainian.
+The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements. Note that this model is experimental
+and may produce unexpected results compared to generic text classifiers. It is trained on cleaned text, therefore, "messy" text may unexpectedly produce different results.
+> Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi.
+### Training and Evaluation Data
+A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
+is found in `DerivedFunction/lang-ner-v2`.
+It achieves the following results on the evaluation set:
+- Loss: 0.0427
+- Precision: 0.8949
+- Recall: 0.9144
+- F1: 0.9046
+- Accuracy: 0.9892
 ## Training procedure
 - Transformers 5.0.0
 - Pytorch 2.10.0+cu128
 - Datasets 4.0.0
+- Tokenizers 0.22.2