DerivedFunction
/

polyglot-tagger-100L-4M

Token Classification

Generated from Trainer

Model card Files Files and versions

Metrics Training metrics Community

DerivedFunction commited on 1 day ago

Commit

a427c70

·

verified ·

1 Parent(s): 5fc6f5c

Update README.md

Files changed (1) hide show

README.md +14 -11

README.md CHANGED Viewed

@@ -127,24 +127,19 @@ datasets:
 pipeline_tag: token-classification
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-#  Polyglot Tagger: 100 Languages
-This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
-It achieves the following results on the evaluation set:
-- Loss: 0.0452
-- Precision: 0.8626
-- Recall: 0.8916
-- F1: 0.8769
-- Accuracy: 0.9892
 ## Model description
 Introducing Polyglot Tagger, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model
-generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
 ## Intended uses & limitations
 This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
@@ -167,6 +162,7 @@ The data composition follows a strategic curriculum:
 ## Training procedure
 ### Training hyperparameters
@@ -185,6 +181,13 @@ The following hyperparameters were used during training:
 ### Training results
 | Training Loss | Epoch  | Step  | Validation Loss | Precision | Recall | F1     | Accuracy |
 |:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:|
 | 0.0730        | 0.0905 | 2500  | 0.1081          | 0.7241    | 0.8260 | 0.7717 | 0.9760   |

 pipeline_tag: token-classification
 ---
+![Polyglot Tagger banner](assets/model-card-banner.svg)
+Fine-tuned `xlm-roberta-base` for sentence-level language tagging across 100 languages.
+The model predicts BIO-style language tags over tokens, which makes it useful for
+language identification, code-switch detection, and multilingual document analysis.
 ## Model description
 Introducing Polyglot Tagger, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model
+generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
 ## Intended uses & limitations
 This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
 ## Training procedure
 ### Training hyperparameters
 ### Training results
+It achieves the following results on the evaluation set:
+- Loss: 0.0452
+- Precision: 0.8626
+- Recall: 0.8916
+- F1: 0.8769
+- Accuracy: 0.9892
 | Training Loss | Epoch  | Step  | Validation Loss | Precision | Recall | F1     | Accuracy |
 |:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:|
 | 0.0730        | 0.0905 | 2500  | 0.1081          | 0.7241    | 0.8260 | 0.7717 | 0.9760   |