Update README.md
Browse files
README.md
CHANGED
|
@@ -127,24 +127,19 @@ datasets:
|
|
| 127 |
pipeline_tag: token-classification
|
| 128 |
---
|
| 129 |
|
| 130 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 131 |
-
should probably proofread and complete it, then remove this comment. -->
|
| 132 |
|
| 133 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
-
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
|
| 136 |
-
It achieves the following results on the evaluation set:
|
| 137 |
-
- Loss: 0.0452
|
| 138 |
-
- Precision: 0.8626
|
| 139 |
-
- Recall: 0.8916
|
| 140 |
-
- F1: 0.8769
|
| 141 |
-
- Accuracy: 0.9892
|
| 142 |
|
| 143 |
|
| 144 |
## Model description
|
| 145 |
|
| 146 |
Introducing Polyglot Tagger, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model
|
| 147 |
-
generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
|
| 148 |
|
| 149 |
## Intended uses & limitations
|
| 150 |
This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
|
|
@@ -167,6 +162,7 @@ The data composition follows a strategic curriculum:
|
|
| 167 |
|
| 168 |
|
| 169 |
|
|
|
|
| 170 |
## Training procedure
|
| 171 |
|
| 172 |
### Training hyperparameters
|
|
@@ -185,6 +181,13 @@ The following hyperparameters were used during training:
|
|
| 185 |
|
| 186 |
### Training results
|
| 187 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|
| 189 |
|:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:|
|
| 190 |
| 0.0730 | 0.0905 | 2500 | 0.1081 | 0.7241 | 0.8260 | 0.7717 | 0.9760 |
|
|
|
|
| 127 |
pipeline_tag: token-classification
|
| 128 |
---
|
| 129 |
|
|
|
|
|
|
|
| 130 |
|
| 131 |
+

|
| 132 |
+
|
| 133 |
+
Fine-tuned `xlm-roberta-base` for sentence-level language tagging across 100 languages.
|
| 134 |
+
The model predicts BIO-style language tags over tokens, which makes it useful for
|
| 135 |
+
language identification, code-switch detection, and multilingual document analysis.
|
| 136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
|
| 139 |
## Model description
|
| 140 |
|
| 141 |
Introducing Polyglot Tagger, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model
|
| 142 |
+
generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
|
| 143 |
|
| 144 |
## Intended uses & limitations
|
| 145 |
This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
|
|
|
|
| 162 |
|
| 163 |
|
| 164 |
|
| 165 |
+
|
| 166 |
## Training procedure
|
| 167 |
|
| 168 |
### Training hyperparameters
|
|
|
|
| 181 |
|
| 182 |
### Training results
|
| 183 |
|
| 184 |
+
It achieves the following results on the evaluation set:
|
| 185 |
+
- Loss: 0.0452
|
| 186 |
+
- Precision: 0.8626
|
| 187 |
+
- Recall: 0.8916
|
| 188 |
+
- F1: 0.8769
|
| 189 |
+
- Accuracy: 0.9892
|
| 190 |
+
|
| 191 |
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|
| 192 |
|:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:|
|
| 193 |
| 0.0730 | 0.0905 | 2500 | 0.1081 | 0.7241 | 0.8260 | 0.7717 | 0.9760 |
|