Token Classification
Transformers
TensorBoard
Safetensors
xlm-roberta
Generated from Trainer
language-identification
codeswitching
Instructions to use DerivedFunction/polyglot-tagger-v2.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DerivedFunction/polyglot-tagger-v2.2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="DerivedFunction/polyglot-tagger-v2.2")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("DerivedFunction/polyglot-tagger-v2.2") model = AutoModelForTokenClassification.from_pretrained("DerivedFunction/polyglot-tagger-v2.2") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -153,10 +153,22 @@ and may produce unexpected results compared to generic text classifiers. It is t
|
|
| 153 |
> Note that Romanized versions of any language may only have minor representation in the training set, such as Romanized Russian, and Hindi.
|
| 154 |
|
| 155 |
### Training and Evaluation Data
|
| 156 |
-
A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
|
| 157 |
-
is found in `DerivedFunction/lang-ner-v2`.
|
| 158 |
|
| 159 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
It achieves the following results on the evaluation set:
|
| 161 |
- Loss: 0.0345
|
| 162 |
- Precision: 0.9508
|
|
|
|
| 153 |
> Note that Romanized versions of any language may only have minor representation in the training set, such as Romanized Russian, and Hindi.
|
| 154 |
|
| 155 |
### Training and Evaluation Data
|
|
|
|
|
|
|
| 156 |
|
| 157 |
+
A synthetic training row consists of 1-6 individual and mostly independent sentences extracted from various sources. To generalize well against multiple languages, several
|
| 158 |
+
factors were used to simulate messy text, and to reduce single character bias on certain languages:
|
| 159 |
+
- Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
|
| 160 |
+
- Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
|
| 161 |
+
- Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in Araabic languages
|
| 162 |
+
- Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
|
| 163 |
+
- Low chance of simulating OCR and messy text with character mutation.
|
| 164 |
+
|
| 165 |
+
To generalize well on both the target language and code switching a circulumn is provided:
|
| 166 |
+
- Pure documents 55%: Single language to learn its vocabulary
|
| 167 |
+
- Homogenous 25%: Single language + one foreign sentence to learn simple code switching
|
| 168 |
+
- Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
|
| 169 |
+
- Mixed 10%: Generic mix of any languages.
|
| 170 |
+
|
| 171 |
+
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
|
| 172 |
It achieves the following results on the evaluation set:
|
| 173 |
- Loss: 0.0345
|
| 174 |
- Precision: 0.9508
|