DerivedFunction
/

polyglot-tagger-v2.2

Token Classification

Generated from Trainer

language-identification

Model card Files Files and versions

Metrics Training metrics Community

DerivedFunction commited on Apr 18

Commit

59264e7

·

verified ·

1 Parent(s): 630accb

Update README.md

Files changed (1) hide show

README.md +15 -3

README.md CHANGED Viewed

@@ -153,10 +153,22 @@ and may produce unexpected results compared to generic text classifiers. It is t
 > Note that Romanized versions of any language may only have minor representation in the training set, such as Romanized Russian, and Hindi.
 ### Training and Evaluation Data
-A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
-is found in `DerivedFunction/lang-ner-v2`.
-This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.0345
 - Precision: 0.9508

 > Note that Romanized versions of any language may only have minor representation in the training set, such as Romanized Russian, and Hindi.
 ### Training and Evaluation Data
+A synthetic training row consists of 1-6 individual and mostly independent sentences extracted from various sources. To generalize well against multiple languages, several
+factors were used to simulate messy text, and to reduce single character bias on certain languages:
+- Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
+- Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
+- Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in Araabic languages
+- Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
+- Low chance of simulating OCR and messy text with character mutation.
+To generalize well on both the target language and code switching a circulumn is provided:
+- Pure documents 55%: Single language to learn its vocabulary
+- Homogenous 25%: Single language + one foreign sentence to learn simple code switching
+- Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
+- Mixed 10%: Generic mix of any languages.
+This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
 It achieves the following results on the evaluation set:
 - Loss: 0.0345
 - Precision: 0.9508