DerivedFunction
/

polyglot-tagger-v2.2

Token Classification

Generated from Trainer

language-identification

Model card Files Files and versions

Metrics Training metrics Community

DerivedFunction commited on 27 days ago

Commit

cf4d2a7

·

verified ·

1 Parent(s): e5b4fb6

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -158,13 +158,13 @@ A synthetic training row consists of 1-6 individual and mostly independent sente
 factors were used to simulate messy text, and to reduce single character bias on certain languages:
 - Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
 - Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
-- Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in Araabic languages
 - Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
 - Low chance of simulating OCR and messy text with character mutation.
 To generalize well on both the target language and code switching a circulumn is provided:
-- Pure documents 55%: Single language to learn its vocabulary
-- Homogenous 25%: Single language + one foreign sentence to learn simple code switching
 - Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
 - Mixed 10%: Generic mix of any languages.

 factors were used to simulate messy text, and to reduce single character bias on certain languages:
 - Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
 - Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
+- Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in Arabic languages
 - Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
 - Low chance of simulating OCR and messy text with character mutation.
 To generalize well on both the target language and code switching a circulumn is provided:
+- Pure documents 55%: Single language to learn its vocabulary, simulating a short paragraph of a single language.
+- Homogenous 25%: Single language + one foreign sentence to learn simple code switching.
 - Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
 - Mixed 10%: Generic mix of any languages.