DerivedFunction
/

polyglot-tagger-100L-4M

Token Classification

Generated from Trainer

Model card Files Files and versions

Metrics Training metrics Community

DerivedFunction commited on about 21 hours ago

Commit

81b0f9e

·

verified ·

1 Parent(s): 4e8ff75

Update README.md

Files changed (1) hide show

README.md +0 -8

README.md CHANGED Viewed

@@ -154,14 +154,6 @@ and may produce unexpected results compared to generic text classifiers. It is t
 A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
 is found in `DerivedFunction/language-ner`.
-The data composition follows a strategic curriculum:
-* **60% Pure Documents:** Single-language sequences to establish strong baseline profiles for each language.
-* **30% Homogenous Mixed:** Documents containing one main language, and clear transitions between two or more languages to train boundary detection.
-* **10% Mixed with Noise:** Integration of "neutral" spans including code snippets, mathematical notation, emojis, symbols, and `rot_13` text tagged as `O` or their respective source to reduce hallucination.
 ## Training procedure

 A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
 is found in `DerivedFunction/language-ner`.
 ## Training procedure