DerivedFunction commited on
Commit
59264e7
·
verified ·
1 Parent(s): 630accb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -153,10 +153,22 @@ and may produce unexpected results compared to generic text classifiers. It is t
153
  > Note that Romanized versions of any language may only have minor representation in the training set, such as Romanized Russian, and Hindi.
154
 
155
  ### Training and Evaluation Data
156
- A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
157
- is found in `DerivedFunction/lang-ner-v2`.
158
 
159
- This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
  It achieves the following results on the evaluation set:
161
  - Loss: 0.0345
162
  - Precision: 0.9508
 
153
  > Note that Romanized versions of any language may only have minor representation in the training set, such as Romanized Russian, and Hindi.
154
 
155
  ### Training and Evaluation Data
 
 
156
 
157
+ A synthetic training row consists of 1-6 individual and mostly independent sentences extracted from various sources. To generalize well against multiple languages, several
158
+ factors were used to simulate messy text, and to reduce single character bias on certain languages:
159
+ - Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
160
+ - Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
161
+ - Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in Araabic languages
162
+ - Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
163
+ - Low chance of simulating OCR and messy text with character mutation.
164
+
165
+ To generalize well on both the target language and code switching a circulumn is provided:
166
+ - Pure documents 55%: Single language to learn its vocabulary
167
+ - Homogenous 25%: Single language + one foreign sentence to learn simple code switching
168
+ - Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
169
+ - Mixed 10%: Generic mix of any languages.
170
+
171
+ This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
172
  It achieves the following results on the evaluation set:
173
  - Loss: 0.0345
174
  - Precision: 0.9508