DerivedFunction commited on
Commit
cf4d2a7
·
verified ·
1 Parent(s): e5b4fb6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -158,13 +158,13 @@ A synthetic training row consists of 1-6 individual and mostly independent sente
158
  factors were used to simulate messy text, and to reduce single character bias on certain languages:
159
  - Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
160
  - Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
161
- - Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in Araabic languages
162
  - Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
163
  - Low chance of simulating OCR and messy text with character mutation.
164
 
165
  To generalize well on both the target language and code switching a circulumn is provided:
166
- - Pure documents 55%: Single language to learn its vocabulary
167
- - Homogenous 25%: Single language + one foreign sentence to learn simple code switching
168
  - Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
169
  - Mixed 10%: Generic mix of any languages.
170
 
 
158
  factors were used to simulate messy text, and to reduce single character bias on certain languages:
159
  - Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
160
  - Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
161
+ - Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in Arabic languages
162
  - Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
163
  - Low chance of simulating OCR and messy text with character mutation.
164
 
165
  To generalize well on both the target language and code switching a circulumn is provided:
166
+ - Pure documents 55%: Single language to learn its vocabulary, simulating a short paragraph of a single language.
167
+ - Homogenous 25%: Single language + one foreign sentence to learn simple code switching.
168
  - Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
169
  - Mixed 10%: Generic mix of any languages.
170