Update README.md
Browse files
README.md
CHANGED
|
@@ -154,14 +154,6 @@ and may produce unexpected results compared to generic text classifiers. It is t
|
|
| 154 |
A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
|
| 155 |
is found in `DerivedFunction/language-ner`.
|
| 156 |
|
| 157 |
-
The data composition follows a strategic curriculum:
|
| 158 |
-
|
| 159 |
-
* **60% Pure Documents:** Single-language sequences to establish strong baseline profiles for each language.
|
| 160 |
-
* **30% Homogenous Mixed:** Documents containing one main language, and clear transitions between two or more languages to train boundary detection.
|
| 161 |
-
* **10% Mixed with Noise:** Integration of "neutral" spans including code snippets, mathematical notation, emojis, symbols, and `rot_13` text tagged as `O` or their respective source to reduce hallucination.
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
|
| 166 |
## Training procedure
|
| 167 |
|
|
|
|
| 154 |
A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
|
| 155 |
is found in `DerivedFunction/language-ner`.
|
| 156 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
## Training procedure
|
| 159 |
|