Token Classification
Transformers
TensorBoard
Safetensors
xlm-roberta
Generated from Trainer
language-identification
codeswitching
Instructions to use DerivedFunction/polyglot-tagger-v2.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DerivedFunction/polyglot-tagger-v2.2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="DerivedFunction/polyglot-tagger-v2.2")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("DerivedFunction/polyglot-tagger-v2.2") model = AutoModelForTokenClassification.from_pretrained("DerivedFunction/polyglot-tagger-v2.2") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -158,13 +158,13 @@ A synthetic training row consists of 1-6 individual and mostly independent sente
|
|
| 158 |
factors were used to simulate messy text, and to reduce single character bias on certain languages:
|
| 159 |
- Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
|
| 160 |
- Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
|
| 161 |
-
- Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in
|
| 162 |
- Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
|
| 163 |
- Low chance of simulating OCR and messy text with character mutation.
|
| 164 |
|
| 165 |
To generalize well on both the target language and code switching a circulumn is provided:
|
| 166 |
-
- Pure documents 55%: Single language to learn its vocabulary
|
| 167 |
-
- Homogenous 25%: Single language + one foreign sentence to learn simple code switching
|
| 168 |
- Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
|
| 169 |
- Mixed 10%: Generic mix of any languages.
|
| 170 |
|
|
|
|
| 158 |
factors were used to simulate messy text, and to reduce single character bias on certain languages:
|
| 159 |
- Low chance of deliberate accent stripping for languages such as Spanish and Portugeuse
|
| 160 |
- Random chance to add in, replace or delete punctuation, numeric, and delimiter artifiacts
|
| 161 |
+
- Insert same-script alphabets to family language. For example, randomly injecting Arabic characters in Arabic languages
|
| 162 |
- Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
|
| 163 |
- Low chance of simulating OCR and messy text with character mutation.
|
| 164 |
|
| 165 |
To generalize well on both the target language and code switching a circulumn is provided:
|
| 166 |
+
- Pure documents 55%: Single language to learn its vocabulary, simulating a short paragraph of a single language.
|
| 167 |
+
- Homogenous 25%: Single language + one foreign sentence to learn simple code switching.
|
| 168 |
- Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
|
| 169 |
- Mixed 10%: Generic mix of any languages.
|
| 170 |
|