token correction
Browse files
README.md
CHANGED
|
@@ -134,7 +134,7 @@ We optimized the tokenizer specifically for Uzbek, achieving significantly bette
|
|
| 134 |
|
| 135 |
1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens
|
| 136 |
2. **Embedding Initialization**: Semantic initialization using subword composition
|
| 137 |
-
3. **Continual Pretraining**: Trained on
|
| 138 |
4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets
|
| 139 |
|
| 140 |
### Training Data
|
|
|
|
| 134 |
|
| 135 |
1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens
|
| 136 |
2. **Embedding Initialization**: Semantic initialization using subword composition
|
| 137 |
+
3. **Continual Pretraining**: Trained on 2B tokens of Uzbek and English text corpus
|
| 138 |
4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets
|
| 139 |
|
| 140 |
### Training Data
|