kmamaroziqov commited on
Commit
e538114
·
verified ·
1 Parent(s): 54f3f39

token correction

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -134,7 +134,7 @@ We optimized the tokenizer specifically for Uzbek, achieving significantly bette
134
 
135
  1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens
136
  2. **Embedding Initialization**: Semantic initialization using subword composition
137
- 3. **Continual Pretraining**: Trained on 22GB Uzbek text corpus
138
  4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets
139
 
140
  ### Training Data
 
134
 
135
  1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens
136
  2. **Embedding Initialization**: Semantic initialization using subword composition
137
+ 3. **Continual Pretraining**: Trained on 2B tokens of Uzbek and English text corpus
138
  4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets
139
 
140
  ### Training Data