File size: 7,570 Bytes
0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 54f3f39 890f918 54f3f39 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 195bf74 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 e538114 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 0322d7f 890f918 f0669be 890f918 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
---
language:
- uz
- en
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
- uzbek
- qwen3
- language-model
- text-generation
- nlp
- central-asia
- low-resource
- tokenizer-optimization
datasets:
- behbudiy/alpaca-cleaned-uz
- NeuronUz/uzbek-spelling-mcq
pipeline_tag: text-generation
model-index:
- name: NeuronAI-Uzbek
results:
- task:
type: text-generation
name: Uzbek Language Understanding
dataset:
name: UzLiB Benchmark
type: uzlib
metrics:
- type: accuracy
value: 0.662
name: Overall Accuracy
---
<div align="center">
# πΊπΏ NeuronAI-Uzbek
### The Most Advanced Open-Source Language Model for Uzbek
[](https://huggingface.co/NeuronUz/NeuronAI-Uzbek)
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/Qwen/Qwen3-4B)
**π 4th Place Globally | π₯ 1st Place in Uzbekistan on UzLiB Benchmark**
*Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks*
</div>
---
## π Key Results
<div align="center">
| Achievement | Value |
|-------------|-------|
| **UzLiB Overall Score** | **0.662** |
| **Global Ranking** | **#4** |
| **Regional Ranking** | **#1 in Uzbekistan** |
| **Tokenizer Efficiency Improvement** | **+22.5%** vs Qwen3-4B |
</div>
---
## π UzLiB Benchmark Performance
NeuronAI-Uzbek achieves exceptional performance on the [UzLiB Benchmark](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md), the comprehensive evaluation suite for Uzbek language understanding.
### Leaderboard Position
[](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md)
> **Note**: NeuronAI-Uzbek is the **smallest model** in the top 10, with only **4B parameters**, while competing against models with 100B+ parameters.
### Performance Comparison vs Original Qwen3-4B
| Metric | Qwen3-4B (Original) | NeuronAI-Uzbek | Improvement |
|--------|:-------------------:|:--------------:|:-----------:|
| **Overall (All)** | 0.345 | **0.662** | **+91.9%** |
| Correct Word | 0.351 | 0.718 | +104.6% |
| Meaning | 0.309 | 0.466 | +50.8% |
| Meaning in Context | 0.347 | 0.333 | -4.0% |
| Fill-in | 0.327 | 0.385 | +17.7% |
---
## π€ Tokenizer Efficiency
We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs).
### Fertility Rate Comparison
| Model | Fertility Rate | Std Dev | Vocab Size | Improvement vs Qwen3 |
|-------|:--------------:|:-------:|:----------:|:--------------------:|
| **NeuronAI-Uzbek (Ours)** π | **2.67** | 0.15 | 180,000 | **+22.5%** |
| Gemma 2-9B | 3.15 | 0.22 | 256,000 | +8.3% |
| LLaMA 3.1-8B | 3.32 | 0.22 | 128,256 | +3.7% |
| DeepSeek-V3 | 3.32 | 0.21 | 128,815 | +3.4% |
| Qwen3-4B (Original) | 3.44 | 0.22 | 151,669 | - |
> **Fertility Rate**: Average number of tokens per word. Lower is better for efficiency.
<div align="center">
<img src="assets/fertility_comparison_chart.png" alt="Tokenizer Fertility Rate Comparison" width="700"/>
</div>
### What This Means
- **22.5% fewer tokens** needed to represent Uzbek text
- **Faster inference** due to shorter sequences
- **Lower API costs** when deployed
- **Better context utilization** - fit more content in the same context window
---
## π οΈ Model Details
### Architecture
| Property | Value |
|----------|-------|
| **Base Model** | Qwen3-4B |
| **Parameters** | 4 Billion |
| **Vocabulary Size** | 180,000 tokens |
| **Context Length** | 32,768 tokens |
| **Architecture** | Transformer (Decoder-only) |
| **Precision** | BFloat16 |
### Training Methodology
1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens
2. **Embedding Initialization**: Semantic initialization using subword composition
3. **Continual Pretraining**: Trained on 2B tokens of Uzbek and English text corpus
4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets
### Training Data
| Dataset | Type | Purpose |
|---------|------|---------|
| Uzbek Web Corpus | Pretraining | Language modeling |
| behbudiy/alpaca-cleaned-uz | SFT | Uzbek instructions |
| NeuronUz/uzbek-spelling-mcq | SFT | Benchmark-targeted training |
| vicgalle/alpaca-gpt4 | SFT | English capability retention |
---
## π Quick Start
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "NeuronUz/NeuronAI-Uzbek"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
prompt = "O'zbekiston haqida qisqacha ma'lumot bering."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
```
### With Thinking Mode (Chain-of-Thought)
```python
messages = [
{"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Enable step-by-step reasoning
)
```
---
## π Use Cases
NeuronAI-Uzbek excels at:
- **π Text Generation**: Creative writing, content creation in Uzbek
- **β Question Answering**: Answering questions about Uzbek culture, history, and general knowledge
- **π Reading Comprehension**: Understanding and analyzing Uzbek texts
- **π€ Grammar & Spelling**: Uzbek language correctness tasks
- **π Translation Assistance**: Uzbek-English language tasks
- **π¬ Conversational AI**: Building Uzbek chatbots and assistants
---
## β οΈ Limitations
- **Knowledge Cutoff**: Training data has a knowledge cutoff date
- **Hallucinations**: May generate plausible-sounding but incorrect information
- **Bias**: May reflect biases present in training data
- **Not for Critical Applications**: Should not be used for medical, legal, or safety-critical applications without human oversight
---
## π License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
---
## π Acknowledgments
- **Qwen Team** at Alibaba for the excellent Qwen3-4B base model
- **UzLiB Benchmark** creators for the comprehensive evaluation framework
- **Uzbek NLP Community** for datasets and linguistic resources
---
## π Citation
```bibtex
@misc{neuronai-uzbek-2025,
title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek},
author={NeuronAI Team},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek}
}
```
---
<div align="center">
**Built with β€οΈ in Uzbekistan by [NeuronUz](https://huggingface.co/NeuronUz)**
</div>
|