---
language:
- uz
- en
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
- uzbek
- qwen3
- language-model
- text-generation
- nlp
- central-asia
- low-resource
- tokenizer-optimization
datasets:
- behbudiy/alpaca-cleaned-uz
- NeuronUz/uzbek-spelling-mcq
pipeline_tag: text-generation
model-index:
- name: NeuronAI-Uzbek
results:
- task:
type: text-generation
name: Uzbek Language Understanding
dataset:
name: UzLiB Benchmark
type: uzlib
metrics:
- type: accuracy
value: 0.662
name: Overall Accuracy
---
# πΊπΏ NeuronAI-Uzbek
### The Most Advanced Open-Source Language Model for Uzbek
[](https://huggingface.co/NeuronUz/NeuronAI-Uzbek)
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/Qwen/Qwen3-4B)
**π 4th Place Globally | π₯ 1st Place in Uzbekistan on UzLiB Benchmark**
*Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks*
---
## π Key Results
| Achievement | Value |
|-------------|-------|
| **UzLiB Overall Score** | **0.662** |
| **Global Ranking** | **#4** |
| **Regional Ranking** | **#1 in Uzbekistan** |
| **Tokenizer Efficiency Improvement** | **+22.5%** vs Qwen3-4B |
---
## π UzLiB Benchmark Performance
NeuronAI-Uzbek achieves exceptional performance on the [UzLiB Benchmark](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md), the comprehensive evaluation suite for Uzbek language understanding.
### Leaderboard Position
[](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md)
> **Note**: NeuronAI-Uzbek is the **smallest model** in the top 10, with only **4B parameters**, while competing against models with 100B+ parameters.
### Performance Comparison vs Original Qwen3-4B
| Metric | Qwen3-4B (Original) | NeuronAI-Uzbek | Improvement |
|--------|:-------------------:|:--------------:|:-----------:|
| **Overall (All)** | 0.345 | **0.662** | **+91.9%** |
| Correct Word | 0.351 | 0.718 | +104.6% |
| Meaning | 0.309 | 0.466 | +50.8% |
| Meaning in Context | 0.347 | 0.333 | -4.0% |
| Fill-in | 0.327 | 0.385 | +17.7% |
---
## π€ Tokenizer Efficiency
We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs).
### Fertility Rate Comparison
| Model | Fertility Rate | Std Dev | Vocab Size | Improvement vs Qwen3 |
|-------|:--------------:|:-------:|:----------:|:--------------------:|
| **NeuronAI-Uzbek (Ours)** π | **2.67** | 0.15 | 180,000 | **+22.5%** |
| Gemma 2-9B | 3.15 | 0.22 | 256,000 | +8.3% |
| LLaMA 3.1-8B | 3.32 | 0.22 | 128,256 | +3.7% |
| DeepSeek-V3 | 3.32 | 0.21 | 128,815 | +3.4% |
| Qwen3-4B (Original) | 3.44 | 0.22 | 151,669 | - |
> **Fertility Rate**: Average number of tokens per word. Lower is better for efficiency.
### What This Means
- **22.5% fewer tokens** needed to represent Uzbek text
- **Faster inference** due to shorter sequences
- **Lower API costs** when deployed
- **Better context utilization** - fit more content in the same context window
---
## π οΈ Model Details
### Architecture
| Property | Value |
|----------|-------|
| **Base Model** | Qwen3-4B |
| **Parameters** | 4 Billion |
| **Vocabulary Size** | 180,000 tokens |
| **Context Length** | 32,768 tokens |
| **Architecture** | Transformer (Decoder-only) |
| **Precision** | BFloat16 |
### Training Methodology
1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens
2. **Embedding Initialization**: Semantic initialization using subword composition
3. **Continual Pretraining**: Trained on 2B tokens of Uzbek and English text corpus
4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets
### Training Data
| Dataset | Type | Purpose |
|---------|------|---------|
| Uzbek Web Corpus | Pretraining | Language modeling |
| behbudiy/alpaca-cleaned-uz | SFT | Uzbek instructions |
| NeuronUz/uzbek-spelling-mcq | SFT | Benchmark-targeted training |
| vicgalle/alpaca-gpt4 | SFT | English capability retention |
---
## π Quick Start
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "NeuronUz/NeuronAI-Uzbek"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
prompt = "O'zbekiston haqida qisqacha ma'lumot bering."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
```
### With Thinking Mode (Chain-of-Thought)
```python
messages = [
{"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Enable step-by-step reasoning
)
```
---
## π Use Cases
NeuronAI-Uzbek excels at:
- **π Text Generation**: Creative writing, content creation in Uzbek
- **β Question Answering**: Answering questions about Uzbek culture, history, and general knowledge
- **π Reading Comprehension**: Understanding and analyzing Uzbek texts
- **π€ Grammar & Spelling**: Uzbek language correctness tasks
- **π Translation Assistance**: Uzbek-English language tasks
- **π¬ Conversational AI**: Building Uzbek chatbots and assistants
---
## β οΈ Limitations
- **Knowledge Cutoff**: Training data has a knowledge cutoff date
- **Hallucinations**: May generate plausible-sounding but incorrect information
- **Bias**: May reflect biases present in training data
- **Not for Critical Applications**: Should not be used for medical, legal, or safety-critical applications without human oversight
---
## π License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
---
## π Acknowledgments
- **Qwen Team** at Alibaba for the excellent Qwen3-4B base model
- **UzLiB Benchmark** creators for the comprehensive evaluation framework
- **Uzbek NLP Community** for datasets and linguistic resources
---
## π Citation
```bibtex
@misc{neuronai-uzbek-2025,
title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek},
author={NeuronAI Team},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek}
}
```
---
**Built with β€οΈ in Uzbekistan by [NeuronUz](https://huggingface.co/NeuronUz)**