File size: 7,570 Bytes

---
language:
  - uz
  - en
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
  - uzbek
  - qwen3
  - language-model
  - text-generation
  - nlp
  - central-asia
  - low-resource
  - tokenizer-optimization
datasets:
  - behbudiy/alpaca-cleaned-uz
  - NeuronUz/uzbek-spelling-mcq
pipeline_tag: text-generation
model-index:
  - name: NeuronAI-Uzbek
    results:
      - task:
          type: text-generation
          name: Uzbek Language Understanding
        dataset:
          name: UzLiB Benchmark
          type: uzlib
        metrics:
          - type: accuracy
            value: 0.662
            name: Overall Accuracy
---

<div align="center">

# 🇺🇿 NeuronAI-Uzbek

### The Most Advanced Open-Source Language Model for Uzbek

[![Model](https://img.shields.io/badge/🤗_Model-NeuronAI--Uzbek-blue)](https://huggingface.co/NeuronUz/NeuronAI-Uzbek)
[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Qwen3--4B-purple)](https://huggingface.co/Qwen/Qwen3-4B)

**🏆 4th Place Globally | 🥇 1st Place in Uzbekistan on UzLiB Benchmark**

*Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks*

</div>

---

## 📊 Key Results

<div align="center">

| Achievement | Value |
|-------------|-------|
| **UzLiB Overall Score** | **0.662** |
| **Global Ranking** | **#4** |
| **Regional Ranking** | **#1 in Uzbekistan** |
| **Tokenizer Efficiency Improvement** | **+22.5%** vs Qwen3-4B |

</div>

---

## 🏆 UzLiB Benchmark Performance

NeuronAI-Uzbek achieves exceptional performance on the [UzLiB Benchmark](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md), the comprehensive evaluation suite for Uzbek language understanding.

### Leaderboard Position

[![image](https://cdn-uploads.huggingface.co/production/uploads/65fc70cbaeca3946b8753017/2xJ9BjS6rPNoRoBAzvW7w.png)](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md)


> **Note**: NeuronAI-Uzbek is the **smallest model** in the top 10, with only **4B parameters**, while competing against models with 100B+ parameters.

### Performance Comparison vs Original Qwen3-4B

| Metric | Qwen3-4B (Original) | NeuronAI-Uzbek | Improvement |
|--------|:-------------------:|:--------------:|:-----------:|
| **Overall (All)** | 0.345 | **0.662** | **+91.9%** |
| Correct Word | 0.351 | 0.718 | +104.6% |
| Meaning | 0.309 | 0.466 | +50.8% |
| Meaning in Context | 0.347 | 0.333 | -4.0% |
| Fill-in | 0.327 | 0.385 | +17.7% |

---

## 🔤 Tokenizer Efficiency

We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs).

### Fertility Rate Comparison

| Model | Fertility Rate | Std Dev | Vocab Size | Improvement vs Qwen3 |
|-------|:--------------:|:-------:|:----------:|:--------------------:|
| **NeuronAI-Uzbek (Ours)** 🏆 | **2.67** | 0.15 | 180,000 | **+22.5%** |
| Gemma 2-9B | 3.15 | 0.22 | 256,000 | +8.3% |
| LLaMA 3.1-8B | 3.32 | 0.22 | 128,256 | +3.7% |
| DeepSeek-V3 | 3.32 | 0.21 | 128,815 | +3.4% |
| Qwen3-4B (Original) | 3.44 | 0.22 | 151,669 | - |

> **Fertility Rate**: Average number of tokens per word. Lower is better for efficiency.

<div align="center">
<img src="assets/fertility_comparison_chart.png" alt="Tokenizer Fertility Rate Comparison" width="700"/>
</div>

### What This Means

- **22.5% fewer tokens** needed to represent Uzbek text
- **Faster inference** due to shorter sequences
- **Lower API costs** when deployed
- **Better context utilization** - fit more content in the same context window

---

## 🛠️ Model Details

### Architecture

| Property | Value |
|----------|-------|
| **Base Model** | Qwen3-4B |
| **Parameters** | 4 Billion |
| **Vocabulary Size** | 180,000 tokens |
| **Context Length** | 32,768 tokens |
| **Architecture** | Transformer (Decoder-only) |
| **Precision** | BFloat16 |

### Training Methodology

1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens
2. **Embedding Initialization**: Semantic initialization using subword composition
3. **Continual Pretraining**: Trained on 2B tokens of Uzbek and English text corpus
4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets

### Training Data

| Dataset | Type | Purpose |
|---------|------|---------|
| Uzbek Web Corpus | Pretraining | Language modeling |
| behbudiy/alpaca-cleaned-uz | SFT | Uzbek instructions |
| NeuronUz/uzbek-spelling-mcq | SFT | Benchmark-targeted training |
| vicgalle/alpaca-gpt4 | SFT | English capability retention |

---

## 🚀 Quick Start

### Installation

```bash
pip install transformers torch
```

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "NeuronUz/NeuronAI-Uzbek"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

prompt = "O'zbekiston haqida qisqacha ma'lumot bering."

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
```

### With Thinking Mode (Chain-of-Thought)

```python
messages = [
    {"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Enable step-by-step reasoning
)
```

---

## 📈 Use Cases

NeuronAI-Uzbek excels at:

- **📝 Text Generation**: Creative writing, content creation in Uzbek
- **❓ Question Answering**: Answering questions about Uzbek culture, history, and general knowledge
- **📚 Reading Comprehension**: Understanding and analyzing Uzbek texts
- **🔤 Grammar & Spelling**: Uzbek language correctness tasks
- **🌐 Translation Assistance**: Uzbek-English language tasks
- **💬 Conversational AI**: Building Uzbek chatbots and assistants

---

## ⚠️ Limitations

- **Knowledge Cutoff**: Training data has a knowledge cutoff date
- **Hallucinations**: May generate plausible-sounding but incorrect information
- **Bias**: May reflect biases present in training data
- **Not for Critical Applications**: Should not be used for medical, legal, or safety-critical applications without human oversight

---

## 📜 License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

---

## 🙏 Acknowledgments

- **Qwen Team** at Alibaba for the excellent Qwen3-4B base model
- **UzLiB Benchmark** creators for the comprehensive evaluation framework
- **Uzbek NLP Community** for datasets and linguistic resources

---

## 📖 Citation

```bibtex
@misc{neuronai-uzbek-2025,
  title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek},
  author={NeuronAI Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek}
}
```

---

<div align="center">

**Built with ❤️ in Uzbekistan by [NeuronUz](https://huggingface.co/NeuronUz)**

</div>