|
|
--- |
|
|
language: |
|
|
- uz |
|
|
- en |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen3-4B |
|
|
tags: |
|
|
- uzbek |
|
|
- qwen3 |
|
|
- language-model |
|
|
- text-generation |
|
|
- nlp |
|
|
- central-asia |
|
|
- low-resource |
|
|
- tokenizer-optimization |
|
|
datasets: |
|
|
- behbudiy/alpaca-cleaned-uz |
|
|
- NeuronUz/uzbek-spelling-mcq |
|
|
pipeline_tag: text-generation |
|
|
model-index: |
|
|
- name: NeuronAI-Uzbek |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Uzbek Language Understanding |
|
|
dataset: |
|
|
name: UzLiB Benchmark |
|
|
type: uzlib |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.662 |
|
|
name: Overall Accuracy |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
# πΊπΏ NeuronAI-Uzbek |
|
|
|
|
|
### The Most Advanced Open-Source Language Model for Uzbek |
|
|
|
|
|
[](https://huggingface.co/NeuronUz/NeuronAI-Uzbek) |
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[](https://huggingface.co/Qwen/Qwen3-4B) |
|
|
|
|
|
**π 4th Place Globally | π₯ 1st Place in Uzbekistan on UzLiB Benchmark** |
|
|
|
|
|
*Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks* |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π Key Results |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
| Achievement | Value | |
|
|
|-------------|-------| |
|
|
| **UzLiB Overall Score** | **0.662** | |
|
|
| **Global Ranking** | **#4** | |
|
|
| **Regional Ranking** | **#1 in Uzbekistan** | |
|
|
| **Tokenizer Efficiency Improvement** | **+22.5%** vs Qwen3-4B | |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π UzLiB Benchmark Performance |
|
|
|
|
|
NeuronAI-Uzbek achieves exceptional performance on the [UzLiB Benchmark](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md), the comprehensive evaluation suite for Uzbek language understanding. |
|
|
|
|
|
### Leaderboard Position |
|
|
|
|
|
[](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md) |
|
|
|
|
|
|
|
|
> **Note**: NeuronAI-Uzbek is the **smallest model** in the top 10, with only **4B parameters**, while competing against models with 100B+ parameters. |
|
|
|
|
|
### Performance Comparison vs Original Qwen3-4B |
|
|
|
|
|
| Metric | Qwen3-4B (Original) | NeuronAI-Uzbek | Improvement | |
|
|
|--------|:-------------------:|:--------------:|:-----------:| |
|
|
| **Overall (All)** | 0.345 | **0.662** | **+91.9%** | |
|
|
| Correct Word | 0.351 | 0.718 | +104.6% | |
|
|
| Meaning | 0.309 | 0.466 | +50.8% | |
|
|
| Meaning in Context | 0.347 | 0.333 | -4.0% | |
|
|
| Fill-in | 0.327 | 0.385 | +17.7% | |
|
|
|
|
|
--- |
|
|
|
|
|
## π€ Tokenizer Efficiency |
|
|
|
|
|
We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs). |
|
|
|
|
|
### Fertility Rate Comparison |
|
|
|
|
|
| Model | Fertility Rate | Std Dev | Vocab Size | Improvement vs Qwen3 | |
|
|
|-------|:--------------:|:-------:|:----------:|:--------------------:| |
|
|
| **NeuronAI-Uzbek (Ours)** π | **2.67** | 0.15 | 180,000 | **+22.5%** | |
|
|
| Gemma 2-9B | 3.15 | 0.22 | 256,000 | +8.3% | |
|
|
| LLaMA 3.1-8B | 3.32 | 0.22 | 128,256 | +3.7% | |
|
|
| DeepSeek-V3 | 3.32 | 0.21 | 128,815 | +3.4% | |
|
|
| Qwen3-4B (Original) | 3.44 | 0.22 | 151,669 | - | |
|
|
|
|
|
> **Fertility Rate**: Average number of tokens per word. Lower is better for efficiency. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="assets/fertility_comparison_chart.png" alt="Tokenizer Fertility Rate Comparison" width="700"/> |
|
|
</div> |
|
|
|
|
|
### What This Means |
|
|
|
|
|
- **22.5% fewer tokens** needed to represent Uzbek text |
|
|
- **Faster inference** due to shorter sequences |
|
|
- **Lower API costs** when deployed |
|
|
- **Better context utilization** - fit more content in the same context window |
|
|
|
|
|
--- |
|
|
|
|
|
## π οΈ Model Details |
|
|
|
|
|
### Architecture |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Base Model** | Qwen3-4B | |
|
|
| **Parameters** | 4 Billion | |
|
|
| **Vocabulary Size** | 180,000 tokens | |
|
|
| **Context Length** | 32,768 tokens | |
|
|
| **Architecture** | Transformer (Decoder-only) | |
|
|
| **Precision** | BFloat16 | |
|
|
|
|
|
### Training Methodology |
|
|
|
|
|
1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens |
|
|
2. **Embedding Initialization**: Semantic initialization using subword composition |
|
|
3. **Continual Pretraining**: Trained on 2B tokens of Uzbek and English text corpus |
|
|
4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets |
|
|
|
|
|
### Training Data |
|
|
|
|
|
| Dataset | Type | Purpose | |
|
|
|---------|------|---------| |
|
|
| Uzbek Web Corpus | Pretraining | Language modeling | |
|
|
| behbudiy/alpaca-cleaned-uz | SFT | Uzbek instructions | |
|
|
| NeuronUz/uzbek-spelling-mcq | SFT | Benchmark-targeted training | |
|
|
| vicgalle/alpaca-gpt4 | SFT | English capability retention | |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "NeuronUz/NeuronAI-Uzbek" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype="auto", |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
prompt = "O'zbekiston haqida qisqacha ma'lumot bering." |
|
|
|
|
|
messages = [ |
|
|
{"role": "user", "content": prompt} |
|
|
] |
|
|
|
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=512, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### With Thinking Mode (Chain-of-Thought) |
|
|
|
|
|
```python |
|
|
messages = [ |
|
|
{"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."} |
|
|
] |
|
|
|
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
enable_thinking=True # Enable step-by-step reasoning |
|
|
) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Use Cases |
|
|
|
|
|
NeuronAI-Uzbek excels at: |
|
|
|
|
|
- **π Text Generation**: Creative writing, content creation in Uzbek |
|
|
- **β Question Answering**: Answering questions about Uzbek culture, history, and general knowledge |
|
|
- **π Reading Comprehension**: Understanding and analyzing Uzbek texts |
|
|
- **π€ Grammar & Spelling**: Uzbek language correctness tasks |
|
|
- **π Translation Assistance**: Uzbek-English language tasks |
|
|
- **π¬ Conversational AI**: Building Uzbek chatbots and assistants |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
- **Knowledge Cutoff**: Training data has a knowledge cutoff date |
|
|
- **Hallucinations**: May generate plausible-sounding but incorrect information |
|
|
- **Bias**: May reflect biases present in training data |
|
|
- **Not for Critical Applications**: Should not be used for medical, legal, or safety-critical applications without human oversight |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
|
|
--- |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- **Qwen Team** at Alibaba for the excellent Qwen3-4B base model |
|
|
- **UzLiB Benchmark** creators for the comprehensive evaluation framework |
|
|
- **Uzbek NLP Community** for datasets and linguistic resources |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{neuronai-uzbek-2025, |
|
|
title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek}, |
|
|
author={NeuronAI Team}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Built with β€οΈ in Uzbekistan by [NeuronUz](https://huggingface.co/NeuronUz)** |
|
|
|
|
|
</div> |
|
|
|