NeuronUz
/

NeuronAI-Uzbek

+---
+language:
+- uz
+- en
+tags:
+- uzbek
+- english
+- sft
+- chat
+- transformers
+pipeline_tag: text-generation
+library_name: transformers
+license: other
+---
+# NeuronAI-Uzbek
+NeuronAI-Uzbek is a Qwen3-family causal language model fine-tuned to be helpful for **Uzbek** (primary) and **English**. This repository contains model weights (`safetensors` shards), tokenizer files, and a chat template.
+## Model summary
+- **Architecture**: `Qwen3ForCausalLM` (decoder-only)
+- **Dtype**: `bfloat16`
+- **Layers**: 36
+- **Hidden size**: 2560
+- **Attention heads**: 32 (KV heads: 8)
+- **Vocab size**: 180,000
+- **Max position embeddings**: 40,960 (model config)
+- **Generation defaults** (from `generation_config.json`)
+  - `temperature=0.6`
+  - `top_p=0.95`
+  - `top_k=20`
+Note: the original base checkpoint name was not saved in `config.json` (`_name_or_path` is `null`). This model is from the **Qwen3** family and is intended to be used with recent `transformers`.
+## Training data (token counts)
+This model was trained on a mixture of:
+- **Uzbek**: **1.2B** tokens
+- **English**: **0.8B** tokens
+Total: **2.0B tokens**.
+## Training process (high-level)
+We trained NeuronAI-Uzbek in stages:
+1. **Data preparation**
+   - Collected Uzbek- and English-language text.
+   - Cleaned and normalized text (deduplication/format normalization).
+   - Tokenized into a mixed Uzbek/English stream.
+2. **Model training / adaptation**
+   - Continued training / adaptation on the mixed corpus (2.0B tokens total) to improve Uzbek capability while retaining English.
+3. **Supervised fine-tuning (SFT)**
+   - Final fine-tuning checkpoint is stored under `runs/honest_sft/final` during training and uploaded here.
+   - Key hyperparameters recovered from `training_args.bin`:
+     - **Epochs**: 1
+     - **Learning rate**: 5e-6
+     - **Scheduler**: cosine, **warmup_ratio**: 0.03
+     - **Optimizer**: `paged_adamw_8bit`
+     - **Per-device train batch size**: 2
+     - **Gradient accumulation**: 4
+     - **Gradient checkpointing**: enabled
+     - **Seed**: 42
+     - **bf16**: enabled
+4. **Export**
+   - Exported weights to `safetensors` shards + index.
+   - Uploaded to Hugging Face.
+## Intended use
+- **Primary**: chat assistant for Uzbek, including general Q&A, drafting, summarization, translation (Uzbek↔English), and instruction following.
+- **Secondary**: English chat and general text generation.
+## Limitations and risks
+- The model can generate incorrect or hallucinated information.
+- It may reflect biases present in the training data.
+- It is not guaranteed safe for medical/legal/financial advice.
+- Uzbek language variants/dialects and domain-specific jargon may be weaker.
+## How to use
+### Requirements
+- `transformers` (a recent version)
+- `torch`
+### Text generation (Transformers)
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+repo_id = "NeuronUz/NeuronAI-Uzbek"
+tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    repo_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+prompt = "Uzbek tilida qisqa va aniq qilib sun'iy intellekt nima ekanligini tushuntir."
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=256,
+        do_sample=True,
+        temperature=0.6,
+        top_p=0.95,
+        top_k=20,
+    )
+print(tokenizer.decode(out[0], skip_special_tokens=True))
+```
+### Chat formatting
+This repository includes a `chat_template.jinja`. Some environments may not automatically load it into the tokenizer; if `tokenizer.chat_template` is empty, you can set it manually:
+```python
+from pathlib import Path
+from transformers import AutoTokenizer
+repo_id = "NeuronUz/NeuronAI-Uzbek"
+tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
+if not getattr(tokenizer, "chat_template", None):
+    tokenizer.chat_template = Path("chat_template.jinja").read_text(encoding="utf-8")
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Uzbek tilida menga salom ber."},
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+print(text)
+```
+If you are running in a notebook or environment where the template file is not present locally, download it from the repo first (or copy the template content directly).
+## Example prompts
+- Uzbek:
+  - "Quyidagi matnni xulosa qil: ..."
+  - "Menga Python'da fayl o'qish misolini ko'rsat."
+  - "Inglizchadan o'zbekchaga tarjima qil: ..."
+- English:
+  - "Explain gradient checkpointing in simple terms."
+  - "Summarize this document in bullet points: ..."
+## License
+The license for this release is currently marked as `other` because the upstream/base and dataset licensing details are not fully specified in this repository. If you want, I can update this section once you confirm the intended license.
+## Citation
+If you use this model, please cite the repository:
+```bibtex
+@misc{neuronai_uzbek,
+  title        = {NeuronAI-Uzbek},
+  author       = {NeuronUz},
+  howpublished = {\url{https://huggingface.co/NeuronUz/NeuronAI-Uzbek}},
+  year         = {2025}
+}
+```