kmamaroziqov commited on
Commit
0322d7f
·
verified ·
1 Parent(s): e98e247

Add detailed model card

Browse files
Files changed (1) hide show
  1. README.md +177 -0
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - uz
4
+ - en
5
+ tags:
6
+ - uzbek
7
+ - english
8
+ - sft
9
+ - chat
10
+ - transformers
11
+ pipeline_tag: text-generation
12
+ library_name: transformers
13
+ license: other
14
+ ---
15
+
16
+ # NeuronAI-Uzbek
17
+
18
+ NeuronAI-Uzbek is a Qwen3-family causal language model fine-tuned to be helpful for **Uzbek** (primary) and **English**. This repository contains model weights (`safetensors` shards), tokenizer files, and a chat template.
19
+
20
+ ## Model summary
21
+
22
+ - **Architecture**: `Qwen3ForCausalLM` (decoder-only)
23
+ - **Dtype**: `bfloat16`
24
+ - **Layers**: 36
25
+ - **Hidden size**: 2560
26
+ - **Attention heads**: 32 (KV heads: 8)
27
+ - **Vocab size**: 180,000
28
+ - **Max position embeddings**: 40,960 (model config)
29
+ - **Generation defaults** (from `generation_config.json`)
30
+ - `temperature=0.6`
31
+ - `top_p=0.95`
32
+ - `top_k=20`
33
+
34
+ Note: the original base checkpoint name was not saved in `config.json` (`_name_or_path` is `null`). This model is from the **Qwen3** family and is intended to be used with recent `transformers`.
35
+
36
+ ## Training data (token counts)
37
+
38
+ This model was trained on a mixture of:
39
+
40
+ - **Uzbek**: **1.2B** tokens
41
+ - **English**: **0.8B** tokens
42
+
43
+ Total: **2.0B tokens**.
44
+
45
+ ## Training process (high-level)
46
+
47
+ We trained NeuronAI-Uzbek in stages:
48
+
49
+ 1. **Data preparation**
50
+ - Collected Uzbek- and English-language text.
51
+ - Cleaned and normalized text (deduplication/format normalization).
52
+ - Tokenized into a mixed Uzbek/English stream.
53
+
54
+ 2. **Model training / adaptation**
55
+ - Continued training / adaptation on the mixed corpus (2.0B tokens total) to improve Uzbek capability while retaining English.
56
+
57
+ 3. **Supervised fine-tuning (SFT)**
58
+ - Final fine-tuning checkpoint is stored under `runs/honest_sft/final` during training and uploaded here.
59
+ - Key hyperparameters recovered from `training_args.bin`:
60
+ - **Epochs**: 1
61
+ - **Learning rate**: 5e-6
62
+ - **Scheduler**: cosine, **warmup_ratio**: 0.03
63
+ - **Optimizer**: `paged_adamw_8bit`
64
+ - **Per-device train batch size**: 2
65
+ - **Gradient accumulation**: 4
66
+ - **Gradient checkpointing**: enabled
67
+ - **Seed**: 42
68
+ - **bf16**: enabled
69
+
70
+ 4. **Export**
71
+ - Exported weights to `safetensors` shards + index.
72
+ - Uploaded to Hugging Face.
73
+
74
+ ## Intended use
75
+
76
+ - **Primary**: chat assistant for Uzbek, including general Q&A, drafting, summarization, translation (Uzbek↔English), and instruction following.
77
+ - **Secondary**: English chat and general text generation.
78
+
79
+ ## Limitations and risks
80
+
81
+ - The model can generate incorrect or hallucinated information.
82
+ - It may reflect biases present in the training data.
83
+ - It is not guaranteed safe for medical/legal/financial advice.
84
+ - Uzbek language variants/dialects and domain-specific jargon may be weaker.
85
+
86
+ ## How to use
87
+
88
+ ### Requirements
89
+
90
+ - `transformers` (a recent version)
91
+ - `torch`
92
+
93
+ ### Text generation (Transformers)
94
+
95
+ ```python
96
+ import torch
97
+ from transformers import AutoModelForCausalLM, AutoTokenizer
98
+
99
+ repo_id = "NeuronUz/NeuronAI-Uzbek"
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
102
+ model = AutoModelForCausalLM.from_pretrained(
103
+ repo_id,
104
+ torch_dtype=torch.bfloat16,
105
+ device_map="auto",
106
+ trust_remote_code=True,
107
+ )
108
+
109
+ prompt = "Uzbek tilida qisqa va aniq qilib sun'iy intellekt nima ekanligini tushuntir."
110
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
111
+
112
+ with torch.no_grad():
113
+ out = model.generate(
114
+ **inputs,
115
+ max_new_tokens=256,
116
+ do_sample=True,
117
+ temperature=0.6,
118
+ top_p=0.95,
119
+ top_k=20,
120
+ )
121
+
122
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
123
+ ```
124
+
125
+ ### Chat formatting
126
+
127
+ This repository includes a `chat_template.jinja`. Some environments may not automatically load it into the tokenizer; if `tokenizer.chat_template` is empty, you can set it manually:
128
+
129
+ ```python
130
+ from pathlib import Path
131
+ from transformers import AutoTokenizer
132
+
133
+ repo_id = "NeuronUz/NeuronAI-Uzbek"
134
+
135
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
136
+
137
+ if not getattr(tokenizer, "chat_template", None):
138
+ tokenizer.chat_template = Path("chat_template.jinja").read_text(encoding="utf-8")
139
+
140
+ messages = [
141
+ {"role": "system", "content": "You are a helpful assistant."},
142
+ {"role": "user", "content": "Uzbek tilida menga salom ber."},
143
+ ]
144
+
145
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
146
+ print(text)
147
+ ```
148
+
149
+ If you are running in a notebook or environment where the template file is not present locally, download it from the repo first (or copy the template content directly).
150
+
151
+ ## Example prompts
152
+
153
+ - Uzbek:
154
+ - "Quyidagi matnni xulosa qil: ..."
155
+ - "Menga Python'da fayl o'qish misolini ko'rsat."
156
+ - "Inglizchadan o'zbekchaga tarjima qil: ..."
157
+
158
+ - English:
159
+ - "Explain gradient checkpointing in simple terms."
160
+ - "Summarize this document in bullet points: ..."
161
+
162
+ ## License
163
+
164
+ The license for this release is currently marked as `other` because the upstream/base and dataset licensing details are not fully specified in this repository. If you want, I can update this section once you confirm the intended license.
165
+
166
+ ## Citation
167
+
168
+ If you use this model, please cite the repository:
169
+
170
+ ```bibtex
171
+ @misc{neuronai_uzbek,
172
+ title = {NeuronAI-Uzbek},
173
+ author = {NeuronUz},
174
+ howpublished = {\url{https://huggingface.co/NeuronUz/NeuronAI-Uzbek}},
175
+ year = {2025}
176
+ }
177
+ ```