toroe commited on
Commit
a4001aa
·
verified ·
1 Parent(s): 380119e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +283 -0
README.md ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+ license: other
5
+ base_model: HuggingFaceTB/SmolLM3-3B
6
+ tags:
7
+ - sft
8
+ - instruction-tuning
9
+ - reasoning
10
+ - long-context
11
+ - spanish
12
+ - fsdp
13
+ - transformers
14
+ - liger-kernel
15
+ datasets:
16
+ - DGurgurov/Nemotron-Multilingual-Reasoning
17
+ metrics:
18
+ - token_accuracy
19
+ library_name: transformers
20
+ pipeline_tag: text-generation
21
+ ---
22
+
23
+ # SmolLM3-3B — Spanish Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning)
24
+
25
+ ## Model Description
26
+
27
+ This model is a **Supervised Fine-Tuned (SFT)** version of:
28
+
29
+ `HuggingFaceTB/SmolLM3-3B`
30
+
31
+ Fine-tuned on the **Spanish (`es`) split** of:
32
+
33
+ `DGurgurov/Nemotron-Multilingual-Reasoning`
34
+
35
+ The goal of this training run was to improve:
36
+
37
+ - Spanish instruction following
38
+ - multi-step reasoning
39
+ - conversational behavior
40
+ - long-context understanding
41
+
42
+ Training used structured chat conversations and **completion-only loss**, meaning only the assistant responses were optimized.
43
+
44
+ ### Key Characteristics
45
+
46
+ - Base model: SmolLM3-3B
47
+ - Language specialization: Spanish
48
+ - Context length during training: **16,384 tokens**
49
+ - Chat-format training
50
+ - Packed sequences
51
+ - Long-context reasoning tuning
52
+
53
+ ---
54
+
55
+ ## Intended Uses
56
+
57
+ ### Suitable
58
+ - Spanish conversational assistants
59
+ - tutoring or educational assistants
60
+ - reasoning and explanation tasks
61
+ - document question answering
62
+ - research on efficient small LLMs
63
+
64
+ ### Not Suitable
65
+ - legal or medical advice
66
+ - autonomous decision making
67
+ - safety-critical systems
68
+ - high-risk financial use
69
+
70
+ ---
71
+
72
+ ## Training Data
73
+
74
+ Dataset:
75
+
76
+ `DGurgurov/Nemotron-Multilingual-Reasoning`
77
+
78
+ Processing configuration:
79
+
80
+ - Language filter: **Spanish only**
81
+ - Converted to chat messages (`prepare_messages=True`)
82
+ - Assistant-only optimization (`completion_only_loss=True`)
83
+
84
+ User and system messages were masked during training.
85
+
86
+ Consult the dataset card for data sources and limitations.
87
+
88
+ ---
89
+
90
+ ## Training Procedure
91
+
92
+ Training was performed using **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes.
93
+
94
+ ### Core Setup
95
+
96
+ - Method: Supervised fine-tuning (SFT)
97
+ - Epochs: **3**
98
+ - Maximum sequence length: **16,384 tokens**
99
+ - Sequence packing: enabled
100
+ - Precision: **bfloat16**
101
+ - Gradient checkpointing: enabled
102
+ - Liger kernel: enabled
103
+ - Distributed training: FSDP
104
+
105
+ ---
106
+
107
+ ### Optimization
108
+
109
+ - Optimizer: `adamw_torch_fused`
110
+ - Batch size per device: 4
111
+ - Gradient accumulation steps: 4
112
+ - Effective batch size per GPU: 16 sequences per step
113
+ - Weight decay: 0.05
114
+
115
+ Learning rate schedule:
116
+
117
+ - Scheduler: `cosine_with_min_lr`
118
+ - Warmup ratio: 0.05
119
+ - Minimum LR: 5e-6
120
+
121
+ ---
122
+
123
+ ### Logging & Checkpoints
124
+
125
+ - Logging every 5 steps
126
+ - Checkpoint every 450 steps
127
+ - Weights & Biases tracking
128
+ - Token accuracy logged during training
129
+
130
+ ---
131
+
132
+ ### Data Processing
133
+
134
+ - Dataset preprocessing workers: 16
135
+ - Chat formatting enabled
136
+ - Dataset preparation enabled
137
+ - Language split: `es`
138
+
139
+ ---
140
+
141
+ ## Usage
142
+
143
+ ### Transformers Example
144
+
145
+ ```python
146
+ from transformers import AutoTokenizer, AutoModelForCausalLM
147
+ import torch
148
+
149
+ model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
150
+
151
+ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
152
+ model = AutoModelForCausalLM.from_pretrained(
153
+ model_id,
154
+ device_map="auto",
155
+ torch_dtype=torch.bfloat16,
156
+ )
157
+
158
+ messages = [
159
+ {"role": "system", "content": "Eres un asistente útil."},
160
+ {"role": "user", "content": "¿Por qué el cielo es azul?"}
161
+ ]
162
+
163
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
164
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
165
+
166
+ outputs = model.generate(
167
+ **inputs,
168
+ max_new_tokens=512,
169
+ temperature=0.7,
170
+ top_p=0.9,
171
+ do_sample=True,
172
+ )
173
+
174
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
175
+ ```
176
+ **Important:**
177
+ Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.
178
+
179
+ ---
180
+
181
+ ## Evaluation
182
+
183
+ During training, **token accuracy** was logged as a diagnostic metric.
184
+
185
+ Token accuracy:
186
+ - monitors training stability
187
+ - is **not** a benchmark
188
+ - does not measure reasoning ability
189
+
190
+ For meaningful evaluation, use:
191
+ - instruction-following benchmarks
192
+ - reasoning datasets
193
+ - long-context tasks
194
+
195
+ ---
196
+
197
+ ## Limitations
198
+
199
+ - May hallucinate incorrect information
200
+ - Reasoning chains may contain logical errors
201
+ - Performance near 16k tokens depends heavily on prompt structure
202
+ - Smaller model → weaker world knowledge than larger LLMs
203
+ - Not suitable for safety-critical deployment
204
+
205
+ ---
206
+
207
+ ## Bias & Safety
208
+
209
+ The model inherits biases from:
210
+ - the base model
211
+ - the training dataset
212
+
213
+ Recommended mitigations:
214
+ - moderation filtering
215
+ - safety-oriented system prompts
216
+ - human review for sensitive applications
217
+
218
+ ---
219
+
220
+ ## License
221
+
222
+ This is a derivative model of:
223
+
224
+ `HuggingFaceTB/SmolLM3-3B`
225
+
226
+ The original base model license and restrictions apply, along with dataset terms.
227
+
228
+ Verify compatibility before commercial use.
229
+
230
+ ---
231
+
232
+ ## Reproducibility (Training Arguments)
233
+
234
+ ```text
235
+ accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py
236
+
237
+ --model_name HuggingFaceTB/SmolLM3-3B
238
+ --tokenizer_name HuggingFaceTB/SmolLM3-3B
239
+ --dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
240
+ --skip_prepare_dataset False
241
+ --lang_split es
242
+ --prepare_messages True
243
+ --completion_only_loss True
244
+ --max_length 16384
245
+ --dataset_num_proc 16
246
+ --packing True
247
+ --use_liger_kernel True
248
+ --bf16 True
249
+ --log_token_accuracy True
250
+ --optim adamw_torch_fused
251
+ --gradient_checkpointing True
252
+ --per_device_train_batch_size 4
253
+ --gradient_accumulation_steps 4
254
+ --ddp_find_unused_parameters False
255
+ --lr_scheduler_type cosine_with_min_lr
256
+ --lr_scheduler_kwargs {"min_lr": 5.0e-6}
257
+ --warmup_ratio 0.05
258
+ --weight_decay 0.05
259
+ --report_to wandb
260
+ --run_name smol_3b_3epochs_lns_es
261
+ --num_train_epochs 3
262
+ --save_strategy steps
263
+ --logging_steps 5
264
+ --save_steps 450
265
+ ```
266
+ ---
267
+
268
+ ## Citation
269
+
270
+ If you use this model, please cite:
271
+
272
+ - `HuggingFaceTB/SmolLM3-3B`
273
+ - `DGurgurov/Nemotron-Multilingual-Reasoning`
274
+
275
+ ---
276
+
277
+ ## Acknowledgements
278
+
279
+ - HuggingFaceTB — SmolLM3 base model
280
+ - Nemotron Multilingual Reasoning dataset authors
281
+ - HuggingFace Accelerate and Transformers libraries
282
+
283
+