File size: 6,067 Bytes
a4001aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
language:
  - es
license: other
base_model: HuggingFaceTB/SmolLM3-3B
tags:
  - sft
  - instruction-tuning
  - reasoning
  - long-context
  - spanish
  - fsdp
  - transformers
  - liger-kernel
datasets:
  - DGurgurov/Nemotron-Multilingual-Reasoning
metrics:
  - token_accuracy
library_name: transformers
pipeline_tag: text-generation
---

# SmolLM3-3B — Spanish Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning)

## Model Description

This model is a **Supervised Fine-Tuned (SFT)** version of:

`HuggingFaceTB/SmolLM3-3B`

Fine-tuned on the **Spanish (`es`) split** of:

`DGurgurov/Nemotron-Multilingual-Reasoning`

The goal of this training run was to improve:

- Spanish instruction following
- multi-step reasoning
- conversational behavior
- long-context understanding

Training used structured chat conversations and **completion-only loss**, meaning only the assistant responses were optimized.

### Key Characteristics

- Base model: SmolLM3-3B
- Language specialization: Spanish
- Context length during training: **16,384 tokens**
- Chat-format training
- Packed sequences
- Long-context reasoning tuning

---

## Intended Uses

### Suitable
- Spanish conversational assistants
- tutoring or educational assistants
- reasoning and explanation tasks
- document question answering
- research on efficient small LLMs

### Not Suitable
- legal or medical advice
- autonomous decision making
- safety-critical systems
- high-risk financial use

---

## Training Data

Dataset:

`DGurgurov/Nemotron-Multilingual-Reasoning`

Processing configuration:

- Language filter: **Spanish only**
- Converted to chat messages (`prepare_messages=True`)
- Assistant-only optimization (`completion_only_loss=True`)

User and system messages were masked during training.

Consult the dataset card for data sources and limitations.

---

## Training Procedure

Training was performed using **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes.

### Core Setup

- Method: Supervised fine-tuning (SFT)
- Epochs: **3**
- Maximum sequence length: **16,384 tokens**
- Sequence packing: enabled
- Precision: **bfloat16**
- Gradient checkpointing: enabled
- Liger kernel: enabled
- Distributed training: FSDP

---

### Optimization

- Optimizer: `adamw_torch_fused`
- Batch size per device: 4
- Gradient accumulation steps: 4
- Effective batch size per GPU: 16 sequences per step
- Weight decay: 0.05

Learning rate schedule:

- Scheduler: `cosine_with_min_lr`
- Warmup ratio: 0.05
- Minimum LR: 5e-6

---

### Logging & Checkpoints

- Logging every 5 steps
- Checkpoint every 450 steps
- Weights & Biases tracking
- Token accuracy logged during training

---

### Data Processing

- Dataset preprocessing workers: 16
- Chat formatting enabled
- Dataset preparation enabled
- Language split: `es`

---

## Usage

### Transformers Example

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "Eres un asistente útil."},
    {"role": "user", "content": "¿Por qué el cielo es azul?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
**Important:**  
Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.

---

## Evaluation

During training, **token accuracy** was logged as a diagnostic metric.

Token accuracy:
- monitors training stability
- is **not** a benchmark
- does not measure reasoning ability

For meaningful evaluation, use:
- instruction-following benchmarks
- reasoning datasets
- long-context tasks

---

## Limitations

- May hallucinate incorrect information
- Reasoning chains may contain logical errors
- Performance near 16k tokens depends heavily on prompt structure
- Smaller model → weaker world knowledge than larger LLMs
- Not suitable for safety-critical deployment

---

## Bias & Safety

The model inherits biases from:
- the base model
- the training dataset

Recommended mitigations:
- moderation filtering
- safety-oriented system prompts
- human review for sensitive applications

---

## License

This is a derivative model of:

`HuggingFaceTB/SmolLM3-3B`

The original base model license and restrictions apply, along with dataset terms.

Verify compatibility before commercial use.

---

## Reproducibility (Training Arguments)

```text
accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py

--model_name HuggingFaceTB/SmolLM3-3B
--tokenizer_name HuggingFaceTB/SmolLM3-3B
--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
--skip_prepare_dataset False
--lang_split es
--prepare_messages True
--completion_only_loss True
--max_length 16384
--dataset_num_proc 16
--packing True
--use_liger_kernel True
--bf16 True
--log_token_accuracy True
--optim adamw_torch_fused
--gradient_checkpointing True
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_find_unused_parameters False
--lr_scheduler_type cosine_with_min_lr
--lr_scheduler_kwargs {"min_lr": 5.0e-6}
--warmup_ratio 0.05
--weight_decay 0.05
--report_to wandb
--run_name smol_3b_3epochs_lns_es
--num_train_epochs 3
--save_strategy steps
--logging_steps 5
--save_steps 450
```
---

## Citation

If you use this model, please cite:

- `HuggingFaceTB/SmolLM3-3B`
- `DGurgurov/Nemotron-Multilingual-Reasoning`

---

## Acknowledgements

- HuggingFaceTB — SmolLM3 base model
- Nemotron Multilingual Reasoning dataset authors
- HuggingFace Accelerate and Transformers libraries