File size: 6,067 Bytes
a4001aa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | ---
language:
- es
license: other
base_model: HuggingFaceTB/SmolLM3-3B
tags:
- sft
- instruction-tuning
- reasoning
- long-context
- spanish
- fsdp
- transformers
- liger-kernel
datasets:
- DGurgurov/Nemotron-Multilingual-Reasoning
metrics:
- token_accuracy
library_name: transformers
pipeline_tag: text-generation
---
# SmolLM3-3B — Spanish Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning)
## Model Description
This model is a **Supervised Fine-Tuned (SFT)** version of:
`HuggingFaceTB/SmolLM3-3B`
Fine-tuned on the **Spanish (`es`) split** of:
`DGurgurov/Nemotron-Multilingual-Reasoning`
The goal of this training run was to improve:
- Spanish instruction following
- multi-step reasoning
- conversational behavior
- long-context understanding
Training used structured chat conversations and **completion-only loss**, meaning only the assistant responses were optimized.
### Key Characteristics
- Base model: SmolLM3-3B
- Language specialization: Spanish
- Context length during training: **16,384 tokens**
- Chat-format training
- Packed sequences
- Long-context reasoning tuning
---
## Intended Uses
### Suitable
- Spanish conversational assistants
- tutoring or educational assistants
- reasoning and explanation tasks
- document question answering
- research on efficient small LLMs
### Not Suitable
- legal or medical advice
- autonomous decision making
- safety-critical systems
- high-risk financial use
---
## Training Data
Dataset:
`DGurgurov/Nemotron-Multilingual-Reasoning`
Processing configuration:
- Language filter: **Spanish only**
- Converted to chat messages (`prepare_messages=True`)
- Assistant-only optimization (`completion_only_loss=True`)
User and system messages were masked during training.
Consult the dataset card for data sources and limitations.
---
## Training Procedure
Training was performed using **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes.
### Core Setup
- Method: Supervised fine-tuning (SFT)
- Epochs: **3**
- Maximum sequence length: **16,384 tokens**
- Sequence packing: enabled
- Precision: **bfloat16**
- Gradient checkpointing: enabled
- Liger kernel: enabled
- Distributed training: FSDP
---
### Optimization
- Optimizer: `adamw_torch_fused`
- Batch size per device: 4
- Gradient accumulation steps: 4
- Effective batch size per GPU: 16 sequences per step
- Weight decay: 0.05
Learning rate schedule:
- Scheduler: `cosine_with_min_lr`
- Warmup ratio: 0.05
- Minimum LR: 5e-6
---
### Logging & Checkpoints
- Logging every 5 steps
- Checkpoint every 450 steps
- Weights & Biases tracking
- Token accuracy logged during training
---
### Data Processing
- Dataset preprocessing workers: 16
- Chat formatting enabled
- Dataset preparation enabled
- Language split: `es`
---
## Usage
### Transformers Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
messages = [
{"role": "system", "content": "Eres un asistente útil."},
{"role": "user", "content": "¿Por qué el cielo es azul?"}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
**Important:**
Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.
---
## Evaluation
During training, **token accuracy** was logged as a diagnostic metric.
Token accuracy:
- monitors training stability
- is **not** a benchmark
- does not measure reasoning ability
For meaningful evaluation, use:
- instruction-following benchmarks
- reasoning datasets
- long-context tasks
---
## Limitations
- May hallucinate incorrect information
- Reasoning chains may contain logical errors
- Performance near 16k tokens depends heavily on prompt structure
- Smaller model → weaker world knowledge than larger LLMs
- Not suitable for safety-critical deployment
---
## Bias & Safety
The model inherits biases from:
- the base model
- the training dataset
Recommended mitigations:
- moderation filtering
- safety-oriented system prompts
- human review for sensitive applications
---
## License
This is a derivative model of:
`HuggingFaceTB/SmolLM3-3B`
The original base model license and restrictions apply, along with dataset terms.
Verify compatibility before commercial use.
---
## Reproducibility (Training Arguments)
```text
accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py
--model_name HuggingFaceTB/SmolLM3-3B
--tokenizer_name HuggingFaceTB/SmolLM3-3B
--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
--skip_prepare_dataset False
--lang_split es
--prepare_messages True
--completion_only_loss True
--max_length 16384
--dataset_num_proc 16
--packing True
--use_liger_kernel True
--bf16 True
--log_token_accuracy True
--optim adamw_torch_fused
--gradient_checkpointing True
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_find_unused_parameters False
--lr_scheduler_type cosine_with_min_lr
--lr_scheduler_kwargs {"min_lr": 5.0e-6}
--warmup_ratio 0.05
--weight_decay 0.05
--report_to wandb
--run_name smol_3b_3epochs_lns_es
--num_train_epochs 3
--save_strategy steps
--logging_steps 5
--save_steps 450
```
---
## Citation
If you use this model, please cite:
- `HuggingFaceTB/SmolLM3-3B`
- `DGurgurov/Nemotron-Multilingual-Reasoning`
---
## Acknowledgements
- HuggingFaceTB — SmolLM3 base model
- Nemotron Multilingual Reasoning dataset authors
- HuggingFace Accelerate and Transformers libraries
|