File size: 5,676 Bytes
4952695 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 | ---
language:
- en
license: other
base_model: HuggingFaceTB/SmolLM3-3B
tags:
- sft
- instruction-tuning
- reasoning
- long-context
- fsdp
- transformers
- liger-kernel
- english
datasets:
- DGurgurov/Nemotron-Multilingual-Reasoning
metrics:
pipeline_tag: text-generation
---
# SmolLM3-3B — English Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning)
## Model Description
This model is a **Supervised Fine-Tuned (SFT)** version of:
`HuggingFaceTB/SmolLM3-3B`
It was trained on the **English (`en`) split** of:
`DGurgurov/Nemotron-Multilingual-Reasoning`
The purpose of this fine-tune is to improve:
- English instruction following
- multi-step reasoning
- long-context chat behavior
The dataset was converted into structured chat conversations and optimized using **completion-only loss**, meaning only the assistant’s responses contributed to the training objective.
### Key Characteristics
- Base model: SmolLM3-3B
- Language: English specialization
- Context length during training: **16,384 tokens**
- Chat formatted conversations
- Packed sequences
- Long-context reasoning tuning
---
## Intended Uses
### Suitable
- Conversational assistants
- Instruction-following agents
- Reasoning tasks
- Educational tutoring
- Long-document Q&A
- Research on small long-context LLMs
### Not Suitable
- Medical or legal advice
- Autonomous decision making
- Safety-critical systems
- Financial decision automation
---
## Training Data
Dataset:
`DGurgurov/Nemotron-Multilingual-Reasoning`
Processing configuration:
- Language filter: **English only**
- Converted to chat messages (`prepare_messages=True`)
- Assistant-only loss masking (`completion_only_loss=True`)
User and system prompts were masked during training; only assistant tokens produced gradients.
Please consult the dataset card for data provenance and limitations.
---
## Training Procedure
Training used **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes.
### Core Setup
- Method: Supervised fine-tuning (SFT)
- Epochs: **3**
- Max sequence length: **16,384**
- Packing: enabled
- Precision: **bfloat16**
- Gradient checkpointing: enabled
- Liger kernel: enabled
- Distributed training: FSDP
---
### Optimization
- Optimizer: `adamw_torch_fused`
- Batch size per device: 4
- Gradient accumulation: 4
- Effective batch size per GPU: 16 sequences / step
- Weight decay: 0.05
Learning rate schedule:
- Scheduler: `cosine_with_min_lr`
- Warmup ratio: 0.05
- Minimum learning rate: 5e-6
---
### Logging & Checkpoints
- Logging: every 5 steps
- Checkpoint: every 450 steps
- Tracking: Weights & Biases
- Token accuracy logged during training
---
### Data Processing
- Dataset preprocessing workers: 16
- Chat formatting: enabled
- Dataset preparation: enabled
- Language split: `en`
---
## Usage
### Transformers Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain why the sky is blue."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
**Important:**
Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.
---
## Evaluation
During training, **token accuracy** was logged as a diagnostic metric.
Token accuracy:
- helps monitor training stability
- is **not** a benchmark score
- does not measure reasoning quality
For meaningful evaluation, use:
- instruction-following benchmarks
- reasoning datasets
- long-context tasks
---
## Limitations
- May hallucinate incorrect information
- Reasoning chains may contain logical mistakes
- Performance near 16k tokens depends heavily on prompt structure
- Smaller model → less world knowledge than large LLMs
- Not suitable for safety-critical deployment
---
## Bias & Safety
The model inherits biases from:
- the base model
- the training dataset
Recommended mitigations:
- moderation filtering
- safety-oriented system prompts
- human oversight in sensitive use cases
---
## License
This is a derivative model of:
`HuggingFaceTB/SmolLM3-3B`
The original base model license and restrictions apply, along with dataset terms.
Verify compatibility before commercial usage.
---
## Reproducibility (Training Arguments)
```text
accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py
--model_name HuggingFaceTB/SmolLM3-3B
--tokenizer_name HuggingFaceTB/SmolLM3-3B
--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
--skip_prepare_dataset False
--lang_split en
--prepare_messages True
--completion_only_loss True
--max_length 16384
```
---
## Citation
If you use this model, please cite:
- `HuggingFaceTB/SmolLM3-3B`
- `DGurgurov/Nemotron-Multilingual-Reasoning`
---
## Acknowledgements
- HuggingFaceTB — SmolLM3 base model
- Nemotron Multilingual Reasoning dataset authors
- HuggingFace Accelerate and Transformers libraries
|