toroe commited on
Commit
4952695
·
verified ·
1 Parent(s): bf5804d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +265 -0
README.md ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: other
5
+ base_model: HuggingFaceTB/SmolLM3-3B
6
+ tags:
7
+ - sft
8
+ - instruction-tuning
9
+ - reasoning
10
+
11
+
12
+ - long-context
13
+ - fsdp
14
+ - transformers
15
+ - liger-kernel
16
+ - english
17
+ datasets:
18
+ - DGurgurov/Nemotron-Multilingual-Reasoning
19
+ metrics:
20
+ pipeline_tag: text-generation
21
+ ---
22
+
23
+ # SmolLM3-3B — English Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning)
24
+
25
+ ## Model Description
26
+
27
+ This model is a **Supervised Fine-Tuned (SFT)** version of:
28
+
29
+ `HuggingFaceTB/SmolLM3-3B`
30
+
31
+ It was trained on the **English (`en`) split** of:
32
+
33
+ `DGurgurov/Nemotron-Multilingual-Reasoning`
34
+
35
+ The purpose of this fine-tune is to improve:
36
+
37
+ - English instruction following
38
+ - multi-step reasoning
39
+ - long-context chat behavior
40
+
41
+ The dataset was converted into structured chat conversations and optimized using **completion-only loss**, meaning only the assistant’s responses contributed to the training objective.
42
+
43
+ ### Key Characteristics
44
+
45
+ - Base model: SmolLM3-3B
46
+ - Language: English specialization
47
+ - Context length during training: **16,384 tokens**
48
+ - Chat formatted conversations
49
+ - Packed sequences
50
+ - Long-context reasoning tuning
51
+
52
+ ---
53
+
54
+ ## Intended Uses
55
+
56
+ ### Suitable
57
+ - Conversational assistants
58
+ - Instruction-following agents
59
+ - Reasoning tasks
60
+ - Educational tutoring
61
+ - Long-document Q&A
62
+ - Research on small long-context LLMs
63
+
64
+
65
+ ### Not Suitable
66
+ - Medical or legal advice
67
+ - Autonomous decision making
68
+ - Safety-critical systems
69
+ - Financial decision automation
70
+
71
+ ---
72
+
73
+ ## Training Data
74
+
75
+ Dataset:
76
+
77
+ `DGurgurov/Nemotron-Multilingual-Reasoning`
78
+
79
+ Processing configuration:
80
+
81
+ - Language filter: **English only**
82
+ - Converted to chat messages (`prepare_messages=True`)
83
+ - Assistant-only loss masking (`completion_only_loss=True`)
84
+
85
+ User and system prompts were masked during training; only assistant tokens produced gradients.
86
+
87
+ Please consult the dataset card for data provenance and limitations.
88
+
89
+ ---
90
+
91
+ ## Training Procedure
92
+
93
+ Training used **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes.
94
+
95
+ ### Core Setup
96
+
97
+ - Method: Supervised fine-tuning (SFT)
98
+ - Epochs: **3**
99
+ - Max sequence length: **16,384**
100
+ - Packing: enabled
101
+ - Precision: **bfloat16**
102
+
103
+ - Gradient checkpointing: enabled
104
+ - Liger kernel: enabled
105
+ - Distributed training: FSDP
106
+
107
+ ---
108
+
109
+ ### Optimization
110
+
111
+ - Optimizer: `adamw_torch_fused`
112
+ - Batch size per device: 4
113
+ - Gradient accumulation: 4
114
+ - Effective batch size per GPU: 16 sequences / step
115
+ - Weight decay: 0.05
116
+
117
+ Learning rate schedule:
118
+
119
+ - Scheduler: `cosine_with_min_lr`
120
+ - Warmup ratio: 0.05
121
+ - Minimum learning rate: 5e-6
122
+
123
+ ---
124
+
125
+ ### Logging & Checkpoints
126
+
127
+ - Logging: every 5 steps
128
+ - Checkpoint: every 450 steps
129
+ - Tracking: Weights & Biases
130
+ - Token accuracy logged during training
131
+
132
+ ---
133
+
134
+ ### Data Processing
135
+
136
+ - Dataset preprocessing workers: 16
137
+ - Chat formatting: enabled
138
+ - Dataset preparation: enabled
139
+ - Language split: `en`
140
+
141
+
142
+ ---
143
+
144
+ ## Usage
145
+
146
+ ### Transformers Example
147
+
148
+ ```python
149
+ from transformers import AutoTokenizer, AutoModelForCausalLM
150
+ import torch
151
+
152
+ model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
153
+
154
+ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
155
+ model = AutoModelForCausalLM.from_pretrained(
156
+ model_id,
157
+ device_map="auto",
158
+ torch_dtype=torch.bfloat16,
159
+ )
160
+
161
+ messages = [
162
+ {"role": "system", "content": "You are a helpful assistant."},
163
+ {"role": "user", "content": "Explain why the sky is blue."}
164
+ ]
165
+
166
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
167
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
168
+
169
+ outputs = model.generate(
170
+ **inputs,
171
+ max_new_tokens=512,
172
+ temperature=0.7,
173
+ top_p=0.9,
174
+ do_sample=True,
175
+ )
176
+
177
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
178
+ ```
179
+ **Important:**
180
+ Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.
181
+
182
+ ---
183
+
184
+ ## Evaluation
185
+
186
+ During training, **token accuracy** was logged as a diagnostic metric.
187
+
188
+ Token accuracy:
189
+ - helps monitor training stability
190
+ - is **not** a benchmark score
191
+ - does not measure reasoning quality
192
+
193
+ For meaningful evaluation, use:
194
+ - instruction-following benchmarks
195
+ - reasoning datasets
196
+ - long-context tasks
197
+
198
+ ---
199
+
200
+ ## Limitations
201
+
202
+ - May hallucinate incorrect information
203
+ - Reasoning chains may contain logical mistakes
204
+ - Performance near 16k tokens depends heavily on prompt structure
205
+ - Smaller model → less world knowledge than large LLMs
206
+ - Not suitable for safety-critical deployment
207
+
208
+
209
+ ---
210
+
211
+ ## Bias & Safety
212
+
213
+ The model inherits biases from:
214
+ - the base model
215
+ - the training dataset
216
+
217
+ Recommended mitigations:
218
+ - moderation filtering
219
+ - safety-oriented system prompts
220
+ - human oversight in sensitive use cases
221
+
222
+ ---
223
+
224
+ ## License
225
+
226
+ This is a derivative model of:
227
+
228
+ `HuggingFaceTB/SmolLM3-3B`
229
+
230
+ The original base model license and restrictions apply, along with dataset terms.
231
+
232
+ Verify compatibility before commercial usage.
233
+
234
+ ---
235
+
236
+ ## Reproducibility (Training Arguments)
237
+
238
+ ```text
239
+ accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py
240
+
241
+ --model_name HuggingFaceTB/SmolLM3-3B
242
+ --tokenizer_name HuggingFaceTB/SmolLM3-3B
243
+ --dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
244
+ --skip_prepare_dataset False
245
+ --lang_split en
246
+ --prepare_messages True
247
+ --completion_only_loss True
248
+ --max_length 16384
249
+ ```
250
+ ---
251
+
252
+ ## Citation
253
+
254
+ If you use this model, please cite:
255
+
256
+ - `HuggingFaceTB/SmolLM3-3B`
257
+ - `DGurgurov/Nemotron-Multilingual-Reasoning`
258
+
259
+ ---
260
+
261
+ ## Acknowledgements
262
+
263
+ - HuggingFaceTB — SmolLM3 base model
264
+ - Nemotron Multilingual Reasoning dataset authors
265
+ - HuggingFace Accelerate and Transformers libraries