| | --- |
| | language: |
| | - en |
| | license: other |
| | base_model: HuggingFaceTB/SmolLM3-3B |
| | tags: |
| | - sft |
| | - instruction-tuning |
| | - reasoning |
| |
|
| |
|
| | - long-context |
| | - fsdp |
| | - transformers |
| | - liger-kernel |
| | - english |
| | datasets: |
| | - DGurgurov/Nemotron-Multilingual-Reasoning |
| | metrics: |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # SmolLM3-3B — English Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning) |
| |
|
| | ## Model Description |
| |
|
| | This model is a **Supervised Fine-Tuned (SFT)** version of: |
| |
|
| | `HuggingFaceTB/SmolLM3-3B` |
| |
|
| | It was trained on the **English (`en`) split** of: |
| |
|
| | `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | The purpose of this fine-tune is to improve: |
| |
|
| | - English instruction following |
| | - multi-step reasoning |
| | - long-context chat behavior |
| |
|
| | The dataset was converted into structured chat conversations and optimized using **completion-only loss**, meaning only the assistant’s responses contributed to the training objective. |
| |
|
| | ### Key Characteristics |
| |
|
| | - Base model: SmolLM3-3B |
| | - Language: English specialization |
| | - Context length during training: **16,384 tokens** |
| | - Chat formatted conversations |
| | - Packed sequences |
| | - Long-context reasoning tuning |
| |
|
| | --- |
| |
|
| | ## Intended Uses |
| |
|
| | ### Suitable |
| | - Conversational assistants |
| | - Instruction-following agents |
| | - Reasoning tasks |
| | - Educational tutoring |
| | - Long-document Q&A |
| | - Research on small long-context LLMs |
| |
|
| |
|
| | ### Not Suitable |
| | - Medical or legal advice |
| | - Autonomous decision making |
| | - Safety-critical systems |
| | - Financial decision automation |
| |
|
| | --- |
| |
|
| | ## Training Data |
| |
|
| | Dataset: |
| |
|
| | `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | Processing configuration: |
| |
|
| | - Language filter: **English only** |
| | - Converted to chat messages (`prepare_messages=True`) |
| | - Assistant-only loss masking (`completion_only_loss=True`) |
| |
|
| | User and system prompts were masked during training; only assistant tokens produced gradients. |
| |
|
| | Please consult the dataset card for data provenance and limitations. |
| |
|
| | --- |
| |
|
| | ## Training Procedure |
| |
|
| | Training used **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes. |
| |
|
| | ### Core Setup |
| |
|
| | - Method: Supervised fine-tuning (SFT) |
| | - Epochs: **3** |
| | - Max sequence length: **16,384** |
| | - Packing: enabled |
| | - Precision: **bfloat16** |
| |
|
| | - Gradient checkpointing: enabled |
| | - Liger kernel: enabled |
| | - Distributed training: FSDP |
| |
|
| | --- |
| |
|
| | ### Optimization |
| |
|
| | - Optimizer: `adamw_torch_fused` |
| | - Batch size per device: 4 |
| | - Gradient accumulation: 4 |
| | - Effective batch size per GPU: 16 sequences / step |
| | - Weight decay: 0.05 |
| |
|
| | Learning rate schedule: |
| |
|
| | - Scheduler: `cosine_with_min_lr` |
| | - Warmup ratio: 0.05 |
| | - Minimum learning rate: 5e-6 |
| |
|
| | --- |
| |
|
| | ### Logging & Checkpoints |
| |
|
| | - Logging: every 5 steps |
| | - Checkpoint: every 450 steps |
| | - Tracking: Weights & Biases |
| | - Token accuracy logged during training |
| |
|
| | --- |
| |
|
| | ### Data Processing |
| |
|
| | - Dataset preprocessing workers: 16 |
| | - Chat formatting: enabled |
| | - Dataset preparation: enabled |
| | - Language split: `en` |
| |
|
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ### Transformers Example |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | |
| | model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | device_map="auto", |
| | torch_dtype=torch.bfloat16, |
| | ) |
| | |
| | messages = [ |
| | {"role": "system", "content": "You are a helpful assistant."}, |
| | {"role": "user", "content": "Explain why the sky is blue."} |
| | ] |
| | |
| | prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | temperature=0.7, |
| | top_p=0.9, |
| | do_sample=True, |
| | ) |
| | |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| | **Important:** |
| | Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it. |
| |
|
| | --- |
| |
|
| | ## Evaluation |
| |
|
| | During training, **token accuracy** was logged as a diagnostic metric. |
| |
|
| | Token accuracy: |
| | - helps monitor training stability |
| | - is **not** a benchmark score |
| | - does not measure reasoning quality |
| |
|
| | For meaningful evaluation, use: |
| | - instruction-following benchmarks |
| | - reasoning datasets |
| | - long-context tasks |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - May hallucinate incorrect information |
| | - Reasoning chains may contain logical mistakes |
| | - Performance near 16k tokens depends heavily on prompt structure |
| | - Smaller model → less world knowledge than large LLMs |
| | - Not suitable for safety-critical deployment |
| |
|
| |
|
| | --- |
| |
|
| | ## Bias & Safety |
| |
|
| | The model inherits biases from: |
| | - the base model |
| | - the training dataset |
| |
|
| | Recommended mitigations: |
| | - moderation filtering |
| | - safety-oriented system prompts |
| | - human oversight in sensitive use cases |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | This is a derivative model of: |
| |
|
| | `HuggingFaceTB/SmolLM3-3B` |
| |
|
| | The original base model license and restrictions apply, along with dataset terms. |
| |
|
| | Verify compatibility before commercial usage. |
| |
|
| | --- |
| |
|
| | ## Reproducibility (Training Arguments) |
| |
|
| | ```text |
| | accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py |
| | |
| | --model_name HuggingFaceTB/SmolLM3-3B |
| | --tokenizer_name HuggingFaceTB/SmolLM3-3B |
| | --dataset_path DGurgurov/Nemotron-Multilingual-Reasoning |
| | --skip_prepare_dataset False |
| | --lang_split en |
| | --prepare_messages True |
| | --completion_only_loss True |
| | --max_length 16384 |
| | ``` |
| | --- |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | - `HuggingFaceTB/SmolLM3-3B` |
| | - `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | --- |
| |
|
| | ## Acknowledgements |
| |
|
| | - HuggingFaceTB — SmolLM3 base model |
| | - Nemotron Multilingual Reasoning dataset authors |
| | - HuggingFace Accelerate and Transformers libraries |
| |
|