--- language: - en license: other base_model: HuggingFaceTB/SmolLM3-3B tags: - sft - instruction-tuning - reasoning - long-context - fsdp - transformers - liger-kernel - english datasets: - DGurgurov/Nemotron-Multilingual-Reasoning metrics: pipeline_tag: text-generation --- # SmolLM3-3B — English Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning) ## Model Description This model is a **Supervised Fine-Tuned (SFT)** version of: `HuggingFaceTB/SmolLM3-3B` It was trained on the **English (`en`) split** of: `DGurgurov/Nemotron-Multilingual-Reasoning` The purpose of this fine-tune is to improve: - English instruction following - multi-step reasoning - long-context chat behavior The dataset was converted into structured chat conversations and optimized using **completion-only loss**, meaning only the assistant’s responses contributed to the training objective. ### Key Characteristics - Base model: SmolLM3-3B - Language: English specialization - Context length during training: **16,384 tokens** - Chat formatted conversations - Packed sequences - Long-context reasoning tuning --- ## Intended Uses ### Suitable - Conversational assistants - Instruction-following agents - Reasoning tasks - Educational tutoring - Long-document Q&A - Research on small long-context LLMs ### Not Suitable - Medical or legal advice - Autonomous decision making - Safety-critical systems - Financial decision automation --- ## Training Data Dataset: `DGurgurov/Nemotron-Multilingual-Reasoning` Processing configuration: - Language filter: **English only** - Converted to chat messages (`prepare_messages=True`) - Assistant-only loss masking (`completion_only_loss=True`) User and system prompts were masked during training; only assistant tokens produced gradients. Please consult the dataset card for data provenance and limitations. --- ## Training Procedure Training used **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes. ### Core Setup - Method: Supervised fine-tuning (SFT) - Epochs: **3** - Max sequence length: **16,384** - Packing: enabled - Precision: **bfloat16** - Gradient checkpointing: enabled - Liger kernel: enabled - Distributed training: FSDP --- ### Optimization - Optimizer: `adamw_torch_fused` - Batch size per device: 4 - Gradient accumulation: 4 - Effective batch size per GPU: 16 sequences / step - Weight decay: 0.05 Learning rate schedule: - Scheduler: `cosine_with_min_lr` - Warmup ratio: 0.05 - Minimum learning rate: 5e-6 --- ### Logging & Checkpoints - Logging: every 5 steps - Checkpoint: every 450 steps - Tracking: Weights & Biases - Token accuracy logged during training --- ### Data Processing - Dataset preprocessing workers: 16 - Chat formatting: enabled - Dataset preparation: enabled - Language split: `en` --- ## Usage ### Transformers Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16, ) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain why the sky is blue."} ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` **Important:** Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it. --- ## Evaluation During training, **token accuracy** was logged as a diagnostic metric. Token accuracy: - helps monitor training stability - is **not** a benchmark score - does not measure reasoning quality For meaningful evaluation, use: - instruction-following benchmarks - reasoning datasets - long-context tasks --- ## Limitations - May hallucinate incorrect information - Reasoning chains may contain logical mistakes - Performance near 16k tokens depends heavily on prompt structure - Smaller model → less world knowledge than large LLMs - Not suitable for safety-critical deployment --- ## Bias & Safety The model inherits biases from: - the base model - the training dataset Recommended mitigations: - moderation filtering - safety-oriented system prompts - human oversight in sensitive use cases --- ## License This is a derivative model of: `HuggingFaceTB/SmolLM3-3B` The original base model license and restrictions apply, along with dataset terms. Verify compatibility before commercial usage. --- ## Reproducibility (Training Arguments) ```text accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py --model_name HuggingFaceTB/SmolLM3-3B --tokenizer_name HuggingFaceTB/SmolLM3-3B --dataset_path DGurgurov/Nemotron-Multilingual-Reasoning --skip_prepare_dataset False --lang_split en --prepare_messages True --completion_only_loss True --max_length 16384 ``` --- ## Citation If you use this model, please cite: - `HuggingFaceTB/SmolLM3-3B` - `DGurgurov/Nemotron-Multilingual-Reasoning` --- ## Acknowledgements - HuggingFaceTB — SmolLM3 base model - Nemotron Multilingual Reasoning dataset authors - HuggingFace Accelerate and Transformers libraries