--- language: - es license: other base_model: HuggingFaceTB/SmolLM3-3B tags: - sft - instruction-tuning - reasoning - long-context - spanish - fsdp - transformers - liger-kernel datasets: - DGurgurov/Nemotron-Multilingual-Reasoning metrics: - token_accuracy library_name: transformers pipeline_tag: text-generation --- # SmolLM3-3B — Spanish Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning) ## Model Description This model is a **Supervised Fine-Tuned (SFT)** version of: `HuggingFaceTB/SmolLM3-3B` Fine-tuned on the **Spanish (`es`) split** of: `DGurgurov/Nemotron-Multilingual-Reasoning` The goal of this training run was to improve: - Spanish instruction following - multi-step reasoning - conversational behavior - long-context understanding Training used structured chat conversations and **completion-only loss**, meaning only the assistant responses were optimized. ### Key Characteristics - Base model: SmolLM3-3B - Language specialization: Spanish - Context length during training: **16,384 tokens** - Chat-format training - Packed sequences - Long-context reasoning tuning --- ## Intended Uses ### Suitable - Spanish conversational assistants - tutoring or educational assistants - reasoning and explanation tasks - document question answering - research on efficient small LLMs ### Not Suitable - legal or medical advice - autonomous decision making - safety-critical systems - high-risk financial use --- ## Training Data Dataset: `DGurgurov/Nemotron-Multilingual-Reasoning` Processing configuration: - Language filter: **Spanish only** - Converted to chat messages (`prepare_messages=True`) - Assistant-only optimization (`completion_only_loss=True`) User and system messages were masked during training. Consult the dataset card for data sources and limitations. --- ## Training Procedure Training was performed using **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes. ### Core Setup - Method: Supervised fine-tuning (SFT) - Epochs: **3** - Maximum sequence length: **16,384 tokens** - Sequence packing: enabled - Precision: **bfloat16** - Gradient checkpointing: enabled - Liger kernel: enabled - Distributed training: FSDP --- ### Optimization - Optimizer: `adamw_torch_fused` - Batch size per device: 4 - Gradient accumulation steps: 4 - Effective batch size per GPU: 16 sequences per step - Weight decay: 0.05 Learning rate schedule: - Scheduler: `cosine_with_min_lr` - Warmup ratio: 0.05 - Minimum LR: 5e-6 --- ### Logging & Checkpoints - Logging every 5 steps - Checkpoint every 450 steps - Weights & Biases tracking - Token accuracy logged during training --- ### Data Processing - Dataset preprocessing workers: 16 - Chat formatting enabled - Dataset preparation enabled - Language split: `es` --- ## Usage ### Transformers Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16, ) messages = [ {"role": "system", "content": "Eres un asistente útil."}, {"role": "user", "content": "¿Por qué el cielo es azul?"} ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` **Important:** Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it. --- ## Evaluation During training, **token accuracy** was logged as a diagnostic metric. Token accuracy: - monitors training stability - is **not** a benchmark - does not measure reasoning ability For meaningful evaluation, use: - instruction-following benchmarks - reasoning datasets - long-context tasks --- ## Limitations - May hallucinate incorrect information - Reasoning chains may contain logical errors - Performance near 16k tokens depends heavily on prompt structure - Smaller model → weaker world knowledge than larger LLMs - Not suitable for safety-critical deployment --- ## Bias & Safety The model inherits biases from: - the base model - the training dataset Recommended mitigations: - moderation filtering - safety-oriented system prompts - human review for sensitive applications --- ## License This is a derivative model of: `HuggingFaceTB/SmolLM3-3B` The original base model license and restrictions apply, along with dataset terms. Verify compatibility before commercial use. --- ## Reproducibility (Training Arguments) ```text accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py --model_name HuggingFaceTB/SmolLM3-3B --tokenizer_name HuggingFaceTB/SmolLM3-3B --dataset_path DGurgurov/Nemotron-Multilingual-Reasoning --skip_prepare_dataset False --lang_split es --prepare_messages True --completion_only_loss True --max_length 16384 --dataset_num_proc 16 --packing True --use_liger_kernel True --bf16 True --log_token_accuracy True --optim adamw_torch_fused --gradient_checkpointing True --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --ddp_find_unused_parameters False --lr_scheduler_type cosine_with_min_lr --lr_scheduler_kwargs {"min_lr": 5.0e-6} --warmup_ratio 0.05 --weight_decay 0.05 --report_to wandb --run_name smol_3b_3epochs_lns_es --num_train_epochs 3 --save_strategy steps --logging_steps 5 --save_steps 450 ``` --- ## Citation If you use this model, please cite: - `HuggingFaceTB/SmolLM3-3B` - `DGurgurov/Nemotron-Multilingual-Reasoning` --- ## Acknowledgements - HuggingFaceTB — SmolLM3 base model - Nemotron Multilingual Reasoning dataset authors - HuggingFace Accelerate and Transformers libraries