| | --- |
| | language: |
| | - es |
| | license: other |
| | base_model: HuggingFaceTB/SmolLM3-3B |
| | tags: |
| | - sft |
| | - instruction-tuning |
| | - reasoning |
| | - long-context |
| | - spanish |
| | - fsdp |
| | - transformers |
| | - liger-kernel |
| | datasets: |
| | - DGurgurov/Nemotron-Multilingual-Reasoning |
| | metrics: |
| | - token_accuracy |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # SmolLM3-3B — Spanish Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning) |
| |
|
| | ## Model Description |
| |
|
| | This model is a **Supervised Fine-Tuned (SFT)** version of: |
| |
|
| | `HuggingFaceTB/SmolLM3-3B` |
| |
|
| | Fine-tuned on the **Spanish (`es`) split** of: |
| |
|
| | `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | The goal of this training run was to improve: |
| |
|
| | - Spanish instruction following |
| | - multi-step reasoning |
| | - conversational behavior |
| | - long-context understanding |
| |
|
| | Training used structured chat conversations and **completion-only loss**, meaning only the assistant responses were optimized. |
| |
|
| | ### Key Characteristics |
| |
|
| | - Base model: SmolLM3-3B |
| | - Language specialization: Spanish |
| | - Context length during training: **16,384 tokens** |
| | - Chat-format training |
| | - Packed sequences |
| | - Long-context reasoning tuning |
| |
|
| | --- |
| |
|
| | ## Intended Uses |
| |
|
| | ### Suitable |
| | - Spanish conversational assistants |
| | - tutoring or educational assistants |
| | - reasoning and explanation tasks |
| | - document question answering |
| | - research on efficient small LLMs |
| |
|
| | ### Not Suitable |
| | - legal or medical advice |
| | - autonomous decision making |
| | - safety-critical systems |
| | - high-risk financial use |
| |
|
| | --- |
| |
|
| | ## Training Data |
| |
|
| | Dataset: |
| |
|
| | `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | Processing configuration: |
| |
|
| | - Language filter: **Spanish only** |
| | - Converted to chat messages (`prepare_messages=True`) |
| | - Assistant-only optimization (`completion_only_loss=True`) |
| |
|
| | User and system messages were masked during training. |
| |
|
| | Consult the dataset card for data sources and limitations. |
| |
|
| | --- |
| |
|
| | ## Training Procedure |
| |
|
| | Training was performed using **HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP)** across 8 processes. |
| |
|
| | ### Core Setup |
| |
|
| | - Method: Supervised fine-tuning (SFT) |
| | - Epochs: **3** |
| | - Maximum sequence length: **16,384 tokens** |
| | - Sequence packing: enabled |
| | - Precision: **bfloat16** |
| | - Gradient checkpointing: enabled |
| | - Liger kernel: enabled |
| | - Distributed training: FSDP |
| |
|
| | --- |
| |
|
| | ### Optimization |
| |
|
| | - Optimizer: `adamw_torch_fused` |
| | - Batch size per device: 4 |
| | - Gradient accumulation steps: 4 |
| | - Effective batch size per GPU: 16 sequences per step |
| | - Weight decay: 0.05 |
| |
|
| | Learning rate schedule: |
| |
|
| | - Scheduler: `cosine_with_min_lr` |
| | - Warmup ratio: 0.05 |
| | - Minimum LR: 5e-6 |
| |
|
| | --- |
| |
|
| | ### Logging & Checkpoints |
| |
|
| | - Logging every 5 steps |
| | - Checkpoint every 450 steps |
| | - Weights & Biases tracking |
| | - Token accuracy logged during training |
| |
|
| | --- |
| |
|
| | ### Data Processing |
| |
|
| | - Dataset preprocessing workers: 16 |
| | - Chat formatting enabled |
| | - Dataset preparation enabled |
| | - Language split: `es` |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ### Transformers Example |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | |
| | model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | device_map="auto", |
| | torch_dtype=torch.bfloat16, |
| | ) |
| | |
| | messages = [ |
| | {"role": "system", "content": "Eres un asistente útil."}, |
| | {"role": "user", "content": "¿Por qué el cielo es azul?"} |
| | ] |
| | |
| | prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | temperature=0.7, |
| | top_p=0.9, |
| | do_sample=True, |
| | ) |
| | |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| | **Important:** |
| | Use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it. |
| |
|
| | --- |
| |
|
| | ## Evaluation |
| |
|
| | During training, **token accuracy** was logged as a diagnostic metric. |
| |
|
| | Token accuracy: |
| | - monitors training stability |
| | - is **not** a benchmark |
| | - does not measure reasoning ability |
| |
|
| | For meaningful evaluation, use: |
| | - instruction-following benchmarks |
| | - reasoning datasets |
| | - long-context tasks |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - May hallucinate incorrect information |
| | - Reasoning chains may contain logical errors |
| | - Performance near 16k tokens depends heavily on prompt structure |
| | - Smaller model → weaker world knowledge than larger LLMs |
| | - Not suitable for safety-critical deployment |
| |
|
| | --- |
| |
|
| | ## Bias & Safety |
| |
|
| | The model inherits biases from: |
| | - the base model |
| | - the training dataset |
| |
|
| | Recommended mitigations: |
| | - moderation filtering |
| | - safety-oriented system prompts |
| | - human review for sensitive applications |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | This is a derivative model of: |
| |
|
| | `HuggingFaceTB/SmolLM3-3B` |
| |
|
| | The original base model license and restrictions apply, along with dataset terms. |
| |
|
| | Verify compatibility before commercial use. |
| |
|
| | --- |
| |
|
| | ## Reproducibility (Training Arguments) |
| |
|
| | ```text |
| | accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py |
| | |
| | --model_name HuggingFaceTB/SmolLM3-3B |
| | --tokenizer_name HuggingFaceTB/SmolLM3-3B |
| | --dataset_path DGurgurov/Nemotron-Multilingual-Reasoning |
| | --skip_prepare_dataset False |
| | --lang_split es |
| | --prepare_messages True |
| | --completion_only_loss True |
| | --max_length 16384 |
| | --dataset_num_proc 16 |
| | --packing True |
| | --use_liger_kernel True |
| | --bf16 True |
| | --log_token_accuracy True |
| | --optim adamw_torch_fused |
| | --gradient_checkpointing True |
| | --per_device_train_batch_size 4 |
| | --gradient_accumulation_steps 4 |
| | --ddp_find_unused_parameters False |
| | --lr_scheduler_type cosine_with_min_lr |
| | --lr_scheduler_kwargs {"min_lr": 5.0e-6} |
| | --warmup_ratio 0.05 |
| | --weight_decay 0.05 |
| | --report_to wandb |
| | --run_name smol_3b_3epochs_lns_es |
| | --num_train_epochs 3 |
| | --save_strategy steps |
| | --logging_steps 5 |
| | --save_steps 450 |
| | ``` |
| | --- |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | - `HuggingFaceTB/SmolLM3-3B` |
| | - `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | --- |
| |
|
| | ## Acknowledgements |
| |
|
| | - HuggingFaceTB — SmolLM3 base model |
| | - Nemotron Multilingual Reasoning dataset authors |
| | - HuggingFace Accelerate and Transformers libraries |
| |
|
| |
|
| |
|