SmolLM3-3B — German Reasoning Instruction SFT (Nemotron Multilingual Reasoning)

Model Description

This model is a Supervised Fine-Tuned (SFT) version of:

HuggingFaceTB/SmolLM3-3B

It was fine-tuned on the German (de) split of the dataset:

DGurgurov/Nemotron-Multilingual-Reasoning

The goal of the training was to improve:

  • German instruction following
  • Step-by-step reasoning
  • Long-context conversation behavior

The model was trained using chat-formatted conversations and completion-only loss, meaning only assistant responses contributed to optimization.

Key properties:

  • Base model: SmolLM3-3B
  • Language specialization: German
  • Context length during training: 16,384 tokens
  • Chat formatted dataset
  • Long-context packing enabled

Intended Uses

Suitable For

  • German conversational assistants
  • Educational tutoring
  • Reasoning and structured explanation tasks
  • Long-document Q&A in German
  • Research experiments with long-context small LLMs

Not Suitable For

  • Medical or legal advice without human review
  • Autonomous decision-making
  • Safety-critical systems
  • High-stakes financial decisions

Training Data

Dataset used:

DGurgurov/Nemotron-Multilingual-Reasoning

Processing configuration:

  • Language filtering: German only
  • Converted into chat messages (prepare_messages=True)
  • Assistant-only optimization (completion_only_loss=True)

Only the assistant responses were used to compute loss; user and system messages were masked.

Please review the dataset card for provenance and limitations.


Training Procedure

Training was performed using HuggingFace Accelerate with FSDP (Fully Sharded Data Parallel) across 8 processes.

Core Setup

  • Training method: Supervised fine-tuning (SFT)
  • Epochs: 3
  • Maximum sequence length: 16,384
  • Sequence packing: enabled
  • Precision: bfloat16
  • Kernel optimization: Liger kernel enabled
  • Gradient checkpointing: enabled
  • Distributed: FSDP (8 processes)

Optimization

  • Optimizer: adamw_torch_fused
  • Per-device batch size: 4
  • Gradient accumulation: 4
  • Effective batch size (per GPU): 16 sequences per step
  • Weight decay: 0.05

Learning rate schedule:

  • Scheduler: cosine_with_min_lr
  • Warmup ratio: 0.05
  • Minimum LR: 5e-6

Logging & Checkpoints

  • Logging every 5 steps
  • Checkpoint every 450 steps
  • Weights & Biases tracking enabled
  • Token accuracy logged during training

Data Processing

  • Dataset workers: 16
  • Dataset preparation: enabled
  • Chat message preparation: enabled
  • German split: enabled

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "YOUR_USERNAME/YOUR_MODEL_NAME"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
   model_id,
   device_map="auto",
   torch_dtype=torch.bfloat16,
)

messages = [
   {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
   {"role": "user", "content": "Warum ist der Himmel blau?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
   **inputs,
   max_new_tokens=512,
   temperature=0.7,
   top_p=0.9,
   do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Important:
You should use apply_chat_template() when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.


Evaluation

During training, token accuracy was logged as a diagnostic metric.

Token accuracy:

  • is useful for monitoring training stability
  • is NOT a benchmark score
  • does not represent real reasoning performance

For proper evaluation, use:

  • German instruction-following benchmarks
  • reasoning datasets
  • long-context evaluation tasks

Limitations

  • May hallucinate facts
  • Reasoning chains can still contain logical errors
  • Performance near 16k context depends heavily on prompt structure
  • Improvements mainly apply to German
  • Smaller model size means weaker world knowledge than large LLMs
  • Not aligned for safety-critical deployment

Bias & Safety

This model inherits biases from:

  • the base model
  • the training dataset

Recommended mitigations:

  • add moderation filters
  • use system prompts enforcing safe behavior
  • include human review for sensitive deployments

License

This model is a derivative of:

HuggingFaceTB/SmolLM3-3B

Therefore, the original base model license and usage restrictions apply, along with any dataset terms.

Verify compatibility before commercial deployment.


Reproducibility (Training Arguments)

accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py

--model_name HuggingFaceTB/SmolLM3-3B
--tokenizer_name HuggingFaceTB/SmolLM3-3B
--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
--skip_prepare_dataset False
--lang_split de
--prepare_messages True
--completion_only_loss True
--max_length 16384
--dataset_num_proc 16
--packing True
--use_liger_kernel True
--bf16 True
--log_token_accuracy True
--optim adamw_torch_fused
--gradient_checkpointing True
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_find_unused_parameters False
--lr_scheduler_type cosine_with_min_lr
--lr_scheduler_kwargs {"min_lr": 5.0e-6}
--warmup_ratio 0.05
--weight_decay 0.05
--report_to wandb
--run_name smol_3b_3epochs_lns_de
--num_train_epochs 3
--save_strategy steps
--logging_steps 5
--save_steps 450

Citation

If you use this model, please cite:

  • HuggingFaceTB/SmolLM3-3B
  • DGurgurov/Nemotron-Multilingual-Reasoning

Acknowledgements

  • HuggingFaceTB — SmolLM3 base model
  • Nemotron Multilingual Reasoning dataset authors
  • HuggingFace Accelerate and Transformers libraries
Downloads last month
345
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for toroe/SmolLM-3B-Science-DE

Finetuned
(102)
this model

Dataset used to train toroe/SmolLM-3B-Science-DE