SmolLM3-3B-SFT-FR

SmolLM3-3B fine-tuned on the French split of Reason_XL via supervised fine-tuning (SFT), targeting in-language reasoning adaptation.

Model Description

This model is the result of Stage 1 of a two-stage reasoning adaptation pipeline. It is trained to shift the base model's reasoning language from English to French, by exposing it to a large corpus of French-language reasoning traces spanning math, science, code, and general domains.

Property	Value
Base model	`HuggingFaceTB/SmolLM3-3B`
Training stage	SFT (Stage 1 only)
Target language	French (`fr`)
Training data	Reason_XL — FR split
Training tokens	~8.8B
Avg. sequence length	3,872 tokens

Training Data

The model is trained on the French split of Reason_XL, a multilingual cross-domain reasoning corpus of ~~2M samples per language (~~9B tokens). English source samples were translated using Qwen3-32B with a dedicated system prompt preserving technical terminology, mathematical notation, and reasoning structure.

Each sample consists of three independently translated components: the user input, the model's reasoning trace (within <think> tags), and the final output.

The English source data was annotated and filtered using Propella-1-4B across 18 properties (safety, quality, information density, educational value, domain), followed by class-aware downsampling for domain balance.

Source datasets included in Reason_XL:

Dataset	Config	Samples
Cascade-SFT-Stage-2	general / math	768,615
Dolci-Think-SFT-7B	science	347,453
Cascade-SFT-Stage-1	general / code / math / science	711,812
Llama-Nemotron-PTD	science	267,147
Nemotron-Science-v1	—	97,026
Nemotron-IF-Chat-v1	—	91,151
Total		2,282,204

Training Details

Training uses completion-only loss — only the assistant's reasoning trace and output contribute to the objective; user and system tokens are masked. Sequences are chat-formatted and packed to 16,384 tokens.

Hyperparameter	Value
Epochs	2
Max sequence length	16,384
Packing	Enabled
Precision	bfloat16
Optimizer	`adamw_torch_fused`
Per-device batch size	4
Gradient accumulation	4 steps
Weight decay	0.05
LR scheduler	Cosine with min LR
Min LR	5×10⁻⁶
Warmup ratio	0.05
Distributed strategy	FSDP (8 GPUs)
Gradient checkpointing	Enabled

Intended Use

This is a research checkpoint demonstrating that reasoning language re-wiring via SFT is feasible at the 3B scale. It is intended as a starting point for Stage 2 RL fine-tuning (see DGurgurov/SmolLM3-3B-SFT-GRPO-FR), and may exhibit reduced reasoning quality compared to the base model on non-French benchmarks.

Citation

@misc{reasonxl2026,
  title        = {Reason{XL}: A Multilingual Cross-Domain Reasoning Corpus},
  author       = {Daniil Gurgurov and Tom Röhr},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/toroe/Soofi-Think-SFT-10B-multilingual}}
}