| | --- |
| | language: |
| | - de |
| | license: other |
| | base_model: HuggingFaceTB/SmolLM3-3B |
| | tags: |
| | - sft |
| | - instruction-tuning |
| | - reasoning |
| | - german |
| | - multilingual |
| | - long-context |
| | - fsdp |
| | - transformers |
| | datasets: |
| | - DGurgurov/Nemotron-Multilingual-Reasoning |
| | metrics: |
| | - token_accuracy |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # SmolLM3-3B — German Reasoning Instruction SFT (Nemotron Multilingual Reasoning) |
| |
|
| | ## Model Description |
| |
|
| | This model is a **Supervised Fine-Tuned (SFT)** version of: |
| |
|
| | `HuggingFaceTB/SmolLM3-3B` |
| |
|
| | It was fine-tuned on the **German (`de`) split** of the dataset: |
| |
|
| | `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | The goal of the training was to improve: |
| |
|
| | - German instruction following |
| | - Step-by-step reasoning |
| | - Long-context conversation behavior |
| |
|
| | The model was trained using chat-formatted conversations and **completion-only loss**, meaning only assistant responses contributed to optimization. |
| |
|
| | Key properties: |
| |
|
| | - Base model: SmolLM3-3B |
| | - Language specialization: German |
| | - Context length during training: **16,384 tokens** |
| | - Chat formatted dataset |
| | - Long-context packing enabled |
| |
|
| | --- |
| |
|
| | ## Intended Uses |
| |
|
| | ### Suitable For |
| | - German conversational assistants |
| | - Educational tutoring |
| | - Reasoning and structured explanation tasks |
| | - Long-document Q&A in German |
| | - Research experiments with long-context small LLMs |
| |
|
| | ### Not Suitable For |
| | - Medical or legal advice without human review |
| | - Autonomous decision-making |
| | - Safety-critical systems |
| | - High-stakes financial decisions |
| |
|
| | --- |
| |
|
| | ## Training Data |
| |
|
| | Dataset used: |
| |
|
| | `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | Processing configuration: |
| |
|
| | - Language filtering: **German only** |
| | - Converted into chat messages (`prepare_messages=True`) |
| | - Assistant-only optimization (`completion_only_loss=True`) |
| |
|
| | Only the assistant responses were used to compute loss; user and system messages were masked. |
| |
|
| | Please review the dataset card for provenance and limitations. |
| |
|
| | --- |
| |
|
| | ## Training Procedure |
| |
|
| | Training was performed using **HuggingFace Accelerate with FSDP (Fully Sharded Data Parallel)** across 8 processes. |
| |
|
| | ### Core Setup |
| |
|
| | - Training method: Supervised fine-tuning (SFT) |
| | - Epochs: **3** |
| | - Maximum sequence length: **16,384** |
| | - Sequence packing: enabled |
| | - Precision: **bfloat16** |
| | - Kernel optimization: Liger kernel enabled |
| | - Gradient checkpointing: enabled |
| | - Distributed: FSDP (8 processes) |
| |
|
| | --- |
| |
|
| | ### Optimization |
| |
|
| | - Optimizer: `adamw_torch_fused` |
| | - Per-device batch size: 4 |
| | - Gradient accumulation: 4 |
| | - Effective batch size (per GPU): 16 sequences per step |
| | - Weight decay: 0.05 |
| |
|
| | Learning rate schedule: |
| |
|
| | - Scheduler: `cosine_with_min_lr` |
| | - Warmup ratio: 0.05 |
| | - Minimum LR: 5e-6 |
| |
|
| | --- |
| |
|
| | ### Logging & Checkpoints |
| |
|
| | - Logging every 5 steps |
| | - Checkpoint every 450 steps |
| | - Weights & Biases tracking enabled |
| | - Token accuracy logged during training |
| |
|
| | --- |
| |
|
| | ### Data Processing |
| |
|
| | - Dataset workers: 16 |
| | - Dataset preparation: enabled |
| | - Chat message preparation: enabled |
| | - German split: enabled |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ### Transformers |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | |
| | model_id = "YOUR_USERNAME/YOUR_MODEL_NAME" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | device_map="auto", |
| | torch_dtype=torch.bfloat16, |
| | ) |
| | |
| | messages = [ |
| | {"role": "system", "content": "Du bist ein hilfreicher Assistent."}, |
| | {"role": "user", "content": "Warum ist der Himmel blau?"} |
| | ] |
| | |
| | prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | temperature=0.7, |
| | top_p=0.9, |
| | do_sample=True |
| | ) |
| | |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| | **Important:** |
| | You should use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it. |
| |
|
| | --- |
| |
|
| | ## Evaluation |
| |
|
| | During training, **token accuracy** was logged as a diagnostic metric. |
| |
|
| | Token accuracy: |
| | - is useful for monitoring training stability |
| | - is **NOT** a benchmark score |
| | - does not represent real reasoning performance |
| |
|
| | For proper evaluation, use: |
| | - German instruction-following benchmarks |
| | - reasoning datasets |
| | - long-context evaluation tasks |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - May hallucinate facts |
| | - Reasoning chains can still contain logical errors |
| | - Performance near 16k context depends heavily on prompt structure |
| | - Improvements mainly apply to German |
| | - Smaller model size means weaker world knowledge than large LLMs |
| | - Not aligned for safety-critical deployment |
| |
|
| | --- |
| |
|
| | ## Bias & Safety |
| |
|
| | This model inherits biases from: |
| | - the base model |
| | - the training dataset |
| |
|
| | Recommended mitigations: |
| | - add moderation filters |
| | - use system prompts enforcing safe behavior |
| | - include human review for sensitive deployments |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | This model is a derivative of: |
| |
|
| | `HuggingFaceTB/SmolLM3-3B` |
| |
|
| | Therefore, the original base model license and usage restrictions apply, along with any dataset terms. |
| |
|
| | Verify compatibility before commercial deployment. |
| |
|
| | --- |
| |
|
| | ## Reproducibility (Training Arguments) |
| |
|
| | ```text |
| | accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py |
| | |
| | --model_name HuggingFaceTB/SmolLM3-3B |
| | --tokenizer_name HuggingFaceTB/SmolLM3-3B |
| | --dataset_path DGurgurov/Nemotron-Multilingual-Reasoning |
| | --skip_prepare_dataset False |
| | --lang_split de |
| | --prepare_messages True |
| | --completion_only_loss True |
| | --max_length 16384 |
| | --dataset_num_proc 16 |
| | --packing True |
| | --use_liger_kernel True |
| | --bf16 True |
| | --log_token_accuracy True |
| | --optim adamw_torch_fused |
| | --gradient_checkpointing True |
| | --per_device_train_batch_size 4 |
| | --gradient_accumulation_steps 4 |
| | --ddp_find_unused_parameters False |
| | --lr_scheduler_type cosine_with_min_lr |
| | --lr_scheduler_kwargs {"min_lr": 5.0e-6} |
| | --warmup_ratio 0.05 |
| | --weight_decay 0.05 |
| | --report_to wandb |
| | --run_name smol_3b_3epochs_lns_de |
| | --num_train_epochs 3 |
| | --save_strategy steps |
| | --logging_steps 5 |
| | --save_steps 450 |
| | ``` |
| | --- |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | - `HuggingFaceTB/SmolLM3-3B` |
| | - `DGurgurov/Nemotron-Multilingual-Reasoning` |
| |
|
| | --- |
| |
|
| | ## Acknowledgements |
| |
|
| | - HuggingFaceTB — SmolLM3 base model |
| | - Nemotron Multilingual Reasoning dataset authors |
| | - HuggingFace Accelerate and Transformers libraries |