Qwen3-4B-Instruct-2507 — German Reasoning SFT (Nemotron Multilingual Reasoning)

Model description

This model is a supervised fine-tuned (SFT) version of Qwen/Qwen3-4B-Instruct-2507, trained on the German (de) split of DGurgurov/Nemotron-Multilingual-Reasoning.

The objective of this training run was to improve:

  • German instruction following
  • Step-by-step reasoning
  • Long-context conversational performance

Key characteristics:

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Tokenizer: Qwen/Qwen3-4B-Instruct-2507
  • Training data: DGurgurov/Nemotron-Multilingual-Reasoning (de)
  • Loss: completion-only loss (only assistant tokens are optimized)
  • Context length during training: 16,384 tokens
  • Chat formatted data: Yes (message templates prepared)

Intended uses

Suitable for

  • German assistants and chatbots
  • German reasoning tasks (logic, math, structured explanations)
  • Long-context document QA in German
  • Instruction following

Not suitable for

  • Medical or legal advice without professional oversight
  • Safety-critical decisions
  • Autonomous decision making systems

Training data

Dataset used:

DGurgurov/Nemotron-Multilingual-Reasoning

Configuration:

  • Language filter: German only (de)
  • Converted to chat messages (prepare_messages=True)
  • Loss masking: completion_only_loss=True

Only assistant responses contributed to training loss.

Please review the dataset card for provenance and potential limitations.


Training procedure

General

  • Method: Supervised fine-tuning (SFT)
  • Epochs: 3
  • Max sequence length: 16384
  • Packing: enabled
  • Precision: bfloat16
  • Gradient checkpointing: enabled
  • Kernel optimization: Liger kernel enabled
  • Distributed training: DDP

Optimization

  • Optimizer: adamw_torch_fused
  • Batch size per device: 4
  • Gradient accumulation: 4
  • Effective batch size (per GPU): 16 sequences/step
  • Weight decay: 0.05

Learning rate:

  • Scheduler: cosine_with_min_lr
  • Warmup ratio: 0.05
  • Minimum LR: 5e-6

Logging & checkpoints

  • Logging steps: 5
  • Save steps: 900
  • Tracking: Weights & Biases
  • Token accuracy logged during training

Data processing

  • Dataset workers: 16 processes
  • Dataset preparation: enabled
  • Language split: de

Usage

Transformers example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
    {"role": "user", "content": "Erkläre mir kurz den Unterschied zwischen erneuerbaren und fossilen Energien."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(out[0], skip_special_tokens=True))

Important:
Use the tokenizer's apply_chat_template() — the model was trained in chat format and quality will drop without it.


Evaluation

Training logged token accuracy as a diagnostic metric.

Token accuracy is not a real benchmark score and should not be interpreted as model quality.
For proper evaluation, use German instruction-following and reasoning benchmarks.


Limitations

  • May hallucinate facts
  • Reasoning is not guaranteed correct
  • Performance near 16k context depends on prompt structure
  • Improvements mainly apply to German (other languages may not improve)
  • Not aligned for safety-critical deployments

Bias & Safety

This model inherits:

  • biases from the base model
  • biases from training data

Recommended mitigations:

  • add moderation layer
  • add safety prompts
  • human review for sensitive applications

License

This is a derivative model of:

Qwen/Qwen3-4B-Instruct-2507

Therefore the base model license and usage restrictions apply in addition to any dataset terms.

Please verify compatibility before commercial use.


Reproducibility (Training Arguments)

--model_name Qwen/Qwen3-4B-Instruct-2507
--tokenizer_name Qwen/Qwen3-4B-Instruct-2507
--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
--skip_prepare_dataset False
--lang_split de
--prepare_messages True
--completion_only_loss True
--max_length 16384
--dataset_num_proc 16
--packing True
--use_liger_kernel True
--bf16 True
--log_token_accuracy True
--optim adamw_torch_fused
--gradient_checkpointing True
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_find_unused_parameters False
--lr_scheduler_type cosine_with_min_lr
--lr_scheduler_kwargs {"min_lr": 5.0e-6}
--warmup_ratio 0.05
--weight_decay 0.05
--report_to wandb
--run_name qwen3_4b_instruct_lns_de_3_epochs
--num_train_epochs 3
--save_strategy steps
--logging_steps 5
--save_steps 900

Citation

If you use this model, please cite:

  • Qwen/Qwen3-4B-Instruct-2507
  • DGurgurov/Nemotron-Multilingual-Reasoning

Acknowledgements

  • Qwen Team — base model
  • Nemotron Multilingual Reasoning dataset authors
  • HuggingFace Transformers ecosystem
Downloads last month
49
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for toroe/Qwen3-4B-Instruct-DE-Science-Thinking

Finetuned
(1155)
this model

Dataset used to train toroe/Qwen3-4B-Instruct-DE-Science-Thinking