Update README.md

6afa686 verified 10 days ago

6.4 kB

	---
	language:
	- de
	license: other
	base_model: HuggingFaceTB/SmolLM3-3B
	tags:
	- sft
	- instruction-tuning
	- reasoning
	- german
	- multilingual
	- long-context
	- fsdp
	- transformers
	datasets:
	- DGurgurov/Nemotron-Multilingual-Reasoning
	metrics:
	- token_accuracy
	library_name: transformers
	pipeline_tag: text-generation
	---

	# SmolLM3-3B — German Reasoning Instruction SFT (Nemotron Multilingual Reasoning)

	## Model Description

	This model is a Supervised Fine-Tuned (SFT) version of:

	`HuggingFaceTB/SmolLM3-3B`

	It was fine-tuned on the German (`de`) split of the dataset:

	`DGurgurov/Nemotron-Multilingual-Reasoning`

	The goal of the training was to improve:

	- German instruction following
	- Step-by-step reasoning
	- Long-context conversation behavior

	The model was trained using chat-formatted conversations and completion-only loss, meaning only assistant responses contributed to optimization.

	Key properties:

	- Base model: SmolLM3-3B
	- Language specialization: German
	- Context length during training: 16,384 tokens
	- Chat formatted dataset
	- Long-context packing enabled

	---

	## Intended Uses

	### Suitable For
	- German conversational assistants
	- Educational tutoring
	- Reasoning and structured explanation tasks
	- Long-document Q&A in German
	- Research experiments with long-context small LLMs

	### Not Suitable For
	- Medical or legal advice without human review
	- Autonomous decision-making
	- Safety-critical systems
	- High-stakes financial decisions

	---

	## Training Data

	Dataset used:

	`DGurgurov/Nemotron-Multilingual-Reasoning`

	Processing configuration:

	- Language filtering: German only
	- Converted into chat messages (`prepare_messages=True`)
	- Assistant-only optimization (`completion_only_loss=True`)

	Only the assistant responses were used to compute loss; user and system messages were masked.

	Please review the dataset card for provenance and limitations.

	---

	## Training Procedure

	Training was performed using HuggingFace Accelerate with FSDP (Fully Sharded Data Parallel) across 8 processes.

	### Core Setup

	- Training method: Supervised fine-tuning (SFT)
	- Epochs: 3
	- Maximum sequence length: 16,384
	- Sequence packing: enabled
	- Precision: bfloat16
	- Kernel optimization: Liger kernel enabled
	- Gradient checkpointing: enabled
	- Distributed: FSDP (8 processes)

	---

	### Optimization

	- Optimizer: `adamw_torch_fused`
	- Per-device batch size: 4
	- Gradient accumulation: 4
	- Effective batch size (per GPU): 16 sequences per step
	- Weight decay: 0.05

	Learning rate schedule:

	- Scheduler: `cosine_with_min_lr`
	- Warmup ratio: 0.05
	- Minimum LR: 5e-6

	---

	### Logging & Checkpoints

	- Logging every 5 steps
	- Checkpoint every 450 steps
	- Weights & Biases tracking enabled
	- Token accuracy logged during training

	---

	### Data Processing

	- Dataset workers: 16
	- Dataset preparation: enabled
	- Chat message preparation: enabled
	- German split: enabled

	---

	## Usage

	### Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_id = "YOUR_USERNAME/YOUR_MODEL_NAME"

	tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)

	messages = [
	{"role": "system", "content": "Du bist ein hilfreicher Assistent."},
	{"role": "user", "content": "Warum ist der Himmel blau?"}
	]

	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	top_p=0.9,
	do_sample=True
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```
	Important:
	You should use `apply_chat_template()` when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.

	---

	## Evaluation

	During training, token accuracy was logged as a diagnostic metric.

	Token accuracy:
	- is useful for monitoring training stability
	- is NOT a benchmark score
	- does not represent real reasoning performance

	For proper evaluation, use:
	- German instruction-following benchmarks
	- reasoning datasets
	- long-context evaluation tasks

	---

	## Limitations

	- May hallucinate facts
	- Reasoning chains can still contain logical errors
	- Performance near 16k context depends heavily on prompt structure
	- Improvements mainly apply to German
	- Smaller model size means weaker world knowledge than large LLMs
	- Not aligned for safety-critical deployment

	---

	## Bias & Safety

	This model inherits biases from:
	- the base model
	- the training dataset

	Recommended mitigations:
	- add moderation filters
	- use system prompts enforcing safe behavior
	- include human review for sensitive deployments

	---

	## License

	This model is a derivative of:

	`HuggingFaceTB/SmolLM3-3B`

	Therefore, the original base model license and usage restrictions apply, along with any dataset terms.

	Verify compatibility before commercial deployment.

	---

	## Reproducibility (Training Arguments)

	```text
	accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py

	--model_name HuggingFaceTB/SmolLM3-3B
	--tokenizer_name HuggingFaceTB/SmolLM3-3B
	--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
	--skip_prepare_dataset False
	--lang_split de
	--prepare_messages True
	--completion_only_loss True
	--max_length 16384
	--dataset_num_proc 16
	--packing True
	--use_liger_kernel True
	--bf16 True
	--log_token_accuracy True
	--optim adamw_torch_fused
	--gradient_checkpointing True
	--per_device_train_batch_size 4
	--gradient_accumulation_steps 4
	--ddp_find_unused_parameters False
	--lr_scheduler_type cosine_with_min_lr
	--lr_scheduler_kwargs {"min_lr": 5.0e-6}
	--warmup_ratio 0.05
	--weight_decay 0.05
	--report_to wandb
	--run_name smol_3b_3epochs_lns_de
	--num_train_epochs 3
	--save_strategy steps
	--logging_steps 5
	--save_steps 450
	```
	---

	## Citation

	If you use this model, please cite:

	- `HuggingFaceTB/SmolLM3-3B`
	- `DGurgurov/Nemotron-Multilingual-Reasoning`

	---

	## Acknowledgements

	- HuggingFaceTB — SmolLM3 base model
	- Nemotron Multilingual Reasoning dataset authors
	- HuggingFace Accelerate and Transformers libraries