File size: 2,376 Bytes
b918725 58c000c b918725 58c000c b918725 f937caa 58c000c b918725 80e33c1 58c000c b918725 58c000c b918725 d5d8db7 58c000c 80e33c1 58c000c d5d8db7 b918725 80e33c1 58c000c b918725 58c000c b918725 58c000c b918725 58c000c b918725 58c000c b918725 58c000c b918725 d5d8db7 b918725 58c000c b918725 58c000c b918725 f937caa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
library_name: transformers
tags:
- small-lm
- math
- reasoning
- slm
license: apache-2.0
datasets:
- openai/gsm8k
base_model:
- Qwen/Qwen3-0.6B
---
# Qwen3-0.6B-Math
This model is obtained by fine-tuning Qwen/Qwen3-0.6B on the [gsm8k](https://huggingface.co/datasets/openai/gsm8k) train split.
The model is used in the experiments described in https://bknyaz.github.io/blog/2026/meta-merge/.
Single A100 was used for fine-tuning and evaluation.
The following versions were used for train/eval:
- python >= 3.10
- torch : 2.9.0+cu128
- lm_eval : 0.4.9.1
- vllm : 0.11.1
- transformers : 4.57.6
- datasets : 3.2.0
- numpy : 2.2.6
## Training
The [TRL](https://github.com/huggingface/trl) library was used with SFT/full-rank options:
```bash
python trl/scripts/sft.py --model_name_or_path Qwen/Qwen3-0.6B --dataset_name openai/gsm8k --dataset_config main --learning_rate 2e-5 \
--num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --gradient_checkpointing --eos_token '<|im_end|>' --eval_strategy steps \
--eval_steps 100 --completion_only_loss True --report_to wandb --output_dir /path/to/the/finetuned/model
```
This is by far not the most compute and performance efficient fine-tuning, but it could be a good baseline.
The dataset was preprocessed to the conversational format:
```python
# trl/scripts/sft.py
dataset = load_dataset(...)
def preprocess_function(example):
return {
"prompt": [{"role": "user", "content": example["question"]}],
"completion": [
{"role": "assistant", "content": example['answer']}
],
}
dataset = dataset.map(preprocess_function)
```
## Evaluation
Evaluation was done with lm_eval on the test split of gsm8k:
```bash
python -m lm_eval --model vllm --model_args pretrained=${model},tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=1 \
--tasks gsm8k --batch_size 1 --apply_chat_template=True --confirm_run_unsafe_code --trust_remote_code
```
### Results
| Model | gsm8k|
|-----------------------|------|
| Qwen3-0.6B | 21.0 |
| Qwen3-0.6B-Math | 46.3 |
## License
Please refer to the license of the original model [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) and dataset [gsm8k](https://huggingface.co/datasets/openai/gsm8k). |