File size: 2,376 Bytes
b918725
 
58c000c
 
 
 
 
 
 
 
 
 
b918725
 
58c000c
b918725
f937caa
 
58c000c
b918725
80e33c1
 
 
 
 
 
 
 
 
 
58c000c
b918725
58c000c
b918725
d5d8db7
58c000c
80e33c1
58c000c
d5d8db7
b918725
80e33c1
 
58c000c
b918725
58c000c
 
b918725
58c000c
b918725
58c000c
 
 
 
 
 
 
b918725
58c000c
 
b918725
 
 
58c000c
b918725
d5d8db7
 
 
 
b918725
 
 
58c000c
 
 
 
b918725
58c000c
b918725
f937caa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
library_name: transformers
tags:
- small-lm
- math
- reasoning
- slm
license: apache-2.0
datasets:
- openai/gsm8k
base_model:
- Qwen/Qwen3-0.6B
---

# Qwen3-0.6B-Math

This model is obtained by fine-tuning Qwen/Qwen3-0.6B on the [gsm8k](https://huggingface.co/datasets/openai/gsm8k) train split. 
The model is used in the experiments described in https://bknyaz.github.io/blog/2026/meta-merge/. 
Single A100 was used for fine-tuning and evaluation.

The following versions were used for train/eval:

- python >= 3.10
- torch               : 2.9.0+cu128
- lm_eval             : 0.4.9.1
- vllm                : 0.11.1
- transformers        : 4.57.6
- datasets            : 3.2.0
- numpy               : 2.2.6

## Training

The [TRL](https://github.com/huggingface/trl) library was used with SFT/full-rank options:

```bash
python trl/scripts/sft.py --model_name_or_path Qwen/Qwen3-0.6B --dataset_name openai/gsm8k --dataset_config main --learning_rate 2e-5 \
--num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --gradient_checkpointing --eos_token '<|im_end|>' --eval_strategy steps \
--eval_steps 100 --completion_only_loss True --report_to wandb --output_dir /path/to/the/finetuned/model
```

This is by far not the most compute and performance efficient fine-tuning, but it could be a good baseline.

The dataset was preprocessed to the conversational format:

```python
# trl/scripts/sft.py

dataset = load_dataset(...)

def preprocess_function(example):
  return {
  "prompt": [{"role": "user", "content": example["question"]}],
  "completion": [
      {"role": "assistant", "content": example['answer']}
  ],
  }

dataset = dataset.map(preprocess_function)
```

## Evaluation

Evaluation was done with lm_eval on the test split of gsm8k:

```bash
python -m lm_eval --model vllm --model_args pretrained=${model},tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=1 \
 --tasks gsm8k --batch_size 1 --apply_chat_template=True --confirm_run_unsafe_code --trust_remote_code
```

### Results

| Model                 | gsm8k|
|-----------------------|------|
| Qwen3-0.6B            | 21.0 |
| Qwen3-0.6B-Math       | 46.3 |

## License

Please refer to the license of the original model [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) and dataset [gsm8k](https://huggingface.co/datasets/openai/gsm8k).