File size: 2,472 Bytes
afb0329
 
9d33589
 
 
 
 
 
 
 
 
 
afb0329
 
9d33589
afb0329
9d33589
 
 
afb0329
9d33589
afb0329
9d33589
 
 
 
 
 
 
afb0329
9d33589
afb0329
9d33589
afb0329
9d33589
 
 
 
 
afb0329
9d33589
afb0329
9d33589
afb0329
9d33589
 
afb0329
9d33589
afb0329
9d33589
 
 
 
 
 
 
afb0329
9d33589
 
afb0329
 
 
9d33589
afb0329
9d33589
 
 
 
afb0329
 
 
9d33589
 
 
 
afb0329
9d33589
afb0329
9d33589
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
library_name: transformers
tags:
- small-lm
- math
- reasoning
- slm
license: apache-2.0
datasets:
- openai/gsm8k
base_model:
- Qwen/Qwen3-4B-Instruct-2507
---

# Qwen3-4B-Instruct-2507-Math

This model is obtained by fine-tuning Qwen/Qwen3-4B-Instruct-2507 on the [gsm8k](https://huggingface.co/datasets/openai/gsm8k) train split. 
The model is used in the experiments described in https://bknyaz.github.io/blog/2026/meta-merge/. 
Single A100 was used for fine-tuning and evaluation.

The following versions were used for train/eval:

- python >= 3.10
- torch               : 2.9.0+cu128
- lm_eval             : 0.4.9.1
- vllm                : 0.11.1
- transformers        : 4.57.6
- datasets            : 3.2.0
- numpy               : 2.2.6

## Training

The [TRL](https://github.com/huggingface/trl) library was used with SFT/full-rank options:

```bash
python trl/scripts/sft.py --model_name_or_path Qwen/Qwen3-4B-Instruct-2507 --dataset_name openai/gsm8k --dataset_config main --learning_rate 2e-5 \
--num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --gradient_checkpointing --eos_token '<|im_end|>' --eval_strategy steps \
--eval_steps 100 --completion_only_loss True --report_to wandb --output_dir /path/to/the/finetuned/model
```

This is by far not the most compute and performance efficient fine-tuning, but it could be a good baseline.

The dataset was preprocessed to the conversational format:

```python
# trl/scripts/sft.py

dataset = load_dataset(...)

def preprocess_function(example):
  return {
  "prompt": [{"role": "user", "content": example["question"]}],
  "completion": [
      {"role": "assistant", "content": example['answer']}
  ],
  }

dataset = dataset.map(preprocess_function)
```

## Evaluation

Evaluation was done with lm_eval on the test split of gsm8k:

```bash
python -m lm_eval --model vllm --model_args pretrained=${model},tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=1 \
 --tasks gsm8k --batch_size 1 --apply_chat_template=True --confirm_run_unsafe_code --trust_remote_code
```

### Results

| Model                       | gsm8k|
|-----------------------------|------|
| Qwen3-4B-Instruct-2507      | 80.4 |
| Qwen3-4B-Instruct-2507-Math | 76.8 |

## License

Please refer to the license of the original model [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) and dataset [gsm8k](https://huggingface.co/datasets/openai/gsm8k).