File size: 15,817 Bytes
f677bb1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
# Complete Multimodal GRPO Experiment Workflow

This document explains how to use SWIFT GRPO for training multimodal models and tasks. The goal is to train on multiple multimodal tasks to improve task accuracy. Task definitions, training parameters, etc., refer to [R1-V](https://github.com/Deep-Agent/R1-V.git) and [open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal.git).

---

## **ClevrCount Task**

### **Task and Dataset Definition**

This task is based on the `clevr_cogen_a_train` dataset. The model's goal is to output the number of objects in the image. Therefore, we define the dataset as follows:

```python
class ClevrPreprocessor(ResponsePreprocessor):

    def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
        query = row.get('query', '')
        query = f"""{query} Output the thinking process in <think> </think> and
 final answer (number) in <answer> </answer> tags."""
        row.update({'query': query})
        return super().preprocess(row)


register_dataset(
    DatasetMeta(
        ms_dataset_id='okwinds/clevr_cogen_a_train',
        subsets=[
            SubsetDataset(
                name='default',
                subset='default',
                split=['train'],
            ),
        ],
        preprocess_func=ClevrPreprocessor(),
        tags=['qa', 'math']))
```

The purpose of redefining the dataset preprocessor here is to modify the query. A sample dataset entry is as follows, including `messages`, `images`, and `solution` fields. The `solution` is used in the reward function, while `messages` and `images` serve as model input.
- Note: `{'role': 'assistant', 'content': '<answer> 3 </answer>'}` will be removed in GRPOTrainer and can be ignored. The 'solution' field will be passed directly into the ORM. When creating a custom dataset, the 'images' field should be organized as `["image_path1", "image_path2"]`.

```json
{
    "images": ["image_path1", "image_path2"],
    "messages": [
        {
            "role": "user",
            "content": "How many items are there in the image? Output the thinking process in <think> </think> and\n final answer (number) in <answer> </answer> tags."
        }
    ],
    "solution": "<answer> 3 </answer>"
}

```

---

## **Reward Function Definition**

This task uses two reward functions: one is the format reward function mentioned in `Deepseek-R1`, and the other is the accuracy reward function for ClevrCount. The former is built into SWIFT and can be used directly with `--reward_funcs format`. The latter needs to be custom-defined. Here, we use the `external_plugin` method to define the accuracy reward function by placing the code in `swift/examples/train/grpo/plugin/plugin.py`.

The reward function's input includes `completions` and `solution` fields, representing the model-generated text and ground truth, respectively. Each is a list, allowing the computation of multiple completions simultaneously. Note that the `solution` field is passed through directly from the dataset definition. If there are task changes, corresponding modifications can be made to the dataset and reward function.

```python
class MultiModalAccuracyORM(ORM):

    def __call__(self, completions, solution, **kwargs) -> List[float]:
        """
        Reward function that checks if the completion is correct.
        Args:
            completions (list[str]): Generated outputs
            solution (list[str]): Ground Truths.

        Returns:
            list[float]: Reward scores
        """
        rewards = []
        from math_verify import parse, verify
        for content, sol in zip(completions, solution):
            reward = 0.0
            # Try symbolic verification first
            try:
                answer = parse(content)
                if float(verify(answer, parse(sol))) > 0:
                    reward = 1.0
            except Exception:
                pass  # Continue to next verification method if this fails

            # If symbolic verification failed, try string matching
            if reward == 0.0:
                try:
                    # Extract answer from solution if it has think/answer tags
                    sol_match = re.search(r'<answer>(.*?)</answer>', sol)
                    ground_truth = sol_match.group(1).strip() if sol_match else sol.strip()

                    # Extract answer from content if it has think/answer tags
                    content_match = re.search(r'<answer>(.*?)</answer>', content)
                    student_answer = content_match.group(1).strip() if content_match else content.strip()

                    # Compare the extracted answers
                    if student_answer == ground_truth:
                        reward = 1.0
                except Exception:
                    pass  # Keep reward as 0.0 if both methods fail
            rewards.append(reward)
        return rewards
orms['external_r1v_acc'] = MultiModalAccuracyORM
```

---

### **GRPO Training Experiment Log**

#### **Training Parameters**

We selected `Qwen2.5-VL-3B-Instruct` as the base model for training. The main reason for choosing the `Instruct` model over the base model is to rapidly achieve format rewards. Experiments were conducted on 8 GPUs. SWIFT GRPO training supports multi-GPU deployment to accelerate rollouts, so we set `num_infer_workers` to 2 and processes to 6 (2 GPUs for deployment, 6 GPUs for training). If you encounter deployment errors for `qwen2.5-vl` on `vllm`, refer to [this issue](https://github.com/vllm-project/vllm/issues/13285).

Since the task is simple, we set `max_completion_length` to 1024 and selected `external_r1v_acc` and `format` as reward functions. The learning rate and beta are set to `1e-6` and `0.001`, respectively. Other configurations are as follows. The settings for `batch_size` and `num_generations` can be referenced from [GRPO Full Workflow](./GRPO完整流程.md).

```shell
WANDB_API_KEY=your_wandb_api_key \
NPROC_PER_NODE=6 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-VL-3B-Instruct \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_r1v_acc format \
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.6 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset 'okwinds/clevr_cogen_a_train' \
    --vllm_max_model_len 8192 \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --save_strategy 'steps' \
    --eval_strategy 'steps' \
    --eval_steps 1000 \
    --save_steps 1000 \
    --save_total_limit 10 \
    --logging_steps 1 \
    --output_dir output/GRPO_CLEVR_COUNTDOWN \
    --warmup_ratio 0.01 \
    --dataloader_num_workers 4 \
    --num_generations 24 \
    --temperature 1.0 \
    --system 'examples/train/grpo/prompt.txt' \
    --deepspeed zero3 \
    --log_completions true \
    --report_to wandb \
    --num_iterations 1 \
    --num_infer_workers 2 \
    --async_generate false \
    --beta 0.001 \
```

#### **Experimental Observations**

[image.png](../../resources/grpo_clevr_count.png)

- Given the simplicity of the dataset and task, the model converged after 500 epochs. Key observations:
  1. The custom `ClevrORM` reward steadily increased, proving the model learned how to complete the task. The task success rate climbed from an initial 0.4 to nearly 1.
  2. The `Format Reward` remained stable at 1, likely due to the consistent query format across all dataset samples.
  3. The `reward_std` stabilized below 0.1.
  4. The `completion length` eventually stabilized between 60-80 tokens, with the model learning a fixed output pattern for item-by-item counting.

---
For additional tasks like Geometric QA and Open R1 Multimodal datasets, refer to their respective sections in the full experiment documentation.

## **Geometric QA Task**

### **Task and Dataset Definition**

This task is a Geometric QA task, where the task description is: given a geometric figure, answer mathematical questions related to the figure. The original data comes from [this paper](https://arxiv.org/pdf/2312.11370), and [R1-V](https://github.com/Deep-Agent/R1-V.git) has preprocessed the data into a `problem-solution` format while retaining the images in the `image` field. Therefore, we do not need to redefine the dataset and can directly use `--dataset AI-ModelScope/GEOQA_R1V_Train_8K`.

---

### **Reward Function**

As this is also a mathematical problem, and the answers are already processed into final results, we directly use the previously defined `MultiModalAccuracyORM` reward function.

---

### **GRPO Training Experiment Log**

#### **Training Parameters**

The selected model and most hyperparameters are similar to the previous experiment, with two main differences:
1. **SWIFT now supports the `--num_iteration` parameter**, allowing multiple updates during a single rollout. We set it to 2.
2. During the experiment, we found that training might become unstable in mathematical problems, causing the model to collapse. This is characterized by a sharp drop in all rewards, a rapid increase in loss, `grad_norm`, and KL divergence, with no subsequent recovery. To prevent this, we set `--max_grad_norm 0.5` to ensure stable training. Note that this instability can have some randomness.

```shell
WANDB_API_KEY=your_wandb_api_key \
MAX_PIXELS=401408 \
NPROC_PER_NODE=6 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-VL-3B-Instruct \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_r1v_acc format \
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.6 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset 'AI-ModelScope/GEOQA_R1V_Train_8K' \
    --vllm_max_model_len 8192 \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --save_strategy 'steps' \
    --eval_strategy 'steps' \
    --eval_steps 400 \
    --save_steps 400 \
    --save_total_limit 10 \
    --logging_steps 1 \
    --output_dir output/GRPO_GEOQA \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --num_generations 8 \
    --temperature 1.0 \
    --repetition_penalty 1.1 \
    --system 'examples/train/grpo/prompt.txt' \
    --deepspeed zero3 \
    --log_completions true \
    --report_to wandb \
    --num_iterations 2 \
    --num_infer_workers 2 \
    --async_generate false \
    --beta 0.001 \
    --max_grad_norm 0.5 \
```

---

#### **Experimental Observations**

[image.png](../../resources/grpo_geoqa.png)

The training curve is shown above.

1. Compared to the **Count task**, the **Geometric QA task** is significantly more challenging. As a result, the **Accuracy Reward** converges much more slowly and has larger oscillations.
2. The `completion_length` exhibits a clear trend, eventually stabilizing at around 250 tokens.
3. The reward converges, with the standard deviation stabilizing at around 0.2 after a total of 1300 steps.

Below is an example of the model's completion after training:

```plaintext
User:
"If AB is the tangent to circle O with the tangent point A, and BO intersects circle O at point C, and point D is located on circle O, what is the measure of angle ADC if the degree of angle ABO is 32.0?"
Assistant:
"<think>\nTo find the measure of angle ADC, we first need to understand the properties of the tangent and the circle. Since AB is a tangent to the circle at point A, the line segment OA (which is perpendicular to the tangent AB) is perpendicular to AB. This means that angle OAB is 90 degrees.\n\nGiven that angle ABO is 32 degrees, we can find angle BAO by subtracting angle ABO from 90 degrees:\n∠BAO = 90° - ∠ABO = 90° - 32° = 58°\n\nSince angle BAO is an angle in the minor arc AD, the angle ADC, which is an angle in the minor arc AD, is half the angle BAO. Therefore, angle ADC is:\n∠ADC = 1/2 × ∠BAO = 1/2 × 58° = 29°\n\nSo, the measure of angle ADC is 29 degrees.\n</think>\n<answer>\nThe measure of angle ADC is 29 degrees.\n</answer>"
```

---

## **Multimodal Open R1 Dataset Experiment**

### **Task and Dataset Definition**

This task is based on the experiments in [open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal.git), using the dataset: [lmms-lab/multimodal-open-r1-8k-verified](https://www.modelscope.cn/datasets/lmms-lab/multimodal-open-r1-8k-verified). This dataset focuses on multimodal mathematical reasoning tasks, with data generated by GPT4o based on the `Math360K` and `Geo170K` datasets. It includes reasoning paths and verifiable answers. The dataset already contains `image`, `problem`, and `solution` fields, so no additional prompt modifications are required, and there is no need to redefine the dataset.

---

### **Reward Function**

We directly use the previously defined `MultiModalAccuracyORM` reward function.

---

### **GRPO Training Experiment Log**

#### **Training Parameters**

The selected model and most hyperparameters are similar to the previous experiment. Due to an **OOM (Out of Memory) issue**, we set `MAX_PIXELS=262144` to reduce memory usage.

```shell
WANDB_API_KEY=your_wandb_api_key \
MAX_PIXELS=262144 \
MASTER_PORT=29600 \
NPROC_PER_NODE=6 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-VL-3B-Instruct \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_r1v_acc format \
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.6 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset 'lmms-lab/multimodal-open-r1-8k-verified' \
    --vllm_max_model_len 8192 \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --save_strategy 'steps' \
    --eval_strategy 'steps' \
    --eval_steps 400 \
    --save_steps 400 \
    --save_total_limit 10 \
    --logging_steps 1 \
    --output_dir output/GRPO_GEOQA \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --num_generations 8 \
    --temperature 1.0 \
    --repetition_penalty 1.1 \
    --system 'examples/train/grpo/prompt.txt' \
    --deepspeed zero3 \
    --log_completions true \
    --report_to wandb \
    --num_iterations 2 \
    --num_infer_workers 2 \
    --async_generate false \
    --beta 0.001 \
    --max_grad_norm 0.5 \
```

---

#### **Experimental Observations**

[image.png](../../resources/grpo_openr1_multimodal.png)

The training curve is shown above.

- With a total of 8k data samples, the training reached **1268 steps**, with the **Accuracy Reward** converging around 0.5.
- The `completion_length` stabilized around 200 tokens.
- The `reward_std` eventually oscillated around 0.2.

Below is an example of the model's completion after training:

```plaintext
User:
"Based on the map, which state falls into the lowest percentage range of lots, and what is that percentage range?"
Assistant:
"<think>\nThe image provided shows a color-coded map of the USA indicating different ranges of lot percentages across various states. According to the legend at the bottom, the lowest percentage range (6.0% - 6.5%) is represented by white. In the image, Alabama (AL) is shaded in white, which corresponds to the 6.0% - 6.5% category. Therefore, based on the map, the state that falls into the lowest percentage range of lots is Alabama, with the percentage range of 6.0% - 6.5%.\nTherefore, the answer is 6.0% - 6.5%.\n</think>\n<answer>Alabama</answer>"
```