File size: 9,087 Bytes
7de483e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# Ground-R1 `rl-projects` β€” Ablation Branch Notes

Notes from a code read of the experimental forks of `r1-v/` on the `rl-projects` branch
(`Irisicy4/Ground-R1-project`). Each branch keeps a sibling copy of the trainer so the
baseline `r1-v/` stays untouched. Below: what each variant changes vs `r1-v`, what
hypothesis it isolates, and any bugs/risks found.

---

## At a glance

| Branch | Bbox / crop | Round structure | Image(s) carried across rounds | History across rounds |
|---|---|---|---|---|
| `r1-v` (baseline) | yes (`<box>` + crop + 2nd image) | adaptive: stop on `<answer>` or iter==4 | original + each crop | full conversation |
| `r1-v-no-hist` | yes (still crops) | adaptive (same cap) | **only the latest crop** | **reset each round** |
| `r1-v-no123-match-round` | **no** | adaptive (same cap) | original only | full conversation |
| `r1-v-no123` | **no** | **fixed: exactly 2 rounds** | original only | full conversation |
| `r1-v-no23-1round` | `<box>`+`<answer>` in one turn | adaptive (same cap), retry only on missing `<answer>` | original re-sent every retry | full conversation |

The naming reads as: **no-hist** = remove cross-round memory; **no123** = remove the three
grounding pieces (bbox-step, crop, second image); **match-round** = keep r1-v's adaptive
iteration cap; **1round** = intended single fused turn.

---

## `r1-v-no-hist` β€” drop conversation history between grounding rounds

Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws
away prior turns and shows the model only the cropped image + the next-round prompt.

- **Trainer `_prepare_for_stage2`** is the only behavior change:
  - before: `origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop])`,
    `combined_images = [original, crop]`
  - after: `origin_prompt = [next_stage_entry]`, `combined_images = [crop]`
- **`grpo.py`** resume hardening (not algorithmic): replaces the dead hardcoded
  `/home/meng/GRPO/...` resume path with `get_last_checkpoint(output_dir)`, adds a
  `configure_torch_checkpoint_resume()` helper (sets `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1`,
  registers `ZeroStageEnum` as a `torch.serialization` safe global for PyTorch 2.6+),
  and fixes `trainer.save_state(output_dir)` β†’ `trainer.save_state()`.

**Hypothesis:** does iterative grounding still work when the model can't see what it
grounded before?

**Risk / note (by design, easy to overlook):** because the prompt is reset every round,
the final `prompt_completion_ids` sequence contains only the **last** round's
user+assistant turns. The loss-mask scan (`[151644, 77091, 198]` … `151645`) therefore
catches **only the final round's** assistant tokens. Intermediate grounding rounds
produce **no gradient** β€” they only decide which crop the final round sees. So a 4-round
rollout yields the same per-rollout RL signal as a 1-round one. Intended for the
"no-history" ablation, but it sharply reduces training signal per rollout; expect slower
learning than baseline at equal steps.

---

## `r1-v-no123-match-round` β€” pure CoT, no bbox/crop, keep r1-v's adaptive cap

Strips all bounding-box / crop logic; the model just thinks β†’ maybe answers β†’ if not, is
re-asked the same prompt. Up to 4 extra rounds (same cap as `r1-v`).

- **Prompt** (`grpo.py` + trainer `STAGE_PROMPT_TEMPLATE`): no `<box>` language; "if no
  further thinking is needed, provide `<answer>`." Format: `<think>…</think>` or
  `<think>…</think><answer>…</answer>`.
- **Format reward:** `pattern_stage1` = `^<think>(.+?)</think>$` (no `<box>`, no
  `<answer>`); `pattern_stage2` keeps `<think>…</think><answer>…</answer>`.
- **Trainer surgery:** deletes `bbox_adjust`, `cal_bbox_for_iou`,
  `_crop_image_for_next_stage`, `_get_bbox_for_last_stage`; renames `_prepare_for_stage2`
  β†’ `_prepare_for_next_round` (appends `{assistant: previous_response}` + a fresh
  text-only user turn; no image added β€” original image stays in history once);
  `_generate_for_stage2_batch` passes `images=None` when there are none.
- **`grpo.py`:** deletes all bbox/IoU reward+score fns (`compute_iou`, `compute_giou`,
  `bbox_reward_stage2`, `bbox_score_stage{1,2}`, `bbox_iou_stage{1,2,3}`); registries
  trimmed to `{accuracy, format}` / `{refine_times}`. Same resume + `save_state` fixes.
- **`prepare_data.py`:** `item.pop('bboxs', None)`; checked-in jsonl already has bboxs
  stripped.

**Hypothesis:** is the "think β†’ maybe answer, else re-ask" outer loop alone (no grounding)
competitive with the full pipeline? Clean; no bug found.

---

## `r1-v-no123` β€” pure CoT, forced exactly 2 rounds

Same bbox/crop removal as `match-round`, but **does not** mimic r1-v's adaptive loop.
Always two passes.

- **Two distinct prompts:** `STAGE_ONE_TEMPLATE` ("think only … **do not provide the
  final answer yet**"); `STAGE_TWO_TEMPLATE` ("rethink using image + history, then answer").
- **Loop:** round 1 generates `<think>…</think>` only; round 2 appends that as assistant +
  a `STAGE_TWO_TEMPLATE` user turn β†’ one final `generate` for all. Exactly 2 forward
  passes, no `<answer>`/iter early-exit.
- Same deletions, registries, resume + `save_state` fixes as `match-round`.

**Hypothesis:** is a fixed think→rethink+answer schedule enough (vs adaptive)? Mechanically
the simplest of the set; hard to break. No bug found.

---

## `r1-v-no23-1round` β€” fused single turn (`<think>`+`<box>`+`<answer>`), retry on missing answer

Asks for thinking, one bbox, and the final answer in a **single** response. No image
cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is
missing.

- **Single fused prompt:** "...provide one bounding box [x1,y1,x2,y2] inside `<box>` …
  Then directly provide the final answer inside `<answer>`." Format example:
  `<think>…</think> <box>[…]</box> <answer>…</answer>`.
- **Format reward:** `pattern_stage1` (intermediate) accepts `<think>…</think><box>[…]</box>`
  with `(?!.*<answer>)`; `pattern_stage2` (final) requires the full
  `<think>…</think><box>[…]</box><answer>…</answer>`. Both add single-occurrence
  anti-duplication lookaheads.
- **Loop:** stage-1 generate; if `<answer>` present β†’ finalize; else
  `_prepare_for_next_round(..., include_image=True)` appends the failed response + a fresh
  user turn that **re-includes the original image** and the same prompt; loop to
  `<answer>` or iter==4.
- All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped;
  same resume + `save_state` fixes.

**Hypothesis:** does emitting the box and answer in one shot (no separate crop trajectory)
match the multi-round pipeline?

**Risk found β€” wasteful image re-injection:** every retry appends **another copy of the
same original image** (both an `{"type":"image"}` placeholder and the image in the per-
rollout image list). With `max_pixels=401408`, each image β‰ˆ 250 tokens, so after 4 retries
the prompt carries ~5 identical images (~1.25k tokens of duplicate). Risks: hitting
`max_prompt_length=8192` (left-truncation could drop the leading image position) and
extra memory/time on long retries. **Suggested fix:** pass the image only on the first
turn (`include_image=False` on retries) β€” the chat already shows it once. Not a
correctness bug, but a real perf/robustness footgun.

---

## Cross-cutting: a real pre-existing bug in baseline `r1-v` (fixed in all four forks)

`r1-v/src/open_r1/grpo.py` reward/score fns (`accuracy_reward_stage2`,
`accuracy_score_stage1`, `accuracy_score_stage2`) iterate per-sample over
`zip(contents, solution)` but route the dataset handler via **`dataset[0]`** and log
**`problem_id[0]`** β€” the batch-wide kwargs collapsed to the first sample. For any batch
spanning >1 dataset (`per_device_train_batch_size>1`, or grad-accum across prompts),
every sample gets the **first sample's** handler β†’ silently wrong rewards (no crash). The
shipped config (`per_device=1`, `accum=1`) masks it because a GRPO group is one prompt.

All four `no-*` forks fix it by iterating `cur_dataset`/`cur_problem_id` from the zip and
defaulting `reward, student_answer = 0.0, ""` in the missing-handler branch. Worth
upstreaming to baseline `r1-v`.

---

## Shared mechanical fixes in every fork

- Dead hardcoded resume path (`/home/meng/GRPO/...`) β†’ `get_last_checkpoint(output_dir)`
  (baseline would never actually resume on a fresh box).
- `trainer.save_state(output_dir)` β†’ `trainer.save_state()` (kwarg-less is the supported
  Trainer API).
- `r1-v-no-hist` additionally adds the PyTorch-2.6+ `torch.load` weights-only workaround
  for DeepSpeed resume β€” the other forks should copy it if they need to resume under
  torch β‰₯ 2.6.

---

## Bottom line

- **Clean:** `r1-v-no123`, `r1-v-no123-match-round`.
- **By-design caveat:** `r1-v-no-hist` β€” only the final round contributes gradient.
- **Perf footgun:** `r1-v-no23-1round` β€” duplicate original image re-sent on every retry.
- **Real bug (baseline):** per-batch reward mis-routing via `dataset[0]`/`problem_id[0]`;
  fixed in the forks, should go upstream.