File size: 9,087 Bytes
7de483e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | # Ground-R1 `rl-projects` β Ablation Branch Notes
Notes from a code read of the experimental forks of `r1-v/` on the `rl-projects` branch
(`Irisicy4/Ground-R1-project`). Each branch keeps a sibling copy of the trainer so the
baseline `r1-v/` stays untouched. Below: what each variant changes vs `r1-v`, what
hypothesis it isolates, and any bugs/risks found.
---
## At a glance
| Branch | Bbox / crop | Round structure | Image(s) carried across rounds | History across rounds |
|---|---|---|---|---|
| `r1-v` (baseline) | yes (`<box>` + crop + 2nd image) | adaptive: stop on `<answer>` or iter==4 | original + each crop | full conversation |
| `r1-v-no-hist` | yes (still crops) | adaptive (same cap) | **only the latest crop** | **reset each round** |
| `r1-v-no123-match-round` | **no** | adaptive (same cap) | original only | full conversation |
| `r1-v-no123` | **no** | **fixed: exactly 2 rounds** | original only | full conversation |
| `r1-v-no23-1round` | `<box>`+`<answer>` in one turn | adaptive (same cap), retry only on missing `<answer>` | original re-sent every retry | full conversation |
The naming reads as: **no-hist** = remove cross-round memory; **no123** = remove the three
grounding pieces (bbox-step, crop, second image); **match-round** = keep r1-v's adaptive
iteration cap; **1round** = intended single fused turn.
---
## `r1-v-no-hist` β drop conversation history between grounding rounds
Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws
away prior turns and shows the model only the cropped image + the next-round prompt.
- **Trainer `_prepare_for_stage2`** is the only behavior change:
- before: `origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop])`,
`combined_images = [original, crop]`
- after: `origin_prompt = [next_stage_entry]`, `combined_images = [crop]`
- **`grpo.py`** resume hardening (not algorithmic): replaces the dead hardcoded
`/home/meng/GRPO/...` resume path with `get_last_checkpoint(output_dir)`, adds a
`configure_torch_checkpoint_resume()` helper (sets `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1`,
registers `ZeroStageEnum` as a `torch.serialization` safe global for PyTorch 2.6+),
and fixes `trainer.save_state(output_dir)` β `trainer.save_state()`.
**Hypothesis:** does iterative grounding still work when the model can't see what it
grounded before?
**Risk / note (by design, easy to overlook):** because the prompt is reset every round,
the final `prompt_completion_ids` sequence contains only the **last** round's
user+assistant turns. The loss-mask scan (`[151644, 77091, 198]` β¦ `151645`) therefore
catches **only the final round's** assistant tokens. Intermediate grounding rounds
produce **no gradient** β they only decide which crop the final round sees. So a 4-round
rollout yields the same per-rollout RL signal as a 1-round one. Intended for the
"no-history" ablation, but it sharply reduces training signal per rollout; expect slower
learning than baseline at equal steps.
---
## `r1-v-no123-match-round` β pure CoT, no bbox/crop, keep r1-v's adaptive cap
Strips all bounding-box / crop logic; the model just thinks β maybe answers β if not, is
re-asked the same prompt. Up to 4 extra rounds (same cap as `r1-v`).
- **Prompt** (`grpo.py` + trainer `STAGE_PROMPT_TEMPLATE`): no `<box>` language; "if no
further thinking is needed, provide `<answer>`." Format: `<think>β¦</think>` or
`<think>β¦</think><answer>β¦</answer>`.
- **Format reward:** `pattern_stage1` = `^<think>(.+?)</think>$` (no `<box>`, no
`<answer>`); `pattern_stage2` keeps `<think>β¦</think><answer>β¦</answer>`.
- **Trainer surgery:** deletes `bbox_adjust`, `cal_bbox_for_iou`,
`_crop_image_for_next_stage`, `_get_bbox_for_last_stage`; renames `_prepare_for_stage2`
β `_prepare_for_next_round` (appends `{assistant: previous_response}` + a fresh
text-only user turn; no image added β original image stays in history once);
`_generate_for_stage2_batch` passes `images=None` when there are none.
- **`grpo.py`:** deletes all bbox/IoU reward+score fns (`compute_iou`, `compute_giou`,
`bbox_reward_stage2`, `bbox_score_stage{1,2}`, `bbox_iou_stage{1,2,3}`); registries
trimmed to `{accuracy, format}` / `{refine_times}`. Same resume + `save_state` fixes.
- **`prepare_data.py`:** `item.pop('bboxs', None)`; checked-in jsonl already has bboxs
stripped.
**Hypothesis:** is the "think β maybe answer, else re-ask" outer loop alone (no grounding)
competitive with the full pipeline? Clean; no bug found.
---
## `r1-v-no123` β pure CoT, forced exactly 2 rounds
Same bbox/crop removal as `match-round`, but **does not** mimic r1-v's adaptive loop.
Always two passes.
- **Two distinct prompts:** `STAGE_ONE_TEMPLATE` ("think only β¦ **do not provide the
final answer yet**"); `STAGE_TWO_TEMPLATE` ("rethink using image + history, then answer").
- **Loop:** round 1 generates `<think>β¦</think>` only; round 2 appends that as assistant +
a `STAGE_TWO_TEMPLATE` user turn β one final `generate` for all. Exactly 2 forward
passes, no `<answer>`/iter early-exit.
- Same deletions, registries, resume + `save_state` fixes as `match-round`.
**Hypothesis:** is a fixed thinkβrethink+answer schedule enough (vs adaptive)? Mechanically
the simplest of the set; hard to break. No bug found.
---
## `r1-v-no23-1round` β fused single turn (`<think>`+`<box>`+`<answer>`), retry on missing answer
Asks for thinking, one bbox, and the final answer in a **single** response. No image
cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is
missing.
- **Single fused prompt:** "...provide one bounding box [x1,y1,x2,y2] inside `<box>` β¦
Then directly provide the final answer inside `<answer>`." Format example:
`<think>β¦</think> <box>[β¦]</box> <answer>β¦</answer>`.
- **Format reward:** `pattern_stage1` (intermediate) accepts `<think>β¦</think><box>[β¦]</box>`
with `(?!.*<answer>)`; `pattern_stage2` (final) requires the full
`<think>β¦</think><box>[β¦]</box><answer>β¦</answer>`. Both add single-occurrence
anti-duplication lookaheads.
- **Loop:** stage-1 generate; if `<answer>` present β finalize; else
`_prepare_for_next_round(..., include_image=True)` appends the failed response + a fresh
user turn that **re-includes the original image** and the same prompt; loop to
`<answer>` or iter==4.
- All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped;
same resume + `save_state` fixes.
**Hypothesis:** does emitting the box and answer in one shot (no separate crop trajectory)
match the multi-round pipeline?
**Risk found β wasteful image re-injection:** every retry appends **another copy of the
same original image** (both an `{"type":"image"}` placeholder and the image in the per-
rollout image list). With `max_pixels=401408`, each image β 250 tokens, so after 4 retries
the prompt carries ~5 identical images (~1.25k tokens of duplicate). Risks: hitting
`max_prompt_length=8192` (left-truncation could drop the leading image position) and
extra memory/time on long retries. **Suggested fix:** pass the image only on the first
turn (`include_image=False` on retries) β the chat already shows it once. Not a
correctness bug, but a real perf/robustness footgun.
---
## Cross-cutting: a real pre-existing bug in baseline `r1-v` (fixed in all four forks)
`r1-v/src/open_r1/grpo.py` reward/score fns (`accuracy_reward_stage2`,
`accuracy_score_stage1`, `accuracy_score_stage2`) iterate per-sample over
`zip(contents, solution)` but route the dataset handler via **`dataset[0]`** and log
**`problem_id[0]`** β the batch-wide kwargs collapsed to the first sample. For any batch
spanning >1 dataset (`per_device_train_batch_size>1`, or grad-accum across prompts),
every sample gets the **first sample's** handler β silently wrong rewards (no crash). The
shipped config (`per_device=1`, `accum=1`) masks it because a GRPO group is one prompt.
All four `no-*` forks fix it by iterating `cur_dataset`/`cur_problem_id` from the zip and
defaulting `reward, student_answer = 0.0, ""` in the missing-handler branch. Worth
upstreaming to baseline `r1-v`.
---
## Shared mechanical fixes in every fork
- Dead hardcoded resume path (`/home/meng/GRPO/...`) β `get_last_checkpoint(output_dir)`
(baseline would never actually resume on a fresh box).
- `trainer.save_state(output_dir)` β `trainer.save_state()` (kwarg-less is the supported
Trainer API).
- `r1-v-no-hist` additionally adds the PyTorch-2.6+ `torch.load` weights-only workaround
for DeepSpeed resume β the other forks should copy it if they need to resume under
torch β₯ 2.6.
---
## Bottom line
- **Clean:** `r1-v-no123`, `r1-v-no123-match-round`.
- **By-design caveat:** `r1-v-no-hist` β only the final round contributes gradient.
- **Perf footgun:** `r1-v-no23-1round` β duplicate original image re-sent on every retry.
- **Real bug (baseline):** per-batch reward mis-routing via `dataset[0]`/`problem_id[0]`;
fixed in the forks, should go upstream.
|