Upload ABLATION_NOTES.md with huggingface_hub
Browse files- ABLATION_NOTES.md +166 -0
ABLATION_NOTES.md
ADDED
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Ground-R1 `rl-projects` β Ablation Branch Notes
|
| 2 |
+
|
| 3 |
+
Notes from a code read of the experimental forks of `r1-v/` on the `rl-projects` branch
|
| 4 |
+
(`Irisicy4/Ground-R1-project`). Each branch keeps a sibling copy of the trainer so the
|
| 5 |
+
baseline `r1-v/` stays untouched. Below: what each variant changes vs `r1-v`, what
|
| 6 |
+
hypothesis it isolates, and any bugs/risks found.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## At a glance
|
| 11 |
+
|
| 12 |
+
| Branch | Bbox / crop | Round structure | Image(s) carried across rounds | History across rounds |
|
| 13 |
+
|---|---|---|---|---|
|
| 14 |
+
| `r1-v` (baseline) | yes (`<box>` + crop + 2nd image) | adaptive: stop on `<answer>` or iter==4 | original + each crop | full conversation |
|
| 15 |
+
| `r1-v-no-hist` | yes (still crops) | adaptive (same cap) | **only the latest crop** | **reset each round** |
|
| 16 |
+
| `r1-v-no123-match-round` | **no** | adaptive (same cap) | original only | full conversation |
|
| 17 |
+
| `r1-v-no123` | **no** | **fixed: exactly 2 rounds** | original only | full conversation |
|
| 18 |
+
| `r1-v-no23-1round` | `<box>`+`<answer>` in one turn | adaptive (same cap), retry only on missing `<answer>` | original re-sent every retry | full conversation |
|
| 19 |
+
|
| 20 |
+
The naming reads as: **no-hist** = remove cross-round memory; **no123** = remove the three
|
| 21 |
+
grounding pieces (bbox-step, crop, second image); **match-round** = keep r1-v's adaptive
|
| 22 |
+
iteration cap; **1round** = intended single fused turn.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## `r1-v-no-hist` β drop conversation history between grounding rounds
|
| 27 |
+
|
| 28 |
+
Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws
|
| 29 |
+
away prior turns and shows the model only the cropped image + the next-round prompt.
|
| 30 |
+
|
| 31 |
+
- **Trainer `_prepare_for_stage2`** is the only behavior change:
|
| 32 |
+
- before: `origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop])`,
|
| 33 |
+
`combined_images = [original, crop]`
|
| 34 |
+
- after: `origin_prompt = [next_stage_entry]`, `combined_images = [crop]`
|
| 35 |
+
- **`grpo.py`** resume hardening (not algorithmic): replaces the dead hardcoded
|
| 36 |
+
`/home/meng/GRPO/...` resume path with `get_last_checkpoint(output_dir)`, adds a
|
| 37 |
+
`configure_torch_checkpoint_resume()` helper (sets `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1`,
|
| 38 |
+
registers `ZeroStageEnum` as a `torch.serialization` safe global for PyTorch 2.6+),
|
| 39 |
+
and fixes `trainer.save_state(output_dir)` β `trainer.save_state()`.
|
| 40 |
+
|
| 41 |
+
**Hypothesis:** does iterative grounding still work when the model can't see what it
|
| 42 |
+
grounded before?
|
| 43 |
+
|
| 44 |
+
**Risk / note (by design, easy to overlook):** because the prompt is reset every round,
|
| 45 |
+
the final `prompt_completion_ids` sequence contains only the **last** round's
|
| 46 |
+
user+assistant turns. The loss-mask scan (`[151644, 77091, 198]` β¦ `151645`) therefore
|
| 47 |
+
catches **only the final round's** assistant tokens. Intermediate grounding rounds
|
| 48 |
+
produce **no gradient** β they only decide which crop the final round sees. So a 4-round
|
| 49 |
+
rollout yields the same per-rollout RL signal as a 1-round one. Intended for the
|
| 50 |
+
"no-history" ablation, but it sharply reduces training signal per rollout; expect slower
|
| 51 |
+
learning than baseline at equal steps.
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
## `r1-v-no123-match-round` β pure CoT, no bbox/crop, keep r1-v's adaptive cap
|
| 56 |
+
|
| 57 |
+
Strips all bounding-box / crop logic; the model just thinks β maybe answers β if not, is
|
| 58 |
+
re-asked the same prompt. Up to 4 extra rounds (same cap as `r1-v`).
|
| 59 |
+
|
| 60 |
+
- **Prompt** (`grpo.py` + trainer `STAGE_PROMPT_TEMPLATE`): no `<box>` language; "if no
|
| 61 |
+
further thinking is needed, provide `<answer>`." Format: `<think>β¦</think>` or
|
| 62 |
+
`<think>β¦</think><answer>β¦</answer>`.
|
| 63 |
+
- **Format reward:** `pattern_stage1` = `^<think>(.+?)</think>$` (no `<box>`, no
|
| 64 |
+
`<answer>`); `pattern_stage2` keeps `<think>β¦</think><answer>β¦</answer>`.
|
| 65 |
+
- **Trainer surgery:** deletes `bbox_adjust`, `cal_bbox_for_iou`,
|
| 66 |
+
`_crop_image_for_next_stage`, `_get_bbox_for_last_stage`; renames `_prepare_for_stage2`
|
| 67 |
+
β `_prepare_for_next_round` (appends `{assistant: previous_response}` + a fresh
|
| 68 |
+
text-only user turn; no image added β original image stays in history once);
|
| 69 |
+
`_generate_for_stage2_batch` passes `images=None` when there are none.
|
| 70 |
+
- **`grpo.py`:** deletes all bbox/IoU reward+score fns (`compute_iou`, `compute_giou`,
|
| 71 |
+
`bbox_reward_stage2`, `bbox_score_stage{1,2}`, `bbox_iou_stage{1,2,3}`); registries
|
| 72 |
+
trimmed to `{accuracy, format}` / `{refine_times}`. Same resume + `save_state` fixes.
|
| 73 |
+
- **`prepare_data.py`:** `item.pop('bboxs', None)`; checked-in jsonl already has bboxs
|
| 74 |
+
stripped.
|
| 75 |
+
|
| 76 |
+
**Hypothesis:** is the "think β maybe answer, else re-ask" outer loop alone (no grounding)
|
| 77 |
+
competitive with the full pipeline? Clean; no bug found.
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## `r1-v-no123` β pure CoT, forced exactly 2 rounds
|
| 82 |
+
|
| 83 |
+
Same bbox/crop removal as `match-round`, but **does not** mimic r1-v's adaptive loop.
|
| 84 |
+
Always two passes.
|
| 85 |
+
|
| 86 |
+
- **Two distinct prompts:** `STAGE_ONE_TEMPLATE` ("think only β¦ **do not provide the
|
| 87 |
+
final answer yet**"); `STAGE_TWO_TEMPLATE` ("rethink using image + history, then answer").
|
| 88 |
+
- **Loop:** round 1 generates `<think>β¦</think>` only; round 2 appends that as assistant +
|
| 89 |
+
a `STAGE_TWO_TEMPLATE` user turn β one final `generate` for all. Exactly 2 forward
|
| 90 |
+
passes, no `<answer>`/iter early-exit.
|
| 91 |
+
- Same deletions, registries, resume + `save_state` fixes as `match-round`.
|
| 92 |
+
|
| 93 |
+
**Hypothesis:** is a fixed thinkβrethink+answer schedule enough (vs adaptive)? Mechanically
|
| 94 |
+
the simplest of the set; hard to break. No bug found.
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## `r1-v-no23-1round` β fused single turn (`<think>`+`<box>`+`<answer>`), retry on missing answer
|
| 99 |
+
|
| 100 |
+
Asks for thinking, one bbox, and the final answer in a **single** response. No image
|
| 101 |
+
cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is
|
| 102 |
+
missing.
|
| 103 |
+
|
| 104 |
+
- **Single fused prompt:** "...provide one bounding box [x1,y1,x2,y2] inside `<box>` β¦
|
| 105 |
+
Then directly provide the final answer inside `<answer>`." Format example:
|
| 106 |
+
`<think>β¦</think> <box>[β¦]</box> <answer>β¦</answer>`.
|
| 107 |
+
- **Format reward:** `pattern_stage1` (intermediate) accepts `<think>β¦</think><box>[β¦]</box>`
|
| 108 |
+
with `(?!.*<answer>)`; `pattern_stage2` (final) requires the full
|
| 109 |
+
`<think>β¦</think><box>[β¦]</box><answer>β¦</answer>`. Both add single-occurrence
|
| 110 |
+
anti-duplication lookaheads.
|
| 111 |
+
- **Loop:** stage-1 generate; if `<answer>` present β finalize; else
|
| 112 |
+
`_prepare_for_next_round(..., include_image=True)` appends the failed response + a fresh
|
| 113 |
+
user turn that **re-includes the original image** and the same prompt; loop to
|
| 114 |
+
`<answer>` or iter==4.
|
| 115 |
+
- All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped;
|
| 116 |
+
same resume + `save_state` fixes.
|
| 117 |
+
|
| 118 |
+
**Hypothesis:** does emitting the box and answer in one shot (no separate crop trajectory)
|
| 119 |
+
match the multi-round pipeline?
|
| 120 |
+
|
| 121 |
+
**Risk found β wasteful image re-injection:** every retry appends **another copy of the
|
| 122 |
+
same original image** (both an `{"type":"image"}` placeholder and the image in the per-
|
| 123 |
+
rollout image list). With `max_pixels=401408`, each image β 250 tokens, so after 4 retries
|
| 124 |
+
the prompt carries ~5 identical images (~1.25k tokens of duplicate). Risks: hitting
|
| 125 |
+
`max_prompt_length=8192` (left-truncation could drop the leading image position) and
|
| 126 |
+
extra memory/time on long retries. **Suggested fix:** pass the image only on the first
|
| 127 |
+
turn (`include_image=False` on retries) β the chat already shows it once. Not a
|
| 128 |
+
correctness bug, but a real perf/robustness footgun.
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## Cross-cutting: a real pre-existing bug in baseline `r1-v` (fixed in all four forks)
|
| 133 |
+
|
| 134 |
+
`r1-v/src/open_r1/grpo.py` reward/score fns (`accuracy_reward_stage2`,
|
| 135 |
+
`accuracy_score_stage1`, `accuracy_score_stage2`) iterate per-sample over
|
| 136 |
+
`zip(contents, solution)` but route the dataset handler via **`dataset[0]`** and log
|
| 137 |
+
**`problem_id[0]`** β the batch-wide kwargs collapsed to the first sample. For any batch
|
| 138 |
+
spanning >1 dataset (`per_device_train_batch_size>1`, or grad-accum across prompts),
|
| 139 |
+
every sample gets the **first sample's** handler β silently wrong rewards (no crash). The
|
| 140 |
+
shipped config (`per_device=1`, `accum=1`) masks it because a GRPO group is one prompt.
|
| 141 |
+
|
| 142 |
+
All four `no-*` forks fix it by iterating `cur_dataset`/`cur_problem_id` from the zip and
|
| 143 |
+
defaulting `reward, student_answer = 0.0, ""` in the missing-handler branch. Worth
|
| 144 |
+
upstreaming to baseline `r1-v`.
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## Shared mechanical fixes in every fork
|
| 149 |
+
|
| 150 |
+
- Dead hardcoded resume path (`/home/meng/GRPO/...`) β `get_last_checkpoint(output_dir)`
|
| 151 |
+
(baseline would never actually resume on a fresh box).
|
| 152 |
+
- `trainer.save_state(output_dir)` β `trainer.save_state()` (kwarg-less is the supported
|
| 153 |
+
Trainer API).
|
| 154 |
+
- `r1-v-no-hist` additionally adds the PyTorch-2.6+ `torch.load` weights-only workaround
|
| 155 |
+
for DeepSpeed resume β the other forks should copy it if they need to resume under
|
| 156 |
+
torch β₯ 2.6.
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
+
## Bottom line
|
| 161 |
+
|
| 162 |
+
- **Clean:** `r1-v-no123`, `r1-v-no123-match-round`.
|
| 163 |
+
- **By-design caveat:** `r1-v-no-hist` β only the final round contributes gradient.
|
| 164 |
+
- **Perf footgun:** `r1-v-no23-1round` β duplicate original image re-sent on every retry.
|
| 165 |
+
- **Real bug (baseline):** per-batch reward mis-routing via `dataset[0]`/`problem_id[0]`;
|
| 166 |
+
fixed in the forks, should go upstream.
|