| # Ground-R1 `rl-projects` β Ablation Branch Notes |
|
|
| Notes from a code read of the experimental forks of `r1-v/` on the `rl-projects` branch |
| (`Irisicy4/Ground-R1-project`). Each branch keeps a sibling copy of the trainer so the |
| baseline `r1-v/` stays untouched. Below: what each variant changes vs `r1-v`, what |
| hypothesis it isolates, and any bugs/risks found. |
|
|
| --- |
|
|
| ## At a glance |
|
|
| | Branch | Bbox / crop | Round structure | Image(s) carried across rounds | History across rounds | |
| |---|---|---|---|---| |
| | `r1-v` (baseline) | yes (`<box>` + crop + 2nd image) | adaptive: stop on `<answer>` or iter==4 | original + each crop | full conversation | |
| | `r1-v-no-hist` | yes (still crops) | adaptive (same cap) | **only the latest crop** | **reset each round** | |
| | `r1-v-no123-match-round` | **no** | adaptive (same cap) | original only | full conversation | |
| | `r1-v-no123` | **no** | **fixed: exactly 2 rounds** | original only | full conversation | |
| | `r1-v-no23-1round` | `<box>`+`<answer>` in one turn | adaptive (same cap), retry only on missing `<answer>` | original re-sent every retry | full conversation | |
|
|
| The naming reads as: **no-hist** = remove cross-round memory; **no123** = remove the three |
| grounding pieces (bbox-step, crop, second image); **match-round** = keep r1-v's adaptive |
| iteration cap; **1round** = intended single fused turn. |
|
|
| --- |
|
|
| ## `r1-v-no-hist` β drop conversation history between grounding rounds |
|
|
| Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws |
| away prior turns and shows the model only the cropped image + the next-round prompt. |
|
|
| - **Trainer `_prepare_for_stage2`** is the only behavior change: |
| - before: `origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop])`, |
| `combined_images = [original, crop]` |
| - after: `origin_prompt = [next_stage_entry]`, `combined_images = [crop]` |
| - **`grpo.py`** resume hardening (not algorithmic): replaces the dead hardcoded |
| `/home/meng/GRPO/...` resume path with `get_last_checkpoint(output_dir)`, adds a |
| `configure_torch_checkpoint_resume()` helper (sets `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1`, |
| registers `ZeroStageEnum` as a `torch.serialization` safe global for PyTorch 2.6+), |
| and fixes `trainer.save_state(output_dir)` β `trainer.save_state()`. |
| |
| **Hypothesis:** does iterative grounding still work when the model can't see what it |
| grounded before? |
| |
| **Risk / note (by design, easy to overlook):** because the prompt is reset every round, |
| the final `prompt_completion_ids` sequence contains only the **last** round's |
| user+assistant turns. The loss-mask scan (`[151644, 77091, 198]` β¦ `151645`) therefore |
| catches **only the final round's** assistant tokens. Intermediate grounding rounds |
| produce **no gradient** β they only decide which crop the final round sees. So a 4-round |
| rollout yields the same per-rollout RL signal as a 1-round one. Intended for the |
| "no-history" ablation, but it sharply reduces training signal per rollout; expect slower |
| learning than baseline at equal steps. |
| |
| --- |
| |
| ## `r1-v-no123-match-round` β pure CoT, no bbox/crop, keep r1-v's adaptive cap |
| |
| Strips all bounding-box / crop logic; the model just thinks β maybe answers β if not, is |
| re-asked the same prompt. Up to 4 extra rounds (same cap as `r1-v`). |
| |
| - **Prompt** (`grpo.py` + trainer `STAGE_PROMPT_TEMPLATE`): no `<box>` language; "if no |
| further thinking is needed, provide `<answer>`." Format: `<think>β¦</think>` or |
| `<think>β¦</think><answer>β¦</answer>`. |
| - **Format reward:** `pattern_stage1` = `^<think>(.+?)</think>$` (no `<box>`, no |
| `<answer>`); `pattern_stage2` keeps `<think>β¦</think><answer>β¦</answer>`. |
| - **Trainer surgery:** deletes `bbox_adjust`, `cal_bbox_for_iou`, |
| `_crop_image_for_next_stage`, `_get_bbox_for_last_stage`; renames `_prepare_for_stage2` |
| β `_prepare_for_next_round` (appends `{assistant: previous_response}` + a fresh |
| text-only user turn; no image added β original image stays in history once); |
| `_generate_for_stage2_batch` passes `images=None` when there are none. |
| - **`grpo.py`:** deletes all bbox/IoU reward+score fns (`compute_iou`, `compute_giou`, |
| `bbox_reward_stage2`, `bbox_score_stage{1,2}`, `bbox_iou_stage{1,2,3}`); registries |
| trimmed to `{accuracy, format}` / `{refine_times}`. Same resume + `save_state` fixes. |
| - **`prepare_data.py`:** `item.pop('bboxs', None)`; checked-in jsonl already has bboxs |
| stripped. |
|
|
| **Hypothesis:** is the "think β maybe answer, else re-ask" outer loop alone (no grounding) |
| competitive with the full pipeline? Clean; no bug found. |
|
|
| --- |
|
|
| ## `r1-v-no123` β pure CoT, forced exactly 2 rounds |
|
|
| Same bbox/crop removal as `match-round`, but **does not** mimic r1-v's adaptive loop. |
| Always two passes. |
|
|
| - **Two distinct prompts:** `STAGE_ONE_TEMPLATE` ("think only β¦ **do not provide the |
| final answer yet**"); `STAGE_TWO_TEMPLATE` ("rethink using image + history, then answer"). |
| - **Loop:** round 1 generates `<think>β¦</think>` only; round 2 appends that as assistant + |
| a `STAGE_TWO_TEMPLATE` user turn β one final `generate` for all. Exactly 2 forward |
| passes, no `<answer>`/iter early-exit. |
| - Same deletions, registries, resume + `save_state` fixes as `match-round`. |
|
|
| **Hypothesis:** is a fixed thinkβrethink+answer schedule enough (vs adaptive)? Mechanically |
| the simplest of the set; hard to break. No bug found. |
|
|
| --- |
|
|
| ## `r1-v-no23-1round` β fused single turn (`<think>`+`<box>`+`<answer>`), retry on missing answer |
|
|
| Asks for thinking, one bbox, and the final answer in a **single** response. No image |
| cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is |
| missing. |
|
|
| - **Single fused prompt:** "...provide one bounding box [x1,y1,x2,y2] inside `<box>` β¦ |
| Then directly provide the final answer inside `<answer>`." Format example: |
| `<think>β¦</think> <box>[β¦]</box> <answer>β¦</answer>`. |
| - **Format reward:** `pattern_stage1` (intermediate) accepts `<think>β¦</think><box>[β¦]</box>` |
| with `(?!.*<answer>)`; `pattern_stage2` (final) requires the full |
| `<think>β¦</think><box>[β¦]</box><answer>β¦</answer>`. Both add single-occurrence |
| anti-duplication lookaheads. |
| - **Loop:** stage-1 generate; if `<answer>` present β finalize; else |
| `_prepare_for_next_round(..., include_image=True)` appends the failed response + a fresh |
| user turn that **re-includes the original image** and the same prompt; loop to |
| `<answer>` or iter==4. |
| - All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped; |
| same resume + `save_state` fixes. |
|
|
| **Hypothesis:** does emitting the box and answer in one shot (no separate crop trajectory) |
| match the multi-round pipeline? |
|
|
| **Risk found β wasteful image re-injection:** every retry appends **another copy of the |
| same original image** (both an `{"type":"image"}` placeholder and the image in the per- |
| rollout image list). With `max_pixels=401408`, each image β 250 tokens, so after 4 retries |
| the prompt carries ~5 identical images (~1.25k tokens of duplicate). Risks: hitting |
| `max_prompt_length=8192` (left-truncation could drop the leading image position) and |
| extra memory/time on long retries. **Suggested fix:** pass the image only on the first |
| turn (`include_image=False` on retries) β the chat already shows it once. Not a |
| correctness bug, but a real perf/robustness footgun. |
|
|
| --- |
|
|
| ## Cross-cutting: a real pre-existing bug in baseline `r1-v` (fixed in all four forks) |
|
|
| `r1-v/src/open_r1/grpo.py` reward/score fns (`accuracy_reward_stage2`, |
| `accuracy_score_stage1`, `accuracy_score_stage2`) iterate per-sample over |
| `zip(contents, solution)` but route the dataset handler via **`dataset[0]`** and log |
| **`problem_id[0]`** β the batch-wide kwargs collapsed to the first sample. For any batch |
| spanning >1 dataset (`per_device_train_batch_size>1`, or grad-accum across prompts), |
| every sample gets the **first sample's** handler β silently wrong rewards (no crash). The |
| shipped config (`per_device=1`, `accum=1`) masks it because a GRPO group is one prompt. |
| |
| All four `no-*` forks fix it by iterating `cur_dataset`/`cur_problem_id` from the zip and |
| defaulting `reward, student_answer = 0.0, ""` in the missing-handler branch. Worth |
| upstreaming to baseline `r1-v`. |
| |
| --- |
| |
| ## Shared mechanical fixes in every fork |
| |
| - Dead hardcoded resume path (`/home/meng/GRPO/...`) β `get_last_checkpoint(output_dir)` |
| (baseline would never actually resume on a fresh box). |
| - `trainer.save_state(output_dir)` β `trainer.save_state()` (kwarg-less is the supported |
| Trainer API). |
| - `r1-v-no-hist` additionally adds the PyTorch-2.6+ `torch.load` weights-only workaround |
| for DeepSpeed resume β the other forks should copy it if they need to resume under |
| torch β₯ 2.6. |
| |
| --- |
| |
| ## Bottom line |
| |
| - **Clean:** `r1-v-no123`, `r1-v-no123-match-round`. |
| - **By-design caveat:** `r1-v-no-hist` β only the final round contributes gradient. |
| - **Perf footgun:** `r1-v-no23-1round` β duplicate original image re-sent on every retry. |
| - **Real bug (baseline):** per-batch reward mis-routing via `dataset[0]`/`problem_id[0]`; |
| fixed in the forks, should go upstream. |
|
|