Ground-R1 `rl-projects` — Ablation Branch Notes

Notes from a code read of the experimental forks of r1-v/ on the rl-projects branch (Irisicy4/Ground-R1-project). Each branch keeps a sibling copy of the trainer so the baseline r1-v/ stays untouched. Below: what each variant changes vs r1-v, what hypothesis it isolates, and any bugs/risks found.

At a glance

Branch	Bbox / crop	Round structure	Image(s) carried across rounds	History across rounds
`r1-v` (baseline)	yes (`<box>` + crop + 2nd image)	adaptive: stop on `<answer>` or iter==4	original + each crop	full conversation
`r1-v-no-hist`	yes (still crops)	adaptive (same cap)	only the latest crop	reset each round
`r1-v-no123-match-round`	no	adaptive (same cap)	original only	full conversation
`r1-v-no123`	no	fixed: exactly 2 rounds	original only	full conversation
`r1-v-no23-1round`	`<box>`+`<answer>` in one turn	adaptive (same cap), retry only on missing `<answer>`	original re-sent every retry	full conversation

The naming reads as: no-hist = remove cross-round memory; no123 = remove the three grounding pieces (bbox-step, crop, second image); match-round = keep r1-v's adaptive iteration cap; 1round = intended single fused turn.

`r1-v-no-hist` — drop conversation history between grounding rounds

Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws away prior turns and shows the model only the cropped image + the next-round prompt.

Trainer _prepare_for_stage2 is the only behavior change:
- before: origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop]), combined_images = [original, crop]
- after: origin_prompt = [next_stage_entry], combined_images = [crop]
grpo.py resume hardening (not algorithmic): replaces the dead hardcoded /home/meng/GRPO/... resume path with get_last_checkpoint(output_dir), adds a configure_torch_checkpoint_resume() helper (sets TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1, registers ZeroStageEnum as a torch.serialization safe global for PyTorch 2.6+), and fixes trainer.save_state(output_dir) → trainer.save_state().

Hypothesis: does iterative grounding still work when the model can't see what it grounded before?

Risk / note (by design, easy to overlook): because the prompt is reset every round, the final prompt_completion_ids sequence contains only the last round's user+assistant turns. The loss-mask scan ([151644, 77091, 198] … 151645) therefore catches only the final round's assistant tokens. Intermediate grounding rounds produce no gradient — they only decide which crop the final round sees. So a 4-round rollout yields the same per-rollout RL signal as a 1-round one. Intended for the "no-history" ablation, but it sharply reduces training signal per rollout; expect slower learning than baseline at equal steps.

`r1-v-no123-match-round` — pure CoT, no bbox/crop, keep r1-v's adaptive cap

Strips all bounding-box / crop logic; the model just thinks → maybe answers → if not, is re-asked the same prompt. Up to 4 extra rounds (same cap as r1-v).

Prompt (grpo.py + trainer STAGE_PROMPT_TEMPLATE): no <box> language; "if no further thinking is needed, provide <answer>." Format: <think>…</think> or <think>…</think><answer>…</answer>.
Format reward: pattern_stage1 = ^<think>(.+?)</think>$ (no <box>, no <answer>); pattern_stage2 keeps <think>…</think><answer>…</answer>.
Trainer surgery: deletes bbox_adjust, cal_bbox_for_iou, _crop_image_for_next_stage, _get_bbox_for_last_stage; renames _prepare_for_stage2 → _prepare_for_next_round (appends {assistant: previous_response} + a fresh text-only user turn; no image added — original image stays in history once); _generate_for_stage2_batch passes images=None when there are none.
grpo.py: deletes all bbox/IoU reward+score fns (compute_iou, compute_giou, bbox_reward_stage2, bbox_score_stage{1,2}, bbox_iou_stage{1,2,3}); registries trimmed to {accuracy, format} / {refine_times}. Same resume + save_state fixes.
prepare_data.py: item.pop('bboxs', None); checked-in jsonl already has bboxs stripped.

Hypothesis: is the "think → maybe answer, else re-ask" outer loop alone (no grounding) competitive with the full pipeline? Clean; no bug found.

`r1-v-no123` — pure CoT, forced exactly 2 rounds

Same bbox/crop removal as match-round, but does not mimic r1-v's adaptive loop. Always two passes.

Two distinct prompts: STAGE_ONE_TEMPLATE ("think only … do not provide the final answer yet"); STAGE_TWO_TEMPLATE ("rethink using image + history, then answer").
Loop: round 1 generates <think>…</think> only; round 2 appends that as assistant + a STAGE_TWO_TEMPLATE user turn → one final generate for all. Exactly 2 forward passes, no <answer>/iter early-exit.
Same deletions, registries, resume + save_state fixes as match-round.

Hypothesis: is a fixed think→rethink+answer schedule enough (vs adaptive)? Mechanically the simplest of the set; hard to break. No bug found.

`r1-v-no23-1round` — fused single turn (`<think>`+`<box>`+`<answer>`), retry on missing answer

Asks for thinking, one bbox, and the final answer in a single response. No image cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is missing.

Single fused prompt: "...provide one bounding box [x1,y1,x2,y2] inside <box> … Then directly provide the final answer inside <answer>." Format example: <think>…</think> <box>[…]</box> <answer>…</answer>.
Format reward: pattern_stage1 (intermediate) accepts <think>…</think><box>[…]</box> with (?!.*<answer>); pattern_stage2 (final) requires the full <think>…</think><box>[…]</box><answer>…</answer>. Both add single-occurrence anti-duplication lookaheads.
Loop: stage-1 generate; if <answer> present → finalize; else _prepare_for_next_round(..., include_image=True) appends the failed response + a fresh user turn that re-includes the original image and the same prompt; loop to <answer> or iter==4.
All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped; same resume + save_state fixes.

Hypothesis: does emitting the box and answer in one shot (no separate crop trajectory) match the multi-round pipeline?

Risk found — wasteful image re-injection: every retry appends another copy of the same original image (both an {"type":"image"} placeholder and the image in the per- rollout image list). With max_pixels=401408, each image ≈ 250 tokens, so after 4 retries the prompt carries ~~5 identical images (~~1.25k tokens of duplicate). Risks: hitting max_prompt_length=8192 (left-truncation could drop the leading image position) and extra memory/time on long retries. Suggested fix: pass the image only on the first turn (include_image=False on retries) — the chat already shows it once. Not a correctness bug, but a real perf/robustness footgun.

Cross-cutting: a real pre-existing bug in baseline `r1-v` (fixed in all four forks)

r1-v/src/open_r1/grpo.py reward/score fns (accuracy_reward_stage2, accuracy_score_stage1, accuracy_score_stage2) iterate per-sample over zip(contents, solution) but route the dataset handler via dataset[0] and log problem_id[0] — the batch-wide kwargs collapsed to the first sample. For any batch spanning >1 dataset (per_device_train_batch_size>1, or grad-accum across prompts), every sample gets the first sample's handler → silently wrong rewards (no crash). The shipped config (per_device=1, accum=1) masks it because a GRPO group is one prompt.

All four no-* forks fix it by iterating cur_dataset/cur_problem_id from the zip and defaulting reward, student_answer = 0.0, "" in the missing-handler branch. Worth upstreaming to baseline r1-v.

Shared mechanical fixes in every fork

Dead hardcoded resume path (/home/meng/GRPO/...) → get_last_checkpoint(output_dir) (baseline would never actually resume on a fresh box).
trainer.save_state(output_dir) → trainer.save_state() (kwarg-less is the supported Trainer API).
r1-v-no-hist additionally adds the PyTorch-2.6+ torch.load weights-only workaround for DeepSpeed resume — the other forks should copy it if they need to resume under torch ≥ 2.6.

Bottom line

Clean: r1-v-no123, r1-v-no123-match-round.
By-design caveat: r1-v-no-hist — only the final round contributes gradient.
Perf footgun: r1-v-no23-1round — duplicate original image re-sent on every retry.
Real bug (baseline): per-batch reward mis-routing via dataset[0]/problem_id[0]; fixed in the forks, should go upstream.

Ground-R1 rl-projects — Ablation Branch Notes

At a glance

r1-v-no-hist — drop conversation history between grounding rounds

r1-v-no123-match-round — pure CoT, no bbox/crop, keep r1-v's adaptive cap

r1-v-no123 — pure CoT, forced exactly 2 rounds

r1-v-no23-1round — fused single turn (<think>+<box>+<answer>), retry on missing answer

Cross-cutting: a real pre-existing bug in baseline r1-v (fixed in all four forks)