Ground-R1 rl-projects β Ablation Branch Notes
Notes from a code read of the experimental forks of r1-v/ on the rl-projects branch
(Irisicy4/Ground-R1-project). Each branch keeps a sibling copy of the trainer so the
baseline r1-v/ stays untouched. Below: what each variant changes vs r1-v, what
hypothesis it isolates, and any bugs/risks found.
At a glance
| Branch | Bbox / crop | Round structure | Image(s) carried across rounds | History across rounds |
|---|---|---|---|---|
r1-v (baseline) |
yes (<box> + crop + 2nd image) |
adaptive: stop on <answer> or iter==4 |
original + each crop | full conversation |
r1-v-no-hist |
yes (still crops) | adaptive (same cap) | only the latest crop | reset each round |
r1-v-no123-match-round |
no | adaptive (same cap) | original only | full conversation |
r1-v-no123 |
no | fixed: exactly 2 rounds | original only | full conversation |
r1-v-no23-1round |
<box>+<answer> in one turn |
adaptive (same cap), retry only on missing <answer> |
original re-sent every retry | full conversation |
The naming reads as: no-hist = remove cross-round memory; no123 = remove the three grounding pieces (bbox-step, crop, second image); match-round = keep r1-v's adaptive iteration cap; 1round = intended single fused turn.
r1-v-no-hist β drop conversation history between grounding rounds
Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws away prior turns and shows the model only the cropped image + the next-round prompt.
- Trainer
_prepare_for_stage2is the only behavior change:- before:
origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop]),combined_images = [original, crop] - after:
origin_prompt = [next_stage_entry],combined_images = [crop]
- before:
grpo.pyresume hardening (not algorithmic): replaces the dead hardcoded/home/meng/GRPO/...resume path withget_last_checkpoint(output_dir), adds aconfigure_torch_checkpoint_resume()helper (setsTORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1, registersZeroStageEnumas atorch.serializationsafe global for PyTorch 2.6+), and fixestrainer.save_state(output_dir)βtrainer.save_state().
Hypothesis: does iterative grounding still work when the model can't see what it grounded before?
Risk / note (by design, easy to overlook): because the prompt is reset every round,
the final prompt_completion_ids sequence contains only the last round's
user+assistant turns. The loss-mask scan ([151644, 77091, 198] β¦ 151645) therefore
catches only the final round's assistant tokens. Intermediate grounding rounds
produce no gradient β they only decide which crop the final round sees. So a 4-round
rollout yields the same per-rollout RL signal as a 1-round one. Intended for the
"no-history" ablation, but it sharply reduces training signal per rollout; expect slower
learning than baseline at equal steps.
r1-v-no123-match-round β pure CoT, no bbox/crop, keep r1-v's adaptive cap
Strips all bounding-box / crop logic; the model just thinks β maybe answers β if not, is
re-asked the same prompt. Up to 4 extra rounds (same cap as r1-v).
- Prompt (
grpo.py+ trainerSTAGE_PROMPT_TEMPLATE): no<box>language; "if no further thinking is needed, provide<answer>." Format:<think>β¦</think>or<think>β¦</think><answer>β¦</answer>. - Format reward:
pattern_stage1=^<think>(.+?)</think>$(no<box>, no<answer>);pattern_stage2keeps<think>β¦</think><answer>β¦</answer>. - Trainer surgery: deletes
bbox_adjust,cal_bbox_for_iou,_crop_image_for_next_stage,_get_bbox_for_last_stage; renames_prepare_for_stage2β_prepare_for_next_round(appends{assistant: previous_response}+ a fresh text-only user turn; no image added β original image stays in history once);_generate_for_stage2_batchpassesimages=Nonewhen there are none. grpo.py: deletes all bbox/IoU reward+score fns (compute_iou,compute_giou,bbox_reward_stage2,bbox_score_stage{1,2},bbox_iou_stage{1,2,3}); registries trimmed to{accuracy, format}/{refine_times}. Same resume +save_statefixes.prepare_data.py:item.pop('bboxs', None); checked-in jsonl already has bboxs stripped.
Hypothesis: is the "think β maybe answer, else re-ask" outer loop alone (no grounding) competitive with the full pipeline? Clean; no bug found.
r1-v-no123 β pure CoT, forced exactly 2 rounds
Same bbox/crop removal as match-round, but does not mimic r1-v's adaptive loop.
Always two passes.
- Two distinct prompts:
STAGE_ONE_TEMPLATE("think only β¦ do not provide the final answer yet");STAGE_TWO_TEMPLATE("rethink using image + history, then answer"). - Loop: round 1 generates
<think>β¦</think>only; round 2 appends that as assistant + aSTAGE_TWO_TEMPLATEuser turn β one finalgeneratefor all. Exactly 2 forward passes, no<answer>/iter early-exit. - Same deletions, registries, resume +
save_statefixes asmatch-round.
Hypothesis: is a fixed thinkβrethink+answer schedule enough (vs adaptive)? Mechanically the simplest of the set; hard to break. No bug found.
r1-v-no23-1round β fused single turn (<think>+<box>+<answer>), retry on missing answer
Asks for thinking, one bbox, and the final answer in a single response. No image cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is missing.
- Single fused prompt: "...provide one bounding box [x1,y1,x2,y2] inside
<box>β¦ Then directly provide the final answer inside<answer>." Format example:<think>β¦</think> <box>[β¦]</box> <answer>β¦</answer>. - Format reward:
pattern_stage1(intermediate) accepts<think>β¦</think><box>[β¦]</box>with(?!.*<answer>);pattern_stage2(final) requires the full<think>β¦</think><box>[β¦]</box><answer>β¦</answer>. Both add single-occurrence anti-duplication lookaheads. - Loop: stage-1 generate; if
<answer>present β finalize; else_prepare_for_next_round(..., include_image=True)appends the failed response + a fresh user turn that re-includes the original image and the same prompt; loop to<answer>or iter==4. - All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped;
same resume +
save_statefixes.
Hypothesis: does emitting the box and answer in one shot (no separate crop trajectory) match the multi-round pipeline?
Risk found β wasteful image re-injection: every retry appends another copy of the
same original image (both an {"type":"image"} placeholder and the image in the per-
rollout image list). With max_pixels=401408, each image β 250 tokens, so after 4 retries
the prompt carries 5 identical images (1.25k tokens of duplicate). Risks: hitting
max_prompt_length=8192 (left-truncation could drop the leading image position) and
extra memory/time on long retries. Suggested fix: pass the image only on the first
turn (include_image=False on retries) β the chat already shows it once. Not a
correctness bug, but a real perf/robustness footgun.
Cross-cutting: a real pre-existing bug in baseline r1-v (fixed in all four forks)
r1-v/src/open_r1/grpo.py reward/score fns (accuracy_reward_stage2,
accuracy_score_stage1, accuracy_score_stage2) iterate per-sample over
zip(contents, solution) but route the dataset handler via dataset[0] and log
problem_id[0] β the batch-wide kwargs collapsed to the first sample. For any batch
spanning >1 dataset (per_device_train_batch_size>1, or grad-accum across prompts),
every sample gets the first sample's handler β silently wrong rewards (no crash). The
shipped config (per_device=1, accum=1) masks it because a GRPO group is one prompt.
All four no-* forks fix it by iterating cur_dataset/cur_problem_id from the zip and
defaulting reward, student_answer = 0.0, "" in the missing-handler branch. Worth
upstreaming to baseline r1-v.
Shared mechanical fixes in every fork
- Dead hardcoded resume path (
/home/meng/GRPO/...) βget_last_checkpoint(output_dir)(baseline would never actually resume on a fresh box). trainer.save_state(output_dir)βtrainer.save_state()(kwarg-less is the supported Trainer API).r1-v-no-histadditionally adds the PyTorch-2.6+torch.loadweights-only workaround for DeepSpeed resume β the other forks should copy it if they need to resume under torch β₯ 2.6.
Bottom line
- Clean:
r1-v-no123,r1-v-no123-match-round. - By-design caveat:
r1-v-no-histβ only the final round contributes gradient. - Perf footgun:
r1-v-no23-1roundβ duplicate original image re-sent on every retry. - Real bug (baseline): per-batch reward mis-routing via
dataset[0]/problem_id[0]; fixed in the forks, should go upstream.