tmp-score / ABLATION_NOTES.md
Icey444's picture
Upload ABLATION_NOTES.md with huggingface_hub
7de483e verified
|
Raw
History Blame Contribute Delete
9.09 kB

Ground-R1 rl-projects β€” Ablation Branch Notes

Notes from a code read of the experimental forks of r1-v/ on the rl-projects branch (Irisicy4/Ground-R1-project). Each branch keeps a sibling copy of the trainer so the baseline r1-v/ stays untouched. Below: what each variant changes vs r1-v, what hypothesis it isolates, and any bugs/risks found.


At a glance

Branch Bbox / crop Round structure Image(s) carried across rounds History across rounds
r1-v (baseline) yes (<box> + crop + 2nd image) adaptive: stop on <answer> or iter==4 original + each crop full conversation
r1-v-no-hist yes (still crops) adaptive (same cap) only the latest crop reset each round
r1-v-no123-match-round no adaptive (same cap) original only full conversation
r1-v-no123 no fixed: exactly 2 rounds original only full conversation
r1-v-no23-1round <box>+<answer> in one turn adaptive (same cap), retry only on missing <answer> original re-sent every retry full conversation

The naming reads as: no-hist = remove cross-round memory; no123 = remove the three grounding pieces (bbox-step, crop, second image); match-round = keep r1-v's adaptive iteration cap; 1round = intended single fused turn.


r1-v-no-hist β€” drop conversation history between grounding rounds

Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws away prior turns and shows the model only the cropped image + the next-round prompt.

  • Trainer _prepare_for_stage2 is the only behavior change:
    • before: origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop]), combined_images = [original, crop]
    • after: origin_prompt = [next_stage_entry], combined_images = [crop]
  • grpo.py resume hardening (not algorithmic): replaces the dead hardcoded /home/meng/GRPO/... resume path with get_last_checkpoint(output_dir), adds a configure_torch_checkpoint_resume() helper (sets TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1, registers ZeroStageEnum as a torch.serialization safe global for PyTorch 2.6+), and fixes trainer.save_state(output_dir) β†’ trainer.save_state().

Hypothesis: does iterative grounding still work when the model can't see what it grounded before?

Risk / note (by design, easy to overlook): because the prompt is reset every round, the final prompt_completion_ids sequence contains only the last round's user+assistant turns. The loss-mask scan ([151644, 77091, 198] … 151645) therefore catches only the final round's assistant tokens. Intermediate grounding rounds produce no gradient β€” they only decide which crop the final round sees. So a 4-round rollout yields the same per-rollout RL signal as a 1-round one. Intended for the "no-history" ablation, but it sharply reduces training signal per rollout; expect slower learning than baseline at equal steps.


r1-v-no123-match-round β€” pure CoT, no bbox/crop, keep r1-v's adaptive cap

Strips all bounding-box / crop logic; the model just thinks β†’ maybe answers β†’ if not, is re-asked the same prompt. Up to 4 extra rounds (same cap as r1-v).

  • Prompt (grpo.py + trainer STAGE_PROMPT_TEMPLATE): no <box> language; "if no further thinking is needed, provide <answer>." Format: <think>…</think> or <think>…</think><answer>…</answer>.
  • Format reward: pattern_stage1 = ^<think>(.+?)</think>$ (no <box>, no <answer>); pattern_stage2 keeps <think>…</think><answer>…</answer>.
  • Trainer surgery: deletes bbox_adjust, cal_bbox_for_iou, _crop_image_for_next_stage, _get_bbox_for_last_stage; renames _prepare_for_stage2 β†’ _prepare_for_next_round (appends {assistant: previous_response} + a fresh text-only user turn; no image added β€” original image stays in history once); _generate_for_stage2_batch passes images=None when there are none.
  • grpo.py: deletes all bbox/IoU reward+score fns (compute_iou, compute_giou, bbox_reward_stage2, bbox_score_stage{1,2}, bbox_iou_stage{1,2,3}); registries trimmed to {accuracy, format} / {refine_times}. Same resume + save_state fixes.
  • prepare_data.py: item.pop('bboxs', None); checked-in jsonl already has bboxs stripped.

Hypothesis: is the "think β†’ maybe answer, else re-ask" outer loop alone (no grounding) competitive with the full pipeline? Clean; no bug found.


r1-v-no123 β€” pure CoT, forced exactly 2 rounds

Same bbox/crop removal as match-round, but does not mimic r1-v's adaptive loop. Always two passes.

  • Two distinct prompts: STAGE_ONE_TEMPLATE ("think only … do not provide the final answer yet"); STAGE_TWO_TEMPLATE ("rethink using image + history, then answer").
  • Loop: round 1 generates <think>…</think> only; round 2 appends that as assistant + a STAGE_TWO_TEMPLATE user turn β†’ one final generate for all. Exactly 2 forward passes, no <answer>/iter early-exit.
  • Same deletions, registries, resume + save_state fixes as match-round.

Hypothesis: is a fixed think→rethink+answer schedule enough (vs adaptive)? Mechanically the simplest of the set; hard to break. No bug found.


r1-v-no23-1round β€” fused single turn (<think>+<box>+<answer>), retry on missing answer

Asks for thinking, one bbox, and the final answer in a single response. No image cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is missing.

  • Single fused prompt: "...provide one bounding box [x1,y1,x2,y2] inside <box> … Then directly provide the final answer inside <answer>." Format example: <think>…</think> <box>[…]</box> <answer>…</answer>.
  • Format reward: pattern_stage1 (intermediate) accepts <think>…</think><box>[…]</box> with (?!.*<answer>); pattern_stage2 (final) requires the full <think>…</think><box>[…]</box><answer>…</answer>. Both add single-occurrence anti-duplication lookaheads.
  • Loop: stage-1 generate; if <answer> present β†’ finalize; else _prepare_for_next_round(..., include_image=True) appends the failed response + a fresh user turn that re-includes the original image and the same prompt; loop to <answer> or iter==4.
  • All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped; same resume + save_state fixes.

Hypothesis: does emitting the box and answer in one shot (no separate crop trajectory) match the multi-round pipeline?

Risk found β€” wasteful image re-injection: every retry appends another copy of the same original image (both an {"type":"image"} placeholder and the image in the per- rollout image list). With max_pixels=401408, each image β‰ˆ 250 tokens, so after 4 retries the prompt carries 5 identical images (1.25k tokens of duplicate). Risks: hitting max_prompt_length=8192 (left-truncation could drop the leading image position) and extra memory/time on long retries. Suggested fix: pass the image only on the first turn (include_image=False on retries) β€” the chat already shows it once. Not a correctness bug, but a real perf/robustness footgun.


Cross-cutting: a real pre-existing bug in baseline r1-v (fixed in all four forks)

r1-v/src/open_r1/grpo.py reward/score fns (accuracy_reward_stage2, accuracy_score_stage1, accuracy_score_stage2) iterate per-sample over zip(contents, solution) but route the dataset handler via dataset[0] and log problem_id[0] β€” the batch-wide kwargs collapsed to the first sample. For any batch spanning >1 dataset (per_device_train_batch_size>1, or grad-accum across prompts), every sample gets the first sample's handler β†’ silently wrong rewards (no crash). The shipped config (per_device=1, accum=1) masks it because a GRPO group is one prompt.

All four no-* forks fix it by iterating cur_dataset/cur_problem_id from the zip and defaulting reward, student_answer = 0.0, "" in the missing-handler branch. Worth upstreaming to baseline r1-v.


Shared mechanical fixes in every fork

  • Dead hardcoded resume path (/home/meng/GRPO/...) β†’ get_last_checkpoint(output_dir) (baseline would never actually resume on a fresh box).
  • trainer.save_state(output_dir) β†’ trainer.save_state() (kwarg-less is the supported Trainer API).
  • r1-v-no-hist additionally adds the PyTorch-2.6+ torch.load weights-only workaround for DeepSpeed resume β€” the other forks should copy it if they need to resume under torch β‰₯ 2.6.

Bottom line

  • Clean: r1-v-no123, r1-v-no123-match-round.
  • By-design caveat: r1-v-no-hist β€” only the final round contributes gradient.
  • Perf footgun: r1-v-no23-1round β€” duplicate original image re-sent on every retry.
  • Real bug (baseline): per-batch reward mis-routing via dataset[0]/problem_id[0]; fixed in the forks, should go upstream.