tmp-score / ABLATION_NOTES.md

Upload ABLATION_NOTES.md with huggingface_hub

7de483e verified about 1 month ago

9.09 kB

	# Ground-R1 `rl-projects` — Ablation Branch Notes

	Notes from a code read of the experimental forks of `r1-v/` on the `rl-projects` branch
	(`Irisicy4/Ground-R1-project`). Each branch keeps a sibling copy of the trainer so the
	baseline `r1-v/` stays untouched. Below: what each variant changes vs `r1-v`, what
	hypothesis it isolates, and any bugs/risks found.

	---

	## At a glance

	\| Branch \| Bbox / crop \| Round structure \| Image(s) carried across rounds \| History across rounds \|
	\|---\|---\|---\|---\|---\|
	\| `r1-v` (baseline) \| yes (`<box>` + crop + 2nd image) \| adaptive: stop on `<answer>` or iter==4 \| original + each crop \| full conversation \|
	\| `r1-v-no-hist` \| yes (still crops) \| adaptive (same cap) \| only the latest crop \| reset each round \|
	\| `r1-v-no123-match-round` \| no \| adaptive (same cap) \| original only \| full conversation \|
	\| `r1-v-no123` \| no \| fixed: exactly 2 rounds \| original only \| full conversation \|
	\| `r1-v-no23-1round` \| `<box>`+`<answer>` in one turn \| adaptive (same cap), retry only on missing `<answer>` \| original re-sent every retry \| full conversation \|

	The naming reads as: no-hist = remove cross-round memory; no123 = remove the three
	grounding pieces (bbox-step, crop, second image); match-round = keep r1-v's adaptive
	iteration cap; 1round = intended single fused turn.

	---

	## `r1-v-no-hist` — drop conversation history between grounding rounds

	Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws
	away prior turns and shows the model only the cropped image + the next-round prompt.

	- Trainer `_prepare_for_stage2` is the only behavior change:
	- before: `origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop])`,
	`combined_images = [original, crop]`
	- after: `origin_prompt = [next_stage_entry]`, `combined_images = [crop]`
	- `grpo.py` resume hardening (not algorithmic): replaces the dead hardcoded
	`/home/meng/GRPO/...` resume path with `get_last_checkpoint(output_dir)`, adds a
	`configure_torch_checkpoint_resume()` helper (sets `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1`,
	registers `ZeroStageEnum` as a `torch.serialization` safe global for PyTorch 2.6+),
	and fixes `trainer.save_state(output_dir)` → `trainer.save_state()`.

	Hypothesis: does iterative grounding still work when the model can't see what it
	grounded before?

	Risk / note (by design, easy to overlook): because the prompt is reset every round,
	the final `prompt_completion_ids` sequence contains only the last round's
	user+assistant turns. The loss-mask scan (`[151644, 77091, 198]` … `151645`) therefore
	catches only the final round's assistant tokens. Intermediate grounding rounds
	produce no gradient — they only decide which crop the final round sees. So a 4-round
	rollout yields the same per-rollout RL signal as a 1-round one. Intended for the
	"no-history" ablation, but it sharply reduces training signal per rollout; expect slower
	learning than baseline at equal steps.

	---

	## `r1-v-no123-match-round` — pure CoT, no bbox/crop, keep r1-v's adaptive cap

	Strips all bounding-box / crop logic; the model just thinks → maybe answers → if not, is
	re-asked the same prompt. Up to 4 extra rounds (same cap as `r1-v`).

	- Prompt (`grpo.py` + trainer `STAGE_PROMPT_TEMPLATE`): no `<box>` language; "if no
	further thinking is needed, provide `<answer>`." Format: `<think>…</think>` or
	`<think>…</think><answer>…</answer>`.
	- Format reward: `pattern_stage1` = `^<think>(.+?)</think>$` (no `<box>`, no
	`<answer>`); `pattern_stage2` keeps `<think>…</think><answer>…</answer>`.
	- Trainer surgery: deletes `bbox_adjust`, `cal_bbox_for_iou`,
	`_crop_image_for_next_stage`, `_get_bbox_for_last_stage`; renames `_prepare_for_stage2`
	→ `_prepare_for_next_round` (appends `{assistant: previous_response}` + a fresh
	text-only user turn; no image added — original image stays in history once);
	`_generate_for_stage2_batch` passes `images=None` when there are none.
	- `grpo.py`: deletes all bbox/IoU reward+score fns (`compute_iou`, `compute_giou`,
	`bbox_reward_stage2`, `bbox_score_stage{1,2}`, `bbox_iou_stage{1,2,3}`); registries
	trimmed to `{accuracy, format}` / `{refine_times}`. Same resume + `save_state` fixes.
	- `prepare_data.py`: `item.pop('bboxs', None)`; checked-in jsonl already has bboxs
	stripped.

	Hypothesis: is the "think → maybe answer, else re-ask" outer loop alone (no grounding)
	competitive with the full pipeline? Clean; no bug found.

	---

	## `r1-v-no123` — pure CoT, forced exactly 2 rounds

	Same bbox/crop removal as `match-round`, but does not mimic r1-v's adaptive loop.
	Always two passes.

	- Two distinct prompts: `STAGE_ONE_TEMPLATE` ("think only … **do not provide the
	final answer yet**"); `STAGE_TWO_TEMPLATE` ("rethink using image + history, then answer").
	- Loop: round 1 generates `<think>…</think>` only; round 2 appends that as assistant +
	a `STAGE_TWO_TEMPLATE` user turn → one final `generate` for all. Exactly 2 forward
	passes, no `<answer>`/iter early-exit.
	- Same deletions, registries, resume + `save_state` fixes as `match-round`.

	Hypothesis: is a fixed think→rethink+answer schedule enough (vs adaptive)? Mechanically
	the simplest of the set; hard to break. No bug found.

	---

	## `r1-v-no23-1round` — fused single turn (`<think>`+`<box>`+`<answer>`), retry on missing answer

	Asks for thinking, one bbox, and the final answer in a single response. No image
	cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is
	missing.

	- Single fused prompt: "...provide one bounding box [x1,y1,x2,y2] inside `<box>` …
	Then directly provide the final answer inside `<answer>`." Format example:
	`<think>…</think> <box>[…]</box> <answer>…</answer>`.
	- Format reward: `pattern_stage1` (intermediate) accepts `<think>…</think><box>[…]</box>`
	with `(?!.*<answer>)`; `pattern_stage2` (final) requires the full
	`<think>…</think><box>[…]</box><answer>…</answer>`. Both add single-occurrence
	anti-duplication lookaheads.
	- Loop: stage-1 generate; if `<answer>` present → finalize; else
	`_prepare_for_next_round(..., include_image=True)` appends the failed response + a fresh
	user turn that re-includes the original image and the same prompt; loop to
	`<answer>` or iter==4.
	- All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped;
	same resume + `save_state` fixes.

	Hypothesis: does emitting the box and answer in one shot (no separate crop trajectory)
	match the multi-round pipeline?

	Risk found — wasteful image re-injection: every retry appends **another copy of the
	same original image** (both an `{"type":"image"}` placeholder and the image in the per-
	rollout image list). With `max_pixels=401408`, each image ≈ 250 tokens, so after 4 retries
	the prompt carries ~5 identical images (~1.25k tokens of duplicate). Risks: hitting
	`max_prompt_length=8192` (left-truncation could drop the leading image position) and
	extra memory/time on long retries. Suggested fix: pass the image only on the first
	turn (`include_image=False` on retries) — the chat already shows it once. Not a
	correctness bug, but a real perf/robustness footgun.

	---

	## Cross-cutting: a real pre-existing bug in baseline `r1-v` (fixed in all four forks)

	`r1-v/src/open_r1/grpo.py` reward/score fns (`accuracy_reward_stage2`,
	`accuracy_score_stage1`, `accuracy_score_stage2`) iterate per-sample over
	`zip(contents, solution)` but route the dataset handler via `dataset[0]` and log
	`problem_id[0]` — the batch-wide kwargs collapsed to the first sample. For any batch
	spanning >1 dataset (`per_device_train_batch_size>1`, or grad-accum across prompts),
	every sample gets the first sample's handler → silently wrong rewards (no crash). The
	shipped config (`per_device=1`, `accum=1`) masks it because a GRPO group is one prompt.

	All four `no-*` forks fix it by iterating `cur_dataset`/`cur_problem_id` from the zip and
	defaulting `reward, student_answer = 0.0, ""` in the missing-handler branch. Worth
	upstreaming to baseline `r1-v`.

	---

	## Shared mechanical fixes in every fork

	- Dead hardcoded resume path (`/home/meng/GRPO/...`) → `get_last_checkpoint(output_dir)`
	(baseline would never actually resume on a fresh box).
	- `trainer.save_state(output_dir)` → `trainer.save_state()` (kwarg-less is the supported
	Trainer API).
	- `r1-v-no-hist` additionally adds the PyTorch-2.6+ `torch.load` weights-only workaround
	for DeepSpeed resume — the other forks should copy it if they need to resume under
	torch ≥ 2.6.

	---

	## Bottom line

	- Clean: `r1-v-no123`, `r1-v-no123-match-round`.
	- By-design caveat: `r1-v-no-hist` — only the final round contributes gradient.
	- Perf footgun: `r1-v-no23-1round` — duplicate original image re-sent on every retry.
	- Real bug (baseline): per-batch reward mis-routing via `dataset[0]`/`problem_id[0]`;
	fixed in the forks, should go upstream.