Icey444 commited on
Commit
7de483e
Β·
verified Β·
1 Parent(s): 499f6c8

Upload ABLATION_NOTES.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. ABLATION_NOTES.md +166 -0
ABLATION_NOTES.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ground-R1 `rl-projects` β€” Ablation Branch Notes
2
+
3
+ Notes from a code read of the experimental forks of `r1-v/` on the `rl-projects` branch
4
+ (`Irisicy4/Ground-R1-project`). Each branch keeps a sibling copy of the trainer so the
5
+ baseline `r1-v/` stays untouched. Below: what each variant changes vs `r1-v`, what
6
+ hypothesis it isolates, and any bugs/risks found.
7
+
8
+ ---
9
+
10
+ ## At a glance
11
+
12
+ | Branch | Bbox / crop | Round structure | Image(s) carried across rounds | History across rounds |
13
+ |---|---|---|---|---|
14
+ | `r1-v` (baseline) | yes (`<box>` + crop + 2nd image) | adaptive: stop on `<answer>` or iter==4 | original + each crop | full conversation |
15
+ | `r1-v-no-hist` | yes (still crops) | adaptive (same cap) | **only the latest crop** | **reset each round** |
16
+ | `r1-v-no123-match-round` | **no** | adaptive (same cap) | original only | full conversation |
17
+ | `r1-v-no123` | **no** | **fixed: exactly 2 rounds** | original only | full conversation |
18
+ | `r1-v-no23-1round` | `<box>`+`<answer>` in one turn | adaptive (same cap), retry only on missing `<answer>` | original re-sent every retry | full conversation |
19
+
20
+ The naming reads as: **no-hist** = remove cross-round memory; **no123** = remove the three
21
+ grounding pieces (bbox-step, crop, second image); **match-round** = keep r1-v's adaptive
22
+ iteration cap; **1round** = intended single fused turn.
23
+
24
+ ---
25
+
26
+ ## `r1-v-no-hist` β€” drop conversation history between grounding rounds
27
+
28
+ Keeps the full Ground-R1 bbox/crop machinery, but at the start of each new round throws
29
+ away prior turns and shows the model only the cropped image + the next-round prompt.
30
+
31
+ - **Trainer `_prepare_for_stage2`** is the only behavior change:
32
+ - before: `origin_prompt.extend([{assistant: bbox_str}, next_user_with_crop])`,
33
+ `combined_images = [original, crop]`
34
+ - after: `origin_prompt = [next_stage_entry]`, `combined_images = [crop]`
35
+ - **`grpo.py`** resume hardening (not algorithmic): replaces the dead hardcoded
36
+ `/home/meng/GRPO/...` resume path with `get_last_checkpoint(output_dir)`, adds a
37
+ `configure_torch_checkpoint_resume()` helper (sets `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1`,
38
+ registers `ZeroStageEnum` as a `torch.serialization` safe global for PyTorch 2.6+),
39
+ and fixes `trainer.save_state(output_dir)` β†’ `trainer.save_state()`.
40
+
41
+ **Hypothesis:** does iterative grounding still work when the model can't see what it
42
+ grounded before?
43
+
44
+ **Risk / note (by design, easy to overlook):** because the prompt is reset every round,
45
+ the final `prompt_completion_ids` sequence contains only the **last** round's
46
+ user+assistant turns. The loss-mask scan (`[151644, 77091, 198]` … `151645`) therefore
47
+ catches **only the final round's** assistant tokens. Intermediate grounding rounds
48
+ produce **no gradient** β€” they only decide which crop the final round sees. So a 4-round
49
+ rollout yields the same per-rollout RL signal as a 1-round one. Intended for the
50
+ "no-history" ablation, but it sharply reduces training signal per rollout; expect slower
51
+ learning than baseline at equal steps.
52
+
53
+ ---
54
+
55
+ ## `r1-v-no123-match-round` β€” pure CoT, no bbox/crop, keep r1-v's adaptive cap
56
+
57
+ Strips all bounding-box / crop logic; the model just thinks β†’ maybe answers β†’ if not, is
58
+ re-asked the same prompt. Up to 4 extra rounds (same cap as `r1-v`).
59
+
60
+ - **Prompt** (`grpo.py` + trainer `STAGE_PROMPT_TEMPLATE`): no `<box>` language; "if no
61
+ further thinking is needed, provide `<answer>`." Format: `<think>…</think>` or
62
+ `<think>…</think><answer>…</answer>`.
63
+ - **Format reward:** `pattern_stage1` = `^<think>(.+?)</think>$` (no `<box>`, no
64
+ `<answer>`); `pattern_stage2` keeps `<think>…</think><answer>…</answer>`.
65
+ - **Trainer surgery:** deletes `bbox_adjust`, `cal_bbox_for_iou`,
66
+ `_crop_image_for_next_stage`, `_get_bbox_for_last_stage`; renames `_prepare_for_stage2`
67
+ β†’ `_prepare_for_next_round` (appends `{assistant: previous_response}` + a fresh
68
+ text-only user turn; no image added β€” original image stays in history once);
69
+ `_generate_for_stage2_batch` passes `images=None` when there are none.
70
+ - **`grpo.py`:** deletes all bbox/IoU reward+score fns (`compute_iou`, `compute_giou`,
71
+ `bbox_reward_stage2`, `bbox_score_stage{1,2}`, `bbox_iou_stage{1,2,3}`); registries
72
+ trimmed to `{accuracy, format}` / `{refine_times}`. Same resume + `save_state` fixes.
73
+ - **`prepare_data.py`:** `item.pop('bboxs', None)`; checked-in jsonl already has bboxs
74
+ stripped.
75
+
76
+ **Hypothesis:** is the "think β†’ maybe answer, else re-ask" outer loop alone (no grounding)
77
+ competitive with the full pipeline? Clean; no bug found.
78
+
79
+ ---
80
+
81
+ ## `r1-v-no123` β€” pure CoT, forced exactly 2 rounds
82
+
83
+ Same bbox/crop removal as `match-round`, but **does not** mimic r1-v's adaptive loop.
84
+ Always two passes.
85
+
86
+ - **Two distinct prompts:** `STAGE_ONE_TEMPLATE` ("think only … **do not provide the
87
+ final answer yet**"); `STAGE_TWO_TEMPLATE` ("rethink using image + history, then answer").
88
+ - **Loop:** round 1 generates `<think>…</think>` only; round 2 appends that as assistant +
89
+ a `STAGE_TWO_TEMPLATE` user turn β†’ one final `generate` for all. Exactly 2 forward
90
+ passes, no `<answer>`/iter early-exit.
91
+ - Same deletions, registries, resume + `save_state` fixes as `match-round`.
92
+
93
+ **Hypothesis:** is a fixed think→rethink+answer schedule enough (vs adaptive)? Mechanically
94
+ the simplest of the set; hard to break. No bug found.
95
+
96
+ ---
97
+
98
+ ## `r1-v-no23-1round` β€” fused single turn (`<think>`+`<box>`+`<answer>`), retry on missing answer
99
+
100
+ Asks for thinking, one bbox, and the final answer in a **single** response. No image
101
+ cropping; r1-v's adaptive retry cap is reused only as a fallback when the answer is
102
+ missing.
103
+
104
+ - **Single fused prompt:** "...provide one bounding box [x1,y1,x2,y2] inside `<box>` …
105
+ Then directly provide the final answer inside `<answer>`." Format example:
106
+ `<think>…</think> <box>[…]</box> <answer>…</answer>`.
107
+ - **Format reward:** `pattern_stage1` (intermediate) accepts `<think>…</think><box>[…]</box>`
108
+ with `(?!.*<answer>)`; `pattern_stage2` (final) requires the full
109
+ `<think>…</think><box>[…]</box><answer>…</answer>`. Both add single-occurrence
110
+ anti-duplication lookaheads.
111
+ - **Loop:** stage-1 generate; if `<answer>` present β†’ finalize; else
112
+ `_prepare_for_next_round(..., include_image=True)` appends the failed response + a fresh
113
+ user turn that **re-includes the original image** and the same prompt; loop to
114
+ `<answer>` or iter==4.
115
+ - All bbox/crop infra + IoU rewards deleted; registries trimmed; jsonl bboxs stripped;
116
+ same resume + `save_state` fixes.
117
+
118
+ **Hypothesis:** does emitting the box and answer in one shot (no separate crop trajectory)
119
+ match the multi-round pipeline?
120
+
121
+ **Risk found β€” wasteful image re-injection:** every retry appends **another copy of the
122
+ same original image** (both an `{"type":"image"}` placeholder and the image in the per-
123
+ rollout image list). With `max_pixels=401408`, each image β‰ˆ 250 tokens, so after 4 retries
124
+ the prompt carries ~5 identical images (~1.25k tokens of duplicate). Risks: hitting
125
+ `max_prompt_length=8192` (left-truncation could drop the leading image position) and
126
+ extra memory/time on long retries. **Suggested fix:** pass the image only on the first
127
+ turn (`include_image=False` on retries) β€” the chat already shows it once. Not a
128
+ correctness bug, but a real perf/robustness footgun.
129
+
130
+ ---
131
+
132
+ ## Cross-cutting: a real pre-existing bug in baseline `r1-v` (fixed in all four forks)
133
+
134
+ `r1-v/src/open_r1/grpo.py` reward/score fns (`accuracy_reward_stage2`,
135
+ `accuracy_score_stage1`, `accuracy_score_stage2`) iterate per-sample over
136
+ `zip(contents, solution)` but route the dataset handler via **`dataset[0]`** and log
137
+ **`problem_id[0]`** β€” the batch-wide kwargs collapsed to the first sample. For any batch
138
+ spanning >1 dataset (`per_device_train_batch_size>1`, or grad-accum across prompts),
139
+ every sample gets the **first sample's** handler β†’ silently wrong rewards (no crash). The
140
+ shipped config (`per_device=1`, `accum=1`) masks it because a GRPO group is one prompt.
141
+
142
+ All four `no-*` forks fix it by iterating `cur_dataset`/`cur_problem_id` from the zip and
143
+ defaulting `reward, student_answer = 0.0, ""` in the missing-handler branch. Worth
144
+ upstreaming to baseline `r1-v`.
145
+
146
+ ---
147
+
148
+ ## Shared mechanical fixes in every fork
149
+
150
+ - Dead hardcoded resume path (`/home/meng/GRPO/...`) β†’ `get_last_checkpoint(output_dir)`
151
+ (baseline would never actually resume on a fresh box).
152
+ - `trainer.save_state(output_dir)` β†’ `trainer.save_state()` (kwarg-less is the supported
153
+ Trainer API).
154
+ - `r1-v-no-hist` additionally adds the PyTorch-2.6+ `torch.load` weights-only workaround
155
+ for DeepSpeed resume β€” the other forks should copy it if they need to resume under
156
+ torch β‰₯ 2.6.
157
+
158
+ ---
159
+
160
+ ## Bottom line
161
+
162
+ - **Clean:** `r1-v-no123`, `r1-v-no123-match-round`.
163
+ - **By-design caveat:** `r1-v-no-hist` β€” only the final round contributes gradient.
164
+ - **Perf footgun:** `r1-v-no23-1round` β€” duplicate original image re-sent on every retry.
165
+ - **Real bug (baseline):** per-batch reward mis-routing via `dataset[0]`/`problem_id[0]`;
166
+ fixed in the forks, should go upstream.