Baladithya Balamurugan
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
2a16b30
|
Raw
History Blame Contribute Delete
27.9 kB
# Deep-Read: SDPO / OPSD — Critical Audit of Cluster 2
**Date**: 2026-06-09
**Sources fetched from primary HTML**:
- SDPO: arXiv:2601.20802v2 (Hübotter et al., ETH Zürich / MIT / Stanford)
— note `reinforcement-learning-via-self-distillation-2` (148 K body)
- OPSD: arXiv:2601.18734v3 (Zhao et al., ICML 2026)
— note `self-distilled-reasoner-on-policy-self-distillation-for-large-language-models-2` (52 K body)
- OPSD code repo: github.com/siyan-zhao/OPSD (README + key args)
- SDPO code repo: github.com/lasgroup/SDPO (listed in abstract; fetch returned empty body)
**Repo files audited**: `composer_replication/opsd.py`, `composer_replication/trainer/composer_trainer.py::_compute_sdpo_loss`, `composer_replication/hint_generator.py`, `research/07-sdpo-hint-generator.md`, `research/11-sdpo-alignment-indices.md`, `docs/adrs/ADR-007`, `ADR-008`, `ADR-009`, `ADR-011`.
---
## 1. What the primary sources actually say
### 1.1 SDPO (arXiv:2601.20802v2) — core method
**Exact loss (Eq. 1)**:
```
L_SDPO(θ) := Σ_t KL( π_θ(·|x, y_{<t}) ‖ stopgrad(π_θ(·|x, f, y_{<t})) )
```
KL direction is **forward KL — KL(student ‖ teacher)**, i.e. `KL(π_θ || q_θ)`. The
student is in the first argument. This is the "reverse KL from the teacher's
perspective" but forward from the student's perspective (student wants to match teacher).
The paper writes it `KL(π_θ ‖ stopgrad(q_θ))`.
**Stability improvements (§2.3)**:
1. Regularized teacher: EMA of student params OR interpolation between current teacher
and initial teacher `q_{θ_ref}`. The paper calls these "trust-region" and "EMA"
teachers (Table 4). Non-regularized teacher (`q_θ`, the live student) diverges.
2. Symmetric JSD: the paper adopts JSD as the distillation loss for stability —
citing Agarwal et al. 2024 on-policy distillation. The pseudocode (Fig 14) calls
this `divergence(logprobs_student, logprobs_teacher)` with no fixed default — the
paper reports using JSD.
**Top-K approximation (Appendix A.3)**:
The paper approximates the full-vocab KL with top-K tokens of the **student** distribution:
```
L_SDPO ≈ Σ_t [ Σ_{ŷ_t ∈ topK(π_θ)} π_θ(ŷ_t|x,y_{<t}) · log(π_θ(ŷ_t) / q_θ(ŷ_t))
+ tail_term ]
```
The tail term aggregates the remaining probability mass. Default K=100.
**Teacher context construction (Table 2, verbatim)**:
```
User: {prompt}
Correct solution: {successful_previous_rollout} (skipped if unavailable)
The following is feedback from your unsuccessful earlier attempt:
{environment_output} (skipped if no env output or if solved)
Correctly solve the original question.
Assistant: {original_response} (the student's original attempt, for log-prob re-eval)
```
Critical nuance: the `original_response` is placed in the teacher context so the
model can re-evaluate log-probs of `y` under the teacher. **The student's original
attempt is always appended as the response the teacher evaluates** — this is how both
student and teacher evaluate log-probs of the same token sequence `y`.
Token alignment: **there is no shift / hint-insertion at an "error turn."** The teacher
sees `[prompt, feedback, original_response]` and both student and teacher evaluate
log-probs of the SAME `original_response` tokens. No additional tokens are inserted
into the response; the prefix is longer for the teacher (it has feedback), but the
response token sequence evaluated is identical.
**Three feedback types ablated (§4.6, Table 6)**:
1. **Sample solution** (`f = own solution`): a successful sibling rollout from the GRPO
group; always student-generated (no expert model). Teacher accuracy: 42.4%.
2. **Environment output** (`f = output`): runtime errors, failing unit tests, etc.
Teacher accuracy: 32.5%.
3. **Student's original attempt** (`f = y`): the repo calls out that including the
original attempt in the feedback (not just in the response slot) **reduces teacher
diversity** (biases teacher toward student; "Same output": 30% vs ~10–13%).
4. **Combined** (`f = output + own solution`): best trained student accuracy (48.3%).
Excluding `f = y` (the original attempt as part of conditioning) is key.
**Failure modes reported**:
- Non-regularized teacher (`q_θ`) diverges / training collapses (Table 4: 36.1% vs
50.6% for trust-region teacher).
- Performance depends on model in-context learning ability: SDPO underperforms GRPO on
weaker models (Qwen3-0.6B); hybrid SDPO+GRPO (λ=0.9 GRPO + 0.1 SDPO) is more
robust (§4.5).
- Uninformative or misleading environment feedback: SDPO cannot learn from it.
- SDPO adds small compute overhead (additional forward for log-prob re-computation of
teacher context); minor for large models, non-negligible for small models.
- Including the student's own attempt in the teacher conditioning (not just as the
response to re-evaluate) reduces diversity; the correct template excludes it from
the conditioning prefix.
**SDPO operates over the full rollout, not at isolated "error turns"**. The loss
sums over all tokens `t` in the response `y`. There is no error-site detection
step in the SDPO paper.
### 1.2 OPSD (arXiv:2601.18734v3) — core method
**Exact loss (Eq. 6–8)**:
```
L_OPSD(θ) = E_{(x,y*)~S} [ E_{ŷ~p_S(·|x)} [ D(p_T ‖ p_S)(ŷ|x) ] ]
where
D(p_T ‖ p_S)(ŷ|x) := (1/|ŷ|) Σ_{n=1}^{|ŷ|} D( p_T(·|x, y*, ŷ_{<n}) ‖ p_S(·|x, ŷ_{<n}) )
```
**Divergence D** can be: forward KL, reverse KL, or JSD_β. The paper defines:
```
JSD_β(p_T ‖ p_S) = β·KL(p_T ‖ m) + (1-β)·KL(p_S ‖ m)
m = β·p_T + (1-β)·p_S
```
**Direction convention**: `D(p_T ‖ p_S)` — teacher in first arg, student in second. For the JSD:
- β=0 → KL(p_S ‖ m) (approaches pure KL(p_S ‖ p_T) as m→p_T; forward KL w.r.t. teacher)
- β=1 → KL(p_T ‖ m) (approaches pure KL(p_T ‖ p_S); reverse KL w.r.t. teacher / forward KL w.r.t. student)
The GKD paper (arXiv:2306.13649) that OPSD cites defines JSD_β with the **same
convention**: `JSD_β(p ‖ q) = β·KL(p||M) + (1-β)·KL(q||M)`.
**Per-token pointwise clipping**: OPSD introduces this explicitly:
```
D_clip^(f)(p_T ‖ p_S) = (1/|ŷ|) Σ_n Σ_v min(l_{n,v}^(f), τ)
where l_{n,v}^(f) = p_T(v|·) · f( p_S(v|·) / p_T(v|·) )
```
This clips per vocab-entry contribution. Default τ=0.05 (from README: `--jsd_token_clip 0.05`).
Non-thinking mode results in README use 1e-7 (Qwen3-8B) and 1e-6 (Qwen3-4B, 1.7B).
**Teacher context**: `p_T(·|x, y*, ŷ_{<n})` — teacher sees the ground-truth answer `y*`
(a reference CoT / verified reasoning trace from the dataset) prepended to the problem,
then evaluates the student's prefix `ŷ_{<n}`. Same token sequence for both distributions
evaluated at each step `n`.
**`--reason_first` flag (from GitHub README)**: Prepend an explicit rationalization to the
teacher context before distillation. This is OPSD's self-introspection lever: the teacher
is first asked to rationalize why `y*` is correct, then that rationalization is folded into
the conditioning. Not the main results configuration; requires `--use_peft`.
**Results**: On Qwen3-1.7B (AIME24/25/HMMT25), OPSD +OPSD vs. base: 37.1% → 43.4%
(Avg@12). Outperforms GRPO (37.7%) and SFT (35.8%). Token-efficient: generation capped
at 1024 tokens vs. GRPO's 16K.
---
## 2. Audit: `composer_replication/opsd.py` — "byte-for-byte OPSD parity" claim
### 2.1 JSD formula — CORRECT, with a subtle direction note
The code implements:
```python
JSD = β·KL(teacher||M) + (1-β)·KL(student||M)
M = logsumexp([ log p_student + log(1-β), log p_teacher + log(β) ])
```
The OPSD paper (Eq. 7) defines:
```
JSD_β(p_T ‖ p_S) = β·KL(p_T‖m) + (1-β)·KL(p_S‖m)
```
where `m = β·p_T + (1-β)·p_S`.
The code `kl_teacher = F.kl_div(mixture_log_probs, teacher_log_probs, ...)` uses
PyTorch semantics where `F.kl_div(input=log_q, target=log_p, log_target=True)`
computes `KL(p||q) = Σ p(x)·(log p(x) - log q(x))`. So `kl_teacher` computes
`KL(teacher||mixture)` and `kl_student` computes `KL(student||mixture)`.
The final JSD: `β·kl_teacher + (1-β)·kl_student` = `β·KL(teacher||M) + (1-β)·KL(student||M)`.
This matches the OPSD paper's `JSD_β(p_T ‖ p_S)` exactly. **CORRECT.**
### 2.2 β convention docstring — INVERTED vs. both papers
The `opsd.py` docstring says:
```
β = 0 → KL(teacher || student) (reverse KL — mode-covering for student)
β = 1 → KL(student || teacher) (forward KL — mode-seeking for student)
```
From the OPSD GitHub README:
> `--beta`: Interpolation weight for the JSD mixture distribution.
> **Beta=0 means forward KL and 1 means reverse KL.**
The repo docstring has the β=0 and β=1 labels **swapped** relative to the OPSD upstream.
When β=0: `JSD_0 = 0·KL(teacher||M) + 1·KL(student||M)`. In the limit (degenerate β=0),
M approaches p_teacher, so this approaches `KL(student||teacher)` — which is **forward KL**
(student → teacher), **mode-seeking for student**. The README says "Beta=0 means forward KL"
which matches this analysis.
The code *implementation* is correct (the formula computes the right mixture). The *docstring*
labels β=0 as "reverse KL" and β=1 as "forward KL", which contradicts both the upstream README
and the mathematical analysis. This is a documentation error, not a numerical error.
**VERDICT**: Implementation is numerically correct. Docstring direction labels are inverted.
### 2.3 `reduction="batchmean"` behavior — MINOR DIVERGENCE from upstream
The repo `opsd.py` comment says:
> "batchmean" matches upstream OPSD: divides by `mask.sum()` when labels are given,
> else by the leading dim of jsd (= batch size).
The OPSD paper (Algorithm 1) normalizes by `|ŷ|` (sequence length, token-mean):
```
ℓ(x,y*) = D(p_T‖p_S)(ŷ|x) = (1/|ŷ|) Σ_n D(...)
L_OPSD(θ) = (1/|B|) Σ_{(x,y*)∈B} ℓ(x,y*)
```
The repo divides by `mask.sum()` (number of valid/masked tokens in the batch), which is
equivalent to OPSD's normalization only when every example has the same number of
error-turn tokens. When batch sizes vary (real training), this differs from the paper's
per-sequence average followed by batch average. In practice this difference is negligible
for stability, but it is technically not byte-for-byte OPSD parity on the reduction.
**VERDICT**: The `reduction="batchmean"` logic is borrowed from the OPSD upstream code
(which uses the same `mask.sum()` convention). The docstring's "matches upstream" claim
is accurate for the code, but the code diverges from the paper's stated per-sequence
normalization. Not a material issue.
### 2.4 `token_clip` parameter — CORRECT semantics, but per-token vs. per-(token,vocab) distinction
The repo implements `token_clip` as a **per-position** JSD clip:
```python
jsd = jsd.clamp(max=token_clip) # jsd shape is (B, T, V) or (n_valid, V)
```
The OPSD paper's pointwise clipping (Section 3.3) clips **per-(position, vocab-entry)**:
`min(l_{n,v}^(f), τ)` for each vocab entry v at each position n.
The upstream OPSD code (`--jsd_token_clip`) appears to apply the same per-(position,vocab)
clip. The repo's clamp on the jsd tensor before reduction would clip the full-vocab
contribution per position (since jsd has shape (B,T,V) before masking) — this is
equivalent to per-(position,vocab) clipping, which is correct.
**VERDICT**: Implementation appears correct. The parameter name (`token_clip`) is slightly
misleading (it clips per-token-vocab-entry, not just per-token), but the semantics match.
---
## 3. Critical structural mismatch: Composer/repo framing vs. SDPO mechanics
### 3.1 ERROR-TURN MASKING — NOT IN SDPO
The repo implements SDPO as an error-turn-masked loss:
- `_compute_sdpo_loss` applies JSD only at `error-turn tokens` (via `sdpo_loss_mask`).
- The data collator detects error sites in a trace and constructs a teacher context
with a hint inserted at the error turn (`ctx_teacher = ctx_student + hint`).
- The hint shifts teacher response tokens right, requiring explicit alignment indices
(ADR-011).
**The SDPO paper has no error-turn masking.** SDPO applies the KL loss to ALL tokens `t`
in the rollout response:
> "L_SDPO(θ) := Σ_t KL(π_θ(·|x, y_{<t}) ‖ stopgrad(π_θ(·|x, f, y_{<t})))"
The SDPO teacher context includes the full feedback; both student and teacher evaluate
log-probs of the **same response tokens** `y`. There is no "hint inserted into the
response" — the feedback is in the conditioning prefix, not intercalated into the
response sequence. Therefore the teacher response tokens are **not shifted** and token
alignment is trivially preserved: both contexts evaluate the same sequence `y`.
**The repo's architecture (hint at error turn → response token shift → alignment indices)**
is an interpretation of Composer 2.5's "hint" mechanism, not a feature of SDPO. SDPO's
feedback is in the prompt/conditioning context; it does not intercalate text into the
middle of a response.
**VERDICT**: The repo's error-turn-masking design is a reasonable extension of the
Composer blog's described mechanism ("insert hint at error turn") but is **NOT**
SDPO as described in the paper. The Composer blog's mechanism is itself not fully
described and may or may not match SDPO mechanics.
### 3.2 TEACHER CONTEXT — CRITICAL DIFFERENCE
**SDPO teacher context** (Table 2):
```
[prompt, feedback_f, original_response_y]
```
The teacher evaluates log-probs of `original_response_y` given `[prompt, feedback_f]`.
Teacher prefix = `[prompt, feedback_f]`. Response = `y` (same as student). No hint is
inserted *into* `y`.
**Repo teacher context**:
```
ctx_teacher = ctx_student + hint_at_error_turn
```
The hint is *intercalated* into the response sequence at an error turn. Teacher prefix
= student prefix. Response = `y_before_error + hint + y_after_error`. Teacher response
tokens are LONGER than student response tokens and SHIFTED.
This is architecturally different from SDPO. The alignment problem (ADR-008, ADR-011)
arises precisely because the repo's teacher context design inserts hint text into the
response, which SDPO does not do.
**VERDICT**: The repo's teacher context construction is a novel design inspired by the
Composer blog. It is not what SDPO does. The ADR-008 "trust-gap" and the entire
ADR-011 alignment index complexity are artifacts of this departure from SDPO, not
corrections to SDPO.
### 3.3 OPSD vs. SDPO as the loss source
The repo header in `opsd.py` says the loss is:
> "lifted from siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss (MIT)"
And:
> "SDPO paper: Hübotter et al. … formalizes the same loss as Composer 2.5's
> 'Targeted RL with Textual Feedback.'"
These are TWO DIFFERENT methods with related but not identical losses:
- **OPSD loss** (Eq. 7–8): `JSD_β(p_T ‖ p_S)` with teacher having `y*` (ground truth).
Normalization: per-sequence average then batch average. Pointwise vocab clipping.
Training runs ~100 steps. Fixed teacher (initial checkpoint, not live).
- **SDPO loss** (Eq. 1): `KL(π_θ(·|x,y_{<t}) ‖ stopgrad(π_θ(·|x,f,y_{<t})))` where
KL is applied per-position over the full response. The paper adopts JSD as a stability
improvement (§2.3) but the base formulation is reverse KL (student ‖ teacher).
Teacher is regularized via EMA or trust-region. No per-vocab clipping in the paper
(top-K approximation instead).
The repo correctly implements the OPSD JSD formula (which SDPO also uses for stability).
The claim "verified port of siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss" is
accurate for the loss kernel. The claim "Composer 2.5's 'Targeted RL with Textual Feedback'"
is an assertion that Composer uses the same loss — this is not confirmed anywhere in the
Cursor blog or Composer 2 tech report.
---
## 4. Audit: `_compute_sdpo_loss` in `composer_trainer.py`
### 4.1 Gradient flow — CORRECT
```python
student_logits = model(input_ids=inputs["input_ids"]).logits
with torch.no_grad():
teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
```
Teacher is `no_grad` — matches SDPO's `stopgrad(π_θ(·|x,f,y_{<t}))`. Student has
gradient. Correct.
### 4.2 Alignment index machinery — NECESSARY GIVEN THE DESIGN, BUT NOT FROM SDPO
The `student_response_idx` / `teacher_response_idx` machinery (ADR-011) is needed
because the hint is inserted into the teacher response sequence. This complexity does
not exist in SDPO or OPSD because those methods never insert text into the response.
The repo's `strict_sdpo_alignment` guard is a correct defense against the problem it
has created for itself.
### 4.3 Batch-level masking — CORRECT for the repo's error-turn interpretation
The loss is masked to error-turn tokens only (`aligned_labels` with -100 elsewhere).
This means the SDPO channel only trains on error recovery tokens, not the full rollout.
SDPO trains on the full rollout. For Composer's intent (correcting error turns), the
masking is reasonable, but it produces a loss that is more like a targeted distillation
at error sites than SDPO's full-rollout advantage assignment.
---
## 5. Audit: `research/07-sdpo-hint-generator.md` — Accuracy check
### 5.1 Three feedback types from SDPO paper — CORRECTLY REPORTED
research/07 correctly identifies the three types:
1. Sample solution (successful sibling rollout)
2. Environment output (runtime errors)
3. Student's original attempt
The paper (Table 6 results, which research/07 did NOT have access to) shows:
- Best configuration: `f = output + own solution` (48.3% accuracy)
- Including `f = y` (original attempt as conditioning, not as response) **hurts diversity**
and slightly reduces final accuracy (44.5% vs 48.3%)
Research/07 correctly notes the sibling rollout is "always generated by the student, not
an expert model" — confirmed in the paper: "We emphasize that these sample solutions are
always generated by the student, as in GRPO, and do not require an expert model."
### 5.2 "Successful sibling rollout as implicit feedback" claim — CORRECTLY REPORTED
The abstract: "SDPO also outperforms baselines in standard RLVR environments that only
return scalar feedback by using successful rollouts as implicit feedback for failed attempts."
Research/07 cites this correctly and uses it as the basis for the `SiblingBootstrapGenerator`.
### 5.3 OPSD `--reason_first` flag — CORRECTLY DESCRIBED
The OPSD README confirms: `--reason_first False: Prepend an explicit rationalization to
the teacher context before distillation.` Research/07 correctly calls this "OPSD's own
knob for same-model introspection."
### 5.4 `--jsd_token_clip default 0.05` — CORRECTLY CITED
Confirmed from OPSD README: `--jsd_token_clip 0.05` is the default.
---
## 6. Audit: `SiblingBootstrapGenerator` — Is it supported by the papers?
The repo's `hint_generator.py` sketch (lines 319–331) and `research/07` §6.3:
```python
class SiblingBootstrapGenerator:
def generate(self, ctx):
sibs = ctx.get("sibling_rollouts") or []
winners = [s for s in sibs if s.get("reward", 0.0) > 0.0]
if not winners:
return None
best = max(winners, key=lambda s: s["reward"])
snippet = (best.get("solution_excerpt") or "")[:200]
return ("Reminder: a working approach for this task looks like:\n"
f"{snippet}\nAdapt this to the current step.")
```
**What the SDPO paper actually does** (Table 2 template):
```
Correct solution: {successful_previous_rollout}
```
The successful rollout is passed as the full solution (or relevant excerpt) in the
teacher context. The teacher then evaluates log-probs of the student's original
response given this context.
**Key difference**: In SDPO, the sibling solution goes into the teacher's conditioning
prefix. The teacher does not generate a new hint; it just re-evaluates the student's
response log-probs with the solution visible. In the repo, the sibling solution is
used to *generate a hint string* that gets inserted into the response sequence.
This is an extrapolation beyond what the SDPO paper supports. SDPO's "successful rollout
as implicit feedback" mechanism does NOT:
1. Generate a "Reminder: a working approach..." hint string.
2. Insert text into the student's response sequence.
3. Require error-turn detection.
The SDPO sibling mechanism IS:
1. Condition the teacher on the full successful solution.
2. Re-evaluate ALL student response token log-probs under that teacher.
3. Apply the KL loss across the entire response.
**VERDICT**: The `SiblingBootstrapGenerator` as sketched is an extrapolation from SDPO's
mechanism, not a faithful implementation of it. The paper supports using a sibling rollout
as teacher conditioning context; it does not support generating a textual hint from it
to splice into the response. The Composer blog's "hint" framing is the source of this
architectural decision; SDPO is cited as inspiration but is not the mechanism.
Research/07 acknowledges this at several points ("A working approach looks like: …" in
the class comment vs the actual SDPO template) but does not flag it as a divergence — it
presents the sibling-bootstrap hint approach as if it naturally follows from SDPO.
---
## 7. Audit: `research/11-sdpo-alignment-indices.md`
### 7.1 Problem correctly identified
ADR-011 correctly identifies that inserting a hint into the teacher context shifts teacher
response tokens right. The alignment indices machinery (`_mask_to_padded_indices`,
`student_response_idx`, `teacher_response_idx`, sentinel handling) is a sound engineering
solution to the problem the repo's design creates.
### 7.2 Root cause attribution — MISLEADING
ADR-011 and the trainer comments frame the alignment problem as an SDPO issue that the
papers fail to address ("the exact trust-gap flagged in ADR-008"). This is not accurate.
The alignment problem does not exist in SDPO or OPSD because those methods never insert
text into the response sequence. The alignment problem is entirely self-created by the
repo's decision to implement the Composer blog's "hint at error turn" as a text insertion
into the teacher's response sequence.
---
## 8. Audit: ADR-007, ADR-008, ADR-009 — Key claims
### ADR-007 — JSD as "the kernel of SDPO arXiv:2601.20802"
The ADR says `generalized_jsd_loss` is "verified port of siyan-zhao/OPSD, the kernel of
SDPO arXiv:2601.20802." This telescopes two papers. The JSD is the kernel of OPSD
(the Zhao et al. paper, 2601.18734). SDPO (Hübotter et al., 2601.20802) uses JSD as a
**stability improvement** over the base KL loss; the primary SDPO loss is the KL. Both
papers use the same JSD formula (citing GKD paper 2306.13649). The conflation is not
consequential for the loss code but creates confusion in documentation.
### ADR-008 — "SDPO needs full vocabulary logits"
Confirmed. SDPO Appendix A.3 discusses top-K approximation of the KL because "naively
computing the KL divergence between student and teacher requires holding full logits of
both models in memory." The repo's claim about needing full logits is correct; PRIME-RL's
log-probs-only interface is correctly identified as incompatible with the SDPO channel.
### ADR-008 — Dr. GRPO as the Composer algorithm
This is sourced from research/10 (Composer 2 tech report mining). Not audited here (out
of scope for this cluster).
### ADR-009 — "How Cursor generates that hint is unstated"
Confirmed true. The Composer 2 tech report (arXiv:2603.24477) is cited as unread in
research/07 §8 and ADR-009. ADR-009 correctly acknowledges the open question.
---
## 9. Summary of findings
| Claim | Source | Verdict |
|---|---|---|
| JSD formula in opsd.py is numerically correct | OPSD Eq. 7 | CORRECT |
| β=0 = "reverse KL" in docstring | OPSD README: "β=0 = forward KL" | INVERTED label |
| "byte-for-byte OPSD parity" | OPSD code | Mostly correct; β direction label wrong; reduction differs from paper's per-sequence normalization; otherwise matches upstream code |
| Error-turn masking is from SDPO | SDPO paper | FALSE — SDPO applies loss to full rollout, no error-turn detection |
| Teacher context = ctx_student + hint_at_error_turn | SDPO paper | FALSE — SDPO teacher = [prompt, feedback, student_response]; feedback is in prefix, not intercalated |
| SiblingBootstrapGenerator follows from SDPO "successful rollout as implicit feedback" | SDPO §4.6 | EXTRAPOLATION — SDPO conditions teacher on full solution; repo generates a hint string and inserts it into response sequence |
| Alignment indices machinery (ADR-011) addresses SDPO misalignment | SDPO paper | MISLEADING — problem is self-created by hint-insertion design; does not exist in SDPO |
| SDPO needs full vocabulary logits (ADR-008) | SDPO Appendix A.3 | CORRECT |
| Three feedback types in research/07 | SDPO §4.6 | CORRECTLY REPORTED |
| --jsd_token_clip default 0.05 | OPSD README | CORRECT |
| --reason_first flag | OPSD README | CORRECTLY DESCRIBED |
| "Successful rollouts as implicit feedback" claim | SDPO abstract | CORRECTLY CITED |
| Teacher is stop-grad, student has gradient | SDPO Eq. 1 | CORRECT in opsd.py and composer_trainer.py |
---
## 10. Recommendations
1. **Fix the β docstring** in `opsd.py` to match the OPSD upstream convention:
β=0 → forward KL (KL(student‖teacher)), β=1 → reverse KL (KL(teacher‖student)).
2. **Clarify the architectural departure from SDPO** in `composer_trainer.py` docstring
and `research/07`: the repo implements a Composer-blog-inspired error-turn hint
injection, which is an extension beyond SDPO. SDPO uses the feedback in the prompt
prefix and evaluates the full response; the repo intercalates text into the response.
3. **Reconsider framing of `SiblingBootstrapGenerator`**: it is an original design choice,
not an SDPO mechanism. The SDPO "sibling as implicit feedback" mechanism would look like:
build a teacher context `[prompt, successful_sibling_rollout, original_response]` and
apply KL over the whole original response — without generating a hint string or
error-turn detection. This would be simpler and more faithful to SDPO.
4. **Teacher regularization is not implemented**: the SDPO paper shows a non-regularized
teacher diverges (Table 4: 36.1% vs. 50.6%). The repo's teacher is the live model
weights at each step with no EMA or trust-region regularization. For production SDPO
runs this is a gap. (The `sdpo_jsd_beta` default of 0.5 uses symmetric JSD which is
one of SDPO's stability improvements, but the teacher regularization is absent.)
5. **SDPO's original attempt placement**: the paper includes the student's original
response as the sequence being log-prob-evaluated (i.e., the "response" slot in the
teacher context). The repo's collator instead masks specific error-turn tokens within
a modified response. These are architecturally different. The paper-accurate approach
would re-evaluate log-probs of the entire original response under the hint-conditioned
teacher, not just the tokens after the error.
6. **Failure mode from SDPO paper**: the strongest limitation is model capability
dependence — SDPO underperforms GRPO on weak models (Qwen3-0.6B); SDPO+GRPO with
λ=0.9 is recommended for weaker base models. This is not documented in the repo's
SDPO usage guidance.
---
## 11. What the papers do NOT say (repo-claimed but unconfirmed in sources)
- That Composer 2.5's "Targeted RL with Textual Feedback" is SDPO specifically (the
Cursor blog does not cite SDPO; it describes a mechanism consistent with SDPO but
the connection is an inference, not a citation).
- That error-turn masking is part of SDPO.
- That the repo's hint-at-error-turn teacher context is the SDPO mechanism.
- That the alignment index problem (ADR-011) is an issue in SDPO.
- How Cursor generates the hint (confirmed absent in all Cursor artifacts).