Baladithya Balamurugan

Wave 21: deep-read critical review — 8 source clusters re-read, findings verified

2a16b30 24 days ago

27.9 kB

	# Deep-Read: SDPO / OPSD — Critical Audit of Cluster 2

	Date: 2026-06-09
	Sources fetched from primary HTML:
	- SDPO: arXiv:2601.20802v2 (Hübotter et al., ETH Zürich / MIT / Stanford)
	— note `reinforcement-learning-via-self-distillation-2` (148 K body)
	- OPSD: arXiv:2601.18734v3 (Zhao et al., ICML 2026)
	— note `self-distilled-reasoner-on-policy-self-distillation-for-large-language-models-2` (52 K body)
	- OPSD code repo: github.com/siyan-zhao/OPSD (README + key args)
	- SDPO code repo: github.com/lasgroup/SDPO (listed in abstract; fetch returned empty body)

	Repo files audited: `composer_replication/opsd.py`, `composer_replication/trainer/composer_trainer.py::_compute_sdpo_loss`, `composer_replication/hint_generator.py`, `research/07-sdpo-hint-generator.md`, `research/11-sdpo-alignment-indices.md`, `docs/adrs/ADR-007`, `ADR-008`, `ADR-009`, `ADR-011`.

	---

	## 1. What the primary sources actually say

	### 1.1 SDPO (arXiv:2601.20802v2) — core method

	Exact loss (Eq. 1):

	```
	L_SDPO(θ) := Σ_t KL( π_θ(·\|x, y_{<t}) ‖ stopgrad(π_θ(·\|x, f, y_{<t})) )
	```

	KL direction is forward KL — KL(student ‖ teacher), i.e. `KL(π_θ \|\| q_θ)`. The
	student is in the first argument. This is the "reverse KL from the teacher's
	perspective" but forward from the student's perspective (student wants to match teacher).
	The paper writes it `KL(π_θ ‖ stopgrad(q_θ))`.

	Stability improvements (§2.3):
	1. Regularized teacher: EMA of student params OR interpolation between current teacher
	and initial teacher `q_{θ_ref}`. The paper calls these "trust-region" and "EMA"
	teachers (Table 4). Non-regularized teacher (`q_θ`, the live student) diverges.
	2. Symmetric JSD: the paper adopts JSD as the distillation loss for stability —
	citing Agarwal et al. 2024 on-policy distillation. The pseudocode (Fig 14) calls
	this `divergence(logprobs_student, logprobs_teacher)` with no fixed default — the
	paper reports using JSD.

	Top-K approximation (Appendix A.3):
	The paper approximates the full-vocab KL with top-K tokens of the student distribution:
	```
	L_SDPO ≈ Σ_t [ Σ_{ŷ_t ∈ topK(π_θ)} π_θ(ŷ_t\|x,y_{<t}) · log(π_θ(ŷ_t) / q_θ(ŷ_t))
	+ tail_term ]
	```
	The tail term aggregates the remaining probability mass. Default K=100.

	Teacher context construction (Table 2, verbatim):
	```
	User: {prompt}
	Correct solution: {successful_previous_rollout} (skipped if unavailable)
	The following is feedback from your unsuccessful earlier attempt:
	{environment_output} (skipped if no env output or if solved)
	Correctly solve the original question.
	Assistant: {original_response} (the student's original attempt, for log-prob re-eval)
	```

	Critical nuance: the `original_response` is placed in the teacher context so the
	model can re-evaluate log-probs of `y` under the teacher. **The student's original
	attempt is always appended as the response the teacher evaluates** — this is how both
	student and teacher evaluate log-probs of the same token sequence `y`.

	Token alignment: there is no shift / hint-insertion at an "error turn." The teacher
	sees `[prompt, feedback, original_response]` and both student and teacher evaluate
	log-probs of the SAME `original_response` tokens. No additional tokens are inserted
	into the response; the prefix is longer for the teacher (it has feedback), but the
	response token sequence evaluated is identical.

	Three feedback types ablated (§4.6, Table 6):
	1. Sample solution (`f = own solution`): a successful sibling rollout from the GRPO
	group; always student-generated (no expert model). Teacher accuracy: 42.4%.
	2. Environment output (`f = output`): runtime errors, failing unit tests, etc.
	Teacher accuracy: 32.5%.
	3. Student's original attempt (`f = y`): the repo calls out that including the
	original attempt in the feedback (not just in the response slot) **reduces teacher
	diversity** (biases teacher toward student; "Same output": 30% vs ~10–13%).
	4. Combined (`f = output + own solution`): best trained student accuracy (48.3%).
	Excluding `f = y` (the original attempt as part of conditioning) is key.

	Failure modes reported:
	- Non-regularized teacher (`q_θ`) diverges / training collapses (Table 4: 36.1% vs
	50.6% for trust-region teacher).
	- Performance depends on model in-context learning ability: SDPO underperforms GRPO on
	weaker models (Qwen3-0.6B); hybrid SDPO+GRPO (λ=0.9 GRPO + 0.1 SDPO) is more
	robust (§4.5).
	- Uninformative or misleading environment feedback: SDPO cannot learn from it.
	- SDPO adds small compute overhead (additional forward for log-prob re-computation of
	teacher context); minor for large models, non-negligible for small models.
	- Including the student's own attempt in the teacher conditioning (not just as the
	response to re-evaluate) reduces diversity; the correct template excludes it from
	the conditioning prefix.

	SDPO operates over the full rollout, not at isolated "error turns". The loss
	sums over all tokens `t` in the response `y`. There is no error-site detection
	step in the SDPO paper.

	### 1.2 OPSD (arXiv:2601.18734v3) — core method

	Exact loss (Eq. 6–8):
	```
	L_OPSD(θ) = E_{(x,y*)~S} [ E_{ŷ~p_S(·\|x)} [ D(p_T ‖ p_S)(ŷ\|x) ] ]
	where
	D(p_T ‖ p_S)(ŷ\|x) := (1/\|ŷ\|) Σ_{n=1}^{\|ŷ\|} D( p_T(·\|x, y*, ŷ_{<n}) ‖ p_S(·\|x, ŷ_{<n}) )
	```

	Divergence D can be: forward KL, reverse KL, or JSD_β. The paper defines:
	```
	JSD_β(p_T ‖ p_S) = β·KL(p_T ‖ m) + (1-β)·KL(p_S ‖ m)
	m = β·p_T + (1-β)·p_S
	```

	Direction convention: `D(p_T ‖ p_S)` — teacher in first arg, student in second. For the JSD:
	- β=0 → KL(p_S ‖ m) (approaches pure KL(p_S ‖ p_T) as m→p_T; forward KL w.r.t. teacher)
	- β=1 → KL(p_T ‖ m) (approaches pure KL(p_T ‖ p_S); reverse KL w.r.t. teacher / forward KL w.r.t. student)

	The GKD paper (arXiv:2306.13649) that OPSD cites defines JSD_β with the **same
	convention**: `JSD_β(p ‖ q) = β·KL(p\|\|M) + (1-β)·KL(q\|\|M)`.

	Per-token pointwise clipping: OPSD introduces this explicitly:
	```
	D_clip^(f)(p_T ‖ p_S) = (1/\|ŷ\|) Σ_n Σ_v min(l_{n,v}^(f), τ)
	where l_{n,v}^(f) = p_T(v\|·) · f( p_S(v\|·) / p_T(v\|·) )
	```
	This clips per vocab-entry contribution. Default τ=0.05 (from README: `--jsd_token_clip 0.05`).
	Non-thinking mode results in README use 1e-7 (Qwen3-8B) and 1e-6 (Qwen3-4B, 1.7B).

	Teacher context: `p_T(·\|x, y, ŷ_{<n})` — teacher sees the ground-truth answer `y`
	(a reference CoT / verified reasoning trace from the dataset) prepended to the problem,
	then evaluates the student's prefix `ŷ_{<n}`. Same token sequence for both distributions
	evaluated at each step `n`.

	`--reason_first` flag (from GitHub README): Prepend an explicit rationalization to the
	teacher context before distillation. This is OPSD's self-introspection lever: the teacher
	is first asked to rationalize why `y*` is correct, then that rationalization is folded into
	the conditioning. Not the main results configuration; requires `--use_peft`.

	Results: On Qwen3-1.7B (AIME24/25/HMMT25), OPSD +OPSD vs. base: 37.1% → 43.4%
	(Avg@12). Outperforms GRPO (37.7%) and SFT (35.8%). Token-efficient: generation capped
	at 1024 tokens vs. GRPO's 16K.

	---

	## 2. Audit: `composer_replication/opsd.py` — "byte-for-byte OPSD parity" claim

	### 2.1 JSD formula — CORRECT, with a subtle direction note

	The code implements:
	```python
	JSD = β·KL(teacher\|\|M) + (1-β)·KL(student\|\|M)
	M = logsumexp([ log p_student + log(1-β), log p_teacher + log(β) ])
	```

	The OPSD paper (Eq. 7) defines:
	```
	JSD_β(p_T ‖ p_S) = β·KL(p_T‖m) + (1-β)·KL(p_S‖m)
	```
	where `m = β·p_T + (1-β)·p_S`.

	The code `kl_teacher = F.kl_div(mixture_log_probs, teacher_log_probs, ...)` uses
	PyTorch semantics where `F.kl_div(input=log_q, target=log_p, log_target=True)`
	computes `KL(p\|\|q) = Σ p(x)·(log p(x) - log q(x))`. So `kl_teacher` computes
	`KL(teacher\|\|mixture)` and `kl_student` computes `KL(student\|\|mixture)`.

	The final JSD: `β·kl_teacher + (1-β)·kl_student` = `β·KL(teacher\|\|M) + (1-β)·KL(student\|\|M)`.

	This matches the OPSD paper's `JSD_β(p_T ‖ p_S)` exactly. CORRECT.

	### 2.2 β convention docstring — INVERTED vs. both papers

	The `opsd.py` docstring says:
	```
	β = 0 → KL(teacher \|\| student) (reverse KL — mode-covering for student)
	β = 1 → KL(student \|\| teacher) (forward KL — mode-seeking for student)
	```

	From the OPSD GitHub README:
	> `--beta`: Interpolation weight for the JSD mixture distribution.
	> Beta=0 means forward KL and 1 means reverse KL.

	The repo docstring has the β=0 and β=1 labels swapped relative to the OPSD upstream.
	When β=0: `JSD_0 = 0·KL(teacher\|\|M) + 1·KL(student\|\|M)`. In the limit (degenerate β=0),
	M approaches p_teacher, so this approaches `KL(student\|\|teacher)` — which is forward KL
	(student → teacher), mode-seeking for student. The README says "Beta=0 means forward KL"
	which matches this analysis.

	The code implementation is correct (the formula computes the right mixture). The docstring
	labels β=0 as "reverse KL" and β=1 as "forward KL", which contradicts both the upstream README
	and the mathematical analysis. This is a documentation error, not a numerical error.

	VERDICT: Implementation is numerically correct. Docstring direction labels are inverted.

	### 2.3 `reduction="batchmean"` behavior — MINOR DIVERGENCE from upstream

	The repo `opsd.py` comment says:
	> "batchmean" matches upstream OPSD: divides by `mask.sum()` when labels are given,
	> else by the leading dim of jsd (= batch size).

	The OPSD paper (Algorithm 1) normalizes by `\|ŷ\|` (sequence length, token-mean):
	```
	ℓ(x,y*) = D(p_T‖p_S)(ŷ\|x) = (1/\|ŷ\|) Σ_n D(...)
	L_OPSD(θ) = (1/\|B\|) Σ_{(x,y)∈B} ℓ(x,y)
	```

	The repo divides by `mask.sum()` (number of valid/masked tokens in the batch), which is
	equivalent to OPSD's normalization only when every example has the same number of
	error-turn tokens. When batch sizes vary (real training), this differs from the paper's
	per-sequence average followed by batch average. In practice this difference is negligible
	for stability, but it is technically not byte-for-byte OPSD parity on the reduction.

	VERDICT: The `reduction="batchmean"` logic is borrowed from the OPSD upstream code
	(which uses the same `mask.sum()` convention). The docstring's "matches upstream" claim
	is accurate for the code, but the code diverges from the paper's stated per-sequence
	normalization. Not a material issue.

	### 2.4 `token_clip` parameter — CORRECT semantics, but per-token vs. per-(token,vocab) distinction

	The repo implements `token_clip` as a per-position JSD clip:
	```python
	jsd = jsd.clamp(max=token_clip) # jsd shape is (B, T, V) or (n_valid, V)
	```

	The OPSD paper's pointwise clipping (Section 3.3) clips per-(position, vocab-entry):
	`min(l_{n,v}^(f), τ)` for each vocab entry v at each position n.

	The upstream OPSD code (`--jsd_token_clip`) appears to apply the same per-(position,vocab)
	clip. The repo's clamp on the jsd tensor before reduction would clip the full-vocab
	contribution per position (since jsd has shape (B,T,V) before masking) — this is
	equivalent to per-(position,vocab) clipping, which is correct.

	VERDICT: Implementation appears correct. The parameter name (`token_clip`) is slightly
	misleading (it clips per-token-vocab-entry, not just per-token), but the semantics match.

	---

	## 3. Critical structural mismatch: Composer/repo framing vs. SDPO mechanics

	### 3.1 ERROR-TURN MASKING — NOT IN SDPO

	The repo implements SDPO as an error-turn-masked loss:
	- `_compute_sdpo_loss` applies JSD only at `error-turn tokens` (via `sdpo_loss_mask`).
	- The data collator detects error sites in a trace and constructs a teacher context
	with a hint inserted at the error turn (`ctx_teacher = ctx_student + hint`).
	- The hint shifts teacher response tokens right, requiring explicit alignment indices
	(ADR-011).

	The SDPO paper has no error-turn masking. SDPO applies the KL loss to ALL tokens `t`
	in the rollout response:
	> "L_SDPO(θ) := Σ_t KL(π_θ(·\|x, y_{<t}) ‖ stopgrad(π_θ(·\|x, f, y_{<t})))"

	The SDPO teacher context includes the full feedback; both student and teacher evaluate
	log-probs of the same response tokens `y`. There is no "hint inserted into the
	response" — the feedback is in the conditioning prefix, not intercalated into the
	response sequence. Therefore the teacher response tokens are not shifted and token
	alignment is trivially preserved: both contexts evaluate the same sequence `y`.

	The repo's architecture (hint at error turn → response token shift → alignment indices)
	is an interpretation of Composer 2.5's "hint" mechanism, not a feature of SDPO. SDPO's
	feedback is in the prompt/conditioning context; it does not intercalate text into the
	middle of a response.

	VERDICT: The repo's error-turn-masking design is a reasonable extension of the
	Composer blog's described mechanism ("insert hint at error turn") but is NOT
	SDPO as described in the paper. The Composer blog's mechanism is itself not fully
	described and may or may not match SDPO mechanics.

	### 3.2 TEACHER CONTEXT — CRITICAL DIFFERENCE

	SDPO teacher context (Table 2):
	```
	[prompt, feedback_f, original_response_y]
	```
	The teacher evaluates log-probs of `original_response_y` given `[prompt, feedback_f]`.
	Teacher prefix = `[prompt, feedback_f]`. Response = `y` (same as student). No hint is
	inserted into `y`.

	Repo teacher context:
	```
	ctx_teacher = ctx_student + hint_at_error_turn
	```
	The hint is intercalated into the response sequence at an error turn. Teacher prefix
	= student prefix. Response = `y_before_error + hint + y_after_error`. Teacher response
	tokens are LONGER than student response tokens and SHIFTED.

	This is architecturally different from SDPO. The alignment problem (ADR-008, ADR-011)
	arises precisely because the repo's teacher context design inserts hint text into the
	response, which SDPO does not do.

	VERDICT: The repo's teacher context construction is a novel design inspired by the
	Composer blog. It is not what SDPO does. The ADR-008 "trust-gap" and the entire
	ADR-011 alignment index complexity are artifacts of this departure from SDPO, not
	corrections to SDPO.

	### 3.3 OPSD vs. SDPO as the loss source

	The repo header in `opsd.py` says the loss is:
	> "lifted from siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss (MIT)"

	And:
	> "SDPO paper: Hübotter et al. … formalizes the same loss as Composer 2.5's
	> 'Targeted RL with Textual Feedback.'"

	These are TWO DIFFERENT methods with related but not identical losses:

	- OPSD loss (Eq. 7–8): `JSD_β(p_T ‖ p_S)` with teacher having `y*` (ground truth).
	Normalization: per-sequence average then batch average. Pointwise vocab clipping.
	Training runs ~100 steps. Fixed teacher (initial checkpoint, not live).

	- SDPO loss (Eq. 1): `KL(π_θ(·\|x,y_{<t}) ‖ stopgrad(π_θ(·\|x,f,y_{<t})))` where
	KL is applied per-position over the full response. The paper adopts JSD as a stability
	improvement (§2.3) but the base formulation is reverse KL (student ‖ teacher).
	Teacher is regularized via EMA or trust-region. No per-vocab clipping in the paper
	(top-K approximation instead).

	The repo correctly implements the OPSD JSD formula (which SDPO also uses for stability).
	The claim "verified port of siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss" is
	accurate for the loss kernel. The claim "Composer 2.5's 'Targeted RL with Textual Feedback'"
	is an assertion that Composer uses the same loss — this is not confirmed anywhere in the
	Cursor blog or Composer 2 tech report.

	---

	## 4. Audit: `_compute_sdpo_loss` in `composer_trainer.py`

	### 4.1 Gradient flow — CORRECT

	```python
	student_logits = model(input_ids=inputs["input_ids"]).logits
	with torch.no_grad():
	teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
	```

	Teacher is `no_grad` — matches SDPO's `stopgrad(π_θ(·\|x,f,y_{<t}))`. Student has
	gradient. Correct.

	### 4.2 Alignment index machinery — NECESSARY GIVEN THE DESIGN, BUT NOT FROM SDPO

	The `student_response_idx` / `teacher_response_idx` machinery (ADR-011) is needed
	because the hint is inserted into the teacher response sequence. This complexity does
	not exist in SDPO or OPSD because those methods never insert text into the response.
	The repo's `strict_sdpo_alignment` guard is a correct defense against the problem it
	has created for itself.

	### 4.3 Batch-level masking — CORRECT for the repo's error-turn interpretation

	The loss is masked to error-turn tokens only (`aligned_labels` with -100 elsewhere).
	This means the SDPO channel only trains on error recovery tokens, not the full rollout.
	SDPO trains on the full rollout. For Composer's intent (correcting error turns), the
	masking is reasonable, but it produces a loss that is more like a targeted distillation
	at error sites than SDPO's full-rollout advantage assignment.

	---

	## 5. Audit: `research/07-sdpo-hint-generator.md` — Accuracy check

	### 5.1 Three feedback types from SDPO paper — CORRECTLY REPORTED

	research/07 correctly identifies the three types:
	1. Sample solution (successful sibling rollout)
	2. Environment output (runtime errors)
	3. Student's original attempt

	The paper (Table 6 results, which research/07 did NOT have access to) shows:
	- Best configuration: `f = output + own solution` (48.3% accuracy)
	- Including `f = y` (original attempt as conditioning, not as response) hurts diversity
	and slightly reduces final accuracy (44.5% vs 48.3%)

	Research/07 correctly notes the sibling rollout is "always generated by the student, not
	an expert model" — confirmed in the paper: "We emphasize that these sample solutions are
	always generated by the student, as in GRPO, and do not require an expert model."

	### 5.2 "Successful sibling rollout as implicit feedback" claim — CORRECTLY REPORTED

	The abstract: "SDPO also outperforms baselines in standard RLVR environments that only
	return scalar feedback by using successful rollouts as implicit feedback for failed attempts."

	Research/07 cites this correctly and uses it as the basis for the `SiblingBootstrapGenerator`.

	### 5.3 OPSD `--reason_first` flag — CORRECTLY DESCRIBED

	The OPSD README confirms: `--reason_first False: Prepend an explicit rationalization to
	the teacher context before distillation.` Research/07 correctly calls this "OPSD's own
	knob for same-model introspection."

	### 5.4 `--jsd_token_clip default 0.05` — CORRECTLY CITED

	Confirmed from OPSD README: `--jsd_token_clip 0.05` is the default.

	---

	## 6. Audit: `SiblingBootstrapGenerator` — Is it supported by the papers?

	The repo's `hint_generator.py` sketch (lines 319–331) and `research/07` §6.3:
	```python
	class SiblingBootstrapGenerator:
	def generate(self, ctx):
	sibs = ctx.get("sibling_rollouts") or []
	winners = [s for s in sibs if s.get("reward", 0.0) > 0.0]
	if not winners:
	return None
	best = max(winners, key=lambda s: s["reward"])
	snippet = (best.get("solution_excerpt") or "")[:200]
	return ("Reminder: a working approach for this task looks like:\n"
	f"{snippet}\nAdapt this to the current step.")
	```

	What the SDPO paper actually does (Table 2 template):
	```
	Correct solution: {successful_previous_rollout}
	```
	The successful rollout is passed as the full solution (or relevant excerpt) in the
	teacher context. The teacher then evaluates log-probs of the student's original
	response given this context.

	Key difference: In SDPO, the sibling solution goes into the teacher's conditioning
	prefix. The teacher does not generate a new hint; it just re-evaluates the student's
	response log-probs with the solution visible. In the repo, the sibling solution is
	used to generate a hint string that gets inserted into the response sequence.

	This is an extrapolation beyond what the SDPO paper supports. SDPO's "successful rollout
	as implicit feedback" mechanism does NOT:
	1. Generate a "Reminder: a working approach..." hint string.
	2. Insert text into the student's response sequence.
	3. Require error-turn detection.

	The SDPO sibling mechanism IS:
	1. Condition the teacher on the full successful solution.
	2. Re-evaluate ALL student response token log-probs under that teacher.
	3. Apply the KL loss across the entire response.

	VERDICT: The `SiblingBootstrapGenerator` as sketched is an extrapolation from SDPO's
	mechanism, not a faithful implementation of it. The paper supports using a sibling rollout
	as teacher conditioning context; it does not support generating a textual hint from it
	to splice into the response. The Composer blog's "hint" framing is the source of this
	architectural decision; SDPO is cited as inspiration but is not the mechanism.

	Research/07 acknowledges this at several points ("A working approach looks like: …" in
	the class comment vs the actual SDPO template) but does not flag it as a divergence — it
	presents the sibling-bootstrap hint approach as if it naturally follows from SDPO.

	---

	## 7. Audit: `research/11-sdpo-alignment-indices.md`

	### 7.1 Problem correctly identified

	ADR-011 correctly identifies that inserting a hint into the teacher context shifts teacher
	response tokens right. The alignment indices machinery (`_mask_to_padded_indices`,
	`student_response_idx`, `teacher_response_idx`, sentinel handling) is a sound engineering
	solution to the problem the repo's design creates.

	### 7.2 Root cause attribution — MISLEADING

	ADR-011 and the trainer comments frame the alignment problem as an SDPO issue that the
	papers fail to address ("the exact trust-gap flagged in ADR-008"). This is not accurate.
	The alignment problem does not exist in SDPO or OPSD because those methods never insert
	text into the response sequence. The alignment problem is entirely self-created by the
	repo's decision to implement the Composer blog's "hint at error turn" as a text insertion
	into the teacher's response sequence.

	---

	## 8. Audit: ADR-007, ADR-008, ADR-009 — Key claims

	### ADR-007 — JSD as "the kernel of SDPO arXiv:2601.20802"

	The ADR says `generalized_jsd_loss` is "verified port of siyan-zhao/OPSD, the kernel of
	SDPO arXiv:2601.20802." This telescopes two papers. The JSD is the kernel of OPSD
	(the Zhao et al. paper, 2601.18734). SDPO (Hübotter et al., 2601.20802) uses JSD as a
	stability improvement over the base KL loss; the primary SDPO loss is the KL. Both
	papers use the same JSD formula (citing GKD paper 2306.13649). The conflation is not
	consequential for the loss code but creates confusion in documentation.

	### ADR-008 — "SDPO needs full vocabulary logits"

	Confirmed. SDPO Appendix A.3 discusses top-K approximation of the KL because "naively
	computing the KL divergence between student and teacher requires holding full logits of
	both models in memory." The repo's claim about needing full logits is correct; PRIME-RL's
	log-probs-only interface is correctly identified as incompatible with the SDPO channel.

	### ADR-008 — Dr. GRPO as the Composer algorithm

	This is sourced from research/10 (Composer 2 tech report mining). Not audited here (out
	of scope for this cluster).

	### ADR-009 — "How Cursor generates that hint is unstated"

	Confirmed true. The Composer 2 tech report (arXiv:2603.24477) is cited as unread in
	research/07 §8 and ADR-009. ADR-009 correctly acknowledges the open question.

	---

	## 9. Summary of findings

	\| Claim \| Source \| Verdict \|
	\|---\|---\|---\|
	\| JSD formula in opsd.py is numerically correct \| OPSD Eq. 7 \| CORRECT \|
	\| β=0 = "reverse KL" in docstring \| OPSD README: "β=0 = forward KL" \| INVERTED label \|
	\| "byte-for-byte OPSD parity" \| OPSD code \| Mostly correct; β direction label wrong; reduction differs from paper's per-sequence normalization; otherwise matches upstream code \|
	\| Error-turn masking is from SDPO \| SDPO paper \| FALSE — SDPO applies loss to full rollout, no error-turn detection \|
	\| Teacher context = ctx_student + hint_at_error_turn \| SDPO paper \| FALSE — SDPO teacher = [prompt, feedback, student_response]; feedback is in prefix, not intercalated \|
	\| SiblingBootstrapGenerator follows from SDPO "successful rollout as implicit feedback" \| SDPO §4.6 \| EXTRAPOLATION — SDPO conditions teacher on full solution; repo generates a hint string and inserts it into response sequence \|
	\| Alignment indices machinery (ADR-011) addresses SDPO misalignment \| SDPO paper \| MISLEADING — problem is self-created by hint-insertion design; does not exist in SDPO \|
	\| SDPO needs full vocabulary logits (ADR-008) \| SDPO Appendix A.3 \| CORRECT \|
	\| Three feedback types in research/07 \| SDPO §4.6 \| CORRECTLY REPORTED \|
	\| --jsd_token_clip default 0.05 \| OPSD README \| CORRECT \|
	\| --reason_first flag \| OPSD README \| CORRECTLY DESCRIBED \|
	\| "Successful rollouts as implicit feedback" claim \| SDPO abstract \| CORRECTLY CITED \|
	\| Teacher is stop-grad, student has gradient \| SDPO Eq. 1 \| CORRECT in opsd.py and composer_trainer.py \|

	---

	## 10. Recommendations

	1. Fix the β docstring in `opsd.py` to match the OPSD upstream convention:
	β=0 → forward KL (KL(student‖teacher)), β=1 → reverse KL (KL(teacher‖student)).

	2. Clarify the architectural departure from SDPO in `composer_trainer.py` docstring
	and `research/07`: the repo implements a Composer-blog-inspired error-turn hint
	injection, which is an extension beyond SDPO. SDPO uses the feedback in the prompt
	prefix and evaluates the full response; the repo intercalates text into the response.

	3. Reconsider framing of `SiblingBootstrapGenerator`: it is an original design choice,
	not an SDPO mechanism. The SDPO "sibling as implicit feedback" mechanism would look like:
	build a teacher context `[prompt, successful_sibling_rollout, original_response]` and
	apply KL over the whole original response — without generating a hint string or
	error-turn detection. This would be simpler and more faithful to SDPO.

	4. Teacher regularization is not implemented: the SDPO paper shows a non-regularized
	teacher diverges (Table 4: 36.1% vs. 50.6%). The repo's teacher is the live model
	weights at each step with no EMA or trust-region regularization. For production SDPO
	runs this is a gap. (The `sdpo_jsd_beta` default of 0.5 uses symmetric JSD which is
	one of SDPO's stability improvements, but the teacher regularization is absent.)

	5. SDPO's original attempt placement: the paper includes the student's original
	response as the sequence being log-prob-evaluated (i.e., the "response" slot in the
	teacher context). The repo's collator instead masks specific error-turn tokens within
	a modified response. These are architecturally different. The paper-accurate approach
	would re-evaluate log-probs of the entire original response under the hint-conditioned
	teacher, not just the tokens after the error.

	6. Failure mode from SDPO paper: the strongest limitation is model capability
	dependence — SDPO underperforms GRPO on weak models (Qwen3-0.6B); SDPO+GRPO with
	λ=0.9 is recommended for weaker base models. This is not documented in the repo's
	SDPO usage guidance.

	---

	## 11. What the papers do NOT say (repo-claimed but unconfirmed in sources)

	- That Composer 2.5's "Targeted RL with Textual Feedback" is SDPO specifically (the
	Cursor blog does not cite SDPO; it describes a mechanism consistent with SDPO but
	the connection is an inference, not a citation).
	- That error-turn masking is part of SDPO.
	- That the repo's hint-at-error-turn teacher context is the SDPO mechanism.
	- That the alignment index problem (ADR-011) is an issue in SDPO.
	- How Cursor generates the hint (confirmed absent in all Cursor artifacts).