# Deep-Read: SDPO / OPSD — Critical Audit of Cluster 2 **Date**: 2026-06-09 **Sources fetched from primary HTML**: - SDPO: arXiv:2601.20802v2 (Hübotter et al., ETH Zürich / MIT / Stanford) — note `reinforcement-learning-via-self-distillation-2` (148 K body) - OPSD: arXiv:2601.18734v3 (Zhao et al., ICML 2026) — note `self-distilled-reasoner-on-policy-self-distillation-for-large-language-models-2` (52 K body) - OPSD code repo: github.com/siyan-zhao/OPSD (README + key args) - SDPO code repo: github.com/lasgroup/SDPO (listed in abstract; fetch returned empty body) **Repo files audited**: `composer_replication/opsd.py`, `composer_replication/trainer/composer_trainer.py::_compute_sdpo_loss`, `composer_replication/hint_generator.py`, `research/07-sdpo-hint-generator.md`, `research/11-sdpo-alignment-indices.md`, `docs/adrs/ADR-007`, `ADR-008`, `ADR-009`, `ADR-011`. --- ## 1. What the primary sources actually say ### 1.1 SDPO (arXiv:2601.20802v2) — core method **Exact loss (Eq. 1)**: ``` L_SDPO(θ) := Σ_t KL( π_θ(·|x, y_{ `--beta`: Interpolation weight for the JSD mixture distribution. > **Beta=0 means forward KL and 1 means reverse KL.** The repo docstring has the β=0 and β=1 labels **swapped** relative to the OPSD upstream. When β=0: `JSD_0 = 0·KL(teacher||M) + 1·KL(student||M)`. In the limit (degenerate β=0), M approaches p_teacher, so this approaches `KL(student||teacher)` — which is **forward KL** (student → teacher), **mode-seeking for student**. The README says "Beta=0 means forward KL" which matches this analysis. The code *implementation* is correct (the formula computes the right mixture). The *docstring* labels β=0 as "reverse KL" and β=1 as "forward KL", which contradicts both the upstream README and the mathematical analysis. This is a documentation error, not a numerical error. **VERDICT**: Implementation is numerically correct. Docstring direction labels are inverted. ### 2.3 `reduction="batchmean"` behavior — MINOR DIVERGENCE from upstream The repo `opsd.py` comment says: > "batchmean" matches upstream OPSD: divides by `mask.sum()` when labels are given, > else by the leading dim of jsd (= batch size). The OPSD paper (Algorithm 1) normalizes by `|ŷ|` (sequence length, token-mean): ``` ℓ(x,y*) = D(p_T‖p_S)(ŷ|x) = (1/|ŷ|) Σ_n D(...) L_OPSD(θ) = (1/|B|) Σ_{(x,y*)∈B} ℓ(x,y*) ``` The repo divides by `mask.sum()` (number of valid/masked tokens in the batch), which is equivalent to OPSD's normalization only when every example has the same number of error-turn tokens. When batch sizes vary (real training), this differs from the paper's per-sequence average followed by batch average. In practice this difference is negligible for stability, but it is technically not byte-for-byte OPSD parity on the reduction. **VERDICT**: The `reduction="batchmean"` logic is borrowed from the OPSD upstream code (which uses the same `mask.sum()` convention). The docstring's "matches upstream" claim is accurate for the code, but the code diverges from the paper's stated per-sequence normalization. Not a material issue. ### 2.4 `token_clip` parameter — CORRECT semantics, but per-token vs. per-(token,vocab) distinction The repo implements `token_clip` as a **per-position** JSD clip: ```python jsd = jsd.clamp(max=token_clip) # jsd shape is (B, T, V) or (n_valid, V) ``` The OPSD paper's pointwise clipping (Section 3.3) clips **per-(position, vocab-entry)**: `min(l_{n,v}^(f), τ)` for each vocab entry v at each position n. The upstream OPSD code (`--jsd_token_clip`) appears to apply the same per-(position,vocab) clip. The repo's clamp on the jsd tensor before reduction would clip the full-vocab contribution per position (since jsd has shape (B,T,V) before masking) — this is equivalent to per-(position,vocab) clipping, which is correct. **VERDICT**: Implementation appears correct. The parameter name (`token_clip`) is slightly misleading (it clips per-token-vocab-entry, not just per-token), but the semantics match. --- ## 3. Critical structural mismatch: Composer/repo framing vs. SDPO mechanics ### 3.1 ERROR-TURN MASKING — NOT IN SDPO The repo implements SDPO as an error-turn-masked loss: - `_compute_sdpo_loss` applies JSD only at `error-turn tokens` (via `sdpo_loss_mask`). - The data collator detects error sites in a trace and constructs a teacher context with a hint inserted at the error turn (`ctx_teacher = ctx_student + hint`). - The hint shifts teacher response tokens right, requiring explicit alignment indices (ADR-011). **The SDPO paper has no error-turn masking.** SDPO applies the KL loss to ALL tokens `t` in the rollout response: > "L_SDPO(θ) := Σ_t KL(π_θ(·|x, y_{ "lifted from siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss (MIT)" And: > "SDPO paper: Hübotter et al. … formalizes the same loss as Composer 2.5's > 'Targeted RL with Textual Feedback.'" These are TWO DIFFERENT methods with related but not identical losses: - **OPSD loss** (Eq. 7–8): `JSD_β(p_T ‖ p_S)` with teacher having `y*` (ground truth). Normalization: per-sequence average then batch average. Pointwise vocab clipping. Training runs ~100 steps. Fixed teacher (initial checkpoint, not live). - **SDPO loss** (Eq. 1): `KL(π_θ(·|x,y_{ 0.0] if not winners: return None best = max(winners, key=lambda s: s["reward"]) snippet = (best.get("solution_excerpt") or "")[:200] return ("Reminder: a working approach for this task looks like:\n" f"{snippet}\nAdapt this to the current step.") ``` **What the SDPO paper actually does** (Table 2 template): ``` Correct solution: {successful_previous_rollout} ``` The successful rollout is passed as the full solution (or relevant excerpt) in the teacher context. The teacher then evaluates log-probs of the student's original response given this context. **Key difference**: In SDPO, the sibling solution goes into the teacher's conditioning prefix. The teacher does not generate a new hint; it just re-evaluates the student's response log-probs with the solution visible. In the repo, the sibling solution is used to *generate a hint string* that gets inserted into the response sequence. This is an extrapolation beyond what the SDPO paper supports. SDPO's "successful rollout as implicit feedback" mechanism does NOT: 1. Generate a "Reminder: a working approach..." hint string. 2. Insert text into the student's response sequence. 3. Require error-turn detection. The SDPO sibling mechanism IS: 1. Condition the teacher on the full successful solution. 2. Re-evaluate ALL student response token log-probs under that teacher. 3. Apply the KL loss across the entire response. **VERDICT**: The `SiblingBootstrapGenerator` as sketched is an extrapolation from SDPO's mechanism, not a faithful implementation of it. The paper supports using a sibling rollout as teacher conditioning context; it does not support generating a textual hint from it to splice into the response. The Composer blog's "hint" framing is the source of this architectural decision; SDPO is cited as inspiration but is not the mechanism. Research/07 acknowledges this at several points ("A working approach looks like: …" in the class comment vs the actual SDPO template) but does not flag it as a divergence — it presents the sibling-bootstrap hint approach as if it naturally follows from SDPO. --- ## 7. Audit: `research/11-sdpo-alignment-indices.md` ### 7.1 Problem correctly identified ADR-011 correctly identifies that inserting a hint into the teacher context shifts teacher response tokens right. The alignment indices machinery (`_mask_to_padded_indices`, `student_response_idx`, `teacher_response_idx`, sentinel handling) is a sound engineering solution to the problem the repo's design creates. ### 7.2 Root cause attribution — MISLEADING ADR-011 and the trainer comments frame the alignment problem as an SDPO issue that the papers fail to address ("the exact trust-gap flagged in ADR-008"). This is not accurate. The alignment problem does not exist in SDPO or OPSD because those methods never insert text into the response sequence. The alignment problem is entirely self-created by the repo's decision to implement the Composer blog's "hint at error turn" as a text insertion into the teacher's response sequence. --- ## 8. Audit: ADR-007, ADR-008, ADR-009 — Key claims ### ADR-007 — JSD as "the kernel of SDPO arXiv:2601.20802" The ADR says `generalized_jsd_loss` is "verified port of siyan-zhao/OPSD, the kernel of SDPO arXiv:2601.20802." This telescopes two papers. The JSD is the kernel of OPSD (the Zhao et al. paper, 2601.18734). SDPO (Hübotter et al., 2601.20802) uses JSD as a **stability improvement** over the base KL loss; the primary SDPO loss is the KL. Both papers use the same JSD formula (citing GKD paper 2306.13649). The conflation is not consequential for the loss code but creates confusion in documentation. ### ADR-008 — "SDPO needs full vocabulary logits" Confirmed. SDPO Appendix A.3 discusses top-K approximation of the KL because "naively computing the KL divergence between student and teacher requires holding full logits of both models in memory." The repo's claim about needing full logits is correct; PRIME-RL's log-probs-only interface is correctly identified as incompatible with the SDPO channel. ### ADR-008 — Dr. GRPO as the Composer algorithm This is sourced from research/10 (Composer 2 tech report mining). Not audited here (out of scope for this cluster). ### ADR-009 — "How Cursor generates that hint is unstated" Confirmed true. The Composer 2 tech report (arXiv:2603.24477) is cited as unread in research/07 §8 and ADR-009. ADR-009 correctly acknowledges the open question. --- ## 9. Summary of findings | Claim | Source | Verdict | |---|---|---| | JSD formula in opsd.py is numerically correct | OPSD Eq. 7 | CORRECT | | β=0 = "reverse KL" in docstring | OPSD README: "β=0 = forward KL" | INVERTED label | | "byte-for-byte OPSD parity" | OPSD code | Mostly correct; β direction label wrong; reduction differs from paper's per-sequence normalization; otherwise matches upstream code | | Error-turn masking is from SDPO | SDPO paper | FALSE — SDPO applies loss to full rollout, no error-turn detection | | Teacher context = ctx_student + hint_at_error_turn | SDPO paper | FALSE — SDPO teacher = [prompt, feedback, student_response]; feedback is in prefix, not intercalated | | SiblingBootstrapGenerator follows from SDPO "successful rollout as implicit feedback" | SDPO §4.6 | EXTRAPOLATION — SDPO conditions teacher on full solution; repo generates a hint string and inserts it into response sequence | | Alignment indices machinery (ADR-011) addresses SDPO misalignment | SDPO paper | MISLEADING — problem is self-created by hint-insertion design; does not exist in SDPO | | SDPO needs full vocabulary logits (ADR-008) | SDPO Appendix A.3 | CORRECT | | Three feedback types in research/07 | SDPO §4.6 | CORRECTLY REPORTED | | --jsd_token_clip default 0.05 | OPSD README | CORRECT | | --reason_first flag | OPSD README | CORRECTLY DESCRIBED | | "Successful rollouts as implicit feedback" claim | SDPO abstract | CORRECTLY CITED | | Teacher is stop-grad, student has gradient | SDPO Eq. 1 | CORRECT in opsd.py and composer_trainer.py | --- ## 10. Recommendations 1. **Fix the β docstring** in `opsd.py` to match the OPSD upstream convention: β=0 → forward KL (KL(student‖teacher)), β=1 → reverse KL (KL(teacher‖student)). 2. **Clarify the architectural departure from SDPO** in `composer_trainer.py` docstring and `research/07`: the repo implements a Composer-blog-inspired error-turn hint injection, which is an extension beyond SDPO. SDPO uses the feedback in the prompt prefix and evaluates the full response; the repo intercalates text into the response. 3. **Reconsider framing of `SiblingBootstrapGenerator`**: it is an original design choice, not an SDPO mechanism. The SDPO "sibling as implicit feedback" mechanism would look like: build a teacher context `[prompt, successful_sibling_rollout, original_response]` and apply KL over the whole original response — without generating a hint string or error-turn detection. This would be simpler and more faithful to SDPO. 4. **Teacher regularization is not implemented**: the SDPO paper shows a non-regularized teacher diverges (Table 4: 36.1% vs. 50.6%). The repo's teacher is the live model weights at each step with no EMA or trust-region regularization. For production SDPO runs this is a gap. (The `sdpo_jsd_beta` default of 0.5 uses symmetric JSD which is one of SDPO's stability improvements, but the teacher regularization is absent.) 5. **SDPO's original attempt placement**: the paper includes the student's original response as the sequence being log-prob-evaluated (i.e., the "response" slot in the teacher context). The repo's collator instead masks specific error-turn tokens within a modified response. These are architecturally different. The paper-accurate approach would re-evaluate log-probs of the entire original response under the hint-conditioned teacher, not just the tokens after the error. 6. **Failure mode from SDPO paper**: the strongest limitation is model capability dependence — SDPO underperforms GRPO on weak models (Qwen3-0.6B); SDPO+GRPO with λ=0.9 is recommended for weaker base models. This is not documented in the repo's SDPO usage guidance. --- ## 11. What the papers do NOT say (repo-claimed but unconfirmed in sources) - That Composer 2.5's "Targeted RL with Textual Feedback" is SDPO specifically (the Cursor blog does not cite SDPO; it describes a mechanism consistent with SDPO but the connection is an inference, not a citation). - That error-turn masking is part of SDPO. - That the repo's hint-at-error-turn teacher context is the SDPO mechanism. - That the alignment index problem (ADR-011) is an issue in SDPO. - How Cursor generates the hint (confirmed absent in all Cursor artifacts).