Baladithya Balamurugan
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
2a16b30
|
Raw
History Blame Contribute Delete
27.9 kB

Deep-Read: SDPO / OPSD — Critical Audit of Cluster 2

Date: 2026-06-09 Sources fetched from primary HTML:

  • SDPO: arXiv:2601.20802v2 (Hübotter et al., ETH Zürich / MIT / Stanford) — note reinforcement-learning-via-self-distillation-2 (148 K body)
  • OPSD: arXiv:2601.18734v3 (Zhao et al., ICML 2026) — note self-distilled-reasoner-on-policy-self-distillation-for-large-language-models-2 (52 K body)
  • OPSD code repo: github.com/siyan-zhao/OPSD (README + key args)
  • SDPO code repo: github.com/lasgroup/SDPO (listed in abstract; fetch returned empty body)

Repo files audited: composer_replication/opsd.py, composer_replication/trainer/composer_trainer.py::_compute_sdpo_loss, composer_replication/hint_generator.py, research/07-sdpo-hint-generator.md, research/11-sdpo-alignment-indices.md, docs/adrs/ADR-007, ADR-008, ADR-009, ADR-011.


1. What the primary sources actually say

1.1 SDPO (arXiv:2601.20802v2) — core method

Exact loss (Eq. 1):

L_SDPO(θ) := Σ_t KL( π_θ(·|x, y_{<t}) ‖ stopgrad(π_θ(·|x, f, y_{<t})) )

KL direction is forward KL — KL(student ‖ teacher), i.e. KL(π_θ || q_θ). The student is in the first argument. This is the "reverse KL from the teacher's perspective" but forward from the student's perspective (student wants to match teacher). The paper writes it KL(π_θ ‖ stopgrad(q_θ)).

Stability improvements (§2.3):

  1. Regularized teacher: EMA of student params OR interpolation between current teacher and initial teacher q_{θ_ref}. The paper calls these "trust-region" and "EMA" teachers (Table 4). Non-regularized teacher (q_θ, the live student) diverges.
  2. Symmetric JSD: the paper adopts JSD as the distillation loss for stability — citing Agarwal et al. 2024 on-policy distillation. The pseudocode (Fig 14) calls this divergence(logprobs_student, logprobs_teacher) with no fixed default — the paper reports using JSD.

Top-K approximation (Appendix A.3): The paper approximates the full-vocab KL with top-K tokens of the student distribution:

L_SDPO ≈ Σ_t [ Σ_{ŷ_t ∈ topK(π_θ)} π_θ(ŷ_t|x,y_{<t}) · log(π_θ(ŷ_t) / q_θ(ŷ_t))
              + tail_term ]

The tail term aggregates the remaining probability mass. Default K=100.

Teacher context construction (Table 2, verbatim):

User:  {prompt}
Correct solution: {successful_previous_rollout}    (skipped if unavailable)
The following is feedback from your unsuccessful earlier attempt:
{environment_output}                               (skipped if no env output or if solved)
Correctly solve the original question.
Assistant: {original_response}                     (the student's original attempt, for log-prob re-eval)

Critical nuance: the original_response is placed in the teacher context so the model can re-evaluate log-probs of y under the teacher. The student's original attempt is always appended as the response the teacher evaluates — this is how both student and teacher evaluate log-probs of the same token sequence y.

Token alignment: there is no shift / hint-insertion at an "error turn." The teacher sees [prompt, feedback, original_response] and both student and teacher evaluate log-probs of the SAME original_response tokens. No additional tokens are inserted into the response; the prefix is longer for the teacher (it has feedback), but the response token sequence evaluated is identical.

Three feedback types ablated (§4.6, Table 6):

  1. Sample solution (f = own solution): a successful sibling rollout from the GRPO group; always student-generated (no expert model). Teacher accuracy: 42.4%.
  2. Environment output (f = output): runtime errors, failing unit tests, etc. Teacher accuracy: 32.5%.
  3. Student's original attempt (f = y): the repo calls out that including the original attempt in the feedback (not just in the response slot) reduces teacher diversity (biases teacher toward student; "Same output": 30% vs ~10–13%).
  4. Combined (f = output + own solution): best trained student accuracy (48.3%). Excluding f = y (the original attempt as part of conditioning) is key.

Failure modes reported:

  • Non-regularized teacher (q_θ) diverges / training collapses (Table 4: 36.1% vs 50.6% for trust-region teacher).
  • Performance depends on model in-context learning ability: SDPO underperforms GRPO on weaker models (Qwen3-0.6B); hybrid SDPO+GRPO (λ=0.9 GRPO + 0.1 SDPO) is more robust (§4.5).
  • Uninformative or misleading environment feedback: SDPO cannot learn from it.
  • SDPO adds small compute overhead (additional forward for log-prob re-computation of teacher context); minor for large models, non-negligible for small models.
  • Including the student's own attempt in the teacher conditioning (not just as the response to re-evaluate) reduces diversity; the correct template excludes it from the conditioning prefix.

SDPO operates over the full rollout, not at isolated "error turns". The loss sums over all tokens t in the response y. There is no error-site detection step in the SDPO paper.

1.2 OPSD (arXiv:2601.18734v3) — core method

Exact loss (Eq. 6–8):

L_OPSD(θ) = E_{(x,y*)~S} [ E_{ŷ~p_S(·|x)} [ D(p_T ‖ p_S)(ŷ|x) ] ]
where
D(p_T ‖ p_S)(ŷ|x) := (1/|ŷ|) Σ_{n=1}^{|ŷ|} D( p_T(·|x, y*, ŷ_{<n}) ‖ p_S(·|x, ŷ_{<n}) )

Divergence D can be: forward KL, reverse KL, or JSD_β. The paper defines:

JSD_β(p_T ‖ p_S) = β·KL(p_T ‖ m) + (1-β)·KL(p_S ‖ m)
m = β·p_T + (1-β)·p_S

Direction convention: D(p_T ‖ p_S) — teacher in first arg, student in second. For the JSD:

  • β=0 → KL(p_S ‖ m) (approaches pure KL(p_S ‖ p_T) as m→p_T; forward KL w.r.t. teacher)
  • β=1 → KL(p_T ‖ m) (approaches pure KL(p_T ‖ p_S); reverse KL w.r.t. teacher / forward KL w.r.t. student)

The GKD paper (arXiv:2306.13649) that OPSD cites defines JSD_β with the same convention: JSD_β(p ‖ q) = β·KL(p||M) + (1-β)·KL(q||M).

Per-token pointwise clipping: OPSD introduces this explicitly:

D_clip^(f)(p_T ‖ p_S) = (1/|ŷ|) Σ_n Σ_v min(l_{n,v}^(f), τ)
where l_{n,v}^(f) = p_T(v|·) · f( p_S(v|·) / p_T(v|·) )

This clips per vocab-entry contribution. Default τ=0.05 (from README: --jsd_token_clip 0.05). Non-thinking mode results in README use 1e-7 (Qwen3-8B) and 1e-6 (Qwen3-4B, 1.7B).

Teacher context: p_T(·|x, y*, ŷ_{<n}) — teacher sees the ground-truth answer y* (a reference CoT / verified reasoning trace from the dataset) prepended to the problem, then evaluates the student's prefix ŷ_{<n}. Same token sequence for both distributions evaluated at each step n.

--reason_first flag (from GitHub README): Prepend an explicit rationalization to the teacher context before distillation. This is OPSD's self-introspection lever: the teacher is first asked to rationalize why y* is correct, then that rationalization is folded into the conditioning. Not the main results configuration; requires --use_peft.

Results: On Qwen3-1.7B (AIME24/25/HMMT25), OPSD +OPSD vs. base: 37.1% → 43.4% (Avg@12). Outperforms GRPO (37.7%) and SFT (35.8%). Token-efficient: generation capped at 1024 tokens vs. GRPO's 16K.


2. Audit: composer_replication/opsd.py — "byte-for-byte OPSD parity" claim

2.1 JSD formula — CORRECT, with a subtle direction note

The code implements:

JSD = β·KL(teacher||M) + (1-β)·KL(student||M)
M = logsumexp([ log p_student + log(1-β), log p_teacher + log(β) ])

The OPSD paper (Eq. 7) defines:

JSD_β(p_T ‖ p_S) = β·KL(p_T‖m) + (1-β)·KL(p_S‖m)

where m = β·p_T + (1-β)·p_S.

The code kl_teacher = F.kl_div(mixture_log_probs, teacher_log_probs, ...) uses PyTorch semantics where F.kl_div(input=log_q, target=log_p, log_target=True) computes KL(p||q) = Σ p(x)·(log p(x) - log q(x)). So kl_teacher computes KL(teacher||mixture) and kl_student computes KL(student||mixture).

The final JSD: β·kl_teacher + (1-β)·kl_student = β·KL(teacher||M) + (1-β)·KL(student||M).

This matches the OPSD paper's JSD_β(p_T ‖ p_S) exactly. CORRECT.

2.2 β convention docstring — INVERTED vs. both papers

The opsd.py docstring says:

β = 0  → KL(teacher || student)  (reverse KL — mode-covering for student)
β = 1  → KL(student || teacher)  (forward KL — mode-seeking for student)

From the OPSD GitHub README:

--beta: Interpolation weight for the JSD mixture distribution. Beta=0 means forward KL and 1 means reverse KL.

The repo docstring has the β=0 and β=1 labels swapped relative to the OPSD upstream. When β=0: JSD_0 = 0·KL(teacher||M) + 1·KL(student||M). In the limit (degenerate β=0), M approaches p_teacher, so this approaches KL(student||teacher) — which is forward KL (student → teacher), mode-seeking for student. The README says "Beta=0 means forward KL" which matches this analysis.

The code implementation is correct (the formula computes the right mixture). The docstring labels β=0 as "reverse KL" and β=1 as "forward KL", which contradicts both the upstream README and the mathematical analysis. This is a documentation error, not a numerical error.

VERDICT: Implementation is numerically correct. Docstring direction labels are inverted.

2.3 reduction="batchmean" behavior — MINOR DIVERGENCE from upstream

The repo opsd.py comment says:

"batchmean" matches upstream OPSD: divides by mask.sum() when labels are given, else by the leading dim of jsd (= batch size).

The OPSD paper (Algorithm 1) normalizes by |ŷ| (sequence length, token-mean):

ℓ(x,y*) = D(p_T‖p_S)(ŷ|x) = (1/|ŷ|) Σ_n D(...)
L_OPSD(θ) = (1/|B|) Σ_{(x,y*)∈B} ℓ(x,y*)

The repo divides by mask.sum() (number of valid/masked tokens in the batch), which is equivalent to OPSD's normalization only when every example has the same number of error-turn tokens. When batch sizes vary (real training), this differs from the paper's per-sequence average followed by batch average. In practice this difference is negligible for stability, but it is technically not byte-for-byte OPSD parity on the reduction.

VERDICT: The reduction="batchmean" logic is borrowed from the OPSD upstream code (which uses the same mask.sum() convention). The docstring's "matches upstream" claim is accurate for the code, but the code diverges from the paper's stated per-sequence normalization. Not a material issue.

2.4 token_clip parameter — CORRECT semantics, but per-token vs. per-(token,vocab) distinction

The repo implements token_clip as a per-position JSD clip:

jsd = jsd.clamp(max=token_clip)  # jsd shape is (B, T, V) or (n_valid, V)

The OPSD paper's pointwise clipping (Section 3.3) clips per-(position, vocab-entry): min(l_{n,v}^(f), τ) for each vocab entry v at each position n.

The upstream OPSD code (--jsd_token_clip) appears to apply the same per-(position,vocab) clip. The repo's clamp on the jsd tensor before reduction would clip the full-vocab contribution per position (since jsd has shape (B,T,V) before masking) — this is equivalent to per-(position,vocab) clipping, which is correct.

VERDICT: Implementation appears correct. The parameter name (token_clip) is slightly misleading (it clips per-token-vocab-entry, not just per-token), but the semantics match.


3. Critical structural mismatch: Composer/repo framing vs. SDPO mechanics

3.1 ERROR-TURN MASKING — NOT IN SDPO

The repo implements SDPO as an error-turn-masked loss:

  • _compute_sdpo_loss applies JSD only at error-turn tokens (via sdpo_loss_mask).
  • The data collator detects error sites in a trace and constructs a teacher context with a hint inserted at the error turn (ctx_teacher = ctx_student + hint).
  • The hint shifts teacher response tokens right, requiring explicit alignment indices (ADR-011).

The SDPO paper has no error-turn masking. SDPO applies the KL loss to ALL tokens t in the rollout response:

"L_SDPO(θ) := Σ_t KL(π_θ(·|x, y_{<t}) ‖ stopgrad(π_θ(·|x, f, y_{<t})))"

The SDPO teacher context includes the full feedback; both student and teacher evaluate log-probs of the same response tokens y. There is no "hint inserted into the response" — the feedback is in the conditioning prefix, not intercalated into the response sequence. Therefore the teacher response tokens are not shifted and token alignment is trivially preserved: both contexts evaluate the same sequence y.

The repo's architecture (hint at error turn → response token shift → alignment indices) is an interpretation of Composer 2.5's "hint" mechanism, not a feature of SDPO. SDPO's feedback is in the prompt/conditioning context; it does not intercalate text into the middle of a response.

VERDICT: The repo's error-turn-masking design is a reasonable extension of the Composer blog's described mechanism ("insert hint at error turn") but is NOT SDPO as described in the paper. The Composer blog's mechanism is itself not fully described and may or may not match SDPO mechanics.

3.2 TEACHER CONTEXT — CRITICAL DIFFERENCE

SDPO teacher context (Table 2):

[prompt, feedback_f, original_response_y]

The teacher evaluates log-probs of original_response_y given [prompt, feedback_f]. Teacher prefix = [prompt, feedback_f]. Response = y (same as student). No hint is inserted into y.

Repo teacher context:

ctx_teacher = ctx_student + hint_at_error_turn

The hint is intercalated into the response sequence at an error turn. Teacher prefix = student prefix. Response = y_before_error + hint + y_after_error. Teacher response tokens are LONGER than student response tokens and SHIFTED.

This is architecturally different from SDPO. The alignment problem (ADR-008, ADR-011) arises precisely because the repo's teacher context design inserts hint text into the response, which SDPO does not do.

VERDICT: The repo's teacher context construction is a novel design inspired by the Composer blog. It is not what SDPO does. The ADR-008 "trust-gap" and the entire ADR-011 alignment index complexity are artifacts of this departure from SDPO, not corrections to SDPO.

3.3 OPSD vs. SDPO as the loss source

The repo header in opsd.py says the loss is:

"lifted from siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss (MIT)"

And:

"SDPO paper: Hübotter et al. … formalizes the same loss as Composer 2.5's 'Targeted RL with Textual Feedback.'"

These are TWO DIFFERENT methods with related but not identical losses:

  • OPSD loss (Eq. 7–8): JSD_β(p_T ‖ p_S) with teacher having y* (ground truth). Normalization: per-sequence average then batch average. Pointwise vocab clipping. Training runs ~100 steps. Fixed teacher (initial checkpoint, not live).

  • SDPO loss (Eq. 1): KL(π_θ(·|x,y_{<t}) ‖ stopgrad(π_θ(·|x,f,y_{<t}))) where KL is applied per-position over the full response. The paper adopts JSD as a stability improvement (§2.3) but the base formulation is reverse KL (student ‖ teacher). Teacher is regularized via EMA or trust-region. No per-vocab clipping in the paper (top-K approximation instead).

The repo correctly implements the OPSD JSD formula (which SDPO also uses for stability). The claim "verified port of siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss" is accurate for the loss kernel. The claim "Composer 2.5's 'Targeted RL with Textual Feedback'" is an assertion that Composer uses the same loss — this is not confirmed anywhere in the Cursor blog or Composer 2 tech report.


4. Audit: _compute_sdpo_loss in composer_trainer.py

4.1 Gradient flow — CORRECT

student_logits = model(input_ids=inputs["input_ids"]).logits
with torch.no_grad():
    teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits

Teacher is no_grad — matches SDPO's stopgrad(π_θ(·|x,f,y_{<t})). Student has gradient. Correct.

4.2 Alignment index machinery — NECESSARY GIVEN THE DESIGN, BUT NOT FROM SDPO

The student_response_idx / teacher_response_idx machinery (ADR-011) is needed because the hint is inserted into the teacher response sequence. This complexity does not exist in SDPO or OPSD because those methods never insert text into the response. The repo's strict_sdpo_alignment guard is a correct defense against the problem it has created for itself.

4.3 Batch-level masking — CORRECT for the repo's error-turn interpretation

The loss is masked to error-turn tokens only (aligned_labels with -100 elsewhere). This means the SDPO channel only trains on error recovery tokens, not the full rollout. SDPO trains on the full rollout. For Composer's intent (correcting error turns), the masking is reasonable, but it produces a loss that is more like a targeted distillation at error sites than SDPO's full-rollout advantage assignment.


5. Audit: research/07-sdpo-hint-generator.md — Accuracy check

5.1 Three feedback types from SDPO paper — CORRECTLY REPORTED

research/07 correctly identifies the three types:

  1. Sample solution (successful sibling rollout)
  2. Environment output (runtime errors)
  3. Student's original attempt

The paper (Table 6 results, which research/07 did NOT have access to) shows:

  • Best configuration: f = output + own solution (48.3% accuracy)
  • Including f = y (original attempt as conditioning, not as response) hurts diversity and slightly reduces final accuracy (44.5% vs 48.3%)

Research/07 correctly notes the sibling rollout is "always generated by the student, not an expert model" — confirmed in the paper: "We emphasize that these sample solutions are always generated by the student, as in GRPO, and do not require an expert model."

5.2 "Successful sibling rollout as implicit feedback" claim — CORRECTLY REPORTED

The abstract: "SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts."

Research/07 cites this correctly and uses it as the basis for the SiblingBootstrapGenerator.

5.3 OPSD --reason_first flag — CORRECTLY DESCRIBED

The OPSD README confirms: --reason_first False: Prepend an explicit rationalization to the teacher context before distillation. Research/07 correctly calls this "OPSD's own knob for same-model introspection."

5.4 --jsd_token_clip default 0.05 — CORRECTLY CITED

Confirmed from OPSD README: --jsd_token_clip 0.05 is the default.


6. Audit: SiblingBootstrapGenerator — Is it supported by the papers?

The repo's hint_generator.py sketch (lines 319–331) and research/07 §6.3:

class SiblingBootstrapGenerator:
    def generate(self, ctx):
        sibs = ctx.get("sibling_rollouts") or []
        winners = [s for s in sibs if s.get("reward", 0.0) > 0.0]
        if not winners:
            return None
        best = max(winners, key=lambda s: s["reward"])
        snippet = (best.get("solution_excerpt") or "")[:200]
        return ("Reminder: a working approach for this task looks like:\n"
                f"{snippet}\nAdapt this to the current step.")

What the SDPO paper actually does (Table 2 template):

Correct solution: {successful_previous_rollout}

The successful rollout is passed as the full solution (or relevant excerpt) in the teacher context. The teacher then evaluates log-probs of the student's original response given this context.

Key difference: In SDPO, the sibling solution goes into the teacher's conditioning prefix. The teacher does not generate a new hint; it just re-evaluates the student's response log-probs with the solution visible. In the repo, the sibling solution is used to generate a hint string that gets inserted into the response sequence.

This is an extrapolation beyond what the SDPO paper supports. SDPO's "successful rollout as implicit feedback" mechanism does NOT:

  1. Generate a "Reminder: a working approach..." hint string.
  2. Insert text into the student's response sequence.
  3. Require error-turn detection.

The SDPO sibling mechanism IS:

  1. Condition the teacher on the full successful solution.
  2. Re-evaluate ALL student response token log-probs under that teacher.
  3. Apply the KL loss across the entire response.

VERDICT: The SiblingBootstrapGenerator as sketched is an extrapolation from SDPO's mechanism, not a faithful implementation of it. The paper supports using a sibling rollout as teacher conditioning context; it does not support generating a textual hint from it to splice into the response. The Composer blog's "hint" framing is the source of this architectural decision; SDPO is cited as inspiration but is not the mechanism.

Research/07 acknowledges this at several points ("A working approach looks like: …" in the class comment vs the actual SDPO template) but does not flag it as a divergence — it presents the sibling-bootstrap hint approach as if it naturally follows from SDPO.


7. Audit: research/11-sdpo-alignment-indices.md

7.1 Problem correctly identified

ADR-011 correctly identifies that inserting a hint into the teacher context shifts teacher response tokens right. The alignment indices machinery (_mask_to_padded_indices, student_response_idx, teacher_response_idx, sentinel handling) is a sound engineering solution to the problem the repo's design creates.

7.2 Root cause attribution — MISLEADING

ADR-011 and the trainer comments frame the alignment problem as an SDPO issue that the papers fail to address ("the exact trust-gap flagged in ADR-008"). This is not accurate. The alignment problem does not exist in SDPO or OPSD because those methods never insert text into the response sequence. The alignment problem is entirely self-created by the repo's decision to implement the Composer blog's "hint at error turn" as a text insertion into the teacher's response sequence.


8. Audit: ADR-007, ADR-008, ADR-009 — Key claims

ADR-007 — JSD as "the kernel of SDPO arXiv:2601.20802"

The ADR says generalized_jsd_loss is "verified port of siyan-zhao/OPSD, the kernel of SDPO arXiv:2601.20802." This telescopes two papers. The JSD is the kernel of OPSD (the Zhao et al. paper, 2601.18734). SDPO (Hübotter et al., 2601.20802) uses JSD as a stability improvement over the base KL loss; the primary SDPO loss is the KL. Both papers use the same JSD formula (citing GKD paper 2306.13649). The conflation is not consequential for the loss code but creates confusion in documentation.

ADR-008 — "SDPO needs full vocabulary logits"

Confirmed. SDPO Appendix A.3 discusses top-K approximation of the KL because "naively computing the KL divergence between student and teacher requires holding full logits of both models in memory." The repo's claim about needing full logits is correct; PRIME-RL's log-probs-only interface is correctly identified as incompatible with the SDPO channel.

ADR-008 — Dr. GRPO as the Composer algorithm

This is sourced from research/10 (Composer 2 tech report mining). Not audited here (out of scope for this cluster).

ADR-009 — "How Cursor generates that hint is unstated"

Confirmed true. The Composer 2 tech report (arXiv:2603.24477) is cited as unread in research/07 §8 and ADR-009. ADR-009 correctly acknowledges the open question.


9. Summary of findings

Claim Source Verdict
JSD formula in opsd.py is numerically correct OPSD Eq. 7 CORRECT
β=0 = "reverse KL" in docstring OPSD README: "β=0 = forward KL" INVERTED label
"byte-for-byte OPSD parity" OPSD code Mostly correct; β direction label wrong; reduction differs from paper's per-sequence normalization; otherwise matches upstream code
Error-turn masking is from SDPO SDPO paper FALSE — SDPO applies loss to full rollout, no error-turn detection
Teacher context = ctx_student + hint_at_error_turn SDPO paper FALSE — SDPO teacher = [prompt, feedback, student_response]; feedback is in prefix, not intercalated
SiblingBootstrapGenerator follows from SDPO "successful rollout as implicit feedback" SDPO §4.6 EXTRAPOLATION — SDPO conditions teacher on full solution; repo generates a hint string and inserts it into response sequence
Alignment indices machinery (ADR-011) addresses SDPO misalignment SDPO paper MISLEADING — problem is self-created by hint-insertion design; does not exist in SDPO
SDPO needs full vocabulary logits (ADR-008) SDPO Appendix A.3 CORRECT
Three feedback types in research/07 SDPO §4.6 CORRECTLY REPORTED
--jsd_token_clip default 0.05 OPSD README CORRECT
--reason_first flag OPSD README CORRECTLY DESCRIBED
"Successful rollouts as implicit feedback" claim SDPO abstract CORRECTLY CITED
Teacher is stop-grad, student has gradient SDPO Eq. 1 CORRECT in opsd.py and composer_trainer.py

10. Recommendations

  1. Fix the β docstring in opsd.py to match the OPSD upstream convention: β=0 → forward KL (KL(student‖teacher)), β=1 → reverse KL (KL(teacher‖student)).

  2. Clarify the architectural departure from SDPO in composer_trainer.py docstring and research/07: the repo implements a Composer-blog-inspired error-turn hint injection, which is an extension beyond SDPO. SDPO uses the feedback in the prompt prefix and evaluates the full response; the repo intercalates text into the response.

  3. Reconsider framing of SiblingBootstrapGenerator: it is an original design choice, not an SDPO mechanism. The SDPO "sibling as implicit feedback" mechanism would look like: build a teacher context [prompt, successful_sibling_rollout, original_response] and apply KL over the whole original response — without generating a hint string or error-turn detection. This would be simpler and more faithful to SDPO.

  4. Teacher regularization is not implemented: the SDPO paper shows a non-regularized teacher diverges (Table 4: 36.1% vs. 50.6%). The repo's teacher is the live model weights at each step with no EMA or trust-region regularization. For production SDPO runs this is a gap. (The sdpo_jsd_beta default of 0.5 uses symmetric JSD which is one of SDPO's stability improvements, but the teacher regularization is absent.)

  5. SDPO's original attempt placement: the paper includes the student's original response as the sequence being log-prob-evaluated (i.e., the "response" slot in the teacher context). The repo's collator instead masks specific error-turn tokens within a modified response. These are architecturally different. The paper-accurate approach would re-evaluate log-probs of the entire original response under the hint-conditioned teacher, not just the tokens after the error.

  6. Failure mode from SDPO paper: the strongest limitation is model capability dependence — SDPO underperforms GRPO on weak models (Qwen3-0.6B); SDPO+GRPO with λ=0.9 is recommended for weaker base models. This is not documented in the repo's SDPO usage guidance.


11. What the papers do NOT say (repo-claimed but unconfirmed in sources)

  • That Composer 2.5's "Targeted RL with Textual Feedback" is SDPO specifically (the Cursor blog does not cite SDPO; it describes a mechanism consistent with SDPO but the connection is an inference, not a citation).
  • That error-turn masking is part of SDPO.
  • That the repo's hint-at-error-turn teacher context is the SDPO mechanism.
  • That the alignment index problem (ADR-011) is an issue in SDPO.
  • How Cursor generates the hint (confirmed absent in all Cursor artifacts).