Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Baladithya Balamurugan
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
2a16b30 | # Deep-Read: SDPO / OPSD — Critical Audit of Cluster 2 | |
| **Date**: 2026-06-09 | |
| **Sources fetched from primary HTML**: | |
| - SDPO: arXiv:2601.20802v2 (Hübotter et al., ETH Zürich / MIT / Stanford) | |
| — note `reinforcement-learning-via-self-distillation-2` (148 K body) | |
| - OPSD: arXiv:2601.18734v3 (Zhao et al., ICML 2026) | |
| — note `self-distilled-reasoner-on-policy-self-distillation-for-large-language-models-2` (52 K body) | |
| - OPSD code repo: github.com/siyan-zhao/OPSD (README + key args) | |
| - SDPO code repo: github.com/lasgroup/SDPO (listed in abstract; fetch returned empty body) | |
| **Repo files audited**: `composer_replication/opsd.py`, `composer_replication/trainer/composer_trainer.py::_compute_sdpo_loss`, `composer_replication/hint_generator.py`, `research/07-sdpo-hint-generator.md`, `research/11-sdpo-alignment-indices.md`, `docs/adrs/ADR-007`, `ADR-008`, `ADR-009`, `ADR-011`. | |
| --- | |
| ## 1. What the primary sources actually say | |
| ### 1.1 SDPO (arXiv:2601.20802v2) — core method | |
| **Exact loss (Eq. 1)**: | |
| ``` | |
| L_SDPO(θ) := Σ_t KL( π_θ(·|x, y_{<t}) ‖ stopgrad(π_θ(·|x, f, y_{<t})) ) | |
| ``` | |
| KL direction is **forward KL — KL(student ‖ teacher)**, i.e. `KL(π_θ || q_θ)`. The | |
| student is in the first argument. This is the "reverse KL from the teacher's | |
| perspective" but forward from the student's perspective (student wants to match teacher). | |
| The paper writes it `KL(π_θ ‖ stopgrad(q_θ))`. | |
| **Stability improvements (§2.3)**: | |
| 1. Regularized teacher: EMA of student params OR interpolation between current teacher | |
| and initial teacher `q_{θ_ref}`. The paper calls these "trust-region" and "EMA" | |
| teachers (Table 4). Non-regularized teacher (`q_θ`, the live student) diverges. | |
| 2. Symmetric JSD: the paper adopts JSD as the distillation loss for stability — | |
| citing Agarwal et al. 2024 on-policy distillation. The pseudocode (Fig 14) calls | |
| this `divergence(logprobs_student, logprobs_teacher)` with no fixed default — the | |
| paper reports using JSD. | |
| **Top-K approximation (Appendix A.3)**: | |
| The paper approximates the full-vocab KL with top-K tokens of the **student** distribution: | |
| ``` | |
| L_SDPO ≈ Σ_t [ Σ_{ŷ_t ∈ topK(π_θ)} π_θ(ŷ_t|x,y_{<t}) · log(π_θ(ŷ_t) / q_θ(ŷ_t)) | |
| + tail_term ] | |
| ``` | |
| The tail term aggregates the remaining probability mass. Default K=100. | |
| **Teacher context construction (Table 2, verbatim)**: | |
| ``` | |
| User: {prompt} | |
| Correct solution: {successful_previous_rollout} (skipped if unavailable) | |
| The following is feedback from your unsuccessful earlier attempt: | |
| {environment_output} (skipped if no env output or if solved) | |
| Correctly solve the original question. | |
| Assistant: {original_response} (the student's original attempt, for log-prob re-eval) | |
| ``` | |
| Critical nuance: the `original_response` is placed in the teacher context so the | |
| model can re-evaluate log-probs of `y` under the teacher. **The student's original | |
| attempt is always appended as the response the teacher evaluates** — this is how both | |
| student and teacher evaluate log-probs of the same token sequence `y`. | |
| Token alignment: **there is no shift / hint-insertion at an "error turn."** The teacher | |
| sees `[prompt, feedback, original_response]` and both student and teacher evaluate | |
| log-probs of the SAME `original_response` tokens. No additional tokens are inserted | |
| into the response; the prefix is longer for the teacher (it has feedback), but the | |
| response token sequence evaluated is identical. | |
| **Three feedback types ablated (§4.6, Table 6)**: | |
| 1. **Sample solution** (`f = own solution`): a successful sibling rollout from the GRPO | |
| group; always student-generated (no expert model). Teacher accuracy: 42.4%. | |
| 2. **Environment output** (`f = output`): runtime errors, failing unit tests, etc. | |
| Teacher accuracy: 32.5%. | |
| 3. **Student's original attempt** (`f = y`): the repo calls out that including the | |
| original attempt in the feedback (not just in the response slot) **reduces teacher | |
| diversity** (biases teacher toward student; "Same output": 30% vs ~10–13%). | |
| 4. **Combined** (`f = output + own solution`): best trained student accuracy (48.3%). | |
| Excluding `f = y` (the original attempt as part of conditioning) is key. | |
| **Failure modes reported**: | |
| - Non-regularized teacher (`q_θ`) diverges / training collapses (Table 4: 36.1% vs | |
| 50.6% for trust-region teacher). | |
| - Performance depends on model in-context learning ability: SDPO underperforms GRPO on | |
| weaker models (Qwen3-0.6B); hybrid SDPO+GRPO (λ=0.9 GRPO + 0.1 SDPO) is more | |
| robust (§4.5). | |
| - Uninformative or misleading environment feedback: SDPO cannot learn from it. | |
| - SDPO adds small compute overhead (additional forward for log-prob re-computation of | |
| teacher context); minor for large models, non-negligible for small models. | |
| - Including the student's own attempt in the teacher conditioning (not just as the | |
| response to re-evaluate) reduces diversity; the correct template excludes it from | |
| the conditioning prefix. | |
| **SDPO operates over the full rollout, not at isolated "error turns"**. The loss | |
| sums over all tokens `t` in the response `y`. There is no error-site detection | |
| step in the SDPO paper. | |
| ### 1.2 OPSD (arXiv:2601.18734v3) — core method | |
| **Exact loss (Eq. 6–8)**: | |
| ``` | |
| L_OPSD(θ) = E_{(x,y*)~S} [ E_{ŷ~p_S(·|x)} [ D(p_T ‖ p_S)(ŷ|x) ] ] | |
| where | |
| D(p_T ‖ p_S)(ŷ|x) := (1/|ŷ|) Σ_{n=1}^{|ŷ|} D( p_T(·|x, y*, ŷ_{<n}) ‖ p_S(·|x, ŷ_{<n}) ) | |
| ``` | |
| **Divergence D** can be: forward KL, reverse KL, or JSD_β. The paper defines: | |
| ``` | |
| JSD_β(p_T ‖ p_S) = β·KL(p_T ‖ m) + (1-β)·KL(p_S ‖ m) | |
| m = β·p_T + (1-β)·p_S | |
| ``` | |
| **Direction convention**: `D(p_T ‖ p_S)` — teacher in first arg, student in second. For the JSD: | |
| - β=0 → KL(p_S ‖ m) (approaches pure KL(p_S ‖ p_T) as m→p_T; forward KL w.r.t. teacher) | |
| - β=1 → KL(p_T ‖ m) (approaches pure KL(p_T ‖ p_S); reverse KL w.r.t. teacher / forward KL w.r.t. student) | |
| The GKD paper (arXiv:2306.13649) that OPSD cites defines JSD_β with the **same | |
| convention**: `JSD_β(p ‖ q) = β·KL(p||M) + (1-β)·KL(q||M)`. | |
| **Per-token pointwise clipping**: OPSD introduces this explicitly: | |
| ``` | |
| D_clip^(f)(p_T ‖ p_S) = (1/|ŷ|) Σ_n Σ_v min(l_{n,v}^(f), τ) | |
| where l_{n,v}^(f) = p_T(v|·) · f( p_S(v|·) / p_T(v|·) ) | |
| ``` | |
| This clips per vocab-entry contribution. Default τ=0.05 (from README: `--jsd_token_clip 0.05`). | |
| Non-thinking mode results in README use 1e-7 (Qwen3-8B) and 1e-6 (Qwen3-4B, 1.7B). | |
| **Teacher context**: `p_T(·|x, y*, ŷ_{<n})` — teacher sees the ground-truth answer `y*` | |
| (a reference CoT / verified reasoning trace from the dataset) prepended to the problem, | |
| then evaluates the student's prefix `ŷ_{<n}`. Same token sequence for both distributions | |
| evaluated at each step `n`. | |
| **`--reason_first` flag (from GitHub README)**: Prepend an explicit rationalization to the | |
| teacher context before distillation. This is OPSD's self-introspection lever: the teacher | |
| is first asked to rationalize why `y*` is correct, then that rationalization is folded into | |
| the conditioning. Not the main results configuration; requires `--use_peft`. | |
| **Results**: On Qwen3-1.7B (AIME24/25/HMMT25), OPSD +OPSD vs. base: 37.1% → 43.4% | |
| (Avg@12). Outperforms GRPO (37.7%) and SFT (35.8%). Token-efficient: generation capped | |
| at 1024 tokens vs. GRPO's 16K. | |
| --- | |
| ## 2. Audit: `composer_replication/opsd.py` — "byte-for-byte OPSD parity" claim | |
| ### 2.1 JSD formula — CORRECT, with a subtle direction note | |
| The code implements: | |
| ```python | |
| JSD = β·KL(teacher||M) + (1-β)·KL(student||M) | |
| M = logsumexp([ log p_student + log(1-β), log p_teacher + log(β) ]) | |
| ``` | |
| The OPSD paper (Eq. 7) defines: | |
| ``` | |
| JSD_β(p_T ‖ p_S) = β·KL(p_T‖m) + (1-β)·KL(p_S‖m) | |
| ``` | |
| where `m = β·p_T + (1-β)·p_S`. | |
| The code `kl_teacher = F.kl_div(mixture_log_probs, teacher_log_probs, ...)` uses | |
| PyTorch semantics where `F.kl_div(input=log_q, target=log_p, log_target=True)` | |
| computes `KL(p||q) = Σ p(x)·(log p(x) - log q(x))`. So `kl_teacher` computes | |
| `KL(teacher||mixture)` and `kl_student` computes `KL(student||mixture)`. | |
| The final JSD: `β·kl_teacher + (1-β)·kl_student` = `β·KL(teacher||M) + (1-β)·KL(student||M)`. | |
| This matches the OPSD paper's `JSD_β(p_T ‖ p_S)` exactly. **CORRECT.** | |
| ### 2.2 β convention docstring — INVERTED vs. both papers | |
| The `opsd.py` docstring says: | |
| ``` | |
| β = 0 → KL(teacher || student) (reverse KL — mode-covering for student) | |
| β = 1 → KL(student || teacher) (forward KL — mode-seeking for student) | |
| ``` | |
| From the OPSD GitHub README: | |
| > `--beta`: Interpolation weight for the JSD mixture distribution. | |
| > **Beta=0 means forward KL and 1 means reverse KL.** | |
| The repo docstring has the β=0 and β=1 labels **swapped** relative to the OPSD upstream. | |
| When β=0: `JSD_0 = 0·KL(teacher||M) + 1·KL(student||M)`. In the limit (degenerate β=0), | |
| M approaches p_teacher, so this approaches `KL(student||teacher)` — which is **forward KL** | |
| (student → teacher), **mode-seeking for student**. The README says "Beta=0 means forward KL" | |
| which matches this analysis. | |
| The code *implementation* is correct (the formula computes the right mixture). The *docstring* | |
| labels β=0 as "reverse KL" and β=1 as "forward KL", which contradicts both the upstream README | |
| and the mathematical analysis. This is a documentation error, not a numerical error. | |
| **VERDICT**: Implementation is numerically correct. Docstring direction labels are inverted. | |
| ### 2.3 `reduction="batchmean"` behavior — MINOR DIVERGENCE from upstream | |
| The repo `opsd.py` comment says: | |
| > "batchmean" matches upstream OPSD: divides by `mask.sum()` when labels are given, | |
| > else by the leading dim of jsd (= batch size). | |
| The OPSD paper (Algorithm 1) normalizes by `|ŷ|` (sequence length, token-mean): | |
| ``` | |
| ℓ(x,y*) = D(p_T‖p_S)(ŷ|x) = (1/|ŷ|) Σ_n D(...) | |
| L_OPSD(θ) = (1/|B|) Σ_{(x,y*)∈B} ℓ(x,y*) | |
| ``` | |
| The repo divides by `mask.sum()` (number of valid/masked tokens in the batch), which is | |
| equivalent to OPSD's normalization only when every example has the same number of | |
| error-turn tokens. When batch sizes vary (real training), this differs from the paper's | |
| per-sequence average followed by batch average. In practice this difference is negligible | |
| for stability, but it is technically not byte-for-byte OPSD parity on the reduction. | |
| **VERDICT**: The `reduction="batchmean"` logic is borrowed from the OPSD upstream code | |
| (which uses the same `mask.sum()` convention). The docstring's "matches upstream" claim | |
| is accurate for the code, but the code diverges from the paper's stated per-sequence | |
| normalization. Not a material issue. | |
| ### 2.4 `token_clip` parameter — CORRECT semantics, but per-token vs. per-(token,vocab) distinction | |
| The repo implements `token_clip` as a **per-position** JSD clip: | |
| ```python | |
| jsd = jsd.clamp(max=token_clip) # jsd shape is (B, T, V) or (n_valid, V) | |
| ``` | |
| The OPSD paper's pointwise clipping (Section 3.3) clips **per-(position, vocab-entry)**: | |
| `min(l_{n,v}^(f), τ)` for each vocab entry v at each position n. | |
| The upstream OPSD code (`--jsd_token_clip`) appears to apply the same per-(position,vocab) | |
| clip. The repo's clamp on the jsd tensor before reduction would clip the full-vocab | |
| contribution per position (since jsd has shape (B,T,V) before masking) — this is | |
| equivalent to per-(position,vocab) clipping, which is correct. | |
| **VERDICT**: Implementation appears correct. The parameter name (`token_clip`) is slightly | |
| misleading (it clips per-token-vocab-entry, not just per-token), but the semantics match. | |
| --- | |
| ## 3. Critical structural mismatch: Composer/repo framing vs. SDPO mechanics | |
| ### 3.1 ERROR-TURN MASKING — NOT IN SDPO | |
| The repo implements SDPO as an error-turn-masked loss: | |
| - `_compute_sdpo_loss` applies JSD only at `error-turn tokens` (via `sdpo_loss_mask`). | |
| - The data collator detects error sites in a trace and constructs a teacher context | |
| with a hint inserted at the error turn (`ctx_teacher = ctx_student + hint`). | |
| - The hint shifts teacher response tokens right, requiring explicit alignment indices | |
| (ADR-011). | |
| **The SDPO paper has no error-turn masking.** SDPO applies the KL loss to ALL tokens `t` | |
| in the rollout response: | |
| > "L_SDPO(θ) := Σ_t KL(π_θ(·|x, y_{<t}) ‖ stopgrad(π_θ(·|x, f, y_{<t})))" | |
| The SDPO teacher context includes the full feedback; both student and teacher evaluate | |
| log-probs of the **same response tokens** `y`. There is no "hint inserted into the | |
| response" — the feedback is in the conditioning prefix, not intercalated into the | |
| response sequence. Therefore the teacher response tokens are **not shifted** and token | |
| alignment is trivially preserved: both contexts evaluate the same sequence `y`. | |
| **The repo's architecture (hint at error turn → response token shift → alignment indices)** | |
| is an interpretation of Composer 2.5's "hint" mechanism, not a feature of SDPO. SDPO's | |
| feedback is in the prompt/conditioning context; it does not intercalate text into the | |
| middle of a response. | |
| **VERDICT**: The repo's error-turn-masking design is a reasonable extension of the | |
| Composer blog's described mechanism ("insert hint at error turn") but is **NOT** | |
| SDPO as described in the paper. The Composer blog's mechanism is itself not fully | |
| described and may or may not match SDPO mechanics. | |
| ### 3.2 TEACHER CONTEXT — CRITICAL DIFFERENCE | |
| **SDPO teacher context** (Table 2): | |
| ``` | |
| [prompt, feedback_f, original_response_y] | |
| ``` | |
| The teacher evaluates log-probs of `original_response_y` given `[prompt, feedback_f]`. | |
| Teacher prefix = `[prompt, feedback_f]`. Response = `y` (same as student). No hint is | |
| inserted *into* `y`. | |
| **Repo teacher context**: | |
| ``` | |
| ctx_teacher = ctx_student + hint_at_error_turn | |
| ``` | |
| The hint is *intercalated* into the response sequence at an error turn. Teacher prefix | |
| = student prefix. Response = `y_before_error + hint + y_after_error`. Teacher response | |
| tokens are LONGER than student response tokens and SHIFTED. | |
| This is architecturally different from SDPO. The alignment problem (ADR-008, ADR-011) | |
| arises precisely because the repo's teacher context design inserts hint text into the | |
| response, which SDPO does not do. | |
| **VERDICT**: The repo's teacher context construction is a novel design inspired by the | |
| Composer blog. It is not what SDPO does. The ADR-008 "trust-gap" and the entire | |
| ADR-011 alignment index complexity are artifacts of this departure from SDPO, not | |
| corrections to SDPO. | |
| ### 3.3 OPSD vs. SDPO as the loss source | |
| The repo header in `opsd.py` says the loss is: | |
| > "lifted from siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss (MIT)" | |
| And: | |
| > "SDPO paper: Hübotter et al. … formalizes the same loss as Composer 2.5's | |
| > 'Targeted RL with Textual Feedback.'" | |
| These are TWO DIFFERENT methods with related but not identical losses: | |
| - **OPSD loss** (Eq. 7–8): `JSD_β(p_T ‖ p_S)` with teacher having `y*` (ground truth). | |
| Normalization: per-sequence average then batch average. Pointwise vocab clipping. | |
| Training runs ~100 steps. Fixed teacher (initial checkpoint, not live). | |
| - **SDPO loss** (Eq. 1): `KL(π_θ(·|x,y_{<t}) ‖ stopgrad(π_θ(·|x,f,y_{<t})))` where | |
| KL is applied per-position over the full response. The paper adopts JSD as a stability | |
| improvement (§2.3) but the base formulation is reverse KL (student ‖ teacher). | |
| Teacher is regularized via EMA or trust-region. No per-vocab clipping in the paper | |
| (top-K approximation instead). | |
| The repo correctly implements the OPSD JSD formula (which SDPO also uses for stability). | |
| The claim "verified port of siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss" is | |
| accurate for the loss kernel. The claim "Composer 2.5's 'Targeted RL with Textual Feedback'" | |
| is an assertion that Composer uses the same loss — this is not confirmed anywhere in the | |
| Cursor blog or Composer 2 tech report. | |
| --- | |
| ## 4. Audit: `_compute_sdpo_loss` in `composer_trainer.py` | |
| ### 4.1 Gradient flow — CORRECT | |
| ```python | |
| student_logits = model(input_ids=inputs["input_ids"]).logits | |
| with torch.no_grad(): | |
| teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits | |
| ``` | |
| Teacher is `no_grad` — matches SDPO's `stopgrad(π_θ(·|x,f,y_{<t}))`. Student has | |
| gradient. Correct. | |
| ### 4.2 Alignment index machinery — NECESSARY GIVEN THE DESIGN, BUT NOT FROM SDPO | |
| The `student_response_idx` / `teacher_response_idx` machinery (ADR-011) is needed | |
| because the hint is inserted into the teacher response sequence. This complexity does | |
| not exist in SDPO or OPSD because those methods never insert text into the response. | |
| The repo's `strict_sdpo_alignment` guard is a correct defense against the problem it | |
| has created for itself. | |
| ### 4.3 Batch-level masking — CORRECT for the repo's error-turn interpretation | |
| The loss is masked to error-turn tokens only (`aligned_labels` with -100 elsewhere). | |
| This means the SDPO channel only trains on error recovery tokens, not the full rollout. | |
| SDPO trains on the full rollout. For Composer's intent (correcting error turns), the | |
| masking is reasonable, but it produces a loss that is more like a targeted distillation | |
| at error sites than SDPO's full-rollout advantage assignment. | |
| --- | |
| ## 5. Audit: `research/07-sdpo-hint-generator.md` — Accuracy check | |
| ### 5.1 Three feedback types from SDPO paper — CORRECTLY REPORTED | |
| research/07 correctly identifies the three types: | |
| 1. Sample solution (successful sibling rollout) | |
| 2. Environment output (runtime errors) | |
| 3. Student's original attempt | |
| The paper (Table 6 results, which research/07 did NOT have access to) shows: | |
| - Best configuration: `f = output + own solution` (48.3% accuracy) | |
| - Including `f = y` (original attempt as conditioning, not as response) **hurts diversity** | |
| and slightly reduces final accuracy (44.5% vs 48.3%) | |
| Research/07 correctly notes the sibling rollout is "always generated by the student, not | |
| an expert model" — confirmed in the paper: "We emphasize that these sample solutions are | |
| always generated by the student, as in GRPO, and do not require an expert model." | |
| ### 5.2 "Successful sibling rollout as implicit feedback" claim — CORRECTLY REPORTED | |
| The abstract: "SDPO also outperforms baselines in standard RLVR environments that only | |
| return scalar feedback by using successful rollouts as implicit feedback for failed attempts." | |
| Research/07 cites this correctly and uses it as the basis for the `SiblingBootstrapGenerator`. | |
| ### 5.3 OPSD `--reason_first` flag — CORRECTLY DESCRIBED | |
| The OPSD README confirms: `--reason_first False: Prepend an explicit rationalization to | |
| the teacher context before distillation.` Research/07 correctly calls this "OPSD's own | |
| knob for same-model introspection." | |
| ### 5.4 `--jsd_token_clip default 0.05` — CORRECTLY CITED | |
| Confirmed from OPSD README: `--jsd_token_clip 0.05` is the default. | |
| --- | |
| ## 6. Audit: `SiblingBootstrapGenerator` — Is it supported by the papers? | |
| The repo's `hint_generator.py` sketch (lines 319–331) and `research/07` §6.3: | |
| ```python | |
| class SiblingBootstrapGenerator: | |
| def generate(self, ctx): | |
| sibs = ctx.get("sibling_rollouts") or [] | |
| winners = [s for s in sibs if s.get("reward", 0.0) > 0.0] | |
| if not winners: | |
| return None | |
| best = max(winners, key=lambda s: s["reward"]) | |
| snippet = (best.get("solution_excerpt") or "")[:200] | |
| return ("Reminder: a working approach for this task looks like:\n" | |
| f"{snippet}\nAdapt this to the current step.") | |
| ``` | |
| **What the SDPO paper actually does** (Table 2 template): | |
| ``` | |
| Correct solution: {successful_previous_rollout} | |
| ``` | |
| The successful rollout is passed as the full solution (or relevant excerpt) in the | |
| teacher context. The teacher then evaluates log-probs of the student's original | |
| response given this context. | |
| **Key difference**: In SDPO, the sibling solution goes into the teacher's conditioning | |
| prefix. The teacher does not generate a new hint; it just re-evaluates the student's | |
| response log-probs with the solution visible. In the repo, the sibling solution is | |
| used to *generate a hint string* that gets inserted into the response sequence. | |
| This is an extrapolation beyond what the SDPO paper supports. SDPO's "successful rollout | |
| as implicit feedback" mechanism does NOT: | |
| 1. Generate a "Reminder: a working approach..." hint string. | |
| 2. Insert text into the student's response sequence. | |
| 3. Require error-turn detection. | |
| The SDPO sibling mechanism IS: | |
| 1. Condition the teacher on the full successful solution. | |
| 2. Re-evaluate ALL student response token log-probs under that teacher. | |
| 3. Apply the KL loss across the entire response. | |
| **VERDICT**: The `SiblingBootstrapGenerator` as sketched is an extrapolation from SDPO's | |
| mechanism, not a faithful implementation of it. The paper supports using a sibling rollout | |
| as teacher conditioning context; it does not support generating a textual hint from it | |
| to splice into the response. The Composer blog's "hint" framing is the source of this | |
| architectural decision; SDPO is cited as inspiration but is not the mechanism. | |
| Research/07 acknowledges this at several points ("A working approach looks like: …" in | |
| the class comment vs the actual SDPO template) but does not flag it as a divergence — it | |
| presents the sibling-bootstrap hint approach as if it naturally follows from SDPO. | |
| --- | |
| ## 7. Audit: `research/11-sdpo-alignment-indices.md` | |
| ### 7.1 Problem correctly identified | |
| ADR-011 correctly identifies that inserting a hint into the teacher context shifts teacher | |
| response tokens right. The alignment indices machinery (`_mask_to_padded_indices`, | |
| `student_response_idx`, `teacher_response_idx`, sentinel handling) is a sound engineering | |
| solution to the problem the repo's design creates. | |
| ### 7.2 Root cause attribution — MISLEADING | |
| ADR-011 and the trainer comments frame the alignment problem as an SDPO issue that the | |
| papers fail to address ("the exact trust-gap flagged in ADR-008"). This is not accurate. | |
| The alignment problem does not exist in SDPO or OPSD because those methods never insert | |
| text into the response sequence. The alignment problem is entirely self-created by the | |
| repo's decision to implement the Composer blog's "hint at error turn" as a text insertion | |
| into the teacher's response sequence. | |
| --- | |
| ## 8. Audit: ADR-007, ADR-008, ADR-009 — Key claims | |
| ### ADR-007 — JSD as "the kernel of SDPO arXiv:2601.20802" | |
| The ADR says `generalized_jsd_loss` is "verified port of siyan-zhao/OPSD, the kernel of | |
| SDPO arXiv:2601.20802." This telescopes two papers. The JSD is the kernel of OPSD | |
| (the Zhao et al. paper, 2601.18734). SDPO (Hübotter et al., 2601.20802) uses JSD as a | |
| **stability improvement** over the base KL loss; the primary SDPO loss is the KL. Both | |
| papers use the same JSD formula (citing GKD paper 2306.13649). The conflation is not | |
| consequential for the loss code but creates confusion in documentation. | |
| ### ADR-008 — "SDPO needs full vocabulary logits" | |
| Confirmed. SDPO Appendix A.3 discusses top-K approximation of the KL because "naively | |
| computing the KL divergence between student and teacher requires holding full logits of | |
| both models in memory." The repo's claim about needing full logits is correct; PRIME-RL's | |
| log-probs-only interface is correctly identified as incompatible with the SDPO channel. | |
| ### ADR-008 — Dr. GRPO as the Composer algorithm | |
| This is sourced from research/10 (Composer 2 tech report mining). Not audited here (out | |
| of scope for this cluster). | |
| ### ADR-009 — "How Cursor generates that hint is unstated" | |
| Confirmed true. The Composer 2 tech report (arXiv:2603.24477) is cited as unread in | |
| research/07 §8 and ADR-009. ADR-009 correctly acknowledges the open question. | |
| --- | |
| ## 9. Summary of findings | |
| | Claim | Source | Verdict | | |
| |---|---|---| | |
| | JSD formula in opsd.py is numerically correct | OPSD Eq. 7 | CORRECT | | |
| | β=0 = "reverse KL" in docstring | OPSD README: "β=0 = forward KL" | INVERTED label | | |
| | "byte-for-byte OPSD parity" | OPSD code | Mostly correct; β direction label wrong; reduction differs from paper's per-sequence normalization; otherwise matches upstream code | | |
| | Error-turn masking is from SDPO | SDPO paper | FALSE — SDPO applies loss to full rollout, no error-turn detection | | |
| | Teacher context = ctx_student + hint_at_error_turn | SDPO paper | FALSE — SDPO teacher = [prompt, feedback, student_response]; feedback is in prefix, not intercalated | | |
| | SiblingBootstrapGenerator follows from SDPO "successful rollout as implicit feedback" | SDPO §4.6 | EXTRAPOLATION — SDPO conditions teacher on full solution; repo generates a hint string and inserts it into response sequence | | |
| | Alignment indices machinery (ADR-011) addresses SDPO misalignment | SDPO paper | MISLEADING — problem is self-created by hint-insertion design; does not exist in SDPO | | |
| | SDPO needs full vocabulary logits (ADR-008) | SDPO Appendix A.3 | CORRECT | | |
| | Three feedback types in research/07 | SDPO §4.6 | CORRECTLY REPORTED | | |
| | --jsd_token_clip default 0.05 | OPSD README | CORRECT | | |
| | --reason_first flag | OPSD README | CORRECTLY DESCRIBED | | |
| | "Successful rollouts as implicit feedback" claim | SDPO abstract | CORRECTLY CITED | | |
| | Teacher is stop-grad, student has gradient | SDPO Eq. 1 | CORRECT in opsd.py and composer_trainer.py | | |
| --- | |
| ## 10. Recommendations | |
| 1. **Fix the β docstring** in `opsd.py` to match the OPSD upstream convention: | |
| β=0 → forward KL (KL(student‖teacher)), β=1 → reverse KL (KL(teacher‖student)). | |
| 2. **Clarify the architectural departure from SDPO** in `composer_trainer.py` docstring | |
| and `research/07`: the repo implements a Composer-blog-inspired error-turn hint | |
| injection, which is an extension beyond SDPO. SDPO uses the feedback in the prompt | |
| prefix and evaluates the full response; the repo intercalates text into the response. | |
| 3. **Reconsider framing of `SiblingBootstrapGenerator`**: it is an original design choice, | |
| not an SDPO mechanism. The SDPO "sibling as implicit feedback" mechanism would look like: | |
| build a teacher context `[prompt, successful_sibling_rollout, original_response]` and | |
| apply KL over the whole original response — without generating a hint string or | |
| error-turn detection. This would be simpler and more faithful to SDPO. | |
| 4. **Teacher regularization is not implemented**: the SDPO paper shows a non-regularized | |
| teacher diverges (Table 4: 36.1% vs. 50.6%). The repo's teacher is the live model | |
| weights at each step with no EMA or trust-region regularization. For production SDPO | |
| runs this is a gap. (The `sdpo_jsd_beta` default of 0.5 uses symmetric JSD which is | |
| one of SDPO's stability improvements, but the teacher regularization is absent.) | |
| 5. **SDPO's original attempt placement**: the paper includes the student's original | |
| response as the sequence being log-prob-evaluated (i.e., the "response" slot in the | |
| teacher context). The repo's collator instead masks specific error-turn tokens within | |
| a modified response. These are architecturally different. The paper-accurate approach | |
| would re-evaluate log-probs of the entire original response under the hint-conditioned | |
| teacher, not just the tokens after the error. | |
| 6. **Failure mode from SDPO paper**: the strongest limitation is model capability | |
| dependence — SDPO underperforms GRPO on weak models (Qwen3-0.6B); SDPO+GRPO with | |
| λ=0.9 is recommended for weaker base models. This is not documented in the repo's | |
| SDPO usage guidance. | |
| --- | |
| ## 11. What the papers do NOT say (repo-claimed but unconfirmed in sources) | |
| - That Composer 2.5's "Targeted RL with Textual Feedback" is SDPO specifically (the | |
| Cursor blog does not cite SDPO; it describes a mechanism consistent with SDPO but | |
| the connection is an inference, not a citation). | |
| - That error-turn masking is part of SDPO. | |
| - That the repo's hint-at-error-turn teacher context is the SDPO mechanism. | |
| - That the alignment index problem (ADR-011) is an issue in SDPO. | |
| - How Cursor generates the hint (confirmed absent in all Cursor artifacts). | |