amarsaikhan
/

spark-code-A-3b

@@ -80,15 +80,18 @@ print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ## Comparison to Other Conditions
-All three adapters share the same base model, training pool, seed, and rollout budget. They differ only in the auxiliary objective and KL strength.
-| Condition | aux_loss_scale | kl_coeff | HumanEval pass@1 (it 3) | MBPP-held pass@5 (it 3) | GRPO KL (it 3) |
-|---|---:|---:|---:|---:|---:|
-| **A (exec-only)** — this card           | 0.00 | 0.01 | **0.805** | 0.690 | **0.0011** |
-| [C-light (naive co-evolve)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) | 0.10 | 0.01 | 0.773 | 0.680 | 0.0941 |
-| [C-reg (regularized co-evolve)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) | 0.03 | 0.02 | 0.800 | **0.720** | 0.0136 |
-Condition A delivers the highest HumanEval pass@1 and the lowest reference-policy drift; C-reg is the only condition that beats it on MBPP pass@5 (+3 pp), and C-light demonstrates the policy-drift failure mode.
 ## Findings Summary
@@ -98,7 +101,7 @@ Condition A delivers the highest HumanEval pass@1 and the lowest reference-polic
 ## Related Artifacts
-- Sibling adapters: [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) · [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b)
 - GitHub repository: https://github.com/amarsaikhanb/spark-code
 - Full per-problem eval data (HumanEval and held-out MBPP JSONs per iteration) lives under `condition_A/eval/` in the repository
 - Interactive demo Space: [SPACES_URL]

 ## Comparison to Other Conditions
+All five adapters share the same base model and seed. The original three (A, C-light, C-reg) used a 200-problem MBPP pool over 3 iterations; the two full-pool adapters (A-v2, C-reg2) used the 311-problem pool over 6 iterations. Each adapter row reports its **published checkpoint** — for A-v2 the iteration-4 peak, for the others the final / last completed iteration — and the _Base_ row is the untrained model (iteration 0, identical across all conditions). Rows are sorted by HumanEval pass@1, so conditions above _Base_ beat the baseline and those below regress. Bold marks the best value in each metric column (for GRPO KL, lower = less policy drift).
+| Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 | GRPO KL |
+|---|---|---:|---:|---:|---:|---:|
+| [A-v2 (exec-only, full)](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2) | 311 / it 4 | 0.00 | 0.02 | **0.816** | 0.710 | 0.0023 |
+| **A (exec-only)** — this card | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 | **0.0011** |
+| [C-reg (regularized)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) | 200 / it 3 | 0.03 | 0.02 | 0.800 | **0.720** | 0.0136 |
+| _Base (untrained Qwen2.5-Coder-3B)_ | — / it 0 | — | — | 0.796 | 0.680 | — |
+| [C-reg2 (regularized, full)](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 | 0.0957 |
+| [C-light (naive)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 | 0.0941 |
+The exec-only conditions (A, A-v2) hold the lowest KL and the top HumanEval pass@1; A's full-pool rerun ([A-v2](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2)) is the strongest in the study. The co-evolve runs either fail outright (C-light) or drift over a long schedule (C-reg2); the short regularized run (C-reg) keeps the best MBPP pass@5.
 ## Findings Summary
 ## Related Artifacts
+- Sibling adapters: [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) · [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) · [spark-code-A-3b-v2](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2) · [spark-code-C-reg2-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)
 - GitHub repository: https://github.com/amarsaikhanb/spark-code
 - Full per-problem eval data (HumanEval and held-out MBPP JSONs per iteration) lives under `condition_A/eval/` in the repository
 - Interactive demo Space: [SPACES_URL]