amarsaikhan commited on
Commit
bb8ca75
·
verified ·
1 Parent(s): 4c32d62

Update model card

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -80,15 +80,18 @@ print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
80
 
81
  ## Comparison to Other Conditions
82
 
83
- All three adapters share the same base model, training pool, seed, and rollout budget. They differ only in the auxiliary objective and KL strength.
84
 
85
- | Condition | aux_loss_scale | kl_coeff | HumanEval pass@1 (it 3) | MBPP-held pass@5 (it 3) | GRPO KL (it 3) |
86
- |---|---:|---:|---:|---:|---:|
87
- | **A (exec-only)** this card | 0.00 | 0.01 | **0.805** | 0.690 | **0.0011** |
88
- | [C-light (naive co-evolve)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) | 0.10 | 0.01 | 0.773 | 0.680 | 0.0941 |
89
- | [C-reg (regularized co-evolve)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) | 0.03 | 0.02 | 0.800 | **0.720** | 0.0136 |
 
 
 
90
 
91
- Condition A delivers the highest HumanEval pass@1 and the lowest reference-policy drift; C-reg is the only condition that beats it on MBPP pass@5 (+3 pp), and C-light demonstrates the policy-drift failure mode.
92
 
93
  ## Findings Summary
94
 
@@ -98,7 +101,7 @@ Condition A delivers the highest HumanEval pass@1 and the lowest reference-polic
98
 
99
  ## Related Artifacts
100
 
101
- - Sibling adapters: [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) · [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b)
102
  - GitHub repository: https://github.com/amarsaikhanb/spark-code
103
  - Full per-problem eval data (HumanEval and held-out MBPP JSONs per iteration) lives under `condition_A/eval/` in the repository
104
  - Interactive demo Space: [SPACES_URL]
 
80
 
81
  ## Comparison to Other Conditions
82
 
83
+ All five adapters share the same base model and seed. The original three (A, C-light, C-reg) used a 200-problem MBPP pool over 3 iterations; the two full-pool adapters (A-v2, C-reg2) used the 311-problem pool over 6 iterations. Each adapter row reports its **published checkpoint** — for A-v2 the iteration-4 peak, for the others the final / last completed iteration — and the _Base_ row is the untrained model (iteration 0, identical across all conditions). Rows are sorted by HumanEval pass@1, so conditions above _Base_ beat the baseline and those below regress. Bold marks the best value in each metric column (for GRPO KL, lower = less policy drift).
84
 
85
+ | Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 | GRPO KL |
86
+ |---|---|---:|---:|---:|---:|---:|
87
+ | [A-v2 (exec-only, full)](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2) | 311 / it 4 | 0.00 | 0.02 | **0.816** | 0.710 | 0.0023 |
88
+ | **A (exec-only)** — this card | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 | **0.0011** |
89
+ | [C-reg (regularized)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) | 200 / it 3 | 0.03 | 0.02 | 0.800 | **0.720** | 0.0136 |
90
+ | _Base (untrained Qwen2.5-Coder-3B)_ | — / it 0 | — | — | 0.796 | 0.680 | — |
91
+ | [C-reg2 (regularized, full)](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 | 0.0957 |
92
+ | [C-light (naive)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 | 0.0941 |
93
 
94
+ The exec-only conditions (A, A-v2) hold the lowest KL and the top HumanEval pass@1; A's full-pool rerun ([A-v2](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2)) is the strongest in the study. The co-evolve runs either fail outright (C-light) or drift over a long schedule (C-reg2); the short regularized run (C-reg) keeps the best MBPP pass@5.
95
 
96
  ## Findings Summary
97
 
 
101
 
102
  ## Related Artifacts
103
 
104
+ - Sibling adapters: [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) · [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) · [spark-code-A-3b-v2](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2) · [spark-code-C-reg2-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)
105
  - GitHub repository: https://github.com/amarsaikhanb/spark-code
106
  - Full per-problem eval data (HumanEval and held-out MBPP JSONs per iteration) lives under `condition_A/eval/` in the repository
107
  - Interactive demo Space: [SPACES_URL]