File size: 8,554 Bytes
3ec8e56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c32d62
3ec8e56
 
 
 
 
 
 
 
 
 
 
 
 
4c32d62
3ec8e56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c32d62
3ec8e56
 
 
 
 
 
 
 
 
 
 
 
 
 
bb8ca75
3ec8e56
bb8ca75
 
 
 
 
 
 
 
3ec8e56
bb8ca75
3ec8e56
 
 
 
 
 
 
 
 
bb8ca75
4c32d62
3ec8e56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
base_model: Qwen/Qwen2.5-Coder-3B-Instruct
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- code-generation
- grpo
- lora
- qlora
- spark
- co-evolution
- python
datasets:
- google-research-datasets/mbpp
- openai/openai_humaneval
---

# SPARK-Code Β· Condition A (Exec-Only GRPO) Β· Qwen2.5-Coder-3B QLoRA

**QLoRA adapter trained with execution-grounded GRPO. The strongest and most stable cross-benchmark performer in the SPARK-Code study.**

## TL;DR

`spark-code-A-3b` is a LoRA adapter for `Qwen/Qwen2.5-Coder-3B-Instruct` produced by 3 iterations of Group Relative Policy Optimization (GRPO) on 200 MBPP problems, using partial per-test execution feedback as the only reward signal. It moves HumanEval pass@1 from 0.796 β†’ 0.805 (+0.85 pp) monotonically while keeping the KL to the frozen reference well under 1.1e-3, and it generalizes cleanly to held-out MBPP (0.634 β†’ 0.636 pass@1; 0.68 β†’ 0.69 pass@5 with an intermediate peak at 0.71). In the three-arm comparison, Condition A is the only run that improves on both benchmarks without policy drift.

## Training Setup

- **Base model:** `Qwen/Qwen2.5-Coder-3B-Instruct`
- **Method:** Execution-grounded GRPO. For each MBPP problem we generate a group of rollouts, score each rollout by the fraction of unit tests it passes (with explicit penalties for syntax/runtime/timeout errors), normalize rewards within the group, and apply a clipped PPO-style policy-gradient update. No auxiliary SFT objective is used in this condition β€” this is the exec-only baseline.
- **Training data:** MBPP-sanitized, 200 problems, 3 iterations, K=4 adaptive rollouts (up to 8 when the group has zero advantage variance), partial per-test rewards with `syntax_penalty=-0.2`, `runtime_penalty=-0.1`, `timeout_penalty=-0.3`, `wrong_answer_floor=0.0`.
- **LoRA:** `r=16`, `alpha=32`, `dropout=0.05`, target modules `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
- **Quantization:** 4-bit NF4 with double quantization, bf16 compute.
- **Optimizer:** AdamW, `lr=5e-6`, `grad_accum=4`, `clip_ratio=0.2`, `max_grad_norm=1.0`.
- **KL regularization:** `kl_coeff=0.01` against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time).
- **Auxiliary objective:** none (this is Condition A).
- **Seed:** 42.

Training script: `run_experiment_with_mbpp_heldout.py` in the GitHub repo.

## Evaluation Results

HumanEval is evaluated with 5 samples per problem at `temperature=0.2`, `top_p=0.95`. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts.

| Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL |
|-----:|-----------------:|-----------------:|-----------------:|-----------------:|----------------:|--------:|
|    0 |            0.796 |            0.854 |            0.634 |            0.680 |               β€” |       β€” |
|    1 |            0.798 |            0.860 |            0.624 |            0.690 |           0.603 |  0.0002 |
|    2 |            0.799 |            0.848 |            0.632 |            0.710 |           0.640 |  0.0005 |
|    3 |        **0.805** |            0.854 |        **0.636** |            0.690 |           0.639 |  0.0011 |

**Trajectory.** HumanEval pass@1 climbs monotonically across all three iterations (+0.85 pp end-to-end), and KL stays bounded below 1.1e-3, indicating that the policy is improving without drifting from the base distribution. MBPP held-out pass@5 peaks at iter 2 (0.71) and settles to 0.69 at iter 3, while pass@1 ends slightly above baseline (+0.2 pp). Train pass rate rises from 0.603 to 0.639, consistent with the eval gains. Mean tokens per GRPO sequence stays in the 177–182 range across iterations β€” no completion-length collapse.

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-3B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-A-3b")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")

prompt = tok.apply_chat_template(
    [{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
     {"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

## Comparison to Other Conditions

All five adapters share the same base model and seed. The original three (A, C-light, C-reg) used a 200-problem MBPP pool over 3 iterations; the two full-pool adapters (A-v2, C-reg2) used the 311-problem pool over 6 iterations. Each adapter row reports its **published checkpoint** β€” for A-v2 the iteration-4 peak, for the others the final / last completed iteration β€” and the _Base_ row is the untrained model (iteration 0, identical across all conditions). Rows are sorted by HumanEval pass@1, so conditions above _Base_ beat the baseline and those below regress. Bold marks the best value in each metric column (for GRPO KL, lower = less policy drift).

| Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 | GRPO KL |
|---|---|---:|---:|---:|---:|---:|
| [A-v2 (exec-only, full)](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2) | 311 / it 4 | 0.00 | 0.02 | **0.816** | 0.710 | 0.0023 |
| **A (exec-only)** β€” this card | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 | **0.0011** |
| [C-reg (regularized)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) | 200 / it 3 | 0.03 | 0.02 | 0.800 | **0.720** | 0.0136 |
| _Base (untrained Qwen2.5-Coder-3B)_ | β€” / it 0 | β€” | β€” | 0.796 | 0.680 | β€” |
| [C-reg2 (regularized, full)](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 | 0.0957 |
| [C-light (naive)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 | 0.0941 |

The exec-only conditions (A, A-v2) hold the lowest KL and the top HumanEval pass@1; A's full-pool rerun ([A-v2](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2)) is the strongest in the study. The co-evolve runs either fail outright (C-light) or drift over a long schedule (C-reg2); the short regularized run (C-reg) keeps the best MBPP pass@5.

## Findings Summary

- **Simplest method wins on the primary cross-benchmark metric.** Exec-only GRPO produced the largest, most stable HumanEval pass@1 gain in the study; no auxiliary SFT was required.
- **Drift control is essentially free here.** With `kl_coeff=0.01` and no auxiliary loss pulling the policy off-distribution, KL stays ≀1.1e-3 and completion lengths stay flat across iterations.
- **Sample efficiency is modest but real.** 200 MBPP problems Γ— 3 iterations on a single 3B-parameter base was enough to produce a small but monotonic HumanEval improvement and a peaked MBPP pass@5 gain.

## Related Artifacts

- Sibling adapters: [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) Β· [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) Β· [spark-code-A-3b-v2](https://huggingface.co/amarsaikhan/spark-code-A-3b-v2) Β· [spark-code-C-reg2-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)
- GitHub repository: https://github.com/amarsaikhanb/spark-code
- Full per-problem eval data (HumanEval and held-out MBPP JSONs per iteration) lives under `condition_A/eval/` in the repository
- Interactive demo Space: [SPACES_URL]

## Citation

```bibtex
@misc{batjargal2026sparkcode,
  title  = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
  author = {Amarsaikhan Batjargal},
  year   = {2026},
}
```

## License

The LoRA adapter weights in this repository are released under the **Apache 2.0** license. The base model, `Qwen/Qwen2.5-Coder-3B-Instruct`, is distributed under the [Tongyi Qianwen LICENSE](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct/blob/main/LICENSE); any downstream use must comply with its terms.