spark-code-A-3b-v2 / README.md
amarsaikhan's picture
Add full-pool (v2) adapter and model card
8744705 verified
---
base_model: Qwen/Qwen2.5-Coder-3B-Instruct
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- code-generation
- grpo
- lora
- qlora
- spark
- co-evolution
- python
datasets:
- google-research-datasets/mbpp
- openai/openai_humaneval
---
# SPARK-Code Β· Condition A-v2 (Exec-only GRPO, full pool) Β· Qwen2.5-Coder-3B QLoRA
**QLoRA adapter trained with execution-grounded GRPO on the full 311-problem MBPP pool over a 6-iteration schedule. Published weights are the iteration-4 checkpoint β€” the strongest HumanEval result in the entire SPARK-Code study (pass@1 0.816).**
## TL;DR
`spark-code-A-3b-v2` is the scaled-up rerun of the exec-only GRPO baseline: same recipe as [`spark-code-A-3b`](https://huggingface.co/amarsaikhan/spark-code-A-3b) but on the full 311-problem MBPP training pool and a longer 6-iteration schedule (`kl_coeff=0.02`). HumanEval pass@1 peaks at **0.816 at iteration 4** β€” the best score across all five adapters in the study β€” with the KL to the frozen reference staying below 2.4e-3 the whole way. **The run terminated at iteration 6 with a CUDA out-of-memory error (GPU contention, not a code fault), so no `final/` adapter was auto-saved; the published weights are the iteration-4 checkpoint**, chosen as the peak of the eval trajectory.
## Training Setup
- **Base model:** `Qwen/Qwen2.5-Coder-3B-Instruct`
- **Method:** Execution-grounded GRPO. Per problem, sample a group of rollouts, score each by the fraction of unit tests it passes (penalties for syntax/runtime/timeout), normalize rewards within the group, apply a clipped PPO-style update against a frozen reference. No auxiliary SFT objective (this is the exec-only condition).
- **Training data:** MBPP-sanitized, **311 problems** (full pool), **6 iterations intended** (5 completed + eval; crash during iteration 6), K=4 adaptive rollouts (up to 8), partial per-test rewards with `syntax_penalty=-0.2`, `runtime_penalty=-0.1`, `timeout_penalty=-0.3`, `wrong_answer_floor=0.0`.
- **LoRA:** `r=16`, `alpha=32`, `dropout=0.05`, target modules `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
- **Quantization:** 4-bit NF4 + double quant, bf16 compute.
- **Optimizer:** AdamW, `lr=5e-6`, `grad_accum=4`, `clip_ratio=0.2`, `max_grad_norm=1.0`.
- **KL regularization:** `kl_coeff=0.02` against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time).
- **Auxiliary objective:** none (this is Condition A).
- **Seed:** 42.
- **Published checkpoint:** `condition_A/checkpoints/iter4` (the run crashed before a `final/` was written; see Limitations).
Training script: `run_experiment_with_mbpp_heldout.py` in the GitHub repo.
## Evaluation Results
HumanEval is evaluated with 5 samples per problem at `temperature=0.2`, `top_p=0.95`. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts. Iterations 0–5 completed; iteration 6 crashed during the GRPO step before its eval.
| Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL |
|-----:|-----------------:|-----------------:|-----------------:|-----------------:|----------------:|--------:|
| 0 | 0.796 | 0.854 | 0.634 | 0.680 | β€” | β€” |
| 1 | 0.806 | 0.872 | 0.628 | 0.680 | 0.593 | 0.0003 |
| 2 | 0.801 | 0.860 | 0.642 | 0.690 | 0.620 | 0.0007 |
| 3 | 0.793 | 0.872 | 0.618 | 0.680 | 0.633 | 0.0013 |
| **4** | **0.816** | **0.872** | **0.638** | **0.710** | 0.649 | 0.0023 |
| 5 | 0.796 | 0.854 | 0.636 | 0.690 | 0.672 | 0.0024 |
| 6 | n/a | n/a | n/a | n/a | 0.696 | n/a |
**Trajectory.** HumanEval pass@1 oscillates in a narrow band and peaks at **0.816 at iteration 4** (+2.0 pp over baseline), the highest of any adapter in the study; pass@5 holds at 0.872 across iters 1–4. Held-out MBPP pass@5 also peaks at iter 4 (0.71). Crucially, GRPO KL stays below 2.4e-3 for the entire run β€” exec-only GRPO shows **no policy drift even over six iterations on the full pool**, in sharp contrast to the regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)), whose KL climbed to ~0.096 over the same schedule. Mean tokens per GRPO sequence stay in the 179–186 range (no completion-length collapse). The published iteration-4 checkpoint captures the peak.
## Limitations
The training run hit `torch.OutOfMemoryError` during iteration 6's GRPO backward pass β€” the GPU was shared with another large process at the time, so this was resource contention rather than a fault in the recipe. No `final/` adapter was written. The weights published here are the **iteration-4 checkpoint**, selected because it is both the eval peak and a fully-consistent post-iteration snapshot. Iteration 5 (pass@1 0.796) is also available in the source repo if a more-trained-but-lower checkpoint is preferred. Iteration 6 has no eval (the crash preceded it).
## Usage
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-3B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-A-3b-v2")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")
prompt = tok.apply_chat_template(
[{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
{"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```
## Comparison to Other Conditions
All five adapters share the same base model and seed. The original three used a 200-problem pool over 3 iterations; the two `-v2`/`2` adapters use the full 311-problem pool over 6 iterations. Each row reports that adapter's **published checkpoint**.
| Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 |
|---|---|---:|---:|---:|---:|
| **A-v2 (exec-only, full)** β€” this card | 311 / it 4 | 0.00 | 0.02 | **0.816** | **0.710** |
| [A (exec-only)](https://huggingface.co/amarsaikhan/spark-code-A-3b) | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 |
| [C-reg (regularized)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) | 200 / it 3 | 0.03 | 0.02 | 0.800 | 0.720 |
| [C-light (naive)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 |
| [C-reg2 (regularized, full)](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 |
A-v2 is the strongest HumanEval pass@1 in the study and ties the best MBPP pass@5.
## Findings Summary
- **The simplest method, scaled up, is still the strongest.** Exec-only GRPO on the full pool produced the best HumanEval pass@1 (0.816) of any adapter β€” no auxiliary recycling required.
- **Exec-only does not drift, even over six iterations.** KL stays below 2.4e-3 throughout. The matched-schedule regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)) drifted to KL ~0.096 and regressed on HumanEval over the same six iterations β€” direct evidence that the auxiliary objective, not the longer schedule, is what destabilizes the policy.
- **Published checkpoint is the iteration-4 peak.** The run crashed at iteration 6 (GPU OOM from contention); the weights here are iter4, the eval peak. This is a checkpoint-selection decision, not a completed-run "final."
## Related Artifacts
- Sibling adapters: [spark-code-A-3b](https://huggingface.co/amarsaikhan/spark-code-A-3b) Β· [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) Β· [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) Β· [spark-code-C-reg2-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)
- GitHub repository: https://github.com/amarsaikhanb/spark-code
- Full per-problem eval data (HumanEval and held-out MBPP JSONs, iters 0–5) lives under `condition_A/eval/` in the repository
- Interactive demo Space: [SPACES_URL]
## Citation
```bibtex
@misc{batjargal2026sparkcode,
title = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
author = {Amarsaikhan Batjargal},
year = {2026},
}
```
## License
The LoRA adapter weights in this repository are released under the **Apache 2.0** license. The base model, `Qwen/Qwen2.5-Coder-3B-Instruct`, is distributed under the [Tongyi Qianwen LICENSE](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct/blob/main/LICENSE); any downstream use must comply with its terms.