---
base_model: Qwen/Qwen2.5-Coder-3B-Instruct
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- code-generation
- grpo
- lora
- qlora
- spark
- co-evolution
- python
datasets:
- google-research-datasets/mbpp
- openai/openai_humaneval
---

# SPARK-Code · Condition A-v2 (Exec-only GRPO, full pool) · Qwen2.5-Coder-3B QLoRA

**QLoRA adapter trained with execution-grounded GRPO on the full 311-problem MBPP pool over a 6-iteration schedule. Published weights are the iteration-4 checkpoint — the strongest HumanEval result in the entire SPARK-Code study (pass@1 0.816).**

## TL;DR

`spark-code-A-3b-v2` is the scaled-up rerun of the exec-only GRPO baseline: same recipe as [`spark-code-A-3b`](https://huggingface.co/amarsaikhan/spark-code-A-3b) but on the full 311-problem MBPP training pool and a longer 6-iteration schedule (`kl_coeff=0.02`). HumanEval pass@1 peaks at **0.816 at iteration 4** — the best score across all five adapters in the study — with the KL to the frozen reference staying below 2.4e-3 the whole way. **The run terminated at iteration 6 with a CUDA out-of-memory error (GPU contention, not a code fault), so no `final/` adapter was auto-saved; the published weights are the iteration-4 checkpoint**, chosen as the peak of the eval trajectory.

## Training Setup

- **Base model:** `Qwen/Qwen2.5-Coder-3B-Instruct`
- **Method:** Execution-grounded GRPO. Per problem, sample a group of rollouts, score each by the fraction of unit tests it passes (penalties for syntax/runtime/timeout), normalize rewards within the group, apply a clipped PPO-style update against a frozen reference. No auxiliary SFT objective (this is the exec-only condition).
- **Training data:** MBPP-sanitized, **311 problems** (full pool), **6 iterations intended** (5 completed + eval; crash during iteration 6), K=4 adaptive rollouts (up to 8), partial per-test rewards with `syntax_penalty=-0.2`, `runtime_penalty=-0.1`, `timeout_penalty=-0.3`, `wrong_answer_floor=0.0`.
- **LoRA:** `r=16`, `alpha=32`, `dropout=0.05`, target modules `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
- **Quantization:** 4-bit NF4 + double quant, bf16 compute.
- **Optimizer:** AdamW, `lr=5e-6`, `grad_accum=4`, `clip_ratio=0.2`, `max_grad_norm=1.0`.
- **KL regularization:** `kl_coeff=0.02` against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time).
- **Auxiliary objective:** none (this is Condition A).
- **Seed:** 42.
- **Published checkpoint:** `condition_A/checkpoints/iter4` (the run crashed before a `final/` was written; see Limitations).

Training script: `run_experiment_with_mbpp_heldout.py` in the GitHub repo.

## Evaluation Results

HumanEval is evaluated with 5 samples per problem at `temperature=0.2`, `top_p=0.95`. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts. Iterations 0–5 completed; iteration 6 crashed during the GRPO step before its eval.

| Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL |
|-----:|-----------------:|-----------------:|-----------------:|-----------------:|----------------:|--------:|
|    0 |            0.796 |            0.854 |            0.634 |            0.680 |               — |       — |
|    1 |            0.806 |            0.872 |            0.628 |            0.680 |           0.593 |  0.0003 |
|    2 |            0.801 |            0.860 |            0.642 |            0.690 |           0.620 |  0.0007 |
|    3 |            0.793 |            0.872 |            0.618 |            0.680 |           0.633 |  0.0013 |
| **4** |        **0.816** |        **0.872** |        **0.638** |        **0.710** |           0.649 |  0.0023 |
|    5 |            0.796 |            0.854 |            0.636 |            0.690 |           0.672 |  0.0024 |
|    6 |              n/a |              n/a |              n/a |              n/a |           0.696 |     n/a |

**Trajectory.** HumanEval pass@1 oscillates in a narrow band and peaks at **0.816 at iteration 4** (+2.0 pp over baseline), the highest of any adapter in the study; pass@5 holds at 0.872 across iters 1–4. Held-out MBPP pass@5 also peaks at iter 4 (0.71). Crucially, GRPO KL stays below 2.4e-3 for the entire run — exec-only GRPO shows **no policy drift even over six iterations on the full pool**, in sharp contrast to the regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)), whose KL climbed to ~0.096 over the same schedule. Mean tokens per GRPO sequence stay in the 179–186 range (no completion-length collapse). The published iteration-4 checkpoint captures the peak.

## Limitations

The training run hit `torch.OutOfMemoryError` during iteration 6's GRPO backward pass — the GPU was shared with another large process at the time, so this was resource contention rather than a fault in the recipe. No `final/` adapter was written. The weights published here are the **iteration-4 checkpoint**, selected because it is both the eval peak and a fully-consistent post-iteration snapshot. Iteration 5 (pass@1 0.796) is also available in the source repo if a more-trained-but-lower checkpoint is preferred. Iteration 6 has no eval (the crash preceded it).

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-3B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-A-3b-v2")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")

prompt = tok.apply_chat_template(
    [{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
     {"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

## Comparison to Other Conditions

All five adapters share the same base model and seed. The original three used a 200-problem pool over 3 iterations; the two `-v2`/`2` adapters use the full 311-problem pool over 6 iterations. Each row reports that adapter's **published checkpoint**.

| Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 |
|---|---|---:|---:|---:|---:|
| **A-v2 (exec-only, full)** — this card                                          | 311 / it 4 | 0.00 | 0.02 | **0.816** | **0.710** |
| [A (exec-only)](https://huggingface.co/amarsaikhan/spark-code-A-3b)             | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 |
| [C-reg (regularized)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b)   | 200 / it 3 | 0.03 | 0.02 | 0.800 | 0.720 |
| [C-light (naive)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b)     | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 |
| [C-reg2 (regularized, full)](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 |

A-v2 is the strongest HumanEval pass@1 in the study and ties the best MBPP pass@5.

## Findings Summary

- **The simplest method, scaled up, is still the strongest.** Exec-only GRPO on the full pool produced the best HumanEval pass@1 (0.816) of any adapter — no auxiliary recycling required.
- **Exec-only does not drift, even over six iterations.** KL stays below 2.4e-3 throughout. The matched-schedule regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)) drifted to KL ~0.096 and regressed on HumanEval over the same six iterations — direct evidence that the auxiliary objective, not the longer schedule, is what destabilizes the policy.
- **Published checkpoint is the iteration-4 peak.** The run crashed at iteration 6 (GPU OOM from contention); the weights here are iter4, the eval peak. This is a checkpoint-selection decision, not a completed-run "final."

## Related Artifacts

- Sibling adapters: [spark-code-A-3b](https://huggingface.co/amarsaikhan/spark-code-A-3b) · [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) · [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) · [spark-code-C-reg2-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)
- GitHub repository: https://github.com/amarsaikhanb/spark-code
- Full per-problem eval data (HumanEval and held-out MBPP JSONs, iters 0–5) lives under `condition_A/eval/` in the repository
- Interactive demo Space: [SPACES_URL]

## Citation

```bibtex
@misc{batjargal2026sparkcode,
  title  = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
  author = {Amarsaikhan Batjargal},
  year   = {2026},
}
```

## License

The LoRA adapter weights in this repository are released under the **Apache 2.0** license. The base model, `Qwen/Qwen2.5-Coder-3B-Instruct`, is distributed under the [Tongyi Qianwen LICENSE](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct/blob/main/LICENSE); any downstream use must comply with its terms.