--- base_model: Qwen/Qwen2.5-Coder-3B-Instruct library_name: peft license: apache-2.0 language: - en pipeline_tag: text-generation tags: - code-generation - grpo - lora - qlora - spark - co-evolution - python datasets: - google-research-datasets/mbpp - openai/openai_humaneval --- # SPARK-Code · Condition A-v2 (Exec-only GRPO, full pool) · Qwen2.5-Coder-3B QLoRA **QLoRA adapter trained with execution-grounded GRPO on the full 311-problem MBPP pool over a 6-iteration schedule. Published weights are the iteration-4 checkpoint — the strongest HumanEval result in the entire SPARK-Code study (pass@1 0.816).** ## TL;DR `spark-code-A-3b-v2` is the scaled-up rerun of the exec-only GRPO baseline: same recipe as [`spark-code-A-3b`](https://huggingface.co/amarsaikhan/spark-code-A-3b) but on the full 311-problem MBPP training pool and a longer 6-iteration schedule (`kl_coeff=0.02`). HumanEval pass@1 peaks at **0.816 at iteration 4** — the best score across all five adapters in the study — with the KL to the frozen reference staying below 2.4e-3 the whole way. **The run terminated at iteration 6 with a CUDA out-of-memory error (GPU contention, not a code fault), so no `final/` adapter was auto-saved; the published weights are the iteration-4 checkpoint**, chosen as the peak of the eval trajectory. ## Training Setup - **Base model:** `Qwen/Qwen2.5-Coder-3B-Instruct` - **Method:** Execution-grounded GRPO. Per problem, sample a group of rollouts, score each by the fraction of unit tests it passes (penalties for syntax/runtime/timeout), normalize rewards within the group, apply a clipped PPO-style update against a frozen reference. No auxiliary SFT objective (this is the exec-only condition). - **Training data:** MBPP-sanitized, **311 problems** (full pool), **6 iterations intended** (5 completed + eval; crash during iteration 6), K=4 adaptive rollouts (up to 8), partial per-test rewards with `syntax_penalty=-0.2`, `runtime_penalty=-0.1`, `timeout_penalty=-0.3`, `wrong_answer_floor=0.0`. - **LoRA:** `r=16`, `alpha=32`, `dropout=0.05`, target modules `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`. - **Quantization:** 4-bit NF4 + double quant, bf16 compute. - **Optimizer:** AdamW, `lr=5e-6`, `grad_accum=4`, `clip_ratio=0.2`, `max_grad_norm=1.0`. - **KL regularization:** `kl_coeff=0.02` against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time). - **Auxiliary objective:** none (this is Condition A). - **Seed:** 42. - **Published checkpoint:** `condition_A/checkpoints/iter4` (the run crashed before a `final/` was written; see Limitations). Training script: `run_experiment_with_mbpp_heldout.py` in the GitHub repo. ## Evaluation Results HumanEval is evaluated with 5 samples per problem at `temperature=0.2`, `top_p=0.95`. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts. Iterations 0–5 completed; iteration 6 crashed during the GRPO step before its eval. | Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL | |-----:|-----------------:|-----------------:|-----------------:|-----------------:|----------------:|--------:| | 0 | 0.796 | 0.854 | 0.634 | 0.680 | — | — | | 1 | 0.806 | 0.872 | 0.628 | 0.680 | 0.593 | 0.0003 | | 2 | 0.801 | 0.860 | 0.642 | 0.690 | 0.620 | 0.0007 | | 3 | 0.793 | 0.872 | 0.618 | 0.680 | 0.633 | 0.0013 | | **4** | **0.816** | **0.872** | **0.638** | **0.710** | 0.649 | 0.0023 | | 5 | 0.796 | 0.854 | 0.636 | 0.690 | 0.672 | 0.0024 | | 6 | n/a | n/a | n/a | n/a | 0.696 | n/a | **Trajectory.** HumanEval pass@1 oscillates in a narrow band and peaks at **0.816 at iteration 4** (+2.0 pp over baseline), the highest of any adapter in the study; pass@5 holds at 0.872 across iters 1–4. Held-out MBPP pass@5 also peaks at iter 4 (0.71). Crucially, GRPO KL stays below 2.4e-3 for the entire run — exec-only GRPO shows **no policy drift even over six iterations on the full pool**, in sharp contrast to the regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)), whose KL climbed to ~0.096 over the same schedule. Mean tokens per GRPO sequence stay in the 179–186 range (no completion-length collapse). The published iteration-4 checkpoint captures the peak. ## Limitations The training run hit `torch.OutOfMemoryError` during iteration 6's GRPO backward pass — the GPU was shared with another large process at the time, so this was resource contention rather than a fault in the recipe. No `final/` adapter was written. The weights published here are the **iteration-4 checkpoint**, selected because it is both the eval peak and a fully-consistent post-iteration snapshot. Iteration 5 (pass@1 0.796) is also available in the source repo if a more-trained-but-lower checkpoint is preferred. Iteration 6 has no eval (the crash preceded it). ## Usage ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer import torch base = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-Coder-3B-Instruct", torch_dtype=torch.bfloat16, device_map="auto", ) model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-A-3b-v2") tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct") prompt = tok.apply_chat_template( [{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."}, {"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}], tokenize=False, add_generation_prompt=True, ) inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95) print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) ``` ## Comparison to Other Conditions All five adapters share the same base model and seed. The original three used a 200-problem pool over 3 iterations; the two `-v2`/`2` adapters use the full 311-problem pool over 6 iterations. Each row reports that adapter's **published checkpoint**. | Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 | |---|---|---:|---:|---:|---:| | **A-v2 (exec-only, full)** — this card | 311 / it 4 | 0.00 | 0.02 | **0.816** | **0.710** | | [A (exec-only)](https://huggingface.co/amarsaikhan/spark-code-A-3b) | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 | | [C-reg (regularized)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) | 200 / it 3 | 0.03 | 0.02 | 0.800 | 0.720 | | [C-light (naive)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 | | [C-reg2 (regularized, full)](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 | A-v2 is the strongest HumanEval pass@1 in the study and ties the best MBPP pass@5. ## Findings Summary - **The simplest method, scaled up, is still the strongest.** Exec-only GRPO on the full pool produced the best HumanEval pass@1 (0.816) of any adapter — no auxiliary recycling required. - **Exec-only does not drift, even over six iterations.** KL stays below 2.4e-3 throughout. The matched-schedule regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)) drifted to KL ~0.096 and regressed on HumanEval over the same six iterations — direct evidence that the auxiliary objective, not the longer schedule, is what destabilizes the policy. - **Published checkpoint is the iteration-4 peak.** The run crashed at iteration 6 (GPU OOM from contention); the weights here are iter4, the eval peak. This is a checkpoint-selection decision, not a completed-run "final." ## Related Artifacts - Sibling adapters: [spark-code-A-3b](https://huggingface.co/amarsaikhan/spark-code-A-3b) · [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) · [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) · [spark-code-C-reg2-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) - GitHub repository: https://github.com/amarsaikhanb/spark-code - Full per-problem eval data (HumanEval and held-out MBPP JSONs, iters 0–5) lives under `condition_A/eval/` in the repository - Interactive demo Space: [SPACES_URL] ## Citation ```bibtex @misc{batjargal2026sparkcode, title = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation}, author = {Amarsaikhan Batjargal}, year = {2026}, } ``` ## License The LoRA adapter weights in this repository are released under the **Apache 2.0** license. The base model, `Qwen/Qwen2.5-Coder-3B-Instruct`, is distributed under the [Tongyi Qianwen LICENSE](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct/blob/main/LICENSE); any downstream use must comply with its terms.