Add full-pool (v2) adapter and model card

8744705 verified 14 days ago

9.48 kB

	---
	base_model: Qwen/Qwen2.5-Coder-3B-Instruct
	library_name: peft
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- code-generation
	- grpo
	- lora
	- qlora
	- spark
	- co-evolution
	- python
	datasets:
	- google-research-datasets/mbpp
	- openai/openai_humaneval
	---

	# SPARK-Code · Condition A-v2 (Exec-only GRPO, full pool) · Qwen2.5-Coder-3B QLoRA

	QLoRA adapter trained with execution-grounded GRPO on the full 311-problem MBPP pool over a 6-iteration schedule. Published weights are the iteration-4 checkpoint — the strongest HumanEval result in the entire SPARK-Code study (pass@1 0.816).

	## TL;DR

	`spark-code-A-3b-v2` is the scaled-up rerun of the exec-only GRPO baseline: same recipe as [`spark-code-A-3b`](https://huggingface.co/amarsaikhan/spark-code-A-3b) but on the full 311-problem MBPP training pool and a longer 6-iteration schedule (`kl_coeff=0.02`). HumanEval pass@1 peaks at 0.816 at iteration 4 — the best score across all five adapters in the study — with the KL to the frozen reference staying below 2.4e-3 the whole way. The run terminated at iteration 6 with a CUDA out-of-memory error (GPU contention, not a code fault), so no `final/` adapter was auto-saved; the published weights are the iteration-4 checkpoint, chosen as the peak of the eval trajectory.

	## Training Setup

	- Base model: `Qwen/Qwen2.5-Coder-3B-Instruct`
	- Method: Execution-grounded GRPO. Per problem, sample a group of rollouts, score each by the fraction of unit tests it passes (penalties for syntax/runtime/timeout), normalize rewards within the group, apply a clipped PPO-style update against a frozen reference. No auxiliary SFT objective (this is the exec-only condition).
	- Training data: MBPP-sanitized, 311 problems (full pool), 6 iterations intended (5 completed + eval; crash during iteration 6), K=4 adaptive rollouts (up to 8), partial per-test rewards with `syntax_penalty=-0.2`, `runtime_penalty=-0.1`, `timeout_penalty=-0.3`, `wrong_answer_floor=0.0`.
	- LoRA: `r=16`, `alpha=32`, `dropout=0.05`, target modules `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
	- Quantization: 4-bit NF4 + double quant, bf16 compute.
	- Optimizer: AdamW, `lr=5e-6`, `grad_accum=4`, `clip_ratio=0.2`, `max_grad_norm=1.0`.
	- KL regularization: `kl_coeff=0.02` against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time).
	- Auxiliary objective: none (this is Condition A).
	- Seed: 42.
	- Published checkpoint: `condition_A/checkpoints/iter4` (the run crashed before a `final/` was written; see Limitations).

	Training script: `run_experiment_with_mbpp_heldout.py` in the GitHub repo.

	## Evaluation Results

	HumanEval is evaluated with 5 samples per problem at `temperature=0.2`, `top_p=0.95`. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts. Iterations 0–5 completed; iteration 6 crashed during the GRPO step before its eval.

	\| Iter \| HumanEval pass@1 \| HumanEval pass@5 \| MBPP-held pass@1 \| MBPP-held pass@5 \| Train pass rate \| GRPO KL \|
	\|-----:\|-----------------:\|-----------------:\|-----------------:\|-----------------:\|----------------:\|--------:\|
	\| 0 \| 0.796 \| 0.854 \| 0.634 \| 0.680 \| — \| — \|
	\| 1 \| 0.806 \| 0.872 \| 0.628 \| 0.680 \| 0.593 \| 0.0003 \|
	\| 2 \| 0.801 \| 0.860 \| 0.642 \| 0.690 \| 0.620 \| 0.0007 \|
	\| 3 \| 0.793 \| 0.872 \| 0.618 \| 0.680 \| 0.633 \| 0.0013 \|
	\| 4 \| 0.816 \| 0.872 \| 0.638 \| 0.710 \| 0.649 \| 0.0023 \|
	\| 5 \| 0.796 \| 0.854 \| 0.636 \| 0.690 \| 0.672 \| 0.0024 \|
	\| 6 \| n/a \| n/a \| n/a \| n/a \| 0.696 \| n/a \|

	Trajectory. HumanEval pass@1 oscillates in a narrow band and peaks at 0.816 at iteration 4 (+2.0 pp over baseline), the highest of any adapter in the study; pass@5 holds at 0.872 across iters 1–4. Held-out MBPP pass@5 also peaks at iter 4 (0.71). Crucially, GRPO KL stays below 2.4e-3 for the entire run — exec-only GRPO shows no policy drift even over six iterations on the full pool, in sharp contrast to the regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)), whose KL climbed to ~0.096 over the same schedule. Mean tokens per GRPO sequence stay in the 179–186 range (no completion-length collapse). The published iteration-4 checkpoint captures the peak.

	## Limitations

	The training run hit `torch.OutOfMemoryError` during iteration 6's GRPO backward pass — the GPU was shared with another large process at the time, so this was resource contention rather than a fault in the recipe. No `final/` adapter was written. The weights published here are the iteration-4 checkpoint, selected because it is both the eval peak and a fully-consistent post-iteration snapshot. Iteration 5 (pass@1 0.796) is also available in the source repo if a more-trained-but-lower checkpoint is preferred. Iteration 6 has no eval (the crash preceded it).

	## Usage

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	base = AutoModelForCausalLM.from_pretrained(
	"Qwen/Qwen2.5-Coder-3B-Instruct",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-A-3b-v2")
	tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")

	prompt = tok.apply_chat_template(
	[{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
	{"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
	tokenize=False, add_generation_prompt=True,
	)
	inputs = tok(prompt, return_tensors="pt").to(model.device)
	out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
	print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
	```

	## Comparison to Other Conditions

	All five adapters share the same base model and seed. The original three used a 200-problem pool over 3 iterations; the two `-v2`/`2` adapters use the full 311-problem pool over 6 iterations. Each row reports that adapter's published checkpoint.

	\| Condition \| Pool / iters \| aux_loss_scale \| kl_coeff \| HumanEval pass@1 \| MBPP-held pass@5 \|
	\|---\|---\|---:\|---:\|---:\|---:\|
	\| A-v2 (exec-only, full) — this card \| 311 / it 4 \| 0.00 \| 0.02 \| 0.816 \| 0.710 \|
	\| [A (exec-only)](https://huggingface.co/amarsaikhan/spark-code-A-3b) \| 200 / it 3 \| 0.00 \| 0.01 \| 0.805 \| 0.690 \|
	\| [C-reg (regularized)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) \| 200 / it 3 \| 0.03 \| 0.02 \| 0.800 \| 0.720 \|
	\| [C-light (naive)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) \| 200 / it 3 \| 0.10 \| 0.01 \| 0.773 \| 0.680 \|
	\| [C-reg2 (regularized, full)](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) \| 311 / it 6 \| 0.02 \| 0.03 \| 0.774 \| 0.680 \|

	A-v2 is the strongest HumanEval pass@1 in the study and ties the best MBPP pass@5.

	## Findings Summary

	- The simplest method, scaled up, is still the strongest. Exec-only GRPO on the full pool produced the best HumanEval pass@1 (0.816) of any adapter — no auxiliary recycling required.
	- Exec-only does not drift, even over six iterations. KL stays below 2.4e-3 throughout. The matched-schedule regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)) drifted to KL ~0.096 and regressed on HumanEval over the same six iterations — direct evidence that the auxiliary objective, not the longer schedule, is what destabilizes the policy.
	- Published checkpoint is the iteration-4 peak. The run crashed at iteration 6 (GPU OOM from contention); the weights here are iter4, the eval peak. This is a checkpoint-selection decision, not a completed-run "final."

	## Related Artifacts

	- Sibling adapters: [spark-code-A-3b](https://huggingface.co/amarsaikhan/spark-code-A-3b) · [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) · [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) · [spark-code-C-reg2-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)
	- GitHub repository: https://github.com/amarsaikhanb/spark-code
	- Full per-problem eval data (HumanEval and held-out MBPP JSONs, iters 0–5) lives under `condition_A/eval/` in the repository
	- Interactive demo Space: [SPACES_URL]

	## Citation

	```bibtex
	@misc{batjargal2026sparkcode,
	title = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
	author = {Amarsaikhan Batjargal},
	year = {2026},
	}
	```

	## License

	The LoRA adapter weights in this repository are released under the Apache 2.0 license. The base model, `Qwen/Qwen2.5-Coder-3B-Instruct`, is distributed under the [Tongyi Qianwen LICENSE](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct/blob/main/LICENSE); any downstream use must comply with its terms.