Text Generation
PEFT
Safetensors
English
code-generation
grpo
lora
qlora
spark
co-evolution
python
conversational
Instructions to use amarsaikhan/spark-code-A-3b-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use amarsaikhan/spark-code-A-3b-v2 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct") model = PeftModel.from_pretrained(base_model, "amarsaikhan/spark-code-A-3b-v2") - Notebooks
- Google Colab
- Kaggle
| base_model: Qwen/Qwen2.5-Coder-3B-Instruct | |
| library_name: peft | |
| license: apache-2.0 | |
| language: | |
| - en | |
| pipeline_tag: text-generation | |
| tags: | |
| - code-generation | |
| - grpo | |
| - lora | |
| - qlora | |
| - spark | |
| - co-evolution | |
| - python | |
| datasets: | |
| - google-research-datasets/mbpp | |
| - openai/openai_humaneval | |
| # SPARK-Code Β· Condition A-v2 (Exec-only GRPO, full pool) Β· Qwen2.5-Coder-3B QLoRA | |
| **QLoRA adapter trained with execution-grounded GRPO on the full 311-problem MBPP pool over a 6-iteration schedule. Published weights are the iteration-4 checkpoint β the strongest HumanEval result in the entire SPARK-Code study (pass@1 0.816).** | |
| ## TL;DR | |
| `spark-code-A-3b-v2` is the scaled-up rerun of the exec-only GRPO baseline: same recipe as [`spark-code-A-3b`](https://huggingface.co/amarsaikhan/spark-code-A-3b) but on the full 311-problem MBPP training pool and a longer 6-iteration schedule (`kl_coeff=0.02`). HumanEval pass@1 peaks at **0.816 at iteration 4** β the best score across all five adapters in the study β with the KL to the frozen reference staying below 2.4e-3 the whole way. **The run terminated at iteration 6 with a CUDA out-of-memory error (GPU contention, not a code fault), so no `final/` adapter was auto-saved; the published weights are the iteration-4 checkpoint**, chosen as the peak of the eval trajectory. | |
| ## Training Setup | |
| - **Base model:** `Qwen/Qwen2.5-Coder-3B-Instruct` | |
| - **Method:** Execution-grounded GRPO. Per problem, sample a group of rollouts, score each by the fraction of unit tests it passes (penalties for syntax/runtime/timeout), normalize rewards within the group, apply a clipped PPO-style update against a frozen reference. No auxiliary SFT objective (this is the exec-only condition). | |
| - **Training data:** MBPP-sanitized, **311 problems** (full pool), **6 iterations intended** (5 completed + eval; crash during iteration 6), K=4 adaptive rollouts (up to 8), partial per-test rewards with `syntax_penalty=-0.2`, `runtime_penalty=-0.1`, `timeout_penalty=-0.3`, `wrong_answer_floor=0.0`. | |
| - **LoRA:** `r=16`, `alpha=32`, `dropout=0.05`, target modules `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`. | |
| - **Quantization:** 4-bit NF4 + double quant, bf16 compute. | |
| - **Optimizer:** AdamW, `lr=5e-6`, `grad_accum=4`, `clip_ratio=0.2`, `max_grad_norm=1.0`. | |
| - **KL regularization:** `kl_coeff=0.02` against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time). | |
| - **Auxiliary objective:** none (this is Condition A). | |
| - **Seed:** 42. | |
| - **Published checkpoint:** `condition_A/checkpoints/iter4` (the run crashed before a `final/` was written; see Limitations). | |
| Training script: `run_experiment_with_mbpp_heldout.py` in the GitHub repo. | |
| ## Evaluation Results | |
| HumanEval is evaluated with 5 samples per problem at `temperature=0.2`, `top_p=0.95`. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts. Iterations 0β5 completed; iteration 6 crashed during the GRPO step before its eval. | |
| | Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL | | |
| |-----:|-----------------:|-----------------:|-----------------:|-----------------:|----------------:|--------:| | |
| | 0 | 0.796 | 0.854 | 0.634 | 0.680 | β | β | | |
| | 1 | 0.806 | 0.872 | 0.628 | 0.680 | 0.593 | 0.0003 | | |
| | 2 | 0.801 | 0.860 | 0.642 | 0.690 | 0.620 | 0.0007 | | |
| | 3 | 0.793 | 0.872 | 0.618 | 0.680 | 0.633 | 0.0013 | | |
| | **4** | **0.816** | **0.872** | **0.638** | **0.710** | 0.649 | 0.0023 | | |
| | 5 | 0.796 | 0.854 | 0.636 | 0.690 | 0.672 | 0.0024 | | |
| | 6 | n/a | n/a | n/a | n/a | 0.696 | n/a | | |
| **Trajectory.** HumanEval pass@1 oscillates in a narrow band and peaks at **0.816 at iteration 4** (+2.0 pp over baseline), the highest of any adapter in the study; pass@5 holds at 0.872 across iters 1β4. Held-out MBPP pass@5 also peaks at iter 4 (0.71). Crucially, GRPO KL stays below 2.4e-3 for the entire run β exec-only GRPO shows **no policy drift even over six iterations on the full pool**, in sharp contrast to the regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)), whose KL climbed to ~0.096 over the same schedule. Mean tokens per GRPO sequence stay in the 179β186 range (no completion-length collapse). The published iteration-4 checkpoint captures the peak. | |
| ## Limitations | |
| The training run hit `torch.OutOfMemoryError` during iteration 6's GRPO backward pass β the GPU was shared with another large process at the time, so this was resource contention rather than a fault in the recipe. No `final/` adapter was written. The weights published here are the **iteration-4 checkpoint**, selected because it is both the eval peak and a fully-consistent post-iteration snapshot. Iteration 5 (pass@1 0.796) is also available in the source repo if a more-trained-but-lower checkpoint is preferred. Iteration 6 has no eval (the crash preceded it). | |
| ## Usage | |
| ```python | |
| from peft import PeftModel | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| base = AutoModelForCausalLM.from_pretrained( | |
| "Qwen/Qwen2.5-Coder-3B-Instruct", | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-A-3b-v2") | |
| tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct") | |
| prompt = tok.apply_chat_template( | |
| [{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."}, | |
| {"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}], | |
| tokenize=False, add_generation_prompt=True, | |
| ) | |
| inputs = tok(prompt, return_tensors="pt").to(model.device) | |
| out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95) | |
| print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ## Comparison to Other Conditions | |
| All five adapters share the same base model and seed. The original three used a 200-problem pool over 3 iterations; the two `-v2`/`2` adapters use the full 311-problem pool over 6 iterations. Each row reports that adapter's **published checkpoint**. | |
| | Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 | | |
| |---|---|---:|---:|---:|---:| | |
| | **A-v2 (exec-only, full)** β this card | 311 / it 4 | 0.00 | 0.02 | **0.816** | **0.710** | | |
| | [A (exec-only)](https://huggingface.co/amarsaikhan/spark-code-A-3b) | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 | | |
| | [C-reg (regularized)](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) | 200 / it 3 | 0.03 | 0.02 | 0.800 | 0.720 | | |
| | [C-light (naive)](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 | | |
| | [C-reg2 (regularized, full)](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 | | |
| A-v2 is the strongest HumanEval pass@1 in the study and ties the best MBPP pass@5. | |
| ## Findings Summary | |
| - **The simplest method, scaled up, is still the strongest.** Exec-only GRPO on the full pool produced the best HumanEval pass@1 (0.816) of any adapter β no auxiliary recycling required. | |
| - **Exec-only does not drift, even over six iterations.** KL stays below 2.4e-3 throughout. The matched-schedule regularized co-evolve run ([C-reg2](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b)) drifted to KL ~0.096 and regressed on HumanEval over the same six iterations β direct evidence that the auxiliary objective, not the longer schedule, is what destabilizes the policy. | |
| - **Published checkpoint is the iteration-4 peak.** The run crashed at iteration 6 (GPU OOM from contention); the weights here are iter4, the eval peak. This is a checkpoint-selection decision, not a completed-run "final." | |
| ## Related Artifacts | |
| - Sibling adapters: [spark-code-A-3b](https://huggingface.co/amarsaikhan/spark-code-A-3b) Β· [spark-code-C-light-3b](https://huggingface.co/amarsaikhan/spark-code-C-light-3b) Β· [spark-code-C-reg-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg-3b) Β· [spark-code-C-reg2-3b](https://huggingface.co/amarsaikhan/spark-code-C-reg2-3b) | |
| - GitHub repository: https://github.com/amarsaikhanb/spark-code | |
| - Full per-problem eval data (HumanEval and held-out MBPP JSONs, iters 0β5) lives under `condition_A/eval/` in the repository | |
| - Interactive demo Space: [SPACES_URL] | |
| ## Citation | |
| ```bibtex | |
| @misc{batjargal2026sparkcode, | |
| title = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation}, | |
| author = {Amarsaikhan Batjargal}, | |
| year = {2026}, | |
| } | |
| ``` | |
| ## License | |
| The LoRA adapter weights in this repository are released under the **Apache 2.0** license. The base model, `Qwen/Qwen2.5-Coder-3B-Instruct`, is distributed under the [Tongyi Qianwen LICENSE](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct/blob/main/LICENSE); any downstream use must comply with its terms. | |