CodeK v3 β Qwen2.5-Coder-7B LoRA
A LoRA adapter fine-tuned on CodeK, a synthetic dataset of Python programming tasks written in the style of Andrej Karpathy's open-source code. The model is trained to reason carefully about code: explaining implementations, diagnosing bugs, contrasting correct vs. incorrect versions, and generating multi-hypothesis debugging chains.
Best checkpoint: checkpoint-800 (eval loss: 0.5888)
Model Details
| Field | Value |
|---|---|
| Base model | Qwen/Qwen2.5-Coder-7B-Instruct |
| Adapter type | LoRA (rank 16, alpha 32, RSLoRA) |
| Target modules | q/k/v/o proj, gate/up/down proj |
| Training tokens | response-only (prompt tokens masked) |
| Best checkpoint | checkpoint-800 |
| Eval loss | 0.5888 |
| Training hardware | NVIDIA A100 80GB SXM4 |
Training Data
The CodeK v3 dataset combines v2 (398 seeds) and v3 (161 seeds) augmentation pipelines for a total of 559 unique Python tasks across 9 categories:
- Data structures, algorithms, graphs, dynamic programming
- Numerical methods, parsing, concurrency, bit manipulation, compression
Each seed is augmented across up to 5 passes:
| Pass | Type | Description |
|---|---|---|
| Pass 1 | Reasoning | Step-by-step explanation of the correct implementation |
| Pass 2 | Debugging | Single-line surgical bug + model diagnosis (via Codex, 100% coverage) |
| Pass 3 | Contrast | Correct vs. incorrect comparison with explanation |
| Pass 4 | Research loop | Multi-turn investigation of the implementation |
| Pass 5 | Multi-hypothesis | Competing bug hypotheses, ranked by plausibility |
Training split: 6,757 pairs (504 seed-level train tasks) Validation split: 728 pairs (55 seed-level held-out tasks, zero task overlap with train)
Key improvements over v2 model
- Seed-level val split β validation set has no task overlap with training (eval loss is meaningful)
- Response-only loss β prompt tokens masked; model only trained on assistant responses
- Pass 5 β multi-hypothesis bug reasoning signal (new in v3)
- Pass 2 via Codex β 100% pass 2 coverage with sharper
change_tokenannotations change_tokenfield β targets thechange_hitfailure mode from the v1/v2 evals
Evaluation
Ground-truth Pass 2 eval on 50 held-out v1 seeds (same seeds used across all versions for apples-to-apples comparison). A prediction passes if it correctly identifies both the function containing the bug and the nature of the change.
| Version | Dataset | LoRA Pass@1 | Base Pass@1 |
|---|---|---|---|
| v0 | 201 seeds, 4 passes | 58% | 64% |
| v1 | 398 seeds, 4 passes | 60% | 62% |
| v3 | 559 seeds, 5 passes | pending | pending |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter = "mechramc/codek-qwen2.5-coder-7b-lora-v3"
tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
model.eval()
messages = [
{"role": "system", "content": "You are a Python debugging expert. When shown code with a bug, identify the exact location and nature of the bug. Be precise and concise."},
{"role": "user", "content": "The following Python code has a subtle bug. Find it.\n\n```python\ndef binary_search(arr, target):\n lo, hi = 0, len(arr) - 1\n while lo <= hi:\n mid = (lo + hi) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n lo = mid\n else:\n hi = mid - 1\n return -1\n```"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Framework Versions
- PEFT: 0.18.1
- TRL: 0.24.0
- Transformers: 5.5.0
- PyTorch: 2.6.0
- Unsloth: 2026.4.1
- CUDA: 12.4
- Downloads last month
- -