codeq-qwen2.5-coder-7b-dpo-r2
LoRA adapter for Qwen/Qwen2.5-Coder-7B-Instruct, trained with DPO on
self-generated debugging preference pairs (Round 2 of the CodeQ iterative
DPO pipeline).
Architecture
- Base model:
Qwen/Qwen2.5-Coder-7B-Instruct - Adapter type: LoRA (PEFT)
- LoRA rank (r): 32
- LoRA alpha: 64
- LoRA dropout: 0.05
- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Task type:
CAUSAL_LM
Training Details
- Objective: Direct Preference Optimization (DPO)
- DPO beta: 0.1
- Precision: fp32
- Learning rate: 2e-6
- Epochs: 1
- Round: 2 (initialized from Round 1 adapter; Round 2 pairs resampled with the Round 1 policy as reference)
- Preference data: filtered DebugBench trajectories collected via MCTS
rollouts; see
tathadn/codeq-debugbench-dpo-pairs
Results (DebugBench)
| Setting | Accuracy |
|---|---|
| MCTS (search at inference) | 92.0% (46/50) |
| Single-pass full rewrite | 55.6% (40/72) |
The large gap between MCTS and single-pass accuracy reflects the benefit of inference-time search: the policy proposes candidate fixes that are verified and refined across a search tree, rather than committed to in one shot.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
BASE = "Qwen/Qwen2.5-Coder-7B-Instruct"
ADAPTER = "tathadn/codeq-qwen2.5-coder-7b-dpo-r2"
tokenizer = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()
messages = [
{"role": "system", "content": "You are an expert Python debugger."},
{"role": "user", "content": "Fix the following buggy function...\n\n```python\n...\n```"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
To merge the adapter into the base weights:
merged = model.merge_and_unload()
merged.save_pretrained("codeq-qwen2.5-coder-7b-dpo-r2-merged")
Intended Use
Research on iterative preference optimization for code debugging, and as a stronger single-pass or MCTS-driven policy over the base Qwen2.5-Coder-7B- Instruct model on Python bug-fixing tasks.
Limitations
- Trained and evaluated primarily on DebugBench-style Python bugs; generalization to other languages or bug distributions is not verified.
- Single-pass accuracy is substantially below MCTS accuracy — for best results, pair the policy with a verifier / search loop at inference time.
Framework versions
- PEFT 0.18.1
- Downloads last month
- 12