codeq-qwen2.5-coder-7b-dpo-r2

LoRA adapter for Qwen/Qwen2.5-Coder-7B-Instruct, trained with DPO on self-generated debugging preference pairs (Round 2 of the CodeQ iterative DPO pipeline).

Architecture

  • Base model: Qwen/Qwen2.5-Coder-7B-Instruct
  • Adapter type: LoRA (PEFT)
  • LoRA rank (r): 32
  • LoRA alpha: 64
  • LoRA dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Task type: CAUSAL_LM

Training Details

  • Objective: Direct Preference Optimization (DPO)
  • DPO beta: 0.1
  • Precision: fp32
  • Learning rate: 2e-6
  • Epochs: 1
  • Round: 2 (initialized from Round 1 adapter; Round 2 pairs resampled with the Round 1 policy as reference)
  • Preference data: filtered DebugBench trajectories collected via MCTS rollouts; see tathadn/codeq-debugbench-dpo-pairs

Results (DebugBench)

Setting Accuracy
MCTS (search at inference) 92.0% (46/50)
Single-pass full rewrite 55.6% (40/72)

The large gap between MCTS and single-pass accuracy reflects the benefit of inference-time search: the policy proposes candidate fixes that are verified and refined across a search tree, rather than committed to in one shot.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

BASE = "Qwen/Qwen2.5-Coder-7B-Instruct"
ADAPTER = "tathadn/codeq-qwen2.5-coder-7b-dpo-r2"

tokenizer = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

messages = [
    {"role": "system", "content": "You are an expert Python debugger."},
    {"role": "user", "content": "Fix the following buggy function...\n\n```python\n...\n```"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

To merge the adapter into the base weights:

merged = model.merge_and_unload()
merged.save_pretrained("codeq-qwen2.5-coder-7b-dpo-r2-merged")

Intended Use

Research on iterative preference optimization for code debugging, and as a stronger single-pass or MCTS-driven policy over the base Qwen2.5-Coder-7B- Instruct model on Python bug-fixing tasks.

Limitations

  • Trained and evaluated primarily on DebugBench-style Python bugs; generalization to other languages or bug distributions is not verified.
  • Single-pass accuracy is substantially below MCTS accuracy — for best results, pair the policy with a verifier / search loop at inference time.

Framework versions

  • PEFT 0.18.1
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tathadn/codeq-qwen2.5-coder-7b-dpo-r2

Base model

Qwen/Qwen2.5-7B
Adapter
(610)
this model

Dataset used to train tathadn/codeq-qwen2.5-coder-7b-dpo-r2