🏛️ Qwen2.5-Coder-7B-Mythos

A Claude Code-level coding model built by fine-tuning Qwen2.5-Coder-7B-Instruct using state-of-the-art training recipes from published research.

🎯 Training Recipe

Based on exhaustive literature crawl of SOTA code LLM papers:

Component	Details
Base Model	Qwen/Qwen2.5-Coder-7B-Instruct (88.4% HumanEval baseline)
Method	QLoRA (4-bit NF4 + LoRA r=64, all-linear layers)
Optimizer	Paged AdamW 8-bit, LR=2e-4, cosine schedule
Context	4096 tokens with packing
Epochs	2
Effective Batch	16 (1 × 16 grad accum)

📊 Training Data (~350K+ samples)

Dataset	Samples	Purpose	Reference
KodCode-V1-SFT-R1	~100K+ (r1_correctness=True)	Verified competitive programming with R1-style chain-of-thought reasoning	arxiv:2503.02951
Code-Feedback	66K	Multi-turn code dialogue (ChatML)	m-a-p
Magicoder-OSS-Instruct-75K	75K	Diverse code generation from real code seeds	arxiv:2312.02120
Magicoder-Evol-Instruct-110K	110K	Evolved code instructions (increasing complexity)	arxiv:2312.02120

Data Quality Controls

KodCode filtered to only r1_correctness=True solutions (execution-verified)
All datasets converted to ChatML messages format with expert system prompt
Quality filter: minimum 50 chars in assistant response

🔬 Research Foundation

This model's recipe is derived from deep literature analysis of the top code LLM papers:

Key Papers & Results

Paper	Key Finding	Benchmark
rStar-Coder (2505.21297)	Qwen2.5-Coder-7B → 57.3% LiveCodeBench (from 17.4%) using verified competitive programming data	LiveCodeBench
KodCode (2503.02951)	Verified R1-style reasoning traces improve coding by +15% on BigCodeBench	BigCodeBench
Qwen2.5-Coder (2409.12186)	7:2:1 code:text:math ratio; coarse→fine SFT; 92.7% HumanEval at 32B	HumanEval
LoRA Without Regret	r=64+ all-linear matches full fine-tuning quality; alpha=2×r	LoRA theory
SWE-RL (2502.18449)	GRPO on 273K PRs → 41.0% SWE-bench Verified (beats GPT-4o)	SWE-bench

Why This Recipe Works

KodCode R1-style reasoning: Long chain-of-thought traces teach the model to think before coding, mimicking Claude's reasoning approach
Execution-verified data only: Every KodCode solution passed actual test execution — no incorrect code in training data
Diverse instruction sources: Magicoder (evolved instructions) + Code-Feedback (dialogue) cover the full spectrum from competitive programming to debugging
QLoRA + all-linear: Per "LoRA Without Regret" research, targeting all linear layers with sufficient rank matches full fine-tuning quality

🚀 Usage

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "ashhhhhh26/qwen25-coder-32b-mythos")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Generate
messages = [
    {"role": "system", "content": "You are an elite software engineer..."},
    {"role": "user", "content": "Implement a red-black tree in Python with insert, delete, and search operations."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.1, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🏋️ Training

Requirements

pip install torch transformers trl peft datasets bitsandbytes accelerate trackio flash-attn

Launch Training

# Single GPU (T4/L4/A10G with 16-24GB VRAM)
python train.py

# Multi-GPU with DeepSpeed ZeRO-2
accelerate launch --config_file deepspeed_zero2.yaml --num_processes 4 train.py

HF Jobs (recommended)

# Via HF Jobs API
huggingface-cli jobs run train.py \
    --hardware t4-small \
    --timeout 8h \
    --dependencies torch transformers trl peft datasets bitsandbytes accelerate trackio flash-attn

Hardware Requirements

Hardware	VRAM	Feasibility
T4 (16GB)	16GB	✅ QLoRA 4-bit (max_length=4096)
L4 (24GB)	24GB	✅ QLoRA 4-bit (max_length=8192)
A10G (24GB)	24GB	✅ QLoRA 4-bit (max_length=8192)
A100 (80GB)	80GB	✅ Full LoRA or even full fine-tune

📈 Next Steps (Future Training Stages)

Stage 2: GRPO with Execution Rewards

Based on SWE-RL and DeepSeek-Coder-V2 research:

Use KodCode/KodCode-Light-RL-10K for GRPO training
Binary reward: pass all unit tests = 1.0, fail = 0.0
Expected improvement: +5-10% on competitive programming benchmarks

Stage 3: SWE-RL for Agent-Level Performance

Fine-tune on 273K GitHub PR data with edit-similarity reward
Target: 40%+ SWE-bench Verified

Scale Up Options

32B model: Use Qwen/Qwen2.5-Coder-32B-Instruct on A100-80GB with same recipe
rStar-Coder data: Add microsoft/rStar-Coder seed_sft split for even stronger competitive programming

📝 Citation

If you use this model, please cite the foundational works:

@article{qwen2.5-coder,
  title={Qwen2.5-Coder Technical Report},
  author={Hui, Binyuan and Yang, Jian and others},
  journal={arXiv:2409.12186},
  year={2024}
}

@article{kodcode,
  title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding},
  author={Zheng, Zhangchen and others},
  journal={arXiv:2503.02951},
  year={2025}
}

@article{rstar-coder,
  title={rStar-Coder: Scaling Competitive Code Reasoning},
  author={Li, Xinyu and others},
  journal={arXiv:2505.21297},
  year={2025}
}