Devstral Small 2507 โ Fine-tuned on AI Coding Conversations
QLoRA fine-tune of Devstral Small 2507 (24B) on 2,100 real AI coding assistant conversations extracted from Claude Code, Cursor, Codex CLI, and OpenCode.
Training Details
| Parameter | Value |
|---|---|
| Base model | mistralai/Devstral-Small-2507 (24B) |
| Method | QLoRA (4-bit NF4, rank 32, alpha 32) |
| Target modules | q, k, v, o, gate, up, down proj |
| Trainable params | 184.8M / 23.8B (0.78%) |
| Epochs | 3 |
| Batch size | 2 x 4 grad_accum = 8 effective |
| Learning rate | 2e-4 (cosine schedule) |
| Optimizer | AdamW 8-bit |
| Precision | bfloat16 |
| Hardware | 1x NVIDIA L4 24GB (GCP g2-standard-8) |
| Training time | 10.9 hours (39,402s) |
| Final loss | 0.3618 |
| Framework | Unsloth 2026.2.1 + TRL 0.22.2 |
Training Data
2,100 multi-turn coding conversations (175K+ messages total before filtering) from:
| Source | Conversations |
|---|---|
| Cursor (AI Service) | 2,073 |
| Cursor (Global Composer) | 1,104 |
| Codex CLI | 555 |
| Claude Code | 289 |
| OpenCode CLI | 284 |
Preprocessing:
- Filtered conversations with <2 messages
- Removed tool-call-only assistant turns (<20 chars)
- Removed tool_result user messages
- Merged consecutive same-role messages
- Truncated messages >8000 chars
- Conversations must start with user, contain at least one assistant response
- Secrets redacted (4,208 redaction markers across 91 unique secrets)
Usage
With Unsloth (recommended for inference)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="YOUR_USERNAME/devstral-finetuned-lora",
max_seq_length=2048,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Write a Python LRU cache from scratch"}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
With PEFT + Transformers
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Devstral-Small-2507",
load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "YOUR_USERNAME/devstral-finetuned-lora")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/devstral-finetuned-lora")
Convert to MLX (Apple Silicon)
# First merge LoRA into full model, then convert
pip install mlx-lm
python -m mlx_lm.convert --hf-path devstral-finetuned-16bit --mlx-path devstral-mlx -q --q-bits 4
python -m mlx_lm.generate --model devstral-mlx --prompt "Write a function that..."
Evaluation
| Benchmark | Metric | Score | Notes |
|---|---|---|---|
| HumanEval | pass@1 | 3.0% (5/164) | Low score expected โ model fine-tuned on conversational coding (multi-turn dialogs with tool use), not bare function completion |
Why the low HumanEval score?
This model was trained on real AI coding conversations with:
- Multi-turn dialog context
- Tool calls and results
- Natural language explanations
- User-assistant interaction patterns
HumanEval tests bare function completion without dialog context, which is a different task. The model is optimized for conversational coding assistance, not standalone code generation.
Limitations
- Fine-tuned on a specific user's coding style and preferences
- Training data is English-only, primarily TypeScript/Python/Rust
- Not a general-purpose improvement โ reflects patterns from specific coding workflows
- LoRA adapters only; requires the base Devstral Small 2507 model
License
Apache 2.0 (same as base model)
Compute Cost
~$12 total on GCP (L4 GPU @ ~$1.10/hr for ~10.9 hours)
- Downloads last month
- 9
Evaluation results
- pass@1 on HumanEvalself-reported3.000