Devstral Small 2507 โ€” Fine-tuned on AI Coding Conversations

QLoRA fine-tune of Devstral Small 2507 (24B) on 2,100 real AI coding assistant conversations extracted from Claude Code, Cursor, Codex CLI, and OpenCode.

Training Details

Parameter Value
Base model mistralai/Devstral-Small-2507 (24B)
Method QLoRA (4-bit NF4, rank 32, alpha 32)
Target modules q, k, v, o, gate, up, down proj
Trainable params 184.8M / 23.8B (0.78%)
Epochs 3
Batch size 2 x 4 grad_accum = 8 effective
Learning rate 2e-4 (cosine schedule)
Optimizer AdamW 8-bit
Precision bfloat16
Hardware 1x NVIDIA L4 24GB (GCP g2-standard-8)
Training time 10.9 hours (39,402s)
Final loss 0.3618
Framework Unsloth 2026.2.1 + TRL 0.22.2

Training Data

2,100 multi-turn coding conversations (175K+ messages total before filtering) from:

Source Conversations
Cursor (AI Service) 2,073
Cursor (Global Composer) 1,104
Codex CLI 555
Claude Code 289
OpenCode CLI 284

Preprocessing:

  • Filtered conversations with <2 messages
  • Removed tool-call-only assistant turns (<20 chars)
  • Removed tool_result user messages
  • Merged consecutive same-role messages
  • Truncated messages >8000 chars
  • Conversations must start with user, contain at least one assistant response
  • Secrets redacted (4,208 redaction markers across 91 unique secrets)

Usage

With Unsloth (recommended for inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="YOUR_USERNAME/devstral-finetuned-lora",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Write a Python LRU cache from scratch"}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

With PEFT + Transformers

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Devstral-Small-2507",
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "YOUR_USERNAME/devstral-finetuned-lora")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/devstral-finetuned-lora")

Convert to MLX (Apple Silicon)

# First merge LoRA into full model, then convert
pip install mlx-lm
python -m mlx_lm.convert --hf-path devstral-finetuned-16bit --mlx-path devstral-mlx -q --q-bits 4
python -m mlx_lm.generate --model devstral-mlx --prompt "Write a function that..."

Evaluation

Benchmark Metric Score Notes
HumanEval pass@1 3.0% (5/164) Low score expected โ€” model fine-tuned on conversational coding (multi-turn dialogs with tool use), not bare function completion

Why the low HumanEval score?

This model was trained on real AI coding conversations with:

  • Multi-turn dialog context
  • Tool calls and results
  • Natural language explanations
  • User-assistant interaction patterns

HumanEval tests bare function completion without dialog context, which is a different task. The model is optimized for conversational coding assistance, not standalone code generation.

Limitations

  • Fine-tuned on a specific user's coding style and preferences
  • Training data is English-only, primarily TypeScript/Python/Rust
  • Not a general-purpose improvement โ€” reflects patterns from specific coding workflows
  • LoRA adapters only; requires the base Devstral Small 2507 model

License

Apache 2.0 (same as base model)

Compute Cost

~$12 total on GCP (L4 GPU @ ~$1.10/hr for ~10.9 hours)

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results