Devstral-Small-2-24B Opus Reasoning

A LoRA fine-tune of Devstral-Small-2-24B distilled on Claude 4.6 Opus <think>...</think> reasoning traces. The goal: give Devstral's strong coding foundation explicit chain-of-thought reasoning before it writes code.

Model Details

Base model mistralai/Devstral-Small-2-24B-Instruct-2512
Fine-tune type QLoRA (4-bit NF4 base + BF16 LoRA adapters)
LoRA rank r=16, alpha=16
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training data nohurry/Opus-4.6-Reasoning-3000x-filtered (2,322 samples)
Checkpoint used checkpoint-1200 (end of epoch 2 — best generalisation)
Hardware RTX 3090 24GB VRAM
Framework Unsloth 2026.3.10 + TRL SFTTrainer
Sequence length 2048

Files

File Description
adapter_model.safetensors LoRA adapter weights (~400MB)
adapter_config.json LoRA config (rank, target modules, base model path)
Devstral-Small-2-24B-Opus-Reasoning.Q4_K_M.gguf Quantised GGUF — ready for llama.cpp / Ollama / llama-swap
Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf Higher quality GGUF — recommended for local use

Training Data

nohurry/Opus-4.6-Reasoning-3000x-filtered — 2,324 problems with Claude 4.6 Opus <think> reasoning traces and solutions, filtered to < 20,000 characters combined length.

Each sample was formatted as:

[INST] {problem} [/INST]<think>
{thinking}
</think>

{solution}

Loss was computed on the assistant turn only (train_on_responses_only).

Training Loss

Step Epoch Loss
5 0.01 0.7949
100 0.17 0.5708
300 0.52 0.5800
600 1.03 0.3559
900 1.55 0.3858
1100 1.89 0.3469
1160 2.00 0.3752
1200 2.07 0.1493

Checkpoint 1200 (end of epoch 2) was selected over the full epoch 3 run — for reasoning distillation tasks, epoch 3 typically overfits to the trace style while epoch 2 gives the best generalisation.

Usage

GGUF (llama.cpp / Ollama / llama-swap)

Download Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf for best quality, or Devstral-Small-2-24B-Opus-Reasoning.Q4_K_M.gguf if VRAM is tight.

# llama.cpp
./llama-cli -m unsloth.Q5_K_M.gguf \
  --chat-template mistral \
  -p "[INST] Write a Python function to find all prime numbers up to n using a sieve. [/INST]"

LoRA Adapter (Python)

Requires the base model. Because Devstral is a VLM (Pixtral vision encoder), the easiest path is the text-only extracted weights — see the technical notes below.

import torch
from unsloth import FastLanguageModel
from peft import PeftModel

base_model_path = "path/to/Devstral-Small-2-24B-textonly"  # see notes
adapter_path    = "adamjen/Devstral-Small-2-24B-Opus-Reasoning"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = base_model_path,
    max_seq_length = 2048,
    dtype          = torch.bfloat16,
    load_in_4bit   = True,
)
model = PeftModel.from_pretrained(model, adapter_path)

messages = [{"role": "user", "content": "Write a binary search in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Template

This model uses Mistral's [INST]...[/INST] format. The model will produce a <think>...</think> block before its response.

[INST] Your question here [/INST]<think>
... reasoning ...
</think>

... answer ...

Technical Notes: The Devstral Extraction Problem

Devstral-Small-2-24B ships as a Mistral3ForConditionalGeneration (VLM) with a Pixtral vision encoder. Training it as a text-only model on a single 24GB GPU hits several problems:

  • FP8 weights: The official instruct release uses FP8 quantisation, which requires compute capability ≥ 8.9. RTX 3090 is 8.6 — incompatible. Requires dequantising to BF16 first.
  • Vision encoder VRAM: The Pixtral encoder consumes ~4GB VRAM, leaving insufficient headroom for 4-bit QLoRA + gradients.
  • Device map splitting: With a VLM loaded via device_map="auto", accelerate splits layers across GPU/CPU, breaking distributed training mode.
  • transformers 5.x concurrent loader: The async tensor loader materialises all BF16 tensors simultaneously before quantisation → OOM. Fix: HF_DEACTIVATE_ASYNC_LOAD=1.

Solution: Extract the Ministral3ForCausalLM language layers into a standalone text-only model directory (stripping vision_tower.* and multi_modal_projector.*, renaming language_model.model.*model.*). This produces a clean 23B causal LM loadable by FastLanguageModel.

Full write-up with all fixes: Fine-tuning Devstral on an RTX 3090

Hardware Requirements

Format Min VRAM
Q4_K_M GGUF ~16GB
Q5_K_M GGUF ~18GB
LoRA inference (4-bit) ~20GB
LoRA training (QLoRA) 24GB

Limitations

  • Trained on 2,322 samples — a small dataset. Performance gains on reasoning are real but limited in breadth.
  • Max sequence length 2048 tokens (training constraint). Longer contexts may degrade quality.
  • The <think> block reasoning style is inherited from Claude Opus traces — the model may produce verbose reasoning.
  • Not evaluated on formal benchmarks.

Author

Adam Jenner — adamjenner.com.au

Downloads last month
327
GGUF
Model size
24B params
Architecture
mistral3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adamjen/Devstral-Small-2-24B-Opus-Reasoning

Dataset used to train adamjen/Devstral-Small-2-24B-Opus-Reasoning