--- base_model: unsloth/devstral-small-2507-unsloth-bnb-4bit library_name: peft pipeline_tag: text-generation license: apache-2.0 language: - en tags: - lora - sft - transformers - trl - unsloth - code - devstral - mistral datasets: - custom model-index: - name: devstral-finetuned-lora results: - task: type: text-generation name: Code Generation dataset: name: HumanEval type: openai_humaneval metrics: - type: pass@1 value: 3.0 name: pass@1 --- # Devstral Small 2507 — Fine-tuned on AI Coding Conversations QLoRA fine-tune of [Devstral Small 2507](https://huggingface.co/mistralai/Devstral-Small-2507) (24B) on 2,100 real AI coding assistant conversations extracted from Claude Code, Cursor, Codex CLI, and OpenCode. ## Training Details | Parameter | Value | |-----------|-------| | Base model | `mistralai/Devstral-Small-2507` (24B) | | Method | QLoRA (4-bit NF4, rank 32, alpha 32) | | Target modules | q, k, v, o, gate, up, down proj | | Trainable params | 184.8M / 23.8B (0.78%) | | Epochs | 3 | | Batch size | 2 x 4 grad_accum = 8 effective | | Learning rate | 2e-4 (cosine schedule) | | Optimizer | AdamW 8-bit | | Precision | bfloat16 | | Hardware | 1x NVIDIA L4 24GB (GCP g2-standard-8) | | Training time | 10.9 hours (39,402s) | | Final loss | 0.3618 | | Framework | Unsloth 2026.2.1 + TRL 0.22.2 | ## Training Data 2,100 multi-turn coding conversations (175K+ messages total before filtering) from: | Source | Conversations | |--------|--------------| | Cursor (AI Service) | 2,073 | | Cursor (Global Composer) | 1,104 | | Codex CLI | 555 | | Claude Code | 289 | | OpenCode CLI | 284 | **Preprocessing:** - Filtered conversations with <2 messages - Removed tool-call-only assistant turns (<20 chars) - Removed tool_result user messages - Merged consecutive same-role messages - Truncated messages >8000 chars - Conversations must start with user, contain at least one assistant response - Secrets redacted (4,208 redaction markers across 91 unique secrets) ## Usage ### With Unsloth (recommended for inference) ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="YOUR_USERNAME/devstral-finetuned-lora", max_seq_length=2048, load_in_4bit=True, ) FastLanguageModel.for_inference(model) messages = [{"role": "user", "content": "Write a Python LRU cache from scratch"}] input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ### With PEFT + Transformers ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base_model = AutoModelForCausalLM.from_pretrained( "mistralai/Devstral-Small-2507", load_in_4bit=True, ) model = PeftModel.from_pretrained(base_model, "YOUR_USERNAME/devstral-finetuned-lora") tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/devstral-finetuned-lora") ``` ### Convert to MLX (Apple Silicon) ```bash # First merge LoRA into full model, then convert pip install mlx-lm python -m mlx_lm.convert --hf-path devstral-finetuned-16bit --mlx-path devstral-mlx -q --q-bits 4 python -m mlx_lm.generate --model devstral-mlx --prompt "Write a function that..." ``` ## Evaluation | Benchmark | Metric | Score | Notes | |-----------|--------|-------|-------| | HumanEval | pass@1 | 3.0% (5/164) | Low score expected — model fine-tuned on conversational coding (multi-turn dialogs with tool use), not bare function completion | **Why the low HumanEval score?** This model was trained on real AI coding conversations with: - Multi-turn dialog context - Tool calls and results - Natural language explanations - User-assistant interaction patterns HumanEval tests **bare function completion** without dialog context, which is a different task. The model is optimized for conversational coding assistance, not standalone code generation. ## Limitations - Fine-tuned on a specific user's coding style and preferences - Training data is English-only, primarily TypeScript/Python/Rust - Not a general-purpose improvement — reflects patterns from specific coding workflows - LoRA adapters only; requires the base Devstral Small 2507 model ## License Apache 2.0 (same as base model) ## Compute Cost ~$12 total on GCP (L4 GPU @ ~$1.10/hr for ~10.9 hours)