README.md · wtfsayo/devstral-finetuned-lora at main

devstral-finetuned-lora / README.md

wtfsayo

Add HumanEval benchmark results (pass@1: 3.0%)

96dc054 verified 15 days ago

preview code

raw

history blame contribute delete

4.56 kB

	---
	base_model: unsloth/devstral-small-2507-unsloth-bnb-4bit
	library_name: peft
	pipeline_tag: text-generation
	license: apache-2.0
	language:
	- en
	tags:
	- lora
	- sft
	- transformers
	- trl
	- unsloth
	- code
	- devstral
	- mistral
	datasets:
	- custom
	model-index:
	- name: devstral-finetuned-lora
	results:
	- task:
	type: text-generation
	name: Code Generation
	dataset:
	name: HumanEval
	type: openai_humaneval
	metrics:
	- type: pass@1
	value: 3.0
	name: pass@1
	---

	# Devstral Small 2507 — Fine-tuned on AI Coding Conversations

	QLoRA fine-tune of [Devstral Small 2507](https://huggingface.co/mistralai/Devstral-Small-2507) (24B) on 2,100 real AI coding assistant conversations extracted from Claude Code, Cursor, Codex CLI, and OpenCode.

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| `mistralai/Devstral-Small-2507` (24B) \|
	\| Method \| QLoRA (4-bit NF4, rank 32, alpha 32) \|
	\| Target modules \| q, k, v, o, gate, up, down proj \|
	\| Trainable params \| 184.8M / 23.8B (0.78%) \|
	\| Epochs \| 3 \|
	\| Batch size \| 2 x 4 grad_accum = 8 effective \|
	\| Learning rate \| 2e-4 (cosine schedule) \|
	\| Optimizer \| AdamW 8-bit \|
	\| Precision \| bfloat16 \|
	\| Hardware \| 1x NVIDIA L4 24GB (GCP g2-standard-8) \|
	\| Training time \| 10.9 hours (39,402s) \|
	\| Final loss \| 0.3618 \|
	\| Framework \| Unsloth 2026.2.1 + TRL 0.22.2 \|

	## Training Data

	2,100 multi-turn coding conversations (175K+ messages total before filtering) from:

	\| Source \| Conversations \|
	\|--------\|--------------\|
	\| Cursor (AI Service) \| 2,073 \|
	\| Cursor (Global Composer) \| 1,104 \|
	\| Codex CLI \| 555 \|
	\| Claude Code \| 289 \|
	\| OpenCode CLI \| 284 \|

	Preprocessing:
	- Filtered conversations with <2 messages
	- Removed tool-call-only assistant turns (<20 chars)
	- Removed tool_result user messages
	- Merged consecutive same-role messages
	- Truncated messages >8000 chars
	- Conversations must start with user, contain at least one assistant response
	- Secrets redacted (4,208 redaction markers across 91 unique secrets)

	## Usage

	### With Unsloth (recommended for inference)

	```python
	from unsloth import FastLanguageModel

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name="YOUR_USERNAME/devstral-finetuned-lora",
	max_seq_length=2048,
	load_in_4bit=True,
	)
	FastLanguageModel.for_inference(model)

	messages = [{"role": "user", "content": "Write a Python LRU cache from scratch"}]
	input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

	outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### With PEFT + Transformers

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer

	base_model = AutoModelForCausalLM.from_pretrained(
	"mistralai/Devstral-Small-2507",
	load_in_4bit=True,
	)
	model = PeftModel.from_pretrained(base_model, "YOUR_USERNAME/devstral-finetuned-lora")
	tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/devstral-finetuned-lora")
	```

	### Convert to MLX (Apple Silicon)

	```bash
	# First merge LoRA into full model, then convert
	pip install mlx-lm
	python -m mlx_lm.convert --hf-path devstral-finetuned-16bit --mlx-path devstral-mlx -q --q-bits 4
	python -m mlx_lm.generate --model devstral-mlx --prompt "Write a function that..."
	```

	## Evaluation

	\| Benchmark \| Metric \| Score \| Notes \|
	\|-----------\|--------\|-------\|-------\|
	\| HumanEval \| pass@1 \| 3.0% (5/164) \| Low score expected — model fine-tuned on conversational coding (multi-turn dialogs with tool use), not bare function completion \|

	Why the low HumanEval score?

	This model was trained on real AI coding conversations with:
	- Multi-turn dialog context
	- Tool calls and results
	- Natural language explanations
	- User-assistant interaction patterns

	HumanEval tests bare function completion without dialog context, which is a different task. The model is optimized for conversational coding assistance, not standalone code generation.

	## Limitations

	- Fine-tuned on a specific user's coding style and preferences
	- Training data is English-only, primarily TypeScript/Python/Rust
	- Not a general-purpose improvement — reflects patterns from specific coding workflows
	- LoRA adapters only; requires the base Devstral Small 2507 model

	## License

	Apache 2.0 (same as base model)

	## Compute Cost

	~$12 total on GCP (L4 GPU @ ~$1.10/hr for ~10.9 hours)