File size: 5,462 Bytes

---
license: apache-2.0
---

NEW - I exported and added mmproj-BF16.gguf to properly support llama.cpp, ollama, and LM Studio.

# Devstral-Vision-Small-2507 GGUF

Quantized GGUF versions of [cognitivecomputations/Devstral-Vision-Small-2507](https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507) - the multimodal coding specialist that combines Devstral's exceptional coding abilities with vision understanding.

## Model Description

This is the first vision-enabled version of Devstral, created by transplanting Devstral's language model weights into Mistral-Small-3.2's multimodal architecture. It enables:
- Converting UI screenshots to code
- Debugging visual rendering issues
- Implementing designs from mockups
- Understanding codebases with visual context

## Quantization Selection Guide

| Quantization | Size | Min RAM | Recommended For | Quality | Notes |
|-------------|------|---------|-----------------|---------|-------|
| **Q8_0** | 23GB | 24GB | RTX 3090/4090/A6000 users wanting maximum quality | ★★★★★ | Near-lossless, best for production use |
| **Q6_K** | 18GB | 20GB | High-end GPUs with focus on quality | ★★★★☆ | Excellent quality/size balance |
| **Q5_K_M** | 16GB | 18GB | RTX 3080 Ti/4070 Ti users | ★★★★☆ | Great balance of quality and performance |
| **Q4_K_M** | 13GB | 16GB | **Most users** - RTX 3060 12GB/3070/4060 | ★★★☆☆ | The sweet spot, minimal quality loss |
| **IQ4_XS** | 12GB | 14GB | Experimental - newer compression method | ★★★☆☆ | Good alternative to Q4_K_M |
| **Q3_K_M** | 11GB | 12GB | 8-12GB GPUs, quality-conscious users | ★★☆☆☆ | Noticeable quality drop for complex code |

### Choosing the Right Quantization

**For coding with vision tasks, I recommend:**
- **Production/Professional use**: Q8_0 or Q6_K
- **General development**: Q4_K_M (best balance)
- **Limited VRAM**: Q5_K_M if you can fit it, otherwise Q4_K_M
- **Experimental**: Try IQ4_XS for potentially better quality at similar size to Q4_K_M

**Avoid Q3_K_M unless you're VRAM-constrained** - the quality degradation becomes noticeable for complex coding tasks and visual understanding.

## Usage Examples

### With llama.cpp

```bash
# Download the model
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --local-dir .
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  mmproj-BF16.gguf \
  --local-dir .

# Run with llama.cpp
./llama-cli -m Devstral-Small-Vision-2507-Q4_K_M.gguf \
  -p "Analyze this UI and generate React code" \
  --image screenshot.png \
  -c 8192
```

### With LM Studio

1. Download your chosen quantization
2. Load in LM Studio
3. Enable multimodal/vision mode in settings
4. Drag and drop images into the chat

### With ollama

```bash
# Create Modelfile
cat > Modelfile << EOF
FROM ./Devstral-Small-Vision-2507-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
EOF

# Create and run
ollama create devstral-vision -f Modelfile
ollama run devstral-vision
```

### With koboldcpp

```bash
python koboldcpp.py --model Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --contextsize 8192 \
  --gpulayers 999 \
  --multimodal
```

## Performance Tips

1. **Context Size**: This model supports up to 128k context, but start with 8k-16k for better performance
2. **GPU Layers**: Offload all layers to GPU if possible (`--gpulayers 999` or `-ngl 999`)
3. **Batch Size**: Increase batch size for better throughput if you have VRAM headroom
4. **Temperature**: Use lower temperatures (0.1-0.3) for code generation, higher (0.7-0.9) for creative tasks

## Hardware Requirements

| Quantization | Single GPU | Partial Offload | CPU Only |
|-------------|------------|-----------------|----------|
| Q8_0 | 24GB VRAM | 16GB VRAM + 16GB RAM | 32GB RAM |
| Q6_K | 20GB VRAM | 12GB VRAM + 16GB RAM | 24GB RAM |
| Q5_K_M | 18GB VRAM | 12GB VRAM + 12GB RAM | 24GB RAM |
| Q4_K_M | 16GB VRAM | 8GB VRAM + 12GB RAM | 20GB RAM |
| IQ4_XS | 14GB VRAM | 8GB VRAM + 12GB RAM | 20GB RAM |
| Q3_K_M | 12GB VRAM | 6GB VRAM + 12GB RAM | 16GB RAM |

## Model Capabilities

✅ **Strengths:**
- Exceptional at converting visual designs to code
- Strong debugging abilities with visual context
- Maintains Devstral's 53.6% SWE-Bench performance
- Handles multiple programming languages
- 128k token context window

⚠️ **Limitations:**
- Not specifically fine-tuned for vision-to-code tasks
- Vision performance bounded by Mistral-Small-3.2's capabilities
- Requires decent hardware for optimal performance
- Quantization impacts both vision and coding quality

## License

Apache 2.0 (inherited from base models)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/HM7Pz3FubhAHd8cWGbtFz.png)

## Acknowledgments

- Original model by [Eric Hartford](https://erichartford.com/) at [Cognitive Computations](https://cognitivecomputations.ai/)
- Built on [Mistral AI](https://mistral.ai/)'s Devstral and Mistral-Small models
- Quantized using llama.cpp

## Links

- [Original Model](https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507)
- [Devstral Base](https://huggingface.co/mistralai/Devstral-Small-2507)
- [Mistral-Small Vision](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506)

---

*For issues or questions about these quantizations, please open an issue in the repository.*