QuixiAI
/

Devstral-Vision-Small-2507

Safetensors

mistral3

Model card Files Files and versions

xet

Community

ehartford commited on Jul 11, 2025

Commit

7b9a627

verified ·

1 Parent(s): 8adcf73

Create README.md

Browse files

Files changed (1) hide show

README.md +176 -0

README.md ADDED Viewed

	@@ -0,0 +1,176 @@

+# Devstral-Vision-Small-2507
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/wLHwLZti9Na0O-UOVh-Nh.png)
+# Devstral-Vision-Small-2507
+Created by [Eric Hartford](https://erichartford.com/) at [Cognitive Computations](https://erichartford.com/)
+## Model Description
+Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) with the vision understanding of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506).
+This model enables vision-augmented software engineering tasks, allowing developers to:
+- Analyze screenshots and UI mockups to generate code
+- Debug visual rendering issues with actual screenshots
+- Convert designs and wireframes directly into implementation
+- Understand and modify codebases with visual context
+### Model Details
+- **Base Architecture**: Mistral Small 3.2 with vision encoder
+- **Parameters**: 24B (language model) + vision components
+- **Context Window**: 128k tokens
+- **License**: Apache 2.0
+- **Language Model**: Fine-tuned Devstral weights for superior coding performance
+- **Vision Model**: Mistral-Small vision encoder and multimodal projector
+## How It Was Created
+This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:
+1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
+2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
+3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
+4. Kept Mistral's tokenizer to maintain proper image token handling
+The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.
+## Intended Use
+### Primary Use Cases
+- **Visual Software Engineering**: Analyze UI screenshots, mockups, and designs to generate implementation code
+- **Code Review with Visual Context**: Review code changes alongside their visual output
+- **Debugging Visual Issues**: Debug rendering problems by analyzing screenshots
+- **Design-to-Code**: Convert visual designs directly into code
+- **Documentation with Visual Examples**: Generate documentation that references visual elements
+### Example Applications
+- Building UI components from screenshots
+- Debugging CSS/styling issues with visual feedback
+- Converting Figma/design mockups to code
+- Analyzing and reproducing visual bugs
+- Creating visual test cases
+## Usage
+### With OpenHands
+The model is optimized for use with [OpenHands](https://github.com/All-Hands-AI/OpenHands) for agentic coding tasks:
+```bash
+# Using vLLM
+vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
+    --tokenizer_mode mistral \
+    --config_format mistral \
+    --load_format mistral \
+    --tensor-parallel-size 2
+# Configure OpenHands to use the model
+# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
+# Set Base URL: http://localhost:8000/v1
+```
+### With Transformers
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor
+from PIL import Image
+model_id = "cognitivecomputations/Devstral-Vision-Small-2507"
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(model_id)
+# Load an image
+image = Image.open("screenshot.png")
+# Create a prompt
+prompt = "Analyze this UI screenshot and generate React code to reproduce it."
+# Process inputs
+inputs = processor(
+    text=prompt,
+    images=image,
+    return_tensors="pt"
+).to(model.device)
+# Generate
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=2000,
+    temperature=0.7
+)
+response = processor.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+## Performance Expectations
+### Coding Performance
+Inherits Devstral's exceptional performance on coding tasks:
+- 53.6% on SWE-Bench Verified (when used with OpenHands)
+- Superior performance on multi-file editing and codebase exploration
+- Excellent tool use and agentic behavior
+### Vision Performance
+Maintains Mistral-Small's vision capabilities:
+- Strong understanding of UI elements and layouts
+- Accurate interpretation of charts, diagrams, and visual documentation
+- Reliable screenshot analysis for debugging
+## Hardware Requirements
+- **GPU Memory**: ~48GB for full precision, ~24GB with 4-bit quantization
+- **Recommended**: 2x RTX 4090 or better for optimal performance
+- **Minimum**: Single GPU with 24GB VRAM using quantization
+## Limitations
+- Vision capabilities are limited to what Mistral-Small-3.2 supports
+- Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
+- Large model size may be prohibitive for some deployment scenarios
+- Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)
+## Ethical Considerations
+This model inherits both the capabilities and limitations of its parent models. Users should:
+- Review generated code for security vulnerabilities
+- Verify visual interpretations are accurate
+- Be aware of potential biases in code generation
+- Use appropriate safety measures in production deployments
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{devstral-vision-2507,
+  author = {Hartford, Eric},
+  title = {Devstral-Vision-Small-2507},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
+}
+```
+## Acknowledgments
+This model builds upon the excellent work by:
+- [Mistral AI](https://mistral.ai/) for both Mistral-Small and Devstral
+- [All Hands AI](https://www.all-hands.dev/) for their collaboration on Devstral
+- The open-source community for testing and feedback
+## License
+Apache 2.0 - Same as the base models
+---
+*Created with dolphin passion 🐬 by Cognitive Computations*