ehartford commited on
Commit
7b9a627
·
verified ·
1 Parent(s): 8adcf73

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Devstral-Vision-Small-2507
2
+
3
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/wLHwLZti9Na0O-UOVh-Nh.png)
4
+
5
+ # Devstral-Vision-Small-2507
6
+
7
+ Created by [Eric Hartford](https://erichartford.com/) at [Cognitive Computations](https://erichartford.com/)
8
+
9
+ ## Model Description
10
+
11
+ Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) with the vision understanding of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506).
12
+
13
+ This model enables vision-augmented software engineering tasks, allowing developers to:
14
+ - Analyze screenshots and UI mockups to generate code
15
+ - Debug visual rendering issues with actual screenshots
16
+ - Convert designs and wireframes directly into implementation
17
+ - Understand and modify codebases with visual context
18
+
19
+ ### Model Details
20
+
21
+ - **Base Architecture**: Mistral Small 3.2 with vision encoder
22
+ - **Parameters**: 24B (language model) + vision components
23
+ - **Context Window**: 128k tokens
24
+ - **License**: Apache 2.0
25
+ - **Language Model**: Fine-tuned Devstral weights for superior coding performance
26
+ - **Vision Model**: Mistral-Small vision encoder and multimodal projector
27
+
28
+ ## How It Was Created
29
+
30
+ This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:
31
+
32
+ 1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
33
+ 2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
34
+ 3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
35
+ 4. Kept Mistral's tokenizer to maintain proper image token handling
36
+
37
+ The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.
38
+
39
+ ## Intended Use
40
+
41
+ ### Primary Use Cases
42
+ - **Visual Software Engineering**: Analyze UI screenshots, mockups, and designs to generate implementation code
43
+ - **Code Review with Visual Context**: Review code changes alongside their visual output
44
+ - **Debugging Visual Issues**: Debug rendering problems by analyzing screenshots
45
+ - **Design-to-Code**: Convert visual designs directly into code
46
+ - **Documentation with Visual Examples**: Generate documentation that references visual elements
47
+
48
+ ### Example Applications
49
+ - Building UI components from screenshots
50
+ - Debugging CSS/styling issues with visual feedback
51
+ - Converting Figma/design mockups to code
52
+ - Analyzing and reproducing visual bugs
53
+ - Creating visual test cases
54
+
55
+ ## Usage
56
+
57
+ ### With OpenHands
58
+
59
+ The model is optimized for use with [OpenHands](https://github.com/All-Hands-AI/OpenHands) for agentic coding tasks:
60
+
61
+ ```bash
62
+ # Using vLLM
63
+ vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
64
+ --tokenizer_mode mistral \
65
+ --config_format mistral \
66
+ --load_format mistral \
67
+ --tensor-parallel-size 2
68
+
69
+ # Configure OpenHands to use the model
70
+ # Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
71
+ # Set Base URL: http://localhost:8000/v1
72
+ ```
73
+
74
+ ### With Transformers
75
+
76
+ ```python
77
+ import torch
78
+ from transformers import AutoModelForCausalLM, AutoProcessor
79
+ from PIL import Image
80
+
81
+ model_id = "cognitivecomputations/Devstral-Vision-Small-2507"
82
+
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ model_id,
85
+ torch_dtype=torch.bfloat16,
86
+ device_map="auto"
87
+ )
88
+ processor = AutoProcessor.from_pretrained(model_id)
89
+
90
+ # Load an image
91
+ image = Image.open("screenshot.png")
92
+
93
+ # Create a prompt
94
+ prompt = "Analyze this UI screenshot and generate React code to reproduce it."
95
+
96
+ # Process inputs
97
+ inputs = processor(
98
+ text=prompt,
99
+ images=image,
100
+ return_tensors="pt"
101
+ ).to(model.device)
102
+
103
+ # Generate
104
+ outputs = model.generate(
105
+ **inputs,
106
+ max_new_tokens=2000,
107
+ temperature=0.7
108
+ )
109
+
110
+ response = processor.decode(outputs[0], skip_special_tokens=True)
111
+ print(response)
112
+ ```
113
+
114
+ ## Performance Expectations
115
+
116
+ ### Coding Performance
117
+ Inherits Devstral's exceptional performance on coding tasks:
118
+ - 53.6% on SWE-Bench Verified (when used with OpenHands)
119
+ - Superior performance on multi-file editing and codebase exploration
120
+ - Excellent tool use and agentic behavior
121
+
122
+ ### Vision Performance
123
+ Maintains Mistral-Small's vision capabilities:
124
+ - Strong understanding of UI elements and layouts
125
+ - Accurate interpretation of charts, diagrams, and visual documentation
126
+ - Reliable screenshot analysis for debugging
127
+
128
+ ## Hardware Requirements
129
+
130
+ - **GPU Memory**: ~48GB for full precision, ~24GB with 4-bit quantization
131
+ - **Recommended**: 2x RTX 4090 or better for optimal performance
132
+ - **Minimum**: Single GPU with 24GB VRAM using quantization
133
+
134
+ ## Limitations
135
+
136
+ - Vision capabilities are limited to what Mistral-Small-3.2 supports
137
+ - Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
138
+ - Large model size may be prohibitive for some deployment scenarios
139
+ - Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)
140
+
141
+ ## Ethical Considerations
142
+
143
+ This model inherits both the capabilities and limitations of its parent models. Users should:
144
+ - Review generated code for security vulnerabilities
145
+ - Verify visual interpretations are accurate
146
+ - Be aware of potential biases in code generation
147
+ - Use appropriate safety measures in production deployments
148
+
149
+ ## Citation
150
+
151
+ If you use this model, please cite:
152
+
153
+ ```bibtex
154
+ @misc{devstral-vision-2507,
155
+ author = {Hartford, Eric},
156
+ title = {Devstral-Vision-Small-2507},
157
+ year = {2025},
158
+ publisher = {HuggingFace},
159
+ url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
160
+ }
161
+ ```
162
+
163
+ ## Acknowledgments
164
+
165
+ This model builds upon the excellent work by:
166
+ - [Mistral AI](https://mistral.ai/) for both Mistral-Small and Devstral
167
+ - [All Hands AI](https://www.all-hands.dev/) for their collaboration on Devstral
168
+ - The open-source community for testing and feedback
169
+
170
+ ## License
171
+
172
+ Apache 2.0 - Same as the base models
173
+
174
+ ---
175
+
176
+ *Created with dolphin passion 🐬 by Cognitive Computations*