|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- mistralai/Magistral-Small-2506 |
|
|
- mistralai/Mistral-Small-3.1-24B-Instruct-2503 |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: vllm |
|
|
language: |
|
|
- en |
|
|
- fr |
|
|
- de |
|
|
- es |
|
|
- pt |
|
|
- it |
|
|
- ja |
|
|
- ko |
|
|
- ru |
|
|
- zh |
|
|
- ar |
|
|
- fa |
|
|
- id |
|
|
- ms |
|
|
- ne |
|
|
- pl |
|
|
- ro |
|
|
- sr |
|
|
- sv |
|
|
- tr |
|
|
- uk |
|
|
- vi |
|
|
- hi |
|
|
- bn |
|
|
--- |
|
|
|
|
|
# Magistral-Small-2506-Vision |
|
|
|
|
|
Inspired by https://huggingface.co/ngxson/Devstral-Small-Vision-2505-GGUF, which is a Devstral vision experiment, this is an experimental checkpoint of Magistral-Small-2506 with vision. |
|
|
|
|
|
Magistral Small is a GRPO-trained reasoning fine-tune of Mistral Small 3.1, which is a vision-capable LLM. |
|
|
|
|
|
In its technical report, Mistral states that Magistral was fine-tuned on text-only data, but the authors report results on MMMU, MMMU-Pro and MathVista benchmarks, which show modest improvements despite text-only training. |
|
|
This suggests that Magistral successfully generalized its reasoning capabilities to multimodal data. |
|
|
|
|
|
Mistral removed Magistral's vision encoder in their official release. This may be because of the performance gap between text-only and multimodal inputs. |
|
|
|
|
|
In this model, I grafted Mistral Small 3.1's vision encoder on to Magistral Small. No further training was done, which should mean that text-only performance of this model should be the same as Mistral's official release. |
|
|
|
|
|
The model was tested with vLLM and should work with any toolkit supporting Mistral Small 3.1. The Transformers implementation of Mistral 3 does not work well. |
|
|
|
|
|
Make sure to use the system prompt provided in the `SYSTEM_PROMPT.txt` file (from Mistral's docs) and the sampling params `temp=0.7, top_p=0.95`. |
|
|
|
|
|
The code used for creating this model can be found here: https://colab.research.google.com/drive/1UuMo4VSgVoD4GfLrFgHUJvCv0cdALR7m?usp=sharing. |
|
|
It requires ~150 GB of RAM (VRAM is not needed for this) since it loads three 24B models in BF16. |
|
|
4-bit bits and bytes quantization could be used to reduce the memory requirements to 1/4. |
|
|
|
|
|
There still may be configuration errors in this model which might reduce performance. Let me know if you encounter any weird behavior! |