OptimusePrime
/

Magistral-Small-2506-Vision

Image-Text-to-Text

Model card Files Files and versions

OptimusePrime commited on Jun 13, 2025

Commit

017e2a1

·

verified ·

1 Parent(s): 9c00d47

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -34,17 +34,17 @@ language:
 # Magistral-Small-2506-Vision
-Inspired by the ![https://huggingface.co/ngxson/Devstral-Small-Vision-2505-GGUF](Devstral vision experiment), this is an experimental checkpoint of ![https://huggingface.co/mistralai/Magistral-Small-2506](Magistral-Small-2506) with vision.
-Magistral Small is a GRPO-powered reasoning fine-tune of ![https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503](Mistral Small 3.1), which is a vision-capable LLM.
 In its technical report, Mistral states that Magistral was fine-tuned on text-only data, but the authors report results on MMMU, MMMU-Pro and MathVista benchmarks, which show modest improvements despite text-only training.
 This suggests that Magistral successfully generalized its reasoning capabilities to multimodal data.
-Mistral removed Magistral's vision encoder in their official. This may be because of the performance gap between text-only and multimodal inputs.
 In this model, I grafted Mistral Small 3.1's vision encoder on to Magistral Small. No further training was done, which should mean that text-only performance of this model should be the same as Mistral's official release.
 The model was tested with vLLM and should work with any toolkit supporting Mistral Small 3.1. The Transformers implementation of Mistral 3 does not work well.
-I will soon benchmark the model on several vision benchmarks.

 # Magistral-Small-2506-Vision
+Inspired by https://huggingface.co/ngxson/Devstral-Small-Vision-2505-GGUF, which is a Devstral vision experiment, this is an experimental checkpoint of Magistral-Small-2506 with vision.
+Magistral Small is a GRPO-trained reasoning fine-tune of Mistral Small 3.1, which is a vision-capable LLM.
 In its technical report, Mistral states that Magistral was fine-tuned on text-only data, but the authors report results on MMMU, MMMU-Pro and MathVista benchmarks, which show modest improvements despite text-only training.
 This suggests that Magistral successfully generalized its reasoning capabilities to multimodal data.
+Mistral removed Magistral's vision encoder in their official release. This may be because of the performance gap between text-only and multimodal inputs.
 In this model, I grafted Mistral Small 3.1's vision encoder on to Magistral Small. No further training was done, which should mean that text-only performance of this model should be the same as Mistral's official release.
 The model was tested with vLLM and should work with any toolkit supporting Mistral Small 3.1. The Transformers implementation of Mistral 3 does not work well.
+I will soon benchmark the model on several vision benchmarks as there still may be configuration errors in this model which might reduce performance.