OptimusePrime
/

Magistral-Small-2506-Vision

Image-Text-to-Text

Model card Files Files and versions

OptimusePrime commited on Jun 13, 2025

Commit

9c00d47

·

verified ·

1 Parent(s): ffd6910

Update README.md

Files changed (1) hide show

README.md +18 -1

README.md CHANGED Viewed

@@ -30,4 +30,21 @@ language:
   - vi
   - hi
   - bn
----

   - vi
   - hi
   - bn
+---
+# Magistral-Small-2506-Vision
+Inspired by the ![https://huggingface.co/ngxson/Devstral-Small-Vision-2505-GGUF](Devstral vision experiment), this is an experimental checkpoint of ![https://huggingface.co/mistralai/Magistral-Small-2506](Magistral-Small-2506) with vision.
+Magistral Small is a GRPO-powered reasoning fine-tune of ![https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503](Mistral Small 3.1), which is a vision-capable LLM.
+In its technical report, Mistral states that Magistral was fine-tuned on text-only data, but the authors report results on MMMU, MMMU-Pro and MathVista benchmarks, which show modest improvements despite text-only training.
+This suggests that Magistral successfully generalized its reasoning capabilities to multimodal data.
+Mistral removed Magistral's vision encoder in their official. This may be because of the performance gap between text-only and multimodal inputs.
+In this model, I grafted Mistral Small 3.1's vision encoder on to Magistral Small. No further training was done, which should mean that text-only performance of this model should be the same as Mistral's official release.
+The model was tested with vLLM and should work with any toolkit supporting Mistral Small 3.1. The Transformers implementation of Mistral 3 does not work well.
+I will soon benchmark the model on several vision benchmarks.