MJ1: Multimodal Judgment via Grounded Verification
Paper โข 2603.07990 โข Published
Haize Labs 2026
MJ1 achieves state-of-the-art with only 3B active parameters, surpassing all API-based and open-source models. Best results in bold, second-best in italics.
| Judge | T2I | Editing | Interleaved | Reasoning | Avg. |
|---|---|---|---|---|---|
| Open-source multimodal LLMs | |||||
| Gemma 3 4B | 51.7 | 51.0 | 51.3 | 48.8 | 50.7 |
| Gemma 3 12B | 56.0 | 58.0 | 58.0 | 49.3 | 55.3 |
| Gemma 3 27B | 58.3 | 60.2 | 61.1 | 49.4 | 57.3 |
| Qwen2.5-VL-7B | 50.4 | 57.1 | 48.4 | 47.5 | 50.9 |
| Qwen2.5-VL-72B | 59.1 | 64.6 | 62.3 | 50.0 | 59.0 |
| Qwen3-VL-8B | 59.4 | 61.7 | 61.5 | 54.6 | 59.3 |
| Qwen3-VL-32B | 64.1 | 67.3 | 70.5 | 56.6 | 64.6 |
| Qwen3-VL-30B-A3B | 60.0 | 59.5 | 57.3 | 57.3 | 58.5 |
| Qwen3-VL-235B-A22B | 62.0 | 64.8 | 69.0 | 55.9 | 62.9 |
| API-based Models | |||||
| GPT-4o | 60.3 | 65.0 | 61.5 | 51.9 | 59.7 |
| GPT-4.1 | 65.8 | 68.2 | 67.0 | 53.0 | 63.5 |
| GPT-5 | 70.5 | 73.8 | 74.4 | 70.2 | 72.2 |
| Gemini 2.5 Flash | 63.1 | 66.5 | 69.4 | 57.5 | 64.1 |
| Gemini 2.5 Pro | 70.5 | 71.3 | 75.1 | 66.6 | 70.9 |
| Gemini 3 Pro | 74.4 | 74.9 | 76.4 | 79.5 | 76.3 |
| MJ1 (Qwen3-VL-30B-A3B + LoRA) | 80.2 | 78.1 | 73.5 | 76.4 | 77.0 |
@misc{kumar2026mj1multimodaljudgmentgrounded,
title={MJ1: Multimodal Judgment via Grounded Verification},
author={Bhavesh Kumar and Dylan Feng and Leonard Tang},
year={2026},
eprint={2603.07990},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.07990},
}