๐Ÿช SenseNova-MARS

Overview

While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the modelโ€™s ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNovaMARS-32B scores 74.3 on MMSearch and 54.4 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Pro and GPT-5.2. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities.

Overall performance of SenseNova-MARS-32B compares to other models across six benchmarks. SenseNova-MARS-32B can surpass proprietary models such as Gemini-3-Pro and GPT-5.2 on the search-oriented benchmarks such as MMSearch and HR-MMSearch

SenseNova-MARS can tackle the challenging visual task by leveraging an integrated suite of text search, image search, and image crop tools within the reasoning process. This is a demo example.

Benchmark Performance

Search-oriented benchmarks

Type Model Average MMSearch HR-MMSearch FVQA-test InfoSeek SimpleVQA LiveVQA MAT-Search
Direct Answer
Open-source Qwen2.5-VL-7B-Instruct 27.70 7.60 0.58 26.28 31.95 47.88 19.63 60.00
Qwen3-VL-8B-Instruct 29.24 11.70 12.13 24.22 23.15 42.94 23.18 67.33
Qwen2.5-VL-32B-Instruct 32.01 11.70 3.93 30.50 36.65 48.57 21.40 71.33
Qwen3-VL-32B-Instruct 35.22 16.96 19.02 32.17 28.95 45.90 31.59 72.67
Proprietary GPT-4o-mini 33.08 15.79 1.31 36.83 35.95 44.42 24.63 72.66
Gemini-2.5-Flash 40.87 21.64 7.54 43.78 44.10 55.48 31.57 82.00
GPT-4o 42.38 23.39 13.11 48.00 52.90 51.73 28.18 79.33
GPT-5 50.24 35.09 22.62 54.39 54.15 61.70 44.39 79.33
GPT-5.2 50.92 43.27 24.92 50.94 50.40 59.92 47.00 80.00
Gemini-3-Flash 53.68 57.31 21.97 56.50 54.85 63.57 38.90 82.67
Gemini-3-Pro 55.87 62.57 26.89 59.22 56.30 64.07 40.06 82.00
Agentic Model (zero-shot)
Open-source Qwen2.5-VL-7B-Instruct 35.50 32.16 19.34 36.00 28.80 42.35 22.52 67.33
Qwen3-VL-8B-Instruct 51.52 47.37 27.87 53.61 46.15 62.29 39.37 84.00
Qwen2.5-VL-32B-Instruct 53.45 49.71 33.44 52.22 50.10 65.15 42.17 81.33
Qwen3-VL-32B-Instruct 53.82 49.12 34.43 54.28 49.85 64.17 42.87 82.00
Proprietary GPT-4o-mini 45.65 38.60 26.23 50.00 42.35 50.84 31.54 80.00
GPT-4o 55.09 49.12 30.16 66.34 59.55 63.67 40.09 76.67
Gemini-2.5-Flash 58.05 59.06 40.00 61.72 53.70 68.81 47.75 75.33
GPT-5 60.12 52.63 38.36 62.61 55.95 70.58 56.02 84.67
Gemini-3-Flash 61.26 62.57 41.64 64.89 61.10 67.92 48.06 82.67
GPT-5.2 67.64 66.08 48.20 68.78 65.55 78.18 65.99 80.67
Gemini-3-Pro 69.06 74.27 48.52 72.61 66.45 75.91 59.69 86.00
Agentic Model
Open-source Visual-ARFT 40.13 34.50 24.92 41.72 37.95 42.45 25.40 74.00
DeepMMSearch-R1 - - - - 47.51 55.87 - -
MMSearch-R1 52.49 53.80 20.33 58.40 55.10 57.40 48.40 74.00
DeepEyesV2 - 63.70 - 60.60 51.10 59.40 - -
SenseNova-MARS-8B 64.20 67.84 41.64 67.11 61.70 70.19 56.22 84.67
SenseNova-MARS-32B 69.74 74.27 54.43 72.61 65.25 74.14 60.83 86.67

High-resolution Benchmarks

Model V* Bench HR-Bench 4K HR-Bench 8K MME RealWorld Avg.
Direct Answer
Gemini-2.5-Pro 83.8 87.3 85.4 - -
GPT-4o 67.5 65.0 59.6 62.8 63.7
LLaVA-onevison 75.4 63.0 59.8 57.4 63.9
Qwen2.5-VL-7B-Instruct 75.3 65.5 62.1 56.8 64.9
Qwen2.5-VL-32B-Instruct 80.6 69.3 63.6 59.1 68.2
Qwen3-VL-8B-Instruct 86.4 78.9 74.6 61.9 75.5
Agentic Model
SEAL 74.8 - - - -
Qwen3-VL-32B-Instruct 91.1 84.6 81.6 - -
Qwen3-VL-235B-A22B-Instruct 93.7 85.4 82.4 - -
Monet 83.3 71.0 68.0 - -
Pixel-Reasoner 84.3 72.6 66.1 64.4 71.9
DeepEyes 83.3 73.2 69.5 64.1 72.5
Thyme 82.2 77.0 72.0 64.8 74.0
DeepEyesV2 81.8 77.9 73.8 64.9 74.6
Mini-o3 88.2 77.5 73.3 65.5 76.1
Skywork-R1V4 88.0 82.8 79.8 71.4 80.5
SenseNova-MARS-8B 92.2 83.1 78.4 67.9 80.4
SenseNova-MARS-32B 94.2 90.2 86.6 72.7 85.9

Citation

@article{SenseNova-MARS,
  title={SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning},
  author={Yong Xien Chng and Tao Hu and Wenwen Tong and Xueheng Li and Jiandong Chen and Haojia Yu and Jiefan Lu and Hewei Guo and Hanming Deng and Chengjun Xie and Gao Huang and Dahua Lin and Lewei Lu},
  journal={arXiv preprint arXiv:2512.24330},
  year={2025}
}
Downloads last month
19
Safetensors
Model size
9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sensenova/SenseNova-MARS-8B

Finetuned
(147)
this model
Quantizations
2 models

Paper for sensenova/SenseNova-MARS-8B