InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
Paper • 2512.18745 • Published • 12
Empowering Multimodal Foundation Models with Generalized Visual Search
Note Can your AI agent truly "think with images"? Test it out on O3-Bench!
Note This is the vSearcher model introduced in our work.
Note In-loop RL training data for vSearcher.
Note Out-of-loop RL training data for vSearcher.