Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
Paper
•
2505.15436
•
Published
•
2
This is a supervised fine-tuned (SFT) vision-language model based on Qwen/Qwen2.5-VL-7B-Instruct. It is trained on the CoF-SFT-Data-5.4k dataset, which contains 5.4k image-text reasoning examples.