| --- |
| base_model: |
| - Qwen/Qwen2.5-VL-7B-Instruct |
| datasets: |
| - AdaReasoner/AdaReasoner-TC-Randomized |
| - AdaReasoner/AdaReasoner-TG-Data |
| language: |
| - en |
| license: apache-2.0 |
| metrics: |
| - accuracy |
| pipeline_tag: image-text-to-text |
| library_name: transformers |
| arxiv: 2601.18631 |
| tags: |
| - agent |
| --- |
| |
| <div align="center"> |
| <img src="logo.png" alt="Logo" width="300"> |
| <h1 align="center">Dynamic Tool Orchestration for Iterative Visual Reasoning</h1> |
|
|
| <a href="https://huggingface.co/papers/2601.18631"> |
| <img src="https://img.shields.io/badge/Paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" alt="Paper"> |
| </a> |
| <a href="https://github.com/ssmisya/AdaReasoner/tree/main/docs"> |
| <img src="https://img.shields.io/badge/Docs-1f6feb?style=for-the-badge&logo=readthedocs&logoColor=white" alt="Docs"> |
| </a> |
| <a href="https://huggingface.co/collections/hitsmy/adareasoner"> |
| <img src="https://img.shields.io/badge/Data%20%26%20Model-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="Data & Model"> |
| </a> |
| <a href="https://adareasoner.github.io"> |
| <img src="https://img.shields.io/badge/Homepage-2ea44f?style=for-the-badge&logo=googlechrome&logoColor=white" alt="Homepage"> |
| </a> |
| |
| <a href="https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval/demo"> |
| <img src="https://img.shields.io/badge/Demo-FF7C00?style=for-the-badge&logo=gradio&logoColor=white" alt="Demo"> |
| </a> |
| <a href="https://www.youtube.com/watch?v=AtBoJYW_yDA"> |
| <img src="https://img.shields.io/badge/Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Video"> |
| </a> |
| |
| </div> |
| |
|
|
| --- |
|
|
| ## ๐ Model Description |
|
|
| **AdaReasoner-7B** is a vision-language model trained with dynamic tool orchestration capabilities for iterative visual reasoning. This model is AdaReasoner-7B-Non-Randomized. |
|
|
|
|
| We provide three variants of AdaReasoner-7B, each optimized for different use cases: |
|
|
| | Model | Description | Hugging Face | |
| |------|-------------|--------------| |
| | **AdaReasoner-7B-Randomized** | Trained with the *adaptive learning* method, enabling strong generalization to **unseen tools and tasks**. Designed for open-ended and evolving tool environments where adaptability is required. | [๐ค Link](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Randomized) | |
| | **AdaReasoner-7B-Non-Randomized** | Trained **without adaptive learning**, providing **more stable and reliable performance on known tools and tasks**, but limited generalization to unseen tools or task settings. | [๐ค Link](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Non-Randomized) | |
| | **AdaReasoner-VSP-7B** | Task-specialized model trained **exclusively on the Visual Spatial Planning (VSP) task**, achieving strong performance on VSP benchmarks but not intended for cross-task generalization. | [๐ค Link](https://huggingface.co/AdaReasoner/AdaReasoner-VSP-7B) | |
|
|
|
|
|
|
| **Key Differences:** |
| - **Randomized**: Trained with adaptive learning method, enabling zero-shot generalization to novel tools and task configurations |
| - **Non-Randomized**: Trained without adaptive learning, offering more predictable behavior on familiar tools but lacking generalization |
| - **VSP-7B**: Task-specific model fine-tuned exclusively on Visual Spatial Planning (VSP) benchmarks for optimal performance on navigation tasks |
|
|
| ## ๐ Quick Start |
|
|
| AdaReasoner-7B can be deployed for single-turn inference using standard inference frameworks such as vLLM. |
| However, AdaReasoner is a tool-planning model whose full capabilities require interaction with an external tool environment. |
| To fully evaluate or utilize its tool-planning behavior, we recommend using [AdaEval](https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval) provided in our repository for batch inference and evaluation, or trying the [Demo](https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval/demo) interface for interactive, single-instance GUI-based reasoning. |
|
|
| ## ๐ฏ Capabilities |
|
|
| The model supports a diverse set of visual reasoning tasks, covering both structured reasoning and open-ended visual understanding: |
| - **Visual Spatial Planning** |
| Navigation and verification tasks based on grid-world environments (VSPO and VSP), evaluating fine-grained spatial perception, multi-step path planning, and safety verification under out-of-distribution map configurations. |
| - **Compositional Visual Reasoning (Jigsaw)** |
| Image reconstruction from shuffled patches (Jigsaw-COCO and BLINK-J), testing localโglobal consistency, partโwhole reasoning, and visual compositional understanding. |
| - **GUI Question Answering (GUIQA)** |
| Fine-grained reasoning over GUI screenshots, including interactive webpage understanding (GUIChat) and agent-centric UI reasoning from WebMMU (Agentic Action subset), emphasizing element grounding, action planning, and multi-step inference. |
| - **General Visual Question Answering (General VQA)** |
| Open-ended visual reasoning beyond structured settings, evaluated on V* and HRBench, focusing on fine-grained visual search, attribute recognition, spatial relationship reasoning, and robustness to high-resolution, complex real-world scenes. |
|
|
| ## ๐ ๏ธ Tool Integration |
|
|
| For full tool-augmented inference capabilities, please refer to the [AdaReasoner repository](https://github.com/ssmisya/AdaReasoner) which includes: |
|
|
| - Tool Server deployment |
| - AdaEval evaluation framework |
| - Complete inference pipeline |
|
|
| ## ๐ Performance |
|
|
| Please refer to our paper for detailed benchmark results across multiple visual reasoning tasks. |
|
|
| ## ๐ง Technical Details |
|
|
| - **Base Architecture**: Qwen 2.5 VL 7B Instruct |
| - **Training Method**: Tool Cold Start (SFT) + Tool GRPO (RL) + Adaptive Learning |
| - **Context Length**: Support for extended context with multiple tool interactions |
| - **Modalities**: Text + Vision |
|
|
| ## ๐ Citation |
|
|
| If you use this model in your research, please cite: |
|
|
| ```bibtex |
| @article{song2026adareasoner, |
| title={AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning}, |
| author={Song, Mingyang and Sun, Haoyu and Gu, Jiawei and Li, Linjie and Xu, Luxin and Krishna, Ranjay and Cheng, Yu}, |
| journal={arXiv preprint arXiv:2601.18631}, |
| year={2026} |
| } |
| ``` |
|
|
| ## ๐ License |
|
|
| Apache 2.0 |
|
|
| ## ๐ค Acknowledgments |
|
|
| This model is part of the AdaReasoner project. For more information, visit our [GitHub repository](https://github.com/ssmisya/AdaReasoner). |
|
|
| ## ๐ง Contact |
|
|
| For questions and feedback, please open an issue in our [GitHub repository](https://github.com/ssmisya/AdaReasoner). |