Image-Text-to-Text
Safetensors
English
qwen2_5_vl
agent
conversational
File size: 6,666 Bytes
5f1bf9d
2bb77bf
 
5f1bf9d
 
 
 
 
2bb77bf
5f1bf9d
 
 
2bb77bf
5f1bf9d
 
2bb77bf
5f1bf9d
2bb77bf
b905f92
 
 
 
2bb77bf
b905f92
 
 
 
 
 
75b72d5
b905f92
75b72d5
b905f92
 
75b72d5
b905f92
 
 
75b72d5
 
 
 
b905f92
 
 
 
 
 
 
2bb77bf
b905f92
 
 
 
 
36bfee2
 
 
b905f92
 
 
 
 
 
 
 
 
 
2bb77bf
 
b905f92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a95ddf
 
 
 
 
b905f92
 
 
 
 
 
 
 
 
 
 
 
 
5f1bf9d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
datasets:
- hitsmy/AdaReasoner-TC-Randomized
- hitsmy/AdaReasoner-TG-Data-Randomized
language:
- en
license: apache-2.0
metrics:
- accuracy
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- agent
arxiv: 2601.18631
---

<div align="center">
  <img src="docs/logo.png" alt="Logo" width="300">
  <h1 align="center">Dynamic Tool Orchestration for Iterative Visual Reasoning</h1>

  <a href="https://arxiv.org/abs/2601.18631">
    <img src="https://img.shields.io/badge/Paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" alt="Paper">
  </a>
  <a href="https://github.com/ssmisya/AdaReasoner/tree/main/docs">
    <img src="https://img.shields.io/badge/Docs-1f6feb?style=for-the-badge&logo=readthedocs&logoColor=white" alt="Docs">
  </a>
  <a href="https://huggingface.co/collections/hitsmy/adareasoner">
    <img src="https://img.shields.io/badge/Data%20%26%20Model-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="Data & Model">
  </a>
  <a href="https://adareasoner.github.io">
    <img src="https://img.shields.io/badge/Homepage-2ea44f?style=for-the-badge&logo=googlechrome&logoColor=white" alt="Homepage">
  </a>

  <a href="https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval/demo">
  <img src="https://img.shields.io/badge/Demo-FF7C00?style=for-the-badge&logo=gradio&logoColor=white" alt="Demo">
  </a>
  <a href="https://www.youtube.com/watch?v=AtBoJYW_yDA">
    <img src="https://img.shields.io/badge/Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Video">
  </a>
    
</div>


---

## 📋 Model Description

**AdaReasoner-7B** is a vision-language model trained with dynamic tool orchestration capabilities for iterative visual reasoning. This model is AdaReasoner-7B-Randomized. It was introduced in the paper [AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning](https://arxiv.org/abs/2601.18631).

We provide three variants of AdaReasoner-7B, each optimized for different use cases:

| Model | Description | Hugging Face |
|------|-------------|--------------|
| **AdaReasoner-7B-Randomized** | Trained with the *adaptive learning* method, enabling strong generalization to **unseen tools and tasks**. Designed for open-ended and evolving tool environments where adaptability is required. | [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Randomized/) |
| **AdaReasoner-7B-Non-Randomized** | Trained **without adaptive learning**, providing **more stable and reliable performance on known tools and tasks**, but limited generalization to unseen tools or task settings. | [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Non-Randomized) |
| **AdaReasoner-VSP-7B** | Task-specialized model trained **exclusively on the Visual Spatial Planning (VSP) task**, achieving strong performance on VSP benchmarks but not intended for cross-task generalization. | [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-VSP-7B) |



**Key Differences:**
- **Randomized**: Trained with adaptive learning method, enabling zero-shot generalization to novel tools and task configurations
- **Non-Randomized**: Trained without adaptive learning, offering more predictable behavior on familiar tools but lacking generalization
- **VSP-7B**: Task-specific model fine-tuned exclusively on Visual Spatial Planning (VSP) benchmarks for optimal performance on navigation tasks

## 🚀 Quick Start

AdaReasoner-7B can be deployed for single-turn inference using standard inference frameworks such as vLLM or the `transformers` library.

However, AdaReasoner is a tool-planning model whose full capabilities require interaction with an external tool environment.
To fully evaluate or utilize its tool-planning behavior, we recommend using [AdaEval](https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval) provided in our repository for batch inference and evaluation, or trying the [Demo](https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval/demo) interface for interactive, single-instance GUI-based reasoning.

## 🎯 Capabilities

The model supports a diverse set of visual reasoning tasks, covering both structured reasoning and open-ended visual understanding:
-	**Visual Spatial Planning**
Navigation and verification tasks based on grid-world environments (VSPO and VSP), evaluating fine-grained spatial perception, multi-step path planning, and safety verification under out-of-distribution map configurations.
-	**Compositional Visual Reasoning (Jigsaw)**
Image reconstruction from shuffled patches (Jigsaw-COCO and BLINK-J), testing local–global consistency, part–whole reasoning, and visual compositional understanding.
-	**GUI Question Answering (GUIQA)**
Fine-grained reasoning over GUI screenshots, including interactive webpage understanding (GUIChat) and agent-centric UI reasoning from WebMMU (Agentic Action subset), emphasizing element grounding, action planning, and multi-step inference.
-	**General Visual Question Answering (General VQA)**
Open-ended visual reasoning beyond structured settings, evaluated on V* and HRBench, focusing on fine-grained visual search, attribute recognition, spatial relationship reasoning, and robustness to high-resolution, complex real-world scenes.

## 🛠️ Tool Integration

For full tool-augmented inference capabilities, please refer to the [AdaReasoner repository](https://github.com/ssmisya/AdaReasoner) which includes:

- Tool Server deployment
- AdaEval evaluation framework
- Complete inference pipeline

## 📊 Performance

Please refer to our paper for detailed benchmark results across multiple visual reasoning tasks.

## 🔧 Technical Details

- **Base Architecture**: Qwen 2.5 VL 7B Instruct
- **Training Method**: Tool Cold Start (SFT) + Tool GRPO (RL) + Adaptive Learning
- **Context Length**: Support for extended context with multiple tool interactions
- **Modalities**: Text + Vision

## 📚 Citation

If you use this model in your research, please cite:

```bibtex
@article{song2026adareasoner,
  title={AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning},
  author={Song, Mingyang and Sun, Haoyu and Gu, Jiawei and Li, Linjie and Xu, Luxin and Krishna, Ranjay and Cheng, Yu},
  journal={arXiv preprint arXiv:2601.18631},
  year={2026}
}
```

## 📄 License

Apache 2.0

## 🤝 Acknowledgments

This model is part of the AdaReasoner project. For more information, visit our [GitHub repository](https://github.com/ssmisya/AdaReasoner).

## 📧 Contact

For questions and feedback, please open an issue in our [GitHub repository](https://github.com/ssmisya/AdaReasoner).