Add library_name and fix paper link

2bb77bf verified about 2 months ago

6.67 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	datasets:
	- hitsmy/AdaReasoner-TC-Randomized
	- hitsmy/AdaReasoner-TG-Data-Randomized
	language:
	- en
	license: apache-2.0
	metrics:
	- accuracy
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- agent
	arxiv: 2601.18631
	---

	<div align="center">
	<img src="docs/logo.png" alt="Logo" width="300">
	<h1 align="center">Dynamic Tool Orchestration for Iterative Visual Reasoning</h1>

	<a href="https://arxiv.org/abs/2601.18631">
	<img src="https://img.shields.io/badge/Paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" alt="Paper">
	</a>
	<a href="https://github.com/ssmisya/AdaReasoner/tree/main/docs">
	<img src="https://img.shields.io/badge/Docs-1f6feb?style=for-the-badge&logo=readthedocs&logoColor=white" alt="Docs">
	</a>
	<a href="https://huggingface.co/collections/hitsmy/adareasoner">
	<img src="https://img.shields.io/badge/Data%20%26%20Model-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="Data & Model">
	</a>
	<a href="https://adareasoner.github.io">
	<img src="https://img.shields.io/badge/Homepage-2ea44f?style=for-the-badge&logo=googlechrome&logoColor=white" alt="Homepage">
	</a>

	<a href="https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval/demo">
	<img src="https://img.shields.io/badge/Demo-FF7C00?style=for-the-badge&logo=gradio&logoColor=white" alt="Demo">
	</a>
	<a href="https://www.youtube.com/watch?v=AtBoJYW_yDA">
	<img src="https://img.shields.io/badge/Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Video">
	</a>

	</div>


	---

	## 📋 Model Description

	AdaReasoner-7B is a vision-language model trained with dynamic tool orchestration capabilities for iterative visual reasoning. This model is AdaReasoner-7B-Randomized. It was introduced in the paper [AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning](https://arxiv.org/abs/2601.18631).

	We provide three variants of AdaReasoner-7B, each optimized for different use cases:

	\| Model \| Description \| Hugging Face \|
	\|------\|-------------\|--------------\|
	\| AdaReasoner-7B-Randomized \| Trained with the adaptive learning method, enabling strong generalization to unseen tools and tasks. Designed for open-ended and evolving tool environments where adaptability is required. \| [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Randomized/) \|
	\| AdaReasoner-7B-Non-Randomized \| Trained without adaptive learning, providing more stable and reliable performance on known tools and tasks, but limited generalization to unseen tools or task settings. \| [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Non-Randomized) \|
	\| AdaReasoner-VSP-7B \| Task-specialized model trained exclusively on the Visual Spatial Planning (VSP) task, achieving strong performance on VSP benchmarks but not intended for cross-task generalization. \| [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-VSP-7B) \|



	Key Differences:
	- Randomized: Trained with adaptive learning method, enabling zero-shot generalization to novel tools and task configurations
	- Non-Randomized: Trained without adaptive learning, offering more predictable behavior on familiar tools but lacking generalization
	- VSP-7B: Task-specific model fine-tuned exclusively on Visual Spatial Planning (VSP) benchmarks for optimal performance on navigation tasks

	## 🚀 Quick Start

	AdaReasoner-7B can be deployed for single-turn inference using standard inference frameworks such as vLLM or the `transformers` library.

	However, AdaReasoner is a tool-planning model whose full capabilities require interaction with an external tool environment.
	To fully evaluate or utilize its tool-planning behavior, we recommend using [AdaEval](https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval) provided in our repository for batch inference and evaluation, or trying the [Demo](https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval/demo) interface for interactive, single-instance GUI-based reasoning.

	## 🎯 Capabilities

	The model supports a diverse set of visual reasoning tasks, covering both structured reasoning and open-ended visual understanding:
	- Visual Spatial Planning
	Navigation and verification tasks based on grid-world environments (VSPO and VSP), evaluating fine-grained spatial perception, multi-step path planning, and safety verification under out-of-distribution map configurations.
	- Compositional Visual Reasoning (Jigsaw)
	Image reconstruction from shuffled patches (Jigsaw-COCO and BLINK-J), testing local–global consistency, part–whole reasoning, and visual compositional understanding.
	- GUI Question Answering (GUIQA)
	Fine-grained reasoning over GUI screenshots, including interactive webpage understanding (GUIChat) and agent-centric UI reasoning from WebMMU (Agentic Action subset), emphasizing element grounding, action planning, and multi-step inference.
	- General Visual Question Answering (General VQA)
	Open-ended visual reasoning beyond structured settings, evaluated on V* and HRBench, focusing on fine-grained visual search, attribute recognition, spatial relationship reasoning, and robustness to high-resolution, complex real-world scenes.

	## 🛠️ Tool Integration

	For full tool-augmented inference capabilities, please refer to the [AdaReasoner repository](https://github.com/ssmisya/AdaReasoner) which includes:

	- Tool Server deployment
	- AdaEval evaluation framework
	- Complete inference pipeline

	## 📊 Performance

	Please refer to our paper for detailed benchmark results across multiple visual reasoning tasks.

	## 🔧 Technical Details

	- Base Architecture: Qwen 2.5 VL 7B Instruct
	- Training Method: Tool Cold Start (SFT) + Tool GRPO (RL) + Adaptive Learning
	- Context Length: Support for extended context with multiple tool interactions
	- Modalities: Text + Vision

	## 📚 Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{song2026adareasoner,
	title={AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning},
	author={Song, Mingyang and Sun, Haoyu and Gu, Jiawei and Li, Linjie and Xu, Luxin and Krishna, Ranjay and Cheng, Yu},
	journal={arXiv preprint arXiv:2601.18631},
	year={2026}
	}
	```

	## 📄 License

	Apache 2.0

	## 🤝 Acknowledgments

	This model is part of the AdaReasoner project. For more information, visit our [GitHub repository](https://github.com/ssmisya/AdaReasoner).

	## 📧 Contact

	For questions and feedback, please open an issue in our [GitHub repository](https://github.com/ssmisya/AdaReasoner).