--- license: apache-2.0 datasets: - hitsmy/AdaReasoner-TC-Randomized - hitsmy/AdaReasoner-TG-Data-Randomized language: - en metrics: - accuracy base_model: - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text tags: - agent ---

Dynamic Tool Orchestration for Iterative Visual Reasoning

--- ## 📋 Model Description **AdaReasoner-7B** is a vision-language model trained with dynamic tool orchestration capabilities for iterative visual reasoning. This model is AdaReasoner-7B-Randomized. We provide three variants of AdaReasoner-7B, each optimized for different use cases: | Model | Description | Hugging Face | |------|-------------|--------------| | **AdaReasoner-7B-Randomized** | Trained with the *adaptive learning* method, enabling strong generalization to **unseen tools and tasks**. Designed for open-ended and evolving tool environments where adaptability is required. | [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Randomized/) | | **AdaReasoner-7B-Non-Randomized** | Trained **without adaptive learning**, providing **more stable and reliable performance on known tools and tasks**, but limited generalization to unseen tools or task settings. | [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-7B-Non-Randomized) | | **AdaReasoner-VSP-7B** | Task-specialized model trained **exclusively on the Visual Spatial Planning (VSP) task**, achieving strong performance on VSP benchmarks but not intended for cross-task generalization. | [🤗 Link](https://huggingface.co/AdaReasoner/AdaReasoner-VSP-7B) | **Key Differences:** - **Randomized**: Trained with adaptive learning method, enabling zero-shot generalization to novel tools and task configurations - **Non-Randomized**: Trained without adaptive learning, offering more predictable behavior on familiar tools but lacking generalization - **VSP-7B**: Task-specific model fine-tuned exclusively on Visual Spatial Planning (VSP) benchmarks for optimal performance on navigation tasks ## 🚀 Quick Start AdaReasoner-7B can be deployed for single-turn inference using standard inference frameworks such as vLLM. However, AdaReasoner is a tool-planning model whose full capabilities require interaction with an external tool environment. To fully evaluate or utilize its tool-planning behavior, we recommend using [AdaEval](https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval) provided in our repository for batch inference and evaluation, or trying the [Demo](https://github.com/ssmisya/AdaReasoner/tree/main/tool_server/tf_eval/demo) interface for interactive, single-instance GUI-based reasoning. ## 🎯 Capabilities The model supports a diverse set of visual reasoning tasks, covering both structured reasoning and open-ended visual understanding: - **Visual Spatial Planning** Navigation and verification tasks based on grid-world environments (VSPO and VSP), evaluating fine-grained spatial perception, multi-step path planning, and safety verification under out-of-distribution map configurations. - **Compositional Visual Reasoning (Jigsaw)** Image reconstruction from shuffled patches (Jigsaw-COCO and BLINK-J), testing local–global consistency, part–whole reasoning, and visual compositional understanding. - **GUI Question Answering (GUIQA)** Fine-grained reasoning over GUI screenshots, including interactive webpage understanding (GUIChat) and agent-centric UI reasoning from WebMMU (Agentic Action subset), emphasizing element grounding, action planning, and multi-step inference. - **General Visual Question Answering (General VQA)** Open-ended visual reasoning beyond structured settings, evaluated on V* and HRBench, focusing on fine-grained visual search, attribute recognition, spatial relationship reasoning, and robustness to high-resolution, complex real-world scenes. ## 🛠️ Tool Integration For full tool-augmented inference capabilities, please refer to the [AdaReasoner repository](https://github.com/ssmisya/AdaReasoner) which includes: - Tool Server deployment - AdaEval evaluation framework - Complete inference pipeline ## 📊 Performance Please refer to our paper for detailed benchmark results across multiple visual reasoning tasks. ## 🔧 Technical Details - **Base Architecture**: Qwen 2.5 VL 7B Instruct - **Training Method**: Tool Cold Start (SFT) + Tool GRPO (RL) + Adaptive Learning - **Context Length**: Support for extended context with multiple tool interactions - **Modalities**: Text + Vision ## 📚 Citation If you use this model in your research, please cite: ```bibtex @article{adareasoner2024, title={Dynamic Tool Orchestration for Iterative Visual Reasoning}, author={AdaReasoner Team}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2024} } ``` ## 📄 License Apache 2.0 ## 🤝 Acknowledgments This model is part of the AdaReasoner project. For more information, visit our [GitHub repository](https://github.com/ssmisya/AdaReasoner). ## 📧 Contact For questions and feedback, please open an issue in our [GitHub repository](https://github.com/ssmisya/AdaReasoner).