Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Abstract
Q-Zoom enhances MLLM performance by adaptively focusing computational resources on relevant visual regions through dynamic gating and self-distilled region proposal networks, achieving faster inference without sacrificing accuracy.
MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.
Community
๐ Say goodbye to "brute-force high resolution" in MLLMs. Introducing Q-Zoom: On-demand visual token allocation!
Excited to share our latest work, Q-Zoom, tackling the classic "computation explosion" bottleneck when Multimodal Large Language Models (MLLMs) process high-resolution images.
๐ The Problem:
To read dense documents or spot tiny objects, current MLLMs rely on global dynamic resolution (generating thousands of visual tokens). This exhaustive approach has two fatal flaws:
Query-level Redundancy: Asking "Is it day or night?" still encodes the whole image in 4K, wasting massive compute.
Spatial Redundancy: Asking about tiny text in a corner forces the model to feed huge useless backgrounds (like white walls or sky) into the heavy Transformer self-attention mechanism.
๐ก Our Solution: Q-Zoom
We propose a Query-aware adaptive high-resolution perception framework. The core idea: Decide whether high-res is needed and where the high-res region is, directly within the model's intermediate layers.
Three Core Designs:
๐ช Lightweight Dynamic Gating: For simple questions, it outputs answers directly from coarse features, safely bypassing high-res processing to massively boost throughput.
๐ฏ Self-Distilled RPN (SD-RPN): For complex questions, it leverages the LLM's internal cross-modal attention to predict a precise Region-of-Interest (RoI) heatmap, enabling local high-res cropping.
๐งฉ Spatio-Temporal Post-SFT: Seamlessly fuses the high-res local features with low-res global context, fixing the loss of spatial awareness caused by cropping.
๐ Key Results & Highlights:
๐ Breaking the Accuracy-Efficiency Trade-off (See Pareto curve): On Qwen2.5-VL 7B, Q-Zoom surpasses the peak accuracy of a native 4096-token baseline using at most 1024 visual tokens!
โก Massive Speedups: 2.52x faster (53.0% fewer tokens) on Doc/OCR tasks, and up to 4.39x faster (73.2% fewer tokens) on extreme high-res/dense vision tasks!
๐ฅ Orthogonal to "Slow Thinking" & SOTA Architectures: Achieves consistent, significant gains on LLaVA, Qwen2.5-VL, and Qwen3-VL. Crucially, Q-Zoom seamlessly integrates with the newest RL-trained "Thinking-with-Image" models (e.g., ZwZ), delivering a further performance leap on top of powerful visual slow-thinking capabilities!
๐ฐ Friendly Training Cost:
No expensive human bounding box annotations. No memory-hungry RL. The entire framework relies on self-supervised distillation and consistency-aware sample generation. Total training takes < 8 hours on just 4x A6000 GPUs.
๐ Resources:
๐ Project Page (w/ more visual results): https://yuhengsss.github.io/Q-Zoom/
๐ Paper: https://arxiv.org/pdf/2604.06912
Get this paper in your agent:
hf papers read 2604.06912 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper