YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

Zhenyang Liu1,2, Yongchong Gu1, Yikai Wang3,*,
Xiangyang Xue1,†, Yanwei Fu1,2,†

1Fudan University, 2Shanghai Innovation Institute, 3Nanyang Technological University

*Corresponding Author, †Co-corresponding Authors

Paper Project Page Video


πŸ“’ News & Roadmap

This repository is the official implementation of ActiveVLA. We are currently preparing the code and data for release. Please stay tuned!

  • Release the Code (Training & Inference scripts).
  • Release Pre-trained Models.
  • Release Evaluation Scripts (RLBench, COLOSSEUM, GemBench).
  • Release Real-Robot Control Code.

πŸ“– Abstract

Most existing Vision-Language-Action (VLA) models rely on static, wrist-mounted cameras that provide a fixed, end-effector-centric viewpoint. This setup limits perceptual flexibility: the agent cannot adaptively adjust its viewpoint or camera resolution according to the task context, leading to failures in long-horizon tasks or fine-grained manipulation due to occlusion and lack of detail.

We propose ActiveVLA, a novel vision-language-action framework that explicitly integrates active perception into robotic manipulation. Unlike passive perception methods, ActiveVLA empowers robots to:

  1. Actively Select Viewpoints: Autonomously determine optimal camera perspectives to maximize visibility and task relevance while minimizing occlusions.
  2. Active 3D Zoom-in: Selectively enhance high-resolution views of task-critical regions within the 3D scene.

By dynamically refining its perceptual input, ActiveVLA achieves superior adaptability and performance in complex scenarios. Experiments show that ActiveVLA outperforms state-of-the-art baselines on RLBench, COLOSSEUM, and GemBench, and transfers seamlessly to real-world robots.


πŸš€ Method: ActiveVLA

We propose a coarse-to-fine active perception framework that integrates 3D spatial reasoning with vision-language understanding.

The pipeline consists of two main stages:

  1. Critical Region Localization (Coarse Stage): Projects 3D inputs onto multi-view 2D projections to identify critical 3D regions via heatmaps.
  2. Active Perception Optimization (Fine Stage):
    • Active Viewpoint Selection: Uses a hypothesis testing strategy to choose optimal viewpoints that maximize amodal relevance and diversity.
    • Active 3D Zoom-in: Applies a virtual optical zoom effect to improve resolution in key areas for precise manipulation.

Note: For more visualizations and real-world robot demos, please visit our Project Page.


πŸ“Š Results

ActiveVLA achieves state-of-the-art performance across multiple benchmarks:

  • RLBench: Achieves an average success rate of 91.8%, ranking 1st in 10 tasks.
  • COLOSSEUM: Demonstrates superior robustness with a 65.9% success rate in challenging generalization scenarios.
  • GemBench: Outperforms all baselines with strong adaptability across diverse tasks.
  • Real World: High success rates in occlusion-heavy tasks (e.g., retrieving items from drawers, handling occluded objects).

πŸ“ Citation

If you find our work useful in your research, please consider citing:

@misc{liu2026activevlainjectingactiveperception,
      title={ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation}, 
      author={Zhenyang Liu and Yongchong Gu and Yikai Wang and Xiangyang Xue and Yanwei Fu},
      year={2026},
      eprint={2601.08325},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.08325}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for ZhenyangLiu/ActiveVLA