File size: 4,237 Bytes
8fe803d
 
 
 
e2c328b
8fe803d
 
 
 
e2c328b
8fe803d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
<div align="center"> 
  
**ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation**

**Zhenyang Liu**<sup>1,2</sup>, **Yongchong Gu**<sup>1</sup>, **Yikai Wang**<sup>3</sup>,  
**Xiangyang Xue**<sup>1,†</sup>, **Yanwei Fu**<sup>1,2,†</sup>

<sup>1</sup>Fudan University, <sup>2</sup>Shanghai Innovation Institute, <sup>3</sup>Nanyang Technological University

<sup>†</sup>Corresponding Authors

[![Paper](https://img.shields.io/badge/Paper-Arxiv-b31b1b.svg)](https://arxiv.org/abs/2601.08325v1)
[![Project Page](https://img.shields.io/badge/Project-Website-blue.svg)](https://zhenyangliu.github.io/ActiveVLA/)
[![Video](https://img.shields.io/badge/Video-YouTube-red.svg)](https://zhenyangliu.github.io/ActiveVLA)

</div>

---

## πŸ“’ News & Roadmap

This repository is the official implementation of **ActiveVLA**. We are currently preparing the code and data for release. Please stay tuned!

- [ ] **Release the Code** (Training & Inference scripts).
- [ ] **Release Pre-trained Models**.
- [ ] **Release Evaluation Scripts** (RLBench, COLOSSEUM, GemBench).
- [ ] **Release Real-Robot Control Code**.

---

## πŸ“– Abstract

Most existing Vision-Language-Action (VLA) models rely on static, wrist-mounted cameras that provide a fixed, end-effector-centric viewpoint. This setup limits perceptual flexibility: the agent cannot adaptively adjust its viewpoint or camera resolution according to the task context, leading to failures in long-horizon tasks or fine-grained manipulation due to occlusion and lack of detail.

We propose **ActiveVLA**, a novel vision-language-action framework that explicitly integrates **active perception** into robotic manipulation. Unlike passive perception methods, ActiveVLA empowers robots to:
1.  **Actively Select Viewpoints:** Autonomously determine optimal camera perspectives to maximize visibility and task relevance while minimizing occlusions.
2.  **Active 3D Zoom-in:** Selectively enhance high-resolution views of task-critical regions within the 3D scene.

By dynamically refining its perceptual input, ActiveVLA achieves superior adaptability and performance in complex scenarios. Experiments show that ActiveVLA outperforms state-of-the-art baselines on **RLBench**, **COLOSSEUM**, and **GemBench**, and transfers seamlessly to real-world robots.

---

## πŸš€ Method: ActiveVLA

We propose a coarse-to-fine active perception framework that integrates 3D spatial reasoning with vision-language understanding.

The pipeline consists of two main stages:
1.  **Critical Region Localization (Coarse Stage):** Projects 3D inputs onto multi-view 2D projections to identify critical 3D regions via heatmaps.
2.  **Active Perception Optimization (Fine Stage):**
    *   **Active Viewpoint Selection:** Uses a hypothesis testing strategy to choose optimal viewpoints that maximize amodal relevance and diversity.
    *   **Active 3D Zoom-in:** Applies a virtual optical zoom effect to improve resolution in key areas for precise manipulation.

> **Note:** For more visualizations and real-world robot demos, please visit our [**Project Page**](https://zhenyangliu.github.io/ActiveVLA).

---

## πŸ“Š Results

ActiveVLA achieves state-of-the-art performance across multiple benchmarks:

- **RLBench:** Achieves an average success rate of **91.8%**, ranking 1st in 10 tasks.
- **COLOSSEUM:** Demonstrates superior robustness with a **65.9%** success rate in challenging generalization scenarios.
- **GemBench:** Outperforms all baselines with strong adaptability across diverse tasks.
- **Real World:** High success rates in occlusion-heavy tasks (e.g., retrieving items from drawers, handling occluded objects).

---

## πŸ“ Citation

If you find our work useful in your research, please consider citing:

```bibtex
@misc{liu2026activevlainjectingactiveperception,
      title={ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation}, 
      author={Zhenyang Liu and Yongchong Gu and Yikai Wang and Xiangyang Xue and Yanwei Fu},
      year={2026},
      eprint={2601.08325},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.08325}, 
}