ZhenyangLiu commited on
Commit
8fe803d
ยท
verified ยท
1 Parent(s): c842079

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ **ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation**
4
+
5
+ **Zhenyang Liu**<sup>1,2</sup>, **Yongchong Gu**<sup>1</sup>, **Yikai Wang**<sup>3,*</sup>,
6
+ **Xiangyang Xue**<sup>1,โ€ </sup>, **Yanwei Fu**<sup>1,2,โ€ </sup>
7
+
8
+ <sup>1</sup>Fudan University, <sup>2</sup>Shanghai Innovation Institute, <sup>3</sup>Nanyang Technological University
9
+
10
+ <sup>*</sup>Corresponding Author, <sup>โ€ </sup>Co-corresponding Authors
11
+
12
+ [![Paper](https://img.shields.io/badge/Paper-Arxiv-b31b1b.svg)](https://arxiv.org/abs/2601.08325v1)
13
+ [![Project Page](https://img.shields.io/badge/Project-Website-blue.svg)](https://zhenyangliu.github.io/ActiveVLA/)
14
+ [![Video](https://img.shields.io/badge/Video-YouTube-red.svg)](https://zhenyangliu.github.io/ActiveVLA)
15
+
16
+ </div>
17
+
18
+ ---
19
+
20
+ ## ๐Ÿ“ข News & Roadmap
21
+
22
+ This repository is the official implementation of **ActiveVLA**. We are currently preparing the code and data for release. Please stay tuned!
23
+
24
+ - [ ] **Release the Code** (Training & Inference scripts).
25
+ - [ ] **Release Pre-trained Models**.
26
+ - [ ] **Release Evaluation Scripts** (RLBench, COLOSSEUM, GemBench).
27
+ - [ ] **Release Real-Robot Control Code**.
28
+
29
+ ---
30
+
31
+ ## ๐Ÿ“– Abstract
32
+
33
+ Most existing Vision-Language-Action (VLA) models rely on static, wrist-mounted cameras that provide a fixed, end-effector-centric viewpoint. This setup limits perceptual flexibility: the agent cannot adaptively adjust its viewpoint or camera resolution according to the task context, leading to failures in long-horizon tasks or fine-grained manipulation due to occlusion and lack of detail.
34
+
35
+ We propose **ActiveVLA**, a novel vision-language-action framework that explicitly integrates **active perception** into robotic manipulation. Unlike passive perception methods, ActiveVLA empowers robots to:
36
+ 1. **Actively Select Viewpoints:** Autonomously determine optimal camera perspectives to maximize visibility and task relevance while minimizing occlusions.
37
+ 2. **Active 3D Zoom-in:** Selectively enhance high-resolution views of task-critical regions within the 3D scene.
38
+
39
+ By dynamically refining its perceptual input, ActiveVLA achieves superior adaptability and performance in complex scenarios. Experiments show that ActiveVLA outperforms state-of-the-art baselines on **RLBench**, **COLOSSEUM**, and **GemBench**, and transfers seamlessly to real-world robots.
40
+
41
+ ---
42
+
43
+ ## ๐Ÿš€ Method: ActiveVLA
44
+
45
+ We propose a coarse-to-fine active perception framework that integrates 3D spatial reasoning with vision-language understanding.
46
+
47
+ The pipeline consists of two main stages:
48
+ 1. **Critical Region Localization (Coarse Stage):** Projects 3D inputs onto multi-view 2D projections to identify critical 3D regions via heatmaps.
49
+ 2. **Active Perception Optimization (Fine Stage):**
50
+ * **Active Viewpoint Selection:** Uses a hypothesis testing strategy to choose optimal viewpoints that maximize amodal relevance and diversity.
51
+ * **Active 3D Zoom-in:** Applies a virtual optical zoom effect to improve resolution in key areas for precise manipulation.
52
+
53
+ > **Note:** For more visualizations and real-world robot demos, please visit our [**Project Page**](https://zhenyangliu.github.io/ActiveVLA).
54
+
55
+ ---
56
+
57
+ ## ๐Ÿ“Š Results
58
+
59
+ ActiveVLA achieves state-of-the-art performance across multiple benchmarks:
60
+
61
+ - **RLBench:** Achieves an average success rate of **91.8%**, ranking 1st in 10 tasks.
62
+ - **COLOSSEUM:** Demonstrates superior robustness with a **65.9%** success rate in challenging generalization scenarios.
63
+ - **GemBench:** Outperforms all baselines with strong adaptability across diverse tasks.
64
+ - **Real World:** High success rates in occlusion-heavy tasks (e.g., retrieving items from drawers, handling occluded objects).
65
+
66
+ ---
67
+
68
+ ## ๐Ÿ“ Citation
69
+
70
+ If you find our work useful in your research, please consider citing:
71
+
72
+ ```bibtex
73
+ @misc{liu2026activevlainjectingactiveperception,
74
+ title={ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation},
75
+ author={Zhenyang Liu and Yongchong Gu and Yikai Wang and Xiangyang Xue and Yanwei Fu},
76
+ year={2026},
77
+ eprint={2601.08325},
78
+ archivePrefix={arXiv},
79
+ primaryClass={cs.RO},
80
+ url={https://arxiv.org/abs/2601.08325},
81
+ }