Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<div align="center">
|
| 2 |
+
|
| 3 |
+
**ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation**
|
| 4 |
+
|
| 5 |
+
**Zhenyang Liu**<sup>1,2</sup>, **Yongchong Gu**<sup>1</sup>, **Yikai Wang**<sup>3,*</sup>,
|
| 6 |
+
**Xiangyang Xue**<sup>1,โ </sup>, **Yanwei Fu**<sup>1,2,โ </sup>
|
| 7 |
+
|
| 8 |
+
<sup>1</sup>Fudan University, <sup>2</sup>Shanghai Innovation Institute, <sup>3</sup>Nanyang Technological University
|
| 9 |
+
|
| 10 |
+
<sup>*</sup>Corresponding Author, <sup>โ </sup>Co-corresponding Authors
|
| 11 |
+
|
| 12 |
+
[](https://arxiv.org/abs/2601.08325v1)
|
| 13 |
+
[](https://zhenyangliu.github.io/ActiveVLA/)
|
| 14 |
+
[](https://zhenyangliu.github.io/ActiveVLA)
|
| 15 |
+
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## ๐ข News & Roadmap
|
| 21 |
+
|
| 22 |
+
This repository is the official implementation of **ActiveVLA**. We are currently preparing the code and data for release. Please stay tuned!
|
| 23 |
+
|
| 24 |
+
- [ ] **Release the Code** (Training & Inference scripts).
|
| 25 |
+
- [ ] **Release Pre-trained Models**.
|
| 26 |
+
- [ ] **Release Evaluation Scripts** (RLBench, COLOSSEUM, GemBench).
|
| 27 |
+
- [ ] **Release Real-Robot Control Code**.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## ๐ Abstract
|
| 32 |
+
|
| 33 |
+
Most existing Vision-Language-Action (VLA) models rely on static, wrist-mounted cameras that provide a fixed, end-effector-centric viewpoint. This setup limits perceptual flexibility: the agent cannot adaptively adjust its viewpoint or camera resolution according to the task context, leading to failures in long-horizon tasks or fine-grained manipulation due to occlusion and lack of detail.
|
| 34 |
+
|
| 35 |
+
We propose **ActiveVLA**, a novel vision-language-action framework that explicitly integrates **active perception** into robotic manipulation. Unlike passive perception methods, ActiveVLA empowers robots to:
|
| 36 |
+
1. **Actively Select Viewpoints:** Autonomously determine optimal camera perspectives to maximize visibility and task relevance while minimizing occlusions.
|
| 37 |
+
2. **Active 3D Zoom-in:** Selectively enhance high-resolution views of task-critical regions within the 3D scene.
|
| 38 |
+
|
| 39 |
+
By dynamically refining its perceptual input, ActiveVLA achieves superior adaptability and performance in complex scenarios. Experiments show that ActiveVLA outperforms state-of-the-art baselines on **RLBench**, **COLOSSEUM**, and **GemBench**, and transfers seamlessly to real-world robots.
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## ๐ Method: ActiveVLA
|
| 44 |
+
|
| 45 |
+
We propose a coarse-to-fine active perception framework that integrates 3D spatial reasoning with vision-language understanding.
|
| 46 |
+
|
| 47 |
+
The pipeline consists of two main stages:
|
| 48 |
+
1. **Critical Region Localization (Coarse Stage):** Projects 3D inputs onto multi-view 2D projections to identify critical 3D regions via heatmaps.
|
| 49 |
+
2. **Active Perception Optimization (Fine Stage):**
|
| 50 |
+
* **Active Viewpoint Selection:** Uses a hypothesis testing strategy to choose optimal viewpoints that maximize amodal relevance and diversity.
|
| 51 |
+
* **Active 3D Zoom-in:** Applies a virtual optical zoom effect to improve resolution in key areas for precise manipulation.
|
| 52 |
+
|
| 53 |
+
> **Note:** For more visualizations and real-world robot demos, please visit our [**Project Page**](https://zhenyangliu.github.io/ActiveVLA).
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## ๐ Results
|
| 58 |
+
|
| 59 |
+
ActiveVLA achieves state-of-the-art performance across multiple benchmarks:
|
| 60 |
+
|
| 61 |
+
- **RLBench:** Achieves an average success rate of **91.8%**, ranking 1st in 10 tasks.
|
| 62 |
+
- **COLOSSEUM:** Demonstrates superior robustness with a **65.9%** success rate in challenging generalization scenarios.
|
| 63 |
+
- **GemBench:** Outperforms all baselines with strong adaptability across diverse tasks.
|
| 64 |
+
- **Real World:** High success rates in occlusion-heavy tasks (e.g., retrieving items from drawers, handling occluded objects).
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## ๐ Citation
|
| 69 |
+
|
| 70 |
+
If you find our work useful in your research, please consider citing:
|
| 71 |
+
|
| 72 |
+
```bibtex
|
| 73 |
+
@misc{liu2026activevlainjectingactiveperception,
|
| 74 |
+
title={ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation},
|
| 75 |
+
author={Zhenyang Liu and Yongchong Gu and Yikai Wang and Xiangyang Xue and Yanwei Fu},
|
| 76 |
+
year={2026},
|
| 77 |
+
eprint={2601.08325},
|
| 78 |
+
archivePrefix={arXiv},
|
| 79 |
+
primaryClass={cs.RO},
|
| 80 |
+
url={https://arxiv.org/abs/2601.08325},
|
| 81 |
+
}
|