|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: robotics |
|
|
library_name: transformers |
|
|
tags: |
|
|
- multimodal |
|
|
- robotics |
|
|
- vision-language-action |
|
|
- univla |
|
|
- action-decoder |
|
|
- continuous-control |
|
|
datasets: |
|
|
- VLA-Arena/VLA_Arena_L0_L_rlds |
|
|
--- |
|
|
|
|
|
# UniVLA - Action Decoder (Deployment Head) |
|
|
|
|
|
## About VLA-Arena |
|
|
|
|
|
**VLA-Arena** is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes: |
|
|
|
|
|
1. **Task Structure**: 170+ tasks grouped into four key dimensions: |
|
|
* **Safety**: Operating reliably under strict constraints. |
|
|
* **Distractor**: Handling environmental unpredictability. |
|
|
* **Extrapolation**: Generalizing to unseen scenarios. |
|
|
* **Long Horizon**: Executing complex, multi-step tasks. |
|
|
2. **Language Command**: Variations in instruction complexity. |
|
|
3. **Visual Observation**: Perturbations in visual input. |
|
|
|
|
|
Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on **L0** tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
This model is the **Action Decoder Head** for **UniVLA**. Unlike the Latent Action Model (LAM) which is used for tokenizing video data, this decoder is a lightweight transformer module attached to the UniVLA backbone during deployment. |
|
|
|
|
|
Its specific role is **Detokenization**: it takes the sequence of **Latent Action Tokens** (predicted by the VLM backbone) and **Visual Embeddings**, and decodes them into precise, continuous **Action Chunks** (7-DoF trajectories) executable by the robot. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The Action Decoder is designed to bridge the gap between the discrete latent space of the VLM and the continuous action space of the robot. It utilizes **Multi-Head Attention Pooling** to extract context-specific features from both latent actions and visual observations. |
|
|
|
|
|
| Component | Description | |
|
|
| :--- | :--- | |
|
|
| **Input** | Latent Action Embeddings + Visual Embeddings (VLM Last Layer) | |
|
|
| **Context Mechanism** | **Attention Pooling** (Visual tokens query Action tokens) | |
|
|
| **Output** | **Action Chunks** (Sequence of continuous poses) | |
|
|
| **Parameter Count** | **12.6M** (Lightweight Adapter) | |
|
|
|
|
|
### Architecture Configuration |
|
|
The decoder consists of attention pooling layers followed by projection MLPs. For real-world deployment, it also includes a proprioceptive projection layer. |
|
|
|
|
|
| Parameter | Value | |
|
|
| :--- | :--- | |
|
|
| **Attention Heads** | 8 | |
|
|
| **Head Dimension** | 64 | |
|
|
| **Hidden Size** | 512 | |
|
|
| **MLP Ratio** | 4 | |
|
|
| **Proprioception Projection** | 2 Layers (Hidden Size 512) | |
|
|
|
|
|
### Key Feature: Action Chunking |
|
|
Unlike OpenVLA which predicts actions step-by-step, this decoder outputs **Action Chunks** (default size $N=12$ for real-world tasks). This allows for significantly smoother control and higher inference frequency ($\sim$10Hz). |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
This model was fine-tuned on the **[VLA-Arena/VLA_Arena_L0_L_rlds](https://huggingface.co/datasets/VLA-Arena/VLA_Arena_L0_L_rlds)** dataset. |
|
|
|
|
|
### Training Strategy |
|
|
This decoder is trained **end-to-end** with the UniVLA backbone (via LoRA). While the backbone learns to predict the correct *discrete* latent token, this decoder simultaneously learns to map that token to the correct *continuous* physical action. |
|
|
|
|
|
| Parameter | Value | |
|
|
| :--- | :--- | |
|
|
| **Loss Function** | **L1 Loss** (Ground Truth vs. Predicted Action) | |
|
|
| **Optimization** | Joint optimization with VLM Next-Token Prediction | |
|
|
| **Visual Conditioning** | **Enabled** (Visual embeddings used as queries) | |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation & Usage |
|
|
|
|
|
This model must be used in conjunction with the **UniVLA** backbone. |
|
|
1. **Backbone Phase**: The VLM predicts a sequence of discrete latent tokens (e.g., `<ACT_1>`, `<ACT_2>`). |
|
|
2. **Decoder Phase (This Model)**: These tokens, along with the visual context, are passed to this Action Decoder to generate the final $7\times N$ action vector (End-effector pose + Gripper). |
|
|
|
|
|
Ablation studies show that this specific **Visual-Attention Decoder** outperforms standard auto-regressive decoding by **42.1%** on long-horizon tasks (LIBERO-Long), proving its efficacy in reducing ambiguity and improving precision. |
|
|
|
|
|
For detailed evaluation instructions, metrics, and scripts, please refer to the [VLA-Arena repository](https://github.com/PKU-Alignment/VLA-Arena). |