Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: robotics
|
| 6 |
+
library_name: transformers
|
| 7 |
+
tags:
|
| 8 |
+
- multimodal
|
| 9 |
+
- robotics
|
| 10 |
+
- vision-language-action
|
| 11 |
+
- univla
|
| 12 |
+
- action-decoder
|
| 13 |
+
- continuous-control
|
| 14 |
+
datasets:
|
| 15 |
+
- VLA-Arena/VLA_Arena_L0_L_rlds
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# UniVLA - Action Decoder (Deployment Head)
|
| 19 |
+
|
| 20 |
+
## About VLA-Arena
|
| 21 |
+
|
| 22 |
+
**VLA-Arena** is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:
|
| 23 |
+
|
| 24 |
+
1. **Task Structure**: 170+ tasks grouped into four key dimensions:
|
| 25 |
+
* **Safety**: Operating reliably under strict constraints.
|
| 26 |
+
* **Distractor**: Handling environmental unpredictability.
|
| 27 |
+
* **Extrapolation**: Generalizing to unseen scenarios.
|
| 28 |
+
* **Long Horizon**: Executing complex, multi-step tasks.
|
| 29 |
+
2. **Language Command**: Variations in instruction complexity.
|
| 30 |
+
3. **Visual Observation**: Perturbations in visual input.
|
| 31 |
+
|
| 32 |
+
Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on **L0** tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.
|
| 33 |
+
|
| 34 |
+
## Model Overview
|
| 35 |
+
|
| 36 |
+
This model is the **Action Decoder Head** for **UniVLA**. Unlike the Latent Action Model (LAM) which is used for tokenizing video data, this decoder is a lightweight transformer module attached to the UniVLA backbone during deployment.
|
| 37 |
+
|
| 38 |
+
Its specific role is **Detokenization**: it takes the sequence of **Latent Action Tokens** (predicted by the VLM backbone) and **Visual Embeddings**, and decodes them into precise, continuous **Action Chunks** (7-DoF trajectories) executable by the robot.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## Model Architecture
|
| 43 |
+
|
| 44 |
+
The Action Decoder is designed to bridge the gap between the discrete latent space of the VLM and the continuous action space of the robot. It utilizes **Multi-Head Attention Pooling** to extract context-specific features from both latent actions and visual observations.
|
| 45 |
+
|
| 46 |
+
| Component | Description |
|
| 47 |
+
| :--- | :--- |
|
| 48 |
+
| **Input** | Latent Action Embeddings + Visual Embeddings (VLM Last Layer) |
|
| 49 |
+
| **Context Mechanism** | **Attention Pooling** (Visual tokens query Action tokens) |
|
| 50 |
+
| **Output** | **Action Chunks** (Sequence of continuous poses) |
|
| 51 |
+
| **Parameter Count** | **12.6M** (Lightweight Adapter) |
|
| 52 |
+
|
| 53 |
+
### Architecture Configuration
|
| 54 |
+
The decoder consists of attention pooling layers followed by projection MLPs. For real-world deployment, it also includes a proprioceptive projection layer.
|
| 55 |
+
|
| 56 |
+
| Parameter | Value |
|
| 57 |
+
| :--- | :--- |
|
| 58 |
+
| **Attention Heads** | 8 |
|
| 59 |
+
| **Head Dimension** | 64 |
|
| 60 |
+
| **Hidden Size** | 512 |
|
| 61 |
+
| **MLP Ratio** | 4 |
|
| 62 |
+
| **Proprioception Projection** | 2 Layers (Hidden Size 512) |
|
| 63 |
+
|
| 64 |
+
### Key Feature: Action Chunking
|
| 65 |
+
Unlike OpenVLA which predicts actions step-by-step, this decoder outputs **Action Chunks** (default size $N=12$ for real-world tasks). This allows for significantly smoother control and higher inference frequency ($\sim$10Hz).
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Training Details
|
| 70 |
+
|
| 71 |
+
### Dataset
|
| 72 |
+
This model was fine-tuned on the **[VLA-Arena/VLA_Arena_L0_L_rlds](https://huggingface.co/datasets/VLA-Arena/VLA_Arena_L0_L_rlds)** dataset.
|
| 73 |
+
|
| 74 |
+
### Training Strategy
|
| 75 |
+
This decoder is trained **end-to-end** with the UniVLA backbone (via LoRA). While the backbone learns to predict the correct *discrete* latent token, this decoder simultaneously learns to map that token to the correct *continuous* physical action.
|
| 76 |
+
|
| 77 |
+
| Parameter | Value |
|
| 78 |
+
| :--- | :--- |
|
| 79 |
+
| **Loss Function** | **L1 Loss** (Ground Truth vs. Predicted Action) |
|
| 80 |
+
| **Optimization** | Joint optimization with VLM Next-Token Prediction |
|
| 81 |
+
| **Visual Conditioning** | **Enabled** (Visual embeddings used as queries) |
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## Evaluation & Usage
|
| 86 |
+
|
| 87 |
+
This model must be used in conjunction with the **UniVLA** backbone.
|
| 88 |
+
1. **Backbone Phase**: The VLM predicts a sequence of discrete latent tokens (e.g., `<ACT_1>`, `<ACT_2>`).
|
| 89 |
+
2. **Decoder Phase (This Model)**: These tokens, along with the visual context, are passed to this Action Decoder to generate the final $7\times N$ action vector (End-effector pose + Gripper).
|
| 90 |
+
|
| 91 |
+
Ablation studies show that this specific **Visual-Attention Decoder** outperforms standard auto-regressive decoding by **42.1%** on long-horizon tasks (LIBERO-Long), proving its efficacy in reducing ambiguity and improving precision.
|
| 92 |
+
|
| 93 |
+
For detailed evaluation instructions, metrics, and scripts, please refer to the [VLA-Arena repository](https://github.com/PKU-Alignment/VLA-Arena).
|