jiahaoli2077 commited on
Commit
d31b8f9
·
verified ·
1 Parent(s): 49f5653

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: robotics
6
+ library_name: transformers
7
+ tags:
8
+ - multimodal
9
+ - robotics
10
+ - vision-language-action
11
+ - univla
12
+ - action-decoder
13
+ - continuous-control
14
+ datasets:
15
+ - VLA-Arena/VLA_Arena_L0_L_rlds
16
+ ---
17
+
18
+ # UniVLA - Action Decoder (Deployment Head)
19
+
20
+ ## About VLA-Arena
21
+
22
+ **VLA-Arena** is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:
23
+
24
+ 1. **Task Structure**: 170+ tasks grouped into four key dimensions:
25
+ * **Safety**: Operating reliably under strict constraints.
26
+ * **Distractor**: Handling environmental unpredictability.
27
+ * **Extrapolation**: Generalizing to unseen scenarios.
28
+ * **Long Horizon**: Executing complex, multi-step tasks.
29
+ 2. **Language Command**: Variations in instruction complexity.
30
+ 3. **Visual Observation**: Perturbations in visual input.
31
+
32
+ Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on **L0** tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.
33
+
34
+ ## Model Overview
35
+
36
+ This model is the **Action Decoder Head** for **UniVLA**. Unlike the Latent Action Model (LAM) which is used for tokenizing video data, this decoder is a lightweight transformer module attached to the UniVLA backbone during deployment.
37
+
38
+ Its specific role is **Detokenization**: it takes the sequence of **Latent Action Tokens** (predicted by the VLM backbone) and **Visual Embeddings**, and decodes them into precise, continuous **Action Chunks** (7-DoF trajectories) executable by the robot.
39
+
40
+ ---
41
+
42
+ ## Model Architecture
43
+
44
+ The Action Decoder is designed to bridge the gap between the discrete latent space of the VLM and the continuous action space of the robot. It utilizes **Multi-Head Attention Pooling** to extract context-specific features from both latent actions and visual observations.
45
+
46
+ | Component | Description |
47
+ | :--- | :--- |
48
+ | **Input** | Latent Action Embeddings + Visual Embeddings (VLM Last Layer) |
49
+ | **Context Mechanism** | **Attention Pooling** (Visual tokens query Action tokens) |
50
+ | **Output** | **Action Chunks** (Sequence of continuous poses) |
51
+ | **Parameter Count** | **12.6M** (Lightweight Adapter) |
52
+
53
+ ### Architecture Configuration
54
+ The decoder consists of attention pooling layers followed by projection MLPs. For real-world deployment, it also includes a proprioceptive projection layer.
55
+
56
+ | Parameter | Value |
57
+ | :--- | :--- |
58
+ | **Attention Heads** | 8 |
59
+ | **Head Dimension** | 64 |
60
+ | **Hidden Size** | 512 |
61
+ | **MLP Ratio** | 4 |
62
+ | **Proprioception Projection** | 2 Layers (Hidden Size 512) |
63
+
64
+ ### Key Feature: Action Chunking
65
+ Unlike OpenVLA which predicts actions step-by-step, this decoder outputs **Action Chunks** (default size $N=12$ for real-world tasks). This allows for significantly smoother control and higher inference frequency ($\sim$10Hz).
66
+
67
+ ---
68
+
69
+ ## Training Details
70
+
71
+ ### Dataset
72
+ This model was fine-tuned on the **[VLA-Arena/VLA_Arena_L0_L_rlds](https://huggingface.co/datasets/VLA-Arena/VLA_Arena_L0_L_rlds)** dataset.
73
+
74
+ ### Training Strategy
75
+ This decoder is trained **end-to-end** with the UniVLA backbone (via LoRA). While the backbone learns to predict the correct *discrete* latent token, this decoder simultaneously learns to map that token to the correct *continuous* physical action.
76
+
77
+ | Parameter | Value |
78
+ | :--- | :--- |
79
+ | **Loss Function** | **L1 Loss** (Ground Truth vs. Predicted Action) |
80
+ | **Optimization** | Joint optimization with VLM Next-Token Prediction |
81
+ | **Visual Conditioning** | **Enabled** (Visual embeddings used as queries) |
82
+
83
+ ---
84
+
85
+ ## Evaluation & Usage
86
+
87
+ This model must be used in conjunction with the **UniVLA** backbone.
88
+ 1. **Backbone Phase**: The VLM predicts a sequence of discrete latent tokens (e.g., `<ACT_1>`, `<ACT_2>`).
89
+ 2. **Decoder Phase (This Model)**: These tokens, along with the visual context, are passed to this Action Decoder to generate the final $7\times N$ action vector (End-effector pose + Gripper).
90
+
91
+ Ablation studies show that this specific **Visual-Attention Decoder** outperforms standard auto-regressive decoding by **42.1%** on long-horizon tasks (LIBERO-Long), proving its efficacy in reducing ambiguity and improving precision.
92
+
93
+ For detailed evaluation instructions, metrics, and scripts, please refer to the [VLA-Arena repository](https://github.com/PKU-Alignment/VLA-Arena).