jiahaoli2077

Create README.md

d31b8f9 verified 18 days ago

4.7 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: robotics
	library_name: transformers
	tags:
	- multimodal
	- robotics
	- vision-language-action
	- univla
	- action-decoder
	- continuous-control
	datasets:
	- VLA-Arena/VLA_Arena_L0_L_rlds
	---

	# UniVLA - Action Decoder (Deployment Head)

	## About VLA-Arena

	VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

	1. Task Structure: 170+ tasks grouped into four key dimensions:
	* Safety: Operating reliably under strict constraints.
	* Distractor: Handling environmental unpredictability.
	* Extrapolation: Generalizing to unseen scenarios.
	* Long Horizon: Executing complex, multi-step tasks.
	2. Language Command: Variations in instruction complexity.
	3. Visual Observation: Perturbations in visual input.

	Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

	## Model Overview

	This model is the Action Decoder Head for UniVLA. Unlike the Latent Action Model (LAM) which is used for tokenizing video data, this decoder is a lightweight transformer module attached to the UniVLA backbone during deployment.

	Its specific role is Detokenization: it takes the sequence of Latent Action Tokens (predicted by the VLM backbone) and Visual Embeddings, and decodes them into precise, continuous Action Chunks (7-DoF trajectories) executable by the robot.

	---

	## Model Architecture

	The Action Decoder is designed to bridge the gap between the discrete latent space of the VLM and the continuous action space of the robot. It utilizes Multi-Head Attention Pooling to extract context-specific features from both latent actions and visual observations.

	\| Component \| Description \|
	\| :--- \| :--- \|
	\| Input \| Latent Action Embeddings + Visual Embeddings (VLM Last Layer) \|
	\| Context Mechanism \| Attention Pooling (Visual tokens query Action tokens) \|
	\| Output \| Action Chunks (Sequence of continuous poses) \|
	\| Parameter Count \| 12.6M (Lightweight Adapter) \|

	### Architecture Configuration
	The decoder consists of attention pooling layers followed by projection MLPs. For real-world deployment, it also includes a proprioceptive projection layer.

	\| Parameter \| Value \|
	\| :--- \| :--- \|
	\| Attention Heads \| 8 \|
	\| Head Dimension \| 64 \|
	\| Hidden Size \| 512 \|
	\| MLP Ratio \| 4 \|
	\| Proprioception Projection \| 2 Layers (Hidden Size 512) \|

	### Key Feature: Action Chunking
	Unlike OpenVLA which predicts actions step-by-step, this decoder outputs Action Chunks (default size $N=12$ for real-world tasks). This allows for significantly smoother control and higher inference frequency ($\sim$10Hz).

	---

	## Training Details

	### Dataset
	This model was fine-tuned on the [VLA-Arena/VLA_Arena_L0_L_rlds](https://huggingface.co/datasets/VLA-Arena/VLA_Arena_L0_L_rlds) dataset.

	### Training Strategy
	This decoder is trained end-to-end with the UniVLA backbone (via LoRA). While the backbone learns to predict the correct discrete latent token, this decoder simultaneously learns to map that token to the correct continuous physical action.

	\| Parameter \| Value \|
	\| :--- \| :--- \|
	\| Loss Function \| L1 Loss (Ground Truth vs. Predicted Action) \|
	\| Optimization \| Joint optimization with VLM Next-Token Prediction \|
	\| Visual Conditioning \| Enabled (Visual embeddings used as queries) \|

	---

	## Evaluation & Usage

	This model must be used in conjunction with the UniVLA backbone.
	1. Backbone Phase: The VLM predicts a sequence of discrete latent tokens (e.g., `<ACT_1>`, `<ACT_2>`).
	2. Decoder Phase (This Model): These tokens, along with the visual context, are passed to this Action Decoder to generate the final $7\times N$ action vector (End-effector pose + Gripper).

	Ablation studies show that this specific Visual-Attention Decoder outperforms standard auto-regressive decoding by 42.1% on long-horizon tasks (LIBERO-Long), proving its efficacy in reducing ambiguity and improving precision.

	For detailed evaluation instructions, metrics, and scripts, please refer to the [VLA-Arena repository](https://github.com/PKU-Alignment/VLA-Arena).

	---
	license: mit
	language:
	- en
	pipeline_tag: robotics
	library_name: transformers
	tags:
	- multimodal
	- robotics
	- vision-language-action
	- univla
	- action-decoder
	- continuous-control
	datasets:
	- VLA-Arena/VLA_Arena_L0_L_rlds
	---

	# UniVLA - Action Decoder (Deployment Head)

	## About VLA-Arena

	VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

	1. Task Structure: 170+ tasks grouped into four key dimensions:
	* Safety: Operating reliably under strict constraints.
	* Distractor: Handling environmental unpredictability.
	* Extrapolation: Generalizing to unseen scenarios.
	* Long Horizon: Executing complex, multi-step tasks.
	2. Language Command: Variations in instruction complexity.
	3. Visual Observation: Perturbations in visual input.

	Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

	## Model Overview

	This model is the Action Decoder Head for UniVLA. Unlike the Latent Action Model (LAM) which is used for tokenizing video data, this decoder is a lightweight transformer module attached to the UniVLA backbone during deployment.

	Its specific role is Detokenization: it takes the sequence of Latent Action Tokens (predicted by the VLM backbone) and Visual Embeddings, and decodes them into precise, continuous Action Chunks (7-DoF trajectories) executable by the robot.

	---

	## Model Architecture

	The Action Decoder is designed to bridge the gap between the discrete latent space of the VLM and the continuous action space of the robot. It utilizes Multi-Head Attention Pooling to extract context-specific features from both latent actions and visual observations.

	\| Component \| Description \|
	\| :--- \| :--- \|
	\| Input \| Latent Action Embeddings + Visual Embeddings (VLM Last Layer) \|
	\| Context Mechanism \| Attention Pooling (Visual tokens query Action tokens) \|
	\| Output \| Action Chunks (Sequence of continuous poses) \|
	\| Parameter Count \| 12.6M (Lightweight Adapter) \|

	### Architecture Configuration
	The decoder consists of attention pooling layers followed by projection MLPs. For real-world deployment, it also includes a proprioceptive projection layer.

	\| Parameter \| Value \|
	\| :--- \| :--- \|
	\| Attention Heads \| 8 \|
	\| Head Dimension \| 64 \|
	\| Hidden Size \| 512 \|
	\| MLP Ratio \| 4 \|
	\| Proprioception Projection \| 2 Layers (Hidden Size 512) \|

	### Key Feature: Action Chunking
	Unlike OpenVLA which predicts actions step-by-step, this decoder outputs Action Chunks (default size $N=12$ for real-world tasks). This allows for significantly smoother control and higher inference frequency ($\sim$10Hz).

	---

	## Training Details

	### Dataset
	This model was fine-tuned on the [VLA-Arena/VLA_Arena_L0_L_rlds](https://huggingface.co/datasets/VLA-Arena/VLA_Arena_L0_L_rlds) dataset.

	### Training Strategy
	This decoder is trained end-to-end with the UniVLA backbone (via LoRA). While the backbone learns to predict the correct discrete latent token, this decoder simultaneously learns to map that token to the correct continuous physical action.

	\| Parameter \| Value \|
	\| :--- \| :--- \|
	\| Loss Function \| L1 Loss (Ground Truth vs. Predicted Action) \|
	\| Optimization \| Joint optimization with VLM Next-Token Prediction \|
	\| Visual Conditioning \| Enabled (Visual embeddings used as queries) \|

	---

	## Evaluation & Usage

	This model must be used in conjunction with the UniVLA backbone.
	1. Backbone Phase: The VLM predicts a sequence of discrete latent tokens (e.g., `<ACT_1>`, `<ACT_2>`).
	2. Decoder Phase (This Model): These tokens, along with the visual context, are passed to this Action Decoder to generate the final $7\times N$ action vector (End-effector pose + Gripper).

	Ablation studies show that this specific Visual-Attention Decoder outperforms standard auto-regressive decoding by 42.1% on long-horizon tasks (LIBERO-Long), proving its efficacy in reducing ambiguity and improving precision.

	For detailed evaluation instructions, metrics, and scripts, please refer to the [VLA-Arena repository](https://github.com/PKU-Alignment/VLA-Arena).