Commit
·
4bd47d5
1
Parent(s):
643312c
update the tag
Browse files
README.md
CHANGED
|
@@ -10,7 +10,7 @@ tags:
|
|
| 10 |
- Diffusion
|
| 11 |
language:
|
| 12 |
- en
|
| 13 |
-
pipeline_tag:
|
| 14 |
---
|
| 15 |
# VITRA-VLA-3B
|
| 16 |
VITRA is a novel approach for pretraining Vision-Language-Action (VLA) models for robotic manipulation using large-scale, unscripted, real-world videos of human hand activities. Treating human hand as dexterous robot end-effector, we show that in-the-wild egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. We create a human hand V-L-A dataset containing over 1 million episodes. We further develop a VLA model with a causal action transformer trained on this dataset. It demonstrates strong zero-shot human-hand action prediction in entirely new scenes and serves as a cornerstone for few-shot finetuning and adaptation to real-world robotic manipulation.
|
|
|
|
| 10 |
- Diffusion
|
| 11 |
language:
|
| 12 |
- en
|
| 13 |
+
pipeline_tag: robotics
|
| 14 |
---
|
| 15 |
# VITRA-VLA-3B
|
| 16 |
VITRA is a novel approach for pretraining Vision-Language-Action (VLA) models for robotic manipulation using large-scale, unscripted, real-world videos of human hand activities. Treating human hand as dexterous robot end-effector, we show that in-the-wild egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. We create a human hand V-L-A dataset containing over 1 million episodes. We further develop a VLA model with a causal action transformer trained on this dataset. It demonstrates strong zero-shot human-hand action prediction in entirely new scenes and serves as a cornerstone for few-shot finetuning and adaptation to real-world robotic manipulation.
|