Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- robotics
|
| 5 |
+
- vjepa2
|
| 6 |
+
- dm_control
|
| 7 |
+
- world-model
|
| 8 |
+
- teach-by-showing
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# V-JEPA 2 Robot Multi-Task Dataset & Models
|
| 12 |
+
|
| 13 |
+
Vision-based robot control data using **V-JEPA 2** (ViT-L) latent representations
|
| 14 |
+
from DeepMind Control Suite environments.
|
| 15 |
+
|
| 16 |
+
## π Dataset
|
| 17 |
+
|
| 18 |
+
| Task | Episodes | Transitions | Latent Dim | Action Dim | Success Rate |
|
| 19 |
+
|------|----------|-------------|------------|------------|-------------|
|
| 20 |
+
| reacher_easy | 1,000 | 200,000 | 1024 | 2 | 28.9% |
|
| 21 |
+
| point_mass_easy | 1,000 | 200,000 | 1024 | 2 | 0.6% |
|
| 22 |
+
| cartpole_swingup | 1,000 | 200,000 | 1024 | 1 | 0.0% |
|
| 23 |
+
|
| 24 |
+
Each `.npz` file contains:
|
| 25 |
+
- `z_t` β V-JEPA 2 latent state embeddings (N Γ 1024)
|
| 26 |
+
- `a_t` β actions taken (N Γ action_dim)
|
| 27 |
+
- `z_next` β next-state latent embeddings (N Γ 1024)
|
| 28 |
+
- `rewards` β per-step rewards (N,)
|
| 29 |
+
|
| 30 |
+
## π€ Models
|
| 31 |
+
|
| 32 |
+
For each task, we provide:
|
| 33 |
+
- **5Γ Dynamics Ensemble** β `dyn_0.pt` to `dyn_4.pt` (MLP: z + a β z_next, ~1.58M params each)
|
| 34 |
+
- **1Γ Reward Model** β `reward.pt` (MLP: z + a β reward, ~329K params)
|
| 35 |
+
|
| 36 |
+
### Architecture
|
| 37 |
+
- Dynamics: `Linear(1024+a_dim, 512) β LN β ReLU β Γ3 β Linear(512, 1024)` + residual connection
|
| 38 |
+
- Reward: `Linear(1024+a_dim, 256) β ReLU β Γ2 β Linear(256, 1)`
|
| 39 |
+
- Ensemble diversity (weight cosine sim): ~0.60
|
| 40 |
+
|
| 41 |
+
## ποΈ How It Was Built
|
| 42 |
+
|
| 43 |
+
1. Expert policies collect episodes in dm_control environments
|
| 44 |
+
2. Each frame rendered at 224Γ224, encoded with V-JEPA 2 ViT-L (8-frame sliding windows)
|
| 45 |
+
3. Dynamics ensemble trained with random data splits + different seeds
|
| 46 |
+
4. Reward model trained to predict per-step rewards from z_t + a_t
|
| 47 |
+
|
| 48 |
+
## π Training Details
|
| 49 |
+
|
| 50 |
+
- **GPU:** NVIDIA A100-SXM4-80GB (Prime Intellect)
|
| 51 |
+
- **Total time:** 5.4 hours
|
| 52 |
+
- **Total cost:** ~$7
|
| 53 |
+
- **Dynamics val loss:** ~0.0008 (reacher, point_mass), ~0.0002 (cartpole)
|
| 54 |
+
- **Temporal coherence:** >0.998 for all tasks
|
| 55 |
+
|
| 56 |
+
## π― Purpose
|
| 57 |
+
|
| 58 |
+
These world models are designed for **"teach-by-showing"** β demonstrating a task via video,
|
| 59 |
+
then using the learned dynamics + CEM planning to reproduce the shown behavior.
|