| | --- |
| | license: mit |
| | tags: |
| | - robotics |
| | - vjepa2 |
| | - dm_control |
| | - world-model |
| | - teach-by-showing |
| | --- |
| | |
| | # V-JEPA 2 Robot Multi-Task Dataset & Models |
| |
|
| | Vision-based robot control data using **V-JEPA 2** (ViT-L) latent representations |
| | from DeepMind Control Suite environments. |
| |
|
| | ## π Dataset |
| |
|
| | | Task | Episodes | Transitions | Latent Dim | Action Dim | Success Rate | |
| | |------|----------|-------------|------------|------------|-------------| |
| | | reacher_easy | 1,000 | 200,000 | 1024 | 2 | 28.9% | |
| | | point_mass_easy | 1,000 | 200,000 | 1024 | 2 | 0.6% | |
| | | cartpole_swingup | 1,000 | 200,000 | 1024 | 1 | 0.0% | |
| |
|
| | Each `.npz` file contains: |
| | - `z_t` β V-JEPA 2 latent state embeddings (N Γ 1024) |
| | - `a_t` β actions taken (N Γ action_dim) |
| | - `z_next` β next-state latent embeddings (N Γ 1024) |
| | - `rewards` β per-step rewards (N,) |
| |
|
| | ## π€ Models |
| |
|
| | For each task, we provide: |
| | - **5Γ Dynamics Ensemble** β `dyn_0.pt` to `dyn_4.pt` (MLP: z + a β z_next, ~1.58M params each) |
| | - **1Γ Reward Model** β `reward.pt` (MLP: z + a β reward, ~329K params) |
| | |
| | ### Architecture |
| | - Dynamics: `Linear(1024+a_dim, 512) β LN β ReLU β Γ3 β Linear(512, 1024)` + residual connection |
| | - Reward: `Linear(1024+a_dim, 256) β ReLU β Γ2 β Linear(256, 1)` |
| | - Ensemble diversity (weight cosine sim): ~0.60 |
| |
|
| | ## ποΈ How It Was Built |
| |
|
| | 1. Expert policies collect episodes in dm_control environments |
| | 2. Each frame rendered at 224Γ224, encoded with V-JEPA 2 ViT-L (8-frame sliding windows) |
| | 3. Dynamics ensemble trained with random data splits + different seeds |
| | 4. Reward model trained to predict per-step rewards from z_t + a_t |
| | |
| | ## π Training Details |
| | |
| | - **GPU:** NVIDIA A100-SXM4-80GB (Prime Intellect) |
| | - **Total time:** 5.4 hours |
| | - **Total cost:** ~$7 |
| | - **Dynamics val loss:** ~0.0008 (reacher, point_mass), ~0.0002 (cartpole) |
| | - **Temporal coherence:** >0.998 for all tasks |
| |
|
| | ## π― Purpose |
| |
|
| | These world models are designed for **"teach-by-showing"** β demonstrating a task via video, |
| | then using the learned dynamics + CEM planning to reproduce the shown behavior. |
| |
|