File size: 2,061 Bytes
3a71a0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
license: mit
tags:
  - robotics
  - vjepa2
  - dm_control
  - world-model
  - teach-by-showing
---

# V-JEPA 2 Robot Multi-Task Dataset & Models

Vision-based robot control data using **V-JEPA 2** (ViT-L) latent representations
from DeepMind Control Suite environments.

## πŸ“Š Dataset

| Task | Episodes | Transitions | Latent Dim | Action Dim | Success Rate |
|------|----------|-------------|------------|------------|-------------|
| reacher_easy | 1,000 | 200,000 | 1024 | 2 | 28.9% |
| point_mass_easy | 1,000 | 200,000 | 1024 | 2 | 0.6% |
| cartpole_swingup | 1,000 | 200,000 | 1024 | 1 | 0.0% |

Each `.npz` file contains:
- `z_t` β€” V-JEPA 2 latent state embeddings (N Γ— 1024)
- `a_t` β€” actions taken (N Γ— action_dim)
- `z_next` β€” next-state latent embeddings (N Γ— 1024)
- `rewards` β€” per-step rewards (N,)

## πŸ€– Models

For each task, we provide:
- **5Γ— Dynamics Ensemble** β€” `dyn_0.pt` to `dyn_4.pt` (MLP: z + a β†’ z_next, ~1.58M params each)
- **1Γ— Reward Model** β€” `reward.pt` (MLP: z + a β†’ reward, ~329K params)

### Architecture
- Dynamics: `Linear(1024+a_dim, 512) β†’ LN β†’ ReLU β†’ Γ—3 β†’ Linear(512, 1024)` + residual connection
- Reward: `Linear(1024+a_dim, 256) β†’ ReLU β†’ Γ—2 β†’ Linear(256, 1)`
- Ensemble diversity (weight cosine sim): ~0.60

## πŸ—οΈ How It Was Built

1. Expert policies collect episodes in dm_control environments
2. Each frame rendered at 224Γ—224, encoded with V-JEPA 2 ViT-L (8-frame sliding windows)
3. Dynamics ensemble trained with random data splits + different seeds
4. Reward model trained to predict per-step rewards from z_t + a_t

## πŸ“ˆ Training Details

- **GPU:** NVIDIA A100-SXM4-80GB (Prime Intellect)
- **Total time:** 5.4 hours
- **Total cost:** ~$7
- **Dynamics val loss:** ~0.0008 (reacher, point_mass), ~0.0002 (cartpole)
- **Temporal coherence:** >0.998 for all tasks

## 🎯 Purpose

These world models are designed for **"teach-by-showing"** β€” demonstrating a task via video,
then using the learned dynamics + CEM planning to reproduce the shown behavior.