ThomasTheMaker
/

vjepa2-robot-multitask

teach-by-showing

Model card Files Files and versions

vjepa2-robot-multitask / README.md

ThomasTheMaker's picture

Upload README.md with huggingface_hub

3a71a0f verified 6 days ago

|

history blame contribute delete

2.06 kB

	---
	license: mit
	tags:
	- robotics
	- vjepa2
	- dm_control
	- world-model
	- teach-by-showing
	---

	# V-JEPA 2 Robot Multi-Task Dataset & Models

	Vision-based robot control data using V-JEPA 2 (ViT-L) latent representations
	from DeepMind Control Suite environments.

	## 📊 Dataset

	\| Task \| Episodes \| Transitions \| Latent Dim \| Action Dim \| Success Rate \|
	\|------\|----------\|-------------\|------------\|------------\|-------------\|
	\| reacher_easy \| 1,000 \| 200,000 \| 1024 \| 2 \| 28.9% \|
	\| point_mass_easy \| 1,000 \| 200,000 \| 1024 \| 2 \| 0.6% \|
	\| cartpole_swingup \| 1,000 \| 200,000 \| 1024 \| 1 \| 0.0% \|

	Each `.npz` file contains:
	- `z_t` — V-JEPA 2 latent state embeddings (N × 1024)
	- `a_t` — actions taken (N × action_dim)
	- `z_next` — next-state latent embeddings (N × 1024)
	- `rewards` — per-step rewards (N,)

	## 🤖 Models

	For each task, we provide:
	- 5× Dynamics Ensemble — `dyn_0.pt` to `dyn_4.pt` (MLP: z + a → z_next, ~1.58M params each)
	- 1× Reward Model — `reward.pt` (MLP: z + a → reward, ~329K params)

	### Architecture
	- Dynamics: `Linear(1024+a_dim, 512) → LN → ReLU → ×3 → Linear(512, 1024)` + residual connection
	- Reward: `Linear(1024+a_dim, 256) → ReLU → ×2 → Linear(256, 1)`
	- Ensemble diversity (weight cosine sim): ~0.60

	## 🏗️ How It Was Built

	1. Expert policies collect episodes in dm_control environments
	2. Each frame rendered at 224×224, encoded with V-JEPA 2 ViT-L (8-frame sliding windows)
	3. Dynamics ensemble trained with random data splits + different seeds
	4. Reward model trained to predict per-step rewards from z_t + a_t

	## 📈 Training Details

	- GPU: NVIDIA A100-SXM4-80GB (Prime Intellect)
	- Total time: 5.4 hours
	- Total cost: ~$7
	- Dynamics val loss: ~0.0008 (reacher, point_mass), ~0.0002 (cartpole)
	- Temporal coherence: >0.998 for all tasks

	## 🎯 Purpose

	These world models are designed for "teach-by-showing" — demonstrating a task via video,
	then using the learned dynamics + CEM planning to reproduce the shown behavior.