V-JEPA2-AC ViT-g (MLX) — action-conditioned world-model

Apple MLX fp16 port of Meta's V-JEPA2-AC (vjepa2-ac-vitg): a ViT-g video encoder + an action-conditioned predictor that, given encoder context tokens + per-frame 7-DoF robot poses (action/state), predicts future latent states — the world-model used for robot planning. MIT.

from vjepa2_mlx.utils.weights import build_ac_encoder, build_ac_predictor
enc = build_ac_encoder()          # ViT-g encoder (hidden 1408, 40 layers)
pred = build_ac_predictor()       # AC predictor (frame-causal, 3D-RoPE)
# tokens = enc(video); future = pred(tokens, actions, states)
  • Arch: ViT-g encoder (1408 / 40 / 22 heads, GELU, 3D-RoPE) + AC predictor (1024 / 24 / 16, fused-qkv ACRoPEAttention, frame-causal mask, 7-DoF action+state encoders).
  • Parity vs upstream torch (cpu fp32): encoder rel 3.5e-5 · AC predictor rel 8.8e-6 (structural 8.9e-8). fp16: encoder rel 9.1e-3 · predictor 4.8e-4.
  • Precision: fp16 (~2.6 GB).

MIT (© Meta Platforms). Converted from vjepa2-ac-vitg.pt.

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including mlx-community/V-JEPA2-AC-vitg