Cosmos-Policy 2B 480p — Romoya Bimanual Crack-Egg

Single-task fine-tune of nvidia/Cosmos-Predict2-2B-Video2World (model-480p-16fps.pt) on the romoya bimanual lebai-follower crack-egg dataset (romoya/B3_Station_crack_egg, 55 episodes / 118,224 frames).

Task language: pick-up an egg and crack into the bowl.

Training

Checkpoint exported at iteration 30,000 (current model.pt; this is the plateau — see eval table below).
Recipe inherits the ALOHA bimanual ALOHA-Cosmos-Policy schedule (state_t=11, chunk_duration=41, 3 cameras: 1 third-person base + 2 wrist).
Batch size 4, num_workers 4, 1× A100 80 GB, ~1.2 s/iter steady-state with a 56 GB/worker decoded-video cache.
Trained on the LeRobot v2.1 conversion of the source v3 dataset.

Files

file	purpose
`model.pt`	consolidated PyTorch checkpoint (~3.91 GB; converted from FSDP/DCP shards via `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`)
`dataset_statistics.json`	action / proprio normalization stats used at training time
`dataset_statistics_post_norm.json`	post-normalization stats (auxiliary)
`t5_embeddings.pkl`	precomputed T5 embeddings for the 4 romoya task commands; only `pick-up an egg and crack into the bowl` is used here

Offline evaluation

On romoya/eval_pi05_bimanual_crack_egg (5 episodes × 5 query points each, 5 denoising steps, action-chunk L1 in unnormalized units):

Checkpoint	Mean L1 ↓	Cross-step Corr ↑
iter 6,500	0.7575	0.130
iter 7,000	0.7333	0.114
iter 10,000	0.6975	0.111
iter 15,000	0.6830	0.154
iter 20,000	0.6654	0.144
iter 25,000	0.6563	0.152
iter 30,000	0.6545	0.142

Action L1 is computed in the unnormalized space — i.e., raw joint-pos / effort / velocity / DO units of the romoya bi-lebai-follower. The action vector is 94-D (12 joint pos + 12 effort + 12 vel + 4 DO + ... per arm pair); the proprioceptive state is 166-D.

The improvement curve flattens after iter 25,000 (Δ = −0.002 over the last 5,000 iters) — i.e., 55 demos hit a plateau at this resolution. Bigger gains likely require either more demonstrations or task-mixture training.

Usage

Cameras: 1 third-person (base) + 2 wrist (left_wrist, right_wrist).

The model expects the ALOHA-style obs dict with keys primary_image, left_wrist_image, right_wrist_image, future_primary_image, future_left_wrist_image, future_right_wrist_image, proprio. See cosmos_policy/experiments/robot/cosmos_utils.py:get_action (suite="aloha" branch) for the full contract.

Action / proprio dimensions deviate from ALOHA defaults (ACTION_DIM=94, PROPRIO_DIM=166, NUM_ACTIONS_CHUNK=25); patch cosmos_policy.constants at runtime before importing cosmos_utils:

import cosmos_policy.constants as _C
_C.NUM_ACTIONS_CHUNK = 25
_C.ACTION_DIM = 94
_C.PROPRIO_DIM = 166
from cosmos_policy.experiments.robot.cosmos_utils import get_action, get_model

A worked offline-eval example lives in cosmos_policy/experiments/robot/romoya/eval_offline.py.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Model tree for romoya/cosmos_predict2_2b_480p_crack_egg

Base model

nvidia/Cosmos-Predict2-2B-Video2World

Finetuned

(6)

this model