Cosmos-Policy 2B 480p β€” Romoya Bimanual Crack-Egg

Single-task fine-tune of nvidia/Cosmos-Predict2-2B-Video2World (model-480p-16fps.pt) on the romoya bimanual lebai-follower crack-egg dataset (romoya/B3_Station_crack_egg, 55 episodes / 118,224 frames).

Task language: pick-up an egg and crack into the bowl.

Training

  • Checkpoint exported at iteration 30,000 (current model.pt; this is the plateau β€” see eval table below).
  • Recipe inherits the ALOHA bimanual ALOHA-Cosmos-Policy schedule (state_t=11, chunk_duration=41, 3 cameras: 1 third-person base + 2 wrist).
  • Batch size 4, num_workers 4, 1Γ— A100 80 GB, ~1.2 s/iter steady-state with a 56 GB/worker decoded-video cache.
  • Trained on the LeRobot v2.1 conversion of the source v3 dataset.

Files

file purpose
model.pt consolidated PyTorch checkpoint (~3.91 GB; converted from FSDP/DCP shards via torch.distributed.checkpoint.format_utils.dcp_to_torch_save)
dataset_statistics.json action / proprio normalization stats used at training time
dataset_statistics_post_norm.json post-normalization stats (auxiliary)
t5_embeddings.pkl precomputed T5 embeddings for the 4 romoya task commands; only pick-up an egg and crack into the bowl is used here

Offline evaluation

On romoya/eval_pi05_bimanual_crack_egg (5 episodes Γ— 5 query points each, 5 denoising steps, action-chunk L1 in unnormalized units):

Checkpoint Mean L1 ↓ Cross-step Corr ↑
iter 6,500 0.7575 0.130
iter 7,000 0.7333 0.114
iter 10,000 0.6975 0.111
iter 15,000 0.6830 0.154
iter 20,000 0.6654 0.144
iter 25,000 0.6563 0.152
iter 30,000 0.6545 0.142

Action L1 is computed in the unnormalized space β€” i.e., raw joint-pos / effort / velocity / DO units of the romoya bi-lebai-follower. The action vector is 94-D (12 joint pos + 12 effort + 12 vel + 4 DO + ... per arm pair); the proprioceptive state is 166-D.

The improvement curve flattens after iter 25,000 (Ξ” = βˆ’0.002 over the last 5,000 iters) β€” i.e., 55 demos hit a plateau at this resolution. Bigger gains likely require either more demonstrations or task-mixture training.

Usage

Cameras: 1 third-person (base) + 2 wrist (left_wrist, right_wrist).

The model expects the ALOHA-style obs dict with keys primary_image, left_wrist_image, right_wrist_image, future_primary_image, future_left_wrist_image, future_right_wrist_image, proprio. See cosmos_policy/experiments/robot/cosmos_utils.py:get_action (suite="aloha" branch) for the full contract.

Action / proprio dimensions deviate from ALOHA defaults (ACTION_DIM=94, PROPRIO_DIM=166, NUM_ACTIONS_CHUNK=25); patch cosmos_policy.constants at runtime before importing cosmos_utils:

import cosmos_policy.constants as _C
_C.NUM_ACTIONS_CHUNK = 25
_C.ACTION_DIM = 94
_C.PROPRIO_DIM = 166
from cosmos_policy.experiments.robot.cosmos_utils import get_action, get_model

A worked offline-eval example lives in cosmos_policy/experiments/robot/romoya/eval_offline.py.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for romoya/cosmos_predict2_2b_480p_crack_egg

Finetuned
(6)
this model