Cosmos-Policy 2B 480p β Romoya Bimanual Crack-Egg
Single-task fine-tune of nvidia/Cosmos-Predict2-2B-Video2World
(model-480p-16fps.pt) on the romoya bimanual lebai-follower crack-egg dataset
(romoya/B3_Station_crack_egg, 55 episodes / 118,224 frames).
Task language: pick-up an egg and crack into the bowl.
Training
- Checkpoint exported at iteration 30,000 (current
model.pt; this is the plateau β see eval table below). - Recipe inherits the ALOHA bimanual ALOHA-Cosmos-Policy schedule (
state_t=11,chunk_duration=41, 3 cameras: 1 third-personbase+ 2 wrist). - Batch size 4, num_workers 4, 1Γ A100 80 GB, ~1.2 s/iter steady-state with a 56 GB/worker decoded-video cache.
- Trained on the LeRobot v2.1 conversion of the source v3 dataset.
Files
| file | purpose |
|---|---|
model.pt |
consolidated PyTorch checkpoint (~3.91 GB; converted from FSDP/DCP shards via torch.distributed.checkpoint.format_utils.dcp_to_torch_save) |
dataset_statistics.json |
action / proprio normalization stats used at training time |
dataset_statistics_post_norm.json |
post-normalization stats (auxiliary) |
t5_embeddings.pkl |
precomputed T5 embeddings for the 4 romoya task commands; only pick-up an egg and crack into the bowl is used here |
Offline evaluation
On romoya/eval_pi05_bimanual_crack_egg (5 episodes Γ 5 query points each, 5 denoising steps, action-chunk L1 in unnormalized units):
| Checkpoint | Mean L1 β | Cross-step Corr β |
|---|---|---|
| iter 6,500 | 0.7575 | 0.130 |
| iter 7,000 | 0.7333 | 0.114 |
| iter 10,000 | 0.6975 | 0.111 |
| iter 15,000 | 0.6830 | 0.154 |
| iter 20,000 | 0.6654 | 0.144 |
| iter 25,000 | 0.6563 | 0.152 |
| iter 30,000 | 0.6545 | 0.142 |
Action L1 is computed in the unnormalized space β i.e., raw joint-pos / effort / velocity / DO units of the romoya bi-lebai-follower. The action vector is 94-D (12 joint pos + 12 effort + 12 vel + 4 DO + ... per arm pair); the proprioceptive state is 166-D.
The improvement curve flattens after iter 25,000 (Ξ = β0.002 over the last 5,000 iters) β i.e., 55 demos hit a plateau at this resolution. Bigger gains likely require either more demonstrations or task-mixture training.
Usage
Cameras: 1 third-person (base) + 2 wrist (left_wrist, right_wrist).
The model expects the ALOHA-style obs dict with keys primary_image, left_wrist_image, right_wrist_image,
future_primary_image, future_left_wrist_image, future_right_wrist_image, proprio.
See cosmos_policy/experiments/robot/cosmos_utils.py:get_action (suite="aloha" branch) for the full contract.
Action / proprio dimensions deviate from ALOHA defaults (ACTION_DIM=94, PROPRIO_DIM=166, NUM_ACTIONS_CHUNK=25);
patch cosmos_policy.constants at runtime before importing cosmos_utils:
import cosmos_policy.constants as _C
_C.NUM_ACTIONS_CHUNK = 25
_C.ACTION_DIM = 94
_C.PROPRIO_DIM = 166
from cosmos_policy.experiments.robot.cosmos_utils import get_action, get_model
A worked offline-eval example lives in
cosmos_policy/experiments/robot/romoya/eval_offline.py.
Model tree for romoya/cosmos_predict2_2b_480p_crack_egg
Base model
nvidia/Cosmos-Predict2-2B-Video2World