romoya
/

cosmos_predict2_2b_480p_crack_egg

@@ -16,16 +16,16 @@ library_name: cosmos-policy
 # Cosmos-Policy 2B 480p — Romoya Bimanual Crack-Egg
 Single-task fine-tune of [`nvidia/Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
-(model-480p-16fps.pt) on the **romoya** bimanual lebai-follower **crack-egg** dataset
 (`romoya/B3_Station_crack_egg`, 55 episodes / 118,224 frames).
 Task language: `pick-up an egg and crack into the bowl`.
 ## Training
-- Checkpoint exported at iteration **20,000** (best of {6500, 7000, 10000, 15000, 20000} on offline eval).
-- Recipe inherits the ALOHA bimanual ALOHA-Cosmos-Policy schedule (`state_t=11`, `chunk_duration=41`, 3 cameras).
-- Batch size 4, num_workers 4, 1× A100 80 GB, ~1.4 s/iter steady-state.
 - Trained on the LeRobot v2.1 conversion of the source v3 dataset.
 ## Files
@@ -33,23 +33,27 @@ Task language: `pick-up an egg and crack into the bowl`.
 | file | purpose |
 |---|---|
 | `model.pt` | consolidated PyTorch checkpoint (~3.91 GB; converted from FSDP/DCP shards via `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`) |
-| `dataset_statistics.json` | action/proprio normalisation stats used at training |
-| `dataset_statistics_post_norm.json` | post-normalisation stats (auxiliary) |
-| `t5_embeddings.pkl` | precomputed T5 embeddings for the 4 romoya task commands (only `pick-up an egg and crack into the bowl` is used here) |
 ## Offline evaluation
-On `romoya/eval_pi05_bimanual_crack_egg` (5 episodes × 5 query points, 5 denoising steps):
-| Checkpoint | Mean L1 (action units) ↓ | Cross-step Corr ↑ |
 |---|---|---|
-| iter 6500  | 0.7575 | 0.130 |
-| iter 7000  | 0.7333 | 0.114 |
-| iter 10000 | 0.6975 | 0.111 |
-| iter 15000 | 0.6830 | 0.154 |
-| **iter 20000** | **0.6654** | **0.144** |
-Action L1 is computed in the unnormalised space — i.e., raw joint-pos / effort / velocity / DO units of the romoya bi-lebai-follower (94-dim action, 166-dim proprio).
 ## Usage
@@ -60,7 +64,7 @@ The model expects the ALOHA-style `obs` dict with keys `primary_image`, `left_wr
 See `cosmos_policy/experiments/robot/cosmos_utils.py:get_action` (suite="aloha" branch) for the full contract.
 Action / proprio dimensions deviate from ALOHA defaults (ACTION_DIM=94, PROPRIO_DIM=166, NUM_ACTIONS_CHUNK=25);
-patch `cosmos_policy.constants` at runtime before importing `cosmos_utils`.
 ```python
 import cosmos_policy.constants as _C

 # Cosmos-Policy 2B 480p — Romoya Bimanual Crack-Egg
 Single-task fine-tune of [`nvidia/Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
+(`model-480p-16fps.pt`) on the **romoya** bimanual lebai-follower **crack-egg** dataset
 (`romoya/B3_Station_crack_egg`, 55 episodes / 118,224 frames).
 Task language: `pick-up an egg and crack into the bowl`.
 ## Training
+- Checkpoint exported at iteration **30,000** (current `model.pt`; this is the plateau — see eval table below).
+- Recipe inherits the ALOHA bimanual ALOHA-Cosmos-Policy schedule (`state_t=11`, `chunk_duration=41`, 3 cameras: 1 third-person `base` + 2 wrist).
+- Batch size 4, num_workers 4, 1× A100 80 GB, ~1.2 s/iter steady-state with a 56 GB/worker decoded-video cache.
 - Trained on the LeRobot v2.1 conversion of the source v3 dataset.
 ## Files
 | file | purpose |
 |---|---|
 | `model.pt` | consolidated PyTorch checkpoint (~3.91 GB; converted from FSDP/DCP shards via `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`) |
+| `dataset_statistics.json` | action / proprio normalization stats used at training time |
+| `dataset_statistics_post_norm.json` | post-normalization stats (auxiliary) |
+| `t5_embeddings.pkl` | precomputed T5 embeddings for the 4 romoya task commands; only `pick-up an egg and crack into the bowl` is used here |
 ## Offline evaluation
+On `romoya/eval_pi05_bimanual_crack_egg` (5 episodes × 5 query points each, 5 denoising steps, action-chunk L1 in unnormalized units):
+| Checkpoint | Mean L1 ↓ | Cross-step Corr ↑ |
 |---|---|---|
+| iter 6,500  | 0.7575 | 0.130 |
+| iter 7,000  | 0.7333 | 0.114 |
+| iter 10,000 | 0.6975 | 0.111 |
+| iter 15,000 | 0.6830 | 0.154 |
+| iter 20,000 | 0.6654 | 0.144 |
+| iter 25,000 | 0.6563 | 0.152 |
+| **iter 30,000** | **0.6545** | 0.142 |
+Action L1 is computed in the unnormalized space — i.e., raw joint-pos / effort / velocity / DO units of the romoya bi-lebai-follower. The action vector is 94-D (12 joint pos + 12 effort + 12 vel + 4 DO + ... per arm pair); the proprioceptive state is 166-D.
+The improvement curve flattens after iter 25,000 (Δ = −0.002 over the last 5,000 iters) — i.e., 55 demos hit a plateau at this resolution. Bigger gains likely require either more demonstrations or task-mixture training.
 ## Usage
 See `cosmos_policy/experiments/robot/cosmos_utils.py:get_action` (suite="aloha" branch) for the full contract.
 Action / proprio dimensions deviate from ALOHA defaults (ACTION_DIM=94, PROPRIO_DIM=166, NUM_ACTIONS_CHUNK=25);
+patch `cosmos_policy.constants` at runtime **before** importing `cosmos_utils`:
 ```python
 import cosmos_policy.constants as _C

model.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5c0a470f4a54d6479c51db66552cabd11327b21aab5e18151e8c6bb0cc42c3d3
 size 3913008759

 version https://git-lfs.github.com/spec/v1
+oid sha256:b1624cf4ed9208527f4e28fb1fadb1fe54bdb0caf822fbc71db63e2c01c0c0ad
 size 3913008759