Instructions to use romoya/cosmos_predict2_2b_480p_crack_egg with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use romoya/cosmos_predict2_2b_480p_crack_egg with LeRobot:
- Notebooks
- Google Colab
- Kaggle
Replace iter 20000 with iter 30000 (eval L1 0.6654 → 0.6545)
Browse files
README.md
CHANGED
|
@@ -16,16 +16,16 @@ library_name: cosmos-policy
|
|
| 16 |
# Cosmos-Policy 2B 480p — Romoya Bimanual Crack-Egg
|
| 17 |
|
| 18 |
Single-task fine-tune of [`nvidia/Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
|
| 19 |
-
(model-480p-16fps.pt) on the **romoya** bimanual lebai-follower **crack-egg** dataset
|
| 20 |
(`romoya/B3_Station_crack_egg`, 55 episodes / 118,224 frames).
|
| 21 |
|
| 22 |
Task language: `pick-up an egg and crack into the bowl`.
|
| 23 |
|
| 24 |
## Training
|
| 25 |
|
| 26 |
-
- Checkpoint exported at iteration **
|
| 27 |
-
- Recipe inherits the ALOHA bimanual ALOHA-Cosmos-Policy schedule (`state_t=11`, `chunk_duration=41`, 3 cameras).
|
| 28 |
-
- Batch size 4, num_workers 4, 1× A100 80 GB, ~1.
|
| 29 |
- Trained on the LeRobot v2.1 conversion of the source v3 dataset.
|
| 30 |
|
| 31 |
## Files
|
|
@@ -33,23 +33,27 @@ Task language: `pick-up an egg and crack into the bowl`.
|
|
| 33 |
| file | purpose |
|
| 34 |
|---|---|
|
| 35 |
| `model.pt` | consolidated PyTorch checkpoint (~3.91 GB; converted from FSDP/DCP shards via `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`) |
|
| 36 |
-
| `dataset_statistics.json` | action/proprio
|
| 37 |
-
| `dataset_statistics_post_norm.json` | post-
|
| 38 |
-
| `t5_embeddings.pkl` | precomputed T5 embeddings for the 4 romoya task commands
|
| 39 |
|
| 40 |
## Offline evaluation
|
| 41 |
|
| 42 |
-
On `romoya/eval_pi05_bimanual_crack_egg` (5 episodes × 5 query points, 5 denoising steps):
|
| 43 |
|
| 44 |
-
| Checkpoint | Mean L1
|
| 45 |
|---|---|---|
|
| 46 |
-
| iter
|
| 47 |
-
| iter
|
| 48 |
-
| iter
|
| 49 |
-
| iter
|
| 50 |
-
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
-
Action L1 is computed in the
|
|
|
|
|
|
|
| 53 |
|
| 54 |
## Usage
|
| 55 |
|
|
@@ -60,7 +64,7 @@ The model expects the ALOHA-style `obs` dict with keys `primary_image`, `left_wr
|
|
| 60 |
See `cosmos_policy/experiments/robot/cosmos_utils.py:get_action` (suite="aloha" branch) for the full contract.
|
| 61 |
|
| 62 |
Action / proprio dimensions deviate from ALOHA defaults (ACTION_DIM=94, PROPRIO_DIM=166, NUM_ACTIONS_CHUNK=25);
|
| 63 |
-
patch `cosmos_policy.constants` at runtime before importing `cosmos_utils`
|
| 64 |
|
| 65 |
```python
|
| 66 |
import cosmos_policy.constants as _C
|
|
|
|
| 16 |
# Cosmos-Policy 2B 480p — Romoya Bimanual Crack-Egg
|
| 17 |
|
| 18 |
Single-task fine-tune of [`nvidia/Cosmos-Predict2-2B-Video2World`](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
|
| 19 |
+
(`model-480p-16fps.pt`) on the **romoya** bimanual lebai-follower **crack-egg** dataset
|
| 20 |
(`romoya/B3_Station_crack_egg`, 55 episodes / 118,224 frames).
|
| 21 |
|
| 22 |
Task language: `pick-up an egg and crack into the bowl`.
|
| 23 |
|
| 24 |
## Training
|
| 25 |
|
| 26 |
+
- Checkpoint exported at iteration **30,000** (current `model.pt`; this is the plateau — see eval table below).
|
| 27 |
+
- Recipe inherits the ALOHA bimanual ALOHA-Cosmos-Policy schedule (`state_t=11`, `chunk_duration=41`, 3 cameras: 1 third-person `base` + 2 wrist).
|
| 28 |
+
- Batch size 4, num_workers 4, 1× A100 80 GB, ~1.2 s/iter steady-state with a 56 GB/worker decoded-video cache.
|
| 29 |
- Trained on the LeRobot v2.1 conversion of the source v3 dataset.
|
| 30 |
|
| 31 |
## Files
|
|
|
|
| 33 |
| file | purpose |
|
| 34 |
|---|---|
|
| 35 |
| `model.pt` | consolidated PyTorch checkpoint (~3.91 GB; converted from FSDP/DCP shards via `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`) |
|
| 36 |
+
| `dataset_statistics.json` | action / proprio normalization stats used at training time |
|
| 37 |
+
| `dataset_statistics_post_norm.json` | post-normalization stats (auxiliary) |
|
| 38 |
+
| `t5_embeddings.pkl` | precomputed T5 embeddings for the 4 romoya task commands; only `pick-up an egg and crack into the bowl` is used here |
|
| 39 |
|
| 40 |
## Offline evaluation
|
| 41 |
|
| 42 |
+
On `romoya/eval_pi05_bimanual_crack_egg` (5 episodes × 5 query points each, 5 denoising steps, action-chunk L1 in unnormalized units):
|
| 43 |
|
| 44 |
+
| Checkpoint | Mean L1 ↓ | Cross-step Corr ↑ |
|
| 45 |
|---|---|---|
|
| 46 |
+
| iter 6,500 | 0.7575 | 0.130 |
|
| 47 |
+
| iter 7,000 | 0.7333 | 0.114 |
|
| 48 |
+
| iter 10,000 | 0.6975 | 0.111 |
|
| 49 |
+
| iter 15,000 | 0.6830 | 0.154 |
|
| 50 |
+
| iter 20,000 | 0.6654 | 0.144 |
|
| 51 |
+
| iter 25,000 | 0.6563 | 0.152 |
|
| 52 |
+
| **iter 30,000** | **0.6545** | 0.142 |
|
| 53 |
|
| 54 |
+
Action L1 is computed in the unnormalized space — i.e., raw joint-pos / effort / velocity / DO units of the romoya bi-lebai-follower. The action vector is 94-D (12 joint pos + 12 effort + 12 vel + 4 DO + ... per arm pair); the proprioceptive state is 166-D.
|
| 55 |
+
|
| 56 |
+
The improvement curve flattens after iter 25,000 (Δ = −0.002 over the last 5,000 iters) — i.e., 55 demos hit a plateau at this resolution. Bigger gains likely require either more demonstrations or task-mixture training.
|
| 57 |
|
| 58 |
## Usage
|
| 59 |
|
|
|
|
| 64 |
See `cosmos_policy/experiments/robot/cosmos_utils.py:get_action` (suite="aloha" branch) for the full contract.
|
| 65 |
|
| 66 |
Action / proprio dimensions deviate from ALOHA defaults (ACTION_DIM=94, PROPRIO_DIM=166, NUM_ACTIONS_CHUNK=25);
|
| 67 |
+
patch `cosmos_policy.constants` at runtime **before** importing `cosmos_utils`:
|
| 68 |
|
| 69 |
```python
|
| 70 |
import cosmos_policy.constants as _C
|
model.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 3913008759
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b1624cf4ed9208527f4e28fb1fadb1fe54bdb0caf822fbc71db63e2c01c0c0ad
|
| 3 |
size 3913008759
|