moojink-nv commited on
Commit
1c62363
·
verified ·
1 Parent(s): c7a7043

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. Cosmos-Policy-RoboCasa-Predict2-2B.pt +3 -0
  2. README.md +115 -0
Cosmos-Policy-RoboCasa-Predict2-2B.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed4b991612b0d75e5e633fb2ca821e54c66a3b89973eb361e6ca50ba4984506f
3
+ size 3913017345
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cosmos-Policy-RoboCasa-Predict2-2B
2
+
3
+ ## Model Description
4
+
5
+ Cosmos-Policy-RoboCasa-Predict2-2B is a robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model achieves state-of-the-art performance on the RoboCasa simulation benchmark with a 67.1% average success rate across 24 kitchen manipulation tasks, using only ~50 human teleoperation demonstrations per task.
6
+
7
+ **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
8
+
9
+ ### Key Features
10
+
11
+ - **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
12
+ - **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
13
+ - **High performance**: 67.1% average success rate on RoboCasa with only 50 demos per task (vs. 300+ for other methods)
14
+ - **Data efficiency**: Trained on significantly fewer demonstrations than prior state-of-the-art methods
15
+
16
+ ### Model Architecture
17
+
18
+ This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
19
+
20
+ **Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
21
+
22
+ ## Model Details
23
+
24
+ ### Inputs
25
+
26
+ - **Current state images**:
27
+ - Left third-person camera (agentview_left): Resized to 224x224 RGB
28
+ - Right third-person camera (agentview_right): Resized to 224x224 RGB
29
+ - Wrist-mounted camera (eye_in_hand): Resized to 224x224 RGB
30
+ - **Robot proprioception**: 9-dimensional (2 gripper joints + 3 end-effector position + 4 end-effector quaternion)
31
+ - **Task description**: Natural language text (e.g., "open the right drawer")
32
+
33
+ ### Outputs
34
+
35
+ - **Action chunk**: 32-timestep sequence of 7-dimensional actions (6-DoF end-effector control + 1 gripper)
36
+ - **Future robot proprioception**: 9-dimensional state at timestep t+32
37
+ - **Future state images**:
38
+ - Wrist camera prediction at timestep t+32
39
+ - Left third-person camera prediction at timestep t+32
40
+ - Right third-person camera prediction at timestep t+32
41
+ - **Future state value**: Expected cumulative reward from future state
42
+
43
+ ### Training Details
44
+
45
+ **Training Data**: [RoboCasa-Cosmos-Policy](https://huggingface.co/datasets/nvidia/RoboCasa-Cosmos-Policy) dataset
46
+ - 24 kitchen manipulation tasks across 7 categories
47
+ - 50 human teleoperation demonstrations per task
48
+ - Successful demonstrations used for policy training
49
+ - All demonstrations (including failures) used for world model and value function training
50
+
51
+ **Training Configuration**:
52
+ - **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
53
+ - **Training steps**: 45,000 gradient steps
54
+ - **Batch size**: 800 (global)
55
+ - **GPUs**: 32 H100 GPUs
56
+ - **Training time**: ~48 hours
57
+ - **Optimization**: Full model fine-tuning (all weights updated)
58
+ - **Action chunk size**: 32 timesteps (prediction horizon)
59
+ - **Execution horizon**: 16 timesteps (recommended; can be varied)
60
+ - **Image resolution**: 224x224 pixels
61
+
62
+ **Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
63
+
64
+ **Inference Settings**:
65
+ - Denoising steps: 5 (note: this can be changed without retraining)
66
+ - Noise level range: σ_min = 4.0, σ_max = 80.0
67
+ - Generation mode: Parallel (action, future state, and value generated simultaneously)
68
+ - Execution: First 16 timesteps of the 32-timestep action chunk are executed before requerying the policy
69
+
70
+ ## Performance
71
+
72
+ ### RoboCasa Benchmark Results
73
+
74
+ | Method | # Training Demos per Task | Average Success Rate |
75
+ |--------|---------------------------|---------------------|
76
+ | GR00T-N1 | 300 | 49.6% |
77
+ | UVA | 50 | 50.0% |
78
+ | DP-VLA | 3,000 | 57.3% |
79
+ | π0 | 300 | 62.5% |
80
+ | GR00T-N1.5 | 300 | 64.1% |
81
+ | Video Policy | 300 | 66.0% |
82
+ | FLARE | 300 | 66.4% |
83
+ | **Cosmos Policy (ours)** | **50** | **67.1%** |
84
+
85
+ Success rates are averaged over 50 trials per task (across 5 evaluation scenes with 10 trials each) and 3 random seeds (3,600 trials total).
86
+
87
+ ### Key Achievements
88
+
89
+ - Achieves state-of-the-art performance with 6× fewer demonstrations than most baselines
90
+ - Generalizes to unseen object instances and kitchen styles
91
+ - Demonstrates strong multi-task manipulation capabilities across diverse kitchen environments
92
+
93
+ ## Notes
94
+
95
+ - **Simulation only**: This checkpoint is trained and evaluated exclusively on RoboCasa simulation environments
96
+ - **Single robot platform**: Trained only for the Franka Emika Panda robot arm
97
+ - **Fixed camera setup**: Requires specific camera configuration (two third-person + one wrist views)
98
+ - **Kitchen tasks**: Designed specifically for kitchen manipulation scenarios
99
+
100
+ ## Citation
101
+
102
+ If you use this model, please cite the Cosmos Policy paper by Kim et al.
103
+ <!-- ```bibtex
104
+ # TODO: Add Cosmos Policy BibTeX
105
+ ``` -->
106
+
107
+ ## License
108
+
109
+ Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
110
+
111
+ ## Related Resources
112
+
113
+ - **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
114
+ - **Training Dataset**: [RoboCasa-Cosmos-Policy](https://huggingface.co/datasets/nvidia/RoboCasa-Cosmos-Policy)
115
+ - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*