harrim-nv commited on
Commit
0af3fb3
·
verified ·
1 Parent(s): 537391c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +186 -51
README.md CHANGED
@@ -1,51 +1,172 @@
1
- # Cosmos-Policy-LIBERO-Predict2-2B
 
 
 
 
2
 
3
- ## Model Description
4
 
5
- Cosmos-Policy-LIBERO-Predict2-2B is a robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model achieves state-of-the-art performance on the LIBERO simulation benchmark with a 98.5% average success rate across four task suites.
6
 
7
- **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
8
 
9
- ### Key Features
10
 
11
- - **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
12
- - **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
13
- - **High performance**: 98.5% average success rate on LIBERO (Spatial: 98.1%, Object: 100.0%, Goal: 98.2%, Long: 97.6%)
14
 
15
- ### Model Architecture
 
 
16
 
17
- This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  **Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
20
 
21
- ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- ### Inputs
24
 
25
- - **Current state images**:
26
- - Third-person camera (agentview): Resized to 224x224 RGB
27
- - Wrist-mounted camera (eye-in-hand): Resized to 224x224 RGB
28
- - **Robot proprioception**: 9-dimensional (2 gripper joints + 3 end-effector position + 4 end-effector quaternion)
29
- - **Task description**: Natural language text (e.g., "put the black bowl on top of the cabinet")
30
 
31
- ### Outputs
32
 
33
- - **Action chunk**: 16-timestep sequence of 7-dimensional actions (6-DoF end-effector control + 1 gripper)
34
- - **Future robot proprioception**: 9-dimensional state at timestep t+16
35
- - **Future state images**:
36
- - Third-person camera prediction at timestep t+16
37
- - Wrist camera prediction at timestep t+16
38
- - **Future state value**: Expected cumulative reward from future state
39
 
40
- ### Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  **Training Data**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy) dataset
 
43
  - 4 task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long
44
  - 500 demonstrations per suite (50 demos × 10 tasks)
45
  - Successful demonstrations used for policy training
46
  - All demonstrations (including failures) used for world model and value function training
47
 
48
  **Training Configuration**:
 
49
  - **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
50
  - **Training steps**: 40,000 gradient steps
51
  - **Batch size**: 1,920 (global)
@@ -53,49 +174,63 @@ This model uses the same architecture as the base Cosmos-Predict2-2B model (a di
53
  - **Training time**: ~48 hours
54
  - **Optimization**: Full model fine-tuning (all weights updated)
55
  - **Action chunk size**: 16 timesteps
56
- - **Image resolution**: 224x224 pixels
57
 
58
  **Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
59
 
60
- **Inference Settings**:
61
- - Denoising steps: 5 (note: this can be changed without retraining)
62
- - Noise level range: σ_min = 4.0, σ_max = 80.0
63
- - Generation mode: Parallel (action, future state, and value generated simultaneously)
64
 
65
- ## Performance
66
 
67
- ### LIBERO Benchmark Results
68
 
69
- | Task Suite | Success Rate |
70
- |-----------|--------------|
71
- | LIBERO-Spatial | 98.1% |
72
- | LIBERO-Object | 100.0% |
73
- | LIBERO-Goal | 98.2% |
74
- | LIBERO-Long | 97.6% |
75
- | **Average** | **98.5%** |
76
 
77
- Success rates are averaged over 500 trials per suite (10 tasks × 50 episodes) across 3 random seeds (6,000 trials total).
78
 
79
- ## Notes
80
 
81
- - **Simulation only**: This checkpoint is trained and evaluated exclusively on LIBERO simulation environments
82
- - **Single robot platform**: Trained only for the Franka Emika Panda robot arm
83
- - **Fixed camera setup**: Requires specific camera configuration (third-person + wrist views)
84
 
85
- ## Citation
86
 
87
- If you use this dataset, please cite the Cosmos Policy paper by Kim et al.
88
- <!-- ```bibtex
89
- # TODO: Add Cosmos Policy BibTeX
90
- ``` -->
91
 
 
 
 
92
 
93
- ## License
94
 
95
- Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
  ## Related Resources
98
 
99
  - **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
100
  - **Training Dataset**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy)
101
  - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - nvidia/Cosmos-Predict2-2B-Video2World
4
+ ---
5
+ # **Cosmos-Policy-LIBERO-Predict2-2B**
6
 
7
+ [**Cosmos Policy**](https://huggingface.co/collections/nvidia/cosmos-policy) | [**Code**](http://github.com/NVlabs/cosmos-policy) | [**White Paper**]() | [**Website**](https://research.nvidia.com/labs/dir/cosmos-policy/)
8
 
9
+ # Model Overview
10
 
11
+ ## Description:
12
 
13
+ Cosmos-Policy-LIBERO-Predict2-2B is a 2B-parameter robot manipulation policy model fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model. This model achieves state-of-the-art performance on the LIBERO simulation benchmark with a 98.5% average success rate across four task suites.
14
 
15
+ Key features:
 
 
16
 
17
+ * **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
18
+ * **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
19
+ * **High performance**: 98.5% average success rate on LIBERO (Spatial: 98.1%, Object: 100.0%, Goal: 98.2%, Long: 97.6%)
20
 
21
+ Use cases:
22
+
23
+ * Robotic manipulation and control in simulation environments
24
+ * Imitation learning and policy learning for table-top manipulation tasks
25
+ * Vision-based robot learning with multiple camera viewpoints
26
+ * Long-horizon task planning and execution
27
+ * Lifelong learning and transfer learning in robotics
28
+
29
+ This model is for research and development only.
30
+
31
+ **Model Developer**: NVIDIA
32
+
33
+ ## Model Versions
34
+
35
+ Cosmos Policy models include the following:
36
+
37
+ - [Cosmos-Policy-LIBERO-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-LIBERO-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated LIBERO environments.
38
+ - [Cosmos-Policy-RoboCasa-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-RoboCasa-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated RoboCasa environments.
39
+ - [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in real-world ALOHA robot environments.
40
+ - [Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B): Given current state observations, a task description, and action sequences, generate future state predictions and value estimates for robot manipulation in real-world ALOHA robot environments. (This checkpoint is meant to be deployed alongside Cosmos-Policy-ALOHA-Predict2-2B, not independently.)
41
+
42
+ ### License:
43
+
44
+ This model is released under the [NVIDIA One-Way Noncommercial License (NSCLv1)](https://github.com/NVlabs/HMAR/blob/main/LICENSE). For a custom license, please contact [cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com).
45
+
46
+ Under the NVIDIA One-Way Noncommercial License (NSCLv1), NVIDIA confirms:
47
+
48
+ * Models are not for commercial use.
49
+ * NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
50
+
51
+ ### Deployment Geography:
52
+
53
+ Global
54
+
55
+ ### Use Case:
56
+
57
+ Physical AI: Robot manipulation and control, encompassing tabletop manipulation and imitation learning in simulation environments.
58
+
59
+ ### Release Date:
60
+
61
+ GitHub [01/12/2026] via [https://github.com/nvlabs/cosmos-policy](https://github.com/nvlabs/cosmos-policy)
62
+
63
+ Hugging Face [01/12/2026] via [https://huggingface.co/collections/nvidia/cosmos-policy](https://huggingface.co/collections/nvidia/cosmos-policy)
64
+
65
+ ## Model Architecture:
66
+
67
+ Architecture Type: A diffusion transformer with latent video diffusion, fine-tuned from Cosmos-Predict2-2B-Video2World.
68
+
69
+ Network Architecture: The model uses the same architecture as the base [Cosmos-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) model (a diffusion transformer with latent video diffusion).
70
 
71
  **Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
72
 
73
+ **Number of model parameters:**
74
+
75
+ 2B (inherited from base model)
76
+
77
+ ## Input
78
+
79
+ **Input Type(s)**: Text + Multi-view Images + Proprioceptive State
80
+
81
+ **Input Format(s)**:
82
+
83
+ * Text: String (natural language task description)
84
+ * Images: RGB images from multiple camera views
85
+ * Proprioception: Numerical array
86
+
87
+ **Input Parameters**:
88
+
89
+ * Text: One-dimensional (1D) - Task description (e.g., "put the black bowl on top of the cabinet")
90
+ * Images: Two-dimensional (2D) - Third-person camera (agentview): 224×224 RGB; Wrist-mounted camera (eye-in-hand): 224×224 RGB
91
+ * Proprioception: One-dimensional (1D) - 9-dimensional state (2 gripper joints + 3 end-effector position + 4 end-effector quaternion)
92
+
93
+ **Other Properties Related to Input**:
94
+
95
+ * Requires specific camera configuration (third-person + wrist views)
96
+ * Images resized to 224×224 pixels from original resolution
97
+ * Trained exclusively for Franka Emika Panda robot arm in LIBERO simulation environments
98
+
99
+ ## Output
100
+
101
+ **Output Type(s)**: Action Sequence + Future State Predictions + Value Estimate
102
+
103
+ **Output Format**:
104
+
105
+ * Actions: Numerical array
106
+ * Future states: Images + Proprioception
107
+ * Value: Scalar
108
+
109
+ **Output Parameters**:
110
+
111
+ * Action chunk: 16-timestep sequence of 7-dimensional actions (6-DoF end-effector control + 1 gripper)
112
+ * Future robot proprioception: 9-dimensional state at timestep t+16
113
+ * Future state images: Third-person camera prediction (224×224 RGB) and wrist camera prediction (224×224 RGB) at timestep t+16
114
+ * Future state value: Expected cumulative reward from future state (scalar)
115
 
116
+ **Other Properties Related to Output**:
117
 
118
+ * Action chunk size: 16 timesteps
119
+ * Denoising steps: 5 (configurable without retraining)
120
+ * Noise level range: σ_min = 4.0, σ_max = 80.0
121
+ * Generation mode: Parallel (action, future state, and value generated simultaneously)
 
122
 
123
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
124
 
125
+ ## Software Integration
 
 
 
 
 
126
 
127
+ **Runtime Engine(s):**
128
+
129
+ * [Transformers](https://github.com/huggingface/transformers)
130
+
131
+ **Supported Hardware Microarchitecture Compatibility:**
132
+
133
+ * NVIDIA Hopper (e.g., H100)
134
+
135
+ **Note**: We have only tested doing inference with BF16 precision.
136
+
137
+ **Operating System(s):**
138
+
139
+ * Linux
140
+
141
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
142
+
143
+ # Usage
144
+
145
+ See [Cosmos Policy GitHub](http://github.com/NVlabs/cosmos-policy) for details.
146
+
147
+ ## Training and Evaluation Sections:
148
+
149
+ ### Training Datasets:
150
+
151
+ **Data Collection Method**:
152
+
153
+ * LIBERO-Cosmos-Policy: Hybrid: Human - Human-teleoperated demonstrations recorded in simulation environment
154
+
155
+ **Labeling Method**:
156
+
157
+ * LIBERO-Cosmos-Policy: Automated - Success/failure labels automatically determined by simulation environment evaluation; task descriptions from benchmark specification
158
+
159
+ ##### Properties:
160
 
161
  **Training Data**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy) dataset
162
+
163
  - 4 task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long
164
  - 500 demonstrations per suite (50 demos × 10 tasks)
165
  - Successful demonstrations used for policy training
166
  - All demonstrations (including failures) used for world model and value function training
167
 
168
  **Training Configuration**:
169
+
170
  - **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
171
  - **Training steps**: 40,000 gradient steps
172
  - **Batch size**: 1,920 (global)
 
174
  - **Training time**: ~48 hours
175
  - **Optimization**: Full model fine-tuning (all weights updated)
176
  - **Action chunk size**: 16 timesteps
177
+ - **Image resolution**: 224×224 pixels
178
 
179
  **Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
180
 
181
+ ### Evaluation Datasets:
 
 
 
182
 
183
+ Data Collection Method: Not Applicable
184
 
185
+ Labeling Method: Not Applicable
186
 
187
+ Properties: Not Applicable - we use the LIBERO simulation environments for direct evaluations.
 
 
 
 
 
 
188
 
189
+ ## Inference:
190
 
191
+ **Test Hardware:** H100, A100
192
 
193
+ See [Cosmos Policy GitHub](http://github.com/NVlabs/cosmos-policy) for details.
 
 
194
 
195
+ #### System Requirements and Performance
196
 
197
+ Inference with base Cosmos Policy only (i.e., no model-based planning):
 
 
 
198
 
199
+ * 1 GPU with 6.8 GB VRAM for LIBERO sim benchmark tasks
200
+ * 1 GPU with 8.9 GB VRAM for RoboCasa sim benchmark tasks
201
+ * 1 GPU with 6.0 GB VRAM for ALOHA robot tasks
202
 
203
+ #### Quality Benchmarks
204
 
205
+ ### LIBERO Benchmark Results
206
+
207
+ | Task Suite | Success Rate |
208
+ | ----------------- | --------------- |
209
+ | LIBERO-Spatial | 98.1% |
210
+ | LIBERO-Object | 100.0% |
211
+ | LIBERO-Goal | 98.2% |
212
+ | LIBERO-Long | 97.6% |
213
+ | **Average** | **98.5%** |
214
+
215
+ Success rates are averaged over 500 trials per suite (10 tasks × 50 episodes) across 3 random seeds (6,000 trials total).
216
+
217
+ ## Ethical Considerations
218
+
219
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
220
+
221
+ Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
222
+
223
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
224
 
225
  ## Related Resources
226
 
227
  - **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
228
  - **Training Dataset**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy)
229
  - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
230
+ - **Original LIBERO**: [LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning](https://arxiv.org/abs/2306.03310)
231
+
232
+ ## Citation
233
+
234
+ If you use this model, please cite the Cosmos Policy paper:
235
+
236
+ (Cosmos Policy BibTeX citation coming soon!)