harrim-nv commited on
Commit
d97a4c9
·
verified ·
1 Parent(s): 3ab5d79

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +201 -72
README.md CHANGED
@@ -1,123 +1,252 @@
1
- # Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B
 
 
 
 
 
2
 
3
- ## Model Description
4
 
5
- Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B is a refined world model and value function checkpoint fine-tuned from [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) on policy rollout data. This checkpoint is designed to be used in conjunction with the base Cosmos Policy checkpoint for model-based planning via best-of-N search, achieving improved performance on challenging manipulation tasks. This checkpoint should NOT be deployed on its own.
6
 
7
- **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
8
 
9
- ### Key Features
10
 
11
- - **Refined predictions**: Fine-tuned on policy rollout data for more accurate world model and value function predictions
12
- - **Dual deployment**: Used alongside base Cosmos Policy checkpoint for model-based planning
13
- - **Improved performance**: Achieves 12.5 percentage point average score increase on challenging ALOHA tasks when used for planning
14
 
15
- ### Model Architecture
 
 
16
 
17
- This model uses the same architecture as [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B). Please refer to that model card and the [base Cosmos-Predict2-2B model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
18
 
19
- ## Model Details
 
 
 
20
 
21
- ### Inputs
22
 
23
- Same as [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B):
24
- - **Current state images**: Top-down camera, left wrist camera, right wrist camera (all 224x224 RGB)
25
- - **Robot proprioception**: 14-dimensional (7 joint angles per arm)
26
- - **Action chunk**: 50-timestep sequence of 14-dimensional actions (for world model and value prediction)
27
 
28
- ### Outputs
29
 
30
- - **Future robot proprioception**: 14-dimensional state at timestep t+50
31
- - **Future state images**:
32
- - Top-down third-person camera prediction at timestep t+50
33
- - Left wrist camera prediction at timestep t+50
34
- - Right wrist camera prediction at timestep t+50
35
- - **Future state value**: Expected cumulative reward from future state (V(s'))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  **Note**: While this checkpoint can technically generate actions like the base policy, it is specifically designed and optimized for world model and value function predictions. For action generation, please use [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B).
38
 
39
- ### Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
- **Base Checkpoint**: [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B)
42
 
43
- **Fine-tuning Data**: 648 policy rollout episodes collected from various methods (see paper for details)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  - Includes both successful and failed episodes
45
  - Covers diverse initial conditions and execution trajectories
46
  - Enables more accurate modeling of state transitions and value predictions beyond the demonstration distribution
47
 
48
  **Training Configuration**:
49
- - **Training steps**: Details in paper
 
 
50
  - **Batch split**: 10/45/45 for policy/world model/value function (emphasis on world model and value function refinement)
51
  - **GPUs**: 8 H100 GPUs
 
52
 
53
  **Training Objective**: Fine-tuned with increased emphasis on world model and value function training (90% of training batches) to improve future state and value prediction accuracy for more effective planning.
54
 
55
- ## Usage: Dual Deployment for Model-Based Planning
56
-
57
- This checkpoint is designed for **dual deployment** with the base Cosmos Policy checkpoint:
58
-
59
- 1. **Policy Model** ([Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B)): Generates N candidate action chunks
60
- 2. **Planning Model** (this checkpoint): For each candidate action:
61
- - Predicts future state (world model)
62
- - Predicts future state value (value function)
63
- - Averages across ensemble predictions (3 future state predictions × 5 value predictions = 15 total value estimates per action)
64
- 3. **Selection**: Execute the action chunk with the highest predicted value
65
 
66
- See the paper for complete implementation details of the best-of-N planning algorithm.
67
 
68
- ## Performance
69
 
70
- ### Planning Performance on ALOHA Tasks
71
 
72
- When used for model-based planning with the base policy checkpoint:
73
 
74
- | Task | Base Policy Score | With Planning (this checkpoint) | Improvement |
75
- |------|------------------|--------------------------------|-------------|
76
- | put candies in bowl | 49.0 | 60.0 | +11.0 |
77
- | put candy in ziploc bag | 70.0 | 84.0 | +14.0 |
78
- | **Average** | **60.0** | **72.0** | **+12.5** |
79
 
80
- Results are on challenging initial conditions for these two tasks. Planning with this checkpoint enables the policy to be more likely to avoid errors (e.g., losing grasp of objects) by selecting higher-quality actions.
81
 
82
- ## Important Usage Notes
83
 
84
- **Inference Latency**: Model-based planning with dual deployment has significantly higher inference latency:
85
- - **Planning mode (dual deployment)**: ~4.9 seconds per action chunk using 8 parallel H100 GPUs
86
- - **Direct policy mode (base checkpoint only)**: ~0.95 seconds per action chunk using 1 H100 GPU
87
 
88
- For applications requiring faster inference, we recommend using the base [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) checkpoint alone without planning.
 
 
89
 
90
- **Hardware Requirements**: Model-based planning requires:
91
- - Multiple GPUs for parallelized best-of-N search (8 GPUs recommended for N=8)
92
- - Sufficient compute for ensemble predictions (3 world model queries × 5 value function queries per action)
93
 
94
- **When to Use Planning**: Planning is most beneficial for:
95
- - Challenging tasks with high precision requirements
96
- - Situations where avoiding errors is critical
97
- - Scenarios where additional compute time is acceptable
98
 
99
- **Same warnings as base checkpoint apply**: Hardware compatibility, 25 Hz control frequency requirement, and real-world deployment safety considerations. See [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) model card for details.
100
 
101
- ## Notes
 
 
 
 
102
 
103
- - **Specialized checkpoint**: Optimized specifically for world model and value function predictions, not action generation
104
- - **Requires base policy**: Must be used in conjunction with Cosmos-Policy-ALOHA-Predict2-2B for planning
105
- - **Compute-intensive**: Significantly higher computational requirements than direct policy execution
106
- - **Real-world tested**: Evaluated on real ALOHA 2 hardware in challenging manipulation scenarios
107
 
108
- ## Citation
109
 
110
- If you use this model, please cite the Cosmos Policy paper by Kim et al.
111
- <!-- ```bibtex
112
- # TODO: Add Cosmos Policy BibTeX
113
- ``` -->
114
 
115
- ## License
116
 
117
- Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
118
 
119
  ## Related Resources
120
 
121
  - **Base Policy Checkpoint**: [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) (required for planning)
122
  - **Base Video Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
123
  - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - nvidia/Cosmos-Policy-ALOHA-Predict2-2B
4
+ - nvidia/Cosmos-Predict2-2B-Video2World
5
+ ---
6
+ # **Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B**
7
 
8
+ [**Cosmos Policy**](https://huggingface.co/collections/nvidia/cosmos-policy) | [**Code**](http://github.com/NVlabs/cosmos-policy) | [**White Paper**]() | [**Website**](https://research.nvidia.com/labs/dir/cosmos-policy/)
9
 
10
+ # Model Overview
11
 
12
+ ## Description:
13
 
14
+ Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B is a 2B-parameter refined world model and value function checkpoint fine-tuned from [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) on policy rollout data. This checkpoint is designed to be used in conjunction with the base Cosmos Policy checkpoint for model-based planning via best-of-N search, achieving a 12.5 percentage point average score increase on challenging ALOHA manipulation tasks.
15
 
16
+ Key features:
 
 
17
 
18
+ * **Refined predictions**: Fine-tuned on policy rollout data for more accurate world model and value function predictions
19
+ * **Dual deployment**: Used alongside base Cosmos Policy checkpoint for model-based planning
20
+ * **Improved performance**: Achieves 12.5 percentage point average score increase on challenging ALOHA tasks when used for planning
21
 
22
+ Use cases:
23
 
24
+ * Model-based planning for bimanual robot manipulation
25
+ * Best-of-N action selection via value-based search
26
+ * Improving policy robustness on high-precision tasks
27
+ * Error avoidance in contact-rich manipulation
28
 
29
+ This model is for research and development only.
30
 
31
+ **Model Developer**: NVIDIA
 
 
 
32
 
33
+ ## Model Versions
34
 
35
+ Cosmos Policy models include the following:
36
+
37
+ - [Cosmos-Policy-LIBERO-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-LIBERO-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated LIBERO environments.
38
+ - [Cosmos-Policy-RoboCasa-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-RoboCasa-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated RoboCasa environments.
39
+ - [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in real-world ALOHA robot environments.
40
+ - [Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B): Given current state observations, a task description, and action sequences, generate future state predictions and value estimates for robot manipulation in real-world ALOHA robot environments. (This checkpoint is meant to be deployed alongside Cosmos-Policy-ALOHA-Predict2-2B, not independently.)
41
+
42
+ ### License:
43
+
44
+ This model is released under the [NVIDIA One-Way Noncommercial License (NSCLv1)](https://github.com/NVlabs/HMAR/blob/main/LICENSE). For a custom license, please contact [cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com).
45
+
46
+ Under the NVIDIA One-Way Noncommercial License (NSCLv1), NVIDIA confirms:
47
+
48
+ * Models are not for commercial use.
49
+ * NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
50
+
51
+ ### Deployment Geography:
52
+
53
+ Global
54
+
55
+ ### Use Case:
56
+
57
+ Physical AI: Model-based planning for bimanual robot manipulation in real-world environments, encompassing world modeling and value function prediction for best-of-N action selection.
58
+
59
+ ### Release Date:
60
+
61
+ GitHub [01/06/2026] via [https://github.com/nvlabs/cosmos-policy](https://github.com/nvlabs/cosmos-policy)
62
+
63
+ Hugging Face [01/06/2026] via [https://huggingface.co/collections/nvidia/cosmos-policy](https://huggingface.co/collections/nvidia/cosmos-policy)
64
+
65
+ ## Model Architecture:
66
+
67
+ Architecture Type: A diffusion transformer with latent video diffusion, fine-tuned from Cosmos-Policy-ALOHA-Predict2-2B.
68
+
69
+ Network Architecture: The model uses the same architecture as [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B).
70
+
71
+ **Key adaptation**: This checkpoint is specifically optimized for world model and value function predictions through fine-tuning on policy rollout data with emphasis on future state and value prediction accuracy.
72
+
73
+ **Number of model parameters:**
74
+
75
+ 2B (inherited from base model)
76
+
77
+ ## Input
78
+
79
+ **Input Type(s)**: Text + Multi-view Images + Proprioceptive State + Action Sequence
80
+
81
+ **Input Format(s)**:
82
+
83
+ * Text: String (natural language task description)
84
+ * Images: RGB images from multiple camera views
85
+ * Proprioception: Numerical array
86
+ * Actions: Numerical array
87
+
88
+ **Input Parameters**:
89
+
90
+ * Text: One-dimensional (1D) - Task description (e.g., "put candy in ziploc bag")
91
+ * Images: Two-dimensional (2D) - Top-down third-person camera: 224×224 RGB; Left wrist-mounted camera: 224×224 RGB; Right wrist-mounted camera: 224×224 RGB
92
+ * Proprioception: One-dimensional (1D) - 14-dimensional state (7 joint angles per arm)
93
+ * Actions: One-dimensional (1D) - 50-timestep sequence of 14-dimensional actions (for world model and value prediction)
94
+
95
+ **Other Properties Related to Input**:
96
+
97
+ * Requires specific camera configuration (top-down + two wrist views)
98
+ * Images resized to 224×224 pixels from original resolution
99
+ * Trained exclusively for ALOHA 2 robot platform with two ViperX 300 S robot arms
100
+ * Control frequency: 25 Hz
101
+
102
+ ## Output
103
+
104
+ **Output Type(s)**: Future State Predictions + Value Estimate
105
+
106
+ **Output Format**:
107
+
108
+ * Future states: Images + Proprioception
109
+ * Value: Scalar
110
+
111
+ **Output Parameters**:
112
+
113
+ * Future robot proprioception: 14-dimensional state at timestep t+50
114
+ * Future state images: Top-down third-person camera prediction (224×224 RGB), left wrist camera prediction (224×224 RGB), and right wrist camera prediction (224×224 RGB) at timestep t+50
115
+ * Future state value: Expected cumulative reward from future state (scalar)
116
+
117
+ **Other Properties Related to Output**:
118
+
119
+ * Denoising steps: 10 (configurable without retraining)
120
+ * Noise level range: σ_min = 4.0, σ_max = 80.0
121
+ * Ensemble predictions: 3 world model queries × 5 value function queries per action (15 total value estimates)
122
 
123
  **Note**: While this checkpoint can technically generate actions like the base policy, it is specifically designed and optimized for world model and value function predictions. For action generation, please use [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B).
124
 
125
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
126
+
127
+ ## Software Integration
128
+
129
+ **Runtime Engine(s):**
130
+
131
+ * [Transformers](https://github.com/huggingface/transformers)
132
+
133
+ **Supported Hardware Microarchitecture Compatibility:**
134
+
135
+ * NVIDIA Hopper (e.g., H100)
136
+
137
+ **Note**: We have only tested doing inference with BF16 precision.
138
+
139
+ **Operating System(s):**
140
+
141
+ * Linux
142
 
143
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
144
 
145
+ **Dual Deployment**: This checkpoint is designed for dual deployment with [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B):
146
+
147
+ 1. **Policy Model** (Cosmos-Policy-ALOHA-Predict2-2B): Generates N candidate action chunks
148
+ 2. **Planning Model** (this checkpoint): For each candidate action, predicts future state (world model) and future state value (value function), averaging across ensemble predictions
149
+ 3. **Selection**: Execute the action chunk with the highest predicted value
150
+
151
+ **Inference Latency**: Model-based planning with dual deployment has significantly higher inference latency:
152
+
153
+ - **Planning mode (dual deployment)**: ~4.9 seconds per action chunk using 8 parallel H100 GPUs
154
+ - **Direct policy mode (base checkpoint only)**: ~0.95 seconds per action chunk using 1 H100 GPU
155
+
156
+ **Hardware Requirements**: Model-based planning requires multiple GPUs for parallelized best-of-N search (8 GPUs recommended for N=8) and sufficient compute for ensemble predictions.
157
+
158
+ **When to Use Planning**: Planning is most beneficial for challenging tasks with high precision requirements, situations where avoiding errors is critical, and scenarios where additional compute time is acceptable.
159
+
160
+ **Same warnings as base checkpoint apply**: Hardware compatibility, 25 Hz control frequency requirement, and real-world deployment safety considerations. See [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) model card for details.
161
+
162
+ # Usage
163
+
164
+ See [Cosmos Policy GitHub](http://github.com/NVlabs/cosmos-policy) for details.
165
+
166
+ ## Training and Evaluation Sections:
167
+
168
+ ### Training Datasets:
169
+
170
+ **Data Collection Method**:
171
+
172
+ * ALOHA-Planning-Rollouts: Automated - Policy rollout episodes collected from various methods
173
+
174
+ **Labeling Method**:
175
+
176
+ * ALOHA-Planning-Rollouts: Automated - Success/failure labels automatically determined through policy execution and environment evaluation
177
+
178
+ ##### **Properties:**
179
+
180
+ **Training Data**: Policy rollout data
181
+
182
+ - 648 policy rollout episodes
183
  - Includes both successful and failed episodes
184
  - Covers diverse initial conditions and execution trajectories
185
  - Enables more accurate modeling of state transitions and value predictions beyond the demonstration distribution
186
 
187
  **Training Configuration**:
188
+
189
+ - **Base model**: Cosmos-Policy-ALOHA-Predict2-2B
190
+ - **Training steps**: See paper for details
191
  - **Batch split**: 10/45/45 for policy/world model/value function (emphasis on world model and value function refinement)
192
  - **GPUs**: 8 H100 GPUs
193
+ - **Optimization**: Full model fine-tuning (all weights updated)
194
 
195
  **Training Objective**: Fine-tuned with increased emphasis on world model and value function training (90% of training batches) to improve future state and value prediction accuracy for more effective planning.
196
 
197
+ ### Evaluation Datasets:
 
 
 
 
 
 
 
 
 
198
 
199
+ Data Collection Method: Not Applicable
200
 
201
+ Labeling Method: Not Applicable
202
 
203
+ Properties: Not Applicable - We use the real-world ALOHA 2 robot platform for direct evaluations.
204
 
205
+ ## Inference:
206
 
207
+ **Test Hardware:** H100
 
 
 
 
208
 
209
+ See [Cosmos Policy GitHub](http://github.com/NVlabs/cosmos-policy) for details.
210
 
211
+ #### System Requirements and Performance
212
 
213
+ Inference with model-based planning (dual deployment):
 
 
214
 
215
+ * 8 H100 GPUs (recommended for N=8 best-of-N search)
216
+ * ~4.9 seconds per action chunk
217
+ * Ensemble predictions: 3 world model × 5 value function queries per action
218
 
219
+ #### Quality Benchmarks
 
 
220
 
221
+ ### Planning Performance on ALOHA Tasks
 
 
 
222
 
223
+ When used for model-based planning with the base policy checkpoint:
224
 
225
+ | Task | Base Policy Score | With Planning (this checkpoint) | Improvement |
226
+ | ----------------------- | ----------------- | ------------------------------- | --------------- |
227
+ | put candies in bowl | 49.0 | 60.0 | +11.0 |
228
+ | put candy in ziploc bag | 70.0 | 84.0 | +14.0 |
229
+ | **Average** | **60.0** | **72.0** | **+12.5** |
230
 
231
+ Results are on challenging initial conditions for these two tasks. Planning with this checkpoint enables the policy to be more likely to avoid errors (e.g., losing grasp of objects) by selecting higher-quality actions.
 
 
 
232
 
233
+ ## Ethical Considerations
234
 
235
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 
 
 
236
 
237
+ Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
238
 
239
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
240
 
241
  ## Related Resources
242
 
243
  - **Base Policy Checkpoint**: [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) (required for planning)
244
  - **Base Video Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
245
  - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
246
+ - **Original ALOHA**: [Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware](https://arxiv.org/abs/2304.13705)
247
+
248
+ ## Citation
249
+
250
+ If you use this model, please cite the Cosmos Policy paper:
251
+
252
+ (Cosmos Policy BibTeX citation coming soon!)