cosmos-policy
harrim-nv commited on
Commit
bc84aa7
·
verified ·
1 Parent(s): 7980218

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +211 -70
README.md CHANGED
@@ -1,127 +1,268 @@
1
- # Cosmos-Policy-ALOHA-Predict2-2B
 
 
 
 
 
 
2
 
3
- ## Model Description
4
 
5
- Cosmos-Policy-ALOHA-Predict2-2B is a bimanual robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model is trained on real-world human teleoperation data collected on the ALOHA 2 robot platform and achieves a 93.6% average completion rate across four challenging bimanual manipulation tasks.
6
 
7
- **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
8
 
9
- ### Key Features
10
 
11
- - **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
12
- - **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
13
- - **Real-world performance**: 93.6% average score on challenging bimanual manipulation tasks
14
 
15
- ### Model Architecture
 
 
16
 
17
- This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  **Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
20
 
21
- ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- ### Inputs
24
 
25
- - **Current state images**:
26
- - Top-down third-person camera: Resized to 224x224 RGB
27
- - Left wrist-mounted camera: Resized to 224x224 RGB
28
- - Right wrist-mounted camera: Resized to 224x224 RGB
29
- - **Robot proprioception**: 14-dimensional (7 joint angles per arm)
30
- - **Task description**: Natural language text (e.g., "put candy in ziploc bag")
31
 
32
- ### Outputs
33
 
34
- - **Action chunk**: 50-timestep sequence of 14-dimensional actions (7 per arm: joint positions for 6 joints + 1 gripper)
35
- - **Future robot proprioception**: 14-dimensional state at timestep t+50
36
- - **Future state images**:
37
- - Top-down third-person camera prediction at timestep t+50
38
- - Left wrist camera prediction at timestep t+50
39
- - Right wrist camera prediction at timestep t+50
40
- - **Future state value**: Expected cumulative reward from future state
41
 
42
- **Note on future predictions**: The future state images and value predictions generated by this base policy checkpoint are primarily for visualization and interpretability purposes. For model-based planning with these predictions, please additionally use the separate [Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B](#) checkpoint as the world model and value function. The other checkpoint has been fine-tuned on policy rollout data to refine the world model and value function for more accurate planning.
43
 
44
- ### Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  **Training Data**: [ALOHA-Cosmos-Policy](https://huggingface.co/datasets/nvidia/ALOHA-Cosmos-Policy) dataset
 
47
  - 4 bimanual manipulation tasks
48
  - 185 total real-world human teleoperation demonstrations
49
  - put X on plate: 80 demos
50
  - fold shirt: 15 demos
51
  - put candies in bowl: 45 demos
52
  - put candy in ziploc bag: 45 demos
 
 
53
 
54
  **Training Configuration**:
 
55
  - **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
56
  - **Training steps**: 50,000 gradient steps
57
  - **Batch size**: 200 (global)
58
  - **GPUs**: 8 H100 GPUs
59
  - **Training time**: ~48 hours
60
  - **Optimization**: Full model fine-tuning (all weights updated)
61
- - **Action chunk size**: 50 timesteps (spanning 2 seconds given 25 Hz control frequency)
62
- - **Execution horizon**: 50 timesteps (full chunk; recommended, though can be varied)
63
- - **Image resolution**: 224x224 pixels
64
- - **Control frequency**: 25 Hz (reduced from original 50 Hz)
65
 
66
  **Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
67
 
68
- **Inference Settings**:
69
- - Denoising steps: 10 (note: this can be changed without retraining)
70
- - Noise level range: σ_min = 4.0, σ_max = 80.0
71
- - Generation mode: Either parallel (action, future state, and value generated simultaneously) or autoregressive (using this checkpoint as the policy and the separate planning model checkpoint mentioned above as the world model and value function; see paper for more details)
72
- - Execution: Full 50-timestep action chunk (2 seconds) is executed before requerying the policy
73
 
74
- ## Performance
75
 
76
- ### ALOHA Real-World Benchmark Results
77
 
78
- | Task | Score |
79
- |------|-------|
80
- | put X on plate | 100.0 |
81
- | fold shirt | 99.5 |
82
- | put candies in bowl | 89.6 |
83
- | put candy in ziploc bag | 85.4 |
84
- | **Average** | **93.6** |
85
 
86
- Scores represent average percent completion across 101 trials total (including both in-distribution and out-of-distribution test conditions). The model outperforms baseline policies including Diffusion Policy (33.6), OpenVLA-OFT+ (62.0), π0 (77.9), and π0.5 (88.6).
87
 
88
- ### Task Characteristics
89
 
90
- - **put X on plate**: Language-conditioned object placement (tests language following)
91
- - **fold shirt**: Multi-step contact-rich manipulation (tests long-horizon planning)
92
- - **put candies in bowl**: Handling scattered objects (tests multimodal grasp sequences)
93
- - **put candy in ziploc bag**: High-precision millimeter-tolerance manipulation
94
 
95
- ## Important Usage Notes
96
 
97
- **Hardware Compatibility Warning**: This model was trained on a specific ALOHA 2 robot setup with particular hardware characteristics. Differences between our robot setup and downstream users' hardware setups (including calibration, joint limits, camera positioning, gripper mechanics, etc.) may significantly impact performance. Users must exercise caution during deployment.
98
 
99
- **Control Frequency**: This policy must be used with a **25 Hz controller** for satisfactory performance (not the original 50 Hz ALOHA control frequency). The reduced frequency was used during data collection and training.
 
 
100
 
101
- **Real-World Deployment**: This model operates real robotic hardware. Always ensure that proper safety measures are in place. On the first deployment of this checkpoint, we highly recommend measuring the difference in the current robot state and the next commanded robot state (e.g., difference between current joint angles and predicted actions, which represent target joint angles) and aborting policy execution if the difference is large.
102
 
103
- ## Notes
104
 
105
- - **Real-world data**: This checkpoint is trained on real-world teleoperation data from the ALOHA 2 robot
106
- - **Bimanual platform**: Designed for dual-arm manipulation with two ViperX 300 S robot arms
107
- - **Fixed camera setup**: Requires specific camera configuration (top-down + two wrist views)
108
- - **Task-specific**: Trained on four specific bimanual manipulation tasks
109
- - **Hardware sensitivity**: Performance may vary with different robot configurations or hardware setups
 
 
110
 
111
- ## Citation
112
 
113
- If you use this model, please cite the Cosmos Policy paper by Kim et al.
114
- <!-- ```bibtex
115
- # TODO: Add Cosmos Policy BibTeX
116
- ``` -->
117
 
118
- ## License
 
 
 
 
119
 
120
- Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
  ## Related Resources
123
 
124
  - **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
125
  - **Training Dataset**: [ALOHA-Cosmos-Policy](https://huggingface.co/datasets/nvidia/ALOHA-Cosmos-Policy)
126
- - **Planning Model Checkpoint**: Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B (for model-based planning)
127
  - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - nvidia/ALOHA-Cosmos-Policy
4
+ base_model:
5
+ - nvidia/Cosmos-Predict2-2B-Video2World
6
+ ---
7
+ # **Cosmos-Policy-ALOHA-Predict2-2B**
8
 
9
+ [**Cosmos Policy**](https://huggingface.co/collections/nvidia/cosmos-policy) | [**Code**](http://github.com/NVlabs/cosmos-policy) | [**White Paper Coming Soon**]() | [**Website**](https://research.nvidia.com/labs/dir/cosmos-policy/)
10
 
11
+ # Model Overview
12
 
13
+ ## Description:
14
 
15
+ Cosmos-Policy-ALOHA-Predict2-2B is a 2B-parameter bimanual robot manipulation policy model fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model. This model achieves a 93.6% average completion rate across four challenging real-world bimanual manipulation tasks on the ALOHA 2 robot platform.
16
 
17
+ Key features:
 
 
18
 
19
+ * **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
20
+ * **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
21
+ * **Real-world performance**: 93.6% average score on challenging bimanual manipulation tasks
22
 
23
+ Use cases:
24
+
25
+ * Bimanual robotic manipulation and control in real-world environments
26
+ * Imitation learning from human teleoperation demonstrations
27
+ * Vision-based robot learning with multiple camera viewpoints
28
+ * Contact-rich and high-precision manipulation tasks
29
+ * Long-horizon task planning and execution
30
+
31
+ This model is for research and development only.
32
+
33
+ **Model Developer**: NVIDIA
34
+
35
+ ## Model Versions
36
+
37
+ Cosmos Policy models include the following:
38
+
39
+ - [Cosmos-Policy-LIBERO-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-LIBERO-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated LIBERO environments.
40
+ - [Cosmos-Policy-RoboCasa-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-RoboCasa-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated RoboCasa environments.
41
+ - [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in real-world ALOHA robot environments.
42
+ - [Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B): Given current state observations, a task description, and action sequences, generate future state predictions and value estimates for robot manipulation in real-world ALOHA robot environments. (This checkpoint is meant to be deployed alongside Cosmos-Policy-ALOHA-Predict2-2B, not independently.)
43
+
44
+ ### License:
45
+
46
+ This model is released under the [NVIDIA One-Way Noncommercial License (NSCLv1)](https://github.com/NVlabs/HMAR/blob/main/LICENSE). For a custom license, please contact [cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com).
47
+
48
+ Under the NVIDIA One-Way Noncommercial License (NSCLv1), NVIDIA confirms:
49
+
50
+ * Models are not for commercial use.
51
+ * NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
52
+
53
+ ### Deployment Geography:
54
+
55
+ Global
56
+
57
+ ### Use Case:
58
+
59
+ Physical AI: Bimanual robot manipulation and control in real-world environments, encompassing contact-rich manipulation and imitation learning from human demonstrations.
60
+
61
+ ### Release Date:
62
+
63
+ GitHub [01/06/2026] via [https://github.com/nvlabs/cosmos-policy](https://github.com/nvlabs/cosmos-policy)
64
+
65
+ Hugging Face [01/06/2026] via [https://huggingface.co/collections/nvidia/cosmos-policy](https://huggingface.co/collections/nvidia/cosmos-policy)
66
+
67
+ ## Model Architecture:
68
+
69
+ Architecture Type: A diffusion transformer with latent video diffusion, fine-tuned from Cosmos-Predict2-2B-Video2World.
70
+
71
+ Network Architecture: The model uses the same architecture as the base [Cosmos-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) model (a diffusion transformer with latent video diffusion).
72
 
73
  **Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
74
 
75
+ **Number of model parameters:**
76
+
77
+ 2B (inherited from base model)
78
+
79
+ ## Input
80
+
81
+ **Input Type(s)**: Text + Multi-view Images + Proprioceptive State
82
+
83
+ **Input Format(s)**:
84
+
85
+ * Text: String (natural language task description)
86
+ * Images: RGB images from multiple camera views
87
+ * Proprioception: Numerical array
88
+
89
+ **Input Parameters**:
90
+
91
+ * Text: One-dimensional (1D) - Task description (e.g., "put candy in ziploc bag")
92
+ * Images: Two-dimensional (2D) - Top-down third-person camera: 224×224 RGB; Left wrist-mounted camera: 224×224 RGB; Right wrist-mounted camera: 224×224 RGB
93
+ * Proprioception: One-dimensional (1D) - 14-dimensional state (7 joint angles per arm)
94
+
95
+ **Other Properties Related to Input**:
96
+
97
+ * Requires specific camera configuration (top-down + two wrist views)
98
+ * Images resized to 224×224 pixels from original resolution
99
+ * Trained exclusively for ALOHA 2 robot platform with two ViperX 300 S robot arms
100
+ * Control frequency: 25 Hz (reduced from original 50 Hz)
101
+
102
+ ## Output
103
+
104
+ **Output Type(s)**: Action Sequence + Future State Predictions + Value Estimate
105
+
106
+ **Output Format**:
107
+
108
+ * Actions: Numerical array
109
+ * Future states: Images + Proprioception
110
+ * Value: Scalar
111
+
112
+ **Output Parameters**:
113
+
114
+ * Action chunk: 50-timestep sequence of 14-dimensional actions (7 per arm: joint positions for 6 joints + 1 gripper)
115
+ * Future robot proprioception: 14-dimensional state at timestep t+50
116
+ * Future state images: Top-down third-person camera prediction (224×224 RGB), left wrist camera prediction (224×224 RGB), and right wrist camera prediction (224×224 RGB) at timestep t+50
117
+ * Future state value: Expected cumulative reward from future state (scalar)
118
+
119
+ **Other Properties Related to Output**:
120
+
121
+ * Action chunk size: 50 timesteps (spanning 2 seconds given 25 Hz control frequency)
122
+ * Execution horizon: 50 timesteps (full chunk; recommended, though can be varied)
123
+ * Denoising steps: 10 (configurable without retraining)
124
+ * Noise level range: σ_min = 4.0, σ_max = 80.0
125
+ * Generation mode: Either parallel (action, future state, and value generated simultaneously) or autoregressive (using this checkpoint as the policy and the separate planning model checkpoint as the world model and value function)
126
+
127
+ **Note on future predictions**: The future state images and value predictions generated by this base policy checkpoint are primarily for visualization and interpretability purposes. For model-based planning with these predictions, please additionally use the separate [Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B) checkpoint as the world model and value function.
128
+
129
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
130
+
131
+ ## Software Integration
132
+
133
+ **Runtime Engine(s):**
134
+
135
+ * [Transformers](https://github.com/huggingface/transformers)
136
 
137
+ **Supported Hardware Microarchitecture Compatibility:**
138
 
139
+ * NVIDIA Hopper (e.g., H100)
 
 
 
 
 
140
 
141
+ **Note**: We have only tested doing inference with BF16 precision.
142
 
143
+ **Operating System(s):**
 
 
 
 
 
 
144
 
145
+ * Linux
146
 
147
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
148
+
149
+ **Hardware Compatibility Warning**: This model was trained on a specific ALOHA 2 robot setup with particular hardware characteristics. Differences between our robot setup and downstream users' hardware setups (including calibration, joint limits, camera positioning, gripper mechanics, etc.) may significantly impact performance. Users must exercise caution during deployment.
150
+
151
+ **Control Frequency**: This policy must be used with a **25 Hz controller** for satisfactory performance (not the original 50 Hz ALOHA control frequency). The reduced frequency was used during data collection and training.
152
+
153
+ **Real-World Deployment**: This model operates real robotic hardware. Always ensure that proper safety measures are in place. On the first deployment of this checkpoint, we highly recommend measuring the difference in the current robot state and the next commanded robot state (e.g., difference between current joint angles and predicted actions, which represent target joint angles) and aborting policy execution if the difference is large.
154
+
155
+ # Usage
156
+
157
+ See [Cosmos Policy GitHub](http://github.com/NVlabs/cosmos-policy) for details.
158
+
159
+ ## Training and Evaluation Sections:
160
+
161
+ ### Training Datasets:
162
+
163
+ **Data Collection Method**:
164
+
165
+ * ALOHA-Cosmos-Policy: Human - Human-teleoperated demonstrations recorded in real-world environment
166
+
167
+ **Labeling Method**:
168
+
169
+ * ALOHA-Cosmos-Policy: Human - Success/failure labels and completion scores manually determined; task descriptions provided
170
+
171
+ ##### Properties:
172
 
173
  **Training Data**: [ALOHA-Cosmos-Policy](https://huggingface.co/datasets/nvidia/ALOHA-Cosmos-Policy) dataset
174
+
175
  - 4 bimanual manipulation tasks
176
  - 185 total real-world human teleoperation demonstrations
177
  - put X on plate: 80 demos
178
  - fold shirt: 15 demos
179
  - put candies in bowl: 45 demos
180
  - put candy in ziploc bag: 45 demos
181
+ - Successful demonstrations used for policy training
182
+ - All demonstrations (including failures) used for world model and value function training
183
 
184
  **Training Configuration**:
185
+
186
  - **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
187
  - **Training steps**: 50,000 gradient steps
188
  - **Batch size**: 200 (global)
189
  - **GPUs**: 8 H100 GPUs
190
  - **Training time**: ~48 hours
191
  - **Optimization**: Full model fine-tuning (all weights updated)
192
+ - **Action chunk size**: 50 timesteps
193
+ - **Image resolution**: 224×224 pixels
 
 
194
 
195
  **Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
196
 
197
+ ### Evaluation Datasets:
 
 
 
 
198
 
199
+ Data Collection Method: Not Applicable
200
 
201
+ Labeling Method: Not Applicable
202
 
203
+ Properties: Not Applicable - We use the real-world ALOHA 2 robot platform for direct evaluations.
 
 
 
 
 
 
204
 
205
+ ## Inference:
206
 
207
+ **Test Hardware:** H100, A100
208
 
209
+ See [Cosmos Policy GitHub](http://github.com/NVlabs/cosmos-policy) for details.
 
 
 
210
 
211
+ #### System Requirements and Performance
212
 
213
+ Inference with base Cosmos Policy only (i.e., no model-based planning):
214
 
215
+ * 1 GPU with 6.8 GB VRAM for LIBERO sim benchmark tasks
216
+ * 1 GPU with 8.9 GB VRAM for RoboCasa sim benchmark tasks
217
+ * 1 GPU with 6.0 GB VRAM for ALOHA robot tasks
218
 
219
+ #### Quality Benchmarks
220
 
221
+ ### ALOHA Real-World Benchmark Results
222
 
223
+ | Task | Score |
224
+ | ----------------------- | -------------- |
225
+ | put X on plate | 100.0 |
226
+ | fold shirt | 99.5 |
227
+ | put candies in bowl | 89.6 |
228
+ | put candy in ziploc bag | 85.4 |
229
+ | **Average** | **93.6** |
230
 
231
+ Scores represent average percent completion across 101 trials total (including both in-distribution and out-of-distribution test conditions).
232
 
233
+ **Comparison with baselines**:
 
 
 
234
 
235
+ - Diffusion Policy: 33.6
236
+ - OpenVLA-OFT+: 62.0
237
+ - π0: 77.9
238
+ - π0.5: 88.6
239
+ - **Cosmos Policy (ours)**: **93.6**
240
 
241
+ ### Task Characteristics
242
+
243
+ - **put X on plate**: Language-conditioned object placement (tests language following)
244
+ - **fold shirt**: Multi-step contact-rich manipulation (tests long-horizon planning)
245
+ - **put candies in bowl**: Handling scattered objects (tests multimodal grasp sequences)
246
+ - **put candy in ziploc bag**: High-precision millimeter-tolerance manipulation
247
+
248
+ ## Ethical Considerations
249
+
250
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
251
+
252
+ Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
253
+
254
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
255
 
256
  ## Related Resources
257
 
258
  - **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
259
  - **Training Dataset**: [ALOHA-Cosmos-Policy](https://huggingface.co/datasets/nvidia/ALOHA-Cosmos-Policy)
260
+ - **Planning Model Checkpoint**: [Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B)
261
  - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
262
+ - **Original ALOHA**: [Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware](https://arxiv.org/abs/2304.13705)
263
+
264
+ ## Citation
265
+
266
+ If you use this model, please cite the Cosmos Policy paper:
267
+
268
+ (Cosmos Policy BibTeX citation coming soon!)