cosmos-policy
moojink-nv commited on
Commit
f04a4f9
·
verified ·
1 Parent(s): f273618

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. Cosmos-Policy-ALOHA-Predict2-2B.pt +3 -0
  2. README.md +128 -0
Cosmos-Policy-ALOHA-Predict2-2B.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ffd6c761f11a1deb60c87ff185fc93684c24521126d6f635a091cd567a7c79a1
3
+ size 3913017345
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cosmos-Policy-ALOHA-Predict2-2B
2
+
3
+ ## Model Description
4
+
5
+ Cosmos-Policy-ALOHA-Predict2-2B is a bimanual robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model is trained on real-world human teleoperation data collected on the ALOHA 2 robot platform and achieves a 93.6% average completion rate across four challenging bimanual manipulation tasks.
6
+
7
+ **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
8
+
9
+ ### Key Features
10
+
11
+ - **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
12
+ - **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
13
+ - **Real-world performance**: 93.6% average score on challenging bimanual manipulation tasks
14
+
15
+ ### Model Architecture
16
+
17
+ This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
18
+
19
+ **Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
20
+
21
+ ## Model Details
22
+
23
+ ### Inputs
24
+
25
+ - **Current state images**:
26
+ - Top-down third-person camera: Resized to 224x224 RGB
27
+ - Left wrist-mounted camera: Resized to 224x224 RGB
28
+ - Right wrist-mounted camera: Resized to 224x224 RGB
29
+ - **Robot proprioception**: 14-dimensional (7 joint angles per arm)
30
+ - **Task description**: Natural language text (e.g., "put candy in ziploc bag")
31
+
32
+ ### Outputs
33
+
34
+ - **Action chunk**: 50-timestep sequence of 14-dimensional actions (7 per arm: joint positions for 6 joints + 1 gripper)
35
+ - **Future robot proprioception**: 14-dimensional state at timestep t+50
36
+ - **Future state images**:
37
+ - Top-down third-person camera prediction at timestep t+50
38
+ - Left wrist camera prediction at timestep t+50
39
+ - Right wrist camera prediction at timestep t+50
40
+ - **Future state value**: Expected cumulative reward from future state
41
+
42
+ **Note on future predictions**: The future state images and value predictions generated by this base policy checkpoint are primarily for visualization and interpretability purposes. For model-based planning with these predictions, please additionally use the separate [Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B](#) checkpoint as the world model and value function. The other checkpoint has been fine-tuned on policy rollout data to refine the world model and value function for more accurate planning.
43
+
44
+ ### Training Details
45
+
46
+ **Training Data**: [ALOHA-Cosmos-Policy](https://huggingface.co/datasets/nvidia/ALOHA-Cosmos-Policy) dataset
47
+ - 4 bimanual manipulation tasks
48
+ - 185 total real-world human teleoperation demonstrations
49
+ - put X on plate: 80 demos
50
+ - fold shirt: 15 demos
51
+ - put candies in bowl: 45 demos
52
+ - put candy in ziploc bag: 45 demos
53
+
54
+ **Training Configuration**:
55
+ - **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
56
+ - **Training steps**: 50,000 gradient steps
57
+ - **Batch size**: 200 (global)
58
+ - **GPUs**: 8 H100 GPUs
59
+ - **Training time**: ~48 hours
60
+ - **Optimization**: Full model fine-tuning (all weights updated)
61
+ - **Action chunk size**: 50 timesteps (spanning 2 seconds given 25 Hz control frequency)
62
+ - **Execution horizon**: 50 timesteps (full chunk; recommended, though can be varied)
63
+ - **Image resolution**: 224x224 pixels
64
+ - **Control frequency**: 25 Hz (reduced from original 50 Hz)
65
+
66
+ **Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
67
+
68
+ **Inference Settings**:
69
+ - Denoising steps: 10 (note: this can be changed without retraining)
70
+ - Noise level range: σ_min = 4.0, σ_max = 80.0
71
+ - Generation mode: Either parallel (action, future state, and value generated simultaneously) or autoregressive (using this checkpoint as the policy and the separate planning model checkpoint mentioned above as the world model and value function; see paper for more details)
72
+ - Execution: Full 50-timestep action chunk (2 seconds) is executed before requerying the policy
73
+
74
+ ## Performance
75
+
76
+ ### ALOHA Real-World Benchmark Results
77
+
78
+ | Task | Score |
79
+ |------|-------|
80
+ | put X on plate | 100.0 |
81
+ | fold shirt | 99.5 |
82
+ | put candies in bowl | 89.6 |
83
+ | put candy in ziploc bag | 85.4 |
84
+ | **Average** | **93.6** |
85
+
86
+ Scores represent average percent completion across 101 trials total (including both in-distribution and out-of-distribution test conditions). The model outperforms baseline policies including Diffusion Policy (33.6), OpenVLA-OFT+ (62.0), π0 (77.9), and π0.5 (88.6).
87
+
88
+ ### Task Characteristics
89
+
90
+ - **put X on plate**: Language-conditioned object placement (tests language following)
91
+ - **fold shirt**: Multi-step contact-rich manipulation (tests long-horizon planning)
92
+ - **put candies in bowl**: Handling scattered objects (tests multimodal grasp sequences)
93
+ - **put candy in ziploc bag**: High-precision millimeter-tolerance manipulation
94
+
95
+ ## Important Usage Notes
96
+
97
+ **Hardware Compatibility Warning**: This model was trained on a specific ALOHA 2 robot setup with particular hardware characteristics. Differences between our robot setup and downstream users' hardware setups (including calibration, joint limits, camera positioning, gripper mechanics, etc.) may significantly impact performance. Users must exercise caution during deployment.
98
+
99
+ **Control Frequency**: This policy must be used with a **25 Hz controller** for satisfactory performance (not the original 50 Hz ALOHA control frequency). The reduced frequency was used during data collection and training.
100
+
101
+ **Real-World Deployment**: This model operates real robotic hardware. Always ensure that proper safety measures are in place. On the first deployment of this checkpoint, we highly recommend measuring the difference in the current robot state and the next commanded robot state (e.g., difference between current joint angles and predicted actions, which represent target joint angles) and aborting policy execution if the difference is large.
102
+
103
+ ## Notes
104
+
105
+ - **Real-world data**: This checkpoint is trained on real-world teleoperation data from the ALOHA 2 robot
106
+ - **Bimanual platform**: Designed for dual-arm manipulation with two ViperX 300 S robot arms
107
+ - **Fixed camera setup**: Requires specific camera configuration (top-down + two wrist views)
108
+ - **Task-specific**: Trained on four specific bimanual manipulation tasks
109
+ - **Hardware sensitivity**: Performance may vary with different robot configurations or hardware setups
110
+
111
+ ## Citation
112
+
113
+ If you use this model, please cite the Cosmos Policy paper by Kim et al.
114
+ <!-- ```bibtex
115
+ # TODO: Add Cosmos Policy BibTeX
116
+ ``` -->
117
+
118
+ ## License
119
+
120
+ Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
121
+
122
+ ## Related Resources
123
+
124
+ - **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
125
+ - **Training Dataset**: [ALOHA-Cosmos-Policy](https://huggingface.co/datasets/nvidia/ALOHA-Cosmos-Policy)
126
+ - **Planning Model Checkpoint**: Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B (for model-based planning)
127
+ - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
128
+ ```