Upload folder using huggingface_hub
Browse files- Cosmos-Policy-RoboCasa-Predict2-2B.pt +3 -0
- README.md +115 -0
Cosmos-Policy-RoboCasa-Predict2-2B.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ed4b991612b0d75e5e633fb2ca821e54c66a3b89973eb361e6ca50ba4984506f
|
| 3 |
+
size 3913017345
|
README.md
ADDED
|
@@ -0,0 +1,115 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Cosmos-Policy-RoboCasa-Predict2-2B
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
Cosmos-Policy-RoboCasa-Predict2-2B is a robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model achieves state-of-the-art performance on the RoboCasa simulation benchmark with a 67.1% average success rate across 24 kitchen manipulation tasks, using only ~50 human teleoperation demonstrations per task.
|
| 6 |
+
|
| 7 |
+
**Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
|
| 8 |
+
|
| 9 |
+
### Key Features
|
| 10 |
+
|
| 11 |
+
- **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
|
| 12 |
+
- **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
|
| 13 |
+
- **High performance**: 67.1% average success rate on RoboCasa with only 50 demos per task (vs. 300+ for other methods)
|
| 14 |
+
- **Data efficiency**: Trained on significantly fewer demonstrations than prior state-of-the-art methods
|
| 15 |
+
|
| 16 |
+
### Model Architecture
|
| 17 |
+
|
| 18 |
+
This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
|
| 19 |
+
|
| 20 |
+
**Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
|
| 21 |
+
|
| 22 |
+
## Model Details
|
| 23 |
+
|
| 24 |
+
### Inputs
|
| 25 |
+
|
| 26 |
+
- **Current state images**:
|
| 27 |
+
- Left third-person camera (agentview_left): Resized to 224x224 RGB
|
| 28 |
+
- Right third-person camera (agentview_right): Resized to 224x224 RGB
|
| 29 |
+
- Wrist-mounted camera (eye_in_hand): Resized to 224x224 RGB
|
| 30 |
+
- **Robot proprioception**: 9-dimensional (2 gripper joints + 3 end-effector position + 4 end-effector quaternion)
|
| 31 |
+
- **Task description**: Natural language text (e.g., "open the right drawer")
|
| 32 |
+
|
| 33 |
+
### Outputs
|
| 34 |
+
|
| 35 |
+
- **Action chunk**: 32-timestep sequence of 7-dimensional actions (6-DoF end-effector control + 1 gripper)
|
| 36 |
+
- **Future robot proprioception**: 9-dimensional state at timestep t+32
|
| 37 |
+
- **Future state images**:
|
| 38 |
+
- Wrist camera prediction at timestep t+32
|
| 39 |
+
- Left third-person camera prediction at timestep t+32
|
| 40 |
+
- Right third-person camera prediction at timestep t+32
|
| 41 |
+
- **Future state value**: Expected cumulative reward from future state
|
| 42 |
+
|
| 43 |
+
### Training Details
|
| 44 |
+
|
| 45 |
+
**Training Data**: [RoboCasa-Cosmos-Policy](https://huggingface.co/datasets/nvidia/RoboCasa-Cosmos-Policy) dataset
|
| 46 |
+
- 24 kitchen manipulation tasks across 7 categories
|
| 47 |
+
- 50 human teleoperation demonstrations per task
|
| 48 |
+
- Successful demonstrations used for policy training
|
| 49 |
+
- All demonstrations (including failures) used for world model and value function training
|
| 50 |
+
|
| 51 |
+
**Training Configuration**:
|
| 52 |
+
- **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
|
| 53 |
+
- **Training steps**: 45,000 gradient steps
|
| 54 |
+
- **Batch size**: 800 (global)
|
| 55 |
+
- **GPUs**: 32 H100 GPUs
|
| 56 |
+
- **Training time**: ~48 hours
|
| 57 |
+
- **Optimization**: Full model fine-tuning (all weights updated)
|
| 58 |
+
- **Action chunk size**: 32 timesteps (prediction horizon)
|
| 59 |
+
- **Execution horizon**: 16 timesteps (recommended; can be varied)
|
| 60 |
+
- **Image resolution**: 224x224 pixels
|
| 61 |
+
|
| 62 |
+
**Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
|
| 63 |
+
|
| 64 |
+
**Inference Settings**:
|
| 65 |
+
- Denoising steps: 5 (note: this can be changed without retraining)
|
| 66 |
+
- Noise level range: σ_min = 4.0, σ_max = 80.0
|
| 67 |
+
- Generation mode: Parallel (action, future state, and value generated simultaneously)
|
| 68 |
+
- Execution: First 16 timesteps of the 32-timestep action chunk are executed before requerying the policy
|
| 69 |
+
|
| 70 |
+
## Performance
|
| 71 |
+
|
| 72 |
+
### RoboCasa Benchmark Results
|
| 73 |
+
|
| 74 |
+
| Method | # Training Demos per Task | Average Success Rate |
|
| 75 |
+
|--------|---------------------------|---------------------|
|
| 76 |
+
| GR00T-N1 | 300 | 49.6% |
|
| 77 |
+
| UVA | 50 | 50.0% |
|
| 78 |
+
| DP-VLA | 3,000 | 57.3% |
|
| 79 |
+
| π0 | 300 | 62.5% |
|
| 80 |
+
| GR00T-N1.5 | 300 | 64.1% |
|
| 81 |
+
| Video Policy | 300 | 66.0% |
|
| 82 |
+
| FLARE | 300 | 66.4% |
|
| 83 |
+
| **Cosmos Policy (ours)** | **50** | **67.1%** |
|
| 84 |
+
|
| 85 |
+
Success rates are averaged over 50 trials per task (across 5 evaluation scenes with 10 trials each) and 3 random seeds (3,600 trials total).
|
| 86 |
+
|
| 87 |
+
### Key Achievements
|
| 88 |
+
|
| 89 |
+
- Achieves state-of-the-art performance with 6× fewer demonstrations than most baselines
|
| 90 |
+
- Generalizes to unseen object instances and kitchen styles
|
| 91 |
+
- Demonstrates strong multi-task manipulation capabilities across diverse kitchen environments
|
| 92 |
+
|
| 93 |
+
## Notes
|
| 94 |
+
|
| 95 |
+
- **Simulation only**: This checkpoint is trained and evaluated exclusively on RoboCasa simulation environments
|
| 96 |
+
- **Single robot platform**: Trained only for the Franka Emika Panda robot arm
|
| 97 |
+
- **Fixed camera setup**: Requires specific camera configuration (two third-person + one wrist views)
|
| 98 |
+
- **Kitchen tasks**: Designed specifically for kitchen manipulation scenarios
|
| 99 |
+
|
| 100 |
+
## Citation
|
| 101 |
+
|
| 102 |
+
If you use this model, please cite the Cosmos Policy paper by Kim et al.
|
| 103 |
+
<!-- ```bibtex
|
| 104 |
+
# TODO: Add Cosmos Policy BibTeX
|
| 105 |
+
``` -->
|
| 106 |
+
|
| 107 |
+
## License
|
| 108 |
+
|
| 109 |
+
Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
|
| 110 |
+
|
| 111 |
+
## Related Resources
|
| 112 |
+
|
| 113 |
+
- **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
|
| 114 |
+
- **Training Dataset**: [RoboCasa-Cosmos-Policy](https://huggingface.co/datasets/nvidia/RoboCasa-Cosmos-Policy)
|
| 115 |
+
- **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
|