Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

Cosmos-Policy-RoboCasa-Predict2-2B.pt +3 -0
README.md +115 -0

Cosmos-Policy-RoboCasa-Predict2-2B.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ed4b991612b0d75e5e633fb2ca821e54c66a3b89973eb361e6ca50ba4984506f
+size 3913017345

README.md ADDED Viewed

	@@ -0,0 +1,115 @@

+# Cosmos-Policy-RoboCasa-Predict2-2B
+## Model Description
+Cosmos-Policy-RoboCasa-Predict2-2B is a robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model achieves state-of-the-art performance on the RoboCasa simulation benchmark with a 67.1% average success rate across 24 kitchen manipulation tasks, using only ~50 human teleoperation demonstrations per task.
+**Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
+### Key Features
+- **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
+- **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
+- **High performance**: 67.1% average success rate on RoboCasa with only 50 demos per task (vs. 300+ for other methods)
+- **Data efficiency**: Trained on significantly fewer demonstrations than prior state-of-the-art methods
+### Model Architecture
+This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
+**Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
+## Model Details
+### Inputs
+- **Current state images**:
+  - Left third-person camera (agentview_left): Resized to 224x224 RGB
+  - Right third-person camera (agentview_right): Resized to 224x224 RGB
+  - Wrist-mounted camera (eye_in_hand): Resized to 224x224 RGB
+- **Robot proprioception**: 9-dimensional (2 gripper joints + 3 end-effector position + 4 end-effector quaternion)
+- **Task description**: Natural language text (e.g., "open the right drawer")
+### Outputs
+- **Action chunk**: 32-timestep sequence of 7-dimensional actions (6-DoF end-effector control + 1 gripper)
+- **Future robot proprioception**: 9-dimensional state at timestep t+32
+- **Future state images**:
+  - Wrist camera prediction at timestep t+32
+  - Left third-person camera prediction at timestep t+32
+  - Right third-person camera prediction at timestep t+32
+- **Future state value**: Expected cumulative reward from future state
+### Training Details
+**Training Data**: [RoboCasa-Cosmos-Policy](https://huggingface.co/datasets/nvidia/RoboCasa-Cosmos-Policy) dataset
+- 24 kitchen manipulation tasks across 7 categories
+- 50 human teleoperation demonstrations per task
+- Successful demonstrations used for policy training
+- All demonstrations (including failures) used for world model and value function training
+**Training Configuration**:
+- **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
+- **Training steps**: 45,000 gradient steps
+- **Batch size**: 800 (global)
+- **GPUs**: 32 H100 GPUs
+- **Training time**: ~48 hours
+- **Optimization**: Full model fine-tuning (all weights updated)
+- **Action chunk size**: 32 timesteps (prediction horizon)
+- **Execution horizon**: 16 timesteps (recommended; can be varied)
+- **Image resolution**: 224x224 pixels
+**Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
+**Inference Settings**:
+- Denoising steps: 5 (note: this can be changed without retraining)
+- Noise level range: σ_min = 4.0, σ_max = 80.0
+- Generation mode: Parallel (action, future state, and value generated simultaneously)
+- Execution: First 16 timesteps of the 32-timestep action chunk are executed before requerying the policy
+## Performance
+### RoboCasa Benchmark Results
+| Method | # Training Demos per Task | Average Success Rate |
+|--------|---------------------------|---------------------|
+| GR00T-N1 | 300 | 49.6% |
+| UVA | 50 | 50.0% |
+| DP-VLA | 3,000 | 57.3% |
+| π0 | 300 | 62.5% |
+| GR00T-N1.5 | 300 | 64.1% |
+| Video Policy | 300 | 66.0% |
+| FLARE | 300 | 66.4% |
+| **Cosmos Policy (ours)** | **50** | **67.1%** |
+Success rates are averaged over 50 trials per task (across 5 evaluation scenes with 10 trials each) and 3 random seeds (3,600 trials total).
+### Key Achievements
+- Achieves state-of-the-art performance with 6× fewer demonstrations than most baselines
+- Generalizes to unseen object instances and kitchen styles
+- Demonstrates strong multi-task manipulation capabilities across diverse kitchen environments
+## Notes
+- **Simulation only**: This checkpoint is trained and evaluated exclusively on RoboCasa simulation environments
+- **Single robot platform**: Trained only for the Franka Emika Panda robot arm
+- **Fixed camera setup**: Requires specific camera configuration (two third-person + one wrist views)
+- **Kitchen tasks**: Designed specifically for kitchen manipulation scenarios
+## Citation
+If you use this model, please cite the Cosmos Policy paper by Kim et al.
+<!-- ```bibtex
+# TODO: Add Cosmos Policy BibTeX
+``` -->
+## License
+Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
+## Related Resources
+- **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
+- **Training Dataset**: [RoboCasa-Cosmos-Policy](https://huggingface.co/datasets/nvidia/RoboCasa-Cosmos-Policy)
+- **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*