Upload folder using huggingface_hub
Browse files- Cosmos-Policy-LIBERO-Predict2-2B.pt +3 -0
- README.md +101 -0
Cosmos-Policy-LIBERO-Predict2-2B.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8818528d8c9150cda0ddf8c711b0f221b21dac8ac379bd26d5690235954d33e2
|
| 3 |
+
size 3913017345
|
README.md
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Cosmos-Policy-LIBERO-Predict2-2B
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
Cosmos-Policy-LIBERO-Predict2-2B is a robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model achieves state-of-the-art performance on the LIBERO simulation benchmark with a 98.5% average success rate across four task suites.
|
| 6 |
+
|
| 7 |
+
**Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
|
| 8 |
+
|
| 9 |
+
### Key Features
|
| 10 |
+
|
| 11 |
+
- **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
|
| 12 |
+
- **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
|
| 13 |
+
- **High performance**: 98.5% average success rate on LIBERO (Spatial: 98.1%, Object: 100.0%, Goal: 98.2%, Long: 97.6%)
|
| 14 |
+
|
| 15 |
+
### Model Architecture
|
| 16 |
+
|
| 17 |
+
This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
|
| 18 |
+
|
| 19 |
+
**Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
|
| 20 |
+
|
| 21 |
+
## Model Details
|
| 22 |
+
|
| 23 |
+
### Inputs
|
| 24 |
+
|
| 25 |
+
- **Current state images**:
|
| 26 |
+
- Third-person camera (agentview): Resized to 224x224 RGB
|
| 27 |
+
- Wrist-mounted camera (eye-in-hand): Resized to 224x224 RGB
|
| 28 |
+
- **Robot proprioception**: 9-dimensional (2 gripper joints + 3 end-effector position + 4 end-effector quaternion)
|
| 29 |
+
- **Task description**: Natural language text (e.g., "put the black bowl on top of the cabinet")
|
| 30 |
+
|
| 31 |
+
### Outputs
|
| 32 |
+
|
| 33 |
+
- **Action chunk**: 16-timestep sequence of 7-dimensional actions (6-DoF end-effector control + 1 gripper)
|
| 34 |
+
- **Future robot proprioception**: 9-dimensional state at timestep t+16
|
| 35 |
+
- **Future state images**:
|
| 36 |
+
- Third-person camera prediction at timestep t+16
|
| 37 |
+
- Wrist camera prediction at timestep t+16
|
| 38 |
+
- **Future state value**: Expected cumulative reward from future state
|
| 39 |
+
|
| 40 |
+
### Training Details
|
| 41 |
+
|
| 42 |
+
**Training Data**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy) dataset
|
| 43 |
+
- 4 task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long
|
| 44 |
+
- 500 demonstrations per suite (50 demos × 10 tasks)
|
| 45 |
+
- Successful demonstrations used for policy training
|
| 46 |
+
- All demonstrations (including failures) used for world model and value function training
|
| 47 |
+
|
| 48 |
+
**Training Configuration**:
|
| 49 |
+
- **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
|
| 50 |
+
- **Training steps**: 40,000 gradient steps
|
| 51 |
+
- **Batch size**: 1,920 (global)
|
| 52 |
+
- **GPUs**: 64 H100 GPUs
|
| 53 |
+
- **Training time**: ~48 hours
|
| 54 |
+
- **Optimization**: Full model fine-tuning (all weights updated)
|
| 55 |
+
- **Action chunk size**: 16 timesteps
|
| 56 |
+
- **Image resolution**: 224x224 pixels
|
| 57 |
+
|
| 58 |
+
**Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
|
| 59 |
+
|
| 60 |
+
**Inference Settings**:
|
| 61 |
+
- Denoising steps: 5 (note: this can be changed without retraining)
|
| 62 |
+
- Noise level range: σ_min = 4.0, σ_max = 80.0
|
| 63 |
+
- Generation mode: Parallel (action, future state, and value generated simultaneously)
|
| 64 |
+
|
| 65 |
+
## Performance
|
| 66 |
+
|
| 67 |
+
### LIBERO Benchmark Results
|
| 68 |
+
|
| 69 |
+
| Task Suite | Success Rate |
|
| 70 |
+
|-----------|--------------|
|
| 71 |
+
| LIBERO-Spatial | 98.1% |
|
| 72 |
+
| LIBERO-Object | 100.0% |
|
| 73 |
+
| LIBERO-Goal | 98.2% |
|
| 74 |
+
| LIBERO-Long | 97.6% |
|
| 75 |
+
| **Average** | **98.5%** |
|
| 76 |
+
|
| 77 |
+
Success rates are averaged over 500 trials per suite (10 tasks × 50 episodes) across 3 random seeds (6,000 trials total).
|
| 78 |
+
|
| 79 |
+
## Notes
|
| 80 |
+
|
| 81 |
+
- **Simulation only**: This checkpoint is trained and evaluated exclusively on LIBERO simulation environments
|
| 82 |
+
- **Single robot platform**: Trained only for the Franka Emika Panda robot arm
|
| 83 |
+
- **Fixed camera setup**: Requires specific camera configuration (third-person + wrist views)
|
| 84 |
+
|
| 85 |
+
## Citation
|
| 86 |
+
|
| 87 |
+
If you use this dataset, please cite the Cosmos Policy paper by Kim et al.
|
| 88 |
+
<!-- ```bibtex
|
| 89 |
+
# TODO: Add Cosmos Policy BibTeX
|
| 90 |
+
``` -->
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
## License
|
| 94 |
+
|
| 95 |
+
Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
|
| 96 |
+
|
| 97 |
+
## Related Resources
|
| 98 |
+
|
| 99 |
+
- **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
|
| 100 |
+
- **Training Dataset**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy)
|
| 101 |
+
- **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
|