Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

Cosmos-Policy-LIBERO-Predict2-2B.pt +3 -0
README.md +101 -0

Cosmos-Policy-LIBERO-Predict2-2B.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8818528d8c9150cda0ddf8c711b0f221b21dac8ac379bd26d5690235954d33e2
+size 3913017345

README.md ADDED Viewed

	@@ -0,0 +1,101 @@

+# Cosmos-Policy-LIBERO-Predict2-2B
+## Model Description
+Cosmos-Policy-LIBERO-Predict2-2B is a robot manipulation policy fine-tuned from the [NVIDIA Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video foundation model (checkpoint: `model-480p-16fps.pt`). This model achieves state-of-the-art performance on the LIBERO simulation benchmark with a 98.5% average success rate across four task suites.
+**Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
+### Key Features
+- **Single-stage fine-tuning**: Adapted from pretrained video model with no architectural modifications
+- **Multimodal outputs**: Jointly predicts actions, future states, and values through unified video diffusion
+- **High performance**: 98.5% average success rate on LIBERO (Spatial: 98.1%, Object: 100.0%, Goal: 98.2%, Long: 97.6%)
+### Model Architecture
+This model uses the same architecture as the base Cosmos-Predict2-2B model (a diffusion transformer with latent video diffusion). Please refer to the [base model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
+**Key adaptation**: Actions, proprioceptive states, and values are encoded as latent frames and injected directly into the video model's latent diffusion sequence, enabling the model to generate these modalities alongside predicted future images.
+## Model Details
+### Inputs
+- **Current state images**:
+  - Third-person camera (agentview): Resized to 224x224 RGB
+  - Wrist-mounted camera (eye-in-hand): Resized to 224x224 RGB
+- **Robot proprioception**: 9-dimensional (2 gripper joints + 3 end-effector position + 4 end-effector quaternion)
+- **Task description**: Natural language text (e.g., "put the black bowl on top of the cabinet")
+### Outputs
+- **Action chunk**: 16-timestep sequence of 7-dimensional actions (6-DoF end-effector control + 1 gripper)
+- **Future robot proprioception**: 9-dimensional state at timestep t+16
+- **Future state images**:
+  - Third-person camera prediction at timestep t+16
+  - Wrist camera prediction at timestep t+16
+- **Future state value**: Expected cumulative reward from future state
+### Training Details
+**Training Data**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy) dataset
+- 4 task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long
+- 500 demonstrations per suite (50 demos × 10 tasks)
+- Successful demonstrations used for policy training
+- All demonstrations (including failures) used for world model and value function training
+**Training Configuration**:
+- **Base model**: NVIDIA Cosmos-Predict2-2B-Video2World (`model-480p-16fps.pt`)
+- **Training steps**: 40,000 gradient steps
+- **Batch size**: 1,920 (global)
+- **GPUs**: 64 H100 GPUs
+- **Training time**: ~48 hours
+- **Optimization**: Full model fine-tuning (all weights updated)
+- **Action chunk size**: 16 timesteps
+- **Image resolution**: 224x224 pixels
+**Training Objective**: The model is trained with a hybrid log-normal-uniform noise distribution (modified from the base model's log-normal distribution; see paper for details) to improve action prediction accuracy. Training batches are split 50/25/25 for policy, world model, and value function objectives, respectively.
+**Inference Settings**:
+- Denoising steps: 5 (note: this can be changed without retraining)
+- Noise level range: σ_min = 4.0, σ_max = 80.0
+- Generation mode: Parallel (action, future state, and value generated simultaneously)
+## Performance
+### LIBERO Benchmark Results
+| Task Suite | Success Rate |
+|-----------|--------------|
+| LIBERO-Spatial | 98.1% |
+| LIBERO-Object | 100.0% |
+| LIBERO-Goal | 98.2% |
+| LIBERO-Long | 97.6% |
+| **Average** | **98.5%** |
+Success rates are averaged over 500 trials per suite (10 tasks × 50 episodes) across 3 random seeds (6,000 trials total).
+## Notes
+- **Simulation only**: This checkpoint is trained and evaluated exclusively on LIBERO simulation environments
+- **Single robot platform**: Trained only for the Franka Emika Panda robot arm
+- **Fixed camera setup**: Requires specific camera configuration (third-person + wrist views)
+## Citation
+If you use this dataset, please cite the Cosmos Policy paper by Kim et al.
+<!-- ```bibtex
+# TODO: Add Cosmos Policy BibTeX
+``` -->
+## License
+Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
+## Related Resources
+- **Base Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
+- **Training Dataset**: [LIBERO-Cosmos-Policy](https://huggingface.co/datasets/nvidia/LIBERO-Cosmos-Policy)
+- **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*