nvidia
/

Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B

cosmos-policy-planning

Model card Files Files and versions

xet

Community

harrim-nv commited on Jan 12

Commit

d97a4c9

verified ·

1 Parent(s): 3ab5d79

Update README.md

Browse files

Files changed (1) hide show

README.md +201 -72

README.md CHANGED Viewed

@@ -1,123 +1,252 @@
-# Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B
-## Model Description
-Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B is a refined world model and value function checkpoint fine-tuned from [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) on policy rollout data. This checkpoint is designed to be used in conjunction with the base Cosmos Policy checkpoint for model-based planning via best-of-N search, achieving improved performance on challenging manipulation tasks. This checkpoint should NOT be deployed on its own.
-**Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
-### Key Features
-- **Refined predictions**: Fine-tuned on policy rollout data for more accurate world model and value function predictions
-- **Dual deployment**: Used alongside base Cosmos Policy checkpoint for model-based planning
-- **Improved performance**: Achieves 12.5 percentage point average score increase on challenging ALOHA tasks when used for planning
-### Model Architecture
-This model uses the same architecture as [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B). Please refer to that model card and the [base Cosmos-Predict2-2B model card](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for detailed architecture specifications.
-## Model Details
-### Inputs
-Same as [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B):
-- **Current state images**: Top-down camera, left wrist camera, right wrist camera (all 224x224 RGB)
-- **Robot proprioception**: 14-dimensional (7 joint angles per arm)
-- **Action chunk**: 50-timestep sequence of 14-dimensional actions (for world model and value prediction)
-### Outputs
-- **Future robot proprioception**: 14-dimensional state at timestep t+50
-- **Future state images**:
-  - Top-down third-person camera prediction at timestep t+50
-  - Left wrist camera prediction at timestep t+50
-  - Right wrist camera prediction at timestep t+50
-- **Future state value**: Expected cumulative reward from future state (V(s'))
 **Note**: While this checkpoint can technically generate actions like the base policy, it is specifically designed and optimized for world model and value function predictions. For action generation, please use [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B).
-### Training Details
-**Base Checkpoint**: [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B)
-**Fine-tuning Data**: 648 policy rollout episodes collected from various methods (see paper for details)
 - Includes both successful and failed episodes
 - Covers diverse initial conditions and execution trajectories
 - Enables more accurate modeling of state transitions and value predictions beyond the demonstration distribution
 **Training Configuration**:
-- **Training steps**: Details in paper
 - **Batch split**: 10/45/45 for policy/world model/value function (emphasis on world model and value function refinement)
 - **GPUs**: 8 H100 GPUs
 **Training Objective**: Fine-tuned with increased emphasis on world model and value function training (90% of training batches) to improve future state and value prediction accuracy for more effective planning.
-## Usage: Dual Deployment for Model-Based Planning
-This checkpoint is designed for **dual deployment** with the base Cosmos Policy checkpoint:
-1. **Policy Model** ([Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B)): Generates N candidate action chunks
-2. **Planning Model** (this checkpoint): For each candidate action:
-   - Predicts future state (world model)
-   - Predicts future state value (value function)
-   - Averages across ensemble predictions (3 future state predictions × 5 value predictions = 15 total value estimates per action)
-3. **Selection**: Execute the action chunk with the highest predicted value
-See the paper for complete implementation details of the best-of-N planning algorithm.
-## Performance
-### Planning Performance on ALOHA Tasks
-When used for model-based planning with the base policy checkpoint:
-| Task | Base Policy Score | With Planning (this checkpoint) | Improvement |
-|------|------------------|--------------------------------|-------------|
-| put candies in bowl | 49.0 | 60.0 | +11.0 |
-| put candy in ziploc bag | 70.0 | 84.0 | +14.0 |
-| **Average** | **60.0** | **72.0** | **+12.5** |
-Results are on challenging initial conditions for these two tasks. Planning with this checkpoint enables the policy to be more likely to avoid errors (e.g., losing grasp of objects) by selecting higher-quality actions.
-## Important Usage Notes
-**Inference Latency**: Model-based planning with dual deployment has significantly higher inference latency:
-- **Planning mode (dual deployment)**: ~4.9 seconds per action chunk using 8 parallel H100 GPUs
-- **Direct policy mode (base checkpoint only)**: ~0.95 seconds per action chunk using 1 H100 GPU
-For applications requiring faster inference, we recommend using the base [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) checkpoint alone without planning.
-**Hardware Requirements**: Model-based planning requires:
-- Multiple GPUs for parallelized best-of-N search (8 GPUs recommended for N=8)
-- Sufficient compute for ensemble predictions (3 world model queries × 5 value function queries per action)
-**When to Use Planning**: Planning is most beneficial for:
-- Challenging tasks with high precision requirements
-- Situations where avoiding errors is critical
-- Scenarios where additional compute time is acceptable
-**Same warnings as base checkpoint apply**: Hardware compatibility, 25 Hz control frequency requirement, and real-world deployment safety considerations. See [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) model card for details.
-## Notes
-- **Specialized checkpoint**: Optimized specifically for world model and value function predictions, not action generation
-- **Requires base policy**: Must be used in conjunction with Cosmos-Policy-ALOHA-Predict2-2B for planning
-- **Compute-intensive**: Significantly higher computational requirements than direct policy execution
-- **Real-world tested**: Evaluated on real ALOHA 2 hardware in challenging manipulation scenarios
-## Citation
-If you use this model, please cite the Cosmos Policy paper by Kim et al.
-<!-- ```bibtex
-# TODO: Add Cosmos Policy BibTeX
-``` -->
-## License
-Please refer to the [base model license](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) for licensing information.
 ## Related Resources
 - **Base Policy Checkpoint**: [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) (required for planning)
 - **Base Video Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
 - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*

+---
+base_model:
+- nvidia/Cosmos-Policy-ALOHA-Predict2-2B
+- nvidia/Cosmos-Predict2-2B-Video2World
+---
+# **Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B**
+[**Cosmos Policy**](https://huggingface.co/collections/nvidia/cosmos-policy) | [**Code**](http://github.com/NVlabs/cosmos-policy) | [**White Paper**]() | [**Website**](https://research.nvidia.com/labs/dir/cosmos-policy/)
+# Model Overview
+## Description:
+Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B is a 2B-parameter refined world model and value function checkpoint fine-tuned from [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) on policy rollout data. This checkpoint is designed to be used in conjunction with the base Cosmos Policy checkpoint for model-based planning via best-of-N search, achieving a 12.5 percentage point average score increase on challenging ALOHA manipulation tasks.
+Key features:
+* **Refined predictions**: Fine-tuned on policy rollout data for more accurate world model and value function predictions
+* **Dual deployment**: Used alongside base Cosmos Policy checkpoint for model-based planning
+* **Improved performance**: Achieves 12.5 percentage point average score increase on challenging ALOHA tasks when used for planning
+Use cases:
+* Model-based planning for bimanual robot manipulation
+* Best-of-N action selection via value-based search
+* Improving policy robustness on high-precision tasks
+* Error avoidance in contact-rich manipulation
+This model is for research and development only.
+**Model Developer**: NVIDIA
+## Model Versions
+Cosmos Policy models include the following:
+- [Cosmos-Policy-LIBERO-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-LIBERO-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated LIBERO environments.
+- [Cosmos-Policy-RoboCasa-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-RoboCasa-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated RoboCasa environments.
+- [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B): Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in real-world ALOHA robot environments.
+- [Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B): Given current state observations, a task description, and action sequences, generate future state predictions and value estimates for robot manipulation in real-world ALOHA robot environments. (This checkpoint is meant to be deployed alongside Cosmos-Policy-ALOHA-Predict2-2B, not independently.)
+### License:
+This model is released under the  [NVIDIA One-Way Noncommercial License (NSCLv1)](https://github.com/NVlabs/HMAR/blob/main/LICENSE). For a custom license, please contact [cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com).
+Under the NVIDIA One-Way Noncommercial License (NSCLv1), NVIDIA confirms:
+* Models are not for commercial use.
+* NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
+### Deployment Geography:
+Global
+### Use Case:
+Physical AI: Model-based planning for bimanual robot manipulation in real-world environments, encompassing world modeling and value function prediction for best-of-N action selection.
+### Release Date:
+GitHub [01/06/2026] via [https://github.com/nvlabs/cosmos-policy](https://github.com/nvlabs/cosmos-policy)
+Hugging Face [01/06/2026] via [https://huggingface.co/collections/nvidia/cosmos-policy](https://huggingface.co/collections/nvidia/cosmos-policy)
+## Model Architecture:
+Architecture Type: A diffusion transformer with latent video diffusion, fine-tuned from Cosmos-Policy-ALOHA-Predict2-2B.
+Network Architecture: The model uses the same architecture as [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B).
+**Key adaptation**: This checkpoint is specifically optimized for world model and value function predictions through fine-tuning on policy rollout data with emphasis on future state and value prediction accuracy.
+**Number of model parameters:**
+2B (inherited from base model)
+## Input
+**Input Type(s)**: Text + Multi-view Images + Proprioceptive State + Action Sequence
+**Input Format(s)**:
+* Text: String (natural language task description)
+* Images: RGB images from multiple camera views
+* Proprioception: Numerical array
+* Actions: Numerical array
+**Input Parameters**:
+* Text: One-dimensional (1D) - Task description (e.g., "put candy in ziploc bag")
+* Images: Two-dimensional (2D) - Top-down third-person camera: 224×224 RGB; Left wrist-mounted camera: 224×224 RGB; Right wrist-mounted camera: 224×224 RGB
+* Proprioception: One-dimensional (1D) - 14-dimensional state (7 joint angles per arm)
+* Actions: One-dimensional (1D) - 50-timestep sequence of 14-dimensional actions (for world model and value prediction)
+**Other Properties Related to Input**:
+* Requires specific camera configuration (top-down + two wrist views)
+* Images resized to 224×224 pixels from original resolution
+* Trained exclusively for ALOHA 2 robot platform with two ViperX 300 S robot arms
+* Control frequency: 25 Hz
+## Output
+**Output Type(s)**: Future State Predictions + Value Estimate
+**Output Format**:
+* Future states: Images + Proprioception
+* Value: Scalar
+**Output Parameters**:
+* Future robot proprioception: 14-dimensional state at timestep t+50
+* Future state images: Top-down third-person camera prediction (224×224 RGB), left wrist camera prediction (224×224 RGB), and right wrist camera prediction (224×224 RGB) at timestep t+50
+* Future state value: Expected cumulative reward from future state (scalar)
+**Other Properties Related to Output**:
+* Denoising steps: 10 (configurable without retraining)
+* Noise level range: σ_min = 4.0, σ_max = 80.0
+* Ensemble predictions: 3 world model queries × 5 value function queries per action (15 total value estimates)
 **Note**: While this checkpoint can technically generate actions like the base policy, it is specifically designed and optimized for world model and value function predictions. For action generation, please use [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B).
+Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
+## Software Integration
+**Runtime Engine(s):**
+* [Transformers](https://github.com/huggingface/transformers)
+**Supported Hardware Microarchitecture Compatibility:**
+* NVIDIA Hopper (e.g., H100)
+**Note**: We have only tested doing inference with BF16 precision.
+**Operating System(s):**
+* Linux
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
+**Dual Deployment**: This checkpoint is designed for dual deployment with [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B):
+1. **Policy Model** (Cosmos-Policy-ALOHA-Predict2-2B): Generates N candidate action chunks
+2. **Planning Model** (this checkpoint): For each candidate action, predicts future state (world model) and future state value (value function), averaging across ensemble predictions
+3. **Selection**: Execute the action chunk with the highest predicted value
+**Inference Latency**: Model-based planning with dual deployment has significantly higher inference latency:
+- **Planning mode (dual deployment)**: ~4.9 seconds per action chunk using 8 parallel H100 GPUs
+- **Direct policy mode (base checkpoint only)**: ~0.95 seconds per action chunk using 1 H100 GPU
+**Hardware Requirements**: Model-based planning requires multiple GPUs for parallelized best-of-N search (8 GPUs recommended for N=8) and sufficient compute for ensemble predictions.
+**When to Use Planning**: Planning is most beneficial for challenging tasks with high precision requirements, situations where avoiding errors is critical, and scenarios where additional compute time is acceptable.
+**Same warnings as base checkpoint apply**: Hardware compatibility, 25 Hz control frequency requirement, and real-world deployment safety considerations. See [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) model card for details.
+# Usage
+See [Cosmos Policy GitHub](http://github.com/NVlabs/cosmos-policy) for details.
+## Training and Evaluation Sections:
+### Training Datasets:
+**Data Collection Method**:
+* ALOHA-Planning-Rollouts: Automated - Policy rollout episodes collected from various methods
+**Labeling Method**:
+* ALOHA-Planning-Rollouts: Automated - Success/failure labels automatically determined through policy execution and environment evaluation
+##### **Properties:**
+**Training Data**: Policy rollout data
+- 648 policy rollout episodes
 - Includes both successful and failed episodes
 - Covers diverse initial conditions and execution trajectories
 - Enables more accurate modeling of state transitions and value predictions beyond the demonstration distribution
 **Training Configuration**:
+- **Base model**: Cosmos-Policy-ALOHA-Predict2-2B
+- **Training steps**: See paper for details
 - **Batch split**: 10/45/45 for policy/world model/value function (emphasis on world model and value function refinement)
 - **GPUs**: 8 H100 GPUs
+- **Optimization**: Full model fine-tuning (all weights updated)
 **Training Objective**: Fine-tuned with increased emphasis on world model and value function training (90% of training batches) to improve future state and value prediction accuracy for more effective planning.
+### Evaluation Datasets:
+Data Collection Method: Not Applicable
+Labeling Method: Not Applicable
+Properties:  Not Applicable - We use the real-world ALOHA 2 robot platform for direct evaluations.
+## Inference:
+**Test Hardware:** H100
+See [Cosmos Policy GitHub](http://github.com/NVlabs/cosmos-policy) for details.
+#### System Requirements and Performance
+Inference with model-based planning (dual deployment):
+* 8 H100 GPUs (recommended for N=8 best-of-N search)
+* ~4.9 seconds per action chunk
+* Ensemble predictions: 3 world model × 5 value function queries per action
+#### Quality Benchmarks
+### Planning Performance on ALOHA Tasks
+When used for model-based planning with the base policy checkpoint:
+| Task                    | Base Policy Score | With Planning (this checkpoint) | Improvement     |
+| ----------------------- | ----------------- | ------------------------------- | --------------- |
+| put candies in bowl     | 49.0              | 60.0                            | +11.0           |
+| put candy in ziploc bag | 70.0              | 84.0                            | +14.0           |
+| **Average**       | **60.0**    | **72.0**                  | **+12.5** |
+Results are on challenging initial conditions for these two tasks. Planning with this checkpoint enables the policy to be more likely to avoid errors (e.g., losing grasp of objects) by selecting higher-quality actions.
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
+Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 ## Related Resources
 - **Base Policy Checkpoint**: [Cosmos-Policy-ALOHA-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Policy-ALOHA-Predict2-2B) (required for planning)
 - **Base Video Model**: [Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
 - **Paper**: *Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning*
+- **Original ALOHA**: [Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware](https://arxiv.org/abs/2304.13705)
+## Citation
+If you use this model, please cite the Cosmos Policy paper:
+(Cosmos Policy BibTeX citation coming soon!)