Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B

Cosmos Policy | Code | White Paper | Website

Model Overview

Description:

Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B is a 2B-parameter refined world model and value function checkpoint fine-tuned from Cosmos-Policy-ALOHA-Predict2-2B on policy rollout data. This checkpoint is designed to be used in conjunction with the base Cosmos Policy checkpoint for model-based planning via best-of-N search, achieving a 12.5 percentage point average score increase on challenging ALOHA manipulation tasks.

Key features:

  • Refined predictions: Fine-tuned on policy rollout data for more accurate world model and value function predictions
  • Dual deployment: Used alongside base Cosmos Policy checkpoint for model-based planning
  • Improved performance: Achieves 12.5 percentage point average score increase on challenging ALOHA tasks when used for planning

Use cases:

  • Model-based planning for bimanual robot manipulation
  • Best-of-N action selection via value-based search
  • Improving policy robustness on high-precision tasks
  • Error avoidance in contact-rich manipulation

This model is for research and development only.

Model Developer: NVIDIA

Model Versions

Cosmos Policy models include the following:

  • Cosmos-Policy-LIBERO-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated LIBERO environments.
  • Cosmos-Policy-RoboCasa-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated RoboCasa environments.
  • Cosmos-Policy-ALOHA-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in real-world ALOHA robot environments.
  • Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B: Given current state observations, a task description, and action sequences, generate future state predictions and value estimates for robot manipulation in real-world ALOHA robot environments. (This checkpoint is meant to be deployed alongside Cosmos-Policy-ALOHA-Predict2-2B, not independently.)

License:

This model is released under the NVIDIA One-Way Noncommercial License (NSCLv1). For a custom license, please contact cosmos-license@nvidia.com.

Under the NVIDIA One-Way Noncommercial License (NSCLv1), NVIDIA confirms:

  • Models are not for commercial use.
  • NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

Deployment Geography:

Global

Use Case:

Physical AI: Model-based planning for bimanual robot manipulation in real-world environments, encompassing world modeling and value function prediction for best-of-N action selection.

Release Date:

GitHub [01/22/2026] via https://github.com/nvlabs/cosmos-policy

Hugging Face [01/22/2026] via https://huggingface.co/collections/nvidia/cosmos-policy

Model Architecture:

Architecture Type: A diffusion transformer with latent video diffusion, fine-tuned from Cosmos-Policy-ALOHA-Predict2-2B.

Network Architecture: The model uses the same architecture as Cosmos-Policy-ALOHA-Predict2-2B.

Key adaptation: This checkpoint is specifically optimized for world model and value function predictions through fine-tuning on policy rollout data with emphasis on future state and value prediction accuracy.

Number of model parameters:

2B (inherited from base model)

Input

Input Type(s): Text + Multi-view Images + Proprioceptive State + Action Sequence

Input Format(s):

  • Text: String (natural language task description)
  • Images: RGB images from multiple camera views
  • Proprioception: Numerical array
  • Actions: Numerical array

Input Parameters:

  • Text: One-dimensional (1D) - Task description (e.g., "put candy in ziploc bag")
  • Images: Two-dimensional (2D) - Top-down third-person camera: 224×224 RGB; Left wrist-mounted camera: 224×224 RGB; Right wrist-mounted camera: 224×224 RGB
  • Proprioception: One-dimensional (1D) - 14-dimensional state (7 joint angles per arm)
  • Actions: One-dimensional (1D) - 50-timestep sequence of 14-dimensional actions (for world model and value prediction)

Other Properties Related to Input:

  • Requires specific camera configuration (top-down + two wrist views)
  • Images resized to 224×224 pixels from original resolution
  • Trained exclusively for ALOHA 2 robot platform with two ViperX 300 S robot arms
  • Control frequency: 25 Hz

Output

Output Type(s): Future State Predictions + Value Estimate

Output Format:

  • Future states: Images + Proprioception
  • Value: Scalar

Output Parameters:

  • Future robot proprioception: 14-dimensional state at timestep t+50
  • Future state images: Top-down third-person camera prediction (224×224 RGB), left wrist camera prediction (224×224 RGB), and right wrist camera prediction (224×224 RGB) at timestep t+50
  • Future state value: Expected cumulative reward from future state (scalar)

Other Properties Related to Output:

  • Denoising steps: 10 (configurable without retraining)
  • Noise level range: σ_min = 4.0, σ_max = 80.0
  • Ensemble predictions: 3 world model queries × 5 value function queries per action (15 total value estimates)

Note: While this checkpoint can technically generate actions like the base policy, it is specifically designed and optimized for world model and value function predictions. For action generation, please use Cosmos-Policy-ALOHA-Predict2-2B.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Hopper (e.g., H100)

Note: We have only tested doing inference with BF16 precision.

Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Dual Deployment: This checkpoint is designed for dual deployment with Cosmos-Policy-ALOHA-Predict2-2B:

  1. Policy Model (Cosmos-Policy-ALOHA-Predict2-2B): Generates N candidate action chunks
  2. Planning Model (this checkpoint): For each candidate action, predicts future state (world model) and future state value (value function), averaging across ensemble predictions
  3. Selection: Execute the action chunk with the highest predicted value

Inference Latency: Model-based planning with dual deployment has significantly higher inference latency:

  • Planning mode (dual deployment): ~4.9 seconds per action chunk using 8 parallel H100 GPUs
  • Direct policy mode (base checkpoint only): ~0.95 seconds per action chunk using 1 H100 GPU

Hardware Requirements: Model-based planning requires multiple GPUs for parallelized best-of-N search (8 GPUs recommended for N=8) and sufficient compute for ensemble predictions.

When to Use Planning: Planning is most beneficial for challenging tasks with high precision requirements, situations where avoiding errors is critical, and scenarios where additional compute time is acceptable.

Same warnings as base checkpoint apply: Hardware compatibility, 25 Hz control frequency requirement, and real-world deployment safety considerations. See Cosmos-Policy-ALOHA-Predict2-2B model card for details.

Usage

See Cosmos Policy GitHub for details.

Training and Evaluation Sections:

Training Datasets:

Data Collection Method:

  • ALOHA-Planning-Rollouts: Automated - Policy rollout episodes collected from various methods

Labeling Method:

  • ALOHA-Planning-Rollouts: Automated - Success/failure labels automatically determined through policy execution and environment evaluation
Properties:

Training Data: Policy rollout data

  • 648 policy rollout episodes
  • Includes both successful and failed episodes
  • Covers diverse initial conditions and execution trajectories
  • Enables more accurate modeling of state transitions and value predictions beyond the demonstration distribution

Training Configuration:

  • Base model: Cosmos-Policy-ALOHA-Predict2-2B
  • Training steps: See paper for details
  • Batch split: 10/45/45 for policy/world model/value function (emphasis on world model and value function refinement)
  • GPUs: 8 H100 GPUs
  • Optimization: Full model fine-tuning (all weights updated)

Training Objective: Fine-tuned with increased emphasis on world model and value function training (90% of training batches) to improve future state and value prediction accuracy for more effective planning.

Evaluation Datasets:

Data Collection Method: Not Applicable

Labeling Method: Not Applicable

Properties: Not Applicable - We use the real-world ALOHA 2 robot platform for direct evaluations.

Inference:

Test Hardware: H100

See Cosmos Policy GitHub for details.

System Requirements and Performance

Inference with model-based planning (dual deployment):

  • 8 H100 GPUs (recommended for N=8 best-of-N search)
  • ~4.9 seconds per action chunk
  • Ensemble predictions: 3 world model × 5 value function queries per action

Quality Benchmarks

Planning Performance on ALOHA Tasks

When used for model-based planning with the base policy checkpoint:

Task Base Policy Score With Planning (this checkpoint) Improvement
put candies in bowl 49.0 60.0 +11.0
put candy in ziploc bag 70.0 84.0 +14.0
Average 60.0 72.0 +12.5

Results are on challenging initial conditions for these two tasks. Planning with this checkpoint enables the policy to be more likely to avoid errors (e.g., losing grasp of objects) by selecting higher-quality actions.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Related Resources

Citation

If you use this model, please cite the Cosmos Policy paper:

(Cosmos Policy BibTeX citation coming soon!)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B

Collection including nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B

Papers for nvidia/Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B