Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B

Cosmos Policy | Code | White Paper | Website

Model Overview

Description:

Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B is a 2B-parameter refined world model and value function checkpoint fine-tuned from Cosmos-Policy-ALOHA-Predict2-2B on policy rollout data. This checkpoint is designed to be used in conjunction with the base Cosmos Policy checkpoint for model-based planning via best-of-N search, achieving a 12.5 percentage point average score increase on challenging ALOHA manipulation tasks.

Key features:

Refined predictions: Fine-tuned on policy rollout data for more accurate world model and value function predictions
Dual deployment: Used alongside base Cosmos Policy checkpoint for model-based planning
Improved performance: Achieves 12.5 percentage point average score increase on challenging ALOHA tasks when used for planning

Use cases:

Model-based planning for bimanual robot manipulation
Best-of-N action selection via value-based search
Improving policy robustness on high-precision tasks
Error avoidance in contact-rich manipulation

This model is for research and development only.

Model Developer: NVIDIA

Model Versions

Cosmos Policy models include the following:

Cosmos-Policy-LIBERO-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated LIBERO environments.
Cosmos-Policy-RoboCasa-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in simulated RoboCasa environments.
Cosmos-Policy-ALOHA-Predict2-2B: Given current state observations and a task description, generate action sequences, future state predictions, and value estimates for robot manipulation in real-world ALOHA robot environments.
Cosmos-Policy-ALOHA-Planning-Model-Predict2-2B: Given current state observations, a task description, and action sequences, generate future state predictions and value estimates for robot manipulation in real-world ALOHA robot environments. (This checkpoint is meant to be deployed alongside Cosmos-Policy-ALOHA-Predict2-2B, not independently.)

License:

This model is released under the NVIDIA One-Way Noncommercial License (NSCLv1). For a custom license, please contact cosmos-license@nvidia.com.

Under the NVIDIA One-Way Noncommercial License (NSCLv1), NVIDIA confirms:

Models are not for commercial use.
NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

Deployment Geography:

Global

Use Case:

Physical AI: Model-based planning for bimanual robot manipulation in real-world environments, encompassing world modeling and value function prediction for best-of-N action selection.

Release Date:

GitHub [01/22/2026] via https://github.com/nvlabs/cosmos-policy

Hugging Face [01/22/2026] via https://huggingface.co/collections/nvidia/cosmos-policy

Model Architecture:

Architecture Type: A diffusion transformer with latent video diffusion, fine-tuned from Cosmos-Policy-ALOHA-Predict2-2B.

Network Architecture: The model uses the same architecture as Cosmos-Policy-ALOHA-Predict2-2B.

Key adaptation: This checkpoint is specifically optimized for world model and value function predictions through fine-tuning on policy rollout data with emphasis on future state and value prediction accuracy.

Number of model parameters:

2B (inherited from base model)

Input

Input Type(s): Text + Multi-view Images + Proprioceptive State + Action Sequence

Input Format(s):

Text: String (natural language task description)
Images: RGB images from multiple camera views
Proprioception: Numerical array
Actions: Numerical array

Input Parameters:

Text: One-dimensional (1D) - Task description (e.g., "put candy in ziploc bag")
Images: Two-dimensional (2D) - Top-down third-person camera: 224×224 RGB; Left wrist-mounted camera: 224×224 RGB; Right wrist-mounted camera: 224×224 RGB
Proprioception: One-dimensional (1D) - 14-dimensional state (7 joint angles per arm)
Actions: One-dimensional (1D) - 50-timestep sequence of 14-dimensional actions (for world model and value prediction)

Other Properties Related to Input:

Requires specific camera configuration (top-down + two wrist views)
Images resized to 224×224 pixels from original resolution
Trained exclusively for ALOHA 2 robot platform with two ViperX 300 S robot arms
Control frequency: 25 Hz

Output

Output Type(s): Future State Predictions + Value Estimate

Output Format:

Future states: Images + Proprioception
Value: Scalar

Output Parameters:

Future robot proprioception: 14-dimensional state at timestep t+50
Future state images: Top-down third-person camera prediction (224×224 RGB), left wrist camera prediction (224×224 RGB), and right wrist camera prediction (224×224 RGB) at timestep t+50
Future state value: Expected cumulative reward from future state (scalar)

Other Properties Related to Output:

Denoising steps: 10 (configurable without retraining)
Noise level range: σ_min = 4.0, σ_max = 80.0
Ensemble predictions: 3 world model queries × 5 value function queries per action (15 total value estimates)

Note: While this checkpoint can technically generate actions like the base policy, it is specifically designed and optimized for world model and value function predictions. For action generation, please use Cosmos-Policy-ALOHA-Predict2-2B.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

Transformers

Supported Hardware Microarchitecture Compatibility:

NVIDIA Hopper (e.g., H100)

Note: We have only tested doing inference with BF16 precision.

Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Dual Deployment: This checkpoint is designed for dual deployment with Cosmos-Policy-ALOHA-Predict2-2B:

Policy Model (Cosmos-Policy-ALOHA-Predict2-2B): Generates N candidate action chunks
Planning Model (this checkpoint): For each candidate action, predicts future state (world model) and future state value (value function), averaging across ensemble predictions
Selection: Execute the action chunk with the highest predicted value

Inference Latency: Model-based planning with dual deployment has significantly higher inference latency:

Planning mode (dual deployment): ~4.9 seconds per action chunk using 8 parallel H100 GPUs
Direct policy mode (base checkpoint only): ~0.95 seconds per action chunk using 1 H100 GPU

Hardware Requirements: Model-based planning requires multiple GPUs for parallelized best-of-N search (8 GPUs recommended for N=8) and sufficient compute for ensemble predictions.

When to Use Planning: Planning is most beneficial for challenging tasks with high precision requirements, situations where avoiding errors is critical, and scenarios where additional compute time is acceptable.

Same warnings as base checkpoint apply: Hardware compatibility, 25 Hz control frequency requirement, and real-world deployment safety considerations. See Cosmos-Policy-ALOHA-Predict2-2B model card for details.

Usage

See Cosmos Policy GitHub for details.

Training and Evaluation Sections:

Training Datasets:

Data Collection Method:

ALOHA-Planning-Rollouts: Automated - Policy rollout episodes collected from various methods

Labeling Method:

ALOHA-Planning-Rollouts: Automated - Success/failure labels automatically determined through policy execution and environment evaluation

Properties:

Training Data: Policy rollout data

648 policy rollout episodes
Includes both successful and failed episodes
Covers diverse initial conditions and execution trajectories
Enables more accurate modeling of state transitions and value predictions beyond the demonstration distribution

Training Configuration:

Base model: Cosmos-Policy-ALOHA-Predict2-2B
Training steps: See paper for details
Batch split: 10/45/45 for policy/world model/value function (emphasis on world model and value function refinement)
GPUs: 8 H100 GPUs
Optimization: Full model fine-tuning (all weights updated)

Training Objective: Fine-tuned with increased emphasis on world model and value function training (90% of training batches) to improve future state and value prediction accuracy for more effective planning.

Evaluation Datasets:

Data Collection Method: Not Applicable

Labeling Method: Not Applicable

Properties: Not Applicable - We use the real-world ALOHA 2 robot platform for direct evaluations.

Inference:

Test Hardware: H100

See Cosmos Policy GitHub for details.

System Requirements and Performance

Inference with model-based planning (dual deployment):

8 H100 GPUs (recommended for N=8 best-of-N search)
~4.9 seconds per action chunk
Ensemble predictions: 3 world model × 5 value function queries per action

Quality Benchmarks

Planning Performance on ALOHA Tasks

When used for model-based planning with the base policy checkpoint:

Task	Base Policy Score	With Planning (this checkpoint)	Improvement
put candies in bowl	49.0	60.0	+11.0
put candy in ziploc bag	70.0	84.0	+14.0
Average	60.0	72.0	+12.5

Results are on challenging initial conditions for these two tasks. Planning with this checkpoint enables the policy to be more likely to avoid errors (e.g., losing grasp of objects) by selecting higher-quality actions.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

Please report security vulnerabilities or NVIDIA AI Concerns here.