Cosmos 2B Action-Conditioned World Model - LIBERO Spatial

This repository contains a Cosmos 2B DiT model trained on the LIBERO-Spatial dataset for robot manipulation world modeling. The model generates future video predictions conditioned on robot actions.

Model Details

Architecture: Cosmos 2B DiT (2048 hidden dim, 28 blocks, 16 heads)
Training Data: LIBERO-Spatial dataset (500 demonstrations, 10 spatial reasoning tasks)
Training Iterations: 19,000
Resolution: 256×320
Action Conditioning: 8 action steps with stride 3
Base Model: Cosmos-1.0-Diffusion-7B-Text2World
Training Duration: ~6.3 hours on A100 80GB

Checkpoint

Filename: libero-spatial-2b-19k.pt
Size: 11.89 GB
Format: PyTorch state dict (.pt)

Quick Start

Installation

# Clone the Cosmos Predict repository
git clone https://github.com/NVIDIA/Cosmos.git
cd Cosmos/cosmos-predict2.5

# Install dependencies
pip install -e ".[cu128]"

Download Checkpoint

from huggingface_hub import hf_hub_download

checkpoint_path = hf_hub_download(
    repo_id="tayalmanan/cosmos-robotics",
    filename="libero-spatial-2b-19k.pt"
)
print(f"Checkpoint downloaded to: {checkpoint_path}")

Run Inference

# Create inference parameters JSON
cat > inference_params.json << EOF
{
  "prompts": [
    {
      "video_path": "path/to/your/input_video.mp4",
      "json_path": "path/to/your/action_annotations.json"
    }
  ],
  "num_input_frames": 1,
  "num_output_frames": 12,
  "height": 256,
  "width": 320,
  "guidance_scale": 7
}
EOF

# Run inference
uv run --extra=cu128 python examples/action_conditioned.py \
    -i inference_params.json \
    --checkpoint-path $(python -c "from huggingface_hub import hf_hub_download; print(hf_hub_download('tayalmanan/cosmos-robotics', 'libero-spatial-2b-19k.pt'))") \
    --experiment ac_libero_reason_embeddings_rectified_flow_2b_256_320 \
    --config-file cosmos_predict2/_src/predict2/action/configs/action_conditioned/config.py \
    -o outputs/predictions

Action Annotation Format

The model expects actions in the following JSON format:

{
  "videos": ["path/to/video.mp4"],
  "action_label": [
    [x, y, z, roll, pitch, yaw, gripper],
    [x, y, z, roll, pitch, yaw, gripper],
    ...
  ],
  "episode_metadata": {
    "num_frames": 98,
    "task": "task_description"
  }
}

Action dimensions: 7D (x, y, z, roll, pitch, yaw, gripper)
Action sequence: 8 consecutive actions with stride 3
Gripper: Binary value (0 = open, 1 = closed)

Training Details

Dataset

LIBERO-Spatial: 10 spatial reasoning manipulation tasks
Train: 400 demonstrations
Val: 100 demonstrations
Average frames per demo: ~98 frames

Training Configuration

ExperimentConfig(
    train_dataset="libero_13frame_480_640_train",
    val_dataset="libero_13frame_480_640_val",
    model="action_conditioned_reason_embeddings_rectified_flow_2b_256_320",
    max_iter=19_000,
    save_iter=2_000,
    batch_size=2,
    guidance=7,
    num_workers=16,
)

Scaling from Bridge Dataset

The iteration count was scaled from the Bridge dataset baseline:

LIBERO: 500 demos / 39,200 samples
Bridge: 25,460 demos / 814,720 samples
Scaling factor: 0.0196
Target iterations: 400,000 × 0.0196 = 19,000

Model Configuration

You need to add this configuration to use the model:

File: cosmos_predict2/_src/predict2/action/configs/action_conditioned/data.py

libero_13frame_480_640_train: DatasetConfig = field(
    default_factory=lambda: DatasetConfig(
        name="libero_13frame_480_640_train",
        data_type="cosmos",
        video_folder=f"{libero_base_path}/videos",
        json_folder=f"{libero_base_path}/annotation/train",
        num_video_repeat=1,
        in_len=1,
        future_len=12,
        action_len=8,
        action_stride=3,
        task_token=False,
    )
)

File: cosmos_predict2/experiments/base/action.py

if name == "ac_libero_reason_embeddings_rectified_flow_2b_256_320":
    config = ExperimentConfig(
        train_dataset="libero_13frame_480_640_train",
        val_dataset="libero_13frame_480_640_val",
        model="action_conditioned_reason_embeddings_rectified_flow_2b_256_320",
        checkpoint="models/Cosmos-1.0-Diffusion-7B-Text2World",
        max_iter=19_000,
        save_iter=2_000,
        every_n_sample=2_500,
        log_iter=25,
        val_iter=100,
        batch_size=2,
        guidance=7,
        num_workers=16,
    )
    return config

Performance

Memory: ~50GB GPU memory during inference
Inference speed: ~5-10 seconds per video prediction (13 frames)
Training memory: ~70GB GPU memory with batch_size=2

Use Cases

This model is designed for:

Robot manipulation planning: Predict future outcomes given actions
World model learning: Learn environment dynamics from demonstrations
Action-conditioned video prediction: Generate videos conditioned on robot actions
Spatial reasoning tasks: Manipulation tasks requiring spatial understanding

Limitations

Trained specifically on LIBERO-Spatial dataset (10 tasks)
Resolution limited to 256×320
Requires 7D action space (SE(3) + gripper)
Single camera viewpoint
Limited to 12-frame predictions

Citation

If you use this model, please cite:

@misc{cosmos-libero-2b,
  author = {Tayal, Manan},
  title = {Cosmos 2B Action-Conditioned Model - LIBERO Spatial},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/tayalmanan/cosmos-robotics}}
}

@inproceedings{liu2024libero,
  title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
  author={Liu, Bo and Zhu, Yifeng and Gao, Chongkai and Feng, Yihao and Liu, Qiang and Zhu, Yuke and Stone, Peter},
  booktitle={NeurIPS 2023 Datasets and Benchmarks Track},
  year={2024}
}

License

This model is released under the NVIDIA Open Model License.

Additional Resources

Contact

For issues or questions:

Open an issue on the Cosmos GitHub repository
Model-specific questions: Create an issue in this repository

Acknowledgments

This model was trained using NVIDIA's Cosmos framework and the LIBERO-Spatial dataset. Thanks to the LIBERO team for providing the dataset and the NVIDIA team for the Cosmos infrastructure.

Downloads last month: 28

Video Preview

Robotics