Cosmos 2B Action-Conditioned World Model - LIBERO Spatial
This repository contains a Cosmos 2B DiT model trained on the LIBERO-Spatial dataset for robot manipulation world modeling. The model generates future video predictions conditioned on robot actions.
Model Details
- Architecture: Cosmos 2B DiT (2048 hidden dim, 28 blocks, 16 heads)
- Training Data: LIBERO-Spatial dataset (500 demonstrations, 10 spatial reasoning tasks)
- Training Iterations: 19,000
- Resolution: 256×320
- Action Conditioning: 8 action steps with stride 3
- Base Model: Cosmos-1.0-Diffusion-7B-Text2World
- Training Duration: ~6.3 hours on A100 80GB
Checkpoint
- Filename:
libero-spatial-2b-19k.pt - Size: 11.89 GB
- Format: PyTorch state dict (
.pt)
Quick Start
Installation
# Clone the Cosmos Predict repository
git clone https://github.com/NVIDIA/Cosmos.git
cd Cosmos/cosmos-predict2.5
# Install dependencies
pip install -e ".[cu128]"
Download Checkpoint
from huggingface_hub import hf_hub_download
checkpoint_path = hf_hub_download(
repo_id="tayalmanan/cosmos-robotics",
filename="libero-spatial-2b-19k.pt"
)
print(f"Checkpoint downloaded to: {checkpoint_path}")
Run Inference
# Create inference parameters JSON
cat > inference_params.json << EOF
{
"prompts": [
{
"video_path": "path/to/your/input_video.mp4",
"json_path": "path/to/your/action_annotations.json"
}
],
"num_input_frames": 1,
"num_output_frames": 12,
"height": 256,
"width": 320,
"guidance_scale": 7
}
EOF
# Run inference
uv run --extra=cu128 python examples/action_conditioned.py \
-i inference_params.json \
--checkpoint-path $(python -c "from huggingface_hub import hf_hub_download; print(hf_hub_download('tayalmanan/cosmos-robotics', 'libero-spatial-2b-19k.pt'))") \
--experiment ac_libero_reason_embeddings_rectified_flow_2b_256_320 \
--config-file cosmos_predict2/_src/predict2/action/configs/action_conditioned/config.py \
-o outputs/predictions
Action Annotation Format
The model expects actions in the following JSON format:
{
"videos": ["path/to/video.mp4"],
"action_label": [
[x, y, z, roll, pitch, yaw, gripper],
[x, y, z, roll, pitch, yaw, gripper],
...
],
"episode_metadata": {
"num_frames": 98,
"task": "task_description"
}
}
- Action dimensions: 7D (x, y, z, roll, pitch, yaw, gripper)
- Action sequence: 8 consecutive actions with stride 3
- Gripper: Binary value (0 = open, 1 = closed)
Training Details
Dataset
- LIBERO-Spatial: 10 spatial reasoning manipulation tasks
- Train: 400 demonstrations
- Val: 100 demonstrations
- Average frames per demo: ~98 frames
Training Configuration
ExperimentConfig(
train_dataset="libero_13frame_480_640_train",
val_dataset="libero_13frame_480_640_val",
model="action_conditioned_reason_embeddings_rectified_flow_2b_256_320",
max_iter=19_000,
save_iter=2_000,
batch_size=2,
guidance=7,
num_workers=16,
)
Scaling from Bridge Dataset
The iteration count was scaled from the Bridge dataset baseline:
LIBERO: 500 demos / 39,200 samples
Bridge: 25,460 demos / 814,720 samples
Scaling factor: 0.0196
Target iterations: 400,000 × 0.0196 = 19,000
Model Configuration
You need to add this configuration to use the model:
File: cosmos_predict2/_src/predict2/action/configs/action_conditioned/data.py
libero_13frame_480_640_train: DatasetConfig = field(
default_factory=lambda: DatasetConfig(
name="libero_13frame_480_640_train",
data_type="cosmos",
video_folder=f"{libero_base_path}/videos",
json_folder=f"{libero_base_path}/annotation/train",
num_video_repeat=1,
in_len=1,
future_len=12,
action_len=8,
action_stride=3,
task_token=False,
)
)
File: cosmos_predict2/experiments/base/action.py
if name == "ac_libero_reason_embeddings_rectified_flow_2b_256_320":
config = ExperimentConfig(
train_dataset="libero_13frame_480_640_train",
val_dataset="libero_13frame_480_640_val",
model="action_conditioned_reason_embeddings_rectified_flow_2b_256_320",
checkpoint="models/Cosmos-1.0-Diffusion-7B-Text2World",
max_iter=19_000,
save_iter=2_000,
every_n_sample=2_500,
log_iter=25,
val_iter=100,
batch_size=2,
guidance=7,
num_workers=16,
)
return config
Performance
- Memory: ~50GB GPU memory during inference
- Inference speed: ~5-10 seconds per video prediction (13 frames)
- Training memory: ~70GB GPU memory with batch_size=2
Use Cases
This model is designed for:
- Robot manipulation planning: Predict future outcomes given actions
- World model learning: Learn environment dynamics from demonstrations
- Action-conditioned video prediction: Generate videos conditioned on robot actions
- Spatial reasoning tasks: Manipulation tasks requiring spatial understanding
Limitations
- Trained specifically on LIBERO-Spatial dataset (10 tasks)
- Resolution limited to 256×320
- Requires 7D action space (SE(3) + gripper)
- Single camera viewpoint
- Limited to 12-frame predictions
Citation
If you use this model, please cite:
@misc{cosmos-libero-2b,
author = {Tayal, Manan},
title = {Cosmos 2B Action-Conditioned Model - LIBERO Spatial},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/tayalmanan/cosmos-robotics}}
}
@inproceedings{liu2024libero,
title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
author={Liu, Bo and Zhu, Yifeng and Gao, Chongkai and Feng, Yihao and Liu, Qiang and Zhu, Yuke and Stone, Peter},
booktitle={NeurIPS 2023 Datasets and Benchmarks Track},
year={2024}
}
License
This model is released under the NVIDIA Open Model License.
Additional Resources
Contact
For issues or questions:
- Open an issue on the Cosmos GitHub repository
- Model-specific questions: Create an issue in this repository
Acknowledgments
This model was trained using NVIDIA's Cosmos framework and the LIBERO-Spatial dataset. Thanks to the LIBERO team for providing the dataset and the NVIDIA team for the Cosmos infrastructure.
- Downloads last month
- 28