Ctrl-World World Model Checkpoint

📋 Model Overview

This directory contains pre-trained checkpoints based on the Ctrl-World architecture, trained using the Agibot-Alpha's 327 dataset. The whole model is built upon the Stable Video Diffusion (SVD) architecture with added support for action conditioning and text instruction conditioning.

The files checkpoint-*.pt correspond to model checkpoints saved at different training steps, where * denotes the step number (e.g., checkpoint-15000.pt was saved at step 15,000).

The folder samples contains validation results on the validation dataset.

For more technical details, please visit our blog post.

📑 Table of Contents

📋 Model Overview
📦 Datasets
🏗️ Model Architecture
⚙️ Inference Configuration
- Inference Hyperparameters
- Usage Example
💾 Checkpoint Structure
🔧 Dependencies
📈 Performance Metrics
🤗 Acknowledgements
📄 License
💬 Contact

📦 Datasets

Please visit AgiBotWorld-Alpha-CtrlWorld-327 to see more details about the datasets.

🏗️ Model Architecture

Core Components

Base Model: Stable Video Diffusion (SVD) - A foundational diffusion model for video generation
UNet: Spatio-temporal conditional UNet - Supports frame-level action conditioning
Action Encoder: 3-layer fully connected network (1024-dimensional) - Encodes action sequences into feature representations
Text Encoder: CLIP Text Encoder - Supports text instruction conditioning
VAE: Used for image encoding and decoding

Model Parameters

Action Dimension (action_dim): 18
- Left arm Cartesian position: 7 dimensions
- Right arm Cartesian position: 7 dimensions
- Left gripper state: 1 dimension
- Right gripper state: 1 dimension
- Left gripper action: 1 dimension
- Right gripper action: 1 dimension
History Frames (num_history): 6
Prediction Frames (num_frames): 10
Text Conditioning (text_cond): True
Frame-level Conditioning (frame_level_cond): True
History Condition Zeroing (his_cond_zero): False

Input/Output Specifications

Input Image Size: 320 × 192 (single view)
Multi-view Support: 3 views (concatenated: 320 × 576)
Latent Space Dimension: (4, 72, 40) - where 72 = 24 × 3 (3 views)
Frame Rate: 7 FPS

⚙️ Inference Configuration

This model can be used with our inference code, which is available for download on GitHub.

Inference Hyperparameters

Inference Steps (num_inference_steps): 50
Guidance Scale (guidance_scale): 2.0
Motion Bucket ID (motion_bucket_id): 127
Frame Rate (fps): 7
Decode Chunk Size (decode_chunk_size): 7
Data Type: bfloat16 (recommended for inference to accelerate computation and save memory)

Usage Example

from models.ctrl_world import CtrlWorld
import torch

# Initialize model
model = CtrlWorld(
    svd_model_path="/path/to/stable-video-diffusion-img2vid",
    clip_model_path="/path/to/clip-vit-base-patch32",
    action_dim=18,
    num_history=6,
    num_frames=10,
    text_cond=True,
    motion_bucket_id=127,
    fps=7,
    his_cond_zero=False,
    frame_level_cond=True
)

# Load checkpoint
checkpoint_path = "model_ckpt/task_327/checkpoint-21500.pt"
state_dict = torch.load(checkpoint_path, map_location='cpu')
model.load_state_dict(state_dict)
model.eval()

# Inference
with torch.no_grad():
    latents = model.generate(
        image=image_cond,      # Conditional image (1, 4, 72, 40)
        action=action_cond,    # Action sequence (1, 16, 18)
        text=["instruction"],  # Text instruction (optional)
        history=his_cond,      # History frames (1, 6, 4, 72, 40)
        num_frames=10,
        num_inference_steps=50,
        guidance_scale=2.0,
        fps=7,
        motion_bucket_id=127
    )

💾 Checkpoint Structure

The checkpoint file is a PyTorch state_dict containing approximately 2,525 parameter groups, primarily including:

unet.*: Parameters of the UNet diffusion model
action_encoder.*: Parameters of the action encoder

Note: Parameters of the VAE and CLIP encoder are not saved in the checkpoint, as they use frozen pretrained weights.

🔧 Dependencies

Required Dependencies

PyTorch >= 1.12.0
diffusers (Stable Video Diffusion)
transformers (CLIP)
accelerate
einops
decord (video reading)
mediapy (video saving)

Pretrained Models

Using this checkpoint depends on the structure of the following pretrained models:

Stable Video Diffusion:
- Path: stable-video-diffusion-img2vid-config-path
- Or download from HuggingFace: stabilityai/stable-video-diffusion-img2vid
CLIP Text Encoder:
- Path: clip-vit-base-patch32-config-path
- Or download from HuggingFace: openai/clip-vit-base-patch32

📈 Performance Metrics

The model was trained on the task_327 dataset and can predict multi-view robotic manipulation videos. The model supports:

✅ Multi-view video prediction (3 views)
✅ Action-conditioned control
✅ Text instruction conditioning
✅ Long-horizon prediction (via rolling prediction)

🤗 Acknowledgements

We would like to express our gratitude to the following projects and teams:

Stable Video Diffusion (SVD): This model is built upon the Stable Video Diffusion architecture developed by Stability AI. We thank the Stability AI team for their excellent work on video generation with diffusion models.
Ctrl-World: We acknowledge the Ctrl-World team for their pioneering work on controllable generative world models for robot manipulation.

📄 License

MIT License

Copyright (c) 2026 Pyromind Dynamics

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

💬 Contact

For questions or suggestions, please contact us through the project Issues.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support