pyromind
/

Ctrl-Wrold-Agibot-327

Model card Files Files and versions

xet

Community

niqi-lyu commited on Jan 9

Commit

52c3724

verified ·

1 Parent(s): 1c718fe

Update README.md

Browse files

Files changed (1) hide show

README.md +167 -110

README.md CHANGED Viewed

@@ -1,146 +1,203 @@
-# AgiBotWorld-Alpha-CtrlWorld-327 Dataset
-> **⚠️ Important**: This dataset is processed from task_327 in Agibot-World-Alpha. For more details, please refer to the [Acknowledgements](#-acknowledgements) section.
-## 📑 Table of Contents
-- [🚀 Get Started](#🚀-get-started)
-  - [Download the Dataset](#download-the-dataset)
-- [📁 Data Structure](#📁-data-structure)
-- [📊 Explanation of Proprioceptive State](#📊-explanation-of-proprioceptive-state)
-  - [State and Action](#state-and-action)
-  - [Common Fields](#common-fields)
-  - [Value Shapes and Ranges](#value-shapes-and-ranges)
-- [🤗 Acknowledgements](#🤗-acknowledgements)
-- [📄 License](#📄-license)
-- [💬 Contact](#💬-contact)
-## 🚀 Get Started
-### Download the Dataset
-To download the full dataset, you can use the following commands. If you encounter any issues, please refer to the [official Hugging Face documentation](https://huggingface.co/docs/hub/datasets-downloading).
-```
-# Make sure you have git-lfs installed (https://git-lfs.com)
-git lfs install
-# When prompted for a password, use an access token with write permissions.
-# Generate one from your settings: https://huggingface.co/settings/tokens
-git clone https://huggingface.co/datasets/pyromind/AgiBotWorld-Alpha-CtrlWorld-327
-# If you want to clone without large files - just their pointers
-GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/pyromind/AgiBotWorld-Alpha-CtrlWorld-327
-```
-The data has already been pre-processed and can be used directly with our [inference code](https://github.com/PyroMind-Dynamics/WorldModelInference).
-## 📁 Data Structure
-The dataset is organized as follows:
 ```
-task_327/
-├── annotation/
-│   ├── train/          # Training episode annotations (207 episodes)
-│   │   ├── 0.json
-│   │   ├── 1.json
-│   │   └── ...
-│   └── val/            # Validation episode annotations (2 episodes)
-│       ├── 99.json
-│       └── 199.json
-├── latent_videos/
-│   ├── train/          # Pre-encoded latent video representations
-│   │   ├── 0/
-│   │   │   ├── 0.pt
-│   │   │   ├── 1.pt
-│   │   │   └── 2.pt
-│   │   └── ...
-│   └── val/
-│       └── ...
-├── videos/
-│   ├── train/          # Original video files
-│   └── val/
-└── stat.json           # Dataset statistics
-```
-## 📊 Explanation of Proprioceptive State
-## State and Action
-### State
-The **state** represents the proprioceptive observations of the robot at each timestep. It includes:
-- **End-effector orientation**: 4D quaternion representation (w, x, y, z) describing the orientation of the robot's end-effector
-- **End-effector position**: 3D Cartesian coordinates (x, y, z) of the end-effector position
-- **Effector position**: 3D Cartesian coordinates (x, y, z) of additional effector position information
-The state vector has a dimension of 16, combining all these proprioceptive measurements.
-### Action
-The **action** represents the control commands sent to the robot. In this dataset:
-- Actions are 2-dimensional vectors
-- Actions control the effector position in the environment
-## Common Fields
-Each annotation JSON file (`{episode_id}.json`) contains the following fields:
-| Field | Type | Description |
-|-------|------|-------------|
-| `texts` | `List[str]` | Task description and initial scene description |
-| `episode_id` | `int` | Unique identifier for the episode |
-| `success` | `int` | Binary indicator (0 or 1) whether the episode was successful |
-| `video_length` | `int` | Number of frames in the processed video |
-| `raw_length` | `int` | Number of frames in the original raw video |
-| `state_columns` | `List[str]` | Column names for state components: `['observation.states.end.orientation', 'observation.states.end.position', 'observation.states.effector.position']` |
-| `action_columns` | `List[str]` | Column names for action components: `['actions.effector.position']` |
-| `states` | `List[List[float]]` | Array of state vectors, one per timestep. Each state is a 16-dimensional vector |
-| `actions` | `List[List[float]]` | Array of action vectors, one per timestep. Each action is a 2-dimensional vector |
-| `videos` | `List[Dict]` | List of video file paths, e.g., `[{'video_path': 'videos/train/{episode_id}/{segment_id}.mp4'}]` |
-| `latent_videos` | `List[str]` | List of paths to pre-encoded latent video representations |
-## Value Shapes and Ranges
-### State Values
-- **Shape**: `[T, 16]` where T is the number of timesteps (video_length)
-- **Components**:
-  - Orientation (quaternion): 4 values, typically in range [-1.0, 1.0]
-  - Positions: 6 values (3D end position + 3D effector position), typically in range [-1.0, 1.0] for normalized coordinates, or larger ranges for absolute positions (e.g., up to ~97.0)
-- **Overall range**: Approximately [-0.86, 97.29] (values may vary depending on normalization)
-### Action Values
-- **Shape**: `[T, 2]` where T is the number of timesteps
-- **Range**: [0.0, 1.0] (normalized control values)
-### Latent Videos
-- **Format**: PyTorch tensor files (`.pt`)
-- **Shape**: `[T, 4, 24, 40]` where:
-  - `T`: Number of frames (matches video_length)
-  - `4`: Number of channels
-  - `24`: Height dimension
-  - `40`: Width dimension
-### Dataset Statistics
-- **Training episodes**: 207
-- **Validation episodes**: 2
-- **Total episodes**: 209
-The `stat.json` file contains sample state and action values at the 1st and 99th percentiles, which can be used for normalization or data analysis purposes.
 ## 🤗 Acknowledgements
 We would like to express our gratitude to the following projects and teams:
-- **[AgiBot-World](https://github.com/OpenDriveLab/AgiBot-World)**: This dataset is processed from task_327 in Agibot-World-Alpha. We acknowledge the OpenDriveLab team for their excellent work on the large-scale manipulation platform for scalable and intelligent embodied systems.
 ---
 ## 📄 License
-All the data and code within this repository are licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
 ## 💬 Contact

+# Ctrl-World World Model Checkpoint
+![](https://github.com/PyroMind-Dynamics/WorldModelInference/blob/master/video/show_case0.gif)
+## 📋 Model Overview
+This directory contains pre-trained checkpoints based on the **Ctrl-World** architecture, trained using the **Agibot-Alpha**'s 327 dataset. The whole model is built upon the Stable Video Diffusion (SVD) architecture with added support for action conditioning and text instruction conditioning.
+The files `checkpoint-*.pt` correspond to model checkpoints saved at different training steps, where `*` denotes the step number (e.g., `checkpoint-15000.pt` was saved at step 15,000).
+The folder `samples` contains validation results on the validation dataset.
+For more technical details, please [visit our blog post](https://pyromind.ai/blog/world-model/build-ctrl-world).
+## 📑 Table of Contents
+- [📋 Model Overview](#📋-model-overview)
+- [📦 Datasets](#📦-datasets)
+- [🏗️ Model Architecture](#🏗️-model-architecture)
+  - [Core Components](#core-components)
+  - [Model Parameters](#model-parameters)
+  - [Input/Output Specifications](#inputoutput-specifications)
+- [⚙️ Inference Configuration](#⚙️-inference-configuration)
+  - [Inference Hyperparameters](#inference-hyperparameters)
+  - [Usage Example](#usage-example)
+- [💾 Checkpoint Structure](#💾-checkpoint-structure)
+- [🔧 Dependencies](#🔧-dependencies)
+- [📈 Performance Metrics](#📈-performance-metrics)
+- [🤗 Acknowledgements](#🤗-acknowledgements)
+- [📄 License](#📄-license)
+- [💬 Contact](#💬-contact)
+## 📦 Datasets
+Please visit [AgiBotWorld-Alpha-CtrlWorld-327](https://huggingface.co/datasets/pyromind/AgiBotWorld-Alpha-CtrlWorld-327/tree/main) to see more details about the datasets.
+## 🏗️ Model Architecture
+### Core Components
+- **Base Model**: Stable Video Diffusion (SVD) - A foundational diffusion model for video generation
+- **UNet**: Spatio-temporal conditional UNet - Supports frame-level action conditioning
+- **Action Encoder**: 3-layer fully connected network (1024-dimensional) - Encodes action sequences into feature representations
+- **Text Encoder**: CLIP Text Encoder - Supports text instruction conditioning
+- **VAE**: Used for image encoding and decoding
+### Model Parameters
+- **Action Dimension (action_dim)**: 18
+  - Left arm Cartesian position: 7 dimensions
+  - Right arm Cartesian position: 7 dimensions
+  - Left gripper state: 1 dimension
+  - Right gripper state: 1 dimension
+  - Left gripper action: 1 dimension
+  - Right gripper action: 1 dimension
+- **History Frames (num_history)**: 6
+- **Prediction Frames (num_frames)**: 10
+- **Text Conditioning (text_cond)**: True
+- **Frame-level Conditioning (frame_level_cond)**: True
+- **History Condition Zeroing (his_cond_zero)**: False
+### Input/Output Specifications
+- **Input Image Size**: 320 × 192 (single view)
+- **Multi-view Support**: 3 views (concatenated: 320 × 576)
+- **Latent Space Dimension**: (4, 72, 40) - where 72 = 24 × 3 (3 views)
+- **Frame Rate**: 7 FPS
+## ⚙️ Inference Configuration
+This model can be used with our inference code, which is available for [download on GitHub](https://github.com/PyroMind-Dynamics/WorldModelInference).
+### Inference Hyperparameters
+- **Inference Steps (num_inference_steps)**: 50
+- **Guidance Scale (guidance_scale)**: 2.0
+- **Motion Bucket ID (motion_bucket_id)**: 127
+- **Frame Rate (fps)**: 7
+- **Decode Chunk Size (decode_chunk_size)**: 7
+- **Data Type**: bfloat16 (recommended for inference to accelerate computation and save memory)
+### Usage Example
+```python
+from models.ctrl_world import CtrlWorld
+import torch
+# Initialize model
+model = CtrlWorld(
+    svd_model_path="/path/to/stable-video-diffusion-img2vid",
+    clip_model_path="/path/to/clip-vit-base-patch32",
+    action_dim=18,
+    num_history=6,
+    num_frames=10,
+    text_cond=True,
+    motion_bucket_id=127,
+    fps=7,
+    his_cond_zero=False,
+    frame_level_cond=True
+)
+# Load checkpoint
+checkpoint_path = "model_ckpt/task_327/checkpoint-21500.pt"
+state_dict = torch.load(checkpoint_path, map_location='cpu')
+model.load_state_dict(state_dict)
+model.eval()
+# Inference
+with torch.no_grad():
+    latents = model.generate(
+        image=image_cond,      # Conditional image (1, 4, 72, 40)
+        action=action_cond,    # Action sequence (1, 16, 18)
+        text=["instruction"],  # Text instruction (optional)
+        history=his_cond,      # History frames (1, 6, 4, 72, 40)
+        num_frames=10,
+        num_inference_steps=50,
+        guidance_scale=2.0,
+        fps=7,
+        motion_bucket_id=127
+    )
 ```
+## 💾 Checkpoint Structure
+The checkpoint file is a PyTorch state_dict containing approximately 2,525 parameter groups, primarily including:
+- `unet.*`: Parameters of the UNet diffusion model
+- `action_encoder.*`: Parameters of the action encoder
+**Note**: Parameters of the VAE and CLIP encoder are not saved in the checkpoint, as they use frozen pretrained weights.
+## 🔧 Dependencies
+### Required Dependencies
+- PyTorch >= 1.12.0
+- diffusers (Stable Video Diffusion)
+- transformers (CLIP)
+- accelerate
+- einops
+- decord (video reading)
+- mediapy (video saving)
+### Pretrained Models
+Using this checkpoint depends on the structure of the following pretrained models:
+1. **Stable Video Diffusion**:
+   - Path: `stable-video-diffusion-img2vid-config-path`
+   - Or download from HuggingFace: `stabilityai/stable-video-diffusion-img2vid`
+2. **CLIP Text Encoder**:
+   - Path: `clip-vit-base-patch32-config-path`
+   - Or download from HuggingFace: `openai/clip-vit-base-patch32`
+## 📈 Performance Metrics
+The model was trained on the task_327 dataset and can predict multi-view robotic manipulation videos. The model supports:
+- ✅ Multi-view video prediction (3 views)
+- ✅ Action-conditioned control
+- ✅ Text instruction conditioning
+- ✅ Long-horizon prediction (via rolling prediction)
 ## 🤗 Acknowledgements
 We would like to express our gratitude to the following projects and teams:
+- **[Stable Video Diffusion (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid)**: This model is built upon the Stable Video Diffusion architecture developed by Stability AI. We thank the Stability AI team for their excellent work on video generation with diffusion models.
+- **[Ctrl-World](https://github.com/Robert-gyj/Ctrl-World)**: We acknowledge the Ctrl-World team for their pioneering work on controllable generative world models for robot manipulation.
 ---
 ## 📄 License
+```
+MIT License
+Copyright (c) 2026 Pyromind Dynamics
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+```
 ## 💬 Contact