niqi-lyu commited on
Commit
52c3724
Β·
verified Β·
1 Parent(s): 1c718fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -110
README.md CHANGED
@@ -1,146 +1,203 @@
1
- # AgiBotWorld-Alpha-CtrlWorld-327 Dataset
2
 
3
- > **⚠️ Important**: This dataset is processed from task_327 in Agibot-World-Alpha. For more details, please refer to the [Acknowledgements](#-acknowledgements) section.
4
 
5
- ## πŸ“‘ Table of Contents
6
-
7
- - [πŸš€ Get Started](#πŸš€-get-started)
8
- - [Download the Dataset](#download-the-dataset)
9
- - [πŸ“ Data Structure](#πŸ“-data-structure)
10
- - [πŸ“Š Explanation of Proprioceptive State](#πŸ“Š-explanation-of-proprioceptive-state)
11
- - [State and Action](#state-and-action)
12
- - [Common Fields](#common-fields)
13
- - [Value Shapes and Ranges](#value-shapes-and-ranges)
14
- - [πŸ€— Acknowledgements](#πŸ€—-acknowledgements)
15
- - [πŸ“„ License](#πŸ“„-license)
16
- - [πŸ’¬ Contact](#πŸ’¬-contact)
17
-
18
- ## πŸš€ Get Started
19
 
20
- ### Download the Dataset
21
 
22
- To download the full dataset, you can use the following commands. If you encounter any issues, please refer to the [official Hugging Face documentation](https://huggingface.co/docs/hub/datasets-downloading).
23
 
24
- ```
25
- # Make sure you have git-lfs installed (https://git-lfs.com)
26
- git lfs install
27
-
28
- # When prompted for a password, use an access token with write permissions.
29
- # Generate one from your settings: https://huggingface.co/settings/tokens
30
- git clone https://huggingface.co/datasets/pyromind/AgiBotWorld-Alpha-CtrlWorld-327
31
-
32
- # If you want to clone without large files - just their pointers
33
- GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/pyromind/AgiBotWorld-Alpha-CtrlWorld-327
34
- ```
35
 
36
- The data has already been pre-processed and can be used directly with our [inference code](https://github.com/PyroMind-Dynamics/WorldModelInference).
37
 
38
- ## πŸ“ Data Structure
39
 
40
- The dataset is organized as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  ```
43
- task_327/
44
- β”œβ”€β”€ annotation/
45
- β”‚ β”œβ”€β”€ train/ # Training episode annotations (207 episodes)
46
- β”‚ β”‚ β”œβ”€β”€ 0.json
47
- β”‚ β”‚ β”œβ”€β”€ 1.json
48
- β”‚ β”‚ └── ...
49
- β”‚ └── val/ # Validation episode annotations (2 episodes)
50
- β”‚ β”œβ”€β”€ 99.json
51
- β”‚ └── 199.json
52
- β”œβ”€β”€ latent_videos/
53
- β”‚ β”œβ”€β”€ train/ # Pre-encoded latent video representations
54
- β”‚ β”‚ β”œβ”€β”€ 0/
55
- β”‚ β”‚ β”‚ β”œβ”€β”€ 0.pt
56
- β”‚ β”‚ β”‚ β”œβ”€β”€ 1.pt
57
- β”‚ β”‚ β”‚ └── 2.pt
58
- β”‚ β”‚ └── ...
59
- β”‚ └── val/
60
- β”‚ └── ...
61
- β”œβ”€β”€ videos/
62
- β”‚ β”œβ”€β”€ train/ # Original video files
63
- β”‚ └── val/
64
- └── stat.json # Dataset statistics
65
- ```
66
-
67
- ## πŸ“Š Explanation of Proprioceptive State
68
-
69
- ## State and Action
70
-
71
- ### State
72
 
73
- The **state** represents the proprioceptive observations of the robot at each timestep. It includes:
74
 
75
- - **End-effector orientation**: 4D quaternion representation (w, x, y, z) describing the orientation of the robot's end-effector
76
- - **End-effector position**: 3D Cartesian coordinates (x, y, z) of the end-effector position
77
- - **Effector position**: 3D Cartesian coordinates (x, y, z) of additional effector position information
78
 
79
- The state vector has a dimension of 16, combining all these proprioceptive measurements.
 
80
 
81
- ### Action
82
 
83
- The **action** represents the control commands sent to the robot. In this dataset:
84
 
85
- - Actions are 2-dimensional vectors
86
- - Actions control the effector position in the environment
87
 
88
- ## Common Fields
 
 
 
 
 
 
89
 
90
- Each annotation JSON file (`{episode_id}.json`) contains the following fields:
91
 
92
- | Field | Type | Description |
93
- |-------|------|-------------|
94
- | `texts` | `List[str]` | Task description and initial scene description |
95
- | `episode_id` | `int` | Unique identifier for the episode |
96
- | `success` | `int` | Binary indicator (0 or 1) whether the episode was successful |
97
- | `video_length` | `int` | Number of frames in the processed video |
98
- | `raw_length` | `int` | Number of frames in the original raw video |
99
- | `state_columns` | `List[str]` | Column names for state components: `['observation.states.end.orientation', 'observation.states.end.position', 'observation.states.effector.position']` |
100
- | `action_columns` | `List[str]` | Column names for action components: `['actions.effector.position']` |
101
- | `states` | `List[List[float]]` | Array of state vectors, one per timestep. Each state is a 16-dimensional vector |
102
- | `actions` | `List[List[float]]` | Array of action vectors, one per timestep. Each action is a 2-dimensional vector |
103
- | `videos` | `List[Dict]` | List of video file paths, e.g., `[{'video_path': 'videos/train/{episode_id}/{segment_id}.mp4'}]` |
104
- | `latent_videos` | `List[str]` | List of paths to pre-encoded latent video representations |
105
 
106
- ## Value Shapes and Ranges
 
 
107
 
108
- ### State Values
109
- - **Shape**: `[T, 16]` where T is the number of timesteps (video_length)
110
- - **Components**:
111
- - Orientation (quaternion): 4 values, typically in range [-1.0, 1.0]
112
- - Positions: 6 values (3D end position + 3D effector position), typically in range [-1.0, 1.0] for normalized coordinates, or larger ranges for absolute positions (e.g., up to ~97.0)
113
- - **Overall range**: Approximately [-0.86, 97.29] (values may vary depending on normalization)
114
 
115
- ### Action Values
116
- - **Shape**: `[T, 2]` where T is the number of timesteps
117
- - **Range**: [0.0, 1.0] (normalized control values)
118
 
119
- ### Latent Videos
120
- - **Format**: PyTorch tensor files (`.pt`)
121
- - **Shape**: `[T, 4, 24, 40]` where:
122
- - `T`: Number of frames (matches video_length)
123
- - `4`: Number of channels
124
- - `24`: Height dimension
125
- - `40`: Width dimension
126
 
127
- ### Dataset Statistics
128
- - **Training episodes**: 207
129
- - **Validation episodes**: 2
130
- - **Total episodes**: 209
131
-
132
- The `stat.json` file contains sample state and action values at the 1st and 99th percentiles, which can be used for normalization or data analysis purposes.
133
 
134
  ## πŸ€— Acknowledgements
135
 
136
  We would like to express our gratitude to the following projects and teams:
137
 
138
- - **[AgiBot-World](https://github.com/OpenDriveLab/AgiBot-World)**: This dataset is processed from task_327 in Agibot-World-Alpha. We acknowledge the OpenDriveLab team for their excellent work on the large-scale manipulation platform for scalable and intelligent embodied systems.
 
 
 
139
  ---
140
 
141
  ## πŸ“„ License
142
 
143
- All the data and code within this repository are licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
 
145
  ## πŸ’¬ Contact
146
 
 
1
+ # Ctrl-World World Model Checkpoint
2
 
3
+ ![](https://github.com/PyroMind-Dynamics/WorldModelInference/blob/master/video/show_case0.gif)
4
 
5
+ ## πŸ“‹ Model Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
+ This directory contains pre-trained checkpoints based on the **Ctrl-World** architecture, trained using the **Agibot-Alpha**'s 327 dataset. The whole model is built upon the Stable Video Diffusion (SVD) architecture with added support for action conditioning and text instruction conditioning.
8
 
9
+ The files `checkpoint-*.pt` correspond to model checkpoints saved at different training steps, where `*` denotes the step number (e.g., `checkpoint-15000.pt` was saved at step 15,000).
10
 
11
+ The folder `samples` contains validation results on the validation dataset.
 
 
 
 
 
 
 
 
 
 
12
 
13
+ For more technical details, please [visit our blog post](https://pyromind.ai/blog/world-model/build-ctrl-world).
14
 
15
+ ## πŸ“‘ Table of Contents
16
 
17
+ - [πŸ“‹ Model Overview](#πŸ“‹-model-overview)
18
+ - [πŸ“¦ Datasets](#πŸ“¦-datasets)
19
+ - [πŸ—οΈ Model Architecture](#πŸ—οΈ-model-architecture)
20
+ - [Core Components](#core-components)
21
+ - [Model Parameters](#model-parameters)
22
+ - [Input/Output Specifications](#inputoutput-specifications)
23
+ - [βš™οΈ Inference Configuration](#βš™οΈ-inference-configuration)
24
+ - [Inference Hyperparameters](#inference-hyperparameters)
25
+ - [Usage Example](#usage-example)
26
+ - [πŸ’Ύ Checkpoint Structure](#πŸ’Ύ-checkpoint-structure)
27
+ - [πŸ”§ Dependencies](#πŸ”§-dependencies)
28
+ - [πŸ“ˆ Performance Metrics](#πŸ“ˆ-performance-metrics)
29
+ - [πŸ€— Acknowledgements](#πŸ€—-acknowledgements)
30
+ - [πŸ“„ License](#πŸ“„-license)
31
+ - [πŸ’¬ Contact](#πŸ’¬-contact)
32
 
33
+ ## πŸ“¦ Datasets
34
+
35
+ Please visit [AgiBotWorld-Alpha-CtrlWorld-327](https://huggingface.co/datasets/pyromind/AgiBotWorld-Alpha-CtrlWorld-327/tree/main) to see more details about the datasets.
36
+
37
+ ## πŸ—οΈ Model Architecture
38
+
39
+ ### Core Components
40
+
41
+ - **Base Model**: Stable Video Diffusion (SVD) - A foundational diffusion model for video generation
42
+ - **UNet**: Spatio-temporal conditional UNet - Supports frame-level action conditioning
43
+ - **Action Encoder**: 3-layer fully connected network (1024-dimensional) - Encodes action sequences into feature representations
44
+ - **Text Encoder**: CLIP Text Encoder - Supports text instruction conditioning
45
+ - **VAE**: Used for image encoding and decoding
46
+
47
+ ### Model Parameters
48
+
49
+ - **Action Dimension (action_dim)**: 18
50
+ - Left arm Cartesian position: 7 dimensions
51
+ - Right arm Cartesian position: 7 dimensions
52
+ - Left gripper state: 1 dimension
53
+ - Right gripper state: 1 dimension
54
+ - Left gripper action: 1 dimension
55
+ - Right gripper action: 1 dimension
56
+
57
+ - **History Frames (num_history)**: 6
58
+ - **Prediction Frames (num_frames)**: 10
59
+ - **Text Conditioning (text_cond)**: True
60
+ - **Frame-level Conditioning (frame_level_cond)**: True
61
+ - **History Condition Zeroing (his_cond_zero)**: False
62
+
63
+ ### Input/Output Specifications
64
+
65
+ - **Input Image Size**: 320 Γ— 192 (single view)
66
+ - **Multi-view Support**: 3 views (concatenated: 320 Γ— 576)
67
+ - **Latent Space Dimension**: (4, 72, 40) - where 72 = 24 Γ— 3 (3 views)
68
+ - **Frame Rate**: 7 FPS
69
+
70
+ ## βš™οΈ Inference Configuration
71
+
72
+ This model can be used with our inference code, which is available for [download on GitHub](https://github.com/PyroMind-Dynamics/WorldModelInference).
73
+
74
+ ### Inference Hyperparameters
75
+
76
+ - **Inference Steps (num_inference_steps)**: 50
77
+ - **Guidance Scale (guidance_scale)**: 2.0
78
+ - **Motion Bucket ID (motion_bucket_id)**: 127
79
+ - **Frame Rate (fps)**: 7
80
+ - **Decode Chunk Size (decode_chunk_size)**: 7
81
+ - **Data Type**: bfloat16 (recommended for inference to accelerate computation and save memory)
82
+
83
+ ### Usage Example
84
+
85
+ ```python
86
+ from models.ctrl_world import CtrlWorld
87
+ import torch
88
+
89
+ # Initialize model
90
+ model = CtrlWorld(
91
+ svd_model_path="/path/to/stable-video-diffusion-img2vid",
92
+ clip_model_path="/path/to/clip-vit-base-patch32",
93
+ action_dim=18,
94
+ num_history=6,
95
+ num_frames=10,
96
+ text_cond=True,
97
+ motion_bucket_id=127,
98
+ fps=7,
99
+ his_cond_zero=False,
100
+ frame_level_cond=True
101
+ )
102
+
103
+ # Load checkpoint
104
+ checkpoint_path = "model_ckpt/task_327/checkpoint-21500.pt"
105
+ state_dict = torch.load(checkpoint_path, map_location='cpu')
106
+ model.load_state_dict(state_dict)
107
+ model.eval()
108
+
109
+ # Inference
110
+ with torch.no_grad():
111
+ latents = model.generate(
112
+ image=image_cond, # Conditional image (1, 4, 72, 40)
113
+ action=action_cond, # Action sequence (1, 16, 18)
114
+ text=["instruction"], # Text instruction (optional)
115
+ history=his_cond, # History frames (1, 6, 4, 72, 40)
116
+ num_frames=10,
117
+ num_inference_steps=50,
118
+ guidance_scale=2.0,
119
+ fps=7,
120
+ motion_bucket_id=127
121
+ )
122
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
+ ## πŸ’Ύ Checkpoint Structure
125
 
126
+ The checkpoint file is a PyTorch state_dict containing approximately 2,525 parameter groups, primarily including:
 
 
127
 
128
+ - `unet.*`: Parameters of the UNet diffusion model
129
+ - `action_encoder.*`: Parameters of the action encoder
130
 
131
+ **Note**: Parameters of the VAE and CLIP encoder are not saved in the checkpoint, as they use frozen pretrained weights.
132
 
133
+ ## πŸ”§ Dependencies
134
 
135
+ ### Required Dependencies
 
136
 
137
+ - PyTorch >= 1.12.0
138
+ - diffusers (Stable Video Diffusion)
139
+ - transformers (CLIP)
140
+ - accelerate
141
+ - einops
142
+ - decord (video reading)
143
+ - mediapy (video saving)
144
 
145
+ ### Pretrained Models
146
 
147
+ Using this checkpoint depends on the structure of the following pretrained models:
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
+ 1. **Stable Video Diffusion**:
150
+ - Path: `stable-video-diffusion-img2vid-config-path`
151
+ - Or download from HuggingFace: `stabilityai/stable-video-diffusion-img2vid`
152
 
153
+ 2. **CLIP Text Encoder**:
154
+ - Path: `clip-vit-base-patch32-config-path`
155
+ - Or download from HuggingFace: `openai/clip-vit-base-patch32`
 
 
 
156
 
157
+ ## πŸ“ˆ Performance Metrics
 
 
158
 
159
+ The model was trained on the task_327 dataset and can predict multi-view robotic manipulation videos. The model supports:
 
 
 
 
 
 
160
 
161
+ - βœ… Multi-view video prediction (3 views)
162
+ - βœ… Action-conditioned control
163
+ - βœ… Text instruction conditioning
164
+ - βœ… Long-horizon prediction (via rolling prediction)
 
 
165
 
166
  ## πŸ€— Acknowledgements
167
 
168
  We would like to express our gratitude to the following projects and teams:
169
 
170
+ - **[Stable Video Diffusion (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid)**: This model is built upon the Stable Video Diffusion architecture developed by Stability AI. We thank the Stability AI team for their excellent work on video generation with diffusion models.
171
+
172
+ - **[Ctrl-World](https://github.com/Robert-gyj/Ctrl-World)**: We acknowledge the Ctrl-World team for their pioneering work on controllable generative world models for robot manipulation.
173
+
174
  ---
175
 
176
  ## πŸ“„ License
177
 
178
+ ```
179
+ MIT License
180
+
181
+ Copyright (c) 2026 Pyromind Dynamics
182
+
183
+ Permission is hereby granted, free of charge, to any person obtaining a copy
184
+ of this software and associated documentation files (the "Software"), to deal
185
+ in the Software without restriction, including without limitation the rights
186
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
187
+ copies of the Software, and to permit persons to whom the Software is
188
+ furnished to do so, subject to the following conditions:
189
+
190
+ The above copyright notice and this permission notice shall be included in all
191
+ copies or substantial portions of the Software.
192
+
193
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
194
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
195
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
196
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
197
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
198
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
199
+ SOFTWARE.
200
+ ```
201
 
202
  ## πŸ’¬ Contact
203