ZhenweiWang commited on
Commit
af597e9
Β·
verified Β·
1 Parent(s): 4e9c82a

Update DOCUMENTATION.md

Browse files
Files changed (1) hide show
  1. DOCUMENTATION.md +59 -1
DOCUMENTATION.md CHANGED
@@ -1,6 +1,6 @@
1
  # HunyuanWorld 2.0 β€” Documentation
2
  This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0.
3
- ---
4
  ## Table of Contents
5
  - [WorldMirror 2.0 (World Reconstruction)](#worldmirror-20-world-reconstruction)
6
  - [Overview](#overview)
@@ -23,6 +23,7 @@ This document provides detailed usage guides, parameter references, and output f
23
  - [Gradio App](#gradio-app)
24
  - [Panorama Generation](#panorama-generation)
25
  - [World Generation](#world-generation)
 
26
  ---
27
  ## WorldMirror 2.0 (World Reconstruction)
28
  ### Overview
@@ -37,10 +38,12 @@ Key improvements over WorldMirror 1.0:
37
  - **Normalized RoPE** for flexible resolution inference
38
  - **Depth mask prediction** for robust invalid pixel handling
39
  - **Sequence Parallel + FSDP + BF16** for efficient multi-GPU inference
 
40
  ---
41
  ### Python API
42
  #### `WorldMirrorPipeline.from_pretrained`
43
  Factory method to load the model and create a pipeline instance.
 
44
  ```python
45
  from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
46
 
@@ -55,6 +58,7 @@ pipeline = WorldMirrorPipeline.from_pretrained(
55
  disable_heads=None,
56
  )
57
  ```
 
58
  | Parameter | Type | Default | Description |
59
  |-----------|------|---------|-------------|
60
  | `pretrained_model_name_or_path` | `str` | `"tencent/HY-World-2.0"` | HuggingFace repo ID or local path |
@@ -65,12 +69,15 @@ pipeline = WorldMirrorPipeline.from_pretrained(
65
  | `enable_bf16` | `bool` | `False` | Use bfloat16 precision (except numerically critical layers) |
66
  | `fsdp_cpu_offload` | `bool` | `False` | Offload FSDP parameters to CPU (saves GPU memory at the cost of speed) |
67
  | `disable_heads` | `list[str]` | `None` | Heads to disable and free from memory. Options: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"` |
 
68
  **Notes:**
69
  - Distributed mode is auto-detected from `WORLD_SIZE` environment variable (set by `torchrun`).
70
  - When using multi-GPU, each rank must call `from_pretrained` β€” the method handles `dist.init_process_group` internally.
 
71
  ---
72
  #### `WorldMirrorPipeline.__call__`
73
  Run inference on a set of images or a video.
 
74
  ```python
75
  result = pipeline(
76
  input_path,
@@ -78,8 +85,11 @@ result = pipeline(
78
  **kwargs,
79
  )
80
  ```
 
81
  Returns the output directory path (`str`), or `None` if the input was skipped.
 
82
  **Inference Parameters:**
 
83
  | Parameter | Type | Default | Description |
84
  |-----------|------|---------|-------------|
85
  | `input_path` | `str` | *(required)* | Directory of images or path to a video file |
@@ -89,7 +99,9 @@ Returns the output directory path (`str`), or `None` if the input was skipped.
89
  | `video_strategy` | `str` | `"new"` | Video frame extraction strategy: `"new"` (motion-aware) or `"old"` (uniform FPS) |
90
  | `video_min_frames` | `int` | `1` | Minimum number of frames to extract from video |
91
  | `video_max_frames` | `int` | `32` | Maximum number of frames to extract from video |
 
92
  **Save Parameters:**
 
93
  | Parameter | Type | Default | Description |
94
  |-----------|------|---------|-------------|
95
  | `save_depth` | `bool` | `True` | Save per-view depth maps (PNG visualization + NPY raw values) |
@@ -100,7 +112,9 @@ Returns the output directory path (`str`), or `None` if the input was skipped.
100
  | `save_colmap` | `bool` | `False` | Save COLMAP-format sparse reconstruction (`sparse/0/`) |
101
  | `save_conf` | `bool` | `False` | Save depth confidence maps |
102
  | `save_sky_mask` | `bool` | `False` | Save sky segmentation masks |
 
103
  **Mask Parameters:**
 
104
  | Parameter | Type | Default | Description |
105
  |-----------|------|---------|-------------|
106
  | `apply_sky_mask` | `bool` | `True` | Filter out sky regions from point clouds and Gaussians |
@@ -111,7 +125,9 @@ Returns the output directory path (`str`), or `None` if the input was skipped.
111
  | `confidence_percentile` | `float` | `10.0` | Percentile threshold for confidence filtering (bottom N% removed) |
112
  | `edge_normal_threshold` | `float` | `1.0` | Normal edge detection tolerance |
113
  | `edge_depth_threshold` | `float` | `0.03` | Depth edge detection relative tolerance |
 
114
  **Compression Parameters:**
 
115
  | Parameter | Type | Default | Description |
116
  |-----------|------|---------|-------------|
117
  | `compress_pts` | `bool` | `True` | Compress point clouds via voxel merging + random sampling |
@@ -119,25 +135,33 @@ Returns the output directory path (`str`), or `None` if the input was skipped.
119
  | `compress_pts_voxel_size` | `float` | `0.002` | Voxel size for point cloud merging |
120
  | `max_resolution` | `int` | `1920` | Maximum resolution for saved output images |
121
  | `compress_gs_max_points` | `int` | `5,000,000` | Maximum number of Gaussians after voxel pruning |
 
122
  **Prior Parameters:**
 
123
  | Parameter | Type | Default | Description |
124
  |-----------|------|---------|-------------|
125
  | `prior_cam_path` | `str` | `None` | Path to camera parameters JSON file |
126
  | `prior_depth_path` | `str` | `None` | Path to directory containing depth map files |
 
127
  **Rendered Video Parameters:**
 
128
  | Parameter | Type | Default | Description |
129
  |-----------|------|---------|-------------|
130
  | `save_rendered` | `bool` | `False` | Render interpolated fly-through video from Gaussian splats |
131
  | `render_interp_per_pair` | `int` | `15` | Number of interpolated frames between each camera pair |
132
  | `render_depth` | `bool` | `False` | Also render a depth visualization video |
 
133
  **Misc Parameters:**
 
134
  | Parameter | Type | Default | Description |
135
  |-----------|------|---------|-------------|
136
  | `log_time` | `bool` | `True` | Print timing report and save `pipeline_timing.json` |
137
  | `strict_output_path` | `str` | `None` | If set, save results directly to this path without `<case_name>/<timestamp>` subdirectories |
 
138
  ---
139
  ### CLI Reference
140
  All `__call__` parameters are exposed as CLI arguments:
 
141
  ```bash
142
  python -m hyworld2.worldrecon.pipeline \
143
  --input_path path/to/images \
@@ -146,7 +170,9 @@ python -m hyworld2.worldrecon.pipeline \
146
  --prior_cam_path path/to/camera_params.json \
147
  --prior_depth_path path/to/depth_dir/ \
148
  ```
 
149
  **Boolean flag conventions:**
 
150
  | Enable | Disable |
151
  |--------|---------|
152
  | `--save_colmap` | *(omit)* |
@@ -164,7 +190,9 @@ python -m hyworld2.worldrecon.pipeline \
164
  | *(default on)* `save_points` | `--no_save_points` |
165
  | `--save_rendered` | *(omit)* |
166
  | `--render_depth` | *(omit)* |
 
167
  **Additional CLI-only arguments:**
 
168
  | Argument | Description |
169
  |----------|-------------|
170
  | `--config_path` | Training config YAML for custom checkpoint loading |
@@ -174,9 +202,11 @@ python -m hyworld2.worldrecon.pipeline \
174
  | `--fsdp_cpu_offload` | Offload FSDP params to CPU |
175
  | `--disable_heads` | Space-separated list of heads to disable (e.g. `--disable_heads camera normal`) |
176
  | `--no_interactive` | Exit after first inference (skip interactive prompt loop) |
 
177
  ---
178
  ### Output Format
179
  #### File Structure
 
180
  ```
181
  inference_output/
182
  └── <case_name>/
@@ -201,8 +231,10 @@ inference_output/
201
  β”‚ └── rendered_depth.mp4 # (if --render_depth)
202
  └── pipeline_timing.json # Performance timing report
203
  ```
 
204
  #### Prediction Dictionary
205
  When using the Python API, `pipeline(...)` internally produces a `predictions` dictionary with the following keys:
 
206
  ```python
207
  # Geometry
208
  predictions["depth"] # [B, S, H, W, 1] β€” Z-depth in camera frame
@@ -223,17 +255,22 @@ predictions["splats"]["opacities"] # [B, N] β€” Gaussian opacities
223
  predictions["splats"]["sh"] # [B, N, 1, 3] β€” Spherical harmonics (degree 0)
224
  predictions["splats"]["weights"] # [B, N] β€” Per-Gaussian confidence weights
225
  ```
 
226
  Where `B` = batch size (always 1 for inference), `S` = number of input views, `H, W` = image dimensions, `N` = total Gaussians (`S Γ— H Γ— W`).
 
227
  ---
228
  ### Prior Injection
229
  WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files.
 
230
  | Prior Type | Condition | Input Format |
231
  |------------|-----------|--------------|
232
  | Camera Pose | `cond_flags[0]` | c2w 4Γ—4 matrix (OpenCV convention) |
233
  | Depth Map | `cond_flags[1]` | Per-view float depth maps |
234
  | Intrinsics | `cond_flags[2]` | 3Γ—3 intrinsic matrix |
 
235
  #### Camera Parameters (JSON)
236
  The camera parameter file follows the same format as the `camera_params.json` output by the pipeline:
 
237
  ```json
238
  {
239
  "num_cameras": 2,
@@ -260,26 +297,32 @@ The camera parameter file follows the same format as the `camera_params.json` ou
260
  ]
261
  }
262
  ```
 
263
  **Field descriptions:**
 
264
  | Field | Description |
265
  |-------|-------------|
266
  | `camera_id` | Integer index (`0`, `1`, `2`, ...) or image filename stem without extension (e.g., `"image_0001"`) |
267
  | `extrinsics.matrix` | 4Γ—4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention |
268
  | `intrinsics.matrix` | 3Γ—3 camera intrinsic matrix in pixels (`fx, fy` = focal lengths; `cx, cy` = principal point) |
 
269
  **Important notes:**
270
  - `extrinsics` and `intrinsics` lists can be provided independently or together. An empty list `[]` or missing key means that prior is unavailable.
271
  - **Intrinsics resolution:** Values should correspond to the **original image resolution**. The pipeline automatically adjusts for inference-time resize + center-crop.
272
  - **Extrinsics alignment:** The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior.
273
  #### Depth Maps (Folder)
274
  Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: `.npy`, `.exr`, `.png` (16-bit).
 
275
  ```
276
  prior_depth/
277
  β”œβ”€β”€ image_0001.npy # float32, shape [H, W]
278
  β”œβ”€β”€ image_0002.npy
279
  └── ...
280
  ```
 
281
  #### Combining Priors
282
  Priors can be freely combined. Examples:
 
283
  ```bash
284
  # Only intrinsics
285
  python -m hyworld2.worldrecon.pipeline --input_path images/ \
@@ -292,6 +335,7 @@ python -m hyworld2.worldrecon.pipeline --input_path images/ \
292
  --prior_cam_path camera_params.json \
293
  --prior_depth_path depth_maps/
294
  ```
 
295
  ---
296
  ### Multi-GPU Inference
297
  WorldMirror 2.0 supports **Sequence Parallel (SP)** inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel.
@@ -316,16 +360,19 @@ pipeline = WorldMirrorPipeline.from_pretrained(
316
  )
317
  pipeline('path/to/images')
318
  ```
 
319
  **What happens under the hood:**
320
  1. `from_pretrained` auto-detects `WORLD_SIZE > 1` and initializes `torch.distributed`.
321
  2. The model is loaded on rank 0 and broadcast via `sync_module_states=True`.
322
  3. FSDP shards parameters across the SP process group.
323
  4. DPT prediction heads split frames across ranks and `AllGather` results.
324
  5. Post-processing (mask computation, saving) runs on rank 0 only.
 
325
  ---
326
  ### Advanced Options
327
  #### Disabling Prediction Heads
328
  To save memory when you only need specific outputs:
 
329
  ```python
330
  from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
331
 
@@ -334,6 +381,7 @@ pipeline = WorldMirrorPipeline.from_pretrained(
334
  disable_heads=["normal", "points"], # free ~200M params
335
  )
336
  ```
 
337
  Available heads: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"`.
338
  #### Mask Filtering
339
  The pipeline supports three types of output filtering to improve point cloud and Gaussian quality:
@@ -346,10 +394,12 @@ When `compress_pts=True` (default), the depth-derived point cloud undergoes:
346
  1. **Voxel merging**: Points within each voxel (size controlled by `compress_pts_voxel_size`) are merged via weighted averaging.
347
  2. **Random subsampling**: If the result exceeds `compress_pts_max_points`, points are uniformly subsampled.
348
  Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to `compress_gs_max_points`.
 
349
  ---
350
  ### Gradio App
351
  An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser.
352
  **Quick start:**
 
353
  ```bash
354
  # Single GPU
355
  python -m hyworld2.worldrecon.gradio_app
@@ -358,17 +408,23 @@ python -m hyworld2.worldrecon.gradio_app
358
  torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \
359
  --use_fsdp --enable_bf16
360
  ```
 
361
  **With a local checkpoint:**
 
362
  ```bash
363
  python -m hyworld2.worldrecon.gradio_app \
364
  --config_path /path/to/config.yaml \
365
  --ckpt_path /path/to/checkpoint.safetensors
366
  ```
 
367
  **With a public link (e.g., for Colab or remote servers):**
 
368
  ```bash
369
  python -m hyworld2.worldrecon.gradio_app --share
370
  ```
 
371
  **Arguments:**
 
372
  | Argument | Default | Description |
373
  |----------|---------|-------------|
374
  | `--port` | `8081` | Server port |
@@ -382,6 +438,7 @@ python -m hyworld2.worldrecon.gradio_app --share
382
  | `--fsdp_cpu_offload` | `False` | Offload FSDP params to CPU (saves GPU memory) |
383
 
384
  > **Important:** In multi-GPU mode, the number of input images must be **>= the number of GPUs**.
 
385
  ---
386
  ## Panorama Generation
387
  *Coming soon.*
@@ -390,6 +447,7 @@ This section will document the panorama generation model, including:
390
  - Model architecture (MMDiT-based implicit perspective-to-ERP mapping)
391
  - Configuration parameters
392
  - Output formats
 
393
  ---
394
  ## World Generation
395
  *Coming soon.*
 
1
  # HunyuanWorld 2.0 β€” Documentation
2
  This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0.
3
+
4
  ## Table of Contents
5
  - [WorldMirror 2.0 (World Reconstruction)](#worldmirror-20-world-reconstruction)
6
  - [Overview](#overview)
 
23
  - [Gradio App](#gradio-app)
24
  - [Panorama Generation](#panorama-generation)
25
  - [World Generation](#world-generation)
26
+
27
  ---
28
  ## WorldMirror 2.0 (World Reconstruction)
29
  ### Overview
 
38
  - **Normalized RoPE** for flexible resolution inference
39
  - **Depth mask prediction** for robust invalid pixel handling
40
  - **Sequence Parallel + FSDP + BF16** for efficient multi-GPU inference
41
+
42
  ---
43
  ### Python API
44
  #### `WorldMirrorPipeline.from_pretrained`
45
  Factory method to load the model and create a pipeline instance.
46
+
47
  ```python
48
  from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
49
 
 
58
  disable_heads=None,
59
  )
60
  ```
61
+
62
  | Parameter | Type | Default | Description |
63
  |-----------|------|---------|-------------|
64
  | `pretrained_model_name_or_path` | `str` | `"tencent/HY-World-2.0"` | HuggingFace repo ID or local path |
 
69
  | `enable_bf16` | `bool` | `False` | Use bfloat16 precision (except numerically critical layers) |
70
  | `fsdp_cpu_offload` | `bool` | `False` | Offload FSDP parameters to CPU (saves GPU memory at the cost of speed) |
71
  | `disable_heads` | `list[str]` | `None` | Heads to disable and free from memory. Options: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"` |
72
+
73
  **Notes:**
74
  - Distributed mode is auto-detected from `WORLD_SIZE` environment variable (set by `torchrun`).
75
  - When using multi-GPU, each rank must call `from_pretrained` β€” the method handles `dist.init_process_group` internally.
76
+
77
  ---
78
  #### `WorldMirrorPipeline.__call__`
79
  Run inference on a set of images or a video.
80
+
81
  ```python
82
  result = pipeline(
83
  input_path,
 
85
  **kwargs,
86
  )
87
  ```
88
+
89
  Returns the output directory path (`str`), or `None` if the input was skipped.
90
+
91
  **Inference Parameters:**
92
+
93
  | Parameter | Type | Default | Description |
94
  |-----------|------|---------|-------------|
95
  | `input_path` | `str` | *(required)* | Directory of images or path to a video file |
 
99
  | `video_strategy` | `str` | `"new"` | Video frame extraction strategy: `"new"` (motion-aware) or `"old"` (uniform FPS) |
100
  | `video_min_frames` | `int` | `1` | Minimum number of frames to extract from video |
101
  | `video_max_frames` | `int` | `32` | Maximum number of frames to extract from video |
102
+
103
  **Save Parameters:**
104
+
105
  | Parameter | Type | Default | Description |
106
  |-----------|------|---------|-------------|
107
  | `save_depth` | `bool` | `True` | Save per-view depth maps (PNG visualization + NPY raw values) |
 
112
  | `save_colmap` | `bool` | `False` | Save COLMAP-format sparse reconstruction (`sparse/0/`) |
113
  | `save_conf` | `bool` | `False` | Save depth confidence maps |
114
  | `save_sky_mask` | `bool` | `False` | Save sky segmentation masks |
115
+
116
  **Mask Parameters:**
117
+
118
  | Parameter | Type | Default | Description |
119
  |-----------|------|---------|-------------|
120
  | `apply_sky_mask` | `bool` | `True` | Filter out sky regions from point clouds and Gaussians |
 
125
  | `confidence_percentile` | `float` | `10.0` | Percentile threshold for confidence filtering (bottom N% removed) |
126
  | `edge_normal_threshold` | `float` | `1.0` | Normal edge detection tolerance |
127
  | `edge_depth_threshold` | `float` | `0.03` | Depth edge detection relative tolerance |
128
+
129
  **Compression Parameters:**
130
+
131
  | Parameter | Type | Default | Description |
132
  |-----------|------|---------|-------------|
133
  | `compress_pts` | `bool` | `True` | Compress point clouds via voxel merging + random sampling |
 
135
  | `compress_pts_voxel_size` | `float` | `0.002` | Voxel size for point cloud merging |
136
  | `max_resolution` | `int` | `1920` | Maximum resolution for saved output images |
137
  | `compress_gs_max_points` | `int` | `5,000,000` | Maximum number of Gaussians after voxel pruning |
138
+
139
  **Prior Parameters:**
140
+
141
  | Parameter | Type | Default | Description |
142
  |-----------|------|---------|-------------|
143
  | `prior_cam_path` | `str` | `None` | Path to camera parameters JSON file |
144
  | `prior_depth_path` | `str` | `None` | Path to directory containing depth map files |
145
+
146
  **Rendered Video Parameters:**
147
+
148
  | Parameter | Type | Default | Description |
149
  |-----------|------|---------|-------------|
150
  | `save_rendered` | `bool` | `False` | Render interpolated fly-through video from Gaussian splats |
151
  | `render_interp_per_pair` | `int` | `15` | Number of interpolated frames between each camera pair |
152
  | `render_depth` | `bool` | `False` | Also render a depth visualization video |
153
+
154
  **Misc Parameters:**
155
+
156
  | Parameter | Type | Default | Description |
157
  |-----------|------|---------|-------------|
158
  | `log_time` | `bool` | `True` | Print timing report and save `pipeline_timing.json` |
159
  | `strict_output_path` | `str` | `None` | If set, save results directly to this path without `<case_name>/<timestamp>` subdirectories |
160
+
161
  ---
162
  ### CLI Reference
163
  All `__call__` parameters are exposed as CLI arguments:
164
+
165
  ```bash
166
  python -m hyworld2.worldrecon.pipeline \
167
  --input_path path/to/images \
 
170
  --prior_cam_path path/to/camera_params.json \
171
  --prior_depth_path path/to/depth_dir/ \
172
  ```
173
+
174
  **Boolean flag conventions:**
175
+
176
  | Enable | Disable |
177
  |--------|---------|
178
  | `--save_colmap` | *(omit)* |
 
190
  | *(default on)* `save_points` | `--no_save_points` |
191
  | `--save_rendered` | *(omit)* |
192
  | `--render_depth` | *(omit)* |
193
+
194
  **Additional CLI-only arguments:**
195
+
196
  | Argument | Description |
197
  |----------|-------------|
198
  | `--config_path` | Training config YAML for custom checkpoint loading |
 
202
  | `--fsdp_cpu_offload` | Offload FSDP params to CPU |
203
  | `--disable_heads` | Space-separated list of heads to disable (e.g. `--disable_heads camera normal`) |
204
  | `--no_interactive` | Exit after first inference (skip interactive prompt loop) |
205
+
206
  ---
207
  ### Output Format
208
  #### File Structure
209
+
210
  ```
211
  inference_output/
212
  └── <case_name>/
 
231
  β”‚ └── rendered_depth.mp4 # (if --render_depth)
232
  └── pipeline_timing.json # Performance timing report
233
  ```
234
+
235
  #### Prediction Dictionary
236
  When using the Python API, `pipeline(...)` internally produces a `predictions` dictionary with the following keys:
237
+
238
  ```python
239
  # Geometry
240
  predictions["depth"] # [B, S, H, W, 1] β€” Z-depth in camera frame
 
255
  predictions["splats"]["sh"] # [B, N, 1, 3] β€” Spherical harmonics (degree 0)
256
  predictions["splats"]["weights"] # [B, N] β€” Per-Gaussian confidence weights
257
  ```
258
+
259
  Where `B` = batch size (always 1 for inference), `S` = number of input views, `H, W` = image dimensions, `N` = total Gaussians (`S Γ— H Γ— W`).
260
+
261
  ---
262
  ### Prior Injection
263
  WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files.
264
+
265
  | Prior Type | Condition | Input Format |
266
  |------------|-----------|--------------|
267
  | Camera Pose | `cond_flags[0]` | c2w 4Γ—4 matrix (OpenCV convention) |
268
  | Depth Map | `cond_flags[1]` | Per-view float depth maps |
269
  | Intrinsics | `cond_flags[2]` | 3Γ—3 intrinsic matrix |
270
+
271
  #### Camera Parameters (JSON)
272
  The camera parameter file follows the same format as the `camera_params.json` output by the pipeline:
273
+
274
  ```json
275
  {
276
  "num_cameras": 2,
 
297
  ]
298
  }
299
  ```
300
+
301
  **Field descriptions:**
302
+
303
  | Field | Description |
304
  |-------|-------------|
305
  | `camera_id` | Integer index (`0`, `1`, `2`, ...) or image filename stem without extension (e.g., `"image_0001"`) |
306
  | `extrinsics.matrix` | 4Γ—4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention |
307
  | `intrinsics.matrix` | 3Γ—3 camera intrinsic matrix in pixels (`fx, fy` = focal lengths; `cx, cy` = principal point) |
308
+
309
  **Important notes:**
310
  - `extrinsics` and `intrinsics` lists can be provided independently or together. An empty list `[]` or missing key means that prior is unavailable.
311
  - **Intrinsics resolution:** Values should correspond to the **original image resolution**. The pipeline automatically adjusts for inference-time resize + center-crop.
312
  - **Extrinsics alignment:** The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior.
313
  #### Depth Maps (Folder)
314
  Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: `.npy`, `.exr`, `.png` (16-bit).
315
+
316
  ```
317
  prior_depth/
318
  β”œβ”€β”€ image_0001.npy # float32, shape [H, W]
319
  β”œβ”€β”€ image_0002.npy
320
  └── ...
321
  ```
322
+
323
  #### Combining Priors
324
  Priors can be freely combined. Examples:
325
+
326
  ```bash
327
  # Only intrinsics
328
  python -m hyworld2.worldrecon.pipeline --input_path images/ \
 
335
  --prior_cam_path camera_params.json \
336
  --prior_depth_path depth_maps/
337
  ```
338
+
339
  ---
340
  ### Multi-GPU Inference
341
  WorldMirror 2.0 supports **Sequence Parallel (SP)** inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel.
 
360
  )
361
  pipeline('path/to/images')
362
  ```
363
+
364
  **What happens under the hood:**
365
  1. `from_pretrained` auto-detects `WORLD_SIZE > 1` and initializes `torch.distributed`.
366
  2. The model is loaded on rank 0 and broadcast via `sync_module_states=True`.
367
  3. FSDP shards parameters across the SP process group.
368
  4. DPT prediction heads split frames across ranks and `AllGather` results.
369
  5. Post-processing (mask computation, saving) runs on rank 0 only.
370
+
371
  ---
372
  ### Advanced Options
373
  #### Disabling Prediction Heads
374
  To save memory when you only need specific outputs:
375
+
376
  ```python
377
  from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
378
 
 
381
  disable_heads=["normal", "points"], # free ~200M params
382
  )
383
  ```
384
+
385
  Available heads: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"`.
386
  #### Mask Filtering
387
  The pipeline supports three types of output filtering to improve point cloud and Gaussian quality:
 
394
  1. **Voxel merging**: Points within each voxel (size controlled by `compress_pts_voxel_size`) are merged via weighted averaging.
395
  2. **Random subsampling**: If the result exceeds `compress_pts_max_points`, points are uniformly subsampled.
396
  Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to `compress_gs_max_points`.
397
+
398
  ---
399
  ### Gradio App
400
  An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser.
401
  **Quick start:**
402
+
403
  ```bash
404
  # Single GPU
405
  python -m hyworld2.worldrecon.gradio_app
 
408
  torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \
409
  --use_fsdp --enable_bf16
410
  ```
411
+
412
  **With a local checkpoint:**
413
+
414
  ```bash
415
  python -m hyworld2.worldrecon.gradio_app \
416
  --config_path /path/to/config.yaml \
417
  --ckpt_path /path/to/checkpoint.safetensors
418
  ```
419
+
420
  **With a public link (e.g., for Colab or remote servers):**
421
+
422
  ```bash
423
  python -m hyworld2.worldrecon.gradio_app --share
424
  ```
425
+
426
  **Arguments:**
427
+
428
  | Argument | Default | Description |
429
  |----------|---------|-------------|
430
  | `--port` | `8081` | Server port |
 
438
  | `--fsdp_cpu_offload` | `False` | Offload FSDP params to CPU (saves GPU memory) |
439
 
440
  > **Important:** In multi-GPU mode, the number of input images must be **>= the number of GPUs**.
441
+
442
  ---
443
  ## Panorama Generation
444
  *Coming soon.*
 
447
  - Model architecture (MMDiT-based implicit perspective-to-ERP mapping)
448
  - Configuration parameters
449
  - Output formats
450
+
451
  ---
452
  ## World Generation
453
  *Coming soon.*