Update DOCUMENTATION.md
Browse files- DOCUMENTATION.md +59 -1
DOCUMENTATION.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# HunyuanWorld 2.0 β Documentation
|
| 2 |
This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0.
|
| 3 |
-
|
| 4 |
## Table of Contents
|
| 5 |
- [WorldMirror 2.0 (World Reconstruction)](#worldmirror-20-world-reconstruction)
|
| 6 |
- [Overview](#overview)
|
|
@@ -23,6 +23,7 @@ This document provides detailed usage guides, parameter references, and output f
|
|
| 23 |
- [Gradio App](#gradio-app)
|
| 24 |
- [Panorama Generation](#panorama-generation)
|
| 25 |
- [World Generation](#world-generation)
|
|
|
|
| 26 |
---
|
| 27 |
## WorldMirror 2.0 (World Reconstruction)
|
| 28 |
### Overview
|
|
@@ -37,10 +38,12 @@ Key improvements over WorldMirror 1.0:
|
|
| 37 |
- **Normalized RoPE** for flexible resolution inference
|
| 38 |
- **Depth mask prediction** for robust invalid pixel handling
|
| 39 |
- **Sequence Parallel + FSDP + BF16** for efficient multi-GPU inference
|
|
|
|
| 40 |
---
|
| 41 |
### Python API
|
| 42 |
#### `WorldMirrorPipeline.from_pretrained`
|
| 43 |
Factory method to load the model and create a pipeline instance.
|
|
|
|
| 44 |
```python
|
| 45 |
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
|
| 46 |
|
|
@@ -55,6 +58,7 @@ pipeline = WorldMirrorPipeline.from_pretrained(
|
|
| 55 |
disable_heads=None,
|
| 56 |
)
|
| 57 |
```
|
|
|
|
| 58 |
| Parameter | Type | Default | Description |
|
| 59 |
|-----------|------|---------|-------------|
|
| 60 |
| `pretrained_model_name_or_path` | `str` | `"tencent/HY-World-2.0"` | HuggingFace repo ID or local path |
|
|
@@ -65,12 +69,15 @@ pipeline = WorldMirrorPipeline.from_pretrained(
|
|
| 65 |
| `enable_bf16` | `bool` | `False` | Use bfloat16 precision (except numerically critical layers) |
|
| 66 |
| `fsdp_cpu_offload` | `bool` | `False` | Offload FSDP parameters to CPU (saves GPU memory at the cost of speed) |
|
| 67 |
| `disable_heads` | `list[str]` | `None` | Heads to disable and free from memory. Options: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"` |
|
|
|
|
| 68 |
**Notes:**
|
| 69 |
- Distributed mode is auto-detected from `WORLD_SIZE` environment variable (set by `torchrun`).
|
| 70 |
- When using multi-GPU, each rank must call `from_pretrained` β the method handles `dist.init_process_group` internally.
|
|
|
|
| 71 |
---
|
| 72 |
#### `WorldMirrorPipeline.__call__`
|
| 73 |
Run inference on a set of images or a video.
|
|
|
|
| 74 |
```python
|
| 75 |
result = pipeline(
|
| 76 |
input_path,
|
|
@@ -78,8 +85,11 @@ result = pipeline(
|
|
| 78 |
**kwargs,
|
| 79 |
)
|
| 80 |
```
|
|
|
|
| 81 |
Returns the output directory path (`str`), or `None` if the input was skipped.
|
|
|
|
| 82 |
**Inference Parameters:**
|
|
|
|
| 83 |
| Parameter | Type | Default | Description |
|
| 84 |
|-----------|------|---------|-------------|
|
| 85 |
| `input_path` | `str` | *(required)* | Directory of images or path to a video file |
|
|
@@ -89,7 +99,9 @@ Returns the output directory path (`str`), or `None` if the input was skipped.
|
|
| 89 |
| `video_strategy` | `str` | `"new"` | Video frame extraction strategy: `"new"` (motion-aware) or `"old"` (uniform FPS) |
|
| 90 |
| `video_min_frames` | `int` | `1` | Minimum number of frames to extract from video |
|
| 91 |
| `video_max_frames` | `int` | `32` | Maximum number of frames to extract from video |
|
|
|
|
| 92 |
**Save Parameters:**
|
|
|
|
| 93 |
| Parameter | Type | Default | Description |
|
| 94 |
|-----------|------|---------|-------------|
|
| 95 |
| `save_depth` | `bool` | `True` | Save per-view depth maps (PNG visualization + NPY raw values) |
|
|
@@ -100,7 +112,9 @@ Returns the output directory path (`str`), or `None` if the input was skipped.
|
|
| 100 |
| `save_colmap` | `bool` | `False` | Save COLMAP-format sparse reconstruction (`sparse/0/`) |
|
| 101 |
| `save_conf` | `bool` | `False` | Save depth confidence maps |
|
| 102 |
| `save_sky_mask` | `bool` | `False` | Save sky segmentation masks |
|
|
|
|
| 103 |
**Mask Parameters:**
|
|
|
|
| 104 |
| Parameter | Type | Default | Description |
|
| 105 |
|-----------|------|---------|-------------|
|
| 106 |
| `apply_sky_mask` | `bool` | `True` | Filter out sky regions from point clouds and Gaussians |
|
|
@@ -111,7 +125,9 @@ Returns the output directory path (`str`), or `None` if the input was skipped.
|
|
| 111 |
| `confidence_percentile` | `float` | `10.0` | Percentile threshold for confidence filtering (bottom N% removed) |
|
| 112 |
| `edge_normal_threshold` | `float` | `1.0` | Normal edge detection tolerance |
|
| 113 |
| `edge_depth_threshold` | `float` | `0.03` | Depth edge detection relative tolerance |
|
|
|
|
| 114 |
**Compression Parameters:**
|
|
|
|
| 115 |
| Parameter | Type | Default | Description |
|
| 116 |
|-----------|------|---------|-------------|
|
| 117 |
| `compress_pts` | `bool` | `True` | Compress point clouds via voxel merging + random sampling |
|
|
@@ -119,25 +135,33 @@ Returns the output directory path (`str`), or `None` if the input was skipped.
|
|
| 119 |
| `compress_pts_voxel_size` | `float` | `0.002` | Voxel size for point cloud merging |
|
| 120 |
| `max_resolution` | `int` | `1920` | Maximum resolution for saved output images |
|
| 121 |
| `compress_gs_max_points` | `int` | `5,000,000` | Maximum number of Gaussians after voxel pruning |
|
|
|
|
| 122 |
**Prior Parameters:**
|
|
|
|
| 123 |
| Parameter | Type | Default | Description |
|
| 124 |
|-----------|------|---------|-------------|
|
| 125 |
| `prior_cam_path` | `str` | `None` | Path to camera parameters JSON file |
|
| 126 |
| `prior_depth_path` | `str` | `None` | Path to directory containing depth map files |
|
|
|
|
| 127 |
**Rendered Video Parameters:**
|
|
|
|
| 128 |
| Parameter | Type | Default | Description |
|
| 129 |
|-----------|------|---------|-------------|
|
| 130 |
| `save_rendered` | `bool` | `False` | Render interpolated fly-through video from Gaussian splats |
|
| 131 |
| `render_interp_per_pair` | `int` | `15` | Number of interpolated frames between each camera pair |
|
| 132 |
| `render_depth` | `bool` | `False` | Also render a depth visualization video |
|
|
|
|
| 133 |
**Misc Parameters:**
|
|
|
|
| 134 |
| Parameter | Type | Default | Description |
|
| 135 |
|-----------|------|---------|-------------|
|
| 136 |
| `log_time` | `bool` | `True` | Print timing report and save `pipeline_timing.json` |
|
| 137 |
| `strict_output_path` | `str` | `None` | If set, save results directly to this path without `<case_name>/<timestamp>` subdirectories |
|
|
|
|
| 138 |
---
|
| 139 |
### CLI Reference
|
| 140 |
All `__call__` parameters are exposed as CLI arguments:
|
|
|
|
| 141 |
```bash
|
| 142 |
python -m hyworld2.worldrecon.pipeline \
|
| 143 |
--input_path path/to/images \
|
|
@@ -146,7 +170,9 @@ python -m hyworld2.worldrecon.pipeline \
|
|
| 146 |
--prior_cam_path path/to/camera_params.json \
|
| 147 |
--prior_depth_path path/to/depth_dir/ \
|
| 148 |
```
|
|
|
|
| 149 |
**Boolean flag conventions:**
|
|
|
|
| 150 |
| Enable | Disable |
|
| 151 |
|--------|---------|
|
| 152 |
| `--save_colmap` | *(omit)* |
|
|
@@ -164,7 +190,9 @@ python -m hyworld2.worldrecon.pipeline \
|
|
| 164 |
| *(default on)* `save_points` | `--no_save_points` |
|
| 165 |
| `--save_rendered` | *(omit)* |
|
| 166 |
| `--render_depth` | *(omit)* |
|
|
|
|
| 167 |
**Additional CLI-only arguments:**
|
|
|
|
| 168 |
| Argument | Description |
|
| 169 |
|----------|-------------|
|
| 170 |
| `--config_path` | Training config YAML for custom checkpoint loading |
|
|
@@ -174,9 +202,11 @@ python -m hyworld2.worldrecon.pipeline \
|
|
| 174 |
| `--fsdp_cpu_offload` | Offload FSDP params to CPU |
|
| 175 |
| `--disable_heads` | Space-separated list of heads to disable (e.g. `--disable_heads camera normal`) |
|
| 176 |
| `--no_interactive` | Exit after first inference (skip interactive prompt loop) |
|
|
|
|
| 177 |
---
|
| 178 |
### Output Format
|
| 179 |
#### File Structure
|
|
|
|
| 180 |
```
|
| 181 |
inference_output/
|
| 182 |
βββ <case_name>/
|
|
@@ -201,8 +231,10 @@ inference_output/
|
|
| 201 |
β βββ rendered_depth.mp4 # (if --render_depth)
|
| 202 |
βββ pipeline_timing.json # Performance timing report
|
| 203 |
```
|
|
|
|
| 204 |
#### Prediction Dictionary
|
| 205 |
When using the Python API, `pipeline(...)` internally produces a `predictions` dictionary with the following keys:
|
|
|
|
| 206 |
```python
|
| 207 |
# Geometry
|
| 208 |
predictions["depth"] # [B, S, H, W, 1] β Z-depth in camera frame
|
|
@@ -223,17 +255,22 @@ predictions["splats"]["opacities"] # [B, N] β Gaussian opacities
|
|
| 223 |
predictions["splats"]["sh"] # [B, N, 1, 3] β Spherical harmonics (degree 0)
|
| 224 |
predictions["splats"]["weights"] # [B, N] β Per-Gaussian confidence weights
|
| 225 |
```
|
|
|
|
| 226 |
Where `B` = batch size (always 1 for inference), `S` = number of input views, `H, W` = image dimensions, `N` = total Gaussians (`S Γ H Γ W`).
|
|
|
|
| 227 |
---
|
| 228 |
### Prior Injection
|
| 229 |
WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files.
|
|
|
|
| 230 |
| Prior Type | Condition | Input Format |
|
| 231 |
|------------|-----------|--------------|
|
| 232 |
| Camera Pose | `cond_flags[0]` | c2w 4Γ4 matrix (OpenCV convention) |
|
| 233 |
| Depth Map | `cond_flags[1]` | Per-view float depth maps |
|
| 234 |
| Intrinsics | `cond_flags[2]` | 3Γ3 intrinsic matrix |
|
|
|
|
| 235 |
#### Camera Parameters (JSON)
|
| 236 |
The camera parameter file follows the same format as the `camera_params.json` output by the pipeline:
|
|
|
|
| 237 |
```json
|
| 238 |
{
|
| 239 |
"num_cameras": 2,
|
|
@@ -260,26 +297,32 @@ The camera parameter file follows the same format as the `camera_params.json` ou
|
|
| 260 |
]
|
| 261 |
}
|
| 262 |
```
|
|
|
|
| 263 |
**Field descriptions:**
|
|
|
|
| 264 |
| Field | Description |
|
| 265 |
|-------|-------------|
|
| 266 |
| `camera_id` | Integer index (`0`, `1`, `2`, ...) or image filename stem without extension (e.g., `"image_0001"`) |
|
| 267 |
| `extrinsics.matrix` | 4Γ4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention |
|
| 268 |
| `intrinsics.matrix` | 3Γ3 camera intrinsic matrix in pixels (`fx, fy` = focal lengths; `cx, cy` = principal point) |
|
|
|
|
| 269 |
**Important notes:**
|
| 270 |
- `extrinsics` and `intrinsics` lists can be provided independently or together. An empty list `[]` or missing key means that prior is unavailable.
|
| 271 |
- **Intrinsics resolution:** Values should correspond to the **original image resolution**. The pipeline automatically adjusts for inference-time resize + center-crop.
|
| 272 |
- **Extrinsics alignment:** The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior.
|
| 273 |
#### Depth Maps (Folder)
|
| 274 |
Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: `.npy`, `.exr`, `.png` (16-bit).
|
|
|
|
| 275 |
```
|
| 276 |
prior_depth/
|
| 277 |
βββ image_0001.npy # float32, shape [H, W]
|
| 278 |
βββ image_0002.npy
|
| 279 |
βββ ...
|
| 280 |
```
|
|
|
|
| 281 |
#### Combining Priors
|
| 282 |
Priors can be freely combined. Examples:
|
|
|
|
| 283 |
```bash
|
| 284 |
# Only intrinsics
|
| 285 |
python -m hyworld2.worldrecon.pipeline --input_path images/ \
|
|
@@ -292,6 +335,7 @@ python -m hyworld2.worldrecon.pipeline --input_path images/ \
|
|
| 292 |
--prior_cam_path camera_params.json \
|
| 293 |
--prior_depth_path depth_maps/
|
| 294 |
```
|
|
|
|
| 295 |
---
|
| 296 |
### Multi-GPU Inference
|
| 297 |
WorldMirror 2.0 supports **Sequence Parallel (SP)** inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel.
|
|
@@ -316,16 +360,19 @@ pipeline = WorldMirrorPipeline.from_pretrained(
|
|
| 316 |
)
|
| 317 |
pipeline('path/to/images')
|
| 318 |
```
|
|
|
|
| 319 |
**What happens under the hood:**
|
| 320 |
1. `from_pretrained` auto-detects `WORLD_SIZE > 1` and initializes `torch.distributed`.
|
| 321 |
2. The model is loaded on rank 0 and broadcast via `sync_module_states=True`.
|
| 322 |
3. FSDP shards parameters across the SP process group.
|
| 323 |
4. DPT prediction heads split frames across ranks and `AllGather` results.
|
| 324 |
5. Post-processing (mask computation, saving) runs on rank 0 only.
|
|
|
|
| 325 |
---
|
| 326 |
### Advanced Options
|
| 327 |
#### Disabling Prediction Heads
|
| 328 |
To save memory when you only need specific outputs:
|
|
|
|
| 329 |
```python
|
| 330 |
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
|
| 331 |
|
|
@@ -334,6 +381,7 @@ pipeline = WorldMirrorPipeline.from_pretrained(
|
|
| 334 |
disable_heads=["normal", "points"], # free ~200M params
|
| 335 |
)
|
| 336 |
```
|
|
|
|
| 337 |
Available heads: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"`.
|
| 338 |
#### Mask Filtering
|
| 339 |
The pipeline supports three types of output filtering to improve point cloud and Gaussian quality:
|
|
@@ -346,10 +394,12 @@ When `compress_pts=True` (default), the depth-derived point cloud undergoes:
|
|
| 346 |
1. **Voxel merging**: Points within each voxel (size controlled by `compress_pts_voxel_size`) are merged via weighted averaging.
|
| 347 |
2. **Random subsampling**: If the result exceeds `compress_pts_max_points`, points are uniformly subsampled.
|
| 348 |
Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to `compress_gs_max_points`.
|
|
|
|
| 349 |
---
|
| 350 |
### Gradio App
|
| 351 |
An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser.
|
| 352 |
**Quick start:**
|
|
|
|
| 353 |
```bash
|
| 354 |
# Single GPU
|
| 355 |
python -m hyworld2.worldrecon.gradio_app
|
|
@@ -358,17 +408,23 @@ python -m hyworld2.worldrecon.gradio_app
|
|
| 358 |
torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \
|
| 359 |
--use_fsdp --enable_bf16
|
| 360 |
```
|
|
|
|
| 361 |
**With a local checkpoint:**
|
|
|
|
| 362 |
```bash
|
| 363 |
python -m hyworld2.worldrecon.gradio_app \
|
| 364 |
--config_path /path/to/config.yaml \
|
| 365 |
--ckpt_path /path/to/checkpoint.safetensors
|
| 366 |
```
|
|
|
|
| 367 |
**With a public link (e.g., for Colab or remote servers):**
|
|
|
|
| 368 |
```bash
|
| 369 |
python -m hyworld2.worldrecon.gradio_app --share
|
| 370 |
```
|
|
|
|
| 371 |
**Arguments:**
|
|
|
|
| 372 |
| Argument | Default | Description |
|
| 373 |
|----------|---------|-------------|
|
| 374 |
| `--port` | `8081` | Server port |
|
|
@@ -382,6 +438,7 @@ python -m hyworld2.worldrecon.gradio_app --share
|
|
| 382 |
| `--fsdp_cpu_offload` | `False` | Offload FSDP params to CPU (saves GPU memory) |
|
| 383 |
|
| 384 |
> **Important:** In multi-GPU mode, the number of input images must be **>= the number of GPUs**.
|
|
|
|
| 385 |
---
|
| 386 |
## Panorama Generation
|
| 387 |
*Coming soon.*
|
|
@@ -390,6 +447,7 @@ This section will document the panorama generation model, including:
|
|
| 390 |
- Model architecture (MMDiT-based implicit perspective-to-ERP mapping)
|
| 391 |
- Configuration parameters
|
| 392 |
- Output formats
|
|
|
|
| 393 |
---
|
| 394 |
## World Generation
|
| 395 |
*Coming soon.*
|
|
|
|
| 1 |
# HunyuanWorld 2.0 β Documentation
|
| 2 |
This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0.
|
| 3 |
+
|
| 4 |
## Table of Contents
|
| 5 |
- [WorldMirror 2.0 (World Reconstruction)](#worldmirror-20-world-reconstruction)
|
| 6 |
- [Overview](#overview)
|
|
|
|
| 23 |
- [Gradio App](#gradio-app)
|
| 24 |
- [Panorama Generation](#panorama-generation)
|
| 25 |
- [World Generation](#world-generation)
|
| 26 |
+
|
| 27 |
---
|
| 28 |
## WorldMirror 2.0 (World Reconstruction)
|
| 29 |
### Overview
|
|
|
|
| 38 |
- **Normalized RoPE** for flexible resolution inference
|
| 39 |
- **Depth mask prediction** for robust invalid pixel handling
|
| 40 |
- **Sequence Parallel + FSDP + BF16** for efficient multi-GPU inference
|
| 41 |
+
|
| 42 |
---
|
| 43 |
### Python API
|
| 44 |
#### `WorldMirrorPipeline.from_pretrained`
|
| 45 |
Factory method to load the model and create a pipeline instance.
|
| 46 |
+
|
| 47 |
```python
|
| 48 |
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
|
| 49 |
|
|
|
|
| 58 |
disable_heads=None,
|
| 59 |
)
|
| 60 |
```
|
| 61 |
+
|
| 62 |
| Parameter | Type | Default | Description |
|
| 63 |
|-----------|------|---------|-------------|
|
| 64 |
| `pretrained_model_name_or_path` | `str` | `"tencent/HY-World-2.0"` | HuggingFace repo ID or local path |
|
|
|
|
| 69 |
| `enable_bf16` | `bool` | `False` | Use bfloat16 precision (except numerically critical layers) |
|
| 70 |
| `fsdp_cpu_offload` | `bool` | `False` | Offload FSDP parameters to CPU (saves GPU memory at the cost of speed) |
|
| 71 |
| `disable_heads` | `list[str]` | `None` | Heads to disable and free from memory. Options: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"` |
|
| 72 |
+
|
| 73 |
**Notes:**
|
| 74 |
- Distributed mode is auto-detected from `WORLD_SIZE` environment variable (set by `torchrun`).
|
| 75 |
- When using multi-GPU, each rank must call `from_pretrained` β the method handles `dist.init_process_group` internally.
|
| 76 |
+
|
| 77 |
---
|
| 78 |
#### `WorldMirrorPipeline.__call__`
|
| 79 |
Run inference on a set of images or a video.
|
| 80 |
+
|
| 81 |
```python
|
| 82 |
result = pipeline(
|
| 83 |
input_path,
|
|
|
|
| 85 |
**kwargs,
|
| 86 |
)
|
| 87 |
```
|
| 88 |
+
|
| 89 |
Returns the output directory path (`str`), or `None` if the input was skipped.
|
| 90 |
+
|
| 91 |
**Inference Parameters:**
|
| 92 |
+
|
| 93 |
| Parameter | Type | Default | Description |
|
| 94 |
|-----------|------|---------|-------------|
|
| 95 |
| `input_path` | `str` | *(required)* | Directory of images or path to a video file |
|
|
|
|
| 99 |
| `video_strategy` | `str` | `"new"` | Video frame extraction strategy: `"new"` (motion-aware) or `"old"` (uniform FPS) |
|
| 100 |
| `video_min_frames` | `int` | `1` | Minimum number of frames to extract from video |
|
| 101 |
| `video_max_frames` | `int` | `32` | Maximum number of frames to extract from video |
|
| 102 |
+
|
| 103 |
**Save Parameters:**
|
| 104 |
+
|
| 105 |
| Parameter | Type | Default | Description |
|
| 106 |
|-----------|------|---------|-------------|
|
| 107 |
| `save_depth` | `bool` | `True` | Save per-view depth maps (PNG visualization + NPY raw values) |
|
|
|
|
| 112 |
| `save_colmap` | `bool` | `False` | Save COLMAP-format sparse reconstruction (`sparse/0/`) |
|
| 113 |
| `save_conf` | `bool` | `False` | Save depth confidence maps |
|
| 114 |
| `save_sky_mask` | `bool` | `False` | Save sky segmentation masks |
|
| 115 |
+
|
| 116 |
**Mask Parameters:**
|
| 117 |
+
|
| 118 |
| Parameter | Type | Default | Description |
|
| 119 |
|-----------|------|---------|-------------|
|
| 120 |
| `apply_sky_mask` | `bool` | `True` | Filter out sky regions from point clouds and Gaussians |
|
|
|
|
| 125 |
| `confidence_percentile` | `float` | `10.0` | Percentile threshold for confidence filtering (bottom N% removed) |
|
| 126 |
| `edge_normal_threshold` | `float` | `1.0` | Normal edge detection tolerance |
|
| 127 |
| `edge_depth_threshold` | `float` | `0.03` | Depth edge detection relative tolerance |
|
| 128 |
+
|
| 129 |
**Compression Parameters:**
|
| 130 |
+
|
| 131 |
| Parameter | Type | Default | Description |
|
| 132 |
|-----------|------|---------|-------------|
|
| 133 |
| `compress_pts` | `bool` | `True` | Compress point clouds via voxel merging + random sampling |
|
|
|
|
| 135 |
| `compress_pts_voxel_size` | `float` | `0.002` | Voxel size for point cloud merging |
|
| 136 |
| `max_resolution` | `int` | `1920` | Maximum resolution for saved output images |
|
| 137 |
| `compress_gs_max_points` | `int` | `5,000,000` | Maximum number of Gaussians after voxel pruning |
|
| 138 |
+
|
| 139 |
**Prior Parameters:**
|
| 140 |
+
|
| 141 |
| Parameter | Type | Default | Description |
|
| 142 |
|-----------|------|---------|-------------|
|
| 143 |
| `prior_cam_path` | `str` | `None` | Path to camera parameters JSON file |
|
| 144 |
| `prior_depth_path` | `str` | `None` | Path to directory containing depth map files |
|
| 145 |
+
|
| 146 |
**Rendered Video Parameters:**
|
| 147 |
+
|
| 148 |
| Parameter | Type | Default | Description |
|
| 149 |
|-----------|------|---------|-------------|
|
| 150 |
| `save_rendered` | `bool` | `False` | Render interpolated fly-through video from Gaussian splats |
|
| 151 |
| `render_interp_per_pair` | `int` | `15` | Number of interpolated frames between each camera pair |
|
| 152 |
| `render_depth` | `bool` | `False` | Also render a depth visualization video |
|
| 153 |
+
|
| 154 |
**Misc Parameters:**
|
| 155 |
+
|
| 156 |
| Parameter | Type | Default | Description |
|
| 157 |
|-----------|------|---------|-------------|
|
| 158 |
| `log_time` | `bool` | `True` | Print timing report and save `pipeline_timing.json` |
|
| 159 |
| `strict_output_path` | `str` | `None` | If set, save results directly to this path without `<case_name>/<timestamp>` subdirectories |
|
| 160 |
+
|
| 161 |
---
|
| 162 |
### CLI Reference
|
| 163 |
All `__call__` parameters are exposed as CLI arguments:
|
| 164 |
+
|
| 165 |
```bash
|
| 166 |
python -m hyworld2.worldrecon.pipeline \
|
| 167 |
--input_path path/to/images \
|
|
|
|
| 170 |
--prior_cam_path path/to/camera_params.json \
|
| 171 |
--prior_depth_path path/to/depth_dir/ \
|
| 172 |
```
|
| 173 |
+
|
| 174 |
**Boolean flag conventions:**
|
| 175 |
+
|
| 176 |
| Enable | Disable |
|
| 177 |
|--------|---------|
|
| 178 |
| `--save_colmap` | *(omit)* |
|
|
|
|
| 190 |
| *(default on)* `save_points` | `--no_save_points` |
|
| 191 |
| `--save_rendered` | *(omit)* |
|
| 192 |
| `--render_depth` | *(omit)* |
|
| 193 |
+
|
| 194 |
**Additional CLI-only arguments:**
|
| 195 |
+
|
| 196 |
| Argument | Description |
|
| 197 |
|----------|-------------|
|
| 198 |
| `--config_path` | Training config YAML for custom checkpoint loading |
|
|
|
|
| 202 |
| `--fsdp_cpu_offload` | Offload FSDP params to CPU |
|
| 203 |
| `--disable_heads` | Space-separated list of heads to disable (e.g. `--disable_heads camera normal`) |
|
| 204 |
| `--no_interactive` | Exit after first inference (skip interactive prompt loop) |
|
| 205 |
+
|
| 206 |
---
|
| 207 |
### Output Format
|
| 208 |
#### File Structure
|
| 209 |
+
|
| 210 |
```
|
| 211 |
inference_output/
|
| 212 |
βββ <case_name>/
|
|
|
|
| 231 |
β βββ rendered_depth.mp4 # (if --render_depth)
|
| 232 |
βββ pipeline_timing.json # Performance timing report
|
| 233 |
```
|
| 234 |
+
|
| 235 |
#### Prediction Dictionary
|
| 236 |
When using the Python API, `pipeline(...)` internally produces a `predictions` dictionary with the following keys:
|
| 237 |
+
|
| 238 |
```python
|
| 239 |
# Geometry
|
| 240 |
predictions["depth"] # [B, S, H, W, 1] β Z-depth in camera frame
|
|
|
|
| 255 |
predictions["splats"]["sh"] # [B, N, 1, 3] β Spherical harmonics (degree 0)
|
| 256 |
predictions["splats"]["weights"] # [B, N] β Per-Gaussian confidence weights
|
| 257 |
```
|
| 258 |
+
|
| 259 |
Where `B` = batch size (always 1 for inference), `S` = number of input views, `H, W` = image dimensions, `N` = total Gaussians (`S Γ H Γ W`).
|
| 260 |
+
|
| 261 |
---
|
| 262 |
### Prior Injection
|
| 263 |
WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files.
|
| 264 |
+
|
| 265 |
| Prior Type | Condition | Input Format |
|
| 266 |
|------------|-----------|--------------|
|
| 267 |
| Camera Pose | `cond_flags[0]` | c2w 4Γ4 matrix (OpenCV convention) |
|
| 268 |
| Depth Map | `cond_flags[1]` | Per-view float depth maps |
|
| 269 |
| Intrinsics | `cond_flags[2]` | 3Γ3 intrinsic matrix |
|
| 270 |
+
|
| 271 |
#### Camera Parameters (JSON)
|
| 272 |
The camera parameter file follows the same format as the `camera_params.json` output by the pipeline:
|
| 273 |
+
|
| 274 |
```json
|
| 275 |
{
|
| 276 |
"num_cameras": 2,
|
|
|
|
| 297 |
]
|
| 298 |
}
|
| 299 |
```
|
| 300 |
+
|
| 301 |
**Field descriptions:**
|
| 302 |
+
|
| 303 |
| Field | Description |
|
| 304 |
|-------|-------------|
|
| 305 |
| `camera_id` | Integer index (`0`, `1`, `2`, ...) or image filename stem without extension (e.g., `"image_0001"`) |
|
| 306 |
| `extrinsics.matrix` | 4Γ4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention |
|
| 307 |
| `intrinsics.matrix` | 3Γ3 camera intrinsic matrix in pixels (`fx, fy` = focal lengths; `cx, cy` = principal point) |
|
| 308 |
+
|
| 309 |
**Important notes:**
|
| 310 |
- `extrinsics` and `intrinsics` lists can be provided independently or together. An empty list `[]` or missing key means that prior is unavailable.
|
| 311 |
- **Intrinsics resolution:** Values should correspond to the **original image resolution**. The pipeline automatically adjusts for inference-time resize + center-crop.
|
| 312 |
- **Extrinsics alignment:** The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior.
|
| 313 |
#### Depth Maps (Folder)
|
| 314 |
Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: `.npy`, `.exr`, `.png` (16-bit).
|
| 315 |
+
|
| 316 |
```
|
| 317 |
prior_depth/
|
| 318 |
βββ image_0001.npy # float32, shape [H, W]
|
| 319 |
βββ image_0002.npy
|
| 320 |
βββ ...
|
| 321 |
```
|
| 322 |
+
|
| 323 |
#### Combining Priors
|
| 324 |
Priors can be freely combined. Examples:
|
| 325 |
+
|
| 326 |
```bash
|
| 327 |
# Only intrinsics
|
| 328 |
python -m hyworld2.worldrecon.pipeline --input_path images/ \
|
|
|
|
| 335 |
--prior_cam_path camera_params.json \
|
| 336 |
--prior_depth_path depth_maps/
|
| 337 |
```
|
| 338 |
+
|
| 339 |
---
|
| 340 |
### Multi-GPU Inference
|
| 341 |
WorldMirror 2.0 supports **Sequence Parallel (SP)** inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel.
|
|
|
|
| 360 |
)
|
| 361 |
pipeline('path/to/images')
|
| 362 |
```
|
| 363 |
+
|
| 364 |
**What happens under the hood:**
|
| 365 |
1. `from_pretrained` auto-detects `WORLD_SIZE > 1` and initializes `torch.distributed`.
|
| 366 |
2. The model is loaded on rank 0 and broadcast via `sync_module_states=True`.
|
| 367 |
3. FSDP shards parameters across the SP process group.
|
| 368 |
4. DPT prediction heads split frames across ranks and `AllGather` results.
|
| 369 |
5. Post-processing (mask computation, saving) runs on rank 0 only.
|
| 370 |
+
|
| 371 |
---
|
| 372 |
### Advanced Options
|
| 373 |
#### Disabling Prediction Heads
|
| 374 |
To save memory when you only need specific outputs:
|
| 375 |
+
|
| 376 |
```python
|
| 377 |
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
|
| 378 |
|
|
|
|
| 381 |
disable_heads=["normal", "points"], # free ~200M params
|
| 382 |
)
|
| 383 |
```
|
| 384 |
+
|
| 385 |
Available heads: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"`.
|
| 386 |
#### Mask Filtering
|
| 387 |
The pipeline supports three types of output filtering to improve point cloud and Gaussian quality:
|
|
|
|
| 394 |
1. **Voxel merging**: Points within each voxel (size controlled by `compress_pts_voxel_size`) are merged via weighted averaging.
|
| 395 |
2. **Random subsampling**: If the result exceeds `compress_pts_max_points`, points are uniformly subsampled.
|
| 396 |
Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to `compress_gs_max_points`.
|
| 397 |
+
|
| 398 |
---
|
| 399 |
### Gradio App
|
| 400 |
An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser.
|
| 401 |
**Quick start:**
|
| 402 |
+
|
| 403 |
```bash
|
| 404 |
# Single GPU
|
| 405 |
python -m hyworld2.worldrecon.gradio_app
|
|
|
|
| 408 |
torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \
|
| 409 |
--use_fsdp --enable_bf16
|
| 410 |
```
|
| 411 |
+
|
| 412 |
**With a local checkpoint:**
|
| 413 |
+
|
| 414 |
```bash
|
| 415 |
python -m hyworld2.worldrecon.gradio_app \
|
| 416 |
--config_path /path/to/config.yaml \
|
| 417 |
--ckpt_path /path/to/checkpoint.safetensors
|
| 418 |
```
|
| 419 |
+
|
| 420 |
**With a public link (e.g., for Colab or remote servers):**
|
| 421 |
+
|
| 422 |
```bash
|
| 423 |
python -m hyworld2.worldrecon.gradio_app --share
|
| 424 |
```
|
| 425 |
+
|
| 426 |
**Arguments:**
|
| 427 |
+
|
| 428 |
| Argument | Default | Description |
|
| 429 |
|----------|---------|-------------|
|
| 430 |
| `--port` | `8081` | Server port |
|
|
|
|
| 438 |
| `--fsdp_cpu_offload` | `False` | Offload FSDP params to CPU (saves GPU memory) |
|
| 439 |
|
| 440 |
> **Important:** In multi-GPU mode, the number of input images must be **>= the number of GPUs**.
|
| 441 |
+
|
| 442 |
---
|
| 443 |
## Panorama Generation
|
| 444 |
*Coming soon.*
|
|
|
|
| 447 |
- Model architecture (MMDiT-based implicit perspective-to-ERP mapping)
|
| 448 |
- Configuration parameters
|
| 449 |
- Output formats
|
| 450 |
+
|
| 451 |
---
|
| 452 |
## World Generation
|
| 453 |
*Coming soon.*
|