HY-World-2.0 / DOCUMENTATION.md
ZhenweiWang's picture
Update DOCUMENTATION.md
af597e9 verified
# HunyuanWorld 2.0 β€” Documentation
This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0.
## Table of Contents
- [WorldMirror 2.0 (World Reconstruction)](#worldmirror-20-world-reconstruction)
- [Overview](#overview)
- [Python API](#python-api)
- [`WorldMirrorPipeline.from_pretrained`](#worldmirrorpipelinefrom_pretrained)
- [`WorldMirrorPipeline.__call__`](#worldmirrorpipelinecall)
- [CLI Reference](#cli-reference)
- [Output Format](#output-format)
- [File Structure](#file-structure)
- [Prediction Dictionary](#prediction-dictionary)
- [Prior Injection](#prior-injection)
- [Camera Parameters (JSON)](#camera-parameters-json)
- [Depth Maps (Folder)](#depth-maps-folder)
- [Combining Priors](#combining-priors)
- [Multi-GPU Inference](#multi-gpu-inference)
- [Advanced Options](#advanced-options)
- [Disabling Prediction Heads](#disabling-prediction-heads)
- [Mask Filtering](#mask-filtering)
- [Point Cloud Compression](#point-cloud-compression)
- [Gradio App](#gradio-app)
- [Panorama Generation](#panorama-generation)
- [World Generation](#world-generation)
---
## WorldMirror 2.0 (World Reconstruction)
### Overview
WorldMirror 2.0 is a unified feed-forward model for comprehensive 3D geometric prediction from multi-view images or video. It simultaneously generates:
- **3D point clouds** in world coordinates
- **Per-view depth maps** in camera frame
- **Surface normals** in camera coordinates
- **Camera poses** (c2w) and **intrinsics**
- **3D Gaussian Splatting** attributes (means, scales, rotations, opacities, SH coefficients)
Key improvements over WorldMirror 1.0:
- **Normalized RoPE** for flexible resolution inference
- **Depth mask prediction** for robust invalid pixel handling
- **Sequence Parallel + FSDP + BF16** for efficient multi-GPU inference
---
### Python API
#### `WorldMirrorPipeline.from_pretrained`
Factory method to load the model and create a pipeline instance.
```python
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
pipeline = WorldMirrorPipeline.from_pretrained(
pretrained_model_name_or_path="tencent/HY-World-2.0",
subfolder="HY-WorldMirror-2.0",
config_path=None,
ckpt_path=None,
use_fsdp=False,
enable_bf16=False,
fsdp_cpu_offload=False,
disable_heads=None,
)
```
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `pretrained_model_name_or_path` | `str` | `"tencent/HY-World-2.0"` | HuggingFace repo ID or local path |
| `subfolder` | `str` | `"HY-WorldMirror-2.0"` | Subfolder inside the repo containing WorldMirror checkpoint (`model.safetensors` + config) |
| `config_path` | `str` | `None` | Training config YAML (used with `ckpt_path` for custom checkpoints) |
| `ckpt_path` | `str` | `None` | Checkpoint file (`.ckpt` / `.safetensors`). When provided with `config_path`, loads model from local checkpoint instead of HuggingFace |
| `use_fsdp` | `bool` | `False` | Shard parameters across GPUs via Fully Sharded Data Parallel |
| `enable_bf16` | `bool` | `False` | Use bfloat16 precision (except numerically critical layers) |
| `fsdp_cpu_offload` | `bool` | `False` | Offload FSDP parameters to CPU (saves GPU memory at the cost of speed) |
| `disable_heads` | `list[str]` | `None` | Heads to disable and free from memory. Options: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"` |
**Notes:**
- Distributed mode is auto-detected from `WORLD_SIZE` environment variable (set by `torchrun`).
- When using multi-GPU, each rank must call `from_pretrained` β€” the method handles `dist.init_process_group` internally.
---
#### `WorldMirrorPipeline.__call__`
Run inference on a set of images or a video.
```python
result = pipeline(
input_path,
output_path="inference_output",
**kwargs,
)
```
Returns the output directory path (`str`), or `None` if the input was skipped.
**Inference Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input_path` | `str` | *(required)* | Directory of images or path to a video file |
| `output_path` | `str` | `"inference_output"` | Root output directory |
| `target_size` | `int` | `952` | Maximum inference resolution (longest edge). Images are resized + center-cropped to the nearest multiple of 14 |
| `fps` | `int` | `1` | FPS for extracting frames from video input |
| `video_strategy` | `str` | `"new"` | Video frame extraction strategy: `"new"` (motion-aware) or `"old"` (uniform FPS) |
| `video_min_frames` | `int` | `1` | Minimum number of frames to extract from video |
| `video_max_frames` | `int` | `32` | Maximum number of frames to extract from video |
**Save Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `save_depth` | `bool` | `True` | Save per-view depth maps (PNG visualization + NPY raw values) |
| `save_normal` | `bool` | `True` | Save per-view surface normal maps (PNG) |
| `save_gs` | `bool` | `True` | Save 3D Gaussian Splatting as `gaussians.ply` |
| `save_camera` | `bool` | `True` | Save camera parameters as `camera_params.json` |
| `save_points` | `bool` | `True` | Save depth-derived point cloud as `points.ply` |
| `save_colmap` | `bool` | `False` | Save COLMAP-format sparse reconstruction (`sparse/0/`) |
| `save_conf` | `bool` | `False` | Save depth confidence maps |
| `save_sky_mask` | `bool` | `False` | Save sky segmentation masks |
**Mask Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `apply_sky_mask` | `bool` | `True` | Filter out sky regions from point clouds and Gaussians |
| `apply_edge_mask` | `bool` | `True` | Filter out edge/discontinuity regions |
| `apply_confidence_mask` | `bool` | `False` | Filter out low-confidence predictions |
| `sky_mask_source` | `str` | `"auto"` | Sky mask method: `"auto"` (ONNX + model fusion), `"model"` (model predictions only), `"onnx"` (external segmentation only) |
| `model_sky_threshold` | `float` | `0.45` | Threshold for model-based sky detection |
| `confidence_percentile` | `float` | `10.0` | Percentile threshold for confidence filtering (bottom N% removed) |
| `edge_normal_threshold` | `float` | `1.0` | Normal edge detection tolerance |
| `edge_depth_threshold` | `float` | `0.03` | Depth edge detection relative tolerance |
**Compression Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `compress_pts` | `bool` | `True` | Compress point clouds via voxel merging + random sampling |
| `compress_pts_max_points` | `int` | `2,000,000` | Maximum number of points after compression |
| `compress_pts_voxel_size` | `float` | `0.002` | Voxel size for point cloud merging |
| `max_resolution` | `int` | `1920` | Maximum resolution for saved output images |
| `compress_gs_max_points` | `int` | `5,000,000` | Maximum number of Gaussians after voxel pruning |
**Prior Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `prior_cam_path` | `str` | `None` | Path to camera parameters JSON file |
| `prior_depth_path` | `str` | `None` | Path to directory containing depth map files |
**Rendered Video Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `save_rendered` | `bool` | `False` | Render interpolated fly-through video from Gaussian splats |
| `render_interp_per_pair` | `int` | `15` | Number of interpolated frames between each camera pair |
| `render_depth` | `bool` | `False` | Also render a depth visualization video |
**Misc Parameters:**
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `log_time` | `bool` | `True` | Print timing report and save `pipeline_timing.json` |
| `strict_output_path` | `str` | `None` | If set, save results directly to this path without `<case_name>/<timestamp>` subdirectories |
---
### CLI Reference
All `__call__` parameters are exposed as CLI arguments:
```bash
python -m hyworld2.worldrecon.pipeline \
--input_path path/to/images \
--output_path inference_output \
--target_size 952 \
--prior_cam_path path/to/camera_params.json \
--prior_depth_path path/to/depth_dir/ \
```
**Boolean flag conventions:**
| Enable | Disable |
|--------|---------|
| `--save_colmap` | *(omit)* |
| `--save_conf` | *(omit)* |
| `--save_sky_mask` | *(omit)* |
| `--apply_sky_mask` (default on) | `--no_sky_mask` |
| `--apply_edge_mask` (default on) | `--no_edge_mask` |
| `--apply_confidence_mask` | *(omit)* |
| `--compress_pts` (default on) | `--no_compress_pts` |
| `--log_time` (default on) | `--no_log_time` |
| *(default on)* `save_depth` | `--no_save_depth` |
| *(default on)* `save_normal` | `--no_save_normal` |
| *(default on)* `save_gs` | `--no_save_gs` |
| *(default on)* `save_camera` | `--no_save_camera` |
| *(default on)* `save_points` | `--no_save_points` |
| `--save_rendered` | *(omit)* |
| `--render_depth` | *(omit)* |
**Additional CLI-only arguments:**
| Argument | Description |
|----------|-------------|
| `--config_path` | Training config YAML for custom checkpoint loading |
| `--ckpt_path` | Local checkpoint file path |
| `--use_fsdp` | Enable FSDP multi-GPU sharding |
| `--enable_bf16` | Enable bfloat16 mixed precision |
| `--fsdp_cpu_offload` | Offload FSDP params to CPU |
| `--disable_heads` | Space-separated list of heads to disable (e.g. `--disable_heads camera normal`) |
| `--no_interactive` | Exit after first inference (skip interactive prompt loop) |
---
### Output Format
#### File Structure
```
inference_output/
└── <case_name>/
└── <timestamp>/
β”œβ”€β”€ depth/
β”‚ β”œβ”€β”€ depth_0000.png # Normalized depth visualization
β”‚ β”œβ”€β”€ depth_0000.npy # Raw float32 depth values [H, W]
β”‚ └── ...
β”œβ”€β”€ normal/
β”‚ β”œβ”€β”€ normal_0000.png # Normal map visualization (RGB)
β”‚ └── ...
β”œβ”€β”€ camera_params.json # Camera extrinsics & intrinsics
β”œβ”€β”€ gaussians.ply # 3D Gaussian Splatting (standard format)
β”œβ”€β”€ points.ply # Colored point cloud
β”œβ”€β”€ sparse/ # COLMAP format (if --save_colmap)
β”‚ └── 0/
β”‚ β”œβ”€β”€ cameras.bin
β”‚ β”œβ”€β”€ images.bin
β”‚ └── points3D.bin
β”œβ”€β”€ rendered/ # Rendered video (if --save_rendered)
β”‚ β”œβ”€β”€ rendered_rgb.mp4
β”‚ └── rendered_depth.mp4 # (if --render_depth)
└── pipeline_timing.json # Performance timing report
```
#### Prediction Dictionary
When using the Python API, `pipeline(...)` internally produces a `predictions` dictionary with the following keys:
```python
# Geometry
predictions["depth"] # [B, S, H, W, 1] β€” Z-depth in camera frame
predictions["depth_conf"] # [B, S, H, W] β€” Depth confidence
predictions["normals"] # [B, S, H, W, 3] β€” Surface normals in camera coords
predictions["normals_conf"] # [B, S, H, W] β€” Normal confidence
predictions["pts3d"] # [B, S, H, W, 3] β€” 3D point maps in world coords
predictions["pts3d_conf"] # [B, S, H, W] β€” Point cloud confidence
# Camera
predictions["camera_poses"] # [B, S, 4, 4] β€” Camera-to-world (c2w), OpenCV convention
predictions["camera_intrs"] # [B, S, 3, 3] β€” Camera intrinsic matrices
predictions["camera_params"]# [B, S, 9] β€” Compact camera vector (translation, quaternion, fov_v, fov_u)
# 3D Gaussian Splatting
predictions["splats"]["means"] # [B, N, 3] β€” Gaussian centers
predictions["splats"]["scales"] # [B, N, 3] β€” Gaussian scales
predictions["splats"]["quats"] # [B, N, 4] β€” Gaussian rotations (quaternions)
predictions["splats"]["opacities"] # [B, N] β€” Gaussian opacities
predictions["splats"]["sh"] # [B, N, 1, 3] β€” Spherical harmonics (degree 0)
predictions["splats"]["weights"] # [B, N] β€” Per-Gaussian confidence weights
```
Where `B` = batch size (always 1 for inference), `S` = number of input views, `H, W` = image dimensions, `N` = total Gaussians (`S Γ— H Γ— W`).
---
### Prior Injection
WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files.
| Prior Type | Condition | Input Format |
|------------|-----------|--------------|
| Camera Pose | `cond_flags[0]` | c2w 4Γ—4 matrix (OpenCV convention) |
| Depth Map | `cond_flags[1]` | Per-view float depth maps |
| Intrinsics | `cond_flags[2]` | 3Γ—3 intrinsic matrix |
#### Camera Parameters (JSON)
The camera parameter file follows the same format as the `camera_params.json` output by the pipeline:
```json
{
"num_cameras": 2,
"extrinsics": [
{
"camera_id": 0,
"matrix": [
[0.98, 0.01, -0.17, 0.52],
[-0.01, 0.99, 0.01, -0.03],
[0.17, -0.01, 0.98, 1.20],
[0.0, 0.0, 0.0, 1.0]
]
}
],
"intrinsics": [
{
"camera_id": 0,
"matrix": [
[525.0, 0.0, 320.0],
[0.0, 525.0, 240.0],
[0.0, 0.0, 1.0]
]
}
]
}
```
**Field descriptions:**
| Field | Description |
|-------|-------------|
| `camera_id` | Integer index (`0`, `1`, `2`, ...) or image filename stem without extension (e.g., `"image_0001"`) |
| `extrinsics.matrix` | 4Γ—4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention |
| `intrinsics.matrix` | 3Γ—3 camera intrinsic matrix in pixels (`fx, fy` = focal lengths; `cx, cy` = principal point) |
**Important notes:**
- `extrinsics` and `intrinsics` lists can be provided independently or together. An empty list `[]` or missing key means that prior is unavailable.
- **Intrinsics resolution:** Values should correspond to the **original image resolution**. The pipeline automatically adjusts for inference-time resize + center-crop.
- **Extrinsics alignment:** The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior.
#### Depth Maps (Folder)
Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: `.npy`, `.exr`, `.png` (16-bit).
```
prior_depth/
β”œβ”€β”€ image_0001.npy # float32, shape [H, W]
β”œβ”€β”€ image_0002.npy
└── ...
```
#### Combining Priors
Priors can be freely combined. Examples:
```bash
# Only intrinsics
python -m hyworld2.worldrecon.pipeline --input_path images/ \
--prior_cam_path camera_intrinsics_only.json
# Only depth
python -m hyworld2.worldrecon.pipeline --input_path images/ \
--prior_depth_path depth_maps/
# Camera poses + intrinsics + depth
python -m hyworld2.worldrecon.pipeline --input_path images/ \
--prior_cam_path camera_params.json \
--prior_depth_path depth_maps/
```
---
### Multi-GPU Inference
WorldMirror 2.0 supports **Sequence Parallel (SP)** inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel.
> **Requirement:** The number of input images must be **>= the number of GPUs** (`nproc_per_node`). For example, with 8 GPUs you need at least 8 input images. The pipeline will raise an error if this condition is not met.
```bash
# 2 GPUs with FSDP + bf16
torchrun --nproc_per_node=2 -m hyworld2.worldrecon.pipeline \
--input_path path/to/images \
--use_fsdp --enable_bf16
# 4 GPUs
torchrun --nproc_per_node=4 -m hyworld2.worldrecon.pipeline \
--input_path path/to/images \
--use_fsdp --enable_bf16
# Python API (inside a torchrun script)
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
pipeline = WorldMirrorPipeline.from_pretrained(
'tencent/HY-World-2.0',
use_fsdp=True,
enable_bf16=True,
)
pipeline('path/to/images')
```
**What happens under the hood:**
1. `from_pretrained` auto-detects `WORLD_SIZE > 1` and initializes `torch.distributed`.
2. The model is loaded on rank 0 and broadcast via `sync_module_states=True`.
3. FSDP shards parameters across the SP process group.
4. DPT prediction heads split frames across ranks and `AllGather` results.
5. Post-processing (mask computation, saving) runs on rank 0 only.
---
### Advanced Options
#### Disabling Prediction Heads
To save memory when you only need specific outputs:
```python
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
pipeline = WorldMirrorPipeline.from_pretrained(
'tencent/HY-World-2.0',
disable_heads=["normal", "points"], # free ~200M params
)
```
Available heads: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"`.
#### Mask Filtering
The pipeline supports three types of output filtering to improve point cloud and Gaussian quality:
1. **Sky mask** (`apply_sky_mask=True`): Removes sky regions using an ONNX-based segmentation model, optionally fused with model-predicted depth masks.
2. **Edge mask** (`apply_edge_mask=True`): Removes points at depth/normal discontinuities (object boundaries).
3. **Confidence mask** (`apply_confidence_mask=False`): Removes the bottom N% of points by prediction confidence.
These masks are applied independently to both the `points.ply` (depth-based) and `gaussians.ply` (GS-based) outputs. The GS output uses its own depth predictions for edge detection when available.
#### Point Cloud Compression
When `compress_pts=True` (default), the depth-derived point cloud undergoes:
1. **Voxel merging**: Points within each voxel (size controlled by `compress_pts_voxel_size`) are merged via weighted averaging.
2. **Random subsampling**: If the result exceeds `compress_pts_max_points`, points are uniformly subsampled.
Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to `compress_gs_max_points`.
---
### Gradio App
An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser.
**Quick start:**
```bash
# Single GPU
python -m hyworld2.worldrecon.gradio_app
# Multi-GPU
torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \
--use_fsdp --enable_bf16
```
**With a local checkpoint:**
```bash
python -m hyworld2.worldrecon.gradio_app \
--config_path /path/to/config.yaml \
--ckpt_path /path/to/checkpoint.safetensors
```
**With a public link (e.g., for Colab or remote servers):**
```bash
python -m hyworld2.worldrecon.gradio_app --share
```
**Arguments:**
| Argument | Default | Description |
|----------|---------|-------------|
| `--port` | `8081` | Server port |
| `--host` | `0.0.0.0` | Server host |
| `--share` | `False` | Create a public Gradio link |
| `--examples_dir` | `./examples/worldrecon` | Path to example scenes directory |
| `--config_path` | `None` | Training config YAML (used with `--ckpt_path`) |
| `--ckpt_path` | `None` | Local checkpoint file (`.ckpt` / `.safetensors`) |
| `--use_fsdp` | `False` | Enable FSDP multi-GPU sharding |
| `--enable_bf16` | `False` | Enable bfloat16 mixed precision |
| `--fsdp_cpu_offload` | `False` | Offload FSDP params to CPU (saves GPU memory) |
> **Important:** In multi-GPU mode, the number of input images must be **>= the number of GPUs**.
---
## Panorama Generation
*Coming soon.*
This section will document the panorama generation model, including:
- Text-to-panorama and image-to-panorama APIs
- Model architecture (MMDiT-based implicit perspective-to-ERP mapping)
- Configuration parameters
- Output formats
---
## World Generation
*Coming soon.*
This section will document the world generation pipeline, including:
- Trajectory planning configuration
- World expansion with memory-driven video generation
- World composition (point cloud expansion + 3DGS optimization)
- End-to-end generation from text/image to navigable 3D world