| # HunyuanWorld 2.0 β Documentation |
| This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0. |
|
|
| ## Table of Contents |
| - [WorldMirror 2.0 (World Reconstruction)](#worldmirror-20-world-reconstruction) |
| - [Overview](#overview) |
| - [Python API](#python-api) |
| - [`WorldMirrorPipeline.from_pretrained`](#worldmirrorpipelinefrom_pretrained) |
| - [`WorldMirrorPipeline.__call__`](#worldmirrorpipelinecall) |
| - [CLI Reference](#cli-reference) |
| - [Output Format](#output-format) |
| - [File Structure](#file-structure) |
| - [Prediction Dictionary](#prediction-dictionary) |
| - [Prior Injection](#prior-injection) |
| - [Camera Parameters (JSON)](#camera-parameters-json) |
| - [Depth Maps (Folder)](#depth-maps-folder) |
| - [Combining Priors](#combining-priors) |
| - [Multi-GPU Inference](#multi-gpu-inference) |
| - [Advanced Options](#advanced-options) |
| - [Disabling Prediction Heads](#disabling-prediction-heads) |
| - [Mask Filtering](#mask-filtering) |
| - [Point Cloud Compression](#point-cloud-compression) |
| - [Gradio App](#gradio-app) |
| - [Panorama Generation](#panorama-generation) |
| - [World Generation](#world-generation) |
|
|
| --- |
| ## WorldMirror 2.0 (World Reconstruction) |
| ### Overview |
| WorldMirror 2.0 is a unified feed-forward model for comprehensive 3D geometric prediction from multi-view images or video. It simultaneously generates: |
| - **3D point clouds** in world coordinates |
| - **Per-view depth maps** in camera frame |
| - **Surface normals** in camera coordinates |
| - **Camera poses** (c2w) and **intrinsics** |
| - **3D Gaussian Splatting** attributes (means, scales, rotations, opacities, SH coefficients) |
|
|
| Key improvements over WorldMirror 1.0: |
| - **Normalized RoPE** for flexible resolution inference |
| - **Depth mask prediction** for robust invalid pixel handling |
| - **Sequence Parallel + FSDP + BF16** for efficient multi-GPU inference |
|
|
| --- |
| ### Python API |
| #### `WorldMirrorPipeline.from_pretrained` |
| Factory method to load the model and create a pipeline instance. |
| |
| ```python |
| from hyworld2.worldrecon.pipeline import WorldMirrorPipeline |
| |
| pipeline = WorldMirrorPipeline.from_pretrained( |
| pretrained_model_name_or_path="tencent/HY-World-2.0", |
| subfolder="HY-WorldMirror-2.0", |
| config_path=None, |
| ckpt_path=None, |
| use_fsdp=False, |
| enable_bf16=False, |
| fsdp_cpu_offload=False, |
| disable_heads=None, |
| ) |
| ``` |
| |
| | Parameter | Type | Default | Description | |
| |-----------|------|---------|-------------| |
| | `pretrained_model_name_or_path` | `str` | `"tencent/HY-World-2.0"` | HuggingFace repo ID or local path | |
| | `subfolder` | `str` | `"HY-WorldMirror-2.0"` | Subfolder inside the repo containing WorldMirror checkpoint (`model.safetensors` + config) | |
| | `config_path` | `str` | `None` | Training config YAML (used with `ckpt_path` for custom checkpoints) | |
| | `ckpt_path` | `str` | `None` | Checkpoint file (`.ckpt` / `.safetensors`). When provided with `config_path`, loads model from local checkpoint instead of HuggingFace | |
| | `use_fsdp` | `bool` | `False` | Shard parameters across GPUs via Fully Sharded Data Parallel | |
| | `enable_bf16` | `bool` | `False` | Use bfloat16 precision (except numerically critical layers) | |
| | `fsdp_cpu_offload` | `bool` | `False` | Offload FSDP parameters to CPU (saves GPU memory at the cost of speed) | |
| | `disable_heads` | `list[str]` | `None` | Heads to disable and free from memory. Options: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"` | |
|
|
| **Notes:** |
| - Distributed mode is auto-detected from `WORLD_SIZE` environment variable (set by `torchrun`). |
| - When using multi-GPU, each rank must call `from_pretrained` β the method handles `dist.init_process_group` internally. |
|
|
| --- |
| #### `WorldMirrorPipeline.__call__` |
| Run inference on a set of images or a video. |
|
|
| ```python |
| result = pipeline( |
| input_path, |
| output_path="inference_output", |
| **kwargs, |
| ) |
| ``` |
|
|
| Returns the output directory path (`str`), or `None` if the input was skipped. |
|
|
| **Inference Parameters:** |
|
|
| | Parameter | Type | Default | Description | |
| |-----------|------|---------|-------------| |
| | `input_path` | `str` | *(required)* | Directory of images or path to a video file | |
| | `output_path` | `str` | `"inference_output"` | Root output directory | |
| | `target_size` | `int` | `952` | Maximum inference resolution (longest edge). Images are resized + center-cropped to the nearest multiple of 14 | |
| | `fps` | `int` | `1` | FPS for extracting frames from video input | |
| | `video_strategy` | `str` | `"new"` | Video frame extraction strategy: `"new"` (motion-aware) or `"old"` (uniform FPS) | |
| | `video_min_frames` | `int` | `1` | Minimum number of frames to extract from video | |
| | `video_max_frames` | `int` | `32` | Maximum number of frames to extract from video | |
|
|
| **Save Parameters:** |
|
|
| | Parameter | Type | Default | Description | |
| |-----------|------|---------|-------------| |
| | `save_depth` | `bool` | `True` | Save per-view depth maps (PNG visualization + NPY raw values) | |
| | `save_normal` | `bool` | `True` | Save per-view surface normal maps (PNG) | |
| | `save_gs` | `bool` | `True` | Save 3D Gaussian Splatting as `gaussians.ply` | |
| | `save_camera` | `bool` | `True` | Save camera parameters as `camera_params.json` | |
| | `save_points` | `bool` | `True` | Save depth-derived point cloud as `points.ply` | |
| | `save_colmap` | `bool` | `False` | Save COLMAP-format sparse reconstruction (`sparse/0/`) | |
| | `save_conf` | `bool` | `False` | Save depth confidence maps | |
| | `save_sky_mask` | `bool` | `False` | Save sky segmentation masks | |
|
|
| **Mask Parameters:** |
|
|
| | Parameter | Type | Default | Description | |
| |-----------|------|---------|-------------| |
| | `apply_sky_mask` | `bool` | `True` | Filter out sky regions from point clouds and Gaussians | |
| | `apply_edge_mask` | `bool` | `True` | Filter out edge/discontinuity regions | |
| | `apply_confidence_mask` | `bool` | `False` | Filter out low-confidence predictions | |
| | `sky_mask_source` | `str` | `"auto"` | Sky mask method: `"auto"` (ONNX + model fusion), `"model"` (model predictions only), `"onnx"` (external segmentation only) | |
| | `model_sky_threshold` | `float` | `0.45` | Threshold for model-based sky detection | |
| | `confidence_percentile` | `float` | `10.0` | Percentile threshold for confidence filtering (bottom N% removed) | |
| | `edge_normal_threshold` | `float` | `1.0` | Normal edge detection tolerance | |
| | `edge_depth_threshold` | `float` | `0.03` | Depth edge detection relative tolerance | |
|
|
| **Compression Parameters:** |
|
|
| | Parameter | Type | Default | Description | |
| |-----------|------|---------|-------------| |
| | `compress_pts` | `bool` | `True` | Compress point clouds via voxel merging + random sampling | |
| | `compress_pts_max_points` | `int` | `2,000,000` | Maximum number of points after compression | |
| | `compress_pts_voxel_size` | `float` | `0.002` | Voxel size for point cloud merging | |
| | `max_resolution` | `int` | `1920` | Maximum resolution for saved output images | |
| | `compress_gs_max_points` | `int` | `5,000,000` | Maximum number of Gaussians after voxel pruning | |
|
|
| **Prior Parameters:** |
|
|
| | Parameter | Type | Default | Description | |
| |-----------|------|---------|-------------| |
| | `prior_cam_path` | `str` | `None` | Path to camera parameters JSON file | |
| | `prior_depth_path` | `str` | `None` | Path to directory containing depth map files | |
|
|
| **Rendered Video Parameters:** |
|
|
| | Parameter | Type | Default | Description | |
| |-----------|------|---------|-------------| |
| | `save_rendered` | `bool` | `False` | Render interpolated fly-through video from Gaussian splats | |
| | `render_interp_per_pair` | `int` | `15` | Number of interpolated frames between each camera pair | |
| | `render_depth` | `bool` | `False` | Also render a depth visualization video | |
|
|
| **Misc Parameters:** |
|
|
| | Parameter | Type | Default | Description | |
| |-----------|------|---------|-------------| |
| | `log_time` | `bool` | `True` | Print timing report and save `pipeline_timing.json` | |
| | `strict_output_path` | `str` | `None` | If set, save results directly to this path without `<case_name>/<timestamp>` subdirectories | |
|
|
| --- |
| ### CLI Reference |
| All `__call__` parameters are exposed as CLI arguments: |
|
|
| ```bash |
| python -m hyworld2.worldrecon.pipeline \ |
| --input_path path/to/images \ |
| --output_path inference_output \ |
| --target_size 952 \ |
| --prior_cam_path path/to/camera_params.json \ |
| --prior_depth_path path/to/depth_dir/ \ |
| ``` |
|
|
| **Boolean flag conventions:** |
|
|
| | Enable | Disable | |
| |--------|---------| |
| | `--save_colmap` | *(omit)* | |
| | `--save_conf` | *(omit)* | |
| | `--save_sky_mask` | *(omit)* | |
| | `--apply_sky_mask` (default on) | `--no_sky_mask` | |
| | `--apply_edge_mask` (default on) | `--no_edge_mask` | |
| | `--apply_confidence_mask` | *(omit)* | |
| | `--compress_pts` (default on) | `--no_compress_pts` | |
| | `--log_time` (default on) | `--no_log_time` | |
| | *(default on)* `save_depth` | `--no_save_depth` | |
| | *(default on)* `save_normal` | `--no_save_normal` | |
| | *(default on)* `save_gs` | `--no_save_gs` | |
| | *(default on)* `save_camera` | `--no_save_camera` | |
| | *(default on)* `save_points` | `--no_save_points` | |
| | `--save_rendered` | *(omit)* | |
| | `--render_depth` | *(omit)* | |
|
|
| **Additional CLI-only arguments:** |
|
|
| | Argument | Description | |
| |----------|-------------| |
| | `--config_path` | Training config YAML for custom checkpoint loading | |
| | `--ckpt_path` | Local checkpoint file path | |
| | `--use_fsdp` | Enable FSDP multi-GPU sharding | |
| | `--enable_bf16` | Enable bfloat16 mixed precision | |
| | `--fsdp_cpu_offload` | Offload FSDP params to CPU | |
| | `--disable_heads` | Space-separated list of heads to disable (e.g. `--disable_heads camera normal`) | |
| | `--no_interactive` | Exit after first inference (skip interactive prompt loop) | |
|
|
| --- |
| ### Output Format |
| #### File Structure |
|
|
| ``` |
| inference_output/ |
| βββ <case_name>/ |
| βββ <timestamp>/ |
| βββ depth/ |
| β βββ depth_0000.png # Normalized depth visualization |
| β βββ depth_0000.npy # Raw float32 depth values [H, W] |
| β βββ ... |
| βββ normal/ |
| β βββ normal_0000.png # Normal map visualization (RGB) |
| β βββ ... |
| βββ camera_params.json # Camera extrinsics & intrinsics |
| βββ gaussians.ply # 3D Gaussian Splatting (standard format) |
| βββ points.ply # Colored point cloud |
| βββ sparse/ # COLMAP format (if --save_colmap) |
| β βββ 0/ |
| β βββ cameras.bin |
| β βββ images.bin |
| β βββ points3D.bin |
| βββ rendered/ # Rendered video (if --save_rendered) |
| β βββ rendered_rgb.mp4 |
| β βββ rendered_depth.mp4 # (if --render_depth) |
| βββ pipeline_timing.json # Performance timing report |
| ``` |
|
|
| #### Prediction Dictionary |
| When using the Python API, `pipeline(...)` internally produces a `predictions` dictionary with the following keys: |
|
|
| ```python |
| # Geometry |
| predictions["depth"] # [B, S, H, W, 1] β Z-depth in camera frame |
| predictions["depth_conf"] # [B, S, H, W] β Depth confidence |
| predictions["normals"] # [B, S, H, W, 3] β Surface normals in camera coords |
| predictions["normals_conf"] # [B, S, H, W] β Normal confidence |
| predictions["pts3d"] # [B, S, H, W, 3] β 3D point maps in world coords |
| predictions["pts3d_conf"] # [B, S, H, W] β Point cloud confidence |
| # Camera |
| predictions["camera_poses"] # [B, S, 4, 4] β Camera-to-world (c2w), OpenCV convention |
| predictions["camera_intrs"] # [B, S, 3, 3] β Camera intrinsic matrices |
| predictions["camera_params"]# [B, S, 9] β Compact camera vector (translation, quaternion, fov_v, fov_u) |
| # 3D Gaussian Splatting |
| predictions["splats"]["means"] # [B, N, 3] β Gaussian centers |
| predictions["splats"]["scales"] # [B, N, 3] β Gaussian scales |
| predictions["splats"]["quats"] # [B, N, 4] β Gaussian rotations (quaternions) |
| predictions["splats"]["opacities"] # [B, N] β Gaussian opacities |
| predictions["splats"]["sh"] # [B, N, 1, 3] β Spherical harmonics (degree 0) |
| predictions["splats"]["weights"] # [B, N] β Per-Gaussian confidence weights |
| ``` |
|
|
| Where `B` = batch size (always 1 for inference), `S` = number of input views, `H, W` = image dimensions, `N` = total Gaussians (`S Γ H Γ W`). |
|
|
| --- |
| ### Prior Injection |
| WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files. |
|
|
| | Prior Type | Condition | Input Format | |
| |------------|-----------|--------------| |
| | Camera Pose | `cond_flags[0]` | c2w 4Γ4 matrix (OpenCV convention) | |
| | Depth Map | `cond_flags[1]` | Per-view float depth maps | |
| | Intrinsics | `cond_flags[2]` | 3Γ3 intrinsic matrix | |
|
|
| #### Camera Parameters (JSON) |
| The camera parameter file follows the same format as the `camera_params.json` output by the pipeline: |
|
|
| ```json |
| { |
| "num_cameras": 2, |
| "extrinsics": [ |
| { |
| "camera_id": 0, |
| "matrix": [ |
| [0.98, 0.01, -0.17, 0.52], |
| [-0.01, 0.99, 0.01, -0.03], |
| [0.17, -0.01, 0.98, 1.20], |
| [0.0, 0.0, 0.0, 1.0] |
| ] |
| } |
| ], |
| "intrinsics": [ |
| { |
| "camera_id": 0, |
| "matrix": [ |
| [525.0, 0.0, 320.0], |
| [0.0, 525.0, 240.0], |
| [0.0, 0.0, 1.0] |
| ] |
| } |
| ] |
| } |
| ``` |
|
|
| **Field descriptions:** |
|
|
| | Field | Description | |
| |-------|-------------| |
| | `camera_id` | Integer index (`0`, `1`, `2`, ...) or image filename stem without extension (e.g., `"image_0001"`) | |
| | `extrinsics.matrix` | 4Γ4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention | |
| | `intrinsics.matrix` | 3Γ3 camera intrinsic matrix in pixels (`fx, fy` = focal lengths; `cx, cy` = principal point) | |
|
|
| **Important notes:** |
| - `extrinsics` and `intrinsics` lists can be provided independently or together. An empty list `[]` or missing key means that prior is unavailable. |
| - **Intrinsics resolution:** Values should correspond to the **original image resolution**. The pipeline automatically adjusts for inference-time resize + center-crop. |
| - **Extrinsics alignment:** The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior. |
| #### Depth Maps (Folder) |
| Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: `.npy`, `.exr`, `.png` (16-bit). |
|
|
| ``` |
| prior_depth/ |
| βββ image_0001.npy # float32, shape [H, W] |
| βββ image_0002.npy |
| βββ ... |
| ``` |
|
|
| #### Combining Priors |
| Priors can be freely combined. Examples: |
|
|
| ```bash |
| # Only intrinsics |
| python -m hyworld2.worldrecon.pipeline --input_path images/ \ |
| --prior_cam_path camera_intrinsics_only.json |
| # Only depth |
| python -m hyworld2.worldrecon.pipeline --input_path images/ \ |
| --prior_depth_path depth_maps/ |
| # Camera poses + intrinsics + depth |
| python -m hyworld2.worldrecon.pipeline --input_path images/ \ |
| --prior_cam_path camera_params.json \ |
| --prior_depth_path depth_maps/ |
| ``` |
|
|
| --- |
| ### Multi-GPU Inference |
| WorldMirror 2.0 supports **Sequence Parallel (SP)** inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel. |
|
|
| > **Requirement:** The number of input images must be **>= the number of GPUs** (`nproc_per_node`). For example, with 8 GPUs you need at least 8 input images. The pipeline will raise an error if this condition is not met. |
|
|
| ```bash |
| # 2 GPUs with FSDP + bf16 |
| torchrun --nproc_per_node=2 -m hyworld2.worldrecon.pipeline \ |
| --input_path path/to/images \ |
| --use_fsdp --enable_bf16 |
| # 4 GPUs |
| torchrun --nproc_per_node=4 -m hyworld2.worldrecon.pipeline \ |
| --input_path path/to/images \ |
| --use_fsdp --enable_bf16 |
| # Python API (inside a torchrun script) |
| from hyworld2.worldrecon.pipeline import WorldMirrorPipeline |
| pipeline = WorldMirrorPipeline.from_pretrained( |
| 'tencent/HY-World-2.0', |
| use_fsdp=True, |
| enable_bf16=True, |
| ) |
| pipeline('path/to/images') |
| ``` |
|
|
| **What happens under the hood:** |
| 1. `from_pretrained` auto-detects `WORLD_SIZE > 1` and initializes `torch.distributed`. |
| 2. The model is loaded on rank 0 and broadcast via `sync_module_states=True`. |
| 3. FSDP shards parameters across the SP process group. |
| 4. DPT prediction heads split frames across ranks and `AllGather` results. |
| 5. Post-processing (mask computation, saving) runs on rank 0 only. |
|
|
| --- |
| ### Advanced Options |
| #### Disabling Prediction Heads |
| To save memory when you only need specific outputs: |
|
|
| ```python |
| from hyworld2.worldrecon.pipeline import WorldMirrorPipeline |
| |
| pipeline = WorldMirrorPipeline.from_pretrained( |
| 'tencent/HY-World-2.0', |
| disable_heads=["normal", "points"], # free ~200M params |
| ) |
| ``` |
|
|
| Available heads: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"`. |
| #### Mask Filtering |
| The pipeline supports three types of output filtering to improve point cloud and Gaussian quality: |
| 1. **Sky mask** (`apply_sky_mask=True`): Removes sky regions using an ONNX-based segmentation model, optionally fused with model-predicted depth masks. |
| 2. **Edge mask** (`apply_edge_mask=True`): Removes points at depth/normal discontinuities (object boundaries). |
| 3. **Confidence mask** (`apply_confidence_mask=False`): Removes the bottom N% of points by prediction confidence. |
| These masks are applied independently to both the `points.ply` (depth-based) and `gaussians.ply` (GS-based) outputs. The GS output uses its own depth predictions for edge detection when available. |
| #### Point Cloud Compression |
| When `compress_pts=True` (default), the depth-derived point cloud undergoes: |
| 1. **Voxel merging**: Points within each voxel (size controlled by `compress_pts_voxel_size`) are merged via weighted averaging. |
| 2. **Random subsampling**: If the result exceeds `compress_pts_max_points`, points are uniformly subsampled. |
| Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to `compress_gs_max_points`. |
|
|
| --- |
| ### Gradio App |
| An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser. |
| **Quick start:** |
|
|
| ```bash |
| # Single GPU |
| python -m hyworld2.worldrecon.gradio_app |
| |
| # Multi-GPU |
| torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \ |
| --use_fsdp --enable_bf16 |
| ``` |
|
|
| **With a local checkpoint:** |
|
|
| ```bash |
| python -m hyworld2.worldrecon.gradio_app \ |
| --config_path /path/to/config.yaml \ |
| --ckpt_path /path/to/checkpoint.safetensors |
| ``` |
|
|
| **With a public link (e.g., for Colab or remote servers):** |
|
|
| ```bash |
| python -m hyworld2.worldrecon.gradio_app --share |
| ``` |
|
|
| **Arguments:** |
|
|
| | Argument | Default | Description | |
| |----------|---------|-------------| |
| | `--port` | `8081` | Server port | |
| | `--host` | `0.0.0.0` | Server host | |
| | `--share` | `False` | Create a public Gradio link | |
| | `--examples_dir` | `./examples/worldrecon` | Path to example scenes directory | |
| | `--config_path` | `None` | Training config YAML (used with `--ckpt_path`) | |
| | `--ckpt_path` | `None` | Local checkpoint file (`.ckpt` / `.safetensors`) | |
| | `--use_fsdp` | `False` | Enable FSDP multi-GPU sharding | |
| | `--enable_bf16` | `False` | Enable bfloat16 mixed precision | |
| | `--fsdp_cpu_offload` | `False` | Offload FSDP params to CPU (saves GPU memory) | |
|
|
| > **Important:** In multi-GPU mode, the number of input images must be **>= the number of GPUs**. |
|
|
| --- |
| ## Panorama Generation |
| *Coming soon.* |
| This section will document the panorama generation model, including: |
| - Text-to-panorama and image-to-panorama APIs |
| - Model architecture (MMDiT-based implicit perspective-to-ERP mapping) |
| - Configuration parameters |
| - Output formats |
|
|
| --- |
| ## World Generation |
| *Coming soon.* |
| This section will document the world generation pipeline, including: |
| - Trajectory planning configuration |
| - World expansion with memory-driven video generation |
| - World composition (point cloud expansion + 3DGS optimization) |
| - End-to-end generation from text/image to navigable 3D world |