HY-World-2.0 / DOCUMENTATION.md
ZhenweiWang's picture
Update DOCUMENTATION.md
af597e9 verified

HunyuanWorld 2.0 β€” Documentation

This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0.

Table of Contents


WorldMirror 2.0 (World Reconstruction)

Overview

WorldMirror 2.0 is a unified feed-forward model for comprehensive 3D geometric prediction from multi-view images or video. It simultaneously generates:

  • 3D point clouds in world coordinates
  • Per-view depth maps in camera frame
  • Surface normals in camera coordinates
  • Camera poses (c2w) and intrinsics
  • 3D Gaussian Splatting attributes (means, scales, rotations, opacities, SH coefficients)

Key improvements over WorldMirror 1.0:

  • Normalized RoPE for flexible resolution inference
  • Depth mask prediction for robust invalid pixel handling
  • Sequence Parallel + FSDP + BF16 for efficient multi-GPU inference

Python API

WorldMirrorPipeline.from_pretrained

Factory method to load the model and create a pipeline instance.

from hyworld2.worldrecon.pipeline import WorldMirrorPipeline

pipeline = WorldMirrorPipeline.from_pretrained(
    pretrained_model_name_or_path="tencent/HY-World-2.0",
    subfolder="HY-WorldMirror-2.0",
    config_path=None,
    ckpt_path=None,
    use_fsdp=False,
    enable_bf16=False,
    fsdp_cpu_offload=False,
    disable_heads=None,
)
Parameter Type Default Description
pretrained_model_name_or_path str "tencent/HY-World-2.0" HuggingFace repo ID or local path
subfolder str "HY-WorldMirror-2.0" Subfolder inside the repo containing WorldMirror checkpoint (model.safetensors + config)
config_path str None Training config YAML (used with ckpt_path for custom checkpoints)
ckpt_path str None Checkpoint file (.ckpt / .safetensors). When provided with config_path, loads model from local checkpoint instead of HuggingFace
use_fsdp bool False Shard parameters across GPUs via Fully Sharded Data Parallel
enable_bf16 bool False Use bfloat16 precision (except numerically critical layers)
fsdp_cpu_offload bool False Offload FSDP parameters to CPU (saves GPU memory at the cost of speed)
disable_heads list[str] None Heads to disable and free from memory. Options: "camera", "depth", "normal", "points", "gs"

Notes:

  • Distributed mode is auto-detected from WORLD_SIZE environment variable (set by torchrun).
  • When using multi-GPU, each rank must call from_pretrained β€” the method handles dist.init_process_group internally.

WorldMirrorPipeline.__call__

Run inference on a set of images or a video.

result = pipeline(
    input_path,
    output_path="inference_output",
    **kwargs,
)

Returns the output directory path (str), or None if the input was skipped.

Inference Parameters:

Parameter Type Default Description
input_path str (required) Directory of images or path to a video file
output_path str "inference_output" Root output directory
target_size int 952 Maximum inference resolution (longest edge). Images are resized + center-cropped to the nearest multiple of 14
fps int 1 FPS for extracting frames from video input
video_strategy str "new" Video frame extraction strategy: "new" (motion-aware) or "old" (uniform FPS)
video_min_frames int 1 Minimum number of frames to extract from video
video_max_frames int 32 Maximum number of frames to extract from video

Save Parameters:

Parameter Type Default Description
save_depth bool True Save per-view depth maps (PNG visualization + NPY raw values)
save_normal bool True Save per-view surface normal maps (PNG)
save_gs bool True Save 3D Gaussian Splatting as gaussians.ply
save_camera bool True Save camera parameters as camera_params.json
save_points bool True Save depth-derived point cloud as points.ply
save_colmap bool False Save COLMAP-format sparse reconstruction (sparse/0/)
save_conf bool False Save depth confidence maps
save_sky_mask bool False Save sky segmentation masks

Mask Parameters:

Parameter Type Default Description
apply_sky_mask bool True Filter out sky regions from point clouds and Gaussians
apply_edge_mask bool True Filter out edge/discontinuity regions
apply_confidence_mask bool False Filter out low-confidence predictions
sky_mask_source str "auto" Sky mask method: "auto" (ONNX + model fusion), "model" (model predictions only), "onnx" (external segmentation only)
model_sky_threshold float 0.45 Threshold for model-based sky detection
confidence_percentile float 10.0 Percentile threshold for confidence filtering (bottom N% removed)
edge_normal_threshold float 1.0 Normal edge detection tolerance
edge_depth_threshold float 0.03 Depth edge detection relative tolerance

Compression Parameters:

Parameter Type Default Description
compress_pts bool True Compress point clouds via voxel merging + random sampling
compress_pts_max_points int 2,000,000 Maximum number of points after compression
compress_pts_voxel_size float 0.002 Voxel size for point cloud merging
max_resolution int 1920 Maximum resolution for saved output images
compress_gs_max_points int 5,000,000 Maximum number of Gaussians after voxel pruning

Prior Parameters:

Parameter Type Default Description
prior_cam_path str None Path to camera parameters JSON file
prior_depth_path str None Path to directory containing depth map files

Rendered Video Parameters:

Parameter Type Default Description
save_rendered bool False Render interpolated fly-through video from Gaussian splats
render_interp_per_pair int 15 Number of interpolated frames between each camera pair
render_depth bool False Also render a depth visualization video

Misc Parameters:

Parameter Type Default Description
log_time bool True Print timing report and save pipeline_timing.json
strict_output_path str None If set, save results directly to this path without <case_name>/<timestamp> subdirectories

CLI Reference

All __call__ parameters are exposed as CLI arguments:

python -m hyworld2.worldrecon.pipeline \
    --input_path path/to/images \
    --output_path inference_output \
    --target_size 952 \
    --prior_cam_path path/to/camera_params.json \
    --prior_depth_path path/to/depth_dir/ \

Boolean flag conventions:

Enable Disable
--save_colmap (omit)
--save_conf (omit)
--save_sky_mask (omit)
--apply_sky_mask (default on) --no_sky_mask
--apply_edge_mask (default on) --no_edge_mask
--apply_confidence_mask (omit)
--compress_pts (default on) --no_compress_pts
--log_time (default on) --no_log_time
(default on) save_depth --no_save_depth
(default on) save_normal --no_save_normal
(default on) save_gs --no_save_gs
(default on) save_camera --no_save_camera
(default on) save_points --no_save_points
--save_rendered (omit)
--render_depth (omit)

Additional CLI-only arguments:

Argument Description
--config_path Training config YAML for custom checkpoint loading
--ckpt_path Local checkpoint file path
--use_fsdp Enable FSDP multi-GPU sharding
--enable_bf16 Enable bfloat16 mixed precision
--fsdp_cpu_offload Offload FSDP params to CPU
--disable_heads Space-separated list of heads to disable (e.g. --disable_heads camera normal)
--no_interactive Exit after first inference (skip interactive prompt loop)

Output Format

File Structure

inference_output/
└── <case_name>/
    └── <timestamp>/
        β”œβ”€β”€ depth/
        β”‚   β”œβ”€β”€ depth_0000.png      # Normalized depth visualization
        β”‚   β”œβ”€β”€ depth_0000.npy      # Raw float32 depth values [H, W]
        β”‚   └── ...
        β”œβ”€β”€ normal/
        β”‚   β”œβ”€β”€ normal_0000.png     # Normal map visualization (RGB)
        β”‚   └── ...
        β”œβ”€β”€ camera_params.json      # Camera extrinsics & intrinsics
        β”œβ”€β”€ gaussians.ply           # 3D Gaussian Splatting (standard format)
        β”œβ”€β”€ points.ply              # Colored point cloud
        β”œβ”€β”€ sparse/                 # COLMAP format (if --save_colmap)
        β”‚   └── 0/
        β”‚       β”œβ”€β”€ cameras.bin
        β”‚       β”œβ”€β”€ images.bin
        β”‚       └── points3D.bin
        β”œβ”€β”€ rendered/               # Rendered video (if --save_rendered)
        β”‚   β”œβ”€β”€ rendered_rgb.mp4
        β”‚   └── rendered_depth.mp4  # (if --render_depth)
        └── pipeline_timing.json    # Performance timing report

Prediction Dictionary

When using the Python API, pipeline(...) internally produces a predictions dictionary with the following keys:

# Geometry
predictions["depth"]        # [B, S, H, W, 1]  β€” Z-depth in camera frame
predictions["depth_conf"]   # [B, S, H, W]     β€” Depth confidence
predictions["normals"]      # [B, S, H, W, 3]  β€” Surface normals in camera coords
predictions["normals_conf"] # [B, S, H, W]     β€” Normal confidence
predictions["pts3d"]        # [B, S, H, W, 3]  β€” 3D point maps in world coords
predictions["pts3d_conf"]   # [B, S, H, W]     β€” Point cloud confidence
# Camera
predictions["camera_poses"] # [B, S, 4, 4]     β€” Camera-to-world (c2w), OpenCV convention
predictions["camera_intrs"] # [B, S, 3, 3]     β€” Camera intrinsic matrices
predictions["camera_params"]# [B, S, 9]        β€” Compact camera vector (translation, quaternion, fov_v, fov_u)
# 3D Gaussian Splatting
predictions["splats"]["means"]      # [B, N, 3] β€” Gaussian centers
predictions["splats"]["scales"]     # [B, N, 3] β€” Gaussian scales
predictions["splats"]["quats"]      # [B, N, 4] β€” Gaussian rotations (quaternions)
predictions["splats"]["opacities"]  # [B, N]    β€” Gaussian opacities
predictions["splats"]["sh"]         # [B, N, 1, 3] β€” Spherical harmonics (degree 0)
predictions["splats"]["weights"]    # [B, N]    β€” Per-Gaussian confidence weights

Where B = batch size (always 1 for inference), S = number of input views, H, W = image dimensions, N = total Gaussians (S Γ— H Γ— W).


Prior Injection

WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files.

Prior Type Condition Input Format
Camera Pose cond_flags[0] c2w 4Γ—4 matrix (OpenCV convention)
Depth Map cond_flags[1] Per-view float depth maps
Intrinsics cond_flags[2] 3Γ—3 intrinsic matrix

Camera Parameters (JSON)

The camera parameter file follows the same format as the camera_params.json output by the pipeline:

{
  "num_cameras": 2,
  "extrinsics": [
    {
      "camera_id": 0,
      "matrix": [
        [0.98, 0.01, -0.17, 0.52],
        [-0.01, 0.99, 0.01, -0.03],
        [0.17, -0.01, 0.98, 1.20],
        [0.0, 0.0, 0.0, 1.0]
      ]
    }
  ],
  "intrinsics": [
    {
      "camera_id": 0,
      "matrix": [
        [525.0, 0.0, 320.0],
        [0.0, 525.0, 240.0],
        [0.0, 0.0, 1.0]
      ]
    }
  ]
}

Field descriptions:

Field Description
camera_id Integer index (0, 1, 2, ...) or image filename stem without extension (e.g., "image_0001")
extrinsics.matrix 4Γ—4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention
intrinsics.matrix 3Γ—3 camera intrinsic matrix in pixels (fx, fy = focal lengths; cx, cy = principal point)

Important notes:

  • extrinsics and intrinsics lists can be provided independently or together. An empty list [] or missing key means that prior is unavailable.
  • Intrinsics resolution: Values should correspond to the original image resolution. The pipeline automatically adjusts for inference-time resize + center-crop.
  • Extrinsics alignment: The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior.

Depth Maps (Folder)

Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: .npy, .exr, .png (16-bit).

prior_depth/
β”œβ”€β”€ image_0001.npy    # float32, shape [H, W]
β”œβ”€β”€ image_0002.npy
└── ...

Combining Priors

Priors can be freely combined. Examples:

# Only intrinsics
python -m hyworld2.worldrecon.pipeline --input_path images/ \
    --prior_cam_path camera_intrinsics_only.json
# Only depth
python -m hyworld2.worldrecon.pipeline --input_path images/ \
    --prior_depth_path depth_maps/
# Camera poses + intrinsics + depth
python -m hyworld2.worldrecon.pipeline --input_path images/ \
    --prior_cam_path camera_params.json \
    --prior_depth_path depth_maps/

Multi-GPU Inference

WorldMirror 2.0 supports Sequence Parallel (SP) inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel.

Requirement: The number of input images must be >= the number of GPUs (nproc_per_node). For example, with 8 GPUs you need at least 8 input images. The pipeline will raise an error if this condition is not met.

# 2 GPUs with FSDP + bf16
torchrun --nproc_per_node=2 -m hyworld2.worldrecon.pipeline \
    --input_path path/to/images \
    --use_fsdp --enable_bf16
# 4 GPUs
torchrun --nproc_per_node=4 -m hyworld2.worldrecon.pipeline \
    --input_path path/to/images \
    --use_fsdp --enable_bf16
# Python API (inside a torchrun script)
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
pipeline = WorldMirrorPipeline.from_pretrained(
    'tencent/HY-World-2.0',
    use_fsdp=True,
    enable_bf16=True,
)
pipeline('path/to/images')

What happens under the hood:

  1. from_pretrained auto-detects WORLD_SIZE > 1 and initializes torch.distributed.
  2. The model is loaded on rank 0 and broadcast via sync_module_states=True.
  3. FSDP shards parameters across the SP process group.
  4. DPT prediction heads split frames across ranks and AllGather results.
  5. Post-processing (mask computation, saving) runs on rank 0 only.

Advanced Options

Disabling Prediction Heads

To save memory when you only need specific outputs:

from hyworld2.worldrecon.pipeline import WorldMirrorPipeline

pipeline = WorldMirrorPipeline.from_pretrained(
    'tencent/HY-World-2.0',
    disable_heads=["normal", "points"],  # free ~200M params
)

Available heads: "camera", "depth", "normal", "points", "gs".

Mask Filtering

The pipeline supports three types of output filtering to improve point cloud and Gaussian quality:

  1. Sky mask (apply_sky_mask=True): Removes sky regions using an ONNX-based segmentation model, optionally fused with model-predicted depth masks.
  2. Edge mask (apply_edge_mask=True): Removes points at depth/normal discontinuities (object boundaries).
  3. Confidence mask (apply_confidence_mask=False): Removes the bottom N% of points by prediction confidence. These masks are applied independently to both the points.ply (depth-based) and gaussians.ply (GS-based) outputs. The GS output uses its own depth predictions for edge detection when available.

Point Cloud Compression

When compress_pts=True (default), the depth-derived point cloud undergoes:

  1. Voxel merging: Points within each voxel (size controlled by compress_pts_voxel_size) are merged via weighted averaging.
  2. Random subsampling: If the result exceeds compress_pts_max_points, points are uniformly subsampled. Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to compress_gs_max_points.

Gradio App

An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser. Quick start:

# Single GPU
python -m hyworld2.worldrecon.gradio_app

# Multi-GPU
torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \
    --use_fsdp --enable_bf16

With a local checkpoint:

python -m hyworld2.worldrecon.gradio_app \
    --config_path /path/to/config.yaml \
    --ckpt_path /path/to/checkpoint.safetensors

With a public link (e.g., for Colab or remote servers):

python -m hyworld2.worldrecon.gradio_app --share

Arguments:

Argument Default Description
--port 8081 Server port
--host 0.0.0.0 Server host
--share False Create a public Gradio link
--examples_dir ./examples/worldrecon Path to example scenes directory
--config_path None Training config YAML (used with --ckpt_path)
--ckpt_path None Local checkpoint file (.ckpt / .safetensors)
--use_fsdp False Enable FSDP multi-GPU sharding
--enable_bf16 False Enable bfloat16 mixed precision
--fsdp_cpu_offload False Offload FSDP params to CPU (saves GPU memory)

Important: In multi-GPU mode, the number of input images must be >= the number of GPUs.


Panorama Generation

Coming soon. This section will document the panorama generation model, including:

  • Text-to-panorama and image-to-panorama APIs
  • Model architecture (MMDiT-based implicit perspective-to-ERP mapping)
  • Configuration parameters
  • Output formats

World Generation

Coming soon. This section will document the world generation pipeline, including:

  • Trajectory planning configuration
  • World expansion with memory-driven video generation
  • World composition (point cloud expansion + 3DGS optimization)
  • End-to-end generation from text/image to navigable 3D world