HY-World-2.0 / DOCUMENTATION.md

Duplicate from tencent/HY-World-2.0

97c4df8 about 1 month ago

20.1 kB

	# HunyuanWorld 2.0 — Documentation
	This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0.

	## Table of Contents
	- [WorldMirror 2.0 (World Reconstruction)](#worldmirror-20-world-reconstruction)
	- [Overview](#overview)
	- [Python API](#python-api)
	- [`WorldMirrorPipeline.from_pretrained`](#worldmirrorpipelinefrom_pretrained)
	- [`WorldMirrorPipeline.__call__`](#worldmirrorpipelinecall)
	- [CLI Reference](#cli-reference)
	- [Output Format](#output-format)
	- [File Structure](#file-structure)
	- [Prediction Dictionary](#prediction-dictionary)
	- [Prior Injection](#prior-injection)
	- [Camera Parameters (JSON)](#camera-parameters-json)
	- [Depth Maps (Folder)](#depth-maps-folder)
	- [Combining Priors](#combining-priors)
	- [Multi-GPU Inference](#multi-gpu-inference)
	- [Advanced Options](#advanced-options)
	- [Disabling Prediction Heads](#disabling-prediction-heads)
	- [Mask Filtering](#mask-filtering)
	- [Point Cloud Compression](#point-cloud-compression)
	- [Gradio App](#gradio-app)
	- [Panorama Generation](#panorama-generation)
	- [World Generation](#world-generation)

	---
	## WorldMirror 2.0 (World Reconstruction)
	### Overview
	WorldMirror 2.0 is a unified feed-forward model for comprehensive 3D geometric prediction from multi-view images or video. It simultaneously generates:
	- 3D point clouds in world coordinates
	- Per-view depth maps in camera frame
	- Surface normals in camera coordinates
	- Camera poses (c2w) and intrinsics
	- 3D Gaussian Splatting attributes (means, scales, rotations, opacities, SH coefficients)

	Key improvements over WorldMirror 1.0:
	- Normalized RoPE for flexible resolution inference
	- Depth mask prediction for robust invalid pixel handling
	- Sequence Parallel + FSDP + BF16 for efficient multi-GPU inference

	---
	### Python API
	#### `WorldMirrorPipeline.from_pretrained`
	Factory method to load the model and create a pipeline instance.

	```python
	from hyworld2.worldrecon.pipeline import WorldMirrorPipeline

	pipeline = WorldMirrorPipeline.from_pretrained(
	pretrained_model_name_or_path="tencent/HY-World-2.0",
	subfolder="HY-WorldMirror-2.0",
	config_path=None,
	ckpt_path=None,
	use_fsdp=False,
	enable_bf16=False,
	fsdp_cpu_offload=False,
	disable_heads=None,
	)
	```

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `pretrained_model_name_or_path` \| `str` \| `"tencent/HY-World-2.0"` \| HuggingFace repo ID or local path \|
	\| `subfolder` \| `str` \| `"HY-WorldMirror-2.0"` \| Subfolder inside the repo containing WorldMirror checkpoint (`model.safetensors` + config) \|
	\| `config_path` \| `str` \| `None` \| Training config YAML (used with `ckpt_path` for custom checkpoints) \|
	\| `ckpt_path` \| `str` \| `None` \| Checkpoint file (`.ckpt` / `.safetensors`). When provided with `config_path`, loads model from local checkpoint instead of HuggingFace \|
	\| `use_fsdp` \| `bool` \| `False` \| Shard parameters across GPUs via Fully Sharded Data Parallel \|
	\| `enable_bf16` \| `bool` \| `False` \| Use bfloat16 precision (except numerically critical layers) \|
	\| `fsdp_cpu_offload` \| `bool` \| `False` \| Offload FSDP parameters to CPU (saves GPU memory at the cost of speed) \|
	\| `disable_heads` \| `list[str]` \| `None` \| Heads to disable and free from memory. Options: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"` \|

	Notes:
	- Distributed mode is auto-detected from `WORLD_SIZE` environment variable (set by `torchrun`).
	- When using multi-GPU, each rank must call `from_pretrained` — the method handles `dist.init_process_group` internally.

	---
	#### `WorldMirrorPipeline.__call__`
	Run inference on a set of images or a video.

	```python
	result = pipeline(
	input_path,
	output_path="inference_output",
	**kwargs,
	)
	```

	Returns the output directory path (`str`), or `None` if the input was skipped.

	Inference Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `input_path` \| `str` \| (required) \| Directory of images or path to a video file \|
	\| `output_path` \| `str` \| `"inference_output"` \| Root output directory \|
	\| `target_size` \| `int` \| `952` \| Maximum inference resolution (longest edge). Images are resized + center-cropped to the nearest multiple of 14 \|
	\| `fps` \| `int` \| `1` \| FPS for extracting frames from video input \|
	\| `video_strategy` \| `str` \| `"new"` \| Video frame extraction strategy: `"new"` (motion-aware) or `"old"` (uniform FPS) \|
	\| `video_min_frames` \| `int` \| `1` \| Minimum number of frames to extract from video \|
	\| `video_max_frames` \| `int` \| `32` \| Maximum number of frames to extract from video \|

	Save Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `save_depth` \| `bool` \| `True` \| Save per-view depth maps (PNG visualization + NPY raw values) \|
	\| `save_normal` \| `bool` \| `True` \| Save per-view surface normal maps (PNG) \|
	\| `save_gs` \| `bool` \| `True` \| Save 3D Gaussian Splatting as `gaussians.ply` \|
	\| `save_camera` \| `bool` \| `True` \| Save camera parameters as `camera_params.json` \|
	\| `save_points` \| `bool` \| `True` \| Save depth-derived point cloud as `points.ply` \|
	\| `save_colmap` \| `bool` \| `False` \| Save COLMAP-format sparse reconstruction (`sparse/0/`) \|
	\| `save_conf` \| `bool` \| `False` \| Save depth confidence maps \|
	\| `save_sky_mask` \| `bool` \| `False` \| Save sky segmentation masks \|

	Mask Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `apply_sky_mask` \| `bool` \| `True` \| Filter out sky regions from point clouds and Gaussians \|
	\| `apply_edge_mask` \| `bool` \| `True` \| Filter out edge/discontinuity regions \|
	\| `apply_confidence_mask` \| `bool` \| `False` \| Filter out low-confidence predictions \|
	\| `sky_mask_source` \| `str` \| `"auto"` \| Sky mask method: `"auto"` (ONNX + model fusion), `"model"` (model predictions only), `"onnx"` (external segmentation only) \|
	\| `model_sky_threshold` \| `float` \| `0.45` \| Threshold for model-based sky detection \|
	\| `confidence_percentile` \| `float` \| `10.0` \| Percentile threshold for confidence filtering (bottom N% removed) \|
	\| `edge_normal_threshold` \| `float` \| `1.0` \| Normal edge detection tolerance \|
	\| `edge_depth_threshold` \| `float` \| `0.03` \| Depth edge detection relative tolerance \|

	Compression Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `compress_pts` \| `bool` \| `True` \| Compress point clouds via voxel merging + random sampling \|
	\| `compress_pts_max_points` \| `int` \| `2,000,000` \| Maximum number of points after compression \|
	\| `compress_pts_voxel_size` \| `float` \| `0.002` \| Voxel size for point cloud merging \|
	\| `max_resolution` \| `int` \| `1920` \| Maximum resolution for saved output images \|
	\| `compress_gs_max_points` \| `int` \| `5,000,000` \| Maximum number of Gaussians after voxel pruning \|

	Prior Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `prior_cam_path` \| `str` \| `None` \| Path to camera parameters JSON file \|
	\| `prior_depth_path` \| `str` \| `None` \| Path to directory containing depth map files \|

	Rendered Video Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `save_rendered` \| `bool` \| `False` \| Render interpolated fly-through video from Gaussian splats \|
	\| `render_interp_per_pair` \| `int` \| `15` \| Number of interpolated frames between each camera pair \|
	\| `render_depth` \| `bool` \| `False` \| Also render a depth visualization video \|

	Misc Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `log_time` \| `bool` \| `True` \| Print timing report and save `pipeline_timing.json` \|
	\| `strict_output_path` \| `str` \| `None` \| If set, save results directly to this path without `<case_name>/<timestamp>` subdirectories \|

	---
	### CLI Reference
	All `__call__` parameters are exposed as CLI arguments:

	```bash
	python -m hyworld2.worldrecon.pipeline \
	--input_path path/to/images \
	--output_path inference_output \
	--target_size 952 \
	--prior_cam_path path/to/camera_params.json \
	--prior_depth_path path/to/depth_dir/ \
	```

	Boolean flag conventions:

	\| Enable \| Disable \|
	\|--------\|---------\|
	\| `--save_colmap` \| (omit) \|
	\| `--save_conf` \| (omit) \|
	\| `--save_sky_mask` \| (omit) \|
	\| `--apply_sky_mask` (default on) \| `--no_sky_mask` \|
	\| `--apply_edge_mask` (default on) \| `--no_edge_mask` \|
	\| `--apply_confidence_mask` \| (omit) \|
	\| `--compress_pts` (default on) \| `--no_compress_pts` \|
	\| `--log_time` (default on) \| `--no_log_time` \|
	\| (default on) `save_depth` \| `--no_save_depth` \|
	\| (default on) `save_normal` \| `--no_save_normal` \|
	\| (default on) `save_gs` \| `--no_save_gs` \|
	\| (default on) `save_camera` \| `--no_save_camera` \|
	\| (default on) `save_points` \| `--no_save_points` \|
	\| `--save_rendered` \| (omit) \|
	\| `--render_depth` \| (omit) \|

	Additional CLI-only arguments:

	\| Argument \| Description \|
	\|----------\|-------------\|
	\| `--config_path` \| Training config YAML for custom checkpoint loading \|
	\| `--ckpt_path` \| Local checkpoint file path \|
	\| `--use_fsdp` \| Enable FSDP multi-GPU sharding \|
	\| `--enable_bf16` \| Enable bfloat16 mixed precision \|
	\| `--fsdp_cpu_offload` \| Offload FSDP params to CPU \|
	\| `--disable_heads` \| Space-separated list of heads to disable (e.g. `--disable_heads camera normal`) \|
	\| `--no_interactive` \| Exit after first inference (skip interactive prompt loop) \|

	---
	### Output Format
	#### File Structure

	```
	inference_output/
	└── <case_name>/
	└── <timestamp>/
	├── depth/
	│ ├── depth_0000.png # Normalized depth visualization
	│ ├── depth_0000.npy # Raw float32 depth values [H, W]
	│ └── ...
	├── normal/
	│ ├── normal_0000.png # Normal map visualization (RGB)
	│ └── ...
	├── camera_params.json # Camera extrinsics & intrinsics
	├── gaussians.ply # 3D Gaussian Splatting (standard format)
	├── points.ply # Colored point cloud
	├── sparse/ # COLMAP format (if --save_colmap)
	│ └── 0/
	│ ├── cameras.bin
	│ ├── images.bin
	│ └── points3D.bin
	├── rendered/ # Rendered video (if --save_rendered)
	│ ├── rendered_rgb.mp4
	│ └── rendered_depth.mp4 # (if --render_depth)
	└── pipeline_timing.json # Performance timing report
	```

	#### Prediction Dictionary
	When using the Python API, `pipeline(...)` internally produces a `predictions` dictionary with the following keys:

	```python
	# Geometry
	predictions["depth"] # [B, S, H, W, 1] — Z-depth in camera frame
	predictions["depth_conf"] # [B, S, H, W] — Depth confidence
	predictions["normals"] # [B, S, H, W, 3] — Surface normals in camera coords
	predictions["normals_conf"] # [B, S, H, W] — Normal confidence
	predictions["pts3d"] # [B, S, H, W, 3] — 3D point maps in world coords
	predictions["pts3d_conf"] # [B, S, H, W] — Point cloud confidence
	# Camera
	predictions["camera_poses"] # [B, S, 4, 4] — Camera-to-world (c2w), OpenCV convention
	predictions["camera_intrs"] # [B, S, 3, 3] — Camera intrinsic matrices
	predictions["camera_params"]# [B, S, 9] — Compact camera vector (translation, quaternion, fov_v, fov_u)
	# 3D Gaussian Splatting
	predictions["splats"]["means"] # [B, N, 3] — Gaussian centers
	predictions["splats"]["scales"] # [B, N, 3] — Gaussian scales
	predictions["splats"]["quats"] # [B, N, 4] — Gaussian rotations (quaternions)
	predictions["splats"]["opacities"] # [B, N] — Gaussian opacities
	predictions["splats"]["sh"] # [B, N, 1, 3] — Spherical harmonics (degree 0)
	predictions["splats"]["weights"] # [B, N] — Per-Gaussian confidence weights
	```

	Where `B` = batch size (always 1 for inference), `S` = number of input views, `H, W` = image dimensions, `N` = total Gaussians (`S × H × W`).

	---
	### Prior Injection
	WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files.

	\| Prior Type \| Condition \| Input Format \|
	\|------------\|-----------\|--------------\|
	\| Camera Pose \| `cond_flags[0]` \| c2w 4×4 matrix (OpenCV convention) \|
	\| Depth Map \| `cond_flags[1]` \| Per-view float depth maps \|
	\| Intrinsics \| `cond_flags[2]` \| 3×3 intrinsic matrix \|

	#### Camera Parameters (JSON)
	The camera parameter file follows the same format as the `camera_params.json` output by the pipeline:

	```json
	{
	"num_cameras": 2,
	"extrinsics": [
	{
	"camera_id": 0,
	"matrix": [
	[0.98, 0.01, -0.17, 0.52],
	[-0.01, 0.99, 0.01, -0.03],
	[0.17, -0.01, 0.98, 1.20],
	[0.0, 0.0, 0.0, 1.0]
	]
	}
	],
	"intrinsics": [
	{
	"camera_id": 0,
	"matrix": [
	[525.0, 0.0, 320.0],
	[0.0, 525.0, 240.0],
	[0.0, 0.0, 1.0]
	]
	}
	]
	}
	```

	Field descriptions:

	\| Field \| Description \|
	\|-------\|-------------\|
	\| `camera_id` \| Integer index (`0`, `1`, `2`, ...) or image filename stem without extension (e.g., `"image_0001"`) \|
	\| `extrinsics.matrix` \| 4×4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention \|
	\| `intrinsics.matrix` \| 3×3 camera intrinsic matrix in pixels (`fx, fy` = focal lengths; `cx, cy` = principal point) \|

	Important notes:
	- `extrinsics` and `intrinsics` lists can be provided independently or together. An empty list `[]` or missing key means that prior is unavailable.
	- Intrinsics resolution: Values should correspond to the original image resolution. The pipeline automatically adjusts for inference-time resize + center-crop.
	- Extrinsics alignment: The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior.
	#### Depth Maps (Folder)
	Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: `.npy`, `.exr`, `.png` (16-bit).

	```
	prior_depth/
	├── image_0001.npy # float32, shape [H, W]
	├── image_0002.npy
	└── ...
	```

	#### Combining Priors
	Priors can be freely combined. Examples:

	```bash
	# Only intrinsics
	python -m hyworld2.worldrecon.pipeline --input_path images/ \
	--prior_cam_path camera_intrinsics_only.json
	# Only depth
	python -m hyworld2.worldrecon.pipeline --input_path images/ \
	--prior_depth_path depth_maps/
	# Camera poses + intrinsics + depth
	python -m hyworld2.worldrecon.pipeline --input_path images/ \
	--prior_cam_path camera_params.json \
	--prior_depth_path depth_maps/
	```

	---
	### Multi-GPU Inference
	WorldMirror 2.0 supports Sequence Parallel (SP) inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel.

	> Requirement: The number of input images must be >= the number of GPUs (`nproc_per_node`). For example, with 8 GPUs you need at least 8 input images. The pipeline will raise an error if this condition is not met.

	```bash
	# 2 GPUs with FSDP + bf16
	torchrun --nproc_per_node=2 -m hyworld2.worldrecon.pipeline \
	--input_path path/to/images \
	--use_fsdp --enable_bf16
	# 4 GPUs
	torchrun --nproc_per_node=4 -m hyworld2.worldrecon.pipeline \
	--input_path path/to/images \
	--use_fsdp --enable_bf16
	# Python API (inside a torchrun script)
	from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
	pipeline = WorldMirrorPipeline.from_pretrained(
	'tencent/HY-World-2.0',
	use_fsdp=True,
	enable_bf16=True,
	)
	pipeline('path/to/images')
	```

	What happens under the hood:
	1. `from_pretrained` auto-detects `WORLD_SIZE > 1` and initializes `torch.distributed`.
	2. The model is loaded on rank 0 and broadcast via `sync_module_states=True`.
	3. FSDP shards parameters across the SP process group.
	4. DPT prediction heads split frames across ranks and `AllGather` results.
	5. Post-processing (mask computation, saving) runs on rank 0 only.

	---
	### Advanced Options
	#### Disabling Prediction Heads
	To save memory when you only need specific outputs:

	```python
	from hyworld2.worldrecon.pipeline import WorldMirrorPipeline

	pipeline = WorldMirrorPipeline.from_pretrained(
	'tencent/HY-World-2.0',
	disable_heads=["normal", "points"], # free ~200M params
	)
	```

	Available heads: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"`.
	#### Mask Filtering
	The pipeline supports three types of output filtering to improve point cloud and Gaussian quality:
	1. Sky mask (`apply_sky_mask=True`): Removes sky regions using an ONNX-based segmentation model, optionally fused with model-predicted depth masks.
	2. Edge mask (`apply_edge_mask=True`): Removes points at depth/normal discontinuities (object boundaries).
	3. Confidence mask (`apply_confidence_mask=False`): Removes the bottom N% of points by prediction confidence.
	These masks are applied independently to both the `points.ply` (depth-based) and `gaussians.ply` (GS-based) outputs. The GS output uses its own depth predictions for edge detection when available.
	#### Point Cloud Compression
	When `compress_pts=True` (default), the depth-derived point cloud undergoes:
	1. Voxel merging: Points within each voxel (size controlled by `compress_pts_voxel_size`) are merged via weighted averaging.
	2. Random subsampling: If the result exceeds `compress_pts_max_points`, points are uniformly subsampled.
	Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to `compress_gs_max_points`.

	---
	### Gradio App
	An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser.
	Quick start:

	```bash
	# Single GPU
	python -m hyworld2.worldrecon.gradio_app

	# Multi-GPU
	torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \
	--use_fsdp --enable_bf16
	```

	With a local checkpoint:

	```bash
	python -m hyworld2.worldrecon.gradio_app \
	--config_path /path/to/config.yaml \
	--ckpt_path /path/to/checkpoint.safetensors
	```

	With a public link (e.g., for Colab or remote servers):

	```bash
	python -m hyworld2.worldrecon.gradio_app --share
	```

	Arguments:

	\| Argument \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `--port` \| `8081` \| Server port \|
	\| `--host` \| `0.0.0.0` \| Server host \|
	\| `--share` \| `False` \| Create a public Gradio link \|
	\| `--examples_dir` \| `./examples/worldrecon` \| Path to example scenes directory \|
	\| `--config_path` \| `None` \| Training config YAML (used with `--ckpt_path`) \|
	\| `--ckpt_path` \| `None` \| Local checkpoint file (`.ckpt` / `.safetensors`) \|
	\| `--use_fsdp` \| `False` \| Enable FSDP multi-GPU sharding \|
	\| `--enable_bf16` \| `False` \| Enable bfloat16 mixed precision \|
	\| `--fsdp_cpu_offload` \| `False` \| Offload FSDP params to CPU (saves GPU memory) \|

	> Important: In multi-GPU mode, the number of input images must be >= the number of GPUs.

	---
	## Panorama Generation
	Coming soon.
	This section will document the panorama generation model, including:
	- Text-to-panorama and image-to-panorama APIs
	- Model architecture (MMDiT-based implicit perspective-to-ERP mapping)
	- Configuration parameters
	- Output formats

	---
	## World Generation
	Coming soon.
	This section will document the world generation pipeline, including:
	- Trajectory planning configuration
	- World expansion with memory-driven video generation
	- World composition (point cloud expansion + 3DGS optimization)
	- End-to-end generation from text/image to navigable 3D world