# HunyuanWorld 2.0 — Documentation
This document provides detailed usage guides, parameter references, and output format specifications for each component of HunyuanWorld 2.0.

## Table of Contents
- [WorldMirror 2.0 (World Reconstruction)](#worldmirror-20-world-reconstruction)
  - [Overview](#overview)
  - [Python API](#python-api)
    - [`WorldMirrorPipeline.from_pretrained`](#worldmirrorpipelinefrom_pretrained)
    - [`WorldMirrorPipeline.__call__`](#worldmirrorpipelinecall)
  - [CLI Reference](#cli-reference)
  - [Output Format](#output-format)
    - [File Structure](#file-structure)
    - [Prediction Dictionary](#prediction-dictionary)
  - [Prior Injection](#prior-injection)
    - [Camera Parameters (JSON)](#camera-parameters-json)
    - [Depth Maps (Folder)](#depth-maps-folder)
    - [Combining Priors](#combining-priors)
  - [Multi-GPU Inference](#multi-gpu-inference)
  - [Advanced Options](#advanced-options)
    - [Disabling Prediction Heads](#disabling-prediction-heads)
    - [Mask Filtering](#mask-filtering)
    - [Point Cloud Compression](#point-cloud-compression)
  - [Gradio App](#gradio-app)
- [Panorama Generation](#panorama-generation)
- [World Generation](#world-generation)

---
## WorldMirror 2.0 (World Reconstruction)
### Overview
WorldMirror 2.0 is a unified feed-forward model for comprehensive 3D geometric prediction from multi-view images or video. It simultaneously generates:
- **3D point clouds** in world coordinates
- **Per-view depth maps** in camera frame
- **Surface normals** in camera coordinates
- **Camera poses** (c2w) and **intrinsics**
- **3D Gaussian Splatting** attributes (means, scales, rotations, opacities, SH coefficients)

Key improvements over WorldMirror 1.0:
- **Normalized RoPE** for flexible resolution inference
- **Depth mask prediction** for robust invalid pixel handling
- **Sequence Parallel + FSDP + BF16** for efficient multi-GPU inference

---
### Python API
#### `WorldMirrorPipeline.from_pretrained`
Factory method to load the model and create a pipeline instance.

```python
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline

pipeline = WorldMirrorPipeline.from_pretrained(
    pretrained_model_name_or_path="tencent/HY-World-2.0",
    subfolder="HY-WorldMirror-2.0",
    config_path=None,
    ckpt_path=None,
    use_fsdp=False,
    enable_bf16=False,
    fsdp_cpu_offload=False,
    disable_heads=None,
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `pretrained_model_name_or_path` | `str` | `"tencent/HY-World-2.0"` | HuggingFace repo ID or local path |
| `subfolder` | `str` | `"HY-WorldMirror-2.0"` | Subfolder inside the repo containing WorldMirror checkpoint (`model.safetensors` + config) |
| `config_path` | `str` | `None` | Training config YAML (used with `ckpt_path` for custom checkpoints) |
| `ckpt_path` | `str` | `None` | Checkpoint file (`.ckpt` / `.safetensors`). When provided with `config_path`, loads model from local checkpoint instead of HuggingFace |
| `use_fsdp` | `bool` | `False` | Shard parameters across GPUs via Fully Sharded Data Parallel |
| `enable_bf16` | `bool` | `False` | Use bfloat16 precision (except numerically critical layers) |
| `fsdp_cpu_offload` | `bool` | `False` | Offload FSDP parameters to CPU (saves GPU memory at the cost of speed) |
| `disable_heads` | `list[str]` | `None` | Heads to disable and free from memory. Options: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"` |

**Notes:**
- Distributed mode is auto-detected from `WORLD_SIZE` environment variable (set by `torchrun`).
- When using multi-GPU, each rank must call `from_pretrained` — the method handles `dist.init_process_group` internally.

---
#### `WorldMirrorPipeline.__call__`
Run inference on a set of images or a video.

```python
result = pipeline(
    input_path,
    output_path="inference_output",
    **kwargs,
)
```

Returns the output directory path (`str`), or `None` if the input was skipped.

**Inference Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input_path` | `str` | *(required)* | Directory of images or path to a video file |
| `output_path` | `str` | `"inference_output"` | Root output directory |
| `target_size` | `int` | `952` | Maximum inference resolution (longest edge). Images are resized + center-cropped to the nearest multiple of 14 |
| `fps` | `int` | `1` | FPS for extracting frames from video input |
| `video_strategy` | `str` | `"new"` | Video frame extraction strategy: `"new"` (motion-aware) or `"old"` (uniform FPS) |
| `video_min_frames` | `int` | `1` | Minimum number of frames to extract from video |
| `video_max_frames` | `int` | `32` | Maximum number of frames to extract from video |

**Save Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `save_depth` | `bool` | `True` | Save per-view depth maps (PNG visualization + NPY raw values) |
| `save_normal` | `bool` | `True` | Save per-view surface normal maps (PNG) |
| `save_gs` | `bool` | `True` | Save 3D Gaussian Splatting as `gaussians.ply` |
| `save_camera` | `bool` | `True` | Save camera parameters as `camera_params.json` |
| `save_points` | `bool` | `True` | Save depth-derived point cloud as `points.ply` |
| `save_colmap` | `bool` | `False` | Save COLMAP-format sparse reconstruction (`sparse/0/`) |
| `save_conf` | `bool` | `False` | Save depth confidence maps |
| `save_sky_mask` | `bool` | `False` | Save sky segmentation masks |

**Mask Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `apply_sky_mask` | `bool` | `True` | Filter out sky regions from point clouds and Gaussians |
| `apply_edge_mask` | `bool` | `True` | Filter out edge/discontinuity regions |
| `apply_confidence_mask` | `bool` | `False` | Filter out low-confidence predictions |
| `sky_mask_source` | `str` | `"auto"` | Sky mask method: `"auto"` (ONNX + model fusion), `"model"` (model predictions only), `"onnx"` (external segmentation only) |
| `model_sky_threshold` | `float` | `0.45` | Threshold for model-based sky detection |
| `confidence_percentile` | `float` | `10.0` | Percentile threshold for confidence filtering (bottom N% removed) |
| `edge_normal_threshold` | `float` | `1.0` | Normal edge detection tolerance |
| `edge_depth_threshold` | `float` | `0.03` | Depth edge detection relative tolerance |

**Compression Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `compress_pts` | `bool` | `True` | Compress point clouds via voxel merging + random sampling |
| `compress_pts_max_points` | `int` | `2,000,000` | Maximum number of points after compression |
| `compress_pts_voxel_size` | `float` | `0.002` | Voxel size for point cloud merging |
| `max_resolution` | `int` | `1920` | Maximum resolution for saved output images |
| `compress_gs_max_points` | `int` | `5,000,000` | Maximum number of Gaussians after voxel pruning |

**Prior Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `prior_cam_path` | `str` | `None` | Path to camera parameters JSON file |
| `prior_depth_path` | `str` | `None` | Path to directory containing depth map files |

**Rendered Video Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `save_rendered` | `bool` | `False` | Render interpolated fly-through video from Gaussian splats |
| `render_interp_per_pair` | `int` | `15` | Number of interpolated frames between each camera pair |
| `render_depth` | `bool` | `False` | Also render a depth visualization video |

**Misc Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `log_time` | `bool` | `True` | Print timing report and save `pipeline_timing.json` |
| `strict_output_path` | `str` | `None` | If set, save results directly to this path without `<case_name>/<timestamp>` subdirectories |

---
### CLI Reference
All `__call__` parameters are exposed as CLI arguments:

```bash
python -m hyworld2.worldrecon.pipeline \
    --input_path path/to/images \
    --output_path inference_output \
    --target_size 952 \
    --prior_cam_path path/to/camera_params.json \
    --prior_depth_path path/to/depth_dir/ \
```

**Boolean flag conventions:**

| Enable | Disable |
|--------|---------|
| `--save_colmap` | *(omit)* |
| `--save_conf` | *(omit)* |
| `--save_sky_mask` | *(omit)* |
| `--apply_sky_mask` (default on) | `--no_sky_mask` |
| `--apply_edge_mask` (default on) | `--no_edge_mask` |
| `--apply_confidence_mask` | *(omit)* |
| `--compress_pts` (default on) | `--no_compress_pts` |
| `--log_time` (default on) | `--no_log_time` |
| *(default on)* `save_depth` | `--no_save_depth` |
| *(default on)* `save_normal` | `--no_save_normal` |
| *(default on)* `save_gs` | `--no_save_gs` |
| *(default on)* `save_camera` | `--no_save_camera` |
| *(default on)* `save_points` | `--no_save_points` |
| `--save_rendered` | *(omit)* |
| `--render_depth` | *(omit)* |

**Additional CLI-only arguments:**

| Argument | Description |
|----------|-------------|
| `--config_path` | Training config YAML for custom checkpoint loading |
| `--ckpt_path` | Local checkpoint file path |
| `--use_fsdp` | Enable FSDP multi-GPU sharding |
| `--enable_bf16` | Enable bfloat16 mixed precision |
| `--fsdp_cpu_offload` | Offload FSDP params to CPU |
| `--disable_heads` | Space-separated list of heads to disable (e.g. `--disable_heads camera normal`) |
| `--no_interactive` | Exit after first inference (skip interactive prompt loop) |

---
### Output Format
#### File Structure

```
inference_output/
└── <case_name>/
    └── <timestamp>/
        ├── depth/
        │   ├── depth_0000.png      # Normalized depth visualization
        │   ├── depth_0000.npy      # Raw float32 depth values [H, W]
        │   └── ...
        ├── normal/
        │   ├── normal_0000.png     # Normal map visualization (RGB)
        │   └── ...
        ├── camera_params.json      # Camera extrinsics & intrinsics
        ├── gaussians.ply           # 3D Gaussian Splatting (standard format)
        ├── points.ply              # Colored point cloud
        ├── sparse/                 # COLMAP format (if --save_colmap)
        │   └── 0/
        │       ├── cameras.bin
        │       ├── images.bin
        │       └── points3D.bin
        ├── rendered/               # Rendered video (if --save_rendered)
        │   ├── rendered_rgb.mp4
        │   └── rendered_depth.mp4  # (if --render_depth)
        └── pipeline_timing.json    # Performance timing report
```

#### Prediction Dictionary
When using the Python API, `pipeline(...)` internally produces a `predictions` dictionary with the following keys:

```python
# Geometry
predictions["depth"]        # [B, S, H, W, 1]  — Z-depth in camera frame
predictions["depth_conf"]   # [B, S, H, W]     — Depth confidence
predictions["normals"]      # [B, S, H, W, 3]  — Surface normals in camera coords
predictions["normals_conf"] # [B, S, H, W]     — Normal confidence
predictions["pts3d"]        # [B, S, H, W, 3]  — 3D point maps in world coords
predictions["pts3d_conf"]   # [B, S, H, W]     — Point cloud confidence
# Camera
predictions["camera_poses"] # [B, S, 4, 4]     — Camera-to-world (c2w), OpenCV convention
predictions["camera_intrs"] # [B, S, 3, 3]     — Camera intrinsic matrices
predictions["camera_params"]# [B, S, 9]        — Compact camera vector (translation, quaternion, fov_v, fov_u)
# 3D Gaussian Splatting
predictions["splats"]["means"]      # [B, N, 3] — Gaussian centers
predictions["splats"]["scales"]     # [B, N, 3] — Gaussian scales
predictions["splats"]["quats"]      # [B, N, 4] — Gaussian rotations (quaternions)
predictions["splats"]["opacities"]  # [B, N]    — Gaussian opacities
predictions["splats"]["sh"]         # [B, N, 1, 3] — Spherical harmonics (degree 0)
predictions["splats"]["weights"]    # [B, N]    — Per-Gaussian confidence weights
```

Where `B` = batch size (always 1 for inference), `S` = number of input views, `H, W` = image dimensions, `N` = total Gaussians (`S × H × W`).

---
### Prior Injection
WorldMirror 2.0 accepts three types of geometric priors as conditioning inputs. Priors are automatically detected from the provided files.

| Prior Type | Condition | Input Format |
|------------|-----------|--------------|
| Camera Pose | `cond_flags[0]` | c2w 4×4 matrix (OpenCV convention) |
| Depth Map | `cond_flags[1]` | Per-view float depth maps |
| Intrinsics | `cond_flags[2]` | 3×3 intrinsic matrix |

#### Camera Parameters (JSON)
The camera parameter file follows the same format as the `camera_params.json` output by the pipeline:

```json
{
  "num_cameras": 2,
  "extrinsics": [
    {
      "camera_id": 0,
      "matrix": [
        [0.98, 0.01, -0.17, 0.52],
        [-0.01, 0.99, 0.01, -0.03],
        [0.17, -0.01, 0.98, 1.20],
        [0.0, 0.0, 0.0, 1.0]
      ]
    }
  ],
  "intrinsics": [
    {
      "camera_id": 0,
      "matrix": [
        [525.0, 0.0, 320.0],
        [0.0, 525.0, 240.0],
        [0.0, 0.0, 1.0]
      ]
    }
  ]
}
```

**Field descriptions:**

| Field | Description |
|-------|-------------|
| `camera_id` | Integer index (`0`, `1`, `2`, ...) or image filename stem without extension (e.g., `"image_0001"`) |
| `extrinsics.matrix` | 4×4 camera-to-world (c2w) transformation matrix, OpenCV coordinate convention |
| `intrinsics.matrix` | 3×3 camera intrinsic matrix in pixels (`fx, fy` = focal lengths; `cx, cy` = principal point) |

**Important notes:**
- `extrinsics` and `intrinsics` lists can be provided independently or together. An empty list `[]` or missing key means that prior is unavailable.
- **Intrinsics resolution:** Values should correspond to the **original image resolution**. The pipeline automatically adjusts for inference-time resize + center-crop.
- **Extrinsics alignment:** The pipeline automatically normalizes all extrinsics relative to the first view, consistent with training behavior.
#### Depth Maps (Folder)
Depth maps are stored as individual files in a directory. Filenames should match the input image filenames. Supported formats: `.npy`, `.exr`, `.png` (16-bit).

```
prior_depth/
├── image_0001.npy    # float32, shape [H, W]
├── image_0002.npy
└── ...
```

#### Combining Priors
Priors can be freely combined. Examples:

```bash
# Only intrinsics
python -m hyworld2.worldrecon.pipeline --input_path images/ \
    --prior_cam_path camera_intrinsics_only.json
# Only depth
python -m hyworld2.worldrecon.pipeline --input_path images/ \
    --prior_depth_path depth_maps/
# Camera poses + intrinsics + depth
python -m hyworld2.worldrecon.pipeline --input_path images/ \
    --prior_cam_path camera_params.json \
    --prior_depth_path depth_maps/
```

---
### Multi-GPU Inference
WorldMirror 2.0 supports **Sequence Parallel (SP)** inference across multiple GPUs, where token sequences are sharded across ranks in the ViT backbone, and DPT heads process frames in parallel.

> **Requirement:** The number of input images must be **>= the number of GPUs** (`nproc_per_node`). For example, with 8 GPUs you need at least 8 input images. The pipeline will raise an error if this condition is not met.

```bash
# 2 GPUs with FSDP + bf16
torchrun --nproc_per_node=2 -m hyworld2.worldrecon.pipeline \
    --input_path path/to/images \
    --use_fsdp --enable_bf16
# 4 GPUs
torchrun --nproc_per_node=4 -m hyworld2.worldrecon.pipeline \
    --input_path path/to/images \
    --use_fsdp --enable_bf16
# Python API (inside a torchrun script)
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline
pipeline = WorldMirrorPipeline.from_pretrained(
    'tencent/HY-World-2.0',
    use_fsdp=True,
    enable_bf16=True,
)
pipeline('path/to/images')
```

**What happens under the hood:**
1. `from_pretrained` auto-detects `WORLD_SIZE > 1` and initializes `torch.distributed`.
2. The model is loaded on rank 0 and broadcast via `sync_module_states=True`.
3. FSDP shards parameters across the SP process group.
4. DPT prediction heads split frames across ranks and `AllGather` results.
5. Post-processing (mask computation, saving) runs on rank 0 only.

---
### Advanced Options
#### Disabling Prediction Heads
To save memory when you only need specific outputs:

```python
from hyworld2.worldrecon.pipeline import WorldMirrorPipeline

pipeline = WorldMirrorPipeline.from_pretrained(
    'tencent/HY-World-2.0',
    disable_heads=["normal", "points"],  # free ~200M params
)
```

Available heads: `"camera"`, `"depth"`, `"normal"`, `"points"`, `"gs"`.
#### Mask Filtering
The pipeline supports three types of output filtering to improve point cloud and Gaussian quality:
1. **Sky mask** (`apply_sky_mask=True`): Removes sky regions using an ONNX-based segmentation model, optionally fused with model-predicted depth masks.
2. **Edge mask** (`apply_edge_mask=True`): Removes points at depth/normal discontinuities (object boundaries).
3. **Confidence mask** (`apply_confidence_mask=False`): Removes the bottom N% of points by prediction confidence.
These masks are applied independently to both the `points.ply` (depth-based) and `gaussians.ply` (GS-based) outputs. The GS output uses its own depth predictions for edge detection when available.
#### Point Cloud Compression
When `compress_pts=True` (default), the depth-derived point cloud undergoes:
1. **Voxel merging**: Points within each voxel (size controlled by `compress_pts_voxel_size`) are merged via weighted averaging.
2. **Random subsampling**: If the result exceeds `compress_pts_max_points`, points are uniformly subsampled.
Similarly, Gaussians are voxel-pruned (weighted averaging of means, scales, quaternions, colors, opacities) and optionally subsampled to `compress_gs_max_points`.

---
### Gradio App
An interactive web demo for WorldMirror 2.0. Upload images or videos and visualize 3DGS, point clouds, depth maps, normal maps, and camera parameters in your browser.
**Quick start:**

```bash
# Single GPU
python -m hyworld2.worldrecon.gradio_app

# Multi-GPU
torchrun --nproc_per_node=2 -m hyworld2.worldrecon.gradio_app \
    --use_fsdp --enable_bf16
```

**With a local checkpoint:**

```bash
python -m hyworld2.worldrecon.gradio_app \
    --config_path /path/to/config.yaml \
    --ckpt_path /path/to/checkpoint.safetensors
```

**With a public link (e.g., for Colab or remote servers):**

```bash
python -m hyworld2.worldrecon.gradio_app --share
```

**Arguments:**

| Argument | Default | Description |
|----------|---------|-------------|
| `--port` | `8081` | Server port |
| `--host` | `0.0.0.0` | Server host |
| `--share` | `False` | Create a public Gradio link |
| `--examples_dir` | `./examples/worldrecon` | Path to example scenes directory |
| `--config_path` | `None` | Training config YAML (used with `--ckpt_path`) |
| `--ckpt_path` | `None` | Local checkpoint file (`.ckpt` / `.safetensors`) |
| `--use_fsdp` | `False` | Enable FSDP multi-GPU sharding |
| `--enable_bf16` | `False` | Enable bfloat16 mixed precision |
| `--fsdp_cpu_offload` | `False` | Offload FSDP params to CPU (saves GPU memory) |

> **Important:** In multi-GPU mode, the number of input images must be **>= the number of GPUs**.

---
## Panorama Generation
*Coming soon.*
This section will document the panorama generation model, including:
- Text-to-panorama and image-to-panorama APIs
- Model architecture (MMDiT-based implicit perspective-to-ERP mapping)
- Configuration parameters
- Output formats

---
## World Generation
*Coming soon.*
This section will document the world generation pipeline, including:
- Trajectory planning configuration
- World expansion with memory-driven video generation
- World composition (point cloud expansion + 3DGS optimization)
- End-to-end generation from text/image to navigable 3D world