| # π DepthAnything3 API Documentation | |
| ## π Table of Contents | |
| 1. [π Overview](#overview) | |
| 2. [π‘ Usage Examples](#usage-examples) | |
| 3. [π§ Core API](#core-api) | |
| - [DepthAnything3 Class](#depthanything3-class) | |
| - [inference() Method](#inference-method) | |
| 4. [βοΈ Parameters](#parameters) | |
| - [Input Parameters](#input-parameters) | |
| - [Pose Alignment Parameters](#pose-alignment-parameters) | |
| - [Feature Export Parameters](#feature-export-parameters) | |
| - [Rendering Parameters](#rendering-parameters) | |
| - [Processing Parameters](#processing-parameters) | |
| - [Export Parameters](#export-parameters) | |
| 5. [π€ Export Formats](#export-formats) | |
| 6. [β©οΈ Return Value](#return-value) | |
| ## π Overview | |
| This documentation provides comprehensive API reference for DepthAnything3, including usage examples, parameter specifications, export formats, and advanced features. It covers both basic pose and depth estimation workflows and advanced pose-conditioned processing with multiple export capabilities. | |
| ## π‘ Usage Examples | |
| Here are quick examples to get you started: | |
| ### π Basic Depth Estimation | |
| ```python | |
| from depth_anything_3.api import DepthAnything3 | |
| # Initialize and run inference | |
| model = DepthAnything3.from_pretrained("depth-anything/DA3NESTED-GIANT-LARGE").to("cuda") | |
| prediction = model.inference(["image1.jpg", "image2.jpg"]) | |
| ``` | |
| ### π· Pose-Conditioned Depth Estimation | |
| ```python | |
| import numpy as np | |
| # With camera parameters for better consistency | |
| prediction = model.inference( | |
| image=["image1.jpg", "image2.jpg"], | |
| extrinsics=extrinsics_array, # (N, 4, 4) | |
| intrinsics=intrinsics_array # (N, 3, 3) | |
| ) | |
| ``` | |
| ### π€ Export Results | |
| ```python | |
| # Export depth data and 3D visualization | |
| prediction = model.inference( | |
| image=image_paths, | |
| export_dir="./output", | |
| export_format="mini_npz-glb" | |
| ) | |
| ``` | |
| ### π Feature Extraction | |
| ```python | |
| # Export intermediate features from specific layers | |
| prediction = model.inference( | |
| image=image_paths, | |
| export_dir="./output", | |
| export_format="feat_vis", | |
| export_feat_layers=[0, 1, 2] # Export features from layers 0, 1, 2 | |
| ) | |
| ``` | |
| ### β¨ Advanced Export with Gaussian Splatting | |
| ```python | |
| # Export multiple formats including Gaussian Splatting | |
| # Note: infer_gs=True requires da3-giant or da3nested-giant-large model | |
| model = DepthAnything3(model_name="da3-giant").to("cuda") | |
| prediction = model.inference( | |
| image=image_paths, | |
| extrinsics=extrinsics_array, | |
| intrinsics=intrinsics_array, | |
| export_dir="./output", | |
| export_format="npz-glb-gs_ply-gs_video", | |
| align_to_input_ext_scale=True, | |
| infer_gs=True, # Required for gs_ply and gs_video exports | |
| ) | |
| ``` | |
| ### π¨ Advanced Export with Feature Visualization | |
| ```python | |
| # Export with intermediate feature visualization | |
| prediction = model.inference( | |
| image=image_paths, | |
| export_dir="./output", | |
| export_format="mini_npz-glb-depth_vis-feat_vis", | |
| export_feat_layers=[0, 5, 10, 15, 20], | |
| feat_vis_fps=30, | |
| ) | |
| ``` | |
| ### π Using Ray-Based Pose Estimation | |
| ```python | |
| # Use ray-based pose estimation instead of camera decoder | |
| prediction = model.inference( | |
| image=image_paths, | |
| export_dir="./output", | |
| export_format="glb", | |
| use_ray_pose=True, # Enable ray-based pose estimation | |
| ) | |
| ``` | |
| ### π― Reference View Selection | |
| ```python | |
| # For multi-view inputs, automatically select the best reference view | |
| prediction = model.inference( | |
| image=image_paths, | |
| ref_view_strategy="saddle_balanced", # Default: balanced selection | |
| ) | |
| # For video sequences, use middle frame as reference | |
| prediction = model.inference( | |
| image=video_frames, | |
| ref_view_strategy="middle", # Good for temporally ordered inputs | |
| ) | |
| ``` | |
| ## π§ Core API | |
| ### π¨ DepthAnything3 Class | |
| The main API class that provides depth estimation capabilities with optional pose conditioning. | |
| #### π― Initialization | |
| ```python | |
| from depth_anything_3 import DepthAnything3 | |
| # Initialize the model with a model name | |
| model = DepthAnything3(model_name="da3-large") | |
| model = model.to("cuda") # Move to GPU | |
| ``` | |
| **Parameters:** | |
| - `model_name` (str, default: "da3-large"): The name of the model preset to use. | |
| - **Available models:** | |
| - π¦Ύ `"da3-giant"` - 1.15B params, any-view model with GS support | |
| - β `"da3-large"` - 0.35B params, any-view model (recommended for most use cases) | |
| - π¦ `"da3-base"` - 0.12B params, any-view model | |
| - πͺΆ `"da3-small"` - 0.08B params, any-view model | |
| - ποΈ `"da3mono-large"` - 0.35B params, monocular depth only | |
| - π `"da3metric-large"` - 0.35B params, metric depth with sky segmentation | |
| - π― `"da3nested-giant-large"` - 1.40B params, nested model with all features | |
| ### π inference() Method | |
| The primary inference method that processes images and returns depth predictions. | |
| ```python | |
| prediction = model.inference( | |
| image=image_list, | |
| extrinsics=extrinsics_array, # Optional | |
| intrinsics=intrinsics_array, # Optional | |
| align_to_input_ext_scale=True, # Whether to align predicted poses to input scale | |
| infer_gs=True, # Enable Gaussian branch for gs exports | |
| use_ray_pose=False, # Use ray-based pose estimation instead of camera decoder | |
| ref_view_strategy="saddle_balanced", # Reference view selection strategy | |
| render_exts=render_extrinsics, # Optional renders for gs_video | |
| render_ixts=render_intrinsics, # Optional renders for gs_video | |
| render_hw=(height, width), # Optional renders for gs_video | |
| process_res=504, | |
| process_res_method="upper_bound_resize", | |
| export_dir="output_directory", # Optional | |
| export_format="mini_npz", | |
| export_feat_layers=[], # List of layer indices to export features from | |
| conf_thresh_percentile=40.0, # Confidence threshold percentile for depth map in GLB export | |
| num_max_points=1_000_000, # Maximum number of points to export in GLB export | |
| show_cameras=True, # Whether to show cameras in GLB export | |
| feat_vis_fps=15, # Frames per second for feature visualization in feat_vis export | |
| export_kwargs={} # Optional, additional arguments to export functions. export_format:key:val, see 'Parameters/Export Parameters' for details | |
| ) | |
| ``` | |
| ## βοΈ Parameters | |
| ### πΈ Input Parameters | |
| #### `image` (required) | |
| - **Type**: `List[Union[np.ndarray, Image.Image, str]]` | |
| - **Description**: List of input images. Can be numpy arrays, PIL Images, or file paths. | |
| - **Example**: | |
| ```python | |
| # From file paths | |
| image = ["image1.jpg", "image2.jpg", "image3.jpg"] | |
| # From numpy arrays | |
| image = [np.array(img1), np.array(img2)] | |
| # From PIL Images | |
| image = [Image.open("image1.jpg"), Image.open("image2.jpg")] | |
| ``` | |
| #### `extrinsics` (optional) | |
| - **Type**: `Optional[np.ndarray]` | |
| - **Shape**: `(N, 4, 4)` where N is the number of input images | |
| - **Description**: Camera extrinsic matrices (world-to-camera transformation). When provided, enables pose-conditioned depth estimation mode. | |
| - **Note**: If not provided, the model operates in standard depth estimation mode. | |
| #### `intrinsics` (optional) | |
| - **Type**: `Optional[np.ndarray]` | |
| - **Shape**: `(N, 3, 3)` where N is the number of input images | |
| - **Description**: Camera intrinsic matrices containing focal length and principal point information. When provided, enables pose-conditioned depth estimation mode. | |
| ### π― Pose Alignment Parameters | |
| #### `align_to_input_ext_scale` (default: True) | |
| - **Type**: `bool` | |
| - **Description**: When True the predicted extrinsics are replaced with the input | |
| ones and the depth maps are rescaled to match their metric scale. When False the | |
| function returns the internally aligned poses computed via Umeyama alignment. | |
| #### `infer_gs` (default: False) | |
| - **Type**: `bool` | |
| - **Description**: Enable Gaussian Splatting branch for gaussian splatting exports. Required when using `gs_ply` or `gs_video` export formats. | |
| #### `use_ray_pose` (default: False) | |
| - **Type**: `bool` | |
| - **Description**: Use ray-based pose estimation instead of camera decoder for pose prediction. When True, the model uses ray prediction heads to estimate camera poses; when False, it uses the camera decoder approach. | |
| #### `ref_view_strategy` (default: "saddle_balanced") | |
| - **Type**: `str` | |
| - **Description**: Strategy for selecting the reference view from multiple input views. Options: `"first"`, `"middle"`, `"saddle_balanced"`, `"saddle_sim_range"`. Only applied when number of views β₯ 3. See [detailed documentation](funcs/ref_view_strategy.md) for strategy comparisons. | |
| - **Available strategies**: | |
| - `"saddle_balanced"`: Selects view with balanced features across multiple metrics (recommended default) | |
| - `"saddle_sim_range"`: Selects view with largest similarity range | |
| - `"first"`: Always uses first view (not recommended, equivalent to no reordering for views < 3) | |
| - `"middle"`: Uses middle view (recommended for video sequences) | |
| ### π Feature Export Parameters | |
| #### `export_feat_layers` (default: []) | |
| - **Type**: `List[int]` | |
| - **Description**: List of layer indices to export intermediate features from. Features are stored in the `aux` dictionary of the Prediction object with keys like `feat_layer_0`, `feat_layer_1`, etc. | |
| ### π₯ Rendering Parameters | |
| These arguments are only used when exporting Gaussian-splatting videos (include | |
| `"gs_video"` in `export_format`). They describe an auxiliary camera trajectory | |
| with ``M`` views. | |
| #### `render_exts` (optional) | |
| - **Type**: `Optional[np.ndarray]` | |
| - **Shape**: `(M, 4, 4)` | |
| - **Description**: Camera extrinsics for the synthesized trajectory. If omitted, | |
| the exporter falls back to the predicted poses. | |
| #### `render_ixts` (optional) | |
| - **Type**: `Optional[np.ndarray]` | |
| - **Shape**: `(M, 3, 3)` | |
| - **Description**: Camera intrinsics for each rendered frame. Leave `None` to | |
| reuse the input intrinsics. | |
| #### `render_hw` (optional) | |
| - **Type**: `Optional[Tuple[int, int]]` | |
| - **Description**: Explicit output resolution `(height, width)` for the rendered | |
| frames. Defaults to the input resolution when not provided. | |
| ### β‘ Processing Parameters | |
| #### `process_res` (default: 504) | |
| - **Type**: `int` | |
| - **Description**: Base resolution for processing. The model will resize images to this resolution for inference. | |
| #### `process_res_method` (default: "upper_bound_resize") | |
| - **Type**: `str` | |
| - **Description**: Method for resizing images to the target resolution. | |
| - **Options**: | |
| - `"upper_bound_resize"`: Resize so that the specified dimension (504) becomes the longer side | |
| - `"lower_bound_resize"`: Resize so that the specified dimension (504) becomes the shorter side | |
| - **Example**: | |
| - Input: 1200Γ1600 β Output: 378Γ504 (with `process_res=504`, `process_res_method="upper_bound_resize"`) | |
| - Input: 504Γ672 β Output: 504Γ672 (no change needed) | |
| ### π¦ Export Parameters | |
| #### `export_dir` (optional) | |
| - **Type**: `Optional[str]` | |
| - **Description**: Directory path where exported files will be saved. If not provided, no files will be exported. | |
| #### `export_format` (default: "mini_npz") | |
| - **Type**: `str` | |
| - **Description**: Format for exporting results. Supports multiple formats separated by `-`. | |
| - **Example**: `"mini_npz-glb"` exports both mini_npz and glb formats. | |
| #### π GLB Export Parameters | |
| These parameters are passed directly to the `inference()` method and only apply when `export_format` includes `"glb"`. | |
| ##### `conf_thresh_percentile` (default: 40.0) | |
| - **Type**: `float` | |
| - **Description**: Lower percentile for adaptive confidence threshold. Points below this confidence percentile will be filtered out from the point cloud. | |
| ##### `num_max_points` (default: 1,000,000) | |
| - **Type**: `int` | |
| - **Description**: Maximum number of points in the exported point cloud. If the point cloud exceeds this limit, it will be downsampled. | |
| ##### `show_cameras` (default: True) | |
| - **Type**: `bool` | |
| - **Description**: Whether to include camera wireframes in the exported GLB file for visualization. | |
| #### π¨ Feature Visualization Parameters | |
| These parameters are passed directly to the `inference()` method and only apply when `export_format` includes `"feat_vis"`. | |
| ##### `feat_vis_fps` (default: 15) | |
| - **Type**: `int` | |
| - **Description**: Frame rate for the output video when visualizing features across multiple images. | |
| #### β¨π₯ 3DGS and 3DGS Video Parameters | |
| These parameters are passed directly to the `inference()` method and only apply when `export_format` includes `"gs_ply"` or `"gs_video"`. | |
| ##### `export_kwargs` (default: `{}`) | |
| - Type: `dict[str, dict[str, Any]]` | |
| - Description: Per-format extra arguments passed to export functions, mainly for `"gs_ply"` and `"gs_video"`. | |
| - Access pattern: `export_kwargs[export_format][key] = value` | |
| - Example: | |
| ```python | |
| { | |
| "gs_ply": { | |
| "gs_views_interval": 1, | |
| }, | |
| "gs_video": { | |
| "trj_mode": "interpolate_smooth", | |
| "chunk_size": 1, | |
| "vis_depth": None, | |
| }, | |
| } | |
| ``` | |
| ## π€ Export Formats | |
| The API supports multiple export formats for different use cases: | |
| ### π `mini_npz` | |
| - **Description**: Minimal NPZ format containing essential data | |
| - **Contents**: `depth`, `conf`, `exts`, `ixts` | |
| - **Use case**: Lightweight storage for depth data with camera parameters | |
| ### π¦ `npz` | |
| - **Description**: Full NPZ format with comprehensive data | |
| - **Contents**: `depth`, `conf`, `exts`, `ixts`, `image`, etc. | |
| - **Use case**: Complete data export for advanced processing | |
| ### π `glb` | |
| - **Description**: 3D visualization format with point cloud and camera poses | |
| - **Contents**: | |
| - Point cloud with colors from original images | |
| - Camera wireframes for visualization | |
| - Confidence-based filtering and downsampling | |
| - **Use case**: 3D visualization, inspection, and analysis | |
| - **Features**: | |
| - Automatic sky depth handling | |
| - Confidence threshold filtering | |
| - Background filtering (black/white) | |
| - Scene scale normalization | |
| - **Parameters** (passed via `inference()` method directly): | |
| - `conf_thresh_percentile` (float, default: 40.0): Lower percentile for adaptive confidence threshold. Points below this confidence percentile will be filtered out. | |
| - `num_max_points` (int, default: 1,000,000): Maximum number of points in the exported point cloud. If exceeded, points will be downsampled. | |
| - `show_cameras` (bool, default: True): Whether to include camera wireframes in the exported GLB file for visualization. | |
| ### β¨ `gs_ply` | |
| - **Description**: Gaussian Splatting point cloud format | |
| - **Contents**: 3DGS data in PLY format. Compatible with standard 3DGS viewers such as [SuperSplat](https://superspl.at/editor) (recommended), [SPARK](https://sparkjs.dev/viewer/). | |
| - **Use case**: Gaussian Splatting reconstruction | |
| - **Requirements**: Must set `infer_gs=True` when calling `inference()`. Only supported by `da3-giant` and `da3nested-giant-large` models. | |
| - **Additional configs**, provided via `export_kwargs` (see [Export Parameters](#export-parameters)): | |
| - `gs_views_interval`: Export to 3DGS every N views, default: `1`. | |
| ### π₯ `gs_video` | |
| - **Description**: Rasterized 3DGS to obtain videos | |
| - **Contents**: A video of 3DGS-rasterized views using either provided viewpoints or a predefined camera trajectory. | |
| - **Use case**: Video rendering for Gaussian Splatting | |
| - **Requirements**: Must set `infer_gs=True` when calling `inference()`. Only supported by `da3-giant` and `da3nested-giant-large` models. | |
| - **Note**: Can optionally use `render_exts`, `render_ixts`, and `render_hw` parameters in `inference()` method to specify novel viewpoints. | |
| - **Additional configs**, provided via `export_kwargs` (see [Export Parameters](#export-parameters)): | |
| - `extrinsics`: Optional world-to-camera poses for novel views. Falls back to the predicted poses of input views if not provided. (Alternatively, use `render_exts` parameter in `inference()`) | |
| - `intrinsics`: Optional camera intrinsics for novel views. Falls back to the predicted intrinsics of input views if not provided. (Alternatively, use `render_ixts` parameter in `inference()`) | |
| - `out_image_hw`: Optional output resolution `H x W`. Falls back to input resolution if not provided. (Alternatively, use `render_hw` parameter in `inference()`) | |
| - `chunk_size`: Number of views rasterized per batch. Default: `8`. | |
| - `trj_mode`: Predefined camera trajectory for novel-view rendering. | |
| - `color_mode`: Same as `render_mode` in [gsplat](https://docs.gsplat.studio/main/apis/rasterization.html#gsplat.rasterization). | |
| - `vis_depth`: How depth is combined with RGB. Default: `hcat` (horizontal concatenation). | |
| - `enable_tqdm`: Whether to display a tqdm progress bar during rendering. | |
| - `output_name`: File name of the rendered video. | |
| - `video_quality`: Video quality to save. Default: `high`. | |
| - `high`: High quality video (default) | |
| - `medium`: Medium quality video (balance of storage space and quality) | |
| - `low`: Low quality video (fewer storage space) | |
| ### π `feat_vis` | |
| - **Description**: Feature visualization format | |
| - **Contents**: PCA-visualized intermediate features from specified layers | |
| - **Use case**: Model interpretability and feature analysis | |
| - **Note**: Requires `export_feat_layers` to be specified | |
| - **Parameters** (passed via `inference()` method directly): | |
| - `feat_vis_fps` (int, default: 15): Frame rate for the output video when visualizing features across multiple images. | |
| ### π¨ `depth_vis` | |
| - **Description**: Depth visualization format | |
| - **Contents**: Color-coded depth maps alongside original images | |
| - **Use case**: Visual inspection of depth estimation quality | |
| ### π Multiple Format Export | |
| You can export multiple formats simultaneously by separating them with `-`: | |
| ```python | |
| # Export both mini_npz and glb formats | |
| export_format = "mini_npz-glb" | |
| # Export multiple formats | |
| export_format = "npz-glb-gs_ply" | |
| ``` | |
| ## β©οΈ Return Value | |
| The `inference()` method returns a `Prediction` object with the following attributes: | |
| ### π Core Outputs | |
| - **depth**: `np.ndarray` - Estimated depth maps with shape `(N, H, W)` where N is the number of images, H is height, and W is width. | |
| - **conf**: `np.ndarray` - Confidence maps with shape `(N, H, W)` indicating prediction reliability (optional, depends on model). | |
| ### π· Camera Parameters | |
| - **extrinsics**: `np.ndarray` - Camera extrinsic matrices with shape `(N, 3, 4)` representing world-to-camera transformations. Only present if camera poses were estimated or provided as input. | |
| - **intrinsics**: `np.ndarray` - Camera intrinsic matrices with shape `(N, 3, 3)` containing focal length and principal point information. Only present if poses were estimated or provided as input. | |
| ### π Additional Outputs | |
| - **processed_images**: `np.ndarray` - Preprocessed input images with shape `(N, H, W, 3)` in RGB format (0-255 uint8). | |
| - **aux**: `dict` - Auxiliary outputs including: | |
| - `feat_layer_X`: Intermediate features from layer X (if `export_feat_layers` was specified) | |
| - `gaussians`: 3D Gaussian Splats data (if `infer_gs=True`) | |
| ### π» Usage Example | |
| ```python | |
| prediction = model.inference(image=["img1.jpg", "img2.jpg"]) | |
| # Access depth maps | |
| depth_maps = prediction.depth # shape: (2, H, W) | |
| # Access confidence | |
| if hasattr(prediction, 'conf'): | |
| confidence = prediction.conf | |
| # Access camera parameters (if available) | |
| if hasattr(prediction, 'extrinsics'): | |
| camera_poses = prediction.extrinsics # shape: (2, 4, 4) | |
| if hasattr(prediction, 'intrinsics'): | |
| camera_intrinsics = prediction.intrinsics # shape: (2, 3, 3) | |
| # Access intermediate features (if export_feat_layers was set) | |
| if hasattr(prediction, 'aux') and 'feat_layer_0' in prediction.aux: | |
| features = prediction.aux['feat_layer_0'] | |
| ``` | |