NVPanoptix-3D (Front3D checkpoint)

🤗 Hugging Face | 🚀 TAO Toolkit (coming soon)

NVPanoptix-3D Demo

Description

NVPanoptix-3D is a 3D Panoptic Reconstruction model that reconstructs complete 3D indoor scenes from single RGB images, simultaneously performing 2D panoptic segmentation, depth estimation, 3D scene reconstruction, and 3D panoptic segmentation. Built upon Uni-3D (ICCV 2023) baseline architecture, this model enhances 3D understanding by replacing the backbone with VGGT (Visual Geometry Grounded Transformer) and integrating multi-plane occupancy-aware lifting from BUOL (CVPR 2023) for improved 3D scene re-projection. The 3D stage uses WarpConvNet sparse convolutions, providing native support for modern NVIDIA GPU architectures, and newer CUDA (12.x & 13.x). The model reconstructs complete 3D scenes with both object instances (things) and scene layout (stuff) in a unified framework. This model was trained on the 3D-FRONT dataset.

This model is ready for non-commercial use.

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by NVIDIA License. Additional Information: Apache-2.0 License for https://github.com/mlpc-ucsd/Uni-3D?tab=readme-ov-file; https://github.com/facebookresearch/vggt/blob/main/LICENSE.txt for https://github.com/facebookresearch/vggt; Apache-2.0 License for https://github.com/NVlabs/WarpConvNet.

Deployment Geography

Global

Use Case

This model is intended for researchers and developers building 3D scene understanding applications for indoor environments, including robotics navigation, augmented reality, virtual reality, and architectural visualization.

How to use

Setup environment

# Setup NVPanoptix-3D env (CUDA 13.0):
conda create -n nvpanoptix python=3.10 -y

# Activate environment
source activate nvpanoptix 
# or
# conda activate nvpanoptix

# Clone repo
git clone https://huggingface.co/nvidia/nvpanoptix-3d-v1.1-front3d
cd nvpanoptix-3d-v1.1-front3d

# For download large checkpoints
git lfs install
git lfs pull

# Install dependencies
apt-get update && apt-get install -y git git-lfs ninja-build cmake libopenblas-dev

pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu130
pip install -r requirements.txt

# Install WarpConvNet (sparse 3D convolutions)
# Set CUDA version (if using in Dockerfile, use ENV; in shell, use export)
export CUDA=cu130

# Install core dependencies
pip install --no-deps cupy-cuda13x==13.6.0  # use cupy-cuda13x for CUDA 13.x

# Build torch-scatter from source against the current torch/cuda stack
FORCE_CUDA=1 pip install --no-build-isolation --no-cache-dir --no-binary=torch-scatter torch-scatter

# Install patched WarpConvNet for CUDA 13.x
git clone https://github.com/daocongtuyen2x/WarpConvNet.git
cd WarpConvNet
if [ -d .git ]; then git submodule sync --recursive && git submodule update --init --recursive; fi
pip install --no-build-isolation .

Quick Start

from model import NVPanoptix3DModel
from preprocessing import load_image
from visualization import save_outputs
from PIL import Image
import numpy as np

# Load model from local directory
model = NVPanoptix3DModel.from_pretrained("path/to/local/repo/directory")

# Or load from HF repo
# model = PanopticRecon3DModel.from_pretrained("nvidia/nvpanoptix-3d-v1.1-front3d")

# Load and preprocess image
image_path = "path/to/your/image.png"

# keep original image for visualization
orig_image = Image.open(image_path).convert("RGB")
orig_image = np.array(orig_image)

# load processed image for inference
image = load_image(image_path, target_size=(320, 240))

# Run inference
outputs = model.predict(image)

# Save results (2D segmentation, depth map, 3D mesh)
save_outputs(outputs, "output_dir/", original_image=orig_image)

# Access individual outputs
print(f"2D Panoptic: {outputs.panoptic_seg_2d.shape}")   # (120, 160)
print(f"2D Depth: {outputs.depth_2d.shape}")             # (120, 160)
print(f"3D Geometry: {outputs.geometry_3d.shape}")       # (256, 256, 256)
print(f"3D Semantic: {outputs.semantic_seg_3d.shape}")   # (256, 256, 256)

Release Date

Hugging Face: 03/26/2026 via https://huggingface.co/nvidia/nvpanoptix-3d-v1.1-front3d

References

Zhang, Xiang, et al.: Uni-3D: A Universal Model for Panoptic 3D Scene Reconstruction, ICCV 2023
Wang, Jianyuan, et al.: VGGT: Visual Geometry Grounded Transformer, arXiv 2025
Chu, Tao, et al.: BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction, CVPR 2023
WarpConvNet: NVIDIA Sparse 3D Convolution Library

Model Architecture

Architecture Type: Two-Stage Architecture (Transformer + Sparse Convolutional Network)

Network Architecture:

2D Stage: Transformer-based (VGGT Backbone + Mask2Former-style Decoder)
3D Stage: WarpConvNet Sparse 3D CNN Frustum Decoder
Number of parameters: 1.4*10^9
This model was developed based on: Uni-3D (ICCV 2023) with VGGT backbone replacement, BUOL occupancy-aware lifting integration, and WarpConvNet sparse 3D convolutions replacing MinkowskiEngine.
Computational Load (internal Only: For NVIDIA Models Only)
Cumulative Compute: 1.2*10^16 FLOP
Estimated Energy and Emissions for Model Training:
- Estimated Energy: 192.849309184 kWh
- Estimated Emissions: 0.07916464142 tCO2e
  -->
Input

Input Type: Image

Input Format:
- Image: Red, Green, Blue (RGB)
Input Parameters:
- Image: Two-Dimensional (2D)
Other Properties Related to Input:
- RGB Image:
  - Standard size 240 x 320 (H x W), uint8 [0, 255]
  - Processed internally to ~N x 448 (height adjusted to be divisible by 14) for VGGT backbone
  - Minimum resolution: 240 x 320
  - Padded to ensure dimensions divisible by 32 for multi-scale processing
Outputs

Output Types: Mask, Depth Map, 3D Geometry/Segmentation

Output Formats:
- 2D Segmentation: Binary masks with integer instance IDs
- Depth Map: Floating point depth values in meters
- 3D Geometry: Truncated Signed Distance Field (TSDF)
- 3D Segmentation: Integer labels (instance IDs and semantic classes)
Output Parameters:
- 2D Masks: Two-Dimensional (2D)
- Depth Map: Two-Dimensional (2D)
- 3D Geometry/Segmentation: Three-Dimensional (3D)
Other Properties Related to Output:

2D Outputs:
- pred_logits: [Batch, 100, 14] - Classification scores for 100 queries across 13 semantic classes + background
- pred_masks: [Batch, 100, H/2, W/2] - Binary segmentation masks for each query
- pred_depths: [Batch, 1, H/2, W/2] - Per-pixel depth in meters, range [0.4, 6.0]
- panoptic_seg: [H/2, W/2] - 2D panoptic segmentation with instance IDs
- pose_enc: [Batch, 9] - Camera pose encoding from VGGT
3D Outputs:
- geometry: [Batch, 256, 256, 256] - TSDF representing reconstructed 3D geometry
- panoptic_seg_3d: [Batch, 256, 256, 256] - 3D panoptic segmentation with instance IDs
- semantic_seg_3d: [Batch, 256, 256, 256] - 3D semantic segmentation with class labels
- instance_info: List of dictionaries containing per-instance 3D meshes and metadata
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):
- TAO Toolkit Triton Apps
Supported Hardware Microarchitecture Compatibility:
- Optimized for NVIDIA A100 80GB GPUs (Ampere architecture).
- Requires GPU with high memory capacity (≥40GB recommended).
- Compatible with NVIDIA Ampere (A100), Hopper (H100), and Blackwell (B200) GPU architectures via WarpConvNet sparse convolution support.
Preferred/Supported Operating System(s):
- Preferred: Ubuntu 22.04.5 LTS (Jammy Jellyfish), tested with CUDA 13.0.
- Supported: Other Ubuntu versions (20.04+, 22.04+) and Linux distributions with compatible CUDA 12.x & 13.x drivers.
The model requires NVIDIA GPU with ≥40GB memory for training and ≥30GB for inference. By leveraging NVIDIA hardware (GPU cores) and software frameworks (CUDA libraries), the model achieves efficient training and inference. The integration of this model into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment.

Model Version(s)

1.1

(Pre-trained Panoptic Recon 3D model with WarpConvNet backend, deployable to Triton Inference Server for inference)

Training, Testing, and Evaluation Datasets

Dataset Overview

Total Number of Datasets: 01 Dataset (3D-FRONT)

Data Modality: Image, 3D Geometry.

3D-FRONT

Link: https://tianchi.aliyun.com/dataset/65347
Data Modality: Image, 3D Geometry
Image Training Data Size: Less than a Million Images
Data Collection Method: Synthetic - Photorealistic rendered images from CAD models
Labeling Method: Synthetic
Properties: 3D-FRONT is a synthetic dataset of indoor scenes featuring photorealistically rendered RGB images accompanied by ground-truth 3D geometry, depth maps, semantic labels, and instance segmentations. It encompasses a diverse range of room types—including bedrooms, living rooms, dining rooms, and offices—with realistic furniture arrangements representative of residential spaces. The dataset contains over 18,797 indoor scenes, each captured from multiple viewpoints, and is split into 4,389, 489, and 1,206 images for training, validation, and testing, respectively.

Inference

Acceleration Engine: Triton
Test Hardware:
- 1x NVIDIA A100 80GB
- 1x NVIDIA H100 80GB
Configuration:
- Precision: FP32
Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month: 13

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/nvpanoptix-3d-v1.1-front3d

NvPanoptix-3D

Collection

3D panoptic reconstruction segmentation model • 3 items • Updated 20 days ago • 3

Paper for nvidia/nvpanoptix-3d-v1.1-front3d

VGGT: Visual Geometry Grounded Transformer

Paper • 2503.11651 • Published Mar 14, 2025 • 40

NVPanoptix-3D (Front3D checkpoint)

Description

License/Terms of Use

Deployment Geography

Use Case

How to use

Setup environment

Quick Start

Release Date

References

Model Architecture

Input

Outputs

Software Integration

Model Version(s)

Training, Testing, and Evaluation Datasets

Dataset Overview

3D-FRONT

Inference

Ethical Considerations

Collection including nvidia/nvpanoptix-3d-v1.1-front3d

Paper for nvidia/nvpanoptix-3d-v1.1-front3d