NVPanoptix-3D (Front3D checkpoint)
🤗 Hugging Face | 🚀 TAO Toolkit (coming soon)
Description
NVPanoptix-3D is a 3D Panoptic Reconstruction model that reconstructs complete 3D indoor scenes from single RGB images, simultaneously performing 2D panoptic segmentation, depth estimation, 3D scene reconstruction, and 3D panoptic segmentation. Built upon Uni-3D (ICCV 2023) baseline architecture, this model enhances 3D understanding by replacing the backbone with VGGT (Visual Geometry Grounded Transformer) and integrating multi-plane occupancy-aware lifting from BUOL (CVPR 2023) for improved 3D scene re-projection. The 3D stage uses WarpConvNet sparse convolutions, providing native support for modern NVIDIA GPU architectures, and newer CUDA (12.x & 13.x). The model reconstructs complete 3D scenes with both object instances (things) and scene layout (stuff) in a unified framework. This model was trained on the 3D-FRONT dataset.
This model is ready for non-commercial use.
License/Terms of Use
GOVERNING TERMS: Use of this model is governed by NVIDIA License. Additional Information: Apache-2.0 License for https://github.com/mlpc-ucsd/Uni-3D?tab=readme-ov-file; https://github.com/facebookresearch/vggt/blob/main/LICENSE.txt for https://github.com/facebookresearch/vggt; Apache-2.0 License for https://github.com/NVlabs/WarpConvNet.
Deployment Geography
Global
Use Case
This model is intended for researchers and developers building 3D scene understanding applications for indoor environments, including robotics navigation, augmented reality, virtual reality, and architectural visualization.
How to use
Setup environment
# Setup NVPanoptix-3D env (CUDA 13.0):
conda create -n nvpanoptix python=3.10 -y
# Activate environment
source activate nvpanoptix
# or
# conda activate nvpanoptix
# Clone repo
git clone https://huggingface.co/nvidia/nvpanoptix-3d-v1.1-front3d
cd nvpanoptix-3d-v1.1-front3d
# For download large checkpoints
git lfs install
git lfs pull
# Install dependencies
apt-get update && apt-get install -y git git-lfs ninja-build cmake libopenblas-dev
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu130
pip install -r requirements.txt
# Install WarpConvNet (sparse 3D convolutions)
# Set CUDA version (if using in Dockerfile, use ENV; in shell, use export)
export CUDA=cu130
# Install core dependencies
pip install --no-deps cupy-cuda13x==13.6.0 # use cupy-cuda13x for CUDA 13.x
# Build torch-scatter from source against the current torch/cuda stack
FORCE_CUDA=1 pip install --no-build-isolation --no-cache-dir --no-binary=torch-scatter torch-scatter
# Install patched WarpConvNet for CUDA 13.x
git clone https://github.com/daocongtuyen2x/WarpConvNet.git
cd WarpConvNet
if [ -d .git ]; then git submodule sync --recursive && git submodule update --init --recursive; fi
pip install --no-build-isolation .
Quick Start
from model import NVPanoptix3DModel
from preprocessing import load_image
from visualization import save_outputs
from PIL import Image
import numpy as np
# Load model from local directory
model = NVPanoptix3DModel.from_pretrained("path/to/local/repo/directory")
# Or load from HF repo
# model = PanopticRecon3DModel.from_pretrained("nvidia/nvpanoptix-3d-v1.1-front3d")
# Load and preprocess image
image_path = "path/to/your/image.png"
# keep original image for visualization
orig_image = Image.open(image_path).convert("RGB")
orig_image = np.array(orig_image)
# load processed image for inference
image = load_image(image_path, target_size=(320, 240))
# Run inference
outputs = model.predict(image)
# Save results (2D segmentation, depth map, 3D mesh)
save_outputs(outputs, "output_dir/", original_image=orig_image)
# Access individual outputs
print(f"2D Panoptic: {outputs.panoptic_seg_2d.shape}") # (120, 160)
print(f"2D Depth: {outputs.depth_2d.shape}") # (120, 160)
print(f"3D Geometry: {outputs.geometry_3d.shape}") # (256, 256, 256)
print(f"3D Semantic: {outputs.semantic_seg_3d.shape}") # (256, 256, 256)
Release Date
Hugging Face: 03/26/2026 via https://huggingface.co/nvidia/nvpanoptix-3d-v1.1-front3d
References
- Zhang, Xiang, et al.: Uni-3D: A Universal Model for Panoptic 3D Scene Reconstruction, ICCV 2023
- Wang, Jianyuan, et al.: VGGT: Visual Geometry Grounded Transformer, arXiv 2025
- Chu, Tao, et al.: BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction, CVPR 2023
- WarpConvNet: NVIDIA Sparse 3D Convolution Library
Model Architecture
Architecture Type: Two-Stage Architecture (Transformer + Sparse Convolutional Network)
Network Architecture:
2D Stage: Transformer-based (VGGT Backbone + Mask2Former-style Decoder)
3D Stage: WarpConvNet Sparse 3D CNN Frustum Decoder
Number of parameters: 1.4*10^9
This model was developed based on: Uni-3D (ICCV 2023) with VGGT backbone replacement, BUOL occupancy-aware lifting integration, and WarpConvNet sparse 3D convolutions replacing MinkowskiEngine.
Computational Load (internal Only: For NVIDIA Models Only)Cumulative Compute: 1.2*10^16 FLOP
Estimated Energy and Emissions for Model Training:- Estimated Energy: 192.849309184 kWh
- Estimated Emissions: 0.07916464142 tCO2e
-->
Input
Input Type: Image
Input Format:
- Image: Red, Green, Blue (RGB)
Input Parameters:
- Image: Two-Dimensional (2D)
Other Properties Related to Input:
- RGB Image:
- Standard size 240 x 320 (H x W), uint8 [0, 255]
- Processed internally to ~N x 448 (height adjusted to be divisible by 14) for VGGT backbone
- Minimum resolution: 240 x 320
- Padded to ensure dimensions divisible by 32 for multi-scale processing
- Standard size 240 x 320 (H x W), uint8 [0, 255]
Outputs
Output Types: Mask, Depth Map, 3D Geometry/Segmentation
Output Formats:
- 2D Segmentation: Binary masks with integer instance IDs
- Depth Map: Floating point depth values in meters
- 3D Geometry: Truncated Signed Distance Field (TSDF)
- 3D Segmentation: Integer labels (instance IDs and semantic classes)
Output Parameters:
- 2D Masks: Two-Dimensional (2D)
- Depth Map: Two-Dimensional (2D)
- 3D Geometry/Segmentation: Three-Dimensional (3D)
Other Properties Related to Output:
2D Outputs:
pred_logits: [Batch, 100, 14] - Classification scores for 100 queries across 13 semantic classes + backgroundpred_masks: [Batch, 100, H/2, W/2] - Binary segmentation masks for each querypred_depths: [Batch, 1, H/2, W/2] - Per-pixel depth in meters, range [0.4, 6.0]panoptic_seg: [H/2, W/2] - 2D panoptic segmentation with instance IDspose_enc: [Batch, 9] - Camera pose encoding from VGGT
3D Outputs:
geometry: [Batch, 256, 256, 256] - TSDF representing reconstructed 3D geometrypanoptic_seg_3d: [Batch, 256, 256, 256] - 3D panoptic segmentation with instance IDssemantic_seg_3d: [Batch, 256, 256, 256] - 3D semantic segmentation with class labelsinstance_info: List of dictionaries containing per-instance 3D meshes and metadata
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
Runtime Engine(s):
- TAO Toolkit Triton Apps
Supported Hardware Microarchitecture Compatibility:
- Optimized for NVIDIA A100 80GB GPUs (Ampere architecture).
- Requires GPU with high memory capacity (≥40GB recommended).
- Compatible with NVIDIA Ampere (A100), Hopper (H100), and Blackwell (B200) GPU architectures via WarpConvNet sparse convolution support.
Preferred/Supported Operating System(s):
Preferred: Ubuntu 22.04.5 LTS (Jammy Jellyfish), tested with CUDA 13.0.
Supported: Other Ubuntu versions (20.04+, 22.04+) and Linux distributions with compatible CUDA 12.x & 13.x drivers.
The model requires NVIDIA GPU with ≥40GB memory for training and ≥30GB for inference. By leveraging NVIDIA hardware (GPU cores) and software frameworks (CUDA libraries), the model achieves efficient training and inference. The integration of this model into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment.
Model Version(s)
1.1
(Pre-trained Panoptic Recon 3D model with WarpConvNet backend, deployable to Triton Inference Server for inference)
Training, Testing, and Evaluation Datasets
Dataset Overview
Total Number of Datasets: 01 Dataset (3D-FRONT)
Data Modality: Image, 3D Geometry.
3D-FRONT
Link: https://tianchi.aliyun.com/dataset/65347
Data Modality: Image, 3D Geometry
Image Training Data Size: Less than a Million Images
Data Collection Method: Synthetic - Photorealistic rendered images from CAD models
Labeling Method: Synthetic
Properties: 3D-FRONT is a synthetic dataset of indoor scenes featuring photorealistically rendered RGB images accompanied by ground-truth 3D geometry, depth maps, semantic labels, and instance segmentations. It encompasses a diverse range of room types—including bedrooms, living rooms, dining rooms, and offices—with realistic furniture arrangements representative of residential spaces. The dataset contains over 18,797 indoor scenes, each captured from multiple viewpoints, and is split into 4,389, 489, and 1,206 images for training, validation, and testing, respectively.Inference
Acceleration Engine: Triton
Test Hardware:- 1x NVIDIA A100 80GB
- 1x NVIDIA H100 80GB
Configuration:
- Precision: FP32
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
- Estimated Energy: 192.849309184 kWh
- Downloads last month
- -