|
|
--- |
|
|
{} |
|
|
--- |
|
|
# NVPanoptix-3D |
|
|
|
|
|
## Description |
|
|
|
|
|
NVPanoptix-3D is a 3D Panoptic Reconstruction model that reconstructs complete 3D indoor scenes from single RGB images, simultaneously performing 2D panoptic segmentation, depth estimation, 3D scene reconstruction, and 3D panoptic segmentation. Built upon Uni-3D (ICCV 2023) baseline architecture, this model enhances 3D understanding by replacing the backbone with VGGT (Visual Geometry Grounded Transformer) and integrating multi-plane occupancy-aware lifting from BUOL (CVPR 2023) for improved 3D scene re-projection. The model reconstructs complete 3D scenes with both object instances (things) and scene layout (stuff) in a unified framework. <br> |
|
|
|
|
|
This model is ready for non-commercial use. <br> |
|
|
|
|
|
## License/Terms of Use |
|
|
GOVERNING TERMS: Use of this model is governed by [NVIDIA License](https://developer.download.nvidia.com/licenses/NVIDIA-OneWay-Noncommercial-License-22Mar2022.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczovL3d3dy5nb29nbGUuY29tLyJ9). Additional Information: [Apache-2.0 License](https://github.com/mlpc-ucsd/Uni-3D?tab=Apache-2.0-1-ov-file) for https://github.com/mlpc-ucsd/Uni-3D?tab=readme-ov-file; https://github.com/facebookresearch/vggt/blob/main/LICENSE.txt for https://github.com/facebookresearch/vggt. |
|
|
|
|
|
## Deployment Geography |
|
|
|
|
|
Global <br> |
|
|
|
|
|
## Use Case |
|
|
|
|
|
This model is intended for researchers and developers building 3D scene understanding applications for indoor environments, including robotics navigation, augmented reality, virtual reality, and architectural visualization. <br> |
|
|
|
|
|
## How to use |
|
|
### Setup environment |
|
|
|
|
|
```bash |
|
|
# Setup NVPanoptix-3D env (CUDA 11.8): |
|
|
conda create -n nvpanoptix python=3.10 -y |
|
|
source activate nvpanoptix / conda activate nvpanoptix |
|
|
apt-get update && apt-get install -y git git-lfs ninja-build cmake libopenblas-dev |
|
|
git lfs install |
|
|
|
|
|
git clone https://huggingface.co/nvidia/nvpanoptix-3d |
|
|
cd nvpanoptix-3d |
|
|
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118 |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Temporarily set CUDA architecture list for MinkowskiEngine |
|
|
export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 9.0+PTX" |
|
|
pip install ninja && FORCE_CUDA=1 pip install git+https://github.com/NVIDIA/MinkowskiEngine.git --no-build-isolation |
|
|
``` |
|
|
|
|
|
### Quick Start |
|
|
```python |
|
|
from model import PanopticRecon3DModel |
|
|
from preprocessing import load_image |
|
|
from visualization import save_outputs |
|
|
from PIL import Image |
|
|
import numpy as np |
|
|
|
|
|
# Load model from local directory |
|
|
model = PanopticRecon3DModel.from_pretrained("nvpanoptix-3d") |
|
|
|
|
|
# Or load from HF repo |
|
|
# model = PanopticRecon3DModel.from_pretrained("nvidia/nvpanoptix-3d") |
|
|
|
|
|
# Load and preprocess image |
|
|
image_path = "path/to/your/image.png" |
|
|
|
|
|
# keep original image for visualization |
|
|
orig_image = Image.open(image_path).convert("RGB") |
|
|
orig_image = np.array(orig_image) |
|
|
|
|
|
# load processed image for inference |
|
|
image = load_image(image_path, target_size=(320, 240)) |
|
|
|
|
|
# Run inference |
|
|
outputs = model.predict(image) |
|
|
|
|
|
# Save results (2D segmentation, depth map, 3D mesh) |
|
|
save_outputs(outputs, "output_dir/", original_image=orig_image) |
|
|
|
|
|
# Access individual outputs |
|
|
print(f"2D Panoptic: {outputs.panoptic_seg_2d.shape}") # (120, 160) |
|
|
print(f"2D Depth: {outputs.depth_2d.shape}") # (120, 160) |
|
|
print(f"3D Geometry: {outputs.geometry_3d.shape}") # (256, 256, 256) |
|
|
print(f"3D Semantic: {outputs.semantic_seg_3d.shape}") # (256, 256, 256) |
|
|
``` |
|
|
|
|
|
## Release Date |
|
|
Hugging Face: 11/25/2025 via https://huggingface.co/nvidia/3d_panoptic_reconstruction <br> |
|
|
|
|
|
## References |
|
|
|
|
|
- Zhang, Xiang, et al.: [Uni-3D: A Universal Model for Panoptic 3D Scene Reconstruction](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Uni-3D_A_Universal_Model_for_Panoptic_3D_Scene_Reconstruction_ICCV_2023_paper.pdf), ICCV 2023 <br> |
|
|
- Wang, Jianyuan, et al.: [VGGT: Visual Geometry Grounded Transformer](https://arxiv.org/abs/2503.11651), arXiv 2025 <br> |
|
|
- Chu, Tao, et al.: [BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction](https://openaccess.thecvf.com/content/CVPR2023/papers/Chu_BUOL_A_Bottom-Up_Framework_With_Occupancy-Aware_Lifting_for_Panoptic_3D_CVPR_2023_paper.pdf), CVPR 2023 <br> |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
**Architecture Type:** Two-Stage Architecture (Transformer + Sparse Convolutional Network) <br> |
|
|
|
|
|
**Network Architecture:** |
|
|
- 2D Stage: Transformer-based (VGGT Backbone + Mask2Former-style Decoder) |
|
|
- 3D Stage: Sparse 3D CNN Frustum Decoder <br> |
|
|
|
|
|
- Number of parameters: 1.4*10^9 <br> |
|
|
- This model was developed based on: Uni-3D (ICCV 2023) with VGGT backbone replacement and BUOL occupancy-aware lifting integration. <br> |
|
|
<!--Internal, Reminder not to publish |
|
|
## Computational Load (internal Only: For NVIDIA Models Only) |
|
|
**Cumulative Compute:** 1.2*10^16 FLOP <br> |
|
|
**Estimated Energy and Emissions for Model Training:** |
|
|
- Estimated Energy: 192.849309184 kWh <br> |
|
|
- Estimated Emissions: 0.07916464142 tCO2e <br> --> |
|
|
|
|
|
## Input |
|
|
|
|
|
**Input Type:** Image <br> |
|
|
|
|
|
**Input Format:** |
|
|
- Image: Red, Green, Blue (RGB) <br> |
|
|
|
|
|
**Input Parameters:** |
|
|
- Image: Two-Dimensional (2D) <br> |
|
|
|
|
|
**Other Properties Related to Input:** |
|
|
- **RGB Image**: <br> |
|
|
- Standard size 240 x 320 (H x W), uint8 [0, 255] <br> |
|
|
- Processed internally to ~N x 448 (height adjusted to be divisible by 14) for VGGT backbone <br> |
|
|
- Minimum resolution: 240 x 320 <br> |
|
|
- Padded to ensure dimensions divisible by 32 for multi-scale processing <br> |
|
|
|
|
|
## Outputs |
|
|
|
|
|
**Output Types:** Mask, Depth Map, 3D Geometry/Segmentation <br> |
|
|
|
|
|
**Output Formats:** |
|
|
- 2D Segmentation: Binary masks with integer instance IDs <br> |
|
|
- Depth Map: Floating point depth values in meters <br> |
|
|
- 3D Geometry: Truncated Signed Distance Field (TSDF) <br> |
|
|
- 3D Segmentation: Integer labels (instance IDs and semantic classes) <br> |
|
|
|
|
|
**Output Parameters:** |
|
|
- 2D Masks: Two-Dimensional (2D) <br> |
|
|
- Depth Map: Two-Dimensional (2D) <br> |
|
|
- 3D Geometry/Segmentation: Three-Dimensional (3D) <br> |
|
|
|
|
|
**Other Properties Related to Output:** |
|
|
|
|
|
*2D Outputs:* <br> |
|
|
- `pred_logits`: [Batch, 100, 14] - Classification scores for 100 queries across 13 semantic classes + background <br> |
|
|
- `pred_masks`: [Batch, 100, H/2, W/2] - Binary segmentation masks for each query <br> |
|
|
- `pred_depths`: [Batch, 1, H/2, W/2] - Per-pixel depth in meters, range [0.4, 6.0] <br> |
|
|
- `panoptic_seg`: [H/2, W/2] - 2D panoptic segmentation with instance IDs <br> |
|
|
- `pose_enc`: [Batch, 9] - Camera pose encoding from VGGT <br> |
|
|
|
|
|
*3D Outputs:* <br> |
|
|
- `geometry`: [Batch, 256, 256, 256] - TSDF representing reconstructed 3D geometry <br> |
|
|
- `panoptic_seg_3d`: [Batch, 256, 256, 256] - 3D panoptic segmentation with instance IDs <br> |
|
|
- `semantic_seg_3d`: [Batch, 256, 256, 256] - 3D semantic segmentation with class labels <br> |
|
|
- `instance_info`: List of dictionaries containing per-instance 3D meshes and metadata <br> |
|
|
|
|
|
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. |
|
|
## Software Integration |
|
|
|
|
|
**Runtime Engine(s):** |
|
|
* TAO Toolkit Triton Apps <br> |
|
|
|
|
|
**Supported Hardware Microarchitecture Compatibility:** <br> |
|
|
* Optimized for NVIDIA A100 80GB GPUs (Ampere architecture). <br> |
|
|
* Requires GPU with high memory capacity (≥40GB recommended). <br> |
|
|
* Compatible with other NVIDIA Ampere or NVIDIA Hopper GPUs (e.g., H100), though memory and interconnect bandwidth may affect performance. <br> |
|
|
|
|
|
**Preferred/Supported Operating System(s):** |
|
|
- Preferred: Ubuntu 22.04.5 LTS (Jammy Jellyfish), tested with CUDA 11.8. <br> |
|
|
|
|
|
- Supported: Other Ubuntu versions (20.04+, 22.04+) and Linux distributions with compatible CUDA 11.x drivers. <br> |
|
|
|
|
|
The model requires NVIDIA GPU with ≥40GB memory for training and ≥30GB for inference. By leveraging NVIDIA hardware (GPU cores) and software frameworks (CUDA libraries), the model achieves efficient training and inference. The integration of this model into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. <br> |
|
|
|
|
|
## Model Version(s) |
|
|
|
|
|
1.0 <br> |
|
|
|
|
|
*(Pre-trained Panoptic Recon 3D model deployable to Triton Inference Server for inference)* <br> |
|
|
|
|
|
|
|
|
## Training, Testing, and Evaluation Datasets |
|
|
|
|
|
### Dataset Overview |
|
|
|
|
|
**Total Number of Datasets:** 02 Datasets (3D-FRONT and Matterport3D) <br> |
|
|
|
|
|
**Data Modality:** Image, 3D Geometry. |
|
|
|
|
|
### 3D-FRONT |
|
|
|
|
|
**Link:** https://tianchi.aliyun.com/dataset/65347 <br> |
|
|
**Data Modality**: Image, 3D Geometry <br> |
|
|
**Image Training Data Size:** Less than a Million Images <br> |
|
|
**Data Collection Method:** Synthetic - Photorealistic rendered images from CAD models <br> |
|
|
**Labeling Method:** Synthetic <br> |
|
|
**Properties:** 3D-FRONT is a synthetic dataset of indoor scenes featuring photorealistically rendered RGB images accompanied by ground-truth 3D geometry, depth maps, semantic labels, and instance segmentations. It encompasses a diverse range of room types—including bedrooms, living rooms, dining rooms, and offices—with realistic furniture arrangements representative of residential spaces. The dataset contains over 18,797 indoor scenes, each captured from multiple viewpoints, and is split into 4,389, 489, and 1,206 images for training, validation, and testing, respectively. <br> |
|
|
|
|
|
### Matterport3D |
|
|
|
|
|
**Link:** https://niessner.github.io/Matterport/ <br> |
|
|
**Data Modality:** Image, 3D Geometry <br> |
|
|
**Image Training Data Size:** Less than a Million Images <br> |
|
|
**Data Collection Method:** Automatic/Sensors - Real-world 3D scans using Matterport Pro camera <br> |
|
|
**Labeling Method:** Hybrid: Automatic/Sensors, Human - Semi-automatic with human verification <br> |
|
|
**Properties:** Matterport3D is a real-world dataset comprising 3D reconstructions of 90 indoor scenes. It provides RGB images, depth maps, camera poses, and semantic annotations across diverse environments such as homes, offices, and other building types. Each scene includes dense 3D point clouds and surface reconstructions annotated with category-level semantic labels. The dataset is divided into 34,737, 4,898, and 8,631 images for training, validation, and testing, corresponding to 61, 11, and 18 scenes, respectively. <br> |
|
|
|
|
|
# Inference |
|
|
**Acceleration Engine:** Triton <br> |
|
|
**Test Hardware:** <br> |
|
|
- 1x NVIDIA A100 80GB <br> |
|
|
- 1x NVIDIA H100 80GB <br> |
|
|
|
|
|
**Configuration:** |
|
|
- Precision: FP32 <br> |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br> |
|
|
|
|
|
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). <br> |