update readme

9f9f655 verified 3 days ago

7.42 kB

	# ArtiFixer Overview

	## Description:

	ArtiFixer is a few-step causal auto-regressive model that enhances and extends 3D reconstruction. The related source code provides implementations for training, evaluation, and inference, supporting various stages including bidirectional training, diffusion forcing, and Self-Forcing-style DMD distillation.
	ArtiFixer was developed by NVIDIA (Spatial Intelligence Lab) and based on Wan2.1's 14B model.
	_This model is for research and development only._

	### License/Terms of Use:

	GOVERNING DOWNLOAD TERMS: Use of the model is governed by the [NVIDIA License](https://developer.download.nvidia.com/licenses/NVIDIA-OneWay-Noncommercial-License-22Mar2022.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczovL3d3dy5nb29nbGUuY29tLyIsIm5jaWQiOiJzby15b3V0LTg3MTcwMS12dDQ4In0=).
	ADDITIONAL INFORMATION: The Wan 2.1 14B base model is governed by the [Apache License, Version 2.0.](https://www.apache.org/licenses/LICENSE-2.0)

	### Deployment Geography:

	Global

	### Use Case:

	Developers and researchers working on 3D reconstruction, diffusion models, and auto-regressive techniques for enhancing and extending 3D reconstruction capabilities.

	### Release Date:

	Other: Hugging Face: 06/04/2026 via https://research.nvidia.com/labs/sil/projects/artifixer/

	## Reference(s):

	[ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models](https://research.nvidia.com/labs/sil/projects/artifixer/assets/paper.pdf)

	## Model Architecture:

	Architecture Type: Transformer
	Network Architecture: ArtifixerTransformer (built on Wan2.1's WanTransformer3DModel)
	This model was developed based on Wan-AI/Wan2.1-T2V-14B-Diffusers.
	Number of model parameters: ~16.9B trainable (16,910,955,584)

	## Input:

	Input Type(s): Image, text
	Input Format(s): RGB (Red, Green, Blue), opacity maps, camera ray maps, and text
	Input Parameters: Rendered RGB and opacity maps from the underlying 3D representation, camera ray maps, and text prompt
	Other Properties Related to Input: Model refines and extends renderings from imperfect 3D reconstruction, requiring camera intrinsics and camera poses.

	## Output:

	Output Type(s): Image
	Output Format: RGB (Red, Green, Blue)
	Output Parameters: Two-Dimensional (2D) image frames
	Other Properties Related to Output: Generates enhanced images via a few-step causal auto-regressive diffusion model.

	Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

	## Software Integration:

	Runtime Engine(s): PyTorch, Hugging Face Diffusers, Hugging Face Transformers, FlashAttention (FA3 on Hopper, FA4 on Blackwell)
	Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Blackwell
	Supported Operating System(s): Linux

	The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

	## Model Version(s):

	ArtiFixer v1.0
	ArtiFixer integrates with PyTorch and requires CUDA environments. It uses Dockerfiles for CUDA 12 and 13, supporting both x86_64 and aarch64 architectures. The model can be run using torchrun (or accelerate) with multi-GPU setups and requires dependencies such as flash-attn, accelerate, diffusers, and transformers.

	## Training, Testing, and Evaluation Datasets:

	### Training Dataset:

	Data Modality: Text, Image
	Image Training Data Size: Less than a Million Images
	Text Training Data Size: Less than a Billion Tokens
	Data Collection Method by dataset: Hybrid: Automated/Synthetic
	Labeling Method by dataset: Automated
	Properties (Quantity, Dataset Descriptions, Sensor(s)): Multimodal dataset combining 3D reconstruction data from DL3DV-10K (the DL3DV-ALL-960P release) with text captions. Includes sparse 3D point clouds, camera parameters, and RGB images. Camera poses are estimated with COLMAP, reconstructions are produced with 3DGUT (MCMC densification, via the 3DGRUT library), text captions are generated with a vision-language model (Qwen3-VL-30B-A3B-Instruct), and metric scale is estimated with MoGe. The dataset supports training for 3D reconstruction tasks with both real and generated data.

	### Testing Dataset:

	Data Collection Method by dataset: Automated
	Labeling Method by dataset: Automated
	Properties (Quantity, Dataset Descriptions, Sensor(s)): We evaluate our model on 4 benchmarks (DL3DV, Nerfbusters, M360, TandT). We follow standard procedures for sparse reconstruction evaluation: only use a subset of frames, i.e., 3, 6 or 9, to evaluate on the remaining held-out frames. We follow the protocol proposed by [Cat3D](https://cat3d.github.io/) for M360 evaluation, [Difix3D+](https://research.nvidia.com/labs/toronto-ai/difix3d/) for Nerfbusters and DL3DV evaluation, [ReconX](https://liuff19.github.io/ReconX/) for TandT, and finally propose our own evaluation set for DL3DV.

	### Evaluation Dataset:

	Artifact removal on the Nerfbusters and DL3DV benchmarks (Difix3D+ protocol; NB = Nerfbusters):

	\| Method \| NB PSNR↑ \| NB SSIM↑ \| NB LPIPS↓ \| NB FID↓ \| DL3DV PSNR↑ \| DL3DV SSIM↑ \| DL3DV LPIPS↓ \| DL3DV FID↓ \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| ArtiFixer \| 19.83 \| 0.701 \| 0.254 \| 37.78 \| 19.73 \| 0.672 \| 0.231 \| 20.85 \|
	\| ArtiFixer 3D \| 20.24 \| 0.729 \| 0.267 \| 39.67 \| 20.14 \| 0.705 \| 0.256 \| 24.27 \|
	\| ArtiFixer 3D+ \| 20.12 \| 0.713 \| 0.264 \| 41.17 \| 20.06 \| 0.686 \| 0.242 \| 22.61 \|

	Additional results — Mip-NeRF 360 sparse-view (3/6/9-view), DL3DV novel-content generation, and Tanks & Temples (supplement) — are reported in the [paper](https://research.nvidia.com/labs/sil/projects/artifixer/assets/paper.pdf).

	Data Collection Method by dataset: Hybrid: Automated/Human
	Labeling Method by dataset: Hybrid: Automated/Human
	Properties (Quantity, Dataset Descriptions, Sensor(s)): Evaluated on diverse 3D reconstruction benchmarks including DL3DV and Nerfbusters, assessing performance on held-out validation frames, full source trajectories, and prepared trajectories. Metrics include reconstruction quality, multi-view consistency, and inference speed.

	## Inference:

	Acceleration Engine: FlashAttention (FA3/FA4); PyTorch SDPA (cuDNN) fallback on Ampere
	Test Hardware: NVIDIA A100 80GB, NVIDIA H100, NVIDIA GB200 (Blackwell)

	## Ethical Considerations:

	NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
	Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).

	# ArtiFixer Overview

	## Description:

	ArtiFixer is a few-step causal auto-regressive model that enhances and extends 3D reconstruction. The related source code provides implementations for training, evaluation, and inference, supporting various stages including bidirectional training, diffusion forcing, and Self-Forcing-style DMD distillation.
	ArtiFixer was developed by NVIDIA (Spatial Intelligence Lab) and based on Wan2.1's 14B model.
	_This model is for research and development only._

	### License/Terms of Use:

	GOVERNING DOWNLOAD TERMS: Use of the model is governed by the [NVIDIA License](https://developer.download.nvidia.com/licenses/NVIDIA-OneWay-Noncommercial-License-22Mar2022.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczovL3d3dy5nb29nbGUuY29tLyIsIm5jaWQiOiJzby15b3V0LTg3MTcwMS12dDQ4In0=).
	ADDITIONAL INFORMATION: The Wan 2.1 14B base model is governed by the [Apache License, Version 2.0.](https://www.apache.org/licenses/LICENSE-2.0)

	### Deployment Geography:

	Global

	### Use Case:

	Developers and researchers working on 3D reconstruction, diffusion models, and auto-regressive techniques for enhancing and extending 3D reconstruction capabilities.

	### Release Date:

	Other: Hugging Face: 06/04/2026 via https://research.nvidia.com/labs/sil/projects/artifixer/

	## Reference(s):

	[ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models](https://research.nvidia.com/labs/sil/projects/artifixer/assets/paper.pdf)

	## Model Architecture:

	Architecture Type: Transformer
	Network Architecture: ArtifixerTransformer (built on Wan2.1's WanTransformer3DModel)
	This model was developed based on Wan-AI/Wan2.1-T2V-14B-Diffusers.
	Number of model parameters: ~16.9B trainable (16,910,955,584)

	## Input:

	Input Type(s): Image, text
	Input Format(s): RGB (Red, Green, Blue), opacity maps, camera ray maps, and text
	Input Parameters: Rendered RGB and opacity maps from the underlying 3D representation, camera ray maps, and text prompt
	Other Properties Related to Input: Model refines and extends renderings from imperfect 3D reconstruction, requiring camera intrinsics and camera poses.

	## Output:

	Output Type(s): Image
	Output Format: RGB (Red, Green, Blue)
	Output Parameters: Two-Dimensional (2D) image frames
	Other Properties Related to Output: Generates enhanced images via a few-step causal auto-regressive diffusion model.

	Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

	## Software Integration:

	Runtime Engine(s): PyTorch, Hugging Face Diffusers, Hugging Face Transformers, FlashAttention (FA3 on Hopper, FA4 on Blackwell)
	Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Blackwell
	Supported Operating System(s): Linux

	The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

	## Model Version(s):

	ArtiFixer v1.0
	ArtiFixer integrates with PyTorch and requires CUDA environments. It uses Dockerfiles for CUDA 12 and 13, supporting both x86_64 and aarch64 architectures. The model can be run using torchrun (or accelerate) with multi-GPU setups and requires dependencies such as flash-attn, accelerate, diffusers, and transformers.

	## Training, Testing, and Evaluation Datasets:

	### Training Dataset:

	Data Modality: Text, Image
	Image Training Data Size: Less than a Million Images
	Text Training Data Size: Less than a Billion Tokens
	Data Collection Method by dataset: Hybrid: Automated/Synthetic
	Labeling Method by dataset: Automated
	Properties (Quantity, Dataset Descriptions, Sensor(s)): Multimodal dataset combining 3D reconstruction data from DL3DV-10K (the DL3DV-ALL-960P release) with text captions. Includes sparse 3D point clouds, camera parameters, and RGB images. Camera poses are estimated with COLMAP, reconstructions are produced with 3DGUT (MCMC densification, via the 3DGRUT library), text captions are generated with a vision-language model (Qwen3-VL-30B-A3B-Instruct), and metric scale is estimated with MoGe. The dataset supports training for 3D reconstruction tasks with both real and generated data.

	### Testing Dataset:

	Data Collection Method by dataset: Automated
	Labeling Method by dataset: Automated
	Properties (Quantity, Dataset Descriptions, Sensor(s)): We evaluate our model on 4 benchmarks (DL3DV, Nerfbusters, M360, TandT). We follow standard procedures for sparse reconstruction evaluation: only use a subset of frames, i.e., 3, 6 or 9, to evaluate on the remaining held-out frames. We follow the protocol proposed by [Cat3D](https://cat3d.github.io/) for M360 evaluation, [Difix3D+](https://research.nvidia.com/labs/toronto-ai/difix3d/) for Nerfbusters and DL3DV evaluation, [ReconX](https://liuff19.github.io/ReconX/) for TandT, and finally propose our own evaluation set for DL3DV.

	### Evaluation Dataset:

	Artifact removal on the Nerfbusters and DL3DV benchmarks (Difix3D+ protocol; NB = Nerfbusters):

	\| Method \| NB PSNR↑ \| NB SSIM↑ \| NB LPIPS↓ \| NB FID↓ \| DL3DV PSNR↑ \| DL3DV SSIM↑ \| DL3DV LPIPS↓ \| DL3DV FID↓ \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| ArtiFixer \| 19.83 \| 0.701 \| 0.254 \| 37.78 \| 19.73 \| 0.672 \| 0.231 \| 20.85 \|
	\| ArtiFixer 3D \| 20.24 \| 0.729 \| 0.267 \| 39.67 \| 20.14 \| 0.705 \| 0.256 \| 24.27 \|
	\| ArtiFixer 3D+ \| 20.12 \| 0.713 \| 0.264 \| 41.17 \| 20.06 \| 0.686 \| 0.242 \| 22.61 \|

	Additional results — Mip-NeRF 360 sparse-view (3/6/9-view), DL3DV novel-content generation, and Tanks & Temples (supplement) — are reported in the [paper](https://research.nvidia.com/labs/sil/projects/artifixer/assets/paper.pdf).

	Data Collection Method by dataset: Hybrid: Automated/Human
	Labeling Method by dataset: Hybrid: Automated/Human
	Properties (Quantity, Dataset Descriptions, Sensor(s)): Evaluated on diverse 3D reconstruction benchmarks including DL3DV and Nerfbusters, assessing performance on held-out validation frames, full source trajectories, and prepared trajectories. Metrics include reconstruction quality, multi-view consistency, and inference speed.

	## Inference:

	Acceleration Engine: FlashAttention (FA3/FA4); PyTorch SDPA (cuDNN) fallback on Ampere
	Test Hardware: NVIDIA A100 80GB, NVIDIA H100, NVIDIA GB200 (Blackwell)

	## Ethical Considerations:

	NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
	Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).