Harmonizer model card and checkpoint release

Checkpoints: diffusion_harmonizer.pkl (temporal), harmonizer_nontemporal.pt (non-temporal). Model card documents benchmarks, per-checkpoint descriptions, training dataset sizes, and inference times.

Files changed (5) hide show

.gitattributes +35 -0
README.md +240 -0
config.json +17 -0
diffusion_harmonizer.pkl +3 -0
harmonizer_nontemporal.pt +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,240 @@

+---
+language:
+  - en
+license: other
+license_name: nvidia-open-model-license
+license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
+tags:
+  - nvidia
+  - harmonizer
+  - image-to-image
+  - diffusion
+  - neural-reconstruction
+  - gaussian-splatting
+  - nerf
+  - physical-ai
+  - autonomous-driving
+pipeline_tag: image-to-image
+library_name: pytorch
+datasets:
+  - nvidia/Harmonizer-Dataset
+arxiv: 2602.24096
+---
+# Harmonizer | Model Card
+[**Paper**](https://arxiv.org/abs/2602.24096) | [**Project Page**](https://research.nvidia.com/labs/sil/projects/diffusion-harmonizer/) | [**Code**](https://github.com/NVIDIA/harmonizer) | [**Model**](https://huggingface.co/nvidia/Harmonizer) | [**Data**](https://huggingface.co/datasets/nvidia/Harmonizer-Dataset)
+## Description
+Harmonizer is a single-step image diffusion model trained as an online generative enhancer for neural-reconstruction image and video renderings. It transforms imperfect novel-view renderings produced by Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) reconstructions into temporally consistent outputs that are closer to real captures, while correcting illumination, shadow, and reconstruction-artifact issues that arise when dynamic objects are composited into reconstructed scenes.
+Harmonizer supports two operation modes:
+- **Offline mode:** Used during the reconstruction phase to clean up pseudo-training views rendered from the reconstruction, then distill them back into 3D. This enhances underconstrained regions and improves overall 3D representation quality.
+- **Online mode:** Acts as a single-step neural enhancer during simulation or inference. It harmonizes color and lighting, reconstructs missing or inconsistent shadows for inserted dynamic objects, and removes residual reconstruction artifacts from imperfect 3D supervision and current reconstruction-model capacity limits.
+Harmonizer is designed as a single model compatible with both NeRF and 3DGS representations. The model was trained on data curated with 3DGUT-based reconstructions and is adaptable to Gaussian Splatting scenes.
+### License/Terms of Use
+### Governing Terms
+Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
+Deployment Geography: Global
+### Release Management
+The model artifacts are released in this repository. Training and inference code is available from the [Harmonizer GitHub repository](https://github.com/NVIDIA/harmonizer). The associated dataset is available from [nvidia/Harmonizer-Dataset](https://huggingface.co/datasets/nvidia/Harmonizer-Dataset).
+## Use Case
+Harmonizer is intended for Physical AI developers looking to enhance and harmonize neural-reconstruction pipelines for autonomous-vehicle simulation. The model takes an image or image sequence as input and outputs a harmonized image with corrected color, lighting, shadows, and reduced reconstruction artifacts.
+## Benchmark Results
+Benchmarks were evaluated on 864 images from NDAS MLMCF and ParkNet training sessions. PSNR is higher-is-better; LPIPS and FID are lower-is-better.
+| Model | PSNR | LPIPS | FID |
+| :---- | ----: | ----: | ----: |
+| [Difix3D+](https://huggingface.co/nvidia/difix) | 28.33 | 0.16 | 54.20 |
+| [Fixer: cosmos_3dgut](https://huggingface.co/nvidia/Fixer) | 30.99 | 0.16 | 41.87 |
+| Harmonizer: non-temporal mode<br>(fastest runtime; `--enable-harmonizer` in NuRec gRPC)<br><br>Inference enabled through the following checkpoints:<br>`harmonizer_nontemporal.pt`<br>`diffusion_harmonizer.pkl` with `--nontemporal` flag | 30.48 | 0.16 | 32.05 |
+| Harmonizer: temporal mode<br>(highest quality output)<br><br>Inference enabled through the following checkpoint:<br>`diffusion_harmonizer.pkl` | 31.06 | 0.15 | 27.40 |
+## Release Date
+V1: June 2026
+## Reference(s)
+- [DiffusionHarmonizer paper](https://arxiv.org/abs/2602.24096)
+- [DiffusionHarmonizer project page](https://research.nvidia.com/labs/sil/projects/diffusion-harmonizer/)
+- [Harmonizer training and inference code](https://github.com/NVIDIA/harmonizer)
+- [Harmonizer dataset](https://huggingface.co/datasets/nvidia/Harmonizer-Dataset)
+## Model Architecture
+**Architecture Type:** Diffusion Transformer
+**Network Architecture:** Diffusion Transformer, based on Cosmos Predict2 0.6B, post-trained as a single-step, temporally conditioned image-to-image enhancer for neural-reconstruction renderings.
+The project page describes the backbone as the CosmosPredict2 0.6B text-to-image model fine-tuned on real-world and simulation training pairs from scalable data-curation pipelines for color and lighting harmonization, shadow correction, and artifact correction.
+## Model Input
+**Input Type(s):** Image / Image sequence
+**Input Format:** Red, Green, Blue (RGB)
+**Input Parameters:** Two-Dimensional (2D)
+**Other Properties Related to Input:** Specific resolution: 576 px x 1024 px
+## Model Output
+**Output Type(s):** Image
+**Output Format:** Red, Green, Blue (RGB)
+**Output Parameters:** Two-Dimensional (2D)
+**Other Properties Related to Output:** Specific resolution: 576 px x 1024 px
+## Software Integration
+**Runtime Engine(s):** PyTorch
+**Supported Hardware Microarchitecture Compatibility:**
+- NVIDIA Ampere
+- NVIDIA Hopper
+- NVIDIA Blackwell
+**Preferred/Supported Operating System(s):** Linux
+NVIDIA AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware and software frameworks such as CUDA libraries, the model can achieve faster training and inference times compared to CPU-only systems.
+## Model Version
+Harmonizer-cosmos-0.6B
+We release two checkpoints specified below.
+1. **`diffusion_harmonizer.pkl`** — The temporally-conditioned Harmonizer checkpoint reported in the DiffusionHarmonizer paper. Recommended when temporal coherence across consecutive rendered frames is required (e.g., video-style novel-view simulation). The model supports non-temporally faster conditioned inference mode via `--nontemporal` flag.
+   Inference speed on H100:
+   - full model (default): 212 ms / 576 x 1024 px image
+   - `--nontemporal` mode: 28 ms / 576 × 1024 px image
+2. **`harmonizer_nontemporal.pt`** — Exported JIT model for non-temporal, per-image inference. The checkpoint does not support conditioning on previous frames and corresponds to `diffusion_harmonizer.pkl` with `--nontemporal` flag. Recommended for per-image enhancement use cases where neighboring-frame context is unavailable or unnecessary, or where speed is critical.
+   Inference speed on H100: 28 ms / 576 × 1024 px image.
+Pretrained checkpoints are hosted on Hugging Face under [nvidia/Harmonizer](https://huggingface.co/nvidia/Harmonizer). To download all released checkpoints into a local `models/` directory:
+```bash
+hf download nvidia/Harmonizer --local-dir models
+```
+Refer to the [code release](https://github.com/NVIDIA/harmonizer) for the exact inference entry points and configuration files associated with each checkpoint. By default, the model runs in temporal mode. To run in non-temporal mode, add the --nontemporal flag. Refer to the code release for the exact inference entry points.
+## Repository Contents
+This repository contains the following model artifact files:
+- `diffusion_harmonizer.pkl` — DiffusionHarmonizer paper temporal checkpoint
+- `harmonizer_nontemporal.pt` — non-temporal single-frame checkpoint (PyTorch `.pt` format)
+For more details please see the [Model Version](#model-version) section.
+## Training, Testing, and Evaluation Datasets
+Harmonizer was trained, tested, and evaluated using an internal dataset of curated synthetic–real image pairs constructed from five complementary curation pipelines (ISP modification, relighting, asset re-insertion, PBR shadow simulation, and novel-view artifact correction), where 80% of the data was used for training, 10% for evaluation, and 10% for testing. The total volume of training data amounted to ~1 million pairs. Training data will be released at [nvidia/Harmonizer-Dataset](https://huggingface.co/datasets/nvidia/Harmonizer-Dataset).
+### NVIDIA Internal AV Dataset
+**Data Collection Method:** Sensors
+**Labeling Method by Dataset:** Human
+**Properties:** The dataset contains autonomous-driving images and videos captured by NVIDIA vehicles. It is collected by autonomous-driving vehicles and used as the source data from which the synthetic-real training pairs are derived.
+## Inference
+**Engine:** PyTorch>=2.0.0
+**Test Hardware:**
+We tested on H100:
+**`diffusion_harmonizer.pkl`**
+- full model (default): 212 ms / 576 x 1024 px image
+- `--nontemporal` mode: 28 ms / 576 × 1024 px image
+**`harmonizer_nontemporal.pt`**
+- 28 ms / 576 × 1024 px image
+## Known Technical Limitations
+The reconstruction relies on the quality and consistency of input images and camera calibrations; deficiencies in these areas can negatively impact the final output.
+## Known Risk(s)
+The model is not guaranteed to fix 100% of image artifacts. Please verify generated scenarios are context and use appropriate.
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and has established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with the terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+For more detailed information on ethical considerations for this model, please see the ModelCard++ Explainability, Bias, Safety & Security, and Privacy subcards.
+Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
+## ModelCard++
+### Bias
+| Field | Response |
+| :---- | :------- |
+| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |
+| Measures taken to mitigate against unwanted bias | None |
+### Explainability
+| Field | Response |
+| :---- | :------- |
+| Intended Domain | Advanced Driver Assistance Systems |
+| Model Type | Image-to-Image |
+| Intended Users | Autonomous Vehicles developers enhancing and harmonizing Neural Reconstruction pipelines. |
+| Output | Image |
+| Describe how the model works | The model takes as input an image and outputs a harmonized image with corrected color, lighting, shadows, and reduced reconstruction artifacts. |
+| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | None |
+| Technical Limitations | The reconstruction relies on the quality and consistency of input images and camera calibrations; any deficiencies in these areas can negatively impact the final output. |
+| Verified to have met prescribed NVIDIA quality standards | Yes |
+| Performance Metrics | FID (Frechet Inception Distance); PSNR (Peak Signal-to-Noise Ratio); LPIPS (Learned Perceptual Image Patch Similarity) |
+| Potential Known Risks | The model is not guaranteed to fix 100% of the image artifacts. Please verify the generated scenarios are context- and use-appropriate. |
+| Licensing | Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). |
+### Privacy
+| Field | Response |
+| :---- | :------- |
+| Generatable or reverse engineerable personal data? | No |
+| Personal data used to create this model? | No |
+| How often is the dataset reviewed? | Before release |
+| Is there provenance for all datasets used in training? | Yes |
+| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
+| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
+### Safety & Security
+| Field | Response |
+| :---- | :------- |
+| Model Application(s) | Image Enhancement |
+| List types of specific high-risk AI systems, if any, in which the model can be integrated | The model can be used to develop Autonomous Vehicles stacks that can be integrated inside vehicles. The Harmonizer model should not be deployed in a vehicle. |
+| Describe the life critical impact, if present | N/A - The model should not be deployed in a vehicle and will not perform life-critical tasks. |
+| Use Case Restrictions | Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). |
+| Model and dataset restrictions | The Principle of Least Privilege (PoLP) is applied, limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints are adhered to. |

config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "format_version": 1,
+  "name": "Harmonizer",
+  "description": "Bundle manifest for the Harmonizer model repository.",
+  "components": [
+    {
+      "name": "diffusion_harmonizer",
+      "file": "diffusion_harmonizer.pkl",
+      "role": "DiffusionHarmonizer paper temporal harmonization"
+    },
+    {
+      "name": "harmonizer_nontemporal",
+      "file": "harmonizer_nontemporal.pt",
+      "role": "non-temporal single-frame harmonization"
+    }
+  ]
+}

diffusion_harmonizer.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e65e9ec33019a49f7f0d15f81f5e089c2527e131bd5339d6a14f9b3531db524
+size 5042242222

harmonizer_nontemporal.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ece8e2daa914e8c2a027a2da94e0eb2064491d5b3fd8514009fae9a442e06e90
+size 1448843112