Upload AH_multiview_diffusion_turbo.safetensors

by jeanlancel - opened Mar 16

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+72

-637

Files changed (25) hide show

.gitattributes +0 -1
docs/pipeline.gif → AH_multiview_diffusion_turbo.safetensors +2 -2
AH_tokengs_lifting.safetensors +2 -2
README.md +67 -144
config.json +0 -27
docs/in_the_wild_examples/bin_01.jpg +0 -0
docs/in_the_wild_examples/bus_01.jpg +0 -0
docs/in_the_wild_examples/cyclist_02.jpg +0 -0
docs/in_the_wild_examples/pedestrian_01.jpg +0 -0
docs/in_the_wild_examples/pedestrian_03.jpg +0 -0
docs/in_the_wild_examples/pedestrian_04.jpg +0 -0
docs/in_the_wild_examples/pedestrian_05.jpg +0 -0
docs/in_the_wild_examples/pedestrian_06.jpg +0 -0
docs/in_the_wild_examples/sedan_01.jpg +0 -0
docs/in_the_wild_examples/sedan_02.jpg +0 -0
docs/in_the_wild_examples/stroller_01.jpg +0 -0
docs/in_the_wild_examples/stroller_02.jpg +0 -0
docs/in_the_wild_examples/suv_01.jpg +0 -0
docs/in_the_wild_examples/suv_02.jpg +0 -0
docs/in_the_wild_examples/tractor_01.jpg +0 -0
docs/in_the_wild_examples/trailer_01.jpg +0 -0
docs/in_the_wild_examples/truck_01.jpg +0 -0
model_cards/AV_Object_Mask2former.md +0 -144
model_cards/MultiviewDiffusion.md +0 -170
model_cards/Object_TokenGS.md +0 -146

.gitattributes CHANGED Viewed

@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-*.gif filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

docs/pipeline.gif → AH_multiview_diffusion_turbo.safetensors RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9f7df1837e9eae37f572dc07bd30f96372b89d4c2ec1ba83e626f2acae7abcd8
-size 3645159

 version https://git-lfs.github.com/spec/v1
+oid sha256:3dbbc54c2db8875016234a732d18070a0705e9709dee4b8ae7ef61895f08a075
+size 3345066418

AH_tokengs_lifting.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:576a4250e373547c6864cc3fa6ec310b7c66dd06b8025d609ec6681405896ff8
-size 1299556696

 version https://git-lfs.github.com/spec/v1
+oid sha256:9650e8aeeb9dbb5f42231044f6da327046043de0023f6ce64d0ea2f7c5cbdf85
+size 1299556656

README.md CHANGED Viewed

@@ -1,183 +1,100 @@
 ---
 language:
-  - en
 license: other
 license_name: nvidia-open-model-license
 license_link: >-
   https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
 tags:
-  - nvidia
-  - asset-harvester
-  - image-to-3d
-  - 3d-generation
-  - gaussian-splatting
-  - physical-ai
 pipeline_tag: image-to-3d
 ---
 # Asset Harvester | System Model Card
-[**Paper**](https://arxiv.org/abs/2604.18468)  | [**Live Demo!**](https://huggingface.co/spaces/nvidia/asset-harvester)  | [**Project Page**](https://research.nvidia.com/labs/sil/projects/asset-harvester) | [**Code**](https://github.com/NVIDIA/asset-harvester) | [**Model**](https://huggingface.co/nvidia/asset-harvester) | [**Data**](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NCore)
 ## **Description:**
-**Asset Harvester** is an image-to-3D model and end-to-end system that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. The model generates 3D assets from a single image or multiple images of vehicles, VRUs or other road objects extracted from autonomous driving sessions.  To run Asset Harvester, please check our [**codebase**](https://github.com/NVIDIA/asset-harvester).
-<p align="center">
-  <img src="docs/pipeline.gif" alt="Asset Harvester teaser" width="100%" style="border: none;">
-</p>
-**Asset Harvester** turns real-world driving logs into complete, simulation-ready 3D assets — from just one or a few in-the-wild object views. It handles vehicles, pedestrians, riders, and other road objects, even under heavy occlusion, noisy calibration, and extreme viewpoint bias. A multiview diffusion model generates consistent novel viewpoints, and a feed-forward Gaussian reconstructor lifts them to full 3D in seconds. The result: high-fidelity 3D Gaussian splat assets ready for insertion into simulation environments. The pipeline plugs directly into NVIDIA NCore and NuRec for scalable data ingestion and closed-loop simulation.
-Here's how the model checkpoints in this repo are used in the end-to-end system following the order in the pipeline: The [AV object Mask2former](model_cards/AV_Object_Mask2former.md) instance segmentation model is used for image processing when parsing input views from NCore data sessions.
-The input images are encoded by [C-Radio](https://huggingface.co/nvidia/C-RADIO),
-and the multiview diffusion model, [SparseViewDiT](model_cards/MultiviewDiffusion.md), is then used to generate 16 multiview images of the input objects.
-In cases where camera parameters are not provided, the multiview diffusion model includes a camera pose estimation submodule that predicts camera parameters for the input images.
-Lastly, an [Object TokenGS](model_cards/Object_TokenGS.md) lifts the images to a 3D asset.
 This system is ready for commercial/non-commercial use
-<details>
-<summary><big><big><strong>🚗 Example Results 🚗</strong></big></big></summary>
-Each row contains the input image, object mask, and a rendering of the harvested 3DGS asset.
-#### 1. Vehicles / Trucks / Trailers
-<table>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/bus_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/trailer_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/tractor_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/truck_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/sedan_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/suv_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/suv_02.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/sedan_02.jpg" width="860"></td>
-  </tr>
-</table>
-#### 2. VRUs
-<table>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_03.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_04.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_05.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_06.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/cyclist_02.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/stroller_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/stroller_02.jpg" width="860"></td>
-  </tr>
-</table>
-#### 3. Other
-<table>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/bin_01.jpg" width="860"></td>
-  </tr>
-</table>
-</details>
-### **License/Terms of Use**:
-### Governing Terms: Use of this model system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) .
-**Deployment Geography:** Global
-### **Release Management:**
-This system is exposed as a collection of models on [HuggingFace](https://huggingface.co/nvidia/asset-harvester) and inference scripts on [Github](https://github.com/NVIDIA/asset-harvester).
 ## **Automation Level:**
-Partial Automation
 ## **Use Case:**
-Physical AI developers who are looking to create 3D assets of vehicles or VRUs for either closed-loop simulation or Synthetic Data Generation (SDG).
 ## **Known Technical Limitations:**
-The system is not guaranteed to perform well with occluded objects or objects that are outside of the common distribution. For example, a heavily occluded vehicle can generate a poor or hallucinated 3D asset.
-## Known Risk(s):
 AV and robotics developers should be aware that this model cannot guarantee a 100% success rate. In cases of unsuccessful generation, the output may not possess an accurate real-world representation of the asset and should not be relied upon in safety-critical simulations.
-##
-**Reference(s):** _(coming soon)_
-[Asset Harvester: Turning Autonomous Driving Logs into 3D Assets for Simulation]()
 ## **System Architecture**
-System architecture details described in white paper above.
 ## **System Input:**
-**Input Type(s):** 1 or more images (up until 4\)
-**Input Format:** Red, Green, Blue (RGB)
-**Input Parameters:** Two-Dimensional (2D)
-**Other Properties Related to Input:**
-We currently accept up to 4 input images for each object. The resolution of the images are 512x512. The input images are extracted from NVIDIA’s NCore data along w/ other metadata needed for downstream processing:
-* Camera orientation of each image
-* Camera distance of each image
-* Camera field of view of each image
 * Bounding box dimensions of each object
 ## **System Output:**
-**Output Type(s):** Corresponding 3D Gaussian asset to the object in input images
-**Output Format:** Polygon File Format (PLY)
-**Output Parameters:** Three-Dimensional (3D)
-**Other Properties Related to Output:**
-A PLY file (3D Gaussian Splatting, 3DGS) contains 3D object data with the following specific components:
-* **Header**: Defines the file structure, including format (ASCII or binary), Gaussian elements, their properties (e.g., position, appearance coefficients, opacity, scale, rotation), and data types (e.g., float, int).
-* **Gaussian Data**: Stores the parameters of each 3D Gaussian as vertex elements: center position (`x`, `y`, `z`), spherical harmonics DC coefficients (`f_dc_0`, `f_dc_1`, `f_dc_2`), `opacity`, anisotropic scale (`scale_0`, `scale_1`, `scale_2`), and rotation quaternion (`rot_0`, `rot_1`, `rot_2`, `rot_3`).
 ## **Hardware Compatibility:**
 **Supported Hardware Microarchitecture Compatibility:**
-* NVIDIA Ampere
-* NVIDIA Blackwell
-* NVIDIA Hopper
 * NVIDIA Lovelace
 **Preferred/Supported Operating Systems:** Linux
@@ -186,14 +103,14 @@ A PLY file (3D Gaussian Splatting, 3DGS) contains 3D object data with the follow
 The systems can run on a single GPU with an Nvidia GPU with CUDA Compute Capability greater than or equal to 8.0.   The following is required:
-* GPU performance \>= 300 Tflops
-* GPU memory size \>= 30GB
-* GPU memory bandwidth \>= 768 GB/s
-* System RAM \>= 32 GB
-* System disk storage \>= 100GB
 * CPU \>= 16 threads x 3GHz
 ## **System Version:**
@@ -201,7 +118,7 @@ Asset\_Harvester\_GA
 ## **Inference:**
-**Engine:** Pytorch
 **Test Hardware:** A100, H100
 ## **Ethical Considerations:**
@@ -214,30 +131,30 @@ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.
 ## Model Card++
-**Bias**
 | Field | Response |
 | :---- | :---- |
 | Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
 | Measures taken to mitigate against unwanted bias: | None |
-**Explainability**
 | Field | Response |
 | :---- | :---- |
-| Intended Domain | Autonomous Driving Simulation |
 | Model Type: | Image-to-3D Asset  |
 | Intended Users: | Autonomous Vehicles developers enhancing and improving  Neural Reconstruction pipelines. |
 | Output | 3D Asset |
-| Describe how the model works | The system takes as an input one or few images, and outputs a corresponding 3D asset |
 | Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | None |
-| Technical Limitations | The system is not guaranteed to perform well with occluded objects or objects that are outside of the common distribution. For example, a heavily occluded vehicle image can generate a poor or hallucinated 3D asset |
 | Verified to have met prescribed NVIDIA quality standards | Yes |
 | Performance Metrics | PSNR (Peak Signal-to-Noise Ratio) |
 | Potential Known Risks | AV and robotics developers should be aware that this model cannot guarantee a 100% success rate. In cases of unsuccessful generation, the output may not possess an accurate real-world representation of the asset and should not be relied upon in safety-critical simulations. |
-| Licensing | Use of this model system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
-**Privacy**
 | Field | Response |
 | :---- | :---- |
@@ -255,11 +172,17 @@ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.
 | Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
 | Applicable Privacy Policy | [https://www.nvidia.com/en-us/about-nvidia/privacy-policy/](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) |
-**Safety & Security**
 | Field | Response |
 | :---- | :---- |
 | Model Application(s): | 3D Asset Generation |
 | Describe the life critical impact (if present). | N/A \- The system should not be deployed in a vehicle to perform life-critical tasks. |
-| Use Case Restrictions: | Use of this model system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) |
 | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training |

 ---
 language:
+- en
 license: other
 license_name: nvidia-open-model-license
 license_link: >-
   https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
 tags:
+- nvidia
+- asset-harvester
+- image-to-3d
+- 3d-generation
+- gaussian-splatting
+- physical-ai
 pipeline_tag: image-to-3d
 ---
 # Asset Harvester | System Model Card
 ## **Description:**
+Asset Harvester is a system that leverages 4 models (see System Architecture below) to generate three-dimensional (3D) assets from a single image or multiple images of vehicles. [Mask2Former](https://docs.google.com/document/d/1OKMAhNruoLE254xLLdIWULPuwUWGNsbpg36BNUnpTSQ/edit?tab=t.0#heading=h.7axn5fq6ipu5)  and [C-RADIO](https://huggingface.co/nvidia/C-RADIO) are used for view extraction from NCore data sessions, the [Multiview Diffusion (Sana-based)](https://docs.google.com/document/d/1y7qU1to8TrV07Tfz3crxJiuA_AL0Wlwwp6C-RW-NoLg/edit?tab=t.0#heading=h.g8ogslbqcx12) is then used to generate 16 multiview images of the input vehicle, and lastly [TokenGS](https://docs.google.com/document/d/1EZWB-had-1MMmrES9bvQlJHpjXQFawR619sX3HvsVpQ/edit?usp=sharing) generates the output 3D asset.
 This system is ready for commercial/non-commercial use
+### **License/Terms of Use**:
+### GOVERNING TERMS: Your use of the model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) .
+**Deployment Geography:** Global
 ## **Automation Level:**
+Full Automation
 ## **Use Case:**
+Physical AI developers who are looking to create 3D assets of vehicles for either closed-loop simulation or Synthetic Data Generation (SDG).
 ## **Known Technical Limitations:**
+The system is not guaranteed to perform well with occluded objects or objects that are outside of the common distribution. For example, a heavily occluded vehicle can generate a poor or hallucinated 3D asset, like the following example:
+## Known Risk(s):
 AV and robotics developers should be aware that this model cannot guarantee a 100% success rate. In cases of unsuccessful generation, the output may not possess an accurate real-world representation of the asset and should not be relied upon in safety-critical simulations.
+##
+**Release Date:** Public Github \[03/12/2026\]
+**Reference(s):** None
 ## **System Architecture**
+**Architecture Diagram:**
+The following models are used by this system:
+* [Mask2Former Model Card](https://docs.google.com/document/d/1OKMAhNruoLE254xLLdIWULPuwUWGNsbpg36BNUnpTSQ/edit?tab=t.0#heading=h.7axn5fq6ipu5)
+* [C-RADIO Model Card](https://huggingface.co/nvidia/C-RADIO)
+* [Multiview Diffusion (Sana-based) Model Card](https://docs.google.com/document/d/1y7qU1to8TrV07Tfz3crxJiuA_AL0Wlwwp6C-RW-NoLg/edit?tab=t.0#heading=h.g8ogslbqcx12)
+* [TokenGS Model Card](https://docs.google.com/document/d/1EZWB-had-1MMmrES9bvQlJHpjXQFawR619sX3HvsVpQ/edit?usp=sharing)
 ## **System Input:**
+**Input Type(s):** 1 or more images (up until 4\)
+**Input Format:** Red, Green, Blue (RGB)
+**Input Parameters:** Two-Dimensional (2D)
+**Other Properties Related to Input:**
+We currently accept up to 4 input images for each object. The resolution of the images are 512x512. The input images are extracted from NVIDIA's NCore data along w/ other metadata needed for downstream processing:
+* Camera orientation of each image
+* Camera distance of each image
+* Camera field of view of each image
 * Bounding box dimensions of each object
 ## **System Output:**
+**Output Type(s):** Corresponding 3D Gaussian asset to the object in input images
+**Output Format:** Polygon File Format (PLY)
+**Output Parameters:** Three-Dimensional (3D)
+**Other Properties Related to Output:**
+A [PLY file](https://en.wikipedia.org/wiki/PLY_(file_format)#:~:text=PLY%20is%20a%20computer%20file,dimensional%20data%20from%203D%20scanners.) (3D Gaussian Splatting, 3DGS) contains 3D object data with the following specific components:
+* **Header**: Defines the file structure, including format (ASCII or binary), Gaussian elements, their properties (e.g., position, appearance coefficients, opacity, scale, rotation), and data types (e.g., float, int).
+* **Gaussian Data**: Stores the parameters of each 3D Gaussian, including its center position (`x, y, z`), and optionally properties such as normals (`nx, ny, nz`), color or spherical harmonics coefficients (`f_dc_0, f_dc_1, f_dc_2`, and higher-order terms), opacity, anisotropic scale, and rotation.
 ## **Hardware Compatibility:**
 **Supported Hardware Microarchitecture Compatibility:**
+* NVIDIA Ampere
+* NVIDIA Blackwell
+* NVIDIA Hopper
 * NVIDIA Lovelace
 **Preferred/Supported Operating Systems:** Linux
 The systems can run on a single GPU with an Nvidia GPU with CUDA Compute Capability greater than or equal to 8.0.   The following is required:
+* GPU performance \>= 300 Tflops
+* GPU memory size \>= 30GB
+* GPU memory bandwidth \>= 768 GB/s
+* System RAM \>= 32 GB
+* System disk storage \>= 100GB
 * CPU \>= 16 threads x 3GHz
+##
 ## **System Version:**
 ## **Inference:**
+**Engine:** Pytorch
 **Test Hardware:** A100, H100
 ## **Ethical Considerations:**
 ## Model Card++
+### Bias
 | Field | Response |
 | :---- | :---- |
 | Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
 | Measures taken to mitigate against unwanted bias: | None |
+### Explainability
 | Field | Response |
 | :---- | :---- |
+| Intended Domain | Advanced Driver Assistance Systems  |
 | Model Type: | Image-to-3D Asset  |
 | Intended Users: | Autonomous Vehicles developers enhancing and improving  Neural Reconstruction pipelines. |
 | Output | 3D Asset |
+| Describe how the model works | The system takes as an input an image, and outputs a corresponding 3D asset |
 | Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | None |
+| Technical Limitations | The system is not guaranteed to perform well with occluded objects or objects that are outside of the common distribution. For example, a heavily occluded vehicle can generate a poor or hallucinated 3D asset |
 | Verified to have met prescribed NVIDIA quality standards | Yes |
 | Performance Metrics | PSNR (Peak Signal-to-Noise Ratio) |
 | Potential Known Risks | AV and robotics developers should be aware that this model cannot guarantee a 100% success rate. In cases of unsuccessful generation, the output may not possess an accurate real-world representation of the asset and should not be relied upon in safety-critical simulations. |
+| Licensing | The use of the model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
+### Privacy
 | Field | Response |
 | :---- | :---- |
 | Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
 | Applicable Privacy Policy | [https://www.nvidia.com/en-us/about-nvidia/privacy-policy/](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) |
+### Safety & Security
 | Field | Response |
 | :---- | :---- |
 | Model Application(s): | 3D Asset Generation |
 | Describe the life critical impact (if present). | N/A \- The system should not be deployed in a vehicle to perform life-critical tasks. |
+| Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) |
 | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training |
+[image1]: images/image1.png
+[image2]: images/image2.png
+[image3]: images/image3.png

config.json DELETED Viewed

@@ -1,27 +0,0 @@
-{
-  "format_version": 1,
-  "name": "Asset Harvester",
-  "description": "Bundle manifest for the Asset Harvester system model repository.",
-  "components": [
-    {
-      "name": "camera_estimator",
-      "file": "AH_camera_estimator.safetensors",
-      "role": "camera estimation"
-    },
-    {
-      "name": "multiview_diffusion",
-      "file": "AH_multiview_diffusion.safetensors",
-      "role": "multiview image generation"
-    },
-    {
-      "name": "object_segmentation",
-      "file": "AH_object_seg_jit.pt",
-      "role": "object segmentation"
-    },
-    {
-      "name": "tokengs_lifting",
-      "file": "AH_tokengs_lifting.safetensors",
-      "role": "3D lifting"
-    }
-  ]
-}

docs/in_the_wild_examples/bin_01.jpg DELETED Viewed

Binary file (55.4 kB)

docs/in_the_wild_examples/bus_01.jpg DELETED Viewed

Binary file (53.4 kB)

docs/in_the_wild_examples/cyclist_02.jpg DELETED Viewed

Binary file (64.5 kB)

docs/in_the_wild_examples/pedestrian_01.jpg DELETED Viewed

Binary file (57.3 kB)

docs/in_the_wild_examples/pedestrian_03.jpg DELETED Viewed

Binary file (59 kB)

docs/in_the_wild_examples/pedestrian_04.jpg DELETED Viewed

Binary file (37.8 kB)

docs/in_the_wild_examples/pedestrian_05.jpg DELETED Viewed

Binary file (52.3 kB)

docs/in_the_wild_examples/pedestrian_06.jpg DELETED Viewed

Binary file (48.2 kB)

docs/in_the_wild_examples/sedan_01.jpg DELETED Viewed

Binary file (52.2 kB)

docs/in_the_wild_examples/sedan_02.jpg DELETED Viewed

Binary file (50.2 kB)

docs/in_the_wild_examples/stroller_01.jpg DELETED Viewed

Binary file (76.2 kB)

docs/in_the_wild_examples/stroller_02.jpg DELETED Viewed

Binary file (72.1 kB)

docs/in_the_wild_examples/suv_01.jpg DELETED Viewed

Binary file (49.8 kB)

docs/in_the_wild_examples/suv_02.jpg DELETED Viewed

Binary file (43.8 kB)

docs/in_the_wild_examples/tractor_01.jpg DELETED Viewed

Binary file (62.5 kB)

docs/in_the_wild_examples/trailer_01.jpg DELETED Viewed

Binary file (41.4 kB)

docs/in_the_wild_examples/truck_01.jpg DELETED Viewed

Binary file (69.2 kB)

model_cards/AV_Object_Mask2former.md DELETED Viewed

@@ -1,144 +0,0 @@
-# Mask2Former Overview | Model Card
-## **Description:**
-The AV Object Mask2Former is a model that performs object instance segmentation tasks. It was trained on object-centric AV images.
-This model is used in the Asset Harvester System.
-### **License/Terms of Use:**
-GOVERNING TERMS: The use of the model is governed by the [NVIDIA Software and Model Evaluation License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
-### **Deployment Geography:**
-Global
-### **Use Case:**
-The model can be used for segmenting object-centric AV images.  Given an image cropped from AV video, it output binary mask of the object in the center of the image.
-### **Release Date:**
-HuggingFace 03/16/26
-## **Reference:**
-[Bowen Cheng](https://arxiv.org/search/cs?searchtype=author&query=Cheng,+B), [Ishan Misra](https://arxiv.org/search/cs?searchtype=author&query=Misra,+I), [Alexander G. Schwing](https://arxiv.org/search/cs?searchtype=author&query=Schwing,+A+G), [Alexander Kirillov](https://arxiv.org/search/cs?searchtype=author&query=Kirillov,+A), [Rohit Girdhar](https://arxiv.org/search/cs?searchtype=author&query=Girdhar,+R), Masked-attention Mask Transformer for Universal Image Segmentation, [https://arxiv.org/abs/2112.01527](https://arxiv.org/abs/2112.01527).
-## **Model Architecture:**
-* Fully Convolutional Networks (FCNs) + Transformer
-## **Input:**
-* **Input Type(s):** Image
-* **Input Format(s):** Red, Green, Blue (RGB)
-* **Input Parameters:** The input parameters to this model are 2D query features (X0) and 3D image features (Kl, Vl) with dimensions N x C, where N is the number of query features and C is the number of channels.
-* **Other Properties Related to Input:** Spatial resolution of image features: 32, 16, 8.
-## **Output:**
-* **Output Type(s):** Image
-* **Output Format(s):** Binary mask
-* **Output Parameters:** The output parameters of this model are the predicted mask for each query, with dimensions of the input query features being N x C, where N is the number of query features and C is the number of channels.
-* **Other Properties Related to Output:** Resolution: H1=H=32, H2=H=16, H3=H=8 and W1=W=32, W2=W=16
-Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
-## **Software Integration:**
-**Runtime Engine(s):**
-PyTorch
-**Supported Hardware Microarchitecture Compatibility:**
-* NVIDIA Ampere
-* NVIDIA Blackwell
-* NVIDIA Hopper
-* NVIDIA Lovelace
-**[Preferred/Supported] Operating System(s):**
-Linux
-## **Model Version(s):**
-V1
-## **Training, Testing, and Evaluation Datasets:**
-The AV Object Mask2former was trained, tested, and evaluated using NVIDIA proprietary AV dataset.
-| Dataset names | Size and content | Training partition | Test partition |
-| :---- | :---- | :---- | :---- |
-| Internal Nvidia AV dataset | Posed images of 278k objects | 83% (cross validation) | 17% |
-### Internal NVIDIA AV dataset
-**Link:** N/A
-**Data Collection Method:** Sensors
-**Labeling Method by Dataset:** Automated. The labels we collected are binary masks of objects in the images.
-**Properties**: This dataset was collected using sensors mounted on the NVIDIA fleet and was auto-labeled using a third party tool to ensure high-quality annotations.
-## **Inference:**
-**Engine:**
-PyTorch
-**Test Hardware:**
-A6000
-## **Ethical Considerations:**
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
-For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
-Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
-**Bias**
-| Field | Response |
-| :---- | :---- |
-| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
-| Measures taken to mitigate against unwanted bias: | None |
-**Explainability**
-| Field | Response |
-| :---- | :---- |
-| Intended Domain | Advanced Driver Assistance Systems |
-| Model Type: | Object detection and Instance segmentation |
-| Intended Users: | Autonomous Vehicles developers enhancing and improving Neural Reconstruction pipelines. |
-| Output | Image Segmentation |
-| Describe how the model works | The model takes as an input an image, and outputs a segmentation mask of the image |
-| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | None |
-| Technical Limitations | The system does not guarantee a 100% success rate. The model was trained mostly on vehicles and would not perform well on pedestrians, cyclists, or other non-vehicular objects and struggles with small objects |
-| Verified to have met prescribed NVIDIA quality standards | Yes |
-| Performance Metrics | Intersection over Union (IOU) |
-| Potential Known Risks | AV and robotics developers should be aware that this model cannot guarantee a 100% success rate. In cases of unsuccessful generation, the output may not possess an accurate real-world representation of the asset and should not be relied upon in safety-critical simulations. |
-| Licensing | The use of the model is governed by the [NVIDIA Software and Model Evaluation License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
-**Privacy**
-| Field | Response |
-| :---- | :---- |
-| Generatable or reverse engineerable personal data? | No |
-| Personal data used to create this model? | No |
-| How often is the dataset reviewed? | Before release |
-| Is there provenance for all datasets used in training? | Yes |
-| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
-| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
-| Applicable Privacy Policy | [https://www.nvidia.com/en-us/about-nvidia/privacy-policy/](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) |
-**Safety & Security**
-| Field | Response |
-| :---- | :---- |
-| Model Application(s): | Object detection and Segmentation |
-| Describe the life critical impact (if present). | N/A \- The model should not be deployed in a vehicle to perform life-critical tasks. |
-| Use Case Restrictions: | The use of the model is governed by the [NVIDIA Software and Model Evaluation License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
-| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |

model_cards/MultiviewDiffusion.md DELETED Viewed

@@ -1,170 +0,0 @@
-# Multiview Diffusion (Sana-based) | Model Card
-## **Description:**
-The multiview diffusion model was trained on AV object images with a SANA base model. The model is conditioned on image input and outputs images of the same object in different viewpoints. It doesn't support text input.
-This model is used as part of the Asset Harvester GA.
-### **License/Terms of Use:**
-### Governing Terms: Use of this model system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) .
-### **Deployment Geography:**
-Global
-### **Use Case:**
-The multiview diffusion model takes a set of posed images as input and outputs 16 images from different viewpoints of the same input vehicle. The goal of it is to provide the 16 output images as input for three-dimensional (3D) reconstruction to generate 3D assets.
-### **Release Date:**
-HuggingFace
-## **Reference(s):**
-**Asset-Harvester: Turning Autonomous Driving Logs into 3D Assets for Simulation.** *NVIDIA white paper.*
-\[later we replace it with our paper link\]
-## **Model Architecture:**
-**Architecture Type:** Linear Diffusion Transformer
-**Network Architecture:** Sparse View Linear-attention Diffusion Transformer, as described in our white paper,
-with a Deep Compression Autoencoder (DC-AE) for efficient high-resolution image generation. C-RADIO for image conditioning signal.
-## **Input:**
-**Input Type(s):** Up to 4 Images (Adjustable via config parameter)
-**Input Format(s):** Red, Green, Blue (RGB)
-**Input Parameters:** Two-Dimensional (2D)
-**Other Properties Related to Input:** Camera matrices of images
-## **Output:**
-**Output Type(s):** 16 Images
-**Output Format(s):** Red, Green, Blue (RGB)
-**Output Parameters:** Two-Dimensional (2D)
-**Other Properties Related to Output:** Camera poses of images
-Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
-## **Software Integration:**
-**Runtime Engine(s):**
-PyTorch
-**Supported Hardware Microarchitecture Compatibility:**
-NVIDIA Ampere
-**[Preferred/Supported] Operating System(s):**
-Linux
-## **Model Version(s):**
-v1
-## **Training, Testing, and Evaluation Datasets:**
-The model was trained, tested, and finetuned using an Objaverse subset internal AV data, and Omniverse 3D assets (synthetic images).
-| Dataset names | Size and content | Training partition | Test partition |
-| :---- | :---- | :---- | :---- |
-| Nvidia Proprietary AV dataset | Posed images of 278k objects | 83% (cross validation) | 17% |
-| Omniverse 3D assets | 200 3D assets of objects | 100% | 0% |
-| Objaverse | 80k assets collected under commercially viable Creative Commons licenses, | 100% | 0% |
-### Objaverse Commercially Viable Subset under CC licenses
-**Link:** https://objaverse.allenai.org
-**Data Collection Method:** Synthetic 3D assets aggregated from various open-source and licensed sources
-**Labeling Method by Dataset:** Hybrid: Human and Automated
-**Properties:** This dataset consists of a diverse set of over 80,000 synthetic 3D object models spanning everyday items, animals, tools, and complex structures. Each model is rendered into multi-view 2D images with associated camera poses, materials, and mesh properties.
-### Nvidia Proprietary AV dataset
-**Data Collection Method:** Sensors
-**Labeling Method by Dataset:** Human
-**Properties**: This dataset was collected using sensors mounted on the NVIDIA fleet and was manually labeled by a team of human annotators to ensure high-quality annotations.
-### Omniverse 3D assets
-**Data Collection Method:** Human
-**Labeling Method by Dataset:** Human
-**Properties**: This dataset was collected using humans that create 3D assets.
-## **Inference:**
-**Engine:** PyTorch>=2.0.0
-**Test Hardware:**
-We tested on H100, A100, A6000 and RTX4090. Inference time using 1XA100 is 7 seconds per 16 images.
-## **Ethical Considerations:**
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
-For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
-Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
-**Bias**
-| Field | Response |
-| :---- | :---- |
-| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
-| Measures taken to mitigate against unwanted bias: | None |
-**Explainability**
-| Field | Response |
-| :---- | :---- |
-| Intended Domain | Advanced Driver Assistance Systems |
-| Model Type: | Multiview creation |
-| Intended Users: | Autonomous Vehicles developers enhancing and improving Neural Reconstruction pipelines. |
-| Output | 16 images |
-| Describe how the model works | The model takes as an input an image (up to 4\) and outputs 16 multiviews of the vehicles detected in the original image |
-| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | None |
-| Technical Limitations | The system does not guarantee a 100% success rate. It cannot fully guarantee the safety and controllability of the generated image content. Additionally, challenges remain in certain complex cases, such as text rendering and the generation of faces and hands. |
-| Verified to have met prescribed NVIDIA quality standards | Yes |
-| Performance Metrics | Peak signal-to-noise ratio (PSNR), FID (Frechet Inception Distance), CLIPScore |
-| Potential Known Risks | AV and robotics developers should be aware that this model cannot guarantee a 100% success rate. In cases of unsuccessful generation, the output may not possess an accurate real-world representation of the asset and should not be relied upon in safety-critical simulations. |
-| Licensing | The use of the model is governed by the [NVIDIA Software and Model Evaluation License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
-**Privacy**
-| Field | Response |
-| :---- | :---- |
-| Generatable or reverse engineerable personal data? | No |
-| Personal data used to create this model? | Yes |
-| Was consent obtained for any personal data used? | Yes |
-| Is a mechanism in place to honor data subject right of access or deletion of personal data? | Yes |
-| If personal data was collected for the development of the model, was it collected directly by NVIDIA? | No |
-| If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | N/A |
-| If personal data was collected for the development of this AI model, was it minimized to only what was required? | Yes |
-| Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | Yes |
-| How often is the dataset reviewed? | Before release |
-| Is there provenance for all datasets used in training? | Yes |
-| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
-| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
-| Applicable Privacy Policy | [https://www.nvidia.com/en-us/about-nvidia/privacy-policy/](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) |
-**Safety & Security**
-| Field | Response |
-| :---- | :---- |
-| Model Application(s): | Multiview creation |
-| Describe the life critical impact (if present). | N/A \- The model should not be deployed in a vehicle to perform life-critical tasks. |
-| Use Case Restrictions: | The use of the model is governed by the [NVIDIA Software and Model Evaluation License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
-| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |

model_cards/Object_TokenGS.md DELETED Viewed

@@ -1,146 +0,0 @@
-# Object TokenGS | Model Card
-## **Description:**
-The Object TokenGS is a feed-forward neural reconstruction model that takes posed multi-view RGB images as input and predicts a 3D Gaussian Splatting (3DGS) representation for the object.
-TokenGS directly regresses 3D Gaussian centers in global coordinates and decouples the number of predicted Gaussians from input image resolution and number of views by using learnable Gaussian tokens in an encoder-decoder Transformer.
-### **License/Terms of Use:**
-The model is a submodule that follows the terms of [Asset Havester](https://huggingface.co/nvidia/asset-harvester),
-### **Deployment Geography:**
-Global
-### **Use Case:**
-Object TokenGS can be used for multi-view 3D object lifting.  It takes multiview images as input, and convert them into 3D Gaussian assets.
-### **Release Date:**
-This model is on [HuggingFace](https://huggingface.co/nvidia/asset-harvester) and inference script is on [Github](https://github.com/NVIDIA/asset-harvester).
-## **References(s):**
-- [Asset-Harvester: Turning Autonomous Driving Logs into 3D Assets for Simulation. ]()
-## **Model Architecture:**
-System architecture details described in white paper above.
-## **Input:**
-**Input Type(s):** Image
-**Input Format(s):** Red, Green, Blue (RGB) images plus camera parameters
-**Input Parameters:** Two-Dimensional (2D) images with camera intrinsics and extrinsics; optional timestamp conditioning for dynamic reconstruction
-**Other Properties Related to Input:**
-- Input includes camera intrinsics and camera extrinsics.
-- Images with resolution `512 x 512`
-## **Output:**
-**Output Type(s):** 3D Gaussian Splatting primitives and rendered RGB images
-**Output Format(s):** 3DGS parameter tensors (14 attributes per Gaussian primitive) renderable to novel RGB views via a differentiable Gaussian splatting renderer
-**Output Parameters:** 14-dimensional (14D) Gaussian attributes
-**Other Properties Related to Output:**
-Each Gaussian includes:
-- Mean or center: `(x, y, z)`
-- Color: `(r, g, b)`
-- Scale: `(sx, sy, sz)`
-- Opacity: `alpha`
-- Rotation: quaternion `(qw, qx, qy, qz)`
-Our AI models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware and CUDA-enabled software frameworks, the model achieves faster training and inference times compared to CPU-only solutions.
-## **Software Integration:**
-**Supported Hardware Microarchitecture Compatibility:**
-- NVIDIA Ampere
-- NVIDIA Blackwell
-- NVIDIA Hopper
-- NVIDIA Lovelace
-**Supported Operating System(s):**
-- Linux
-The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
-## **Model Version:**
-Asset\_Harvester\_GA
-## **Training, Testing, and Evaluation Datasets:**
-Details described in white paper above.
-## **Inference:**
-**Acceleration Engine:** PyTorch
-**Test Hardware:** NVIDIA A100, H100
-## **Ethical Considerations:**
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with the license terms, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
-For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
-Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the generated image or video will not automatically blur or maintain the proportions of image subjects included.
-Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
-**Bias**
-| Field | Response |
-| :---- | :---- |
-| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
-| Measures taken to mitigate against unwanted bias: | None |
-**Explainability**
-| Field | Response |
-| :---- | :---- |
-| Intended Task/Domain: | Multi-view 3D object reconstruction. |
-| Model Type: | Transformer |
-| Intended Users: | 3D vision, simulation, graphics, and robotics or physical AI researchers and developers. |
-| Output | 3D Gaussian Splat representation and rendered novel views. |
-| Describe how the model works | Encoder-decoder Transformer with learnable Gaussian tokens directly regresses 3D Gaussian attributes from posed images, trained with rendering and visibility losses. |
-| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | None |
-| Technical Limitations & Mitigation | TokenGS may miss fine-grained geometric details. Quality depends on camera pose quality and multiview coverage, so users should validate outputs and provide sufficient view diversity and accurate camera metadata. |
-| Verified to have met prescribed NVIDIA quality standards | Yes |
-| Performance Metrics | PSNR, SSIM, LPIPS; additional comparisons under view extrapolation and camera-noise robustness. |
-| Potential Known Risks | Reconstruction failures or incomplete geometry may produce misleading renderings or assets. |
-| Licensing | The use of the model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). |
-**Privacy**
-| Field | Response |
-| :---- | :---- |
-| Generatable or reverse engineerable personal data? | No |
-| Personal data used to create this model? | No |
-| Was consent obtained for any personal data used? | Not Applicable |
-| How often is the dataset reviewed? | Before release |
-| Is a mechanism in place to honor data subject right of access or deletion of personal data? | Not Applicable |
-| If personal data was collected for the development of the model, was it collected directly by NVIDIA? | Not Applicable |
-| If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Not Applicable |
-| If personal data was collected for the development of this AI model, was it minimized to only what was required? | Not Applicable |
-| Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | No |
-| Is there provenance for all datasets used in training? | Yes |
-| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
-| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
-| Applicable Privacy Policy | [https://www.nvidia.com/en-us/about-nvidia/privacy-policy/](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) |
-**Safety & Security**
-| Field | Response |
-| :---- | :---- |
-| Model Application(s): | 3D object reconstruction|
-| Describe the life critical impact (if present). | Not Applicable. The model is not intended for direct life-critical decision-making, and outputs should not be used as the sole basis for autonomous vehicle perception, robotics control, or operational safety decisions. Additional validation and testing should be incorporated prior to deployment in real-world production. |
-| Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) |
-| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training |