Rename model_cards/Mask2Former.md to model_cards/AV_Object_Mask2former.md

by kangxuey - opened Mar 16

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+12

-125

Files changed (24) hide show

.gitattributes +0 -1
docs/pipeline.gif → AH_multiview_diffusion_turbo.safetensors +2 -2
AH_tokengs_lifting.safetensors +2 -2
README.md +3 -87
config.json +0 -27
docs/in_the_wild_examples/bin_01.jpg +0 -0
docs/in_the_wild_examples/bus_01.jpg +0 -0
docs/in_the_wild_examples/cyclist_02.jpg +0 -0
docs/in_the_wild_examples/pedestrian_01.jpg +0 -0
docs/in_the_wild_examples/pedestrian_03.jpg +0 -0
docs/in_the_wild_examples/pedestrian_04.jpg +0 -0
docs/in_the_wild_examples/pedestrian_05.jpg +0 -0
docs/in_the_wild_examples/pedestrian_06.jpg +0 -0
docs/in_the_wild_examples/sedan_01.jpg +0 -0
docs/in_the_wild_examples/sedan_02.jpg +0 -0
docs/in_the_wild_examples/stroller_01.jpg +0 -0
docs/in_the_wild_examples/stroller_02.jpg +0 -0
docs/in_the_wild_examples/suv_01.jpg +0 -0
docs/in_the_wild_examples/suv_02.jpg +0 -0
docs/in_the_wild_examples/tractor_01.jpg +0 -0
docs/in_the_wild_examples/trailer_01.jpg +0 -0
docs/in_the_wild_examples/truck_01.jpg +0 -0
model_cards/{MultiviewDiffusion.md → MultviewDiffusion.md} +4 -5
model_cards/{Object_TokenGS.md → TokenGS.md} +0 -0

.gitattributes CHANGED Viewed

@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-*.gif filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

docs/pipeline.gif → AH_multiview_diffusion_turbo.safetensors RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9f7df1837e9eae37f572dc07bd30f96372b89d4c2ec1ba83e626f2acae7abcd8
-size 3645159

 version https://git-lfs.github.com/spec/v1
+oid sha256:3dbbc54c2db8875016234a732d18070a0705e9709dee4b8ae7ef61895f08a075
+size 3345066418

AH_tokengs_lifting.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:576a4250e373547c6864cc3fa6ec310b7c66dd06b8025d609ec6681405896ff8
-size 1299556696

 version https://git-lfs.github.com/spec/v1
+oid sha256:9650e8aeeb9dbb5f42231044f6da327046043de0023f6ce64d0ea2f7c5cbdf85
+size 1299556656

README.md CHANGED Viewed

@@ -16,99 +16,15 @@ pipeline_tag: image-to-3d
 ---
 # Asset Harvester | System Model Card
-[**Paper**](https://arxiv.org/abs/2604.18468)  | [**Live Demo!**](https://huggingface.co/spaces/nvidia/asset-harvester)  | [**Project Page**](https://research.nvidia.com/labs/sil/projects/asset-harvester) | [**Code**](https://github.com/NVIDIA/asset-harvester) | [**Model**](https://huggingface.co/nvidia/asset-harvester) | [**Data**](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NCore)
-## **Description:**
-**Asset Harvester** is an image-to-3D model and end-to-end system that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. The model generates 3D assets from a single image or multiple images of vehicles, VRUs or other road objects extracted from autonomous driving sessions.  To run Asset Harvester, please check our [**codebase**](https://github.com/NVIDIA/asset-harvester).
-<p align="center">
-  <img src="docs/pipeline.gif" alt="Asset Harvester teaser" width="100%" style="border: none;">
-</p>
-**Asset Harvester** turns real-world driving logs into complete, simulation-ready 3D assets — from just one or a few in-the-wild object views. It handles vehicles, pedestrians, riders, and other road objects, even under heavy occlusion, noisy calibration, and extreme viewpoint bias. A multiview diffusion model generates consistent novel viewpoints, and a feed-forward Gaussian reconstructor lifts them to full 3D in seconds. The result: high-fidelity 3D Gaussian splat assets ready for insertion into simulation environments. The pipeline plugs directly into NVIDIA NCore and NuRec for scalable data ingestion and closed-loop simulation.
-Here's how the model checkpoints in this repo are used in the end-to-end system following the order in the pipeline: The [AV object Mask2former](model_cards/AV_Object_Mask2former.md) instance segmentation model is used for image processing when parsing input views from NCore data sessions.
-The input images are encoded by [C-Radio](https://huggingface.co/nvidia/C-RADIO),
-and the multiview diffusion model, [SparseViewDiT](model_cards/MultiviewDiffusion.md), is then used to generate 16 multiview images of the input objects.
-In cases where camera parameters are not provided, the multiview diffusion model includes a camera pose estimation submodule that predicts camera parameters for the input images.
-Lastly, an [Object TokenGS](model_cards/Object_TokenGS.md) lifts the images to a 3D asset.
 This system is ready for commercial/non-commercial use
-<details>
-<summary><big><big><strong>🚗 Example Results 🚗</strong></big></big></summary>
-Each row contains the input image, object mask, and a rendering of the harvested 3DGS asset.
-#### 1. Vehicles / Trucks / Trailers
-<table>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/bus_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/trailer_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/tractor_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/truck_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/sedan_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/suv_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/suv_02.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/sedan_02.jpg" width="860"></td>
-  </tr>
-</table>
-#### 2. VRUs
-<table>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_03.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_04.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_05.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/pedestrian_06.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/cyclist_02.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/stroller_01.jpg" width="860"></td>
-  </tr>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/stroller_02.jpg" width="860"></td>
-  </tr>
-</table>
-#### 3. Other
-<table>
-  <tr>
-    <td align="center"><img src="docs/in_the_wild_examples/bin_01.jpg" width="860"></td>
-  </tr>
-</table>
-</details>
 ### **License/Terms of Use**:
 ### Governing Terms: Use of this model system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) .

 ---
 # Asset Harvester | System Model Card
+### [Paper (coming soon)]() | [Project Page (coming soon)](https://research.nvidia.com/labs/sil/asset-harvester) | [Code](https://github.com/NVIDIA/asset-harvester) | [Model](https://huggingface.co/nvidia/asset-harvester) | [Data](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NCore)
+## **Description:**
+**Asset Harvester** is a system that leverages 4 models (see the white paper for architecture) to generate 3D assets from a single image or multiple images of vehicles or VRUs. The [AV object Mask2former]() instance segmentation model is used for image processing when parsing input views from NCore data sessions. The input images are encoded by [C-Radio](https://huggingface.co/nvidia/C-RADIO), and the multiview diffusion model, [SparseViewDiT](), is then used to generate 16 multiview images of the input objects, and lastly an [Object TokenGS]() lifts the images to a 3D asset.
 This system is ready for commercial/non-commercial use
 ### **License/Terms of Use**:
 ### Governing Terms: Use of this model system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) .

config.json DELETED Viewed

@@ -1,27 +0,0 @@
-{
-  "format_version": 1,
-  "name": "Asset Harvester",
-  "description": "Bundle manifest for the Asset Harvester system model repository.",
-  "components": [
-    {
-      "name": "camera_estimator",
-      "file": "AH_camera_estimator.safetensors",
-      "role": "camera estimation"
-    },
-    {
-      "name": "multiview_diffusion",
-      "file": "AH_multiview_diffusion.safetensors",
-      "role": "multiview image generation"
-    },
-    {
-      "name": "object_segmentation",
-      "file": "AH_object_seg_jit.pt",
-      "role": "object segmentation"
-    },
-    {
-      "name": "tokengs_lifting",
-      "file": "AH_tokengs_lifting.safetensors",
-      "role": "3D lifting"
-    }
-  ]
-}

docs/in_the_wild_examples/bin_01.jpg DELETED Viewed

Binary file (55.4 kB)

docs/in_the_wild_examples/bus_01.jpg DELETED Viewed

Binary file (53.4 kB)

docs/in_the_wild_examples/cyclist_02.jpg DELETED Viewed

Binary file (64.5 kB)

docs/in_the_wild_examples/pedestrian_01.jpg DELETED Viewed

Binary file (57.3 kB)

docs/in_the_wild_examples/pedestrian_03.jpg DELETED Viewed

Binary file (59 kB)

docs/in_the_wild_examples/pedestrian_04.jpg DELETED Viewed

Binary file (37.8 kB)

docs/in_the_wild_examples/pedestrian_05.jpg DELETED Viewed

Binary file (52.3 kB)

docs/in_the_wild_examples/pedestrian_06.jpg DELETED Viewed

Binary file (48.2 kB)

docs/in_the_wild_examples/sedan_01.jpg DELETED Viewed

Binary file (52.2 kB)

docs/in_the_wild_examples/sedan_02.jpg DELETED Viewed

Binary file (50.2 kB)

docs/in_the_wild_examples/stroller_01.jpg DELETED Viewed

Binary file (76.2 kB)

docs/in_the_wild_examples/stroller_02.jpg DELETED Viewed

Binary file (72.1 kB)

docs/in_the_wild_examples/suv_01.jpg DELETED Viewed

Binary file (49.8 kB)

docs/in_the_wild_examples/suv_02.jpg DELETED Viewed

Binary file (43.8 kB)

docs/in_the_wild_examples/tractor_01.jpg DELETED Viewed

Binary file (62.5 kB)

docs/in_the_wild_examples/trailer_01.jpg DELETED Viewed

Binary file (41.4 kB)

docs/in_the_wild_examples/truck_01.jpg DELETED Viewed

Binary file (69.2 kB)

model_cards/{MultiviewDiffusion.md → MultviewDiffusion.md} RENAMED Viewed

@@ -31,8 +31,7 @@ HuggingFace
 **Architecture Type:** Linear Diffusion Transformer
-**Network Architecture:** Sparse View Linear-attention Diffusion Transformer, as described in our white paper,
-with a Deep Compression Autoencoder (DC-AE) for efficient high-resolution image generation. C-RADIO for image conditioning signal.
 ## **Input:**
@@ -77,18 +76,18 @@ The model was trained, tested, and finetuned using an Objaverse subset internal
 | Dataset names | Size and content | Training partition | Test partition |
 | :---- | :---- | :---- | :---- |
-| Nvidia Proprietary AV dataset | Posed images of 278k objects | 83% (cross validation) | 17% |
 | Omniverse 3D assets | 200 3D assets of objects | 100% | 0% |
 | Objaverse | 80k assets collected under commercially viable Creative Commons licenses, | 100% | 0% |
-### Objaverse Commercially Viable Subset under CC licenses
 **Link:** https://objaverse.allenai.org
 **Data Collection Method:** Synthetic 3D assets aggregated from various open-source and licensed sources
 **Labeling Method by Dataset:** Hybrid: Human and Automated
 **Properties:** This dataset consists of a diverse set of over 80,000 synthetic 3D object models spanning everyday items, animals, tools, and complex structures. Each model is rendered into multi-view 2D images with associated camera poses, materials, and mesh properties.
-### Nvidia Proprietary AV dataset
 **Data Collection Method:** Sensors

 **Architecture Type:** Linear Diffusion Transformer
+**Network Architecture:** Linear-attention Diffusion Transformer with a Deep Compression Autoencoder (DC-AE) for efficient high-resolution image generation. C-RADIO for image conditioning signal.
 ## **Input:**
 | Dataset names | Size and content | Training partition | Test partition |
 | :---- | :---- | :---- | :---- |
+| Internal Nvidia AV dataset | Posed images of 278k objects | 83% (cross validation) | 17% |
 | Omniverse 3D assets | 200 3D assets of objects | 100% | 0% |
 | Objaverse | 80k assets collected under commercially viable Creative Commons licenses, | 100% | 0% |
+### Objaverse Commercially Viable Subset
 **Link:** https://objaverse.allenai.org
 **Data Collection Method:** Synthetic 3D assets aggregated from various open-source and licensed sources
 **Labeling Method by Dataset:** Hybrid: Human and Automated
 **Properties:** This dataset consists of a diverse set of over 80,000 synthetic 3D object models spanning everyday items, animals, tools, and complex structures. Each model is rendered into multi-view 2D images with associated camera poses, materials, and mesh properties.
+### Internal NVIDIA AV dataset
 **Data Collection Method:** Sensors

model_cards/{Object_TokenGS.md → TokenGS.md} RENAMED Viewed

File without changes