Rename model_cards/Mask2Former.md to model_cards/AV_Object_Mask2former.md

#3
by kangxuey - opened
.gitattributes CHANGED
@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- *.gif filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
docs/pipeline.gif → AH_multiview_diffusion_turbo.safetensors RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9f7df1837e9eae37f572dc07bd30f96372b89d4c2ec1ba83e626f2acae7abcd8
3
- size 3645159
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3dbbc54c2db8875016234a732d18070a0705e9709dee4b8ae7ef61895f08a075
3
+ size 3345066418
AH_tokengs_lifting.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:576a4250e373547c6864cc3fa6ec310b7c66dd06b8025d609ec6681405896ff8
3
- size 1299556696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9650e8aeeb9dbb5f42231044f6da327046043de0023f6ce64d0ea2f7c5cbdf85
3
+ size 1299556656
README.md CHANGED
@@ -16,99 +16,15 @@ pipeline_tag: image-to-3d
16
  ---
17
 
18
  # Asset Harvester | System Model Card
19
- **Paper** | **Project Page** | [**Code**](https://github.com/NVIDIA/asset-harvester) | [**Model**](https://huggingface.co/nvidia/asset-harvester) | [**Data**](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NCore)
20
 
21
- ## **Description:**
22
-
23
- **Asset Harvester** is an image-to-3D model and end-to-end system that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. The model generates 3D assets from a single image or multiple images of vehicles, VRUs or other road objects extracted from autonomous driving sessions. To run Asset Harvester, please check our [**codebase**](https://github.com/NVIDIA/asset-harvester).
24
-
25
- <p align="center">
26
- <img src="docs/pipeline.gif" alt="Asset Harvester teaser" width="100%" style="border: none;">
27
- </p>
28
 
29
- **Asset Harvester** turns real-world driving logs into complete, simulation-ready 3D assets — from just one or a few in-the-wild object views. It handles vehicles, pedestrians, riders, and other road objects, even under heavy occlusion, noisy calibration, and extreme viewpoint bias. A multiview diffusion model generates consistent novel viewpoints, and a feed-forward Gaussian reconstructor lifts them to full 3D in seconds. The result: high-fidelity 3D Gaussian splat assets ready for insertion into simulation environments. The pipeline plugs directly into NVIDIA NCore and NuRec for scalable data ingestion and closed-loop simulation.
30
 
31
- Here's how the model checkpoints in this repo are used in the end-to-end system following the order in the pipeline: The [AV object Mask2former](model_cards/AV_Object_Mask2former.md) instance segmentation model is used for image processing when parsing input views from NCore data sessions.
32
- The input images are encoded by [C-Radio](https://huggingface.co/nvidia/C-RADIO),
33
- and the multiview diffusion model, [SparseViewDiT](model_cards/MultiviewDiffusion.md), is then used to generate 16 multiview images of the input objects.
34
- In cases where camera parameters are not provided, the multiview diffusion model includes a camera pose estimation submodule that predicts camera parameters for the input images.
35
- Lastly, an [Object TokenGS](model_cards/Object_TokenGS.md) lifts the images to a 3D asset.
36
 
37
  This system is ready for commercial/non-commercial use
38
 
39
- <details>
40
- <summary><big><big><strong>🚗 Example Results 🚗</strong></big></big></summary>
41
-
42
- Each row contains the input image, object mask, and a rendering of the harvested 3DGS asset.
43
-
44
- #### 1. Vehicles / Trucks / Trailers
45
-
46
- <table>
47
- <tr>
48
- <td align="center"><img src="docs/in_the_wild_examples/bus_01.jpg" width="860"></td>
49
- </tr>
50
- <tr>
51
- <td align="center"><img src="docs/in_the_wild_examples/trailer_01.jpg" width="860"></td>
52
- </tr>
53
- <tr>
54
- <td align="center"><img src="docs/in_the_wild_examples/tractor_01.jpg" width="860"></td>
55
- </tr>
56
- <tr>
57
- <td align="center"><img src="docs/in_the_wild_examples/truck_01.jpg" width="860"></td>
58
- </tr>
59
- <tr>
60
- <td align="center"><img src="docs/in_the_wild_examples/sedan_01.jpg" width="860"></td>
61
- </tr>
62
- <tr>
63
- <td align="center"><img src="docs/in_the_wild_examples/suv_01.jpg" width="860"></td>
64
- </tr>
65
- <tr>
66
- <td align="center"><img src="docs/in_the_wild_examples/suv_02.jpg" width="860"></td>
67
- </tr>
68
- <tr>
69
- <td align="center"><img src="docs/in_the_wild_examples/sedan_02.jpg" width="860"></td>
70
- </tr>
71
- </table>
72
-
73
- #### 2. VRUs
74
-
75
- <table>
76
- <tr>
77
- <td align="center"><img src="docs/in_the_wild_examples/pedestrian_01.jpg" width="860"></td>
78
- </tr>
79
- <tr>
80
- <td align="center"><img src="docs/in_the_wild_examples/pedestrian_03.jpg" width="860"></td>
81
- </tr>
82
- <tr>
83
- <td align="center"><img src="docs/in_the_wild_examples/pedestrian_04.jpg" width="860"></td>
84
- </tr>
85
- <tr>
86
- <td align="center"><img src="docs/in_the_wild_examples/pedestrian_05.jpg" width="860"></td>
87
- </tr>
88
- <tr>
89
- <td align="center"><img src="docs/in_the_wild_examples/pedestrian_06.jpg" width="860"></td>
90
- </tr>
91
- <tr>
92
- <td align="center"><img src="docs/in_the_wild_examples/cyclist_02.jpg" width="860"></td>
93
- </tr>
94
- <tr>
95
- <td align="center"><img src="docs/in_the_wild_examples/stroller_01.jpg" width="860"></td>
96
- </tr>
97
- <tr>
98
- <td align="center"><img src="docs/in_the_wild_examples/stroller_02.jpg" width="860"></td>
99
- </tr>
100
- </table>
101
-
102
- #### 3. Other
103
-
104
- <table>
105
- <tr>
106
- <td align="center"><img src="docs/in_the_wild_examples/bin_01.jpg" width="860"></td>
107
- </tr>
108
- </table>
109
-
110
- </details>
111
-
112
  ### **License/Terms of Use**:
113
 
114
  ### Governing Terms: Use of this model system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) .
 
16
  ---
17
 
18
  # Asset Harvester | System Model Card
 
19
 
20
+ ### [Paper (coming soon)]() | [Project Page (coming soon)](https://research.nvidia.com/labs/sil/asset-harvester) | [Code](https://github.com/NVIDIA/asset-harvester) | [Model](https://huggingface.co/nvidia/asset-harvester) | [Data](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NCore)
 
 
 
 
 
 
21
 
22
+ ## **Description:**
23
 
24
+ **Asset Harvester** is a system that leverages 4 models (see the white paper for architecture) to generate 3D assets from a single image or multiple images of vehicles or VRUs. The [AV object Mask2former]() instance segmentation model is used for image processing when parsing input views from NCore data sessions. The input images are encoded by [C-Radio](https://huggingface.co/nvidia/C-RADIO), and the multiview diffusion model, [SparseViewDiT](), is then used to generate 16 multiview images of the input objects, and lastly an [Object TokenGS]() lifts the images to a 3D asset.
 
 
 
 
25
 
26
  This system is ready for commercial/non-commercial use
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ### **License/Terms of Use**:
29
 
30
  ### Governing Terms: Use of this model system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) .
config.json DELETED
@@ -1,27 +0,0 @@
1
- {
2
- "format_version": 1,
3
- "name": "Asset Harvester",
4
- "description": "Bundle manifest for the Asset Harvester system model repository.",
5
- "components": [
6
- {
7
- "name": "camera_estimator",
8
- "file": "AH_camera_estimator.safetensors",
9
- "role": "camera estimation"
10
- },
11
- {
12
- "name": "multiview_diffusion",
13
- "file": "AH_multiview_diffusion.safetensors",
14
- "role": "multiview image generation"
15
- },
16
- {
17
- "name": "object_segmentation",
18
- "file": "AH_object_seg_jit.pt",
19
- "role": "object segmentation"
20
- },
21
- {
22
- "name": "tokengs_lifting",
23
- "file": "AH_tokengs_lifting.safetensors",
24
- "role": "3D lifting"
25
- }
26
- ]
27
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/in_the_wild_examples/bin_01.jpg DELETED
Binary file (55.4 kB)
 
docs/in_the_wild_examples/bus_01.jpg DELETED
Binary file (53.4 kB)
 
docs/in_the_wild_examples/cyclist_02.jpg DELETED
Binary file (64.5 kB)
 
docs/in_the_wild_examples/pedestrian_01.jpg DELETED
Binary file (57.3 kB)
 
docs/in_the_wild_examples/pedestrian_03.jpg DELETED
Binary file (59 kB)
 
docs/in_the_wild_examples/pedestrian_04.jpg DELETED
Binary file (37.8 kB)
 
docs/in_the_wild_examples/pedestrian_05.jpg DELETED
Binary file (52.3 kB)
 
docs/in_the_wild_examples/pedestrian_06.jpg DELETED
Binary file (48.2 kB)
 
docs/in_the_wild_examples/sedan_01.jpg DELETED
Binary file (52.2 kB)
 
docs/in_the_wild_examples/sedan_02.jpg DELETED
Binary file (50.2 kB)
 
docs/in_the_wild_examples/stroller_01.jpg DELETED
Binary file (76.2 kB)
 
docs/in_the_wild_examples/stroller_02.jpg DELETED
Binary file (72.1 kB)
 
docs/in_the_wild_examples/suv_01.jpg DELETED
Binary file (49.8 kB)
 
docs/in_the_wild_examples/suv_02.jpg DELETED
Binary file (43.8 kB)
 
docs/in_the_wild_examples/tractor_01.jpg DELETED
Binary file (62.5 kB)
 
docs/in_the_wild_examples/trailer_01.jpg DELETED
Binary file (41.4 kB)
 
docs/in_the_wild_examples/truck_01.jpg DELETED
Binary file (69.2 kB)
 
model_cards/{MultiviewDiffusion.md → MultviewDiffusion.md} RENAMED
@@ -31,8 +31,7 @@ HuggingFace
31
 
32
  **Architecture Type:** Linear Diffusion Transformer
33
 
34
- **Network Architecture:** Sparse View Linear-attention Diffusion Transformer, as described in our white paper,
35
- with a Deep Compression Autoencoder (DC-AE) for efficient high-resolution image generation. C-RADIO for image conditioning signal.
36
 
37
  ## **Input:**
38
 
@@ -77,18 +76,18 @@ The model was trained, tested, and finetuned using an Objaverse subset internal
77
 
78
  | Dataset names | Size and content | Training partition | Test partition |
79
  | :---- | :---- | :---- | :---- |
80
- | Nvidia Proprietary AV dataset | Posed images of 278k objects | 83% (cross validation) | 17% |
81
  | Omniverse 3D assets | 200 3D assets of objects | 100% | 0% |
82
  | Objaverse | 80k assets collected under commercially viable Creative Commons licenses, | 100% | 0% |
83
 
84
- ### Objaverse Commercially Viable Subset under CC licenses
85
 
86
  **Link:** https://objaverse.allenai.org
87
  **Data Collection Method:** Synthetic 3D assets aggregated from various open-source and licensed sources
88
  **Labeling Method by Dataset:** Hybrid: Human and Automated
89
  **Properties:** This dataset consists of a diverse set of over 80,000 synthetic 3D object models spanning everyday items, animals, tools, and complex structures. Each model is rendered into multi-view 2D images with associated camera poses, materials, and mesh properties.
90
 
91
- ### Nvidia Proprietary AV dataset
92
 
93
  **Data Collection Method:** Sensors
94
 
 
31
 
32
  **Architecture Type:** Linear Diffusion Transformer
33
 
34
+ **Network Architecture:** Linear-attention Diffusion Transformer with a Deep Compression Autoencoder (DC-AE) for efficient high-resolution image generation. C-RADIO for image conditioning signal.
 
35
 
36
  ## **Input:**
37
 
 
76
 
77
  | Dataset names | Size and content | Training partition | Test partition |
78
  | :---- | :---- | :---- | :---- |
79
+ | Internal Nvidia AV dataset | Posed images of 278k objects | 83% (cross validation) | 17% |
80
  | Omniverse 3D assets | 200 3D assets of objects | 100% | 0% |
81
  | Objaverse | 80k assets collected under commercially viable Creative Commons licenses, | 100% | 0% |
82
 
83
+ ### Objaverse Commercially Viable Subset
84
 
85
  **Link:** https://objaverse.allenai.org
86
  **Data Collection Method:** Synthetic 3D assets aggregated from various open-source and licensed sources
87
  **Labeling Method by Dataset:** Hybrid: Human and Automated
88
  **Properties:** This dataset consists of a diverse set of over 80,000 synthetic 3D object models spanning everyday items, animals, tools, and complex structures. Each model is rendered into multi-view 2D images with associated camera poses, materials, and mesh properties.
89
 
90
+ ### Internal NVIDIA AV dataset
91
 
92
  **Data Collection Method:** Sensors
93
 
model_cards/{Object_TokenGS.md → TokenGS.md} RENAMED
File without changes