Improve model card: Add pipeline tag, paper link, project page, and sample usage
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,12 +1,102 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |
<div align="center">
|
| 5 |
<h1>DVGT: Driving Visual Geometry Transformer</h1>
|
| 6 |
</div>
|
| 7 |
-
**DVGT**, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-to-3d
|
| 4 |
---
|
| 5 |
+
|
| 6 |
<div align="center">
|
| 7 |
<h1>DVGT: Driving Visual Geometry Transformer</h1>
|
| 8 |
</div>
|
| 9 |
+
**DVGT**, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data. This model offers a robust solution that adapts seamlessly to diverse vehicles and camera configurations by leveraging spatial-temporal attention to process unposed image sequences directly.
|
| 10 |
+
|
| 11 |
+
<p align="center">
|
| 12 |
+
<img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/demo.gif" width="100%">
|
| 13 |
+
</p>
|
| 14 |
+
|
| 15 |
+
[\ud83d\udcda Paper](https://huggingface.co/papers/2512.16919) | [\ud83c\udf10 Project Page](https://wzzheng.net/DVGT) | [\ud83d\udcbb Code](https://github.com/wzzheng/DVGT)
|
| 16 |
+
|
| 17 |
+
## Overview
|
| 18 |
+
|
| 19 |
+
DVGT proposes a universal framework for driving geometry perception. Unlike conventional driving models that are tightly coupled to specific sensor setups or require ground-truth poses, our model leverages spatial-temporal attention to process unposed image sequences directly. By decoding global geometry in the ego-coordinate system, DVGT achieves metric-scaled dense reconstruction without LiDAR alignment, offering a robust solution that adapts seamlessly to diverse vehicles and camera configurations.
|
| 20 |
+
|
| 21 |
+
<p align="center">
|
| 22 |
+
<img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/teaser.png" width="100%">
|
| 23 |
+
</p>
|
| 24 |
+
|
| 25 |
+
## Experimental Results
|
| 26 |
+
DVGT significantly outperforms existing models on various scenarios. As shown below, our method (red) demonstrates superior accuracy ($\delta < 1.25$ for ray depth estimation) on 3D scene reconstruction across all evaluated datasets.
|
| 27 |
+
|
| 28 |
+
<p align="center">
|
| 29 |
+
<img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/experiments.jpg" alt="Radar Chart Performance" width="45%">
|
| 30 |
+
</p>
|
| 31 |
+
|
| 32 |
+
## Quick Start
|
| 33 |
+
|
| 34 |
+
Firstly, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).
|
| 35 |
+
We tested the code with CUDA 12.8, python3.11 and torch 2.8.0.
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
git clone https://github.com/wzzheng/DVGT.git
|
| 39 |
+
cd dvgt
|
| 40 |
+
|
| 41 |
+
conda create -n dvgt python=3.11
|
| 42 |
+
conda activate dvgt
|
| 43 |
+
|
| 44 |
+
pip install -r requirements.txt
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
Secondly, download the pretrained [checkpoint](https://huggingface.co/RainyNight/DVGT) and save it to the `./ckpt` directory.
|
| 48 |
+
|
| 49 |
+
Now, try the model with just a few lines of code:
|
| 50 |
+
|
| 51 |
+
```python
|
| 52 |
+
import torch
|
| 53 |
+
from dvgt.models.dvgt import DVGT
|
| 54 |
+
from dvgt.utils.load_fn import load_and_preprocess_images
|
| 55 |
+
from iopath.common.file_io import g_pathmgr
|
| 56 |
+
|
| 57 |
+
checkpoint_path = 'path to your checkpoint'
|
| 58 |
+
|
| 59 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 60 |
+
# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+)
|
| 61 |
+
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
|
| 62 |
+
|
| 63 |
+
# Initialize the model and load the pretrained weights.
|
| 64 |
+
model = DVGT()
|
| 65 |
+
with g_pathmgr.open(checkpoint_path, "rb") as f:
|
| 66 |
+
checkpoint = torch.load(f, map_location="cpu")
|
| 67 |
+
model.load_state_dict(checkpoint)
|
| 68 |
+
model = model.to(device).eval()
|
| 69 |
+
|
| 70 |
+
# Load and preprocess example images (replace with your own image paths)
|
| 71 |
+
image_dir = 'examples/openscene_log-0104-scene-0007'
|
| 72 |
+
images = load_and_preprocess_images(image_dir, start_frame=16, end_frame=24).to(device)
|
| 73 |
+
|
| 74 |
+
with torch.no_grad():
|
| 75 |
+
with torch.amp.autocast(device, dtype=dtype):
|
| 76 |
+
# Predict attributes including cameras, depth maps, and point maps.
|
| 77 |
+
predictions = model(images)
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
## Acknowledgements
|
| 81 |
+
Our code is based on the following brilliant repositories:
|
| 82 |
+
|
| 83 |
+
[Moge-2](https://github.com/microsoft/MoGe)
|
| 84 |
+
[CUT3R](https://github.com/CUT3R/CUT3R)
|
| 85 |
+
[Driv3R](https://github.com/Barrybarry-Smith/Driv3R)
|
| 86 |
+
[VGGT](https://github.com/facebookresearch/vggt)
|
| 87 |
+
[MapAnything](https://github.com/facebookresearch/map-anything)
|
| 88 |
+
[Pi3](https://github.com/yyfz/Pi3)
|
| 89 |
+
|
| 90 |
+
Many thanks to these authors!
|
| 91 |
+
|
| 92 |
+
## Citation
|
| 93 |
|
| 94 |
+
If you find this project helpful, please consider citing the following paper:
|
| 95 |
+
```
|
| 96 |
+
@article{zuo2025dvgt,
|
| 97 |
+
title={DVGT: Driving Visual Geometry Transformer},
|
| 98 |
+
author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Jiang, Shengyin and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen},
|
| 99 |
+
journal={arXiv preprint arXiv:2512.16919},
|
| 100 |
+
year={2025}
|
| 101 |
+
}
|
| 102 |
+
```
|