Improve model card: Add pipeline tag, paper link, project page, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +95 -5
README.md CHANGED
@@ -1,12 +1,102 @@
1
  ---
2
  license: apache-2.0
 
3
  ---
 
4
  <div align="center">
5
  <h1>DVGT: Driving Visual Geometry Transformer</h1>
6
  </div>
7
- **DVGT**, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- ### ๐Ÿš€ Model Usage
10
- This repository hosts the **pre-trained weights (checkpoints)** for the DVGT model.
11
- For source code, installation guides, and detailed documentation, please visit our GitHub repository:
12
- ๐Ÿ‘‰ **[GitHub: wzzheng/DVGT](https://github.com/wzzheng/DVGT/blob/main)**
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-to-3d
4
  ---
5
+
6
  <div align="center">
7
  <h1>DVGT: Driving Visual Geometry Transformer</h1>
8
  </div>
9
+ **DVGT**, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data. This model offers a robust solution that adapts seamlessly to diverse vehicles and camera configurations by leveraging spatial-temporal attention to process unposed image sequences directly.
10
+
11
+ <p align="center">
12
+ <img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/demo.gif" width="100%">
13
+ </p>
14
+
15
+ [\ud83d\udcda Paper](https://huggingface.co/papers/2512.16919) | [\ud83c\udf10 Project Page](https://wzzheng.net/DVGT) | [\ud83d\udcbb Code](https://github.com/wzzheng/DVGT)
16
+
17
+ ## Overview
18
+
19
+ DVGT proposes a universal framework for driving geometry perception. Unlike conventional driving models that are tightly coupled to specific sensor setups or require ground-truth poses, our model leverages spatial-temporal attention to process unposed image sequences directly. By decoding global geometry in the ego-coordinate system, DVGT achieves metric-scaled dense reconstruction without LiDAR alignment, offering a robust solution that adapts seamlessly to diverse vehicles and camera configurations.
20
+
21
+ <p align="center">
22
+ <img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/teaser.png" width="100%">
23
+ </p>
24
+
25
+ ## Experimental Results
26
+ DVGT significantly outperforms existing models on various scenarios. As shown below, our method (red) demonstrates superior accuracy ($\delta < 1.25$ for ray depth estimation) on 3D scene reconstruction across all evaluated datasets.
27
+
28
+ <p align="center">
29
+ <img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/experiments.jpg" alt="Radar Chart Performance" width="45%">
30
+ </p>
31
+
32
+ ## Quick Start
33
+
34
+ Firstly, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).
35
+ We tested the code with CUDA 12.8, python3.11 and torch 2.8.0.
36
+
37
+ ```bash
38
+ git clone https://github.com/wzzheng/DVGT.git
39
+ cd dvgt
40
+
41
+ conda create -n dvgt python=3.11
42
+ conda activate dvgt
43
+
44
+ pip install -r requirements.txt
45
+ ```
46
+
47
+ Secondly, download the pretrained [checkpoint](https://huggingface.co/RainyNight/DVGT) and save it to the `./ckpt` directory.
48
+
49
+ Now, try the model with just a few lines of code:
50
+
51
+ ```python
52
+ import torch
53
+ from dvgt.models.dvgt import DVGT
54
+ from dvgt.utils.load_fn import load_and_preprocess_images
55
+ from iopath.common.file_io import g_pathmgr
56
+
57
+ checkpoint_path = 'path to your checkpoint'
58
+
59
+ device = "cuda" if torch.cuda.is_available() else "cpu"
60
+ # bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+)
61
+ dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
62
+
63
+ # Initialize the model and load the pretrained weights.
64
+ model = DVGT()
65
+ with g_pathmgr.open(checkpoint_path, "rb") as f:
66
+ checkpoint = torch.load(f, map_location="cpu")
67
+ model.load_state_dict(checkpoint)
68
+ model = model.to(device).eval()
69
+
70
+ # Load and preprocess example images (replace with your own image paths)
71
+ image_dir = 'examples/openscene_log-0104-scene-0007'
72
+ images = load_and_preprocess_images(image_dir, start_frame=16, end_frame=24).to(device)
73
+
74
+ with torch.no_grad():
75
+ with torch.amp.autocast(device, dtype=dtype):
76
+ # Predict attributes including cameras, depth maps, and point maps.
77
+ predictions = model(images)
78
+ ```
79
+
80
+ ## Acknowledgements
81
+ Our code is based on the following brilliant repositories:
82
+
83
+ [Moge-2](https://github.com/microsoft/MoGe)
84
+ [CUT3R](https://github.com/CUT3R/CUT3R)
85
+ [Driv3R](https://github.com/Barrybarry-Smith/Driv3R)
86
+ [VGGT](https://github.com/facebookresearch/vggt)
87
+ [MapAnything](https://github.com/facebookresearch/map-anything)
88
+ [Pi3](https://github.com/yyfz/Pi3)
89
+
90
+ Many thanks to these authors!
91
+
92
+ ## Citation
93
 
94
+ If you find this project helpful, please consider citing the following paper:
95
+ ```
96
+ @article{zuo2025dvgt,
97
+ title={DVGT: Driving Visual Geometry Transformer},
98
+ author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Jiang, Shengyin and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen},
99
+ journal={arXiv preprint arXiv:2512.16919},
100
+ year={2025}
101
+ }
102
+ ```