RainyNight
/

DVGT

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, paper link, project page, and sample usage

by nielsr HF Staff - opened 17 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+95

-5

Files changed (1) hide show

README.md +95 -5

README.md CHANGED Viewed

@@ -1,12 +1,102 @@
 ---
 license: apache-2.0
 ---
 <div align="center">
 <h1>DVGT: Driving Visual Geometry Transformer</h1>
 </div>
-**DVGT**, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data.
-### 🚀 Model Usage
-This repository hosts the **pre-trained weights (checkpoints)** for the DVGT model.
-For source code, installation guides, and detailed documentation, please visit our GitHub repository:
-👉 **[GitHub: wzzheng/DVGT](https://github.com/wzzheng/DVGT/blob/main)**

 ---
 license: apache-2.0
+pipeline_tag: image-to-3d
 ---
 <div align="center">
 <h1>DVGT: Driving Visual Geometry Transformer</h1>
 </div>
+**DVGT**, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data. This model offers a robust solution that adapts seamlessly to diverse vehicles and camera configurations by leveraging spatial-temporal attention to process unposed image sequences directly.
+<p align="center">
+  <img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/demo.gif" width="100%">
+</p>
+[\ud83d\udcda Paper](https://huggingface.co/papers/2512.16919) | [\ud83c\udf10 Project Page](https://wzzheng.net/DVGT) | [\ud83d\udcbb Code](https://github.com/wzzheng/DVGT)
+## Overview
+DVGT proposes a universal framework for driving geometry perception. Unlike conventional driving models that are tightly coupled to specific sensor setups or require ground-truth poses, our model leverages spatial-temporal attention to process unposed image sequences directly. By decoding global geometry in the ego-coordinate system, DVGT achieves metric-scaled dense reconstruction without LiDAR alignment, offering a robust solution that adapts seamlessly to diverse vehicles and camera configurations.
+<p align="center">
+    <img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/teaser.png" width="100%">
+</p>
+## Experimental Results
+DVGT significantly outperforms existing models on various scenarios. As shown below, our method (red) demonstrates superior accuracy ($\delta < 1.25$ for ray depth estimation) on 3D scene reconstruction across all evaluated datasets.
+<p align="center">
+<img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/experiments.jpg" alt="Radar Chart Performance" width="45%">
+</p>
+## Quick Start
+Firstly, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).
+We tested the code with CUDA 12.8, python3.11 and torch 2.8.0.
+```bash
+git clone https://github.com/wzzheng/DVGT.git
+cd dvgt
+conda create -n dvgt python=3.11
+conda activate dvgt
+pip install -r requirements.txt
+```
+Secondly, download the pretrained [checkpoint](https://huggingface.co/RainyNight/DVGT) and save it to the `./ckpt` directory.
+Now, try the model with just a few lines of code:
+```python
+import torch
+from dvgt.models.dvgt import DVGT
+from dvgt.utils.load_fn import load_and_preprocess_images
+from iopath.common.file_io import g_pathmgr
+checkpoint_path = 'path to your checkpoint'
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+)
+dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
+# Initialize the model and load the pretrained weights.
+model = DVGT()
+with g_pathmgr.open(checkpoint_path, "rb") as f:
+    checkpoint = torch.load(f, map_location="cpu")
+model.load_state_dict(checkpoint)
+model = model.to(device).eval()
+# Load and preprocess example images (replace with your own image paths)
+image_dir = 'examples/openscene_log-0104-scene-0007'
+images = load_and_preprocess_images(image_dir, start_frame=16, end_frame=24).to(device)
+with torch.no_grad():
+    with torch.amp.autocast(device, dtype=dtype):
+        # Predict attributes including cameras, depth maps, and point maps.
+        predictions = model(images)
+```
+## Acknowledgements
+Our code is based on the following brilliant repositories:
+[Moge-2](https://github.com/microsoft/MoGe)
+[CUT3R](https://github.com/CUT3R/CUT3R)
+[Driv3R](https://github.com/Barrybarry-Smith/Driv3R)
+[VGGT](https://github.com/facebookresearch/vggt)
+[MapAnything](https://github.com/facebookresearch/map-anything)
+[Pi3](https://github.com/yyfz/Pi3)
+Many thanks to these authors!
+## Citation
+If you find this project helpful, please consider citing the following paper:
+```
+@article{zuo2025dvgt,
+      title={DVGT: Driving Visual Geometry Transformer},
+      author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Jiang, Shengyin and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen},
+      journal={arXiv preprint arXiv:2512.16919},
+      year={2025}
+}
+```