Improve model card: Add pipeline tag, paper link, project page, and sample usage

This PR significantly enhances the model card for the DVGT model by:
- Adding the `pipeline_tag: image-to-3d` to the metadata, improving discoverability on the Hugging Face Hub.
- Including direct links to the paper and the official project page.
- Integrating a comprehensive "Quick Start" section with both installation instructions and a runnable Python code snippet, directly sourced from the GitHub repository.
- Incorporating additional descriptive content like "Overview", "Experimental Results", "Acknowledgements", and "Citation" for richer documentation.
- Embedding key visuals (GIFs and images) from the project's GitHub repository to showcase the model's capabilities.

Please review and merge if these enhancements align with the model's documentation needs.

Files changed (1) hide show

README.md +95 -5

README.md CHANGED Viewed

@@ -1,12 +1,102 @@
 ---
 license: apache-2.0
 ---
 <div align="center">
 <h1>DVGT: Driving Visual Geometry Transformer</h1>
 </div>
-**DVGT**, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data.
-### 🚀 Model Usage
-This repository hosts the **pre-trained weights (checkpoints)** for the DVGT model.
-For source code, installation guides, and detailed documentation, please visit our GitHub repository:
-👉 **[GitHub: wzzheng/DVGT](https://github.com/wzzheng/DVGT/blob/main)**

 ---
 license: apache-2.0
+pipeline_tag: image-to-3d
 ---
 <div align="center">
 <h1>DVGT: Driving Visual Geometry Transformer</h1>
 </div>
+**DVGT**, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data. This model offers a robust solution that adapts seamlessly to diverse vehicles and camera configurations by leveraging spatial-temporal attention to process unposed image sequences directly.
+<p align="center">
+  <img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/demo.gif" width="100%">
+</p>
+[\ud83d\udcda Paper](https://huggingface.co/papers/2512.16919) | [\ud83c\udf10 Project Page](https://wzzheng.net/DVGT) | [\ud83d\udcbb Code](https://github.com/wzzheng/DVGT)
+## Overview
+DVGT proposes a universal framework for driving geometry perception. Unlike conventional driving models that are tightly coupled to specific sensor setups or require ground-truth poses, our model leverages spatial-temporal attention to process unposed image sequences directly. By decoding global geometry in the ego-coordinate system, DVGT achieves metric-scaled dense reconstruction without LiDAR alignment, offering a robust solution that adapts seamlessly to diverse vehicles and camera configurations.
+<p align="center">
+    <img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/teaser.png" width="100%">
+</p>
+## Experimental Results
+DVGT significantly outperforms existing models on various scenarios. As shown below, our method (red) demonstrates superior accuracy ($\delta < 1.25$ for ray depth estimation) on 3D scene reconstruction across all evaluated datasets.
+<p align="center">
+<img src="https://huggingface.co/RainyNight/DVGT/resolve/main/assets/experiments.jpg" alt="Radar Chart Performance" width="45%">
+</p>
+## Quick Start
+Firstly, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).
+We tested the code with CUDA 12.8, python3.11 and torch 2.8.0.
+```bash
+git clone https://github.com/wzzheng/DVGT.git
+cd dvgt
+conda create -n dvgt python=3.11
+conda activate dvgt
+pip install -r requirements.txt
+```
+Secondly, download the pretrained [checkpoint](https://huggingface.co/RainyNight/DVGT) and save it to the `./ckpt` directory.
+Now, try the model with just a few lines of code:
+```python
+import torch
+from dvgt.models.dvgt import DVGT
+from dvgt.utils.load_fn import load_and_preprocess_images
+from iopath.common.file_io import g_pathmgr
+checkpoint_path = 'path to your checkpoint'
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+)
+dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
+# Initialize the model and load the pretrained weights.
+model = DVGT()
+with g_pathmgr.open(checkpoint_path, "rb") as f:
+    checkpoint = torch.load(f, map_location="cpu")
+model.load_state_dict(checkpoint)
+model = model.to(device).eval()
+# Load and preprocess example images (replace with your own image paths)
+image_dir = 'examples/openscene_log-0104-scene-0007'
+images = load_and_preprocess_images(image_dir, start_frame=16, end_frame=24).to(device)
+with torch.no_grad():
+    with torch.amp.autocast(device, dtype=dtype):
+        # Predict attributes including cameras, depth maps, and point maps.
+        predictions = model(images)
+```
+## Acknowledgements
+Our code is based on the following brilliant repositories:
+[Moge-2](https://github.com/microsoft/MoGe)
+[CUT3R](https://github.com/CUT3R/CUT3R)
+[Driv3R](https://github.com/Barrybarry-Smith/Driv3R)
+[VGGT](https://github.com/facebookresearch/vggt)
+[MapAnything](https://github.com/facebookresearch/map-anything)
+[Pi3](https://github.com/yyfz/Pi3)
+Many thanks to these authors!
+## Citation
+If you find this project helpful, please consider citing the following paper:
+```
+@article{zuo2025dvgt,
+      title={DVGT: Driving Visual Geometry Transformer},
+      author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Jiang, Shengyin and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen},
+      journal={arXiv preprint arXiv:2512.16919},
+      year={2025}
+}
+```