|
|
--- |
|
|
pipeline_tag: image-to-3d |
|
|
tags: |
|
|
- model_hub_mixin |
|
|
- pytorch_model_hub_mixin |
|
|
library_name: pytorch |
|
|
license: cc-by-nc-4.0 |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<h1>Streaming 4D Visual Geometry Transformer</h1> |
|
|
</div> |
|
|
|
|
|
Dong Zhuo\*, [Wenzhao Zheng](https://wzzheng.net/)\*†, Jiahe Guo, Yuqi Wu, [Jie Zhou](https://scholar.google.com/citations?user=6a79aPwAAAAJ&hl=en&authuser=1), [Jiwen Lu](http://ivg.au.tsinghua.edu.cn/Jiwen_Lu/) |
|
|
|
|
|
\* Equal contribution. † Project leader. |
|
|
|
|
|
[Paper](https://arxiv.org/abs/2507.11539) | [Project Page](https://wzzheng.net/StreamVGGT) |
|
|
|
|
|
**StreamVGGT**, a causal transformer architecture for **real-time streaming 4D visual geometry perception** compatiable with LLM-targeted attention mechanism (e.g., [FlashAttention](https://github.com/Dao-AILab/flash-attention)), delivers both fast inference and high-quality 4D reconstruction. |
|
|
|
|
|
## Overview |
|
|
|
|
|
Given a sequence of images, unlike offline models that require reprocessing the entire sequence and reconstructing the entire scene upon receiving each new image, our StreamVGGT employs temporal |
|
|
causal attention and leverages cached memory token to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applitions. |
|
|
|
|
|
## Quick start |
|
|
|
|
|
Please refer to our [Github Repo](https://github.com/wzzheng/StreamVGGT). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this project helpful, please consider citing the following paper: |
|
|
``` |
|
|
@article{streamVGGT, |
|
|
title={Streaming 4D Visual Geometry Transformer}, |
|
|
author={Dong Zhuo and Wenzhao Zheng and Jiahe Guo and Yuqi Wu and Jie Zhou and Jiwen Lu}, |
|
|
journal={arXiv preprint arXiv:2507.11539}, |
|
|
year={2025} |
|
|
} |
|
|
``` |