File size: 5,066 Bytes
c057409
 
 
 
 
 
 
 
ddc2968
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: mit
metrics:
- accuracy
base_model:
- facebook/VGGT_tracker_fixed
pipeline_tag: image-to-3d
---
<div align="center">
<h2>⚑️ FastVGGT: Training-Free Acceleration of Visual Geometry Transformer</h2>
  
<p align="center">
  <a href="https://arxiv.org/abs/2509.02560"><img src="https://img.shields.io/badge/arXiv-FastVGGT-red?logo=arxiv" alt="Paper PDF"></a>
  <a href="https://mystorm16.github.io/fastvggt/"><img src="https://img.shields.io/badge/Project_Page-FastVGGT-yellow" alt="Project Page"></a>
</p>
  

[You Shen](https://mystorm16.github.io/), [Zhipeng Zhang](https://zhipengzhang.cn/), [Yansong Qu](https://quyans.github.io/), [Liujuan Cao](https://mac.xmu.edu.cn/ljcao/)
</div>


## πŸ”­ Overview

FastVGGT observes **strong similarity** in attention maps and leverages it to design a training-free acceleration method for long-sequence 3D reconstruction, **achieving up to 4Γ— faster inference without sacrificing accuracy.**


## βš™οΈ Environment Setup
First, create a virtual environment using Conda, clone this repository to your local machine, and install the required dependencies.


```bash
conda create -n fastvggt python=3.10
conda activate fastvggt
git clone git@github.com:mystorm16/FastVGGT.git
cd FastVGGT
pip install -r requirements.txt
```

Next, prepare the ScanNet dataset: http://www.scan-net.org/ScanNet/

Then, download the VGGT checkpoint (we use the checkpoint link provided in https://github.com/facebookresearch/vggt/tree/evaluation/evaluation):
```bash
wget https://huggingface.co/facebook/VGGT_tracker_fixed/resolve/main/model_tracker_fixed_e20.pt
```

Finally, configure the dataset path and VGGT checkpoint path. For example:
```bash
    parser.add_argument(
        "--data_dir", type=Path, default="/data/scannetv2/process_scannet"
    )
    parser.add_argument(
        "--gt_ply_dir",
        type=Path,
        default="/data/scannetv2/OpenDataLab___ScanNet_v2/raw/scans",
    )
    parser.add_argument(
        "--ckpt_path",
        type=str,
        default="./ckpt/model_tracker_fixed_e20.pt",
    )
```


## πŸ’Ž Observation

Note: A large number of input_frames may significantly slow down saving the visualization results. Please try using a smaller number first.
```bash
python eval/eval_scannet.py --input_frame 30 --vis_attn_map --merging 0
```

We observe that many token-level attention maps are highly similar in each block, motivating our optimization of the Global Attention module.



## πŸ€ Evaluation
### Custom Dataset
Please organize the data according to the following directory:
```
<data_path>/
β”œβ”€β”€ images/       
β”‚   β”œβ”€β”€ 000000.jpg
β”‚   β”œβ”€β”€ 000001.jpg
β”‚   └── ...
β”œβ”€β”€ pose/                # Optional: Camera poses
β”‚   β”œβ”€β”€ 000000.txt 
β”‚   β”œβ”€β”€ 000001.txt
β”‚   └── ...
└── gt_ply/              # Optional: GT point cloud
    └── scene_xxx.ply   
```
- Required: `images/`
- Additionally required when `--enable_evaluation` is enabled: `pose/` and `gt_ply/`

Inference only:

```bash
python eval/eval_custom.py \
  --data_path /path/to/your_dataset \
  --output_path ./eval_results_custom \
  --plot
```

Inference + Evaluation (requires `pose/` and `gt_ply/`):

```bash
python eval/eval_custom.py \
  --data_path /path/to/your_dataset \
  --enable_evaluation \
  --output_path ./eval_results_custom \
  --plot
```

### ScanNet
Evaluate FastVGGT on the ScanNet dataset with 1,000 input images. The **--merging** parameter specifies the block index at which the merging strategy is applied:

```bash
python eval/eval_scannet.py --input_frame 1000 --merging 0
```

Evaluate Baseline VGGT on the ScanNet dataset with 1,000 input images:
```bash
python eval/eval_scannet.py --input_frame 1000
```

### 7 Scenes & NRGBD
Evaluate across two datasets, sampling keyframes every 10 frames:
```bash
python eval/eval_7andN.py --kf 10
```

## 🍺 Acknowledgements

- Thanks to these great repositories: [VGGT](https://github.com/facebookresearch/vggt), [Dust3r](https://github.com/naver/dust3r),  [Fast3R](https://github.com/facebookresearch/fast3r), [CUT3R](https://github.com/CUT3R/CUT3R), [MV-DUSt3R+](https://github.com/facebookresearch/mvdust3r), [StreamVGGT](https://github.com/wzzheng/StreamVGGT), [VGGT-Long](https://github.com/DengKaiCQ/VGGT-Long), [ToMeSD](https://github.com/dbolya/tomesd) and many other inspiring works in the community.

- Special thanks to [Jianyuan Wang](https://jytime.github.io/) for his valuable discussions and suggestions on this work.

<!-- ## ✍️ Checklist

- [ ] Release the evaluation code on 7 Scenes / NRGBD -->


## βš–οΈ License
See the [LICENSE](./LICENSE.txt) file for details about the license under which this code is made available.

## Citation

If you find this project helpful, please consider citing the following paper:
```
@article{shen2025fastvggt,
  title={FastVGGT: Training-Free Acceleration of Visual Geometry Transformer},
  author={Shen, You and Zhang, Zhipeng and Qu, Yansong and Cao, Liujuan},
  journal={arXiv preprint arXiv:2509.02560},
  year={2025}
}
```