OmniVGGT: Omni-Modality Driven Visual Geometry Grounded
Paper
•
2511.10560
•
Published
•
2
Model type: Spatial Foundation Model for 3D Geometry Reconstruction
Model date: November 2025
Paper: OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
Code: https://github.com/Livioni/OmniVGGT-official
Authors:
* Equal Contribution, † Corresponding Author
OmniVGGT is a spatial foundation model that can effectively benefit from an arbitrary number of auxiliary geometric modalities (depth, camera intrinsics, and pose) to obtain high-quality 3D geometric results. The model achieves state-of-the-art performance across various downstream tasks and further improves performance on robot manipulation tasks.
import torch
from omnivggt.models.omnivggt import OmniVGGT
# Load model
model = OmniVGGT()
model.load_state_dict(torch.load('path/to/model.pth'))
model.eval()
# Prepare inputs
inputs = {
'images': images, # torch.Tensor [B, N, 3, H, W]
'extrinsics': extrinsics, # optional
'intrinsics': intrinsics, # optional
'depth': depth, # optional
'mask': mask, # optional
}
# Run inference
with torch.no_grad():
predictions = model(**inputs)
# Basic usage - only images required
python inference.py --image_folder path/to/images/
# With auxiliary camera and depth information
python inference.py \
--image_folder path/to/images/ \
--camera_folder path/to/cameras/ \
--depth_folder path/to/depths/
conda create -n omnivggt python=3.10
conda activate omnivggt
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
@article{peng2025omnivggt,
title={OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer},
author={Peng, Haosong and Li, Hao and Dai, Yalun and Lan, Yushi and Luo, Yihang and Qi, Tianyu and Zhang, Zhengshen and Zhan, Yufeng and Zhang, Junfei and Xu, Wenchao and Liu, Ziwei},
journal={arXiv preprint arXiv:2511.10560},
year={2025}
}
Base model
facebook/VGGT-1B