ViGeo

ViGeo estimates scene geometry from either video clips or single-frame inputs, including depth, 3D points, surface normals, confidence, and camera poses for sequences.

The checkpoint in this repository is vigeo.pt.

Checkpoint Note

This repository currently provides a preliminary ViGeo checkpoint. The current checkpoint was trained with a known issue in the loss implementation, which may cause minor visualization artifacts in camera poses and distant regions. This checkpoint is consistent with the results reported in the paper and can be used to obtain dense geometry estimation results.

We are preparing an updated checkpoint with a sky mask head and will release it soon.

Installation

conda create -n vigeo python=3.10 -y
conda activate vigeo

pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu126

git clone https://github.com/aigc3d/ViGeo.git
cd ViGeo
pip install -r requirements.txt
pip install -e .

Quick Start

import torch

from vigeo import ViGeo
from utils import load_image_sequence

device = torch.device("cuda")
image_paths = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]
images = load_image_sequence(image_paths).to(device)  # [T, 3, H, W], RGB in [0, 1]

model = ViGeo.from_pretrained("pkqbajng/ViGeo").to(device).eval()

with torch.inference_mode():
    output = model.infer(images, mode="offline")

depth = output["depth_pred"]      # [T, 1, H, W]
points = output["points_pred"]    # [T, H, W, 3]
normals = output["normal_pred"]   # [T, H, W, 3], inward normals
normals_out = -normals            # outward normals for visualization/evaluation
poses = output["pose_pred"]       # [T, 3, 4], camera-to-world
confidence = output["conf_pred"]  # [T, 1, H, W]

For batched input [B, T, 3, H, W], tensor outputs keep the leading batch dimension.

ViGeo uses a right-handed camera coordinate system with (X, Y, Z) = (right, down, front). The raw normal_pred output follows the inward normal convention. Use normals = -normal_pred when outward normals are needed for visualization or evaluation.

Inference Modes

ViGeo provides offline, chunk, and online inference modes. offline processes the full input sequence at once and is preferred when the complete video or image set is available. For long videos, use chunk or online mode with cached context.

See the ViGeo main branch README for examples of all inference modes.

License

CC BY-NC 4.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for pkqbajng/ViGeo

Towards Consistent Video Geometry Estimation

Paper • 2605.30060 • Published May 28 • 1

pkqbajng
/

ViGeo