pipeline_tag: image-to-3d
license: apache-2.0
Multi-view Pyramid Transformer: Look Coarser to See Broader
This repository contains the official model for the paper "Multi-view Pyramid Transformer: Look Coarser to See Broader".
Multi-view Pyramid Transformer (MVP) is a scalable multi-view transformer architecture designed to directly reconstruct large 3D scenes from tens to hundreds of images in a single forward pass. MVP is built on two core design principles:
- Local-to-global inter-view hierarchy: Gradually broadens the model's perspective from local views to groups and ultimately the full scene.
- Fine-to-coarse intra-view hierarchy: Starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens.
This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. When coupled with 3D Gaussian Splatting as the underlying 3D representation, MVP achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
Installation
To set up the environment and install dependencies:
# create conda environment
conda create -n mvp python=3.11 -y
conda activate mvp
# install PyTorch (adjust cuda version according to your system)
pip install -r requirements.txt
pip install git+https://github.com/nerfstudio-project/gsplat.git
Checkpoints
The model checkpoints are hosted on HuggingFace (mvp_540x960).
For training and evaluation, we used the DL3DV dataset after applying undistortion preprocessing with this script, originally introduced in Long-LRM.
Download the DL3DV benchmark dataset from here, and apply undistortion preprocessing.
Inference
To perform inference with the pre-trained model:
- Update the
inference.ckpt_pathfield inconfigs/inference.yamlwith the path to the downloaded pretrained model. - Update the entries in
data/dl3dv_eval.txtto point to the correct processed dataset path.
# inference
CUDA_VISIBLE_DEVICES=0 python inference.py --config configs/inference.yaml
Citation
If you find our work useful, please cite our paper:
@article{kang2025multi,
title={Multi-view Pyramid Transformer: Look Coarser to See Broader},
author={Kang, Gyeongjin and Yang, Seungkwon and Nam, Seungtae and Lee, Younggeun and Kim, Jungwoo and Park, Eunbyung},
journal={arXiv preprint arXiv:2512.07806},
year={2025}
}
Acknowledgements
This project is built on many amazing research works, thanks a lot to all the authors for sharing!