MVP / README.md

nielsr HF Staff

Add model card for Multi-view Pyramid Transformer (MVP)

7d813db verified 4 months ago

4.04 kB

pipeline_tag: image-to-3d
license: apache-2.0

Multi-view Pyramid Transformer: Look Coarser to See Broader

This repository contains the official model for the paper "Multi-view Pyramid Transformer: Look Coarser to See Broader".

Multi-view Pyramid Transformer (MVP) is a scalable multi-view transformer architecture designed to directly reconstruct large 3D scenes from tens to hundreds of images in a single forward pass. MVP is built on two core design principles:

Local-to-global inter-view hierarchy: Gradually broadens the model's perspective from local views to groups and ultimately the full scene.
Fine-to-coarse intra-view hierarchy: Starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens.

This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. When coupled with 3D Gaussian Splatting as the underlying 3D representation, MVP achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

Installation

To set up the environment and install dependencies:

# create conda environment
conda create -n mvp python=3.11 -y
conda activate mvp

# install PyTorch (adjust cuda version according to your system)
pip install -r requirements.txt
pip install git+https://github.com/nerfstudio-project/gsplat.git

Checkpoints

The model checkpoints are hosted on HuggingFace (mvp_540x960).

For training and evaluation, we used the DL3DV dataset after applying undistortion preprocessing with this script, originally introduced in Long-LRM.

Download the DL3DV benchmark dataset from here, and apply undistortion preprocessing.

Inference

To perform inference with the pre-trained model:

Update the inference.ckpt_path field in configs/inference.yaml with the path to the downloaded pretrained model.
Update the entries in data/dl3dv_eval.txt to point to the correct processed dataset path.

# inference
CUDA_VISIBLE_DEVICES=0 python inference.py --config configs/inference.yaml

Citation

If you find our work useful, please cite our paper:

@article{kang2025multi,
  title={Multi-view Pyramid Transformer: Look Coarser to See Broader},
  author={Kang, Gyeongjin and Yang, Seungkwon and Nam, Seungtae and Lee, Younggeun and Kim, Jungwoo and Park, Eunbyung},
  journal={arXiv preprint arXiv:2512.07806},
  year={2025}
}

Acknowledgements

This project is built on many amazing research works, thanks a lot to all the authors for sharing!