Add model card for Multi-view Pyramid Transformer (MVP)
#1
by
nielsr
HF Staff
- opened
README.md
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: image-to-3d
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Multi-view Pyramid Transformer: Look Coarser to See Broader
|
| 7 |
+
|
| 8 |
+
<div align="center">
|
| 9 |
+
<h1><span style="color:#93cf6a;">M</span>ulti-<span style="color:#93cf6a;">v</span>iew <span style="color:#93cf6a;">P</span>yramid Transformer: Look Coarser to See Broader</h1>
|
| 10 |
+
|
| 11 |
+
<a href="https://huggingface.co/papers/2512.07806"><img src="https://img.shields.io/badge/Paper-2512.07806-b31b1b" alt="Paper"></a>
|
| 12 |
+
<a href="https://gynjn.github.io/MVP/"><img src="https://img.shields.io/badge/Project_Page-green" alt="Project Page"></a>
|
| 13 |
+
<a href="https://github.com/Gynjn/MVP"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code"></a>
|
| 14 |
+
</div>
|
| 15 |
+
|
| 16 |
+
This repository contains the official model for the paper "[Multi-view Pyramid Transformer: Look Coarser to See Broader](https://huggingface.co/papers/2512.07806)".
|
| 17 |
+
|
| 18 |
+
Multi-view Pyramid Transformer (MVP) is a scalable multi-view transformer architecture designed to directly reconstruct large 3D scenes from tens to hundreds of images in a single forward pass. MVP is built on two core design principles:
|
| 19 |
+
1. **Local-to-global inter-view hierarchy**: Gradually broadens the model's perspective from local views to groups and ultimately the full scene.
|
| 20 |
+
2. **Fine-to-coarse intra-view hierarchy**: Starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens.
|
| 21 |
+
|
| 22 |
+
This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. When coupled with 3D Gaussian Splatting as the underlying 3D representation, MVP achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
|
| 23 |
+
|
| 24 |
+
## Installation
|
| 25 |
+
|
| 26 |
+
To set up the environment and install dependencies:
|
| 27 |
+
|
| 28 |
+
```bash
|
| 29 |
+
# create conda environment
|
| 30 |
+
conda create -n mvp python=3.11 -y
|
| 31 |
+
conda activate mvp
|
| 32 |
+
|
| 33 |
+
# install PyTorch (adjust cuda version according to your system)
|
| 34 |
+
pip install -r requirements.txt
|
| 35 |
+
pip install git+https://github.com/nerfstudio-project/gsplat.git
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## Checkpoints
|
| 39 |
+
|
| 40 |
+
The model checkpoints are hosted on [HuggingFace](https://huggingface.co/Gynjn/MVP) ([mvp_540x960](https://huggingface.co/Gynjn/MVP/resolve/main/mvp.pt?download=true)).
|
| 41 |
+
|
| 42 |
+
For training and evaluation, we used the DL3DV dataset after applying undistortion preprocessing with this [script](https://github.com/arthurhero/Long-LRM/blob/main/data/prosess_dl3dv.py), originally introduced in [Long-LRM](https://arthurhero.github.io/projects/llrm/index.html).
|
| 43 |
+
|
| 44 |
+
Download the DL3DV benchmark dataset from [here](https://huggingface.co/datasets/DL3DV/DL3DV-Benchmark/tree/main), and apply undistortion preprocessing.
|
| 45 |
+
|
| 46 |
+
## Inference
|
| 47 |
+
|
| 48 |
+
To perform inference with the pre-trained model:
|
| 49 |
+
|
| 50 |
+
1. Update the `inference.ckpt_path` field in `configs/inference.yaml` with the path to the downloaded pretrained model.
|
| 51 |
+
2. Update the entries in `data/dl3dv_eval.txt` to point to the correct processed dataset path.
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
# inference
|
| 55 |
+
CUDA_VISIBLE_DEVICES=0 python inference.py --config configs/inference.yaml
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
## Citation
|
| 59 |
+
|
| 60 |
+
If you find our work useful, please cite our paper:
|
| 61 |
+
|
| 62 |
+
```bibtex
|
| 63 |
+
@article{kang2025multi,
|
| 64 |
+
title={Multi-view Pyramid Transformer: Look Coarser to See Broader},
|
| 65 |
+
author={Kang, Gyeongjin and Yang, Seungkwon and Nam, Seungtae and Lee, Younggeun and Kim, Jungwoo and Park, Eunbyung},
|
| 66 |
+
journal={arXiv preprint arXiv:2512.07806},
|
| 67 |
+
year={2025}
|
| 68 |
+
}
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## Acknowledgements
|
| 72 |
+
|
| 73 |
+
This project is built on many amazing research works, thanks a lot to all the authors for sharing!
|
| 74 |
+
|
| 75 |
+
- [Gaussian-Splatting](https://github.com/graphdeco-inria/gaussian-splatting) and [gsplat](https://github.com/nerfstudio-project/gsplat)
|
| 76 |
+
- [LVSM](https://github.com/haian-jin/LVSM)
|
| 77 |
+
- [Long-LRM](https://github.com/arthurhero/Long-LRM)
|
| 78 |
+
- [LaCT](https://github.com/a1600012888/LaCT)
|
| 79 |
+
- [iLRM](https://github.com/Gynjn/iLRM)
|
| 80 |
+
- [ProPE](https://github.com/liruilong940607/prope)
|
| 81 |
+
- [LVT](https://toobaimt.github.io/lvt/)
|