| --- |
| license: apache-2.0 |
| pipeline_tag: image-to-3d |
| tags: |
| - novel-view-synthesis |
| - multi-view-diffusion |
| - depth-estimation |
| - 3d-reconstruction |
| --- |
| |
| # GLD: Geometric Latent Diffusion |
|
|
| **Repurposing Geometric Foundation Models for Multi-view Diffusion** |
|
|
| [[Paper]](https://huggingface.co/papers/2603.22275) | [[Project Page]](https://cvlab-kaist.github.io/GLD/) | [[Code]](https://github.com/cvlab-kaist/GLD) |
|
|
| Geometric Latent Diffusion (GLD) is a framework that repurposes the geometrically consistent feature space of geometric foundation models (such as Depth Anything 3 and VGGT) as the latent space for multi-view diffusion. By operating in this space rather than a view-independent VAE latent space, GLD achieves consistent novel view synthesis (NVS) and 3D reconstruction with significantly faster training convergence. |
|
|
| ## Quick Start |
|
|
| To use these models, follow the setup instructions in the [official GitHub repository](https://github.com/cvlab-kaist/GLD). |
|
|
| ```bash |
| git clone https://github.com/cvlab-kaist/GLD.git |
| cd GLD |
| conda env create -f environment.yml |
| conda activate gld |
| |
| # Download all checkpoints |
| python -c "from huggingface_hub import snapshot_download; snapshot_download('SeonghuJeon/GLD', local_dir='.')" |
| |
| # Run demo |
| ./run_demo.sh da3 |
| ``` |
|
|
| ## Files |
|
|
| | File | Description | Params | Size | |
| |------|-------------|--------|------| |
| | `checkpoints/da3_level1.pt` | DA3 Level-1 diffusion (EMA) | 783M | 2.9G | |
| | `checkpoints/da3_cascade.pt` | DA3 Cascade: L1→L0 (EMA) | 473M | 1.8G | |
| | `checkpoints/vggt_level1.pt` | VGGT Level-1 diffusion (EMA) | 806M | 3.0G | |
| | `checkpoints/vggt_cascade.pt` | VGGT Cascade: L1→L0 (EMA) | 806M | 3.0G | |
| | `pretrained_models/da3/model.safetensors` | DA3-Base encoder | 135M | 0.5G | |
| | `pretrained_models/da3/dpt_decoder.pt` | DPT decoder (depth + geometry) | - | 1.1G | |
| | `pretrained_models/mae_decoder.pt` | DA3 MAE decoder (EMA, decoder-only) | 423M | 1.6G | |
| | `pretrained_models/vggt/mae_decoder.pt` | VGGT MAE decoder (EMA, decoder-only) | 425M | 1.6G | |
|
|
| Stage-2 and MAE decoder checkpoints contain **EMA weights only**. |
| MAE decoder checkpoints contain **decoder weights only** (encoder removed). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{jang2026gld, |
| title={Repurposing Geometric Foundation Models for Multi-view Diffusion}, |
| author={Jang, Wooseok and Jeon, Seonghu and Han, Jisang and Choi, Jinhyeok and Kwon, Minkyung and Kim, Seungryong and Xie, Saining and Liu, Sainan}, |
| journal={arXiv preprint arXiv:2603.22275}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| Built upon [RAE](https://github.com/nicknign/RAE_release), [Depth Anything 3](https://github.com/DepthAnything/Depth-Anything-3), [VGGT](https://github.com/facebookresearch/vggt), [CUT3R](https://github.com/naver/CUT3R), and [SiT](https://github.com/willisma/SiT). |