SeonghuJeon
/

GLD

novel-view-synthesis

multi-view-diffusion

depth-estimation

3d-reconstruction

Model card Files Files and versions

GLD / README.md

nielsr's picture

nielsr HF Staff

Add pipeline tag and paper link

1e72100 verified 10 days ago

|

2.78 kB

	---
	license: apache-2.0
	pipeline_tag: image-to-3d
	tags:
	- novel-view-synthesis
	- multi-view-diffusion
	- depth-estimation
	- 3d-reconstruction
	---

	# GLD: Geometric Latent Diffusion

	Repurposing Geometric Foundation Models for Multi-view Diffusion

	[[Paper]](https://huggingface.co/papers/2603.22275) \| [[Project Page]](https://cvlab-kaist.github.io/GLD/) \| [[Code]](https://github.com/cvlab-kaist/GLD)

	Geometric Latent Diffusion (GLD) is a framework that repurposes the geometrically consistent feature space of geometric foundation models (such as Depth Anything 3 and VGGT) as the latent space for multi-view diffusion. By operating in this space rather than a view-independent VAE latent space, GLD achieves consistent novel view synthesis (NVS) and 3D reconstruction with significantly faster training convergence.

	## Quick Start

	To use these models, follow the setup instructions in the [official GitHub repository](https://github.com/cvlab-kaist/GLD).

	```bash
	git clone https://github.com/cvlab-kaist/GLD.git
	cd GLD
	conda env create -f environment.yml
	conda activate gld

	# Download all checkpoints
	python -c "from huggingface_hub import snapshot_download; snapshot_download('SeonghuJeon/GLD', local_dir='.')"

	# Run demo
	./run_demo.sh da3
	```

	## Files

	\| File \| Description \| Params \| Size \|
	\|------\|-------------\|--------\|------\|
	\| `checkpoints/da3_level1.pt` \| DA3 Level-1 diffusion (EMA) \| 783M \| 2.9G \|
	\| `checkpoints/da3_cascade.pt` \| DA3 Cascade: L1→L0 (EMA) \| 473M \| 1.8G \|
	\| `checkpoints/vggt_level1.pt` \| VGGT Level-1 diffusion (EMA) \| 806M \| 3.0G \|
	\| `checkpoints/vggt_cascade.pt` \| VGGT Cascade: L1→L0 (EMA) \| 806M \| 3.0G \|
	\| `pretrained_models/da3/model.safetensors` \| DA3-Base encoder \| 135M \| 0.5G \|
	\| `pretrained_models/da3/dpt_decoder.pt` \| DPT decoder (depth + geometry) \| - \| 1.1G \|
	\| `pretrained_models/mae_decoder.pt` \| DA3 MAE decoder (EMA, decoder-only) \| 423M \| 1.6G \|
	\| `pretrained_models/vggt/mae_decoder.pt` \| VGGT MAE decoder (EMA, decoder-only) \| 425M \| 1.6G \|

	Stage-2 and MAE decoder checkpoints contain EMA weights only.
	MAE decoder checkpoints contain decoder weights only (encoder removed).

	## Citation

	```bibtex
	@article{jang2026gld,
	title={Repurposing Geometric Foundation Models for Multi-view Diffusion},
	author={Jang, Wooseok and Jeon, Seonghu and Han, Jisang and Choi, Jinhyeok and Kwon, Minkyung and Kim, Seungryong and Xie, Saining and Liu, Sainan},
	journal={arXiv preprint arXiv:2603.22275},
	year={2026}
	}
	```

	## Acknowledgements

	Built upon [RAE](https://github.com/nicknign/RAE_release), [Depth Anything 3](https://github.com/DepthAnything/Depth-Anything-3), [VGGT](https://github.com/facebookresearch/vggt), [CUT3R](https://github.com/naver/CUT3R), and [SiT](https://github.com/willisma/SiT).