VolFill / README.md

Add model card

9463344 verified 26 days ago

3.77 kB

	---
	license: mit
	library_name: volfill
	pipeline_tag: image-to-3d
	tags:
	- 3d-reconstruction
	- amodal-completion
	- single-view-reconstruction
	- scene-reconstruction
	- flow-matching
	- diffusion-transformer
	- point-cloud
	---

	# VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

	<p align="center">
	<a href="https://arxiv.org/abs/2605.31466"><img src="https://img.shields.io/badge/arXiv-2605.31466-b31b1b.svg" alt="arXiv"></a>
	<a href="https://ngoductuanlhp.github.io/VolFill/"><img src="https://img.shields.io/badge/Project-Page-1f72b8.svg" alt="Project Page"></a>
	<a href="https://github.com/ngoductuanlhp/VolFill"><img src="https://img.shields.io/badge/Code-GitHub-181717.svg?logo=github" alt="Code"></a>
	</p>

	<p align="center">
	<img src="https://raw.githubusercontent.com/ngoductuanlhp/VolFill/main/assets/teaser.png" width="100%" alt="VolFill teaser">
	</p>

	Pretrained checkpoints for VolFill, which recovers the complete 3D scene
	geometry — including occluded surfaces — from a single RGB image, represented
	as a 256³ Truncated Unsigned Distance Function (TUDF) grid.

	> Authors: Tuan Duc Ngo¹, Chuang Gan¹, Evangelos Kalogerakis¹˒²
	>  \|  ¹University of Massachusetts Amherst   ²Technical University of Crete

	## Model description

	VolFill is a two-stage latent generative model. A hybrid 3D VAE (sparse
	encoder → dense bottleneck → hybrid dense-to-sparse decoder) compresses the 256³
	TUDF to a compact 16³×16ch latent, and a **latent Diffusion Transformer trained
	with flow matching** generates that latent — conditioned on (a) frozen MoGe-v2
	image features as a global geometric prior and (b) a visible-geometry latent that
	anchors the occluded regions. At inference the model encodes the visible region,
	samples the DiT for 50 Euler steps with CFG = 3.0, and decodes to a TUDF that is
	thresholded into a point cloud or mesh.

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `volfill_dit.pth` \| Latent flow-matching DiT (visible-latent conditioned, 16× variant) \|
	\| `volfill_vae.pth` \| Hybrid 3D VAE (sparse encoder + hybrid decoder) \|
	\| `inference.yaml` \| Model architecture + sampler config \|
	\| `latent_stats_16x.npy` \| Latent normalization statistics (mean / std) \|

	The MoGe geometry prior (`Ruicheng/moge-2-vitl`, `Ruicheng/moge-2-vitl-normal`)
	is downloaded automatically on first run.

	## Usage

	Install the inference code from the [GitHub repo](https://github.com/ngoductuanlhp/VolFill)
	(CUDA 13.0 / RTX 40-series), then everything in this model repo downloads
	automatically:

	```bash
	# CLI — all weights/config/stats auto-download from this repo
	python -m volfill.amodal.inference_latent_visible \
	--hf_repo TuanNgo/VolFill --input_path image.jpg --output ./results/
	```

	```python
	from PIL import Image
	from volfill.amodal.inference_latent_visible import LatentTUDFVisibleInference

	infer = LatentTUDFVisibleInference.from_pretrained("TuanNgo/VolFill")
	result = infer(Image.open("image.jpg").convert("RGB"))
	# result["tudf"]: (1, 1, 256, 256, 256) predicted TUDF in [-1, 1]
	```

	See the GitHub README for installation, point-cloud visualization, and local /
	Google-Drive checkpoint options.

	## Citation

	```bibtex
	@article{ngo2026volfill,
	title = {VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching},
	author = {Ngo, Tuan Duc and Gan, Chuang and Kalogerakis, Evangelos},
	journal = {arXiv preprint arXiv:2605.31466},
	year = {2026}
	}
	```

	## License & acknowledgements

	Released under the MIT License. Built on
	[LaRI](https://github.com/ruili3/LaRI), reuses sparse-conv modules from
	[TRELLIS](https://github.com/microsoft/TRELLIS), and uses
	[MoGe-v2](https://github.com/microsoft/MoGe) as the visible geometry prior.