| --- |
| license: mit |
| library_name: volfill |
| pipeline_tag: image-to-3d |
| tags: |
| - 3d-reconstruction |
| - amodal-completion |
| - single-view-reconstruction |
| - scene-reconstruction |
| - flow-matching |
| - diffusion-transformer |
| - point-cloud |
| --- |
| |
| # VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching |
|
|
| <p align="center"> |
| <a href="https://arxiv.org/abs/2605.31466"><img src="https://img.shields.io/badge/arXiv-2605.31466-b31b1b.svg" alt="arXiv"></a> |
| <a href="https://ngoductuanlhp.github.io/VolFill/"><img src="https://img.shields.io/badge/Project-Page-1f72b8.svg" alt="Project Page"></a> |
| <a href="https://github.com/ngoductuanlhp/VolFill"><img src="https://img.shields.io/badge/Code-GitHub-181717.svg?logo=github" alt="Code"></a> |
| </p> |
|
|
| <p align="center"> |
| <img src="https://raw.githubusercontent.com/ngoductuanlhp/VolFill/main/assets/teaser.png" width="100%" alt="VolFill teaser"> |
| </p> |
|
|
| Pretrained checkpoints for **VolFill**, which recovers the **complete** 3D scene |
| geometry — including occluded surfaces — from a **single RGB image**, represented |
| as a 256³ Truncated Unsigned Distance Function (TUDF) grid. |
|
|
| > Authors: Tuan Duc Ngo¹, Chuang Gan¹, Evangelos Kalogerakis¹˒² |
| > | ¹University of Massachusetts Amherst ²Technical University of Crete |
|
|
| ## Model description |
|
|
| VolFill is a two-stage latent generative model. A **hybrid 3D VAE** (sparse |
| encoder → dense bottleneck → hybrid dense-to-sparse decoder) compresses the 256³ |
| TUDF to a compact 16³×16ch latent, and a **latent Diffusion Transformer trained |
| with flow matching** generates that latent — conditioned on (a) frozen MoGe-v2 |
| image features as a global geometric prior and (b) a visible-geometry latent that |
| anchors the occluded regions. At inference the model encodes the visible region, |
| samples the DiT for 50 Euler steps with CFG = 3.0, and decodes to a TUDF that is |
| thresholded into a point cloud or mesh. |
|
|
| ## Files |
|
|
| | File | Description | |
| |---|---| |
| | `volfill_dit.pth` | Latent flow-matching DiT (visible-latent conditioned, 16× variant) | |
| | `volfill_vae.pth` | Hybrid 3D VAE (sparse encoder + hybrid decoder) | |
| | `inference.yaml` | Model architecture + sampler config | |
| | `latent_stats_16x.npy` | Latent normalization statistics (mean / std) | |
|
|
| The MoGe geometry prior (`Ruicheng/moge-2-vitl`, `Ruicheng/moge-2-vitl-normal`) |
| is downloaded automatically on first run. |
|
|
| ## Usage |
|
|
| Install the inference code from the [GitHub repo](https://github.com/ngoductuanlhp/VolFill) |
| (CUDA 13.0 / RTX 40-series), then everything in this model repo downloads |
| automatically: |
|
|
| ```bash |
| # CLI — all weights/config/stats auto-download from this repo |
| python -m volfill.amodal.inference_latent_visible \ |
| --hf_repo TuanNgo/VolFill --input_path image.jpg --output ./results/ |
| ``` |
|
|
| ```python |
| from PIL import Image |
| from volfill.amodal.inference_latent_visible import LatentTUDFVisibleInference |
| |
| infer = LatentTUDFVisibleInference.from_pretrained("TuanNgo/VolFill") |
| result = infer(Image.open("image.jpg").convert("RGB")) |
| # result["tudf"]: (1, 1, 256, 256, 256) predicted TUDF in [-1, 1] |
| ``` |
|
|
| See the GitHub README for installation, point-cloud visualization, and local / |
| Google-Drive checkpoint options. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{ngo2026volfill, |
| title = {VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching}, |
| author = {Ngo, Tuan Duc and Gan, Chuang and Kalogerakis, Evangelos}, |
| journal = {arXiv preprint arXiv:2605.31466}, |
| year = {2026} |
| } |
| ``` |
|
|
| ## License & acknowledgements |
|
|
| Released under the MIT License. Built on |
| [LaRI](https://github.com/ruili3/LaRI), reuses sparse-conv modules from |
| [TRELLIS](https://github.com/microsoft/TRELLIS), and uses |
| [MoGe-v2](https://github.com/microsoft/MoGe) as the visible geometry prior. |
|
|