MoAI: Aligned Novel View Image and Geometry Synthesis
Model Description
MoAI is a diffusion-based framework that performs aligned novel view image and geometry generation from arbitrary numbers of unposed reference images. The model can generate novel views from extrapolative, far-away camera viewpoints using a warping-and-inpainting methodology.
Key Innovation: Cross-modal Attention Instillation (MoAI) - spatial attention maps from the image generation pipeline are instilled into the geometry generation pipeline during training and inference, creating synergistic effects between RGB and depth generation.
Capabilities
- β¨ Novel view synthesis from unposed reference images
- π― Aligned RGB image and depth/geometry generation
- π Extrapolative viewpoint generation (far-away cameras)
- π Consistent novel view outputs
Model Details
- Developed by: Naver AI Lab | KAIST Computer Vision Lab
- Model type: Novel view syntehsis model finetuned from Stable Diffusion 2.1
- Paper: Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
- License: Apache 2.0 (subject to base model licenses)
Model Architecture
The model consists of checkpoints finetuned from SD 2.1 for novel view synthesis:
- Denoising U-Net for RGB image generation
- Geometry U-Net for depth generation
- Reference U-Net for RGB image conditioning
- Geo-Reference U-Net for geometry conditioning
- Pose Guider for camera pose conditioning
- Cross-modal attention instillation mechanism
Installation
Requirements
pip install -r requirements.txt
Tested Environment:
- Python >= 3.10
- CUDA 11.8
- Ubuntu 20.04
- NVIDIA A6000 GPU
Required Pretrained Models
This model requires additional pretrained components:
- Image Encoder from lambdalabs/sd-image-variations-diffusers:
- VGGT Integration from VGGT:
git clone https://github.com/facebookresearch/vggt.git
pip install -r requirements_dev.txt
Checkpoint Structure
After downloading, your checkpoint directory should look like:
checkpoints/
βββ image_encoder/
β βββ config.json
β βββ pytorch_model.bin
βββ configs/
β βββ image_config.json
β βββ geometry_config.json
βββ main/
β βββ denoising_unet.pth
β βββ geometry_unet.pth
β βββ pose_guider.pth
β βββ geo_reference_unet.pth
β βββ reference_unet.pth
Usage
Basic Inference
from main import MoAI
# Initialize model
moai_cfg = dict(
pretrained_model_path='./checkpoints',
checkpoint_name='main',
half_precision_weights=True
)
moai_nvs = MoAI(cfg=moai_cfg)
# Load and process your images
# (See GitHub repository for detailed examples)
Input Format
The model accepts:
- Reference images: RGB images (arbitrary number)
- Target viewpoints: Desired novel view poses
Output Format
- RGB novel view images: Photorealistic rendered novel views
- Depth maps: Aligned pointmap for each generated view
Citation
@misc{kwak2025moai,
title={Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation},
author={Min-Seop Kwak and Junho Kim and Sangdoo Yun and Dongyoon Han and Taekyoung Kim and Seungryong Kim and Jin-Hwa Kim},
year={2025},
eprint={2506.11924},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.11924},
}
Acknowledgements
Our implementation is based on Moore-AnimateAnyone and related repositories. We thank the original authors for their contributions.
Contact
For questions and feedback:
- π§ Open an issue on GitHub
- π Visit our project page
Note: Users must check licenses of all dependencies (Stable Diffusion, VGGT, etc.) before use. This model card covers only the MoAI-specific components.


