MoAI: Aligned Novel View Image and Geometry Synthesis

Model Description

MoAI is a diffusion-based framework that performs aligned novel view image and geometry generation from arbitrary numbers of unposed reference images. The model can generate novel views from extrapolative, far-away camera viewpoints using a warping-and-inpainting methodology.

Key Innovation: Cross-modal Attention Instillation (MoAI) - spatial attention maps from the image generation pipeline are instilled into the geometry generation pipeline during training and inference, creating synergistic effects between RGB and depth generation.

Capabilities

✨ Novel view synthesis from unposed reference images
🎯 Aligned RGB image and depth/geometry generation
🚀 Extrapolative viewpoint generation (far-away cameras)
🔄 Consistent novel view outputs

Model Details

Developed by: Naver AI Lab | KAIST Computer Vision Lab
Model type: Novel view syntehsis model finetuned from Stable Diffusion 2.1
Paper: Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
License: Apache 2.0 (subject to base model licenses)

Model Architecture

The model consists of checkpoints finetuned from SD 2.1 for novel view synthesis:

Denoising U-Net for RGB image generation
Geometry U-Net for depth generation
Reference U-Net for RGB image conditioning
Geo-Reference U-Net for geometry conditioning
Pose Guider for camera pose conditioning
Cross-modal attention instillation mechanism

Installation

Requirements

pip install -r requirements.txt

Tested Environment:

Python >= 3.10
CUDA 11.8
Ubuntu 20.04
NVIDIA A6000 GPU

Required Pretrained Models

This model requires additional pretrained components:

Image Encoder from lambdalabs/sd-image-variations-diffusers:
VGGT Integration from VGGT:

git clone https://github.com/facebookresearch/vggt.git
pip install -r requirements_dev.txt

Checkpoint Structure

After downloading, your checkpoint directory should look like:

checkpoints/
├── image_encoder/
│   ├── config.json
│   └── pytorch_model.bin
├── configs/
│   ├── image_config.json
│   └── geometry_config.json
├── main/
│   ├── denoising_unet.pth
│   ├── geometry_unet.pth
│   ├── pose_guider.pth
│   ├── geo_reference_unet.pth
│   └── reference_unet.pth

Usage

Basic Inference

from main import MoAI

# Initialize model
moai_cfg = dict(
    pretrained_model_path='./checkpoints',
    checkpoint_name='main',
    half_precision_weights=True
)
moai_nvs = MoAI(cfg=moai_cfg)

# Load and process your images
# (See GitHub repository for detailed examples)

Input Format

The model accepts:

Reference images: RGB images (arbitrary number)
Target viewpoints: Desired novel view poses

Output Format

RGB novel view images: Photorealistic rendered novel views
Depth maps: Aligned pointmap for each generated view

Citation

@misc{kwak2025moai,
  title={Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation}, 
  author={Min-Seop Kwak and Junho Kim and Sangdoo Yun and Dongyoon Han and Taekyoung Kim and Seungryong Kim and Jin-Hwa Kim},
  year={2025},
  eprint={2506.11924},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.11924}, 
}

Acknowledgements

Our implementation is based on Moore-AnimateAnyone and related repositories. We thank the original authors for their contributions.

Contact

For questions and feedback:

📧 Open an issue on GitHub
🌐 Visit our project page

Note: Users must check licenses of all dependencies (Stable Diffusion, VGGT, etc.) before use. This model card covers only the MoAI-specific components.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for minseop-kwak/moai-checkpoints

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Paper • 2506.11924 • Published Jun 13, 2025 • 35