MoAI: Aligned Novel View Image and Geometry Synthesis

Project Page GitHub Paper

Teaser Example_results

Model Description

MoAI is a diffusion-based framework that performs aligned novel view image and geometry generation from arbitrary numbers of unposed reference images. The model can generate novel views from extrapolative, far-away camera viewpoints using a warping-and-inpainting methodology.

Key Innovation: Cross-modal Attention Instillation (MoAI) - spatial attention maps from the image generation pipeline are instilled into the geometry generation pipeline during training and inference, creating synergistic effects between RGB and depth generation.

Framework

Capabilities

  • ✨ Novel view synthesis from unposed reference images
  • 🎯 Aligned RGB image and depth/geometry generation
  • πŸš€ Extrapolative viewpoint generation (far-away cameras)
  • πŸ”„ Consistent novel view outputs

Model Details

Model Architecture

The model consists of checkpoints finetuned from SD 2.1 for novel view synthesis:

  • Denoising U-Net for RGB image generation
  • Geometry U-Net for depth generation
  • Reference U-Net for RGB image conditioning
  • Geo-Reference U-Net for geometry conditioning
  • Pose Guider for camera pose conditioning
  • Cross-modal attention instillation mechanism

Installation

Requirements

pip install -r requirements.txt

Tested Environment:

  • Python >= 3.10
  • CUDA 11.8
  • Ubuntu 20.04
  • NVIDIA A6000 GPU

Required Pretrained Models

This model requires additional pretrained components:

  1. Image Encoder from lambdalabs/sd-image-variations-diffusers:
  2. VGGT Integration from VGGT:
git clone https://github.com/facebookresearch/vggt.git
pip install -r requirements_dev.txt

Checkpoint Structure

After downloading, your checkpoint directory should look like:

checkpoints/
β”œβ”€β”€ image_encoder/
β”‚   β”œβ”€β”€ config.json
β”‚   └── pytorch_model.bin
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ image_config.json
β”‚   └── geometry_config.json
β”œβ”€β”€ main/
β”‚   β”œβ”€β”€ denoising_unet.pth
β”‚   β”œβ”€β”€ geometry_unet.pth
β”‚   β”œβ”€β”€ pose_guider.pth
β”‚   β”œβ”€β”€ geo_reference_unet.pth
β”‚   └── reference_unet.pth

Usage

Basic Inference

from main import MoAI

# Initialize model
moai_cfg = dict(
    pretrained_model_path='./checkpoints',
    checkpoint_name='main',
    half_precision_weights=True
)
moai_nvs = MoAI(cfg=moai_cfg)

# Load and process your images
# (See GitHub repository for detailed examples)

Input Format

The model accepts:

  • Reference images: RGB images (arbitrary number)
  • Target viewpoints: Desired novel view poses

Output Format

  • RGB novel view images: Photorealistic rendered novel views
  • Depth maps: Aligned pointmap for each generated view

Citation

@misc{kwak2025moai,
  title={Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation}, 
  author={Min-Seop Kwak and Junho Kim and Sangdoo Yun and Dongyoon Han and Taekyoung Kim and Seungryong Kim and Jin-Hwa Kim},
  year={2025},
  eprint={2506.11924},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.11924}, 
}

Acknowledgements

Our implementation is based on Moore-AnimateAnyone and related repositories. We thank the original authors for their contributions.

Contact

For questions and feedback:


Note: Users must check licenses of all dependencies (Stable Diffusion, VGGT, etc.) before use. This model card covers only the MoAI-specific components.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support