MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
Authors: Bharath Krishnamurthy and Ajita Rattani
Affiliation: University of North Texas, Denton, Texas, USA
Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)
Abstract
Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. However, existing approaches typically append auxiliary control modules or stitch together separate uni-modal networks.
We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over five state-of-the-art multimodal face generation models.
Repository Contents
This repository contains the trained model checkpoints for MMFace-DiT. The models are provided for both Diffusion and Rectified Flow Matching (Flow) paradigms across different resolutions.
dit-unified-flux-vae-256: Diffusion paradigm model for 256x256 resolution using the unified FLUX VAE (checkpoint-440700).dit-unified-flux-vae-256-rfm: Rectified Flow Matching (RFM) paradigm model for 256x256 resolution (checkpoint-283517).dit-unified-flux-vae-512-rfm: Rectified Flow Matching (RFM) paradigm model for 512x512 resolution (checkpoint-44070).VAE: Standalone VAE weights utilizing the compressed 16-channel FLUX latent space.stable-diffusion-2-1-base: Base SD 2.1 component structures required for the pipeline (Tokenizers, Text Encoders, Schedulers).
Usage & Inference
Please refer to our Official GitHub Project Page for complete inference scripts, training code, and setup instructions.
Example Inference (Flow - Mask Conditioning)
python sample_flow.py \
--config_path "configs/flow/config_256_unified_rfm.yml" \
--weights_path "path/to/downloaded/dit-unified-flux-vae-256-rfm/checkpoint-283517/dit_model_weights_ema.safetensors" \
--modality "mask" \
--conditioning_path "path/to/mask.png" \
--prompt "A stunning young woman with long, wavy blonde hair..." \
--output_dir "Generated_Samples" \
--num_samples 4 \
--guidance_scale 7.5
Citation:
If you find this work helpful for your research, please cite our CVPR paper:
@inproceedings{krishnamurthy2026mmfacedit,
title = {MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation},
author = {Krishnamurthy, Bharath and Rattani, Ajita},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
- Downloads last month
- -