Duplicate from BharathK333/MMFace-DiT-Models

e9b2080 4 days ago

3.99 kB

	---
	license: mit
	tags:
	- diffusion
	- transformers
	- image-generation
	- face-generation
	- cvpr2026
	- pytorch
	---

	# MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

	[![Conference](https://img.shields.io/badge/CVPR-2026-blue)](https://cvpr.thecvf.com/)
	[![Paper](https://img.shields.io/badge/ArXiv-Paper-red)](https://arxiv.org/abs/2603.29029)
	[![Project Page](https://img.shields.io/badge/Project_Page-GitHub.io-blue)](https://vcbsl.github.io/MMFace-DiT/)
	[![Code](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Bharath-K3/MMFace-DiT)
	[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/BharathK333/MMFace-DiT-Datasets)
	[![Demo](https://img.shields.io/badge/Demo-HuggingFace-orange)](https://huggingface.co/spaces/BharathK333/MMFace-DiT)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

	Authors: Bharath Krishnamurthy and Ajita Rattani
	Affiliation: University of North Texas, Denton, Texas, USA

	_Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)_

	## Abstract
	Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. However, existing approaches typically append auxiliary control modules or stitch together separate uni-modal networks.

	We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over five state-of-the-art multimodal face generation models.

	## Repository Contents
	This repository contains the trained model checkpoints for MMFace-DiT. The models are provided for both Diffusion and Rectified Flow Matching (Flow) paradigms across different resolutions.

	* `dit-unified-flux-vae-256`: Diffusion paradigm model for 256x256 resolution using the unified FLUX VAE (checkpoint-440700).
	* `dit-unified-flux-vae-256-rfm`: Rectified Flow Matching (RFM) paradigm model for 256x256 resolution (checkpoint-283517).
	* `dit-unified-flux-vae-512-rfm`: Rectified Flow Matching (RFM) paradigm model for 512x512 resolution (checkpoint-44070).
	* `VAE`: Standalone VAE weights utilizing the compressed 16-channel FLUX latent space.
	* `stable-diffusion-2-1-base`: Base SD 2.1 component structures required for the pipeline (Tokenizers, Text Encoders, Schedulers).

	## Usage & Inference
	Please refer to our [Official GitHub Project Page](https://vcbsl.github.io/MMFace-DiT/) for complete inference scripts, training code, and setup instructions.

	### Example Inference (Flow - Mask Conditioning)
	```bash
	python sample_flow.py \
	--config_path "configs/flow/config_256_unified_rfm.yml" \
	--weights_path "path/to/downloaded/dit-unified-flux-vae-256-rfm/checkpoint-283517/dit_model_weights_ema.safetensors" \
	--modality "mask" \
	--conditioning_path "path/to/mask.png" \
	--prompt "A stunning young woman with long, wavy blonde hair..." \
	--output_dir "Generated_Samples" \
	--num_samples 4 \
	--guidance_scale 7.5
	```

	## Citation:
	If you find this work helpful for your research, please cite our CVPR paper:

	```bibtex
	@article{krishnamurthy2026mmface,
	title={MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation},
	author={Krishnamurthy, Bharath and Rattani, Ajita},
	journal={arXiv preprint arXiv:2603.29029},
	year={2026}
	}
	```