Instructions to use hdkkty/MMFace-DiT-Models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use hdkkty/MMFace-DiT-Models with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("hdkkty/MMFace-DiT-Models", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Transformers
How to use hdkkty/MMFace-DiT-Models with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("hdkkty/MMFace-DiT-Models", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - diffusion | |
| - transformers | |
| - image-generation | |
| - face-generation | |
| - cvpr2026 | |
| - pytorch | |
| # MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation | |
| [](https://cvpr.thecvf.com/) | |
| [](https://arxiv.org/abs/2603.29029) | |
| [](https://vcbsl.github.io/MMFace-DiT/) | |
| [](https://github.com/Bharath-K3/MMFace-DiT) | |
| [](https://huggingface.co/datasets/BharathK333/MMFace-DiT-Datasets) | |
| [](https://huggingface.co/spaces/BharathK333/MMFace-DiT) | |
| [](https://opensource.org/licenses/MIT) | |
| **Authors:** Bharath Krishnamurthy and Ajita Rattani | |
| **Affiliation:** University of North Texas, Denton, Texas, USA | |
| _Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)_ | |
| ## Abstract | |
| Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. However, existing approaches typically append auxiliary control modules or stitch together separate uni-modal networks. | |
| We introduce **MMFace-DiT**, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared **Rotary Position-Embedded (RoPE) Attention** mechanism. Furthermore, a novel **Modality Embedder** enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over five state-of-the-art multimodal face generation models. | |
| ## Repository Contents | |
| This repository contains the trained model checkpoints for MMFace-DiT. The models are provided for both Diffusion and Rectified Flow Matching (Flow) paradigms across different resolutions. | |
| * `dit-unified-flux-vae-256`: Diffusion paradigm model for 256x256 resolution using the unified FLUX VAE (checkpoint-440700). | |
| * `dit-unified-flux-vae-256-rfm`: Rectified Flow Matching (RFM) paradigm model for 256x256 resolution (checkpoint-283517). | |
| * `dit-unified-flux-vae-512-rfm`: Rectified Flow Matching (RFM) paradigm model for 512x512 resolution (checkpoint-44070). | |
| * `VAE`: Standalone VAE weights utilizing the compressed 16-channel FLUX latent space. | |
| * `stable-diffusion-2-1-base`: Base SD 2.1 component structures required for the pipeline (Tokenizers, Text Encoders, Schedulers). | |
| ## Usage & Inference | |
| Please refer to our [Official GitHub Project Page](https://vcbsl.github.io/MMFace-DiT/) for complete inference scripts, training code, and setup instructions. | |
| ### Example Inference (Flow - Mask Conditioning) | |
| ```bash | |
| python sample_flow.py \ | |
| --config_path "configs/flow/config_256_unified_rfm.yml" \ | |
| --weights_path "path/to/downloaded/dit-unified-flux-vae-256-rfm/checkpoint-283517/dit_model_weights_ema.safetensors" \ | |
| --modality "mask" \ | |
| --conditioning_path "path/to/mask.png" \ | |
| --prompt "A stunning young woman with long, wavy blonde hair..." \ | |
| --output_dir "Generated_Samples" \ | |
| --num_samples 4 \ | |
| --guidance_scale 7.5 | |
| ``` | |
| ## Citation: | |
| If you find this work helpful for your research, please cite our CVPR paper: | |
| ```bibtex | |
| @article{krishnamurthy2026mmface, | |
| title={MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation}, | |
| author={Krishnamurthy, Bharath and Rattani, Ajita}, | |
| journal={arXiv preprint arXiv:2603.29029}, | |
| year={2026} | |
| } | |
| ``` |