| --- |
| license: cc-by-nc-sa-4.0 |
| tags: |
| - text-to-image |
| - diffusion |
| - mm-dit |
| - stable-diffusion-3 |
| - face-generation |
| - ffhq |
| - pytorch |
| datasets: |
| - ffhq |
| language: |
| - en |
| --- |
| |
| # 🌟 NovaFace-DiT (512x512) |
|
|
| **NovaFace-DiT** is a Multimodal Diffusion Transformer (MM-DiT) model trained entirely from scratch for high-fidelity human face synthesis. It leverages the powerful Rectified Flow Matching technique and is deeply inspired by the Stable Diffusion 3 architecture. |
|
|
| Despite being trained on a highly constrained hardware setup (a single consumer-grade GPU) and a highly curated dataset (70,000 images from FFHQ), NovaFace-DiT demonstrates the incredible efficiency and scaling capability of the custom MM-DiT architecture. |
|
|
| <table style="border: none; background-color: transparent;"> |
| <tr> |
| <td style="border: none; background-color: transparent; padding: 2px;"><img src="https://raw.githubusercontent.com/devbnamdar/MM-DiT-From-Scratch/main/assets/sample4.png" alt="Generated Face 1" /></td> |
| <td style="border: none; background-color: transparent; padding: 2px;"><img src="https://raw.githubusercontent.com/devbnamdar/MM-DiT-From-Scratch/main/assets/sample5.png" alt="Generated Face 2" /></td> |
| <td style="border: none; background-color: transparent; padding: 2px;"><img src="https://raw.githubusercontent.com/devbnamdar/MM-DiT-From-Scratch/main/assets/sample6.png" alt="Generated Face 3" /></td> |
| <td style="border: none; background-color: transparent; padding: 2px;"><img src="https://raw.githubusercontent.com/devbnamdar/MM-DiT-From-Scratch/main/assets/sample7.png" alt="Generated Face 4" /></td> |
| </tr> |
| </table> |
| <br> |
| <div align="center"> |
| <em>High-fidelity samples generated by NovaFace-DiT using complex text prompts.</em> |
| </div> |
| |
| ## 📊 Model Details |
|
|
| - **Model Type:** Text-to-Image Diffusion Transformer (MM-DiT) |
| - **Parameters:** ~260 Million |
| - **Text Encoder:** T5-Base (768-dim) |
| - **Latent Space:** Custom 8-channel VAE (f8) |
| - **Training Dataset:** [FFHQ (Flickr-Faces-HQ)](https://github.com/NVlabs/ffhq-dataset) |
| - **Resolution:** 512x512 |
| - **License:** Creative Commons BY-NC-SA 4.0 (Non-commercial) |
|
|
| ## ⚡ Requirements & Custom VAE |
|
|
| NovaFace-DiT operates in an optimized 8-channel latent space and **requires** our custom-trained Autoencoder (VAE) to decode images properly. Standard SDXL or SD3 VAEs are not compatible. |
|
|
| 👉 **[Download the Custom 8-Channel VAE here](https://huggingface.co/devbnamdar/Custom-VAE-8ch-f8)** *(Note: Please download this VAE to generate images)* |
|
|
| ## 🚀 How to Use (Code & UI) |
|
|
| This repository contains **only the model weights (`.safetensors`)**. To actually generate images, inspect the architecture, or resume training, please visit our official GitHub repository which contains a full production-ready Gradio UI and training pipeline. |
|
|
| 🔗 **Official GitHub Repository:** [devbnamdar/MM-DiT-From-Scratch](https://github.com/devbnamdar/MM-DiT-From-Scratch) |
|
|
| **Quick Setup:** |
| 1. Clone the GitHub repository. |
| 2. Download the `NovaFace-DiT.safetensors` from this Hugging Face page and place it in your local `checkpoints/` directory. |
| 3. Download the Custom VAE from [its separate repository](https://huggingface.co/devbnamdar/Custom-VAE-8ch-f8) and place it in your local `vae_models/` directory. |
| 4. Launch the Gradio app: |
| ```bash |
| python gradio_ui/app.py |
| ``` |
| 5. In the Gradio UI, go to the **"⚙️ Settings"** tab, enter the path to your downloaded model (e.g., `checkpoints/NovaFace-DiT.safetensors`) in the **"Base Model Path"** field, and click **"Load Models to GPU"**. |
|
|
| ## ⚠️ Limitations and Bias |
|
|
| - **Domain Specific:** This model was trained exclusively on the FFHQ dataset. It is highly specialized in generating human portraits (shoulders and above). It is not designed to generate landscapes, animals, or full-body shots. |
| - **Text Rendering:** The model does not generate legible text or complex typography. |
| - **Bias:** As the model is trained on FFHQ, it may inherit demographic or lighting biases present in the original dataset. |
|
|
| ## 📄 Citation |
|
|
| If you use this model or the accompanying codebase in your research or projects, please cite: |
|
|
| ```bibtex |
| @misc{namdar2026mmdit, |
| author = {Namdar, Bunyamin}, |
| title = {MM-DiT From Scratch: High-Fidelity Diffusion Training on Limited Dataset}, |
| year = {2026}, |
| publisher = {GitHub}, |
| url = {https://github.com/devbnamdar/MM-DiT-From-Scratch} |
| } |
| ``` |
|
|