| | --- |
| | library_name: transformers |
| | license: mit |
| | pipeline_tag: any-to-any |
| | tags: |
| | - diffusion-model |
| | - multimodal |
| | - text-to-image |
| | - text-generation |
| | - image-captioning |
| | - generalist-llm |
| | language: en |
| | --- |
| | |
| | # MMaDA-8B-Base |
| |
|
| | <div align="center"> |
| | <br> |
| | <img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/title.png" width="166"> |
| | <h3>Multimodal Large Diffusion Language Models (NeurIPS 2025)</h3></div> |
| |
|
| | We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations: |
| |
|
| | 1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. |
| | 2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. |
| | 3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. |
| |
|
| | [Paper](https://huggingface.co/papers/2505.15809) | [Code](https://github.com/Gen-Verse/MMaDA) | [Project Page / Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) |
| |
|
| | <div align="center" style="width: 600px; margin: auto;"> |
| | <img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/showcase0.8.gif" alt="MMaDA decoding demo" width="550" /> |
| | <p style="font-style: italic; font-size: 14px; color: #555; margin-top: 6px;"> |
| | MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.<br> |
| | The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising. |
| | </p> |
| | </div> |
| | |
| | ## 📰 Latest Updates |
| | * **[2025-09-09]** We open source a comprehensive RL framework for diffusion language model, [dLLM-RL](https://github.com/Gen-Verse/dLLM-RL), which also supports post-training our MMaDA model. |
| | * **[2025-06-02]** We open source our **MMaDA-8B-MixCoT** at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT). |
| | * **[2025-05-24]** We add support for MPS inference, tested on M4. |
| | * **[2025-05-22]** We release the inference and training code of MMaDA for text generation, multimodal generation and image generation. |
| | * **[2025-05-22]** We open source our **MMaDA-8B-Base** at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-Base). **MMaDA-8B-MixCoT** and **MMaDA-8B-Max** will be released in the near future. |
| | * **[2025-05-22]** We release our [research paper](https://huggingface.co/papers/2505.15809) and [demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) for the first unified multimodal diffusion model: MMaDA. |
| |
|
| | ## 🧬 MMaDA Series Overview |
| |
|
| | MMaDA includes a series of checkpoints reflecting different training stages: |
| | 1. **[MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base)**: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and **thinking abilities**. |
| | 2. **[MMaDA-8B-MixCoT](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT)**: After mixed long chain-of-thought (CoT) fine-tuning. Capable of **complex** textual, multimodal and image generation reasoning. |
| | 3. **MMaDA-8B-Max (coming soon)**: After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in the future. |
| |
|
| | <div align="center"> |
| | <img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/example_compare.png" width="800"> |
| | <p><i>Overview of MMaDA's capabilities.</i></p> |
| | </div> |
| |
|
| | ## ⚙️ Quick Start |
| |
|
| | First, set up the environment by installing the required packages from the official GitHub repository: |
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| | Then, you can launch a local Gradio demo: |
| | ```bash |
| | python app.py |
| | ``` |
| | Or try it online via our [Huggingface Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA). |
| |
|
| | ## 🚀 Inference |
| |
|
| | For batch-level inference, we provide inference scripts on the [official GitHub repository](https://github.com/Gen-Verse/MMaDA). |
| |
|
| | Before running multimodal or text-to-image generation examples, you may need to log in to your Weights & Biases (wandb) account: |
| | ```bash |
| | wandb login |
| | ``` |
| |
|
| | ### 1. Text Generation |
| |
|
| | For text generation, we follow LLaDA's configuration and generation script. Simple run: |
| | ```bash |
| | python generate.py |
| | ``` |
| |
|
| | ### 2. MultiModal Generation |
| |
|
| | Inference demo for MultiModal Generation, and you can view the results on wandb: |
| | ```python |
| | from inference_solver import FlexARInferenceSolver |
| | from PIL import Image |
| | |
| | inference_solver = FlexARInferenceSolver( |
| | model_path="Alpha-VLLM/Lumina-mGPT-7B-512", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model |
| | precision="bf16", |
| | target_size=512, |
| | ) |
| | |
| | # "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM |
| | q1 = "Describe the image in detail. <|image|>" |
| | |
| | images = [Image.open("path/to/your/image.png")] # Replace with your image path |
| | qas = [[q1, None]] |
| | |
| | # `len(images)` should be equal to the number of appearance of "<|image|>" in qas |
| | generated = inference_solver.generate( |
| | images=images, |
| | qas=qas, |
| | max_gen_len=8192, |
| | temperature=1.0, |
| | logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000), |
| | ) |
| | |
| | a1 = generated[0] |
| | print(f"Generated text response: {a1}") |
| | # generated[1], namely the list of newly generated images, should typically be empty in this case. |
| | ``` |
| |
|
| | ### 3. Text-to-Image Generation |
| |
|
| | Inference demo for Text-to-Image Generation, and you can view the results on wandb: |
| | ```python |
| | from inference_solver import FlexARInferenceSolver |
| | from PIL import Image |
| | |
| | inference_solver = FlexARInferenceSolver( |
| | model_path="Alpha-VLLM/Lumina-mGPT-7B-768", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model |
| | precision="bf16", |
| | target_size=768, |
| | ) |
| | |
| | q1 = f"Generate an image of 768x768 according to the following prompt:\ |
| | " \ |
| | f"Image of a dog playing water, and a waterfall is in the background." |
| | |
| | # generated: tuple of (generated response, list of generated images) |
| | generated = inference_solver.generate( |
| | images=[], |
| | qas=[[q1, None]], |
| | max_gen_len=8192, |
| | temperature=1.0, |
| | logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000), |
| | ) |
| | |
| | a1, new_image = generated[0], generated[1][0] |
| | new_image.show() # Display the generated image |
| | # new_image is a PIL Image object representing the generated image |
| | # print(f"Generated text response: {a1}") |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{yang2025mmada, |
| | title={MMaDA: Multimodal Large Diffusion Language Models}, |
| | author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi}, |
| | journal={arXiv preprint arXiv:2505.15809}, |
| | year={2025} |
| | } |
| | ``` |