library_name: transformers
license: mit
pipeline_tag: any-to-any
tags:
- diffusion-model
- multimodal
- text-to-image
- text-generation
- image-captioning
- generalist-llm
language: en
MMaDA-8B-Base
Multimodal Large Diffusion Language Models (NeurIPS 2025)
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:
- MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
- MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
- MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
Paper | Code | Project Page / Demo
MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.
The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.
π° Latest Updates
- [2025-09-09] We open source a comprehensive RL framework for diffusion language model, dLLM-RL, which also supports post-training our MMaDA model.
- [2025-06-02] We open source our MMaDA-8B-MixCoT at Huggingface.
- [2025-05-24] We add support for MPS inference, tested on M4.
- [2025-05-22] We release the inference and training code of MMaDA for text generation, multimodal generation and image generation.
- [2025-05-22] We open source our MMaDA-8B-Base at Huggingface. MMaDA-8B-MixCoT and MMaDA-8B-Max will be released in the near future.
- [2025-05-22] We release our research paper and demo for the first unified multimodal diffusion model: MMaDA.
𧬠MMaDA Series Overview
MMaDA includes a series of checkpoints reflecting different training stages:
- MMaDA-8B-Base: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and thinking abilities.
- MMaDA-8B-MixCoT: After mixed long chain-of-thought (CoT) fine-tuning. Capable of complex textual, multimodal and image generation reasoning.
- MMaDA-8B-Max (coming soon): After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in the future.
Overview of MMaDA's capabilities.
βοΈ Quick Start
First, set up the environment by installing the required packages from the official GitHub repository:
pip install -r requirements.txt
Then, you can launch a local Gradio demo:
python app.py
Or try it online via our Huggingface Demo.
π Inference
For batch-level inference, we provide inference scripts on the official GitHub repository.
Before running multimodal or text-to-image generation examples, you may need to log in to your Weights & Biases (wandb) account:
wandb login
1. Text Generation
For text generation, we follow LLaDA's configuration and generation script. Simple run:
python generate.py
2. MultiModal Generation
Inference demo for MultiModal Generation, and you can view the results on wandb:
from inference_solver import FlexARInferenceSolver
from PIL import Image
inference_solver = FlexARInferenceSolver(
model_path="Alpha-VLLM/Lumina-mGPT-7B-512", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
precision="bf16",
target_size=512,
)
# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
q1 = "Describe the image in detail. <|image|>"
images = [Image.open("path/to/your/image.png")] # Replace with your image path
qas = [[q1, None]]
# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
generated = inference_solver.generate(
images=images,
qas=qas,
max_gen_len=8192,
temperature=1.0,
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)
a1 = generated[0]
print(f"Generated text response: {a1}")
# generated[1], namely the list of newly generated images, should typically be empty in this case.
3. Text-to-Image Generation
Inference demo for Text-to-Image Generation, and you can view the results on wandb:
from inference_solver import FlexARInferenceSolver
from PIL import Image
inference_solver = FlexARInferenceSolver(
model_path="Alpha-VLLM/Lumina-mGPT-7B-768", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
precision="bf16",
target_size=768,
)
q1 = f"Generate an image of 768x768 according to the following prompt:\
" \
f"Image of a dog playing water, and a waterfall is in the background."
# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate(
images=[],
qas=[[q1, None]],
max_gen_len=8192,
temperature=1.0,
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)
a1, new_image = generated[0], generated[1][0]
new_image.show() # Display the generated image
# new_image is a PIL Image object representing the generated image
# print(f"Generated text response: {a1}")
Citation
@article{yang2025mmada,
title={MMaDA: Multimodal Large Diffusion Language Models},
author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
journal={arXiv preprint arXiv:2505.15809},
year={2025}
}