MMaDA-8B-Base / README.md

nielsr HF Staff

Improve model card: update paper link, add usage, project overview and tags

b574593 verified 5 months ago

6.98 kB

library_name: transformers
license: mit
pipeline_tag: any-to-any
tags:
  - diffusion-model
  - multimodal
  - text-to-image
  - text-generation
  - image-captioning
  - generalist-llm
language: en

MMaDA-8B-Base

Multimodal Large Diffusion Language Models (NeurIPS 2025)

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:

MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.

Paper | Code | Project Page / Demo

MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.
The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.

📰 Latest Updates

[2025-09-09] We open source a comprehensive RL framework for diffusion language model, dLLM-RL, which also supports post-training our MMaDA model.
[2025-06-02] We open source our MMaDA-8B-MixCoT at Huggingface.
[2025-05-24] We add support for MPS inference, tested on M4.
[2025-05-22] We release the inference and training code of MMaDA for text generation, multimodal generation and image generation.
[2025-05-22] We open source our MMaDA-8B-Base at Huggingface. MMaDA-8B-MixCoT and MMaDA-8B-Max will be released in the near future.
[2025-05-22] We release our research paper and demo for the first unified multimodal diffusion model: MMaDA.

🧬 MMaDA Series Overview

MMaDA includes a series of checkpoints reflecting different training stages:

MMaDA-8B-Base: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and thinking abilities.
MMaDA-8B-MixCoT: After mixed long chain-of-thought (CoT) fine-tuning. Capable of complex textual, multimodal and image generation reasoning.
MMaDA-8B-Max (coming soon): After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in the future.

Overview of MMaDA's capabilities.

⚙️ Quick Start

First, set up the environment by installing the required packages from the official GitHub repository:

pip install -r requirements.txt

Then, you can launch a local Gradio demo:

python app.py

Or try it online via our Huggingface Demo.

🚀 Inference

For batch-level inference, we provide inference scripts on the official GitHub repository.

Before running multimodal or text-to-image generation examples, you may need to log in to your Weights & Biases (wandb) account:

wandb login

1. Text Generation

For text generation, we follow LLaDA's configuration and generation script. Simple run:

python generate.py

2. MultiModal Generation

Inference demo for MultiModal Generation, and you can view the results on wandb:

from inference_solver import FlexARInferenceSolver
from PIL import Image

inference_solver = FlexARInferenceSolver(
    model_path="Alpha-VLLM/Lumina-mGPT-7B-512", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
    precision="bf16",
    target_size=512,
)

# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
q1 = "Describe the image in detail. <|image|>"

images = [Image.open("path/to/your/image.png")] # Replace with your image path
qas = [[q1, None]]

# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
generated = inference_solver.generate(
    images=images,
    qas=qas,
    max_gen_len=8192,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)

a1 = generated[0]
print(f"Generated text response: {a1}")
# generated[1], namely the list of newly generated images, should typically be empty in this case.

3. Text-to-Image Generation

Inference demo for Text-to-Image Generation, and you can view the results on wandb:

from inference_solver import FlexARInferenceSolver
from PIL import Image

inference_solver = FlexARInferenceSolver(
    model_path="Alpha-VLLM/Lumina-mGPT-7B-768", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
    precision="bf16",
    target_size=768,
)

q1 = f"Generate an image of 768x768 according to the following prompt:\
" \
     f"Image of a dog playing water, and a waterfall is in the background."

# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate(
    images=[],
    qas=[[q1, None]],
    max_gen_len=8192,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)

a1, new_image = generated[0], generated[1][0]
new_image.show() # Display the generated image
# new_image is a PIL Image object representing the generated image
# print(f"Generated text response: {a1}")

Citation

@article{yang2025mmada,
  title={MMaDA: Multimodal Large Diffusion Language Models},
  author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
  journal={arXiv preprint arXiv:2505.15809},
  year={2025}
}