Improve model card: update paper link, add usage, project overview and tags

b574593 verified 5 months ago

6.98 kB

	---
	library_name: transformers
	license: mit
	pipeline_tag: any-to-any
	tags:
	- diffusion-model
	- multimodal
	- text-to-image
	- text-generation
	- image-captioning
	- generalist-llm
	language: en
	---

	# MMaDA-8B-Base

	<div align="center">
	<br>
	<img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/title.png" width="166">
	<h3>Multimodal Large Diffusion Language Models (NeurIPS 2025)</h3></div>

	We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:

	1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
	2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
	3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.

	[Paper](https://huggingface.co/papers/2505.15809) \| [Code](https://github.com/Gen-Verse/MMaDA) \| [Project Page / Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA)

	<div align="center" style="width: 600px; margin: auto;">
	<img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/showcase0.8.gif" alt="MMaDA decoding demo" width="550" />
	<p style="font-style: italic; font-size: 14px; color: #555; margin-top: 6px;">
	MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.<br>
	The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.
	</p>
	</div>

	## 📰 Latest Updates
	* [2025-09-09] We open source a comprehensive RL framework for diffusion language model, [dLLM-RL](https://github.com/Gen-Verse/dLLM-RL), which also supports post-training our MMaDA model.
	* [2025-06-02] We open source our MMaDA-8B-MixCoT at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT).
	* [2025-05-24] We add support for MPS inference, tested on M4.
	* [2025-05-22] We release the inference and training code of MMaDA for text generation, multimodal generation and image generation.
	* [2025-05-22] We open source our MMaDA-8B-Base at [Huggingface](https://huggingface.co/Gen-Verse/MMaDA-8B-Base). MMaDA-8B-MixCoT and MMaDA-8B-Max will be released in the near future.
	* [2025-05-22] We release our [research paper](https://huggingface.co/papers/2505.15809) and [demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) for the first unified multimodal diffusion model: MMaDA.

	## 🧬 MMaDA Series Overview

	MMaDA includes a series of checkpoints reflecting different training stages:
	1. [MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base): After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and thinking abilities.
	2. [MMaDA-8B-MixCoT](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT): After mixed long chain-of-thought (CoT) fine-tuning. Capable of complex textual, multimodal and image generation reasoning.
	3. MMaDA-8B-Max (coming soon): After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in the future.

	<div align="center">
	<img src="https://github.com/Gen-Verse/MMaDA/raw/main/assets/example_compare.png" width="800">
	<p><i>Overview of MMaDA's capabilities.</i></p>
	</div>

	## ⚙️ Quick Start

	First, set up the environment by installing the required packages from the official GitHub repository:
	```bash
	pip install -r requirements.txt
	```
	Then, you can launch a local Gradio demo:
	```bash
	python app.py
	```
	Or try it online via our [Huggingface Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA).

	## 🚀 Inference

	For batch-level inference, we provide inference scripts on the [official GitHub repository](https://github.com/Gen-Verse/MMaDA).

	Before running multimodal or text-to-image generation examples, you may need to log in to your Weights & Biases (wandb) account:
	```bash
	wandb login
	```

	### 1. Text Generation

	For text generation, we follow LLaDA's configuration and generation script. Simple run:
	```bash
	python generate.py
	```

	### 2. MultiModal Generation

	Inference demo for MultiModal Generation, and you can view the results on wandb:
	```python
	from inference_solver import FlexARInferenceSolver
	from PIL import Image

	inference_solver = FlexARInferenceSolver(
	model_path="Alpha-VLLM/Lumina-mGPT-7B-512", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
	precision="bf16",
	target_size=512,
	)

	# "<\|image\|>" symbol will be replaced with sequence of image tokens before fed to LLM
	q1 = "Describe the image in detail. <\|image\|>"

	images = [Image.open("path/to/your/image.png")] # Replace with your image path
	qas = [[q1, None]]

	# `len(images)` should be equal to the number of appearance of "<\|image\|>" in qas
	generated = inference_solver.generate(
	images=images,
	qas=qas,
	max_gen_len=8192,
	temperature=1.0,
	logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
	)

	a1 = generated[0]
	print(f"Generated text response: {a1}")
	# generated[1], namely the list of newly generated images, should typically be empty in this case.
	```

	### 3. Text-to-Image Generation

	Inference demo for Text-to-Image Generation, and you can view the results on wandb:
	```python
	from inference_solver import FlexARInferenceSolver
	from PIL import Image

	inference_solver = FlexARInferenceSolver(
	model_path="Alpha-VLLM/Lumina-mGPT-7B-768", # Replace with "Gen-Verse/MMaDA-8B-Base" for this model
	precision="bf16",
	target_size=768,
	)

	q1 = f"Generate an image of 768x768 according to the following prompt:\
	" \
	f"Image of a dog playing water, and a waterfall is in the background."

	# generated: tuple of (generated response, list of generated images)
	generated = inference_solver.generate(
	images=[],
	qas=[[q1, None]],
	max_gen_len=8192,
	temperature=1.0,
	logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
	)

	a1, new_image = generated[0], generated[1][0]
	new_image.show() # Display the generated image
	# new_image is a PIL Image object representing the generated image
	# print(f"Generated text response: {a1}")
	```

	## Citation

	```bibtex
	@article{yang2025mmada,
	title={MMaDA: Multimodal Large Diffusion Language Models},
	author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi},
	journal={arXiv preprint arXiv:2505.15809},
	year={2025}
	}
	```