LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

FP8 Quantized Version of LLaDA2.0-Uni

AGI Research Center, Inclusion AI

Overview

This is the FP8 quantized version of LLaDA2.0-Uni, featuring block-wise FP8 quantization of MoE expert weights. This reduces GPU memory usage by ~48% for model loading while preserving output quality.

Quantization Details

Method: Block-wise FP8 (float8_e4m3fn) with per-block scale factors
Block size: 128×128
Quantized layers: MoE routed expert weights (gate_proj, up_proj, down_proj)
Kept in BF16: Embeddings, lm_head, attention projections, shared experts, layer norms, routing gates

Memory Comparison

Variant	Model Loading	T2I Peak	Understanding Peak	Edit Peak
BF16	62.9 GB	35.3 GB	33.2 GB	41.7 GB
FP8	32.5 GB	35.3 GB	33.3 GB	41.8 GB

Note: FP8 halves the static model weight memory (~30 GB saved at load time). Peak inference memory is similar because activations dominate during generation.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "inclusionAI/LLaDA2.0-Uni-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Text-to-Image Generation
result = model.generate_image(
    "A cat sitting on a windowsill at sunset",
    image_h=1024, image_w=1024,
    steps=16, cfg_scale=4.0,
)

# Decode VQ tokens to image
from decoder import decode_vq_tokens
image = decode_vq_tokens(
    result["token_ids"], result["h"], result["w"],
    model_path, "cuda",
    num_steps=8, decode_mode="decoder-turbo",
)
image.save("output.png")

Model Capabilities

Same as the base LLaDA2.0-Uni model:

🖼️ Text-to-Image Generation
🔍 Image Understanding
✏️ Image Editing
⚡ Sprint Acceleration

⚠️ License

This project is licensed under the terms of the Apache License 2.0.

📖 BibTeX

@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}