LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

FP8 Quantized Version of LLaDA2.0-Uni

[📑 Technical Report]   [🌐 Github]

AGI Research Center, Inclusion AI

Overview

This is the FP8 quantized version of LLaDA2.0-Uni, featuring block-wise FP8 quantization of MoE expert weights. This reduces GPU memory usage by ~48% for model loading while preserving output quality.

Quantization Details

  • Method: Block-wise FP8 (float8_e4m3fn) with per-block scale factors
  • Block size: 128×128
  • Quantized layers: MoE routed expert weights (gate_proj, up_proj, down_proj)
  • Kept in BF16: Embeddings, lm_head, attention projections, shared experts, layer norms, routing gates

Memory Comparison

Variant Model Loading T2I Peak Understanding Peak Edit Peak
BF16 62.9 GB 35.3 GB 33.2 GB 41.7 GB
FP8 32.5 GB 35.3 GB 33.3 GB 41.8 GB

Note: FP8 halves the static model weight memory (~30 GB saved at load time). Peak inference memory is similar because activations dominate during generation.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "inclusionAI/LLaDA2.0-Uni-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Text-to-Image Generation
result = model.generate_image(
    "A cat sitting on a windowsill at sunset",
    image_h=1024, image_w=1024,
    steps=16, cfg_scale=4.0,
)

# Decode VQ tokens to image
from decoder import decode_vq_tokens
image = decode_vq_tokens(
    result["token_ids"], result["h"], result["w"],
    model_path, "cuda",
    num_steps=8, decode_mode="decoder-turbo",
)
image.save("output.png")

Model Capabilities

Same as the base LLaDA2.0-Uni model:

  • 🖼️ Text-to-Image Generation
  • 🔍 Image Understanding
  • ✏️ Image Editing
  • Sprint Acceleration

⚠️ License

This project is licensed under the terms of the Apache License 2.0.

📖 BibTeX

@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}
Downloads last month
26
Safetensors
Model size
16B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inclusionAI/LLaDA2.0-Uni-FP8

Finetuned
(2)
this model

Collection including inclusionAI/LLaDA2.0-Uni-FP8

Paper for inclusionAI/LLaDA2.0-Uni-FP8