LLaDA2.0
Collection
10 items • Updated • 44
[📑 Technical Report] [🌐 Github]
AGI Research Center, Inclusion AI
This is the FP8 quantized version of LLaDA2.0-Uni, featuring block-wise FP8 quantization of MoE expert weights. This reduces GPU memory usage by ~48% for model loading while preserving output quality.
| Variant | Model Loading | T2I Peak | Understanding Peak | Edit Peak |
|---|---|---|---|---|
| BF16 | 62.9 GB | 35.3 GB | 33.2 GB | 41.7 GB |
| FP8 | 32.5 GB | 35.3 GB | 33.3 GB | 41.8 GB |
Note: FP8 halves the static model weight memory (~30 GB saved at load time). Peak inference memory is similar because activations dominate during generation.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "inclusionAI/LLaDA2.0-Uni-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Text-to-Image Generation
result = model.generate_image(
"A cat sitting on a windowsill at sunset",
image_h=1024, image_w=1024,
steps=16, cfg_scale=4.0,
)
# Decode VQ tokens to image
from decoder import decode_vq_tokens
image = decode_vq_tokens(
result["token_ids"], result["h"], result["w"],
model_path, "cuda",
num_steps=8, decode_mode="decoder-turbo",
)
image.save("output.png")
Same as the base LLaDA2.0-Uni model:
This project is licensed under the terms of the Apache License 2.0.
@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}