Abstract
Multi-Block Diffusion Language Models extend single-block diffusion to concurrent block decoding with improved training strategies and optimized decoding algorithms.
Block Diffusion Language Models (BD-LMs) improve diffusion-based text generation with KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) to Multi-Block Diffusion (MultiBD), where a running-set of consecutive blocks is decoded concurrently for inter-block parallelism. However, existing BD-LMs are mostly trained under teacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recent diffusion forcing strategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded running-set with heterogeneous slot-wise noise patterns. To bridge this gap, we propose Multi-Block Diffusion Language Models (MBD-LMs), obtained by post-training BD-LMs with Multi-block Teacher Forcing (MultiTF). MultiTF integrates teacher forcing and diffusion forcing by training on bounded noise-groups conditioned on clean prefixes, with randomized noise-schedulers that better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on the Block Buffer mechanism that preserves prefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism into wall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.
Community
We introduce Multi-Block Diffusion Language Models (MBD-LMs), a unified framework that bridges the training-inference gap for practical multi-block diffusion in block diffusion language models (BD-LMs). We identify that existing Teacher Forcing and D2F paradigms fail to align with the bounded running-set and heterogeneous slot-wise noise patterns required by Multi-Block Diffusion (MultiBD) inference. To address this, we propose Multi-block Teacher Forcing (MultiTF), a lightweight post-training method that constructs bounded noise groups with randomized chain-uniform scheduling, enabling any BD-LM to upgrade into an MBD-LM. On the inference side, we design the Block Buffer mechanism to decouple dynamic running-sets from static physical shapes, enabling CUDA Graph capture and prefix KV cache reuse. Empirically, MBD-LLaDA2-Mini achieves a 78.4% TPF improvement (3.47 to 6.19) while improving accuracy from 79.95% to 81.03%. Combined with DMax, TPF reaches 9.34 with strong throughput gains. We also release Diffulex, a unified serving engine that supports MBD-LMs and various BD-LM strategies (SingleBD, MultiBD, Dual Cache, DMax, etc.) under a single backend.
Project page: https://sjtu-deng-lab.github.io/mbd-lms/
Training code: https://github.com/SJTU-DENG-Lab/mbd-lms
Inference engine (Diffulex): https://github.com/SJTU-DENG-Lab/Diffulex
Get this paper in your agent:
hf papers read 2606.29215 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 8
SJTU-DENG-Lab/MBD-Math-LLaDA2-mini-DMax-16B
Datasets citing this paper 1
SJTU-DENG-Lab/MBD-LMs-MultiTF-Datasets
Spaces citing this paper 0
No Space linking this paper