Efficient-Large-Model
/

Fast_dLLM_v2_1.5B

Model card Files Files and versions

WuChengyue commited on Sep 4, 2025

Commit

7ab45cd

·

verified ·

1 Parent(s): ef88084

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ base_model:
 Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. However, their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that transforms a pretrained AR model—specifically, Qwen-2.5-1.5B-Instruct—into a diffusion-style decoder for parallel text generation.
-Our approach introduces a novel decoding recipe incorporating a complementary attention mask and a position-aware masking strategy, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a token-level intra-block cache that supports efficient parallel decoding within partially generated blocks.
 Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 2.5x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.

 Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. However, their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that transforms a pretrained AR model—specifically, Qwen-2.5-1.5B-Instruct—into a diffusion-style decoder for parallel text generation.
+Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks.
 Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 2.5x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.