WuChengyue commited on
Commit
7ab45cd
·
verified ·
1 Parent(s): ef88084

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -12,7 +12,7 @@ base_model:
12
 
13
  Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. However, their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that transforms a pretrained AR model—specifically, Qwen-2.5-1.5B-Instruct—into a diffusion-style decoder for parallel text generation.
14
 
15
- Our approach introduces a novel decoding recipe incorporating a complementary attention mask and a position-aware masking strategy, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a token-level intra-block cache that supports efficient parallel decoding within partially generated blocks.
16
 
17
  Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 2.5x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.
18
 
 
12
 
13
  Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. However, their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that transforms a pretrained AR model—specifically, Qwen-2.5-1.5B-Instruct—into a diffusion-style decoder for parallel text generation.
14
 
15
+ Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks.
16
 
17
  Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 2.5x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.
18