update

Files changed (4) hide show

README.md +12 -14
assets/benchmark_results.png +0 -0
assets/throughput.png +0 -0
modeling.py +6 -6

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ Autoregressive (AR) large language models (LLMs) have achieved remarkable perfor
 Our approach introduces a novel decoding recipe incorporating a complementary attention mask and a position-aware masking strategy, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a token-level intra-block cache that supports efficient parallel decoding within partially generated blocks.
-Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 4x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.
 **This repo contains the Fast-dLLM v2 1.5B model**, which has the following features:
@@ -98,24 +98,22 @@ print(response)
 Fast-dLLM v2 demonstrates state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs. The model achieves:
-* Near 4x inference speedup compared to standard AR decoding
-* Comparable generation quality to the base Qwen2.5-1.5B-Instruct model
-* Efficient memory usage through hierarchical caching mechanisms
 ### Benchmark Results
-The following table compares the performance of Fast-dLLM-v2  against the base autoregressive model (qwen2.5-1.5B-ar) across various benchmarks:
-| Model | HumanEval | HumanEval+ | MBPP | MBPP+ | GSM8K | MATH | IFEval | MMLU (0-shot) | GPQA |
-|-------|-----------|------------|------|-------|-------|------|--------|---------------|------|
-| qwen2.5-1.5B-ar | 42.1 | 37.2 | 48.1 | 41.3 | 57.0 | 22.4 | 41.2 | 54.6 | 30.58 |
-| Fast-dLLM-v2 | **43.3** | **40.2** | **50.0** | 41.3 | **60.1** | **28.4** | **45.7** | **55.1** | 27.7 |
-**Key Observations:**
-- Fast-dLLM v2 outperforms the base AR model on 7 out of 9 benchmarks
-- Significant improvements in mathematical reasoning (MATH: 22.4 → 28.4) and instruction following (IFEval: 41.2 → 45.7)
-- Comparable performance on MBPP+ and slight decrease on GPQA
-- Overall performance improvement while achieving 4x inference speedup
 ## Citation

 Our approach introduces a novel decoding recipe incorporating a complementary attention mask and a position-aware masking strategy, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a token-level intra-block cache that supports efficient parallel decoding within partially generated blocks.
+Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 2.5x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.
 **This repo contains the Fast-dLLM v2 1.5B model**, which has the following features:
 Fast-dLLM v2 demonstrates state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs. The model achieves:
+* Near 2.5x inference speedup compared to standard AR decoding
+* Comparable generation quality to the original Qwen2.5-1.5B-Instruct model
+### Throughput Performance
+We accelerate the AR model with near 2.5x speedup at batch size 1.
+For larger batch size, previous methods’ throughput will decrease while Fast-dLLM-v2 is consistently faster than AR.
+![Throughput Comparison](assets/throughput.png)
 ### Benchmark Results
+We well maintains the performance of AR-LLM and achieves the SOTA performance among 1B size LLM and also catch up the performance with 8B diffusion LLM (LLaDA).
+![Benchmark Results](assets/benchmark_results.png)
 ## Citation

assets/benchmark_results.png ADDED Viewed

assets/throughput.png ADDED Viewed

modeling.py CHANGED Viewed

@@ -163,13 +163,13 @@ class Fast_dLLM_QwenAttention(nn.Module):
                 cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
                 key_states, value_states = block_past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
             else:
-                block_cache_key_states = block_past_key_values[self.layer_idx][0].clone()
-                block_cache_value_states = block_past_key_values[self.layer_idx][1].clone()
                 block_cache_key_states[:, :, replace_position:replace_position+key_states.shape[2]] = key_states
                 block_cache_value_states[:, :, replace_position:replace_position+value_states.shape[2]] = value_states
-                key_states = block_cache_key_states.contiguous()
-                value_states = block_cache_value_states.contiguous()
         if past_key_value is not None:
             # sin and cos are specific to RoPE models; cache_position needed for the static cache
@@ -618,7 +618,7 @@ class Fast_dLLM_QwenForCausalLM(Fast_dLLM_QwenPreTrainedModel, GenerationMixin):
                                 logits = torch.cat([logits[:, :1, :], logits[:, :-1, :]], dim=1)
                                 logits = logits[:, start:end]
                             else:
-                                logits = self.forward(input_ids=x_t[:, -block_size+small_block_start_idx:], use_cache=True, past_key_values=past_key_values, update_past_key_values=False, use_block_cache=True, block_past_key_values=block_past_key_values, replace_position=small_block_start_idx).logits
                                 logits = torch.cat([logits[:, :1, :], logits[:, :-1, :]], dim=1)
                         else:
                             logits = self.forward(input_ids=x_t[:, -block_size:], use_cache=True, past_key_values=past_key_values, update_past_key_values=False).logits
@@ -629,7 +629,7 @@ class Fast_dLLM_QwenForCausalLM(Fast_dLLM_QwenPreTrainedModel, GenerationMixin):
                         x_1, p_1t = self.sample_with_top_p(logits, top_p=top_p, temperature=temperature)
                         # Select tokens with probability greater than threshold from p_1t
                         x1_p = torch.squeeze(torch.gather(p_1t, dim=-1, index=torch.unsqueeze(x_1, -1)), -1)
-                        x1_p = torch.where(mask_idx, x1_p, -torch.inf)
                         unmask_idx = (x1_p > threshold)
                         max_prob_idx = x1_p.argmax(dim=-1)

                 cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
                 key_states, value_states = block_past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
             else:
+                block_cache_key_states = block_past_key_values[self.layer_idx][0]
+                block_cache_value_states = block_past_key_values[self.layer_idx][1]
                 block_cache_key_states[:, :, replace_position:replace_position+key_states.shape[2]] = key_states
                 block_cache_value_states[:, :, replace_position:replace_position+value_states.shape[2]] = value_states
+                key_states = block_cache_key_states
+                value_states = block_cache_value_states
         if past_key_value is not None:
             # sin and cos are specific to RoPE models; cache_position needed for the static cache
                                 logits = torch.cat([logits[:, :1, :], logits[:, :-1, :]], dim=1)
                                 logits = logits[:, start:end]
                             else:
+                                logits = self.forward(input_ids=x_t[:,start:end], use_cache=True, past_key_values=past_key_values, update_past_key_values=False, use_block_cache=True, block_past_key_values=block_past_key_values, replace_position=small_block_start_idx).logits
                                 logits = torch.cat([logits[:, :1, :], logits[:, :-1, :]], dim=1)
                         else:
                             logits = self.forward(input_ids=x_t[:, -block_size:], use_cache=True, past_key_values=past_key_values, update_past_key_values=False).logits
                         x_1, p_1t = self.sample_with_top_p(logits, top_p=top_p, temperature=temperature)
                         # Select tokens with probability greater than threshold from p_1t
                         x1_p = torch.squeeze(torch.gather(p_1t, dim=-1, index=torch.unsqueeze(x_1, -1)), -1)
+                        x1_p = torch.where(mask_idx[:, start:end], x1_p, -torch.inf)
                         unmask_idx = (x1_p > threshold)
                         max_prob_idx = x1_p.argmax(dim=-1)