Chengyue Wu commited on
Commit
393167c
·
1 Parent(s): 0f41374

update readme

Browse files
Files changed (1) hide show
  1. README.md +124 -1
README.md CHANGED
@@ -4,4 +4,127 @@ language:
4
  - en
5
  base_model:
6
  - Qwen/Qwen2.5-1.5B-Instruct
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - en
5
  base_model:
6
  - Qwen/Qwen2.5-1.5B-Instruct
7
+ ---
8
+
9
+ # Fast-dLLM v2 (1.5B)
10
+
11
+ ## Introduction
12
+
13
+ Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. However, their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that transforms a pretrained AR model—specifically, Qwen-2.5-1.5B-Instruct—into a diffusion-style decoder for parallel text generation.
14
+
15
+ Our approach introduces a novel decoding recipe incorporating a complementary attention mask and a position-aware masking strategy, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a token-level intra-block cache that supports efficient parallel decoding within partially generated blocks.
16
+
17
+ Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 4x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.
18
+
19
+ **This repo contains the Fast-dLLM v2 1.5B model**, which has the following features:
20
+
21
+ * Type: Block Diffusion Language Model (dLLM)
22
+ * Base Model: Qwen/Qwen2.5-1.5B-Instruct
23
+ * Architecture: Transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings
24
+ * Number of Parameters: 1.54B
25
+ * Number of Parameters (Non-Embedding): 1.31B
26
+ * Number of Layers: 28
27
+ * Number of Attention Heads (GQA): 12 for Q and 2 for KV
28
+ * Context Length: Full 32,768 tokens and generation 8,192 tokens
29
+ * Key Innovation: Parallel block-wise decoding with hierarchical caching
30
+
31
+ ## Requirements
32
+
33
+ The code requires the latest version of `transformers` and custom generation functions. Make sure you have the following dependencies:
34
+
35
+ ```bash
36
+ pip install transformers torch numpy
37
+ ```
38
+
39
+ ## Quickstart
40
+
41
+ Here provides a code snippet to show you how to load the model and generate contents using the Fast-dLLM v2 parallel decoding:
42
+
43
+ ```python
44
+ from transformers import AutoModelForCausalLM, AutoTokenizer
45
+ import types
46
+ import generation_functions
47
+
48
+ model_name = "Efficient-Large-Model/Fast_dLLM_1.5B"
49
+
50
+ model = AutoModelForCausalLM.from_pretrained(
51
+ model_name,
52
+ torch_dtype="auto",
53
+ device_map="auto",
54
+ trust_remote_code=True
55
+ )
56
+
57
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
58
+
59
+ # Example conversation
60
+ prompt = "Give me a short introduction to large language model."
61
+ messages = [
62
+ {"role": "system", "content": "You are a helpful assistant."},
63
+ {"role": "user", "content": prompt}
64
+ ]
65
+
66
+ text = tokenizer.apply_chat_template(
67
+ messages,
68
+ tokenize=False,
69
+ add_generation_prompt=True
70
+ )
71
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
72
+
73
+ # Use Fast-dLLM v2 parallel decoding
74
+ generated_ids = model.generate(
75
+ model_inputs["input_ids"],
76
+ tokenizer=tokenizer,
77
+ max_new_tokens=512,
78
+ small_block_size=8,
79
+ threshold=0.9,
80
+ )
81
+
82
+ response = tokenizer.decode(
83
+ generated_ids[0][model_inputs["input_ids"].shape[1]:],
84
+ skip_special_tokens=True
85
+ )
86
+ print(response)
87
+ ```
88
+
89
+ ## Key Features
90
+
91
+ * **Parallel Decoding**: Achieves near 4x speedup over standard autoregressive decoding
92
+ * **Block-wise Processing**: Processes text in blocks for efficient parallel generation
93
+ * **Hierarchical Caching**: Block-level and token-level caching for optimal memory usage
94
+ * **Quality Preservation**: Maintains generation quality while significantly improving speed
95
+ * **Compatible Interface**: Drop-in replacement for standard transformer models
96
+
97
+ ## Performance
98
+
99
+ Fast-dLLM v2 demonstrates state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs. The model achieves:
100
+
101
+ * Near 4x inference speedup compared to standard AR decoding
102
+ * Comparable generation quality to the base Qwen2.5-1.5B-Instruct model
103
+ * Efficient memory usage through hierarchical caching mechanisms
104
+
105
+ ### Benchmark Results
106
+
107
+ The following table compares the performance of Fast-dLLM-v2 against the base autoregressive model (qwen2.5-1.5B-ar) across various benchmarks:
108
+
109
+ | Model | HumanEval | HumanEval+ | MBPP | MBPP+ | GSM8K | MATH | IFEval | MMLU (0-shot) | GPQA |
110
+ |-------|-----------|------------|------|-------|-------|------|--------|---------------|------|
111
+ | qwen2.5-1.5B-ar | 42.1 | 37.2 | 48.1 | 41.3 | 57.0 | 22.4 | 41.2 | 54.6 | 30.58 |
112
+ | Fast-dLLM-v2 | **43.3** | **40.2** | **50.0** | 41.3 | **60.1** | **28.4** | **45.7** | **55.1** | 27.7 |
113
+
114
+ **Key Observations:**
115
+ - Fast-dLLM v2 outperforms the base AR model on 7 out of 9 benchmarks
116
+ - Significant improvements in mathematical reasoning (MATH: 22.4 → 28.4) and instruction following (IFEval: 41.2 → 45.7)
117
+ - Comparable performance on MBPP+ and slight decrease on GPQA
118
+ - Overall performance improvement while achieving 4x inference speedup
119
+
120
+ ## Citation
121
+
122
+ If you find our work helpful, please cite our paper:
123
+
124
+ ```bibtex
125
+
126
+ ```
127
+
128
+ ## License
129
+
130
+ This model is released under the Apache 2.0 license, following the base Qwen2.5-1.5B-Instruct model.