Efficient-Large-Model
/

Fast_dLLM_v2_1.5B

 - en
 base_model:
 - Qwen/Qwen2.5-1.5B-Instruct
+---
+# Fast-dLLM v2 (1.5B)
+## Introduction
+Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. However, their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that transforms a pretrained AR model—specifically, Qwen-2.5-1.5B-Instruct—into a diffusion-style decoder for parallel text generation.
+Our approach introduces a novel decoding recipe incorporating a complementary attention mask and a position-aware masking strategy, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a token-level intra-block cache that supports efficient parallel decoding within partially generated blocks.
+Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves a near 4x speedup over standard AR decoding, without compromising generation quality. Extensive experiments demonstrate that Fast-dLLM v2 achieves state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs, marking a significant step toward practical deployment of fast and accurate language models.
+**This repo contains the Fast-dLLM v2 1.5B model**, which has the following features:
+* Type: Block Diffusion Language Model (dLLM)
+* Base Model: Qwen/Qwen2.5-1.5B-Instruct
+* Architecture: Transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings
+* Number of Parameters: 1.54B
+* Number of Parameters (Non-Embedding): 1.31B
+* Number of Layers: 28
+* Number of Attention Heads (GQA): 12 for Q and 2 for KV
+* Context Length: Full 32,768 tokens and generation 8,192 tokens
+* Key Innovation: Parallel block-wise decoding with hierarchical caching
+## Requirements
+The code requires the latest version of `transformers` and custom generation functions. Make sure you have the following dependencies:
+```bash
+pip install transformers torch numpy
+```
+## Quickstart
+Here provides a code snippet to show you how to load the model and generate contents using the Fast-dLLM v2 parallel decoding:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import types
+import generation_functions
+model_name = "Efficient-Large-Model/Fast_dLLM_1.5B"
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+# Example conversation
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Use Fast-dLLM v2 parallel decoding
+generated_ids = model.generate(
+    model_inputs["input_ids"],
+    tokenizer=tokenizer,
+    max_new_tokens=512,
+    small_block_size=8,
+    threshold=0.9,
+)
+response = tokenizer.decode(
+    generated_ids[0][model_inputs["input_ids"].shape[1]:],
+    skip_special_tokens=True
+)
+print(response)
+```
+## Key Features
+* **Parallel Decoding**: Achieves near 4x speedup over standard autoregressive decoding
+* **Block-wise Processing**: Processes text in blocks for efficient parallel generation
+* **Hierarchical Caching**: Block-level and token-level caching for optimal memory usage
+* **Quality Preservation**: Maintains generation quality while significantly improving speed
+* **Compatible Interface**: Drop-in replacement for standard transformer models
+## Performance
+Fast-dLLM v2 demonstrates state-of-the-art trade-offs between efficiency and performance among existing diffusion-based LLMs. The model achieves:
+* Near 4x inference speedup compared to standard AR decoding
+* Comparable generation quality to the base Qwen2.5-1.5B-Instruct model
+* Efficient memory usage through hierarchical caching mechanisms
+### Benchmark Results
+The following table compares the performance of Fast-dLLM-v2  against the base autoregressive model (qwen2.5-1.5B-ar) across various benchmarks:
+| Model | HumanEval | HumanEval+ | MBPP | MBPP+ | GSM8K | MATH | IFEval | MMLU (0-shot) | GPQA |
+|-------|-----------|------------|------|-------|-------|------|--------|---------------|------|
+| qwen2.5-1.5B-ar | 42.1 | 37.2 | 48.1 | 41.3 | 57.0 | 22.4 | 41.2 | 54.6 | 30.58 |
+| Fast-dLLM-v2 | **43.3** | **40.2** | **50.0** | 41.3 | **60.1** | **28.4** | **45.7** | **55.1** | 27.7 |
+**Key Observations:**
+- Fast-dLLM v2 outperforms the base AR model on 7 out of 9 benchmarks
+- Significant improvements in mathematical reasoning (MATH: 22.4 → 28.4) and instruction following (IFEval: 41.2 → 45.7)
+- Comparable performance on MBPP+ and slight decrease on GPQA
+- Overall performance improvement while achieving 4x inference speedup
+## Citation
+If you find our work helpful, please cite our paper:
+```bibtex
+```
+## License
+This model is released under the Apache 2.0 license, following the base Qwen2.5-1.5B-Instruct model.