File size: 5,909 Bytes
ba45e22 c1e0373 ba45e22 ec5df48 ba45e22 c09b3b5 ba45e22 ec5df48 ba45e22 711665c ba45e22 ec5df48 ba45e22 ec5df48 ba45e22 ec5df48 ba45e22 ec5df48 ba45e22 ec5df48 ba45e22 ec5df48 ba45e22 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
---
license: apache-2.0
language:
- en
- zh
base_model: tencent/WeDLM-8B
pipeline_tag: text-generation
tags:
- language model
- parallel-decoding
library_name: transformers
---
# WeDLM-8B-Instruct β
**WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base).
**Highlights:**
- π 3-6Γ faster than vLLM-optimized Qwen3-8B on math reasoning tasks
- π Outperforms base Qwen3-8B-Instruct on most benchmarks
- β
Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)
For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base), which is based on Qwen3-8B-Base.
π [Paper](https://arxiv.org/abs/2512.22737) | π [Project Page](https://wedlm.github.io) | π» [GitHub](https://github.com/tencent/WeDLM)
## Model Details
| Attribute | Value |
|:----------|:------|
| Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base) |
| Parameters | 8B |
| Context Length | 32,768 |
## Installation
```bash
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && bash install.sh
```
<details>
<summary><b>Manual Installation</b></summary>
```bash
# Step 1: PyTorch
pip install torch==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129
# Step 2: flash-attn build dependencies
pip install psutil ninja packaging
# Step 3: flash-attn (requires torch first)
pip install flash-attn==2.7.4.post1 --no-build-isolation
# Step 4: WeDLM
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && pip install -e .
```
</details>
<details>
<summary><b>Docker Installation</b></summary>
```bash
# Pull the Docker image
docker pull aiweiliu/wedlm:v3
# Run the container with GPU support
docker run -it --gpus all -p 8080:8080 --name wedlm aiweiliu/wedlm:v3 /bin/bash
# Inside the container, run inference directly
python example.py --model tencent/WeDLM-8B-Instruct
```
</details>
> **Note:** `flash-attn` requires compilation and must be installed after PyTorch.
> The `install.sh` script handles this automatically (default: CUDA 12.9).
> For other CUDA versions: `CUDA_VERSION=cu124 bash install.sh`
## Quick Start (Recommended)
For **fast inference**, use the `wedlm` engine:
```python
from transformers import AutoTokenizer
from wedlm import LLM, SamplingParams
llm = LLM(model="tencent/WeDLM-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
print(outputs[0]["text"])
```
### Multi-turn Conversation
```python
messages = [
{"role": "user", "content": "What is the derivative of x^2?"},
{"role": "assistant", "content": "The derivative of xΒ² is 2x."},
{"role": "user", "content": "What about x^3?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
```
### Batch Inference
```python
prompts = [
"Explain quantum entanglement simply.",
"Write a Python function to check if a number is prime.",
"What are the main causes of climate change?"
]
messages_batch = [[{"role": "user", "content": p}] for p in prompts]
texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]
outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
for i, output in enumerate(outputs):
print(f"=== Response {i+1} ===\n{output['text']}\n")
```
## HuggingFace Transformers
For **training** or simple forward passes:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"tencent/WeDLM-8B-Instruct",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model(**inputs)
```
> β οΈ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.
## Performance
### Generation Quality
| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
|:----------|:-----------------:|:-----------------:|
| ARC-C (0-shot) | 91.47 | **92.92** |
| GSM8K (3-shot) | 89.91 | **92.27** |
| MATH (4-shot) | **69.60** | 64.80 |
| HumanEval (4-shot) | 71.95 | **80.49** |
| MMLU (5-shot) | 71.52 | **75.14** |
| GPQA-Diamond (5-shot) | 41.41 | **44.95** |
| **Average** | 75.12 | **77.53** |
### Inference Speed
Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):
| Scenario | Speedup | Notes |
|:---------|:-------:|:------|
| Math Reasoning (GSM8K) | 3-6Γ | Structured, predictable output |
| Code Generation | 2-3Γ | Deterministic syntax |
| Open-ended QA | 1.5-2Γ | Higher entropy limits parallelism |
## Citation
```bibtex
@article{liu2025wedlm,
title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference},
author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie},
journal={arXiv preprint arXiv:2512.22737},
year={2025}
}
```
## License
Apache 2.0 |