File size: 5,909 Bytes
ba45e22
 
 
 
 
 
 
 
 
 
c1e0373
ba45e22
 
 
 
ec5df48
ba45e22
 
 
 
 
 
c09b3b5
ba45e22
ec5df48
ba45e22
 
 
 
 
711665c
ba45e22
 
 
ec5df48
ba45e22
ec5df48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba45e22
 
ec5df48
 
 
 
 
 
 
 
ba45e22
 
ec5df48
 
 
 
 
 
 
 
 
 
ba45e22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec5df48
ba45e22
ec5df48
 
 
 
 
 
 
 
ba45e22
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
license: apache-2.0
language:
- en
- zh
base_model: tencent/WeDLM-8B
pipeline_tag: text-generation
tags:
- language model
- parallel-decoding
library_name: transformers
---

# WeDLM-8B-Instruct ⭐

**WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base).

**Highlights:**
- πŸš€ 3-6Γ— faster than vLLM-optimized Qwen3-8B on math reasoning tasks
- πŸ“ˆ Outperforms base Qwen3-8B-Instruct on most benchmarks
- βœ… Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)

For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base), which is based on Qwen3-8B-Base.

πŸ“„ [Paper](https://arxiv.org/abs/2512.22737) | 🌐 [Project Page](https://wedlm.github.io) | πŸ’» [GitHub](https://github.com/tencent/WeDLM)

## Model Details

| Attribute | Value |
|:----------|:------|
| Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base) |
| Parameters | 8B |
| Context Length | 32,768 |

## Installation

```bash
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && bash install.sh
```

<details>
<summary><b>Manual Installation</b></summary>

```bash
# Step 1: PyTorch
pip install torch==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129

# Step 2: flash-attn build dependencies
pip install psutil ninja packaging

# Step 3: flash-attn (requires torch first)
pip install flash-attn==2.7.4.post1 --no-build-isolation

# Step 4: WeDLM
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && pip install -e .
```

</details>

<details>
<summary><b>Docker Installation</b></summary>

```bash
# Pull the Docker image
docker pull aiweiliu/wedlm:v3

# Run the container with GPU support
docker run -it --gpus all -p 8080:8080 --name wedlm aiweiliu/wedlm:v3 /bin/bash

# Inside the container, run inference directly
python example.py --model tencent/WeDLM-8B-Instruct
```

</details>

> **Note:** `flash-attn` requires compilation and must be installed after PyTorch.
> The `install.sh` script handles this automatically (default: CUDA 12.9).
> For other CUDA versions: `CUDA_VERSION=cu124 bash install.sh`

## Quick Start (Recommended)

For **fast inference**, use the `wedlm` engine:

```python
from transformers import AutoTokenizer
from wedlm import LLM, SamplingParams

llm = LLM(model="tencent/WeDLM-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)

prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
print(outputs[0]["text"])
```

### Multi-turn Conversation

```python
messages = [
    {"role": "user", "content": "What is the derivative of x^2?"},
    {"role": "assistant", "content": "The derivative of xΒ² is 2x."},
    {"role": "user", "content": "What about x^3?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
```

### Batch Inference

```python
prompts = [
    "Explain quantum entanglement simply.",
    "Write a Python function to check if a number is prime.",
    "What are the main causes of climate change?"
]
messages_batch = [[{"role": "user", "content": p}] for p in prompts]
texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]

outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
for i, output in enumerate(outputs):
    print(f"=== Response {i+1} ===\n{output['text']}\n")
```

## HuggingFace Transformers

For **training** or simple forward passes:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tencent/WeDLM-8B-Instruct", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model(**inputs)
```

> ⚠️ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.

## Performance

### Generation Quality

| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
|:----------|:-----------------:|:-----------------:|
| ARC-C (0-shot) | 91.47 | **92.92** |
| GSM8K (3-shot) | 89.91 | **92.27** |
| MATH (4-shot) | **69.60** | 64.80 |
| HumanEval (4-shot) | 71.95 | **80.49** |
| MMLU (5-shot) | 71.52 | **75.14** |
| GPQA-Diamond (5-shot) | 41.41 | **44.95** |
| **Average** | 75.12 | **77.53** |

### Inference Speed

Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):

| Scenario | Speedup | Notes |
|:---------|:-------:|:------|
| Math Reasoning (GSM8K) | 3-6Γ— | Structured, predictable output |
| Code Generation | 2-3Γ— | Deterministic syntax |
| Open-ended QA | 1.5-2Γ— | Higher entropy limits parallelism |

## Citation

```bibtex
@article{liu2025wedlm,
  title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference},
  author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie},
  journal={arXiv preprint arXiv:2512.22737},
  year={2025}
}
```

## License

Apache 2.0