---
license: apache-2.0
language:
- en
- zh
base_model: tencent/WeDLM-8B
pipeline_tag: text-generation
tags:
- language model
- parallel-decoding
---
# WeDLM-8B-Instruct ⭐
**WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base).
**Highlights:**
- 🚀 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks
- 📈 Outperforms base Qwen3-8B-Instruct on most benchmarks
- ✅ Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)
For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base), which is based on Qwen3-8B-Base.
📄 [Paper](https://arxiv.org/abs/2512.22737) | 🌐 [Project Page](https://wedlm.github.io) | 💻 [GitHub](https://github.com/tencent/WeDLM)
## Model Details
| Attribute | Value |
|:----------|:------|
| Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base) |
| Parameters | 8B |
| Context Length | 32,768 |
## Installation
```bash
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && bash install.sh
```
Manual Installation
```bash
# Step 1: PyTorch
pip install torch==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129
# Step 2: flash-attn build dependencies
pip install psutil ninja packaging
# Step 3: flash-attn (requires torch first)
pip install flash-attn==2.7.4.post1 --no-build-isolation
# Step 4: WeDLM
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && pip install -e .
```
Docker Installation
```bash
# Pull the Docker image
docker pull aiweiliu/wedlm:v3
# Run the container with GPU support
docker run -it --gpus all -p 8080:8080 --name wedlm aiweiliu/wedlm:v3 /bin/bash
# Inside the container, run inference directly
python example.py --model tencent/WeDLM-8B-Instruct
```
> **Note:** `flash-attn` requires compilation and must be installed after PyTorch.
> The `install.sh` script handles this automatically (default: CUDA 12.9).
> For other CUDA versions: `CUDA_VERSION=cu124 bash install.sh`
## Quick Start (Recommended)
For **fast inference**, use the `wedlm` engine:
```python
from transformers import AutoTokenizer
from wedlm import LLM, SamplingParams
llm = LLM(model="tencent/WeDLM-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
print(outputs[0]["text"])
```
### Multi-turn Conversation
```python
messages = [
{"role": "user", "content": "What is the derivative of x^2?"},
{"role": "assistant", "content": "The derivative of x² is 2x."},
{"role": "user", "content": "What about x^3?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
```
### Batch Inference
```python
prompts = [
"Explain quantum entanglement simply.",
"Write a Python function to check if a number is prime.",
"What are the main causes of climate change?"
]
messages_batch = [[{"role": "user", "content": p}] for p in prompts]
texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]
outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
for i, output in enumerate(outputs):
print(f"=== Response {i+1} ===\n{output['text']}\n")
```
## HuggingFace Transformers
For **training** or simple forward passes:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"tencent/WeDLM-8B-Instruct",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model(**inputs)
```
> ⚠️ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.
## Performance
### Generation Quality
| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
|:----------|:-----------------:|:-----------------:|
| ARC-C (0-shot) | 91.47 | **92.92** |
| GSM8K (3-shot) | 89.91 | **92.27** |
| MATH (4-shot) | **69.60** | 64.80 |
| HumanEval (4-shot) | 71.95 | **80.49** |
| MMLU (5-shot) | 71.52 | **75.14** |
| GPQA-Diamond (5-shot) | 41.41 | **44.95** |
| **Average** | 75.12 | **77.53** |
### Inference Speed
Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):
| Scenario | Speedup | Notes |
|:---------|:-------:|:------|
| Math Reasoning (GSM8K) | 3-6× | Structured, predictable output |
| Code Generation | 2-3× | Deterministic syntax |
| Open-ended QA | 1.5-2× | Higher entropy limits parallelism |
## Citation
```bibtex
@article{liu2025wedlm,
title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference},
author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie},
journal={arXiv preprint arXiv:2512.22737},
year={2025}
}
```
## License
Apache 2.0