File size: 4,394 Bytes
e2e9157
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
288e433
e2e9157
 
 
 
 
 
 
 
 
 
 
 
288e433
e2e9157
 
 
 
 
 
 
 
 
 
 
 
 
288e433
e2e9157
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
language:
- en
- zh
base_model: tencent/WeDLM-8B
pipeline_tag: text-generation
tags:
- language model
- parallel-decoding
---

# WeDLM-8B-Instruct ⭐

**WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B).

**Highlights:**
- πŸš€ 3-6Γ— faster than vLLM-optimized Qwen3-8B on math reasoning tasks
- πŸ“ˆ Outperforms base Qwen3-8B-Instruct on most benchmarks
- βœ… Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)

For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B).

πŸ“„ Paper (Coming Soon) | 🌐 [Project Page](https://wedlm.github.io) | πŸ’» [GitHub](https://github.com/tencent/WeDLM)

## Model Details

| Attribute | Value |
|:----------|:------|
| Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B) |
| Parameters | 8B |
| Context Length | 32,768 |

## Quick Start (Recommended)

For **fast inference**, use the `wedlm` engine:

```bash
pip install git+https://github.com/tencent/WeDLM.git
```

```python
from transformers import AutoTokenizer
from wedlm import LLM, SamplingParams

llm = LLM(model="tencent/WeDLM-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)

prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
print(outputs[0]["text"])
```

### Multi-turn Conversation

```python
messages = [
    {"role": "user", "content": "What is the derivative of x^2?"},
    {"role": "assistant", "content": "The derivative of xΒ² is 2x."},
    {"role": "user", "content": "What about x^3?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
```

### Batch Inference

```python
prompts = [
    "Explain quantum entanglement simply.",
    "Write a Python function to check if a number is prime.",
    "What are the main causes of climate change?"
]
messages_batch = [[{"role": "user", "content": p}] for p in prompts]
texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]

outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
for i, output in enumerate(outputs):
    print(f"=== Response {i+1} ===\n{output['text']}\n")
```

## HuggingFace Transformers

For **training** or simple forward passes:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tencent/WeDLM-8B-Instruct", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model(**inputs)
```

> ⚠️ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.

## Performance

### Generation Quality

| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
|:----------|:-----------------:|:-----------------:|
| ARC-C (0-shot) | 91.47 | **92.92** |
| GSM8K (3-shot) | 89.91 | **92.27** |
| MATH (4-shot) | **69.60** | 64.80 |
| HumanEval (4-shot) | 71.95 | **80.49** |
| MMLU (5-shot) | 71.52 | **75.14** |
| GPQA-Diamond (5-shot) | 41.41 | **44.95** |
| **Average** | 75.12 | **77.53** |

### Inference Speed

Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):

| Scenario | Speedup | Notes |
|:---------|:-------:|:------|
| Math Reasoning (GSM8K) | 3-6Γ— | Structured, predictable output |
| Code Generation | 2-3Γ— | Deterministic syntax |
| Open-ended QA | 1.5-2Γ— | Higher entropy limits parallelism |

## Citation (Coming soon)


## License

Apache 2.0