File size: 4,394 Bytes
e2e9157 288e433 e2e9157 288e433 e2e9157 288e433 e2e9157 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
license: apache-2.0
language:
- en
- zh
base_model: tencent/WeDLM-8B
pipeline_tag: text-generation
tags:
- language model
- parallel-decoding
---
# WeDLM-8B-Instruct β
**WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B).
**Highlights:**
- π 3-6Γ faster than vLLM-optimized Qwen3-8B on math reasoning tasks
- π Outperforms base Qwen3-8B-Instruct on most benchmarks
- β
Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)
For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B).
π Paper (Coming Soon) | π [Project Page](https://wedlm.github.io) | π» [GitHub](https://github.com/tencent/WeDLM)
## Model Details
| Attribute | Value |
|:----------|:------|
| Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B) |
| Parameters | 8B |
| Context Length | 32,768 |
## Quick Start (Recommended)
For **fast inference**, use the `wedlm` engine:
```bash
pip install git+https://github.com/tencent/WeDLM.git
```
```python
from transformers import AutoTokenizer
from wedlm import LLM, SamplingParams
llm = LLM(model="tencent/WeDLM-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
print(outputs[0]["text"])
```
### Multi-turn Conversation
```python
messages = [
{"role": "user", "content": "What is the derivative of x^2?"},
{"role": "assistant", "content": "The derivative of xΒ² is 2x."},
{"role": "user", "content": "What about x^3?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
```
### Batch Inference
```python
prompts = [
"Explain quantum entanglement simply.",
"Write a Python function to check if a number is prime.",
"What are the main causes of climate change?"
]
messages_batch = [[{"role": "user", "content": p}] for p in prompts]
texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]
outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
for i, output in enumerate(outputs):
print(f"=== Response {i+1} ===\n{output['text']}\n")
```
## HuggingFace Transformers
For **training** or simple forward passes:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"tencent/WeDLM-8B-Instruct",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model(**inputs)
```
> β οΈ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.
## Performance
### Generation Quality
| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
|:----------|:-----------------:|:-----------------:|
| ARC-C (0-shot) | 91.47 | **92.92** |
| GSM8K (3-shot) | 89.91 | **92.27** |
| MATH (4-shot) | **69.60** | 64.80 |
| HumanEval (4-shot) | 71.95 | **80.49** |
| MMLU (5-shot) | 71.52 | **75.14** |
| GPQA-Diamond (5-shot) | 41.41 | **44.95** |
| **Average** | 75.12 | **77.53** |
### Inference Speed
Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):
| Scenario | Speedup | Notes |
|:---------|:-------:|:------|
| Math Reasoning (GSM8K) | 3-6Γ | Structured, predictable output |
| Code Generation | 2-3Γ | Deterministic syntax |
| Open-ended QA | 1.5-2Γ | Higher entropy limits parallelism |
## Citation (Coming soon)
## License
Apache 2.0 |