|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
base_model: tencent/WeDLM-8B |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- language model |
|
|
- parallel-decoding |
|
|
--- |
|
|
|
|
|
# WeDLM-8B-Instruct β |
|
|
|
|
|
**WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B). |
|
|
|
|
|
**Highlights:** |
|
|
- π 3-6Γ faster than vLLM-optimized Qwen3-8B on math reasoning tasks |
|
|
- π Outperforms base Qwen3-8B-Instruct on most benchmarks |
|
|
- β
Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs) |
|
|
|
|
|
For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B). |
|
|
|
|
|
π Paper (Coming Soon) | π [Project Page](https://wedlm.github.io) | π» [GitHub](https://github.com/tencent/WeDLM) |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Attribute | Value | |
|
|
|:----------|:------| |
|
|
| Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B) | |
|
|
| Parameters | 8B | |
|
|
| Context Length | 32,768 | |
|
|
|
|
|
## Quick Start (Recommended) |
|
|
|
|
|
For **fast inference**, use the `wedlm` engine: |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/tencent/WeDLM.git |
|
|
``` |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
from wedlm import LLM, SamplingParams |
|
|
|
|
|
llm = LLM(model="tencent/WeDLM-8B-Instruct") |
|
|
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True) |
|
|
|
|
|
prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?" |
|
|
messages = [{"role": "user", "content": prompt}] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
|
|
|
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512)) |
|
|
print(outputs[0]["text"]) |
|
|
``` |
|
|
|
|
|
### Multi-turn Conversation |
|
|
|
|
|
```python |
|
|
messages = [ |
|
|
{"role": "user", "content": "What is the derivative of x^2?"}, |
|
|
{"role": "assistant", "content": "The derivative of xΒ² is 2x."}, |
|
|
{"role": "user", "content": "What about x^3?"} |
|
|
] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256)) |
|
|
``` |
|
|
|
|
|
### Batch Inference |
|
|
|
|
|
```python |
|
|
prompts = [ |
|
|
"Explain quantum entanglement simply.", |
|
|
"Write a Python function to check if a number is prime.", |
|
|
"What are the main causes of climate change?" |
|
|
] |
|
|
messages_batch = [[{"role": "user", "content": p}] for p in prompts] |
|
|
texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch] |
|
|
|
|
|
outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512)) |
|
|
for i, output in enumerate(outputs): |
|
|
print(f"=== Response {i+1} ===\n{output['text']}\n") |
|
|
``` |
|
|
|
|
|
## HuggingFace Transformers |
|
|
|
|
|
For **training** or simple forward passes: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"tencent/WeDLM-8B-Instruct", |
|
|
trust_remote_code=True, |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
messages = [{"role": "user", "content": "Hello!"}] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device) |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
> β οΈ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above. |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Generation Quality |
|
|
|
|
|
| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct | |
|
|
|:----------|:-----------------:|:-----------------:| |
|
|
| ARC-C (0-shot) | 91.47 | **92.92** | |
|
|
| GSM8K (3-shot) | 89.91 | **92.27** | |
|
|
| MATH (4-shot) | **69.60** | 64.80 | |
|
|
| HumanEval (4-shot) | 71.95 | **80.49** | |
|
|
| MMLU (5-shot) | 71.52 | **75.14** | |
|
|
| GPQA-Diamond (5-shot) | 41.41 | **44.95** | |
|
|
| **Average** | 75.12 | **77.53** | |
|
|
|
|
|
### Inference Speed |
|
|
|
|
|
Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct): |
|
|
|
|
|
| Scenario | Speedup | Notes | |
|
|
|:---------|:-------:|:------| |
|
|
| Math Reasoning (GSM8K) | 3-6Γ | Structured, predictable output | |
|
|
| Code Generation | 2-3Γ | Deterministic syntax | |
|
|
| Open-ended QA | 1.5-2Γ | Higher entropy limits parallelism | |
|
|
|
|
|
## Citation (Coming soon) |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |