tencent
/

WeDLM-8B-Instruct

+---
+license: apache-2.0
+language:
+- en
+- zh
+base_model: tencent/WeDLM-8B
+pipeline_tag: text-generation
+tags:
+- language model
+- parallel-decoding
+---
+# WeDLM-8B-Instruct ⭐
+**WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B).
+**Highlights:**
+- 🚀 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks
+- 📈 Outperforms base Qwen3-8B-Instruct on most benchmarks
+- ✅ Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)
+For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B).
+📄 Paper (Coming Soon) | 🌐 [Project Page](https://wedlm.github.io) | 💻 [GitHub](https://github.com/tencent/WeDLM)
+## Model Details
+| Attribute | Value |
+|:----------|:------|
+| Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B) |
+| Parameters | 8B |
+| Context Length | 32,768 |
+## Quick Start (Recommended)
+For **fast inference**, use the `wedlm` engine:
+```bash
+pip install git+https://github.com/tencent/WeDLM.git
+```
+```python
+from transformers import AutoTokenizer
+from wedlm import LLM, SamplingParams
+llm = LLM(model="tencent/WeDLM-8B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
+prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
+messages = [{"role": "user", "content": prompt}]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
+print(outputs[0]["text"])
+```
+### Multi-turn Conversation
+```python
+messages = [
+    {"role": "user", "content": "What is the derivative of x^2?"},
+    {"role": "assistant", "content": "The derivative of x² is 2x."},
+    {"role": "user", "content": "What about x^3?"}
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
+```
+### Batch Inference
+```python
+prompts = [
+    "Explain quantum entanglement simply.",
+    "Write a Python function to check if a number is prime.",
+    "What are the main causes of climate change?"
+]
+messages_batch = [[{"role": "user", "content": p}] for p in prompts]
+texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]
+outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
+for i, output in enumerate(outputs):
+    print(f"=== Response {i+1} ===\n{output['text']}\n")
+```
+## HuggingFace Transformers
+For **training** or simple forward passes:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    "tencent/WeDLM-8B-Instruct",
+    trust_remote_code=True,
+    torch_dtype="auto",
+    device_map="auto"
+)
+messages = [{"role": "user", "content": "Hello!"}]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+outputs = model(**inputs)
+```
+> ⚠️ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.
+## Performance
+### Generation Quality
+| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
+|:----------|:-----------------:|:-----------------:|
+| ARC-C (0-shot) | 91.47 | **92.92** |
+| GSM8K (3-shot) | 89.91 | **92.27** |
+| MATH (4-shot) | **69.60** | 64.80 |
+| HumanEval (4-shot) | 71.95 | **80.49** |
+| MMLU (5-shot) | 71.52 | **75.14** |
+| GPQA-Diamond (5-shot) | 41.41 | **44.95** |
+| **Average** | 75.12 | **77.53** |
+### Inference Speed
+Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):
+| Scenario | Speedup | Notes |
+|:---------|:-------:|:------|
+| Math Reasoning (GSM8K) | 3-6× | Structured, predictable output |
+| Code Generation | 2-3× | Deterministic syntax |
+| Open-ended QA | 1.5-2× | Higher entropy limits parallelism |
+## Citation (Coming soon)
+## License
+Apache 2.0