WeDLM-8B-Instruct / README.md

Adding `transformers` as the library name (#6)

c1e0373 verified about 14 hours ago

5.91 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	base_model: tencent/WeDLM-8B
	pipeline_tag: text-generation
	tags:
	- language model
	- parallel-decoding
	library_name: transformers
	---

	# WeDLM-8B-Instruct ⭐

	WeDLM-8B-Instruct is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base).

	Highlights:
	- 🚀 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks
	- 📈 Outperforms base Qwen3-8B-Instruct on most benchmarks
	- ✅ Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)

	For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base), which is based on Qwen3-8B-Base.

	📄 [Paper](https://arxiv.org/abs/2512.22737) \| 🌐 [Project Page](https://wedlm.github.io) \| 💻 [GitHub](https://github.com/tencent/WeDLM)

	## Model Details

	\| Attribute \| Value \|
	\|:----------\|:------\|
	\| Base Model \| [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base) \|
	\| Parameters \| 8B \|
	\| Context Length \| 32,768 \|

	## Installation

	```bash
	git clone https://github.com/tencent/WeDLM.git
	cd WeDLM && bash install.sh
	```

	<details>
	<summary><b>Manual Installation</b></summary>

	```bash
	# Step 1: PyTorch
	pip install torch==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129

	# Step 2: flash-attn build dependencies
	pip install psutil ninja packaging

	# Step 3: flash-attn (requires torch first)
	pip install flash-attn==2.7.4.post1 --no-build-isolation

	# Step 4: WeDLM
	git clone https://github.com/tencent/WeDLM.git
	cd WeDLM && pip install -e .
	```

	</details>

	<details>
	<summary><b>Docker Installation</b></summary>

	```bash
	# Pull the Docker image
	docker pull aiweiliu/wedlm:v3

	# Run the container with GPU support
	docker run -it --gpus all -p 8080:8080 --name wedlm aiweiliu/wedlm:v3 /bin/bash

	# Inside the container, run inference directly
	python example.py --model tencent/WeDLM-8B-Instruct
	```

	</details>

	> Note: `flash-attn` requires compilation and must be installed after PyTorch.
	> The `install.sh` script handles this automatically (default: CUDA 12.9).
	> For other CUDA versions: `CUDA_VERSION=cu124 bash install.sh`

	## Quick Start (Recommended)

	For fast inference, use the `wedlm` engine:

	```python
	from transformers import AutoTokenizer
	from wedlm import LLM, SamplingParams

	llm = LLM(model="tencent/WeDLM-8B-Instruct")
	tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)

	prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
	messages = [{"role": "user", "content": prompt}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
	print(outputs[0]["text"])
	```

	### Multi-turn Conversation

	```python
	messages = [
	{"role": "user", "content": "What is the derivative of x^2?"},
	{"role": "assistant", "content": "The derivative of x² is 2x."},
	{"role": "user", "content": "What about x^3?"}
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
	```

	### Batch Inference

	```python
	prompts = [
	"Explain quantum entanglement simply.",
	"Write a Python function to check if a number is prime.",
	"What are the main causes of climate change?"
	]
	messages_batch = [[{"role": "user", "content": p}] for p in prompts]
	texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]

	outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
	for i, output in enumerate(outputs):
	print(f"=== Response {i+1} ===\n{output['text']}\n")
	```

	## HuggingFace Transformers

	For training or simple forward passes:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	"tencent/WeDLM-8B-Instruct",
	trust_remote_code=True,
	torch_dtype="auto",
	device_map="auto"
	)

	messages = [{"role": "user", "content": "Hello!"}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model(**inputs)
	```

	> ⚠️ Note: The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.

	## Performance

	### Generation Quality

	\| Benchmark \| Qwen3-8B-Instruct \| WeDLM-8B-Instruct \|
	\|:----------\|:-----------------:\|:-----------------:\|
	\| ARC-C (0-shot) \| 91.47 \| 92.92 \|
	\| GSM8K (3-shot) \| 89.91 \| 92.27 \|
	\| MATH (4-shot) \| 69.60 \| 64.80 \|
	\| HumanEval (4-shot) \| 71.95 \| 80.49 \|
	\| MMLU (5-shot) \| 71.52 \| 75.14 \|
	\| GPQA-Diamond (5-shot) \| 41.41 \| 44.95 \|
	\| Average \| 75.12 \| 77.53 \|

	### Inference Speed

	Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):

	\| Scenario \| Speedup \| Notes \|
	\|:---------\|:-------:\|:------\|
	\| Math Reasoning (GSM8K) \| 3-6× \| Structured, predictable output \|
	\| Code Generation \| 2-3× \| Deterministic syntax \|
	\| Open-ended QA \| 1.5-2× \| Higher entropy limits parallelism \|

	## Citation

	```bibtex
	@article{liu2025wedlm,
	title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference},
	author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie},
	journal={arXiv preprint arXiv:2512.22737},
	year={2025}
	}
	```

	## License

	Apache 2.0