File size: 2,503 Bytes
08da575
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: apache-2.0
language:
- en
- zh
base_model: Qwen/Qwen2.5-7B
pipeline_tag: text-generation
tags:
- language model
- parallel-decoding
---

# WeDLM-7B

**WeDLM-7B** is a diffusion language model that performs parallel decoding under standard causal attention, initialized from [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B).

This is the **base (pretrained)** version. For the instruction-tuned version, see [WeDLM-7B-Instruct](https://huggingface.co/tencent/WeDLM-7B-Instruct).

๐Ÿ“„ Paper (Coming Soon) | ๐ŸŒ [Project Page](https://wedlm.github.io) | ๐Ÿ’ป [GitHub](https://github.com/tencent/WeDLM)

## Model Details

| Attribute | Value |
|:----------|:------|
| Initialized From | [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) |
| Parameters | 7B |
| Context Length | 32,768 |

## Quick Start (Recommended)

For **fast inference**, use the `wedlm` engine:

```bash
pip install git+https://github.com/tencent/WeDLM.git
```

```python
from wedlm import LLM, SamplingParams

llm = LLM(model="tencent/WeDLM-7B")

prompt = "The theory of relativity states that"
outputs = llm.generate([prompt], SamplingParams(temperature=0.7, max_tokens=256))

print(outputs[0]["text"])
```

## HuggingFace Transformers

For **training** or simple forward passes, you can load via Transformers:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tencent/WeDLM-7B", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

inputs = tokenizer("The theory of relativity", return_tensors="pt").to(model.device)
outputs = model(**inputs)
```

> โš ๏ธ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.

## Performance

| Benchmark | Qwen2.5-7B | WeDLM-7B |
|:----------|:----------:|:--------:|
| ARC-C (0-shot) | 89.93 | 90.70 |
| GSM8K (3-shot) | 79.23 | 84.76 |
| MATH (4-shot) | 43.40 | 48.20 |
| HumanEval (4-shot) | 59.14 | 68.90 |
| MMLU (5-shot) | 71.62 | 71.93 |

## Citation

```bibtex
@article{liu2025wedlm,
  title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference},
  author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie},
  year={2025}
}
```

## License

Apache 2.0