exlaw commited on
Commit
ba45e22
·
verified ·
1 Parent(s): 288e433

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +137 -0
README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model: tencent/WeDLM-8B
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - language model
10
+ - parallel-decoding
11
+ ---
12
+
13
+ # WeDLM-8B-Instruct ⭐
14
+
15
+ **WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B).
16
+
17
+ **Highlights:**
18
+ - 🚀 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks
19
+ - 📈 Outperforms base Qwen3-8B-Instruct on most benchmarks
20
+ - ✅ Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs)
21
+
22
+ For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B).
23
+
24
+ 📄 Paper (Coming Soon) | 🌐 [Project Page](https://wedlm.github.io) | 💻 [GitHub](https://github.com/tencent/WeDLM)
25
+
26
+ ## Model Details
27
+
28
+ | Attribute | Value |
29
+ |:----------|:------|
30
+ | Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B) |
31
+ | Parameters | 8B |
32
+ | Context Length | 32,768 |
33
+
34
+ ## Quick Start (Recommended)
35
+
36
+ For **fast inference**, use the `wedlm` engine:
37
+
38
+ ```bash
39
+ pip install git+https://github.com/tencent/WeDLM.git
40
+ ```
41
+
42
+ ```python
43
+ from transformers import AutoTokenizer
44
+ from wedlm import LLM, SamplingParams
45
+
46
+ llm = LLM(model="tencent/WeDLM-8B-Instruct")
47
+ tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
48
+
49
+ prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
50
+ messages = [{"role": "user", "content": prompt}]
51
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
52
+
53
+ outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
54
+ print(outputs[0]["text"])
55
+ ```
56
+
57
+ ### Multi-turn Conversation
58
+
59
+ ```python
60
+ messages = [
61
+ {"role": "user", "content": "What is the derivative of x^2?"},
62
+ {"role": "assistant", "content": "The derivative of x² is 2x."},
63
+ {"role": "user", "content": "What about x^3?"}
64
+ ]
65
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
66
+ outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
67
+ ```
68
+
69
+ ### Batch Inference
70
+
71
+ ```python
72
+ prompts = [
73
+ "Explain quantum entanglement simply.",
74
+ "Write a Python function to check if a number is prime.",
75
+ "What are the main causes of climate change?"
76
+ ]
77
+ messages_batch = [[{"role": "user", "content": p}] for p in prompts]
78
+ texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]
79
+
80
+ outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
81
+ for i, output in enumerate(outputs):
82
+ print(f"=== Response {i+1} ===\n{output['text']}\n")
83
+ ```
84
+
85
+ ## HuggingFace Transformers
86
+
87
+ For **training** or simple forward passes:
88
+
89
+ ```python
90
+ from transformers import AutoTokenizer, AutoModelForCausalLM
91
+
92
+ tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
93
+ model = AutoModelForCausalLM.from_pretrained(
94
+ "tencent/WeDLM-8B-Instruct",
95
+ trust_remote_code=True,
96
+ torch_dtype="auto",
97
+ device_map="auto"
98
+ )
99
+
100
+ messages = [{"role": "user", "content": "Hello!"}]
101
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
102
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
103
+ outputs = model(**inputs)
104
+ ```
105
+
106
+ > ⚠️ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above.
107
+
108
+ ## Performance
109
+
110
+ ### Generation Quality
111
+
112
+ | Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
113
+ |:----------|:-----------------:|:-----------------:|
114
+ | ARC-C (0-shot) | 91.47 | **92.92** |
115
+ | GSM8K (3-shot) | 89.91 | **92.27** |
116
+ | MATH (4-shot) | **69.60** | 64.80 |
117
+ | HumanEval (4-shot) | 71.95 | **80.49** |
118
+ | MMLU (5-shot) | 71.52 | **75.14** |
119
+ | GPQA-Diamond (5-shot) | 41.41 | **44.95** |
120
+ | **Average** | 75.12 | **77.53** |
121
+
122
+ ### Inference Speed
123
+
124
+ Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):
125
+
126
+ | Scenario | Speedup | Notes |
127
+ |:---------|:-------:|:------|
128
+ | Math Reasoning (GSM8K) | 3-6× | Structured, predictable output |
129
+ | Code Generation | 2-3× | Deterministic syntax |
130
+ | Open-ended QA | 1.5-2× | Higher entropy limits parallelism |
131
+
132
+ ## Citation (Coming soon)
133
+
134
+
135
+ ## License
136
+
137
+ Apache 2.0