jianchen0311 commited on
Commit
98ca0e3
·
verified ·
1 Parent(s): 0c69806

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -6
README.md CHANGED
@@ -37,21 +37,21 @@ This result highlights the **training efficiency and scalability** of DFlash, an
37
  ## 🚀 Quick Start
38
 
39
  ### SGLang
40
- DFlash is now supported on SGLang. And vLLM integration is currently in progress.
41
 
42
  #### Installation
43
  ```bash
44
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
45
  ```
46
 
47
- #### Inference
48
  ```bash
49
- export SGLANG_ENABLE_SPEC_V2=1
50
- export SGLANG_ENABLE_DFLASH_SPEC_V2=1
51
- export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
 
52
 
53
  python -m sglang.launch_server \
54
- --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
55
  --speculative-algorithm DFLASH \
56
  --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
57
  --tp-size 1 \
@@ -61,6 +61,56 @@ python -m sglang.launch_server \
61
  --trust-remote-code
62
  ```
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ### Transformers
65
 
66
  #### Installation
 
37
  ## 🚀 Quick Start
38
 
39
  ### SGLang
 
40
 
41
  #### Installation
42
  ```bash
43
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
44
  ```
45
 
46
+ #### Launch Server
47
  ```bash
48
+ # Optional: enable schedule overlapping (experimental, may not be stable)
49
+ # export SGLANG_ENABLE_SPEC_V2=1
50
+ # export SGLANG_ENABLE_DFLASH_SPEC_V2=1
51
+ # export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
52
 
53
  python -m sglang.launch_server \
54
+ --model-path Qwen/Qwen3-Coder-30B-A3B \
55
  --speculative-algorithm DFLASH \
56
  --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
57
  --tp-size 1 \
 
61
  --trust-remote-code
62
  ```
63
 
64
+ #### Usage
65
+
66
+ ```python
67
+ from openai import OpenAI
68
+
69
+ client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
70
+
71
+ response = client.chat.completions.create(
72
+ model="Qwen/Qwen3-Coder-30B-A3B",
73
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
74
+ max_tokens=2048,
75
+ temperature=0.0,
76
+ )
77
+ print(response.choices[0].message.content)
78
+ ```
79
+
80
+ ### vLLM
81
+
82
+ #### Installation
83
+
84
+ ```bash
85
+ uv pip install vllm
86
+ uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
87
+ ```
88
+
89
+ #### Launch Server
90
+
91
+ ```bash
92
+ vllm serve Qwen/Qwen3-Coder-30B-A3B \
93
+ --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3-Coder-30B-A3B-DFlash", "num_speculative_tokens": 15}' \
94
+ --attention-backend flash_attn \
95
+ --max-num-batched-tokens 32768
96
+ ```
97
+
98
+ #### Usage
99
+
100
+ ```python
101
+ from openai import OpenAI
102
+
103
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
104
+
105
+ response = client.chat.completions.create(
106
+ model="Qwen/Qwen3-Coder-30B-A3B",
107
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
108
+ max_tokens=2048,
109
+ temperature=0.0,
110
+ )
111
+ print(response.choices[0].message.content)
112
+ ```
113
+
114
  ### Transformers
115
 
116
  #### Installation