Update README.md

Browse files

Files changed (1) hide show

README.md +56 -6

README.md CHANGED Viewed

@@ -37,21 +37,21 @@ This result highlights the **training efficiency and scalability** of DFlash, an
 ## 🚀 Quick Start
 ### SGLang
-DFlash is now supported on SGLang. And vLLM integration is currently in progress.
 #### Installation
 ```bash
 uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
 ```
-#### Inference
 ```bash
-export SGLANG_ENABLE_SPEC_V2=1
-export SGLANG_ENABLE_DFLASH_SPEC_V2=1
-export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
 python -m sglang.launch_server \
-    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
     --speculative-algorithm DFLASH \
     --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
     --tp-size 1 \
@@ -61,6 +61,56 @@ python -m sglang.launch_server \
     --trust-remote-code
 ```
 ### Transformers
 #### Installation

 ## 🚀 Quick Start
 ### SGLang
 #### Installation
 ```bash
 uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
 ```
+#### Launch Server
 ```bash
+# Optional: enable schedule overlapping (experimental, may not be stable)
+# export SGLANG_ENABLE_SPEC_V2=1
+# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
+# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
 python -m sglang.launch_server \
+    --model-path Qwen/Qwen3-Coder-30B-A3B \
     --speculative-algorithm DFLASH \
     --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
     --tp-size 1 \
     --trust-remote-code
 ```
+#### Usage
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="Qwen/Qwen3-Coder-30B-A3B",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=2048,
+    temperature=0.0,
+)
+print(response.choices[0].message.content)
+```
+### vLLM
+#### Installation
+```bash
+uv pip install vllm
+uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
+```
+#### Launch Server
+```bash
+vllm serve Qwen/Qwen3-Coder-30B-A3B \
+  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3-Coder-30B-A3B-DFlash", "num_speculative_tokens": 15}' \
+  --attention-backend flash_attn \
+  --max-num-batched-tokens 32768
+```
+#### Usage
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="Qwen/Qwen3-Coder-30B-A3B",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=2048,
+    temperature=0.0,
+)
+print(response.choices[0].message.content)
+```
 ### Transformers
 #### Installation