Update README.md

Browse files

Files changed (1) hide show

README.md +55 -6

README.md CHANGED Viewed

@@ -30,19 +30,18 @@ This model is the **drafter** component. It must be used in conjunction with the
 ## 🚀 Quick Start
 ### SGLang
-DFlash is now supported on SGLang. And vLLM integration is currently in progress.
 #### Installation
 ```bash
 uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
 ```
-#### Inference
 ```bash
-export SGLANG_ENABLE_SPEC_V2=1
-export SGLANG_ENABLE_DFLASH_SPEC_V2=1
-export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
-export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
 python -m sglang.launch_server \
     --model-path meta-llama/Llama-3.1-8B-Instruct \
@@ -55,6 +54,56 @@ python -m sglang.launch_server \
     --trust-remote-code
 ```
 ### Transformers
 #### Installation

 ## 🚀 Quick Start
 ### SGLang
 #### Installation
 ```bash
 uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
 ```
+#### Launch Server
 ```bash
+# Optional: enable schedule overlapping (experimental, may not be stable)
+# export SGLANG_ENABLE_SPEC_V2=1
+# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
+# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
 python -m sglang.launch_server \
     --model-path meta-llama/Llama-3.1-8B-Instruct \
     --trust-remote-code
 ```
+#### Usage
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="meta-llama/Llama-3.1-8B-Instruct",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=2048,
+    temperature=0.0,
+)
+print(response.choices[0].message.content)
+```
+### vLLM
+#### Installation
+```bash
+uv pip install vllm
+uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
+```
+#### Launch Server
+```bash
+vllm serve meta-llama/Llama-3.1-8B-Instruct \
+  --speculative-config '{"method": "dflash", "model": "z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat", "num_speculative_tokens": 9}' \
+  --attention-backend flash_attn \
+  --max-num-batched-tokens 32768
+```
+#### Usage
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="meta-llama/Llama-3.1-8B-Instruct",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=2048,
+    temperature=0.0,
+)
+print(response.choices[0].message.content)
+```
 ### Transformers
 #### Installation