jianchen0311 commited on
Commit
d3af30d
·
verified ·
1 Parent(s): 7634bc0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -6
README.md CHANGED
@@ -30,19 +30,18 @@ This model is the **drafter** component. It must be used in conjunction with the
30
  ## 🚀 Quick Start
31
 
32
  ### SGLang
33
- DFlash is now supported on SGLang. And vLLM integration is currently in progress.
34
 
35
  #### Installation
36
  ```bash
37
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
38
  ```
39
 
40
- #### Inference
41
  ```bash
42
- export SGLANG_ENABLE_SPEC_V2=1
43
- export SGLANG_ENABLE_DFLASH_SPEC_V2=1
44
- export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
45
- export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
46
 
47
  python -m sglang.launch_server \
48
  --model-path meta-llama/Llama-3.1-8B-Instruct \
@@ -55,6 +54,56 @@ python -m sglang.launch_server \
55
  --trust-remote-code
56
  ```
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ### Transformers
59
 
60
  #### Installation
 
30
  ## 🚀 Quick Start
31
 
32
  ### SGLang
 
33
 
34
  #### Installation
35
  ```bash
36
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
37
  ```
38
 
39
+ #### Launch Server
40
  ```bash
41
+ # Optional: enable schedule overlapping (experimental, may not be stable)
42
+ # export SGLANG_ENABLE_SPEC_V2=1
43
+ # export SGLANG_ENABLE_DFLASH_SPEC_V2=1
44
+ # export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
45
 
46
  python -m sglang.launch_server \
47
  --model-path meta-llama/Llama-3.1-8B-Instruct \
 
54
  --trust-remote-code
55
  ```
56
 
57
+ #### Usage
58
+
59
+ ```python
60
+ from openai import OpenAI
61
+
62
+ client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
63
+
64
+ response = client.chat.completions.create(
65
+ model="meta-llama/Llama-3.1-8B-Instruct",
66
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
67
+ max_tokens=2048,
68
+ temperature=0.0,
69
+ )
70
+ print(response.choices[0].message.content)
71
+ ```
72
+
73
+ ### vLLM
74
+
75
+ #### Installation
76
+
77
+ ```bash
78
+ uv pip install vllm
79
+ uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
80
+ ```
81
+
82
+ #### Launch Server
83
+
84
+ ```bash
85
+ vllm serve meta-llama/Llama-3.1-8B-Instruct \
86
+ --speculative-config '{"method": "dflash", "model": "z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat", "num_speculative_tokens": 9}' \
87
+ --attention-backend flash_attn \
88
+ --max-num-batched-tokens 32768
89
+ ```
90
+
91
+ #### Usage
92
+
93
+ ```python
94
+ from openai import OpenAI
95
+
96
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
97
+
98
+ response = client.chat.completions.create(
99
+ model="meta-llama/Llama-3.1-8B-Instruct",
100
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
101
+ max_tokens=2048,
102
+ temperature=0.0,
103
+ )
104
+ print(response.choices[0].message.content)
105
+ ```
106
+
107
  ### Transformers
108
 
109
  #### Installation