zhanghanxiao commited on
Commit
889c262
·
verified ·
1 Parent(s): f841857

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -11
README.md CHANGED
@@ -133,37 +133,34 @@ print(completion.choices[0].message.content)
133
 
134
  #### Environment Preparation
135
 
136
- We will later submit our model to SGLang official release, now we can prepare the environment following steps:
137
  ```shell
138
  pip3 install -U sglang sgl-kernel
139
  ```
140
 
141
  #### Run Inference
142
 
143
- BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}.
144
 
145
- Here is the example to run Ring-1T with multiple gpu nodes, with master node IP is ${MASTER_IP} and port is ${PORT} :
146
 
147
  - Start server:
148
  ```bash
149
  # Node 0:
150
- python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 0
151
 
152
  # Node 1:
153
- python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 1
154
 
155
  # Node 2:
156
- python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 2
157
 
158
  # Node 3:
159
- python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 3
160
 
161
- # This is only an example, please adjust arguments according to your actual environment.
162
  ```
163
 
164
- MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
165
- to start command.
166
-
167
  - Client:
168
 
169
  ```shell
@@ -174,6 +171,44 @@ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
174
 
175
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ## Finetuning
178
 
179
  We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ring-V2/blob/main/docs/llamafactory_finetuning.md).
 
133
 
134
  #### Environment Preparation
135
 
136
+ We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps:
137
  ```shell
138
  pip3 install -U sglang sgl-kernel
139
  ```
140
 
141
  #### Run Inference
142
 
143
+ Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}.
144
 
145
+ Here is the example to run Ring-1T with multiple GPU nodes, where the master node IP is ${MASTER_IP} and server port is ${PORT}:
146
 
147
  - Start server:
148
  ```bash
149
  # Node 0:
150
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 0
151
 
152
  # Node 1:
153
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 1
154
 
155
  # Node 2:
156
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 2
157
 
158
  # Node 3:
159
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 3
160
 
161
+ # This is only an example. Please adjust arguments according to your actual environment.
162
  ```
163
 
 
 
 
164
  - Client:
165
 
166
  ```shell
 
171
 
172
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
173
 
174
+ ### vLLM
175
+
176
+ #### Environment Preparation
177
+
178
+ ```bash
179
+ pip install vllm==0.11.0
180
+ ```
181
+
182
+ #### Run Inference:
183
+
184
+ Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTER_IP}, server port is ${PORT} and the path of model is ${MODEL_PATH}:
185
+
186
+ ```bash
187
+ # step 1. start ray on all nodes
188
+
189
+ # step 2. start vllm server only on node 0:
190
+ vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
191
+
192
+
193
+ # This is only an example, please adjust arguments according to your actual environment.
194
+ ```
195
+
196
+ To handle long context in vLLM using YaRN, we need to follow these two steps:
197
+ 1. Add a `rope_scaling` field to the model's `config.json` file, for example:
198
+ ```json
199
+ {
200
+ ...,
201
+ "rope_scaling": {
202
+ "factor": 4.0,
203
+ "original_max_position_embeddings": 32768,
204
+ "type": "yarn"
205
+ }
206
+ }
207
+ ```
208
+ 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
209
+
210
+ For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
211
+
212
  ## Finetuning
213
 
214
  We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ring-V2/blob/main/docs/llamafactory_finetuning.md).