zhanghanxiao commited on
Commit
67c3090
·
verified ·
1 Parent(s): 4a39946

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -60
README.md CHANGED
@@ -175,39 +175,38 @@ print(completion.choices[0].message.content)
175
 
176
  #### Environment Preparation
177
 
178
- We will later submit our model to SGLang official release, now we can prepare the environment following steps:
179
  ```shell
180
- pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
181
- ```
182
- You can use docker image as well:
183
- ```shell
184
- docker pull lmsysorg/sglang:v0.5.2rc0-cu126
185
- ```
186
- Then you should apply patch to sglang installation:
187
- ```shell
188
- # patch command is needed, run `yum install -y patch` if needed
189
- patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
190
  ```
191
 
192
  #### Run Inference
193
 
194
- BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
 
 
195
 
196
  - Start server:
197
- ```shell
198
- python -m sglang.launch_server \
199
- --model-path $MODLE_PATH \
200
- --host 0.0.0.0 --port $PORT \
201
- --trust-remote-code \
202
- --attention-backend fa3
 
 
 
 
 
 
 
 
203
  ```
204
- MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
205
- to start command.
206
 
207
  - Client:
208
 
209
  ```shell
210
- curl -s http://localhost:${PORT}/v1/chat/completions \
211
  -H "Content-Type: application/json" \
212
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
213
  ```
@@ -216,54 +215,26 @@ More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.h
216
 
217
  ### vLLM
218
 
219
- vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
220
 
221
  #### Environment Preparation
222
 
223
- Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:
224
-
225
  ```bash
226
- git clone -b v0.10.0 https://github.com/vllm-project/vllm.git
227
- cd vllm
228
- wget https://raw.githubusercontent.com/inclusionAI/Ring-V2/refs/heads/main/inference/vllm/bailing_moe_v2.patch
229
- git apply bailing_moe_v2.patch
230
- pip install -e .
231
  ```
232
 
233
- #### Offline Inference:
234
-
235
- ```python
236
- from transformers import AutoTokenizer
237
- from vllm import LLM, SamplingParams
238
-
239
- tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-1T") # Changed from Ring-flash-2.0 for consistency
240
 
241
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
242
 
243
- llm = LLM(model="inclusionAI/Ring-1T", dtype='bfloat16') # Changed from Ring-flash-2.0 for consistency
244
- prompt = "Give me a short introduction to large language models."
245
- messages = [
246
- {"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
247
- {"role": "user", "content": prompt}
248
- ]
249
-
250
- text = tokenizer.apply_chat_template(
251
- messages,
252
- tokenize=False,
253
- add_generation_prompt=True
254
- )
255
- outputs = llm.generate([text], sampling_params)
256
 
257
- ```
 
258
 
259
- #### Online Inference:
260
 
261
- ```bash
262
- vllm serve inclusionAI/Ring-1T \
263
- --tensor-parallel-size 2 \
264
- --pipeline-parallel-size 1 \
265
- --use-v2-block-manager \
266
- --gpu-memory-utilization 0.90
267
  ```
268
 
269
  To handle long context in vLLM using YaRN, we need to follow these two steps:
@@ -280,8 +251,6 @@ To handle long context in vLLM using YaRN, we need to follow these two steps:
280
  ```
281
  2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
282
 
283
- For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
284
-
285
 
286
  ## Finetuning
287
 
 
175
 
176
  #### Environment Preparation
177
 
178
+ We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps:
179
  ```shell
180
+ pip3 install -U sglang sgl-kernel
 
 
 
 
 
 
 
 
 
181
  ```
182
 
183
  #### Run Inference
184
 
185
+ Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}.
186
+
187
+ Here is the example to run Ring-1T with multiple GPU nodes, where the master node IP is ${MASTER_IP} and server port is ${PORT}:
188
 
189
  - Start server:
190
+ ```bash
191
+ # Node 0:
192
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 0
193
+
194
+ # Node 1:
195
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 1
196
+
197
+ # Node 2:
198
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 2
199
+
200
+ # Node 3:
201
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 3
202
+
203
+ # This is only an example. Please adjust arguments according to your actual environment.
204
  ```
 
 
205
 
206
  - Client:
207
 
208
  ```shell
209
+ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
210
  -H "Content-Type: application/json" \
211
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
212
  ```
 
215
 
216
  ### vLLM
217
 
218
+ For latest guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/projects/recipes/en/latest/inclusionAI/Ring-1T-FP8.html).
219
 
220
  #### Environment Preparation
221
 
 
 
222
  ```bash
223
+ pip install vllm==0.11.0
 
 
 
 
224
  ```
225
 
226
+ #### Run Inference:
 
 
 
 
 
 
227
 
228
+ Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTER_IP}, server port is ${PORT} and the path of model is ${MODEL_PATH}:
229
 
230
+ ```bash
231
+ # step 1. start ray on all nodes
 
 
 
 
 
 
 
 
 
 
 
232
 
233
+ # step 2. start vllm server only on node 0:
234
+ vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
235
 
 
236
 
237
+ # This is only an example, please adjust arguments according to your actual environment.
 
 
 
 
 
238
  ```
239
 
240
  To handle long context in vLLM using YaRN, we need to follow these two steps:
 
251
  ```
252
  2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
253
 
 
 
254
 
255
  ## Finetuning
256