inclusionAI
/

Ring-1T-FP8

@@ -133,37 +133,34 @@ print(completion.choices[0].message.content)
 #### Environment Preparation
-We will later submit our model to SGLang official release, now we can prepare the environment following steps:
 ```shell
 pip3 install -U sglang sgl-kernel
 ```
 #### Run Inference
-BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}.
-Here is the example to run Ring-1T with multiple gpu nodes, with master node IP is ${MASTER_IP} and port is ${PORT} :
 - Start server:
 ```bash
 # Node 0:
-python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 0
 # Node 1:
-python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 1
 # Node 2:
-python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 2
 # Node 3:
-python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 3
-# This is only an example, please adjust arguments according to your actual environment.
 ```
-MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
-to start command.
 - Client:
 ```shell
@@ -174,6 +171,44 @@ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
 More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
 ## Finetuning
 We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ring-V2/blob/main/docs/llamafactory_finetuning.md).

 #### Environment Preparation
+We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps:
 ```shell
 pip3 install -U sglang sgl-kernel
 ```
 #### Run Inference
+Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}.
+Here is the example to run Ring-1T with multiple GPU nodes, where the master node IP is ${MASTER_IP} and server port is ${PORT}:
 - Start server:
 ```bash
 # Node 0:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 0
 # Node 1:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 1
 # Node 2:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 2
 # Node 3:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 3
+# This is only an example. Please adjust arguments according to your actual environment.
 ```
 - Client:
 ```shell
 More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
+### vLLM
+#### Environment Preparation
+```bash
+pip install vllm==0.11.0
+```
+#### Run Inference:
+Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTER_IP}, server port is ${PORT} and the path of model is ${MODEL_PATH}:
+```bash
+# step 1. start ray on all nodes
+# step 2. start vllm server only on node 0:
+vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
+# This is only an example, please adjust arguments according to your actual environment.
+```
+To handle long context in vLLM using YaRN, we need to follow these two steps:
+1. Add a `rope_scaling` field to the model's `config.json` file, for example:
+```json
+{
+  ...,
+  "rope_scaling": {
+    "factor": 4.0,
+    "original_max_position_embeddings": 32768,
+    "type": "yarn"
+  }
+}
+```
+2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
+For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
 ## Finetuning
 We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ring-V2/blob/main/docs/llamafactory_finetuning.md).