inclusionAI
/

Ring-1T-FP8

@@ -175,39 +175,38 @@ print(completion.choices[0].message.content)
 #### Environment Preparation
-We will later submit our model to SGLang official release, now we can prepare the environment following steps:
 ```shell
-pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
-```
-You can use docker image as well:
-```shell
-docker pull lmsysorg/sglang:v0.5.2rc0-cu126
-```
-Then you should apply patch to sglang installation:
-```shell
-# patch command is needed, run `yum install -y patch` if needed
-patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
 ```
 #### Run Inference
-BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
 - Start server:
-```shell
-python -m sglang.launch_server \
-    --model-path $MODLE_PATH \
-    --host 0.0.0.0 --port $PORT \
-    --trust-remote-code \
-    --attention-backend fa3
 ```
-MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
-to start command.
 - Client:
 ```shell
-curl -s http://localhost:${PORT}/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 ```
@@ -216,54 +215,26 @@ More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.h
 ### vLLM
-vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
 #### Environment Preparation
-Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:
 ```bash
-git clone -b v0.10.0 https://github.com/vllm-project/vllm.git
-cd vllm
-wget https://raw.githubusercontent.com/inclusionAI/Ring-V2/refs/heads/main/inference/vllm/bailing_moe_v2.patch
-git apply bailing_moe_v2.patch
-pip install -e .
 ```
-#### Offline Inference:
-```python
-from transformers import AutoTokenizer
-from vllm import LLM, SamplingParams
-tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-1T") # Changed from Ring-flash-2.0 for consistency
-sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
-llm = LLM(model="inclusionAI/Ring-1T", dtype='bfloat16') # Changed from Ring-flash-2.0 for consistency
-prompt = "Give me a short introduction to large language models."
-messages = [
-    {"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
-    {"role": "user", "content": prompt}
-]
-text = tokenizer.apply_chat_template(
-    messages,
-    tokenize=False,
-    add_generation_prompt=True
-)
-outputs = llm.generate([text], sampling_params)
-```
-#### Online Inference:
-```bash
-vllm serve inclusionAI/Ring-1T \
-              --tensor-parallel-size 2 \
-              --pipeline-parallel-size 1 \
-              --use-v2-block-manager \
-              --gpu-memory-utilization 0.90
 ```
 To handle long context in vLLM using YaRN, we need to follow these two steps:
@@ -280,8 +251,6 @@ To handle long context in vLLM using YaRN, we need to follow these two steps:
 ```
 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
-For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
 ## Finetuning

 #### Environment Preparation
+We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps:
 ```shell
+pip3 install -U sglang sgl-kernel
 ```
 #### Run Inference
+Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}.
+Here is the example to run Ring-1T with multiple GPU nodes, where the master node IP is ${MASTER_IP} and server port is ${PORT}:
 - Start server:
+```bash
+# Node 0:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 0
+# Node 1:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 1
+# Node 2:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 2
+# Node 3:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 3
+# This is only an example. Please adjust arguments according to your actual environment.
 ```
 - Client:
 ```shell
+curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 ```
 ### vLLM
+For latest guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/projects/recipes/en/latest/inclusionAI/Ring-1T-FP8.html).
 #### Environment Preparation
 ```bash
+pip install vllm==0.11.0
 ```
+#### Run Inference:
+Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTER_IP}, server port is ${PORT} and the path of model is ${MODEL_PATH}:
+```bash
+# step 1. start ray on all nodes
+# step 2. start vllm server only on node 0:
+vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
+# This is only an example, please adjust arguments according to your actual environment.
 ```
 To handle long context in vLLM using YaRN, we need to follow these two steps:
 ```
 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
 ## Finetuning