Update README.md
Browse files
README.md
CHANGED
|
@@ -133,37 +133,34 @@ print(completion.choices[0].message.content)
|
|
| 133 |
|
| 134 |
#### Environment Preparation
|
| 135 |
|
| 136 |
-
We will later submit our model to SGLang official release
|
| 137 |
```shell
|
| 138 |
pip3 install -U sglang sgl-kernel
|
| 139 |
```
|
| 140 |
|
| 141 |
#### Run Inference
|
| 142 |
|
| 143 |
-
BF16 and FP8 models are supported by SGLang now
|
| 144 |
|
| 145 |
-
Here is the example to run Ring-1T with multiple
|
| 146 |
|
| 147 |
- Start server:
|
| 148 |
```bash
|
| 149 |
# Node 0:
|
| 150 |
-
python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP
|
| 151 |
|
| 152 |
# Node 1:
|
| 153 |
-
python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP
|
| 154 |
|
| 155 |
# Node 2:
|
| 156 |
-
python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP
|
| 157 |
|
| 158 |
# Node 3:
|
| 159 |
-
python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP
|
| 160 |
|
| 161 |
-
# This is only an example
|
| 162 |
```
|
| 163 |
|
| 164 |
-
MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
|
| 165 |
-
to start command.
|
| 166 |
-
|
| 167 |
- Client:
|
| 168 |
|
| 169 |
```shell
|
|
@@ -174,6 +171,44 @@ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
|
|
| 174 |
|
| 175 |
More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
|
| 176 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
## Finetuning
|
| 178 |
|
| 179 |
We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ring-V2/blob/main/docs/llamafactory_finetuning.md).
|
|
|
|
| 133 |
|
| 134 |
#### Environment Preparation
|
| 135 |
|
| 136 |
+
We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps:
|
| 137 |
```shell
|
| 138 |
pip3 install -U sglang sgl-kernel
|
| 139 |
```
|
| 140 |
|
| 141 |
#### Run Inference
|
| 142 |
|
| 143 |
+
Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}.
|
| 144 |
|
| 145 |
+
Here is the example to run Ring-1T with multiple GPU nodes, where the master node IP is ${MASTER_IP} and server port is ${PORT}:
|
| 146 |
|
| 147 |
- Start server:
|
| 148 |
```bash
|
| 149 |
# Node 0:
|
| 150 |
+
python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 0
|
| 151 |
|
| 152 |
# Node 1:
|
| 153 |
+
python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 1
|
| 154 |
|
| 155 |
# Node 2:
|
| 156 |
+
python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 2
|
| 157 |
|
| 158 |
# Node 3:
|
| 159 |
+
python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 3
|
| 160 |
|
| 161 |
+
# This is only an example. Please adjust arguments according to your actual environment.
|
| 162 |
```
|
| 163 |
|
|
|
|
|
|
|
|
|
|
| 164 |
- Client:
|
| 165 |
|
| 166 |
```shell
|
|
|
|
| 171 |
|
| 172 |
More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
|
| 173 |
|
| 174 |
+
### vLLM
|
| 175 |
+
|
| 176 |
+
#### Environment Preparation
|
| 177 |
+
|
| 178 |
+
```bash
|
| 179 |
+
pip install vllm==0.11.0
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
#### Run Inference:
|
| 183 |
+
|
| 184 |
+
Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTER_IP}, server port is ${PORT} and the path of model is ${MODEL_PATH}:
|
| 185 |
+
|
| 186 |
+
```bash
|
| 187 |
+
# step 1. start ray on all nodes
|
| 188 |
+
|
| 189 |
+
# step 2. start vllm server only on node 0:
|
| 190 |
+
vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
# This is only an example, please adjust arguments according to your actual environment.
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
To handle long context in vLLM using YaRN, we need to follow these two steps:
|
| 197 |
+
1. Add a `rope_scaling` field to the model's `config.json` file, for example:
|
| 198 |
+
```json
|
| 199 |
+
{
|
| 200 |
+
...,
|
| 201 |
+
"rope_scaling": {
|
| 202 |
+
"factor": 4.0,
|
| 203 |
+
"original_max_position_embeddings": 32768,
|
| 204 |
+
"type": "yarn"
|
| 205 |
+
}
|
| 206 |
+
}
|
| 207 |
+
```
|
| 208 |
+
2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
|
| 209 |
+
|
| 210 |
+
For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
|
| 211 |
+
|
| 212 |
## Finetuning
|
| 213 |
|
| 214 |
We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ring-V2/blob/main/docs/llamafactory_finetuning.md).
|