Update README.md
Browse files
README.md
CHANGED
|
@@ -74,9 +74,9 @@ What is truly exciting is that in the comparison with Qwen3-32B, Ring-flash-line
|
|
| 74 |
|
| 75 |
<div align="center">
|
| 76 |
|
| 77 |
-
| **Model**
|
| 78 |
-
| :----------------: | :----------: |
|
| 79 |
-
| Ring-flash-linear-2.0 |
|
| 80 |
</div>
|
| 81 |
|
| 82 |
## Quickstart
|
|
@@ -140,8 +140,26 @@ print(responses)
|
|
| 140 |
print("*" * 30)
|
| 141 |
```
|
| 142 |
|
| 143 |
-
### SGLang
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
python -m sglang.launch_server \
|
| 146 |
--model-path <model_path> \
|
| 147 |
--trust-remote-code \
|
|
@@ -149,7 +167,17 @@ python -m sglang.launch_server \
|
|
| 149 |
--disable-radix-cache \
|
| 150 |
--json-model-override-args "{\"linear_backend\": \"seg_la\"}"
|
| 151 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
### vLLM
|
| 153 |
-
|
| 154 |
-
|
| 155 |
## Citation
|
|
|
|
| 74 |
|
| 75 |
<div align="center">
|
| 76 |
|
| 77 |
+
| **Model** | **Context Length** | **Download** |
|
| 78 |
+
| :----------------: | :----------------: | :----------: |
|
| 79 |
+
| Ring-flash-linear-2.0 | 128K | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ring-flash-linear-2.0) <br>[🤖 Modelscope](https://modelscope.cn/models/inclusionAI/Ring-flash-linear-2.0)|
|
| 80 |
</div>
|
| 81 |
|
| 82 |
## Quickstart
|
|
|
|
| 140 |
print("*" * 30)
|
| 141 |
```
|
| 142 |
|
| 143 |
+
### 🚀 SGLang
|
| 144 |
+
|
| 145 |
+
#### Environment Preparation
|
| 146 |
+
|
| 147 |
+
We will later submit our model to SGLang official release, now we can prepare the environment following steps:
|
| 148 |
+
```shell
|
| 149 |
+
pip3 install sgl-kernel==0.3.9.post2 vllm==0.10.2
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
Then you should install our sglang whl package:
|
| 153 |
+
```shell
|
| 154 |
+
pip install https://github.com/inclusionAI/Ring-V2/blob/main/hybrid_linear/whls/sglang-0.5.2-py3-none-any.whl
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
#### Run Inference
|
| 158 |
+
|
| 159 |
+
BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
|
| 160 |
+
|
| 161 |
+
- Start server:
|
| 162 |
+
```shell
|
| 163 |
python -m sglang.launch_server \
|
| 164 |
--model-path <model_path> \
|
| 165 |
--trust-remote-code \
|
|
|
|
| 167 |
--disable-radix-cache \
|
| 168 |
--json-model-override-args "{\"linear_backend\": \"seg_la\"}"
|
| 169 |
```
|
| 170 |
+
|
| 171 |
+
- Client:
|
| 172 |
+
|
| 173 |
+
```shell
|
| 174 |
+
curl -s http://localhost:${PORT}/v1/chat/completions \
|
| 175 |
+
-H "Content-Type: application/json" \
|
| 176 |
+
-d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
|
| 180 |
+
|
| 181 |
### vLLM
|
| 182 |
+
TODO
|
|
|
|
| 183 |
## Citation
|