Update README.md
Browse files
README.md
CHANGED
|
@@ -148,14 +148,14 @@ print("*" * 30)
|
|
| 148 |
|
| 149 |
#### Environment Preparation
|
| 150 |
|
| 151 |
-
We
|
| 152 |
```shell
|
| 153 |
-
|
| 154 |
```
|
| 155 |
|
| 156 |
-
Then you should install our sglang
|
| 157 |
```shell
|
| 158 |
-
pip install
|
| 159 |
```
|
| 160 |
|
| 161 |
#### Run Inference
|
|
@@ -177,7 +177,7 @@ python -m sglang.launch_server \
|
|
| 177 |
```shell
|
| 178 |
curl -s http://localhost:${PORT}/v1/chat/completions \
|
| 179 |
-H "Content-Type: application/json" \
|
| 180 |
-
-d '{"model": "auto", "messages": [{"role": "user", "content": "
|
| 181 |
```
|
| 182 |
|
| 183 |
More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
|
|
@@ -193,7 +193,7 @@ pip install torch==2.7.0 torchvision==0.22.0
|
|
| 193 |
|
| 194 |
Then you should install our vLLM wheel package:
|
| 195 |
```shell
|
| 196 |
-
pip install https://
|
| 197 |
```
|
| 198 |
|
| 199 |
#### Offline Inference
|
|
@@ -202,14 +202,13 @@ pip install https://raw.githubusercontent.com/inclusionAI/Ring-V2/main/hybrid_li
|
|
| 202 |
from transformers import AutoTokenizer
|
| 203 |
from vllm import LLM, SamplingParams
|
| 204 |
|
| 205 |
-
tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-
|
| 206 |
|
| 207 |
-
sampling_params = SamplingParams(temperature=0.6, max_tokens=8192)
|
| 208 |
|
| 209 |
llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
|
| 210 |
prompt = "Give me a short introduction to large language models."
|
| 211 |
messages = [
|
| 212 |
-
{"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
|
| 213 |
{"role": "user", "content": prompt}
|
| 214 |
]
|
| 215 |
|
|
@@ -226,7 +225,7 @@ outputs = llm.generate([text], sampling_params)
|
|
| 226 |
vllm serve inclusionAI/Ring-flash-linear-2.0 \
|
| 227 |
--tensor-parallel-size 4 \
|
| 228 |
--gpu-memory-utilization 0.90 \
|
| 229 |
-
--max-num-seqs 512 \
|
| 230 |
--no-enable-prefix-caching
|
| 231 |
```
|
|
|
|
| 232 |
## Citation
|
|
|
|
| 148 |
|
| 149 |
#### Environment Preparation
|
| 150 |
|
| 151 |
+
We have submitted our [PR](https://github.com/sgl-project/sglang/pull/10917) to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages:
|
| 152 |
```shell
|
| 153 |
+
pip install sglang==0.5.2 sgl-kernel==0.3.9.post2 vllm==0.10.2 torch==2.8.0 torchvision==0.23.0 torchao
|
| 154 |
```
|
| 155 |
|
| 156 |
+
Then you should install our sglang wheel package:
|
| 157 |
```shell
|
| 158 |
+
pip install http://raw.githubusercontent.com/inclusionAI/Ring-V2/blob/main/hybrid_linear/whls/sglang-0.5.2-py3-none-any.whl --no-deps --force-reinstall
|
| 159 |
```
|
| 160 |
|
| 161 |
#### Run Inference
|
|
|
|
| 177 |
```shell
|
| 178 |
curl -s http://localhost:${PORT}/v1/chat/completions \
|
| 179 |
-H "Content-Type: application/json" \
|
| 180 |
+
-d '{"model": "auto", "temperature": 0.6, "messages": [{"role": "user", "content": "Give me a short introduction to large language models."}]}'
|
| 181 |
```
|
| 182 |
|
| 183 |
More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
|
|
|
|
| 193 |
|
| 194 |
Then you should install our vLLM wheel package:
|
| 195 |
```shell
|
| 196 |
+
pip install https://media.githubusercontent.com/media/inclusionAI/Ring-V2/refs/heads/main/hybrid_linear/whls/vllm-0.8.5%2Bcuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
|
| 197 |
```
|
| 198 |
|
| 199 |
#### Offline Inference
|
|
|
|
| 202 |
from transformers import AutoTokenizer
|
| 203 |
from vllm import LLM, SamplingParams
|
| 204 |
|
| 205 |
+
tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-flash-linear-2.0")
|
| 206 |
|
| 207 |
+
sampling_params = SamplingParams(temperature=0.6, top_p=1.0, max_tokens=8192)
|
| 208 |
|
| 209 |
llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
|
| 210 |
prompt = "Give me a short introduction to large language models."
|
| 211 |
messages = [
|
|
|
|
| 212 |
{"role": "user", "content": prompt}
|
| 213 |
]
|
| 214 |
|
|
|
|
| 225 |
vllm serve inclusionAI/Ring-flash-linear-2.0 \
|
| 226 |
--tensor-parallel-size 4 \
|
| 227 |
--gpu-memory-utilization 0.90 \
|
|
|
|
| 228 |
--no-enable-prefix-caching
|
| 229 |
```
|
| 230 |
+
|
| 231 |
## Citation
|