curryandsun commited on
Commit
2097eb8
·
verified ·
1 Parent(s): 402252b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -10
README.md CHANGED
@@ -148,14 +148,14 @@ print("*" * 30)
148
 
149
  #### Environment Preparation
150
 
151
- We will later submit our model to SGLang official release, now we can prepare the environment following steps:
152
  ```shell
153
- pip3 install sgl-kernel==0.3.9.post2 vllm==0.10.2
154
  ```
155
 
156
- Then you should install our sglang whl package:
157
  ```shell
158
- pip install https://raw.githubusercontent.com/inclusionAI/Ring-V2/main/hybrid_linear/whls/sglang-0.5.2-py3-none-any.whl --no-deps --force-reinstall
159
  ```
160
 
161
  #### Run Inference
@@ -177,7 +177,7 @@ python -m sglang.launch_server \
177
  ```shell
178
  curl -s http://localhost:${PORT}/v1/chat/completions \
179
  -H "Content-Type: application/json" \
180
- -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
181
  ```
182
 
183
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
@@ -193,7 +193,7 @@ pip install torch==2.7.0 torchvision==0.22.0
193
 
194
  Then you should install our vLLM wheel package:
195
  ```shell
196
- pip install https://raw.githubusercontent.com/inclusionAI/Ring-V2/main/hybrid_linear/whls/vllm-0.8.5+cuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
197
  ```
198
 
199
  #### Offline Inference
@@ -202,14 +202,13 @@ pip install https://raw.githubusercontent.com/inclusionAI/Ring-V2/main/hybrid_li
202
  from transformers import AutoTokenizer
203
  from vllm import LLM, SamplingParams
204
 
205
- tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-mini-linear-2.0")
206
 
207
- sampling_params = SamplingParams(temperature=0.6, max_tokens=8192)
208
 
209
  llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
210
  prompt = "Give me a short introduction to large language models."
211
  messages = [
212
- {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
213
  {"role": "user", "content": prompt}
214
  ]
215
 
@@ -226,7 +225,7 @@ outputs = llm.generate([text], sampling_params)
226
  vllm serve inclusionAI/Ring-flash-linear-2.0 \
227
  --tensor-parallel-size 4 \
228
  --gpu-memory-utilization 0.90 \
229
- --max-num-seqs 512 \
230
  --no-enable-prefix-caching
231
  ```
 
232
  ## Citation
 
148
 
149
  #### Environment Preparation
150
 
151
+ We have submitted our [PR](https://github.com/sgl-project/sglang/pull/10917) to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages:
152
  ```shell
153
+ pip install sglang==0.5.2 sgl-kernel==0.3.9.post2 vllm==0.10.2 torch==2.8.0 torchvision==0.23.0 torchao
154
  ```
155
 
156
+ Then you should install our sglang wheel package:
157
  ```shell
158
+ pip install http://raw.githubusercontent.com/inclusionAI/Ring-V2/blob/main/hybrid_linear/whls/sglang-0.5.2-py3-none-any.whl --no-deps --force-reinstall
159
  ```
160
 
161
  #### Run Inference
 
177
  ```shell
178
  curl -s http://localhost:${PORT}/v1/chat/completions \
179
  -H "Content-Type: application/json" \
180
+ -d '{"model": "auto", "temperature": 0.6, "messages": [{"role": "user", "content": "Give me a short introduction to large language models."}]}'
181
  ```
182
 
183
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
 
193
 
194
  Then you should install our vLLM wheel package:
195
  ```shell
196
+ pip install https://media.githubusercontent.com/media/inclusionAI/Ring-V2/refs/heads/main/hybrid_linear/whls/vllm-0.8.5%2Bcuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
197
  ```
198
 
199
  #### Offline Inference
 
202
  from transformers import AutoTokenizer
203
  from vllm import LLM, SamplingParams
204
 
205
+ tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-flash-linear-2.0")
206
 
207
+ sampling_params = SamplingParams(temperature=0.6, top_p=1.0, max_tokens=8192)
208
 
209
  llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
210
  prompt = "Give me a short introduction to large language models."
211
  messages = [
 
212
  {"role": "user", "content": prompt}
213
  ]
214
 
 
225
  vllm serve inclusionAI/Ring-flash-linear-2.0 \
226
  --tensor-parallel-size 4 \
227
  --gpu-memory-utilization 0.90 \
 
228
  --no-enable-prefix-caching
229
  ```
230
+
231
  ## Citation