inclusionAI
/

Ring-flash-linear-2.0

@@ -148,14 +148,14 @@ print("*" * 30)
 #### Environment Preparation
-We will later submit our model to SGLang official release, now we can prepare the environment following steps:
 ```shell
-pip3 install sgl-kernel==0.3.9.post2 vllm==0.10.2
 ```
-Then you should install our sglang whl package:
 ```shell
-pip install https://raw.githubusercontent.com/inclusionAI/Ring-V2/main/hybrid_linear/whls/sglang-0.5.2-py3-none-any.whl --no-deps --force-reinstall
 ```
 #### Run Inference
@@ -177,7 +177,7 @@ python -m sglang.launch_server \
 ```shell
 curl -s http://localhost:${PORT}/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 ```
 More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
@@ -193,7 +193,7 @@ pip install torch==2.7.0 torchvision==0.22.0
 Then you should install our vLLM wheel package:
 ```shell
-pip install https://raw.githubusercontent.com/inclusionAI/Ring-V2/main/hybrid_linear/whls/vllm-0.8.5+cuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
 ```
 #### Offline Inference
@@ -202,14 +202,13 @@ pip install https://raw.githubusercontent.com/inclusionAI/Ring-V2/main/hybrid_li
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
-tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-mini-linear-2.0")
-sampling_params = SamplingParams(temperature=0.6, max_tokens=8192)
 llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
 prompt = "Give me a short introduction to large language models."
 messages = [
-    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
     {"role": "user", "content": prompt}
 ]
@@ -226,7 +225,7 @@ outputs = llm.generate([text], sampling_params)
 vllm serve inclusionAI/Ring-flash-linear-2.0 \
               --tensor-parallel-size 4 \
               --gpu-memory-utilization 0.90 \
-              --max-num-seqs 512 \
               --no-enable-prefix-caching
 ```
 ## Citation

 #### Environment Preparation
+We have submitted our [PR](https://github.com/sgl-project/sglang/pull/10917) to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages:
 ```shell
+pip install sglang==0.5.2 sgl-kernel==0.3.9.post2 vllm==0.10.2 torch==2.8.0 torchvision==0.23.0 torchao
 ```
+Then you should install our sglang wheel package:
 ```shell
+pip install http://raw.githubusercontent.com/inclusionAI/Ring-V2/blob/main/hybrid_linear/whls/sglang-0.5.2-py3-none-any.whl --no-deps --force-reinstall
 ```
 #### Run Inference
 ```shell
 curl -s http://localhost:${PORT}/v1/chat/completions \
   -H "Content-Type: application/json" \
+  -d '{"model": "auto", "temperature": 0.6, "messages": [{"role": "user", "content": "Give me a short introduction to large language models."}]}'
 ```
 More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
 Then you should install our vLLM wheel package:
 ```shell
+pip install https://media.githubusercontent.com/media/inclusionAI/Ring-V2/refs/heads/main/hybrid_linear/whls/vllm-0.8.5%2Bcuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
 ```
 #### Offline Inference
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
+tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-flash-linear-2.0")
+sampling_params = SamplingParams(temperature=0.6, top_p=1.0, max_tokens=8192)
 llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
 prompt = "Give me a short introduction to large language models."
 messages = [
     {"role": "user", "content": prompt}
 ]
 vllm serve inclusionAI/Ring-flash-linear-2.0 \
               --tensor-parallel-size 4 \
               --gpu-memory-utilization 0.90 \
               --no-enable-prefix-caching
 ```
 ## Citation