inclusionAI
/

Ring-1T

@@ -2,6 +2,14 @@
 license: mit
 ---
 ## Model Downloads
 You can download Ring-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope to speed up the download process.
@@ -54,156 +62,48 @@ completion = client.chat.completions.create(
 print(completion.choices[0].message.content)
 ```
-### 🤗 Hugging Face Transformers
-Here is a code snippet to show you how to use the chat model with `transformers`:
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "inclusionAI/Ring-1T"
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    dtype="auto",
-    device_map="auto",
-    trust_remote_code=True,
-)
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-prompt = "Give me a short introduction to large language models."
-messages = [
-    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
-    {"role": "user", "content": prompt}
-]
-text = tokenizer.apply_chat_template(
-    messages,
-    tokenize=False,
-    add_generation_prompt=True
-)
-model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
-generated_ids = model.generate(
-    **model_inputs,
-    max_new_tokens=32768
-)
-generated_ids = [
-    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-]
-response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-```
-### 🤖 ModelScope
-If you're in mainland China, we strongly recommend you to use our model from 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ring-1T">ModelScope</a>.
 ## Deployment
-### vLLM
-vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
-#### Environment Preparation
-```bash
-pip install vllm==0.11.0
-```
-#### Offline Inference:
-```python
-from transformers import AutoTokenizer
-from vllm import LLM, SamplingParams
-tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-1T")
-sampling_params = SamplingParams(temperature=1.2, top_p=0.8, repetition_penalty=1.0, max_tokens=65536)
-llm = LLM(model="inclusionAI/Ring-1T", dtype='bfloat16', trust_remote_code=True)
-prompt = "Give me a short introduction to large language models."
-messages = [
-    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
-    {"role": "user", "content": prompt}
-]
-text = tokenizer.apply_chat_template(
-    messages,
-    tokenize=False,
-    add_generation_prompt=True
-)
-outputs = llm.generate([text], sampling_params)
-```
-#### Online Inference:
-```bash
-vllm serve inclusionAI/Ring-1T \
-              --tensor-parallel-size 32 \
-              --pipeline-parallel-size 1 \
-              --trust-remote-code \
-              --gpu-memory-utilization 0.90
-# This is only an example, please adjust arguments according to your actual environment.
-```
-To handle long context in vLLM using YaRN, we need to follow these two steps:
-1. Add a `rope_scaling` field to the model's `config.json` file, for example:
-```json
-{
-  ...,
-  "rope_scaling": {
-    "factor": 2.0,
-    "original_max_position_embeddings": 65536,
-    "type": "yarn"
-  }
-}
-```
-2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
-For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
 ### SGLang
 #### Environment Preparation
 We will later submit our model to SGLang official release, now we can prepare the environment following steps:
 ```shell
-pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
-```
-You can use docker image as well:
-```shell
-docker pull lmsysorg/sglang:v0.5.2rc0-cu126
-```
-Then you should apply patch to sglang installation:
-```bash
-# patch command is needed, run `yum install -y patch` if needed
-patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
 ```
 #### Run Inference
-BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
 - Start server:
 ```bash
-python -m sglang.launch_server \
-    --model-path $MODEL_PATH \
-    --host 0.0.0.0 --port $PORT \
-    --trust-remote-code \
-    --attention-backend fa3
 # This is only an example, please adjust arguments according to your actual environment.
 ```
 MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
 to start command.
 - Client:
 ```shell
-curl -s http://localhost:${PORT}/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 ```

 license: mit
 ---
+<p align="center">
+    <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
+</p>
+<p align="center">🤗 <a href="https://huggingface.co/inclusionAI">Hugging Face</a>&nbsp;&nbsp; | &nbsp;&nbsp;🤖 <a href="https://modelscope.cn/organization/inclusionAI">ModelScope </a>&nbsp;&nbsp; | &nbsp;&nbsp;🐙 <a href="https://zenmux.ai/inclusionai/ring-1t?utm_source=hf_inclusionAI">Experience Now</a></p>
 ## Model Downloads
 You can download Ring-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope to speed up the download process.
 print(completion.choices[0].message.content)
 ```
 ## Deployment
 ### SGLang
 #### Environment Preparation
 We will later submit our model to SGLang official release, now we can prepare the environment following steps:
 ```shell
+pip3 install -U sglang sgl-kernel
 ```
 #### Run Inference
+BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}.
+Here is the example to run Ring-1T with multiple nodes, with master node IP is ${MASTER_IP} and port is ${PORT} :
 - Start server:
 ```bash
+# Node 0:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 0
+# Node 1:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 1
+# Node 2:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 2
+# Node 3:
+python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 3
 # This is only an example, please adjust arguments according to your actual environment.
 ```
 MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
 to start command.
 - Client:
 ```shell
+curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 ```