rulixiang commited on
Commit
3810e48
·
1 Parent(s): 8c325ad

update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -125
README.md CHANGED
@@ -2,6 +2,14 @@
2
  license: mit
3
  ---
4
 
 
 
 
 
 
 
 
 
5
  ## Model Downloads
6
 
7
  You can download Ring-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope to speed up the download process.
@@ -54,156 +62,48 @@ completion = client.chat.completions.create(
54
  print(completion.choices[0].message.content)
55
  ```
56
 
57
- ### 🤗 Hugging Face Transformers
58
-
59
- Here is a code snippet to show you how to use the chat model with `transformers`:
60
-
61
- ```python
62
- from transformers import AutoModelForCausalLM, AutoTokenizer
63
-
64
- model_name = "inclusionAI/Ring-1T"
65
-
66
- model = AutoModelForCausalLM.from_pretrained(
67
- model_name,
68
- dtype="auto",
69
- device_map="auto",
70
- trust_remote_code=True,
71
- )
72
- tokenizer = AutoTokenizer.from_pretrained(model_name)
73
-
74
- prompt = "Give me a short introduction to large language models."
75
- messages = [
76
- {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
77
- {"role": "user", "content": prompt}
78
- ]
79
- text = tokenizer.apply_chat_template(
80
- messages,
81
- tokenize=False,
82
- add_generation_prompt=True
83
- )
84
- model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
85
-
86
- generated_ids = model.generate(
87
- **model_inputs,
88
- max_new_tokens=32768
89
- )
90
- generated_ids = [
91
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
92
- ]
93
-
94
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
95
- ```
96
-
97
- ### 🤖 ModelScope
98
-
99
- If you're in mainland China, we strongly recommend you to use our model from 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ring-1T">ModelScope</a>.
100
 
101
  ## Deployment
102
 
103
- ### vLLM
104
-
105
- vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
106
-
107
- #### Environment Preparation
108
-
109
- ```bash
110
- pip install vllm==0.11.0
111
- ```
112
-
113
- #### Offline Inference:
114
-
115
- ```python
116
- from transformers import AutoTokenizer
117
- from vllm import LLM, SamplingParams
118
-
119
- tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-1T")
120
-
121
- sampling_params = SamplingParams(temperature=1.2, top_p=0.8, repetition_penalty=1.0, max_tokens=65536)
122
-
123
- llm = LLM(model="inclusionAI/Ring-1T", dtype='bfloat16', trust_remote_code=True)
124
- prompt = "Give me a short introduction to large language models."
125
- messages = [
126
- {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
127
- {"role": "user", "content": prompt}
128
- ]
129
-
130
- text = tokenizer.apply_chat_template(
131
- messages,
132
- tokenize=False,
133
- add_generation_prompt=True
134
- )
135
- outputs = llm.generate([text], sampling_params)
136
-
137
- ```
138
-
139
- #### Online Inference:
140
-
141
- ```bash
142
- vllm serve inclusionAI/Ring-1T \
143
- --tensor-parallel-size 32 \
144
- --pipeline-parallel-size 1 \
145
- --trust-remote-code \
146
- --gpu-memory-utilization 0.90
147
-
148
- # This is only an example, please adjust arguments according to your actual environment.
149
- ```
150
-
151
- To handle long context in vLLM using YaRN, we need to follow these two steps:
152
- 1. Add a `rope_scaling` field to the model's `config.json` file, for example:
153
- ```json
154
- {
155
- ...,
156
- "rope_scaling": {
157
- "factor": 2.0,
158
- "original_max_position_embeddings": 65536,
159
- "type": "yarn"
160
- }
161
- }
162
- ```
163
- 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
164
-
165
- For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
166
-
167
-
168
  ### SGLang
169
 
170
  #### Environment Preparation
171
 
172
  We will later submit our model to SGLang official release, now we can prepare the environment following steps:
173
  ```shell
174
- pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
175
- ```
176
- You can use docker image as well:
177
- ```shell
178
- docker pull lmsysorg/sglang:v0.5.2rc0-cu126
179
- ```
180
- Then you should apply patch to sglang installation:
181
- ```bash
182
- # patch command is needed, run `yum install -y patch` if needed
183
- patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
184
  ```
185
 
186
  #### Run Inference
187
 
188
- BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
 
 
189
 
190
  - Start server:
191
  ```bash
192
- python -m sglang.launch_server \
193
- --model-path $MODEL_PATH \
194
- --host 0.0.0.0 --port $PORT \
195
- --trust-remote-code \
196
- --attention-backend fa3
 
 
 
 
 
 
197
 
198
  # This is only an example, please adjust arguments according to your actual environment.
199
  ```
 
200
  MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
201
  to start command.
202
 
203
  - Client:
204
 
205
  ```shell
206
- curl -s http://localhost:${PORT}/v1/chat/completions \
207
  -H "Content-Type: application/json" \
208
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
209
  ```
 
2
  license: mit
3
  ---
4
 
5
+
6
+ <p align="center">
7
+ <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
8
+ </p>
9
+
10
+ <p align="center">🤗 <a href="https://huggingface.co/inclusionAI">Hugging Face</a>&nbsp;&nbsp; | &nbsp;&nbsp;🤖 <a href="https://modelscope.cn/organization/inclusionAI">ModelScope </a>&nbsp;&nbsp; | &nbsp;&nbsp;🐙 <a href="https://zenmux.ai/inclusionai/ring-1t?utm_source=hf_inclusionAI">Experience Now</a></p>
11
+
12
+
13
  ## Model Downloads
14
 
15
  You can download Ring-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope to speed up the download process.
 
62
  print(completion.choices[0].message.content)
63
  ```
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  ## Deployment
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ### SGLang
69
 
70
  #### Environment Preparation
71
 
72
  We will later submit our model to SGLang official release, now we can prepare the environment following steps:
73
  ```shell
74
+ pip3 install -U sglang sgl-kernel
 
 
 
 
 
 
 
 
 
75
  ```
76
 
77
  #### Run Inference
78
 
79
+ BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}.
80
+
81
+ Here is the example to run Ring-1T with multiple nodes, with master node IP is ${MASTER_IP} and port is ${PORT} :
82
 
83
  - Start server:
84
  ```bash
85
+ # Node 0:
86
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 0
87
+
88
+ # Node 1:
89
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 1
90
+
91
+ # Node 2:
92
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 2
93
+
94
+ # Node 3:
95
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 3
96
 
97
  # This is only an example, please adjust arguments according to your actual environment.
98
  ```
99
+
100
  MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
101
  to start command.
102
 
103
  - Client:
104
 
105
  ```shell
106
+ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
107
  -H "Content-Type: application/json" \
108
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
109
  ```