zhanghanxiao commited on
Commit
cd1b806
·
verified ·
1 Parent(s): 204fe58

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -114
README.md CHANGED
@@ -209,96 +209,67 @@ completion = client.chat.completions.create(
209
  print(completion.choices[0].message.content)
210
  ```
211
 
212
- ### 🤗 Hugging Face Transformers
213
-
214
- Here is a code snippet to show you how to use the chat model with `transformers`:
215
-
216
- ```python
217
- from transformers import AutoModelForCausalLM, AutoTokenizer
218
-
219
- model_name = "inclusionAI/Ling-1T"
220
 
221
- model = AutoModelForCausalLM.from_pretrained(
222
- model_name,
223
- dtype="auto",
224
- device_map="auto",
225
- trust_remote_code=True,
226
- )
227
- tokenizer = AutoTokenizer.from_pretrained(model_name)
228
-
229
- prompt = "Give me a short introduction to large language models."
230
- messages = [
231
- {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
232
- {"role": "user", "content": prompt}
233
- ]
234
- text = tokenizer.apply_chat_template(
235
- messages,
236
- tokenize=False,
237
- add_generation_prompt=True
238
- )
239
- model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
240
 
241
- generated_ids = model.generate(
242
- **model_inputs,
243
- max_new_tokens=512
244
- )
245
- generated_ids = [
246
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
247
- ]
248
 
249
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 
 
250
  ```
251
 
252
- ### 🤖 ModelScope
253
 
254
- If you're in mainland China, we strongly recommend you to use our model from 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ling-1T">ModelScope</a>.
255
 
256
- ## Deployment
257
 
258
- ### vLLM
 
 
 
259
 
260
- vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
 
261
 
262
- #### Environment Preparation
 
263
 
264
- ```bash
265
- pip install vllm==0.11.0
266
- ```
267
 
268
- #### Offline Inference:
 
269
 
270
- ```python
271
- from transformers import AutoTokenizer
272
- from vllm import LLM, SamplingParams
273
 
274
- tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-1T")
 
 
 
 
275
 
276
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
277
 
278
- llm = LLM(model="inclusionAI/Ling-1T", dtype='bfloat16', trust_remote_code=True)
279
- prompt = "Give me a short introduction to large language models."
280
- messages = [
281
- {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
282
- {"role": "user", "content": prompt}
283
- ]
284
 
285
- text = tokenizer.apply_chat_template(
286
- messages,
287
- tokenize=False,
288
- add_generation_prompt=True
289
- )
290
- outputs = llm.generate([text], sampling_params)
291
 
 
 
292
  ```
293
 
294
- #### Online Inference:
 
 
295
 
296
  ```bash
297
- vllm serve inclusionAI/Ling-1T \
298
- --tensor-parallel-size 32 \
299
- --pipeline-parallel-size 1 \
300
- --trust-remote-code \
301
- --gpu-memory-utilization 0.90
302
 
303
  # This is only an example, please adjust arguments according to your actual environment.
304
  ```
@@ -320,52 +291,6 @@ To handle long context in vLLM using YaRN, we need to follow these two steps:
320
  For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
321
 
322
 
323
- ### SGLang
324
-
325
- #### Environment Preparation
326
-
327
- We will later submit our model to SGLang official release, now we can prepare the environment following steps:
328
- ```shell
329
- pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
330
- ```
331
- You can use docker image as well:
332
- ```shell
333
- docker pull lmsysorg/sglang:v0.5.2rc0-cu126
334
- ```
335
- Then you should apply patch to sglang installation:
336
- ```bash
337
- # patch command is needed, run `yum install -y patch` if needed
338
- patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
339
- ```
340
-
341
- #### Run Inference
342
-
343
- BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
344
-
345
- - Start server:
346
- ```bash
347
- python -m sglang.launch_server \
348
- --model-path $MODEL_PATH \
349
- --host 0.0.0.0 --port $PORT \
350
- --trust-remote-code \
351
- --attention-backend fa3
352
-
353
- # This is only an example, please adjust arguments according to your actual environment.
354
- ```
355
- MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
356
- to start command.
357
-
358
- - Client:
359
-
360
- ```shell
361
- curl -s http://localhost:${PORT}/v1/chat/completions \
362
- -H "Content-Type: application/json" \
363
- -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
364
- ```
365
-
366
- More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
367
-
368
-
369
 
370
  ## Limitations & Future Plans
371
 
 
209
  print(completion.choices[0].message.content)
210
  ```
211
 
212
+ ## Deployment
 
 
 
 
 
 
 
213
 
214
+ ### SGLang
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
+ #### Environment Preparation
 
 
 
 
 
 
217
 
218
+ We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps:
219
+ ```shell
220
+ pip3 install -U sglang sgl-kernel
221
  ```
222
 
223
+ #### Run Inference
224
 
225
+ Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}.
226
 
227
+ Here is the example to run Ling-1T with multiple GPU nodes, where the master node IP is ${MASTER_IP} and server port is ${PORT}:
228
 
229
+ - Start server:
230
+ ```bash
231
+ # Node 0:
232
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 0
233
 
234
+ # Node 1:
235
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 1
236
 
237
+ # Node 2:
238
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 2
239
 
240
+ # Node 3:
241
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 3
 
242
 
243
+ # This is only an example. Please adjust arguments according to your actual environment.
244
+ ```
245
 
246
+ - Client:
 
 
247
 
248
+ ```shell
249
+ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
250
+ -H "Content-Type: application/json" \
251
+ -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
252
+ ```
253
 
254
+ More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
255
 
256
+ ### vLLM
 
 
 
 
 
257
 
258
+ #### Environment Preparation
 
 
 
 
 
259
 
260
+ ```bash
261
+ pip install vllm==0.11.0
262
  ```
263
 
264
+ #### Run Inference:
265
+
266
+ Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTER_IP}, server port is ${PORT} and the path of model is ${MODEL_PATH}:
267
 
268
  ```bash
269
+ # step 1. start ray on all nodes
270
+
271
+ # step 2. start vllm server only on node 0:
272
+ vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
 
273
 
274
  # This is only an example, please adjust arguments according to your actual environment.
275
  ```
 
291
  For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
292
 
293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294
 
295
  ## Limitations & Future Plans
296