internlm
/

internlm-chat-20b-4bit

@@ -12,7 +12,7 @@ pipeline_tag: text-generation
 Before proceeding with the inference of `internlm-chat-20b-4bit`, please ensure that lmdeploy is installed.
 ```shell
-pip install 'lmdeploy>=0.0.9'
 ```
 ## Inference
@@ -31,7 +31,7 @@ As demonstrated in the command below, first convert the model's layout using `tu
 # Convert the model's layout and store it in the default path, ./workspace.
 python3 -m lmdeploy.serve.turbomind.deploy \
     --model-name internlm-chat-20b \
-    --model-path ./internlm-chat-20b \
     --model-format awq \
     --group-size 128
@@ -44,7 +44,7 @@ python3 -m lmdeploy.turbomind.chat ./workspace
 If you wish to interact with the model via web UI, please initiate the gradio server as indicated below:
 ```shell
-python3 -m lmdeploy.serve.turbomind ./workspace --server_name {ip_addr} --server_port {port}
 ```
 Subsequently, you can open the website `http://{ip_addr}:{port}` in your browser and interact with the model.
@@ -65,12 +65,10 @@ We conducted benchmarks on `internlm-chat-20b-4bit`. And `token_throughput` was
 **Note**: The `session_len` in `workspace/triton_models/weights/config.ini` is changed to `2056` in our test.
-| batch | tensor parallel | prompt_tokens | completion_tokens | thr_per_proc(token/s) | thr_per_node(token/s) | rpm (req/min) | mem_per_proc(GB) | mem_per_gpu(GB) | mem_per_node(GB) |
-|-------|-----------------|---------------|-------------------|-----------------------|-----------------------|---------------|------------------|-----------------|------------------|
-| 1     | 1               | 256           | 512               | 79.12                 | 632.98                | -             | 15.67            | 15.67           | 125.35           |
-| 16    | 1               | 256           | 512               | 708.76                | 5670.1                | 220.23        | 51.48            | 51.48           | 411.85           |
 ### token throughput
@@ -84,6 +82,24 @@ python benchmark/profile_generation.py \
 ```
 You will find the `token_throughput` metrics in `./token_throughput.csv`
 ### request throughput

 Before proceeding with the inference of `internlm-chat-20b-4bit`, please ensure that lmdeploy is installed.
 ```shell
+pip install 'lmdeploy>=0.0.11'
 ```
 ## Inference
 # Convert the model's layout and store it in the default path, ./workspace.
 python3 -m lmdeploy.serve.turbomind.deploy \
     --model-name internlm-chat-20b \
+    --model-path ./internlm-chat-20b-4bit \
     --model-format awq \
     --group-size 128
 If you wish to interact with the model via web UI, please initiate the gradio server as indicated below:
 ```shell
+python3 -m lmdeploy.serve.gradio.app ./workspace --server_name {ip_addr} --server_port {port}
 ```
 Subsequently, you can open the website `http://{ip_addr}:{port}` in your browser and interact with the model.
 **Note**: The `session_len` in `workspace/triton_models/weights/config.ini` is changed to `2056` in our test.
+| batch | tensor parallel | prompt_tokens | completion_tokens | thr_per_proc(token/s) | rpm (req/min) | mem_per_proc(GB) |
+|-------|-----------------|---------------|-------------------|-----------------------|---------------|------------------|
+| 1     | 1               | 256           | 512               | 88.77                 | -             | 15.65            |
+| 16    | 1               | 256           | 512               | 792.7                 | 220.23        | 51.46            |
 ### token throughput
 ```
 You will find the `token_throughput` metrics in `./token_throughput.csv`
+| batch | prompt_tokens | completion_tokens | thr_per_proc(token/s) | thr_per_node(token/s) | rpm(req/min) | mem_per_proc(GB) | mem_per_gpu(GB) | mem_per_node(GB) |
+|-------|---------------|-------------------|-----------------------|-----------------------|--------------|------------------|-----------------|------------------|
+| 1     | 256           | 512               | 88.77                 | 710.12                | -            | 15.65            | 15.65           | 125.21           |
+| 1     | 512           | 512               | 83.89                 | 671.15                | -            | 15.68            | 15.68           | 125.46           |
+| 1     | 512           | 1024              | 80.19                 | 641.5                 | -            | 15.68            | 15.68           | 125.46           |
+| 1     | 1024          | 1024              | 72.34                 | 578.74                | -            | 15.75            | 15.75           | 125.96           |
+| 1     | 1             | 2048              | 80.69                 | 645.55                | -            | 15.62            | 15.62           | 124.96           |
+| 8     | 256           | 512               | 565.21                | 4521.67               | -            | 32.37            | 32.37           | 258.96           |
+| 8     | 512           | 512               | 489.04                | 3912.33               | -            | 32.62            | 32.62           | 260.96           |
+| 8     | 512           | 1024              | 467.23                | 3737.84               | -            | 32.62            | 32.62           | 260.96           |
+| 8     | 1024          | 1024              | 383.4                 | 3067.19               | -            | 33.06            | 33.06           | 264.46           |
+| 8     | 1             | 2048              | 487.74                | 3901.93               | -            | 32.12            | 32.12           | 256.96           |
+| 16    | 256           | 512               | 792.7                 | 6341.6                | -            | 51.46            | 51.46           | 411.71           |
+| 16    | 512           | 512               | 639.4                 | 5115.17               | -            | 51.93            | 51.93           | 415.46           |
+| 16    | 512           | 1024              | 591.39                | 4731.09               | -            | 51.93            | 51.93           | 415.46           |
+| 16    | 1024          | 1024              | 449.11                | 3592.85               | -            | 52.06            | 52.06           | 416.46           |
+| 16    | 1             | 2048              | 620.5                 | 4964.02               | -            | 51               | 51              | 407.96           |
 ### request throughput