Update README.md
Browse files
README.md
CHANGED
|
@@ -3,21 +3,6 @@ license: apache-2.0
|
|
| 3 |
---
|
| 4 |
I used AutoAWQ to quantize Kimi K2
|
| 5 |
|
| 6 |
-
vLLM seems to have a bug that prevents it from inferencing.
|
| 7 |
-
|
| 8 |
-
```
|
| 9 |
-
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] WorkerProc failed to start.
|
| 10 |
-
...
|
| 11 |
-
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 443, in weight_loader
|
| 12 |
-
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] param[shard_offset:shard_offset + shard_size] = loaded_weight
|
| 13 |
-
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 14 |
-
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1. Target sizes: [576, 264]. Tensor sizes: [7168, 72]
|
| 15 |
-
```
|
| 16 |
-
|
| 17 |
-
I used AutoAWQ to quantize Kimi K2.
|
| 18 |
-
|
| 19 |
-
Run with this command:
|
| 20 |
-
|
| 21 |
```
|
| 22 |
$ docker run -it --rm \
|
| 23 |
--gpus all \
|
|
@@ -27,6 +12,16 @@ $ docker run -it --rm \
|
|
| 27 |
-v /home/hotaisle/workspace/models:/models \
|
| 28 |
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
|
| 29 |
vllm/vllm-openai:latest \
|
| 30 |
-
-c "pip install blobfile && python3 -m vllm.entrypoints.openai.api_server --model QuixiAI/Kimi-K2-
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
I used AutoAWQ to quantize Kimi K2
|
| 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
```
|
| 7 |
$ docker run -it --rm \
|
| 8 |
--gpus all \
|
|
|
|
| 12 |
-v /home/hotaisle/workspace/models:/models \
|
| 13 |
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
|
| 14 |
vllm/vllm-openai:latest \
|
| 15 |
+
-c "pip install blobfile && python3 -m vllm.entrypoints.openai.api_server --model QuixiAI/Kimi-K2-Instruct-AWQ --port 8000 --tensor-parallel-size 8 --trust-remote-code --gpu-memory-utilization 0.95 --enable-prefix-caching --enable-chunked-prefill --dtype bfloat16"```
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
vLLM seems to have a bug that prevents it from inferencing.
|
| 19 |
|
| 20 |
+
```
|
| 21 |
+
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] WorkerProc failed to start.
|
| 22 |
+
...
|
| 23 |
+
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 443, in weight_loader
|
| 24 |
+
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] param[shard_offset:shard_offset + shard_size] = loaded_weight
|
| 25 |
+
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 26 |
+
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1. Target sizes: [576, 264]. Tensor sizes: [7168, 72]
|
| 27 |
+
```
|