ehartford commited on
Commit
52be1d9
·
verified ·
1 Parent(s): 9f45c82

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -17
README.md CHANGED
@@ -3,21 +3,6 @@ license: apache-2.0
3
  ---
4
  I used AutoAWQ to quantize Kimi K2
5
 
6
- vLLM seems to have a bug that prevents it from inferencing.
7
-
8
- ```
9
- (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] WorkerProc failed to start.
10
- ...
11
- (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 443, in weight_loader
12
- (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] param[shard_offset:shard_offset + shard_size] = loaded_weight
13
- (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
14
- (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1. Target sizes: [576, 264]. Tensor sizes: [7168, 72]
15
- ```
16
-
17
- I used AutoAWQ to quantize Kimi K2.
18
-
19
- Run with this command:
20
-
21
  ```
22
  $ docker run -it --rm \
23
  --gpus all \
@@ -27,6 +12,16 @@ $ docker run -it --rm \
27
  -v /home/hotaisle/workspace/models:/models \
28
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
29
  vllm/vllm-openai:latest \
30
- -c "pip install blobfile && python3 -m vllm.entrypoints.openai.api_server --model QuixiAI/Kimi-K2-Base-AWQ --port 8000 --tensor-parallel-size 8 --trust-remote-code --gpu-memory-utilization 0.95 --enable-prefix-caching --enable-chunked-prefill --dtype bfloat16"```
 
 
 
31
 
32
- It seems due to a bug in vLLM, it cannot run.
 
 
 
 
 
 
 
 
3
  ---
4
  I used AutoAWQ to quantize Kimi K2
5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ```
7
  $ docker run -it --rm \
8
  --gpus all \
 
12
  -v /home/hotaisle/workspace/models:/models \
13
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
14
  vllm/vllm-openai:latest \
15
+ -c "pip install blobfile && python3 -m vllm.entrypoints.openai.api_server --model QuixiAI/Kimi-K2-Instruct-AWQ --port 8000 --tensor-parallel-size 8 --trust-remote-code --gpu-memory-utilization 0.95 --enable-prefix-caching --enable-chunked-prefill --dtype bfloat16"```
16
+
17
+
18
+ vLLM seems to have a bug that prevents it from inferencing.
19
 
20
+ ```
21
+ (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] WorkerProc failed to start.
22
+ ...
23
+ (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 443, in weight_loader
24
+ (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] param[shard_offset:shard_offset + shard_size] = loaded_weight
25
+ (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
26
+ (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1. Target sizes: [576, 264]. Tensor sizes: [7168, 72]
27
+ ```