ehartford commited on
Commit
9f45c82
·
verified ·
1 Parent(s): d0a41f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -1
README.md CHANGED
@@ -1,3 +1,6 @@
 
 
 
1
  I used AutoAWQ to quantize Kimi K2
2
 
3
  vLLM seems to have a bug that prevents it from inferencing.
@@ -9,4 +12,21 @@ vLLM seems to have a bug that prevents it from inferencing.
9
  (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] param[shard_offset:shard_offset + shard_size] = loaded_weight
10
  (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
11
  (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1. Target sizes: [576, 264]. Tensor sizes: [7168, 72]
12
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
  I used AutoAWQ to quantize Kimi K2
5
 
6
  vLLM seems to have a bug that prevents it from inferencing.
 
12
  (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] param[shard_offset:shard_offset + shard_size] = loaded_weight
13
  (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
14
  (VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1. Target sizes: [576, 264]. Tensor sizes: [7168, 72]
15
+ ```
16
+
17
+ I used AutoAWQ to quantize Kimi K2.
18
+
19
+ Run with this command:
20
+
21
+ ```
22
+ $ docker run -it --rm \
23
+ --gpus all \
24
+ --network=host \
25
+ --shm-size=1024g \
26
+ --entrypoint /bin/sh \
27
+ -v /home/hotaisle/workspace/models:/models \
28
+ -v $HOME/.cache/huggingface:/root/.cache/huggingface \
29
+ vllm/vllm-openai:latest \
30
+ -c "pip install blobfile && python3 -m vllm.entrypoints.openai.api_server --model QuixiAI/Kimi-K2-Base-AWQ --port 8000 --tensor-parallel-size 8 --trust-remote-code --gpu-memory-utilization 0.95 --enable-prefix-caching --enable-chunked-prefill --dtype bfloat16"```
31
+
32
+ It seems due to a bug in vLLM, it cannot run.