QuixiAI
/

Kimi-K2-Instruct-AWQ

4-bit precision

Model card Files Files and versions

ehartford commited on Aug 4, 2025

Commit

52be1d9

·

verified ·

1 Parent(s): 9f45c82

Update README.md

Files changed (1) hide show

README.md +12 -17

README.md CHANGED Viewed

@@ -3,21 +3,6 @@ license: apache-2.0
 ---
 I used AutoAWQ to quantize Kimi K2
-vLLM seems to have a bug that prevents it from inferencing.
-```
-(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] WorkerProc failed to start.
-...
-(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 443, in weight_loader
-(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511]     param[shard_offset:shard_offset + shard_size] = loaded_weight
-(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511]     ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1.  Target sizes: [576, 264].  Tensor sizes: [7168, 72]
-```
-I used AutoAWQ to quantize Kimi K2.
-Run with this command:
 ```
 $ docker run -it --rm \
   --gpus all \
@@ -27,6 +12,16 @@ $ docker run -it --rm \
   -v /home/hotaisle/workspace/models:/models \
   -v $HOME/.cache/huggingface:/root/.cache/huggingface \
   vllm/vllm-openai:latest \
-  -c "pip install blobfile && python3 -m vllm.entrypoints.openai.api_server --model QuixiAI/Kimi-K2-Base-AWQ --port 8000 --tensor-parallel-size 8 --trust-remote-code --gpu-memory-utilization 0.95 --enable-prefix-caching --enable-chunked-prefill --dtype bfloat16"```
-It seems due to a bug in vLLM, it cannot run.

 ---
 I used AutoAWQ to quantize Kimi K2
 ```
 $ docker run -it --rm \
   --gpus all \
   -v /home/hotaisle/workspace/models:/models \
   -v $HOME/.cache/huggingface:/root/.cache/huggingface \
   vllm/vllm-openai:latest \
+  -c "pip install blobfile && python3 -m vllm.entrypoints.openai.api_server --model QuixiAI/Kimi-K2-Instruct-AWQ --port 8000 --tensor-parallel-size 8 --trust-remote-code --gpu-memory-utilization 0.95 --enable-prefix-caching --enable-chunked-prefill --dtype bfloat16"```
+vLLM seems to have a bug that prevents it from inferencing.
+```
+(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] WorkerProc failed to start.
+...
+(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 443, in weight_loader
+(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511]     param[shard_offset:shard_offset + shard_size] = loaded_weight
+(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511]     ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1.  Target sizes: [576, 264].  Tensor sizes: [7168, 72]
+```