|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
I used AutoAWQ to quantize Kimi K2 |
|
|
|
|
|
The base model is [here](https://huggingface.co/QuixiAI/Kimi-K2-Base-AWQ) |
|
|
|
|
|
``` |
|
|
$ docker run -it --rm \ |
|
|
--gpus all \ |
|
|
--network=host \ |
|
|
--shm-size=1024g \ |
|
|
--entrypoint /bin/sh \ |
|
|
-v /home/hotaisle/workspace/models:/models \ |
|
|
-v $HOME/.cache/huggingface:/root/.cache/huggingface \ |
|
|
vllm/vllm-openai:latest \ |
|
|
-c "pip install blobfile && python3 -m vllm.entrypoints.openai.api_server --model QuixiAI/Kimi-K2-Instruct-AWQ --port 8000 --tensor-parallel-size 8 --trust-remote-code --gpu-memory-utilization 0.95 --enable-prefix-caching --enable-chunked-prefill --dtype bfloat16"``` |
|
|
``` |
|
|
|
|
|
vLLM seems to have a bug that prevents it from inferencing. |
|
|
|
|
|
``` |
|
|
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] WorkerProc failed to start. |
|
|
... |
|
|
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 443, in weight_loader |
|
|
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] param[shard_offset:shard_offset + shard_size] = loaded_weight |
|
|
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
(VllmWorker rank=1 pid=644) ERROR 08-03 22:42:00 [multiproc_executor.py:511] RuntimeError: The expanded size of the tensor (264) must match the existing size (72) at non-singleton dimension 1. Target sizes: [576, 264]. Tensor sizes: [7168, 72] |
|
|
``` |