Can the AWQ quantized model run on a single A100? I tried to running with vllm on single A100, but failed.
Yes, it can, but you need to install the latest version of transformers from GitHub:
pip install vllm git+https://github.com/huggingface/transformers.git
Then, you can launch the model with:
vllm serve PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ \
--dtype float16 \
--port 4701 \
--gpu-memory-utilization 0.95 \
--quantization awq \
--max-model-len 32768
I can start the service, but once the tokens becomes a little bit more, it'll crash. However, it's not possible to serve on 2gpus. https://qwen.readthedocs.io/en/latest/quantization/gptq.html#qwen2-5-32b-instruct-gptq-int4-broken-with-vllm-on-multiple-gpus
Maybe you can quant this model following the official instructions above.
I can start the service, but once the tokens becomes a little bit more, it'll crash. However, it's not possible to serve on 2gpus. https://qwen.readthedocs.io/en/latest/quantization/gptq.html#qwen2-5-32b-instruct-gptq-int4-broken-with-vllm-on-multiple-gpus
Maybe you can quant this model following the official instructions above.
Hi, I have added the support for 2, 4 and 8 GPUs.