Can the AWQ quantized model run on a single A100? I tried to running with vllm on single A100, but failed.

#1
by Albertstone - opened
No description provided.

Yes, it can, but you need to install the latest version of transformers from GitHub:

pip install vllm git+https://github.com/huggingface/transformers.git

Then, you can launch the model with:

vllm serve PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ \
  --dtype float16 \
  --port 4701 \
  --gpu-memory-utilization 0.95 \
  --quantization awq \
  --max-model-len 32768

I can start the service, but once the tokens becomes a little bit more, it'll crash. However, it's not possible to serve on 2gpus. https://qwen.readthedocs.io/en/latest/quantization/gptq.html#qwen2-5-32b-instruct-gptq-int4-broken-with-vllm-on-multiple-gpus
Maybe you can quant this model following the official instructions above.

Pointer Inc. org

I can start the service, but once the tokens becomes a little bit more, it'll crash. However, it's not possible to serve on 2gpus. https://qwen.readthedocs.io/en/latest/quantization/gptq.html#qwen2-5-32b-instruct-gptq-int4-broken-with-vllm-on-multiple-gpus
Maybe you can quant this model following the official instructions above.

Hi, I have added the support for 2, 4 and 8 GPUs.

imjliao changed discussion status to closed

Sign up or log in to comment