Can the AWQ quantized model run on a single A100? I tried to running with vllm on single A100, but failed.

by Albertstone - opened Feb 11, 2025

Discussion

Albertstone

Feb 11, 2025

No description provided.

Highdrien

Feb 12, 2025

Yes, it can, but you need to install the latest version of transformers from GitHub:

pip install vllm git+https://github.com/huggingface/transformers.git

Then, you can launch the model with:

vllm serve PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ \
  --dtype float16 \
  --port 4701 \
  --gpu-memory-utilization 0.95 \
  --quantization awq \
  --max-model-len 32768

czxplp

Feb 19, 2025

I can start the service, but once the tokens becomes a little bit more, it'll crash. However, it's not possible to serve on 2gpus. https://qwen.readthedocs.io/en/latest/quantization/gptq.html#qwen2-5-32b-instruct-gptq-int4-broken-with-vllm-on-multiple-gpus
Maybe you can quant this model following the official instructions above.

imjliao

Pointer Inc. org Feb 21, 2025

I can start the service, but once the tokens becomes a little bit more, it'll crash. However, it's not possible to serve on 2gpus. https://qwen.readthedocs.io/en/latest/quantization/gptq.html#qwen2-5-32b-instruct-gptq-int4-broken-with-vllm-on-multiple-gpus
Maybe you can quant this model following the official instructions above.

Hi, I have added the support for 2, 4 and 8 GPUs.

imjliao changed discussion status to closed Feb 21, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment