InternVL3-8B-AWQ is much slower than InternVL3-8B
#3
by
QiliangGoose
- opened
I'm using InternVL3-8B-AWQ to inference on vLLM, which is much slower than InternVL3-8B.
The decive I'm using:
RTX4090D 24G
vLLM==0.9.0
Time cost for return of first token:
InternVL3-8B 0.81s
InternVL3-8B-AWQ 1.40s
Command Settings:
python3 -m vllm.entrypoints.openai.api_server
--model models--OpenGVLab--InternVL3-8B-AWQ
--gpu-memory-utilization 0.9
--max_num_seqs 1
--max-model-len 16384
--served-model-name "vlm_test"
--limit-mm-per-prompt image=5
--quantization awq
--trust-remote-code
Wonder if there is any special settings needed?