Should I tuned for this warning?

#4
by bash99 - opened

(Worker_TP3 pid=102905) WARNING 01-13 00:06:55 [fp8_utils.py:1027] Using default W8A8 Block FP8 kernel config. Perform
ance might be sub-optimal! Config file not found at /data/deployer24/miniforge3/envs/vllm_last/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=2048,K=3072,device_name=NVIDIA_GeForce_RTX_4090,dtype=fp
8_w8a8,block_shape=[128,128].json

And I'm wonder why W8A8 Block FP8? does this model is a AWQ-int4?

Owner

It's a mixed precision model.

  • Self-attention layers which are quite important for quality but really small overall, like 3~5GB, are maintained in original FP8 precision.
  • Experts layers which are less impactful for quality and easily take 95% of the size are quantized to W4A16GS32 via AWQ method.

Sign up or log in to comment