Should I tuned for this warning?
#4
by
bash99
- opened
(Worker_TP3 pid=102905) WARNING 01-13 00:06:55 [fp8_utils.py:1027] Using default W8A8 Block FP8 kernel config. Perform
ance might be sub-optimal! Config file not found at /data/deployer24/miniforge3/envs/vllm_last/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=2048,K=3072,device_name=NVIDIA_GeForce_RTX_4090,dtype=fp
8_w8a8,block_shape=[128,128].json
And I'm wonder why W8A8 Block FP8? does this model is a AWQ-int4?
It's a mixed precision model.
- Self-attention layers which are quite important for quality but really small overall, like 3~5GB, are maintained in original FP8 precision.
- Experts layers which are less impactful for quality and easily take 95% of the size are quantized to W4A16GS32 via AWQ method.