mratsim/MiniMax-M2.1-FP8-INT4-AWQ · Should I tuned for this warning?

Should I tuned for this warning?

by bash99 - opened Jan 12

Jan 12

(Worker_TP3 pid=102905) WARNING 01-13 00:06:55 [fp8_utils.py:1027] Using default W8A8 Block FP8 kernel config. Perform
ance might be sub-optimal! Config file not found at /data/deployer24/miniforge3/envs/vllm_last/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=2048,K=3072,device_name=NVIDIA_GeForce_RTX_4090,dtype=fp
8_w8a8,block_shape=[128,128].json

And I'm wonder why W8A8 Block FP8? does this model is a AWQ-int4?

mratsim

Owner Jan 12

It's a mixed precision model.

Self-attention layers which are quite important for quality but really small overall, like 3~5GB, are maintained in original FP8 precision.
Experts layers which are less impactful for quality and easily take 95% of the size are quantized to W4A16GS32 via AWQ method.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment