Phi3 Mini 128k 4 Bit Quantized
Flash Attention
- The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
- When quantizing Phi3 on a 4090 (24G) with Flash Attention disabled Quantization would fail due to insufficient VRAM
- Enabling Flash Attention allowed Quantization to complete with an extra 10 Giagbaytes of VRAM available on the GPU
Metrics
Total Size:
- Before: 7.64G
- After: 2.28G
VRAM Size:
- Before: 11.47G
- After: 6.57G
Average Inference Time:
- Before: 12ms/token
- After: 5ms/token