HuggingFaceH4/ultrachat_200k
Viewer • Updated • 515k • 67.2k • 728
Quant of mistral-community/pixtral-12b using LLM Compressor for optimised inference on VLLM.
FP8 dynamic quant on language model, and FP8 quant of KV cache. multi_modal_projector and vision_tower left in FP16 since it's a small part of the model.
Calibrated on 2048 ultrachat samples.
Example VLLM usage
vllm serve nintwentydo/pixtral-12b-FP8-dynamic-FP8-KV-cache --quantization fp8 --kv-cache-dtype fp8
Supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
Edit: Something seems to be wrong with the tokenizer. If you have any issues add --tokenizer mistral-community/pixtral-12b to your VLLM command line args.
Base model
mistral-experimental/pixtral-12b