K1-31B-v5-fp8
Base model: win10/K1-31B-v5
Scheme: FP8_DYNAMIC — weights quantized to FP8 statically, activations scaled dynamically at runtime (W8A8).
How it was made: One-shot datafree quantization with LLM Compressor on a DGX Spark (GB10 Grace Blackwell, 128GB unified memory). No calibration data required.
Multimodal projection layers and MoE routers kept in bf16. Blackwell (GB10/B100/B200) has native FP8 hardware support. Hopper (H100/H200) also supports FP8 natively.
Check the original model card for capability and license details.
Running with vLLM
sudo docker run \
--gpus all \
--network host \
--ipc host \
--restart unless-stopped \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e NCCL_IGNORE_CPU_AFFINITY=1 \
vllm/vllm-openai:gemma4-0505-cu130 \
Firworks/K1-31B-v5-fp8 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 4 \
--max-num-batched-tokens 4096 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--mm-processor-kwargs '{"max_soft_tokens": 1120}' \
--max-model-len 131072 \
--port 8000 \
--host 0.0.0.0
Tested on a DGX Spark.
If there are other models you'd like quantized to FP8, let me know.
- Downloads last month
- 113
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support