K1-31B-v5-fp8

Base model: win10/K1-31B-v5 Scheme: FP8_DYNAMIC — weights quantized to FP8 statically, activations scaled dynamically at runtime (W8A8). How it was made: One-shot datafree quantization with LLM Compressor on a DGX Spark (GB10 Grace Blackwell, 128GB unified memory). No calibration data required.

Multimodal projection layers and MoE routers kept in bf16. Blackwell (GB10/B100/B200) has native FP8 hardware support. Hopper (H100/H200) also supports FP8 natively.

Check the original model card for capability and license details.

Running with vLLM

sudo docker run \
    --gpus all \
    --network host \
    --ipc host \
    --restart unless-stopped \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -e NCCL_IGNORE_CPU_AFFINITY=1 \
    vllm/vllm-openai:gemma4-0505-cu130 \
    Firworks/K1-31B-v5-fp8 \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 4 \
    --max-num-batched-tokens 4096 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --trust-remote-code \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4 \
    --enable-auto-tool-choice \
    --default-chat-template-kwargs '{"enable_thinking": true}' \
    --mm-processor-kwargs '{"max_soft_tokens": 1120}' \
    --max-model-len 131072 \
    --port 8000 \
    --host 0.0.0.0

Tested on a DGX Spark.

If there are other models you'd like quantized to FP8, let me know.

Downloads last month
113
Safetensors
Model size
33B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Firworks/K1-31B-v5-fp8

Finetuned
win10/K1-31B-v5
Quantized
(3)
this model