fp8 kv cache

#4
by festr2 - opened

Hello,

using fp8 kv cache introduces some inacuracy in my tests, is this something which can be solved? :

[2026-02-16 01:09:36 TP0] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!

That's an incorrect warning from sglang - the checkpoint does contain KV scaling factors.

Are you seeing any actual problems?

That's an incorrect warning from sglang - the checkpoint does contain KV scaling factors.

Are you seeing any actual problems?

I'm curious if the scaling is really used because of the warning - if the warning is there it looks like it just does not use the scaling from the checkpoint is it? I'm trying to do own benchmarks comparing BF16 and FP8 kv cache and I'm getting mesurable worse results with the FP8 (>5% worse) which is suspicios and worth to investigate if the fp8 is really using the scaling - what do you recommend?

There's a bug in sglang - https://github.com/sgl-project/sglang/pull/18904

Once this is merged I'll update the quant.

There's a bug in sglang - https://github.com/sgl-project/sglang/pull/18904

Once this is merged I'll update the quant.

how about vllm? it has the same %drop precission when comparing bf16 and fp8 kv cache
what exactly the quant will have different once the pr is merged, I guess those k,v scales will be exoorted in .json? or its directly in the binary files?

I wasn't able to include k/v scales != 1.0 (they are present in the checkpoint, but 1.0) due to the above bug, since it completely breaks FP8.

I could upload a separate model briefly if you're interested in trying the one with precise K/V scales.

I could upload a separate model briefly if you're interested in trying the one with precise K/V scales.

yes please, I will patch sglang and gladly try and compare.

how about the vllm framework? it should have the same issue?

No, I suspect vLLM is fine, but I haven't tried.

https://huggingface.co/lukealonso/MiniMax-M2.5-NVFP4-KV-Untested

@lukealonso I cannot load the MiniMax-M2.5-NVFP4-KV-Untested - there is no config.json - even when I copy it from the previous one. Is the Untested upload complete?

  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 250, in from_server_args
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 149, in __init__
    if self.hf_config.architectures[0] in mm_disabled_models:
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^

@lukealonso I have found my test inacuracies, it was caused by max token set to 512 which leads to misleading 1-2% differences. When I fixed this in my internal tests I'm getting differences within statistical noise differences. I'm not sure if I'm able to do reliable tests between vairous versions and even differences between bfloat16 and fp8 kv caches.

What is now preferred version of your quant with vllm? I have found out that vllm has higher throughput in my kilo code high concurrency workload - I'm not able to match vllm throughput with sglang throughput. So I'm preferring actually vllm.

I'm still not sure about the kv scales versions/patches - thank you for any hint.

Great!

The current version should work with vLLM - let me know if it doesn't. I removed the K/V scales entirely since it was causing problems with vLLM (different key names), and the gains are marginal. I may add them back later when the bugs in sglang and vLLM have been fixed.

Great!

The current version should work with vLLM - let me know if it doesn't. I removed the K/V scales entirely since it was causing problems with vLLM (different key names), and the gains are marginal. I may add them back later when the bugs in sglang and vLLM have been fixed.

interesting I had no problem with previous checkpoints in vllm - I was using official latest nightly vllm docker docker pull vllm/vllm-openai:cu130-nightly. - what was the issue?

I think it was just a bunch of warnings about k_scale and v_scale not being loaded (or at least that's what was reported), nothing actually harmful.

What kind of perf difference are you seeing?

@lukealonso I have written kilocode 80 concurrency test so it stresses prefill caches and mixed workload and the vllm is faster (I have 8 rxtx and 4x vllm --tp2 4x instances) especially the time to first token is absolutely crushing sglang and I'm not able to find out switches in sglang which closes the gap to vllm. I'm not sure if it could be piecewise enabled in vllm or just the vllm scheduler but the differences are huge. I will probably write some more info so more ppl can suggest something.

For single batch inference the sglang is actually faster. It is just the mixed workload in vllm in the high concurrency coding workload

I have now also finished testing sglang (your patched kv cache and the untested model which I managed to load) vllm with previous and with the latest snapshot - all in kv cache fp8 4/3 and all results are within statistical noise range so I'm not able to tell which one is more stable. Actually the latest tests show less variability 1.7% than the previous checkpoints (5%) - again it could be just random noise. I'm not sure what tests should be conducted to tell what quants are more stable / precise.

p.s.: feel free to join discord https://discord.gg/FJye6yaWN3 - where I'm trying to put togather community with >=4 rtx 6000 pro cards and share recent success/procedures

Sign up or log in to comment