Severe Looping/Repetitive Output when using --kv-cache-dtype fp8 with GLM-4.7-Flash-FP8-Dynamic on vLLM
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1,2,3' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
--served-model-name unsloth/GLM-4.7-Flash \
--tensor-parallel-size 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--dtype bfloat16 \
--seed 3407 \
--max-model-len 200000 \
--gpu-memory-utilization 0.95 \
--max_num_batched_tokens 16384 \
--port 8000 \
--kv-cache-dtype fp8
Description:
When serving unsloth/GLM-4.7-Flash-FP8-Dynamic using the vLLM V1 engine on NVIDIA H200, enabling FP8 KV cache results in a complete failure of inference logic. The model enters an infinite repetition loop (e.g., outputting !!!!!!!!!! or repeating the same word indefinitely).
This appears to be a numerical stability issue specific to the interaction between FP8 quantized weights, FP8 KV cache, and the FlashMLA implementation in the V1 engine.
Is this after a few turns or immediately? Can you try removing --kv-cache-dtype fp8 to see if that helpes
Is this after a few turns or immediately? Can you try removing
--kv-cache-dtype fp8to see if that helpes
Immediately-- kv-cache-dtype autoreturns to normal.
@ShelterW Oh interesting hmmm we were planning to calibrate the KV cache as well, which might / or might not cause issues
If the KV cache turned off works, hmm for now use that - I'll investigate further