Severe Looping/Repetitive Output when using --kv-cache-dtype fp8 with GLM-4.7-Flash-FP8-Dynamic on vLLM

#2
by ShelterW - opened
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1,2,3' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
    --served-model-name unsloth/GLM-4.7-Flash \
    --tensor-parallel-size 4 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --max-model-len 200000 \
    --gpu-memory-utilization 0.95 \
    --max_num_batched_tokens 16384 \
    --port 8000 \
    --kv-cache-dtype fp8

Description:
When serving unsloth/GLM-4.7-Flash-FP8-Dynamic using the vLLM V1 engine on NVIDIA H200, enabling FP8 KV cache results in a complete failure of inference logic. The model enters an infinite repetition loop (e.g., outputting !!!!!!!!!! or repeating the same word indefinitely).
This appears to be a numerical stability issue specific to the interaction between FP8 quantized weights, FP8 KV cache, and the FlashMLA implementation in the V1 engine.

Unsloth AI org

Is this after a few turns or immediately? Can you try removing --kv-cache-dtype fp8 to see if that helpes

Is this after a few turns or immediately? Can you try removing --kv-cache-dtype fp8 to see if that helpes
Immediately -- kv-cache-dtype auto returns to normal.

Unsloth AI org

@ShelterW Oh interesting hmmm we were planning to calibrate the KV cache as well, which might / or might not cause issues

If the KV cache turned off works, hmm for now use that - I'll investigate further

Sign up or log in to comment