Enormous KV-cache size?

#3
by nephepritou - opened

Just want to verify I'm doing everything correct and it's just huge KV-cache size and not my mistake with VLLM configuration.

python3 -m vllm.entrypoints.openai.api_server \
    --model /mnt/data/llm-data/models/zai-org/GLM-4.7-Flash \
    --served-model-name glm-4.7-flash \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 2 \
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    --reasoning-parser glm45

And I got an error:

To serve at least one request with the models's max seq len (131072), (29.38 GiB KV cache is needed, which is larger than the available KV cache memory (7.29 GiB). Based on the available memory, the estimated maximum model length is 32528. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

It's was a surprise because I've though it will consume not much more than Qwen3 Coder 30B, and for Qwen it fit 280K tokens into same 4x RTX 3090.

try changing -max-num-seqs to 1 and increase --gpu-memory-utilization

I'm on a single RTX 6000 pro, for the same 131072 it's asking for 120GiB KV! How is your config only requesting 29.38 GiB for 131k tok?
Here's what i'm getting:
ValueError: To serve at least one request with the models's max seq len (131072), (120.0 GiB KV cache is needed, which is larger than the available KV cache memory (31.83 GiB). Based on the available memory, the estimated maximum model length is 34768. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

My config:

MODEL_ID="zai-org/GLM-4.7-Flash"

vllm serve "${MODEL_ID}" \
  --host 0.0.0.0 \
  --port 1236 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 131072 \
  --max-num-seqs 32 \
  --tensor-parallel-size 1 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash

I'm on a single RTX 6000 pro, for the same 131072 it's asking for 120GiB KV! How is your config only requesting 29.38 GiB for 131k tok?
Here's what i'm getting:
ValueError: To serve at least one request with the models's max seq len (131072), (120.0 GiB KV cache is needed, which is larger than the available KV cache memory (31.83 GiB). Based on the available memory, the estimated maximum model length is 34768. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

My config:

MODEL_ID="zai-org/GLM-4.7-Flash"

vllm serve "${MODEL_ID}" \
  --host 0.0.0.0 \
  --port 1236 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 131072 \
  --max-num-seqs 32 \
  --tensor-parallel-size 1 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash

Because it requested 30Gb on EACH card. So 120Gb total.

@nepherpritou
Guessing you're on 4x 3090?
There's an FP8 and NVFP4 version out that can leave more room for context, but it's still 0.91MB KV Cache per token unless you want to quantize KV Cache.

It seems like even on mainline llama.cpp this model is taking up a lot of space with the compute buffer e.g. 68k context using almost 24GB on the compute buffer alone huh...

Test GGUF quant with more details including initial benchmarks (before I OOMd): https://huggingface.co/ubergarm/GLM-4.7-Flash-GGUF

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 69632 \
  -fit off \
  -fa off \
  -ngl 99 \
  -ub 4096 -b 4096 \
  --threads 1

llama_context: n_ctx_seq (69632) < n_ctx_train (202752) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.59 MiB
llama_kv_cache:      CUDA0 KV buffer size =  6791.50 MiB
llama_kv_cache: size = 6791.50 MiB ( 69632 cells,  47 layers,  1/1 seqs), K (f16): 3595.50 MiB, V (f16): 3196.00 MiB
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size = 23431.06 MiB
sched_reserve:  CUDA_Host compute buffer size =  1136.08 MiB
sched_reserve: graph nodes  = 3504
sched_reserve: graph splits = 2
sched_reserve: reserve took 481.45 ms, sched copies = 1

Have you vllm folks also noticed token generation speed (TG aka decode aka TPOT) dropping quickly as context/kv-cache grows? oh might be an implementation issue, still looking: https://github.com/ggml-org/llama.cpp/issues/18944

Seems that the model should be using MLA, but isn't, and thus is falling back to MHA. If MLA were working properly, would be 54KB KV Footprint per token, instead of ~0.91MB per token.

Sign up or log in to comment