Why does the KV cache occupy so much GPU memory?
My GPU setup is 48*2, and the maximum number of tokens I can successfully launch is only slightly over 20K.
vllm serve /home/tester/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/279ecdf8ee35f17f1939f95d6b113d8b806a7b2b \
--tensor-parallel-size 2 \
--swap-space 4 \
--max-model-len 27600 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm_47_flash \
--port 7777 \
--api-key 12345678 \
--gpu-memory-utilization 0.92
VLLM don't use MLA. Try SGLang from model readme. It will allow ~120K tokens. But generation speed will not make you happy :(
Apparently
VLLM don't use MLA. Try SGLang from model readme. It will allow ~120K tokens. But generation speed will not make you happy :(
is SGLang that bad compared to vLLM? o__O
Some discussion on it here and links to both performance and quantization quality benchmarks: https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/3#696ff1a8a2f6682f043decc3
ik_llama.cpp supports MLA with GLM-4.7-Flash
vLLM is not correctly triggering MLA, following up on this
Does regular llama.cpp support MLA? Anyway, it is great that Flash supports MLA, support for it will come, hopefully.
Does regular llama.cpp support MLA?
It supports MLA for deepseek and kimi, but yesterday I couldn't get it working with GLM-4.7-Flash myself on CUDA backend. I think this is the PR you want to be tracking: https://github.com/ggml-org/llama.cpp/pull/18953
any smaller model for 3050 4gb vram ??
i used qwen for generating srt. from transcribed files from whisper AI its Chinese audio to English subtitle setup . but qwen doing note insertion inside srt subtitle file making srt unusable tried argos it much worse for chinese to english translation... any tips how to tackle this i am new to this....
any smaller model for 3050 4gb vram ??
i used qwen for generating srt. from transcribed files from whisper AI its Chinese audio to English subtitle setup . but qwen doing note insertion inside srt subtitle file making srt unusable tried argos it much worse for chinese to english translation... any tips how to tackle this i am new to this....
I have made REAP'd versions of the model all they way up to 49% compression incase youre interested
Thank you.
However No i don't do benchmarks. i only reap and make sure the output isn't total garbage. however i have seen some huge performance drop and weirder output even if i used the recommended ones by unsloth / z.ai in the 30, 40 and 50 models
any smaller model for 3050 4gb vram ??
i used qwen for generating srt. from transcribed files from whisper AI its Chinese audio to English subtitle setup . but qwen doing note insertion inside srt subtitle file making srt unusable tried argos it much worse for chinese to english translation... any tips how to tackle this i am new to this....I have made REAP'd versions of the model all they way up to 49% compression incase youre interested
Yeah i would love to try it out what is the size and how do i can get it?? On hugging face?? Which setup you used this on??
any smaller model for 3050 4gb vram ??
i used qwen for generating srt. from transcribed files from whisper AI its Chinese audio to English subtitle setup . but qwen doing note insertion inside srt subtitle file making srt unusable tried argos it much worse for chinese to english translation... any tips how to tackle this i am new to this....I have made REAP'd versions of the model all they way up to 49% compression incase youre interested
Yeah i would love to try it out what is the size and how do i can get it?? On hugging face?? Which setup you used this on??
Yeah on my huggingface. I used 3x RTX A5000 to test out 3 of them on runpod. I used 2x A100 PCIe to reap the models
Safetensors:
- https://huggingface.co/Akicou/GLM-4.7-Flash-REAP-39 (~19B)
- https://huggingface.co/Akicou/GLM-4.7-Flash-REAP-50 (~16B)
- https://huggingface.co/Akicou/GLM-4.7-Flash-REAP-19 (~25B)
- https://huggingface.co/Akicou/GLM-4.7-Flash-REAP-09 (~27B)
the estimate of parameters is what huggingface is showing
I also made GGUF quants but llama cpp needs to fix some problems because regardless of whether its the reap or the unsloth quants. the gguf model somehow forgets the history or atleast for me
Cerebras also has their own reap with benchmarks