Highest performance inference on <8 RTX 6000 Pros setups

#1
by curiouspp8 - opened

Is there any way to run any of those quants via high performance engine like sglang or vllm?
Quantized safetensor versions dont fit into 1/2/4 GPUs like VLLMs wants and not sure where they are with gguf support. Just wondering if anyone did setup like that. VLLM's TP needs 1/2/4/8 GPUs.

Have you checked the ik-llama.cpp fork? You can find GLM-5.1 quants for it from Ubergarm here on huggingface. Multi-GPU performance is superior to llama.cpp mainline.
Another option would be exllama3, GLM-5 seems not yet supported, but you can contact turboderp and ask if and when support will land.

Yep, ik_llama is the way now. I'm getting 500-1000 prefil with 20-40tps generation. Its only 1 thread though and slows down a lot past 70k context. Can still pull off 200k sessions but its tough and agentic workflows are not great without parallelism. Ik has graph support which can efficiently utilize arbitrary number of gpus and its not supported for glm/kimi.

Sign up or log in to comment