Highest performance inference on <8 RTX 6000 Pros setups

by curiouspp8 - opened 7 days ago

Is there any way to run any of those quants via high performance engine like sglang or vllm?
Quantized safetensor versions dont fit into 1/2/4 GPUs like VLLMs wants and not sure where they are with gguf support. Just wondering if anyone did setup like that. VLLM's TP needs 1/2/4/8 GPUs.

Fashion-Italia

3 days ago

Have you checked the ik-llama.cpp fork? You can find GLM-5.1 quants for it from Ubergarm here on huggingface. Multi-GPU performance is superior to llama.cpp mainline.
Another option would be exllama3, GLM-5 seems not yet supported, but you can contact turboderp and ask if and when support will land.

curiouspp8

3 days ago

Yep, ik_llama is the way now. I'm getting 500-1000 prefil with 20-40tps generation. Its only 1 thread though and slows down a lot past 70k context. Can still pull off 200k sessions but its tough and agentic workflows are not great without parallelism. Ik has graph support which can efficiently utilize arbitrary number of gpus and its not supported for glm/kimi.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment