Spaces:

DontPlanToEnd
/

UGI-Leaderboard

Running

App Files Files Community

638

Backend Optimization?

#382

by Downtown-Case - opened Oct 7, 2025

Discussion

Downtown-Case

Oct 7, 2025

•

edited Oct 7, 2025

I see that backend speed/vram usage/quantization support has been an issue. Are you still mostly using vllm bf16? What about larger models?

If you're looking to optimizing running, some potential suggestions would be:

TabbyAPI: This is a great, easy option for big models that supports batching well. It's not quite as fast at pure parallel text generation as vllm. But its still fast, its quantization is state of the art (with 6bpw being basically lossless), and it supports quantization across all the model architectures it can run.
llama-server: I've seen you mention llama.cpp doesn't support batching, but it does support parallel calls.
TensorRT-LLM or SGLang: these are more optimal for small bf16 models.

Downtown-Case

Oct 7, 2025

•

edited Oct 7, 2025

It's also worth mentioning that some more esoteric engines have niches. For instance, PaddlePaddle supports running ERNIE 300B with 2-bit quantization, with (apparently) fairly low loss: https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle

Also, I (and I'm sure others) would be happy to help with quantization if existing quants aren't available.

DontPlanToEnd

Owner Oct 13, 2025

I'm using vllm bf16 up till 123B. After that it gets super expensive to test models so I use apis. bf16 is nice for testing new models and architectures since there are fewer issues with model support compared to using quantization (and a lot of quantization methods I tried took longer to do than not quanting). I spent so long configuring vllm's settings that I'm kinda exhausted with that stuff for a while. I'll probably look into different engines eventually, but I'll just stick to what I'm running for now :l

DontPlanToEnd changed discussion status to closed Oct 13, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment