Backend Optimization?
I see that backend speed/vram usage/quantization support has been an issue. Are you still mostly using vllm bf16? What about larger models?
If you're looking to optimizing running, some potential suggestions would be:
- TabbyAPI: This is a great, easy option for big models that supports batching well. It's not quite as fast at pure parallel text generation as vllm. But its still fast, its quantization is state of the art (with 6bpw being basically lossless), and it supports quantization across all the model architectures it can run.
- llama-server: I've seen you mention llama.cpp doesn't support batching, but it does support parallel calls.
- TensorRT-LLM or SGLang: these are more optimal for small bf16 models.
It's also worth mentioning that some more esoteric engines have niches. For instance, PaddlePaddle supports running ERNIE 300B with 2-bit quantization, with (apparently) fairly low loss: https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle
Also, I (and I'm sure others) would be happy to help with quantization if existing quants aren't available.
I'm using vllm bf16 up till 123B. After that it gets super expensive to test models so I use apis. bf16 is nice for testing new models and architectures since there are fewer issues with model support compared to using quantization (and a lot of quantization methods I tried took longer to do than not quanting). I spent so long configuring vllm's settings that I'm kinda exhausted with that stuff for a while. I'll probably look into different engines eventually, but I'll just stick to what I'm running for now :l