How could I use vllm to serve this gte model？

#25

by qiujiaji - opened Mar 4, 2025

Mar 4, 2025

I try to use vllm to serve this model but found this 'NewModel' is not supported. I try to use vllm because I found it's a bit slow when I deploy gte with litserver: it reaches 1300 qps with 8 H20, which is only a bit faster than Qwen-0.5B. I think it should be a lot faster since gte only infers a embedding while Qwen-0.5B infers multiple tokens in my case.

qiujiaji changed discussion title from How to I use vllm to serve this gte model？ to How could I use vllm to serve this gte model？ Mar 4, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment