Performance Optimization recommendations for Qwen3 Reranker 0.6B on A100/H100 GPUs

#20
by rajshah14 - opened

Hi Team, thank you so much for providing these models.
I was wondering if you could recommend some GPU optimizations for Qwen3 reranker 0.6B model that can be done reduce latency nvidia A100/H100 GPUs.

PS: I have tried the vLLM sample in the model card.

rajshah14 changed discussion title from Performance Optimization recommendations on A100/H100 GPUs to Performance Optimization recommendations for Qwen3 Reranker 0.6B on A100/H100 GPUs

As far as I know, vLLM is currently the easiest way to serve this model.
https://medium.com/@kimdoil1211/deploying-qwen3-reranker-8b-with-vllm-instruction-aware-reranking-for-next-generation-retrieval-c35a57c9f0a6

If you can wait a bit, TEI will likely add support for qwen3 reranker
https://github.com/huggingface/text-embeddings-inference/pull/730

Sign up or log in to comment