Performance Optimization recommendations for Qwen3 Reranker 0.6B on A100/H100 GPUs
#20
by
rajshah14
- opened
Hi Team, thank you so much for providing these models.
I was wondering if you could recommend some GPU optimizations for Qwen3 reranker 0.6B model that can be done reduce latency nvidia A100/H100 GPUs.
PS: I have tried the vLLM sample in the model card.
rajshah14
changed discussion title from
Performance Optimization recommendations on A100/H100 GPUs
to Performance Optimization recommendations for Qwen3 Reranker 0.6B on A100/H100 GPUs
As far as I know, vLLM is currently the easiest way to serve this model.
https://medium.com/@kimdoil1211/deploying-qwen3-reranker-8b-with-vllm-instruction-aware-reranking-for-next-generation-retrieval-c35a57c9f0a6
If you can wait a bit, TEI will likely add support for qwen3 reranker
https://github.com/huggingface/text-embeddings-inference/pull/730