Reranker Model Performance Optimization

by kazmi09 - opened Jan 28, 2025

Jan 28, 2025

•

edited Jan 28, 2025

Hello,
We are using the ms-marco-MiniLM-L-6-v2 model for our conceptual search application. Initially, we were reranking the top 1000 results, which was taking an average time of 2–2.5 seconds. For deployment, we were using Flask with Gunicorn, configured with 5 workers on a single GPU machine.

Now, we are planning to increase the reranking scope from the top 1000 results to the top 3000. However, we are facing a significant performance hit, with the average time increasing to 6–7 seconds for reranking the passed top results. We are loading the model as per the documentation provided in the model card for the sentence-transformers package.

Could you please advise if we are doing anything wrong, either in terms of model loading or from a deployment perspective?

Additionally, as per the model card, the model is capable of reranking 1800 documents/chunks per sec.

Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment