Best strategy for inference on multiple GPUs

#124

by symdec - opened Jun 6, 2024

Discussion

symdec

Jun 6, 2024

•

edited Jun 6, 2024

Hello,
A question regarding the serving of this model for a real-time-ish and many users use case.

I'm using this model on a server behind a FastAPI/uvicorn webserver. Right now it is working with the model running on 1 GPU.
I want to increase the serving throughput by using multiple GPUs, with one instance of whisper on each.
Do you know what technologies I can use to make the queueing of http requests and routing to the different instances / GPUs (with some balance) in order to maximize the throughput / minimize the latency ?

Thanks in advance !

rareson168

Mar 24, 2025

•

edited Mar 24, 2025

Ray Serve :)

Set number of Ray Serve replicas to the number of GPUs you have available and set options of the actor to num_gpus=1.

This will make each replica have a visible GPU and you can instantiate a Whisper Model on each replica.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment