Instructions to use openai/whisper-large-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openai/whisper-large-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("openai/whisper-large-v3") model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3") - Inference
- Notebooks
- Google Colab
- Kaggle
Best strategy for inference on multiple GPUs
Hello,
A question regarding the serving of this model for a real-time-ish and many users use case.
I'm using this model on a server behind a FastAPI/uvicorn webserver. Right now it is working with the model running on 1 GPU.
I want to increase the serving throughput by using multiple GPUs, with one instance of whisper on each.
Do you know what technologies I can use to make the queueing of http requests and routing to the different instances / GPUs (with some balance) in order to maximize the throughput / minimize the latency ?
Thanks in advance !
Ray Serve :)
Set number of Ray Serve replicas to the number of GPUs you have available and set options of the actor to num_gpus=1.
This will make each replica have a visible GPU and you can instantiate a Whisper Model on each replica.