Elastic model: whisper-large-v3
Overview
ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
- XL: Mathematically equivalent neural network, optimized with our DNN compiler.
- L: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
- M: Faster model, with accuracy degradation less than 1.5%.
- S: The fastest model, with accuracy degradation less than 2%.
Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
System Requirements
| Property | Value |
|---|---|
| GPU | L40s, RTX 4090, RTX 5090, H100 |
| Python Version | 3.10-3.12 |
| CPU | Intel/AMD x86_64 |
| CUDA Version | 12.8+ |
TheStage AI Access Token Setup
Install TheStage AI CLI and setup API token:
pip install thestage
thestage config set --access-token <YOUR_ACCESS_TOKEN>
ElasticModels
Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the whisper-large-v3 model.
Installation
pip install 'thestage-elastic-models[nvidia]' \
--extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install datasets==3.6.0 librosa soundfile # only needed to load audio for the example below
Usage
import torch
from elastic_models.transformers import AutoModelForSpeechSeq2Seq
from transformers import AutoProcessor
model_name = "openai/whisper-large-v3"
hf_token = ''
device = torch.device("cuda")
processor = AutoProcessor.from_pretrained(
model_name, token=hf_token
)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name,
token=hf_token,
torch_dtype=torch.float16,
mode='S'
).to(device)
# Load audio file
from datasets import load_dataset
dataset = load_dataset(
"hf-internal-testing/librispeech_asr_dummy",
"clean", split="validation"
)
audio_sample = dataset[0]["audio"]
# Process audio
input_features = processor(
audio_sample["array"],
sampling_rate=audio_sample["sampling_rate"],
return_tensors="pt"
).input_features.to(device, dtype=torch.float16)
# Generate transcription
with torch.inference_mode():
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(
predicted_ids, skip_special_tokens=True
)[0]
print(f"Transcription: {transcription}")
Quality Benchmarks
We have evaluated the models using the Hugging Face Open ASR Leaderboard methodology. For each model size (S, M, L, XL), we report Word Error Rate (WER) on standard English and multilingual speech recognition benchmarks.
Open ASR Leaderboard (English, WER %)
| Dataset | S | M | L | XL | Original |
|---|---|---|---|---|---|
| LibriSpeech Clean | 1.89 | 1.88 | 1.87 | 1.88 | 1.88 |
| LibriSpeech Other | 3.71 | 3.69 | 3.69 | 3.69 | 3.68 |
| SPGISpeech | 2.92 | 2.9 | 2.91 | 2.95 | 2.95 |
| TED-LIUM | 3.81 | 3.77 | 3.8 | 3.85 | 3.84 |
| VoxPopuli | 8.33 | 8.45 | 8.41 | 9.28 | 9.17 |
| GigaSpeech | 9.97 | 10.01 | 10.01 | 10.01 | 10.01 |
| Earnings-22 | 11.23 | 11.2 | 11.21 | 11.27 | 11.24 |
| AMI | 16.13 | 16.1 | 16.04 | 16.08 | 16.16 |
| Mean WER | 7.25 | 7.25 | 7.24 | 7.38 | 7.37 |
Multilingual (WER %)
| Dataset | S | M | L | XL | Original |
|---|---|---|---|---|---|
| CoVoST2 DE | 5.83 | 5.79 | 5.81 | 5.8 | 5.79 |
| CoVoST2 ES | 4.32 | 4.28 | 4.3 | 4.3 | 4.3 |
| CoVoST2 FR | 10.86 | 10.83 | 10.82 | 10.85 | 10.8 |
| CoVoST2 IT | 5.24 | 5.23 | 5.25 | 5.22 | 5.21 |
| CoVoST2 PT | 3.7 | 4.15 | 4.11 | 4.11 | 4.12 |
| FLEURS DE | 3.82 | 3.86 | 3.85 | 3.84 | 3.86 |
| FLEURS ES | 2.56 | 2.55 | 2.54 | 2.55 | 2.56 |
| FLEURS FR | 5.15 | 5.26 | 5.22 | 5.24 | 5.26 |
| FLEURS IT | 2.39 | 2.47 | 2.41 | 2.42 | 2.42 |
| FLEURS PT | 3.66 | 3.68 | 3.7 | 3.67 | 3.67 |
| MLS French | 4.69 | 4.68 | 4.7 | 4.68 | 4.68 |
| MLS German | 4.42 | 4.41 | 4.42 | 4.46 | 4.45 |
| MLS Italian | 8.95 | 8.94 | 8.92 | 8.92 | 8.94 |
| MLS Portuguese | 5.7 | 5.61 | 5.57 | 5.52 | 5.58 |
| MLS Spanish | 2.9 | 2.93 | 2.93 | 2.92 | 2.86 |
| Mean WER | 4.94 | 4.98 | 4.97 | 4.97 | 4.97 |
Datasets
English (Open ASR Leaderboard)
- LibriSpeech Clean: Read English speech from audiobooks, recorded in clean conditions. Tests baseline transcription accuracy on clear, well-articulated speech.
- LibriSpeech Other: Read English speech from audiobooks with more challenging acoustic conditions, including noisier recordings and less common speakers.
- SPGISpeech: Financial earnings calls and presentations, featuring domain-specific terminology, spontaneous speech, and diverse speaker accents.
- TEDLium: TED conference talks covering a wide range of topics, with diverse speakers, presentation styles, and varying audio quality.
- VoxPopuli: European Parliament event recordings in multiple languages, featuring political discourse, formal speech, and multilingual speakers.
- GigaSpeech: Large-scale multi-domain English speech corpus from audiobooks, podcasts, and YouTube, representing diverse acoustic conditions and speaking styles.
- Earnings22: Corporate earnings calls with financial terminology, multiple speakers, and telephone-quality audio.
- AMI: Meeting recordings with overlapping speech, distant microphones, and natural conversational dynamics.
Multilingual
- CoVoST2: Common Voice Speech-To-Text 2. Built on Mozilla's Common Voice recordings, providing speech-to-text evaluation across 21 languages with diverse speakers, accents, and recording conditions.
- FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech. Covers 102 languages with read speech from Wikipedia passages.
- MLS: Multilingual LibriSpeech. Derived from read audiobooks in 8 languages, providing large-scale multilingual ASR evaluation data.
Metrics
- WER (Word Error Rate): Measures the proportion of word-level errors (substitutions, insertions, deletions) in the transcription compared to the reference text. Lower values indicate better accuracy.
Latency Benchmarks
We measured RTFx (Real-Time Factor) for each model size on various GPUs. RTFx indicates how many times faster than real-time the model transcribes audio. Higher RTFx is better.
RTFx, batch size 1
| GPU/Model Size | S | M | L | XL | Original |
|---|---|---|---|---|---|
| H100 | 63.0 | 62.8 | 59.0 | 59.1 | 26.1 |
| L40s | 55.2 | 54.3 | 54.0 | 51.0 | 18.1 |
| GeForce RTX 5090 | 64 | 62 | 62 | 62 | 25 |
| GeForce RTX 4090 | 69.3 | 67.9 | 67.2 | 64.3 | 38.5 |
RTFx, batched
| GPU/Model Size | S | M | L | XL | Original |
|---|---|---|---|---|---|
| RTX 4090 (bs=24) | 355 | 353 | 335 | 315 | 302 |
| L40s (bs=32) | 279 | 279 | 269 | 261 | 238 |
| RTX 5090 (bs=32) | 401 | 396 | 396 | 376 | 287 |
| H100 (bs=64) | 532 | 528 | 525 | 522 | 415 |
Benchmarking Methodology
The benchmarking was performed on a single GPU using a 10-minute audio file resampled to 16kHz mono. RTFx (Real-Time Factor) is calculated as audio_duration / transcription_time — higher values mean faster-than-real-time transcription.
Algorithm summary:
- Load the whisper-large-v3 model with the specified size (S, M, L, XL, original).
- Load a 10-minute audio file and resample to 16kHz mono.
- Run a warm-up pass to initialize GPU caches.
- Synchronize the GPU, record the start time.
- Run the transcription pipeline with the specified batch size and chunk length.
- Synchronize the GPU, record the end time.
- Calculate RTFx as
audio_duration / time_taken.
Serving with Docker Image
For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints. Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers. You can also use this container to run inference through TheStage AI platform.
Prebuilt image from ECR
Pull docker image and start inference container:
docker pull public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-stt-streaming-24.09c
docker run --rm -it \
--name triton-stt \
--gpus all \
-p 127.0.0.1:80:80 \
-v "$HOME/.cache:/opt/project/.cache/" \
-e MODEL_REPO=openai/whisper-large-v3 \
-e MODEL_SIZE=<MODEL_SIZE> \
-e MODEL_BATCH=<MODEL_BATCH> \
-e PIPELINE_MAX_BATCH_SIZE=<PIPELINE_MAX_BATCH_SIZE> \
-e CHUNK_LENGTH=<CHUNK_LENGTH> \
-e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
-e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-stt-streaming-24.09c
| Parameter | Description |
|---|---|
<MODEL_SIZE> |
Available: S, M, L, XL. |
<MODEL_BATCH> |
Maximum batch size for the model. |
<PIPELINE_MAX_BATCH_SIZE> |
Maximum batch size for the ASR pipeline processing. |
<CHUNK_LENGTH> |
Audio chunk length in seconds (e.g., 10, 15, 20, 30). |
<HUGGINGFACE_ACCESS_TOKEN> |
Hugging Face access token. |
<THESTAGE_ACCESS_TOKEN> |
TheStage token generated on the platform (Profile -> Access tokens). |
Invocation
CLI
elastic-models-client client stt --sample sample.wav --lang-id en
cURL
curl -X POST http://127.0.0.1:80/v1/audio/transcriptions \
-H "Authorization: Bearer 123" \
-H "X-Lang-Id: en" \
-H "X-Model-Name: -<MODEL_SIZE>-bs<MODEL_BATCH>" \
-F "file=@sample.wav"
Endpoint Parameters
Method
POST
/v1/audio/transcriptions
Header Parameters
Authorization:stringBearer token for authentication.
X-Lang-Id:stringLanguage of the audio (e.g., "en", "es", "fr").
X-Model-Name:stringSpecifies the model to use for transcription. Format:
-<size>-bs<batch_size>, where<size>is one ofS,M,L,XL,originaland<batch_size>is theMODEL_BATCHconfigured during container startup.
Input Body
file:binaryThe audio file to transcribe (multipart/form-data).
Links
- Platform: app.thestage.ai
- Subscribe for updates: TheStageAI X
- Contact email: contact@thestage.ai
- Downloads last month
- 381
Model tree for TheStageAI/Elastic-whisper-large-v3
Base model
openai/whisper-large-v3
