Update README.md
Browse files
README.md
CHANGED
|
@@ -129,7 +129,7 @@ by Alec Radford et al. from OpenAI. The original code repository can be found [h
|
|
| 129 |
The `Whisper-large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using `whisper-large-v2`.
|
| 130 |
The model was trained for 2.0 epochs over this mixture dataset.
|
| 131 |
|
| 132 |
-
The `Whisper-large-v3 model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to `Whisper-large-v2`.
|
| 133 |
|
| 134 |
|
| 135 |
**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
|
|
@@ -161,193 +161,201 @@ checkpoints are summarised in the following table with links to the models on th
|
|
| 161 |
| large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
|
| 162 |
| large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
|
| 163 |
|
| 164 |
-
|
| 165 |
|
| 166 |
-
|
|
|
|
|
|
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
```
|
| 181 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
```
|
| 183 |
-
Which tells the model to decode in English, under the task of speech recognition, and not to predict timestamps.
|
| 184 |
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
|
|
|
| 188 |
|
| 189 |
-
|
| 190 |
|
| 191 |
```python
|
| 192 |
-
|
| 193 |
-
|
|
|
|
| 194 |
|
| 195 |
-
Which forces the model to predict in English under the task of speech recognition.
|
| 196 |
|
| 197 |
-
|
|
|
|
| 198 |
|
| 199 |
-
|
| 200 |
-
In this example, the context tokens are 'unforced', meaning the model automatically predicts the output language
|
| 201 |
-
(English) and task (transcribe).
|
| 202 |
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
|
|
|
|
|
|
| 225 |
```
|
| 226 |
-
The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`.
|
| 227 |
|
| 228 |
-
|
| 229 |
-
The
|
| 230 |
|
| 231 |
```python
|
| 232 |
-
|
| 233 |
-
>>> from datasets import Audio, load_dataset
|
| 234 |
-
|
| 235 |
-
>>> # load model and processor
|
| 236 |
-
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
|
| 237 |
-
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
|
| 238 |
-
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")
|
| 239 |
-
|
| 240 |
-
>>> # load streaming dataset and read first audio sample
|
| 241 |
-
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
|
| 242 |
-
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
|
| 243 |
-
>>> input_speech = next(iter(ds))["audio"]
|
| 244 |
-
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
|
| 245 |
-
|
| 246 |
-
>>> # generate token ids
|
| 247 |
-
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
|
| 248 |
-
>>> # decode token ids to text
|
| 249 |
-
>>> transcription = processor.batch_decode(predicted_ids)
|
| 250 |
-
['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']
|
| 251 |
-
|
| 252 |
-
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
|
| 253 |
-
[' Un vrai travail intéressant va enfin être mené sur ce sujet.']
|
| 254 |
```
|
|
|
|
| 255 |
|
| 256 |
-
|
| 257 |
-
Setting the task to "translate" forces the Whisper model to perform speech translation.
|
| 258 |
|
| 259 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 260 |
|
| 261 |
```python
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
>>> # load model and processor
|
| 266 |
-
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
|
| 267 |
-
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
|
| 268 |
-
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")
|
| 269 |
-
|
| 270 |
-
>>> # load streaming dataset and read first audio sample
|
| 271 |
-
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
|
| 272 |
-
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
|
| 273 |
-
>>> input_speech = next(iter(ds))["audio"]
|
| 274 |
-
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
|
| 275 |
-
|
| 276 |
-
>>> # generate token ids
|
| 277 |
-
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
|
| 278 |
-
>>> # decode token ids to text
|
| 279 |
-
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
|
| 280 |
-
[' A very interesting work, we will finally be given on this subject.']
|
| 281 |
-
```
|
| 282 |
|
| 283 |
-
|
|
|
|
| 284 |
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
|
|
|
|
|
|
|
|
|
| 314 |
```
|
| 315 |
|
| 316 |
-
##
|
| 317 |
|
| 318 |
-
|
| 319 |
-
algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
|
| 320 |
-
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
| 321 |
-
method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
|
| 322 |
-
can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
|
| 323 |
|
| 324 |
-
|
| 325 |
-
>>> import torch
|
| 326 |
-
>>> from transformers import pipeline
|
| 327 |
-
>>> from datasets import load_dataset
|
| 328 |
|
| 329 |
-
|
|
|
|
| 330 |
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
>>> chunk_length_s=30,
|
| 335 |
-
>>> device=device,
|
| 336 |
-
>>> )
|
| 337 |
|
| 338 |
-
|
| 339 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 340 |
|
| 341 |
-
|
| 342 |
-
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
|
| 343 |
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
|
|
|
| 348 |
```
|
| 349 |
|
| 350 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 351 |
|
| 352 |
## Fine-Tuning
|
| 353 |
|
|
|
|
| 129 |
The `Whisper-large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using `whisper-large-v2`.
|
| 130 |
The model was trained for 2.0 epochs over this mixture dataset.
|
| 131 |
|
| 132 |
+
The `Whisper-large-v3` model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to `Whisper-large-v2`.
|
| 133 |
|
| 134 |
|
| 135 |
**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
|
|
|
|
| 161 |
| large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
|
| 162 |
| large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
|
| 163 |
|
| 164 |
+
## Usage
|
| 165 |
|
| 166 |
+
Whisper-large-v3 is supported in Hugging Face 🤗 Transformers through the `main` branch in the Transformers repo. To run the model, first
|
| 167 |
+
install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
|
| 168 |
+
audio dataset from the Hugging Face Hub:
|
| 169 |
|
| 170 |
+
```bash
|
| 171 |
+
pip install --upgrade pip
|
| 172 |
+
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
### Short-Form Transcription
|
| 176 |
+
|
| 177 |
+
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
| 178 |
+
class to transcribe short-form audio files (< 30-seconds) as follows:
|
| 179 |
+
|
| 180 |
+
```python
|
| 181 |
+
import torch
|
| 182 |
+
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
|
| 183 |
+
from datasets import load_dataset
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 187 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
| 188 |
+
|
| 189 |
+
model_id = "openai/Whisper-large-v3"
|
| 190 |
+
|
| 191 |
+
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
| 192 |
+
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
| 193 |
+
)
|
| 194 |
+
model.to(device)
|
| 195 |
+
|
| 196 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 197 |
|
| 198 |
+
pipe = pipeline(
|
| 199 |
+
"automatic-speech-recognition",
|
| 200 |
+
model=model,
|
| 201 |
+
tokenizer=processor.tokenizer,
|
| 202 |
+
feature_extractor=processor.feature_extractor,
|
| 203 |
+
max_new_tokens=128,
|
| 204 |
+
torch_dtype=torch_dtype,
|
| 205 |
+
device=device,
|
| 206 |
+
)
|
| 207 |
|
| 208 |
+
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
| 209 |
+
sample = dataset[0]["audio"]
|
| 210 |
+
|
| 211 |
+
result = pipe(sample)
|
| 212 |
+
print(result["text"])
|
| 213 |
```
|
| 214 |
+
|
| 215 |
+
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
|
| 216 |
+
```diff
|
| 217 |
+
- result = pipe(sample)
|
| 218 |
+
+ result = pipe("audio.mp3")
|
| 219 |
```
|
|
|
|
| 220 |
|
| 221 |
+
### Long-Form Transcription
|
| 222 |
+
|
| 223 |
+
Through Transformers Whisper-large-v3 uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
|
| 224 |
+
is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
|
| 225 |
|
| 226 |
+
To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
|
| 227 |
|
| 228 |
```python
|
| 229 |
+
import torch
|
| 230 |
+
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
|
| 231 |
+
from datasets import load_dataset
|
| 232 |
|
|
|
|
| 233 |
|
| 234 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 235 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
| 236 |
|
| 237 |
+
model_id = "openai/Whisper-large-v3"
|
|
|
|
|
|
|
| 238 |
|
| 239 |
+
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
| 240 |
+
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
| 241 |
+
)
|
| 242 |
+
model.to(device)
|
| 243 |
+
|
| 244 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 245 |
+
|
| 246 |
+
pipe = pipeline(
|
| 247 |
+
"automatic-speech-recognition",
|
| 248 |
+
model=model,
|
| 249 |
+
tokenizer=processor.tokenizer,
|
| 250 |
+
feature_extractor=processor.feature_extractor,
|
| 251 |
+
max_new_tokens=128,
|
| 252 |
+
chunk_length_s=15,
|
| 253 |
+
batch_size=16,
|
| 254 |
+
torch_dtype=torch_dtype,
|
| 255 |
+
device=device,
|
| 256 |
+
)
|
| 257 |
+
|
| 258 |
+
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
|
| 259 |
+
sample = dataset[0]["audio"]
|
| 260 |
+
|
| 261 |
+
result = pipe(sample)
|
| 262 |
+
print(result["text"])
|
| 263 |
```
|
|
|
|
| 264 |
|
| 265 |
+
<!---
|
| 266 |
+
**Tip:** The pipeline can also be used to transcribe an audio file from a remote URL, for example:
|
| 267 |
|
| 268 |
```python
|
| 269 |
+
result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 270 |
```
|
| 271 |
+
--->
|
| 272 |
|
| 273 |
+
### Speculative Decoding
|
|
|
|
| 274 |
|
| 275 |
+
[Distil-Whisper](https://hf.co/distil-whisper/large-v2) can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
|
| 276 |
+
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|
| 277 |
+
replacement for existing Whisper pipelines, since the same outputs are guaranteed.
|
| 278 |
+
|
| 279 |
+
In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
|
| 280 |
+
specify it as the "assistant model" for generation:
|
| 281 |
|
| 282 |
```python
|
| 283 |
+
from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
|
| 284 |
+
import torch
|
| 285 |
+
from datasets import load_dataset
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 286 |
|
| 287 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 288 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
| 289 |
|
| 290 |
+
assistant_model_id = "distil-whisper/distil-large-v2"
|
| 291 |
+
|
| 292 |
+
assistant_model = AutoModelForCausalLM.from_pretrained(
|
| 293 |
+
assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
| 294 |
+
)
|
| 295 |
+
assistant_model.to(device)
|
| 296 |
+
|
| 297 |
+
model_id = "openai/whisper-large-v3"
|
| 298 |
+
|
| 299 |
+
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
| 300 |
+
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
| 301 |
+
)
|
| 302 |
+
model.to(device)
|
| 303 |
+
|
| 304 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 305 |
+
|
| 306 |
+
pipe = pipeline(
|
| 307 |
+
"automatic-speech-recognition",
|
| 308 |
+
model=model,
|
| 309 |
+
tokenizer=processor.tokenizer,
|
| 310 |
+
feature_extractor=processor.feature_extractor,
|
| 311 |
+
max_new_tokens=128,
|
| 312 |
+
generate_kwargs={"assistant_model": assistant_model},
|
| 313 |
+
torch_dtype=torch_dtype,
|
| 314 |
+
device=device,
|
| 315 |
+
)
|
| 316 |
+
|
| 317 |
+
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
| 318 |
+
sample = dataset[0]["audio"]
|
| 319 |
+
|
| 320 |
+
result = pipe(sample)
|
| 321 |
+
print(result["text"])
|
| 322 |
```
|
| 323 |
|
| 324 |
+
## Additional Speed & Memory Improvements
|
| 325 |
|
| 326 |
+
You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 327 |
|
| 328 |
+
### Flash Attention
|
|
|
|
|
|
|
|
|
|
| 329 |
|
| 330 |
+
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it.
|
| 331 |
+
To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
|
| 332 |
|
| 333 |
+
```
|
| 334 |
+
pip install flash-attn --no-build-isolation
|
| 335 |
+
```
|
|
|
|
|
|
|
|
|
|
| 336 |
|
| 337 |
+
and then all you have to do is to pass `use_flash_attention_2=True` to `from_pretrained`:
|
| 338 |
+
|
| 339 |
+
```diff
|
| 340 |
+
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
|
| 341 |
+
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)
|
| 342 |
+
```
|
| 343 |
|
| 344 |
+
### Torch Scale-Product-Attention (SDPA)
|
|
|
|
| 345 |
|
| 346 |
+
If your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#bettertransformer).
|
| 347 |
+
To do so, you first need to install optimum:
|
| 348 |
+
|
| 349 |
+
```
|
| 350 |
+
pip install --upgrade optimum
|
| 351 |
```
|
| 352 |
|
| 353 |
+
And then convert your model to a "BetterTransformer" model before using it:
|
| 354 |
+
|
| 355 |
+
```diff
|
| 356 |
+
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
|
| 357 |
+
+ model = model.to_bettertransformer()
|
| 358 |
+
```
|
| 359 |
|
| 360 |
## Fine-Tuning
|
| 361 |
|