reach-vb
/

whisper-large-v3

@@ -129,7 +129,7 @@ by Alec Radford et al. from OpenAI. The original code repository can be found [h
 The `Whisper-large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using `whisper-large-v2`.
 The model was trained for 2.0 epochs over this mixture dataset.
-The `Whisper-large-v3 model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to `Whisper-large-v2`.
 **Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
@@ -161,193 +161,201 @@ checkpoints are summarised in the following table with links to the models on th
 | large-v2 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
 | large-v3 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v3) |
-# Usage
-To transcribe audio samples, the model has to be used alongside a [`WhisperProcessor`](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).
-The `WhisperProcessor` is used to:
-1. Pre-process the audio inputs (converting them to log-Mel spectrograms for the model)
-2. Post-process the model outputs (converting them from tokens to text)
-The model is informed of which task to perform (transcription or translation) by passing the appropriate "context tokens". These context tokens
-are a sequence of tokens that are given to the decoder at the start of the decoding process, and take the following order:
-1. The transcription always starts with the `<|startoftranscript|>` token
-2. The second token is the language token (e.g. `<|en|>` for English)
-3. The third token is the "task token". It can take one of two values: `<|transcribe|>` for speech recognition or `<|translate|>` for speech translation
-4. In addition, a `<|notimestamps|>` token is added if the model should not include timestamp prediction
-Thus, a typical sequence of context tokens might look as follows:
 ```
-<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
 ```
-Which tells the model to decode in English, under the task of speech recognition, and not to predict timestamps.
-These tokens can either be forced or un-forced. If they are forced, the model is made to predict each token at
-each position. This allows one to control the output language and task for the Whisper model. If they are un-forced,
-the Whisper model will automatically predict the output langauge and task itself.
-The context tokens can be set accordingly:
 ```python
-model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")
-```
-Which forces the model to predict in English under the task of speech recognition.
-## Transcription
-### English to English
-In this example, the context tokens are 'unforced', meaning the model automatically predicts the output language
-(English) and task (transcribe).
-```python
->>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
->>> from datasets import load_dataset
->>> # load model and processor
->>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
->>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
->>> model.config.forced_decoder_ids = None
->>> # load dummy dataset and read audio files
->>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
->>> sample = ds[0]["audio"]
->>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
->>> # generate token ids
->>> predicted_ids = model.generate(input_features)
->>> # decode token ids to text
->>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
-['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']
->>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
-[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
 ```
-The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`.
-### French to French
-The following example demonstrates French to French transcription by setting the decoder ids appropriately.
 ```python
->>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
->>> from datasets import Audio, load_dataset
->>> # load model and processor
->>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
->>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
->>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")
->>> # load streaming dataset and read first audio sample
->>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
->>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
->>> input_speech = next(iter(ds))["audio"]
->>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
->>> # generate token ids
->>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
->>> # decode token ids to text
->>> transcription = processor.batch_decode(predicted_ids)
-['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']
->>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
-[' Un vrai travail intéressant va enfin être mené sur ce sujet.']
 ```
-## Translation
-Setting the task to "translate" forces the Whisper model to perform speech translation.
-### French to English
 ```python
->>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
->>> from datasets import Audio, load_dataset
->>> # load model and processor
->>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
->>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
->>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")
->>> # load streaming dataset and read first audio sample
->>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
->>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
->>> input_speech = next(iter(ds))["audio"]
->>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
->>> # generate token ids
->>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
->>> # decode token ids to text
->>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
-[' A very interesting work, we will finally be given on this subject.']
-```
-## Evaluation
-This code snippet shows how to evaluate Whisper Large on [LibriSpeech test-clean](https://huggingface.co/datasets/librispeech_asr):
-```python
->>> from datasets import load_dataset
->>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
->>> import torch
->>> from evaluate import load
->>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
->>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
->>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2").to("cuda")
->>> def map_to_pred(batch):
->>>     audio = batch["audio"]
->>>     input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
->>>     batch["reference"] = processor.tokenizer._normalize(batch['text'])
->>>
->>>     with torch.no_grad():
->>>         predicted_ids = model.generate(input_features.to("cuda"))[0]
->>>     transcription = processor.decode(predicted_ids)
->>>     batch["prediction"] = processor.tokenizer._normalize(transcription)
->>>     return batch
->>> result = librispeech_test_clean.map(map_to_pred)
->>> wer = load("wer")
->>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
-3.0003583080317572
 ```
-## Long-Form Transcription
-The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
-algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
-[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
-method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
-can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
-```python
->>> import torch
->>> from transformers import pipeline
->>> from datasets import load_dataset
->>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
->>> pipe = pipeline(
->>>   "automatic-speech-recognition",
->>>   model="openai/whisper-large-v2",
->>>   chunk_length_s=30,
->>>   device=device,
->>> )
->>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
->>> sample = ds[0]["audio"]
->>> prediction = pipe(sample.copy(), batch_size=8)["text"]
-" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
->>> # we can also return timestamps for the predictions
->>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
-[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
-  'timestamp': (0.0, 5.44)}]
 ```
-Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
 ## Fine-Tuning

 The `Whisper-large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using `whisper-large-v2`.
 The model was trained for 2.0 epochs over this mixture dataset.
+The `Whisper-large-v3` model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to `Whisper-large-v2`.
 **Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
 | large-v2 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
 | large-v3 | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v3) |
+## Usage
+Whisper-large-v3 is supported in Hugging Face 🤗 Transformers through the `main` branch in the Transformers repo. To run the model, first
+install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
+audio dataset from the Hugging Face Hub:
+```bash
+pip install --upgrade pip
+pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
+```
+### Short-Form Transcription
+The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
+class to transcribe short-form audio files (< 30-seconds) as follows:
+```python
+import torch
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
+from datasets import load_dataset
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+model_id = "openai/Whisper-large-v3"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+)
+model.to(device)
+processor = AutoProcessor.from_pretrained(model_id)
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model=model,
+    tokenizer=processor.tokenizer,
+    feature_extractor=processor.feature_extractor,
+    max_new_tokens=128,
+    torch_dtype=torch_dtype,
+    device=device,
+)
+dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+sample = dataset[0]["audio"]
+result = pipe(sample)
+print(result["text"])
 ```
+To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
+```diff
+- result = pipe(sample)
++ result = pipe("audio.mp3")
 ```
+### Long-Form Transcription
+Through Transformers Whisper-large-v3 uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
+is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
+To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
 ```python
+import torch
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
+from datasets import load_dataset
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+model_id = "openai/Whisper-large-v3"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+)
+model.to(device)
+processor = AutoProcessor.from_pretrained(model_id)
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model=model,
+    tokenizer=processor.tokenizer,
+    feature_extractor=processor.feature_extractor,
+    max_new_tokens=128,
+    chunk_length_s=15,
+    batch_size=16,
+    torch_dtype=torch_dtype,
+    device=device,
+)
+dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
+sample = dataset[0]["audio"]
+result = pipe(sample)
+print(result["text"])
 ```
+<!---
+**Tip:** The pipeline can also be used to transcribe an audio file from a remote URL, for example:
 ```python
+result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
 ```
+--->
+### Speculative Decoding
+[Distil-Whisper](https://hf.co/distil-whisper/large-v2) can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
+ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
+replacement for existing Whisper pipelines, since the same outputs are guaranteed.
+In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
+specify it as the "assistant model" for generation:
 ```python
+from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
+import torch
+from datasets import load_dataset
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+assistant_model_id = "distil-whisper/distil-large-v2"
+assistant_model = AutoModelForCausalLM.from_pretrained(
+    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+)
+assistant_model.to(device)
+model_id = "openai/whisper-large-v3"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+)
+model.to(device)
+processor = AutoProcessor.from_pretrained(model_id)
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model=model,
+    tokenizer=processor.tokenizer,
+    feature_extractor=processor.feature_extractor,
+    max_new_tokens=128,
+    generate_kwargs={"assistant_model": assistant_model},
+    torch_dtype=torch_dtype,
+    device=device,
+)
+dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+sample = dataset[0]["audio"]
+result = pipe(sample)
+print(result["text"])
 ```
+## Additional Speed & Memory Improvements
+You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
+### Flash Attention
+We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it.
+To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
+```
+pip install flash-attn --no-build-isolation
+```
+and then all you have to do is to pass `use_flash_attention_2=True` to `from_pretrained`:
+```diff
+- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
++ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)
+```
+### Torch Scale-Product-Attention (SDPA)
+If your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#bettertransformer).
+To do so, you first need to install optimum:
+```
+pip install --upgrade optimum
 ```
+And then convert your model to a "BetterTransformer" model before using it:
+```diff
+model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
++ model = model.to_bettertransformer()
+```
 ## Fine-Tuning