Model card update suggested

#49
by cfasana - opened

Running the code provided in the model card, the behavior does not match the described one with the latest version of transformerslibrary.
Specifically, running the following code...

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None

# load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
print("[Skip special tokens=False] ", transcription)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print("[Skip special tokens=True] ", transcription)

... gives the following output:

[Skip special tokens=False] [' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.']
[Skip special tokens=True] [' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.']

I hereby report the reason behind this behavior as explained in the corresponding issue on GitHub by @eustlb .

The docs actually spell out two different return behaviours:
- Default generate() (see l. 550) β†’ returns a torch.LongTensor with decoder input IDs (<|startoftranscript|>, <|en|>, etc.) already stripped. They're not in the tensor, so skip_special_tokens=False can't show what isn't there. That's why both settings give you identical output.
- return_dict_in_generate=True (see l. 548) β†’ returns a ModelOutput where .sequences keeps everything, including decoder input IDs and <|endoftext|>. Now skip_special_tokens actually has something to filter, which is why you see the difference.
If however a dict output is required by running the following code:

Hence, to show the special tokens skipping functionality, the model card should be updated to use return_dict_in_generate=True.
Hope this helps in keeping the model card aligned with the lib version.

Sign up or log in to comment