aiola
/

whisper-ner-v1

@@ -13,18 +13,19 @@ tags:
 - Named entity recognition
 ---
-# Whisper Ner
-Whisper ner is an advanced model that allows joint speech transcription and entity recognition.
 WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference.
-We augment a large synthetic dataset with synthetic speech samples.
-This allows us to train WhisperNER on a large number of examples with diverse NER tags.
-During training, the model is prompted with NER labels and optimized to output the transcribed utterance along with the corresponding tagged entities.
 ---------
 ## Training Details
-`aiola/whisper-ner-v1` was trained on the Nuner dataset to perform audio translation with ner at the same time in English only.
 ---------
@@ -33,107 +34,49 @@ To use `whisper-ner-v1` install [`whisper-ner`](https://github.com/aiola-lab/whi
 Inference can be done using the following code:
 ```python
-import logging
-import argparse
 import torch
 from transformers import WhisperProcessor, WhisperForConditionalGeneration
-from experiments.utils import set_logger, get_device, remove_suppress_tokens
-from experiments.utils.utils import UNSUPPRESS_TOKEN
-import torchaudio
-import numpy as np
-set_logger()
-@torch.no_grad()
-def main(model_path, audio_file_path, prompt, max_new_tokens, language, device):
-    # load model and processor from pre-trained
-    processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
-    model = WhisperForConditionalGeneration.from_pretrained(model_path)
-    remove_suppress_tokens(model)
-    logging.info(f"removed suppress tokens: {UNSUPPRESS_TOKEN}")
-    model = model.to(device)
-    # load audio file: user is responsible for loading the audio files themselves
-    target_sample_rate = 16000
-    signal, sampling_rate = torchaudio.load(audio_file_path)
-    resampler = torchaudio.transforms.Resample(sampling_rate, target_sample_rate)
-    signal = resampler(signal)
-    # convert to mono or remove first dim if needed
-    if signal.ndim == 2:
-        signal = torch.mean(signal, dim=0)
-    # pre-process to get the input features
-    input_features = processor(
-        signal, sampling_rate=target_sample_rate, return_tensors="pt"
-    ).input_features
-    input_features = input_features.to(device)
-    prompt = prompt.lower()  # lowercase the prompt, to align with training
-    prompt_ids = processor.get_prompt_ids(prompt, return_tensors="pt")
-    prompt_ids = prompt_ids.to(device)
-    # generate token ids by running model forward sequentially
-    logging.info(f"Inference with prompt: '{prompt}'.")
     predicted_ids = model.generate(
         input_features,
-        max_new_tokens=max_new_tokens,
-        language=language,
         prompt_ids=prompt_ids,
         generation_config=model.generation_config,
     )
-    # post-process token ids to text
-    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
-    print(transcription)
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Transcribe audio using Whisper model."
-    )
-    parser.add_argument(
-        "--model-path",
-        type=str,
-        required=True,
-        default="aiola/whisper-ner-v1",
-        help="Path to the pre-trained model components.",
-    )
-    parser.add_argument(
-        "--audio-file-path",
-        type=str,
-        required=True,
-        help="Path to the audio file (wav) to transcribe.",
-    )
-    parser.add_argument(
-        "--prompt",
-        type=str,
-        default="father",
-        help="Prompt text to guide the transcription.",
-    )
-    parser.add_argument(
-        "--max-new-tokens",
-        type=int,
-        default=256,
-        help="Maximum number of new tokens to generate.",
-    )
-    parser.add_argument(
-        "--language",
-        type=str,
-        default="en",
-        help="Language code for the transcription.",
-    )
-    args = parser.parse_args()
-    device = get_device()
-    main(
-        args.model_path,
-        args.audio_file_path,
-        args.prompt,
-        args.max_new_tokens,
-        args.language,
-        device,
-    )
 ```

 - Named entity recognition
 ---
+# Whisper-NER
+- Peper: [_WhisperNER: Unified Open Named Entity and Speech Recognition_](https://arxiv.org/abs/2409.08107).
+- Code: https://github.com/aiola-lab/whisper-ner
+We introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition.
 WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference.
 ---------
 ## Training Details
+`aiola/whisper-ner-v1` was trained on the NuNER dataset to perform joint audio transcription and NER tagging.
+The model was trained and evaluated only on English data. Check out the [paper](https://arxiv.org/abs/2409.08107) for full details.
 ---------
 Inference can be done using the following code:
 ```python
 import torch
 from transformers import WhisperProcessor, WhisperForConditionalGeneration
+model_path = "aiola/whisper-ner-v1"
+audio_file_path = "path/to/audio/file"
+prompt = "person, company, location"  # comma separated entity tags
+# load model and processor from pre-trained
+processor = WhisperProcessor.from_pretrained(model_path)
+model = WhisperForConditionalGeneration.from_pretrained(model_path)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+# load audio file: user is responsible for loading the audio files themselves
+target_sample_rate = 16000
+signal, sampling_rate = torchaudio.load(audio_file_path)
+resampler = torchaudio.transforms.Resample(sampling_rate, target_sample_rate)
+signal = resampler(signal)
+# convert to mono or remove first dim if needed
+if signal.ndim == 2:
+    signal = torch.mean(signal, dim=0)
+# pre-process to get the input features
+input_features = processor(
+    signal, sampling_rate=target_sample_rate, return_tensors="pt"
+).input_features
+input_features = input_features.to(device)
+prompt_ids = processor.get_prompt_ids(prompt.lower(), return_tensors="pt")
+prompt_ids = prompt_ids.to(device)
+# generate token ids by running model forward sequentially
+with torch.no_grad():
     predicted_ids = model.generate(
         input_features,
         prompt_ids=prompt_ids,
         generation_config=model.generation_config,
+        language="en",
     )
+# post-process token ids to text, remove prompt
+transcription = processor.batch_decode(
+    predicted_ids[:, prompt_ids.shape[0]:], skip_special_tokens=True
+)[0]
+print(transcription)
 ```