# Usage Guide for Fine-Tuned Whisper ASR Model on Chichewa

This notebook provides a step-by-step guide to using a fine-tuned Whisper model for Automatic Speech Recognition (ASR) in **Chichewa**. The model has been fine-tuned on a Chichewa dataset consisting of approximately **24 hours of speech data**, and it has undergone several iterations, resulting in multiple versions with differing levels of performance. This guide directs you to the **best-performing model** to ensure optimal transcription accuracy.

### Key Highlights:
1. **Fine-Tuned on Chichewa**: The model has been specifically trained on 24 hours of Chichewa speech, making it well-suited for transcription tasks in this language. The fine-tuning process has significantly improved its performance for Chichewa ASR tasks.
   
2. **No Tokenizer Required**: Unlike other Whisper models, this fine-tuned version does not use a tokenizer. Instead, the model works directly with the processor, simplifying the inference process.

3. **Best Performance via Commit Hash**: To ensure you are using the best-performing model, you will load the model by passing the **`revision` parameter** along with the exact **commit hash** corresponding to the version that achieved the lowest Word Error Rate (WER). This ensures you are working with the most accurate version of the model.

4. **Best Performance**: This notebook demonstrates how to load and use the model version with the best WER for Chichewa ASR tasks. Following this guide will allow you to achieve the highest transcription quality.

### What You’ll Learn:
- How to load and use the fine-tuned Whisper model specifically for Chichewa ASR.
- How to process and transcribe audio files data using this model.
- How to ensure you are using the version of the model that delivers the best transcription performance for Chichewa by utilizing the `revision` parameter with the commit hash.

By following this guide, you’ll be equipped to leverage this specialized ASR model to produce high-quality transcriptions of Chichewa speech with minimal setup and effort.


In [1]:
import librosa
from datasets import Audio, load_dataset, DatasetDict, load_from_disk

import torch

from transformers import (WhisperFeatureExtractor, WhisperTokenizer, 
                        WhisperProcessor, logging)
from transformers import WhisperForConditionalGeneration
from transformers import (pipeline, AutoModel, AutoTokenizer, 
AutoProcessor, AutoModelForSpeechSeq2Seq)

# Suppress Warnings
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')

# Suppress Hugging Face logs except errors
logging.set_verbosity_error()


## 1. Best-Performing Model and WER Information

In this section, we provide the commit hash for the best-performing model and its corresponding Word Error Rate (WER) based on evaluation.

### Best-Performing Model:

The model with the best WER performance can be loaded using the following **commit hash**:

```python
# Best-performing model hash
HUGGINGFACE_MODEL_ID = "dmatekenya/whisper-large-chichewa"
BEST_MODEL_COMMIT_HASH = "bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a"  
```

### Note/Warning
While the model endpoint remains the same (i.e., dmatekenya/whisper-large-chichewa), it is crucial to include the specific commit hash (COMMIT_HASH) provided above when loading the model to access the best-performing version. Please use this [full url](https://huggingface.co/dmatekenya/whisper-large-v3-chichewa/commit/bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a) to access this commit. 
Without the commit hash, the latest version may be loaded, which could have a higher Word Error Rate (WER) than the version evaluated as best. To ensure you get the most accurate results, always include the commit hash.
```


## 2. Performing Speech-to-Text Inference on Audio Files 
In this section, I will demonstrate how to transcribe an individual audio file directly using the fine-tuned Whisper model, instead of loading a dataset through the HuggingFace `datasets` package. This approach is particularly useful when deploying the model within an application.

In [2]:
# HuggingFace endpoint for this finetuned model
finetuned_model_id = "dmatekenya/whisper-large-v3-chichewa"

# Copy the commit hash from this notebook or follow the URL provided above
best_model_commit_hash = "bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a" 

# Whisper base model endpoint
whisper_base_model_id = "openai/whisper-large-v3"

# Language-I used Shona when finetuning, so use it when loading base model
whisper_base_model_language = "shona"

# Load whisper processor using base model for 
processor = WhisperProcessor.from_pretrained(whisper_base_model_id , 
                                                language=whisper_base_model_language, 
                                                task="transcribe")
# Load the finetuned model and use revision parameter to ensure 
# we load the best model
model = WhisperForConditionalGeneration.from_pretrained(finetuned_model_id, 
revision=best_model_commit_hash)

# Load audio file with librosa
audio_file = "inspect-data/audio1.mp3"
y, sr = librosa.load(audio_file, sr=16000)

# Prepare input features for the model
input_features = processor(y, return_tensors="pt", sampling_rate=sr).input_features

# Generate transcription
generated_ids = model.generate(inputs=input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Print the transcript
print(f"Transcript from model: \n { transcription}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Transcript from model: 
 Timazalandedza chifukwa choyamba ndi imene timapeza mafuta amene amafunikira mumatupimwathu..
