YashikaNagpal's picture
Create README.md
040736a verified
# Model Details
Model Name: Whisper_Small
Model Type: Speech-to-Text (Automatic Speech Recognition)
Base Model: OpenAI Whisper Small (openai/whisper-small)
Developed By: Aventiq AI
Date: February 24, 2025
Version: 1.0
# Model Description
This is a fine-tuned and quantized version of the OpenAI Whisper Small model, optimized for speech recognition tasks. The model was fine-tuned on the SpeechOcean762 dataset and subsequently quantized to FP16 (half-precision floating-point) to reduce memory usage and improve inference speed while maintaining reasonable transcription accuracy.
Intended Use: General-purpose automatic speech recognition, particularly for English speech.
Primary Users: Researchers, developers, and practitioners working on speech-to-text applications.
Input: Audio files (16kHz sampling rate recommended).
Output: Text transcriptions of spoken content.
```
# Training Details
Dataset
Name: SpeechOcean762 (mispeech/speechocean762)
Description: A dataset of English speech recordings with corresponding transcriptions, designed for evaluating speech quality across multiple dimensions (accuracy, completeness, fluency, prosody).
Language: English
Training Procedure
Framework: Hugging Face Transformers
Hardware: [Specify if known, e.g., Single NVIDIA GPU with FP16 support]
Hyperparameters:
Batch Size: 8 (train/eval)
Epochs: 3
Learning Rate: 1e-5
Mixed Precision: FP16
Optimizer: AdamW (default Whisper settings)
Preprocessing: Audio resampled to 16kHz, converted to input features using WhisperProcessor.
Training Time: 2+ hrs on Single GPU
Quantization
Method: Post-training quantization to FP16 using PyTorch’s .half() method.
Purpose: Reduce model size and improve inference speed.
Model Size:
Original:967 MB
Quantized: 461 MB
Evaluation
Metrics
Evaluation was performed using Word Error Rate (WER) and Character Error Rate (CER) on a test set of audio files with known transcriptions.
Results:
Average WER: 3.33
Average CER: 2.62
Example Performance
Audio File Reference Text Predicted Text WER CER
harvard.wav "the north wind and the sun..." "the north wind and the son..." [X] [Y]
```
# Usage
Requirements
Python 3.8+
Dependencies: transformers, torch, librosa, jiwer
Hardware: CPU or GPU (CUDA support recommended for faster inference)
Installation
bash
Wrap
Copy
pip install transformers torch librosa jiwer
# Example Code
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
model_path = "./whisper-small-finetuned-fp16"
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
def transcribe(audio_path):
audio, sr = librosa.load(audio_path, sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
with torch.no_grad():
outputs = model.generate(inputs, max_length=448, num_beams=4)
return processor.batch_decode(outputs, skip_special_tokens=True)[0]
# Example usage
print(transcribe("harvard.wav"))
Saved Model
Location: ./whisper-small-finetuned-fp16
Files: pytorch_model.bin, config.json, preprocessor_config.json, etc.
```
# Limitations
Language: Optimized for English; performance on other languages may vary.
Audio Quality: Best performance on clean, 16kHz audio; may degrade with noisy or low-quality inputs.
Quantization Trade-off: FP16 quantization reduces model size but may slightly impact transcription accuracy compared to the full-precision model.
Domain: Fine-tuned on SpeechOcean762, which may not generalize perfectly to all speech domains (e.g., conversational, accented, or technical speech).