Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Details
|
| 2 |
+
Model Name: Whisper_Small
|
| 3 |
+
Model Type: Speech-to-Text (Automatic Speech Recognition)
|
| 4 |
+
Base Model: OpenAI Whisper Small (openai/whisper-small)
|
| 5 |
+
Developed By: Aventiq AI
|
| 6 |
+
Date: February 24, 2025
|
| 7 |
+
Version: 1.0
|
| 8 |
+
|
| 9 |
+
# Model Description
|
| 10 |
+
This is a fine-tuned and quantized version of the OpenAI Whisper Small model, optimized for speech recognition tasks. The model was fine-tuned on the SpeechOcean762 dataset and subsequently quantized to FP16 (half-precision floating-point) to reduce memory usage and improve inference speed while maintaining reasonable transcription accuracy.
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
Intended Use: General-purpose automatic speech recognition, particularly for English speech.
|
| 14 |
+
Primary Users: Researchers, developers, and practitioners working on speech-to-text applications.
|
| 15 |
+
Input: Audio files (16kHz sampling rate recommended).
|
| 16 |
+
Output: Text transcriptions of spoken content.
|
| 17 |
+
|
| 18 |
+
```
|
| 19 |
+
# Training Details
|
| 20 |
+
Dataset
|
| 21 |
+
Name: SpeechOcean762 (mispeech/speechocean762)
|
| 22 |
+
Description: A dataset of English speech recordings with corresponding transcriptions, designed for evaluating speech quality across multiple dimensions (accuracy, completeness, fluency, prosody).
|
| 23 |
+
Language: English
|
| 24 |
+
Training Procedure
|
| 25 |
+
Framework: Hugging Face Transformers
|
| 26 |
+
Hardware: [Specify if known, e.g., Single NVIDIA GPU with FP16 support]
|
| 27 |
+
Hyperparameters:
|
| 28 |
+
Batch Size: 8 (train/eval)
|
| 29 |
+
Epochs: 3
|
| 30 |
+
Learning Rate: 1e-5
|
| 31 |
+
Mixed Precision: FP16
|
| 32 |
+
Optimizer: AdamW (default Whisper settings)
|
| 33 |
+
Preprocessing: Audio resampled to 16kHz, converted to input features using WhisperProcessor.
|
| 34 |
+
Training Time: 2+ hrs on Single GPU
|
| 35 |
+
Quantization
|
| 36 |
+
Method: Post-training quantization to FP16 using PyTorch’s .half() method.
|
| 37 |
+
Purpose: Reduce model size and improve inference speed.
|
| 38 |
+
Model Size:
|
| 39 |
+
Original:967 MB
|
| 40 |
+
Quantized: 461 MB
|
| 41 |
+
Evaluation
|
| 42 |
+
Metrics
|
| 43 |
+
Evaluation was performed using Word Error Rate (WER) and Character Error Rate (CER) on a test set of audio files with known transcriptions.
|
| 44 |
+
Results:
|
| 45 |
+
Average WER: 3.33
|
| 46 |
+
Average CER: 2.62
|
| 47 |
+
|
| 48 |
+
Example Performance
|
| 49 |
+
Audio File Reference Text Predicted Text WER CER
|
| 50 |
+
harvard.wav "the north wind and the sun..." "the north wind and the son..." [X] [Y]
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
# Usage
|
| 54 |
+
Requirements
|
| 55 |
+
Python 3.8+
|
| 56 |
+
Dependencies: transformers, torch, librosa, jiwer
|
| 57 |
+
Hardware: CPU or GPU (CUDA support recommended for faster inference)
|
| 58 |
+
Installation
|
| 59 |
+
bash
|
| 60 |
+
Wrap
|
| 61 |
+
Copy
|
| 62 |
+
pip install transformers torch librosa jiwer
|
| 63 |
+
|
| 64 |
+
# Example Code
|
| 65 |
+
```python
|
| 66 |
+
from transformers import WhisperProcessor, WhisperForConditionalGeneration
|
| 67 |
+
import torch
|
| 68 |
+
import librosa
|
| 69 |
+
|
| 70 |
+
model_path = "./whisper-small-finetuned-fp16"
|
| 71 |
+
processor = WhisperProcessor.from_pretrained(model_path)
|
| 72 |
+
model = WhisperForConditionalGeneration.from_pretrained(model_path)
|
| 73 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 74 |
+
model = model.to(device)
|
| 75 |
+
|
| 76 |
+
def transcribe(audio_path):
|
| 77 |
+
audio, sr = librosa.load(audio_path, sr=16000)
|
| 78 |
+
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
|
| 79 |
+
with torch.no_grad():
|
| 80 |
+
outputs = model.generate(inputs, max_length=448, num_beams=4)
|
| 81 |
+
return processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
| 82 |
+
|
| 83 |
+
# Example usage
|
| 84 |
+
print(transcribe("harvard.wav"))
|
| 85 |
+
Saved Model
|
| 86 |
+
Location: ./whisper-small-finetuned-fp16
|
| 87 |
+
Files: pytorch_model.bin, config.json, preprocessor_config.json, etc.
|
| 88 |
+
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
# Limitations
|
| 92 |
+
Language: Optimized for English; performance on other languages may vary.
|
| 93 |
+
Audio Quality: Best performance on clean, 16kHz audio; may degrade with noisy or low-quality inputs.
|
| 94 |
+
Quantization Trade-off: FP16 quantization reduces model size but may slightly impact transcription accuracy compared to the full-precision model.
|
| 95 |
+
Domain: Fine-tuned on SpeechOcean762, which may not generalize perfectly to all speech domains (e.g., conversational, accented, or technical speech).
|