| # Model Details |
| Model Name: Whisper_Small |
| Model Type: Speech-to-Text (Automatic Speech Recognition) |
| Base Model: OpenAI Whisper Small (openai/whisper-small) |
| Developed By: Aventiq AI |
| Date: February 24, 2025 |
| Version: 1.0 |
| |
| # Model Description |
| This is a fine-tuned and quantized version of the OpenAI Whisper Small model, optimized for speech recognition tasks. The model was fine-tuned on the SpeechOcean762 dataset and subsequently quantized to FP16 (half-precision floating-point) to reduce memory usage and improve inference speed while maintaining reasonable transcription accuracy. |
| |
| |
| Intended Use: General-purpose automatic speech recognition, particularly for English speech. |
| Primary Users: Researchers, developers, and practitioners working on speech-to-text applications. |
| Input: Audio files (16kHz sampling rate recommended). |
| Output: Text transcriptions of spoken content. |
| |
| ``` |
| # Training Details |
| Dataset |
| Name: SpeechOcean762 (mispeech/speechocean762) |
| Description: A dataset of English speech recordings with corresponding transcriptions, designed for evaluating speech quality across multiple dimensions (accuracy, completeness, fluency, prosody). |
| Language: English |
| Training Procedure |
| Framework: Hugging Face Transformers |
| Hardware: [Specify if known, e.g., Single NVIDIA GPU with FP16 support] |
| Hyperparameters: |
| Batch Size: 8 (train/eval) |
| Epochs: 3 |
| Learning Rate: 1e-5 |
| Mixed Precision: FP16 |
| Optimizer: AdamW (default Whisper settings) |
| Preprocessing: Audio resampled to 16kHz, converted to input features using WhisperProcessor. |
| Training Time: 2+ hrs on Single GPU |
| Quantization |
| Method: Post-training quantization to FP16 using PyTorch’s .half() method. |
| Purpose: Reduce model size and improve inference speed. |
| Model Size: |
| Original:967 MB |
| Quantized: 461 MB |
| Evaluation |
| Metrics |
| Evaluation was performed using Word Error Rate (WER) and Character Error Rate (CER) on a test set of audio files with known transcriptions. |
| Results: |
| Average WER: 3.33 |
| Average CER: 2.62 |
| |
| Example Performance |
| Audio File Reference Text Predicted Text WER CER |
| harvard.wav "the north wind and the sun..." "the north wind and the son..." [X] [Y] |
| ``` |
| |
| # Usage |
| Requirements |
| Python 3.8+ |
| Dependencies: transformers, torch, librosa, jiwer |
| Hardware: CPU or GPU (CUDA support recommended for faster inference) |
| Installation |
| bash |
| Wrap |
| Copy |
| pip install transformers torch librosa jiwer |
| |
| # Example Code |
| ```python |
| from transformers import WhisperProcessor, WhisperForConditionalGeneration |
| import torch |
| import librosa |
| |
| model_path = "./whisper-small-finetuned-fp16" |
| processor = WhisperProcessor.from_pretrained(model_path) |
| model = WhisperForConditionalGeneration.from_pretrained(model_path) |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model = model.to(device) |
| |
| def transcribe(audio_path): |
| audio, sr = librosa.load(audio_path, sr=16000) |
| inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device) |
| with torch.no_grad(): |
| outputs = model.generate(inputs, max_length=448, num_beams=4) |
| return processor.batch_decode(outputs, skip_special_tokens=True)[0] |
| |
| # Example usage |
| print(transcribe("harvard.wav")) |
| Saved Model |
| Location: ./whisper-small-finetuned-fp16 |
| Files: pytorch_model.bin, config.json, preprocessor_config.json, etc. |
|
|
| ``` |
| |
| # Limitations |
| Language: Optimized for English; performance on other languages may vary. |
| Audio Quality: Best performance on clean, 16kHz audio; may degrade with noisy or low-quality inputs. |
| Quantization Trade-off: FP16 quantization reduces model size but may slightly impact transcription accuracy compared to the full-precision model. |
| Domain: Fine-tuned on SpeechOcean762, which may not generalize perfectly to all speech domains (e.g., conversational, accented, or technical speech). |
| |