# Model Details Model Name: Whisper_Small Model Type: Speech-to-Text (Automatic Speech Recognition) Base Model: OpenAI Whisper Small (openai/whisper-small) Developed By: Aventiq AI Date: February 24, 2025 Version: 1.0 # Model Description This is a fine-tuned and quantized version of the OpenAI Whisper Small model, optimized for speech recognition tasks. The model was fine-tuned on the SpeechOcean762 dataset and subsequently quantized to FP16 (half-precision floating-point) to reduce memory usage and improve inference speed while maintaining reasonable transcription accuracy. Intended Use: General-purpose automatic speech recognition, particularly for English speech. Primary Users: Researchers, developers, and practitioners working on speech-to-text applications. Input: Audio files (16kHz sampling rate recommended). Output: Text transcriptions of spoken content. ``` # Training Details Dataset Name: SpeechOcean762 (mispeech/speechocean762) Description: A dataset of English speech recordings with corresponding transcriptions, designed for evaluating speech quality across multiple dimensions (accuracy, completeness, fluency, prosody). Language: English Training Procedure Framework: Hugging Face Transformers Hardware: [Specify if known, e.g., Single NVIDIA GPU with FP16 support] Hyperparameters: Batch Size: 8 (train/eval) Epochs: 3 Learning Rate: 1e-5 Mixed Precision: FP16 Optimizer: AdamW (default Whisper settings) Preprocessing: Audio resampled to 16kHz, converted to input features using WhisperProcessor. Training Time: 2+ hrs on Single GPU Quantization Method: Post-training quantization to FP16 using PyTorch’s .half() method. Purpose: Reduce model size and improve inference speed. Model Size: Original:967 MB Quantized: 461 MB Evaluation Metrics Evaluation was performed using Word Error Rate (WER) and Character Error Rate (CER) on a test set of audio files with known transcriptions. Results: Average WER: 3.33 Average CER: 2.62 Example Performance Audio File Reference Text Predicted Text WER CER harvard.wav "the north wind and the sun..." "the north wind and the son..." [X] [Y] ``` # Usage Requirements Python 3.8+ Dependencies: transformers, torch, librosa, jiwer Hardware: CPU or GPU (CUDA support recommended for faster inference) Installation bash Wrap Copy pip install transformers torch librosa jiwer # Example Code ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch import librosa model_path = "./whisper-small-finetuned-fp16" processor = WhisperProcessor.from_pretrained(model_path) model = WhisperForConditionalGeneration.from_pretrained(model_path) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) def transcribe(audio_path): audio, sr = librosa.load(audio_path, sr=16000) inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device) with torch.no_grad(): outputs = model.generate(inputs, max_length=448, num_beams=4) return processor.batch_decode(outputs, skip_special_tokens=True)[0] # Example usage print(transcribe("harvard.wav")) Saved Model Location: ./whisper-small-finetuned-fp16 Files: pytorch_model.bin, config.json, preprocessor_config.json, etc. ``` # Limitations Language: Optimized for English; performance on other languages may vary. Audio Quality: Best performance on clean, 16kHz audio; may degrade with noisy or low-quality inputs. Quantization Trade-off: FP16 quantization reduces model size but may slightly impact transcription accuracy compared to the full-precision model. Domain: Fine-tuned on SpeechOcean762, which may not generalize perfectly to all speech domains (e.g., conversational, accented, or technical speech).