AventIQ-AI
/

whisper_small_Automatic_speech_recognition

Safetensors

whisper

Model card Files Files and versions

xet

Community

YashikaNagpal commited on Feb 24, 2025

Commit

040736a

verified ·

1 Parent(s): f9ea988

Create README.md

Browse files

Files changed (1) hide show

README.md +95 -0

README.md ADDED Viewed

	@@ -0,0 +1,95 @@

+# Model Details
+Model Name: Whisper_Small
+Model Type: Speech-to-Text (Automatic Speech Recognition)
+Base Model: OpenAI Whisper Small (openai/whisper-small)
+Developed By: Aventiq AI
+Date: February 24, 2025
+Version: 1.0
+# Model Description
+This is a fine-tuned and quantized version of the OpenAI Whisper Small model, optimized for speech recognition tasks. The model was fine-tuned on the SpeechOcean762 dataset and subsequently quantized to FP16 (half-precision floating-point) to reduce memory usage and improve inference speed while maintaining reasonable transcription accuracy.
+Intended Use: General-purpose automatic speech recognition, particularly for English speech.
+Primary Users: Researchers, developers, and practitioners working on speech-to-text applications.
+Input: Audio files (16kHz sampling rate recommended).
+Output: Text transcriptions of spoken content.
+```
+# Training Details
+Dataset
+Name: SpeechOcean762 (mispeech/speechocean762)
+Description: A dataset of English speech recordings with corresponding transcriptions, designed for evaluating speech quality across multiple dimensions (accuracy, completeness, fluency, prosody).
+Language: English
+Training Procedure
+Framework: Hugging Face Transformers
+Hardware: [Specify if known, e.g., Single NVIDIA GPU with FP16 support]
+Hyperparameters:
+Batch Size: 8 (train/eval)
+Epochs: 3
+Learning Rate: 1e-5
+Mixed Precision: FP16
+Optimizer: AdamW (default Whisper settings)
+Preprocessing: Audio resampled to 16kHz, converted to input features using WhisperProcessor.
+Training Time: 2+ hrs on Single GPU
+Quantization
+Method: Post-training quantization to FP16 using PyTorch’s .half() method.
+Purpose: Reduce model size and improve inference speed.
+Model Size:
+Original:967 MB
+Quantized: 461 MB
+Evaluation
+Metrics
+Evaluation was performed using Word Error Rate (WER) and Character Error Rate (CER) on a test set of audio files with known transcriptions.
+Results:
+Average WER: 3.33
+Average CER: 2.62
+Example Performance
+Audio File	Reference Text	Predicted Text	WER	CER
+harvard.wav	"the north wind and the sun..."	"the north wind and the son..."	[X]	[Y]
+```
+# Usage
+Requirements
+Python 3.8+
+Dependencies: transformers, torch, librosa, jiwer
+Hardware: CPU or GPU (CUDA support recommended for faster inference)
+Installation
+bash
+Wrap
+Copy
+pip install transformers torch librosa jiwer
+# Example Code
+```python
+from transformers import WhisperProcessor, WhisperForConditionalGeneration
+import torch
+import librosa
+model_path = "./whisper-small-finetuned-fp16"
+processor = WhisperProcessor.from_pretrained(model_path)
+model = WhisperForConditionalGeneration.from_pretrained(model_path)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+def transcribe(audio_path):
+    audio, sr = librosa.load(audio_path, sr=16000)
+    inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
+    with torch.no_grad():
+        outputs = model.generate(inputs, max_length=448, num_beams=4)
+    return processor.batch_decode(outputs, skip_special_tokens=True)[0]
+# Example usage
+print(transcribe("harvard.wav"))
+Saved Model
+Location: ./whisper-small-finetuned-fp16
+Files: pytorch_model.bin, config.json, preprocessor_config.json, etc.
+```
+# Limitations
+Language: Optimized for English; performance on other languages may vary.
+Audio Quality: Best performance on clean, 16kHz audio; may degrade with noisy or low-quality inputs.
+Quantization Trade-off: FP16 quantization reduces model size but may slightly impact transcription accuracy compared to the full-precision model.
+Domain: Fine-tuned on SpeechOcean762, which may not generalize perfectly to all speech domains (e.g., conversational, accented, or technical speech).