YashikaNagpal commited on
Commit
040736a
·
verified ·
1 Parent(s): f9ea988

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Details
2
+ Model Name: Whisper_Small
3
+ Model Type: Speech-to-Text (Automatic Speech Recognition)
4
+ Base Model: OpenAI Whisper Small (openai/whisper-small)
5
+ Developed By: Aventiq AI
6
+ Date: February 24, 2025
7
+ Version: 1.0
8
+
9
+ # Model Description
10
+ This is a fine-tuned and quantized version of the OpenAI Whisper Small model, optimized for speech recognition tasks. The model was fine-tuned on the SpeechOcean762 dataset and subsequently quantized to FP16 (half-precision floating-point) to reduce memory usage and improve inference speed while maintaining reasonable transcription accuracy.
11
+
12
+
13
+ Intended Use: General-purpose automatic speech recognition, particularly for English speech.
14
+ Primary Users: Researchers, developers, and practitioners working on speech-to-text applications.
15
+ Input: Audio files (16kHz sampling rate recommended).
16
+ Output: Text transcriptions of spoken content.
17
+
18
+ ```
19
+ # Training Details
20
+ Dataset
21
+ Name: SpeechOcean762 (mispeech/speechocean762)
22
+ Description: A dataset of English speech recordings with corresponding transcriptions, designed for evaluating speech quality across multiple dimensions (accuracy, completeness, fluency, prosody).
23
+ Language: English
24
+ Training Procedure
25
+ Framework: Hugging Face Transformers
26
+ Hardware: [Specify if known, e.g., Single NVIDIA GPU with FP16 support]
27
+ Hyperparameters:
28
+ Batch Size: 8 (train/eval)
29
+ Epochs: 3
30
+ Learning Rate: 1e-5
31
+ Mixed Precision: FP16
32
+ Optimizer: AdamW (default Whisper settings)
33
+ Preprocessing: Audio resampled to 16kHz, converted to input features using WhisperProcessor.
34
+ Training Time: 2+ hrs on Single GPU
35
+ Quantization
36
+ Method: Post-training quantization to FP16 using PyTorch’s .half() method.
37
+ Purpose: Reduce model size and improve inference speed.
38
+ Model Size:
39
+ Original:967 MB
40
+ Quantized: 461 MB
41
+ Evaluation
42
+ Metrics
43
+ Evaluation was performed using Word Error Rate (WER) and Character Error Rate (CER) on a test set of audio files with known transcriptions.
44
+ Results:
45
+ Average WER: 3.33
46
+ Average CER: 2.62
47
+
48
+ Example Performance
49
+ Audio File Reference Text Predicted Text WER CER
50
+ harvard.wav "the north wind and the sun..." "the north wind and the son..." [X] [Y]
51
+ ```
52
+
53
+ # Usage
54
+ Requirements
55
+ Python 3.8+
56
+ Dependencies: transformers, torch, librosa, jiwer
57
+ Hardware: CPU or GPU (CUDA support recommended for faster inference)
58
+ Installation
59
+ bash
60
+ Wrap
61
+ Copy
62
+ pip install transformers torch librosa jiwer
63
+
64
+ # Example Code
65
+ ```python
66
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
67
+ import torch
68
+ import librosa
69
+
70
+ model_path = "./whisper-small-finetuned-fp16"
71
+ processor = WhisperProcessor.from_pretrained(model_path)
72
+ model = WhisperForConditionalGeneration.from_pretrained(model_path)
73
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
74
+ model = model.to(device)
75
+
76
+ def transcribe(audio_path):
77
+ audio, sr = librosa.load(audio_path, sr=16000)
78
+ inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
79
+ with torch.no_grad():
80
+ outputs = model.generate(inputs, max_length=448, num_beams=4)
81
+ return processor.batch_decode(outputs, skip_special_tokens=True)[0]
82
+
83
+ # Example usage
84
+ print(transcribe("harvard.wav"))
85
+ Saved Model
86
+ Location: ./whisper-small-finetuned-fp16
87
+ Files: pytorch_model.bin, config.json, preprocessor_config.json, etc.
88
+
89
+ ```
90
+
91
+ # Limitations
92
+ Language: Optimized for English; performance on other languages may vary.
93
+ Audio Quality: Best performance on clean, 16kHz audio; may degrade with noisy or low-quality inputs.
94
+ Quantization Trade-off: FP16 quantization reduces model size but may slightly impact transcription accuracy compared to the full-precision model.
95
+ Domain: Fine-tuned on SpeechOcean762, which may not generalize perfectly to all speech domains (e.g., conversational, accented, or technical speech).