Create README.md

040736a verified about 1 year ago

3.75 kB

	# Model Details
	Model Name: Whisper_Small
	Model Type: Speech-to-Text (Automatic Speech Recognition)
	Base Model: OpenAI Whisper Small (openai/whisper-small)
	Developed By: Aventiq AI
	Date: February 24, 2025
	Version: 1.0

	# Model Description
	This is a fine-tuned and quantized version of the OpenAI Whisper Small model, optimized for speech recognition tasks. The model was fine-tuned on the SpeechOcean762 dataset and subsequently quantized to FP16 (half-precision floating-point) to reduce memory usage and improve inference speed while maintaining reasonable transcription accuracy.


	Intended Use: General-purpose automatic speech recognition, particularly for English speech.
	Primary Users: Researchers, developers, and practitioners working on speech-to-text applications.
	Input: Audio files (16kHz sampling rate recommended).
	Output: Text transcriptions of spoken content.

	```
	# Training Details
	Dataset
	Name: SpeechOcean762 (mispeech/speechocean762)
	Description: A dataset of English speech recordings with corresponding transcriptions, designed for evaluating speech quality across multiple dimensions (accuracy, completeness, fluency, prosody).
	Language: English
	Training Procedure
	Framework: Hugging Face Transformers
	Hardware: [Specify if known, e.g., Single NVIDIA GPU with FP16 support]
	Hyperparameters:
	Batch Size: 8 (train/eval)
	Epochs: 3
	Learning Rate: 1e-5
	Mixed Precision: FP16
	Optimizer: AdamW (default Whisper settings)
	Preprocessing: Audio resampled to 16kHz, converted to input features using WhisperProcessor.
	Training Time: 2+ hrs on Single GPU
	Quantization
	Method: Post-training quantization to FP16 using PyTorch’s .half() method.
	Purpose: Reduce model size and improve inference speed.
	Model Size:
	Original:967 MB
	Quantized: 461 MB
	Evaluation
	Metrics
	Evaluation was performed using Word Error Rate (WER) and Character Error Rate (CER) on a test set of audio files with known transcriptions.
	Results:
	Average WER: 3.33
	Average CER: 2.62

	Example Performance
	Audio File Reference Text Predicted Text WER CER
	harvard.wav "the north wind and the sun..." "the north wind and the son..." [X] [Y]
	```

	# Usage
	Requirements
	Python 3.8+
	Dependencies: transformers, torch, librosa, jiwer
	Hardware: CPU or GPU (CUDA support recommended for faster inference)
	Installation
	bash
	Wrap
	Copy
	pip install transformers torch librosa jiwer

	# Example Code
	```python
	from transformers import WhisperProcessor, WhisperForConditionalGeneration
	import torch
	import librosa

	model_path = "./whisper-small-finetuned-fp16"
	processor = WhisperProcessor.from_pretrained(model_path)
	model = WhisperForConditionalGeneration.from_pretrained(model_path)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)

	def transcribe(audio_path):
	audio, sr = librosa.load(audio_path, sr=16000)
	inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
	with torch.no_grad():
	outputs = model.generate(inputs, max_length=448, num_beams=4)
	return processor.batch_decode(outputs, skip_special_tokens=True)[0]

	# Example usage
	print(transcribe("harvard.wav"))
	Saved Model
	Location: ./whisper-small-finetuned-fp16
	Files: pytorch_model.bin, config.json, preprocessor_config.json, etc.

	```

	# Limitations
	Language: Optimized for English; performance on other languages may vary.
	Audio Quality: Best performance on clean, 16kHz audio; may degrade with noisy or low-quality inputs.
	Quantization Trade-off: FP16 quantization reduces model size but may slightly impact transcription accuracy compared to the full-precision model.
	Domain: Fine-tuned on SpeechOcean762, which may not generalize perfectly to all speech domains (e.g., conversational, accented, or technical speech).