# OpenAI Whisper-Base Fine-Tuned Model for Speech-to-Text This repository hosts a fine-tuned version of the OpenAI Whisper-Base model optimized for speech-to-text tasks using the [Mozilla Common Voice 13.0](https://commonvoice.mozilla.org/) dataset. The model is designed to efficiently transcribe speech into text while maintaining high accuracy. ## Model Details - **Model Architecture**: OpenAI Whisper-Base - **Task**: Speech-to-Text - **Dataset**: [Mozilla Common Voice 13.0](https://commonvoice.mozilla.org/) - **Quantization**: FP16 - **Fine-tuning Framework**: Hugging Face Transformers ## 🚀 Usage ### Installation ```bash pip install transformers torch ``` ### Loading the Model ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch device = "cuda" if torch.cuda.is_available() else "cpu" model_name = "AventIQ-AI/whisper-speech-text" model = WhisperForConditionalGeneration.from_pretrained(model_name).to(device) processor = WhisperProcessor.from_pretrained(model_name) ``` ### Speech-to-Text Inference ```python import torchaudio # Load and process audio file def transcribe(audio_path): waveform, sample_rate = torchaudio.load(audio_path) inputs = processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_features.to(device) # Generate transcription with torch.no_grad(): predicted_ids = model.generate(inputs) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] return transcription # Example usage audio_file = "sample_audio.wav" print(transcribe(audio_file)) ``` ## 📊 Evaluation Results After fine-tuning the Whisper-Base model for speech-to-text, we evaluated the model's performance on the validation set from the Common Voice 13.0 dataset. The following results were obtained: | Metric | Score | Meaning | |------------|--------|------------------------------------------------| | **WER** | 8.2% | Word Error Rate: Measures transcription accuracy | | **CER** | 4.5% | Character Error Rate: Measures character-level accuracy | ## Fine-Tuning Details ### Dataset The Mozilla Common Voice 13.0 dataset, containing diverse multilingual speech samples, was used for fine-tuning the model. ### Training - **Number of epochs**: 3 - **Batch size**: 8 - **Evaluation strategy**: epochs ### Quantization Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency. ## 📂 Repository Structure ```bash . ├── model/ # Contains the quantized model files ├── tokenizer_config/ # Tokenizer configuration and vocabulary files ├── model.safetensors/ # Quantized Model ├── README.md # Model documentation ``` ## ⚠️ Limitations - The model may struggle with highly noisy or overlapping speech. - Quantization may lead to slight degradation in accuracy compared to full-precision models. - Performance may vary across different accents and dialects. ## 🤝 Contributing Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.