|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- facebook/multilingual_librispeech |
|
|
metrics: |
|
|
- character |
|
|
base_model: |
|
|
- openai/whisper-small |
|
|
- facebook/wav2vec2-base-960h |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
library_name: transformers |
|
|
tags: |
|
|
- code |
|
|
- audio |
|
|
- speech-recognition |
|
|
- whisper |
|
|
- wav2vec2 |
|
|
- pytorch |
|
|
--- |
|
|
|
|
|
# Speech Recognition AI: Fine-Tuned Whisper and Wav2Vec2 for Real-Time Audio |
|
|
|
|
|
This project fine-tunes OpenAI's Whisper (`whisper-small`) and Facebook's Wav2Vec2 (`wav2vec2-base-960h`) models for real-time speech recognition using live audio recordings. Itβs designed for dynamic environments where low-latency transcription is key, such as live conversations or streaming audio. |
|
|
|
|
|
## Model Description |
|
|
Fine-tuned Whisper and Wav2Vec2 models for real-time speech recognition on live audio. |
|
|
|
|
|
## Features |
|
|
- **Real-time audio recording**: Captures live 16kHz mono audio via microphone input. |
|
|
- **Continuous fine-tuning**: Updates model weights incrementally during live sessions. |
|
|
- **Speech-to-text transcription**: Converts audio to text with high accuracy. |
|
|
- **Model saving/loading**: Automatically saves fine-tuned models with timestamps. |
|
|
- **Dual model support**: Choose between Whisper and Wav2Vec2 architectures. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Start Fine-Tuning |
|
|
Fine-tune the model on live audio: |
|
|
```bash |
|
|
# For Whisper model |
|
|
python main.py --model_type whisper |
|
|
|
|
|
# For Wav2Vec2 model |
|
|
python main.py --model_type wav2vec2 |
|
|
``` |
|
|
Records audio in real-time and updates the model continuously. Press Ctrl+C to stop training and save the model automatically. |
|
|
|
|
|
### Transcription |
|
|
Test the fine-tuned model: |
|
|
```bash |
|
|
# For Whisper model |
|
|
python test_transcription.py --model_type whisper |
|
|
|
|
|
# For Wav2Vec2 model |
|
|
python test_transcription.py --model_type wav2vec2 |
|
|
``` |
|
|
Records 5 seconds of audio (configurable in code) and generates a transcription. |
|
|
|
|
|
### Model Storage |
|
|
Models are saved by default to: |
|
|
``` |
|
|
models/speech_recognition_ai_fine_tune_[model_type]_[timestamp] |
|
|
``` |
|
|
Example: `models/speech_recognition_ai_fine_tune_whisper_20250225` |
|
|
|
|
|
To customize the save path: |
|
|
```bash |
|
|
export MODEL_SAVE_PATH="/your/custom/path" |
|
|
python main.py --model_type [whisper|wav2vec2] |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
- Python 3.8+ |
|
|
- PyTorch (torch==2.0.1 recommended) |
|
|
- Transformers (transformers==4.35.0 recommended) |
|
|
- Sounddevice (sounddevice==0.4.6) |
|
|
- Torchaudio (torchaudio==2.0.1) |
|
|
|
|
|
A GPU is recommended for faster fine-tuning. See `requirements.txt` for the full list. |
|
|
|
|
|
## Model Details |
|
|
- **Task**: Automatic Speech Recognition (ASR) |
|
|
- **Base Models**: |
|
|
- Whisper: openai/whisper-small |
|
|
- Wav2Vec2: facebook/wav2vec2-base-960h |
|
|
- **Fine-tuning**: Trained on live 16kHz mono audio recordings with a batch size of 8, using the Adam optimizer (learning rate 1e-5). |
|
|
- **Input**: 16kHz mono audio |
|
|
- **Output**: Text transcription |
|
|
- **Language**: English |
|
|
|
|
|
## Loading the Model (Hugging Face) |
|
|
|
|
|
To load the models from Hugging Face: |
|
|
```python |
|
|
from transformers import WhisperForConditionalGeneration, WhisperProcessor |
|
|
model = WhisperForConditionalGeneration.from_pretrained("harpertoken/harpertokenASR") |
|
|
processor = WhisperProcessor.from_pretrained("harpertoken/harpertokenASR") |
|
|
``` |
|
|
|
|
|
## Repository Structure |
|
|
|
|
|
``` |
|
|
speech-model/ |
|
|
βββ dataset.py # Audio recording and preprocessing |
|
|
βββ train.py # Training pipeline |
|
|
βββ test_transcription.py # Transcription testing |
|
|
βββ main.py # Main script for fine-tuning |
|
|
βββ README.md # This file |
|
|
βββ requirements.txt # Dependencies |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
The models are fine-tuned on live audio recordings collected during runtime. No pre-existing dataset is requiredβusers generate their own data via microphone input. |
|
|
|
|
|
## Evaluation Results |
|
|
Future updates will include WER (Word Error Rate) metrics compared to base models. |
|
|
|
|
|
## License |
|
|
Licensed under the MIT License. |