File size: 3,857 Bytes
4d0fcdf
 
 
 
7e4ba23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d0fcdf
 
 
 
 
 
 
cf00f16
4d0fcdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf00f16
 
4d0fcdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf00f16
4d0fcdf
 
cf00f16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: mit
language:
- en
datasets:
- facebook/multilingual_librispeech
metrics:
- character
base_model:
- openai/whisper-small
- facebook/wav2vec2-base-960h
pipeline_tag: automatic-speech-recognition
library_name: transformers
tags:
- code
- audio
- speech-recognition
- whisper
- wav2vec2
- pytorch
---

# Speech Recognition AI: Fine-Tuned Whisper and Wav2Vec2 for Real-Time Audio

This project fine-tunes OpenAI's Whisper (`whisper-small`) and Facebook's Wav2Vec2 (`wav2vec2-base-960h`) models for real-time speech recognition using live audio recordings. It’s designed for dynamic environments where low-latency transcription is key, such as live conversations or streaming audio.

## Model Description
Fine-tuned Whisper and Wav2Vec2 models for real-time speech recognition on live audio.

## Features
- **Real-time audio recording**: Captures live 16kHz mono audio via microphone input.
- **Continuous fine-tuning**: Updates model weights incrementally during live sessions.
- **Speech-to-text transcription**: Converts audio to text with high accuracy.
- **Model saving/loading**: Automatically saves fine-tuned models with timestamps.
- **Dual model support**: Choose between Whisper and Wav2Vec2 architectures.

## Usage

### Start Fine-Tuning
Fine-tune the model on live audio:
```bash
# For Whisper model
python main.py --model_type whisper

# For Wav2Vec2 model
python main.py --model_type wav2vec2
```
Records audio in real-time and updates the model continuously. Press Ctrl+C to stop training and save the model automatically.

### Transcription
Test the fine-tuned model:
```bash
# For Whisper model
python test_transcription.py --model_type whisper

# For Wav2Vec2 model
python test_transcription.py --model_type wav2vec2
```
Records 5 seconds of audio (configurable in code) and generates a transcription.

### Model Storage
Models are saved by default to:
```
models/speech_recognition_ai_fine_tune_[model_type]_[timestamp]
```
Example: `models/speech_recognition_ai_fine_tune_whisper_20250225`

To customize the save path:
```bash
export MODEL_SAVE_PATH="/your/custom/path"
python main.py --model_type [whisper|wav2vec2]
```

## Requirements
- Python 3.8+
- PyTorch (torch==2.0.1 recommended)
- Transformers (transformers==4.35.0 recommended)
- Sounddevice (sounddevice==0.4.6)
- Torchaudio (torchaudio==2.0.1)

A GPU is recommended for faster fine-tuning. See `requirements.txt` for the full list.

## Model Details
- **Task**: Automatic Speech Recognition (ASR)
- **Base Models**:
  - Whisper: openai/whisper-small
  - Wav2Vec2: facebook/wav2vec2-base-960h
- **Fine-tuning**: Trained on live 16kHz mono audio recordings with a batch size of 8, using the Adam optimizer (learning rate 1e-5).
- **Input**: 16kHz mono audio
- **Output**: Text transcription
- **Language**: English

## Loading the Model (Hugging Face)

To load the models from Hugging Face:
```python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("harpertoken/harpertokenASR")
processor = WhisperProcessor.from_pretrained("harpertoken/harpertokenASR")
```

## Repository Structure

```
speech-model/
├── dataset.py              # Audio recording and preprocessing
├── train.py                # Training pipeline
├── test_transcription.py   # Transcription testing
├── main.py                 # Main script for fine-tuning
├── README.md               # This file
└── requirements.txt        # Dependencies
```

## Training Data
The models are fine-tuned on live audio recordings collected during runtime. No pre-existing dataset is required—users generate their own data via microphone input.

## Evaluation Results
Future updates will include WER (Word Error Rate) metrics compared to base models.

## License
Licensed under the MIT License.