Turkish Speech Recognition Model
This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text.
Dataset
The project uses the Mozilla Common Voice Turkish dataset:
- Source: https://datacollective.mozillafoundation.org/datasets/cmj8u3px500s1nxxb4qh79iqr
- Dataset structure:
clips/directory and TSV files undertr/folder - Training:
train.tsv - Testing:
test.tsv
Model Architecture
The model has a hybrid CNN-RNN architecture:
- CNN Layers: Residual CNN blocks for feature extraction from Mel-spectrograms
- RNN Layers: 4-layer bidirectional LSTM for temporal context
- Output: Character-level prediction with CTC (Connectionist Temporal Classification) loss
Technical Details
- Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop)
- CNN: 32-64 channel residual blocks with GELU activation
- LSTM: 512 hidden units, 4 layers, bidirectional
- Alphabet: 37 characters (Turkish letters + space)
- Optimization: AdamW + OneCycleLR scheduler
File Descriptions
1. data.py
Data loading and preprocessing module:
- Reading data from TSV files
- Converting audio files to Mel-spectrograms
- Text normalization and character encoding
- Data augmentation for training (optional noise injection)
2. train_pro.py
Initial training script:
- 40 epochs of training
- Batch size: 16
- Learning rate: 0.0003
- Data augmentation with SpecAugment
- Model saved after each epoch
3. resume.py
Resume training script:
- Continue training from a saved model
- Lower learning rate (0.00005)
- Increased regularization
- Designed for epochs 41-75
4. check_voca.py
Helper script for alphabet verification. Displays the character set used by the model.
5. count.py
Dataset statistics:
- Total number of recordings
- Total duration calculation
- Fast calculation if
clip_durations.tsvexists, otherwise scans audio files
Installation
Requirements
pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy
Preparing the Dataset
- Download the Mozilla Common Voice Turkish dataset
- Extract to
tr/folder - Structure should be:
tr/
├── clips/
│ ├── common_voice_tr_*.mp3
│ └── ...
├── train.tsv
├── test.tsv
└── clip_durations.tsv (optional)
Training
Initial Training
python train_pro.py
- Trains for 40 epochs
- Saves
model_advanced_epoch_X.pthafter each epoch - Terminal output shows loss, CER score, and sample predictions
Resume Training
python resume.py
- Starts from
model_advanced_epoch_40.pth - Trains epochs 41-75
- Uses lower learning rate for fine-tuning
Data Augmentation
The model uses two types of data augmentation during training:
- Waveform Noise (
data.py): Random Gaussian noise in training mode - SpecAugment (
train_pro.py,resume.py): Frequency and time masking
Performance Metrics
Model performance is measured with CER (Character Error Rate):
- CER: Character-level error rate
- Evaluated on test set after each epoch
- Sample predictions printed to console
Model Outputs
After training, model files are created for each epoch:
model_advanced_epoch_1.pth-model_advanced_epoch_75.pth- The best performing model can be selected for use
Dataset Analysis
To get information about the dataset:
python count.py
This script displays the total number of recordings and duration.
Notes
- GPU usage is automatically detected
- Gradient clipping is applied during training
- All parameters are saved when the model is stored
- Alphabet:
_abcçdefgğhıijklmnoöprsştuüvyzqwx(37 characters)
License
Code
MIT License - Feel free to use, modify, and distribute this code.
Dataset
The Mozilla Common Voice Turkish dataset is licensed under CC0 1.0 Universal. The dataset is in the public domain and free to use for any purpose.