Turkish Speech Recognition Model

This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text.

Dataset

The project uses the Mozilla Common Voice Turkish dataset:

Source: https://datacollective.mozillafoundation.org/datasets/cmj8u3px500s1nxxb4qh79iqr
Dataset structure: clips/ directory and TSV files under tr/ folder
Training: train.tsv
Testing: test.tsv

Model Architecture

The model has a hybrid CNN-RNN architecture:

CNN Layers: Residual CNN blocks for feature extraction from Mel-spectrograms
RNN Layers: 4-layer bidirectional LSTM for temporal context
Output: Character-level prediction with CTC (Connectionist Temporal Classification) loss

Technical Details

Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop)
CNN: 32-64 channel residual blocks with GELU activation
LSTM: 512 hidden units, 4 layers, bidirectional
Alphabet: 37 characters (Turkish letters + space)
Optimization: AdamW + OneCycleLR scheduler

File Descriptions

1. `data.py`

Data loading and preprocessing module:

Reading data from TSV files
Converting audio files to Mel-spectrograms
Text normalization and character encoding
Data augmentation for training (optional noise injection)

2. `train_pro.py`

Initial training script:

40 epochs of training
Batch size: 16
Learning rate: 0.0003
Data augmentation with SpecAugment
Model saved after each epoch

3. `resume.py`

Resume training script:

Continue training from a saved model
Lower learning rate (0.00005)
Increased regularization
Designed for epochs 41-75

4. `check_voca.py`

Helper script for alphabet verification. Displays the character set used by the model.

5. `count.py`

Dataset statistics:

Total number of recordings
Total duration calculation
Fast calculation if clip_durations.tsv exists, otherwise scans audio files

Installation

Requirements

pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy

Preparing the Dataset

Download the Mozilla Common Voice Turkish dataset
Extract to tr/ folder
Structure should be:

tr/
├── clips/
│   ├── common_voice_tr_*.mp3
│   └── ...
├── train.tsv
├── test.tsv
└── clip_durations.tsv (optional)

Training

Initial Training

python train_pro.py

Trains for 40 epochs
Saves model_advanced_epoch_X.pth after each epoch
Terminal output shows loss, CER score, and sample predictions

Resume Training

python resume.py

Starts from model_advanced_epoch_40.pth
Trains epochs 41-75
Uses lower learning rate for fine-tuning

Data Augmentation

The model uses two types of data augmentation during training:

Waveform Noise (data.py): Random Gaussian noise in training mode
SpecAugment (train_pro.py, resume.py): Frequency and time masking

Performance Metrics

Model performance is measured with CER (Character Error Rate):

CER: Character-level error rate
Evaluated on test set after each epoch
Sample predictions printed to console

Model Outputs

After training, model files are created for each epoch:

model_advanced_epoch_1.pth - model_advanced_epoch_75.pth
The best performing model can be selected for use

Dataset Analysis

To get information about the dataset:

python count.py

This script displays the total number of recordings and duration.

Notes

GPU usage is automatically detected
Gradient clipping is applied during training
All parameters are saved when the model is stored
Alphabet: _abcçdefgğhıijklmnoöprsştuüvyzqwx (37 characters)

License

Code

MIT License - Feel free to use, modify, and distribute this code.

Dataset

The Mozilla Common Voice Turkish dataset is licensed under CC0 1.0 Universal. The dataset is in the public domain and free to use for any purpose.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Turkish Speech Recognition Model

Dataset

Model Architecture

Technical Details

File Descriptions

1. data.py

2. train_pro.py

3. resume.py

4. check_voca.py

5. count.py

Installation

Requirements

Preparing the Dataset

Training

Initial Training

Resume Training

Data Augmentation

Performance Metrics

Model Outputs

Dataset Analysis

Notes

License

Code

Dataset

1. `data.py`

2. `train_pro.py`

3. `resume.py`

4. `check_voca.py`

5. `count.py`