Turkish Speech Recognition Model

This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text.

Dataset

The project uses the Mozilla Common Voice Turkish dataset:

Model Architecture

The model has a hybrid CNN-RNN architecture:

  • CNN Layers: Residual CNN blocks for feature extraction from Mel-spectrograms
  • RNN Layers: 4-layer bidirectional LSTM for temporal context
  • Output: Character-level prediction with CTC (Connectionist Temporal Classification) loss

Technical Details

  • Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop)
  • CNN: 32-64 channel residual blocks with GELU activation
  • LSTM: 512 hidden units, 4 layers, bidirectional
  • Alphabet: 37 characters (Turkish letters + space)
  • Optimization: AdamW + OneCycleLR scheduler

File Descriptions

1. data.py

Data loading and preprocessing module:

  • Reading data from TSV files
  • Converting audio files to Mel-spectrograms
  • Text normalization and character encoding
  • Data augmentation for training (optional noise injection)

2. train_pro.py

Initial training script:

  • 40 epochs of training
  • Batch size: 16
  • Learning rate: 0.0003
  • Data augmentation with SpecAugment
  • Model saved after each epoch

3. resume.py

Resume training script:

  • Continue training from a saved model
  • Lower learning rate (0.00005)
  • Increased regularization
  • Designed for epochs 41-75

4. check_voca.py

Helper script for alphabet verification. Displays the character set used by the model.

5. count.py

Dataset statistics:

  • Total number of recordings
  • Total duration calculation
  • Fast calculation if clip_durations.tsv exists, otherwise scans audio files

Installation

Requirements

pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy

Preparing the Dataset

  1. Download the Mozilla Common Voice Turkish dataset
  2. Extract to tr/ folder
  3. Structure should be:
tr/
├── clips/
│   ├── common_voice_tr_*.mp3
│   └── ...
├── train.tsv
├── test.tsv
└── clip_durations.tsv (optional)

Training

Initial Training

python train_pro.py
  • Trains for 40 epochs
  • Saves model_advanced_epoch_X.pth after each epoch
  • Terminal output shows loss, CER score, and sample predictions

Resume Training

python resume.py
  • Starts from model_advanced_epoch_40.pth
  • Trains epochs 41-75
  • Uses lower learning rate for fine-tuning

Data Augmentation

The model uses two types of data augmentation during training:

  1. Waveform Noise (data.py): Random Gaussian noise in training mode
  2. SpecAugment (train_pro.py, resume.py): Frequency and time masking

Performance Metrics

Model performance is measured with CER (Character Error Rate):

  • CER: Character-level error rate
  • Evaluated on test set after each epoch
  • Sample predictions printed to console

Model Outputs

After training, model files are created for each epoch:

  • model_advanced_epoch_1.pth - model_advanced_epoch_75.pth
  • The best performing model can be selected for use

Dataset Analysis

To get information about the dataset:

python count.py

This script displays the total number of recordings and duration.

Notes

  • GPU usage is automatically detected
  • Gradient clipping is applied during training
  • All parameters are saved when the model is stored
  • Alphabet: _abcçdefgğhıijklmnoöprsştuüvyzqwx (37 characters)

License

Code

MIT License - Feel free to use, modify, and distribute this code.

Dataset

The Mozilla Common Voice Turkish dataset is licensed under CC0 1.0 Universal. The dataset is in the public domain and free to use for any purpose.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support