--- license: mit language: - tr tags: - speech-to-text - audio - pytorch - ctc --- # Turkish Speech Recognition Model This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text. ## Dataset The project uses the Mozilla Common Voice Turkish dataset: - Source: https://datacollective.mozillafoundation.org/datasets/cmj8u3px500s1nxxb4qh79iqr - Dataset structure: `clips/` directory and TSV files under `tr/` folder - Training: `train.tsv` - Testing: `test.tsv` ## Model Architecture The model has a hybrid CNN-RNN architecture: - **CNN Layers**: Residual CNN blocks for feature extraction from Mel-spectrograms - **RNN Layers**: 4-layer bidirectional LSTM for temporal context - **Output**: Character-level prediction with CTC (Connectionist Temporal Classification) loss ### Technical Details - Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop) - CNN: 32-64 channel residual blocks with GELU activation - LSTM: 512 hidden units, 4 layers, bidirectional - Alphabet: 37 characters (Turkish letters + space) - Optimization: AdamW + OneCycleLR scheduler ## File Descriptions ### 1. `data.py` Data loading and preprocessing module: - Reading data from TSV files - Converting audio files to Mel-spectrograms - Text normalization and character encoding - Data augmentation for training (optional noise injection) ### 2. `train_pro.py` Initial training script: - 40 epochs of training - Batch size: 16 - Learning rate: 0.0003 - Data augmentation with SpecAugment - Model saved after each epoch ### 3. `resume.py` Resume training script: - Continue training from a saved model - Lower learning rate (0.00005) - Increased regularization - Designed for epochs 41-75 ### 4. `check_voca.py` Helper script for alphabet verification. Displays the character set used by the model. ### 5. `count.py` Dataset statistics: - Total number of recordings - Total duration calculation - Fast calculation if `clip_durations.tsv` exists, otherwise scans audio files ## Installation ### Requirements ```bash pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy ``` ### Preparing the Dataset 1. Download the Mozilla Common Voice Turkish dataset 2. Extract to `tr/` folder 3. Structure should be: ``` tr/ ├── clips/ │ ├── common_voice_tr_*.mp3 │ └── ... ├── train.tsv ├── test.tsv └── clip_durations.tsv (optional) ``` ## Training ### Initial Training ```bash python train_pro.py ``` - Trains for 40 epochs - Saves `model_advanced_epoch_X.pth` after each epoch - Terminal output shows loss, CER score, and sample predictions ### Resume Training ```bash python resume.py ``` - Starts from `model_advanced_epoch_40.pth` - Trains epochs 41-75 - Uses lower learning rate for fine-tuning ## Data Augmentation The model uses two types of data augmentation during training: 1. **Waveform Noise** (`data.py`): Random Gaussian noise in training mode 2. **SpecAugment** (`train_pro.py`, `resume.py`): Frequency and time masking ## Performance Metrics Model performance is measured with CER (Character Error Rate): - CER: Character-level error rate - Evaluated on test set after each epoch - Sample predictions printed to console ## Model Outputs After training, model files are created for each epoch: - `model_advanced_epoch_1.pth` - `model_advanced_epoch_75.pth` - The best performing model can be selected for use ## Dataset Analysis To get information about the dataset: ```bash python count.py ``` This script displays the total number of recordings and duration. ## Notes - GPU usage is automatically detected - Gradient clipping is applied during training - All parameters are saved when the model is stored - Alphabet: `_abcçdefgğhıijklmnoöprsştuüvyzqwx ` (37 characters) ## License ### Code MIT License - Feel free to use, modify, and distribute this code. ### Dataset The Mozilla Common Voice Turkish dataset is licensed under [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/). The dataset is in the public domain and free to use for any purpose.