| | --- |
| | license: mit |
| | language: |
| | - tr |
| | tags: |
| | - speech-to-text |
| | - audio |
| | - pytorch |
| | - ctc |
| | --- |
| | # Turkish Speech Recognition Model |
| |
|
| | This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text. |
| |
|
| | ## Dataset |
| |
|
| | The project uses the Mozilla Common Voice Turkish dataset: |
| | - Source: https://datacollective.mozillafoundation.org/datasets/cmj8u3px500s1nxxb4qh79iqr |
| | - Dataset structure: `clips/` directory and TSV files under `tr/` folder |
| | - Training: `train.tsv` |
| | - Testing: `test.tsv` |
| |
|
| | ## Model Architecture |
| |
|
| | The model has a hybrid CNN-RNN architecture: |
| | - **CNN Layers**: Residual CNN blocks for feature extraction from Mel-spectrograms |
| | - **RNN Layers**: 4-layer bidirectional LSTM for temporal context |
| | - **Output**: Character-level prediction with CTC (Connectionist Temporal Classification) loss |
| |
|
| | ### Technical Details |
| | - Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop) |
| | - CNN: 32-64 channel residual blocks with GELU activation |
| | - LSTM: 512 hidden units, 4 layers, bidirectional |
| | - Alphabet: 37 characters (Turkish letters + space) |
| | - Optimization: AdamW + OneCycleLR scheduler |
| |
|
| | ## File Descriptions |
| |
|
| | ### 1. `data.py` |
| | Data loading and preprocessing module: |
| | - Reading data from TSV files |
| | - Converting audio files to Mel-spectrograms |
| | - Text normalization and character encoding |
| | - Data augmentation for training (optional noise injection) |
| |
|
| | ### 2. `train_pro.py` |
| | Initial training script: |
| | - 40 epochs of training |
| | - Batch size: 16 |
| | - Learning rate: 0.0003 |
| | - Data augmentation with SpecAugment |
| | - Model saved after each epoch |
| | |
| | ### 3. `resume.py` |
| | Resume training script: |
| | - Continue training from a saved model |
| | - Lower learning rate (0.00005) |
| | - Increased regularization |
| | - Designed for epochs 41-75 |
| | |
| | ### 4. `check_voca.py` |
| | Helper script for alphabet verification. Displays the character set used by the model. |
| |
|
| | ### 5. `count.py` |
| | Dataset statistics: |
| | - Total number of recordings |
| | - Total duration calculation |
| | - Fast calculation if `clip_durations.tsv` exists, otherwise scans audio files |
| |
|
| | ## Installation |
| |
|
| | ### Requirements |
| | ```bash |
| | pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy |
| | ``` |
| |
|
| | ### Preparing the Dataset |
| | 1. Download the Mozilla Common Voice Turkish dataset |
| | 2. Extract to `tr/` folder |
| | 3. Structure should be: |
| | ``` |
| | tr/ |
| | ├── clips/ |
| | │ ├── common_voice_tr_*.mp3 |
| | │ └── ... |
| | ├── train.tsv |
| | ├── test.tsv |
| | └── clip_durations.tsv (optional) |
| | ``` |
| |
|
| | ## Training |
| |
|
| | ### Initial Training |
| | ```bash |
| | python train_pro.py |
| | ``` |
| | - Trains for 40 epochs |
| | - Saves `model_advanced_epoch_X.pth` after each epoch |
| | - Terminal output shows loss, CER score, and sample predictions |
| |
|
| | ### Resume Training |
| | ```bash |
| | python resume.py |
| | ``` |
| | - Starts from `model_advanced_epoch_40.pth` |
| | - Trains epochs 41-75 |
| | - Uses lower learning rate for fine-tuning |
| |
|
| | ## Data Augmentation |
| |
|
| | The model uses two types of data augmentation during training: |
| |
|
| | 1. **Waveform Noise** (`data.py`): Random Gaussian noise in training mode |
| | 2. **SpecAugment** (`train_pro.py`, `resume.py`): Frequency and time masking |
| |
|
| | ## Performance Metrics |
| |
|
| | Model performance is measured with CER (Character Error Rate): |
| | - CER: Character-level error rate |
| | - Evaluated on test set after each epoch |
| | - Sample predictions printed to console |
| |
|
| | ## Model Outputs |
| |
|
| | After training, model files are created for each epoch: |
| | - `model_advanced_epoch_1.pth` - `model_advanced_epoch_75.pth` |
| | - The best performing model can be selected for use |
| |
|
| | ## Dataset Analysis |
| |
|
| | To get information about the dataset: |
| | ```bash |
| | python count.py |
| | ``` |
| |
|
| | This script displays the total number of recordings and duration. |
| |
|
| | ## Notes |
| |
|
| | - GPU usage is automatically detected |
| | - Gradient clipping is applied during training |
| | - All parameters are saved when the model is stored |
| | - Alphabet: `_abcçdefgğhıijklmnoöprsştuüvyzqwx ` (37 characters) |
| |
|
| | ## License |
| |
|
| | ### Code |
| | MIT License - Feel free to use, modify, and distribute this code. |
| |
|
| | ### Dataset |
| | The Mozilla Common Voice Turkish dataset is licensed under [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/). The dataset is in the public domain and free to use for any purpose. |