STT-for-Turkish / README.md
Aytacus's picture
Update README.md
2fedda8 verified
---
license: mit
language:
- tr
tags:
- speech-to-text
- audio
- pytorch
- ctc
---
# Turkish Speech Recognition Model
This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text.
## Dataset
The project uses the Mozilla Common Voice Turkish dataset:
- Source: https://datacollective.mozillafoundation.org/datasets/cmj8u3px500s1nxxb4qh79iqr
- Dataset structure: `clips/` directory and TSV files under `tr/` folder
- Training: `train.tsv`
- Testing: `test.tsv`
## Model Architecture
The model has a hybrid CNN-RNN architecture:
- **CNN Layers**: Residual CNN blocks for feature extraction from Mel-spectrograms
- **RNN Layers**: 4-layer bidirectional LSTM for temporal context
- **Output**: Character-level prediction with CTC (Connectionist Temporal Classification) loss
### Technical Details
- Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop)
- CNN: 32-64 channel residual blocks with GELU activation
- LSTM: 512 hidden units, 4 layers, bidirectional
- Alphabet: 37 characters (Turkish letters + space)
- Optimization: AdamW + OneCycleLR scheduler
## File Descriptions
### 1. `data.py`
Data loading and preprocessing module:
- Reading data from TSV files
- Converting audio files to Mel-spectrograms
- Text normalization and character encoding
- Data augmentation for training (optional noise injection)
### 2. `train_pro.py`
Initial training script:
- 40 epochs of training
- Batch size: 16
- Learning rate: 0.0003
- Data augmentation with SpecAugment
- Model saved after each epoch
### 3. `resume.py`
Resume training script:
- Continue training from a saved model
- Lower learning rate (0.00005)
- Increased regularization
- Designed for epochs 41-75
### 4. `check_voca.py`
Helper script for alphabet verification. Displays the character set used by the model.
### 5. `count.py`
Dataset statistics:
- Total number of recordings
- Total duration calculation
- Fast calculation if `clip_durations.tsv` exists, otherwise scans audio files
## Installation
### Requirements
```bash
pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy
```
### Preparing the Dataset
1. Download the Mozilla Common Voice Turkish dataset
2. Extract to `tr/` folder
3. Structure should be:
```
tr/
├── clips/
│ ├── common_voice_tr_*.mp3
│ └── ...
├── train.tsv
├── test.tsv
└── clip_durations.tsv (optional)
```
## Training
### Initial Training
```bash
python train_pro.py
```
- Trains for 40 epochs
- Saves `model_advanced_epoch_X.pth` after each epoch
- Terminal output shows loss, CER score, and sample predictions
### Resume Training
```bash
python resume.py
```
- Starts from `model_advanced_epoch_40.pth`
- Trains epochs 41-75
- Uses lower learning rate for fine-tuning
## Data Augmentation
The model uses two types of data augmentation during training:
1. **Waveform Noise** (`data.py`): Random Gaussian noise in training mode
2. **SpecAugment** (`train_pro.py`, `resume.py`): Frequency and time masking
## Performance Metrics
Model performance is measured with CER (Character Error Rate):
- CER: Character-level error rate
- Evaluated on test set after each epoch
- Sample predictions printed to console
## Model Outputs
After training, model files are created for each epoch:
- `model_advanced_epoch_1.pth` - `model_advanced_epoch_75.pth`
- The best performing model can be selected for use
## Dataset Analysis
To get information about the dataset:
```bash
python count.py
```
This script displays the total number of recordings and duration.
## Notes
- GPU usage is automatically detected
- Gradient clipping is applied during training
- All parameters are saved when the model is stored
- Alphabet: `_abcçdefgğhıijklmnoöprsştuüvyzqwx ` (37 characters)
## License
### Code
MIT License - Feel free to use, modify, and distribute this code.
### Dataset
The Mozilla Common Voice Turkish dataset is licensed under [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/). The dataset is in the public domain and free to use for any purpose.