Aytacus
/

STT-for-Turkish

+# Turkish Speech Recognition Model
+This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text.
+## Dataset
+The project uses the Mozilla Common Voice Turkish dataset:
+- Source: https://datacollective.mozillafoundation.org/datasets/cmj8u3px500s1nxxb4qh79iqr
+- Dataset structure: `clips/` directory and TSV files under `tr/` folder
+- Training: `train.tsv`
+- Testing: `test.tsv`
+## Model Architecture
+The model has a hybrid CNN-RNN architecture:
+- **CNN Layers**: Residual CNN blocks for feature extraction from Mel-spectrograms
+- **RNN Layers**: 4-layer bidirectional LSTM for temporal context
+- **Output**: Character-level prediction with CTC (Connectionist Temporal Classification) loss
+### Technical Details
+- Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop)
+- CNN: 32-64 channel residual blocks with GELU activation
+- LSTM: 512 hidden units, 4 layers, bidirectional
+- Alphabet: 37 characters (Turkish letters + space)
+- Optimization: AdamW + OneCycleLR scheduler
+## File Descriptions
+### 1. `data.py`
+Data loading and preprocessing module:
+- Reading data from TSV files
+- Converting audio files to Mel-spectrograms
+- Text normalization and character encoding
+- Data augmentation for training (optional noise injection)
+### 2. `train_pro.py`
+Initial training script:
+- 40 epochs of training
+- Batch size: 16
+- Learning rate: 0.0003
+- Data augmentation with SpecAugment
+- Model saved after each epoch
+### 3. `resume.py`
+Resume training script:
+- Continue training from a saved model
+- Lower learning rate (0.00005)
+- Increased regularization
+- Designed for epochs 41-75
+### 4. `check_voca.py`
+Helper script for alphabet verification. Displays the character set used by the model.
+### 5. `count.py`
+Dataset statistics:
+- Total number of recordings
+- Total duration calculation
+- Fast calculation if `clip_durations.tsv` exists, otherwise scans audio files
+## Installation
+### Requirements
+```bash
+pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy
+```
+### Preparing the Dataset
+1. Download the Mozilla Common Voice Turkish dataset
+2. Extract to `tr/` folder
+3. Structure should be:
+```
+tr/
+├── clips/
+│   ├── common_voice_tr_*.mp3
+│   └── ...
+├── train.tsv
+├── test.tsv
+└── clip_durations.tsv (optional)
+```
+## Training
+### Initial Training
+```bash
+python train_pro.py
+```
+- Trains for 40 epochs
+- Saves `model_advanced_epoch_X.pth` after each epoch
+- Terminal output shows loss, CER score, and sample predictions
+### Resume Training
+```bash
+python resume.py
+```
+- Starts from `model_advanced_epoch_40.pth`
+- Trains epochs 41-75
+- Uses lower learning rate for fine-tuning
+## Data Augmentation
+The model uses two types of data augmentation during training:
+1. **Waveform Noise** (`data.py`): Random Gaussian noise in training mode
+2. **SpecAugment** (`train_pro.py`, `resume.py`): Frequency and time masking
+## Performance Metrics
+Model performance is measured with CER (Character Error Rate):
+- CER: Character-level error rate
+- Evaluated on test set after each epoch
+- Sample predictions printed to console
+## Model Outputs
+After training, model files are created for each epoch:
+- `model_advanced_epoch_1.pth` - `model_advanced_epoch_75.pth`
+- The best performing model can be selected for use
+## Dataset Analysis
+To get information about the dataset:
+```bash
+python count.py
+```
+This script displays the total number of recordings and duration.
+## Notes
+- GPU usage is automatically detected
+- Gradient clipping is applied during training
+- All parameters are saved when the model is stored
+- Alphabet: `_abcçdefgğhıijklmnoöprsştuüvyzqwx ` (37 characters)
+## License
+### Code
+MIT License - Feel free to use, modify, and distribute this code.
+### Dataset
+The Mozilla Common Voice Turkish dataset is licensed under [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/). The dataset is in the public domain and free to use for any purpose.