Update README.md

2fedda8 verified about 1 month ago

4.16 kB

	---
	license: mit
	language:
	- tr
	tags:
	- speech-to-text
	- audio
	- pytorch
	- ctc
	---
	# Turkish Speech Recognition Model

	This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text.

	## Dataset

	The project uses the Mozilla Common Voice Turkish dataset:
	- Source: https://datacollective.mozillafoundation.org/datasets/cmj8u3px500s1nxxb4qh79iqr
	- Dataset structure: `clips/` directory and TSV files under `tr/` folder
	- Training: `train.tsv`
	- Testing: `test.tsv`

	## Model Architecture

	The model has a hybrid CNN-RNN architecture:
	- CNN Layers: Residual CNN blocks for feature extraction from Mel-spectrograms
	- RNN Layers: 4-layer bidirectional LSTM for temporal context
	- Output: Character-level prediction with CTC (Connectionist Temporal Classification) loss

	### Technical Details
	- Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop)
	- CNN: 32-64 channel residual blocks with GELU activation
	- LSTM: 512 hidden units, 4 layers, bidirectional
	- Alphabet: 37 characters (Turkish letters + space)
	- Optimization: AdamW + OneCycleLR scheduler

	## File Descriptions

	### 1. `data.py`
	Data loading and preprocessing module:
	- Reading data from TSV files
	- Converting audio files to Mel-spectrograms
	- Text normalization and character encoding
	- Data augmentation for training (optional noise injection)

	### 2. `train_pro.py`
	Initial training script:
	- 40 epochs of training
	- Batch size: 16
	- Learning rate: 0.0003
	- Data augmentation with SpecAugment
	- Model saved after each epoch

	### 3. `resume.py`
	Resume training script:
	- Continue training from a saved model
	- Lower learning rate (0.00005)
	- Increased regularization
	- Designed for epochs 41-75

	### 4. `check_voca.py`
	Helper script for alphabet verification. Displays the character set used by the model.

	### 5. `count.py`
	Dataset statistics:
	- Total number of recordings
	- Total duration calculation
	- Fast calculation if `clip_durations.tsv` exists, otherwise scans audio files

	## Installation

	### Requirements
	```bash
	pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy
	```

	### Preparing the Dataset
	1. Download the Mozilla Common Voice Turkish dataset
	2. Extract to `tr/` folder
	3. Structure should be:
	```
	tr/
	├── clips/
	│ ├── common_voice_tr_*.mp3
	│ └── ...
	├── train.tsv
	├── test.tsv
	└── clip_durations.tsv (optional)
	```

	## Training

	### Initial Training
	```bash
	python train_pro.py
	```
	- Trains for 40 epochs
	- Saves `model_advanced_epoch_X.pth` after each epoch
	- Terminal output shows loss, CER score, and sample predictions

	### Resume Training
	```bash
	python resume.py
	```
	- Starts from `model_advanced_epoch_40.pth`
	- Trains epochs 41-75
	- Uses lower learning rate for fine-tuning

	## Data Augmentation

	The model uses two types of data augmentation during training:

	1. Waveform Noise (`data.py`): Random Gaussian noise in training mode
	2. SpecAugment (`train_pro.py`, `resume.py`): Frequency and time masking

	## Performance Metrics

	Model performance is measured with CER (Character Error Rate):
	- CER: Character-level error rate
	- Evaluated on test set after each epoch
	- Sample predictions printed to console

	## Model Outputs

	After training, model files are created for each epoch:
	- `model_advanced_epoch_1.pth` - `model_advanced_epoch_75.pth`
	- The best performing model can be selected for use

	## Dataset Analysis

	To get information about the dataset:
	```bash
	python count.py
	```

	This script displays the total number of recordings and duration.

	## Notes

	- GPU usage is automatically detected
	- Gradient clipping is applied during training
	- All parameters are saved when the model is stored
	- Alphabet: `_abcçdefgğhıijklmnoöprsştuüvyzqwx ` (37 characters)

	## License

	### Code
	MIT License - Feel free to use, modify, and distribute this code.

	### Dataset
	The Mozilla Common Voice Turkish dataset is licensed under [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/). The dataset is in the public domain and free to use for any purpose.