Aytacus commited on
Commit
b64bf22
·
verified ·
1 Parent(s): 7d10ac1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -3
README.md CHANGED
@@ -1,3 +1,141 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Turkish Speech Recognition Model
2
+
3
+ This project is a deep learning-based speech recognition system trained on the Mozilla Common Voice Turkish dataset. The model can convert audio recordings into text.
4
+
5
+ ## Dataset
6
+
7
+ The project uses the Mozilla Common Voice Turkish dataset:
8
+ - Source: https://datacollective.mozillafoundation.org/datasets/cmj8u3px500s1nxxb4qh79iqr
9
+ - Dataset structure: `clips/` directory and TSV files under `tr/` folder
10
+ - Training: `train.tsv`
11
+ - Testing: `test.tsv`
12
+
13
+ ## Model Architecture
14
+
15
+ The model has a hybrid CNN-RNN architecture:
16
+ - **CNN Layers**: Residual CNN blocks for feature extraction from Mel-spectrograms
17
+ - **RNN Layers**: 4-layer bidirectional LSTM for temporal context
18
+ - **Output**: Character-level prediction with CTC (Connectionist Temporal Classification) loss
19
+
20
+ ### Technical Details
21
+ - Input: 128-dimensional Mel-spectrogram (16kHz, 1024 FFT, 256 hop)
22
+ - CNN: 32-64 channel residual blocks with GELU activation
23
+ - LSTM: 512 hidden units, 4 layers, bidirectional
24
+ - Alphabet: 37 characters (Turkish letters + space)
25
+ - Optimization: AdamW + OneCycleLR scheduler
26
+
27
+ ## File Descriptions
28
+
29
+ ### 1. `data.py`
30
+ Data loading and preprocessing module:
31
+ - Reading data from TSV files
32
+ - Converting audio files to Mel-spectrograms
33
+ - Text normalization and character encoding
34
+ - Data augmentation for training (optional noise injection)
35
+
36
+ ### 2. `train_pro.py`
37
+ Initial training script:
38
+ - 40 epochs of training
39
+ - Batch size: 16
40
+ - Learning rate: 0.0003
41
+ - Data augmentation with SpecAugment
42
+ - Model saved after each epoch
43
+
44
+ ### 3. `resume.py`
45
+ Resume training script:
46
+ - Continue training from a saved model
47
+ - Lower learning rate (0.00005)
48
+ - Increased regularization
49
+ - Designed for epochs 41-75
50
+
51
+ ### 4. `check_voca.py`
52
+ Helper script for alphabet verification. Displays the character set used by the model.
53
+
54
+ ### 5. `count.py`
55
+ Dataset statistics:
56
+ - Total number of recordings
57
+ - Total duration calculation
58
+ - Fast calculation if `clip_durations.tsv` exists, otherwise scans audio files
59
+
60
+ ## Installation
61
+
62
+ ### Requirements
63
+ ```bash
64
+ pip install torch torchaudio pandas Levenshtein sounddevice scipy numpy
65
+ ```
66
+
67
+ ### Preparing the Dataset
68
+ 1. Download the Mozilla Common Voice Turkish dataset
69
+ 2. Extract to `tr/` folder
70
+ 3. Structure should be:
71
+ ```
72
+ tr/
73
+ ├── clips/
74
+ │ ├── common_voice_tr_*.mp3
75
+ │ └── ...
76
+ ├── train.tsv
77
+ ├── test.tsv
78
+ └── clip_durations.tsv (optional)
79
+ ```
80
+
81
+ ## Training
82
+
83
+ ### Initial Training
84
+ ```bash
85
+ python train_pro.py
86
+ ```
87
+ - Trains for 40 epochs
88
+ - Saves `model_advanced_epoch_X.pth` after each epoch
89
+ - Terminal output shows loss, CER score, and sample predictions
90
+
91
+ ### Resume Training
92
+ ```bash
93
+ python resume.py
94
+ ```
95
+ - Starts from `model_advanced_epoch_40.pth`
96
+ - Trains epochs 41-75
97
+ - Uses lower learning rate for fine-tuning
98
+
99
+ ## Data Augmentation
100
+
101
+ The model uses two types of data augmentation during training:
102
+
103
+ 1. **Waveform Noise** (`data.py`): Random Gaussian noise in training mode
104
+ 2. **SpecAugment** (`train_pro.py`, `resume.py`): Frequency and time masking
105
+
106
+ ## Performance Metrics
107
+
108
+ Model performance is measured with CER (Character Error Rate):
109
+ - CER: Character-level error rate
110
+ - Evaluated on test set after each epoch
111
+ - Sample predictions printed to console
112
+
113
+ ## Model Outputs
114
+
115
+ After training, model files are created for each epoch:
116
+ - `model_advanced_epoch_1.pth` - `model_advanced_epoch_75.pth`
117
+ - The best performing model can be selected for use
118
+
119
+ ## Dataset Analysis
120
+
121
+ To get information about the dataset:
122
+ ```bash
123
+ python count.py
124
+ ```
125
+
126
+ This script displays the total number of recordings and duration.
127
+
128
+ ## Notes
129
+
130
+ - GPU usage is automatically detected
131
+ - Gradient clipping is applied during training
132
+ - All parameters are saved when the model is stored
133
+ - Alphabet: `_abcçdefgğhıijklmnoöprsştuüvyzqwx ` (37 characters)
134
+
135
+ ## License
136
+
137
+ ### Code
138
+ MIT License - Feel free to use, modify, and distribute this code.
139
+
140
+ ### Dataset
141
+ The Mozilla Common Voice Turkish dataset is licensed under [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/). The dataset is in the public domain and free to use for any purpose.