Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,62 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition
|
| 2 |
+
|
| 3 |
+
Official implementation of the Efficient Conformer, progressively downsampled Conformer with grouped attention for Automatic Speech Recognition.
|
| 4 |
+
|
| 5 |
+
**Efficient Conformer [Paper](https://arxiv.org/abs/2109.01163)**
|
| 6 |
+
|
| 7 |
+
## Efficient Conformer Encoder
|
| 8 |
+
Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.
|
| 9 |
+
|
| 10 |
+
<img src="media/EfficientConformer.jpg" width="50%"/>
|
| 11 |
+
|
| 12 |
+
## Installation
|
| 13 |
+
Clone GitHub repository and set up environment
|
| 14 |
+
```
|
| 15 |
+
git clone https://github.com/burchim/EfficientConformer.git
|
| 16 |
+
cd EfficientConformer
|
| 17 |
+
pip install -r requirements.txt
|
| 18 |
+
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
|
| 19 |
+
pip install protobuf==4.25
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
Install [ctcdecode](https://github.com/parlance/ctcdecode)
|
| 23 |
+
|
| 24 |
+
## Prepare dataset and training pipline
|
| 25 |
+
|
| 26 |
+
Steps:
|
| 27 |
+
|
| 28 |
+
- Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
|
| 29 |
+
- Add noise to the audio using add_noise.py.
|
| 30 |
+
- Change the speaking speed using speed_permutation.py.
|
| 31 |
+
- Extract audio length and BPE tokens using prepare_dataset.py.
|
| 32 |
+
- Filter audio by the maximum length specified, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
|
| 33 |
+
- Train the model using train.py (please read the parameters carefully).
|
| 34 |
+
- Prepare a lm_corpus.txt to train n gram bpe language model, using train_lm.py
|
| 35 |
+
|
| 36 |
+
## Evaluation
|
| 37 |
+
Please read code carefully !
|
| 38 |
+
```
|
| 39 |
+
bash test.sh
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## Monitor training
|
| 43 |
+
|
| 44 |
+
```
|
| 45 |
+
tensorboard --logdir callback_path
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
<img src="media/logs.jpg"/>
|
| 49 |
+
|
| 50 |
+
## LibriSpeech Performance
|
| 51 |
+
|
| 52 |
+
| Model | Size | Type | Params (M) | gigaspeech_test/vlsp2023_test_pb/vlsp2023_test_pr gready WER (%)| gigaspeech_test/vlsp2023_test_pb/vlsp2023_test_pr n-gram WER (%) | GPUs |
|
| 53 |
+
| :-------------------: |:--------: |:-----:|:----------:|:------:|:------:|:------:|
|
| 54 |
+
| Efficient Conformer | Small | CTC | 13.4 | 19.61 / 23.06 / 23.17 | 17.86 / 21.11 / 21.42 | 1 x RTX 3090 |
|
| 55 |
+
|
| 56 |
+
In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below:
|
| 57 |
+
https://www.overleaf.com/read/nhqjtcpktjyc#3b472e
|
| 58 |
+
|
| 59 |
+
## Reference
|
| 60 |
+
[Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.](https://arxiv.org/abs/2109.01163)
|
| 61 |
+
<br><br>
|
| 62 |
+
* Maxime Burchi [@burchim](https://github.com/burchim)
|