File size: 3,902 Bytes
d7a1e56 273510b b06787e df0093c b06787e f9a7ed3 271cca6 b06787e d7a1e56 b06787e 9b9ee9f 6d3fc79 9b9ee9f b06787e a6b96aa b06787e 6eb4446 b06787e d9e3c76 b06787e c084131 3904444 e9bb763 f933fb6 2082322 734e2e2 b06787e 5921082 b06787e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | ---
tags:
- speech-to-text
- vietnamese
- ai-model
- deep-learning
license: apache-2.0
library_name: pytorch
model_name: EfficientConformerVietnamese
language: vi
---
# Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition [Paper](https://arxiv.org/abs/2109.01163)
## Efficient Conformer Encoder
Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.
<img src="EfficientConformer.jpg" width="35%"/>
## Installation
Clone GitHub repository and set up environment
```
git clone https://github.com/nguyenthienhy/EfficientConformerVietnamese.git
cd EfficientConformerVietnamese
pip install -r requirements.txt
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install protobuf==4.25
```
Install [ctcdecode](https://github.com/parlance/ctcdecode)
## Prepare dataset and training pipline
Dataset to train this mini version:
- Vivos
- Vietbud_500
- VLSP2020, VLSP2021, VLSP2022
- VietMed_labeled
- Google Fleurs
Steps:
- Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
- Add noise to the audio using **add_noise.py**.
- Change the speaking speed using **speed_permutation.py**.
- Extract audio length and BPE tokens using **prepare_dataset.py**.
- Filter audio by the maximum length specified, using **filter_max_length.py**, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
- Train the model using **train.py** (please read the parameters carefully).
- Prepare a **lm_corpus.txt** to train **n gram bpe language model**, using **train_lm.py**
## Evaluation
Please read code test.py carefully !
```
bash test.sh
```
## Monitor training
```
tensorboard --logdir callback_path
```
<img src="logs.jpg" width="55%" />
## Vietnamese Performance
| Model | Gigaspeech_test<br>(Greedy / n-gram Beam Search) | VLSP2023_pb_test<br>(Greedy / n-gram Beam Search) | VLSP2023_pr_test<br>(Greedy / n-gram Beam Search) |
|:--------------------------------------|:------------------------------------------------:|:-------------------------------------------------:|:-------------------------------------------------:|
| **EC-Small-CTC** | **19.61 / 17.47** | **23.06 / 20.83** | **23.17 / 21.15** |
| **PhoWhiper-Tiny** | **20.45** | **33.21** | **33.02** |
| **PhoWhiper-Base** | **18.78** | **29.25** | **28.29** |
In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below:
https://www.overleaf.com/read/nhqjtcpktjyc#3b472e
## Reference
[Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.](https://arxiv.org/abs/2109.01163)
* Maxime Burchi [@burchim](https://github.com/burchim) |