|
|
--- |
|
|
tags: |
|
|
- speech-to-text |
|
|
- vietnamese |
|
|
- ai-model |
|
|
- deep-learning |
|
|
license: apache-2.0 |
|
|
library_name: pytorch |
|
|
model_name: EfficientConformerVietnamese |
|
|
language: vi |
|
|
--- |
|
|
|
|
|
# Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition [Paper](https://arxiv.org/abs/2109.01163) |
|
|
|
|
|
## Efficient Conformer Encoder |
|
|
Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention. |
|
|
|
|
|
<img src="EfficientConformer.jpg" width="35%"/> |
|
|
|
|
|
## Installation |
|
|
Clone GitHub repository and set up environment |
|
|
``` |
|
|
git clone https://github.com/nguyenthienhy/EfficientConformerVietnamese.git |
|
|
cd EfficientConformerVietnamese |
|
|
pip install -r requirements.txt |
|
|
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 |
|
|
pip install protobuf==4.25 |
|
|
``` |
|
|
|
|
|
Install [ctcdecode](https://github.com/parlance/ctcdecode) |
|
|
|
|
|
## Prepare dataset and training pipline |
|
|
|
|
|
Dataset to train this mini version: |
|
|
- Vivos |
|
|
- Vietbud_500 |
|
|
- VLSP2020, VLSP2021, VLSP2022 |
|
|
- VietMed_labeled |
|
|
- Google Fleurs |
|
|
|
|
|
Steps: |
|
|
|
|
|
- Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files. |
|
|
- Add noise to the audio using **add_noise.py**. |
|
|
- Change the speaking speed using **speed_permutation.py**. |
|
|
- Extract audio length and BPE tokens using **prepare_dataset.py**. |
|
|
- Filter audio by the maximum length specified, using **filter_max_length.py**, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt. |
|
|
- Train the model using **train.py** (please read the parameters carefully). |
|
|
- Prepare a **lm_corpus.txt** to train **n gram bpe language model**, using **train_lm.py** |
|
|
|
|
|
## Evaluation |
|
|
Please read code test.py carefully ! |
|
|
``` |
|
|
bash test.sh |
|
|
``` |
|
|
|
|
|
## Monitor training |
|
|
|
|
|
``` |
|
|
tensorboard --logdir callback_path |
|
|
``` |
|
|
|
|
|
<img src="logs.jpg" width="55%" /> |
|
|
|
|
|
## Vietnamese Performance |
|
|
|
|
|
|
|
|
| Model | Gigaspeech_test<br>(Greedy / n-gram Beam Search) | VLSP2023_pb_test<br>(Greedy / n-gram Beam Search) | VLSP2023_pr_test<br>(Greedy / n-gram Beam Search) | |
|
|
|:--------------------------------------|:------------------------------------------------:|:-------------------------------------------------:|:-------------------------------------------------:| |
|
|
| **EC-Small-CTC** | **19.61 / 17.47** | **23.06 / 20.83** | **23.17 / 21.15** | |
|
|
| **PhoWhiper-Tiny** | **20.45** | **33.21** | **33.02** | |
|
|
| **PhoWhiper-Base** | **18.78** | **29.25** | **28.29** | |
|
|
|
|
|
|
|
|
In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below: |
|
|
https://www.overleaf.com/read/nhqjtcpktjyc#3b472e |
|
|
|
|
|
## Reference |
|
|
[Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.](https://arxiv.org/abs/2109.01163) |
|
|
|
|
|
* Maxime Burchi [@burchim](https://github.com/burchim) |