File size: 3,902 Bytes
d7a1e56
 
 
 
 
 
 
 
 
 
 
 
273510b
b06787e
 
 
 
df0093c
b06787e
 
 
 
f9a7ed3
271cca6
b06787e
 
 
 
 
 
 
 
 
d7a1e56
 
 
 
 
 
 
b06787e
 
 
9b9ee9f
 
 
6d3fc79
9b9ee9f
 
b06787e
 
a6b96aa
b06787e
 
 
 
 
 
 
 
 
 
6eb4446
b06787e
d9e3c76
b06787e
c084131
3904444
 
e9bb763
f933fb6
2082322
734e2e2
b06787e
 
 
 
 
 
5921082
b06787e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
tags:
- speech-to-text
- vietnamese
- ai-model
- deep-learning
license: apache-2.0
library_name: pytorch
model_name: EfficientConformerVietnamese
language: vi
---

# Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition [Paper](https://arxiv.org/abs/2109.01163)

## Efficient Conformer Encoder
Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering  the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.

<img src="EfficientConformer.jpg" width="35%"/>

## Installation
Clone GitHub repository and set up environment
```
git clone https://github.com/nguyenthienhy/EfficientConformerVietnamese.git
cd EfficientConformerVietnamese
pip install -r requirements.txt
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install protobuf==4.25
```

Install [ctcdecode](https://github.com/parlance/ctcdecode)

## Prepare dataset and training pipline

Dataset to train this mini version:
- Vivos
- Vietbud_500
- VLSP2020, VLSP2021, VLSP2022
- VietMed_labeled
- Google Fleurs

Steps:

- Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
- Add noise to the audio using **add_noise.py**.
- Change the speaking speed using **speed_permutation.py**.
- Extract audio length and BPE tokens using **prepare_dataset.py**.
- Filter audio by the maximum length specified, using **filter_max_length.py**, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
- Train the model using **train.py** (please read the parameters carefully).
- Prepare a **lm_corpus.txt** to train **n gram bpe language model**, using **train_lm.py**

## Evaluation
Please read code test.py carefully !
```
bash test.sh
```

## Monitor training

```
tensorboard --logdir callback_path
```

<img src="logs.jpg" width="55%" />

## Vietnamese Performance


| Model                                  | Gigaspeech_test<br>(Greedy / n-gram Beam Search) | VLSP2023_pb_test<br>(Greedy / n-gram Beam Search) | VLSP2023_pr_test<br>(Greedy / n-gram Beam Search) |
|:--------------------------------------|:------------------------------------------------:|:-------------------------------------------------:|:-------------------------------------------------:|
| **EC-Small-CTC**     |              **19.61 / 17.47**                   |               **23.06 / 20.83**                   |               **23.17 / 21.15**                   |
| **PhoWhiper-Tiny**     |              **20.45**                   |               **33.21**                   |               **33.02**                   |
| **PhoWhiper-Base**     |              **18.78**                   |               **29.25**                   |               **28.29**                   |


In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below:
https://www.overleaf.com/read/nhqjtcpktjyc#3b472e

## Reference
[Maxime Burchi, Valentin Vielzeuf.	Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.](https://arxiv.org/abs/2109.01163)

* Maxime Burchi [@burchim](https://github.com/burchim)