hynt commited on
Commit
b06787e
·
verified ·
1 Parent(s): b49e514

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -3
README.md CHANGED
@@ -1,3 +1,62 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition
2
+
3
+ Official implementation of the Efficient Conformer, progressively downsampled Conformer with grouped attention for Automatic Speech Recognition.
4
+
5
+ **Efficient Conformer [Paper](https://arxiv.org/abs/2109.01163)**
6
+
7
+ ## Efficient Conformer Encoder
8
+ Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.
9
+
10
+ <img src="media/EfficientConformer.jpg" width="50%"/>
11
+
12
+ ## Installation
13
+ Clone GitHub repository and set up environment
14
+ ```
15
+ git clone https://github.com/burchim/EfficientConformer.git
16
+ cd EfficientConformer
17
+ pip install -r requirements.txt
18
+ pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
19
+ pip install protobuf==4.25
20
+ ```
21
+
22
+ Install [ctcdecode](https://github.com/parlance/ctcdecode)
23
+
24
+ ## Prepare dataset and training pipline
25
+
26
+ Steps:
27
+
28
+ - Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
29
+ - Add noise to the audio using add_noise.py.
30
+ - Change the speaking speed using speed_permutation.py.
31
+ - Extract audio length and BPE tokens using prepare_dataset.py.
32
+ - Filter audio by the maximum length specified, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
33
+ - Train the model using train.py (please read the parameters carefully).
34
+ - Prepare a lm_corpus.txt to train n gram bpe language model, using train_lm.py
35
+
36
+ ## Evaluation
37
+ Please read code carefully !
38
+ ```
39
+ bash test.sh
40
+ ```
41
+
42
+ ## Monitor training
43
+
44
+ ```
45
+ tensorboard --logdir callback_path
46
+ ```
47
+
48
+ <img src="media/logs.jpg"/>
49
+
50
+ ## LibriSpeech Performance
51
+
52
+ | Model | Size | Type | Params (M) | gigaspeech_test/vlsp2023_test_pb/vlsp2023_test_pr gready WER (%)| gigaspeech_test/vlsp2023_test_pb/vlsp2023_test_pr n-gram WER (%) | GPUs |
53
+ | :-------------------: |:--------: |:-----:|:----------:|:------:|:------:|:------:|
54
+ | Efficient Conformer | Small | CTC | 13.4 | 19.61 / 23.06 / 23.17 | 17.86 / 21.11 / 21.42 | 1 x RTX 3090 |
55
+
56
+ In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below:
57
+ https://www.overleaf.com/read/nhqjtcpktjyc#3b472e
58
+
59
+ ## Reference
60
+ [Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.](https://arxiv.org/abs/2109.01163)
61
+ <br><br>
62
+ * Maxime Burchi [@burchim](https://github.com/burchim)