EfficientConformerVietnamese / README.md

Update README.md

2082322 verified 10 months ago

3.9 kB

	---
	tags:
	- speech-to-text
	- vietnamese
	- ai-model
	- deep-learning
	license: apache-2.0
	library_name: pytorch
	model_name: EfficientConformerVietnamese
	language: vi
	---

	# Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition [Paper](https://arxiv.org/abs/2109.01163)

	## Efficient Conformer Encoder
	Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.

	<img src="EfficientConformer.jpg" width="35%"/>

	## Installation
	Clone GitHub repository and set up environment
	```
	git clone https://github.com/nguyenthienhy/EfficientConformerVietnamese.git
	cd EfficientConformerVietnamese
	pip install -r requirements.txt
	pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
	pip install protobuf==4.25
	```

	Install [ctcdecode](https://github.com/parlance/ctcdecode)

	## Prepare dataset and training pipline

	Dataset to train this mini version:
	- Vivos
	- Vietbud_500
	- VLSP2020, VLSP2021, VLSP2022
	- VietMed_labeled
	- Google Fleurs

	Steps:

	- Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
	- Add noise to the audio using add_noise.py.
	- Change the speaking speed using speed_permutation.py.
	- Extract audio length and BPE tokens using prepare_dataset.py.
	- Filter audio by the maximum length specified, using filter_max_length.py, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
	- Train the model using train.py (please read the parameters carefully).
	- Prepare a lm_corpus.txt to train n gram bpe language model, using train_lm.py

	## Evaluation
	Please read code test.py carefully !
	```
	bash test.sh
	```

	## Monitor training

	```
	tensorboard --logdir callback_path
	```

	<img src="logs.jpg" width="55%" />

	## Vietnamese Performance


	\| Model \| Gigaspeech_test<br>(Greedy / n-gram Beam Search) \| VLSP2023_pb_test<br>(Greedy / n-gram Beam Search) \| VLSP2023_pr_test<br>(Greedy / n-gram Beam Search) \|
	\|:--------------------------------------\|:------------------------------------------------:\|:-------------------------------------------------:\|:-------------------------------------------------:\|
	\| EC-Small-CTC \| 19.61 / 17.47 \| 23.06 / 20.83 \| 23.17 / 21.15 \|
	\| PhoWhiper-Tiny \| 20.45 \| 33.21 \| 33.02 \|
	\| PhoWhiper-Base \| 18.78 \| 29.25 \| 28.29 \|


	In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below:
	https://www.overleaf.com/read/nhqjtcpktjyc#3b472e

	## Reference
	[Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.](https://arxiv.org/abs/2109.01163)

	* Maxime Burchi [@burchim](https://github.com/burchim)