Automatic Speech Recognition
Transformers
Safetensors
Vietnamese
whisper
NhutP commited on
Commit
983ea7d
·
verified ·
1 Parent(s): 7bd3ee4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ datasets:
5
+ - NhutP/VSV-1100
6
+ - mozilla-foundation/common_voice_14_0
7
+ - AILAB-VNUHCM/vivos
8
+ language:
9
+ - vi
10
+ metrics:
11
+ - wer
12
+ base_model:
13
+ - openai/whisper-base
14
+ ---
15
+ ## Introduction
16
+ - We release a new model for Vietnamese speech regconition task.
17
+ - We fine-tuned [openai/whisper-base(https://huggingface.co/openai/whisper-base) on our new dataset [VSV-1100](https://huggingface.co/datasets/NhutP/VSV-1100).
18
+
19
+ ## Training data
20
+
21
+ | [VSV-1100](https://huggingface.co/datasets/NhutP/VSV-1100) | T2S* | [CMV14-vi](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0) |[VIVOS](https://huggingface.co/datasets/AILAB-VNUHCM/vivos)| [VLSP2021](https://vlsp.org.vn/index.php/resources) | Total|
22
+ |:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
23
+ | 1100 hours | 11 hours | 3.04 hours | 13.94 hours| 180 hours | 1308 hours |
24
+
25
+ \* We use a text-to-speech model to generate sentences containing words that do not appear in our dataset.
26
+
27
+ ## WER result
28
+ | [CMV14-vi](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0) | [VIVOS](https://huggingface.co/datasets/AILAB-VNUHCM/vivos) | [VLSP2020-T1](https://vlsp.org.vn/index.php/resources) | [VLSP2020-T2](https://vlsp.org.vn/index.php/resources) | [VLSP2021-T1](https://vlsp.org.vn/index.php/resources) | [VLSP2021-T2](https://vlsp.org.vn/index.php/resources) |[Bud500](https://huggingface.co/datasets/linhtran92/viet_bud500) |
29
+ |:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
30
+ |13.74|7.94|16.02|45.55| 17.25 | 12.32 | 6.63 |
31
+
32
+
33
+ ## Usage
34
+ ### Inference
35
+ ```python
36
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
37
+ import librosa
38
+ # load model and processor
39
+ processor = WhisperProcessor.from_pretrained("NhutP/ViWhisper-base")
40
+ model = WhisperForConditionalGeneration.from_pretrained("NhutP/ViWhisper-base")
41
+ model.config.forced_decoder_ids = None
42
+
43
+ # load a sample
44
+ array, sampling_rate = librosa.load('path_to_audio', sr = 16000) # Load some audio sample
45
+ input_features = processor(array, sampling_rate=sampling_rate, return_tensors="pt").input_features
46
+ # generate token ids
47
+ predicted_ids = model.generate(input_features)
48
+ # decode token ids to text
49
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
50
+ ```
51
+ ### Use with pipeline
52
+ ```python
53
+ from transformers import pipeline
54
+ pipe = pipeline(
55
+ "automatic-speech-recognition",
56
+ model="NhutP/ViWhisper-base",
57
+ max_new_tokens=128,
58
+ chunk_length_s=30,
59
+ return_timestamps=False,
60
+ device= '...' # 'cpu' or 'cuda'
61
+ )
62
+ output = pipe(path_to_audio_samplingrate_16000)['text']
63
+ ```
64
+
65
+ ## Citation
66
+
67
+ ```
68
+ @misc{VSV-1100,
69
+ author = {Pham Quang Nhut and Duong Pham Hoang Anh and Nguyen Vinh Tiep},
70
+ title = {VSV-1100: Vietnamese social voice dataset},
71
+ url = {https://github.com/NhutP/VSV-1100},
72
+ year = {2024}
73
+ }
74
+ ```
75
+
76
+ Also, please give us a star on github: https://github.com/NhutP/ViWhisper if you find our project useful
77
+
78
+ Contact me at: 22521061@gm.uit.edu.vn (Pham Quang Nhut)