Davidsamuel101 commited on
Commit
4fc4de4
·
verified ·
1 Parent(s): 5db1419

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -3
README.md CHANGED
@@ -1,3 +1,168 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - icefall
5
+ - phoneme-recognition
6
+ - automatic-speech-recognition
7
+ datasets:
8
+ - bookbot/slr72_dataset
9
+ - bookbot/slr72_dataset
10
+ ---
11
+
12
+ # Pruned Stateless Zipformer RNN-T Streaming Robust SW v4
13
+
14
+ Pruned Stateless Zipformer RNN-T Streaming Robust SW v4 is an automatic speech recognition model trained on the following datasets:
15
+
16
+ - [SLR72 dataset](https://www.openslr.org/72/)
17
+ - [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw)
18
+
19
+ Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).
20
+
21
+ This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard.
22
+
23
+ ## Evaluation Results
24
+
25
+ ### Chunk-wise Streaming
26
+
27
+ ```sh
28
+ for m in greedy_search fast_beam_search modified_beam_search; do
29
+ ./zipformer/streaming_decode.py \
30
+ --epoch 80 \
31
+ --avg 5 \
32
+ --causal 1 \
33
+ --num-encoder-layers 2,2,2,2,2,2 \
34
+ --feedforward-dim 512,768,768,768,768,768 \
35
+ --encoder-dim 192,256,256,256,256,256 \
36
+ --encoder-unmasked-dim 192,192,192,192,192,192 \
37
+ --chunk-size 16 \
38
+ --left-context-frames 128 \
39
+ --exp-dir . \
40
+ --use-transducer True \
41
+ --decoding-method $m \
42
+ --num-decode-streams 1000
43
+ done
44
+ ```
45
+
46
+ The model achieves the following phoneme error rates on the different test sets:
47
+
48
+ | Decoding | Common Voice 23.0 ES | SLR72
49
+ | -------------------- | :---------------: | :----: |
50
+ | Fast Beam Search | 5.57% | 2.18% |
51
+ | Greedy Search | 2.85% | 1.56% |
52
+ | Modified Beam Search | 2.71% | 1.47% |
53
+
54
+ ## Usage
55
+
56
+ ### Download Pre-trained Model
57
+
58
+ ```sh
59
+ cd egs/bookbot_sw/ASR
60
+ mkdir tmp
61
+ cd tmp
62
+ git lfs install
63
+ git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
64
+ ```
65
+
66
+ ### Inference
67
+
68
+ To decode with greedy search, run:
69
+
70
+ ```sh
71
+ ./zipformer/jit_pretrained_streaming.py \
72
+ --nn-model-filename ./tmp/zipformer-streaming-robust-sw-v4/exp-causal/jit_script_chunk_32_left_128.pt \
73
+ --tokens ./tmp/zipformer-streaming-robust-sw-v4/data/lang_phone/tokens.txt \
74
+ ./tmp/zipformer-streaming-robust-sw-v4/test_waves/sample1.wav
75
+ ```
76
+
77
+ <details>
78
+ <summary>Decoding Output</summary>
79
+
80
+ ```
81
+ 2024-10-28 13:54:44,964 INFO [jit_pretrained_streaming.py:184] device: cuda:0
82
+ 2024-10-28 13:54:45,325 INFO [jit_pretrained_streaming.py:197] Constructing Fbank computer
83
+ 2024-10-28 13:54:45,325 INFO [jit_pretrained_streaming.py:200] Reading sound files: ./tmp/zipformer-streaming-robust-sw-v4/test_waves/sample1.wav
84
+ 2024-10-28 13:54:45,353 INFO [jit_pretrained_streaming.py:205] torch.Size([125568])
85
+ 2024-10-28 13:54:45,353 INFO [jit_pretrained_streaming.py:207] Decoding started
86
+ 2024-10-28 13:54:45,353 INFO [jit_pretrained_streaming.py:212] chunk_length: 64
87
+ 2024-10-28 13:54:45,353 INFO [jit_pretrained_streaming.py:213] T: 77
88
+ 2024-10-28 13:54:45,364 INFO [jit_pretrained_streaming.py:229] 0/130368
89
+ 2024-10-28 13:54:45,366 INFO [jit_pretrained_streaming.py:229] 4000/130368
90
+ 2024-10-28 13:54:45,367 INFO [jit_pretrained_streaming.py:229] 8000/130368
91
+ 2024-10-28 13:54:45,367 INFO [jit_pretrained_streaming.py:229] 12000/130368
92
+ 2024-10-28 13:54:45,535 INFO [jit_pretrained_streaming.py:229] 16000/130368
93
+ 2024-10-28 13:54:45,536 INFO [jit_pretrained_streaming.py:229] 20000/130368
94
+ 2024-10-28 13:54:45,545 INFO [jit_pretrained_streaming.py:229] 24000/130368
95
+ 2024-10-28 13:54:45,546 INFO [jit_pretrained_streaming.py:229] 28000/130368
96
+ 2024-10-28 13:54:45,547 INFO [jit_pretrained_streaming.py:229] 32000/130368
97
+ 2024-10-28 13:54:45,556 INFO [jit_pretrained_streaming.py:229] 36000/130368
98
+ 2024-10-28 13:54:45,557 INFO [jit_pretrained_streaming.py:229] 40000/130368
99
+ 2024-10-28 13:54:45,566 INFO [jit_pretrained_streaming.py:229] 44000/130368
100
+ 2024-10-28 13:54:45,567 INFO [jit_pretrained_streaming.py:229] 48000/130368
101
+ 2024-10-28 13:54:45,567 INFO [jit_pretrained_streaming.py:229] 52000/130368
102
+ 2024-10-28 13:54:45,576 INFO [jit_pretrained_streaming.py:229] 56000/130368
103
+ 2024-10-28 13:54:45,577 INFO [jit_pretrained_streaming.py:229] 60000/130368
104
+ 2024-10-28 13:54:45,587 INFO [jit_pretrained_streaming.py:229] 64000/130368
105
+ 2024-10-28 13:54:45,587 INFO [jit_pretrained_streaming.py:229] 68000/130368
106
+ 2024-10-28 13:54:45,588 INFO [jit_pretrained_streaming.py:229] 72000/130368
107
+ 2024-10-28 13:54:45,597 INFO [jit_pretrained_streaming.py:229] 76000/130368
108
+ 2024-10-28 13:54:45,598 INFO [jit_pretrained_streaming.py:229] 80000/130368
109
+ 2024-10-28 13:54:45,599 INFO [jit_pretrained_streaming.py:229] 84000/130368
110
+ 2024-10-28 13:54:45,608 INFO [jit_pretrained_streaming.py:229] 88000/130368
111
+ 2024-10-28 13:54:45,609 INFO [jit_pretrained_streaming.py:229] 92000/130368
112
+ 2024-10-28 13:54:45,618 INFO [jit_pretrained_streaming.py:229] 96000/130368
113
+ 2024-10-28 13:54:45,619 INFO [jit_pretrained_streaming.py:229] 100000/130368
114
+ 2024-10-28 13:54:45,619 INFO [jit_pretrained_streaming.py:229] 104000/130368
115
+ 2024-10-28 13:54:45,628 INFO [jit_pretrained_streaming.py:229] 108000/130368
116
+ 2024-10-28 13:54:45,629 INFO [jit_pretrained_streaming.py:229] 112000/130368
117
+ 2024-10-28 13:54:45,638 INFO [jit_pretrained_streaming.py:229] 116000/130368
118
+ 2024-10-28 13:54:45,639 INFO [jit_pretrained_streaming.py:229] 120000/130368
119
+ 2024-10-28 13:54:45,640 INFO [jit_pretrained_streaming.py:229] 124000/130368
120
+ 2024-10-28 13:54:45,649 INFO [jit_pretrained_streaming.py:229] 128000/130368
121
+ 2024-10-28 13:54:45,649 INFO [jit_pretrained_streaming.py:259] ./tmp/zipformer-streaming-robust-sw-v4/test_waves/sample1.wav
122
+ 2024-10-28 13:54:45,649 INFO [jit_pretrained_streaming.py:260] wɑʃiɑɑᵐɓɑɔwɑnɑiʃihɑsɑkɑtikɑɛnɛɔlɑmɑʃɑɾikikɑtikɑufɑlmɛhuɔwɛnjɛutɑʄiɾiwɑmɑfutɑ
123
+ 2024-10-28 13:54:45,649 INFO [jit_pretrained_streaming.py:262] Decoding Done
124
+ ```
125
+
126
+ </details>
127
+
128
+ ## Training procedure
129
+
130
+ ### Install icefall
131
+
132
+ ```sh
133
+ git clone https://github.com/bookbot-hive/icefall
134
+ cd icefall
135
+ export PYTHONPATH=`pwd`:$PYTHONPATH
136
+ ```
137
+
138
+ ### Prepare Data
139
+
140
+ ```sh
141
+ cd egs/bookbot_sw/ASR
142
+ ./prepare.sh
143
+ ```
144
+
145
+ ### Train
146
+
147
+ ```sh
148
+ export CUDA_VISIBLE_DEVICES="0,1"
149
+ ./zipformer/train.py \
150
+ --world-size 2 \
151
+ --num-epochs 40 \
152
+ --use-fp16 1 \
153
+ --exp-dir zipformer/exp-causal \
154
+ --causal 1 \
155
+ --num-encoder-layers 2,2,2,2,2,2 \
156
+ --feedforward-dim 512,768,768,768,768,768 \
157
+ --encoder-dim 192,256,256,256,256,256 \
158
+ --encoder-unmasked-dim 192,192,192,192,192,192 \
159
+ --base-lr 0.04 \
160
+ --max-duration 400 \
161
+ --use-transducer True --use-ctc True
162
+ ```
163
+
164
+ ## Frameworks
165
+
166
+ - [k2](https://github.com/k2-fsa/k2)
167
+ - [icefall](https://github.com/bookbot-hive/icefall)
168
+ - [lhotse](https://github.com/bookbot-hive/lhotse)