bookbot
/

zipformer-streaming-robust-es-v0

+---
+license: apache-2.0
+tags:
+  - icefall
+  - phoneme-recognition
+  - automatic-speech-recognition
+datasets:
+  - bookbot/slr72_dataset
+  - bookbot/slr72_dataset
+---
+# Pruned Stateless Zipformer RNN-T Streaming Robust SW v4
+Pruned Stateless Zipformer RNN-T Streaming Robust SW v4 is an automatic speech recognition model trained on the following datasets:
+- [SLR72 dataset](https://www.openslr.org/72/)
+- [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw)
+Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).
+This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard.
+## Evaluation Results
+### Chunk-wise Streaming
+```sh
+for m in greedy_search fast_beam_search modified_beam_search; do
+  ./zipformer/streaming_decode.py \
+    --epoch 80 \
+    --avg 5 \
+    --causal 1 \
+    --num-encoder-layers 2,2,2,2,2,2 \
+    --feedforward-dim 512,768,768,768,768,768 \
+    --encoder-dim 192,256,256,256,256,256 \
+    --encoder-unmasked-dim 192,192,192,192,192,192 \
+    --chunk-size 16 \
+    --left-context-frames 128 \
+    --exp-dir . \
+    --use-transducer True \
+    --decoding-method $m \
+    --num-decode-streams 1000
+done
+```
+The model achieves the following phoneme error rates on the different test sets:
+| Decoding             | Common Voice 23.0 ES | SLR72
+| -------------------- | :---------------:    | :----: |
+| Fast Beam Search       |       5.57%       | 2.18%  |
+| Greedy Search    |       2.85%       | 1.56%  |
+| Modified Beam Search    |       2.71%       | 1.47%  |
+## Usage
+### Download Pre-trained Model
+```sh
+cd egs/bookbot_sw/ASR
+mkdir tmp
+cd tmp
+git lfs install
+git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
+```
+### Inference
+To decode with greedy search, run:
+```sh
+./zipformer/jit_pretrained_streaming.py \
+  --nn-model-filename ./tmp/zipformer-streaming-robust-sw-v4/exp-causal/jit_script_chunk_32_left_128.pt \
+  --tokens ./tmp/zipformer-streaming-robust-sw-v4/data/lang_phone/tokens.txt \
+  ./tmp/zipformer-streaming-robust-sw-v4/test_waves/sample1.wav
+```
+<details>
+<summary>Decoding Output</summary>
+```
+2024-10-28 13:54:44,964 INFO [jit_pretrained_streaming.py:184] device: cuda:0
+2024-10-28 13:54:45,325 INFO [jit_pretrained_streaming.py:197] Constructing Fbank computer
+2024-10-28 13:54:45,325 INFO [jit_pretrained_streaming.py:200] Reading sound files: ./tmp/zipformer-streaming-robust-sw-v4/test_waves/sample1.wav
+2024-10-28 13:54:45,353 INFO [jit_pretrained_streaming.py:205] torch.Size([125568])
+2024-10-28 13:54:45,353 INFO [jit_pretrained_streaming.py:207] Decoding started
+2024-10-28 13:54:45,353 INFO [jit_pretrained_streaming.py:212] chunk_length: 64
+2024-10-28 13:54:45,353 INFO [jit_pretrained_streaming.py:213] T: 77
+2024-10-28 13:54:45,364 INFO [jit_pretrained_streaming.py:229] 0/130368
+2024-10-28 13:54:45,366 INFO [jit_pretrained_streaming.py:229] 4000/130368
+2024-10-28 13:54:45,367 INFO [jit_pretrained_streaming.py:229] 8000/130368
+2024-10-28 13:54:45,367 INFO [jit_pretrained_streaming.py:229] 12000/130368
+2024-10-28 13:54:45,535 INFO [jit_pretrained_streaming.py:229] 16000/130368
+2024-10-28 13:54:45,536 INFO [jit_pretrained_streaming.py:229] 20000/130368
+2024-10-28 13:54:45,545 INFO [jit_pretrained_streaming.py:229] 24000/130368
+2024-10-28 13:54:45,546 INFO [jit_pretrained_streaming.py:229] 28000/130368
+2024-10-28 13:54:45,547 INFO [jit_pretrained_streaming.py:229] 32000/130368
+2024-10-28 13:54:45,556 INFO [jit_pretrained_streaming.py:229] 36000/130368
+2024-10-28 13:54:45,557 INFO [jit_pretrained_streaming.py:229] 40000/130368
+2024-10-28 13:54:45,566 INFO [jit_pretrained_streaming.py:229] 44000/130368
+2024-10-28 13:54:45,567 INFO [jit_pretrained_streaming.py:229] 48000/130368
+2024-10-28 13:54:45,567 INFO [jit_pretrained_streaming.py:229] 52000/130368
+2024-10-28 13:54:45,576 INFO [jit_pretrained_streaming.py:229] 56000/130368
+2024-10-28 13:54:45,577 INFO [jit_pretrained_streaming.py:229] 60000/130368
+2024-10-28 13:54:45,587 INFO [jit_pretrained_streaming.py:229] 64000/130368
+2024-10-28 13:54:45,587 INFO [jit_pretrained_streaming.py:229] 68000/130368
+2024-10-28 13:54:45,588 INFO [jit_pretrained_streaming.py:229] 72000/130368
+2024-10-28 13:54:45,597 INFO [jit_pretrained_streaming.py:229] 76000/130368
+2024-10-28 13:54:45,598 INFO [jit_pretrained_streaming.py:229] 80000/130368
+2024-10-28 13:54:45,599 INFO [jit_pretrained_streaming.py:229] 84000/130368
+2024-10-28 13:54:45,608 INFO [jit_pretrained_streaming.py:229] 88000/130368
+2024-10-28 13:54:45,609 INFO [jit_pretrained_streaming.py:229] 92000/130368
+2024-10-28 13:54:45,618 INFO [jit_pretrained_streaming.py:229] 96000/130368
+2024-10-28 13:54:45,619 INFO [jit_pretrained_streaming.py:229] 100000/130368
+2024-10-28 13:54:45,619 INFO [jit_pretrained_streaming.py:229] 104000/130368
+2024-10-28 13:54:45,628 INFO [jit_pretrained_streaming.py:229] 108000/130368
+2024-10-28 13:54:45,629 INFO [jit_pretrained_streaming.py:229] 112000/130368
+2024-10-28 13:54:45,638 INFO [jit_pretrained_streaming.py:229] 116000/130368
+2024-10-28 13:54:45,639 INFO [jit_pretrained_streaming.py:229] 120000/130368
+2024-10-28 13:54:45,640 INFO [jit_pretrained_streaming.py:229] 124000/130368
+2024-10-28 13:54:45,649 INFO [jit_pretrained_streaming.py:229] 128000/130368
+2024-10-28 13:54:45,649 INFO [jit_pretrained_streaming.py:259] ./tmp/zipformer-streaming-robust-sw-v4/test_waves/sample1.wav
+2024-10-28 13:54:45,649 INFO [jit_pretrained_streaming.py:260] wɑʃiɑɑᵐɓɑɔwɑnɑiʃihɑsɑkɑtikɑɛnɛɔlɑmɑʃɑɾikikɑtikɑufɑlmɛhuɔwɛnjɛutɑʄiɾiwɑmɑfutɑ
+2024-10-28 13:54:45,649 INFO [jit_pretrained_streaming.py:262] Decoding Done
+```
+</details>
+## Training procedure
+### Install icefall
+```sh
+git clone https://github.com/bookbot-hive/icefall
+cd icefall
+export PYTHONPATH=`pwd`:$PYTHONPATH
+```
+### Prepare Data
+```sh
+cd egs/bookbot_sw/ASR
+./prepare.sh
+```
+### Train
+```sh
+export CUDA_VISIBLE_DEVICES="0,1"
+./zipformer/train.py \
+  --world-size 2 \
+  --num-epochs 40 \
+  --use-fp16 1 \
+  --exp-dir zipformer/exp-causal \
+  --causal 1 \
+  --num-encoder-layers 2,2,2,2,2,2 \
+  --feedforward-dim 512,768,768,768,768,768 \
+  --encoder-dim 192,256,256,256,256,256 \
+  --encoder-unmasked-dim 192,192,192,192,192,192 \
+  --base-lr 0.04 \
+  --max-duration 400 \
+  --use-transducer True --use-ctc True
+```
+## Frameworks
+- [k2](https://github.com/k2-fsa/k2)
+- [icefall](https://github.com/bookbot-hive/icefall)
+- [lhotse](https://github.com/bookbot-hive/lhotse)