File size: 9,033 Bytes
4fc4de4 0058817 4fc4de4 249f9a4 4fc4de4 249f9a4 4fc4de4 0058817 4fc4de4 f9839a1 e63b79b f9839a1 e63b79b f9839a1 e63b79b f9839a1 4fc4de4 e63b79b 4fc4de4 f9839a1 4fc4de4 0058817 4fc4de4 0058817 4fc4de4 0058817 4fc4de4 c16961f f9839a1 e63b79b f9839a1 e63b79b f9839a1 e63b79b f9839a1 e63b79b f9839a1 e63b79b f9839a1 e63b79b f9839a1 4fc4de4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | ---
license: apache-2.0
tags:
- icefall
- phoneme-recognition
- automatic-speech-recognition
datasets:
- bookbot/common_voice_16_1_es
- bookbot/slr72_dataset
---
# Pruned Stateless Zipformer RNN-T Streaming Robust ES v0
Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets:
- [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw)
- [SLR72 dataset](https://www.openslr.org/72/)
Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).
This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard.
## Setup
To set up all the necessary packages, please follow the installation instructions from the official icefall [documentation](https://icefall.readthedocs.io/en/latest/installation/index.html).
When cloning the icefall repo, make sure to clone our fork of icefall `git clone https://github.com/bookbot-hive/icefall` instead of the original.
### Download Pre-trained Model
Once you've installed all the necessary packages, follow the steps below
```sh
cd egs/bookbot_es/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
cd ..
```
## Evaluation Results
### Chunk-wise Streaming
```sh
for m in greedy_search fast_beam_search modified_beam_search; do
./zipformer/streaming_decode.py \
--epoch 80 \
--avg 5 \
--causal 1 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--chunk-size 16 \
--left-context-frames 128 \
--exp-dir tmp/zipformer-streaming-robust-es-v0/ \
--use-transducer True \
--decoding-method $m \
--num-decode-streams 1000
done
```
The model achieves the following phoneme error rates on the different test sets:
| Decoding | Common Voice 23.0 ES | SLR72 |
| -------------------- | :------------------: | :---: |
| Fast Beam Search | 5.57% | 2.18% |
| Greedy Search | 2.85% | 1.56% |
| Modified Beam Search | 2.71% | 1.47% |
## Usage
### Inference
To decode with greedy search, run:
```sh
./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \
--nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \
--tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \
./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
```
<details>
<summary>Decoding Output</summary>
```
2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'}
2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0
2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer
2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688])
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45
2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488
2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488
2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488
2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488
2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488
2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488
2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488
2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488
2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488
2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488
2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488
2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488
2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488
2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488
2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488
2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488
2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488
2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488
2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488
2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488
2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488
2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488
2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488
2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488
2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488
2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488
2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488
2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done
```
</details>
## Training procedure
### Install icefall
```sh
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
```
### Prepare Data
```sh
cd egs/bookbot_es/ASR
./prepare.sh
```
### Train
```sh
export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train.py \
--world-size 2 \
--num-epochs 80 \
--exp-dir tmp/exp-causal \
--causal 1 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--max-duration 1000 \
--base-lr 0.04 \
--use-transducer True \
--use-fp16 1
```
### Exporting to ONNX
To export the trained model to onnx run:
```
./zipformer/export-onnx-streaming.py \
--tokens data/lang_phone/tokens.txt \
--avg 5 \
--causal 1 \
--exp-dir tmp/zipformer-streaming-robust-es-v0 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--chunk-size 16 \
--left-context-frames 128 \
--use-transducer True \
--epoch 80 \
```
It will store the ONNX files inside the specified `exp-dir`.
### Converting ONNX to ORT
```
cd tmp/zipformer-streaming-robust-es-v0
python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed .
```
Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated:
**Standard ORT files:**
- `encoder-epoch-80-avg-5-chunk-16-left-128.ort`
- `decoder-epoch-80-avg-5-chunk-16-left-128.ort`
- `joiner-epoch-80-avg-5-chunk-16-left-128.ort`
**INT8 Quantized ORT files:**
- `encoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
- `decoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
- `joiner-epoch-80-avg-5-chunk-16-left-128.int8.ort`
## Frameworks
- [k2](https://github.com/k2-fsa/k2)
- [icefall](https://github.com/bookbot-hive/icefall)
- [lhotse](https://github.com/bookbot-hive/lhotse)
|