File size: 9,033 Bytes
4fc4de4
 
 
 
 
 
 
0058817
4fc4de4
 
 
249f9a4
4fc4de4
249f9a4
4fc4de4
 
0058817
4fc4de4
 
 
 
 
f9839a1
e63b79b
f9839a1
 
 
 
e63b79b
f9839a1
e63b79b
f9839a1
 
 
 
 
 
 
 
 
4fc4de4
 
 
 
 
 
 
 
e63b79b
4fc4de4
 
 
 
 
 
 
f9839a1
4fc4de4
 
 
 
 
 
 
 
0058817
 
 
 
 
4fc4de4
 
 
 
 
 
 
 
0058817
 
 
 
4fc4de4
 
 
 
 
 
0058817
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4fc4de4
 
 
 
c16961f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9839a1
e63b79b
f9839a1
e63b79b
f9839a1
 
 
e63b79b
f9839a1
 
 
 
 
 
 
 
 
 
 
e63b79b
f9839a1
 
 
e63b79b
f9839a1
 
 
 
e63b79b
 
 
 
 
 
 
 
 
 
 
 
 
 
f9839a1
4fc4de4
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
license: apache-2.0
tags:
  - icefall
  - phoneme-recognition
  - automatic-speech-recognition
datasets:
  - bookbot/common_voice_16_1_es
  - bookbot/slr72_dataset
---

# Pruned Stateless Zipformer RNN-T Streaming Robust ES v0

Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets:

- [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw)
- [SLR72 dataset](https://www.openslr.org/72/)

Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).

This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard.

## Setup

To set up all the necessary packages, please follow the installation instructions from the official icefall [documentation](https://icefall.readthedocs.io/en/latest/installation/index.html).
When cloning the icefall repo, make sure to clone our fork of icefall `git clone https://github.com/bookbot-hive/icefall` instead of the original.

### Download Pre-trained Model

Once you've installed all the necessary packages, follow the steps below

```sh
cd egs/bookbot_es/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
cd ..
```

## Evaluation Results

### Chunk-wise Streaming

```sh
for m in greedy_search fast_beam_search modified_beam_search; do
  ./zipformer/streaming_decode.py \
    --epoch 80 \
    --avg 5 \
    --causal 1 \
    --num-encoder-layers 2,2,2,2,2,2 \
    --feedforward-dim 512,768,768,768,768,768 \
    --encoder-dim 192,256,256,256,256,256 \
    --encoder-unmasked-dim 192,192,192,192,192,192 \
    --chunk-size 16 \
    --left-context-frames 128 \
    --exp-dir tmp/zipformer-streaming-robust-es-v0/ \
    --use-transducer True \
    --decoding-method $m \
    --num-decode-streams 1000
done
```

The model achieves the following phoneme error rates on the different test sets:

| Decoding             | Common Voice 23.0 ES | SLR72 |
| -------------------- | :------------------: | :---: |
| Fast Beam Search     |        5.57%         | 2.18% |
| Greedy Search        |        2.85%         | 1.56% |
| Modified Beam Search |        2.71%         | 1.47% |

## Usage

### Inference

To decode with greedy search, run:

```sh
./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \
  --nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \
  --tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \
  ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
```

<details>
<summary>Decoding Output</summary>

```
2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'}
2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0
2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer
2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688])
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45
2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488
2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488
2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488
2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488
2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488
2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488
2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488
2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488
2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488
2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488
2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488
2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488
2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488
2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488
2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488
2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488
2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488
2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488
2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488
2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488
2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488
2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488
2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488
2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488
2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488
2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488
2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488
2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done
```

</details>

## Training procedure

### Install icefall

```sh
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
```

### Prepare Data

```sh
cd egs/bookbot_es/ASR
./prepare.sh
```

### Train

```sh
export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train.py \
  --world-size 2 \
  --num-epochs 80 \
  --exp-dir tmp/exp-causal \
  --causal 1 \
  --num-encoder-layers 2,2,2,2,2,2 \
  --feedforward-dim 512,768,768,768,768,768 \
  --encoder-dim 192,256,256,256,256,256 \
  --encoder-unmasked-dim 192,192,192,192,192,192 \
  --max-duration 1000 \
  --base-lr 0.04 \
  --use-transducer True \
  --use-fp16 1
```

### Exporting to ONNX

To export the trained model to onnx run:

```
./zipformer/export-onnx-streaming.py \
    --tokens data/lang_phone/tokens.txt \
    --avg 5 \
    --causal 1 \
    --exp-dir tmp/zipformer-streaming-robust-es-v0 \
    --num-encoder-layers 2,2,2,2,2,2 \
    --feedforward-dim 512,768,768,768,768,768 \
    --encoder-dim 192,256,256,256,256,256 \
    --encoder-unmasked-dim 192,192,192,192,192,192 \
    --chunk-size 16 \
    --left-context-frames 128 \
    --use-transducer True \
    --epoch 80 \
```

It will store the ONNX files inside the specified `exp-dir`.

### Converting ONNX to ORT

```
cd tmp/zipformer-streaming-robust-es-v0
python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed .
```

Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated:

**Standard ORT files:**

- `encoder-epoch-80-avg-5-chunk-16-left-128.ort`
- `decoder-epoch-80-avg-5-chunk-16-left-128.ort`
- `joiner-epoch-80-avg-5-chunk-16-left-128.ort`

**INT8 Quantized ORT files:**

- `encoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
- `decoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
- `joiner-epoch-80-avg-5-chunk-16-left-128.int8.ort`

## Frameworks

- [k2](https://github.com/k2-fsa/k2)
- [icefall](https://github.com/bookbot-hive/icefall)
- [lhotse](https://github.com/bookbot-hive/lhotse)