---
license: apache-2.0
tags:
  - icefall
  - phoneme-recognition
  - automatic-speech-recognition
datasets:
  - bookbot/common_voice_16_1_es
  - bookbot/slr72_dataset
---

# Pruned Stateless Zipformer RNN-T Streaming Robust ES v0

Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets:

- [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw)
- [SLR72 dataset](https://www.openslr.org/72/)

Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).

This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard.

## Setup

To set up all the necessary packages, please follow the installation instructions from the official icefall [documentation](https://icefall.readthedocs.io/en/latest/installation/index.html).
When cloning the icefall repo, make sure to clone our fork of icefall `git clone https://github.com/bookbot-hive/icefall` instead of the original.

### Download Pre-trained Model

Once you've installed all the necessary packages, follow the steps below

```sh
cd egs/bookbot_es/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
cd ..
```

## Evaluation Results

### Chunk-wise Streaming

```sh
for m in greedy_search fast_beam_search modified_beam_search; do
  ./zipformer/streaming_decode.py \
    --epoch 80 \
    --avg 5 \
    --causal 1 \
    --num-encoder-layers 2,2,2,2,2,2 \
    --feedforward-dim 512,768,768,768,768,768 \
    --encoder-dim 192,256,256,256,256,256 \
    --encoder-unmasked-dim 192,192,192,192,192,192 \
    --chunk-size 16 \
    --left-context-frames 128 \
    --exp-dir tmp/zipformer-streaming-robust-es-v0/ \
    --use-transducer True \
    --decoding-method $m \
    --num-decode-streams 1000
done
```

The model achieves the following phoneme error rates on the different test sets:

| Decoding             | Common Voice 23.0 ES | SLR72 |
| -------------------- | :------------------: | :---: |
| Fast Beam Search     |        5.57%         | 2.18% |
| Greedy Search        |        2.85%         | 1.56% |
| Modified Beam Search |        2.71%         | 1.47% |

## Usage

### Inference

To decode with greedy search, run:

```sh
./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \
  --nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \
  --tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \
  ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
```

<details>
<summary>Decoding Output</summary>

```
2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'}
2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0
2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer
2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688])
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45
2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488
2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488
2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488
2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488
2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488
2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488
2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488
2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488
2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488
2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488
2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488
2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488
2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488
2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488
2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488
2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488
2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488
2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488
2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488
2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488
2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488
2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488
2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488
2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488
2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488
2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488
2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488
2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done
```

</details>

## Training procedure

### Install icefall

```sh
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
```

### Prepare Data

```sh
cd egs/bookbot_es/ASR
./prepare.sh
```

### Train

```sh
export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train.py \
  --world-size 2 \
  --num-epochs 80 \
  --exp-dir tmp/exp-causal \
  --causal 1 \
  --num-encoder-layers 2,2,2,2,2,2 \
  --feedforward-dim 512,768,768,768,768,768 \
  --encoder-dim 192,256,256,256,256,256 \
  --encoder-unmasked-dim 192,192,192,192,192,192 \
  --max-duration 1000 \
  --base-lr 0.04 \
  --use-transducer True \
  --use-fp16 1
```

### Exporting to ONNX

To export the trained model to onnx run:

```
./zipformer/export-onnx-streaming.py \
    --tokens data/lang_phone/tokens.txt \
    --avg 5 \
    --causal 1 \
    --exp-dir tmp/zipformer-streaming-robust-es-v0 \
    --num-encoder-layers 2,2,2,2,2,2 \
    --feedforward-dim 512,768,768,768,768,768 \
    --encoder-dim 192,256,256,256,256,256 \
    --encoder-unmasked-dim 192,192,192,192,192,192 \
    --chunk-size 16 \
    --left-context-frames 128 \
    --use-transducer True \
    --epoch 80 \
```

It will store the ONNX files inside the specified `exp-dir`.

### Converting ONNX to ORT

```
cd tmp/zipformer-streaming-robust-es-v0
python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed .
```

Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated:

**Standard ORT files:**

- `encoder-epoch-80-avg-5-chunk-16-left-128.ort`
- `decoder-epoch-80-avg-5-chunk-16-left-128.ort`
- `joiner-epoch-80-avg-5-chunk-16-left-128.ort`

**INT8 Quantized ORT files:**

- `encoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
- `decoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
- `joiner-epoch-80-avg-5-chunk-16-left-128.int8.ort`

## Frameworks

- [k2](https://github.com/k2-fsa/k2)
- [icefall](https://github.com/bookbot-hive/icefall)
- [lhotse](https://github.com/bookbot-hive/lhotse)