Davidsamuel101's picture
Upload README.md with huggingface_hub
e63b79b verified
---
license: apache-2.0
tags:
- icefall
- phoneme-recognition
- automatic-speech-recognition
datasets:
- bookbot/common_voice_16_1_es
- bookbot/slr72_dataset
---
# Pruned Stateless Zipformer RNN-T Streaming Robust ES v0
Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets:
- [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw)
- [SLR72 dataset](https://www.openslr.org/72/)
Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).
This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard.
## Setup
To set up all the necessary packages, please follow the installation instructions from the official icefall [documentation](https://icefall.readthedocs.io/en/latest/installation/index.html).
When cloning the icefall repo, make sure to clone our fork of icefall `git clone https://github.com/bookbot-hive/icefall` instead of the original.
### Download Pre-trained Model
Once you've installed all the necessary packages, follow the steps below
```sh
cd egs/bookbot_es/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
cd ..
```
## Evaluation Results
### Chunk-wise Streaming
```sh
for m in greedy_search fast_beam_search modified_beam_search; do
./zipformer/streaming_decode.py \
--epoch 80 \
--avg 5 \
--causal 1 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--chunk-size 16 \
--left-context-frames 128 \
--exp-dir tmp/zipformer-streaming-robust-es-v0/ \
--use-transducer True \
--decoding-method $m \
--num-decode-streams 1000
done
```
The model achieves the following phoneme error rates on the different test sets:
| Decoding | Common Voice 23.0 ES | SLR72 |
| -------------------- | :------------------: | :---: |
| Fast Beam Search | 5.57% | 2.18% |
| Greedy Search | 2.85% | 1.56% |
| Modified Beam Search | 2.71% | 1.47% |
## Usage
### Inference
To decode with greedy search, run:
```sh
./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \
--nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \
--tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \
./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
```
<details>
<summary>Decoding Output</summary>
```
2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'}
2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0
2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer
2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688])
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45
2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488
2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488
2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488
2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488
2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488
2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488
2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488
2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488
2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488
2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488
2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488
2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488
2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488
2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488
2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488
2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488
2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488
2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488
2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488
2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488
2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488
2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488
2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488
2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488
2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488
2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488
2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488
2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done
```
</details>
## Training procedure
### Install icefall
```sh
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
```
### Prepare Data
```sh
cd egs/bookbot_es/ASR
./prepare.sh
```
### Train
```sh
export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train.py \
--world-size 2 \
--num-epochs 80 \
--exp-dir tmp/exp-causal \
--causal 1 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--max-duration 1000 \
--base-lr 0.04 \
--use-transducer True \
--use-fp16 1
```
### Exporting to ONNX
To export the trained model to onnx run:
```
./zipformer/export-onnx-streaming.py \
--tokens data/lang_phone/tokens.txt \
--avg 5 \
--causal 1 \
--exp-dir tmp/zipformer-streaming-robust-es-v0 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--chunk-size 16 \
--left-context-frames 128 \
--use-transducer True \
--epoch 80 \
```
It will store the ONNX files inside the specified `exp-dir`.
### Converting ONNX to ORT
```
cd tmp/zipformer-streaming-robust-es-v0
python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed .
```
Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated:
**Standard ORT files:**
- `encoder-epoch-80-avg-5-chunk-16-left-128.ort`
- `decoder-epoch-80-avg-5-chunk-16-left-128.ort`
- `joiner-epoch-80-avg-5-chunk-16-left-128.ort`
**INT8 Quantized ORT files:**
- `encoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
- `decoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
- `joiner-epoch-80-avg-5-chunk-16-left-128.int8.ort`
## Frameworks
- [k2](https://github.com/k2-fsa/k2)
- [icefall](https://github.com/bookbot-hive/icefall)
- [lhotse](https://github.com/bookbot-hive/lhotse)