--- license: apache-2.0 tags: - icefall - phoneme-recognition - automatic-speech-recognition datasets: - bookbot/common_voice_16_1_es - bookbot/slr72_dataset --- # Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets: - [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw) - [SLR72 dataset](https://www.openslr.org/72/) Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut). This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard. ## Setup To set up all the necessary packages, please follow the installation instructions from the official icefall [documentation](https://icefall.readthedocs.io/en/latest/installation/index.html). When cloning the icefall repo, make sure to clone our fork of icefall `git clone https://github.com/bookbot-hive/icefall` instead of the original. ### Download Pre-trained Model Once you've installed all the necessary packages, follow the steps below ```sh cd egs/bookbot_es/ASR mkdir tmp cd tmp git lfs install git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/ cd .. ``` ## Evaluation Results ### Chunk-wise Streaming ```sh for m in greedy_search fast_beam_search modified_beam_search; do ./zipformer/streaming_decode.py \ --epoch 80 \ --avg 5 \ --causal 1 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,768,768,768,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --chunk-size 16 \ --left-context-frames 128 \ --exp-dir tmp/zipformer-streaming-robust-es-v0/ \ --use-transducer True \ --decoding-method $m \ --num-decode-streams 1000 done ``` The model achieves the following phoneme error rates on the different test sets: | Decoding | Common Voice 23.0 ES | SLR72 | | -------------------- | :------------------: | :---: | | Fast Beam Search | 5.57% | 2.18% | | Greedy Search | 2.85% | 1.56% | | Modified Beam Search | 2.71% | 1.47% | ## Usage ### Inference To decode with greedy search, run: ```sh ./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \ --nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \ --tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \ ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav ```
Decoding Output ``` 2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'} 2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0 2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer 2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav 2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688]) 2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started 2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32 2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45 2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488 2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488 2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488 2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488 2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488 2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488 2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488 2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488 2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488 2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488 2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488 2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488 2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488 2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488 2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488 2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488 2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488 2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488 2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488 2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488 2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488 2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488 2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488 2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488 2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488 2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488 2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488 2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488 2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488 2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488 2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav 2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas 2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done ```
## Training procedure ### Install icefall ```sh git clone https://github.com/bookbot-hive/icefall cd icefall export PYTHONPATH=`pwd`:$PYTHONPATH ``` ### Prepare Data ```sh cd egs/bookbot_es/ASR ./prepare.sh ``` ### Train ```sh export CUDA_VISIBLE_DEVICES="0,1" ./zipformer/train.py \ --world-size 2 \ --num-epochs 80 \ --exp-dir tmp/exp-causal \ --causal 1 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,768,768,768,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --max-duration 1000 \ --base-lr 0.04 \ --use-transducer True \ --use-fp16 1 ``` ### Exporting to ONNX To export the trained model to onnx run: ``` ./zipformer/export-onnx-streaming.py \ --tokens data/lang_phone/tokens.txt \ --avg 5 \ --causal 1 \ --exp-dir tmp/zipformer-streaming-robust-es-v0 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,768,768,768,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --chunk-size 16 \ --left-context-frames 128 \ --use-transducer True \ --epoch 80 \ ``` It will store the ONNX files inside the specified `exp-dir`. ### Converting ONNX to ORT ``` cd tmp/zipformer-streaming-robust-es-v0 python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed . ``` Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated: **Standard ORT files:** - `encoder-epoch-80-avg-5-chunk-16-left-128.ort` - `decoder-epoch-80-avg-5-chunk-16-left-128.ort` - `joiner-epoch-80-avg-5-chunk-16-left-128.ort` **INT8 Quantized ORT files:** - `encoder-epoch-80-avg-5-chunk-16-left-128.int8.ort` - `decoder-epoch-80-avg-5-chunk-16-left-128.int8.ort` - `joiner-epoch-80-avg-5-chunk-16-left-128.int8.ort` ## Frameworks - [k2](https://github.com/k2-fsa/k2) - [icefall](https://github.com/bookbot-hive/icefall) - [lhotse](https://github.com/bookbot-hive/lhotse)