|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- icefall |
|
|
- phoneme-recognition |
|
|
- automatic-speech-recognition |
|
|
datasets: |
|
|
- bookbot/common_voice_16_1_es |
|
|
- bookbot/slr72_dataset |
|
|
--- |
|
|
|
|
|
# Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 |
|
|
|
|
|
Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets: |
|
|
|
|
|
- [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw) |
|
|
- [SLR72 dataset](https://www.openslr.org/72/) |
|
|
|
|
|
Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut). |
|
|
|
|
|
This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard. |
|
|
|
|
|
## Setup |
|
|
|
|
|
To set up all the necessary packages, please follow the installation instructions from the official icefall [documentation](https://icefall.readthedocs.io/en/latest/installation/index.html). |
|
|
When cloning the icefall repo, make sure to clone our fork of icefall `git clone https://github.com/bookbot-hive/icefall` instead of the original. |
|
|
|
|
|
### Download Pre-trained Model |
|
|
|
|
|
Once you've installed all the necessary packages, follow the steps below |
|
|
|
|
|
```sh |
|
|
cd egs/bookbot_es/ASR |
|
|
mkdir tmp |
|
|
cd tmp |
|
|
git lfs install |
|
|
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/ |
|
|
cd .. |
|
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### Chunk-wise Streaming |
|
|
|
|
|
```sh |
|
|
for m in greedy_search fast_beam_search modified_beam_search; do |
|
|
./zipformer/streaming_decode.py \ |
|
|
--epoch 80 \ |
|
|
--avg 5 \ |
|
|
--causal 1 \ |
|
|
--num-encoder-layers 2,2,2,2,2,2 \ |
|
|
--feedforward-dim 512,768,768,768,768,768 \ |
|
|
--encoder-dim 192,256,256,256,256,256 \ |
|
|
--encoder-unmasked-dim 192,192,192,192,192,192 \ |
|
|
--chunk-size 16 \ |
|
|
--left-context-frames 128 \ |
|
|
--exp-dir tmp/zipformer-streaming-robust-es-v0/ \ |
|
|
--use-transducer True \ |
|
|
--decoding-method $m \ |
|
|
--num-decode-streams 1000 |
|
|
done |
|
|
``` |
|
|
|
|
|
The model achieves the following phoneme error rates on the different test sets: |
|
|
|
|
|
| Decoding | Common Voice 23.0 ES | SLR72 | |
|
|
| -------------------- | :------------------: | :---: | |
|
|
| Fast Beam Search | 5.57% | 2.18% | |
|
|
| Greedy Search | 2.85% | 1.56% | |
|
|
| Modified Beam Search | 2.71% | 1.47% | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Inference |
|
|
|
|
|
To decode with greedy search, run: |
|
|
|
|
|
```sh |
|
|
./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \ |
|
|
--nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \ |
|
|
--tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \ |
|
|
./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav |
|
|
``` |
|
|
|
|
|
<details> |
|
|
<summary>Decoding Output</summary> |
|
|
|
|
|
``` |
|
|
2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'} |
|
|
2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0 |
|
|
2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer |
|
|
2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav |
|
|
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688]) |
|
|
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started |
|
|
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32 |
|
|
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45 |
|
|
2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488 |
|
|
2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488 |
|
|
2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488 |
|
|
2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488 |
|
|
2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488 |
|
|
2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488 |
|
|
2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488 |
|
|
2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488 |
|
|
2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488 |
|
|
2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488 |
|
|
2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488 |
|
|
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488 |
|
|
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488 |
|
|
2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488 |
|
|
2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488 |
|
|
2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488 |
|
|
2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488 |
|
|
2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488 |
|
|
2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488 |
|
|
2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488 |
|
|
2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488 |
|
|
2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488 |
|
|
2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488 |
|
|
2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488 |
|
|
2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488 |
|
|
2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488 |
|
|
2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488 |
|
|
2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488 |
|
|
2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488 |
|
|
2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488 |
|
|
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav |
|
|
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas |
|
|
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Install icefall |
|
|
|
|
|
```sh |
|
|
git clone https://github.com/bookbot-hive/icefall |
|
|
cd icefall |
|
|
export PYTHONPATH=`pwd`:$PYTHONPATH |
|
|
``` |
|
|
|
|
|
### Prepare Data |
|
|
|
|
|
```sh |
|
|
cd egs/bookbot_es/ASR |
|
|
./prepare.sh |
|
|
``` |
|
|
|
|
|
### Train |
|
|
|
|
|
```sh |
|
|
export CUDA_VISIBLE_DEVICES="0,1" |
|
|
./zipformer/train.py \ |
|
|
--world-size 2 \ |
|
|
--num-epochs 80 \ |
|
|
--exp-dir tmp/exp-causal \ |
|
|
--causal 1 \ |
|
|
--num-encoder-layers 2,2,2,2,2,2 \ |
|
|
--feedforward-dim 512,768,768,768,768,768 \ |
|
|
--encoder-dim 192,256,256,256,256,256 \ |
|
|
--encoder-unmasked-dim 192,192,192,192,192,192 \ |
|
|
--max-duration 1000 \ |
|
|
--base-lr 0.04 \ |
|
|
--use-transducer True \ |
|
|
--use-fp16 1 |
|
|
``` |
|
|
|
|
|
### Exporting to ONNX |
|
|
|
|
|
To export the trained model to onnx run: |
|
|
|
|
|
``` |
|
|
./zipformer/export-onnx-streaming.py \ |
|
|
--tokens data/lang_phone/tokens.txt \ |
|
|
--avg 5 \ |
|
|
--causal 1 \ |
|
|
--exp-dir tmp/zipformer-streaming-robust-es-v0 \ |
|
|
--num-encoder-layers 2,2,2,2,2,2 \ |
|
|
--feedforward-dim 512,768,768,768,768,768 \ |
|
|
--encoder-dim 192,256,256,256,256,256 \ |
|
|
--encoder-unmasked-dim 192,192,192,192,192,192 \ |
|
|
--chunk-size 16 \ |
|
|
--left-context-frames 128 \ |
|
|
--use-transducer True \ |
|
|
--epoch 80 \ |
|
|
``` |
|
|
|
|
|
It will store the ONNX files inside the specified `exp-dir`. |
|
|
|
|
|
### Converting ONNX to ORT |
|
|
|
|
|
``` |
|
|
cd tmp/zipformer-streaming-robust-es-v0 |
|
|
python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed . |
|
|
``` |
|
|
|
|
|
Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated: |
|
|
|
|
|
**Standard ORT files:** |
|
|
|
|
|
- `encoder-epoch-80-avg-5-chunk-16-left-128.ort` |
|
|
- `decoder-epoch-80-avg-5-chunk-16-left-128.ort` |
|
|
- `joiner-epoch-80-avg-5-chunk-16-left-128.ort` |
|
|
|
|
|
**INT8 Quantized ORT files:** |
|
|
|
|
|
- `encoder-epoch-80-avg-5-chunk-16-left-128.int8.ort` |
|
|
- `decoder-epoch-80-avg-5-chunk-16-left-128.int8.ort` |
|
|
- `joiner-epoch-80-avg-5-chunk-16-left-128.int8.ort` |
|
|
|
|
|
## Frameworks |
|
|
|
|
|
- [k2](https://github.com/k2-fsa/k2) |
|
|
- [icefall](https://github.com/bookbot-hive/icefall) |
|
|
- [lhotse](https://github.com/bookbot-hive/lhotse) |
|
|
|