zipformer-streaming-robust-es-v0 / README.md

Upload README.md with huggingface_hub

e63b79b verified 2 months ago

9.03 kB

	---
	license: apache-2.0
	tags:
	- icefall
	- phoneme-recognition
	- automatic-speech-recognition
	datasets:
	- bookbot/common_voice_16_1_es
	- bookbot/slr72_dataset
	---

	# Pruned Stateless Zipformer RNN-T Streaming Robust ES v0

	Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets:

	- [Common Voice 23.0 Spanish](https://datacollective.mozillafoundation.org/datasets/cmflnuzw51ddgmwjkxpm9z1lw)
	- [SLR72 dataset](https://www.openslr.org/72/)

	Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut).

	This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/tensorboard) logged via Tensorboard.

	## Setup

	To set up all the necessary packages, please follow the installation instructions from the official icefall [documentation](https://icefall.readthedocs.io/en/latest/installation/index.html).
	When cloning the icefall repo, make sure to clone our fork of icefall `git clone https://github.com/bookbot-hive/icefall` instead of the original.

	### Download Pre-trained Model

	Once you've installed all the necessary packages, follow the steps below

	```sh
	cd egs/bookbot_es/ASR
	mkdir tmp
	cd tmp
	git lfs install
	git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
	cd ..
	```

	## Evaluation Results

	### Chunk-wise Streaming

	```sh
	for m in greedy_search fast_beam_search modified_beam_search; do
	./zipformer/streaming_decode.py \
	--epoch 80 \
	--avg 5 \
	--causal 1 \
	--num-encoder-layers 2,2,2,2,2,2 \
	--feedforward-dim 512,768,768,768,768,768 \
	--encoder-dim 192,256,256,256,256,256 \
	--encoder-unmasked-dim 192,192,192,192,192,192 \
	--chunk-size 16 \
	--left-context-frames 128 \
	--exp-dir tmp/zipformer-streaming-robust-es-v0/ \
	--use-transducer True \
	--decoding-method $m \
	--num-decode-streams 1000
	done
	```

	The model achieves the following phoneme error rates on the different test sets:

	\| Decoding \| Common Voice 23.0 ES \| SLR72 \|
	\| -------------------- \| :------------------: \| :---: \|
	\| Fast Beam Search \| 5.57% \| 2.18% \|
	\| Greedy Search \| 2.85% \| 1.56% \|
	\| Modified Beam Search \| 2.71% \| 1.47% \|

	## Usage

	### Inference

	To decode with greedy search, run:

	```sh
	./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \
	--nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \
	--tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \
	./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
	```

	<details>
	<summary>Decoding Output</summary>

	```
	2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'}
	2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0
	2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer
	2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
	2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688])
	2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started
	2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32
	2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45
	2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488
	2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488
	2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488
	2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488
	2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488
	2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488
	2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488
	2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488
	2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488
	2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488
	2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488
	2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488
	2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488
	2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488
	2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488
	2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488
	2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488
	2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488
	2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488
	2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488
	2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488
	2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488
	2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488
	2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488
	2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488
	2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488
	2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488
	2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488
	2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488
	2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488
	2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
	2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas
	2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done
	```

	</details>

	## Training procedure

	### Install icefall

	```sh
	git clone https://github.com/bookbot-hive/icefall
	cd icefall
	export PYTHONPATH=`pwd`:$PYTHONPATH
	```

	### Prepare Data

	```sh
	cd egs/bookbot_es/ASR
	./prepare.sh
	```

	### Train

	```sh
	export CUDA_VISIBLE_DEVICES="0,1"
	./zipformer/train.py \
	--world-size 2 \
	--num-epochs 80 \
	--exp-dir tmp/exp-causal \
	--causal 1 \
	--num-encoder-layers 2,2,2,2,2,2 \
	--feedforward-dim 512,768,768,768,768,768 \
	--encoder-dim 192,256,256,256,256,256 \
	--encoder-unmasked-dim 192,192,192,192,192,192 \
	--max-duration 1000 \
	--base-lr 0.04 \
	--use-transducer True \
	--use-fp16 1
	```

	### Exporting to ONNX

	To export the trained model to onnx run:

	```
	./zipformer/export-onnx-streaming.py \
	--tokens data/lang_phone/tokens.txt \
	--avg 5 \
	--causal 1 \
	--exp-dir tmp/zipformer-streaming-robust-es-v0 \
	--num-encoder-layers 2,2,2,2,2,2 \
	--feedforward-dim 512,768,768,768,768,768 \
	--encoder-dim 192,256,256,256,256,256 \
	--encoder-unmasked-dim 192,192,192,192,192,192 \
	--chunk-size 16 \
	--left-context-frames 128 \
	--use-transducer True \
	--epoch 80 \
	```

	It will store the ONNX files inside the specified `exp-dir`.

	### Converting ONNX to ORT

	```
	cd tmp/zipformer-streaming-robust-es-v0
	python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed .
	```

	Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated:

	Standard ORT files:

	- `encoder-epoch-80-avg-5-chunk-16-left-128.ort`
	- `decoder-epoch-80-avg-5-chunk-16-left-128.ort`
	- `joiner-epoch-80-avg-5-chunk-16-left-128.ort`

	INT8 Quantized ORT files:

	- `encoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
	- `decoder-epoch-80-avg-5-chunk-16-left-128.int8.ort`
	- `joiner-epoch-80-avg-5-chunk-16-left-128.int8.ort`

	## Frameworks

	- [k2](https://github.com/k2-fsa/k2)
	- [icefall](https://github.com/bookbot-hive/icefall)
	- [lhotse](https://github.com/bookbot-hive/lhotse)