ranjit-task-logs-analysis / docs /source /recipes /Streaming-ASR /librispeech /pruned_transducer_stateless.rst
| Pruned transducer statelessX | |
| ============================ | |
| This tutorial shows you how to run a **streaming** conformer transducer model | |
| with the `LibriSpeech <https://www.openslr.org/12>`_ dataset. | |
| .. Note:: | |
| The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`__, | |
| `pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`__, | |
| `pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`__, | |
| `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`__, | |
| We will take pruned_transducer_stateless4 as an example in this tutorial. | |
| .. HINT:: | |
| We assume you have read the page :ref:`install icefall` and have setup | |
| the environment for ``icefall``. | |
| .. HINT:: | |
| We recommend you to use a GPU or several GPUs to run this recipe. | |
| .. hint:: | |
| Please scroll down to the bottom of this page to find download links | |
| for pretrained models if you don't want to train a model from scratch. | |
| We use pruned RNN-T to compute the loss. | |
| .. note:: | |
| You can find the paper about pruned RNN-T at the following address: | |
| `<https://arxiv.org/abs/2206.13236>`_ | |
| The transducer model consists of 3 parts: | |
| - Encoder, a.k.a, the transcription network. We use a Conformer model (the reworked version by Daniel Povey) | |
| - Decoder, a.k.a, the prediction network. We use a stateless model consisting of | |
| ``nn.Embedding`` and ``nn.Conv1d`` | |
| - Joiner, a.k.a, the joint network. | |
| .. caution:: | |
| Contrary to the conventional RNN-T models, we use a stateless decoder. | |
| That is, it has no recurrent connections. | |
| Data preparation | |
| ---------------- | |
| .. hint:: | |
| The data preparation is the same as other recipes on LibriSpeech dataset, | |
| if you have finished this step, you can skip to ``Training`` directly. | |
| .. code-block:: bash | |
| $ cd egs/librispeech/ASR | |
| $ ./prepare.sh | |
| The script ``./prepare.sh`` handles the data preparation for you, **automagically**. | |
| All you need to do is to run it. | |
| The data preparation contains several stages, you can use the following two | |
| options: | |
| - ``--stage`` | |
| - ``--stop-stage`` | |
| to control which stage(s) should be run. By default, all stages are executed. | |
| For example, | |
| .. code-block:: bash | |
| $ cd egs/librispeech/ASR | |
| $ ./prepare.sh --stage 0 --stop-stage 0 | |
| means to run only stage 0. | |
| To run stage 2 to stage 5, use: | |
| .. code-block:: bash | |
| $ ./prepare.sh --stage 2 --stop-stage 5 | |
| .. HINT:: | |
| If you have pre-downloaded the `LibriSpeech <https://www.openslr.org/12>`_ | |
| dataset and the `musan <http://www.openslr.org/17/>`_ dataset, say, | |
| they are saved in ``/tmp/LibriSpeech`` and ``/tmp/musan``, you can modify | |
| the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that | |
| ``./prepare.sh`` won't re-download them. | |
| .. NOTE:: | |
| All generated files by ``./prepare.sh``, e.g., features, lexicon, etc, | |
| are saved in ``./data`` directory. | |
| We provide the following YouTube video showing how to run ``./prepare.sh``. | |
| .. note:: | |
| To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe | |
| the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_: | |
| `<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_ | |
| .. youtube:: ofEIoJL-mGM | |
| Training | |
| -------- | |
| .. NOTE:: | |
| We put the streaming and non-streaming model in one recipe, to train a streaming model you only | |
| need to add **4** extra options comparing with training a non-streaming model. These options are | |
| ``--dynamic-chunk-training``, ``--num-left-chunks``, ``--causal-convolution``, ``--short-chunk-size``. | |
| You can see the configurable options below for their meanings or read https://arxiv.org/pdf/2012.05481.pdf for more details. | |
| Configurable options | |
| ~~~~~~~~~~~~~~~~~~~~ | |
| .. code-block:: bash | |
| $ cd egs/librispeech/ASR | |
| $ ./pruned_transducer_stateless4/train.py --help | |
| shows you the training options that can be passed from the commandline. | |
| The following options are used quite often: | |
| - ``--exp-dir`` | |
| The directory to save checkpoints, training logs and tensorboard. | |
| - ``--full-libri`` | |
| If it's True, the training part uses all the training data, i.e., | |
| 960 hours. Otherwise, the training part uses only the subset | |
| ``train-clean-100``, which has 100 hours of training data. | |
| .. CAUTION:: | |
| The training set is perturbed by speed with two factors: 0.9 and 1.1. | |
| If ``--full-libri`` is True, each epoch actually processes | |
| ``3x960 == 2880`` hours of data. | |
| - ``--num-epochs`` | |
| It is the number of epochs to train. For instance, | |
| ``./pruned_transducer_stateless4/train.py --num-epochs 30`` trains for 30 epochs | |
| and generates ``epoch-1.pt``, ``epoch-2.pt``, ..., ``epoch-30.pt`` | |
| in the folder ``./pruned_transducer_stateless4/exp``. | |
| - ``--start-epoch`` | |
| It's used to resume training. | |
| ``./pruned_transducer_stateless4/train.py --start-epoch 10`` loads the | |
| checkpoint ``./pruned_transducer_stateless4/exp/epoch-9.pt`` and starts | |
| training from epoch 10, based on the state from epoch 9. | |
| - ``--world-size`` | |
| It is used for multi-GPU single-machine DDP training. | |
| - (a) If it is 1, then no DDP training is used. | |
| - (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training. | |
| The following shows some use cases with it. | |
| **Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and | |
| GPU 2 for training. You can do the following: | |
| .. code-block:: bash | |
| $ cd egs/librispeech/ASR | |
| $ export CUDA_VISIBLE_DEVICES="0,2" | |
| $ ./pruned_transducer_stateless4/train.py --world-size 2 | |
| **Use case 2**: You have 4 GPUs and you want to use all of them | |
| for training. You can do the following: | |
| .. code-block:: bash | |
| $ cd egs/librispeech/ASR | |
| $ ./pruned_transducer_stateless4/train.py --world-size 4 | |
| **Use case 3**: You have 4 GPUs but you only want to use GPU 3 | |
| for training. You can do the following: | |
| .. code-block:: bash | |
| $ cd egs/librispeech/ASR | |
| $ export CUDA_VISIBLE_DEVICES="3" | |
| $ ./pruned_transducer_stateless4/train.py --world-size 1 | |
| .. caution:: | |
| Only multi-GPU single-machine DDP training is implemented at present. | |
| Multi-GPU multi-machine DDP training will be added later. | |
| - ``--max-duration`` | |
| It specifies the number of seconds over all utterances in a | |
| batch, before **padding**. | |
| If you encounter CUDA OOM, please reduce it. | |
| .. HINT:: | |
| Due to padding, the number of seconds of all utterances in a | |
| batch will usually be larger than ``--max-duration``. | |
| A larger value for ``--max-duration`` may cause OOM during training, | |
| while a smaller value may increase the training time. You have to | |
| tune it. | |
| - ``--use-fp16`` | |
| If it is True, the model will train with half precision, from our experiment | |
| results, by using half precision you can train with two times larger ``--max-duration`` | |
| so as to get almost 2X speed up. | |
| - ``--dynamic-chunk-training`` | |
| The flag that indicates whether to train a streaming model or not, it | |
| **MUST** be True if you want to train a streaming model. | |
| - ``--short-chunk-size`` | |
| When training a streaming attention model with chunk masking, the chunk size | |
| would be either max sequence length of current batch or uniformly sampled from | |
| (1, short_chunk_size). The default value is 25, you don't have to change it most of the time. | |
| - ``--num-left-chunks`` | |
| It indicates how many left context (in chunks) that can be seen when calculating attention. | |
| The default value is 4, you don't have to change it most of the time. | |
| - ``--causal-convolution`` | |
| Whether to use causal convolution in conformer encoder layer, this requires | |
| to be True when training a streaming model. | |
| Pre-configured options | |
| ~~~~~~~~~~~~~~~~~~~~~~ | |
| There are some training options, e.g., number of encoder layers, | |
| encoder dimension, decoder dimension, number of warmup steps etc, | |
| that are not passed from the commandline. | |
| They are pre-configured by the function ``get_params()`` in | |
| `pruned_transducer_stateless4/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless4/train.py>`_ | |
| You don't need to change these pre-configured parameters. If you really need to change | |
| them, please modify ``./pruned_transducer_stateless4/train.py`` directly. | |
| .. NOTE:: | |
| The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`__ are a little different from | |
| other recipes. It allows you to configure ``--num-encoder-layers``, ``--dim-feedforward``, ``--nhead``, ``--encoder-dim``, ``--decoder-dim``, ``--joiner-dim`` from commandline, so that you can train models with different size with pruned_transducer_stateless5. | |
| Training logs | |
| ~~~~~~~~~~~~~ | |
| Training logs and checkpoints are saved in ``--exp-dir`` (e.g. ``pruned_transducer_stateless4/exp``. | |
| You will find the following files in that directory: | |
| - ``epoch-1.pt``, ``epoch-2.pt``, ... | |
| These are checkpoint files saved at the end of each epoch, containing model | |
| ``state_dict`` and optimizer ``state_dict``. | |
| To resume training from some checkpoint, say ``epoch-10.pt``, you can use: | |
| .. code-block:: bash | |
| $ ./pruned_transducer_stateless4/train.py --start-epoch 11 | |
| - ``checkpoint-436000.pt``, ``checkpoint-438000.pt``, ... | |
| These are checkpoint files saved every ``--save-every-n`` batches, | |
| containing model ``state_dict`` and optimizer ``state_dict``. | |
| To resume training from some checkpoint, say ``checkpoint-436000``, you can use: | |
| .. code-block:: bash | |
| $ ./pruned_transducer_stateless4/train.py --start-batch 436000 | |
| - ``tensorboard/`` | |
| This folder contains tensorBoard logs. Training loss, validation loss, learning | |
| rate, etc, are recorded in these logs. You can visualize them by: | |
| .. code-block:: bash | |
| $ cd pruned_transducer_stateless4/exp/tensorboard | |
| $ tensorboard dev upload --logdir . --description "pruned transducer training for LibriSpeech with icefall" | |
| It will print something like below: | |
| .. code-block:: | |
| TensorFlow installation not found - running with reduced feature set. | |
| Upload started and will continue reading any new data as it's added to the logdir. | |
| To stop uploading, press Ctrl-C. | |
| New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/97VKXf80Ru61CnP2ALWZZg/ | |
| [2022-11-20T15:50:50] Started scanning logdir. | |
| Uploading 4468 scalars... | |
| [2022-11-20T15:53:02] Total uploaded: 210171 scalars, 0 tensors, 0 binary objects | |
| Listening for new data in logdir... | |
| Note there is a URL in the above output. Click it and you will see | |
| the following screenshot: | |
| .. figure:: images/streaming-librispeech-pruned-transducer-tensorboard-log.jpg | |
| :width: 600 | |
| :alt: TensorBoard screenshot | |
| :align: center | |
| :target: https://tensorboard.dev/experiment/97VKXf80Ru61CnP2ALWZZg/ | |
| TensorBoard screenshot. | |
| .. hint:: | |
| If you don't have access to google, you can use the following command | |
| to view the tensorboard log locally: | |
| .. code-block:: bash | |
| cd pruned_transducer_stateless4/exp/tensorboard | |
| tensorboard --logdir . --port 6008 | |
| It will print the following message: | |
| .. code-block:: | |
| Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all | |
| TensorBoard 2.8.0 at http://localhost:6008/ (Press CTRL+C to quit) | |
| Now start your browser and go to `<http://localhost:6008>`_ to view the tensorboard | |
| logs. | |
| - ``log/log-train-xxxx`` | |
| It is the detailed training log in text format, same as the one | |
| you saw printed to the console during training. | |
| Usage example | |
| ~~~~~~~~~~~~~ | |
| You can use the following command to start the training using 4 GPUs: | |
| .. code-block:: bash | |
| export CUDA_VISIBLE_DEVICES="0,1,2,3" | |
| ./pruned_transducer_stateless4/train.py \ | |
| --world-size 4 \ | |
| --dynamic-chunk-training 1 \ | |
| --causal-convolution 1 \ | |
| --num-epochs 30 \ | |
| --start-epoch 1 \ | |
| --exp-dir pruned_transducer_stateless4/exp \ | |
| --full-libri 1 \ | |
| --max-duration 300 | |
| .. NOTE:: | |
| Comparing with training a non-streaming model, you only need to add two extra options, | |
| ``--dynamic-chunk-training 1`` and ``--causal-convolution 1`` . | |
| Decoding | |
| -------- | |
| The decoding part uses checkpoints saved by the training part, so you have | |
| to run the training part first. | |
| .. hint:: | |
| There are two kinds of checkpoints: | |
| - (1) ``epoch-1.pt``, ``epoch-2.pt``, ..., which are saved at the end | |
| of each epoch. You can pass ``--epoch`` to | |
| ``pruned_transducer_stateless4/decode.py`` to use them. | |
| - (2) ``checkpoints-436000.pt``, ``epoch-438000.pt``, ..., which are saved | |
| every ``--save-every-n`` batches. You can pass ``--iter`` to | |
| ``pruned_transducer_stateless4/decode.py`` to use them. | |
| We suggest that you try both types of checkpoints and choose the one | |
| that produces the lowest WERs. | |
| .. tip:: | |
| To decode a streaming model, you can use either ``simulate streaming decoding`` in ``decode.py`` or | |
| ``real streaming decoding`` in ``streaming_decode.py``, the difference between ``decode.py`` and | |
| ``streaming_decode.py`` is that, ``decode.py`` processes the whole acoustic frames at one time with masking (i.e. same as training), | |
| but ``streaming_decode.py`` processes the acoustic frames chunk by chunk (so it can only see limited context). | |
| .. NOTE:: | |
| ``simulate streaming decoding`` in ``decode.py`` and ``real streaming decoding`` in ``streaming_decode.py`` should | |
| produce almost the same results given the same ``--decode-chunk-size`` and ``--left-context``. | |
| Simulate streaming decoding | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
| .. code-block:: bash | |
| $ cd egs/librispeech/ASR | |
| $ ./pruned_transducer_stateless4/decode.py --help | |
| shows the options for decoding. | |
| The following options are important for streaming models: | |
| ``--simulate-streaming`` | |
| If you want to decode a streaming model with ``decode.py``, you **MUST** set | |
| ``--simulate-streaming`` to ``True``. ``simulate`` here means the acoustic frames | |
| are not processed frame by frame (or chunk by chunk), instead, the whole sequence | |
| is processed at one time with masking (the same as training). | |
| ``--causal-convolution`` | |
| If True, the convolution module in encoder layers will be causal convolution. | |
| This is **MUST** be True when decoding with a streaming model. | |
| ``--decode-chunk-size`` | |
| For streaming models, we will calculate the chunk-wise attention, ``--decode-chunk-size`` | |
| indicates the chunk length (in frames after subsampling) for chunk-wise attention. | |
| For ``simulate streaming decoding`` the ``decode-chunk-size`` is used to generate | |
| the attention mask. | |
| ``--left-context`` | |
| ``--left-context`` indicates how many left context frames (after subsampling) can be seen | |
| for current chunk when calculating chunk-wise attention. Normally, ``left-context`` should equal | |
| to ``decode-chunk-size * num-left-chunks``, where ``num-left-chunks`` is the option used | |
| to train this model. For ``simulate streaming decoding`` the ``left-context`` is used to generate | |
| the attention mask. | |
| The following shows two examples (for the two types of checkpoints): | |
| .. code-block:: bash | |
| for m in greedy_search fast_beam_search modified_beam_search; do | |
| for epoch in 25 20; do | |
| for avg in 7 5 3 1; do | |
| ./pruned_transducer_stateless4/decode.py \ | |
| --epoch $epoch \ | |
| --avg $avg \ | |
| --simulate-streaming 1 \ | |
| --causal-convolution 1 \ | |
| --decode-chunk-size 16 \ | |
| --left-context 64 \ | |
| --exp-dir pruned_transducer_stateless4/exp \ | |
| --max-duration 600 \ | |
| --decoding-method $m | |
| done | |
| done | |
| done | |
| .. code-block:: bash | |
| for m in greedy_search fast_beam_search modified_beam_search; do | |
| for iter in 474000; do | |
| for avg in 8 10 12 14 16 18; do | |
| ./pruned_transducer_stateless4/decode.py \ | |
| --iter $iter \ | |
| --avg $avg \ | |
| --simulate-streaming 1 \ | |
| --causal-convolution 1 \ | |
| --decode-chunk-size 16 \ | |
| --left-context 64 \ | |
| --exp-dir pruned_transducer_stateless4/exp \ | |
| --max-duration 600 \ | |
| --decoding-method $m | |
| done | |
| done | |
| done | |
| Real streaming decoding | |
| ~~~~~~~~~~~~~~~~~~~~~~~ | |
| .. code-block:: bash | |
| $ cd egs/librispeech/ASR | |
| $ ./pruned_transducer_stateless4/streaming_decode.py --help | |
| shows the options for decoding. | |
| The following options are important for streaming models: | |
| ``--decode-chunk-size`` | |
| For streaming models, we will calculate the chunk-wise attention, ``--decode-chunk-size`` | |
| indicates the chunk length (in frames after subsampling) for chunk-wise attention. | |
| For ``real streaming decoding``, we will process ``decode-chunk-size`` acoustic frames at each time. | |
| ``--left-context`` | |
| ``--left-context`` indicates how many left context frames (after subsampling) can be seen | |
| for current chunk when calculating chunk-wise attention. Normally, ``left-context`` should equal | |
| to ``decode-chunk-size * num-left-chunks``, where ``num-left-chunks`` is the option used | |
| to train this model. | |
| ``--num-decode-streams`` | |
| The number of decoding streams that can be run in parallel (very similar to the ``bath size``). | |
| For ``real streaming decoding``, the batches will be packed dynamically, for example, if the | |
| ``num-decode-streams`` equals to 10, then, sequence 1 to 10 will be decoded at first, after a while, | |
| suppose sequence 1 and 2 are done, so, sequence 3 to 12 will be processed parallelly in a batch. | |
| .. NOTE:: | |
| We also try adding ``--right-context`` in the real streaming decoding, but it seems not to benefit | |
| the performance for all the models, the reasons might be the training and decoding mismatch. You | |
| can try decoding with ``--right-context`` to see if it helps. The default value is 0. | |
| The following shows two examples (for the two types of checkpoints): | |
| .. code-block:: bash | |
| for m in greedy_search fast_beam_search modified_beam_search; do | |
| for epoch in 25 20; do | |
| for avg in 7 5 3 1; do | |
| ./pruned_transducer_stateless4/decode.py \ | |
| --epoch $epoch \ | |
| --avg $avg \ | |
| --decode-chunk-size 16 \ | |
| --left-context 64 \ | |
| --num-decode-streams 100 \ | |
| --exp-dir pruned_transducer_stateless4/exp \ | |
| --max-duration 600 \ | |
| --decoding-method $m | |
| done | |
| done | |
| done | |
| .. code-block:: bash | |
| for m in greedy_search fast_beam_search modified_beam_search; do | |
| for iter in 474000; do | |
| for avg in 8 10 12 14 16 18; do | |
| ./pruned_transducer_stateless4/decode.py \ | |
| --iter $iter \ | |
| --avg $avg \ | |
| --decode-chunk-size 16 \ | |
| --left-context 64 \ | |
| --num-decode-streams 100 \ | |
| --exp-dir pruned_transducer_stateless4/exp \ | |
| --max-duration 600 \ | |
| --decoding-method $m | |
| done | |
| done | |
| done | |
| .. tip:: | |
| Supporting decoding methods are as follows: | |
| - ``greedy_search`` : It takes the symbol with largest posterior probability | |
| of each frame as the decoding result. | |
| - ``beam_search`` : It implements Algorithm 1 in https://arxiv.org/pdf/1211.3711.pdf and | |
| `espnet/nets/beam_search_transducer.py <https://github.com/espnet/espnet/blob/master/espnet/nets/beam_search_transducer.py#L247>`_ | |
| is used as a reference. Basically, it keeps topk states for each frame, and expands the kept states with their own contexts to | |
| next frame. | |
| - ``modified_beam_search`` : It implements the same algorithm as ``beam_search`` above, but it | |
| runs in batch mode with ``--max-sym-per-frame=1`` being hardcoded. | |
| - ``fast_beam_search`` : It implements graph composition between the output ``log_probs`` and | |
| given ``FSAs``. It is hard to describe the details in several lines of texts, you can read | |
| our paper in https://arxiv.org/pdf/2211.00484.pdf or our `rnnt decode code in k2 <https://github.com/k2-fsa/k2/blob/master/k2/csrc/rnnt_decode.h>`_. ``fast_beam_search`` can decode with ``FSAs`` on GPU efficiently. | |
| - ``fast_beam_search_LG`` : The same as ``fast_beam_search`` above, ``fast_beam_search`` uses | |
| an trivial graph that has only one state, while ``fast_beam_search_LG`` uses an LG graph | |
| (with N-gram LM). | |
| - ``fast_beam_search_nbest`` : It produces the decoding results as follows: | |
| - (1) Use ``fast_beam_search`` to get a lattice | |
| - (2) Select ``num_paths`` paths from the lattice using ``k2.random_paths()`` | |
| - (3) Unique the selected paths | |
| - (4) Intersect the selected paths with the lattice and compute the | |
| shortest path from the intersection result | |
| - (5) The path with the largest score is used as the decoding output. | |
| - ``fast_beam_search_nbest_LG`` : It implements same logic as ``fast_beam_search_nbest``, the | |
| only difference is that it uses ``fast_beam_search_LG`` to generate the lattice. | |
| .. NOTE:: | |
| The supporting decoding methods in ``streaming_decode.py`` might be less than that in ``decode.py``, if needed, | |
| you can implement them by yourself or file a issue in `icefall <https://github.com/k2-fsa/icefall/issues>`_ . | |
| Export Model | |
| ------------ | |
| `pruned_transducer_stateless4/export.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless4/export.py>`_ supports exporting checkpoints from ``pruned_transducer_stateless4/exp`` in the following ways. | |
| Export ``model.state_dict()`` | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
| Checkpoints saved by ``pruned_transducer_stateless4/train.py`` also include | |
| ``optimizer.state_dict()``. It is useful for resuming training. But after training, | |
| we are interested only in ``model.state_dict()``. You can use the following | |
| command to extract ``model.state_dict()``. | |
| .. code-block:: bash | |
| # Assume that --epoch 25 --avg 3 produces the smallest WER | |
| # (You can get such information after running ./pruned_transducer_stateless4/decode.py) | |
| epoch=25 | |
| avg=3 | |
| ./pruned_transducer_stateless4/export.py \ | |
| --exp-dir ./pruned_transducer_stateless4/exp \ | |
| --streaming-model 1 \ | |
| --causal-convolution 1 \ | |
| --bpe-model data/lang_bpe_500/bpe.model \ | |
| --epoch $epoch \ | |
| --avg $avg | |
| .. caution:: | |
| ``--streaming-model`` and ``--causal-convolution`` require to be True to export | |
| a streaming model. | |
| It will generate a file ``./pruned_transducer_stateless4/exp/pretrained.pt``. | |
| .. hint:: | |
| To use the generated ``pretrained.pt`` for ``pruned_transducer_stateless4/decode.py``, | |
| you can run: | |
| .. code-block:: bash | |
| cd pruned_transducer_stateless4/exp | |
| ln -s pretrained.pt epoch-999.pt | |
| And then pass ``--epoch 999 --avg 1 --use-averaged-model 0`` to | |
| ``./pruned_transducer_stateless4/decode.py``. | |
| To use the exported model with ``./pruned_transducer_stateless4/pretrained.py``, you | |
| can run: | |
| .. code-block:: bash | |
| ./pruned_transducer_stateless4/pretrained.py \ | |
| --checkpoint ./pruned_transducer_stateless4/exp/pretrained.pt \ | |
| --simulate-streaming 1 \ | |
| --causal-convolution 1 \ | |
| --bpe-model ./data/lang_bpe_500/bpe.model \ | |
| --method greedy_search \ | |
| /path/to/foo.wav \ | |
| /path/to/bar.wav | |
| Export model using ``torch.jit.script()`` | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
| .. code-block:: bash | |
| ./pruned_transducer_stateless4/export.py \ | |
| --exp-dir ./pruned_transducer_stateless4/exp \ | |
| --streaming-model 1 \ | |
| --causal-convolution 1 \ | |
| --bpe-model data/lang_bpe_500/bpe.model \ | |
| --epoch 25 \ | |
| --avg 3 \ | |
| --jit 1 | |
| .. caution:: | |
| ``--streaming-model`` and ``--causal-convolution`` require to be True to export | |
| a streaming model. | |
| It will generate a file ``cpu_jit.pt`` in the given ``exp_dir``. You can later | |
| load it by ``torch.jit.load("cpu_jit.pt")``. | |
| Note ``cpu`` in the name ``cpu_jit.pt`` means the parameters when loaded into Python | |
| are on CPU. You can use ``to("cuda")`` to move them to a CUDA device. | |
| .. NOTE:: | |
| You will need this ``cpu_jit.pt`` when deploying with Sherpa framework. | |
| Download pretrained models | |
| -------------------------- | |
| If you don't want to train from scratch, you can download the pretrained models | |
| by visiting the following links: | |
| - `pruned_transducer_stateless <https://huggingface.co/pkufool/icefall_librispeech_streaming_pruned_transducer_stateless_20220625>`_ | |
| - `pruned_transducer_stateless2 <https://huggingface.co/pkufool/icefall_librispeech_streaming_pruned_transducer_stateless2_20220625>`_ | |
| - `pruned_transducer_stateless4 <https://huggingface.co/pkufool/icefall_librispeech_streaming_pruned_transducer_stateless4_20220625>`_ | |
| - `pruned_transducer_stateless5 <https://huggingface.co/pkufool/icefall_librispeech_streaming_pruned_transducer_stateless5_20220729>`_ | |
| See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_ | |
| for the details of the above pretrained models | |
| Deploy with Sherpa | |
| ------------------ | |
| Please see `<https://k2-fsa.github.io/sherpa/python/streaming_asr/conformer/index.html#>`_ | |
| for how to deploy the models in ``sherpa``. | |