Upload icefall experiment results and logs

d596074 verified 3 months ago

13.9 kB

	TDNN-CTC
	========

	This page shows you how to run the `yesno <https://www.openslr.org/1>`_ recipe. It contains:

	- (1) Prepare data for training
	- (2) Train a TDNN model

	- (a) View text format logs and visualize TensorBoard logs
	- (b) Select device type, i.e., CPU and GPU, for training
	- (c) Change training options
	- (d) Resume training from a checkpoint

	- (3) Decode with a trained model

	- (a) Select a checkpoint for decoding
	- (b) Model averaging

	- (4) Colab notebook

	- (a) It shows you step by step how to setup the environment, how to do training,
	and how to do decoding
	- (b) How to use a pre-trained model

	- (5) Inference with a pre-trained model

	- (a) Download a pre-trained model, provided by us
	- (b) Decode a single sound file with a pre-trained model
	- (c) Decode multiple sound files at the same time

	It does NOT show you:

	- (1) How to train with multiple GPUs

	The ``yesno`` dataset is so small that CPU is more than enough
	for training as well as for decoding.

	- (2) How to use LM rescoring for decoding

	The dataset does not have an LM for rescoring.

	.. HINT::

	We assume you have read the page :ref:`install icefall` and have setup
	the environment for ``icefall``.

	.. HINT::

	You don't need a GPU to run this recipe. It can be run on a CPU.
	The training part takes less than 30 seconds on a CPU and you will get
	the following WER at the end::

	[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

	Data preparation
	----------------

	.. code-block:: bash

	$ cd egs/yesno/ASR
	$ ./prepare.sh

	The script ``./prepare.sh`` handles the data preparation for you, automagically.
	All you need to do is to run it.

	The data preparation contains several stages, you can use the following two
	options:

	- ``--stage``
	- ``--stop-stage``

	to control which stage(s) should be run. By default, all stages are executed.


	For example,

	.. code-block:: bash

	$ cd egs/yesno/ASR
	$ ./prepare.sh --stage 0 --stop-stage 0

	means to run only stage 0.

	To run stage 2 to stage 5, use:

	.. code-block:: bash

	$ ./prepare.sh --stage 2 --stop-stage 5


	Training
	--------

	We provide only a TDNN model, contained in
	the `tdnn <https://github.com/k2-fsa/icefall/tree/master/egs/yesno/ASR/tdnn>`_
	folder, for ``yesno``.

	The command to run the training part is:

	.. code-block:: bash

	$ cd egs/yesno/ASR
	$ export CUDA_VISIBLE_DEVICES=""
	$ ./tdnn/train.py

	By default, it will run ``15`` epochs. Training logs and checkpoints are saved
	in ``tdnn/exp``.

	In ``tdnn/exp``, you will find the following files:

	- ``epoch-0.pt``, ``epoch-1.pt``, ...

	These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``.
	To resume training from some checkpoint, say ``epoch-10.pt``, you can use:

	.. code-block:: bash

	$ ./tdnn/train.py --start-epoch 11

	- ``tensorboard/``

	This folder contains TensorBoard logs. Training loss, validation loss, learning
	rate, etc, are recorded in these logs. You can visualize them by:

	.. code-block:: bash

	$ cd tdnn/exp/tensorboard
	$ tensorboard dev upload --logdir . --description "TDNN training for yesno with icefall"

	It will print something like below:

	.. code-block::

	TensorFlow installation not found - running with reduced feature set.
	Upload started and will continue reading any new data as it's added to the logdir.

	To stop uploading, press Ctrl-C.

	New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/

	[2021-08-23T23:49:41] Started scanning logdir.
	[2021-08-23T23:49:42] Total uploaded: 135 scalars, 0 tensors, 0 binary objects
	Listening for new data in logdir...

	Note there is a URL in the above output, click it and you will see
	the following screenshot:

	.. figure:: images/tdnn-tensorboard-log.png
	:width: 600
	:alt: TensorBoard screenshot
	:align: center
	:target: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/

	TensorBoard screenshot.

	- ``log/log-train-xxxx``

	It is the detailed training log in text format, same as the one
	you saw printed to the console during training.



	.. NOTE::

	By default, ``./tdnn/train.py`` uses GPU 0 for training if GPUs are available.
	If you have two GPUs, say, GPU 0 and GPU 1, and you want to use GPU 1 for
	training, you can run:

	.. code-block:: bash

	$ export CUDA_VISIBLE_DEVICES="1"
	$ ./tdnn/train.py

	Since the ``yesno`` dataset is very small, containing only 30 sound files
	for training, and the model in use is also very small, we use:

	.. code-block:: bash

	$ export CUDA_VISIBLE_DEVICES=""

	so that ``./tdnn/train.py`` uses CPU during training.

	If you don't have GPUs, then you don't need to
	run ``export CUDA_VISIBLE_DEVICES=""``.

	To see available training options, you can use:

	.. code-block:: bash

	$ ./tdnn/train.py --help

	Other training options, e.g., learning rate, results dir, etc., are
	pre-configured in the function ``get_params()``
	in `tdnn/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/tdnn/train.py>`_.
	Normally, you don't need to change them. You can change them by modifying the code, if
	you want.

	Decoding
	--------

	The decoding part uses checkpoints saved by the training part, so you have
	to run the training part first.

	The command for decoding is:

	.. code-block:: bash

	$ export CUDA_VISIBLE_DEVICES=""
	$ ./tdnn/decode.py

	You will see the WER in the output log.

	Decoded results are saved in ``tdnn/exp``.

	.. code-block:: bash

	$ ./tdnn/decode.py --help

	shows you the available decoding options.

	Some commonly used options are:

	- ``--epoch``

	You can select which checkpoint to be used for decoding.
	For instance, ``./tdnn/decode.py --epoch 10`` means to use
	``./tdnn/exp/epoch-10.pt`` for decoding.

	- ``--avg``

	It's related to model averaging. It specifies number of checkpoints
	to be averaged. The averaged model is used for decoding.
	For example, the following command:

	.. code-block:: bash

	$ ./tdnn/decode.py --epoch 10 --avg 3

	uses the average of ``epoch-8.pt``, ``epoch-9.pt`` and ``epoch-10.pt``
	for decoding.

	- ``--export``

	If it is ``True``, i.e., ``./tdnn/decode.py --export 1``, the code
	will save the averaged model to ``tdnn/exp/pretrained.pt``.
	See :ref:`yesno use a pre-trained model` for how to use it.


	.. _yesno use a pre-trained model:

	Pre-trained Model
	-----------------

	We have uploaded the pre-trained model to
	`<https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn>`_.

	The following shows you how to use the pre-trained model.

	Download the pre-trained model
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	.. code-block:: bash

	$ cd egs/yesno/ASR
	$ mkdir tmp
	$ cd tmp
	$ git lfs install
	$ git clone https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn

	.. CAUTION::

	You have to use ``git lfs`` to download the pre-trained model.

	After downloading, you will have the following files:

	.. code-block:: bash

	$ cd egs/yesno/ASR
	$ tree tmp

	.. code-block:: bash

	tmp/
	`-- icefall_asr_yesno_tdnn
	\|-- README.md
	\|-- lang_phone
	\| \|-- HLG.pt
	\| \|-- L.pt
	\| \|-- L_disambig.pt
	\| \|-- Linv.pt
	\| \|-- lexicon.txt
	\| \|-- lexicon_disambig.txt
	\| \|-- tokens.txt
	\| `-- words.txt
	\|-- lm
	\| \|-- G.arpa
	\| `-- G.fst.txt
	\|-- pretrained.pt
	`-- test_waves
	\|-- 0_0_0_1_0_0_0_1.wav
	\|-- 0_0_1_0_0_0_1_0.wav
	\|-- 0_0_1_0_0_1_1_1.wav
	\|-- 0_0_1_0_1_0_0_1.wav
	\|-- 0_0_1_1_0_0_0_1.wav
	\|-- 0_0_1_1_0_1_1_0.wav
	\|-- 0_0_1_1_1_0_0_0.wav
	\|-- 0_0_1_1_1_1_0_0.wav
	\|-- 0_1_0_0_0_1_0_0.wav
	\|-- 0_1_0_0_1_0_1_0.wav
	\|-- 0_1_0_1_0_0_0_0.wav
	\|-- 0_1_0_1_1_1_0_0.wav
	\|-- 0_1_1_0_0_1_1_1.wav
	\|-- 0_1_1_1_0_0_1_0.wav
	\|-- 0_1_1_1_1_0_1_0.wav
	\|-- 1_0_0_0_0_0_0_0.wav
	\|-- 1_0_0_0_0_0_1_1.wav
	\|-- 1_0_0_1_0_1_1_1.wav
	\|-- 1_0_1_1_0_1_1_1.wav
	\|-- 1_0_1_1_1_1_0_1.wav
	\|-- 1_1_0_0_0_1_1_1.wav
	\|-- 1_1_0_0_1_0_1_1.wav
	\|-- 1_1_0_1_0_1_0_0.wav
	\|-- 1_1_0_1_1_0_0_1.wav
	\|-- 1_1_0_1_1_1_1_0.wav
	\|-- 1_1_1_0_0_1_0_1.wav
	\|-- 1_1_1_0_1_0_1_0.wav
	\|-- 1_1_1_1_0_0_1_0.wav
	\|-- 1_1_1_1_1_0_0_0.wav
	`-- 1_1_1_1_1_1_1_1.wav

	4 directories, 42 files

	.. code-block:: bash

	$ soxi tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav

	Input File : 'tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav'
	Channels : 1
	Sample Rate : 8000
	Precision : 16-bit
	Duration : 00:00:06.76 = 54080 samples ~ 507 CDDA sectors
	File Size : 108k
	Bit Rate : 128k
	Sample Encoding: 16-bit Signed Integer PCM

	- ``0_0_1_0_1_0_0_1.wav``

	0 means No; 1 means Yes. No and Yes are not in English,
	but in `Hebrew <https://en.wikipedia.org/wiki/Hebrew_language>`_.
	So this file contains ``NO NO YES NO YES NO NO YES``.

	Download kaldifeat
	~~~~~~~~~~~~~~~~~~

	`kaldifeat <https://github.com/csukuangfj/kaldifeat>`_ is used for extracting
	features from a single or multiple sound files. Please refer to
	`<https://github.com/csukuangfj/kaldifeat>`_ to install ``kaldifeat`` first.

	Inference with a pre-trained model
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	.. code-block:: bash

	$ cd egs/yesno/ASR
	$ ./tdnn/pretrained.py --help

	shows the usage information of ``./tdnn/pretrained.py``.

	To decode a single file, we can use:

	.. code-block:: bash

	./tdnn/pretrained.py \
	--checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \
	--words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \
	--HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \
	./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav

	The output is:

	.. code-block::

	2021-08-24 12:22:51,621 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']}
	2021-08-24 12:22:51,645 INFO [pretrained.py:125] device: cpu
	2021-08-24 12:22:51,645 INFO [pretrained.py:127] Creating model
	2021-08-24 12:22:51,650 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt
	2021-08-24 12:22:51,651 INFO [pretrained.py:143] Constructing Fbank computer
	2021-08-24 12:22:51,652 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']
	2021-08-24 12:22:51,684 INFO [pretrained.py:159] Decoding started
	2021-08-24 12:22:51,708 INFO [pretrained.py:198]
	./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav:
	NO NO YES NO YES NO NO YES


	2021-08-24 12:22:51,708 INFO [pretrained.py:200] Decoding Done

	You can see that for the sound file ``0_0_1_0_1_0_0_1.wav``, the decoding result is
	``NO NO YES NO YES NO NO YES``.

	To decode multiple files at the same time, you can use

	.. code-block:: bash

	./tdnn/pretrained.py \
	--checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \
	--words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \
	--HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \
	./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav \
	./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav

	The decoding output is:

	.. code-block::

	2021-08-24 12:25:20,159 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav', './tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']}
	2021-08-24 12:25:20,181 INFO [pretrained.py:125] device: cpu
	2021-08-24 12:25:20,181 INFO [pretrained.py:127] Creating model
	2021-08-24 12:25:20,185 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt
	2021-08-24 12:25:20,186 INFO [pretrained.py:143] Constructing Fbank computer
	2021-08-24 12:25:20,187 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav',
	'./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']
	2021-08-24 12:25:20,213 INFO [pretrained.py:159] Decoding started
	2021-08-24 12:25:20,287 INFO [pretrained.py:198]
	./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav:
	NO NO YES NO YES NO NO YES

	./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav:
	YES NO YES YES NO YES YES YES

	2021-08-24 12:25:20,287 INFO [pretrained.py:200] Decoding Done

	You can see again that it decodes correctly.

	Colab notebook
	--------------

	We do provide a colab notebook for this recipe.

	\|yesno colab notebook\|

	.. \|yesno colab notebook\| image:: https://colab.research.google.com/assets/colab-badge.svg
	:target: https://colab.research.google.com/drive/1tIjjzaJc3IvGyKiMCDWO-TSnBgkcuN3B?usp=sharing


	Congratulations! You have finished the simplest speech recognition recipe in ``icefall``.