ranjit-task-logs-analysis / docs /source /for-dummies /data-preparation.rst

Upload icefall experiment results and logs

d596074 verified 18 days ago

8.41 kB

	.. _dummies_tutorial_data_preparation:

	Data Preparation
	================

	After :ref:`dummies_tutorial_environment_setup`, we can start preparing the
	data for training and decoding.

	The first step is to prepare the data for training. We have already provided
	`prepare.sh <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/prepare.sh>`_
	that would prepare everything required for training.

	.. code-block::

	cd /tmp/icefall
	export PYTHONPATH=/tmp/icefall:$PYTHONPATH
	cd egs/yesno/ASR

	./prepare.sh

	Note that in each recipe from `icefall`_, there exists a file ``prepare.sh``,
	which you should run before you run anything else.

	That is all you need for data preparation.

	For the more curious
	--------------------

	If you are wondering how to prepare your own dataset, please refer to the following
	URLs for more details:

	- `<https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes>`_

	It contains recipes for a variety of dataset. If you want to add your own
	dataset, please read recipes in this folder first.

	- `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py>`_

	The `yesno`_ recipe in `lhotse`_.

	If you already have a `Kaldi`_ dataset directory, which contains files like
	``wav.scp``, ``feats.scp``, then you can refer to `<https://lhotse.readthedocs.io/en/latest/kaldi.html#example>`_.

	A quick look to the generated files
	-----------------------------------

	``./prepare.sh`` puts generated files into two directories:

	- ``download``
	- ``data``

	download
	^^^^^^^^

	The ``download`` directory contains downloaded dataset files:

	.. code-block:: bas

	tree -L 1 ./download/

	./download/
	\|-- waves_yesno
	`-- waves_yesno.tar.gz

	.. hint::

	Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py#L41>`_
	for how the data is downloaded and extracted.

	data
	^^^^

	.. code-block:: bash

	tree ./data/

	./data/
	\|-- fbank
	\| \|-- yesno_cuts_test.jsonl.gz
	\| \|-- yesno_cuts_train.jsonl.gz
	\| \|-- yesno_feats_test.lca
	\| `-- yesno_feats_train.lca
	\|-- lang_phone
	\| \|-- HLG.pt
	\| \|-- L.pt
	\| \|-- L_disambig.pt
	\| \|-- Linv.pt
	\| \|-- lexicon.txt
	\| \|-- lexicon_disambig.txt
	\| \|-- tokens.txt
	\| `-- words.txt
	\|-- lm
	\| \|-- G.arpa
	\| `-- G.fst.txt
	`-- manifests
	\|-- yesno_recordings_test.jsonl.gz
	\|-- yesno_recordings_train.jsonl.gz
	\|-- yesno_supervisions_test.jsonl.gz
	`-- yesno_supervisions_train.jsonl.gz

	4 directories, 18 files

	data/manifests:

	This directory contains manifests. They are used to generate files in
	``data/fbank``.

	To give you an idea of what it contains, we examine the first few lines of
	the manifests related to the ``train`` dataset.

	.. code-block:: bash

	cd data/manifests
	gunzip -c yesno_recordings_train.jsonl.gz \| head -n 3

	The output is given below:

	.. code-block:: bash

	{"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}
	{"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}
	{"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}

	Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L300>`_
	for the meaning of each field per line.

	.. code-block:: bash

	gunzip -c yesno_supervisions_train.jsonl.gz \| head -n 3

	The output is given below:

	.. code-block:: bash

	{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}
	{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}
	{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}

	Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/supervision.py#L510>`_
	for the meaning of each field per line.

	data/fbank:

	This directory contains everything from ``data/manifests``. Furthermore, it also contains features
	for training.

	``data/fbank/yesno_feats_train.lca`` contains the features for the train dataset.
	Features are compressed using `lilcom`_.

	``data/fbank/yesno_cuts_train.jsonl.gz`` stores the `CutSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/cut/set.py#L72>`_,
	which stores `RecordingSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L928>`_,
	`SupervisionSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/supervision.py#L510>`_,
	and `FeatureSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/features/base.py#L593>`_.

	To give you an idea about what it looks like, we can run the following command:

	.. code-block:: bash

	cd data/fbank

	gunzip -c yesno_cuts_train.jsonl.gz \| head -n 3

	The output is given below:

	.. code-block:: bash

	{"id": "0_0_0_0_1_1_1_1-0", "start": 0, "duration": 6.35, "channel": 0, "supervisions": [{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 635, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.35, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "0,13000,3570", "channels": 0}, "recording": {"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}, "type": "MonoCut"}
	{"id": "0_0_0_1_0_1_1_0-1", "start": 0, "duration": 6.11, "channel": 0, "supervisions": [{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 611, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.11, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "16570,12964,2929", "channels": 0}, "recording": {"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}, "type": "MonoCut"}
	{"id": "0_0_1_0_0_1_1_0-2", "start": 0, "duration": 6.02, "channel": 0, "supervisions": [{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 602, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.02, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "32463,12936,2696", "channels": 0}, "recording": {"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}, "type": "MonoCut"}

	Note that ``yesno_cuts_train.jsonl.gz`` only stores the information about how to read the features.
	The actual features are stored separately in ``data/fbank/yesno_feats_train.lca``.

	data/lang:

	This directory contains the lexicon.

	data/lm:

	This directory contains language models.