| .. _punctuation_and_capitalization_lexical_audio: |
|
|
| Punctuation and Capitalization Lexical Audio Model |
| ================================================== |
|
|
| Sometimes punctuation and capitalization cannot be restored based only on text. In this case we can use audio to improve model's accuracy. |
|
|
| Like in these examples: |
|
|
| .. code:: |
| |
| Oh yeah? or Oh yeah. |
|
|
| We need to go? or We need to go. |
|
|
| Yeah, they make you work. Yeah, over there you walk a lot? or Yeah, they make you work. Yeah, over there you walk a lot. |
|
|
| You can find more details on text only punctuation and capitalization in `Punctuation And Capitalization's page <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/punctuation_and_capitalization.html>`_. In this document, we focus on model changes needed to use acoustic features. |
|
|
| Quick Start Guide |
| ----------------- |
|
|
| .. code-block:: python |
|
|
| from nemo.collections.nlp.models import PunctuationCapitalizationLexicalAudioModel |
|
|
| # to get the list of pre-trained models |
| PunctuationCapitalizationLexicalAudioModel.list_available_models() |
|
|
| # Download and load the pre-trained model |
| model = PunctuationCapitalizationLexicalAudioModel.from_pretrained("<PATH to .nemo file>") |
|
|
| # try the model on a few examples |
| model.add_punctuation_capitalization(['how are you', 'great how about you'], audio_queries=['/path/to/1.wav', '/path/to/2.wav'], target_sr=16000) |
|
|
| Model Description |
| ----------------- |
| In addition to `Punctuation And Capitalization model <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/punctuation_and_capitalization.html>`_ we add audio encoder (e.g. Conformer's encoder) and attention based fusion of lexical and audio features. |
| This model architecture is based on `Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech <https://arxiv.org/pdf/2008.00702.pdf>`__ :cite:`nlp-punct-sunkara20_interspeech`. |
|
|
| .. note:: |
|
|
| An example script on how to train and evaluate the model can be found at: `NeMo/examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py>`__. |
|
|
| The default configuration file for the model can be found at: `NeMo/examples/nlp/token_classification/conf/punctuation_capitalization_lexical_audio_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/conf/punctuation_capitalization_lexical_audio_config.yaml>`__. |
|
|
| The script for inference can be found at: `NeMo/examples/nlp/token_classification/punctuate_capitalize_infer.py <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuate_capitalize_infer.py>`__. |
|
|
| .. _raw_data_format_punct: |
|
|
| Raw Data Format |
| --------------- |
| In addition to `Punctuation And Capitalization Raw Data Format <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/punctuation_and_capitalization.html#raw-data-format>`_ this model also requires audio data. |
| You have to provide ``audio_train.txt`` and ``audio_dev.txt`` (and optionally ``audio_test.txt``) which contain one valid path to audio per row. |
|
|
| Example of the ``audio_train.txt``/``audio_dev.txt`` file: |
|
|
| .. code:: |
|
|
| /path/to/1.wav |
| /path/to/2.wav |
| .... |
|
|
| In this case ``source_data_dir`` structure should look similar to the following: |
|
|
| .. code:: |
|
|
| . |
| |--sourced_data_dir |
| |-- dev.txt |
| |-- train.txt |
| |-- audio_train.txt |
| |-- audio_dev.txt |
|
|
| .. _nemo-data-format-label: |
|
|
| Tarred dataset |
| -------------- |
|
|
| It is recommended to use tarred dataset for training with large amount of data (>500 hours) due to large amount of RAM consumed by loading whole audio data into memory and CPU usage. |
|
|
| For creating of tarred dataset with audio you will need data in NeMo format: |
|
|
| .. code:: |
|
|
| python examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py \ |
| --text <PATH/TO/LOWERCASED/TEXT/WITHOUT/PUNCTUATION> \ |
| --labels <PATH/TO/LABELS/IN/NEMO/FORMAT> \ |
| --output_dir <PATH/TO/DIRECTORY/WITH/OUTPUT/TARRED/DATASET> \ |
| --num_batches_per_tarfile 100 \ |
| --use_audio \ |
| --audio_file <PATH/TO/AUDIO/PATHS/FILE> \ |
| --sample_rate 16000 |
|
|
| .. note:: |
| You can change sample rate to any positive integer. It will be used in constructor of :class:`~nemo.collections.asr.parts.preprocessing.AudioSegment`. It is recomended to set ``sample_rate`` to the same value as data which was used during training of ASR model. |
|
|
|
|
| Training Punctuation and Capitalization Model |
| --------------------------------------------- |
|
|
| The audio encoder is initialized with pretrained ASR model. You can use any of ``list_available_models()`` of ``EncDecCTCModel`` or your own checkpoints, either one should be provided in ``model.audio_encoder.pretrained_model``. |
| You can freeze audio encoder during training and add additional ``ConformerLayer`` on top of encoder to reduce compute with ``model.audio_encoder.freeze``. You can also add `Adapters <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/core/adapters/components.html>`_ to reduce compute with ``model.audio_encoder.adapter``. Parameters of fusion module are stored in ``model.audio_encoder.fusion``. |
| An example of a model configuration file for training the model can be found at: |
| `NeMo/examples/nlp/token_classification/conf/punctuation_capitalization_lexical_audio_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/conf/punctuation_capitalization_lexical_audio_config.yaml>`__. |
|
|
| Configs |
| ^^^^^^^^^^^^ |
| .. note:: |
| This page contains only parameters specific to lexical and audio model. Others parameters can be found in `Punctuation And Capitalization's page <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/punctuation_and_capitalization.html>`_. |
|
|
| Model config |
| ^^^^^^^^^^^^ |
|
|
| A configuration of |
| :class:`~nemo.collections.nlp.models.token_classification.punctuation_capitalization_lexical_audio_model.PunctuationCapitalizationLexicalAudioModel` |
| model. |
|
|
| .. list-table:: Model config |
| :widths: 5 5 10 25 |
| :header-rows: 1 |
|
|
| * - **Parameter** |
| - **Data type** |
| - **Default value** |
| - **Description** |
| * - **audio_encoder** |
| - :ref:`audio encoder config<audio-encoder-config-label>` |
| - :ref:`audio encoder config<audio-encoder-config-label>` |
| - A configuration for audio encoder. |
|
|
|
|
| Data config |
| ^^^^^^^^^^^ |
|
|
| .. list-table:: Location of data configs in parent configs |
| :widths: 5 5 |
| :header-rows: 1 |
|
|
| * - **Parent config** |
| - **Keys in parent config** |
| * - :ref:`Run config<run-config-label>` |
| - ``model.train_ds``, ``model.validation_ds``, ``model.test_ds`` |
| * - :ref:`Model config<model-config-label>` |
| - ``train_ds``, ``validation_ds``, ``test_ds`` |
|
|
| .. _regular-dataset-parameters-label: |
|
|
| .. list-table:: Parameters for regular (:class:`~nemo.collections.nlp.data.token_classification.punctuation_capitalization_dataset.BertPunctuationCapitalizationDataset`) dataset |
| :widths: 5 5 5 30 |
| :header-rows: 1 |
|
|
| * - **Parameter** |
| - **Data type** |
| - **Default value** |
| - **Description** |
| * - **use_audio** |
| - bool |
| - ``false`` |
| - If set to ``true`` dataset will return audio as well as text. |
| * - **audio_file** |
| - string |
| - ``null`` |
| - A path to file with audio paths. |
| * - **sample_rate** |
| - int |
| - ``null`` |
| - Target sample rate of audios. Can be used for up sampling or down sampling of audio. |
| * - **use_bucketing** |
| - bool |
| - ``true`` |
| - If set to True will sort samples based on their audio length and assamble batches more efficently (less padding in batch). If set to False dataset will return ``batch_size`` batches instead of ``number_of_tokens`` tokens. |
| * - **preload_audios** |
| - bool |
| - ``true`` |
| - If set to True batches will include waveforms, if set to False will store audio_filepaths instead and load audios during ``collate_fn`` call. |
| |
|
|
| .. _audio-encoder-config-label: |
|
|
| Audio Encoder config |
| ^^^^^^^^^^^^^^^^^^^^ |
|
|
| .. list-table:: Audio Encoder Config |
| :widths: 5 5 10 25 |
| :header-rows: 1 |
|
|
| * - **Parameter** |
| - **Data type** |
| - **Default value** |
| - **Description** |
| * - **pretrained_model** |
| - string |
| - ``stt_en_conformer_ctc_medium`` |
| - Pretrained model name or path to ``.nemo``` file to take audio encoder from. |
| * - **freeze** |
| - :ref:`freeze config<freeze-config-label>` |
| - :ref:`freeze config<freeze-config-label>` |
| - Configuration for freezing audio encoder's weights. |
| * - **adapter** |
| - :ref:`adapter config<adapter-config-label>` |
| - :ref:`adapter config<adapter-config-label>` |
| - Configuration for adapter. |
| * - **fusion** |
| - :ref:`fusion config<fusion-config-label>` |
| - :ref:`fusion config<fusion-config-label>` |
| - Configuration for fusion. |
|
|
|
|
| .. _freeze-config-label: |
|
|
| .. list-table:: Freeze Config |
| :widths: 5 5 10 25 |
| :header-rows: 1 |
|
|
| * - **Parameter** |
| - **Data type** |
| - **Default value** |
| - **Description** |
| * - **is_enabled** |
| - bool |
| - ``false`` |
| - If set to ``true`` encoder's weights will not be updated during training and aditional ``ConformerLayer`` layers will be added. |
| * - **d_model** |
| - int |
| - ``256`` |
| - Input dimension of ``MultiheadAttentionMechanism`` and ``PositionwiseFeedForward`` of additional ``ConformerLayer`` layers. |
| * - **d_ff** |
| - int |
| - ``1024`` |
| - Hidden dimension of ``PositionwiseFeedForward`` of additional ``ConformerLayer`` layers. |
| * - **num_layers** |
| - int |
| - ``4`` |
| - Number of additional ``ConformerLayer`` layers. |
|
|
|
|
| .. _adapter-config-label: |
|
|
| .. list-table:: Adapter Config |
| :widths: 5 5 10 25 |
| :header-rows: 1 |
|
|
| * - **Parameter** |
| - **Data type** |
| - **Default value** |
| - **Description** |
| * - **enable** |
| - bool |
| - ``false`` |
| - If set to ``true`` will enable adapters for audio encoder. |
| * - **config** |
| - ``LinearAdapterConfig`` |
| - ``null`` |
| - For more details see `nemo.collections.common.parts.LinearAdapterConfig <https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/common/parts/adapter_modules.py#L141>`_ class. |
|
|
|
|
| .. _fusion-config-label: |
|
|
| .. list-table:: Fusion Config |
| :widths: 5 5 10 25 |
| :header-rows: 1 |
|
|
| * - **Parameter** |
| - **Data type** |
| - **Default value** |
| - **Description** |
| * - **num_layers** |
| - int |
| - ``4`` |
| - Number of layers to use in fusion. |
| * - **num_attention_heads** |
| - int |
| - ``4`` |
| - Number of attention heads to use in fusion. |
| * - **inner_size** |
| - int |
| - ``2048`` |
| - Fusion inner size. |
|
|
|
|
|
|
| Model training |
| ^^^^^^^^^^^^^^ |
|
|
| For more information, refer to the :ref:`nlp_model` section. |
|
|
| To train the model from scratch, run: |
|
|
| .. code:: |
|
|
| python examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py \ |
| model.train_ds.ds_item=<PATH/TO/TRAIN/DATA_DIR> \ |
| model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \ |
| model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \ |
| model.validation_ds.ds_item=<PATH/TO/DEV/DATA_DIR> \ |
| model.validation_ds.text_file=<NAME_OF_DEV_TEXT_FILE> \ |
| model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE> \ |
| trainer.devices=[0,1] \ |
| trainer.accelerator='gpu' \ |
| optim.name=adam \ |
| optim.lr=0.0001 \ |
| model.train_ds.audio_file=<NAME_OF_TRAIN_AUDIO_FILE> \ |
| model.validation_ds.audio_file=<NAME_OF_DEV_AUDIO_FILE> |
|
|
| The above command will start model training on GPUs 0 and 1 with Adam optimizer and learning rate of 0.0001; and the |
| trained model is stored in the ``nemo_experiments/Punctuation_and_Capitalization`` folder. |
|
|
| To train from the pre-trained model, run: |
|
|
| .. code:: |
|
|
| python examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py \ |
| model.train_ds.ds_item=<PATH/TO/TRAIN/DATA_DIR> \ |
| model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \ |
| model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \ |
| model.validation_ds.ds_item=<PATH/TO/DEV/DATA/DIR> \ |
| model.validation_ds.text_file=<NAME_OF_DEV_TEXT_FILE> \ |
| model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE> \ |
| model.train_ds.audio_file=<NAME_OF_TRAIN_AUDIO_FILE> \ |
| model.validation_ds.audio_file=<NAME_OF_DEV_AUDIO_FILE> \ |
| pretrained_model=<PATH/TO/SAVE/.nemo> |
|
|
|
|
| .. note:: |
|
|
| All parameters defined in the configuration file can be changed with command arguments. For example, the sample |
| config file mentioned above has :code:`train_ds.tokens_in_batch` set to ``2048``. However, if you see that |
| the GPU utilization can be optimized further by using a larger batch size, you may override to the desired value |
| by adding the field :code:`train_ds.tokens_in_batch=4096` over the command-line. You can repeat this with |
| any of the parameters defined in the sample configuration file. |
|
|
| Inference |
| --------- |
|
|
| Inference is performed by a script `examples/nlp/token_classification/punctuate_capitalize_infer.py <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuate_capitalize_infer.py>`_ |
|
|
| .. code:: |
|
|
| python punctuate_capitalize_infer.py \ |
| --input_manifest <PATH/TO/INPUT/MANIFEST> \ |
| --output_manifest <PATH/TO/OUTPUT/MANIFEST> \ |
| --pretrained_name <PATH to .nemo file> \ |
| --max_seq_length 64 \ |
| --margin 16 \ |
| --step 8 \ |
| --use_audio |
|
|
| Long audios are split just like in text only case, audio sequences treated the same as text seqences except :code:`max_seq_length` for audio equals :code:`max_seq_length*4000`. |
|
|
| Model Evaluation |
| ---------------- |
|
|
| Model evaluation is performed by the same script |
| `examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py |
| <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py>`_ |
| as training. |
|
|
| Use :ref`config<run-config-lab>` parameter ``do_training=false`` to disable training and parameter ``do_testing=true`` |
| to enable testing. If both parameters ``do_training`` and ``do_testing`` are ``true``, then model is trained and then |
| tested. |
|
|
| To start evaluation of the pre-trained model, run: |
|
|
| .. code:: |
|
|
| python punctuation_capitalization_lexical_audio_train_evaluate.py \ |
| +model.do_training=false \ |
| +model.to_testing=true \ |
| model.test_ds.ds_item=<PATH/TO/TEST/DATA/DIR> \ |
| pretrained_model=<PATH to .nemo file> \ |
| model.test_ds.text_file=<NAME_OF_TEST_INPUT_TEXT_FILE> \ |
| model.test_ds.labels_file=<NAME_OF_TEST_LABELS_FILE> \ |
| model.test_ds.audio_file=<NAME_OF_TEST_AUDIO_FILE> |
|
|
|
|
| Required Arguments |
| ^^^^^^^^^^^^^^^^^^ |
|
|
| - :code:`pretrained_model`: pretrained Punctuation and Capitalization Lexical Audio model from ``list_available_models()`` or path to a ``.nemo`` |
| file. For example: ``your_model.nemo``. |
| - :code:`model.test_ds.ds_item`: path to the directory that contains :code:`model.test_ds.text_file`, :code:`model.test_ds.labels_file` and :code:`model.test_ds.audio_file` |
|
|
| References |
| ---------- |
|
|
| .. bibliography:: nlp_all.bib |
| :style: plain |
| :labelprefix: NLP-PUNCT |
| :keyprefix: nlp-punct- |
|
|
|
|