Checkpoints =========== In this section, we present four key functionalities of NVIDIA NeMo related to checkpoint management: 1. **Checkpoint Loading**: Use the :code:`restore_from()` method to load local ``.nemo`` checkpoint files. 2. **Partial Checkpoint Conversion**: Convert partially-trained ``.ckpt`` checkpoints to the ``.nemo`` format. 3. **Community Checkpoint Conversion**: Convert checkpoints from community sources, like HuggingFace, into the ``.nemo`` format. 4. **Model Parallelism Adjustment**: Adjusting model parallelism is crucial when training models that surpass the memory capacity of a single GPU, such as the NVGPT 5B version, LLaMA2 7B version, or larger models. NeMo incorporates both tensor (intra-layer) and pipeline (inter-layer) model parallelisms. For a deeper understanding, refer to "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (`link `_). This tool assists in modifying model parallelism. After downloading and converting a community checkpoint to the ``.nemo`` format, if a user wishes to fine-tune it further, this adjustment might become essential. Understanding Checkpoint Formats -------------------------------- A ``.nemo`` checkpoint is fundamentally a tar file that bundles the model configurations (given as a YAML file), model weights, and other pertinent artifacts like tokenizer models or vocabulary files. This consolidated design streamlines sharing, loading, tuning, evaluating, and inference. On the other hand, the ``.ckpt`` file, created during PyTorch Lightning training, contains only the model weights and the optimizer states, which is used to pick up training from a pause. The subsequent sections elucidate instructions for the functionalities above, specifically tailored for deploying fully trained checkpoints for assessment or additional fine-tuning. Loading Local Checkpoints ------------------------- By default, NeMo saves checkpoints of trained models in the ``.nemo`` format. To save a model manually during training, use: .. code-block:: python model.save_to(.nemo) To load a local ``.nemo`` checkpoint: .. code-block:: python import nemo.collections.multimodal as nemo_multimodal model = nemo_multimodal.models..restore_from(restore_path="") Replace `` with the appropriate MM model class. Converting Local Checkpoints ---------------------------- Only the last checkpoint is automatically saved in the ``.nemo`` format. If intermediate training checkpoints evaluation is required, a ``.nemo`` conversion might be necessary. For this, refer to the script at ``: .. code-block:: python python -m torch.distributed.launch --nproc_per_node= * \ examples/multimodal/convert_ckpt_to_nemo.py \ --checkpoint_folder \ --checkpoint_name \ --nemo_file_path \ --tensor_model_parallel_size \ --pipeline_model_parallel_size Converting Community Checkpoints -------------------------------- There is no support for converting community checkpoints to NeMo ViT. Model Parallelism Adjustment ---------------------------- ViT Checkpoints ^^^^^^^^^^^^^^^^ To adjust model parallelism from original model parallelism size to a new model parallelism size (Note: NeMo ViT currently only supports `pipeline_model_parallel_size=1`): .. code-block:: bash python examples/nlp/language_modeling/megatron_change_num_partitions.py \ --model_file=/path/to/source.nemo \ --target_file=/path/to/target.nemo \ --tensor_model_parallel_size=??? \ --target_tensor_model_parallel_size=??? \ --pipeline_model_parallel_size=-1 \ --target_pipeline_model_parallel_size=1 \ --precision=32 \ --model_class="nemo.collections.vision.models.megatron_vit_classification_models.MegatronVitClassificationModel" \ --tp_conversion_only