khala / models /Megatron /docs /source /api-guide /dist_checkpointing.rst
multimodalart's picture
multimodalart HF Staff
Initial best-effort ZeroGPU port of Khala song generation
d1f1097 verified
dist\_checkpointing package
===========================
A library for saving and loading the distributed checkpoints.
A "distributed checkpoint" can have various underlying formats (current default format is based on Zarr)
but has a distinctive property - the checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism)
can be loaded in a different parallel configuration.
Using the library requires defining sharded state_dict dictionaries with functions from *mapping* and *optimizer* modules.
Those state dicts can be saved or loaded with a *serialization* module using strategies from *strategies* module.
Safe Checkpoint Loading
-----------------------
Since **PyTorch 2.6**, the default behavior of `torch.load` is `weights_only=True`.
This ensures that only tensors and allow-listed classes are loaded, reducing the risk of arbitrary code execution.
If you encounter an error such as:
.. code-block:: bash
WeightsUnpickler error: Unsupported global: GLOBAL argparse.Namespace was not an allowed global by default.
you can fix it by explicitly allow-listing the missing class in your script:
.. code-block:: python
import torch, argparse
torch.serialization.add_safe_globals([argparse.Namespace])
Subpackages
-----------
.. toctree::
:maxdepth: 4
dist_checkpointing.strategies
Submodules
----------
dist\_checkpointing.serialization module
----------------------------------------
.. automodule:: core.dist_checkpointing.serialization
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.mapping module
----------------------------------
.. automodule:: core.dist_checkpointing.mapping
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.optimizer module
------------------------------------
.. automodule:: core.dist_checkpointing.optimizer
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.core module
-------------------------------
.. automodule:: core.dist_checkpointing.core
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.dict\_utils module
--------------------------------------
.. automodule:: core.dist_checkpointing.dict_utils
:members:
:undoc-members:
:show-inheritance:
dist\_checkpointing.utils module
--------------------------------
.. automodule:: core.dist_checkpointing.utils
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: core.dist_checkpointing
:members:
:undoc-members:
:show-inheritance: