Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / accelerate /pr_4021 /en /package_reference /fsdp.md

HuggingFaceDocBuilder

15 days ago

preview code

download

raw

11.1 kB

	# Fully Sharded Data Parallel utilities

	## enable_fsdp_ram_efficient_loading[[accelerate.utils.enable_fsdp_ram_efficient_loading]]

	#### accelerate.utils.enable_fsdp_ram_efficient_loading[[accelerate.utils.enable_fsdp_ram_efficient_loading]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/fsdp_utils.py#L39)

	Enables RAM efficient loading of Hugging Face models for FSDP in the environment.

	## disable_fsdp_ram_efficient_loading[[accelerate.utils.disable_fsdp_ram_efficient_loading]]

	#### accelerate.utils.disable_fsdp_ram_efficient_loading[[accelerate.utils.disable_fsdp_ram_efficient_loading]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/fsdp_utils.py#L49)

	Disables RAM efficient loading of Hugging Face models for FSDP in the environment.

	## merge_fsdp_weights[[accelerate.utils.merge_fsdp_weights]]

	#### accelerate.utils.merge_fsdp_weights[[accelerate.utils.merge_fsdp_weights]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/fsdp_utils.py#L366)

	Merge the weights from sharded FSDP model checkpoints into a single combined checkpoint. Should be used if
	`SHARDED_STATE_DICT` was used for the model. Weights will be saved to `{output_path}/model.safetensors` if
	`safe_serialization` else `pytorch_model.bin`.

	Note: this is a CPU-bound process.

	Parameters:

	checkpoint_dir (`str`) : The directory containing the FSDP checkpoints (can be either the model or optimizer).

	output_path (`str`) : The path to save the merged checkpoint.

	safe_serialization (`bool`, optional, defaults to `True`) : Whether to save the merged weights with safetensors (recommended).

	remove_checkpoint_dir (`bool`, optional, defaults to `False`) : Whether to remove the checkpoint directory after merging.

	## FullyShardedDataParallelPlugin[[accelerate.FullyShardedDataParallelPlugin]]

	#### accelerate.FullyShardedDataParallelPlugin[[accelerate.FullyShardedDataParallelPlugin]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/dataclasses.py#L1571)

	This plugin is used to enable fully sharded data parallelism.

	set_auto_wrap_policyaccelerate.FullyShardedDataParallelPlugin.set_auto_wrap_policyhttps://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/dataclasses.py#L2041[{"name": "model", "val": ""}]

	Given `model`, creates an `auto_wrap_policy` based on the passed in policy and if we can use the
	`transformer_cls_to_wrap`

	Parameters:

	fsdp_version (`int`, defaults to `1`) : The version of FSDP to use. Defaults to 1. If set to 2, launcher expects the config to be converted to FSDP2 format.

	sharding_strategy (`Union[str, torch.distributed.fsdp.ShardingStrategy]`, defaults to `'FULL_SHARD'`) : Sharding strategy to use. Should be either a `str` or an instance of `torch.distributed.fsdp.fully_sharded_data_parallel.ShardingStrategy`. Is deprecated in favor of `reshard_after_forward`.

	reshard_after_forward (`Union[str, torch.distributed.fsdp.ShardingStrategy, bool]`, defaults to `'FULL_SHARD'` for `fsdp_version=1` and `True` for `fsdp_version=2`) : Sharding strategy to use. Should be a bool if `fsdp_version` is set to 2 else a `str` or an instance of `torch.distributed.fsdp.fully_sharded_data_parallel.ShardingStrategy`.

	backward_prefetch (`Union[str, torch.distributed.fsdp.BackwardPrefetch]`, defaults to `'NO_PREFETCH'`) : Backward prefetch strategy to use. Should be either a `str` or an instance of `torch.distributed.fsdp.fully_sharded_data_parallel.BackwardPrefetch`.

	mixed_precision_policy (`Optional[Union[dict, str, torch.distributed.fsdp.MixedPrecision, torch.distributed.fsdp.MixedPrecisionPolicy]]`, defaults to `None`) : A config to enable mixed precision training with FullyShardedDataParallel. If passing in a `dict`, it should have the following keys: `param_dtype`, `reduce_dtype`, and `buffer_dtype`, can be an instance of `torch.distributed.fsdp.MixedPrecisionPolicy` if `fsdp_version` is set to 2. If passing in a `str`, it should be one of the following values: fp8, fp16, bf16, fp32, and used to set `param_dtype`, `reduce_dtype`, and `buffer_dtype`.

	auto_wrap_policy (`Optional(Union[Callable, Literal["transformer_based_wrap", "size_based_wrap", "no_wrap"]]), defaults to `NO_WRAP`) : A callable or string specifying a policy to recursively wrap layers with FSDP. If a string, it must be one of `transformer_based_wrap`, `size_based_wrap`, or `no_wrap`. See `torch.distributed.fsdp.wrap.size_based_wrap_policy` for a direction on what it should look like.

	cpu_offload (`Union[bool, torch.distributed.fsdp.CPUOffload, torch.distributed.fsdp.CPUOffloadPolicy]`, defaults to `False`) : Whether to offload parameters to CPU. Should be either a `bool` or an instance of `torch.distributed.fsdp.fully_sharded_data_parallel.CPUOffload` or `torch.distributed.fsdp.fully_sharded_data_parallel.CPUOffloadPolicy` if `fsdp_version` is set to 2.

	ignored_modules (`Optional[Union[Iterable[torch.nn.Module], str]]`, defaults to `None`) : A list of modules to ignore when wrapping with FSDP. When passing a string, will match the modules by name using regex fullmatch. If `fsdp_version` is set to 2, the modules are converted to parameters and used.

	state_dict_type (`Union[str, torch.distributed.fsdp.StateDictType]`, defaults to `'FULL_STATE_DICT'`) : State dict type to use. If a string, it must be one of `full_state_dict`, `local_state_dict`, or `sharded_state_dict`.

	state_dict_config (`Optional[Union[torch.distributed.fsdp.FullStateDictConfig, torch.distributed.fsdp.ShardedStateDictConfig]`, defaults to `None`) : State dict config to use. Is determined based on the `state_dict_type` if not passed in.

	optim_state_dict_config (`Optional[Union[torch.distributed.fsdp.FullOptimStateDictConfig, torch.distributed.fsdp.ShardedOptimStateDictConfig]`, defaults to `None`) : Optim state dict config to use. Is determined based on the `state_dict_type` if not passed in.

	limit_all_gathers (`bool`, defaults to `True`) : Whether to have FSDP explicitly synchronizes the CPU thread to prevent too many in-flight all-gathers. This bool only affects the sharded strategies that schedule all-gathers. Enabling this can help lower the number of CUDA malloc retries.

	use_orig_params (`bool`, defaults to `False`) : Whether to use the original parameters for the optimizer.

	param_init_fn (`Optional[Callable[[torch.nn.Module], None]`, defaults to `None`) : A `Callable[torch.nn.Module] -> None` that specifies how modules that are currently on the meta device should be initialized onto an actual device. Only applicable when `sync_module_states` is `True`. By default is a `lambda` which calls `to_empty` on the module.

	sync_module_states (`bool`, defaults to `False`) : Whether each individually wrapped FSDP unit should broadcast module parameters from rank 0 to ensure they are the same across all ranks after initialization. Defaults to `False` unless `cpu_ram_efficient_loading` is `True`, then will be forcibly enabled.

	forward_prefetch (`bool`, defaults to `False`) : Whether to have FSDP explicitly prefetches the next upcoming all-gather while executing in the forward pass. only use with Static graphs.

	activation_checkpointing (`bool`, defaults to `False`) : A technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra computation time for reduced memory usage.

	cpu_ram_efficient_loading (`bool`, defaults to `None`) : If True, only the first process loads the pretrained model checkoint while all other processes have empty weights. Only applicable for Transformers. When using this, `sync_module_states` needs to be `True`.

	transformer_cls_names_to_wrap (`Optional[List[str]]`, defaults to `None`) : A list of transformer layer class names to wrap. Only applicable when `auto_wrap_policy` is `transformer_based_wrap`.

	min_num_params (`Optional[int]`, defaults to `None`) : The minimum number of parameters a module must have to be wrapped. Only applicable when `auto_wrap_policy` is `size_based_wrap`.
	#### set_mixed_precision[[accelerate.FullyShardedDataParallelPlugin.set_mixed_precision]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/dataclasses.py#L2075)

	Sets the mixed precision policy for FSDP
	#### set_state_dict_type[[accelerate.FullyShardedDataParallelPlugin.set_state_dict_type]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/dataclasses.py#L1996)

	Set the state dict config based on the `StateDictType`.
	#### validate_mixed_precision_policy[[accelerate.FullyShardedDataParallelPlugin.validate_mixed_precision_policy]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/dataclasses.py#L2127)

	Validates the mixed precision policy, abstracted away to not bring in the imports if not needed.

	## fsdp2_load_full_state_dict[[accelerate.utils.fsdp2_load_full_state_dict]]

	#### accelerate.utils.fsdp2_load_full_state_dict[[accelerate.utils.fsdp2_load_full_state_dict]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/fsdp_utils.py#L467)

	Loads the full state dict (could be only on rank 0) into the sharded model. This is done by broadcasting the
	parameters from rank 0 to all other ranks. This function modifies the model in-place.

	Parameters:

	accelerator (`Accelerator`) : The accelerator instance

	model (`torch.nn.Module`) : The model to load the state dict into, expected to be on meta device or a VRAM spike can occur

	full_sd (`dict`) : The full state dict to load, can only be on rank 0

	cpu_offload (`bool`, defaults to `False`) : If True, move sharded parameters to CPU after distribution. Required when FSDP CPU offloading is enabled.

	## fsdp2_switch_optimizer_parameters[[accelerate.utils.fsdp2_switch_optimizer_parameters]]

	#### accelerate.utils.fsdp2_switch_optimizer_parameters[[accelerate.utils.fsdp2_switch_optimizer_parameters]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/fsdp_utils.py#L563)

	Switches the parameters of the optimizer to new ones (sharded parameters in usual case). This function modifies the
	optimizer in-place.

	Parameters:

	optimizer (`torch.optim.Optimizer`) : Optimizer instance which contains the original model parameters

	mapping (`dict`) : Mapping from the original parameter (specified by `data_ptr`) to the sharded parameter

	## fsdp2_prepare_model[[accelerate.utils.fsdp2_prepare_model]]

	#### accelerate.utils.fsdp2_prepare_model[[accelerate.utils.fsdp2_prepare_model]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/utils/fsdp_utils.py#L645)

	Prepares the model for FSDP2 in-place. Also returns the model to avoid misuse of the original model.

	Parameters:

	accelerator (`Accelerator`) : The accelerator instance

	model (`torch.nn.Module`) : The model to prepare

	Returns:

	``torch.nn.Module``

	Prepared model

	## fsdp2_prepare_auto_wrap_policy

Xet Storage Details

Size:: 11.1 kB
Xet hash:: 91efc6e6e17ea74602cc6786ee46fc2243a17e59fde21554af257b2f29b373a2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.