BryanW commited on Mar 23

Commit

0d11c13

verified ·

1 Parent(s): 38f7dd9

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/_distutils_hack/__pycache__/__init__.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/_distutils_hack/__pycache__/override.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/commands/tpu.py +157 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/test_utils/scripts/external_deps/__pycache__/test_ds_multiple_model.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/__init__.py +306 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/ao.py +140 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/bnb.py +469 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/constants.py +106 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/dataclasses.py +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/deepspeed.py +385 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/environment.py +471 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/fsdp_utils.py +829 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/imports.py +564 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/launch.py +781 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/megatron_lm.py +1424 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/memory.py +210 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/modeling.py +2186 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/offload.py +213 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py +867 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/other.py +561 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/random.py +156 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/rich.py +24 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/torch_xla.py +51 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/tqdm.py +43 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/transformer_engine.py +186 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/versions.py +56 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/annotated_doc/__pycache__/__init__.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/annotated_doc/__pycache__/main.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/cuda_pathfinder-1.4.0.dist-info/licenses/LICENSE +177 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/__init__.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/__version__.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_align.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_align_getter.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_base.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_column.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_common.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_container.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_converter.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_dataproperty.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_extractor.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_formatter.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_function.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_interface.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_line_break.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_preprocessor.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/typing.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/logger/__init__.py +7 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/logger/__pycache__/__init__.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/logger/__pycache__/_logger.cpython-312.pyc +0 -0
Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/logger/__pycache__/_null_logger.cpython-312.pyc +0 -0

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/_distutils_hack/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (10.6 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/_distutils_hack/__pycache__/override.cpython-312.pyc ADDED Viewed

Binary file (322 Bytes). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/commands/tpu.py ADDED Viewed

	@@ -0,0 +1,157 @@

+#!/usr/bin/env python
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import subprocess
+from packaging.version import Version, parse
+from accelerate.commands.config.config_args import default_config_file, load_config_from_file
+_description = "Run commands across TPU VMs for initial setup before running `accelerate launch`."
+def tpu_command_parser(subparsers=None):
+    if subparsers is not None:
+        parser = subparsers.add_parser("tpu-config", description=_description)
+    else:
+        parser = argparse.ArgumentParser("Accelerate tpu-config command", description=_description)
+    # Core arguments
+    config_args = parser.add_argument_group(
+        "Config Arguments", "Arguments that can be configured through `accelerate config`."
+    )
+    config_args.add_argument(
+        "--config_file",
+        type=str,
+        default=None,
+        help="Path to the config file to use for accelerate.",
+    )
+    config_args.add_argument(
+        "--tpu_name",
+        default=None,
+        help="The name of the TPU to use. If not specified, will use the TPU specified in the config file.",
+    )
+    config_args.add_argument(
+        "--tpu_zone",
+        default=None,
+        help="The zone of the TPU to use. If not specified, will use the zone specified in the config file.",
+    )
+    pod_args = parser.add_argument_group("TPU Arguments", "Arguments for options ran inside the TPU.")
+    pod_args.add_argument(
+        "--use_alpha",
+        action="store_true",
+        help="Whether to use `gcloud alpha` when running the TPU training script instead of `gcloud`.",
+    )
+    pod_args.add_argument(
+        "--command_file",
+        default=None,
+        help="The path to the file containing the commands to run on the pod on startup.",
+    )
+    pod_args.add_argument(
+        "--command",
+        action="append",
+        nargs="+",
+        help="A command to run on the pod. Can be passed multiple times.",
+    )
+    pod_args.add_argument(
+        "--install_accelerate",
+        action="store_true",
+        help="Whether to install accelerate on the pod. Defaults to False.",
+    )
+    pod_args.add_argument(
+        "--accelerate_version",
+        default="latest",
+        help="The version of accelerate to install on the pod. If not specified, will use the latest pypi version. Specify 'dev' to install from GitHub.",
+    )
+    pod_args.add_argument(
+        "--debug", action="store_true", help="If set, will print the command that would be run instead of running it."
+    )
+    if subparsers is not None:
+        parser.set_defaults(func=tpu_command_launcher)
+    return parser
+def tpu_command_launcher(args):
+    defaults = None
+    # Get the default from the config file if it exists.
+    if args.config_file is not None or os.path.isfile(default_config_file):
+        defaults = load_config_from_file(args.config_file)
+        if not args.command_file and defaults.command_file is not None and not args.command:
+            args.command_file = defaults.command_file
+        if not args.command and defaults.commands is not None:
+            args.command = defaults.commands
+        if not args.tpu_name:
+            args.tpu_name = defaults.tpu_name
+        if not args.tpu_zone:
+            args.tpu_zone = defaults.tpu_zone
+    if args.accelerate_version == "dev":
+        args.accelerate_version = "git+https://github.com/huggingface/accelerate.git"
+    elif args.accelerate_version == "latest":
+        args.accelerate_version = "accelerate -U"
+    elif isinstance(parse(args.accelerate_version), Version):
+        args.accelerate_version = f"accelerate=={args.accelerate_version}"
+    if not args.command_file and not args.command:
+        raise ValueError("You must specify either a command file or a command to run on the pod.")
+    if args.command_file:
+        with open(args.command_file) as f:
+            args.command = [f.read().splitlines()]
+    # To turn list of lists into list of strings
+    if isinstance(args.command[0], list):
+        args.command = [line for cmd in args.command for line in cmd]
+    # Default to the shared folder and install accelerate
+    new_cmd = ["cd /usr/share"]
+    if args.install_accelerate:
+        new_cmd += [f"pip install {args.accelerate_version}"]
+    new_cmd += args.command
+    args.command = "; ".join(new_cmd)
+    # Then send it to gcloud
+    # Eventually try to use google-api-core to do this instead of subprocess
+    cmd = ["gcloud"]
+    if args.use_alpha:
+        cmd += ["alpha"]
+    cmd += [
+        "compute",
+        "tpus",
+        "tpu-vm",
+        "ssh",
+        args.tpu_name,
+        "--zone",
+        args.tpu_zone,
+        "--command",
+        args.command,
+        "--worker",
+        "all",
+    ]
+    if args.debug:
+        print(f"Running {' '.join(cmd)}")
+        return
+    subprocess.run(cmd)
+    print("Successfully setup pod.")
+def main():
+    parser = tpu_command_parser()
+    args = parser.parse_args()
+    tpu_command_launcher(args)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/test_utils/scripts/external_deps/__pycache__/test_ds_multiple_model.cpython-312.pyc ADDED Viewed

Binary file (14.1 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,306 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from ..parallelism_config import ParallelismConfig
+from .ao import convert_model_to_fp8_ao, filter_first_and_last_linear_layers, has_ao_layers
+from .constants import (
+    MITA_PROFILING_AVAILABLE_PYTORCH_VERSION,
+    MODEL_NAME,
+    OPTIMIZER_NAME,
+    PROFILE_PATTERN_NAME,
+    RNG_STATE_NAME,
+    SAFE_MODEL_NAME,
+    SAFE_WEIGHTS_INDEX_NAME,
+    SAFE_WEIGHTS_NAME,
+    SAFE_WEIGHTS_PATTERN_NAME,
+    SAMPLER_NAME,
+    SCALER_NAME,
+    SCHEDULER_NAME,
+    TORCH_DISTRIBUTED_OPERATION_TYPES,
+    TORCH_LAUNCH_PARAMS,
+    WEIGHTS_INDEX_NAME,
+    WEIGHTS_NAME,
+    WEIGHTS_PATTERN_NAME,
+    XPU_PROFILING_AVAILABLE_PYTORCH_VERSION,
+)
+from .dataclasses import (
+    AORecipeKwargs,
+    AutocastKwargs,
+    BnbQuantizationConfig,
+    ComputeEnvironment,
+    CustomDtype,
+    DataLoaderConfiguration,
+    DDPCommunicationHookType,
+    DeepSpeedPlugin,
+    DeepSpeedSequenceParallelConfig,
+    DistributedDataParallelKwargs,
+    DistributedType,
+    DynamoBackend,
+    FP8RecipeKwargs,
+    FullyShardedDataParallelPlugin,
+    GradientAccumulationPlugin,
+    GradScalerKwargs,
+    InitProcessGroupKwargs,
+    KwargsHandler,
+    LoggerType,
+    MegatronLMPlugin,
+    MSAMPRecipeKwargs,
+    PrecisionType,
+    ProfileKwargs,
+    ProjectConfiguration,
+    RNGType,
+    SageMakerDistributedType,
+    TensorInformation,
+    TERecipeKwargs,
+    TorchContextParallelConfig,
+    TorchDynamoPlugin,
+    TorchTensorParallelConfig,
+    TorchTensorParallelPlugin,
+    add_model_config_to_megatron_parser,
+)
+from .environment import (
+    are_libraries_initialized,
+    check_cuda_fp8_capability,
+    check_cuda_p2p_ib_support,
+    clear_environment,
+    convert_dict_to_env_variables,
+    get_cpu_distributed_information,
+    get_current_device_type,
+    get_gpu_info,
+    get_int_from_env,
+    parse_choice_from_env,
+    parse_flag_from_env,
+    patch_environment,
+    purge_accelerate_environment,
+    set_numa_affinity,
+    str_to_bool,
+)
+from .imports import (
+    deepspeed_required,
+    get_ccl_version,
+    is_4bit_bnb_available,
+    is_8bit_bnb_available,
+    is_aim_available,
+    is_bf16_available,
+    is_bitsandbytes_multi_backend_available,
+    is_bnb_available,
+    is_boto3_available,
+    is_ccl_available,
+    is_clearml_available,
+    is_comet_ml_available,
+    is_cuda_available,
+    is_datasets_available,
+    is_deepspeed_available,
+    is_dvclive_available,
+    is_fp8_available,
+    is_fp16_available,
+    is_habana_gaudi1,
+    is_hpu_available,
+    is_import_timer_available,
+    is_ipex_available,
+    is_lomo_available,
+    is_matplotlib_available,
+    is_megatron_lm_available,
+    is_mlflow_available,
+    is_mlu_available,
+    is_mps_available,
+    is_msamp_available,
+    is_musa_available,
+    is_npu_available,
+    is_pandas_available,
+    is_peft_available,
+    is_pippy_available,
+    is_pynvml_available,
+    is_pytest_available,
+    is_rich_available,
+    is_sagemaker_available,
+    is_schedulefree_available,
+    is_sdaa_available,
+    is_swanlab_available,
+    is_tensorboard_available,
+    is_timm_available,
+    is_torch_xla_available,
+    is_torchao_available,
+    is_torchdata_available,
+    is_torchdata_stateful_dataloader_available,
+    is_torchvision_available,
+    is_trackio_available,
+    is_transformer_engine_available,
+    is_transformer_engine_mxfp8_available,
+    is_transformers_available,
+    is_triton_available,
+    is_wandb_available,
+    is_weights_only_available,
+    is_xccl_available,
+    is_xpu_available,
+    torchao_required,
+)
+from .modeling import (
+    align_module_device,
+    calculate_maximum_sizes,
+    check_device_map,
+    check_tied_parameters_in_config,
+    check_tied_parameters_on_same_device,
+    compute_module_sizes,
+    convert_file_size_to_int,
+    dtype_byte_size,
+    find_tied_parameters,
+    get_balanced_memory,
+    get_grad_scaler,
+    get_max_layer_size,
+    get_max_memory,
+    get_mixed_precision_context_manager,
+    has_offloaded_params,
+    id_tensor_storage,
+    infer_auto_device_map,
+    is_peft_model,
+    load_checkpoint_in_model,
+    load_offloaded_weights,
+    load_state_dict,
+    named_module_tensors,
+    retie_parameters,
+    set_module_tensor_to_device,
+)
+from .offload import (
+    OffloadedWeightsLoader,
+    PrefixedDataset,
+    extract_submodules_state_dict,
+    load_offloaded_weight,
+    offload_state_dict,
+    offload_weight,
+    save_offload_index,
+)
+from .operations import (
+    CannotPadNestedTensorWarning,
+    GatheredParameters,
+    broadcast,
+    broadcast_object_list,
+    concatenate,
+    convert_outputs_to_fp32,
+    convert_to_fp32,
+    copy_tensor_to_devices,
+    find_batch_size,
+    find_device,
+    gather,
+    gather_object,
+    get_data_structure,
+    honor_type,
+    ignorant_find_batch_size,
+    initialize_tensors,
+    is_namedtuple,
+    is_tensor_information,
+    is_torch_tensor,
+    listify,
+    pad_across_processes,
+    pad_input_tensors,
+    recursively_apply,
+    reduce,
+    send_to_device,
+    slice_tensors,
+)
+from .versions import compare_versions, is_torch_version
+if is_deepspeed_available():
+    from .deepspeed import (
+        DeepSpeedEngineWrapper,
+        DeepSpeedOptimizerWrapper,
+        DeepSpeedSchedulerWrapper,
+        DummyOptim,
+        DummyScheduler,
+        HfDeepSpeedConfig,
+        get_active_deepspeed_plugin,
+        map_pytorch_optim_to_deepspeed,
+    )
+from .bnb import has_4bit_bnb_layers, load_and_quantize_model
+from .fsdp_utils import (
+    disable_fsdp_ram_efficient_loading,
+    enable_fsdp_ram_efficient_loading,
+    ensure_weights_retied,
+    fsdp2_apply_ac,
+    fsdp2_canonicalize_names,
+    fsdp2_load_full_state_dict,
+    fsdp2_prepare_model,
+    fsdp2_switch_optimizer_parameters,
+    get_fsdp2_grad_scaler,
+    load_fsdp_model,
+    load_fsdp_optimizer,
+    merge_fsdp_weights,
+    save_fsdp_model,
+    save_fsdp_optimizer,
+)
+from .launch import (
+    PrepareForLaunch,
+    _filter_args,
+    prepare_deepspeed_cmd_env,
+    prepare_multi_gpu_env,
+    prepare_sagemager_args_inputs,
+    prepare_simple_launcher_cmd_env,
+    prepare_tpu,
+)
+# For docs
+from .megatron_lm import (
+    AbstractTrainStep,
+    BertTrainStep,
+    GPTTrainStep,
+    MegatronLMDummyDataLoader,
+    MegatronLMDummyScheduler,
+    T5TrainStep,
+    avg_losses_across_data_parallel_group,
+)
+if is_megatron_lm_available():
+    from .megatron_lm import (
+        MegatronEngine,
+        MegatronLMOptimizerWrapper,
+        MegatronLMSchedulerWrapper,
+        gather_across_data_parallel_groups,
+    )
+    from .megatron_lm import initialize as megatron_lm_initialize
+    from .megatron_lm import prepare_data_loader as megatron_lm_prepare_data_loader
+    from .megatron_lm import prepare_model_optimizer_scheduler as megatron_lm_prepare_model_optimizer_scheduler
+    from .megatron_lm import prepare_optimizer as megatron_lm_prepare_optimizer
+    from .megatron_lm import prepare_scheduler as megatron_lm_prepare_scheduler
+from .memory import find_executable_batch_size, release_memory
+from .other import (
+    check_os_kernel,
+    clean_state_dict_for_safetensors,
+    compile_regions,
+    compile_regions_deepspeed,
+    convert_bytes,
+    extract_model_from_parallel,
+    get_module_children_bottom_up,
+    get_pretty_name,
+    has_compiled_regions,
+    is_compiled_module,
+    is_port_in_use,
+    load,
+    merge_dicts,
+    model_has_dtensor,
+    recursive_getattr,
+    save,
+    wait_for_everyone,
+    write_basic_config,
+)
+from .random import set_seed, synchronize_rng_state, synchronize_rng_states
+from .torch_xla import install_xla
+from .tqdm import tqdm
+from .transformer_engine import (
+    apply_fp8_autowrap,
+    contextual_fp8_autocast,
+    convert_model,
+    has_transformer_engine_layers,
+)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/ao.py ADDED Viewed

	@@ -0,0 +1,140 @@

+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Needed utilities for torchao FP8 training.
+"""
+from functools import partial
+from typing import TYPE_CHECKING, Callable, Optional
+import torch
+from .imports import is_torchao_available, torchao_required
+if TYPE_CHECKING:
+    if is_torchao_available():
+        from torchao.float8.float8_linear import Float8LinearConfig
+def find_first_last_linear_layers(model: torch.nn.Module):
+    """
+    Finds the first and last linear layer names in a model.
+    This is needed during FP8 to avoid issues with instability by keeping the first and last layers unquantized.
+    Ref: https://x.com/xariusrke/status/1826669142604141052
+    """
+    first_linear, last_linear = None, None
+    for name, module in model.named_modules():
+        if isinstance(module, torch.nn.Linear):
+            if first_linear is None:
+                first_linear = name
+            last_linear = name
+    return first_linear, last_linear
+def filter_linear_layers(module, fqn: str, layers_to_filter: list[str]) -> bool:
+    """
+    A function which will check if `module` is:
+    - a `torch.nn.Linear` layer
+    - has in_features and out_features divisible by 16
+    - is not part of `layers_to_filter`
+    Args:
+        module (`torch.nn.Module`):
+            The module to check.
+        fqn (`str`):
+            The fully qualified name of the layer.
+        layers_to_filter (`List[str]`):
+            The list of layers to filter.
+    """
+    if isinstance(module, torch.nn.Linear):
+        if module.in_features % 16 != 0 or module.out_features % 16 != 0:
+            return False
+    if fqn in layers_to_filter:
+        return False
+    return True
+def filter_first_and_last_linear_layers(module, fqn: str) -> bool:
+    """
+    A filter function which will filter out all linear layers except the first and last.
+    <Tip>
+        For stability reasons, we skip the first and last linear layers Otherwise can lead to the model not training or
+        converging properly
+    </Tip>
+    Args:
+        module (`torch.nn.Module`):
+            The module to check.
+        fqn (`str`):
+            The fully qualified name of the layer.
+    """
+    first_linear, last_linear = find_first_last_linear_layers(module)
+    return filter_linear_layers(module, fqn, layers_to_filter=[first_linear, last_linear])
+@torchao_required
+def has_ao_layers(model: torch.nn.Module):
+    from torchao.float8.float8_linear import Float8Linear
+    for name, module in model.named_modules():
+        if isinstance(module, Float8Linear):
+            return True
+    return False
+@torchao_required
+def convert_model_to_fp8_ao(
+    model: torch.nn.Module,
+    config: Optional["Float8LinearConfig"] = None,
+    module_filter_func: Optional[Callable] = filter_first_and_last_linear_layers,
+):
+    """
+    Converts all `nn.Linear` layers in the model (except the first and last) to torchao's `Float8Linear` layer inplace.
+    Args:
+        model (`torch.nn.Module`):
+            The model to convert.
+        config (`torchao.float8.Float8LinearConfig`, *optional*):
+            The configuration for the FP8 training. Recommended to utilize
+            `torchao.float8.recipe_name_to_linear_config` to generate this. In general, the default config should be
+            sufficient (what is passed when set to `None`).
+        module_filter_func (`Callable`, *optional*, defaults to `filter_linear_layers`):
+            Optional function that must take in a module and layer name, and returns a boolean indicating whether the
+            module should be converted to FP8. Defaults to `filter_linear_layers`. See it for an example.
+    Example:
+    ```python
+    from accelerate.utils.ao import convert_model_to_fp8_ao
+    model = MyModel()
+    model.to("cuda")
+    convert_to_float8_training(model)
+    model.train()
+    ```
+    """
+    from torchao.float8 import convert_to_float8_training
+    first_linear, last_linear = find_first_last_linear_layers(model)
+    if module_filter_func is None:
+        module_filter_func = partial(filter_linear_layers, layers_to_filter=[first_linear, last_linear])
+    convert_to_float8_training(model, module_filter_fn=module_filter_func, config=config)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/bnb.py ADDED Viewed

	@@ -0,0 +1,469 @@

+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import os
+from copy import deepcopy
+from typing import Optional, Union
+import torch
+import torch.nn as nn
+from accelerate.utils.imports import (
+    is_4bit_bnb_available,
+    is_8bit_bnb_available,
+)
+from ..big_modeling import dispatch_model, init_empty_weights
+from .dataclasses import BnbQuantizationConfig
+from .modeling import (
+    find_tied_parameters,
+    get_balanced_memory,
+    infer_auto_device_map,
+    load_checkpoint_in_model,
+    offload_weight,
+    set_module_tensor_to_device,
+)
+logger = logging.getLogger(__name__)
+def load_and_quantize_model(
+    model: torch.nn.Module,
+    bnb_quantization_config: BnbQuantizationConfig,
+    weights_location: Optional[Union[str, os.PathLike]] = None,
+    device_map: Optional[dict[str, Union[int, str, torch.device]]] = None,
+    no_split_module_classes: Optional[list[str]] = None,
+    max_memory: Optional[dict[Union[int, str], Union[int, str]]] = None,
+    offload_folder: Optional[Union[str, os.PathLike]] = None,
+    offload_state_dict: bool = False,
+):
+    """
+    This function will quantize the input model with the associated config passed in `bnb_quantization_config`. If the
+    model is in the meta device, we will load and dispatch the weights according to the `device_map` passed. If the
+    model is already loaded, we will quantize the model and put the model on the GPU,
+    Args:
+        model (`torch.nn.Module`):
+            Input model. The model can be already loaded or on the meta device
+        bnb_quantization_config (`BnbQuantizationConfig`):
+            The bitsandbytes quantization parameters
+        weights_location (`str` or `os.PathLike`):
+            The folder weights_location to load. It can be:
+            - a path to a file containing a whole model state dict
+            - a path to a `.json` file containing the index to a sharded checkpoint
+            - a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
+            - a path to a folder containing a unique pytorch_model.bin file.
+        device_map (`Dict[str, Union[int, str, torch.device]]`, *optional*):
+            A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer
+            name, once a given module name is inside, every submodule of it will be sent to the same device.
+        no_split_module_classes (`List[str]`, *optional*):
+            A list of layer class names that should never be split across device (for instance any layer that has a
+            residual connection).
+        max_memory (`Dict`, *optional*):
+            A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset.
+        offload_folder (`str` or `os.PathLike`, *optional*):
+            If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
+        offload_state_dict (`bool`, *optional*, defaults to `False`):
+            If `True`, will temporarily offload the CPU state dict on the hard drive to avoid getting out of CPU RAM if
+            the weight of the CPU state dict + the biggest shard does not fit.
+    Returns:
+        `torch.nn.Module`: The quantized model
+    """
+    load_in_4bit = bnb_quantization_config.load_in_4bit
+    load_in_8bit = bnb_quantization_config.load_in_8bit
+    if load_in_8bit and not is_8bit_bnb_available():
+        raise ImportError(
+            "You have a version of `bitsandbytes` that is not compatible with 8bit quantization,"
+            " make sure you have the latest version of `bitsandbytes` installed."
+        )
+    if load_in_4bit and not is_4bit_bnb_available():
+        raise ValueError(
+            "You have a version of `bitsandbytes` that is not compatible with 4bit quantization,"
+            "make sure you have the latest version of `bitsandbytes` installed."
+        )
+    modules_on_cpu = []
+    # custom device map
+    if isinstance(device_map, dict) and len(device_map.keys()) > 1:
+        modules_on_cpu = [key for key, value in device_map.items() if value in ["disk", "cpu"]]
+    # We keep some modules such as the lm_head in their original dtype for numerical stability reasons
+    if bnb_quantization_config.skip_modules is None:
+        bnb_quantization_config.skip_modules = get_keys_to_not_convert(model)
+    # add cpu modules to skip modules only for 4-bit modules
+    if load_in_4bit:
+        bnb_quantization_config.skip_modules.extend(modules_on_cpu)
+    modules_to_not_convert = bnb_quantization_config.skip_modules
+    # We add the modules we want to keep in full precision
+    if bnb_quantization_config.keep_in_fp32_modules is None:
+        bnb_quantization_config.keep_in_fp32_modules = []
+    keep_in_fp32_modules = bnb_quantization_config.keep_in_fp32_modules
+    modules_to_not_convert.extend(keep_in_fp32_modules)
+    # compatibility with peft
+    model.is_loaded_in_4bit = load_in_4bit
+    model.is_loaded_in_8bit = load_in_8bit
+    model_device = get_parameter_device(model)
+    if model_device.type != "meta":
+        # quantization of an already loaded model
+        logger.warning(
+            "It is not recommended to quantize a loaded model. "
+            "The model should be instantiated under the `init_empty_weights` context manager."
+        )
+        model = replace_with_bnb_layers(model, bnb_quantization_config, modules_to_not_convert=modules_to_not_convert)
+        # convert param to the right dtype
+        dtype = bnb_quantization_config.torch_dtype
+        for name, param in model.state_dict().items():
+            if any(module_to_keep_in_fp32 in name for module_to_keep_in_fp32 in keep_in_fp32_modules):
+                param.to(torch.float32)
+                if param.dtype != torch.float32:
+                    name = name.replace(".weight", "").replace(".bias", "")
+                    param = getattr(model, name, None)
+                    if param is not None:
+                        param.to(torch.float32)
+            elif torch.is_floating_point(param):
+                param.to(dtype)
+        if model_device.type == "cuda":
+            model.cuda(torch.cuda.current_device())
+            torch.cuda.empty_cache()
+        elif torch.cuda.is_available():
+            model.to(torch.cuda.current_device())
+        elif torch.xpu.is_available():
+            model.to(torch.xpu.current_device())
+        else:
+            raise RuntimeError("No GPU or Intel XPU found. A GPU or Intel XPU is needed for quantization.")
+        logger.info(
+            f"The model device type is {model_device.type}. However, gpu or intel xpu is needed for quantization."
+            "We move the model to it."
+        )
+        return model
+    elif weights_location is None:
+        raise RuntimeError(
+            f"`weights_location` needs to be the folder path containing the weights of the model, but we found {weights_location} "
+        )
+    else:
+        with init_empty_weights():
+            model = replace_with_bnb_layers(
+                model, bnb_quantization_config, modules_to_not_convert=modules_to_not_convert
+            )
+        device_map = get_quantized_model_device_map(
+            model,
+            bnb_quantization_config,
+            device_map,
+            max_memory=max_memory,
+            no_split_module_classes=no_split_module_classes,
+        )
+        if offload_state_dict is None and device_map is not None and "disk" in device_map.values():
+            offload_state_dict = True
+        offload = any(x in list(device_map.values()) for x in ["cpu", "disk"])
+        load_checkpoint_in_model(
+            model,
+            weights_location,
+            device_map,
+            dtype=bnb_quantization_config.torch_dtype,
+            offload_folder=offload_folder,
+            offload_state_dict=offload_state_dict,
+            keep_in_fp32_modules=bnb_quantization_config.keep_in_fp32_modules,
+            offload_8bit_bnb=load_in_8bit and offload,
+        )
+        return dispatch_model(model, device_map=device_map, offload_dir=offload_folder)
+def get_quantized_model_device_map(
+    model, bnb_quantization_config, device_map=None, max_memory=None, no_split_module_classes=None
+):
+    if device_map is None:
+        if torch.cuda.is_available():
+            device_map = {"": torch.cuda.current_device()}
+        elif torch.xpu.is_available():
+            device_map = {"": torch.xpu.current_device()}
+        else:
+            raise RuntimeError("No GPU found. A GPU is needed for quantization.")
+        logger.info("The device_map was not initialized.Setting device_map to `{'':torch.cuda.current_device()}`.")
+    if isinstance(device_map, str):
+        if device_map not in ["auto", "balanced", "balanced_low_0", "sequential"]:
+            raise ValueError(
+                "If passing a string for `device_map`, please choose 'auto', 'balanced', 'balanced_low_0' or "
+                "'sequential'."
+            )
+        special_dtypes = {}
+        special_dtypes.update(
+            {
+                name: bnb_quantization_config.torch_dtype
+                for name, _ in model.named_parameters()
+                if any(m in name for m in bnb_quantization_config.skip_modules)
+            }
+        )
+        special_dtypes.update(
+            {
+                name: torch.float32
+                for name, _ in model.named_parameters()
+                if any(m in name for m in bnb_quantization_config.keep_in_fp32_modules)
+            }
+        )
+        kwargs = {}
+        kwargs["special_dtypes"] = special_dtypes
+        kwargs["no_split_module_classes"] = no_split_module_classes
+        kwargs["dtype"] = bnb_quantization_config.target_dtype
+        # get max_memory for each device.
+        if device_map != "sequential":
+            max_memory = get_balanced_memory(
+                model,
+                low_zero=(device_map == "balanced_low_0"),
+                max_memory=max_memory,
+                **kwargs,
+            )
+        kwargs["max_memory"] = max_memory
+        device_map = infer_auto_device_map(model, **kwargs)
+    if isinstance(device_map, dict):
+        # check if don't have any quantized module on the cpu
+        modules_not_to_convert = bnb_quantization_config.skip_modules + bnb_quantization_config.keep_in_fp32_modules
+        device_map_without_some_modules = {
+            key: device_map[key] for key in device_map.keys() if key not in modules_not_to_convert
+        }
+        for device in ["cpu", "disk"]:
+            if device in device_map_without_some_modules.values():
+                if bnb_quantization_config.load_in_4bit:
+                    raise ValueError(
+                        """
+                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
+                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
+                        these modules in `torch_dtype`, you need to pass a custom `device_map` to
+                        `load_and_quantize_model`. Check
+                        https://huggingface.co/docs/accelerate/main/en/usage_guides/quantization#offload-modules-to-cpu-and-disk
+                        for more details.
+                        """
+                    )
+                else:
+                    logger.info(
+                        "Some modules are are offloaded to the CPU or the disk. Note that these modules will be converted to 8-bit"
+                    )
+        del device_map_without_some_modules
+    return device_map
+def replace_with_bnb_layers(model, bnb_quantization_config, modules_to_not_convert=None, current_key_name=None):
+    """
+    A helper function to replace all `torch.nn.Linear` modules by `bnb.nn.Linear8bit` modules or by `bnb.nn.Linear4bit`
+    modules from the `bitsandbytes`library. The function will be run recursively and replace `torch.nn.Linear` modules.
+    Parameters:
+        model (`torch.nn.Module`):
+            Input model or `torch.nn.Module` as the function is run recursively.
+        modules_to_not_convert (`List[str]`):
+            Names of the modules to not quantize convert. In practice we keep the `lm_head` in full precision for
+            numerical stability reasons.
+        current_key_name (`List[str]`, *optional*):
+            An array to track the current key of the recursion. This is used to check whether the current key (part of
+            it) is not in the list of modules to not convert.
+    """
+    if modules_to_not_convert is None:
+        modules_to_not_convert = []
+    model, has_been_replaced = _replace_with_bnb_layers(
+        model, bnb_quantization_config, modules_to_not_convert, current_key_name
+    )
+    if not has_been_replaced:
+        logger.warning(
+            "You are loading your model in 8bit or 4bit but no linear modules were found in your model."
+            " this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers."
+            " Please double check your model architecture, or submit an issue on github if you think this is"
+            " a bug."
+        )
+    return model
+def _replace_with_bnb_layers(
+    model,
+    bnb_quantization_config,
+    modules_to_not_convert=None,
+    current_key_name=None,
+):
+    """
+    Private method that wraps the recursion for module replacement.
+    Returns the converted model and a boolean that indicates if the conversion has been successful or not.
+    """
+    # bitsandbytes will initialize CUDA on import, so it needs to be imported lazily
+    import bitsandbytes as bnb
+    has_been_replaced = False
+    for name, module in model.named_children():
+        if current_key_name is None:
+            current_key_name = []
+        current_key_name.append(name)
+        if isinstance(module, nn.Linear) and name not in modules_to_not_convert:
+            # Check if the current key is not in the `modules_to_not_convert`
+            current_key_name_str = ".".join(current_key_name)
+            proceed = True
+            for key in modules_to_not_convert:
+                if (
+                    (key in current_key_name_str) and (key + "." in current_key_name_str)
+                ) or key == current_key_name_str:
+                    proceed = False
+                    break
+            if proceed:
+                # Load bnb module with empty weight and replace ``nn.Linear` module
+                if bnb_quantization_config.load_in_8bit:
+                    bnb_module = bnb.nn.Linear8bitLt(
+                        module.in_features,
+                        module.out_features,
+                        module.bias is not None,
+                        has_fp16_weights=False,
+                        threshold=bnb_quantization_config.llm_int8_threshold,
+                    )
+                elif bnb_quantization_config.load_in_4bit:
+                    bnb_module = bnb.nn.Linear4bit(
+                        module.in_features,
+                        module.out_features,
+                        module.bias is not None,
+                        bnb_quantization_config.bnb_4bit_compute_dtype,
+                        compress_statistics=bnb_quantization_config.bnb_4bit_use_double_quant,
+                        quant_type=bnb_quantization_config.bnb_4bit_quant_type,
+                    )
+                else:
+                    raise ValueError("load_in_8bit and load_in_4bit can't be both False")
+                bnb_module.weight.data = module.weight.data
+                if module.bias is not None:
+                    bnb_module.bias.data = module.bias.data
+                bnb_module.requires_grad_(False)
+                setattr(model, name, bnb_module)
+                has_been_replaced = True
+        if len(list(module.children())) > 0:
+            _, _has_been_replaced = _replace_with_bnb_layers(
+                module, bnb_quantization_config, modules_to_not_convert, current_key_name
+            )
+            has_been_replaced = has_been_replaced | _has_been_replaced
+        # Remove the last key for recursion
+        current_key_name.pop(-1)
+    return model, has_been_replaced
+def get_keys_to_not_convert(model):
+    r"""
+    An utility function to get the key of the module to keep in full precision if any For example for CausalLM modules
+    we may want to keep the lm_head in full precision for numerical stability reasons. For other architectures, we want
+    to keep the tied weights of the model. The function will return a list of the keys of the modules to not convert in
+    int8.
+    Parameters:
+    model (`torch.nn.Module`):
+        Input model
+    """
+    # Create a copy of the model
+    with init_empty_weights():
+        tied_model = deepcopy(model)  # this has 0 cost since it is done inside `init_empty_weights` context manager`
+    tied_params = find_tied_parameters(tied_model)
+    # For compatibility with Accelerate < 0.18
+    if isinstance(tied_params, dict):
+        tied_keys = sum(list(tied_params.values()), []) + list(tied_params.keys())
+    else:
+        tied_keys = sum(tied_params, [])
+    has_tied_params = len(tied_keys) > 0
+    # Check if it is a base model
+    is_base_model = False
+    if hasattr(model, "base_model_prefix"):
+        is_base_model = not hasattr(model, model.base_model_prefix)
+    # Ignore this for base models (BertModel, GPT2Model, etc.)
+    if (not has_tied_params) and is_base_model:
+        return []
+    # otherwise they have an attached head
+    list_modules = list(model.named_children())
+    list_last_module = [list_modules[-1][0]]
+    # add last module together with tied weights
+    intersection = set(list_last_module) - set(tied_keys)
+    list_untouched = list(set(tied_keys)) + list(intersection)
+    # remove ".weight" from the keys
+    names_to_remove = [".weight", ".bias"]
+    filtered_module_names = []
+    for name in list_untouched:
+        for name_to_remove in names_to_remove:
+            if name_to_remove in name:
+                name = name.replace(name_to_remove, "")
+        filtered_module_names.append(name)
+    return filtered_module_names
+def has_4bit_bnb_layers(model):
+    """Check if we have `bnb.nn.Linear4bit` or `bnb.nn.Linear8bitLt` layers inside our model"""
+    # bitsandbytes will initialize CUDA on import, so it needs to be imported lazily
+    import bitsandbytes as bnb
+    for m in model.modules():
+        if isinstance(m, bnb.nn.Linear4bit):
+            return True
+    return False
+def get_parameter_device(parameter: nn.Module):
+    return next(parameter.parameters()).device
+def quantize_and_offload_8bit(model, param, param_name, new_dtype, offload_folder, offload_index, fp16_statistics):
+    # if it is not quantized, we quantize and offload the quantized weights and the SCB stats
+    if fp16_statistics is None:
+        set_module_tensor_to_device(model, param_name, 0, dtype=new_dtype, value=param)
+        tensor_name = param_name
+        module = model
+        if "." in tensor_name:
+            splits = tensor_name.split(".")
+            for split in splits[:-1]:
+                new_module = getattr(module, split)
+                if new_module is None:
+                    raise ValueError(f"{module} has no attribute {split}.")
+                module = new_module
+            tensor_name = splits[-1]
+        # offload weights
+        module._parameters[tensor_name].requires_grad = False
+        offload_weight(module._parameters[tensor_name], param_name, offload_folder, index=offload_index)
+        if hasattr(module._parameters[tensor_name], "SCB"):
+            offload_weight(
+                module._parameters[tensor_name].SCB,
+                param_name.replace("weight", "SCB"),
+                offload_folder,
+                index=offload_index,
+            )
+    else:
+        offload_weight(param, param_name, offload_folder, index=offload_index)
+        offload_weight(fp16_statistics, param_name.replace("weight", "SCB"), offload_folder, index=offload_index)
+    set_module_tensor_to_device(model, param_name, "meta", dtype=new_dtype, value=torch.empty(*param.size()))

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/constants.py ADDED Viewed

	@@ -0,0 +1,106 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import operator as op
+import torch
+SCALER_NAME = "scaler.pt"
+MODEL_NAME = "pytorch_model"
+SAFE_MODEL_NAME = "model"
+RNG_STATE_NAME = "random_states"
+OPTIMIZER_NAME = "optimizer"
+SCHEDULER_NAME = "scheduler"
+SAMPLER_NAME = "sampler"
+PROFILE_PATTERN_NAME = "profile_{suffix}.json"
+WEIGHTS_NAME = f"{MODEL_NAME}.bin"
+WEIGHTS_PATTERN_NAME = "pytorch_model{suffix}.bin"
+WEIGHTS_INDEX_NAME = f"{WEIGHTS_NAME}.index.json"
+SAFE_WEIGHTS_NAME = f"{SAFE_MODEL_NAME}.safetensors"
+SAFE_WEIGHTS_PATTERN_NAME = "model{suffix}.safetensors"
+SAFE_WEIGHTS_INDEX_NAME = f"{SAFE_WEIGHTS_NAME}.index.json"
+SAGEMAKER_PYTORCH_VERSION = "1.10.2"
+SAGEMAKER_PYTHON_VERSION = "py38"
+SAGEMAKER_TRANSFORMERS_VERSION = "4.17.0"
+SAGEMAKER_PARALLEL_EC2_INSTANCES = ["ml.p3.16xlarge", "ml.p3dn.24xlarge", "ml.p4dn.24xlarge"]
+FSDP_SHARDING_STRATEGY = ["FULL_SHARD", "SHARD_GRAD_OP", "NO_SHARD", "HYBRID_SHARD", "HYBRID_SHARD_ZERO2"]
+FSDP_AUTO_WRAP_POLICY = ["TRANSFORMER_BASED_WRAP", "SIZE_BASED_WRAP", "NO_WRAP"]
+FSDP_BACKWARD_PREFETCH = ["BACKWARD_PRE", "BACKWARD_POST", "NO_PREFETCH"]
+FSDP_STATE_DICT_TYPE = ["FULL_STATE_DICT", "LOCAL_STATE_DICT", "SHARDED_STATE_DICT"]
+FSDP2_STATE_DICT_TYPE = ["SHARDED_STATE_DICT", "FULL_STATE_DICT"]
+FSDP_PYTORCH_VERSION = (
+    "2.1.0.a0+32f93b1"  # Technically should be 2.1.0, but MS-AMP uses this specific prerelease in their Docker image.
+)
+FSDP2_PYTORCH_VERSION = "2.6.0"
+FSDP_MODEL_NAME = "pytorch_model_fsdp"
+DEEPSPEED_MULTINODE_LAUNCHERS = ["pdsh", "standard", "openmpi", "mvapich", "mpich", "nossh", "slurm"]
+TORCH_DYNAMO_MODES = ["default", "reduce-overhead", "max-autotune"]
+ELASTIC_LOG_LINE_PREFIX_TEMPLATE_PYTORCH_VERSION = "2.2.0"
+XPU_PROFILING_AVAILABLE_PYTORCH_VERSION = "2.4.0"
+MITA_PROFILING_AVAILABLE_PYTORCH_VERSION = "2.1.0"
+BETA_TP_AVAILABLE_PYTORCH_VERSION = "2.3.0"
+BETA_TP_AVAILABLE_TRANSFORMERS_VERSION = "4.52.0"
+BETA_CP_AVAILABLE_PYTORCH_VERSION = "2.6.0"
+BETA_SP_AVAILABLE_DEEPSPEED_VERSION = "0.18.2"
+STR_OPERATION_TO_FUNC = {">": op.gt, ">=": op.ge, "==": op.eq, "!=": op.ne, "<=": op.le, "<": op.lt}
+# These are the args for `torch.distributed.launch` for pytorch < 1.9
+TORCH_LAUNCH_PARAMS = [
+    "nnodes",
+    "nproc_per_node",
+    "rdzv_backend",
+    "rdzv_endpoint",
+    "rdzv_id",
+    "rdzv_conf",
+    "standalone",
+    "max_restarts",
+    "monitor_interval",
+    "start_method",
+    "role",
+    "module",
+    "m",
+    "no_python",
+    "run_path",
+    "log_dir",
+    "r",
+    "redirects",
+    "t",
+    "tee",
+    "node_rank",
+    "master_addr",
+    "master_port",
+]
+CUDA_DISTRIBUTED_TYPES = ["DEEPSPEED", "MULTI_GPU", "FSDP", "MEGATRON_LM", "TP"]
+TORCH_DISTRIBUTED_OPERATION_TYPES = CUDA_DISTRIBUTED_TYPES + [
+    "MULTI_NPU",
+    "MULTI_MLU",
+    "MULTI_SDAA",
+    "MULTI_MUSA",
+    "MULTI_XPU",
+    "MULTI_CPU",
+    "MULTI_HPU",
+]
+SUPPORTED_PYTORCH_LAYERS_FOR_UPCASTING = (
+    torch.nn.Conv1d,
+    torch.nn.Conv2d,
+    torch.nn.Conv3d,
+    torch.nn.ConvTranspose1d,
+    torch.nn.ConvTranspose2d,
+    torch.nn.ConvTranspose3d,
+    torch.nn.Linear,
+)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/dataclasses.py ADDED Viewed

The diff for this file is too large to render. See raw diff

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/deepspeed.py ADDED Viewed

	@@ -0,0 +1,385 @@

+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import base64
+import json
+import os
+from copy import deepcopy
+from torch import optim
+from ..optimizer import AcceleratedOptimizer
+from ..scheduler import AcceleratedScheduler
+from .dataclasses import DistributedType
+from .imports import is_bnb_available
+from .versions import compare_versions
+def map_pytorch_optim_to_deepspeed(optimizer):
+    """
+    Args:
+        optimizer: torch.optim.Optimizer
+    Returns the DeepSeedCPUOptimizer (deepspeed.ops) version of the optimizer.
+    """
+    defaults = {k: v for k, v in optimizer.defaults.items() if k in ["lr", "weight_decay"]}
+    # Select the DeepSpeedCPUOptimizer based on the original optimizer class.
+    # DeepSpeedCPUAdam is the default
+    from deepspeed.ops.adam import DeepSpeedCPUAdam
+    optimizer_class = DeepSpeedCPUAdam
+    # For DeepSpeedCPUAdam (adamw_mode)
+    if compare_versions("deepspeed", ">=", "0.3.1"):
+        defaults["adamw_mode"] = False
+        is_adaw = isinstance(optimizer, optim.AdamW)
+        if is_bnb_available() and not is_adaw:
+            import bitsandbytes.optim as bnb_opt
+            if isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)):
+                try:
+                    is_adaw = optimizer.optim_bits == 32
+                except AttributeError:
+                    is_adaw = optimizer.args.optim_bits == 32
+            else:
+                is_adaw = False
+        if is_adaw:
+            defaults["adamw_mode"] = True
+    # For DeepSpeedCPUAdagrad
+    if compare_versions("deepspeed", ">=", "0.5.5"):
+        # Check if the optimizer is PyTorch's Adagrad.
+        is_ada = isinstance(optimizer, optim.Adagrad)
+        # If not, and bitsandbytes is available,
+        # # check if the optimizer is the 32-bit bitsandbytes Adagrad.
+        if is_bnb_available() and not is_ada:
+            import bitsandbytes.optim as bnb_opt
+            if isinstance(optimizer, (bnb_opt.Adagrad, bnb_opt.Adagrad32bit)):
+                try:
+                    is_ada = optimizer.optim_bits == 32
+                except AttributeError:
+                    is_ada = optimizer.args.optim_bits == 32
+        if is_ada:
+            from deepspeed.ops.adagrad import DeepSpeedCPUAdagrad
+            optimizer_class = DeepSpeedCPUAdagrad
+    # For DeepSpeedCPULion
+    if is_bnb_available(min_version="0.38.0") and compare_versions("deepspeed", ">=", "0.11.0"):
+        from bitsandbytes.optim import Lion, Lion32bit
+        if isinstance(optimizer, (Lion, Lion32bit)):
+            try:
+                is_bnb_32bits = optimizer.optim_bits == 32
+            except AttributeError:
+                is_bnb_32bits = optimizer.args.optim_bits == 32
+            if is_bnb_32bits:
+                from deepspeed.ops.lion import DeepSpeedCPULion
+                optimizer_class = DeepSpeedCPULion
+    return optimizer_class(optimizer.param_groups, **defaults)
+def get_active_deepspeed_plugin(state):
+    """
+    Returns the currently active DeepSpeedPlugin.
+    Raises:
+        ValueError: If DeepSpeed was not enabled and this function is called.
+    """
+    if state.distributed_type != DistributedType.DEEPSPEED:
+        raise ValueError(
+            "Couldn't retrieve the active `DeepSpeedPlugin` as none were enabled. "
+            "Please make sure that either `Accelerator` is configured for `deepspeed` "
+            "or make sure that the desired `DeepSpeedPlugin` has been enabled (`AcceleratorState().select_deepspeed_plugin(name)`) "
+            "before calling this function."
+        )
+    if not isinstance(state.deepspeed_plugins, dict):
+        return state.deepspeed_plugins
+    return next(plugin for plugin in state.deepspeed_plugins.values() if plugin.selected)
+class HfDeepSpeedConfig:
+    """
+    This object contains a DeepSpeed configuration dictionary and can be quickly queried for things like zero stage.
+    A `weakref` of this object is stored in the module's globals to be able to access the config from areas where
+    things like the Trainer object is not available (e.g. `from_pretrained` and `_get_resized_embeddings`). Therefore
+    it's important that this object remains alive while the program is still running.
+    [`Trainer`] uses the `HfTrainerDeepSpeedConfig` subclass instead. That subclass has logic to sync the configuration
+    with values of [`TrainingArguments`] by replacing special placeholder values: `"auto"`. Without this special logic
+    the DeepSpeed configuration is not modified in any way.
+    Args:
+        config_file_or_dict (`Union[str, Dict]`): path to DeepSpeed config file or dict.
+    """
+    def __init__(self, config_file_or_dict):
+        if isinstance(config_file_or_dict, dict):
+            # Don't modify user's data should they want to reuse it (e.g. in tests), because once we
+            # modified it, it will not be accepted here again, since `auto` values would have been overridden
+            config = deepcopy(config_file_or_dict)
+        elif os.path.exists(config_file_or_dict):
+            with open(config_file_or_dict, encoding="utf-8") as f:
+                config = json.load(f)
+        else:
+            try:
+                try:
+                    # First try parsing as JSON directly
+                    config = json.loads(config_file_or_dict)
+                except json.JSONDecodeError:
+                    # If that fails, try base64 decoding
+                    config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("utf-8")
+                    config = json.loads(config_decoded)
+            except (UnicodeDecodeError, AttributeError, ValueError):
+                raise ValueError(
+                    f"Expected a string path to an existing deepspeed config, or a dictionary, or a base64 encoded string. Received: {config_file_or_dict}"
+                )
+        self.config = config
+        self.set_stage_and_offload()
+    def set_stage_and_offload(self):
+        # zero stage - this is done as early as possible, before model is created, to allow
+        # ``is_deepspeed_zero3_enabled`` query and getting to the early deepspeed config object
+        # during ``zero.Init()`` which needs to know the dtype, and some other hparams.
+        self._stage = self.get_value("zero_optimization.stage", -1)
+        # offload
+        self._offload = False
+        if self.is_zero2() or self.is_zero3():
+            offload_devices_valid = set(["cpu", "nvme"])
+            offload_devices = set(
+                [
+                    self.get_value("zero_optimization.offload_optimizer.device"),
+                    self.get_value("zero_optimization.offload_param.device"),
+                ]
+            )
+            if len(offload_devices & offload_devices_valid) > 0:
+                self._offload = True
+    def find_config_node(self, ds_key_long):
+        config = self.config
+        # find the config node of interest if it exists
+        nodes = ds_key_long.split(".")
+        ds_key = nodes.pop()
+        for node in nodes:
+            config = config.get(node)
+            if config is None:
+                return None, ds_key
+        return config, ds_key
+    def get_value(self, ds_key_long, default=None):
+        """
+        Returns the set value or `default` if no value is set
+        """
+        config, ds_key = self.find_config_node(ds_key_long)
+        if config is None:
+            return default
+        return config.get(ds_key, default)
+    def del_config_sub_tree(self, ds_key_long, must_exist=False):
+        """
+        Deletes a sub-section of the config file if it's found.
+        Unless `must_exist` is `True` the section doesn't have to exist.
+        """
+        config = self.config
+        # find the config node of interest if it exists
+        nodes = ds_key_long.split(".")
+        for node in nodes:
+            parent_config = config
+            config = config.get(node)
+            if config is None:
+                if must_exist:
+                    raise ValueError(f"Can't find {ds_key_long} entry in the config: {self.config}")
+                else:
+                    return
+        # if found remove it
+        if parent_config is not None:
+            parent_config.pop(node)
+    def is_true(self, ds_key_long):
+        """
+        Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very
+        specific question of whether the value is set to `True` (and it's not set to `False`` or isn't set).
+        """
+        value = self.get_value(ds_key_long)
+        return False if value is None else bool(value)
+    def is_false(self, ds_key_long):
+        """
+        Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very
+        specific question of whether the value is set to `False` (and it's not set to `True`` or isn't set).
+        """
+        value = self.get_value(ds_key_long)
+        return False if value is None else not bool(value)
+    def is_zero2(self):
+        return self._stage == 2
+    def is_zero3(self):
+        return self._stage == 3
+    def is_offload(self):
+        return self._offload
+class DeepSpeedEngineWrapper:
+    """
+    Internal wrapper for deepspeed.runtime.engine.DeepSpeedEngine. This is used to follow conventional training loop.
+    Args:
+        engine (deepspeed.runtime.engine.DeepSpeedEngine): deepspeed engine to wrap
+    """
+    def __init__(self, engine):
+        self.engine = engine
+    def backward(self, loss, sync_gradients=True, **kwargs):
+        # Set gradient accumulation boundary based on Accelerate's sync_gradients state
+        # This tells DeepSpeed whether this is the final micro-batch before gradient sync
+        self.engine.set_gradient_accumulation_boundary(is_boundary=sync_gradients)
+        # runs backpropagation and handles mixed precision
+        self.engine.backward(loss, **kwargs)
+        # Only perform step and related operations at gradient accumulation boundaries
+        if sync_gradients:
+            # Deepspeed's `engine.step` performs the following operations:
+            # - gradient accumulation check
+            # - gradient clipping
+            # - optimizer step
+            # - zero grad
+            # - checking overflow
+            # - lr_scheduler step (only if engine.lr_scheduler is not None)
+            self.engine.step()
+        # and this plugin overrides the above calls with no-ops when Accelerate runs under
+        # Deepspeed, but allows normal functionality for non-Deepspeed cases thus enabling a simple
+        # training loop that works transparently under many training regimes.
+    def get_global_grad_norm(self):
+        """Get the global gradient norm from DeepSpeed engine."""
+        grad_norm = self.engine.get_global_grad_norm()
+        # Convert to scalar if it's a tensor
+        if hasattr(grad_norm, "item"):
+            return grad_norm.item()
+        return grad_norm
+class DeepSpeedOptimizerWrapper(AcceleratedOptimizer):
+    """
+    Internal wrapper around a deepspeed optimizer.
+    Args:
+        optimizer (`torch.optim.optimizer.Optimizer`):
+            The optimizer to wrap.
+    """
+    def __init__(self, optimizer):
+        super().__init__(optimizer, device_placement=False, scaler=None)
+        self.__has_overflow__ = hasattr(self.optimizer, "overflow")
+    def zero_grad(self, set_to_none=None):
+        pass  # `accelerator.backward(loss)` is doing that automatically. Therefore, its implementation is not needed
+    def step(self):
+        pass  # `accelerator.backward(loss)` is doing that automatically. Therefore, its implementation is not needed
+    @property
+    def step_was_skipped(self):
+        """Whether or not the optimizer step was done, or skipped because of gradient overflow."""
+        if self.__has_overflow__:
+            return self.optimizer.overflow
+        return False
+class DeepSpeedSchedulerWrapper(AcceleratedScheduler):
+    """
+    Internal wrapper around a deepspeed scheduler.
+    Args:
+        scheduler (`torch.optim.lr_scheduler.LambdaLR`):
+            The scheduler to wrap.
+        optimizers (one or a list of `torch.optim.Optimizer`):
+    """
+    def __init__(self, scheduler, optimizers):
+        super().__init__(scheduler, optimizers)
+    def step(self):
+        pass  # `accelerator.backward(loss)` is doing that automatically. Therefore, its implementation is not needed
+class DummyOptim:
+    """
+    Dummy optimizer presents model parameters or param groups, this is primarily used to follow conventional training
+    loop when optimizer config is specified in the deepspeed config file.
+    Args:
+        lr (float):
+            Learning rate.
+        params (iterable): iterable of parameters to optimize or dicts defining
+            parameter groups
+        weight_decay (float):
+            Weight decay.
+        **kwargs (additional keyword arguments, *optional*):
+            Other arguments.
+    """
+    def __init__(self, params, lr=0.001, weight_decay=0, **kwargs):
+        self.params = params
+        self.lr = lr
+        self.weight_decay = weight_decay
+        self.kwargs = kwargs
+class DummyScheduler:
+    """
+    Dummy scheduler presents model parameters or param groups, this is primarily used to follow conventional training
+    loop when scheduler config is specified in the deepspeed config file.
+    Args:
+        optimizer (`torch.optim.optimizer.Optimizer`):
+            The optimizer to wrap.
+        total_num_steps (int, *optional*):
+            Total number of steps.
+        warmup_num_steps (int, *optional*):
+            Number of steps for warmup.
+        lr_scheduler_callable (callable, *optional*):
+            A callable function that creates an LR Scheduler. It accepts only one argument `optimizer`.
+        **kwargs (additional keyword arguments, *optional*):
+            Other arguments.
+    """
+    def __init__(self, optimizer, total_num_steps=None, warmup_num_steps=0, lr_scheduler_callable=None, **kwargs):
+        self.optimizer = optimizer
+        self.total_num_steps = total_num_steps
+        self.warmup_num_steps = warmup_num_steps
+        self.lr_scheduler_callable = lr_scheduler_callable
+        self.kwargs = kwargs

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/environment.py ADDED Viewed

	@@ -0,0 +1,471 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import math
+import os
+import platform
+import subprocess
+import sys
+from contextlib import contextmanager
+from dataclasses import dataclass, field
+from functools import lru_cache, wraps
+from shutil import which
+from typing import Optional, Union
+import torch
+from packaging.version import parse
+logger = logging.getLogger(__name__)
+def convert_dict_to_env_variables(current_env: dict):
+    """
+    Verifies that all keys and values in `current_env` do not contain illegal keys or values, and returns a list of
+    strings as the result.
+    Example:
+    ```python
+    >>> from accelerate.utils.environment import verify_env
+    >>> env = {"ACCELERATE_DEBUG_MODE": "1", "BAD_ENV_NAME": "<mything", "OTHER_ENV": "2"}
+    >>> valid_env_items = verify_env(env)
+    >>> print(valid_env_items)
+    ["ACCELERATE_DEBUG_MODE=1\n", "OTHER_ENV=2\n"]
+    ```
+    """
+    forbidden_chars = [";", "\n", "<", ">", " "]
+    valid_env_items = []
+    for key, value in current_env.items():
+        if all(char not in (key + value) for char in forbidden_chars) and len(key) >= 1 and len(value) >= 1:
+            valid_env_items.append(f"{key}={value}\n")
+        else:
+            logger.warning(f"WARNING: Skipping {key}={value} as it contains forbidden characters or missing values.")
+    return valid_env_items
+def str_to_bool(value, to_bool: bool = False) -> Union[int, bool]:
+    """
+    Converts a string representation of truth to `True` (1) or `False` (0).
+    True values are `y`, `yes`, `t`, `true`, `on`, and `1`; False value are `n`, `no`, `f`, `false`, `off`, and `0`;
+    """
+    value = value.lower()
+    if value in ("y", "yes", "t", "true", "on", "1"):
+        return 1 if not to_bool else True
+    elif value in ("n", "no", "f", "false", "off", "0"):
+        return 0 if not to_bool else False
+    else:
+        raise ValueError(f"invalid truth value {value}")
+def get_int_from_env(env_keys, default):
+    """Returns the first positive env value found in the `env_keys` list or the default."""
+    for e in env_keys:
+        val = int(os.environ.get(e, -1))
+        if val >= 0:
+            return val
+    return default
+def parse_flag_from_env(key, default=False):
+    """Returns truthy value for `key` from the env if available else the default."""
+    value = os.environ.get(key, str(default))
+    return str_to_bool(value) == 1  # As its name indicates `str_to_bool` actually returns an int...
+def parse_choice_from_env(key, default="no"):
+    value = os.environ.get(key, str(default))
+    return value
+def are_libraries_initialized(*library_names: str) -> list[str]:
+    """
+    Checks if any of `library_names` are imported in the environment. Will return any names that are.
+    """
+    return [lib_name for lib_name in library_names if lib_name in sys.modules.keys()]
+def get_current_device_type() -> tuple[str, str]:
+    """
+    Determines the current device type and distributed type without initializing any device.
+    This is particularly important when using fork-based multiprocessing, as device initialization
+    before forking can cause errors.
+    The device detection order follows the same priority as state.py:_prepare_backend():
+    MLU -> SDAA -> MUSA -> NPU -> HPU -> CUDA -> XPU
+    Returns:
+        tuple[str, str]: A tuple of (device_type, distributed_type)
+            - device_type: The device string (e.g., "cuda", "npu", "xpu")
+            - distributed_type: The distributed type string (e.g., "MULTI_GPU", "MULTI_NPU")
+    Example:
+        ```python
+        >>> device_type, distributed_type = get_current_device_type()
+        >>> print(device_type)  # "cuda"
+        >>> print(distributed_type)  # "MULTI_GPU"
+        ```
+    """
+    from .imports import (
+        is_hpu_available,
+        is_mlu_available,
+        is_musa_available,
+        is_npu_available,
+        is_sdaa_available,
+        is_xpu_available,
+    )
+    if is_mlu_available():
+        return "mlu", "MULTI_MLU"
+    elif is_sdaa_available():
+        return "sdaa", "MULTI_SDAA"
+    elif is_musa_available():
+        return "musa", "MULTI_MUSA"
+    elif is_npu_available():
+        return "npu", "MULTI_NPU"
+    elif is_hpu_available():
+        return "hpu", "MULTI_HPU"
+    elif torch.cuda.is_available():
+        return "cuda", "MULTI_GPU"
+    elif is_xpu_available():
+        return "xpu", "MULTI_XPU"
+    else:
+        # Default to CUDA even if not available (for CPU-only scenarios where CUDA code paths are still used)
+        return "cuda", "MULTI_GPU"
+def _nvidia_smi():
+    """
+    Returns the right nvidia-smi command based on the system.
+    """
+    if platform.system() == "Windows":
+        # If platform is Windows and nvidia-smi can't be found in path
+        # try from systemd drive with default installation path
+        command = which("nvidia-smi")
+        if command is None:
+            command = f"{os.environ['systemdrive']}\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe"
+    else:
+        command = "nvidia-smi"
+    return command
+def get_gpu_info():
+    """
+    Gets GPU count and names using `nvidia-smi` instead of torch to not initialize CUDA.
+    Largely based on the `gputil` library.
+    """
+    # Returns as list of `n` GPUs and their names
+    output = subprocess.check_output(
+        [_nvidia_smi(), "--query-gpu=count,name", "--format=csv,noheader"], universal_newlines=True
+    )
+    output = output.strip()
+    gpus = output.split(os.linesep)
+    # Get names from output
+    gpu_count = len(gpus)
+    gpu_names = [gpu.split(",")[1].strip() for gpu in gpus]
+    return gpu_names, gpu_count
+def get_driver_version():
+    """
+    Returns the driver version
+    In the case of multiple GPUs, will return the first.
+    """
+    output = subprocess.check_output(
+        [_nvidia_smi(), "--query-gpu=driver_version", "--format=csv,noheader"], universal_newlines=True
+    )
+    output = output.strip()
+    return output.split(os.linesep)[0]
+def check_cuda_p2p_ib_support():
+    """
+    Checks if the devices being used have issues with P2P and IB communications, namely any consumer GPU hardware after
+    the 3090.
+    Notably uses `nvidia-smi` instead of torch to not initialize CUDA.
+    """
+    try:
+        device_names, device_count = get_gpu_info()
+        # As new consumer GPUs get released, add them to `unsupported_devices``
+        unsupported_devices = {"RTX 40"}
+        if device_count > 1:
+            if any(
+                unsupported_device in device_name
+                for device_name in device_names
+                for unsupported_device in unsupported_devices
+            ):
+                # Check if they have the right driver version
+                acceptable_driver_version = "550.40.07"
+                current_driver_version = get_driver_version()
+                if parse(current_driver_version) < parse(acceptable_driver_version):
+                    return False
+                return True
+    except Exception:
+        pass
+    return True
+@lru_cache
+def check_cuda_fp8_capability():
+    """
+    Checks if the current GPU available supports FP8.
+    Notably might initialize `torch.cuda` to check.
+    """
+    try:
+        # try to get the compute capability from nvidia-smi
+        output = subprocess.check_output(
+            [_nvidia_smi(), "--query-gpu=compute_capability", "--format=csv,noheader"], universal_newlines=True
+        )
+        output = output.strip()
+        # we take the first GPU's compute capability
+        compute_capability = tuple(map(int, output.split(os.linesep)[0].split(".")))
+    except Exception:
+        compute_capability = torch.cuda.get_device_capability()
+    return compute_capability >= (8, 9)
+@dataclass
+class CPUInformation:
+    """
+    Stores information about the CPU in a distributed environment. It contains the following attributes:
+    - rank: The rank of the current process.
+    - world_size: The total number of processes in the world.
+    - local_rank: The rank of the current process on the local node.
+    - local_world_size: The total number of processes on the local node.
+    """
+    rank: int = field(default=0, metadata={"help": "The rank of the current process."})
+    world_size: int = field(default=1, metadata={"help": "The total number of processes in the world."})
+    local_rank: int = field(default=0, metadata={"help": "The rank of the current process on the local node."})
+    local_world_size: int = field(default=1, metadata={"help": "The total number of processes on the local node."})
+def get_cpu_distributed_information() -> CPUInformation:
+    """
+    Returns various information about the environment in relation to CPU distributed training as a `CPUInformation`
+    dataclass.
+    """
+    information = {}
+    information["rank"] = get_int_from_env(["RANK", "PMI_RANK", "OMPI_COMM_WORLD_RANK", "MV2_COMM_WORLD_RANK"], 0)
+    information["world_size"] = get_int_from_env(
+        ["WORLD_SIZE", "PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_SIZE"], 1
+    )
+    information["local_rank"] = get_int_from_env(
+        ["LOCAL_RANK", "MPI_LOCALRANKID", "OMPI_COMM_WORLD_LOCAL_RANK", "MV2_COMM_WORLD_LOCAL_RANK"], 0
+    )
+    information["local_world_size"] = get_int_from_env(
+        ["LOCAL_WORLD_SIZE", "MPI_LOCALNRANKS", "OMPI_COMM_WORLD_LOCAL_SIZE", "MV2_COMM_WORLD_LOCAL_SIZE"],
+        1,
+    )
+    return CPUInformation(**information)
+def override_numa_affinity(local_process_index: int, verbose: Optional[bool] = None) -> None:
+    """
+    Overrides whatever NUMA affinity is set for the current process. This is very taxing and requires recalculating the
+    affinity to set, ideally you should use `utils.environment.set_numa_affinity` instead.
+    Args:
+        local_process_index (int):
+            The index of the current process on the current server.
+        verbose (bool, *optional*):
+            Whether to log out the assignment of each CPU. If `ACCELERATE_DEBUG_MODE` is enabled, will default to True.
+    """
+    if verbose is None:
+        verbose = parse_flag_from_env("ACCELERATE_DEBUG_MODE", False)
+    if torch.cuda.is_available():
+        from accelerate.utils import is_pynvml_available
+        if not is_pynvml_available():
+            raise ImportError(
+                "To set CPU affinity on CUDA GPUs the `nvidia-ml-py` package must be available. (`pip install nvidia-ml-py`)"
+            )
+        import pynvml as nvml
+        # The below code is based on https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/LanguageModeling/BERT/gpu_affinity.py
+        nvml.nvmlInit()
+        num_elements = math.ceil(os.cpu_count() / 64)
+        handle = nvml.nvmlDeviceGetHandleByIndex(local_process_index)
+        affinity_string = ""
+        for j in nvml.nvmlDeviceGetCpuAffinity(handle, num_elements):
+            # assume nvml returns list of 64 bit ints
+            affinity_string = f"{j:064b}{affinity_string}"
+        affinity_list = [int(x) for x in affinity_string]
+        affinity_list.reverse()  # so core 0 is the 0th element
+        affinity_to_set = [i for i, e in enumerate(affinity_list) if e != 0]
+        os.sched_setaffinity(0, affinity_to_set)
+        if verbose:
+            cpu_cores = os.sched_getaffinity(0)
+            logger.info(f"Assigning {len(cpu_cores)} cpu cores to process {local_process_index}: {cpu_cores}")
+@lru_cache
+def set_numa_affinity(local_process_index: int, verbose: Optional[bool] = None) -> None:
+    """
+    Assigns the current process to a specific NUMA node. Ideally most efficient when having at least 2 cpus per node.
+    This result is cached between calls. If you want to override it, please use
+    `accelerate.utils.environment.override_numa_afifnity`.
+    Args:
+        local_process_index (int):
+            The index of the current process on the current server.
+        verbose (bool, *optional*):
+            Whether to print the new cpu cores assignment for each process. If `ACCELERATE_DEBUG_MODE` is enabled, will
+            default to True.
+    """
+    override_numa_affinity(local_process_index=local_process_index, verbose=verbose)
+@contextmanager
+def clear_environment():
+    """
+    A context manager that will temporarily clear environment variables.
+    When this context exits, the previous environment variables will be back.
+    Example:
+    ```python
+    >>> import os
+    >>> from accelerate.utils import clear_environment
+    >>> os.environ["FOO"] = "bar"
+    >>> with clear_environment():
+    ...     print(os.environ)
+    ...     os.environ["FOO"] = "new_bar"
+    ...     print(os.environ["FOO"])
+    {}
+    new_bar
+    >>> print(os.environ["FOO"])
+    bar
+    ```
+    """
+    _old_os_environ = os.environ.copy()
+    os.environ.clear()
+    try:
+        yield
+    finally:
+        os.environ.clear()  # clear any added keys,
+        os.environ.update(_old_os_environ)  # then restore previous environment
+@contextmanager
+def patch_environment(**kwargs):
+    """
+    A context manager that will add each keyword argument passed to `os.environ` and remove them when exiting.
+    Will convert the values in `kwargs` to strings and upper-case all the keys.
+    Example:
+    ```python
+    >>> import os
+    >>> from accelerate.utils import patch_environment
+    >>> with patch_environment(FOO="bar"):
+    ...     print(os.environ["FOO"])  # prints "bar"
+    >>> print(os.environ["FOO"])  # raises KeyError
+    ```
+    """
+    existing_vars = {}
+    for key, value in kwargs.items():
+        key = key.upper()
+        if key in os.environ:
+            existing_vars[key] = os.environ[key]
+        os.environ[key] = str(value)
+    try:
+        yield
+    finally:
+        for key in kwargs:
+            key = key.upper()
+            if key in existing_vars:
+                # restore previous value
+                os.environ[key] = existing_vars[key]
+            else:
+                os.environ.pop(key, None)
+def purge_accelerate_environment(func_or_cls):
+    """Decorator to clean up accelerate environment variables set by the decorated class or function.
+    In some circumstances, calling certain classes or functions can result in accelerate env vars being set and not
+    being cleaned up afterwards. As an example, when calling:
+    TrainingArguments(fp16=True, ...)
+    The following env var will be set:
+    ACCELERATE_MIXED_PRECISION=fp16
+    This can affect subsequent code, since the env var takes precedence over TrainingArguments(fp16=False). This is
+    especially relevant for unit testing, where we want to avoid the individual tests to have side effects on one
+    another. Decorate the unit test function or whole class with this decorator to ensure that after each test, the env
+    vars are cleaned up. This works for both unittest.TestCase and normal classes (pytest); it also works when
+    decorating the parent class.
+    """
+    prefix = "ACCELERATE_"
+    @contextmanager
+    def env_var_context():
+        # Store existing accelerate env vars
+        existing_vars = {k: v for k, v in os.environ.items() if k.startswith(prefix)}
+        try:
+            yield
+        finally:
+            # Restore original env vars or remove new ones
+            for key in [k for k in os.environ if k.startswith(prefix)]:
+                if key in existing_vars:
+                    os.environ[key] = existing_vars[key]
+                else:
+                    os.environ.pop(key, None)
+    def wrap_function(func):
+        @wraps(func)
+        def wrapper(*args, **kwargs):
+            with env_var_context():
+                return func(*args, **kwargs)
+        wrapper._accelerate_is_purged_environment_wrapped = True
+        return wrapper
+    if not isinstance(func_or_cls, type):
+        return wrap_function(func_or_cls)
+    # Handle classes by wrapping test methods
+    def wrap_test_methods(test_class_instance):
+        for name in dir(test_class_instance):
+            if name.startswith("test"):
+                method = getattr(test_class_instance, name)
+                if callable(method) and not hasattr(method, "_accelerate_is_purged_environment_wrapped"):
+                    setattr(test_class_instance, name, wrap_function(method))
+        return test_class_instance
+    # Handle inheritance
+    wrap_test_methods(func_or_cls)
+    func_or_cls.__init_subclass__ = classmethod(lambda cls, **kw: wrap_test_methods(cls))
+    return func_or_cls

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/fsdp_utils.py ADDED Viewed

	@@ -0,0 +1,829 @@

+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import functools
+import os
+import re
+import shutil
+import warnings
+from collections import defaultdict
+from collections.abc import Iterable
+from contextlib import nullcontext
+from pathlib import Path
+from typing import Callable, Union
+import torch
+from ..logging import get_logger
+from .constants import FSDP_MODEL_NAME, OPTIMIZER_NAME, SAFE_WEIGHTS_NAME, WEIGHTS_NAME
+from .dataclasses import get_module_class_from_name
+from .modeling import get_non_persistent_buffers, is_peft_model
+from .other import get_module_children_bottom_up, is_compiled_module, save
+from .versions import is_torch_version
+logger = get_logger(__name__)
+def enable_fsdp_ram_efficient_loading():
+    """
+    Enables RAM efficient loading of Hugging Face models for FSDP in the environment.
+    """
+    # Sets values for `transformers.modeling_utils.is_fsdp_enabled`
+    if "ACCELERATE_USE_FSDP" not in os.environ:
+        os.environ["ACCELERATE_USE_FSDP"] = "True"
+    os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = "True"
+def disable_fsdp_ram_efficient_loading():
+    """
+    Disables RAM efficient loading of Hugging Face models for FSDP in the environment.
+    """
+    os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = "False"
+def _get_model_state_dict(model, adapter_only=False, sd_options=None):
+    if adapter_only and is_peft_model(model):
+        from peft import get_peft_model_state_dict
+        return get_peft_model_state_dict(model, adapter_name=model.active_adapter)
+    # Invariant: `sd_options` is not None only for FSDP2
+    if sd_options is not None:
+        from torch.distributed.checkpoint.state_dict import get_model_state_dict
+        return get_model_state_dict(model, options=sd_options)
+    else:
+        return model.state_dict()
+def _set_model_state_dict(model, state_dict, adapter_only=False, sd_options=None):
+    if adapter_only and is_peft_model(model):
+        from peft import set_peft_model_state_dict
+        return set_peft_model_state_dict(model, state_dict, adapter_name=model.active_adapter)
+    # Invariant: `sd_options` is not None only for FSDP2
+    if sd_options is not None:
+        from torch.distributed.checkpoint.state_dict import set_model_state_dict
+        return set_model_state_dict(model, state_dict, options=sd_options)
+    else:
+        return model.load_state_dict(state_dict)
+def _prepare_sd_options(fsdp_plugin):
+    sd_options = None
+    # we use this only for FSDP2, as it requires torch >= 2.6.0 and this api requires torch >= 2.2.0
+    if fsdp_plugin.fsdp_version == 2:
+        from torch.distributed.checkpoint.state_dict import StateDictOptions
+        from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
+        sd_options = StateDictOptions(
+            full_state_dict=fsdp_plugin.state_dict_type == StateDictType.FULL_STATE_DICT,
+            cpu_offload=getattr(fsdp_plugin.state_dict_config, "offload_to_cpu", False),
+            broadcast_from_rank0=getattr(fsdp_plugin.state_dict_config, "rank0_only", False),
+        )
+    return sd_options
+def save_fsdp_model(fsdp_plugin, accelerator, model, output_dir, model_index=0, adapter_only=False):
+    # Note: We import here to reduce import time from general modules, and isolate outside dependencies
+    import torch.distributed.checkpoint as dist_cp
+    from torch.distributed.checkpoint.default_planner import DefaultSavePlanner
+    from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP
+    from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
+    os.makedirs(output_dir, exist_ok=True)
+    if fsdp_plugin.state_dict_type == StateDictType.FULL_STATE_DICT:
+        # FSDP raises error when single GPU is used with `offload_to_cpu=True` for FULL_STATE_DICT
+        # so, only enable it when num_processes>1
+        is_multi_process = accelerator.num_processes > 1
+        fsdp_plugin.state_dict_config.offload_to_cpu = is_multi_process
+        fsdp_plugin.state_dict_config.rank0_only = is_multi_process
+    ctx = (
+        FSDP.state_dict_type(
+            model, fsdp_plugin.state_dict_type, fsdp_plugin.state_dict_config, fsdp_plugin.optim_state_dict_config
+        )
+        if fsdp_plugin.fsdp_version == 1
+        else nullcontext()
+    )
+    sd_options = _prepare_sd_options(fsdp_plugin)
+    with ctx:
+        state_dict = _get_model_state_dict(model, adapter_only=adapter_only, sd_options=sd_options)
+        if fsdp_plugin.state_dict_type == StateDictType.FULL_STATE_DICT:
+            weights_name = f"{FSDP_MODEL_NAME}.bin" if model_index == 0 else f"{FSDP_MODEL_NAME}_{model_index}.bin"
+            output_model_file = os.path.join(output_dir, weights_name)
+            if accelerator.process_index == 0:
+                logger.info(f"Saving model to {output_model_file}")
+                torch.save(state_dict, output_model_file)
+                logger.info(f"Model saved to {output_model_file}")
+        # Invariant: `LOCAL_STATE_DICT` is never possible with `FSDP2`
+        elif fsdp_plugin.state_dict_type == StateDictType.LOCAL_STATE_DICT:
+            weights_name = (
+                f"{FSDP_MODEL_NAME}_rank{accelerator.process_index}.bin"
+                if model_index == 0
+                else f"{FSDP_MODEL_NAME}_{model_index}_rank{accelerator.process_index}.bin"
+            )
+            output_model_file = os.path.join(output_dir, weights_name)
+            logger.info(f"Saving model to {output_model_file}")
+            torch.save(state_dict, output_model_file)
+            logger.info(f"Model saved to {output_model_file}")
+        elif fsdp_plugin.state_dict_type == StateDictType.SHARDED_STATE_DICT:
+            ckpt_dir = os.path.join(output_dir, f"{FSDP_MODEL_NAME}_{model_index}")
+            os.makedirs(ckpt_dir, exist_ok=True)
+            logger.info(f"Saving model to {ckpt_dir}")
+            state_dict = {"model": state_dict}
+            dist_cp.save(
+                state_dict=state_dict,
+                storage_writer=dist_cp.FileSystemWriter(ckpt_dir),
+                planner=DefaultSavePlanner(),
+            )
+            logger.info(f"Model saved to {ckpt_dir}")
+def load_fsdp_model(fsdp_plugin, accelerator, model, input_dir, model_index=0, adapter_only=False):
+    # Note: We import here to reduce import time from general modules, and isolate outside dependencies
+    import torch.distributed.checkpoint as dist_cp
+    from torch.distributed.checkpoint.default_planner import DefaultLoadPlanner
+    from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP
+    from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
+    accelerator.wait_for_everyone()
+    if fsdp_plugin.state_dict_type == StateDictType.FULL_STATE_DICT:
+        # FSDP raises error when single GPU is used with `offload_to_cpu=True` for FULL_STATE_DICT
+        # so, only enable it when num_processes>1
+        is_multi_process = accelerator.num_processes > 1
+        fsdp_plugin.state_dict_config.offload_to_cpu = is_multi_process
+        fsdp_plugin.state_dict_config.rank0_only = is_multi_process
+    ctx = (
+        FSDP.state_dict_type(
+            model, fsdp_plugin.state_dict_type, fsdp_plugin.state_dict_config, fsdp_plugin.optim_state_dict_config
+        )
+        if fsdp_plugin.fsdp_version == 1
+        else nullcontext()
+    )
+    sd_options = _prepare_sd_options(fsdp_plugin)
+    with ctx:
+        if fsdp_plugin.state_dict_type == StateDictType.FULL_STATE_DICT:
+            if type(model) is not FSDP and accelerator.process_index != 0 and not accelerator.is_fsdp2:
+                if not fsdp_plugin.sync_module_states and fsdp_plugin.fsdp_version == 1:
+                    raise ValueError(
+                        "Set the `sync_module_states` flag to `True` so that model states are synced across processes when "
+                        "initializing FSDP object"
+                    )
+                return
+            weights_name = f"{FSDP_MODEL_NAME}.bin" if model_index == 0 else f"{FSDP_MODEL_NAME}_{model_index}.bin"
+            input_model_file = os.path.join(input_dir, weights_name)
+            logger.info(f"Loading model from {input_model_file}")
+            # we want an empty state dict for FSDP2 as we use `broadcast_from_rank0`
+            load_model = not accelerator.is_fsdp2 or accelerator.is_main_process
+            if load_model:
+                state_dict = torch.load(input_model_file, weights_only=True)
+            else:
+                state_dict = {}
+            logger.info(f"Model loaded from {input_model_file}")
+        elif fsdp_plugin.state_dict_type == StateDictType.LOCAL_STATE_DICT:
+            weights_name = (
+                f"{FSDP_MODEL_NAME}_rank{accelerator.process_index}.bin"
+                if model_index == 0
+                else f"{FSDP_MODEL_NAME}_{model_index}_rank{accelerator.process_index}.bin"
+            )
+            input_model_file = os.path.join(input_dir, weights_name)
+            logger.info(f"Loading model from {input_model_file}")
+            state_dict = torch.load(input_model_file, weights_only=True)
+            logger.info(f"Model loaded from {input_model_file}")
+        elif fsdp_plugin.state_dict_type == StateDictType.SHARDED_STATE_DICT:
+            ckpt_dir = (
+                os.path.join(input_dir, f"{FSDP_MODEL_NAME}_{model_index}")
+                if f"{FSDP_MODEL_NAME}" not in input_dir
+                else input_dir
+            )
+            logger.info(f"Loading model from {ckpt_dir}")
+            state_dict = {"model": _get_model_state_dict(model, adapter_only=adapter_only, sd_options=sd_options)}
+            dist_cp.load(
+                state_dict=state_dict,
+                storage_reader=dist_cp.FileSystemReader(ckpt_dir),
+                planner=DefaultLoadPlanner(),
+            )
+            state_dict = state_dict["model"]
+            logger.info(f"Model loaded from {ckpt_dir}")
+        load_result = _set_model_state_dict(model, state_dict, adapter_only=adapter_only, sd_options=sd_options)
+    return load_result
+def save_fsdp_optimizer(fsdp_plugin, accelerator, optimizer, model, output_dir, optimizer_index=0):
+    # Note: We import here to reduce import time from general modules, and isolate outside dependencies
+    import torch.distributed.checkpoint as dist_cp
+    from torch.distributed.checkpoint.default_planner import DefaultSavePlanner
+    from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP
+    from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
+    os.makedirs(output_dir, exist_ok=True)
+    ctx = (
+        FSDP.state_dict_type(
+            model, fsdp_plugin.state_dict_type, fsdp_plugin.state_dict_config, fsdp_plugin.optim_state_dict_config
+        )
+        if fsdp_plugin.fsdp_version == 1
+        else nullcontext()
+    )
+    sd_options = _prepare_sd_options(fsdp_plugin)
+    with ctx:
+        if fsdp_plugin.fsdp_version == 2:
+            from torch.distributed.checkpoint.state_dict import get_optimizer_state_dict
+            optim_state = get_optimizer_state_dict(model, optimizer, options=sd_options)
+        else:
+            optim_state = FSDP.optim_state_dict(model, optimizer)
+        if fsdp_plugin.state_dict_type == StateDictType.FULL_STATE_DICT:
+            if accelerator.process_index == 0:
+                optim_state_name = (
+                    f"{OPTIMIZER_NAME}.bin" if optimizer_index == 0 else f"{OPTIMIZER_NAME}_{optimizer_index}.bin"
+                )
+                output_optimizer_file = os.path.join(output_dir, optim_state_name)
+                logger.info(f"Saving Optimizer state to {output_optimizer_file}")
+                torch.save(optim_state, output_optimizer_file)
+                logger.info(f"Optimizer state saved in {output_optimizer_file}")
+        else:
+            ckpt_dir = os.path.join(output_dir, f"{OPTIMIZER_NAME}_{optimizer_index}")
+            os.makedirs(ckpt_dir, exist_ok=True)
+            logger.info(f"Saving Optimizer state to {ckpt_dir}")
+            dist_cp.save(
+                state_dict={"optimizer": optim_state},
+                storage_writer=dist_cp.FileSystemWriter(ckpt_dir),
+                planner=DefaultSavePlanner(),
+            )
+            logger.info(f"Optimizer state saved in {ckpt_dir}")
+def load_fsdp_optimizer(fsdp_plugin, accelerator, optimizer, model, input_dir, optimizer_index=0, adapter_only=False):
+    # Note: We import here to reduce import time from general modules, and isolate outside dependencies
+    import torch.distributed.checkpoint as dist_cp
+    from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP
+    from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
+    accelerator.wait_for_everyone()
+    ctx = (
+        FSDP.state_dict_type(
+            model, fsdp_plugin.state_dict_type, fsdp_plugin.state_dict_config, fsdp_plugin.optim_state_dict_config
+        )
+        if fsdp_plugin.fsdp_version == 1
+        else nullcontext()
+    )
+    sd_options = _prepare_sd_options(fsdp_plugin)
+    with ctx:
+        if fsdp_plugin.state_dict_type == StateDictType.FULL_STATE_DICT:
+            optim_state = None
+            if accelerator.process_index == 0 or not fsdp_plugin.optim_state_dict_config.rank0_only:
+                optimizer_name = (
+                    f"{OPTIMIZER_NAME}.bin" if optimizer_index == 0 else f"{OPTIMIZER_NAME}_{optimizer_index}.bin"
+                )
+                input_optimizer_file = os.path.join(input_dir, optimizer_name)
+                logger.info(f"Loading Optimizer state from {input_optimizer_file}")
+                optim_state = torch.load(input_optimizer_file, weights_only=True)
+                logger.info(f"Optimizer state loaded from {input_optimizer_file}")
+        else:
+            ckpt_dir = (
+                os.path.join(input_dir, f"{OPTIMIZER_NAME}_{optimizer_index}")
+                if f"{OPTIMIZER_NAME}" not in input_dir
+                else input_dir
+            )
+            logger.info(f"Loading Optimizer from {ckpt_dir}")
+            optim_state = {"optimizer": optimizer.state_dict()}
+            dist_cp.load(
+                optim_state,
+                checkpoint_id=ckpt_dir,
+                storage_reader=dist_cp.FileSystemReader(ckpt_dir),
+            )
+            optim_state = optim_state["optimizer"]
+            logger.info(f"Optimizer loaded from {ckpt_dir}")
+        if fsdp_plugin.fsdp_version == 1:
+            flattened_osd = FSDP.optim_state_dict_to_load(model=model, optim=optimizer, optim_state_dict=optim_state)
+            optimizer.load_state_dict(flattened_osd)
+        else:
+            from torch.distributed.checkpoint.state_dict import set_optimizer_state_dict
+            set_optimizer_state_dict(model, optimizer, optim_state, options=sd_options)
+def _distributed_checkpoint_to_merged_weights(checkpoint_dir: str, save_path: str, safe_serialization: bool = True):
+    """
+    Passthrough to `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`
+    Will save under `save_path` as either `model.safetensors` or `pytorch_model.bin`.
+    """
+    # Note: We import here to reduce import time from general modules, and isolate outside dependencies
+    import torch.distributed.checkpoint as dist_cp
+    import torch.distributed.checkpoint.format_utils as dist_cp_format_utils
+    state_dict = {}
+    save_path = Path(save_path)
+    save_path.mkdir(exist_ok=True)
+    dist_cp_format_utils._load_state_dict(
+        state_dict,
+        storage_reader=dist_cp.FileSystemReader(checkpoint_dir),
+        planner=dist_cp_format_utils._EmptyStateDictLoadPlanner(),
+        no_dist=True,
+    )
+    save_path = save_path / SAFE_WEIGHTS_NAME if safe_serialization else save_path / WEIGHTS_NAME
+    # To handle if state is a dict like {model: {...}}
+    if len(state_dict.keys()) == 1:
+        state_dict = state_dict[list(state_dict)[0]]
+    save(state_dict, save_path, safe_serialization=safe_serialization)
+    return save_path
+def merge_fsdp_weights(
+    checkpoint_dir: str, output_path: str, safe_serialization: bool = True, remove_checkpoint_dir: bool = False
+):
+    """
+    Merge the weights from sharded FSDP model checkpoints into a single combined checkpoint. Should be used if
+    `SHARDED_STATE_DICT` was used for the model. Weights will be saved to `{output_path}/model.safetensors` if
+    `safe_serialization` else `pytorch_model.bin`.
+    Note: this is a CPU-bound process.
+    Args:
+        checkpoint_dir (`str`):
+            The directory containing the FSDP checkpoints (can be either the model or optimizer).
+        output_path (`str`):
+            The path to save the merged checkpoint.
+        safe_serialization (`bool`, *optional*, defaults to `True`):
+            Whether to save the merged weights with safetensors (recommended).
+        remove_checkpoint_dir (`bool`, *optional*, defaults to `False`):
+            Whether to remove the checkpoint directory after merging.
+    """
+    checkpoint_dir = Path(checkpoint_dir)
+    from accelerate.state import PartialState
+    if not is_torch_version(">=", "2.3.0"):
+        raise ValueError("`merge_fsdp_weights` requires PyTorch >= 2.3.0`")
+    # Verify that the checkpoint directory exists
+    if not checkpoint_dir.exists():
+        model_path_exists = (checkpoint_dir / "pytorch_model_fsdp_0").exists()
+        optimizer_path_exists = (checkpoint_dir / "optimizer_0").exists()
+        err = f"Tried to load from {checkpoint_dir} but couldn't find a valid metadata file."
+        if model_path_exists and optimizer_path_exists:
+            err += " However, potential model and optimizer checkpoint directories exist."
+            err += f"Please pass in either {checkpoint_dir}/pytorch_model_fsdp_0 or {checkpoint_dir}/optimizer_0"
+            err += "instead."
+        elif model_path_exists:
+            err += " However, a potential model checkpoint directory exists."
+            err += f"Please try passing in {checkpoint_dir}/pytorch_model_fsdp_0 instead."
+        elif optimizer_path_exists:
+            err += " However, a potential optimizer checkpoint directory exists."
+            err += f"Please try passing in {checkpoint_dir}/optimizer_0 instead."
+        raise ValueError(err)
+    # To setup `save` to work
+    state = PartialState()
+    if state.is_main_process:
+        logger.info(f"Merging FSDP weights from {checkpoint_dir}")
+        save_path = _distributed_checkpoint_to_merged_weights(checkpoint_dir, output_path, safe_serialization)
+        logger.info(f"Successfully merged FSDP weights and saved to {save_path}")
+        if remove_checkpoint_dir:
+            logger.info(f"Removing old checkpoint directory {checkpoint_dir}")
+            shutil.rmtree(checkpoint_dir)
+    state.wait_for_everyone()
+def ensure_weights_retied(param_init_fn, model: torch.nn.Module, device: torch.device):
+    _tied_names = getattr(model, "_tied_weights_keys", None)
+    if not _tied_names:
+        # if no tied names just passthrough
+        return param_init_fn
+    # get map of parameter instances to params.
+    # - needed for replacement later
+    _tied_params = {}
+    for name in _tied_names:
+        name = name.split(".")
+        name, param_name = ".".join(name[:-1]), name[-1]
+        mod = model.get_submodule(name)
+        param = getattr(mod, param_name)
+        _tied_params[id(param)] = None  # placeholder for the param first
+    # build param_init_fn for the case with tied params
+    def param_init_fn_tied_param(module: torch.nn.Module):
+        # track which params to tie
+        # - usually only 1, but for completeness consider > 1
+        params_to_tie = defaultdict(list)
+        for n, param in module.named_parameters(recurse=False):
+            if id(param) in _tied_params:
+                params_to_tie[id(param)].append(n)
+        # call the param init fn, which potentially re-allocates the
+        # parameters
+        module = param_init_fn(module)
+        # search the parameters again and tie them up again
+        for id_key, _param_names in params_to_tie.items():
+            for param_name in _param_names:
+                param = _tied_params[id_key]
+                if param is None:
+                    # everything will be tied to the first time the
+                    # param is observed
+                    _tied_params[id_key] = getattr(module, param_name)
+                else:
+                    setattr(module, param_name, param)  # tie
+        return module
+    return param_init_fn_tied_param
+def fsdp2_load_full_state_dict(accelerator, model: torch.nn.Module, full_sd: dict):
+    """
+    Loads the full state dict (could be only on rank 0) into the sharded model. This is done by broadcasting the
+    parameters from rank 0 to all other ranks. This function modifies the model in-place.
+    Args:
+        accelerator (`Accelerator`): The accelerator instance
+        model (`torch.nn.Module`):
+            The model to load the state dict into, expected to be on meta device or a VRAM spike can occur
+        full_sd (`dict`): The full state dict to load, can only be on rank 0
+    """
+    import torch.distributed as dist
+    from torch.distributed.tensor import DTensor, distribute_tensor
+    # Model was previously copied to meta device
+    meta_sharded_sd = model.state_dict()
+    sharded_sd = {}
+    # Rank 0 distributes the full state dict to other ranks
+    def _infer_parameter_dtype(model, param_name, empty_param):
+        try:
+            old_param = model.get_parameter_or_buffer(param_name)
+        except AttributeError:
+            # Need this for LORA, as there some params are not *parameters* of sorts
+            base_param_name, local_param_name = param_name.rsplit(".", 1)
+            submodule = model.get_submodule(base_param_name)
+            old_param = getattr(submodule, local_param_name)
+        is_torch_e4m3fn_available = hasattr(torch, "float8_e4m3fn")
+        casting_dtype = None
+        is_param_float8_e4m3fn = is_torch_e4m3fn_available and empty_param.dtype == torch.float8_e4m3fn
+        if empty_param.dtype.is_floating_point and not is_param_float8_e4m3fn:
+            casting_dtype = old_param.dtype
+        return old_param is not None and old_param.is_contiguous(), casting_dtype
+    def _cast_and_contiguous(tensor, to_contiguous, dtype):
+        if dtype is not None:
+            tensor = tensor.to(dtype=dtype)
+        if to_contiguous:
+            tensor = tensor.contiguous()
+        return tensor
+    if accelerator.is_main_process:
+        for (param_name, full_param), sharded_param in zip(full_sd.items(), meta_sharded_sd.values()):
+            device_mesh = sharded_param.device_mesh
+            full_param = full_param.detach().to(device_mesh.device_type)
+            if isinstance(full_param, DTensor):
+                # dist.broadcast() only supports torch.Tensor.
+                # After prepare_tp(), model parameters may become DTensor.
+                # To broadcast such a parameter, convert it to a local tensor first.
+                full_param = full_param.to_local()
+            dist.broadcast(full_param, src=0, group=dist.group.WORLD)
+            sharded_tensor = distribute_tensor(full_param, device_mesh, sharded_param.placements)
+            to_contiguous, casting_dtype = _infer_parameter_dtype(
+                model,
+                param_name,
+                full_param,
+            )
+            sharded_tensor = _cast_and_contiguous(sharded_tensor, to_contiguous, casting_dtype)
+            sharded_sd[param_name] = sharded_tensor
+    # We need this else to have a matching `broadcast` for all of the ranks, else we deadlock
+    else:
+        for param_name, sharded_param in meta_sharded_sd.items():
+            device_mesh = sharded_param.device_mesh
+            full_tensor = torch.empty(sharded_param.size(), device=device_mesh.device_type, dtype=sharded_param.dtype)
+            dist.broadcast(full_tensor, src=0, group=dist.group.WORLD)
+            sharded_tensor = distribute_tensor(full_tensor, device_mesh, sharded_param.placements)
+            to_contiguous, casting_dtype = _infer_parameter_dtype(
+                model,
+                param_name,
+                full_tensor,
+            )
+            sharded_tensor = _cast_and_contiguous(sharded_tensor, to_contiguous, casting_dtype)
+            sharded_sd[param_name] = sharded_tensor
+    # we set `assign=True` because our params are on meta device
+    model.load_state_dict(sharded_sd, assign=True)
+    return model
+def fsdp2_switch_optimizer_parameters(optimizer: torch.optim.Optimizer, mapping: dict):
+    """
+    Switches the parameters of the optimizer to new ones (sharded parameters in usual case). This function modifies the
+    optimizer in-place.
+    Args:
+        optimizer (`torch.optim.Optimizer`): Optimizer instance which contains the original model parameters
+        mapping (`dict`): Mapping from the original parameter (specified by `data_ptr`) to the sharded parameter
+    Raises:
+        KeyError:
+            If a parameter in the optimizer couldn't be switched to its sharded version. This should never happen and
+            indicates a bug. If we kept the original params instead of raising, the training wouldn't be numerically
+            correct and weights wouldn't get updated.
+    """
+    from torch.distributed.tensor import DTensor
+    accessor_mapping = {}
+    accessor_mapping[DTensor] = "_local_tensor"
+    try:
+        for param_group in optimizer.param_groups:
+            param_group["params"] = [mapping[p.data_ptr] for p in param_group["params"]]
+    except KeyError:
+        # This shouldn't ever happen, but we want to fail here else training wouldn't be numerically correct
+        # This basically means that we're missing a mapping from the original parameter to the sharded parameter
+        raise KeyError(
+            "A parameter in the optimizer couldn't be switched to its sharded version. This breaks the training. Please raise an issue on GitHub."
+        )
+def fsdp2_apply_ac(accelerator, model: torch.nn.Module):
+    """
+    Applies the activation checkpointing to the model.
+    Args:
+        accelerator (`Accelerator`): The accelerator instance
+        model (`torch.nn.Module`): The model to apply the activation checkpointing to
+    Returns:
+        `torch.nn.Module`: The model with the activation checkpointing applied
+    """
+    from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+        checkpoint_wrapper,
+    )
+    auto_wrap_policy_func = fsdp2_prepare_auto_wrap_policy(accelerator.state.fsdp_plugin, model)
+    for layer_name, layer in get_module_children_bottom_up(model, return_fqns=True)[:-1]:
+        if len(layer_name.split(".")) > 1:
+            parent_name, child_name = layer_name.rsplit(".", 1)
+        else:
+            parent_name = None
+            child_name = layer_name
+        parent_module = model.get_submodule(parent_name) if parent_name else model
+        if auto_wrap_policy_func(parent_module):
+            layer = checkpoint_wrapper(layer, preserve_rng_state=False)
+            parent_module.register_module(child_name, layer)
+    return model
+def fsdp2_prepare_model(accelerator, model: torch.nn.Module) -> torch.nn.Module:
+    """Prepares the model for FSDP2 in-place. Also returns the model to avoid misuse of the original model.
+    Args:
+        accelerator (`Accelerator`): The accelerator instance
+        model (`torch.nn.Module`): The model to prepare
+    Returns:
+        `torch.nn.Module`: Prepared model
+    """
+    from torch.distributed.fsdp import FSDPModule, MixedPrecisionPolicy, fully_shard
+    is_type_fsdp = isinstance(model, FSDPModule) or (
+        is_compiled_module(model) and isinstance(model._orig_mod, FSDPModule)
+    )
+    if is_type_fsdp:
+        return model
+    fsdp2_plugin = accelerator.state.fsdp_plugin
+    fsdp2_plugin.set_auto_wrap_policy(model)
+    original_sd = model.state_dict()
+    mesh = getattr(accelerator, "torch_device_mesh", None)
+    fsdp2_kwargs = {
+        "reshard_after_forward": fsdp2_plugin.reshard_after_forward,
+        "offload_policy": fsdp2_plugin.cpu_offload,
+        # `fully_shard` doesn't accept `None` in case of `MixedPrecisionPolicy`
+        "mp_policy": fsdp2_plugin.mixed_precision_policy or MixedPrecisionPolicy(),
+        "mesh": mesh[tuple(accelerator.parallelism_config.fsdp_dim_names)] if mesh is not None else None,
+        "ignored_params": get_parameters_from_modules(fsdp2_plugin.ignored_modules, model, accelerator.device),
+    }
+    model_has_params4bit = False
+    for name, param in model.named_parameters():
+        # this is a temporary fix whereby loading models with bnb params cannot be moved from
+        # GPU to a meta device due with FSDP2 because torch operations don't return the original class type
+        # bypassing the move to meta will still cause the VRAM spike, but at least it still will load
+        if param.__class__.__name__ == "Params4bit":
+            model_has_params4bit = True
+            break
+    if fsdp2_plugin.cpu_ram_efficient_loading and not model_has_params4bit:
+        # Context: `fully_shard` moves the model to GPU if it was on CPU, however it can also be on `meta` and then it stays there even after `fully_shard`
+        # For this reason, we need to move the model to `meta` device, as then sharding happens on `meta` device
+        # If we kept the model on CPU (`cpu_ram_efficient_loading` has model be on CPU on all ranks, though non-main ranks only have `torch.empty`), `fully_shard` would move it to GPU
+        # Afterwards, when we call `fsdp2_load_full_state_dict`, us creating the state_dict would result into briefly having two copies of model state_dict on the GPU -> VRAM spike
+        # We need to keep the original non-persistent buffers, as those MAY not be in the state_dict, resulting in them staying on meta device
+        # Also, these buffers aren't getting sharded by default
+        # We get the FQNs of all non-persistent buffers, to re-register them after
+        non_persistent_buffer_fqns = get_non_persistent_buffers(model, recurse=True, fqns=True)
+        original_non_persistent_buffers = copy.deepcopy(
+            {k: v for k, v in model.named_buffers() if k in non_persistent_buffer_fqns}
+        )
+        # We move the model to meta device, as then sharding happens on meta device
+        model = model.to(torch.device("meta"))
+        # We need to re-tie the weights, not exactly sure why, but if we don't do this, reference to `lm_head/embed_tokens` stay hanging -> more VRAM usage
+        # We assume `transformers` models have a `tie_weights` method if they support it
+        if hasattr(model, "tie_weights"):
+            model.tie_weights()
+    auto_wrap_policy_func = fsdp2_prepare_auto_wrap_policy(fsdp2_plugin, model)
+    if auto_wrap_policy_func is not None:
+        # We skip the model itself, as that one is always wrapped
+        for module in get_module_children_bottom_up(model)[:-1]:
+            if auto_wrap_policy_func(module) and not isinstance(module, FSDPModule):
+                fully_shard(module, **fsdp2_kwargs)
+    if not isinstance(model, FSDPModule):
+        fully_shard(model, **fsdp2_kwargs)
+    if fsdp2_plugin.cpu_ram_efficient_loading:
+        # If `cpu_ram_efficient_loading` is enabled, only rank 0 loads the weights
+        # Other ranks have an empty model on `meta` device, so we need to distribute the weights properly
+        fsdp2_load_full_state_dict(accelerator, model, original_sd)
+    if fsdp2_plugin.cpu_ram_efficient_loading and not model_has_params4bit:
+        # We re-register the buffers, as they may not be in the state_dict
+        for fqn, buffer_tensor in original_non_persistent_buffers.items():
+            buffer_tensor = buffer_tensor.to(accelerator.device)
+            if "." in fqn:
+                parent_fqn, local_buffer_name = fqn.rsplit(".", 1)
+                parent_module = model.get_submodule(parent_fqn)
+            else:
+                local_buffer_name = fqn
+                parent_module = model
+            parent_module.register_buffer(local_buffer_name, buffer_tensor, persistent=False)
+        # We need to tie the weights again, as call to `load_full_state_dict` breaks the tie
+        # Needs to be called both here and above
+        # removing this call makes the have slightly different loss
+        # removing the call above leads to extra memory usage as explained in the comment above
+        if hasattr(model, "tie_weights"):
+            model.tie_weights()
+    # There is no `dtype` attribution for nn.Module
+    # Set it to None if it doesn't exist and do the upcast always
+    model_dtype = getattr(model, "dtype", None)
+    if accelerator.mixed_precision != "no" and (model_dtype is None or model_dtype != torch.float32):
+        # We upcast the model according to `deepspeed`'s implementation
+        # More info about this can be found in `accelerator.py:prepare_model`s FSDP1 section
+        model = model.to(torch.float32)
+        if accelerator.is_main_process:
+            # TODO(siro1): Add a warning for each parameter that was upcasted
+            warnings.warn(
+                "FSDP upcast of low precision parameters to fp32 (since mixed_precision != 'no') may affect the precision of model checkpoints."
+            )
+    return model
+def fsdp2_prepare_auto_wrap_policy(fsdp2_plugin, model: torch.nn.Module) -> Callable[[torch.nn.Module], bool]:
+    """Prepares the auto wrap policy based on its type, done to mimic the behaviour of FSDP1 auto wrap policy.
+    Args:
+        fsdp2_plugin (`FullyShardedDataParallelPlugin`):
+            Instance of `FullyShardedDataParallelPlugin` containing the configuration options
+        auto_wrap_policy_type (`str`):
+            Either `transformer` or `size`
+        model (`torch.nn.Module`):
+            The model to wrap
+    Returns:
+        `Callable[[torch.nn.Module], bool]`:
+            The auto wrap policy function to be applied to the model
+    """
+    from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy
+    fn = fsdp2_plugin.auto_wrap_policy
+    if isinstance(fn, functools.partial):
+        fn = fn.func
+    if fn is transformer_auto_wrap_policy:
+        no_split_modules = getattr(model, "_no_split_modules", None)
+        if no_split_modules is None:
+            no_split_modules = []
+        transformer_cls_names_to_wrap = list(no_split_modules)
+        if fsdp2_plugin.transformer_cls_names_to_wrap is not None:
+            transformer_cls_names_to_wrap = fsdp2_plugin.transformer_cls_names_to_wrap
+        transformer_cls_to_wrap = set()
+        for layer_class in transformer_cls_names_to_wrap:
+            transformer_cls = get_module_class_from_name(model, layer_class)
+            if transformer_cls is None:
+                raise ValueError(f"Could not find the transformer layer class {layer_class} in the model.")
+            transformer_cls_to_wrap.add(transformer_cls)
+        def policy(module: torch.nn.Module) -> bool:
+            if fsdp2_plugin.transformer_cls_names_to_wrap is None:
+                return False
+            return isinstance(module, tuple(transformer_cls_to_wrap))
+    elif fn is size_based_auto_wrap_policy:
+        def policy(module: torch.nn.Module) -> bool:
+            module_num_params = sum(p.numel() for p in module.parameters())
+            return module_num_params > fsdp2_plugin.min_num_params
+    else:
+        return None
+    return policy
+def get_fsdp2_grad_scaler(**kwargs):
+    """
+    Returns a `GradScaler` for FSDP2, as the current implementation of `get_grad_scaler` doesn't accept other args. We
+    need this as current `get_grad_scaler` accepts only `distributed_type` as arg, which doesn't differentiate between
+    FSDP1 and FSDP2
+    """
+    from torch.amp.grad_scaler import GradScaler
+    return GradScaler(**kwargs)
+def fsdp2_canonicalize_names(named_params: dict) -> dict:
+    """Removes parameter name modifiers in order to map them back to their original names.
+    See huggingface/accelerate#3554 for more context.
+    Args:
+        named_params (`dict`): The named parameters dictionary to canonicalize.
+    Returns:
+        `dict`: The canonicalized named parameters dictionary
+    """
+    named_params = {k.replace("._checkpoint_wrapped_module", ""): v for k, v in named_params.items()}
+    named_params = {
+        k.replace("_orig_mod.", "") if k.startswith("_orig_mod.") else k: v for k, v in named_params.items()
+    }
+    named_params = {k.replace("._orig_mod", ""): v for k, v in named_params.items()}
+    return named_params
+def get_parameters_from_modules(
+    modules: Union[Iterable[torch.nn.Module], str], model, device
+) -> set[torch.nn.Parameter]:
+    """Converts modules to parameters where modules can be a string or list of torch.nn.Module
+    Args:
+        modules (`Union[Iterable[torch.nn.Module], str]`): List of modules
+    Returns:
+        `set[torch.nn.Parameter]`: List of parameters
+    """
+    if modules is None:
+        return set()
+    parameters = []
+    # code taken from accelerate while preparing kwargs for FSDP
+    if isinstance(modules, str):
+        reg = re.compile(modules)
+        mapped_modules = []
+        for name, module in model.named_modules():
+            if reg.fullmatch(name):
+                module.to(device)
+                mapped_modules.append(module)
+        modules = mapped_modules
+    for module in modules:
+        parameters.extend(list(module.parameters()))
+    return set(parameters)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/imports.py ADDED Viewed

	@@ -0,0 +1,564 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib
+import importlib.metadata
+import os
+import sys
+import warnings
+from functools import lru_cache, wraps
+import torch
+from packaging import version
+from packaging.version import parse
+from .environment import parse_flag_from_env, patch_environment, str_to_bool
+from .versions import compare_versions, is_torch_version
+# Try to run Torch native job in an environment with TorchXLA installed by setting this value to 0.
+USE_TORCH_XLA = parse_flag_from_env("USE_TORCH_XLA", default=True)
+_torch_xla_available = False
+if USE_TORCH_XLA:
+    try:
+        import torch_xla.core.xla_model as xm  # noqa: F401
+        import torch_xla.runtime
+        _torch_xla_available = True
+    except ImportError:
+        pass
+# Keep it for is_tpu_available. It will be removed along with is_tpu_available.
+_tpu_available = _torch_xla_available
+# Cache this result has it's a C FFI call which can be pretty time-consuming
+_torch_distributed_available = torch.distributed.is_available()
+def _is_package_available(pkg_name, metadata_name=None):
+    # Check we're not importing a "pkg_name" directory somewhere but the actual library by trying to grab the version
+    package_exists = importlib.util.find_spec(pkg_name) is not None
+    if package_exists:
+        try:
+            # Some libraries have different names in the metadata
+            _ = importlib.metadata.metadata(pkg_name if metadata_name is None else metadata_name)
+            return True
+        except importlib.metadata.PackageNotFoundError:
+            return False
+def is_torch_distributed_available() -> bool:
+    return _torch_distributed_available
+def is_xccl_available():
+    if is_torch_version(">=", "2.7.0"):
+        return torch.distributed.distributed_c10d.is_xccl_available()
+    if is_ipex_available():
+        return False
+    return False
+def is_ccl_available():
+    try:
+        pass
+    except ImportError:
+        print(
+            "Intel(R) oneCCL Bindings for PyTorch* is required to run DDP on Intel(R) XPUs, but it is not"
+            " detected. If you see \"ValueError: Invalid backend: 'ccl'\" error, please install Intel(R) oneCCL"
+            " Bindings for PyTorch*."
+        )
+    return importlib.util.find_spec("oneccl_bindings_for_pytorch") is not None
+def get_ccl_version():
+    return importlib.metadata.version("oneccl_bind_pt")
+def is_import_timer_available():
+    return _is_package_available("import_timer")
+def is_pynvml_available():
+    return _is_package_available("pynvml") or _is_package_available("pynvml", "nvidia-ml-py")
+def is_pytest_available():
+    return _is_package_available("pytest")
+def is_msamp_available():
+    return _is_package_available("msamp", "ms-amp")
+def is_schedulefree_available():
+    return _is_package_available("schedulefree")
+def is_transformer_engine_available():
+    if is_hpu_available():
+        return _is_package_available("intel_transformer_engine", "intel-transformer-engine")
+    else:
+        return _is_package_available("transformer_engine", "transformer-engine")
+def is_transformer_engine_mxfp8_available():
+    if _is_package_available("transformer_engine", "transformer-engine"):
+        import transformer_engine.pytorch as te
+        return te.fp8.check_mxfp8_support()[0]
+    return False
+def is_lomo_available():
+    return _is_package_available("lomo_optim")
+def is_cuda_available():
+    """
+    Checks if `cuda` is available via an `nvml-based` check which won't trigger the drivers and leave cuda
+    uninitialized.
+    """
+    with patch_environment(PYTORCH_NVML_BASED_CUDA_CHECK="1"):
+        available = torch.cuda.is_available()
+    return available
+@lru_cache
+def is_torch_xla_available(check_is_tpu=False, check_is_gpu=False):
+    """
+    Check if `torch_xla` is available. To train a native pytorch job in an environment with torch xla installed, set
+    the USE_TORCH_XLA to false.
+    """
+    assert not (check_is_tpu and check_is_gpu), "The check_is_tpu and check_is_gpu cannot both be true."
+    if not _torch_xla_available:
+        return False
+    elif check_is_gpu:
+        return torch_xla.runtime.device_type() in ["GPU", "CUDA"]
+    elif check_is_tpu:
+        return torch_xla.runtime.device_type() == "TPU"
+    return True
+def is_torchao_available():
+    package_exists = _is_package_available("torchao")
+    if package_exists:
+        torchao_version = version.parse(importlib.metadata.version("torchao"))
+        return compare_versions(torchao_version, ">=", "0.6.1")
+    return False
+def is_deepspeed_available():
+    return _is_package_available("deepspeed")
+def is_pippy_available():
+    return is_torch_version(">=", "2.4.0")
+def is_bf16_available(ignore_tpu=False):
+    "Checks if bf16 is supported, optionally ignoring the TPU"
+    if is_torch_xla_available(check_is_tpu=True):
+        return not ignore_tpu
+    if is_cuda_available():
+        return torch.cuda.is_bf16_supported()
+    if is_mlu_available():
+        return torch.mlu.is_bf16_supported()
+    if is_xpu_available():
+        return torch.xpu.is_bf16_supported()
+    if is_mps_available():
+        return torch.backends.mps.is_macos_or_newer(14, 0)
+    return True
+def is_fp16_available():
+    "Checks if fp16 is supported"
+    if is_habana_gaudi1():
+        return False
+    return True
+def is_fp8_available():
+    "Checks if fp8 is supported"
+    return is_msamp_available() or is_transformer_engine_available() or is_torchao_available()
+def is_4bit_bnb_available():
+    package_exists = _is_package_available("bitsandbytes")
+    if package_exists:
+        bnb_version = version.parse(importlib.metadata.version("bitsandbytes"))
+        return compare_versions(bnb_version, ">=", "0.39.0")
+    return False
+def is_8bit_bnb_available():
+    package_exists = _is_package_available("bitsandbytes")
+    if package_exists:
+        bnb_version = version.parse(importlib.metadata.version("bitsandbytes"))
+        return compare_versions(bnb_version, ">=", "0.37.2")
+    return False
+def is_bnb_available(min_version=None):
+    package_exists = _is_package_available("bitsandbytes")
+    if package_exists and min_version is not None:
+        bnb_version = version.parse(importlib.metadata.version("bitsandbytes"))
+        return compare_versions(bnb_version, ">=", min_version)
+    else:
+        return package_exists
+def is_bitsandbytes_multi_backend_available():
+    if not is_bnb_available():
+        return False
+    import bitsandbytes as bnb
+    return "multi_backend" in getattr(bnb, "features", set())
+def is_torchvision_available():
+    return _is_package_available("torchvision")
+def is_megatron_lm_available():
+    if str_to_bool(os.environ.get("ACCELERATE_USE_MEGATRON_LM", "False")) == 1:
+        if importlib.util.find_spec("megatron") is not None:
+            try:
+                megatron_version = parse(importlib.metadata.version("megatron-core"))
+                if compare_versions(megatron_version, ">=", "0.8.0"):
+                    return importlib.util.find_spec(".training", "megatron")
+            except Exception as e:
+                warnings.warn(f"Parse Megatron version failed. Exception:{e}")
+                return False
+def is_transformers_available():
+    return _is_package_available("transformers")
+def is_datasets_available():
+    return _is_package_available("datasets")
+def is_peft_available():
+    return _is_package_available("peft")
+def is_timm_available():
+    return _is_package_available("timm")
+def is_triton_available():
+    if is_xpu_available():
+        return _is_package_available("triton", "pytorch-triton-xpu")
+    return _is_package_available("triton")
+def is_aim_available():
+    package_exists = _is_package_available("aim")
+    if package_exists:
+        aim_version = version.parse(importlib.metadata.version("aim"))
+        return compare_versions(aim_version, "<", "4.0.0")
+    return False
+def is_tensorboard_available():
+    return _is_package_available("tensorboard") or _is_package_available("tensorboardX")
+def is_wandb_available():
+    return _is_package_available("wandb")
+def is_comet_ml_available():
+    return _is_package_available("comet_ml")
+def is_swanlab_available():
+    return _is_package_available("swanlab")
+def is_trackio_available():
+    return sys.version_info >= (3, 10) and _is_package_available("trackio")
+def is_boto3_available():
+    return _is_package_available("boto3")
+def is_rich_available():
+    if _is_package_available("rich"):
+        return parse_flag_from_env("ACCELERATE_ENABLE_RICH", False)
+    return False
+def is_sagemaker_available():
+    return _is_package_available("sagemaker")
+def is_tqdm_available():
+    return _is_package_available("tqdm")
+def is_clearml_available():
+    return _is_package_available("clearml")
+def is_pandas_available():
+    return _is_package_available("pandas")
+def is_matplotlib_available():
+    return _is_package_available("matplotlib")
+def is_mlflow_available():
+    if _is_package_available("mlflow"):
+        return True
+    if importlib.util.find_spec("mlflow") is not None:
+        try:
+            _ = importlib.metadata.metadata("mlflow-skinny")
+            return True
+        except importlib.metadata.PackageNotFoundError:
+            return False
+    return False
+def is_mps_available(min_version="1.12"):
+    "Checks if MPS device is available. The minimum version required is 1.12."
+    # With torch 1.12, you can use torch.backends.mps
+    # With torch 2.0.0, you can use torch.mps
+    return is_torch_version(">=", min_version) and torch.backends.mps.is_available() and torch.backends.mps.is_built()
+def is_ipex_available():
+    "Checks if ipex is installed."
+    def get_major_and_minor_from_version(full_version):
+        return str(version.parse(full_version).major) + "." + str(version.parse(full_version).minor)
+    _torch_version = importlib.metadata.version("torch")
+    if importlib.util.find_spec("intel_extension_for_pytorch") is None:
+        return False
+    _ipex_version = "N/A"
+    try:
+        _ipex_version = importlib.metadata.version("intel_extension_for_pytorch")
+    except importlib.metadata.PackageNotFoundError:
+        return False
+    torch_major_and_minor = get_major_and_minor_from_version(_torch_version)
+    ipex_major_and_minor = get_major_and_minor_from_version(_ipex_version)
+    if torch_major_and_minor != ipex_major_and_minor:
+        warnings.warn(
+            f"Intel Extension for PyTorch {ipex_major_and_minor} needs to work with PyTorch {ipex_major_and_minor}.*,"
+            f" but PyTorch {_torch_version} is found. Please switch to the matching version and run again."
+        )
+        return False
+    return True
+@lru_cache
+def is_mlu_available(check_device=False):
+    """
+    Checks if `mlu` is available via an `cndev-based` check which won't trigger the drivers and leave mlu
+    uninitialized.
+    """
+    if importlib.util.find_spec("torch_mlu") is None:
+        return False
+    import torch_mlu  # noqa: F401
+    with patch_environment(PYTORCH_CNDEV_BASED_MLU_CHECK="1"):
+        available = torch.mlu.is_available()
+    return available
+@lru_cache
+def is_musa_available(check_device=False):
+    "Checks if `torch_musa` is installed and potentially if a MUSA is in the environment"
+    if importlib.util.find_spec("torch_musa") is None:
+        return False
+    import torch_musa  # noqa: F401
+    if check_device:
+        try:
+            # Will raise a RuntimeError if no MUSA is found
+            _ = torch.musa.device_count()
+            return torch.musa.is_available()
+        except RuntimeError:
+            return False
+    return hasattr(torch, "musa") and torch.musa.is_available()
+@lru_cache
+def is_npu_available(check_device=False):
+    "Checks if `torch_npu` is installed and potentially if a NPU is in the environment"
+    if importlib.util.find_spec("torch_npu") is None:
+        return False
+    # NOTE: importing torch_npu may raise error in some envs
+    # e.g. inside cpu-only container with torch_npu installed
+    try:
+        import torch_npu  # noqa: F401
+    except Exception:
+        return False
+    if check_device:
+        try:
+            # Will raise a RuntimeError if no NPU is found
+            _ = torch.npu.device_count()
+            return torch.npu.is_available()
+        except RuntimeError:
+            return False
+    return hasattr(torch, "npu") and torch.npu.is_available()
+@lru_cache
+def is_sdaa_available(check_device=False):
+    "Checks if `torch_sdaa` is installed and potentially if a SDAA is in the environment"
+    if importlib.util.find_spec("torch_sdaa") is None:
+        return False
+    import torch_sdaa  # noqa: F401
+    if check_device:
+        try:
+            # Will raise a RuntimeError if no NPU is found
+            _ = torch.sdaa.device_count()
+            return torch.sdaa.is_available()
+        except RuntimeError:
+            return False
+    return hasattr(torch, "sdaa") and torch.sdaa.is_available()
+@lru_cache
+def is_hpu_available(init_hccl=False):
+    "Checks if `torch.hpu` is installed and potentially if a HPU is in the environment"
+    if (
+        importlib.util.find_spec("habana_frameworks") is None
+        or importlib.util.find_spec("habana_frameworks.torch") is None
+    ):
+        return False
+    import habana_frameworks.torch  # noqa: F401
+    if init_hccl:
+        import habana_frameworks.torch.distributed.hccl as hccl  # noqa: F401
+    return hasattr(torch, "hpu") and torch.hpu.is_available()
+def is_habana_gaudi1():
+    if is_hpu_available():
+        import habana_frameworks.torch.utils.experimental as htexp  # noqa: F401
+        if htexp._get_device_type() == htexp.synDeviceType.synDeviceGaudi:
+            return True
+    return False
+@lru_cache
+def is_xpu_available(check_device=False):
+    """
+    Checks if XPU acceleration is available either via `intel_extension_for_pytorch` or via stock PyTorch (>=2.4) and
+    potentially if a XPU is in the environment
+    """
+    if is_ipex_available():
+        import intel_extension_for_pytorch  # noqa: F401
+    else:
+        if is_torch_version("<=", "2.3"):
+            return False
+    if check_device:
+        try:
+            # Will raise a RuntimeError if no XPU  is found
+            _ = torch.xpu.device_count()
+            return torch.xpu.is_available()
+        except RuntimeError:
+            return False
+    return hasattr(torch, "xpu") and torch.xpu.is_available()
+def is_dvclive_available():
+    return _is_package_available("dvclive")
+def is_torchdata_available():
+    return _is_package_available("torchdata")
+# TODO: Remove this function once stateful_dataloader is a stable feature in torchdata.
+def is_torchdata_stateful_dataloader_available():
+    package_exists = _is_package_available("torchdata")
+    if package_exists:
+        torchdata_version = version.parse(importlib.metadata.version("torchdata"))
+        return compare_versions(torchdata_version, ">=", "0.8.0")
+    return False
+def torchao_required(func):
+    """
+    A decorator that ensures the decorated function is only called when torchao is available.
+    """
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        if not is_torchao_available():
+            raise ImportError(
+                "`torchao` is not available, please install it before calling this function via `pip install torchao`."
+            )
+        return func(*args, **kwargs)
+    return wrapper
+# TODO: Rework this into `utils.deepspeed` and migrate the "core" chunks into `accelerate.deepspeed`
+def deepspeed_required(func):
+    """
+    A decorator that ensures the decorated function is only called when deepspeed is enabled.
+    """
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        from accelerate.state import AcceleratorState
+        from accelerate.utils.dataclasses import DistributedType
+        if AcceleratorState._shared_state != {} and AcceleratorState().distributed_type != DistributedType.DEEPSPEED:
+            raise ValueError(
+                "DeepSpeed is not enabled, please make sure that an `Accelerator` is configured for `deepspeed` "
+                "before calling this function."
+            )
+        return func(*args, **kwargs)
+    return wrapper
+def is_weights_only_available():
+    # Weights only with allowlist was added in 2.4.0
+    # ref: https://github.com/pytorch/pytorch/pull/124331
+    return is_torch_version(">=", "2.4.0")
+def is_numpy_available(min_version="1.25.0"):
+    numpy_version = parse(importlib.metadata.version("numpy"))
+    return compare_versions(numpy_version, ">=", min_version)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/launch.py ADDED Viewed

	@@ -0,0 +1,781 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import subprocess
+import sys
+import warnings
+from ast import literal_eval
+from shutil import which
+from typing import Any
+import torch
+from ..commands.config.config_args import SageMakerConfig
+from ..utils import (
+    DynamoBackend,
+    PrecisionType,
+    is_ccl_available,
+    is_fp8_available,
+    is_hpu_available,
+    is_ipex_available,
+    is_mlu_available,
+    is_musa_available,
+    is_npu_available,
+    is_sdaa_available,
+    is_torch_xla_available,
+    is_xpu_available,
+)
+from ..utils.constants import DEEPSPEED_MULTINODE_LAUNCHERS
+from ..utils.other import get_free_port, is_port_in_use, merge_dicts
+from ..utils.versions import compare_versions
+from .dataclasses import DistributedType, SageMakerDistributedType
+def _filter_args(args, parser, default_args=[]):
+    """
+    Filters out all `accelerate` specific args
+    """
+    new_args, _ = parser.parse_known_args(default_args)
+    for key, value in vars(args).items():
+        if key in vars(new_args).keys():
+            setattr(new_args, key, value)
+    return new_args
+def _get_mpirun_args():
+    """
+    Determines the executable and argument names for mpirun, based on the type of install. The supported MPI programs
+    are: OpenMPI, Intel MPI, or MVAPICH.
+    Returns: Program name and arg names for hostfile, num processes, and processes per node
+    """
+    # Find the MPI program name
+    mpi_apps = [x for x in ["mpirun", "mpiexec"] if which(x)]
+    if len(mpi_apps) == 0:
+        raise OSError("mpirun or mpiexec were not found. Ensure that Intel MPI, Open MPI, or MVAPICH are installed.")
+    # Call the app with the --version flag to determine which MPI app is installed
+    mpi_app = mpi_apps[0]
+    mpirun_version = subprocess.check_output([mpi_app, "--version"])
+    if b"Open MPI" in mpirun_version:
+        return mpi_app, "--hostfile", "-n", "--npernode", "--bind-to"
+    else:
+        # Intel MPI and MVAPICH both use the same arg names
+        return mpi_app, "-f", "-n", "-ppn", ""
+def setup_fp8_env(args: argparse.Namespace, current_env: dict[str, str]):
+    """
+    Setup the FP8 environment variables.
+    """
+    prefix = "ACCELERATE_"
+    for arg in vars(args):
+        if arg.startswith("fp8_"):
+            value = getattr(args, arg)
+            if value is not None:
+                if arg == "fp8_override_linear_precision":
+                    current_env[prefix + "FP8_OVERRIDE_FPROP"] = str(value[0])
+                    current_env[prefix + "FP8_OVERRIDE_DGRAD"] = str(value[1])
+                    current_env[prefix + "FP8_OVERRIDE_WGRAD"] = str(value[2])
+                else:
+                    current_env[f"{prefix}{arg.upper()}"] = str(getattr(args, arg))
+    return current_env
+def prepare_simple_launcher_cmd_env(args: argparse.Namespace) -> tuple[list[str], dict[str, str]]:
+    """
+    Prepares and returns the command list and an environment with the correct simple launcher environment variables.
+    """
+    cmd = []
+    if args.no_python and args.module:
+        raise ValueError("--module and --no_python cannot be used together")
+    num_processes = getattr(args, "num_processes", None)
+    num_machines = args.num_machines
+    if args.mpirun_hostfile is not None:
+        mpi_app_name, hostfile_arg, num_proc_arg, proc_per_node_arg, bind_to_arg = _get_mpirun_args()
+        bind_to = getattr(args, "bind-to", "socket")
+        nproc_per_node = str(num_processes // num_machines) if num_processes and num_machines else "1"
+        cmd += [
+            mpi_app_name,
+            hostfile_arg,
+            args.mpirun_hostfile,
+            proc_per_node_arg,
+            nproc_per_node,
+        ]
+        if num_processes:
+            cmd += [num_proc_arg, str(num_processes)]
+        if bind_to_arg:
+            cmd += [bind_to_arg, bind_to]
+    if not args.no_python:
+        cmd.append(sys.executable)
+        if args.module:
+            cmd.append("-m")
+    cmd.append(args.training_script)
+    cmd.extend(args.training_script_args)
+    current_env = os.environ.copy()
+    current_env["ACCELERATE_USE_CPU"] = str(args.cpu or args.use_cpu)
+    if args.debug:
+        current_env["ACCELERATE_DEBUG_MODE"] = "true"
+    if args.gpu_ids != "all" and args.gpu_ids is not None:
+        if is_xpu_available():
+            current_env["ZE_AFFINITY_MASK"] = args.gpu_ids
+        elif is_mlu_available():
+            current_env["MLU_VISIBLE_DEVICES"] = args.gpu_ids
+        elif is_sdaa_available():
+            current_env["SDAA_VISIBLE_DEVICES"] = args.gpu_ids
+        elif is_musa_available():
+            current_env["MUSA_VISIBLE_DEVICES"] = args.gpu_ids
+        elif is_npu_available():
+            current_env["ASCEND_RT_VISIBLE_DEVICES"] = args.gpu_ids
+        elif is_hpu_available():
+            current_env["HABANA_VISIBLE_MODULES"] = args.gpu_ids
+        else:
+            current_env["CUDA_VISIBLE_DEVICES"] = args.gpu_ids
+    if num_machines > 1:
+        assert args.main_process_ip is not None, (
+            "When using multiple machines, you need to specify the main process IP."
+        )
+        assert args.main_process_port is not None, (
+            "When using multiple machines, you need to specify the main process port."
+        )
+    ccl_worker_count = getattr(args, "mpirun_ccl", 0) if is_ccl_available() else 0
+    if (num_processes is not None and num_processes > 1) or num_machines > 1:
+        current_env["MASTER_ADDR"] = args.main_process_ip if args.main_process_ip is not None else "127.0.0.1"
+        current_env["MASTER_PORT"] = str(args.main_process_port) if args.main_process_port is not None else "29500"
+        current_env["CCL_WORKER_COUNT"] = str(ccl_worker_count)
+    if current_env["ACCELERATE_USE_CPU"]:
+        current_env["KMP_AFFINITY"] = "granularity=fine,compact,1,0"
+        current_env["KMP_BLOCKTIME"] = str(1)
+    try:
+        mixed_precision = PrecisionType(args.mixed_precision.lower())
+    except ValueError:
+        raise ValueError(
+            f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
+        )
+    current_env["ACCELERATE_MIXED_PRECISION"] = str(mixed_precision)
+    if args.mixed_precision.lower() == "fp8":
+        if not is_fp8_available():
+            raise RuntimeError(
+                "FP8 is not available on this machine. Please ensure that either Transformer Engine, MSAMP or torchao is installed."
+            )
+        current_env = setup_fp8_env(args, current_env)
+    try:
+        dynamo_backend = DynamoBackend(args.dynamo_backend.upper())
+    except ValueError:
+        raise ValueError(
+            f"Unknown dynamo backend: {args.dynamo_backend.upper()}. Choose between {DynamoBackend.list()}."
+        )
+    current_env["ACCELERATE_DYNAMO_BACKEND"] = dynamo_backend.value
+    current_env["ACCELERATE_DYNAMO_MODE"] = args.dynamo_mode
+    current_env["ACCELERATE_DYNAMO_USE_FULLGRAPH"] = str(args.dynamo_use_fullgraph)
+    current_env["ACCELERATE_DYNAMO_USE_DYNAMIC"] = str(args.dynamo_use_dynamic)
+    current_env["ACCELERATE_DYNAMO_USE_REGIONAL_COMPILATION"] = str(args.dynamo_use_regional_compilation)
+    current_env["OMP_NUM_THREADS"] = str(args.num_cpu_threads_per_process)
+    if is_ipex_available():
+        current_env["ACCELERATE_USE_IPEX"] = str(args.ipex).lower()
+    if args.enable_cpu_affinity:
+        current_env["ACCELERATE_CPU_AFFINITY"] = "1"
+    return cmd, current_env
+def prepare_multi_gpu_env(args: argparse.Namespace) -> dict[str, str]:
+    """
+    Prepares and returns an environment with the correct multi-GPU environment variables.
+    """
+    # get free port and update configurations
+    if args.main_process_port == 0:
+        args.main_process_port = get_free_port()
+    elif args.main_process_port is None:
+        args.main_process_port = 29500
+    num_processes = args.num_processes
+    num_machines = args.num_machines
+    main_process_ip = args.main_process_ip
+    main_process_port = args.main_process_port
+    if num_machines > 1:
+        args.nproc_per_node = str(num_processes // num_machines)
+        args.nnodes = str(num_machines)
+        args.node_rank = int(args.machine_rank)
+        if getattr(args, "same_network", False):
+            args.master_addr = str(main_process_ip)
+            args.master_port = str(main_process_port)
+        else:
+            args.rdzv_endpoint = f"{main_process_ip}:{main_process_port}"
+    else:
+        args.nproc_per_node = str(num_processes)
+        if main_process_port is not None:
+            args.master_port = str(main_process_port)
+    # only need to check port availability in main process, in case we have to start multiple launchers on the same machine
+    # for some reasons like splitting log files.
+    need_port_check = num_machines <= 1 or int(args.machine_rank) == 0
+    if need_port_check and is_port_in_use(main_process_port):
+        if num_machines <= 1:
+            args.standalone = True
+            warnings.warn(
+                f"Port `{main_process_port}` is already in use. "
+                "Accelerate will attempt to launch in a standalone-like mode by finding an open port automatically for this session. "
+                "If this current attempt fails, or for more control in future runs, please specify a different port "
+                "(e.g., `--main_process_port <your_chosen_port>`) or use `--main_process_port 0` for automatic selection "
+                "in your launch command or Accelerate config file."
+            )
+        else:
+            raise ConnectionError(
+                f"Tried to launch distributed communication on port `{main_process_port}`, but another process is utilizing it. "
+                "Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file)"
+                " and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`."
+            )
+    if args.module and args.no_python:
+        raise ValueError("--module and --no_python cannot be used together")
+    elif args.module:
+        args.module = True
+    elif args.no_python:
+        args.no_python = True
+    current_env = os.environ.copy()
+    if args.debug:
+        current_env["ACCELERATE_DEBUG_MODE"] = "true"
+    gpu_ids = getattr(args, "gpu_ids", "all")
+    if gpu_ids != "all" and args.gpu_ids is not None:
+        if is_xpu_available():
+            current_env["ZE_AFFINITY_MASK"] = gpu_ids
+        elif is_mlu_available():
+            current_env["MLU_VISIBLE_DEVICES"] = gpu_ids
+        elif is_sdaa_available():
+            current_env["SDAA_VISIBLE_DEVICES"] = gpu_ids
+        elif is_musa_available():
+            current_env["MUSA_VISIBLE_DEVICES"] = gpu_ids
+        elif is_npu_available():
+            current_env["ASCEND_RT_VISIBLE_DEVICES"] = gpu_ids
+        elif is_hpu_available():
+            current_env["HABANA_VISIBLE_MODULES"] = gpu_ids
+        else:
+            current_env["CUDA_VISIBLE_DEVICES"] = gpu_ids
+    mixed_precision = args.mixed_precision.lower()
+    try:
+        mixed_precision = PrecisionType(mixed_precision)
+    except ValueError:
+        raise ValueError(f"Unknown mixed_precision mode: {mixed_precision}. Choose between {PrecisionType.list()}.")
+    current_env["ACCELERATE_MIXED_PRECISION"] = str(mixed_precision)
+    if args.mixed_precision.lower() == "fp8":
+        if not is_fp8_available():
+            raise RuntimeError(
+                "FP8 is not available on this machine. Please ensure that either Transformer Engine, MSAMP or torchao is installed."
+            )
+        current_env = setup_fp8_env(args, current_env)
+    try:
+        dynamo_backend = DynamoBackend(args.dynamo_backend.upper())
+    except ValueError:
+        raise ValueError(
+            f"Unknown dynamo backend: {args.dynamo_backend.upper()}. Choose between {DynamoBackend.list()}."
+        )
+    current_env["ACCELERATE_DYNAMO_BACKEND"] = dynamo_backend.value
+    current_env["ACCELERATE_DYNAMO_MODE"] = args.dynamo_mode
+    current_env["ACCELERATE_DYNAMO_USE_FULLGRAPH"] = str(args.dynamo_use_fullgraph)
+    current_env["ACCELERATE_DYNAMO_USE_DYNAMIC"] = str(args.dynamo_use_dynamic)
+    current_env["ACCELERATE_DYNAMO_USE_REGIONAL_COMPILATION"] = str(args.dynamo_use_regional_compilation)
+    if args.use_fsdp:
+        current_env["ACCELERATE_USE_FSDP"] = "true"
+        if args.fsdp_cpu_ram_efficient_loading and not args.fsdp_sync_module_states:
+            raise ValueError("When using `--fsdp_cpu_ram_efficient_loading` set `--fsdp_sync_module_states` to `True`")
+        current_env["FSDP_VERSION"] = str(args.fsdp_version) if hasattr(args, "fsdp_version") else "1"
+        # For backwards compatibility, we support this in launched scripts,
+        # however, we do not ask users for this in `accelerate config` CLI
+        current_env["FSDP_SHARDING_STRATEGY"] = str(args.fsdp_sharding_strategy)
+        current_env["FSDP_RESHARD_AFTER_FORWARD"] = str(args.fsdp_reshard_after_forward).lower()
+        current_env["FSDP_OFFLOAD_PARAMS"] = str(args.fsdp_offload_params).lower()
+        current_env["FSDP_MIN_NUM_PARAMS"] = str(args.fsdp_min_num_params)
+        if args.fsdp_auto_wrap_policy is not None:
+            current_env["FSDP_AUTO_WRAP_POLICY"] = str(args.fsdp_auto_wrap_policy)
+        if args.fsdp_transformer_layer_cls_to_wrap is not None:
+            current_env["FSDP_TRANSFORMER_CLS_TO_WRAP"] = str(args.fsdp_transformer_layer_cls_to_wrap)
+        if args.fsdp_backward_prefetch is not None:
+            current_env["FSDP_BACKWARD_PREFETCH"] = str(args.fsdp_backward_prefetch)
+        if args.fsdp_state_dict_type is not None:
+            current_env["FSDP_STATE_DICT_TYPE"] = str(args.fsdp_state_dict_type)
+        current_env["FSDP_FORWARD_PREFETCH"] = str(args.fsdp_forward_prefetch).lower()
+        current_env["FSDP_USE_ORIG_PARAMS"] = str(args.fsdp_use_orig_params).lower()
+        current_env["FSDP_CPU_RAM_EFFICIENT_LOADING"] = str(args.fsdp_cpu_ram_efficient_loading).lower()
+        current_env["FSDP_SYNC_MODULE_STATES"] = str(args.fsdp_sync_module_states).lower()
+        current_env["FSDP_ACTIVATION_CHECKPOINTING"] = str(args.fsdp_activation_checkpointing).lower()
+        if getattr(args, "fsdp_ignored_modules", None) is not None:
+            current_env["FSDP_IGNORED_MODULES"] = str(args.fsdp_ignored_modules)
+    if args.use_megatron_lm:
+        prefix = "MEGATRON_LM_"
+        current_env["ACCELERATE_USE_MEGATRON_LM"] = "true"
+        current_env[prefix + "TP_DEGREE"] = str(args.megatron_lm_tp_degree)
+        current_env[prefix + "PP_DEGREE"] = str(args.megatron_lm_pp_degree)
+        current_env[prefix + "GRADIENT_CLIPPING"] = str(args.megatron_lm_gradient_clipping)
+        if args.megatron_lm_num_micro_batches is not None:
+            current_env[prefix + "NUM_MICRO_BATCHES"] = str(args.megatron_lm_num_micro_batches)
+        if args.megatron_lm_sequence_parallelism is not None:
+            current_env[prefix + "SEQUENCE_PARALLELISM"] = str(args.megatron_lm_sequence_parallelism)
+        if args.megatron_lm_recompute_activations is not None:
+            current_env[prefix + "RECOMPUTE_ACTIVATIONS"] = str(args.megatron_lm_recompute_activations)
+        if args.megatron_lm_use_distributed_optimizer is not None:
+            current_env[prefix + "USE_DISTRIBUTED_OPTIMIZER"] = str(args.megatron_lm_use_distributed_optimizer)
+    current_env["OMP_NUM_THREADS"] = str(args.num_cpu_threads_per_process)
+    if args.enable_cpu_affinity:
+        current_env["ACCELERATE_CPU_AFFINITY"] = "1"
+    if args.use_parallelism_config:
+        current_env = prepare_extend_env_parallelism_config(args, current_env)
+    return current_env
+def prepare_extend_env_parallelism_config(
+    args: argparse.Namespace, current_env: dict
+) -> tuple[list[str], dict[str, str]]:
+    """
+    Extends `current_env` with context parallelism env vars if any have been set
+    """
+    prefix = "PARALLELISM_CONFIG_"
+    current_env["ACCELERATE_USE_PARALLELISM_CONFIG"] = "true"
+    current_env[prefix + "DP_REPLICATE_SIZE"] = str(args.parallelism_config_dp_replicate_size)
+    current_env[prefix + "DP_SHARD_SIZE"] = str(args.parallelism_config_dp_shard_size)
+    current_env[prefix + "TP_SIZE"] = str(args.parallelism_config_tp_size)
+    current_env[prefix + "CP_SIZE"] = str(args.parallelism_config_cp_size)
+    current_env[prefix + "CP_BACKEND"] = str(args.parallelism_config_cp_backend)
+    current_env[prefix + "SP_SIZE"] = str(args.parallelism_config_sp_size)
+    current_env[prefix + "SP_BACKEND"] = str(args.parallelism_config_sp_backend)
+    if args.parallelism_config_cp_size > 1:
+        current_env[prefix + "CP_COMM_STRATEGY"] = str(args.parallelism_config_cp_comm_strategy)
+    if args.parallelism_config_sp_size > 1:
+        current_env[prefix + "SP_SEQ_LENGTH"] = str(args.parallelism_config_sp_seq_length)
+        current_env[prefix + "SP_SEQ_LENGTH_IS_VARIABLE"] = str(args.parallelism_config_sp_seq_length_is_variable)
+        current_env[prefix + "SP_ATTN_IMPLEMENTATION"] = str(args.parallelism_config_sp_attn_implementation)
+    return current_env
+def prepare_deepspeed_cmd_env(args: argparse.Namespace) -> tuple[list[str], dict[str, str]]:
+    """
+    Prepares and returns the command list and an environment with the correct DeepSpeed environment variables.
+    """
+    # get free port and update configurations
+    if args.main_process_port == 0:
+        args.main_process_port = get_free_port()
+    elif args.main_process_port is None:
+        args.main_process_port = 29500
+    num_processes = args.num_processes
+    num_machines = args.num_machines
+    main_process_ip = args.main_process_ip
+    main_process_port = args.main_process_port
+    cmd = None
+    # make sure launcher is not None
+    if args.deepspeed_multinode_launcher is None:
+        # set to default pdsh
+        args.deepspeed_multinode_launcher = DEEPSPEED_MULTINODE_LAUNCHERS[0]
+    if num_machines > 1 and args.deepspeed_multinode_launcher != DEEPSPEED_MULTINODE_LAUNCHERS[1]:
+        cmd = ["deepspeed"]
+        cmd.extend(["--hostfile", str(args.deepspeed_hostfile)])
+        if args.deepspeed_multinode_launcher == "nossh":
+            if compare_versions("deepspeed", "<", "0.14.5"):
+                raise ValueError("nossh launcher requires DeepSpeed >= 0.14.5")
+            cmd.extend(["--node_rank", str(args.machine_rank), "--no_ssh"])
+        else:
+            cmd.extend(["--no_local_rank", "--launcher", str(args.deepspeed_multinode_launcher)])
+        if args.deepspeed_exclusion_filter is not None:
+            cmd.extend(
+                [
+                    "--exclude",
+                    str(args.deepspeed_exclusion_filter),
+                ]
+            )
+        elif args.deepspeed_inclusion_filter is not None:
+            cmd.extend(
+                [
+                    "--include",
+                    str(args.deepspeed_inclusion_filter),
+                ]
+            )
+        else:
+            cmd.extend(["--num_gpus", str(args.num_processes // args.num_machines)])
+        if main_process_ip:
+            cmd.extend(["--master_addr", str(main_process_ip)])
+        cmd.extend(["--master_port", str(main_process_port)])
+        if args.module and args.no_python:
+            raise ValueError("--module and --no_python cannot be used together")
+        elif args.module:
+            cmd.append("--module")
+        elif args.no_python:
+            cmd.append("--no_python")
+        cmd.append(args.training_script)
+        cmd.extend(args.training_script_args)
+    elif num_machines > 1 and args.deepspeed_multinode_launcher == DEEPSPEED_MULTINODE_LAUNCHERS[1]:
+        args.nproc_per_node = str(num_processes // num_machines)
+        args.nnodes = str(num_machines)
+        args.node_rank = int(args.machine_rank)
+        if getattr(args, "same_network", False):
+            args.master_addr = str(main_process_ip)
+            args.master_port = str(main_process_port)
+        else:
+            args.rdzv_endpoint = f"{main_process_ip}:{main_process_port}"
+    else:
+        args.nproc_per_node = str(num_processes)
+        if main_process_port is not None:
+            args.master_port = str(main_process_port)
+    # only need to check port availability in main process, in case we have to start multiple launchers on the same machine
+    # for some reasons like splitting log files.
+    need_port_check = num_machines <= 1 or int(args.machine_rank) == 0
+    if need_port_check and is_port_in_use(main_process_port):
+        if num_machines <= 1:
+            args.standalone = True
+            warnings.warn(
+                f"Port `{main_process_port}` is already in use. "
+                "Accelerate will attempt to launch in a standalone-like mode by finding an open port automatically for this session. "
+                "If this current attempt fails, or for more control in future runs, please specify a different port "
+                "(e.g., `--main_process_port <your_chosen_port>`) or use `--main_process_port 0` for automatic selection "
+                "in your launch command or Accelerate config file."
+            )
+        else:
+            raise ConnectionError(
+                f"Tried to launch distributed communication on port `{main_process_port}`, but another process is utilizing it. "
+                "Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file)"
+                " and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`."
+            )
+    if args.module and args.no_python:
+        raise ValueError("--module and --no_python cannot be used together")
+    elif args.module:
+        args.module = True
+    elif args.no_python:
+        args.no_python = True
+    current_env = os.environ.copy()
+    if args.debug:
+        current_env["ACCELERATE_DEBUG_MODE"] = "true"
+    gpu_ids = getattr(args, "gpu_ids", "all")
+    if gpu_ids != "all" and args.gpu_ids is not None:
+        if is_xpu_available():
+            current_env["ZE_AFFINITY_MASK"] = gpu_ids
+        elif is_mlu_available():
+            current_env["MLU_VISIBLE_DEVICES"] = gpu_ids
+        elif is_sdaa_available():
+            current_env["SDAA_VISIBLE_DEVICES"] = gpu_ids
+        elif is_musa_available():
+            current_env["MUSA_VISIBLE_DEVICES"] = gpu_ids
+        elif is_npu_available():
+            current_env["ASCEND_RT_VISIBLE_DEVICES"] = gpu_ids
+        elif is_hpu_available():
+            current_env["HABANA_VISIBLE_MODULES"] = gpu_ids
+        else:
+            current_env["CUDA_VISIBLE_DEVICES"] = gpu_ids
+    try:
+        mixed_precision = PrecisionType(args.mixed_precision.lower())
+    except ValueError:
+        raise ValueError(
+            f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
+        )
+    current_env["PYTHONPATH"] = env_var_path_add("PYTHONPATH", os.path.abspath("."))
+    current_env["ACCELERATE_MIXED_PRECISION"] = str(mixed_precision)
+    if args.mixed_precision.lower() == "fp8":
+        if not is_fp8_available():
+            raise RuntimeError(
+                "FP8 is not available on this machine. Please ensure that either Transformer Engine, MSAMP or torchao is installed."
+            )
+        current_env = setup_fp8_env(args, current_env)
+    current_env["ACCELERATE_CONFIG_DS_FIELDS"] = str(args.deepspeed_fields_from_accelerate_config).lower()
+    current_env["ACCELERATE_USE_DEEPSPEED"] = "true"
+    if args.zero_stage is not None:
+        current_env["ACCELERATE_DEEPSPEED_ZERO_STAGE"] = str(args.zero_stage)
+    if args.gradient_accumulation_steps is not None:
+        current_env["ACCELERATE_GRADIENT_ACCUMULATION_STEPS"] = str(args.gradient_accumulation_steps)
+    if args.gradient_clipping is not None:
+        current_env["ACCELERATE_GRADIENT_CLIPPING"] = str(args.gradient_clipping).lower()
+    if args.offload_optimizer_device is not None:
+        current_env["ACCELERATE_DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE"] = str(args.offload_optimizer_device).lower()
+    if args.offload_param_device is not None:
+        current_env["ACCELERATE_DEEPSPEED_OFFLOAD_PARAM_DEVICE"] = str(args.offload_param_device).lower()
+    if args.zero3_init_flag is not None:
+        current_env["ACCELERATE_DEEPSPEED_ZERO3_INIT"] = str(args.zero3_init_flag).lower()
+    if args.zero3_save_16bit_model is not None:
+        current_env["ACCELERATE_DEEPSPEED_ZERO3_SAVE_16BIT_MODEL"] = str(args.zero3_save_16bit_model).lower()
+    if args.deepspeed_config_file is not None:
+        current_env["ACCELERATE_DEEPSPEED_CONFIG_FILE"] = str(args.deepspeed_config_file)
+    if args.enable_cpu_affinity:
+        current_env["ACCELERATE_CPU_AFFINITY"] = "1"
+    if args.deepspeed_moe_layer_cls_names is not None:
+        current_env["ACCELERATE_DEEPSPEED_MOE_LAYER_CLS_NAMES"] = str(args.deepspeed_moe_layer_cls_names)
+    if args.use_parallelism_config:
+        current_env = prepare_extend_env_parallelism_config(args, current_env)
+    return cmd, current_env
+def prepare_tpu(
+    args: argparse.Namespace, current_env: dict[str, str], pod: bool = False
+) -> tuple[argparse.Namespace, dict[str, str]]:
+    """
+    Prepares and returns an environment with the correct TPU environment variables.
+    """
+    if args.mixed_precision == "bf16" and is_torch_xla_available(check_is_tpu=True):
+        if args.downcast_bf16:
+            current_env["XLA_DOWNCAST_BF16"] = "1"
+        else:
+            current_env["XLA_USE_BF16"] = "1"
+    if args.debug:
+        current_env["ACCELERATE_DEBUG_MODE"] = "true"
+    if pod:
+        # Take explicit args and set them up for XLA
+        args.vm = args.tpu_vm
+        args.tpu = args.tpu_name
+    return args, current_env
+def _convert_nargs_to_dict(nargs: list[str]) -> dict[str, str]:
+    if len(nargs) < 0:
+        return {}
+    # helper function to infer type for argsparser
+    def _infer_type(s):
+        try:
+            s = float(s)
+            if s // 1 == s:
+                return int(s)
+            return s
+        except ValueError:
+            return s
+    parser = argparse.ArgumentParser()
+    _, unknown = parser.parse_known_args(nargs)
+    for index, argument in enumerate(unknown):
+        if argument.startswith(("-", "--")):
+            action = None
+            if index + 1 < len(unknown):  # checks if next index would be in list
+                if unknown[index + 1].startswith(("-", "--")):  # checks if next element is an key
+                    # raise an error if element is store_true or store_false
+                    raise ValueError(
+                        "SageMaker doesn’t support argparse actions for `store_true` or `store_false`. Please define explicit types"
+                    )
+            else:  # raise an error if last element is store_true or store_false
+                raise ValueError(
+                    "SageMaker doesn’t support argparse actions for `store_true` or `store_false`. Please define explicit types"
+                )
+            # adds argument to parser based on action_store true
+            if action is None:
+                parser.add_argument(argument, type=_infer_type)
+            else:
+                parser.add_argument(argument, action=action)
+    return {
+        key: (literal_eval(value) if value in ("True", "False") else value)
+        for key, value in parser.parse_args(nargs).__dict__.items()
+    }
+def prepare_sagemager_args_inputs(
+    sagemaker_config: SageMakerConfig, args: argparse.Namespace
+) -> tuple[argparse.Namespace, dict[str, Any]]:
+    # configure environment
+    print("Configuring Amazon SageMaker environment")
+    os.environ["AWS_DEFAULT_REGION"] = sagemaker_config.region
+    # configure credentials
+    if sagemaker_config.profile is not None:
+        os.environ["AWS_PROFILE"] = sagemaker_config.profile
+    elif args.aws_access_key_id is not None and args.aws_secret_access_key is not None:
+        os.environ["AWS_ACCESS_KEY_ID"] = args.aws_access_key_id
+        os.environ["AWS_SECRET_ACCESS_KEY"] = args.aws_secret_access_key
+    else:
+        raise OSError("You need to provide an aws_access_key_id and aws_secret_access_key when not using aws_profile")
+    # extract needed arguments
+    source_dir = os.path.dirname(args.training_script)
+    if not source_dir:  # checks if string is empty
+        source_dir = "."
+    entry_point = os.path.basename(args.training_script)
+    if not entry_point.endswith(".py"):
+        raise ValueError(f'Your training script should be a python script and not "{entry_point}"')
+    print("Converting Arguments to Hyperparameters")
+    hyperparameters = _convert_nargs_to_dict(args.training_script_args)
+    try:
+        mixed_precision = PrecisionType(args.mixed_precision.lower())
+    except ValueError:
+        raise ValueError(
+            f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
+        )
+    try:
+        dynamo_backend = DynamoBackend(args.dynamo_backend.upper())
+    except ValueError:
+        raise ValueError(
+            f"Unknown dynamo backend: {args.dynamo_backend.upper()}. Choose between {DynamoBackend.list()}."
+        )
+    # Environment variables to be set for use during training job
+    environment = {
+        "ACCELERATE_USE_SAGEMAKER": "true",
+        "ACCELERATE_MIXED_PRECISION": str(mixed_precision),
+        "ACCELERATE_DYNAMO_BACKEND": dynamo_backend.value,
+        "ACCELERATE_DYNAMO_MODE": args.dynamo_mode,
+        "ACCELERATE_DYNAMO_USE_FULLGRAPH": str(args.dynamo_use_fullgraph),
+        "ACCELERATE_DYNAMO_USE_DYNAMIC": str(args.dynamo_use_dynamic),
+        "ACCELERATE_DYNAMO_USE_REGIONAL_COMPILATION": str(args.dynamo_use_regional_compilation),
+        "ACCELERATE_SAGEMAKER_DISTRIBUTED_TYPE": sagemaker_config.distributed_type.value,
+    }
+    if args.mixed_precision.lower() == "fp8":
+        if not is_fp8_available():
+            raise RuntimeError(
+                "FP8 is not available on this machine. Please ensure that either Transformer Engine, MSAMP or torchao is installed."
+            )
+        environment = setup_fp8_env(args, environment)
+    # configure distribution set up
+    distribution = None
+    if sagemaker_config.distributed_type == SageMakerDistributedType.DATA_PARALLEL:
+        distribution = {"smdistributed": {"dataparallel": {"enabled": True}}}
+    # configure sagemaker inputs
+    sagemaker_inputs = None
+    if sagemaker_config.sagemaker_inputs_file is not None:
+        print(f"Loading SageMaker Inputs from {sagemaker_config.sagemaker_inputs_file} file")
+        sagemaker_inputs = {}
+        with open(sagemaker_config.sagemaker_inputs_file) as file:
+            for i, line in enumerate(file):
+                if i == 0:
+                    continue
+                l = line.split("\t")
+                sagemaker_inputs[l[0]] = l[1].strip()
+        print(f"Loaded SageMaker Inputs: {sagemaker_inputs}")
+    # configure sagemaker metrics
+    sagemaker_metrics = None
+    if sagemaker_config.sagemaker_metrics_file is not None:
+        print(f"Loading SageMaker Metrics from {sagemaker_config.sagemaker_metrics_file} file")
+        sagemaker_metrics = []
+        with open(sagemaker_config.sagemaker_metrics_file) as file:
+            for i, line in enumerate(file):
+                if i == 0:
+                    continue
+                l = line.split("\t")
+                metric_dict = {
+                    "Name": l[0],
+                    "Regex": l[1].strip(),
+                }
+                sagemaker_metrics.append(metric_dict)
+        print(f"Loaded SageMaker Metrics: {sagemaker_metrics}")
+    # configure session
+    print("Creating Estimator")
+    args = {
+        "image_uri": sagemaker_config.image_uri,
+        "entry_point": entry_point,
+        "source_dir": source_dir,
+        "role": sagemaker_config.iam_role_name,
+        "transformers_version": sagemaker_config.transformers_version,
+        "pytorch_version": sagemaker_config.pytorch_version,
+        "py_version": sagemaker_config.py_version,
+        "base_job_name": sagemaker_config.base_job_name,
+        "instance_count": sagemaker_config.num_machines,
+        "instance_type": sagemaker_config.ec2_instance_type,
+        "debugger_hook_config": False,
+        "distribution": distribution,
+        "hyperparameters": hyperparameters,
+        "environment": environment,
+        "metric_definitions": sagemaker_metrics,
+    }
+    if sagemaker_config.additional_args is not None:
+        args = merge_dicts(sagemaker_config.additional_args, args)
+    return args, sagemaker_inputs
+def env_var_path_add(env_var_name, path_to_add):
+    """
+    Extends a path-based environment variable's value with a new path and returns the updated value. It's up to the
+    caller to set it in os.environ.
+    """
+    paths = [p for p in os.environ.get(env_var_name, "").split(":") if len(p) > 0]
+    paths.append(str(path_to_add))
+    return ":".join(paths)
+class PrepareForLaunch:
+    """
+    Prepare a function that will launched in a distributed setup.
+    Args:
+        launcher (`Callable`):
+            The function to launch.
+        distributed_type ([`~state.DistributedType`]):
+            The distributed type to prepare for.
+        debug (`bool`, *optional*, defaults to `False`):
+            Whether or not this is a debug launch.
+    """
+    def __init__(self, launcher, distributed_type="NO", debug=False):
+        self.launcher = launcher
+        self.distributed_type = DistributedType(distributed_type)
+        self.debug = debug
+    def __call__(self, index, *args):
+        if self.debug:
+            world_size = int(os.environ.get("WORLD_SIZE"))
+            rdv_file = os.environ.get("ACCELERATE_DEBUG_RDV_FILE")
+            torch.distributed.init_process_group(
+                "gloo",
+                rank=index,
+                store=torch.distributed.FileStore(rdv_file, world_size),
+                world_size=world_size,
+            )
+        elif self.distributed_type in (
+            DistributedType.MULTI_GPU,
+            DistributedType.MULTI_MLU,
+            DistributedType.MULTI_MUSA,
+            DistributedType.MULTI_NPU,
+            DistributedType.MULTI_XPU,
+            DistributedType.MULTI_CPU,
+        ):
+            # Prepare the environment for torch.distributed
+            os.environ["LOCAL_RANK"] = str(index)
+            nproc = int(os.environ.get("NPROC", 1))
+            node_rank = int(os.environ.get("NODE_RANK", 0))
+            os.environ["RANK"] = str(nproc * node_rank + index)
+        os.environ["FORK_LAUNCHED"] = str(1)
+        self.launcher(*args)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/megatron_lm.py ADDED Viewed

	@@ -0,0 +1,1424 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import math
+import os
+from abc import ABC
+from functools import partial
+import torch
+import torch.nn.functional as F
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
+from ..optimizer import AcceleratedOptimizer
+from ..scheduler import AcceleratedScheduler
+from .imports import is_megatron_lm_available
+from .operations import recursively_apply, send_to_device
+if is_megatron_lm_available():
+    from megatron.core import mpu, tensor_parallel
+    from megatron.core.distributed import DistributedDataParallel as LocalDDP
+    from megatron.core.distributed import finalize_model_grads
+    from megatron.core.enums import ModelType
+    from megatron.core.num_microbatches_calculator import get_num_microbatches
+    from megatron.core.optimizer import get_megatron_optimizer
+    from megatron.core.parallel_state import get_tensor_model_parallel_group, get_tensor_model_parallel_src_rank
+    from megatron.core.pipeline_parallel import get_forward_backward_func
+    from megatron.core.utils import get_model_config
+    from megatron.inference.text_generation.communication import broadcast_int_list, broadcast_tensor
+    from megatron.inference.text_generation.generation import (
+        beam_search_and_return_on_first_stage,
+        generate_tokens_probs_and_return_on_first_stage,
+    )
+    from megatron.legacy.data.dataset_utils import build_train_valid_test_datasets
+    from megatron.legacy.model import BertModel, Float16Module, GPTModel, T5Model
+    from megatron.legacy.model.classification import Classification
+    from megatron.training import (
+        get_args,
+        get_tensorboard_writer,
+        get_tokenizer,
+        print_rank_last,
+    )
+    from megatron.training.arguments import (
+        _add_data_args,
+        _add_validation_args,
+        core_transformer_config_from_args,
+        parse_args,
+        validate_args,
+    )
+    from megatron.training.checkpointing import load_args_from_checkpoint, load_checkpoint, save_checkpoint
+    from megatron.training.global_vars import set_global_variables
+    from megatron.training.initialize import (
+        _compile_dependencies,
+        _init_autoresume,
+        _initialize_distributed,
+        _set_random_seed,
+        set_jit_fusion_options,
+        write_args_to_tensorboard,
+    )
+    from megatron.training.tokenizer.tokenizer import _vocab_size_with_padding
+    from megatron.training.training import (
+        build_train_valid_test_data_iterators,
+        get_optimizer_param_scheduler,
+        num_floating_point_operations,
+        setup_model_and_optimizer,
+        train_step,
+        training_log,
+    )
+    from megatron.training.utils import (
+        average_losses_across_data_parallel_group,
+        calc_params_l2_norm,
+        get_ltor_masks_and_position_ids,
+        unwrap_model,
+    )
+# model utilities
+def model_provider_func(pre_process=True, post_process=True, add_encoder=True, add_decoder=True):
+    """Build the model."""
+    args = get_args()
+    mode = "pre-training" if args.pretraining_flag else "fine-tuning"
+    if args.rank == 0:
+        print(f"Building {args.model_type_name} model in the {mode} mode.")
+        print(
+            "The Megatron LM model weights are initialized at random in `accelerator.prepare`. "
+            "Please use `accelerator.load_checkpoint` to load a pre-trained checkpoint matching the distributed setup."
+        )
+    config = core_transformer_config_from_args(args)
+    if args.model_type_name == "bert":
+        if args.pretraining_flag:
+            num_tokentypes = 2 if args.bert_binary_head else 0
+            model = BertModel(
+                config=config,
+                num_tokentypes=num_tokentypes,
+                add_binary_head=args.bert_binary_head,
+                parallel_output=True,
+                pre_process=pre_process,
+                post_process=post_process,
+            )
+        else:
+            model = Classification(
+                config=config,
+                num_classes=args.num_labels,
+                num_tokentypes=2,
+                pre_process=pre_process,
+                post_process=post_process,
+            )
+    elif args.model_type_name == "gpt":
+        model = GPTModel(
+            config=config,
+            num_tokentypes=0,
+            parallel_output=True,
+            pre_process=pre_process,
+            post_process=post_process,
+        )
+    elif args.model_type_name == "t5":
+        model = T5Model(
+            config=config,
+            num_tokentypes=0,
+            parallel_output=True,
+            pre_process=pre_process,
+            post_process=post_process,
+            add_encoder=add_encoder,
+            add_decoder=add_decoder,
+        )
+    else:
+        raise ValueError(f"Unsupported model type: {args.model_type_name}")
+    return model
+def prepare_model_optimizer_scheduler(accelerator):
+    accelerator.print("Preparing model optimizer scheduler")
+    args = get_args()
+    if accelerator.state.megatron_lm_plugin.custom_prepare_model_function is not None:
+        if accelerator.state.megatron_lm_plugin.custom_model_provider_function is None:
+            raise ValueError(
+                "You must provide a `custom_model_provider_function` when using a `custom_prepare_model_function`."
+            )
+        custom_model_provider_func = accelerator.state.megatron_lm_plugin.custom_model_provider_function
+        model = accelerator.state.megatron_lm_plugin.custom_prepare_model_function(custom_model_provider_func)
+        optimizer = prepare_optimizer(accelerator, model)
+        scheduler = prepare_scheduler(accelerator, optimizer, scheduler=None)
+    else:
+        model_type = ModelType.encoder_or_decoder
+        if args.model_type_name == "t5":
+            model_type = ModelType.encoder_and_decoder
+        model_provider_func_ = model_provider_func
+        if accelerator.state.megatron_lm_plugin.custom_model_provider_function is not None:
+            model_provider_func_ = accelerator.state.megatron_lm_plugin.custom_model_provider_function
+        (model, optimizer, scheduler) = setup_model_and_optimizer(
+            model_provider_func_,
+            model_type,
+            no_wd_decay_cond=args.no_wd_decay_cond,
+            scale_lr_cond=args.scale_lr_cond,
+            lr_mult=args.lr_mult,
+        )
+    args.model_len = len(model)
+    return model, optimizer, scheduler
+# dataloader utilities
+class MegatronLMDummyDataLoader:
+    """
+    Dummy dataloader presents model parameters or param groups, this is primarily used to follow conventional training
+    Args:
+        **dataset_kwargs: Megatron data arguments.
+    """
+    def __init__(self, **dataset_kwargs):
+        parser = argparse.ArgumentParser()
+        parser = _add_data_args(parser)
+        parser = _add_validation_args(parser)
+        data_args = parser.parse_known_args()
+        self.dataset_args = vars(data_args[0])
+        self.dataset_args.update(dataset_kwargs)
+        self.dataset_args["megatron_dataset_flag"] = True
+    def set_megatron_data_args(self):
+        args = get_args()
+        for key, value in self.dataset_args.items():
+            old_value = getattr(args, key, "")
+            if old_value != value:
+                print(
+                    f"WARNING: MegatronLMDummyDataLoader overriding arguments for {key}:{old_value} with {key}:{value}"
+                )
+            setattr(args, key, value)
+    def get_train_valid_test_datasets_provider(self, accelerator):
+        def train_valid_test_datasets_provider(train_val_test_num_samples):
+            """Build train, valid, and test datasets."""
+            args = get_args()
+            dataset_args = {
+                "data_prefix": args.data_path if isinstance(args.data_path, (list, tuple)) else [args.data_path],
+                "splits_string": args.split,
+                "train_valid_test_num_samples": train_val_test_num_samples,
+                "seed": args.seed,
+            }
+            if args.model_type_name == "bert":
+                dataset_args.update(
+                    {
+                        "max_seq_length": args.seq_length,
+                        "binary_head": args.bert_binary_head,
+                    }
+                )
+            elif args.model_type_name == "gpt":
+                dataset_args.update(
+                    {
+                        "max_seq_length": args.seq_length,
+                    }
+                )
+            elif args.model_type_name == "t5":
+                dataset_args.update(
+                    {
+                        "max_seq_length": args.encoder_seq_length,
+                        "max_seq_length_dec": args.decoder_seq_length,
+                        "dataset_type": "t5",
+                    }
+                )
+            else:
+                raise ValueError(f"Unsupported model type: {args.model_type_name}")
+            train_ds, valid_ds, test_ds = build_train_valid_test_datasets(**dataset_args)
+            return train_ds, valid_ds, test_ds
+        if accelerator.state.megatron_lm_plugin.custom_megatron_datasets_provider_function is not None:
+            return accelerator.state.megatron_lm_plugin.custom_megatron_datasets_provider_function
+        try:
+            args = get_args()
+            # Use '--no-use-pep517 -e' to pip install nvidia's megatron from source
+            if args.model_type_name == "bert":
+                from pretrain_bert import train_valid_test_datasets_provider
+                train_valid_test_datasets_provider.is_distributed = True
+                return train_valid_test_datasets_provider
+            elif args.model_type_name == "gpt":
+                from pretrain_gpt import train_valid_test_datasets_provider
+                train_valid_test_datasets_provider.is_distributed = True
+                return train_valid_test_datasets_provider
+            elif args.model_type_name == "t5":
+                from pretrain_t5 import train_valid_test_datasets_provider
+                train_valid_test_datasets_provider.is_distributed = True
+                return train_valid_test_datasets_provider
+        except ImportError:
+            pass
+        return train_valid_test_datasets_provider
+    def build_train_valid_test_data_iterators(self, accelerator):
+        args = get_args()
+        train_valid_test_dataset_provider = self.get_train_valid_test_datasets_provider(accelerator)
+        if args.virtual_pipeline_model_parallel_size is not None:
+            train_data_iterator = []
+            valid_data_iterator = []
+            test_data_iterator = []
+            for i in range(getattr(args, "model_len", 0)):
+                mpu.set_virtual_pipeline_model_parallel_rank(i)
+                iterators = build_train_valid_test_data_iterators(train_valid_test_dataset_provider)
+                train_data_iterator.append(iterators[0])
+                valid_data_iterator.append(iterators[1])
+                test_data_iterator.append(iterators[2])
+        else:
+            train_data_iterator, valid_data_iterator, test_data_iterator = build_train_valid_test_data_iterators(
+                train_valid_test_dataset_provider
+            )
+        return train_data_iterator, valid_data_iterator, test_data_iterator
+def _handle_megatron_data_iterator(accelerator, data_iterator):
+    class DummyMegatronDataloader:
+        def __iter__(self):
+            return self
+        def __next__(self):
+            return {}
+    is_data_iterator_empty = data_iterator is None
+    is_src_data_iterator_empty = torch.tensor(is_data_iterator_empty, dtype=torch.bool, device=accelerator.device)
+    torch.distributed.broadcast(
+        is_src_data_iterator_empty, get_tensor_model_parallel_src_rank(), group=get_tensor_model_parallel_group()
+    )
+    if not is_src_data_iterator_empty and is_data_iterator_empty:
+        return DummyMegatronDataloader()
+    return data_iterator
+def prepare_data_loader(accelerator, dataloader):
+    accelerator.print("Preparing dataloader")
+    args = get_args()
+    if not args.megatron_dataset_flag:
+        from ..data_loader import _PYTORCH_DATALOADER_KWARGS, prepare_data_loader
+        micro_batch_size = args.micro_batch_size * args.num_micro_batches
+        kwargs = {k: getattr(dataloader, k, _PYTORCH_DATALOADER_KWARGS[k]) for k in _PYTORCH_DATALOADER_KWARGS}
+        if kwargs["batch_size"] is None:
+            if isinstance(kwargs["sampler"], torch.utils.data.BatchSampler):
+                kwargs["sampler"].batch_size = micro_batch_size
+            else:
+                del kwargs["sampler"]
+                del kwargs["shuffle"]
+                del kwargs["batch_size"]
+                kwargs["batch_sampler"].batch_size = micro_batch_size
+        else:
+            del kwargs["batch_sampler"]
+            kwargs["batch_size"] = micro_batch_size
+        dataloader = torch.utils.data.DataLoader(dataloader.dataset, **kwargs)
+        # split_batches:
+        # Megatron only needs to fetch different data between different dp groups,
+        # and does not need to split the data within the dp group.
+        return prepare_data_loader(
+            dataloader,
+            accelerator.device,
+            num_processes=mpu.get_data_parallel_world_size(),
+            process_index=mpu.get_data_parallel_rank(),
+            split_batches=False,
+            put_on_device=True,
+            rng_types=accelerator.rng_types.copy(),
+            dispatch_batches=accelerator.dispatch_batches,
+        )
+    else:
+        if args.consumed_samples is not None:
+            (
+                args.consumed_train_samples,
+                args.consumed_valid_samples,
+                args.consumed_test_samples,
+            ) = args.consumed_samples
+        else:
+            args.consumed_train_samples, args.consumed_valid_samples, args.consumed_test_samples = 0, 0, 0
+        args.micro_batch_size = args.micro_batch_size * args.num_micro_batches
+        # In order to be compatible with data in transform format,
+        # it needs to increase the size of mbs first,
+        # and then split the large batch data into some mbs.
+        (
+            train_data_iterator,
+            valid_data_iterator,
+            test_data_iterator,
+        ) = dataloader.build_train_valid_test_data_iterators(accelerator)
+        args.micro_batch_size = args.micro_batch_size // args.num_micro_batches
+        train_data_iterator = _handle_megatron_data_iterator(
+            accelerator=accelerator, data_iterator=train_data_iterator
+        )
+        valid_data_iterator = _handle_megatron_data_iterator(
+            accelerator=accelerator, data_iterator=valid_data_iterator
+        )
+        test_data_iterator = _handle_megatron_data_iterator(accelerator=accelerator, data_iterator=test_data_iterator)
+        return train_data_iterator, valid_data_iterator, test_data_iterator
+# optimizer utilities
+class MegatronLMOptimizerWrapper(AcceleratedOptimizer):
+    def __init__(self, optimizer):
+        super().__init__(optimizer, device_placement=False, scaler=None)
+    def zero_grad(self, set_to_none=None):
+        pass  # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed
+    def step(self):
+        pass  # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed
+    @property
+    def step_was_skipped(self):
+        """Whether or not the optimizer step was done, or skipped because of gradient overflow."""
+        return self.optimizer.skipped_iter
+def prepare_optimizer(accelerator, model):
+    accelerator.print("Preparing optimizer")
+    args = get_args()
+    return get_megatron_optimizer(model, args.no_wd_decay_cond, args.scale_lr_cond, args.lr_mult)
+# scheduler utilities
+class MegatronLMDummyScheduler:
+    """
+    Dummy scheduler presents model parameters or param groups, this is primarily used to follow conventional training
+    loop when scheduler config is specified in the deepspeed config file.
+    Args:
+        optimizer (`torch.optim.optimizer.Optimizer`):
+            The optimizer to wrap.
+        total_num_steps (int):
+            Total number of steps.
+        warmup_num_steps (int):
+            Number of steps for warmup.
+        **kwargs (additional keyword arguments, *optional*):
+            Other arguments.
+    """
+    def __init__(self, optimizer, total_num_steps=None, warmup_num_steps=0, **kwargs):
+        self.optimizer = optimizer
+        self.total_num_steps = total_num_steps
+        self.warmup_num_steps = warmup_num_steps
+        self.kwargs = kwargs
+class MegatronLMSchedulerWrapper(AcceleratedScheduler):
+    def __init__(self, scheduler, optimizers):
+        super().__init__(scheduler, optimizers)
+    def step(self, *args, **kwargs):
+        return  # `model(**batch)` is doing that automatically. Therefore, its implementation is not needed
+def prepare_scheduler(accelerator, optimizer, scheduler):
+    accelerator.print("Preparing scheduler")
+    scheduler = get_optimizer_param_scheduler(optimizer)
+    return scheduler
+class AbstractTrainStep(ABC):
+    """Abstract class for batching, forward pass and loss handler."""
+    def __init__(self, name):
+        super().__init__()
+        self.name = name
+    def get_batch_func(self, accelerator, megatron_dataset_flag):
+        pass
+    def get_forward_step_func(self):
+        pass
+    def get_loss_func(self, accelerator):
+        pass
+class BertTrainStep(AbstractTrainStep):
+    """
+    Bert train step class.
+    Args:
+        args (`argparse.Namespace`): Megatron-LM arguments.
+    """
+    def __init__(self, accelerator, args):
+        super().__init__("BertTrainStep")
+        self.get_batch = self.get_batch_func(accelerator, args.megatron_dataset_flag)
+        self.loss_func = self.get_loss_func(accelerator, args.pretraining_flag, args.num_labels)
+        self.forward_step = self.get_forward_step_func(args.pretraining_flag, args.bert_binary_head)
+        if not args.model_return_dict:
+            self.model_output_class = None
+        else:
+            from transformers.modeling_outputs import SequenceClassifierOutput
+            self.model_output_class = SequenceClassifierOutput
+    def get_batch_func(self, accelerator, megatron_dataset_flag):
+        def get_batch_megatron(data_iterator):
+            """Build the batch."""
+            # Items and their type.
+            keys = ["text", "types", "labels", "is_random", "loss_mask", "padding_mask"]
+            datatype = torch.int64
+            # Broadcast data.
+            if data_iterator is not None:
+                data = next(data_iterator)
+            else:
+                data = None
+            data_b = tensor_parallel.broadcast_data(keys, data, datatype)
+            # Unpack.
+            tokens = data_b["text"].long()
+            types = data_b["types"].long()
+            sentence_order = data_b["is_random"].long()
+            loss_mask = data_b["loss_mask"].float()
+            lm_labels = data_b["labels"].long()
+            padding_mask = data_b["padding_mask"].long()
+            return tokens, types, sentence_order, loss_mask, lm_labels, padding_mask
+        def get_batch_transformer(data_iterator):
+            """Build the batch."""
+            data = next(data_iterator)
+            data = send_to_device(data, torch.cuda.current_device())
+            # Unpack.
+            tokens = data["input_ids"].long()
+            padding_mask = data["attention_mask"].long()
+            if "token_type_ids" in data:
+                types = data["token_type_ids"].long()
+            else:
+                types = None
+            if "labels" in data:
+                lm_labels = data["labels"].long()
+                loss_mask = (data["labels"] != -100).to(torch.float)
+            else:
+                lm_labels = None
+                loss_mask = None
+            if "next_sentence_label" in data:
+                sentence_order = data["next_sentence_label"].long()
+            else:
+                sentence_order = None
+            return tokens, types, sentence_order, loss_mask, lm_labels, padding_mask
+        if accelerator.state.megatron_lm_plugin.custom_get_batch_function is not None:
+            return accelerator.state.megatron_lm_plugin.custom_get_batch_function
+        if megatron_dataset_flag:
+            try:
+                # Use '--no-use-pep517 -e' to pip install nvidia's megatron from source
+                from pretrain_bert import get_batch
+                return get_batch
+            except ImportError:
+                pass
+            return get_batch_megatron
+        else:
+            return get_batch_transformer
+    def get_loss_func(self, accelerator, pretraining_flag, num_labels):
+        def loss_func_pretrain(loss_mask, sentence_order, output_tensor):
+            lm_loss_, sop_logits = output_tensor
+            lm_loss_ = lm_loss_.float()
+            loss_mask = loss_mask.float()
+            lm_loss = torch.sum(lm_loss_.view(-1) * loss_mask.reshape(-1)) / loss_mask.sum()
+            if sop_logits is not None:
+                sop_loss = F.cross_entropy(sop_logits.view(-1, 2).float(), sentence_order.view(-1), ignore_index=-1)
+                sop_loss = sop_loss.float()
+                loss = lm_loss + sop_loss
+                averaged_losses = average_losses_across_data_parallel_group([lm_loss, sop_loss])
+                return loss, {"lm loss": averaged_losses[0], "sop loss": averaged_losses[1]}
+            else:
+                loss = lm_loss
+                averaged_losses = average_losses_across_data_parallel_group([lm_loss])
+                return loss, {"lm loss": averaged_losses[0]}
+        def loss_func_finetune(labels, logits):
+            if num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            elif self.num_labels > 1 and (labels.dtype in (torch.long, torch.int)):
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, num_labels), labels.view(-1))
+            else:
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+            averaged_losses = average_losses_across_data_parallel_group([loss])
+            return loss, {"loss": averaged_losses[0]}
+        if accelerator.state.megatron_lm_plugin.custom_loss_function is not None:
+            return accelerator.state.megatron_lm_plugin.custom_loss_function
+        if pretraining_flag:
+            return loss_func_pretrain
+        else:
+            return loss_func_finetune
+    def get_forward_step_func(self, pretraining_flag, bert_binary_head):
+        def forward_step(data_iterator, model):
+            """Forward step."""
+            tokens, types, sentence_order, loss_mask, labels, padding_mask = self.get_batch(data_iterator)
+            if not bert_binary_head:
+                types = None
+            # Forward pass through the model.
+            if pretraining_flag:
+                output_tensor = model(tokens, padding_mask, tokentype_ids=types, lm_labels=labels)
+                return output_tensor, partial(self.loss_func, loss_mask, sentence_order)
+            else:
+                logits = model(tokens, padding_mask, tokentype_ids=types)
+                return logits, partial(self.loss_func, labels)
+        return forward_step
+class GPTTrainStep(AbstractTrainStep):
+    """
+    GPT train step class.
+    Args:
+        args (`argparse.Namespace`): Megatron-LM arguments.
+    """
+    def __init__(self, accelerator, args):
+        super().__init__("GPTTrainStep")
+        self.get_batch = self.get_batch_func(accelerator, args.megatron_dataset_flag)
+        self.loss_func = self.get_loss_func(accelerator)
+        self.forward_step = self.get_forward_step_func()
+        self.eod_token = args.padded_vocab_size - 1
+        if args.vocab_file is not None:
+            tokenizer = get_tokenizer()
+            self.eod_token = tokenizer.eod
+        self.reset_position_ids = args.reset_position_ids
+        self.reset_attention_mask = args.reset_attention_mask
+        self.eod_mask_loss = args.eod_mask_loss
+        if not args.model_return_dict:
+            self.model_output_class = None
+        else:
+            from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
+            self.model_output_class = CausalLMOutputWithCrossAttentions
+    def get_batch_func(self, accelerator, megatron_dataset_flag):
+        def get_batch_megatron(data_iterator):
+            """Generate a batch"""
+            # Items and their type.
+            keys = ["text"]
+            datatype = torch.int64
+            # Broadcast data.
+            if data_iterator is not None:
+                data = next(data_iterator)
+            else:
+                data = None
+            data_b = tensor_parallel.broadcast_data(keys, data, datatype)
+            # Unpack.
+            tokens_ = data_b["text"].long()
+            labels = tokens_[:, 1:].contiguous()
+            tokens = tokens_[:, :-1].contiguous()
+            # Get the masks and position ids.
+            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
+                tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, self.eod_mask_loss
+            )
+            return tokens, labels, loss_mask, attention_mask, position_ids
+        def get_batch_transformer(data_iterator):
+            data = next(data_iterator)
+            data = {"input_ids": data["input_ids"]}
+            data = send_to_device(data, torch.cuda.current_device())
+            tokens_ = data["input_ids"].long()
+            padding = torch.zeros((tokens_.shape[0], 1), dtype=tokens_.dtype, device=tokens_.device) + self.eod_token
+            tokens_ = torch.concat([tokens_, padding], dim=1)
+            labels = tokens_[:, 1:].contiguous()
+            tokens = tokens_[:, :-1].contiguous()
+            # Get the masks and position ids.
+            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
+                tokens, self.eod_token, self.reset_position_ids, self.reset_attention_mask, True
+            )
+            return tokens, labels, loss_mask, attention_mask, position_ids
+        if accelerator.state.megatron_lm_plugin.custom_get_batch_function is not None:
+            return accelerator.state.megatron_lm_plugin.custom_get_batch_function
+        if megatron_dataset_flag:
+            try:
+                # Use '--no-use-pep517 -e' to pip install nvidia's megatron from source
+                from pretrain_gpt import get_batch
+                return get_batch
+            except ImportError:
+                pass
+            return get_batch_megatron
+        else:
+            return get_batch_transformer
+    def get_loss_func(self, accelerator):
+        args = get_args()
+        def loss_func(loss_mask, output_tensor):
+            if args.return_logits:
+                losses, logits = output_tensor
+            else:
+                losses = output_tensor
+            losses = losses.float()
+            loss_mask = loss_mask.view(-1).float()
+            if args.context_parallel_size > 1:
+                loss = torch.cat([torch.sum(losses.view(-1) * loss_mask).view(1), loss_mask.sum().view(1)])
+                torch.distributed.all_reduce(loss, group=mpu.get_context_parallel_group())
+                loss = loss[0] / loss[1]
+            else:
+                loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()
+            # Check individual rank losses are not NaN prior to DP all-reduce.
+            if args.check_for_nan_in_loss_and_grad:
+                global_rank = torch.distributed.get_rank()
+                assert not loss.isnan(), (
+                    f"Rank {global_rank}: found NaN in local forward loss calculation. "
+                    f"Device: {torch.cuda.current_device()}, node: {os.uname()[1]}"
+                )
+            # Reduce loss for logging.
+            averaged_loss = average_losses_across_data_parallel_group([loss])
+            output_dict = {"lm loss": averaged_loss[0]}
+            if args.return_logits:
+                output_dict.update({"logits": logits})
+            return loss, output_dict
+        if accelerator.state.megatron_lm_plugin.custom_loss_function is not None:
+            return accelerator.state.megatron_lm_plugin.custom_loss_function
+        return loss_func
+    def get_forward_step_func(self):
+        def forward_step(data_iterator, model):
+            """Forward step."""
+            # Get the batch.
+            tokens, labels, loss_mask, attention_mask, position_ids = self.get_batch(data_iterator)
+            output_tensor = model(tokens, position_ids, attention_mask, labels=labels)
+            return output_tensor, partial(self.loss_func, loss_mask)
+        return forward_step
+class T5TrainStep(AbstractTrainStep):
+    """
+    T5 train step class.
+    Args:
+        args (`argparse.Namespace`): Megatron-LM arguments.
+    """
+    def __init__(self, accelerator, args):
+        super().__init__("T5TrainStep")
+        self.get_batch = self.get_batch_func(accelerator, args.megatron_dataset_flag)
+        self.loss_func = self.get_loss_func(accelerator)
+        self.forward_step = self.get_forward_step_func()
+        if not args.model_return_dict:
+            self.model_output_class = None
+        else:
+            from transformers.modeling_outputs import Seq2SeqLMOutput
+            self.model_output_class = Seq2SeqLMOutput
+    @staticmethod
+    def attn_mask_postprocess(attention_mask):
+        # We create a 3D attention mask from a 2D tensor mask.
+        # [b, 1, s]
+        attention_mask_b1s = attention_mask.unsqueeze(1)
+        # [b, s, 1]
+        attention_mask_bs1 = attention_mask.unsqueeze(2)
+        # [b, s, s]
+        attention_mask_bss = attention_mask_b1s * attention_mask_bs1
+        # Convert attention mask to binary:
+        extended_attention_mask = attention_mask_bss < 0.5
+        return extended_attention_mask
+    @staticmethod
+    def get_decoder_mask(seq_length, device):
+        attention_mask = torch.tril(torch.ones((1, seq_length, seq_length), device=device))
+        attention_mask = attention_mask < 0.5
+        return attention_mask
+    @staticmethod
+    def get_enc_dec_mask(attention_mask, dec_seq_length, device):
+        batch_size, _ = attention_mask.shape
+        # We create a 3D attention mask from a 2D tensor mask.
+        # [b, 1, s]
+        attention_mask_b1s = attention_mask.unsqueeze(1)
+        # [b, s, 1]
+        attention_mask_bs1 = torch.ones((batch_size, dec_seq_length, 1), device=device)
+        attention_mask_bss = attention_mask_bs1 * attention_mask_b1s
+        extended_attention_mask = attention_mask_bss < 0.5
+        return extended_attention_mask
+    def get_batch_func(self, accelerator, megatron_dataset_flag):
+        def get_batch_megatron(data_iterator):
+            """Build the batch."""
+            keys = ["text_enc", "text_dec", "labels", "loss_mask", "enc_mask", "dec_mask", "enc_dec_mask"]
+            datatype = torch.int64
+            # Broadcast data.
+            if data_iterator is not None:
+                data = next(data_iterator)
+            else:
+                data = None
+            data_b = tensor_parallel.broadcast_data(keys, data, datatype)
+            # Unpack.
+            tokens_enc = data_b["text_enc"].long()
+            tokens_dec = data_b["text_dec"].long()
+            labels = data_b["labels"].long()
+            loss_mask = data_b["loss_mask"].float()
+            enc_mask = data_b["enc_mask"] < 0.5
+            dec_mask = data_b["dec_mask"] < 0.5
+            enc_dec_mask = data_b["enc_dec_mask"] < 0.5
+            return tokens_enc, tokens_dec, loss_mask, labels, enc_mask, dec_mask, enc_dec_mask
+        def get_batch_transformer(data_iterator):
+            """Build the batch."""
+            data = next(data_iterator)
+            data = send_to_device(data, torch.cuda.current_device())
+            tokens_enc = data["input_ids"].long()
+            labels = data["labels"].long()
+            loss_mask = (labels != -100).to(torch.float)
+            if "decoder_input_ids" in data:
+                tokens_dec = data["decoder_input_ids"].long()
+            else:
+                tokens_dec = labels.new_zeros(labels.shape, device=labels.device, dtype=torch.long)
+                tokens_dec[..., 1:] = labels[..., :-1].clone()
+                tokens_dec[..., 0] = 0
+                tokens_dec.masked_fill_(tokens_dec == -100, 0)
+            enc_mask = T5TrainStep.attn_mask_postprocess(data["attention_mask"].long())
+            dec_mask = T5TrainStep.get_decoder_mask(tokens_dec.shape[1], tokens_dec.device)
+            enc_dec_mask = T5TrainStep.get_enc_dec_mask(
+                data["attention_mask"].long(), tokens_dec.shape[1], tokens_dec.device
+            )
+            return tokens_enc, tokens_dec, loss_mask, labels, enc_mask, dec_mask, enc_dec_mask
+        if accelerator.state.megatron_lm_plugin.custom_get_batch_function is not None:
+            return accelerator.state.megatron_lm_plugin.custom_get_batch_function
+        if megatron_dataset_flag:
+            try:
+                # Use '--no-use-pep517 -e' to pip install nvidia's megatron from source
+                from pretrain_t5 import get_batch
+                return get_batch
+            except ImportError:
+                pass
+            return get_batch_megatron
+        else:
+            return get_batch_transformer
+    def get_loss_func(self, accelerator):
+        def loss_func(loss_mask, output_tensor):
+            lm_loss_ = output_tensor.float()
+            lm_loss = torch.sum(lm_loss_.view(-1) * loss_mask.reshape(-1)) / loss_mask.sum()
+            loss = lm_loss
+            averaged_losses = average_losses_across_data_parallel_group([lm_loss])
+            return loss, {"lm loss": averaged_losses[0]}
+        if accelerator.state.megatron_lm_plugin.custom_loss_function is not None:
+            return accelerator.state.megatron_lm_plugin.custom_loss_function
+        return loss_func
+    def get_forward_step_func(self):
+        def forward_step(data_iterator, model):
+            """Forward step."""
+            # Get the batch.
+            tokens_enc, tokens_dec, loss_mask, lm_labels, enc_mask, dec_mask, enc_dec_mask = self.get_batch(
+                data_iterator
+            )
+            # Forward model lm_labels
+            output_tensor = model(
+                tokens_enc, tokens_dec, enc_mask, dec_mask, enc_dec_mask, tokentype_ids=None, lm_labels=lm_labels
+            )
+            return output_tensor, partial(self.loss_func, loss_mask)
+        return forward_step
+def finish_mpu_init():
+    # torch.distributed initialization
+    args = get_args()
+    # Pytorch distributed.
+    _initialize_distributed()
+    # Random seeds for reproducibility.
+    if args.rank == 0:
+        print(f"> setting random seeds to {args.seed} ...")
+    _set_random_seed(args.seed, args.data_parallel_random_init)
+# initialize megatron setup
+def initialize(accelerator, extra_args_provider=None, args_defaults={}):
+    accelerator.print("Initializing Megatron-LM")
+    assert torch.cuda.is_available(), "Megatron requires CUDA."
+    # Parse arguments
+    args = parse_args(extra_args_provider, ignore_unknown_args=True)
+    # Set defaults
+    for key, value in args_defaults.items():
+        if getattr(args, key, None) is not None:
+            if args.rank == 0:
+                print(
+                    f"WARNING: overriding default arguments for {key}:{getattr(args, key)} with {key}:{value}",
+                    flush=True,
+                )
+        setattr(args, key, value)
+    if args.use_checkpoint_args or args_defaults.get("use_checkpoint_args", False):
+        assert args.load is not None, "--use-checkpoints-args requires --load argument"
+        load_args_from_checkpoint(args)
+    validate_args(args)
+    # set global args, build tokenizer, and set adlr-autoresume,
+    # tensorboard-writer, and timers.
+    set_global_variables(args)
+    # Megatron's MPU is the master. Complete initialization right away.
+    finish_mpu_init()
+    # Autoresume.
+    _init_autoresume()
+    # Compile dependencies.
+    _compile_dependencies()
+    # Set pytorch JIT layer fusion options and warmup JIT functions.
+    set_jit_fusion_options()
+    args = get_args()
+    if getattr(args, "padded_vocab_size", None) is None:
+        args.padded_vocab_size = _vocab_size_with_padding(args.orig_vocab_size, args)
+    if args.model_type_name == "bert" and args.pretraining_flag and args.num_labels == 2:
+        args.bert_binary_head = True
+    else:
+        args.bert_binary_head = False
+    args.iteration = 0
+class MegatronEngine(torch.nn.Module):
+    """
+    Megatron-LM model wrapper
+    Args:
+        accelerator (:class:`~accelerate.Accelerator`): The accelerator object to use.
+        model: Megatron-LM model
+        optimizer: Megatron-LM optimizer
+        lr_scheduler: Megatron-LM lr scheduler
+    """
+    def __init__(self, accelerator, model, optimizer, scheduler):
+        super().__init__()
+        self.module = model
+        self.base_model = model[0]
+        self.optimizer = optimizer
+        self.scheduler = scheduler
+        args = get_args()
+        if accelerator.state.megatron_lm_plugin.custom_train_step_class is not None:
+            self.train_step_handler = accelerator.state.megatron_lm_plugin.custom_train_step_class(
+                args, **accelerator.state.megatron_lm_plugin.custom_train_step_kwargs
+            )
+        elif args.model_type_name == "bert":
+            self.train_step_handler = BertTrainStep(accelerator, args)
+        elif args.model_type_name == "gpt":
+            self.train_step_handler = GPTTrainStep(accelerator, args)
+        elif args.model_type_name == "t5":
+            self.train_step_handler = T5TrainStep(accelerator, args)
+        else:
+            raise ValueError(f"Unsupported model type: {args.model_type_name}")
+        self.optimizer.skipped_iter = False
+        # Tracking loss.
+        self.total_loss_dict = {}
+        self.eval_total_loss_dict = {}
+        self.iteration = 0
+        self.report_memory_flag = True
+        self.num_floating_point_operations_so_far = 0
+        self.module_config = None
+        if args.tensorboard_dir is not None:
+            write_args_to_tensorboard()
+    def get_module_config(self):
+        args = get_args()
+        config = get_model_config(self.module[0])
+        # Setup some training config params
+        config.grad_scale_func = self.optimizer.scale_loss
+        if isinstance(self.module[0], LocalDDP) and args.overlap_grad_reduce:
+            assert config.no_sync_func is None, (
+                "When overlap_grad_reduce is True, config.no_sync_func must be None; "
+                "a custom no_sync_func is not supported when overlapping grad-reduce"
+            )
+            config.no_sync_func = [model_chunk.no_sync for model_chunk in self.module]
+            if len(self.module) == 1:
+                config.no_sync_func = config.no_sync_func[0]
+            if args.delay_grad_reduce:
+                config.grad_sync_func = [model_chunk.start_grad_sync for model_chunk in self.module]
+                if len(self.module) == 1:
+                    config.grad_sync_func = config.grad_sync_func[0]
+        if args.overlap_param_gather and args.delay_param_gather:
+            config.param_sync_func = [
+                lambda x: self.optimizer.finish_param_sync(model_index, x) for model_index in range(len(self.module))
+            ]
+            if len(self.module) == 1:
+                config.param_sync_func = config.param_sync_func[0]
+        config.finalize_model_grads_func = finalize_model_grads
+        return config
+    def train(self):
+        for model_module in self.module:
+            model_module.train()
+        if self.module_config is None:
+            self.module_config = self.get_module_config()
+        self.log_eval_results()
+    def eval(self):
+        for model_module in self.module:
+            model_module.eval()
+        if self.module_config is None:
+            self.module_config = self.get_module_config()
+    def get_batch_data_iterator(self, batch_data):
+        args = get_args()
+        data_chunks = []
+        if len(batch_data) > 0:
+            if args.num_micro_batches > 1:
+                for i in range(0, args.num_micro_batches):
+                    data_chunks.append(
+                        {
+                            k: v[i * args.micro_batch_size : (i + 1) * args.micro_batch_size]
+                            for k, v in batch_data.items()
+                        }
+                    )
+            else:
+                data_chunks = [batch_data]
+        if len(self.module) > 1:
+            batch_data_iterator = (
+                [iter(data_chunks) for _ in range(len(self.module))]
+                if len(batch_data) > 0
+                else [None] * len(self.module)
+            )
+        else:
+            batch_data_iterator = iter(data_chunks) if len(batch_data) > 0 else None
+        return batch_data_iterator
+    def train_step(self, **batch_data):
+        """
+        Training step for Megatron-LM
+        Args:
+            batch_data (:obj:`dict`): The batch data to train on.
+        """
+        batch_data_iterator = self.get_batch_data_iterator(batch_data)
+        loss_reduced, skipped_iter, grad_norm, num_zeros_in_grad = train_step(
+            forward_step_func=self.train_step_handler.forward_step,
+            data_iterator=batch_data_iterator,
+            model=self.module,
+            optimizer=self.optimizer,
+            opt_param_scheduler=self.scheduler,
+            config=self.module_config,
+        )
+        self.optimizer.skipped_iter = skipped_iter == 1
+        return loss_reduced, skipped_iter, grad_norm, num_zeros_in_grad
+    def eval_step(self, **batch_data):
+        """
+        Evaluation step for Megatron-LM
+        Args:
+            batch_data (:obj:`dict`): The batch data to evaluate on.
+        """
+        args = get_args()
+        batch_data_iterator = self.get_batch_data_iterator(batch_data)
+        forward_backward_func = get_forward_backward_func()
+        loss_dicts = forward_backward_func(
+            forward_step_func=self.train_step_handler.forward_step,
+            data_iterator=batch_data_iterator,
+            model=self.module,
+            num_microbatches=get_num_microbatches(),
+            seq_length=args.seq_length,
+            micro_batch_size=args.micro_batch_size,
+            forward_only=True,
+        )
+        # Empty unused memory
+        if args.empty_unused_memory_level >= 1:
+            torch.cuda.empty_cache()
+        args.consumed_valid_samples += (
+            mpu.get_data_parallel_world_size() * args.micro_batch_size * get_num_microbatches()
+        )
+        if mpu.is_pipeline_last_stage(ignore_virtual=True):
+            # Average loss across microbatches.
+            loss_reduced = {}
+            for key in loss_dicts[0]:
+                losses_reduced_for_key = [x[key] for x in loss_dicts]
+                if len(losses_reduced_for_key[0].shape) == 0:
+                    loss_reduced[key] = sum(losses_reduced_for_key) / len(losses_reduced_for_key)
+                else:
+                    loss_reduced[key] = torch.concat(losses_reduced_for_key)
+            return loss_reduced
+        return {}
+    def forward(self, **batch_data):
+        # During training, we use train_step()
+        # model(**batch_data) performs following operations by delegating it to `self.train_step`:
+        # 1. Prepare **batch_data for Tendor, Pipeline and Model Parallelism
+        # 2. Set grad to zero.
+        # 3. forward pass and backward pass using Pipeline Parallelism
+        # 4. Empty unused memory.
+        # 5. Reduce gradients.
+        # 6. Update parameters.
+        # 7. Gather params when using Distributed Optimizer (Data Parallelism).
+        # 8. Update learning rate if scheduler is specified.
+        # 9. Empty unused memory.
+        # 10. Average loss across microbatches and across DP ranks.
+        #
+        # During evaluation, we use eval_step()
+        args = get_args()
+        if self.module[0].training:
+            loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = self.train_step(**batch_data)
+            self.iteration += 1
+            batch_size = mpu.get_data_parallel_world_size() * args.micro_batch_size * get_num_microbatches()
+            args.consumed_train_samples += batch_size
+            self.num_floating_point_operations_so_far += num_floating_point_operations(args, batch_size)
+            if args.tensorboard_dir is not None:
+                # Logging.
+                loss_scale = self.optimizer.get_loss_scale().item()
+                params_norm = None
+                if args.log_params_norm:
+                    params_norm = calc_params_l2_norm(self.model)
+                self.report_memory_flag = training_log(
+                    loss_dict,
+                    self.total_loss_dict,
+                    self.optimizer.param_groups[0]["lr"],
+                    self.iteration,
+                    loss_scale,
+                    self.report_memory_flag,
+                    skipped_iter,
+                    grad_norm,
+                    params_norm,
+                    num_zeros_in_grad,
+                )
+        else:
+            loss_dict = self.eval_step(**batch_data)
+            if args.tensorboard_dir is not None:
+                for key in loss_dict:
+                    self.eval_total_loss_dict[key] = (
+                        self.eval_total_loss_dict.get(key, torch.cuda.FloatTensor([0.0])) + loss_dict[key]
+                    )
+                    self.eval_total_loss_dict[key + "_num_iters"] = self.eval_total_loss_dict.get(
+                        key + "_num_iters", torch.cuda.FloatTensor([0.0])
+                    ) + torch.cuda.FloatTensor([1.0])
+        loss = torch.tensor(0.0, device=torch.cuda.current_device())
+        for key in loss_dict:
+            if len(loss_dict[key].shape) == 0:
+                loss += loss_dict[key]
+        logits = None
+        if "logits" in loss_dict:
+            logits = loss_dict["logits"]
+        if self.train_step_handler.model_output_class is not None:
+            return self.train_step_handler.model_output_class(loss=loss, logits=logits)
+        return loss
+    def log_eval_results(self):
+        args = get_args()
+        if args.tensorboard_dir is None or self.iteration == 0:
+            return
+        args = get_args()
+        writer = get_tensorboard_writer()
+        string = f"validation loss at iteration {self.iteration} | "
+        for key in self.eval_total_loss_dict:
+            if key.endswith("_num_iters"):
+                continue
+            value = self.eval_total_loss_dict[key] / self.eval_total_loss_dict[key + "_num_iters"]
+            string += f"{key} value: {value} | "
+            ppl = math.exp(min(20, value.item()))
+            if args.pretraining_flag:
+                string += f"{key} PPL: {ppl} | "
+            if writer:
+                writer.add_scalar(f"{key} validation", value.item(), self.iteration)
+                if args.pretraining_flag:
+                    writer.add_scalar(f"{key} validation ppl", ppl, self.iteration)
+        length = len(string) + 1
+        print_rank_last("-" * length)
+        print_rank_last(string)
+        print_rank_last("-" * length)
+        self.eval_total_loss_dict = {}
+    def save_checkpoint(self, output_dir):
+        self.log_eval_results()
+        args = get_args()
+        args.save = output_dir
+        torch.distributed.barrier()
+        save_checkpoint(
+            self.iteration,
+            self.module,
+            self.optimizer,
+            self.scheduler,
+            num_floating_point_operations_so_far=self.num_floating_point_operations_so_far,
+        )
+        torch.distributed.barrier()
+    def load_checkpoint(self, input_dir):
+        args = get_args()
+        args.load = input_dir
+        args.consumed_train_samples = 0
+        args.consumed_valid_samples = 0
+        torch.distributed.barrier()
+        iteration, num_floating_point_operations_so_far = load_checkpoint(self.module, self.optimizer, self.scheduler)
+        torch.distributed.barrier()
+        self.iteration = iteration
+        self.num_floating_point_operations_so_far = num_floating_point_operations_so_far
+        if args.fp16 and self.iteration == 0:
+            self.optimizer.reload_model_params()
+    def megatron_generate(
+        self,
+        inputs,
+        attention_mask=None,
+        max_length=None,
+        max_new_tokens=None,
+        num_beams=None,
+        temperature=None,
+        top_k=None,
+        top_p=None,
+        length_penalty=None,
+        **kwargs,
+    ):
+        """
+        Generate method for GPT2 model. This method is used for inference. Supports both greedy and beam search along
+        with sampling. Refer the Megatron-LM repo for more details
+        Args:
+            inputs (torch.Tensor): input ids
+            attention_mask (torch.Tensor, optional): attention mask. Defaults to None.
+            max_length (int, optional): max length of the generated sequence. Defaults to None.
+            Either this or max_new_tokens should be provided.
+            max_new_tokens (int, optional): max number of tokens to be generated. Defaults to None.
+            Either this or max_length should be provided.
+            num_beams (int, optional): number of beams to use for beam search. Defaults to None.
+            temperature (float, optional): temperature for sampling. Defaults to 1.0.
+            top_k (int, optional): top k tokens to consider for sampling. Defaults to 0.0.
+            top_p (float, optional): tokens in top p probability are considered for sampling. Defaults to 0.0.
+            length_penalty (float, optional): length penalty for beam search. Defaults to None.
+            kwargs: additional key-value arguments
+        """
+        # checking if required arguments are passed
+        args = get_args()
+        if args.model_type_name != "gpt":
+            raise NotImplementedError("Generate method is not implemented for this model")
+        if args.data_parallel_size > 1:
+            raise ValueError("Generate method requires data parallelism to be 1")
+        if args.sequence_parallel:
+            raise ValueError("Generate method requires sequence parallelism to be False")
+        if args.recompute_granularity is not None:
+            raise ValueError("Checkpoint activations cannot be set for inference")
+        if args.vocab_file is None:
+            raise ValueError("Vocab file is required for inference")
+        # Prepare inputs
+        if max_length is None and max_new_tokens is None:
+            raise ValueError("`max_length` or `max_new_tokens` are required for inference")
+        if temperature is None:
+            temperature = 1.0
+        elif not (0.0 < temperature <= 100.0):
+            raise ValueError("temperature must be a positive number less than or equal to 100.0")
+        if top_k is None:
+            top_k = 0
+        elif not (0 <= top_k <= 1000):
+            raise ValueError("top_k must be a positive number less than or equal to 1000")
+        if top_p is None:
+            top_p = 0.0
+        elif top_p > 0.0 and top_k > 0.0:
+            raise ValueError("top_p and top_k sampling cannot be set together")
+        else:
+            if not (0.0 <= top_p <= 1.0):
+                raise ValueError("top_p must be less than or equal to 1.0")
+        top_p_decay = kwargs.get("top_p_decay", 0.0)
+        if not (0.0 <= top_p_decay <= 1.0):
+            raise ValueError("top_p_decay must be less than or equal to 1.0")
+        top_p_bound = kwargs.get("top_p_bound", 0.0)
+        if not (0.0 <= top_p_bound <= 1.0):
+            raise ValueError("top_p_bound must be less than or equal to 1.0")
+        add_BOS = kwargs.get("add_BOS", False)
+        if not (isinstance(add_BOS, bool)):
+            raise ValueError("add_BOS must be a boolean")
+        beam_width = num_beams
+        if beam_width is not None:
+            if not isinstance(beam_width, int):
+                raise ValueError("beam_width must be an integer")
+            if beam_width < 1:
+                raise ValueError("beam_width must be greater than 0")
+            if inputs.shape[0] > 1:
+                return "When doing beam_search, batch size must be 1"
+        tokenizer = get_tokenizer()
+        stop_token = kwargs.get("stop_token", tokenizer.eod)
+        if stop_token is not None:
+            if not isinstance(stop_token, int):
+                raise ValueError("stop_token must be an integer")
+        if length_penalty is None:
+            length_penalty = 1.0
+        sizes_list = None
+        prompts_tokens_tensor = None
+        prompts_length_tensor = None
+        if torch.distributed.get_rank() == 0:
+            # Get the prompts length.
+            if attention_mask is None:
+                prompts_length_tensor = torch.cuda.LongTensor([inputs.shape[1]] * inputs.shape[0])
+            else:
+                prompts_length_tensor = attention_mask.sum(axis=-1).cuda()
+            if max_new_tokens is None:
+                max_new_tokens = max_length - inputs.shape[1]
+            if max_new_tokens <= 0:
+                raise ValueError("max_new_tokens must be greater than 0")
+            if add_BOS:
+                max_length = max_new_tokens + inputs.shape[1] + 1
+                # making sure that `max_length` is a multiple of 4 to leverage fused kernels
+                max_length = 4 * math.ceil(max_length / 4)
+                max_new_tokens = max_length - (inputs.shape[1] + 1)
+                padding = torch.cuda.LongTensor([[tokenizer.eod] * max_new_tokens] * inputs.shape[0])
+                prompts_tokens_tensor = torch.concat(
+                    [torch.unsqueeze(padding[:, 0], axis=-1), inputs.cuda(), padding], axis=-1
+                )
+            else:
+                # making sure that `max_length` is a multiple of 4 to leverage fused kernels
+                max_length = max_new_tokens + inputs.shape[1]
+                max_length = 4 * math.ceil(max_length / 4)
+                max_new_tokens = max_length - inputs.shape[1]
+                padding = torch.cuda.LongTensor([[tokenizer.eod] * max_new_tokens] * inputs.shape[0])
+                prompts_tokens_tensor = torch.concat([inputs.cuda(), padding], axis=-1)
+            # We need the sizes of these tensors for the broadcast
+            sizes_list = [
+                prompts_tokens_tensor.size(0),  # Batch size
+                prompts_tokens_tensor.size(1),
+            ]  # Sequence length
+        # First, broadcast the sizes.
+        sizes_tensor = broadcast_int_list(2, int_list=sizes_list, rank=0)
+        # Now that we have the sizes, we can broadcast the tokens
+        # and length tensors.
+        sizes = sizes_tensor.tolist()
+        context_tokens_tensor = broadcast_tensor(sizes, torch.int64, tensor=prompts_tokens_tensor, rank=0)
+        context_length_tensor = broadcast_tensor(sizes[0], torch.int64, tensor=prompts_length_tensor, rank=0)
+        # Run the inference
+        random_seed = kwargs.get("random_seed", 0)
+        torch.random.manual_seed(random_seed)
+        unwrapped_model = unwrap_model(self.base_model, (torchDDP, LocalDDP, Float16Module))
+        if beam_width is not None:
+            tokens, _ = beam_search_and_return_on_first_stage(
+                unwrapped_model,
+                context_tokens_tensor,
+                context_length_tensor,
+                beam_width,
+                stop_token=stop_token,
+                num_return_gen=1,
+                length_penalty=length_penalty,
+            )
+        else:
+            tokens, _, _ = generate_tokens_probs_and_return_on_first_stage(
+                unwrapped_model,
+                context_tokens_tensor,
+                context_length_tensor,
+                return_output_log_probs=False,
+                top_k=top_k,
+                top_p=top_p,
+                top_p_decay=top_p_decay,
+                top_p_bound=top_p_bound,
+                temperature=temperature,
+                use_eod_token_for_early_termination=True,
+            )
+        return tokens
+# other utilities
+def avg_losses_across_data_parallel_group(losses):
+    """
+    Average losses across data parallel group.
+    Args:
+        losses (List[Tensor]): List of losses to average across data parallel group.
+    """
+    return average_losses_across_data_parallel_group(losses)
+def gather_across_data_parallel_groups(tensor):
+    """
+    Recursively gather tensor in a nested list/tuple/dictionary of tensors from data parallel ranks.
+    Args:
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to gather across data parallel ranks.
+    """
+    def _gpu_gather_one(tensor):
+        if tensor.ndim == 0:
+            tensor = tensor.clone()[None]
+        output_tensors = [
+            torch.empty_like(tensor)
+            for _ in range(torch.distributed.get_world_size(group=mpu.get_data_parallel_group()))
+        ]
+        torch.distributed.all_gather(output_tensors, tensor, group=mpu.get_data_parallel_group())
+        return torch.cat(output_tensors, dim=0)
+    return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/memory.py ADDED Viewed

	@@ -0,0 +1,210 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+A collection of utilities for ensuring that training can always occur. Heavily influenced by the
+[toma](https://github.com/BlackHC/toma) library.
+"""
+import functools
+import gc
+import importlib
+import inspect
+import warnings
+from typing import Optional
+import torch
+from packaging import version
+from .imports import (
+    is_cuda_available,
+    is_hpu_available,
+    is_ipex_available,
+    is_mlu_available,
+    is_mps_available,
+    is_musa_available,
+    is_npu_available,
+    is_sdaa_available,
+    is_xpu_available,
+)
+from .versions import compare_versions
+def clear_device_cache(garbage_collection=False):
+    """
+    Clears the device cache by calling `torch.{backend}.empty_cache`. Can also run `gc.collect()`, but do note that
+    this is a *considerable* slowdown and should be used sparingly.
+    """
+    if garbage_collection:
+        gc.collect()
+    if is_xpu_available():
+        torch.xpu.empty_cache()
+    elif is_mlu_available():
+        torch.mlu.empty_cache()
+    elif is_sdaa_available():
+        torch.sdaa.empty_cache()
+    elif is_musa_available():
+        torch.musa.empty_cache()
+    elif is_npu_available():
+        torch.npu.empty_cache()
+    elif is_mps_available(min_version="2.0"):
+        torch.mps.empty_cache()
+    elif is_cuda_available():
+        torch.cuda.empty_cache()
+    elif is_hpu_available():
+        # torch.hpu.empty_cache() # not available on hpu as it reserves all device memory for the current process
+        pass
+def release_memory(*objects):
+    """
+    Releases memory from `objects` by setting them to `None` and calls `gc.collect()` and `torch.cuda.empty_cache()`.
+    Returned objects should be reassigned to the same variables.
+    Args:
+        objects (`Iterable`):
+            An iterable of objects
+    Returns:
+        A list of `None` objects to replace `objects`
+    Example:
+        ```python
+        >>> import torch
+        >>> from accelerate.utils import release_memory
+        >>> a = torch.ones(1000, 1000).cuda()
+        >>> b = torch.ones(1000, 1000).cuda()
+        >>> a, b = release_memory(a, b)
+        ```
+    """
+    if not isinstance(objects, list):
+        objects = list(objects)
+    for i in range(len(objects)):
+        objects[i] = None
+    clear_device_cache(garbage_collection=True)
+    return objects
+def should_reduce_batch_size(exception: Exception) -> bool:
+    """
+    Checks if `exception` relates to CUDA out-of-memory, XPU out-of-memory, CUDNN not supported, or CPU out-of-memory
+    Args:
+        exception (`Exception`):
+            An exception
+    """
+    _statements = [
+        " out of memory.",  # OOM for CUDA, HIP, XPU
+        "cuDNN error: CUDNN_STATUS_NOT_SUPPORTED.",  # CUDNN SNAFU
+        "DefaultCPUAllocator: can't allocate memory",  # CPU OOM
+        "FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed",  # HPU OOM
+    ]
+    if isinstance(exception, RuntimeError) and len(exception.args) == 1:
+        return any(err in exception.args[0] for err in _statements)
+    return False
+def find_executable_batch_size(
+    function: Optional[callable] = None,
+    starting_batch_size: int = 128,
+    reduce_batch_size_fn: Optional[callable] = None,
+):
+    """
+    A basic decorator that will try to execute `function`. If it fails from exceptions related to out-of-memory or
+    CUDNN, the batch size is multiplied by 0.9 and passed to `function`
+    `function` must take in a `batch_size` parameter as its first argument.
+    Args:
+        function (`callable`, *optional*):
+            A function to wrap
+        starting_batch_size (`int`, *optional*):
+            The batch size to try and fit into memory
+    Example:
+    ```python
+    >>> from accelerate.utils import find_executable_batch_size
+    >>> @find_executable_batch_size(starting_batch_size=128)
+    ... def train(batch_size, model, optimizer):
+    ...     ...
+    >>> train(model, optimizer)
+    ```
+    """
+    if function is None:
+        return functools.partial(find_executable_batch_size, starting_batch_size=starting_batch_size)
+    batch_size = starting_batch_size
+    if reduce_batch_size_fn is None:
+        def reduce_batch_size_fn():
+            nonlocal batch_size
+            batch_size = int(batch_size * 0.9)
+            return batch_size
+    def decorator(*args, **kwargs):
+        nonlocal batch_size
+        clear_device_cache(garbage_collection=True)
+        params = list(inspect.signature(function).parameters.keys())
+        # Guard against user error
+        if len(params) < (len(args) + 1):
+            arg_str = ", ".join([f"{arg}={value}" for arg, value in zip(params[1:], args[1:])])
+            raise TypeError(
+                f"Batch size was passed into `{function.__name__}` as the first argument when called."
+                f"Remove this as the decorator already does so: `{function.__name__}({arg_str})`"
+            )
+        while True:
+            if batch_size == 0:
+                raise RuntimeError("No executable batch size found, reached zero.")
+            try:
+                return function(batch_size, *args, **kwargs)
+            except Exception as e:
+                if should_reduce_batch_size(e):
+                    clear_device_cache(garbage_collection=True)
+                    batch_size = reduce_batch_size_fn()
+                else:
+                    raise
+    return decorator
+def get_xpu_available_memory(device_index: int):
+    if version.parse(torch.__version__).release >= version.parse("2.6").release:
+        # torch.xpu.mem_get_info API is available starting from PyTorch 2.6
+        # It further requires PyTorch built with the SYCL runtime which supports API
+        # to query available device memory. If not available, exception will be
+        # raised. Version of SYCL runtime used to build PyTorch is being reported
+        # with print(torch.version.xpu) and corresponds to the version of Intel DPC++
+        # SYCL compiler. First version to support required feature is 20250001.
+        try:
+            return torch.xpu.mem_get_info(device_index)[0]
+        except Exception:
+            pass
+    elif is_ipex_available():
+        ipex_version = version.parse(importlib.metadata.version("intel_extension_for_pytorch"))
+        if compare_versions(ipex_version, ">=", "2.5"):
+            from intel_extension_for_pytorch.xpu import mem_get_info
+            return mem_get_info(device_index)[0]
+    warnings.warn(
+        "The XPU `mem_get_info` API is available in IPEX version >=2.5 or PyTorch >=2.6. The current returned available memory is incorrect. Please consider upgrading your IPEX or PyTorch version."
+    )
+    return torch.xpu.max_memory_allocated(device_index)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/modeling.py ADDED Viewed

	@@ -0,0 +1,2186 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import contextlib
+import gc
+import inspect
+import json
+import logging
+import os
+import re
+import shutil
+import tempfile
+import warnings
+from collections import OrderedDict, defaultdict
+from typing import Optional, Union
+import torch
+from torch import distributed as dist
+from torch import nn
+from ..state import AcceleratorState
+from .constants import SAFE_WEIGHTS_NAME, WEIGHTS_NAME
+from .dataclasses import AutocastKwargs, CustomDtype, DistributedType
+from .imports import (
+    is_hpu_available,
+    is_mlu_available,
+    is_mps_available,
+    is_musa_available,
+    is_npu_available,
+    is_peft_available,
+    is_sdaa_available,
+    is_torch_xla_available,
+    is_xpu_available,
+)
+from .memory import clear_device_cache, get_xpu_available_memory
+from .offload import load_offloaded_weight, offload_weight, save_offload_index
+from .tqdm import is_tqdm_available, tqdm
+from .versions import is_torch_version
+if is_npu_available(check_device=False):
+    import torch_npu  # noqa: F401
+if is_mlu_available(check_device=False):
+    import torch_mlu  # noqa: F401
+if is_sdaa_available(check_device=False):
+    import torch_sdaa  # noqa: F401
+if is_musa_available(check_device=False):
+    import torch_musa  # noqa: F401
+from safetensors import safe_open
+from safetensors.torch import load_file as safe_load_file
+WEIGHTS_INDEX_NAME = "pytorch_model.bin.index.json"
+logger = logging.getLogger(__name__)
+def is_peft_model(model):
+    from .other import extract_model_from_parallel
+    if is_peft_available():
+        from peft import PeftModel
+    return is_peft_available() and isinstance(extract_model_from_parallel(model), PeftModel)
+def check_device_same(first_device, second_device):
+    """
+    Utility method to check if two `torch` devices are similar. When dealing with CUDA devices, torch throws `False`
+    for `torch.device("cuda") == torch.device("cuda:0")` whereas they should be the same
+    Args:
+        first_device (`torch.device`):
+            First device to check
+        second_device (`torch.device`):
+            Second device to check
+    """
+    if first_device.type != second_device.type:
+        return False
+    if first_device.type != "cpu" and first_device.index is None:
+        # In case the first_device is a cuda device and have
+        # the index attribute set to `None`, default it to `0`
+        first_device = torch.device(first_device.type, index=0)
+    if second_device.type != "cpu" and second_device.index is None:
+        # In case the second_device is a cuda device and have
+        # the index attribute set to `None`, default it to `0`
+        second_device = torch.device(second_device.type, index=0)
+    return first_device == second_device
+def convert_file_size_to_int(size: Union[int, str]):
+    """
+    Converts a size expressed as a string with digits an unit (like `"5MB"`) to an integer (in bytes).
+    Args:
+        size (`int` or `str`): The size to convert. Will be directly returned if an `int`.
+    Example:
+    ```py
+    >>> convert_file_size_to_int("1MiB")
+    1048576
+    ```
+    """
+    mem_size = -1
+    err_msg = (
+        f"`size` {size} is not in a valid format. Use an integer for bytes, or a string with an unit (like '5.0GB')."
+    )
+    try:
+        if isinstance(size, int):
+            mem_size = size
+        elif size.upper().endswith("GIB"):
+            mem_size = int(float(size[:-3]) * (2**30))
+        elif size.upper().endswith("MIB"):
+            mem_size = int(float(size[:-3]) * (2**20))
+        elif size.upper().endswith("KIB"):
+            mem_size = int(float(size[:-3]) * (2**10))
+        elif size.upper().endswith("GB"):
+            int_size = int(float(size[:-2]) * (10**9))
+            mem_size = int_size // 8 if size.endswith("b") else int_size
+        elif size.upper().endswith("MB"):
+            int_size = int(float(size[:-2]) * (10**6))
+            mem_size = int_size // 8 if size.endswith("b") else int_size
+        elif size.upper().endswith("KB"):
+            int_size = int(float(size[:-2]) * (10**3))
+            mem_size = int_size // 8 if size.endswith("b") else int_size
+    except ValueError:
+        raise ValueError(err_msg)
+    if mem_size < 0:
+        raise ValueError(err_msg)
+    return mem_size
+def dtype_byte_size(dtype: torch.dtype):
+    """
+    Returns the size (in bytes) occupied by one parameter of type `dtype`.
+    Example:
+    ```py
+    >>> dtype_byte_size(torch.float32)
+    4
+    ```
+    """
+    if dtype == torch.bool:
+        return 1 / 8
+    elif dtype == CustomDtype.INT2:
+        return 1 / 4
+    elif dtype == CustomDtype.INT4:
+        return 1 / 2
+    elif dtype == CustomDtype.FP8:
+        return 1
+    elif is_torch_version(">=", "2.1.0") and dtype in [torch.float8_e4m3fn, torch.float8_e5m2]:
+        return 1
+    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
+    if bit_search is None:
+        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
+    bit_size = int(bit_search.groups()[0])
+    return bit_size // 8
+def id_tensor_storage(tensor: torch.Tensor) -> tuple[torch.device, int, int]:
+    """
+    Unique identifier to a tensor storage. Multiple different tensors can share the same underlying storage. For
+    example, "meta" tensors all share the same storage, and thus their identifier will all be equal. This identifier is
+    guaranteed to be unique and constant for this tensor's storage during its lifetime. Two tensor storages with
+    non-overlapping lifetimes may have the same id.
+    """
+    _SIZE = {
+        torch.int64: 8,
+        torch.float32: 4,
+        torch.int32: 4,
+        torch.bfloat16: 2,
+        torch.float16: 2,
+        torch.int16: 2,
+        torch.uint8: 1,
+        torch.int8: 1,
+        torch.bool: 1,
+        torch.float64: 8,
+    }
+    try:
+        storage_ptr = tensor.untyped_storage().data_ptr()
+        storage_size = tensor.untyped_storage().nbytes()
+    except Exception:
+        try:
+            # Fallback for torch==1.10
+            storage_ptr = tensor.storage().data_ptr()
+            storage_size = tensor.storage().size() * _SIZE[tensor.dtype]
+        except NotImplementedError:
+            # Fallback for meta storage
+            storage_ptr = 0
+            # On torch >=2.0 this is the tensor size
+            storage_size = tensor.nelement() * _SIZE[tensor.dtype]
+    return tensor.device, storage_ptr, storage_size
+def set_module_tensor_to_device(
+    module: nn.Module,
+    tensor_name: str,
+    device: Union[int, str, torch.device],
+    value: Optional[torch.Tensor] = None,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+    fp16_statistics: Optional[torch.HalfTensor] = None,
+    tied_params_map: Optional[dict[int, dict[torch.device, torch.Tensor]]] = None,
+    non_blocking: bool = False,
+    clear_cache: bool = True,
+):
+    """
+    A helper function to set a given tensor (parameter of buffer) of a module on a specific device (note that doing
+    `param.to(device)` creates a new tensor not linked to the parameter, which is why we need this function).
+    Args:
+        module (`torch.nn.Module`):
+            The module in which the tensor we want to move lives.
+        tensor_name (`str`):
+            The full name of the parameter/buffer.
+        device (`int`, `str` or `torch.device`):
+            The device on which to set the tensor.
+        value (`torch.Tensor`, *optional*):
+            The value of the tensor (useful when going from the meta device to any other device).
+        dtype (`torch.dtype`, *optional*):
+            If passed along the value of the parameter will be cast to this `dtype`. Otherwise, `value` will be cast to
+            the dtype of the existing parameter in the model.
+        fp16_statistics (`torch.HalfTensor`, *optional*):
+            The list of fp16 statistics to set on the module, used for 8 bit model serialization.
+        tied_params_map (Dict[int, Dict[torch.device, torch.Tensor]], *optional*, defaults to `None`):
+            A map of current data pointers to dictionaries of devices to already dispatched tied weights. For a given
+            execution device, this parameter is useful to reuse the first available pointer of a shared weight on the
+            device for all others, instead of duplicating memory.
+        non_blocking (`bool`, *optional*, defaults to `False`):
+            If `True`, the device transfer will be asynchronous with respect to the host, if possible.
+        clear_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not to clear the device cache after setting the tensor on the device.
+    """
+    # Recurse if needed
+    if "." in tensor_name:
+        splits = tensor_name.split(".")
+        for split in splits[:-1]:
+            new_module = getattr(module, split)
+            if new_module is None:
+                raise ValueError(f"{module} has no attribute {split}.")
+            module = new_module
+        tensor_name = splits[-1]
+    if tensor_name not in module._parameters and tensor_name not in module._buffers:
+        raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
+    is_buffer = tensor_name in module._buffers
+    old_value = getattr(module, tensor_name)
+    # Treat the case where old_value (or a custom `value`, typically offloaded to RAM/disk) belongs to a tied group, and one of the weight
+    # in the tied group has already been dispatched to the device, by avoiding reallocating memory on the device and just copying the pointer.
+    if (
+        value is not None
+        and tied_params_map is not None
+        and value.data_ptr() in tied_params_map
+        and device in tied_params_map[value.data_ptr()]
+    ):
+        module._parameters[tensor_name] = tied_params_map[value.data_ptr()][device]
+        return
+    elif (
+        tied_params_map is not None
+        and old_value.data_ptr() in tied_params_map
+        and device in tied_params_map[old_value.data_ptr()]
+    ):
+        module._parameters[tensor_name] = tied_params_map[old_value.data_ptr()][device]
+        return
+    if old_value.device == torch.device("meta") and device not in ["meta", torch.device("meta")] and value is None:
+        raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")
+    param = module._parameters[tensor_name] if tensor_name in module._parameters else None
+    param_cls = type(param)
+    if value is not None:
+        # We can expect mismatches when using bnb 4bit since Params4bit will reshape and pack the weights.
+        # In other cases, we want to make sure we're not loading checkpoints that do not match the config.
+        if old_value.shape != value.shape and param_cls.__name__ != "Params4bit":
+            raise ValueError(
+                f'Trying to set a tensor of shape {value.shape} in "{tensor_name}" (which has shape {old_value.shape}), this looks incorrect.'
+            )
+        if dtype is None:
+            # For compatibility with PyTorch load_state_dict which converts state dict dtype to existing dtype in model
+            value = value.to(old_value.dtype, non_blocking=non_blocking)
+        elif not str(value.dtype).startswith(("torch.uint", "torch.int", "torch.bool")):
+            value = value.to(dtype, non_blocking=non_blocking)
+    device_quantization = None
+    with torch.no_grad():
+        # leave it on cpu first before moving them to cuda
+        # # fix the case where the device is meta, we don't want to put it on cpu because there is no data =0
+        if (
+            param is not None
+            and param.device.type not in ("cuda", "xpu")
+            and torch.device(device).type in ("cuda", "xpu")
+            and param_cls.__name__ in ["Int8Params", "FP4Params", "Params4bit"]
+        ):
+            device_quantization = device
+            device = "cpu"
+        # `torch.Tensor.to(<int num>)` is not supported by `torch_npu` (see this [issue](https://github.com/Ascend/pytorch/issues/16)).
+        if isinstance(device, int):
+            if is_npu_available():
+                device = f"npu:{device}"
+            elif is_mlu_available():
+                device = f"mlu:{device}"
+            elif is_sdaa_available():
+                device = f"sdaa:{device}"
+            elif is_musa_available():
+                device = f"musa:{device}"
+            elif is_hpu_available():
+                device = "hpu"
+        if "xpu" in str(device) and not is_xpu_available():
+            raise ValueError(f'{device} is not available, you should use device="cpu" instead')
+        if value is None:
+            new_value = old_value.to(device, non_blocking=non_blocking)
+            if dtype is not None and device in ["meta", torch.device("meta")]:
+                if not str(old_value.dtype).startswith(("torch.uint", "torch.int", "torch.bool")):
+                    new_value = new_value.to(dtype, non_blocking=non_blocking)
+                if not is_buffer:
+                    module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
+        elif isinstance(value, torch.Tensor):
+            new_value = value.to(device, non_blocking=non_blocking)
+        else:
+            new_value = torch.tensor(value, device=device)
+        if device_quantization is not None:
+            device = device_quantization
+        if is_buffer:
+            module._buffers[tensor_name] = new_value
+        elif value is not None or not check_device_same(torch.device(device), module._parameters[tensor_name].device):
+            param_cls = type(module._parameters[tensor_name])
+            kwargs = module._parameters[tensor_name].__dict__
+            if param_cls.__name__ in ["Int8Params", "FP4Params", "Params4bit"]:
+                if param_cls.__name__ == "Int8Params" and new_value.dtype == torch.float32:
+                    # downcast to fp16 if any - needed for 8bit serialization
+                    new_value = new_value.to(torch.float16, non_blocking=non_blocking)
+                # quantize module that are going to stay on the cpu so that we offload quantized weights
+                if device == "cpu" and param_cls.__name__ == "Int8Params":
+                    new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(0).to("cpu")
+                    new_value.CB = new_value.CB.to("cpu")
+                    new_value.SCB = new_value.SCB.to("cpu")
+                else:
+                    new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(
+                        device, non_blocking=non_blocking
+                    )
+            elif param_cls.__name__ in ["QTensor", "QBitsTensor"]:
+                new_value = torch.nn.Parameter(new_value, requires_grad=old_value.requires_grad).to(
+                    device, non_blocking=non_blocking
+                )
+            elif param_cls.__name__ in ["AffineQuantizedTensor"]:
+                new_value = new_value.to(device, non_blocking=non_blocking)
+            else:
+                new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(
+                    device, non_blocking=non_blocking
+                )
+            module._parameters[tensor_name] = new_value
+            if fp16_statistics is not None:
+                module._parameters[tensor_name].SCB = fp16_statistics.to(device, non_blocking=non_blocking)
+                del fp16_statistics
+            # as we put the weight to meta, it doesn't have SCB attr anymore. make sure that it is not a meta weight
+            if (
+                module.__class__.__name__ == "Linear8bitLt"
+                and getattr(module.weight, "SCB", None) is None
+                and str(module.weight.device) != "meta"
+            ):
+                # quantize only if necessary
+                device_index = torch.device(device).index if torch.device(device).type == "cuda" else None
+                if not getattr(module.weight, "SCB", None) and device_index is not None:
+                    if module.bias is not None and module.bias.device.type != "meta":
+                        # if a bias exists, we need to wait until the bias is set on the correct device
+                        module = module.cuda(device_index)
+                    elif module.bias is None:
+                        # if no bias exists, we can quantize right away
+                        module = module.cuda(device_index)
+            elif (
+                module.__class__.__name__ == "Linear4bit"
+                and getattr(module.weight, "quant_state", None) is None
+                and str(module.weight.device) != "meta"
+            ):
+                # quantize only if necessary
+                device_index = torch.device(device).index if torch.device(device).type == "cuda" else None
+                if not getattr(module.weight, "quant_state", None) and device_index is not None:
+                    module.weight = module.weight.cuda(device_index)
+    # clean pre and post forward hook
+    if clear_cache and device not in ("cpu", "meta"):
+        clear_device_cache()
+    # When handling tied weights, we update tied_params_map to keep track of the tied weights that have already been allocated on the device in
+    # order to avoid duplicating memory, see above.
+    if (
+        tied_params_map is not None
+        and old_value.data_ptr() in tied_params_map
+        and device not in tied_params_map[old_value.data_ptr()]
+    ):
+        tied_params_map[old_value.data_ptr()][device] = new_value
+    elif (
+        value is not None
+        and tied_params_map is not None
+        and value.data_ptr() in tied_params_map
+        and device not in tied_params_map[value.data_ptr()]
+    ):
+        tied_params_map[value.data_ptr()][device] = new_value
+def named_module_tensors(
+    module: nn.Module, include_buffers: bool = True, recurse: bool = False, remove_non_persistent: bool = False
+):
+    """
+    A helper function that gathers all the tensors (parameters + buffers) of a given module. If `include_buffers=True`
+    it's the same as doing `module.named_parameters(recurse=recurse) + module.named_buffers(recurse=recurse)`.
+    Args:
+        module (`torch.nn.Module`):
+            The module we want the tensors on.
+        include_buffer (`bool`, *optional*, defaults to `True`):
+            Whether or not to include the buffers in the result.
+        recurse (`bool`, *optional`, defaults to `False`):
+            Whether or not to go look in every submodule or just return the direct parameters and buffers.
+        remove_non_persistent (`bool`, *optional*, defaults to `False`):
+            Whether or not to remove the non persistent buffer from the buffers. Useful only when include_buffers =
+            True
+    """
+    yield from module.named_parameters(recurse=recurse)
+    if include_buffers:
+        non_persistent_buffers = set()
+        if remove_non_persistent:
+            non_persistent_buffers = get_non_persistent_buffers(module, recurse=recurse)
+        for named_buffer in module.named_buffers(recurse=recurse):
+            name, _ = named_buffer
+            if name not in non_persistent_buffers:
+                yield named_buffer
+def get_non_persistent_buffers(module: nn.Module, recurse: bool = False, fqns: bool = False):
+    """
+    Gather all non persistent buffers of a given modules into a set
+    Args:
+        module (`nn.Module`):
+            The module we want the non persistent buffers on.
+        recurse (`bool`, *optional*, defaults to `False`):
+            Whether or not to go look in every submodule or just return the direct non persistent buffers.
+        fqns (`bool`, *optional*, defaults to `False`):
+            Whether or not to return the fully-qualified names of the non persistent buffers.
+    """
+    non_persistent_buffers_set = module._non_persistent_buffers_set
+    if recurse:
+        for n, m in module.named_modules():
+            if fqns:
+                non_persistent_buffers_set |= {n + "." + b for b in m._non_persistent_buffers_set}
+            else:
+                non_persistent_buffers_set |= m._non_persistent_buffers_set
+    return non_persistent_buffers_set
+def check_tied_parameters_in_config(model: nn.Module):
+    """
+    Check if there is any indication in the given model that some weights should be tied.
+    Args:
+        model (`torch.nn.Module`): The model to inspect
+    Returns:
+        bool: True if the model needs to have tied weights
+    """
+    # based on model.tie_weights() method
+    has_tied_word_embedding = False
+    has_tied_encoder_decoder = False
+    has_tied_module = False
+    if "PreTrainedModel" in [c.__name__ for c in inspect.getmro(model.__class__)]:
+        has_tied_word_embedding = False
+        model_decoder_config = None
+        if hasattr(model, "config"):
+            model_decoder_config = (
+                model.config.get_text_config(decoder=True)
+                if hasattr(model.config, "get_text_config")
+                else model.config
+            )
+        has_tied_word_embedding = (
+            model_decoder_config is not None
+            and getattr(model_decoder_config, "tie_word_embeddings", False)
+            and model.get_output_embeddings()
+        )
+        has_tied_encoder_decoder = (
+            hasattr(model, "config")
+            and getattr(model.config, "is_encoder_decoder", False)
+            and getattr(model.config, "tie_encoder_decoder", False)
+        )
+        has_tied_module = any(hasattr(module, "_tie_weights") for module in model.modules())
+    return any([has_tied_word_embedding, has_tied_encoder_decoder, has_tied_module])
+def _get_param_device(param, device_map):
+    if param in device_map:
+        return device_map[param]
+    parent_param = ".".join(param.split(".")[:-1])
+    if parent_param == param:
+        raise ValueError(f"The `device_map` does not contain the module {param}.")
+    else:
+        return _get_param_device(parent_param, device_map)
+def check_tied_parameters_on_same_device(tied_params, device_map):
+    """
+    Check if tied parameters are on the same device
+    Args:
+        tied_params (`List[List[str]]`):
+            A list of lists of parameter names being all tied together.
+        device_map (`Dict[str, Union[int, str, torch.device]]`):
+            A map that specifies where each submodule should go.
+    """
+    for tie_param in tied_params:
+        tie_param_devices = {}
+        for param in tie_param:
+            tie_param_devices[param] = _get_param_device(param, device_map)
+        if len(set(tie_param_devices.values())) > 1:
+            logger.warning(
+                f"Tied parameters are on different devices: {tie_param_devices}. "
+                "Please modify your custom device map or set `device_map='auto'`. "
+            )
+def find_tied_parameters(model: torch.nn.Module, **kwargs) -> list[list[str]]:
+    """
+    Find the tied parameters in a given model.
+    <Tip warning={true}>
+    The signature accepts keyword arguments, but they are for the recursive part of this function and you should ignore
+    them.
+    </Tip>
+    Args:
+        model (`torch.nn.Module`): The model to inspect.
+    Returns:
+        List[List[str]]: A list of lists of parameter names being all tied together.
+    Example:
+    ```py
+    >>> from collections import OrderedDict
+    >>> import torch.nn as nn
+    >>> model = nn.Sequential(OrderedDict([("linear1", nn.Linear(4, 4)), ("linear2", nn.Linear(4, 4))]))
+    >>> model.linear2.weight = model.linear1.weight
+    >>> find_tied_parameters(model)
+    [['linear1.weight', 'linear2.weight']]
+    ```
+    """
+    # get ALL model parameters and their names
+    all_named_parameters = {name: param for name, param in model.named_parameters(remove_duplicate=False)}
+    # get ONLY unique named parameters,
+    # if parameter is tied and have multiple names, it will be included only once
+    no_duplicate_named_parameters = {name: param for name, param in model.named_parameters(remove_duplicate=True)}
+    # the difference of the two sets will give us the tied parameters
+    tied_param_names = set(all_named_parameters.keys()) - set(no_duplicate_named_parameters.keys())
+    # 'tied_param_names' contains the names of parameters that are tied in the model, but we do not know
+    # which names refer to the same parameter. To identify this, we need to group them together.
+    tied_param_groups = {}
+    for tied_param_name in tied_param_names:
+        tied_param = all_named_parameters[tied_param_name]
+        for param_name, param in no_duplicate_named_parameters.items():
+            # compare if parameters are the same, if so, group their names together
+            if param is tied_param:
+                if param_name not in tied_param_groups:
+                    tied_param_groups[param_name] = []
+                tied_param_groups[param_name].append(tied_param_name)
+    return [sorted([weight] + list(set(tied))) for weight, tied in tied_param_groups.items()]
+def retie_parameters(model, tied_params):
+    """
+    Reties tied parameters in a given model if the link was broken (for instance when adding hooks).
+    Args:
+        model (`torch.nn.Module`):
+            The model in which to retie parameters.
+        tied_params (`List[List[str]]`):
+            A mapping parameter name to tied parameter name as obtained by `find_tied_parameters`.
+    """
+    for tied_group in tied_params:
+        param_to_tie = None
+        # two loops : the first one to set param_to_tie , the second one to change the values of tied_group
+        for param_name in tied_group:
+            module = model
+            splits = param_name.split(".")
+            for split in splits[:-1]:
+                module = getattr(module, split)
+            param = getattr(module, splits[-1])
+            if param_to_tie is None and param.device != torch.device("meta"):
+                param_to_tie = param
+                break
+        if param_to_tie is not None:
+            for param_name in tied_group:
+                module = model
+                splits = param_name.split(".")
+                for split in splits[:-1]:
+                    module = getattr(module, split)
+                setattr(module, splits[-1], param_to_tie)
+def _get_proper_dtype(dtype: Union[str, torch.device]) -> torch.dtype:
+    """
+    Just does torch.dtype(dtype) if necessary.
+    """
+    if isinstance(dtype, str):
+        # We accept "torch.float16" or just "float16"
+        dtype = dtype.replace("torch.", "")
+        dtype = getattr(torch, dtype)
+    return dtype
+def compute_module_sizes(
+    model: nn.Module,
+    dtype: Optional[Union[str, torch.device]] = None,
+    special_dtypes: Optional[dict[str, Union[str, torch.device]]] = None,
+    buffers_only: bool = False,
+):
+    """
+    Compute the size of each submodule of a given model.
+    """
+    if dtype is not None:
+        dtype = _get_proper_dtype(dtype)
+        dtype_size = dtype_byte_size(dtype)
+    if special_dtypes is not None:
+        special_dtypes = {key: _get_proper_dtype(dtyp) for key, dtyp in special_dtypes.items()}
+        special_dtypes_size = {key: dtype_byte_size(dtyp) for key, dtyp in special_dtypes.items()}
+    module_sizes = defaultdict(int)
+    module_list = []
+    if not buffers_only:
+        module_list = named_module_tensors(model, recurse=True)
+    else:
+        module_list = model.named_buffers(recurse=True)
+    for name, tensor in module_list:
+        if special_dtypes is not None and name in special_dtypes:
+            size = tensor.numel() * special_dtypes_size[name]
+        elif dtype is None:
+            size = tensor.numel() * dtype_byte_size(tensor.dtype)
+        elif str(tensor.dtype).startswith(("torch.uint", "torch.int", "torch.bool")):
+            # According to the code in set_module_tensor_to_device, these types won't be converted
+            # so use their original size here
+            size = tensor.numel() * dtype_byte_size(tensor.dtype)
+        else:
+            size = tensor.numel() * min(dtype_size, dtype_byte_size(tensor.dtype))
+        name_parts = name.split(".")
+        for idx in range(len(name_parts) + 1):
+            module_sizes[".".join(name_parts[:idx])] += size
+    return module_sizes
+def compute_module_total_buffer_size(
+    model: nn.Module,
+    dtype: Optional[Union[str, torch.device]] = None,
+    special_dtypes: Optional[dict[str, Union[str, torch.device]]] = None,
+):
+    """
+    Compute the total size of buffers in each submodule of a given model.
+    """
+    module_sizes = compute_module_sizes(model, dtype=dtype, special_dtypes=special_dtypes, buffers_only=True)
+    return module_sizes.get("", 0)
+def get_max_layer_size(
+    modules: list[tuple[str, torch.nn.Module]], module_sizes: dict[str, int], no_split_module_classes: list[str]
+):
+    """
+    Utility function that will scan a list of named modules and return the maximum size used by one full layer. The
+    definition of a layer being:
+    - a module with no direct children (just parameters and buffers)
+    - a module whose class name is in the list `no_split_module_classes`
+    Args:
+        modules (`List[Tuple[str, torch.nn.Module]]`):
+            The list of named modules where we want to determine the maximum layer size.
+        module_sizes (`Dict[str, int]`):
+            A dictionary mapping each layer name to its size (as generated by `compute_module_sizes`).
+        no_split_module_classes (`List[str]`):
+            A list of class names for layers we don't want to be split.
+    Returns:
+        `Tuple[int, List[str]]`: The maximum size of a layer with the list of layer names realizing that maximum size.
+    """
+    max_size = 0
+    layer_names = []
+    modules_to_treat = modules.copy()
+    while len(modules_to_treat) > 0:
+        module_name, module = modules_to_treat.pop(0)
+        modules_children = list(module.named_children()) if isinstance(module, torch.nn.Module) else []
+        if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
+            # No splitting this one so we compare to the max_size
+            size = module_sizes[module_name]
+            if size > max_size:
+                max_size = size
+                layer_names = [module_name]
+            elif size == max_size:
+                layer_names.append(module_name)
+        else:
+            modules_to_treat = [(f"{module_name}.{n}", v) for n, v in modules_children] + modules_to_treat
+    return max_size, layer_names
+def get_max_memory(max_memory: Optional[dict[Union[int, str], Union[int, str]]] = None):
+    """
+    Get the maximum memory available if nothing is passed, converts string to int otherwise.
+    """
+    import psutil
+    if max_memory is None:
+        max_memory = {}
+        # Make sure CUDA is initialized on each GPU to have the right memory info.
+        if is_npu_available():
+            for i in range(torch.npu.device_count()):
+                try:
+                    _ = torch.tensor(0, device=torch.device("npu", i))
+                    max_memory[i] = torch.npu.mem_get_info(i)[0]
+                except Exception:
+                    logger.info(f"Device {i} seems unavailable, Proceeding to check subsequent devices.")
+                    continue
+        elif is_mlu_available():
+            for i in range(torch.mlu.device_count()):
+                try:
+                    _ = torch.tensor(0, device=torch.device("mlu", i))
+                    max_memory[i] = torch.mlu.mem_get_info(i)[0]
+                except Exception:
+                    logger.info(f"Device {i} seems unavailable, Proceeding to check subsequent devices.")
+                    continue
+        elif is_sdaa_available():
+            for i in range(torch.sdaa.device_count()):
+                try:
+                    _ = torch.tensor(0, device=torch.device("sdaa", i))
+                    max_memory[i] = torch.sdaa.mem_get_info(i)[0]
+                except Exception:
+                    logger.info(f"Device {i} seems unavailable, Proceeding to check subsequent devices.")
+                    continue
+        elif is_musa_available():
+            for i in range(torch.musa.device_count()):
+                try:
+                    _ = torch.tensor(0, device=torch.device("musa", i))
+                    max_memory[i] = torch.musa.mem_get_info(i)[0]
+                except Exception:
+                    logger.info(f"Device {i} seems unavailable, Proceeding to check subsequent devices.")
+                    continue
+        elif is_xpu_available():
+            for i in range(torch.xpu.device_count()):
+                try:
+                    _ = torch.tensor(0, device=torch.device("xpu", i))
+                    max_memory[i] = get_xpu_available_memory(i)
+                except Exception:
+                    logger.info(f"Device {i} seems unavailable, Proceeding to check subsequent devices.")
+                    continue
+        elif is_hpu_available():
+            for i in range(torch.hpu.device_count()):
+                try:
+                    _ = torch.tensor(0, device=torch.device("hpu", i))
+                    max_memory[i] = torch.hpu.mem_get_info(i)[0]
+                except Exception:
+                    logger.info(f"Device {i} seems unavailable, Proceeding to check subsequent devices.")
+                    continue
+        else:
+            for i in range(torch.cuda.device_count()):
+                try:
+                    _ = torch.tensor([0], device=i)
+                    max_memory[i] = torch.cuda.mem_get_info(i)[0]
+                except Exception:
+                    logger.info(f"Device {i} seems unavailable, Proceeding to check subsequent devices.")
+                    continue
+        # allocate everything in the mps device as the RAM is shared
+        if is_mps_available():
+            max_memory["mps"] = psutil.virtual_memory().available
+        else:
+            max_memory["cpu"] = psutil.virtual_memory().available
+        return max_memory
+    for key in max_memory:
+        if isinstance(max_memory[key], str):
+            max_memory[key] = convert_file_size_to_int(max_memory[key])
+    # Need to sort the device by type to make sure that we allocate the gpu first.
+    # As gpu/npu/xpu are represented by int, we need to sort them first.
+    gpu_devices = [k for k in max_memory.keys() if isinstance(k, int)]
+    gpu_devices.sort()
+    # check if gpu/npu/xpu devices are available and if not, throw a warning
+    if is_npu_available():
+        num_devices = torch.npu.device_count()
+    elif is_mlu_available():
+        num_devices = torch.mlu.device_count()
+    elif is_sdaa_available():
+        num_devices = torch.sdaa.device_count()
+    elif is_musa_available():
+        num_devices = torch.musa.device_count()
+    elif is_xpu_available():
+        num_devices = torch.xpu.device_count()
+    elif is_hpu_available():
+        num_devices = torch.hpu.device_count()
+    else:
+        num_devices = torch.cuda.device_count()
+    for device in gpu_devices:
+        if device >= num_devices or device < 0:
+            logger.warning(f"Device {device} is not available, available devices are {list(range(num_devices))}")
+    # Add the other devices in the preset order if they are available
+    all_devices = gpu_devices + [k for k in ["mps", "cpu", "disk"] if k in max_memory.keys()]
+    # Raise an error if a device is not recognized
+    for k in max_memory.keys():
+        if k not in all_devices:
+            raise ValueError(
+                f"Device {k} is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'"
+            )
+    max_memory = {k: max_memory[k] for k in all_devices}
+    return max_memory
+def clean_device_map(device_map: dict[str, Union[int, str, torch.device]], module_name: str = ""):
+    """
+    Cleans a device_map by grouping all submodules that go on the same device together.
+    """
+    # Get the value of the current module and if there is only one split across several keys, regroup it.
+    prefix = "" if module_name == "" else f"{module_name}."
+    values = [v for k, v in device_map.items() if k.startswith(prefix)]
+    if len(set(values)) == 1 and len(values) > 1:
+        for k in [k for k in device_map if k.startswith(prefix)]:
+            del device_map[k]
+        device_map[module_name] = values[0]
+    # Recurse over the children
+    children_modules = [k for k in device_map.keys() if k.startswith(prefix) and len(k) > len(module_name)]
+    idx = len(module_name.split(".")) + 1 if len(module_name) > 0 else 1
+    children_modules = set(".".join(k.split(".")[:idx]) for k in children_modules)
+    for child in children_modules:
+        clean_device_map(device_map, module_name=child)
+    return device_map
+def load_offloaded_weights(model, index, offload_folder):
+    """
+    Loads the weights from the offload folder into the model.
+    Args:
+        model (`torch.nn.Module`):
+            The model to load the weights into.
+        index (`dict`):
+            A dictionary containing the parameter name and its metadata for each parameter that was offloaded from the
+            model.
+        offload_folder (`str`):
+            The folder where the offloaded weights are stored.
+    """
+    if index is None or len(index) == 0:
+        # Nothing to do
+        return
+    for param_name, metadata in index.items():
+        if "SCB" in param_name:
+            continue
+        fp16_statistics = None
+        if "weight" in param_name and param_name.replace("weight", "SCB") in index.keys():
+            weight_name = param_name.replace("weight", "SCB")
+            fp16_statistics = load_offloaded_weight(
+                os.path.join(offload_folder, f"{weight_name}.dat"), index[weight_name]
+            )
+        tensor_file = os.path.join(offload_folder, f"{param_name}.dat")
+        weight = load_offloaded_weight(tensor_file, metadata)
+        set_module_tensor_to_device(model, param_name, "cpu", value=weight, fp16_statistics=fp16_statistics)
+def get_module_leaves(module_sizes):
+    module_children = {}
+    for module in module_sizes:
+        if module == "" or "." not in module:
+            continue
+        parent = module.rsplit(".", 1)[0]
+        module_children[parent] = module_children.get(parent, 0) + 1
+    leaves = [module for module in module_sizes if module_children.get(module, 0) == 0 and module != ""]
+    return leaves
+def get_balanced_memory(
+    model: nn.Module,
+    max_memory: Optional[dict[Union[int, str], Union[int, str]]] = None,
+    no_split_module_classes: Optional[list[str]] = None,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+    special_dtypes: Optional[dict[str, Union[str, torch.device]]] = None,
+    low_zero: bool = False,
+):
+    """
+    Compute a `max_memory` dictionary for [`infer_auto_device_map`] that will balance the use of each available GPU.
+    <Tip>
+    All computation is done analyzing sizes and dtypes of the model parameters. As a result, the model can be on the
+    meta device (as it would if initialized within the `init_empty_weights` context manager).
+    </Tip>
+    Args:
+        model (`torch.nn.Module`):
+            The model to analyze.
+        max_memory (`Dict`, *optional*):
+            A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset.
+            Example: `max_memory={0: "1GB"}`.
+        no_split_module_classes (`List[str]`, *optional*):
+            A list of layer class names that should never be split across device (for instance any layer that has a
+            residual connection).
+        dtype (`str` or `torch.dtype`, *optional*):
+            If provided, the weights will be converted to that type when loaded.
+        special_dtypes (`Dict[str, Union[str, torch.device]]`, *optional*):
+            If provided, special dtypes to consider for some specific weights (will override dtype used as default for
+            all weights).
+        low_zero (`bool`, *optional*):
+            Minimizes the number of weights on GPU 0, which is convenient when it's used for other operations (like the
+            Transformers generate function).
+    """
+    # Get default / clean up max_memory
+    user_not_set_max_memory = max_memory is None
+    max_memory = get_max_memory(max_memory)
+    if is_npu_available():
+        expected_device_type = "npu"
+    elif is_mlu_available():
+        expected_device_type = "mlu"
+    elif is_sdaa_available():
+        expected_device_type = "sdaa"
+    elif is_musa_available():
+        expected_device_type = "musa"
+    elif is_xpu_available():
+        expected_device_type = "xpu"
+    elif is_hpu_available():
+        expected_device_type = "hpu"
+    elif is_mps_available():
+        expected_device_type = "mps"
+    else:
+        expected_device_type = "cuda"
+    num_devices = len([d for d in max_memory if torch.device(d).type == expected_device_type and max_memory[d] > 0])
+    if num_devices == 0:
+        return max_memory
+    if num_devices == 1:
+        # We cannot do low_zero on just one GPU, but we will still reserve some memory for the buffer
+        low_zero = False
+        # If user just asked us to handle memory usage, we should avoid OOM
+        if user_not_set_max_memory:
+            for key in max_memory.keys():
+                if isinstance(key, int):
+                    max_memory[key] *= 0.9  # 90% is a good compromise
+                    logger.info(
+                        f"We will use 90% of the memory on device {key} for storing the model, and 10% for the buffer to avoid OOM. "
+                        "You can set `max_memory` in to a higher value to use more memory (at your own risk)."
+                    )
+                    break  # only one device
+    module_sizes = compute_module_sizes(model, dtype=dtype, special_dtypes=special_dtypes)
+    per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
+    # We can't just set the memory to model_size // num_devices as it will end being too small: each GPU will get
+    # slightly less layers and some layers will end up offload at the end. So this function computes a buffer size to
+    # add which is the biggest of:
+    # - the size of no split block (if applicable)
+    # - the mean of the layer sizes
+    if no_split_module_classes is None:
+        no_split_module_classes = []
+    elif not isinstance(no_split_module_classes, (list, tuple)):
+        no_split_module_classes = [no_split_module_classes]
+    # Identify the size of the no_split_block modules
+    if len(no_split_module_classes) > 0:
+        no_split_children = {}
+        for name, size in module_sizes.items():
+            if name == "":
+                continue
+            submodule = model
+            for submodule_name in name.split("."):
+                submodule = getattr(submodule, submodule_name)
+            class_name = submodule.__class__.__name__
+            if class_name in no_split_module_classes and class_name not in no_split_children:
+                no_split_children[class_name] = size
+            if set(no_split_children.keys()) == set(no_split_module_classes):
+                break
+        buffer = max(no_split_children.values()) if len(no_split_children) > 0 else 0
+    else:
+        buffer = 0
+    # Compute mean of final modules. In the first dict of module sizes, leaves are the parameters
+    leaves = get_module_leaves(module_sizes)
+    leaves_set = set(leaves)  # Convert to set for O(1) membership testing
+    module_sizes = {n: v for n, v in module_sizes.items() if n not in leaves_set}
+    # Once removed, leaves are the final modules.
+    leaves = get_module_leaves(module_sizes)
+    mean_leaves = int(sum([module_sizes[n] for n in leaves]) / max(len(leaves), 1))
+    buffer = int(1.25 * max(buffer, mean_leaves))
+    per_gpu += buffer
+    # Sorted list of GPUs id (we may have some gpu ids not included in the our max_memory list - let's ignore them)
+    gpus_idx_list = list(
+        sorted(
+            device_id for device_id, device_mem in max_memory.items() if isinstance(device_id, int) and device_mem > 0
+        )
+    )
+    # The last device is left with max_memory just in case the buffer is not enough.
+    for idx in gpus_idx_list[:-1]:
+        max_memory[idx] = min(max_memory[0] if low_zero and idx == 0 else per_gpu, max_memory[idx])
+    if low_zero:
+        min_zero = max(0, module_sizes[""] - sum([max_memory[i] for i in range(1, num_devices)]))
+        max_memory[0] = min(min_zero, max_memory[0])
+    return max_memory
+def calculate_maximum_sizes(model: torch.nn.Module):
+    "Computes the total size of the model and its largest layer"
+    sizes = compute_module_sizes(model)
+    # `transformers` models store this information for us
+    no_split_modules = getattr(model, "_no_split_modules", None)
+    if no_split_modules is None:
+        no_split_modules = []
+    modules_to_treat = (
+        list(model.named_parameters(recurse=False))
+        + list(model.named_children())
+        + list(model.named_buffers(recurse=False))
+    )
+    largest_layer = get_max_layer_size(modules_to_treat, sizes, no_split_modules)
+    total_size = sizes[""]
+    return total_size, largest_layer
+def _init_infer_auto_device_map(
+    model: nn.Module,
+    max_memory: Optional[dict[Union[int, str], Union[int, str]]] = None,
+    no_split_module_classes: Optional[list[str]] = None,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+    special_dtypes: Optional[dict[str, Union[str, torch.device]]] = None,
+) -> tuple[
+    list[Union[int, str]],
+    dict[Union[int, str], Union[int, str]],
+    list[Union[int, str]],
+    list[int],
+    dict[str, int],
+    list[list[str]],
+    list[str],
+    list[tuple[str, nn.Module]],
+]:
+    """
+    Initialize variables required for computing the device map for model allocation.
+    """
+    max_memory = get_max_memory(max_memory)
+    if no_split_module_classes is None:
+        no_split_module_classes = []
+    elif not isinstance(no_split_module_classes, (list, tuple)):
+        no_split_module_classes = [no_split_module_classes]
+    devices = list(max_memory.keys())
+    if "disk" not in devices:
+        devices.append("disk")
+    gpus = [device for device in devices if device not in ["cpu", "disk"]]
+    # Devices that need to keep space for a potential offloaded layer.
+    if "mps" in gpus:
+        main_devices = ["mps"]
+    elif len(gpus) > 0:
+        main_devices = [gpus[0], "cpu"]
+    else:
+        main_devices = ["cpu"]
+    module_sizes = compute_module_sizes(model, dtype=dtype, special_dtypes=special_dtypes)
+    tied_parameters = find_tied_parameters(model)
+    if check_tied_parameters_in_config(model) and len(tied_parameters) == 0:
+        logger.warning(
+            "The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function."
+        )
+    # Direct submodules and parameters
+    modules_to_treat = (
+        list(model.named_parameters(recurse=False))
+        + list(model.named_children())
+        + list(model.named_buffers(recurse=False))
+    )
+    return (
+        devices,
+        max_memory,
+        main_devices,
+        gpus,
+        module_sizes,
+        tied_parameters,
+        no_split_module_classes,
+        modules_to_treat,
+    )
+def get_module_size_with_ties(
+    tied_params,
+    module_size,
+    module_sizes,
+    modules_to_treat,
+) -> tuple[int, list[str], list[nn.Module]]:
+    """
+    Calculate the total size of a module, including its tied parameters.
+    Args:
+        tied_params (`List[str]`): The list of tied parameters.
+        module_size (`int`): The size of the module without tied parameters.
+        module_sizes (`Dict[str, int]`): A dictionary mapping each layer name to its size.
+        modules_to_treat (`List[Tuple[str, nn.Module]]`): The list of named modules to treat.
+    Returns:
+        `Tuple[int, List[str], List[nn.Module]]`: The total size of the module, the names of the tied modules, and the
+        tied modules.
+    """
+    if len(tied_params) < 1:
+        return module_size, [], []
+    tied_module_names = []
+    tied_modules = []
+    for tied_param in tied_params:
+        tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")][0]
+        tied_module_names.append(modules_to_treat[tied_module_index][0])
+        tied_modules.append(modules_to_treat[tied_module_index][1])
+    module_size_with_ties = module_size
+    for tied_param, tied_module_name in zip(tied_params, tied_module_names):
+        module_size_with_ties += module_sizes[tied_module_name] - module_sizes[tied_param]
+    return module_size_with_ties, tied_module_names, tied_modules
+def fallback_allocate(
+    modules: list[tuple[str, nn.Module]],
+    module_sizes: dict[str, int],
+    size_limit: Union[int, str],
+    no_split_module_classes: Optional[list[str]] = None,
+    tied_parameters: Optional[list[list[str]]] = None,
+) -> tuple[Optional[str], Optional[nn.Module], list[tuple[str, nn.Module]]]:
+    """
+    Find a module that fits in the size limit using BFS and return it with its name and the remaining modules.
+    Args:
+        modules (`List[Tuple[str, nn.Module]]`):
+            The list of named modules to search in.
+        module_sizes (`Dict[str, int]`):
+            A dictionary mapping each layer name to its size (as generated by `compute_module_sizes`).
+        size_limit (`Union[int, str]`):
+            The maximum size a module can have.
+        no_split_module_classes (`Optional[List[str]]`, *optional*):
+            A list of class names for layers we don't want to be split.
+        tied_parameters (`Optional[List[List[str]]`, *optional*):
+            A list of lists of parameter names being all tied together.
+    Returns:
+        `Tuple[Optional[str], Optional[nn.Module], List[Tuple[str, nn.Module]]]`: A tuple containing:
+        - The name of the module that fits within the size limit.
+        - The module itself.
+        - The list of remaining modules after the found module is removed.
+    """
+    try:
+        size_limit = convert_file_size_to_int(size_limit)
+    except ValueError:
+        return None, None, modules
+    if no_split_module_classes is None:
+        no_split_module_classes = []
+    if tied_parameters is None:
+        tied_parameters = []
+    modules_to_search = modules.copy()
+    module_found = False
+    while modules_to_search:
+        name, module = modules_to_search.pop(0)
+        tied_param_groups = [
+            tied_group
+            for tied_group in tied_parameters
+            if any(name + "." in k + "." for k in tied_group) and not all(name + "." in k + "." for k in tied_group)
+        ]
+        tied_params = sum(
+            [[p for p in tied_group if name + "." not in p + "."] for tied_group in tied_param_groups], []
+        )
+        module_size_with_ties, _, _ = get_module_size_with_ties(
+            tied_params, module_sizes[name], module_sizes, modules_to_search
+        )
+        # If the module fits in the size limit, we found it.
+        if module_size_with_ties <= size_limit:
+            module_found = True
+            break
+        # The module is too big, we need to split it if possible.
+        modules_children = (
+            []
+            if isinstance(module, nn.Parameter) or isinstance(module, torch.Tensor)
+            else list(module.named_children())
+        )
+        # Split fails, move to the next module
+        if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
+            continue
+        # split is possible, add the children to the list of modules to search
+        modules_children = list(module.named_parameters(recurse=False)) + modules_children
+        modules_to_search = [(f"{name}.{n}", v) for n, v in modules_children] + modules_to_search
+    if not module_found:
+        return None, None, modules
+    # Prepare the module list for removal of the found module
+    current_names = [n for n, _ in modules]
+    dot_idx = [i for i, c in enumerate(name) if c == "."]
+    for dot_index in dot_idx:
+        parent_name = name[:dot_index]
+        if parent_name in current_names:
+            parent_module_idx = current_names.index(parent_name)
+            _, parent_module = modules[parent_module_idx]
+            module_children = list(parent_module.named_parameters(recurse=False)) + list(
+                parent_module.named_children()
+            )
+            modules = (
+                modules[:parent_module_idx]
+                + [(f"{parent_name}.{n}", v) for n, v in module_children]
+                + modules[parent_module_idx + 1 :]
+            )
+            current_names = [n for n, _ in modules]
+    # Now the target module should be directly in the list
+    target_idx = current_names.index(name)
+    name, module = modules.pop(target_idx)
+    return name, module, modules
+def infer_auto_device_map(
+    model: nn.Module,
+    max_memory: Optional[dict[Union[int, str], Union[int, str]]] = None,
+    no_split_module_classes: Optional[list[str]] = None,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+    special_dtypes: Optional[dict[str, Union[str, torch.dtype]]] = None,
+    verbose: bool = False,
+    clean_result: bool = True,
+    offload_buffers: bool = False,
+    fallback_allocation: bool = False,
+):
+    """
+    Compute a device map for a given model giving priority to GPUs, then offload on CPU and finally offload to disk,
+    such that:
+    - we don't exceed the memory available of any of the GPU.
+    - if offload to the CPU is needed, there is always room left on GPU 0 to put back the layer offloaded on CPU that
+      has the largest size.
+    - if offload to the CPU is needed,we don't exceed the RAM available on the CPU.
+    - if offload to the disk is needed, there is always room left on the CPU to put back the layer offloaded on disk
+      that has the largest size.
+    <Tip>
+    All computation is done analyzing sizes and dtypes of the model parameters. As a result, the model can be on the
+    meta device (as it would if initialized within the `init_empty_weights` context manager).
+    </Tip>
+    Args:
+        model (`torch.nn.Module`):
+            The model to analyze.
+        max_memory (`Dict`, *optional*):
+            A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset.
+            Example: `max_memory={0: "1GB"}`.
+        no_split_module_classes (`List[str]`, *optional*):
+            A list of layer class names that should never be split across device (for instance any layer that has a
+            residual connection).
+        dtype (`str` or `torch.dtype`, *optional*):
+            If provided, the weights will be converted to that type when loaded.
+        special_dtypes (`Dict[str, Union[str, torch.device]]`, *optional*):
+            If provided, special dtypes to consider for some specific weights (will override dtype used as default for
+            all weights).
+        verbose (`bool`, *optional*, defaults to `False`):
+            Whether or not to provide debugging statements as the function builds the device_map.
+        clean_result (`bool`, *optional*, defaults to `True`):
+            Clean the resulting device_map by grouping all submodules that go on the same device together.
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            In the layers that are offloaded on the CPU or the hard drive, whether or not to offload the buffers as
+            well as the parameters.
+        fallback_allocation (`bool`, *optional*, defaults to `False`):
+            When regular allocation fails, try to allocate a module that fits in the size limit using BFS.
+    """
+    # Initialize the variables
+    (
+        devices,
+        max_memory,
+        main_devices,
+        gpus,
+        module_sizes,
+        tied_parameters,
+        no_split_module_classes,
+        modules_to_treat,
+    ) = _init_infer_auto_device_map(model, max_memory, no_split_module_classes, dtype, special_dtypes)
+    device_map = OrderedDict()
+    current_device = 0
+    device_memory_used = {device: 0 for device in devices}
+    device_buffer_sizes = {}
+    device_minimum_assignment_memory = {}
+    # Initialize maximum largest layer, to know which space to keep in memory
+    max_layer_size, max_layer_names = get_max_layer_size(modules_to_treat, module_sizes, no_split_module_classes)
+    # Ready ? This is going to be a bit messy.
+    while len(modules_to_treat) > 0:
+        name, module = modules_to_treat.pop(0)
+        if verbose:
+            print(f"\nTreating module {name}.")
+        # Max size in the remaining layers may have changed since we took one, so we maybe update it.
+        max_layer_names = [n for n in max_layer_names if n != name and not n.startswith(name + ".")]
+        if len(max_layer_names) == 0:
+            max_layer_size, max_layer_names = get_max_layer_size(
+                [(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
+                module_sizes,
+                no_split_module_classes,
+            )
+        # Assess size needed
+        module_size = module_sizes[name]
+        # We keep relevant tied parameters only: one of the tied parameters in the group is inside the current module
+        # and the other is not.
+        # Note: If we are currently processing the name `compute.weight`, an other parameter named
+        # e.g. `compute.weight_submodule.parameter`
+        # needs to be considered outside the current module, hence the check with additional dots.
+        tied_param_groups = [
+            tied_group
+            for tied_group in tied_parameters
+            if any(name + "." in k + "." for k in tied_group) and not all(name + "." in k + "." for k in tied_group)
+        ]
+        if verbose and len(tied_param_groups) > 0:
+            print(f"  Found the relevant tied param groups {tied_param_groups}")
+        # Then we keep track of all the parameters that are tied to the current module, but not in the current module
+        tied_params = sum(
+            [[p for p in tied_group if name + "." not in p + "."] for tied_group in tied_param_groups], []
+        )
+        if verbose and len(tied_params) > 0:
+            print(f"  So those parameters need to be taken into account {tied_params}")
+        device = devices[current_device]
+        current_max_size = max_memory[device] if device != "disk" else None
+        current_memory_reserved = 0
+        # Reduce max size available by the largest layer.
+        if devices[current_device] in main_devices:
+            current_max_size = current_max_size - max_layer_size
+            current_memory_reserved = max_layer_size
+        module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(
+            tied_params, module_size, module_sizes, modules_to_treat
+        )
+        # The module and its tied modules fit on the current device.
+        if current_max_size is None or device_memory_used[device] + module_size_with_ties <= current_max_size:
+            if verbose:
+                output = f"Putting {name}"
+                if tied_module_names:
+                    output += f" and {tied_module_names}"
+                else:
+                    output += f" (size={module_size})"
+                if current_max_size is not None:
+                    output += f" (available={current_max_size - device_memory_used[device]})"
+                output += f" on {device}."
+                print(output)
+            device_memory_used[device] += module_size_with_ties
+            # Assign the primary module to the device.
+            device_map[name] = device
+            # Assign tied modules if any.
+            for tied_module_name in tied_module_names:
+                if tied_module_name in [m[0] for m in modules_to_treat]:
+                    # Find the index of the tied module in the list
+                    tied_module_index = next(i for i, (n, _) in enumerate(modules_to_treat) if n == tied_module_name)
+                    # Remove the tied module from the list to prevent reprocessing
+                    modules_to_treat.pop(tied_module_index)
+                # Assign the tied module to the device
+                device_map[tied_module_name] = device
+            # Buffer Handling
+            if not offload_buffers and isinstance(module, nn.Module):
+                # Compute the total buffer size for the module
+                current_buffer_size = compute_module_total_buffer_size(
+                    module, dtype=dtype, special_dtypes=special_dtypes
+                )
+                # Update the buffer size on the device
+                device_buffer_sizes[device] = device_buffer_sizes.get(device, 0) + current_buffer_size
+            continue
+        # The current module itself fits, so we try to split the tied modules.
+        if len(tied_params) > 0 and device_memory_used[device] + module_size <= current_max_size:
+            # can we split one of the tied modules to make it smaller or do we need to go on the next device?
+            if verbose:
+                print(
+                    f"Not enough space on {devices[current_device]} to put {name} and {tied_module_names} (space "
+                    f"available {current_max_size - device_memory_used[device]}, needed size {module_size_with_ties})."
+                )
+            split_happened = False
+            for tied_module_name, tied_module in zip(tied_module_names, tied_modules):
+                tied_module_children = list(tied_module.named_children())
+                if len(tied_module_children) == 0 or tied_module.__class__.__name__ in no_split_module_classes:
+                    # can't break this one.
+                    continue
+                if verbose:
+                    print(f"Splitting {tied_module_name}.")
+                tied_module_children = list(tied_module.named_parameters(recurse=False)) + tied_module_children
+                tied_module_children = [(f"{tied_module_name}.{n}", v) for n, v in tied_module_children]
+                tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if n == tied_module_name][0]
+                modules_to_treat = (
+                    [(name, module)]
+                    + modules_to_treat[:tied_module_index]
+                    + tied_module_children
+                    + modules_to_treat[tied_module_index + 1 :]
+                )
+                # Update the max layer size.
+                max_layer_size, max_layer_names = get_max_layer_size(
+                    [(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
+                    module_sizes,
+                    no_split_module_classes,
+                )
+                split_happened = True
+                break
+            if split_happened:
+                continue
+            # If the tied module is not split, we go to the next device
+            if verbose:
+                print("None of the tied module can be split, going to the next device.")
+        # The current module itself doesn't fit, so we have to split it or go to the next device.
+        if device_memory_used[device] + module_size >= current_max_size:
+            # Split or not split?
+            modules_children = (
+                []
+                if isinstance(module, nn.Parameter) or isinstance(module, torch.Tensor)
+                else list(module.named_children())
+            )
+            if verbose:
+                print(
+                    f"Not enough space on {devices[current_device]} to put {name} (space available "
+                    f"{current_max_size - device_memory_used[device]}, module size {module_size})."
+                )
+            if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
+                # -> no split, we go to the next device
+                if verbose:
+                    print("This module cannot be split, going to the next device.")
+            else:
+                # -> split, we replace the module studied by its children + parameters
+                if verbose:
+                    print(f"Splitting {name}.")
+                modules_children = list(module.named_parameters(recurse=False)) + modules_children
+                modules_to_treat = [(f"{name}.{n}", v) for n, v in modules_children] + modules_to_treat
+                # Update the max layer size.
+                max_layer_size, max_layer_names = get_max_layer_size(
+                    [(n, m) for n, m in modules_to_treat if isinstance(m, torch.nn.Module)],
+                    module_sizes,
+                    no_split_module_classes,
+                )
+                continue
+        # If no module is assigned to the current device, we attempt to allocate a fallback module
+        # if fallback_allocation is enabled.
+        if device_memory_used[device] == 0 and fallback_allocation and device != "disk":
+            # We try to allocate a module that fits in the size limit using BFS.
+            # Recompute the current max size as we need to consider the current module as well.
+            current_max_size = max_memory[device] - max(max_layer_size, module_size_with_ties)
+            fallback_module_name, fallback_module, remaining_modules = fallback_allocate(
+                modules_to_treat,
+                module_sizes,
+                current_max_size - device_memory_used[device],
+                no_split_module_classes,
+                tied_parameters,
+            )
+            # use the next iteration to put the fallback module on the next device to avoid code duplication
+            if fallback_module is not None:
+                modules_to_treat = [(fallback_module_name, fallback_module)] + [(name, module)] + remaining_modules
+                continue
+        if device_memory_used[device] == 0:
+            device_minimum_assignment_memory[device] = module_size_with_ties + current_memory_reserved
+        #  Neither the current module nor any tied modules can be split, so we move to the next device.
+        device_memory_used[device] = device_memory_used[device] + current_memory_reserved
+        current_device += 1
+        modules_to_treat = [(name, module)] + modules_to_treat
+    device_memory_used = {device: mem for device, mem in device_memory_used.items() if mem > 0}
+    if clean_result:
+        device_map = clean_device_map(device_map)
+    non_gpu_buffer_size = device_buffer_sizes.get("cpu", 0) + device_buffer_sizes.get("disk", 0)
+    if non_gpu_buffer_size > 0 and not offload_buffers:
+        is_buffer_fit_any_gpu = False
+        for gpu_device, gpu_max_memory in max_memory.items():
+            if gpu_device == "cpu" or gpu_device == "disk":
+                continue
+            if not is_buffer_fit_any_gpu:
+                gpu_memory_used = device_memory_used.get(gpu_device, 0)
+                if gpu_max_memory >= non_gpu_buffer_size + gpu_memory_used:
+                    is_buffer_fit_any_gpu = True
+        if len(gpus) > 0 and not is_buffer_fit_any_gpu:
+            warnings.warn(
+                f"Current model requires {non_gpu_buffer_size} bytes of buffer for offloaded layers, which seems does "
+                f"not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using "
+                f"offload_buffers=True."
+            )
+    if device_minimum_assignment_memory:
+        devices_info = "\n".join(
+            f"  - {device}: {mem} bytes required" for device, mem in device_minimum_assignment_memory.items()
+        )
+        logger.info(
+            f"Based on the current allocation process, no modules could be assigned to the following devices due to "
+            f"insufficient memory:\n"
+            f"{devices_info}\n"
+            f"These minimum requirements are specific to this allocation attempt and may vary. Consider increasing "
+            f"the available memory for these devices to at least the specified minimum, or adjusting the model config."
+        )
+    return device_map
+def check_device_map(model: nn.Module, device_map: dict[str, Union[int, str, torch.device]]):
+    """
+    Checks a device map covers everything in a given model.
+    Args:
+        model (`torch.nn.Module`): The model to check the device map against.
+        device_map (`Dict[str, Union[int, str, torch.device]]`): The device map to check.
+    """
+    all_module_names = dict(model.named_modules())
+    invalid_keys = [k for k in device_map if k != "" and k not in all_module_names]
+    if invalid_keys:
+        warnings.warn(
+            f"The following device_map keys do not match any submodules in the model: {invalid_keys}", UserWarning
+        )
+    all_model_tensors = [name for name, _ in model.state_dict().items()]
+    for module_name in device_map.keys():
+        if module_name == "":
+            all_model_tensors.clear()
+            break
+        else:
+            all_model_tensors = [
+                name
+                for name in all_model_tensors
+                if not name == module_name and not name.startswith(module_name + ".")
+            ]
+    if len(all_model_tensors) > 0:
+        non_covered_params = ", ".join(all_model_tensors)
+        raise ValueError(
+            f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
+        )
+def load_state_dict(checkpoint_file, device_map=None):
+    """
+    Load a checkpoint from a given file. If the checkpoint is in the safetensors format and a device map is passed, the
+    weights can be fast-loaded directly on the GPU.
+    Args:
+        checkpoint_file (`str`): The path to the checkpoint to load.
+        device_map (`Dict[str, Union[int, str, torch.device]]`, *optional*):
+            A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer
+            name, once a given module name is inside, every submodule of it will be sent to the same device.
+    """
+    if checkpoint_file.endswith(".safetensors"):
+        with safe_open(checkpoint_file, framework="pt") as f:
+            metadata = f.metadata()
+            weight_names = f.keys()
+        if metadata is None:
+            logger.warning(
+                f"The safetensors archive passed at {checkpoint_file} does not contain metadata. "
+                "Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata."
+            )
+            metadata = {"format": "pt"}
+        if metadata.get("format") not in ["pt", "tf", "flax"]:
+            raise OSError(
+                f"The safetensors archive passed at {checkpoint_file} does not contain the valid metadata. Make sure "
+                "you save your model with the `save_pretrained` method."
+            )
+        elif metadata["format"] != "pt":
+            raise ValueError(f"The checkpoint passed was saved with {metadata['format']}, we need a the pt format.")
+        if device_map is None:
+            return safe_load_file(checkpoint_file)
+        else:
+            # if we only have one device we can load everything directly
+            if len(set(device_map.values())) == 1:
+                device = list(device_map.values())[0]
+                target_device = device
+                if isinstance(device, int):
+                    if is_npu_available():
+                        target_device = f"npu:{device}"
+                    elif is_hpu_available():
+                        target_device = "hpu"
+                return safe_load_file(checkpoint_file, device=target_device)
+            devices = list(set(device_map.values()) - {"disk"})
+            # cpu device should always exist as fallback option
+            if "cpu" not in devices:
+                devices.append("cpu")
+            # For each device, get the weights that go there
+            device_weights = {device: [] for device in devices}
+            for module_name, device in device_map.items():
+                if device in devices:
+                    device_weights[device].extend(
+                        [k for k in weight_names if k == module_name or k.startswith(module_name + ".")]
+                    )
+            # all weights that haven't defined a device should be loaded on CPU
+            device_weights["cpu"].extend([k for k in weight_names if k not in sum(device_weights.values(), [])])
+            tensors = {}
+            if is_tqdm_available():
+                progress_bar = tqdm(
+                    main_process_only=False,
+                    total=sum([len(device_weights[device]) for device in devices]),
+                    unit="w",
+                    smoothing=0,
+                    leave=False,
+                )
+            else:
+                progress_bar = None
+            for device in devices:
+                target_device = device
+                if isinstance(device, int):
+                    if is_npu_available():
+                        target_device = f"npu:{device}"
+                    elif is_hpu_available():
+                        target_device = "hpu"
+                with safe_open(checkpoint_file, framework="pt", device=target_device) as f:
+                    for key in device_weights[device]:
+                        if progress_bar is not None:
+                            progress_bar.set_postfix(dev=device, refresh=False)
+                            progress_bar.set_description(key)
+                        tensors[key] = f.get_tensor(key)
+                        if progress_bar is not None:
+                            progress_bar.update()
+            if progress_bar is not None:
+                progress_bar.close()
+            return tensors
+    else:
+        return torch.load(checkpoint_file, map_location=torch.device("cpu"), weights_only=True)
+def get_state_dict_offloaded_model(model: nn.Module):
+    """
+    Returns the state dictionary for an offloaded model via iterative onloading
+    Args:
+        model (`torch.nn.Module`):
+            The offloaded model we want to save
+    """
+    state_dict = {}
+    placeholders = set()
+    for name, module in model.named_modules():
+        if name == "":
+            continue
+        try:
+            with align_module_device(module, "cpu"):
+                module_state_dict = module.state_dict()
+        except MemoryError:
+            raise MemoryError("Offloaded module must fit in CPU memory to call save_model!") from None
+        for key in module_state_dict:
+            # ignore placeholder parameters that are still on the meta device
+            if module_state_dict[key].device == torch.device("meta"):
+                placeholders.add(name + f".{key}")
+                continue
+            params = module_state_dict[key]
+            state_dict[name + f".{key}"] = params.to("cpu")  # move buffers to cpu
+    for key in placeholders.copy():
+        if key in state_dict:
+            placeholders.remove(key)
+    if placeholders:
+        logger.warning(f"The following tensors were not saved because they were still on meta device: {placeholders}")
+    return state_dict
+def get_state_dict_from_offload(
+    module: nn.Module,
+    module_name: str,
+    state_dict: dict[str, Union[str, torch.tensor]],
+    device_to_put_offload: Union[int, str, torch.device] = "cpu",
+):
+    """
+    Retrieve the state dictionary (with parameters) from an offloaded module and load into a specified device (defaults
+    to cpu).
+    Args:
+        module: (`torch.nn.Module`):
+            The module we want to retrieve a state dictionary from
+        module_name: (`str`):
+            The name of the module of interest
+        state_dict (`Dict[str, Union[int, str, torch.device]]`):
+            Dictionary of {module names: parameters}
+        device_to_put_offload (`Union[int, str, torch.device]`):
+            Device to load offloaded parameters into, defaults to the cpu.
+    """
+    root = module_name[: module_name.rfind(".")]  # module name without .weight or .bias
+    # do not move parameters if the module is not offloaded
+    if not has_offloaded_params(module):
+        device_to_put_offload = None
+    # assign the device to which the offloaded parameters will be sent
+    with align_module_device(module, device_to_put_offload):
+        for m_key, params in module.state_dict().items():
+            if (root + f".{m_key}") in state_dict:
+                state_dict[root + f".{m_key}"] = params
+    return state_dict
+def load_checkpoint_in_model(
+    model: nn.Module,
+    checkpoint: Union[str, os.PathLike],
+    device_map: Optional[dict[str, Union[int, str, torch.device]]] = None,
+    offload_folder: Optional[Union[str, os.PathLike]] = None,
+    dtype: Optional[Union[str, torch.dtype]] = None,
+    offload_state_dict: bool = False,
+    offload_buffers: bool = False,
+    keep_in_fp32_modules: Optional[list[str]] = None,
+    offload_8bit_bnb: bool = False,
+    strict: bool = False,
+    full_state_dict: bool = True,
+    broadcast_from_rank0: bool = False,
+):
+    """
+    Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are
+    loaded.
+    <Tip warning={true}>
+    Once loaded across devices, you still need to call [`dispatch_model`] on your model to make it able to run. To
+    group the checkpoint loading and dispatch in one single call, use [`load_checkpoint_and_dispatch`].
+    </Tip>
+    Args:
+        model (`torch.nn.Module`):
+            The model in which we want to load a checkpoint.
+        checkpoint (`str` or `os.PathLike`):
+            The folder checkpoint to load. It can be:
+            - a path to a file containing a whole model state dict
+            - a path to a `.json` file containing the index to a sharded checkpoint
+            - a path to a folder containing a unique `.index.json` file and the shards of a checkpoint.
+            - a path to a folder containing a unique pytorch_model.bin or a model.safetensors file.
+        device_map (`Dict[str, Union[int, str, torch.device]]`, *optional*):
+            A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer
+            name, once a given module name is inside, every submodule of it will be sent to the same device.
+        offload_folder (`str` or `os.PathLike`, *optional*):
+            If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
+        dtype (`str` or `torch.dtype`, *optional*):
+            If provided, the weights will be converted to that type when loaded.
+        offload_state_dict (`bool`, *optional*, defaults to `False`):
+            If `True`, will temporarily offload the CPU state dict on the hard drive to avoid getting out of CPU RAM if
+            the weight of the CPU state dict + the biggest shard does not fit.
+        offload_buffers (`bool`, *optional*, defaults to `False`):
+            Whether or not to include the buffers in the weights offloaded to disk.
+        keep_in_fp32_modules(`List[str]`, *optional*):
+            A list of the modules that we keep in `torch.float32` dtype.
+        offload_8bit_bnb (`bool`, *optional*):
+            Whether or not to enable offload of 8-bit modules on cpu/disk.
+        strict (`bool`, *optional*, defaults to `False`):
+            Whether to strictly enforce that the keys in the checkpoint state_dict match the keys of the model's
+            state_dict.
+        full_state_dict (`bool`, *optional*, defaults to `True`): if this is set to `True`, all the tensors in the
+            loaded state_dict will be gathered. No ShardedTensor and DTensor will be in the loaded state_dict.
+        broadcast_from_rank0 (`False`, *optional*, defaults to `False`): when the option is `True`, a distributed
+            `ProcessGroup` must be initialized. rank0 should receive a full state_dict and will broadcast the tensors
+            in the state_dict one by one to other ranks. Other ranks will receive the tensors and shard (if applicable)
+            according to the local shards in the model.
+    """
+    if offload_8bit_bnb:
+        from .bnb import quantize_and_offload_8bit
+    tied_params = find_tied_parameters(model)
+    if check_tied_parameters_in_config(model) and len(tied_params) == 0:
+        logger.warning(
+            "The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function."
+        )
+    if device_map is not None:
+        check_tied_parameters_on_same_device(tied_params, device_map)
+    if offload_folder is None and device_map is not None and "disk" in device_map.values():
+        raise ValueError(
+            "At least one of the model submodule will be offloaded to disk, please pass along an `offload_folder`."
+        )
+    elif offload_folder is not None and device_map is not None and "disk" in device_map.values():
+        os.makedirs(offload_folder, exist_ok=True)
+    if isinstance(dtype, str):
+        # We accept "torch.float16" or just "float16"
+        dtype = dtype.replace("torch.", "")
+        dtype = getattr(torch, dtype)
+    checkpoint_files = None
+    index_filename = None
+    if os.path.isfile(checkpoint):
+        if str(checkpoint).endswith(".json"):
+            index_filename = checkpoint
+        else:
+            checkpoint_files = [checkpoint]
+    elif os.path.isdir(checkpoint):
+        # check if the whole state dict is present
+        potential_state_bin = [f for f in os.listdir(checkpoint) if f == WEIGHTS_NAME]
+        potential_state_safetensor = [f for f in os.listdir(checkpoint) if f == SAFE_WEIGHTS_NAME]
+        if len(potential_state_bin) == 1:
+            checkpoint_files = [os.path.join(checkpoint, potential_state_bin[0])]
+        elif len(potential_state_safetensor) == 1:
+            checkpoint_files = [os.path.join(checkpoint, potential_state_safetensor[0])]
+        else:
+            # otherwise check for sharded checkpoints
+            potential_index = [f for f in os.listdir(checkpoint) if f.endswith(".index.json")]
+            if len(potential_index) == 0:
+                raise ValueError(
+                    f"{checkpoint} is not a folder containing a `.index.json` file or a {WEIGHTS_NAME} or a {SAFE_WEIGHTS_NAME} file"
+                )
+            elif len(potential_index) == 1:
+                index_filename = os.path.join(checkpoint, potential_index[0])
+            else:
+                raise ValueError(
+                    f"{checkpoint} containing more than one `.index.json` file, delete the irrelevant ones."
+                )
+    else:
+        raise ValueError(
+            "`checkpoint` should be the path to a file containing a whole state dict, or the index of a sharded "
+            f"checkpoint, or a folder containing a sharded checkpoint or the whole state dict, but got {checkpoint}."
+        )
+    if index_filename is not None:
+        checkpoint_folder = os.path.split(index_filename)[0]
+        with open(index_filename) as f:
+            index = json.loads(f.read())
+        if "weight_map" in index:
+            index = index["weight_map"]
+        checkpoint_files = sorted(list(set(index.values())))
+        checkpoint_files = [os.path.join(checkpoint_folder, f) for f in checkpoint_files]
+    # Logic for missing/unexepected keys goes here.
+    offload_index = {}
+    if offload_state_dict:
+        state_dict_folder = tempfile.mkdtemp()
+        state_dict_index = {}
+    unexpected_keys = set()
+    model_keys = set(model.state_dict().keys())
+    buffer_names = [name for name, _ in model.named_buffers()]
+    model_devices = {t.device for t in model.state_dict().values() if isinstance(t, torch.Tensor)}
+    model_physical_devices = model_devices - {torch.device("meta")}
+    for checkpoint_file in checkpoint_files:
+        if device_map is None:
+            # exception for multi-device loading was made for the meta device in torch v2.7.0
+            # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/distributed/checkpoint/state_dict.py#L557-L563
+            # https://github.com/pytorch/pytorch/blob/v2.7.0-rc2/torch/distributed/checkpoint/state_dict.py#L575-L587
+            if is_torch_version(">=", "2.2.0") and (
+                (is_torch_version(">=", "2.7.0") and len(model_physical_devices) <= 1) or len(model_devices) <= 1
+            ):
+                from torch.distributed.checkpoint.state_dict import StateDictOptions, set_model_state_dict
+                broadcast_from_rank0 &= is_torch_version(">=", "2.4.0")
+                loaded_checkpoint = (
+                    load_state_dict(checkpoint_file, device_map=device_map)
+                    if not broadcast_from_rank0 or dist.get_rank() == 0
+                    else {}
+                )
+                set_model_state_dict(
+                    model,
+                    loaded_checkpoint,
+                    options=StateDictOptions(
+                        full_state_dict=full_state_dict,
+                        strict=strict,
+                        **({"broadcast_from_rank0": broadcast_from_rank0} if is_torch_version(">=", "2.4.0") else {}),
+                    ),
+                )
+            else:
+                loaded_checkpoint = load_state_dict(checkpoint_file, device_map=device_map)
+                model.load_state_dict(loaded_checkpoint, strict=strict)
+            unexpected_keys.update(set(loaded_checkpoint.keys()) - model_keys)
+        else:
+            loaded_checkpoint = load_state_dict(checkpoint_file, device_map=device_map)
+            for param_name, param in loaded_checkpoint.items():
+                # skip SCB parameter (for 8-bit serialization)
+                if "SCB" in param_name:
+                    continue
+                if param_name not in model_keys:
+                    unexpected_keys.add(param_name)
+                    if not strict:
+                        continue  # Skip loading this parameter.
+                module_name = param_name
+                while len(module_name) > 0 and module_name not in device_map:
+                    module_name = ".".join(module_name.split(".")[:-1])
+                if module_name == "" and "" not in device_map:
+                    # TODO: group all errors and raise at the end.
+                    raise ValueError(f"{param_name} doesn't have any device set.")
+                param_device = device_map[module_name]
+                new_dtype = dtype
+                if dtype is not None and torch.is_floating_point(param):
+                    if keep_in_fp32_modules is not None and dtype == torch.float16:
+                        proceed = False
+                        for key in keep_in_fp32_modules:
+                            if ((key in param_name) and (key + "." in param_name)) or key == param_name:
+                                proceed = True
+                                break
+                        if proceed:
+                            new_dtype = torch.float32
+                if "weight" in param_name and param_name.replace("weight", "SCB") in loaded_checkpoint.keys():
+                    if param.dtype == torch.int8:
+                        fp16_statistics = loaded_checkpoint[param_name.replace("weight", "SCB")]
+                else:
+                    fp16_statistics = None
+                if param_device == "disk":
+                    if offload_buffers or param_name not in buffer_names:
+                        if new_dtype is None:
+                            new_dtype = param.dtype
+                        if offload_8bit_bnb:
+                            quantize_and_offload_8bit(
+                                model, param, param_name, new_dtype, offload_folder, offload_index, fp16_statistics
+                            )
+                            continue
+                        else:
+                            set_module_tensor_to_device(model, param_name, "meta", dtype=new_dtype)
+                        offload_weight(param, param_name, offload_folder, index=offload_index)
+                elif param_device == "cpu" and offload_state_dict:
+                    if new_dtype is None:
+                        new_dtype = param.dtype
+                    if offload_8bit_bnb:
+                        quantize_and_offload_8bit(
+                            model, param, param_name, new_dtype, state_dict_folder, state_dict_index, fp16_statistics
+                        )
+                    else:
+                        set_module_tensor_to_device(model, param_name, "meta", dtype=new_dtype)
+                        offload_weight(param, param_name, state_dict_folder, index=state_dict_index)
+                else:
+                    set_module_tensor_to_device(
+                        model,
+                        param_name,
+                        param_device,
+                        value=param,
+                        dtype=new_dtype,
+                        fp16_statistics=fp16_statistics,
+                    )
+        # Force Python to clean up.
+        del loaded_checkpoint
+        gc.collect()
+    if not strict and len(unexpected_keys) > 0:
+        logger.warning(
+            f"Some weights of the model checkpoint at {checkpoint} were not used when"
+            f" initializing {model.__class__.__name__}: {unexpected_keys}. This may or may not be an issue - make sure that the checkpoint does not have unnecessary parameters, or that the model definition correctly corresponds to the checkpoint."
+        )
+    save_offload_index(offload_index, offload_folder)
+    # Load back offloaded state dict on CPU
+    if offload_state_dict:
+        load_offloaded_weights(model, state_dict_index, state_dict_folder)
+        shutil.rmtree(state_dict_folder)
+    retie_parameters(model, tied_params)
+def get_mixed_precision_context_manager(native_amp: bool = False, autocast_kwargs: AutocastKwargs = None):
+    """
+    Return a context manager for autocasting mixed precision
+    Args:
+        native_amp (`bool`, *optional*, defaults to False):
+            Whether mixed precision is actually enabled.
+        cache_enabled (`bool`, *optional*, defaults to True):
+            Whether the weight cache inside autocast should be enabled.
+    """
+    state = AcceleratorState()
+    if autocast_kwargs is None:
+        autocast_kwargs = {}
+    else:
+        autocast_kwargs = autocast_kwargs.to_kwargs()
+    if native_amp:
+        device_type = (
+            "cuda"
+            if (state.distributed_type == DistributedType.XLA and is_torch_xla_available(check_is_gpu=True))
+            else state.device.type
+        )
+        if state.mixed_precision == "fp16":
+            return torch.autocast(device_type=device_type, dtype=torch.float16, **autocast_kwargs)
+        elif state.mixed_precision in ["bf16", "fp8"] and state.distributed_type in [
+            DistributedType.NO,
+            DistributedType.MULTI_CPU,
+            DistributedType.MULTI_GPU,
+            DistributedType.MULTI_MLU,
+            DistributedType.MULTI_SDAA,
+            DistributedType.MULTI_MUSA,
+            DistributedType.MULTI_NPU,
+            DistributedType.MULTI_XPU,
+            DistributedType.MULTI_HPU,
+            DistributedType.FSDP,
+            DistributedType.XLA,
+        ]:
+            return torch.autocast(device_type=device_type, dtype=torch.bfloat16, **autocast_kwargs)
+        else:
+            return torch.autocast(device_type=device_type, **autocast_kwargs)
+    else:
+        return contextlib.nullcontext()
+def get_grad_scaler(distributed_type: DistributedType = None, **kwargs):
+    """
+    A generic helper which will initialize the correct `GradScaler` implementation based on the environment and return
+    it.
+    Args:
+        distributed_type (`DistributedType`, *optional*, defaults to None):
+            The type of distributed environment.
+        kwargs:
+            Additional arguments for the utilized `GradScaler` constructor.
+    """
+    if distributed_type == DistributedType.FSDP:
+        from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler
+        return ShardedGradScaler(**kwargs)
+    if is_torch_xla_available(check_is_gpu=True):
+        import torch_xla.amp as xamp
+        return xamp.GradScaler(**kwargs)
+    elif is_mlu_available():
+        return torch.mlu.amp.GradScaler(**kwargs)
+    elif is_sdaa_available():
+        return torch.sdaa.amp.GradScaler(**kwargs)
+    elif is_musa_available():
+        return torch.musa.amp.GradScaler(**kwargs)
+    elif is_npu_available():
+        return torch.npu.amp.GradScaler(**kwargs)
+    elif is_hpu_available():
+        return torch.amp.GradScaler("hpu", **kwargs)
+    elif is_xpu_available():
+        return torch.amp.GradScaler("xpu", **kwargs)
+    elif is_mps_available():
+        if not is_torch_version(">=", "2.8.0"):
+            raise ValueError("Grad Scaler with MPS device requires a Pytorch >= 2.8.0")
+        return torch.amp.GradScaler("mps", **kwargs)
+    else:
+        if is_torch_version(">=", "2.3"):
+            return torch.amp.GradScaler("cuda", **kwargs)
+        else:
+            return torch.cuda.amp.GradScaler(**kwargs)
+def has_offloaded_params(module: torch.nn.Module) -> bool:
+    """
+    Checks if a module has offloaded parameters by checking if the given module has a AlignDevicesHook attached with
+    offloading enabled
+    Args:
+        module (`torch.nn.Module`): The module to check for an offload hook.
+    Returns:
+        bool: `True` if the module has an offload hook and offloading is enabled, `False` otherwise.
+    """
+    from ..hooks import AlignDevicesHook  # avoid circular import
+    return hasattr(module, "_hf_hook") and isinstance(module._hf_hook, AlignDevicesHook) and module._hf_hook.offload
+@contextlib.contextmanager
+def align_module_device(module: torch.nn.Module, execution_device: Optional[torch.device] = None):
+    """
+    Context manager that moves a module's parameters to the specified execution device.
+    Args:
+        module (`torch.nn.Module`):
+            Module with parameters to align.
+        execution_device (`torch.device`, *optional*):
+            If provided, overrides the module's execution device within the context. Otherwise, use hook execution
+            device or pass
+    """
+    if has_offloaded_params(module):
+        if execution_device is not None:
+            original_device = module._hf_hook.execution_device
+            module._hf_hook.execution_device = execution_device
+        try:
+            module._hf_hook.pre_forward(module)
+            yield
+        finally:
+            module._hf_hook.post_forward(module, None)
+            if execution_device is not None:
+                module._hf_hook.execution_device = original_device
+    elif execution_device is not None:
+        devices = {name: param.device for name, param in module.named_parameters(recurse=False)}
+        try:
+            for name in devices:
+                set_module_tensor_to_device(module, name, execution_device)
+            yield
+        finally:
+            for name, device in devices.items():
+                set_module_tensor_to_device(module, name, device)
+    else:
+        yield

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/offload.py ADDED Viewed

	@@ -0,0 +1,213 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+from collections.abc import Mapping
+from typing import Optional, Union
+import numpy as np
+import torch
+from safetensors import safe_open
+def offload_weight(weight, weight_name, offload_folder, index=None):
+    dtype = None
+    # Check the string instead of the dtype to be compatible with versions of PyTorch that don't have bfloat16.
+    if str(weight.dtype) == "torch.bfloat16":
+        # Need to reinterpret the underlined data as int16 since NumPy does not handle bfloat16s.
+        weight = weight.view(torch.int16)
+        dtype = "bfloat16"
+    array = weight.cpu().numpy()
+    tensor_file = os.path.join(offload_folder, f"{weight_name}.dat")
+    if index is not None:
+        if dtype is None:
+            dtype = str(array.dtype)
+        index[weight_name] = {"dtype": dtype, "shape": list(array.shape)}
+    if array.ndim == 0:
+        array = array[None]
+    file_array = np.memmap(tensor_file, dtype=array.dtype, mode="w+", shape=array.shape)
+    file_array[:] = array[:]
+    file_array.flush()
+    return index
+def load_offloaded_weight(weight_file, weight_info):
+    shape = tuple(weight_info["shape"])
+    if shape == ():
+        # NumPy memory-mapped arrays can't have 0 dims so it was saved as 1d tensor
+        shape = (1,)
+    dtype = weight_info["dtype"]
+    if dtype == "bfloat16":
+        # NumPy does not support bfloat16 so this was saved as a int16
+        dtype = "int16"
+    weight = np.memmap(weight_file, dtype=dtype, shape=shape, mode="r")
+    if len(weight_info["shape"]) == 0:
+        weight = weight[0]
+    weight = torch.tensor(weight)
+    if weight_info["dtype"] == "bfloat16":
+        weight = weight.view(torch.bfloat16)
+    return weight
+def save_offload_index(index, offload_folder):
+    if index is None or len(index) == 0:
+        # Nothing to save
+        return
+    offload_index_file = os.path.join(offload_folder, "index.json")
+    if os.path.isfile(offload_index_file):
+        with open(offload_index_file, encoding="utf-8") as f:
+            current_index = json.load(f)
+    else:
+        current_index = {}
+    current_index.update(index)
+    with open(offload_index_file, "w", encoding="utf-8") as f:
+        json.dump(current_index, f, indent=2)
+def offload_state_dict(save_dir: Union[str, os.PathLike], state_dict: dict[str, torch.Tensor]):
+    """
+    Offload a state dict in a given folder.
+    Args:
+        save_dir (`str` or `os.PathLike`):
+            The directory in which to offload the state dict.
+        state_dict (`Dict[str, torch.Tensor]`):
+            The dictionary of tensors to offload.
+    """
+    os.makedirs(save_dir, exist_ok=True)
+    index = {}
+    for name, parameter in state_dict.items():
+        index = offload_weight(parameter, name, save_dir, index=index)
+    # Update index
+    save_offload_index(index, save_dir)
+class PrefixedDataset(Mapping):
+    """
+    Will access keys in a given dataset by adding a prefix.
+    Args:
+        dataset (`Mapping`): Any map with string keys.
+        prefix (`str`): A prefix to add when trying to access any element in the underlying dataset.
+    """
+    def __init__(self, dataset: Mapping, prefix: str):
+        self.dataset = dataset
+        self.prefix = prefix
+    def __getitem__(self, key):
+        return self.dataset[f"{self.prefix}{key}"]
+    def __iter__(self):
+        return iter([key for key in self.dataset if key.startswith(self.prefix)])
+    def __len__(self):
+        return len(self.dataset)
+class OffloadedWeightsLoader(Mapping):
+    """
+    A collection that loads weights stored in a given state dict or memory-mapped on disk.
+    Args:
+        state_dict (`Dict[str, torch.Tensor]`, *optional*):
+            A dictionary parameter name to tensor.
+        save_folder (`str` or `os.PathLike`, *optional*):
+            The directory in which the weights are stored (by `offload_state_dict` for instance).
+        index (`Dict`, *optional*):
+            A dictionary from weight name to their information (`dtype`/ `shape` or safetensors filename). Will default
+            to the index saved in `save_folder`.
+    """
+    def __init__(
+        self,
+        state_dict: Optional[dict[str, torch.Tensor]] = None,
+        save_folder: Optional[Union[str, os.PathLike]] = None,
+        index: Optional[Mapping] = None,
+        device=None,
+    ):
+        if state_dict is None and save_folder is None and index is None:
+            raise ValueError("Need either a `state_dict`, a `save_folder` or an `index` containing offloaded weights.")
+        self.state_dict = {} if state_dict is None else state_dict
+        self.save_folder = save_folder
+        if index is None and save_folder is not None:
+            with open(os.path.join(save_folder, "index.json")) as f:
+                index = json.load(f)
+        self.index = {} if index is None else index
+        self.all_keys = list(self.state_dict.keys())
+        self.all_keys.extend([key for key in self.index if key not in self.all_keys])
+        self.device = device
+    def __getitem__(self, key: str):
+        # State dict gets priority
+        if key in self.state_dict:
+            return self.state_dict[key]
+        weight_info = self.index[key]
+        if weight_info.get("safetensors_file") is not None:
+            device = "cpu" if self.device is None else self.device
+            tensor = None
+            try:
+                with safe_open(weight_info["safetensors_file"], framework="pt", device=device) as f:
+                    tensor = f.get_tensor(weight_info.get("weight_name", key))
+            except TypeError:
+                # if failed to get_tensor on the device, such as bf16 on mps, try to load it on CPU first
+                with safe_open(weight_info["safetensors_file"], framework="pt", device="cpu") as f:
+                    tensor = f.get_tensor(weight_info.get("weight_name", key))
+            if "dtype" in weight_info:
+                tensor = tensor.to(getattr(torch, weight_info["dtype"]))
+            if tensor.device != torch.device(device):
+                tensor = tensor.to(device)
+            return tensor
+        weight_file = os.path.join(self.save_folder, f"{key}.dat")
+        return load_offloaded_weight(weight_file, weight_info)
+    def __iter__(self):
+        return iter(self.all_keys)
+    def __len__(self):
+        return len(self.all_keys)
+def extract_submodules_state_dict(state_dict: dict[str, torch.Tensor], submodule_names: list[str]):
+    """
+    Extract the sub state-dict corresponding to a list of given submodules.
+    Args:
+        state_dict (`Dict[str, torch.Tensor]`): The state dict to extract from.
+        submodule_names (`List[str]`): The list of submodule names we want to extract.
+    """
+    result = {}
+    for module_name in submodule_names:
+        # We want to catch module_name parameter (module_name.xxx) or potentially module_name, but not any of the
+        # submodules that could being like module_name (transformers.h.1 and transformers.h.10 for instance)
+        result.update(
+            {
+                key: param
+                for key, param in state_dict.items()
+                if key == module_name or key.startswith(module_name + ".")
+            }
+        )
+    return result

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py ADDED Viewed

	@@ -0,0 +1,867 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+A set of basic tensor ops compatible with tpu, gpu, and multigpu
+"""
+import pickle
+import warnings
+from collections.abc import Mapping
+from contextlib import contextmanager, nullcontext
+from functools import update_wrapper, wraps
+from typing import Any
+import torch
+from ..state import AcceleratorState, PartialState
+from .constants import TORCH_DISTRIBUTED_OPERATION_TYPES
+from .dataclasses import DistributedType, TensorInformation
+from .imports import (
+    is_npu_available,
+    is_torch_distributed_available,
+    is_torch_xla_available,
+)
+from .versions import is_torch_version
+if is_torch_xla_available():
+    import torch_xla.core.xla_model as xm
+if is_torch_distributed_available():
+    from torch.distributed import ReduceOp
+def is_torch_tensor(tensor):
+    return isinstance(tensor, torch.Tensor)
+def is_torch_xpu_tensor(tensor):
+    return isinstance(
+        tensor,
+        torch.xpu.FloatTensor,
+        torch.xpu.ByteTensor,
+        torch.xpu.IntTensor,
+        torch.xpu.LongTensor,
+        torch.xpu.HalfTensor,
+        torch.xpu.DoubleTensor,
+        torch.xpu.BFloat16Tensor,
+    )
+def is_tensor_information(tensor_info):
+    return isinstance(tensor_info, TensorInformation)
+def is_namedtuple(data):
+    """
+    Checks if `data` is a `namedtuple` or not. Can have false positives, but only if a user is trying to mimic a
+    `namedtuple` perfectly.
+    """
+    return isinstance(data, tuple) and hasattr(data, "_asdict") and hasattr(data, "_fields")
+def honor_type(obj, generator):
+    """
+    Cast a generator to the same type as obj (list, tuple, or namedtuple)
+    """
+    # Some objects may not be able to instantiate from a generator directly
+    if is_namedtuple(obj):
+        return type(obj)(*list(generator))
+    else:
+        return type(obj)(generator)
+def recursively_apply(func, data, *args, test_type=is_torch_tensor, error_on_other_type=False, **kwargs):
+    """
+    Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.
+    Args:
+        func (`callable`):
+            The function to recursively apply.
+        data (nested list/tuple/dictionary of `main_type`):
+            The data on which to apply `func`
+        *args:
+            Positional arguments that will be passed to `func` when applied on the unpacked data.
+        main_type (`type`, *optional*, defaults to `torch.Tensor`):
+            The base type of the objects to which apply `func`.
+        error_on_other_type (`bool`, *optional*, defaults to `False`):
+            Whether to return an error or not if after unpacking `data`, we get on an object that is not of type
+            `main_type`. If `False`, the function will leave objects of types different than `main_type` unchanged.
+        **kwargs (additional keyword arguments, *optional*):
+            Keyword arguments that will be passed to `func` when applied on the unpacked data.
+    Returns:
+        The same data structure as `data` with `func` applied to every object of type `main_type`.
+    """
+    if isinstance(data, (tuple, list)):
+        return honor_type(
+            data,
+            (
+                recursively_apply(
+                    func, o, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs
+                )
+                for o in data
+            ),
+        )
+    elif isinstance(data, Mapping):
+        return type(data)(
+            {
+                k: recursively_apply(
+                    func, v, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs
+                )
+                for k, v in data.items()
+            }
+        )
+    elif test_type(data):
+        return func(data, *args, **kwargs)
+    elif error_on_other_type:
+        raise TypeError(
+            f"Unsupported types ({type(data)}) passed to `{func.__name__}`. Only nested list/tuple/dicts of "
+            f"objects that are valid for `{test_type.__name__}` should be passed."
+        )
+    return data
+def send_to_device(tensor, device, non_blocking=False, skip_keys=None):
+    """
+    Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device.
+    Args:
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to send to a given device.
+        device (`torch.device`):
+            The device to send the data to.
+    Returns:
+        The same data structure as `tensor` with all tensors sent to the proper device.
+    """
+    if is_torch_tensor(tensor) or hasattr(tensor, "to"):
+        # `torch.Tensor.to("npu")` could not find context when called for the first time (see this [issue](https://gitee.com/ascend/pytorch/issues/I8KECW?from=project-issue)).
+        if device == "npu":
+            device = "npu:0"
+        try:
+            return tensor.to(device, non_blocking=non_blocking)
+        except TypeError:  # .to() doesn't accept non_blocking as kwarg
+            return tensor.to(device)
+        except AssertionError as error:
+            # `torch.Tensor.to(<int num>)` is not supported by `torch_npu` (see this [issue](https://github.com/Ascend/pytorch/issues/16)).
+            # This call is inside the try-block since is_npu_available is not supported by torch.compile.
+            if is_npu_available():
+                if isinstance(device, int):
+                    device = f"npu:{device}"
+            else:
+                raise error
+        try:
+            return tensor.to(device, non_blocking=non_blocking)
+        except TypeError:  # .to() doesn't accept non_blocking as kwarg
+            return tensor.to(device)
+    elif isinstance(tensor, (tuple, list)):
+        return honor_type(
+            tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
+        )
+    elif isinstance(tensor, Mapping):
+        if isinstance(skip_keys, str):
+            skip_keys = [skip_keys]
+        elif skip_keys is None:
+            skip_keys = []
+        return type(tensor)(
+            {
+                k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
+                for k, t in tensor.items()
+            }
+        )
+    else:
+        return tensor
+def get_data_structure(data):
+    """
+    Recursively gathers the information needed to rebuild a nested list/tuple/dictionary of tensors.
+    Args:
+        data (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to send to analyze.
+    Returns:
+        The same data structure as `data` with [`~utils.TensorInformation`] instead of tensors.
+    """
+    def _get_data_structure(tensor):
+        return TensorInformation(shape=tensor.shape, dtype=tensor.dtype)
+    return recursively_apply(_get_data_structure, data)
+def get_shape(data):
+    """
+    Recursively gathers the shape of a nested list/tuple/dictionary of tensors as a list.
+    Args:
+        data (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to send to analyze.
+    Returns:
+        The same data structure as `data` with lists of tensor shapes instead of tensors.
+    """
+    def _get_shape(tensor):
+        return list(tensor.shape)
+    return recursively_apply(_get_shape, data)
+def initialize_tensors(data_structure):
+    """
+    Recursively initializes tensors from a nested list/tuple/dictionary of [`~utils.TensorInformation`].
+    Returns:
+        The same data structure as `data` with tensors instead of [`~utils.TensorInformation`].
+    """
+    def _initialize_tensor(tensor_info):
+        return torch.empty(*tensor_info.shape, dtype=tensor_info.dtype)
+    return recursively_apply(_initialize_tensor, data_structure, test_type=is_tensor_information)
+def find_batch_size(data):
+    """
+    Recursively finds the batch size in a nested list/tuple/dictionary of lists of tensors.
+    Args:
+        data (nested list/tuple/dictionary of `torch.Tensor`): The data from which to find the batch size.
+    Returns:
+        `int`: The batch size.
+    """
+    if isinstance(data, (tuple, list, Mapping)) and (len(data) == 0):
+        raise ValueError(f"Cannot find the batch size from empty {type(data)}.")
+    if isinstance(data, (tuple, list)):
+        return find_batch_size(data[0])
+    elif isinstance(data, Mapping):
+        for k in data.keys():
+            return find_batch_size(data[k])
+    elif not isinstance(data, torch.Tensor):
+        raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
+    return data.shape[0]
+def ignorant_find_batch_size(data):
+    """
+    Same as [`utils.operations.find_batch_size`] except will ignore if `ValueError` and `TypeErrors` are raised
+    Args:
+        data (nested list/tuple/dictionary of `torch.Tensor`): The data from which to find the batch size.
+    Returns:
+        `int`: The batch size.
+    """
+    try:
+        return find_batch_size(data)
+    except (ValueError, TypeError):
+        pass
+    return None
+def listify(data):
+    """
+    Recursively finds tensors in a nested list/tuple/dictionary and converts them to a list of numbers.
+    Args:
+        data (nested list/tuple/dictionary of `torch.Tensor`): The data from which to convert to regular numbers.
+    Returns:
+        The same data structure as `data` with lists of numbers instead of `torch.Tensor`.
+    """
+    def _convert_to_list(tensor):
+        tensor = tensor.detach().cpu()
+        if tensor.dtype == torch.bfloat16:
+            # As of Numpy 1.21.4, NumPy does not support bfloat16 (see
+            # https://github.com/numpy/numpy/blob/a47ecdea856986cd60eabbd53265c2ca5916ad5d/doc/source/user/basics.types.rst ).
+            # Until Numpy adds bfloat16, we must convert float32.
+            tensor = tensor.to(torch.float32)
+        return tensor.tolist()
+    return recursively_apply(_convert_to_list, data)
+def _tpu_gather(tensor):
+    def _tpu_gather_one(tensor):
+        if tensor.ndim == 0:
+            tensor = tensor.clone()[None]
+        # Can only gather contiguous tensors
+        if not tensor.is_contiguous():
+            tensor = tensor.contiguous()
+        return xm.all_gather(tensor)
+    res = recursively_apply(_tpu_gather_one, tensor, error_on_other_type=True)
+    xm.mark_step()
+    return res
+def _gpu_gather(tensor):
+    state = PartialState()
+    gather_op = torch.distributed.all_gather_into_tensor
+    # NOTE: need manually synchronize to workaourd a INT64 collectives bug in oneCCL before torch 2.9.0
+    if state.device.type == "xpu" and is_torch_version("<=", "2.8"):
+        torch.xpu.synchronize()
+    def _gpu_gather_one(tensor):
+        if tensor.ndim == 0:
+            tensor = tensor.clone()[None]
+        # Can only gather contiguous tensors
+        if not tensor.is_contiguous():
+            tensor = tensor.contiguous()
+        if state.backend is not None and state.backend != "gloo":
+            # We use `empty` as `all_gather_into_tensor` slightly
+            # differs from `all_gather` for better efficiency,
+            # and we rely on the number of items in the tensor
+            # rather than its direct shape
+            output_tensors = torch.empty(
+                state.num_processes * tensor.numel(),
+                dtype=tensor.dtype,
+                device=state.device,
+            )
+            gather_op(output_tensors, tensor)
+            return output_tensors.view(-1, *tensor.size()[1:])
+        else:
+            # a backend of `None` is always CPU
+            # also gloo does not support `all_gather_into_tensor`,
+            # which will result in a larger memory overhead for the op
+            output_tensors = [torch.empty_like(tensor) for _ in range(state.num_processes)]
+            torch.distributed.all_gather(output_tensors, tensor)
+            return torch.cat(output_tensors, dim=0)
+    return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
+class DistributedOperationException(Exception):
+    """
+    An exception class for distributed operations. Raised if the operation cannot be performed due to the shape of the
+    tensors.
+    """
+    pass
+def verify_operation(function):
+    """
+    Verifies that `tensor` is the same shape across all processes. Only ran if `PartialState().debug` is `True`.
+    """
+    @wraps(function)
+    def wrapper(*args, **kwargs):
+        if PartialState().distributed_type == DistributedType.NO or not PartialState().debug:
+            return function(*args, **kwargs)
+        operation = f"{function.__module__}.{function.__name__}"
+        if "tensor" in kwargs:
+            tensor = kwargs["tensor"]
+        else:
+            tensor = args[0]
+        if PartialState().device.type != find_device(tensor).type:
+            raise DistributedOperationException(
+                f"One or more of the tensors passed to {operation} were not on the {tensor.device.type} while the `Accelerator` is configured for {PartialState().device.type}. "
+                f"Please move it to the {PartialState().device.type} before calling {operation}."
+            )
+        shapes = get_shape(tensor)
+        output = gather_object([shapes])
+        if output[0] is not None:
+            are_same = output.count(output[0]) == len(output)
+            if not are_same:
+                process_shape_str = "\n  - ".join([f"Process {i}: {shape}" for i, shape in enumerate(output)])
+                raise DistributedOperationException(
+                    f"Cannot apply desired operation due to shape mismatches. "
+                    "All shapes across devices must be valid."
+                    f"\n\nOperation: `{operation}`\nInput shapes:\n  - {process_shape_str}"
+                )
+        return function(*args, **kwargs)
+    return wrapper
+def chained_operation(function):
+    """
+    Checks that `verify_operation` failed and if so reports a more helpful error chaining the existing
+    `DistributedOperationException`.
+    """
+    @wraps(function)
+    def wrapper(*args, **kwargs):
+        try:
+            return function(*args, **kwargs)
+        except DistributedOperationException as e:
+            operation = f"{function.__module__}.{function.__name__}"
+            raise DistributedOperationException(
+                f"Error found while calling `{operation}`. Please see the earlier error for more details."
+            ) from e
+    return wrapper
+@verify_operation
+def gather(tensor):
+    """
+    Recursively gather tensor in a nested list/tuple/dictionary of tensors from all devices.
+    Args:
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to gather.
+    Returns:
+        The same data structure as `tensor` with all tensors sent to the proper device.
+    """
+    if PartialState().distributed_type == DistributedType.XLA:
+        return _tpu_gather(tensor)
+    elif PartialState().distributed_type in TORCH_DISTRIBUTED_OPERATION_TYPES:
+        return _gpu_gather(tensor)
+    else:
+        return tensor
+def _gpu_gather_object(object: Any):
+    output_objects = [None for _ in range(PartialState().num_processes)]
+    torch.distributed.all_gather_object(output_objects, object)
+    # all_gather_object returns a list of lists, so we need to flatten it
+    return [x for y in output_objects for x in y]
+def gather_object(object: Any):
+    """
+    Recursively gather object in a nested list/tuple/dictionary of objects from all devices.
+    Args:
+        object (nested list/tuple/dictionary of picklable object):
+            The data to gather.
+    Returns:
+        The same data structure as `object` with all the objects sent to every device.
+    """
+    if PartialState().distributed_type == DistributedType.XLA:
+        raise NotImplementedError("gather objects in TPU is not supported")
+    elif PartialState().distributed_type in TORCH_DISTRIBUTED_OPERATION_TYPES:
+        return _gpu_gather_object(object)
+    else:
+        return object
+def _gpu_broadcast(data, src=0):
+    def _gpu_broadcast_one(tensor, src=0):
+        torch.distributed.broadcast(tensor, src=src)
+        return tensor
+    return recursively_apply(_gpu_broadcast_one, data, error_on_other_type=True, src=src)
+def _tpu_broadcast(tensor, src=0, name="broadcast tensor"):
+    if isinstance(tensor, (list, tuple)):
+        return honor_type(tensor, (_tpu_broadcast(t, name=f"{name}_{i}") for i, t in enumerate(tensor)))
+    elif isinstance(tensor, Mapping):
+        return type(tensor)({k: _tpu_broadcast(v, name=f"{name}_{k}") for k, v in tensor.items()})
+    return xm.mesh_reduce(name, tensor, lambda x: x[src])
+TENSOR_TYPE_TO_INT = {
+    torch.float: 1,
+    torch.double: 2,
+    torch.half: 3,
+    torch.bfloat16: 4,
+    torch.uint8: 5,
+    torch.int8: 6,
+    torch.int16: 7,
+    torch.int32: 8,
+    torch.int64: 9,
+    torch.bool: 10,
+}
+TENSOR_INT_TO_DTYPE = {v: k for k, v in TENSOR_TYPE_TO_INT.items()}
+def gather_tensor_shape(tensor):
+    """
+    Grabs the shape of `tensor` only available on one process and returns a tensor of its shape
+    """
+    # Allocate 80 bytes to store the shape
+    max_tensor_dimension = 2**20
+    state = PartialState()
+    base_tensor = torch.empty(max_tensor_dimension, dtype=torch.int, device=state.device)
+    # Since PyTorch can't just send a tensor to another GPU without
+    # knowing its size, we store the size of the tensor with data
+    # in an allocation
+    if tensor is not None:
+        shape = tensor.shape
+        tensor_dtype = TENSOR_TYPE_TO_INT[tensor.dtype]
+        base_tensor[: len(shape) + 1] = torch.tensor(list(shape) + [tensor_dtype], dtype=int)
+    # Perform a reduction to copy the size data onto all GPUs
+    base_tensor = reduce(base_tensor, reduction="sum")
+    base_tensor = base_tensor[base_tensor.nonzero()]
+    # The last non-zero data contains the coded dtype the source tensor is
+    dtype = int(base_tensor[-1:][0])
+    base_tensor = base_tensor[:-1]
+    return base_tensor, dtype
+def copy_tensor_to_devices(tensor=None) -> torch.Tensor:
+    """
+    Copies a tensor that only exists on a single device and broadcasts it to other devices. Differs from `broadcast` as
+    each worker doesn't need to know its shape when used (and tensor can be `None`)
+    Args:
+        tensor (`torch.tensor`):
+            The tensor that should be sent to all devices. Must only have it be defined on a single device, the rest
+            should be `None`.
+    """
+    state = PartialState()
+    shape, dtype = gather_tensor_shape(tensor)
+    if tensor is None:
+        tensor = torch.zeros(shape, dtype=TENSOR_INT_TO_DTYPE[dtype]).to(state.device)
+    return reduce(tensor, reduction="sum")
+@verify_operation
+def broadcast(tensor, from_process: int = 0):
+    """
+    Recursively broadcast tensor in a nested list/tuple/dictionary of tensors to all devices.
+    Args:
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to gather.
+        from_process (`int`, *optional*, defaults to 0):
+            The process from which to send the data
+    Returns:
+        The same data structure as `tensor` with all tensors broadcasted to the proper device.
+    """
+    if PartialState().distributed_type == DistributedType.XLA:
+        return _tpu_broadcast(tensor, src=from_process, name="accelerate.utils.broadcast")
+    elif PartialState().distributed_type in TORCH_DISTRIBUTED_OPERATION_TYPES:
+        return _gpu_broadcast(tensor, src=from_process)
+    else:
+        return tensor
+def broadcast_object_list(object_list, from_process: int = 0):
+    """
+    Broadcast a list of picklable objects from one process to the others.
+    Args:
+        object_list (list of picklable objects):
+            The list of objects to broadcast. This list will be modified inplace.
+        from_process (`int`, *optional*, defaults to 0):
+            The process from which to send the data.
+    Returns:
+        The same list containing the objects from process 0.
+    """
+    if PartialState().distributed_type == DistributedType.XLA:
+        for i, obj in enumerate(object_list):
+            object_list[i] = xm.mesh_reduce("accelerate.utils.broadcast_object_list", obj, lambda x: x[from_process])
+    elif PartialState().distributed_type in TORCH_DISTRIBUTED_OPERATION_TYPES:
+        torch.distributed.broadcast_object_list(object_list, src=from_process)
+    return object_list
+def slice_tensors(data, tensor_slice, process_index=None, num_processes=None):
+    """
+    Recursively takes a slice in a nested list/tuple/dictionary of tensors.
+    Args:
+        data (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to slice.
+        tensor_slice (`slice`):
+            The slice to take.
+    Returns:
+        The same data structure as `data` with all the tensors slices.
+    """
+    def _slice_tensor(tensor, tensor_slice):
+        return tensor[tensor_slice]
+    return recursively_apply(_slice_tensor, data, tensor_slice)
+def concatenate(data, dim=0):
+    """
+    Recursively concatenate the tensors in a nested list/tuple/dictionary of lists of tensors with the same shape.
+    Args:
+        data (nested list/tuple/dictionary of lists of tensors `torch.Tensor`):
+            The data to concatenate.
+        dim (`int`, *optional*, defaults to 0):
+            The dimension on which to concatenate.
+    Returns:
+        The same data structure as `data` with all the tensors concatenated.
+    """
+    if isinstance(data[0], (tuple, list)):
+        return honor_type(data[0], (concatenate([d[i] for d in data], dim=dim) for i in range(len(data[0]))))
+    elif isinstance(data[0], Mapping):
+        return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
+    elif not isinstance(data[0], torch.Tensor):
+        raise TypeError(f"Can only concatenate tensors but got {type(data[0])}")
+    return torch.cat(data, dim=dim)
+class CannotPadNestedTensorWarning(UserWarning):
+    pass
+@chained_operation
+def pad_across_processes(tensor, dim=0, pad_index=0, pad_first=False):
+    """
+    Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so they
+    can safely be gathered.
+    Args:
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to gather.
+        dim (`int`, *optional*, defaults to 0):
+            The dimension on which to pad.
+        pad_index (`int`, *optional*, defaults to 0):
+            The value with which to pad.
+        pad_first (`bool`, *optional*, defaults to `False`):
+            Whether to pad at the beginning or the end.
+    """
+    def _pad_across_processes(tensor, dim=0, pad_index=0, pad_first=False):
+        if getattr(tensor, "is_nested", False):
+            warnings.warn(
+                "Cannot pad nested tensors without more information. Leaving unprocessed.",
+                CannotPadNestedTensorWarning,
+            )
+            return tensor
+        if dim >= len(tensor.shape) or dim < -len(tensor.shape):
+            return tensor
+        # Convert negative dimensions to non-negative
+        if dim < 0:
+            dim += len(tensor.shape)
+        # Gather all sizes
+        size = torch.tensor(tensor.shape, device=tensor.device)[None]
+        sizes = gather(size).cpu()
+        # Then pad to the maximum size
+        max_size = max(s[dim] for s in sizes)
+        if max_size == tensor.shape[dim]:
+            return tensor
+        old_size = tensor.shape
+        new_size = list(old_size)
+        new_size[dim] = max_size
+        new_tensor = tensor.new_zeros(tuple(new_size)) + pad_index
+        if pad_first:
+            indices = tuple(
+                slice(max_size - old_size[dim], max_size) if i == dim else slice(None) for i in range(len(new_size))
+            )
+        else:
+            indices = tuple(slice(0, old_size[dim]) if i == dim else slice(None) for i in range(len(new_size)))
+        new_tensor[indices] = tensor
+        return new_tensor
+    return recursively_apply(
+        _pad_across_processes, tensor, error_on_other_type=True, dim=dim, pad_index=pad_index, pad_first=pad_first
+    )
+def pad_input_tensors(tensor, batch_size, num_processes, dim=0):
+    """
+    Takes a `tensor` of arbitrary size and pads it so that it can work given `num_processes` needed dimensions.
+    New tensors are just the last input repeated.
+    E.g.:
+      Tensor: ([3,4,4]) Num processes: 4 Expected result shape: ([4,4,4])
+    """
+    def _pad_input_tensors(tensor, batch_size, num_processes, dim=0):
+        remainder = batch_size // num_processes
+        last_inputs = batch_size - (remainder * num_processes)
+        if batch_size // num_processes == 0:
+            to_pad = num_processes - batch_size
+        else:
+            to_pad = num_processes - (batch_size // num_processes)
+        # In the rare case that `to_pad` is negative,
+        # we need to pad the last inputs - the found `to_pad`
+        if last_inputs > to_pad & to_pad < 1:
+            to_pad = last_inputs - to_pad
+        old_size = tensor.shape
+        new_size = list(old_size)
+        new_size[0] = batch_size + to_pad
+        new_tensor = tensor.new_zeros(tuple(new_size))
+        indices = tuple(slice(0, old_size[dim]) if i == dim else slice(None) for i in range(len(new_size)))
+        new_tensor[indices] = tensor
+        return new_tensor
+    return recursively_apply(
+        _pad_input_tensors,
+        tensor,
+        error_on_other_type=True,
+        batch_size=batch_size,
+        num_processes=num_processes,
+        dim=dim,
+    )
+@verify_operation
+def reduce(tensor, reduction="mean", scale=1.0):
+    """
+    Recursively reduce the tensors in a nested list/tuple/dictionary of lists of tensors across all processes by the
+    mean of a given operation.
+    Args:
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to reduce.
+        reduction (`str`, *optional*, defaults to `"mean"`):
+            A reduction method. Can be of "mean", "sum", or "none"
+        scale (`float`, *optional*):
+            A default scaling value to be applied after the reduce, only valid on XLA.
+    Returns:
+        The same data structure as `data` with all the tensors reduced.
+    """
+    def _reduce_across_processes(tensor, reduction="mean", scale=1.0):
+        state = PartialState()
+        cloned_tensor = tensor.clone()
+        if state.distributed_type == DistributedType.NO:
+            return cloned_tensor
+        if state.distributed_type == DistributedType.XLA:
+            # Some processes may have different HLO graphs than other
+            # processes, for example in the breakpoint API
+            # accelerator.set_trigger(). Use mark_step to make HLOs
+            # the same on all processes.
+            xm.mark_step()
+            xm.all_reduce(xm.REDUCE_SUM, [cloned_tensor], scale)
+            xm.mark_step()
+        elif state.distributed_type.value in TORCH_DISTRIBUTED_OPERATION_TYPES:
+            torch.distributed.all_reduce(cloned_tensor, ReduceOp.SUM)
+        if reduction == "mean":
+            cloned_tensor /= state.num_processes
+        return cloned_tensor
+    return recursively_apply(
+        _reduce_across_processes, tensor, error_on_other_type=True, reduction=reduction, scale=scale
+    )
+def convert_to_fp32(tensor):
+    """
+    Recursively converts the elements nested list/tuple/dictionary of tensors in FP16/BF16 precision to FP32.
+    Args:
+        tensor (nested list/tuple/dictionary of `torch.Tensor`):
+            The data to convert from FP16/BF16 to FP32.
+    Returns:
+        The same data structure as `tensor` with all tensors that were in FP16/BF16 precision converted to FP32.
+    """
+    def _convert_to_fp32(tensor):
+        return tensor.float()
+    def _is_fp16_bf16_tensor(tensor):
+        return (is_torch_tensor(tensor) or hasattr(tensor, "dtype")) and tensor.dtype in (
+            torch.float16,
+            torch.bfloat16,
+        )
+    return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)
+class ConvertOutputsToFp32:
+    """
+    Decorator to apply to a function outputting tensors (like a model forward pass) that ensures the outputs in FP16
+    precision will be convert back to FP32.
+    Args:
+        model_forward (`Callable`):
+            The function which outputs we want to treat.
+    Returns:
+        The same function as `model_forward` but with converted outputs.
+    """
+    def __init__(self, model_forward):
+        self.model_forward = model_forward
+        update_wrapper(self, model_forward)
+    def __call__(self, *args, **kwargs):
+        return convert_to_fp32(self.model_forward(*args, **kwargs))
+    def __getstate__(self):
+        raise pickle.PicklingError(
+            "Cannot pickle a prepared model with automatic mixed precision, please unwrap the model with `Accelerator.unwrap_model(model)` before pickling it."
+        )
+def convert_outputs_to_fp32(model_forward):
+    model_forward = ConvertOutputsToFp32(model_forward)
+    def forward(*args, **kwargs):
+        return model_forward(*args, **kwargs)
+    # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
+    forward.__wrapped__ = model_forward
+    return forward
+def find_device(data):
+    """
+    Finds the device on which a nested dict/list/tuple of tensors lies (assuming they are all on the same device).
+    Args:
+        (nested list/tuple/dictionary of `torch.Tensor`): The data we want to know the device of.
+    """
+    if isinstance(data, Mapping):
+        for obj in data.values():
+            device = find_device(obj)
+            if device is not None:
+                return device
+    elif isinstance(data, (tuple, list)):
+        for obj in data:
+            device = find_device(obj)
+            if device is not None:
+                return device
+    elif isinstance(data, torch.Tensor):
+        return data.device
+@contextmanager
+def GatheredParameters(params, modifier_rank=None, fwd_module=None, enabled=True):
+    """
+    Wrapper around `deepspeed.runtime.zero.GatheredParameters`, but if Zero-3 is not enabled, will be a no-op context
+    manager.
+    """
+    # We need to use the `AcceleratorState` here since it has access to the deepspeed plugin
+    if AcceleratorState().distributed_type != DistributedType.DEEPSPEED or (
+        AcceleratorState().deepspeed_plugin is not None
+        and not AcceleratorState().deepspeed_plugin.is_zero3_init_enabled()
+    ):
+        gather_param_context = nullcontext()
+    else:
+        import deepspeed
+        gather_param_context = deepspeed.zero.GatheredParameters(
+            params, modifier_rank=modifier_rank, fwd_module=fwd_module, enabled=enabled
+        )
+    with gather_param_context:
+        yield

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/other.py ADDED Viewed

	@@ -0,0 +1,561 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import platform
+import re
+import socket
+from codecs import encode
+from collections import OrderedDict
+from functools import partial, reduce
+from types import MethodType
+from typing import Optional
+import numpy as np
+import torch
+from packaging.version import Version
+from safetensors.torch import save_file as safe_save_file
+from ..commands.config.default import write_basic_config  # noqa: F401
+from ..logging import get_logger
+from ..state import PartialState
+from .constants import FSDP_PYTORCH_VERSION
+from .dataclasses import DistributedType
+from .imports import (
+    is_deepspeed_available,
+    is_numpy_available,
+    is_torch_distributed_available,
+    is_torch_xla_available,
+    is_weights_only_available,
+)
+from .modeling import id_tensor_storage
+from .transformer_engine import convert_model
+from .versions import is_torch_version
+logger = get_logger(__name__)
+if is_torch_xla_available():
+    import torch_xla.core.xla_model as xm
+def is_compiled_module(module: torch.nn.Module) -> bool:
+    """
+    Check whether the module was compiled with torch.compile()
+    """
+    if not hasattr(torch, "_dynamo"):
+        return False
+    return isinstance(module, torch._dynamo.eval_frame.OptimizedModule)
+def has_compiled_regions(module: torch.nn.Module) -> bool:
+    """
+    Check whether the module has submodules that were compiled with `torch.compile()`.
+    """
+    if not hasattr(torch, "_dynamo"):
+        return False
+    if module._modules:
+        for submodule in module.modules():
+            if isinstance(submodule, torch._dynamo.eval_frame.OptimizedModule):
+                return True
+    return False
+def is_repeated_blocks(module: torch.nn.Module) -> bool:
+    """
+    Check whether the module is a repeated block, i.e. `torch.nn.ModuleList` with all children of the same class. This
+    is useful to determine whether we should apply regional compilation to the module.
+    """
+    return isinstance(module, torch.nn.ModuleList) and all(isinstance(m, module[0].__class__) for m in module)
+def has_repeated_blocks(module: torch.nn.Module) -> bool:
+    """
+    Check whether the module has repeated blocks, i.e. `torch.nn.ModuleList` with all children of the same class, at
+    any level of the module hierarchy. This is useful to determine whether we should apply regional compilation to the
+    module.
+    """
+    if module._modules:
+        for submodule in module.modules():
+            if is_repeated_blocks(submodule):
+                return True
+    return False
+def compile_regions(module: torch.nn.Module, **compile_kwargs) -> torch.nn.Module:
+    """
+    Performs regional compilation where we target repeated blocks of the same class and compile them sequentially to
+    hit the compiler's cache. For example, in `GPT2LMHeadModel`, the repeated block/class is `GPT2Block`, and can be
+    accessed as `model.transformer.h[0]`. The rest of the model (e.g. model.lm_head) is compiled separately.
+    This allows us to speed up the compilation overhead / cold start of models like LLMs and Transformers in general.
+    See https://pytorch.org/tutorials/recipes/regional_compilation.html for more details.
+    Args:
+        module (`torch.nn.Module`):
+            The model to compile.
+        **compile_kwargs:
+            Additional keyword arguments to pass to `torch.compile()`.
+    Returns:
+        `torch.nn.Module`: A new instance of the model with some compiled regions.
+    Example:
+    ```python
+    >>> from accelerate.utils import compile_regions
+    >>> from transformers import AutoModelForCausalLM
+    >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
+    >>> compiled_model = compile_regions(model, mode="reduce-overhead")
+    >>> compiled_model.transformer.h[0]
+    OptimizedModule(
+        (_orig_mod): GPT2Block(
+                (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+                (attn): GPT2Attention(
+                (c_attn): Conv1D(nf=2304, nx=768)
+                (c_proj): Conv1D(nf=768, nx=768)
+                (attn_dropout): Dropout(p=0.1, inplace=False)
+                (resid_dropout): Dropout(p=0.1, inplace=False)
+            )
+            (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+            (mlp): GPT2MLP(
+                (c_fc): Conv1D(nf=3072, nx=768)
+                (c_proj): Conv1D(nf=768, nx=3072)
+                (act): NewGELUActivation()
+                (dropout): Dropout(p=0.1, inplace=False)
+            )
+        )
+    )
+    ```
+    """
+    def _compile_regions(module: torch.nn.Module, **compile_kwargs) -> torch.nn.Module:
+        if is_repeated_blocks(module):
+            new_module = torch.nn.ModuleList()
+            for submodule in module:
+                new_module.append(torch.compile(submodule, **compile_kwargs))
+        elif has_repeated_blocks(module):
+            new_module = module.__class__.__new__(module.__class__)
+            new_module.__dict__.update(module.__dict__)
+            new_module._modules = {}
+            for name, submodule in module.named_children():
+                new_module.add_module(name, _compile_regions(submodule, **compile_kwargs))
+        else:
+            new_module = torch.compile(module, **compile_kwargs)
+        return new_module
+    new_module = _compile_regions(module, **compile_kwargs)
+    if "_orig_mod" not in new_module.__dict__:
+        # Keeps a reference to the original module to decompile/unwrap it later
+        new_module.__dict__["_orig_mod"] = module
+    return new_module
+def compile_regions_deepspeed(module: torch.nn.Module, **compile_kwargs):
+    """
+    Performs regional compilation the same way as `compile_regions`, but specifically for `DeepSpeedEngine.module`.
+    Since the model is wrapped in a `DeepSpeedEngine` and has many added hooks, offloaded parameters, etc that
+    `torch.compile(...)` interferes with, version of trgional compilation uses the inplace `module.compile()` method
+    instead.
+    Args:
+        module (`torch.nn.Module`):
+            The model to compile.
+        **compile_kwargs:
+            Additional keyword arguments to pass to `module.compile()`.
+    """
+    if is_repeated_blocks(module):
+        for submodule in module:
+            submodule.compile(**compile_kwargs)
+    elif has_repeated_blocks(module):
+        for child in module.children():
+            compile_regions_deepspeed(child, **compile_kwargs)
+    else:  # leaf node
+        module.compile(**compile_kwargs)
+def model_has_dtensor(model: torch.nn.Module) -> bool:
+    """
+    Check if the model has DTensor parameters.
+    Args:
+        model (`torch.nn.Module`):
+            The model to check.
+    Returns:
+        `bool`: Whether the model has DTensor parameters.
+    """
+    if is_torch_version(">=", "2.5.0"):
+        from torch.distributed.tensor import DTensor
+    else:
+        # from torch 2.0.0 (oldest supported accelerate torch version), DTensor is in torch.distributed._tensor
+        from torch.distributed._tensor import DTensor
+    return any(isinstance(p, DTensor) for p in model.parameters())
+def extract_model_from_parallel(
+    model, keep_fp32_wrapper: bool = True, keep_torch_compile: bool = True, recursive: bool = False
+):
+    """
+    Extract a model from its distributed containers.
+    Args:
+        model (`torch.nn.Module`):
+            The model to extract.
+        keep_fp32_wrapper (`bool`, *optional*):
+            Whether to remove mixed precision hooks from the model.
+        keep_torch_compile (`bool`, *optional*):
+            Whether to unwrap compiled model.
+        recursive (`bool`, *optional*, defaults to `False`):
+            Whether to recursively extract all cases of `module.module` from `model` as well as unwrap child sublayers
+            recursively, not just the top-level distributed containers.
+    Returns:
+        `torch.nn.Module`: The extracted model.
+    """
+    options = (torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel)
+    is_compiled = is_compiled_module(model)
+    has_compiled = has_compiled_regions(model)
+    if is_compiled:
+        compiled_model = model
+        model = model._orig_mod
+    elif has_compiled:
+        compiled_model = model
+        model = model.__dict__["_orig_mod"]
+    if is_deepspeed_available():
+        from deepspeed import DeepSpeedEngine
+        options += (DeepSpeedEngine,)
+    if is_torch_version(">=", FSDP_PYTORCH_VERSION) and is_torch_distributed_available():
+        from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP
+        options += (FSDP,)
+    while isinstance(model, options):
+        model = model.module
+    if recursive:
+        # This is needed in cases such as using FSDPv2 on XLA
+        def _recursive_unwrap(module):
+            # Wrapped modules are standardly wrapped as `module`, similar to the cases earlier
+            # with DDP, DataParallel, DeepSpeed, and FSDP
+            if hasattr(module, "module"):
+                unwrapped_module = _recursive_unwrap(module.module)
+            else:
+                unwrapped_module = module
+            # Next unwrap child sublayers recursively
+            for name, child in unwrapped_module.named_children():
+                setattr(unwrapped_module, name, _recursive_unwrap(child))
+            return unwrapped_module
+        # Start with top-level
+        model = _recursive_unwrap(model)
+    if not keep_fp32_wrapper:
+        forward = model.forward
+        original_forward = model.__dict__.pop("_original_forward", None)
+        if original_forward is not None:
+            while hasattr(forward, "__wrapped__"):
+                forward = forward.__wrapped__
+                if forward == original_forward:
+                    break
+            model.forward = MethodType(forward, model)
+        if getattr(model, "_converted_to_transformer_engine", False):
+            convert_model(model, to_transformer_engine=False)
+    if keep_torch_compile:
+        if is_compiled:
+            compiled_model._orig_mod = model
+            model = compiled_model
+        elif has_compiled:
+            compiled_model.__dict__["_orig_mod"] = model
+            model = compiled_model
+    return model
+def wait_for_everyone():
+    """
+    Introduces a blocking point in the script, making sure all processes have reached this point before continuing.
+    <Tip warning={true}>
+    Make sure all processes will reach this instruction otherwise one of your processes will hang forever.
+    </Tip>
+    """
+    PartialState().wait_for_everyone()
+def clean_state_dict_for_safetensors(state_dict: dict):
+    """
+    Cleans the state dictionary from a model and removes tensor aliasing if present.
+    Args:
+        state_dict (`dict`):
+            The state dictionary from a model
+    """
+    ptrs = collections.defaultdict(list)
+    # When bnb serialization is used, weights in state dict can be strings
+    for name, tensor in state_dict.items():
+        if not isinstance(tensor, str):
+            ptrs[id_tensor_storage(tensor)].append(name)
+    # These are all pointers of tensors with shared memory
+    shared_ptrs = {ptr: names for ptr, names in ptrs.items() if len(names) > 1}
+    warn_names = set()
+    for names in shared_ptrs.values():
+        # When not all duplicates have been cleaned, we still remove those keys but put a clear warning.
+        # If the link between tensors was done at runtime then `from_pretrained` will not get
+        # the key back leading to random tensor. A proper warning will be shown
+        # during reload (if applicable), but since the file is not necessarily compatible with
+        # the config, better show a proper warning.
+        found_names = [name for name in names if name in state_dict]
+        warn_names.update(found_names[1:])
+        for name in found_names[1:]:
+            del state_dict[name]
+    if len(warn_names) > 0:
+        logger.warning(
+            f"Removed shared tensor {warn_names} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading",
+        )
+    state_dict = {k: v.contiguous() if isinstance(v, torch.Tensor) else v for k, v in state_dict.items()}
+    return state_dict
+def save(obj, f, save_on_each_node: bool = False, safe_serialization: bool = False):
+    """
+    Save the data to disk. Use in place of `torch.save()`.
+    Args:
+        obj:
+            The data to save
+        f:
+            The file (or file-like object) to use to save the data
+        save_on_each_node (`bool`, *optional*, defaults to `False`):
+            Whether to only save on the global main process
+        safe_serialization (`bool`, *optional*, defaults to `False`):
+            Whether to save `obj` using `safetensors` or the traditional PyTorch way (that uses `pickle`).
+    """
+    # When TorchXLA is enabled, it's necessary to transfer all data to the CPU before saving.
+    # Another issue arises with `id_tensor_storage`, which treats all XLA tensors as identical.
+    # If tensors remain on XLA, calling `clean_state_dict_for_safetensors` will result in only
+    # one XLA tensor remaining.
+    if PartialState().distributed_type == DistributedType.XLA:
+        obj = xm._maybe_convert_to_cpu(obj)
+    # Check if it's a model and remove duplicates
+    if safe_serialization:
+        save_func = partial(safe_save_file, metadata={"format": "pt"})
+        if isinstance(obj, OrderedDict):
+            obj = clean_state_dict_for_safetensors(obj)
+    else:
+        save_func = torch.save
+    if PartialState().is_main_process and not save_on_each_node:
+        save_func(obj, f)
+    elif PartialState().is_local_main_process and save_on_each_node:
+        save_func(obj, f)
+# The following are considered "safe" globals to reconstruct various types of objects when using `weights_only=True`
+# These should be added and then removed after loading in the file
+np_core = np._core if is_numpy_available("2.0.0") else np.core
+TORCH_SAFE_GLOBALS = [
+    # numpy arrays are just numbers, not objects, so we can reconstruct them safely
+    np_core.multiarray._reconstruct,
+    np.ndarray,
+    # The following are needed for the RNG states
+    encode,
+    np.dtype,
+]
+if is_numpy_available("1.25.0"):
+    TORCH_SAFE_GLOBALS.append(np.dtypes.UInt32DType)
+def load(f, map_location=None, **kwargs):
+    """
+    Compatible drop-in replacement of `torch.load()` which allows for `weights_only` to be used if `torch` version is
+    2.4.0 or higher. Otherwise will ignore the kwarg.
+    Will also add (and then remove) an exception for numpy arrays
+    Args:
+        f:
+            The file (or file-like object) to use to load the data
+        map_location:
+            a function, `torch.device`, string or a dict specifying how to remap storage locations
+        **kwargs:
+            Additional keyword arguments to pass to `torch.load()`.
+    """
+    try:
+        if is_weights_only_available():
+            old_safe_globals = torch.serialization.get_safe_globals()
+            if "weights_only" not in kwargs:
+                kwargs["weights_only"] = True
+            torch.serialization.add_safe_globals(TORCH_SAFE_GLOBALS)
+        else:
+            kwargs.pop("weights_only", None)
+        loaded_obj = torch.load(f, map_location=map_location, **kwargs)
+    finally:
+        if is_weights_only_available():
+            torch.serialization.clear_safe_globals()
+            if old_safe_globals:
+                torch.serialization.add_safe_globals(old_safe_globals)
+    return loaded_obj
+def get_pretty_name(obj):
+    """
+    Gets a pretty name from `obj`.
+    """
+    if not hasattr(obj, "__qualname__") and not hasattr(obj, "__name__"):
+        obj = getattr(obj, "__class__", obj)
+    if hasattr(obj, "__qualname__"):
+        return obj.__qualname__
+    if hasattr(obj, "__name__"):
+        return obj.__name__
+    return str(obj)
+def merge_dicts(source, destination):
+    """
+    Recursively merges two dictionaries.
+    Args:
+        source (`dict`): The dictionary to merge into `destination`.
+        destination (`dict`): The dictionary to merge `source` into.
+    """
+    for key, value in source.items():
+        if isinstance(value, dict):
+            node = destination.setdefault(key, {})
+            merge_dicts(value, node)
+        else:
+            destination[key] = value
+    return destination
+def is_port_in_use(port: Optional[int] = None) -> bool:
+    """
+    Checks if a port is in use on `localhost`. Useful for checking if multiple `accelerate launch` commands have been
+    run and need to see if the port is already in use.
+    """
+    if port is None:
+        port = 29500
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        return s.connect_ex(("localhost", port)) == 0
+def get_free_port() -> int:
+    """
+    Gets a free port on `localhost`. Useful for automatic port selection when port 0 is specified in distributed
+    training scenarios.
+    Returns:
+        int: An available port number
+    """
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.bind(("", 0))  # bind to port 0 for OS to assign a free port
+        return s.getsockname()[1]
+def convert_bytes(size):
+    "Converts `size` from bytes to the largest possible unit"
+    for x in ["bytes", "KB", "MB", "GB", "TB"]:
+        if size < 1024.0:
+            return f"{round(size, 2)} {x}"
+        size /= 1024.0
+    return f"{round(size, 2)} PB"
+def check_os_kernel():
+    """Warns if the kernel version is below the recommended minimum on Linux."""
+    # see issue #1929
+    info = platform.uname()
+    system = info.system
+    if system != "Linux":
+        return
+    _, version, *_ = re.split(r"(\d+\.\d+\.\d+)", info.release)
+    min_version = "5.5.0"
+    if Version(version) < Version(min_version):
+        msg = (
+            f"Detected kernel version {version}, which is below the recommended minimum of {min_version}; this can "
+            "cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher."
+        )
+        logger.warning(msg, main_process_only=True)
+def recursive_getattr(obj, attr: str):
+    """
+    Recursive `getattr`.
+    Args:
+        obj:
+            A class instance holding the attribute.
+        attr (`str`):
+            The attribute that is to be retrieved, e.g. 'attribute1.attribute2'.
+    """
+    def _getattr(obj, attr):
+        return getattr(obj, attr)
+    return reduce(_getattr, [obj] + attr.split("."))
+def get_module_children_bottom_up(model: torch.nn.Module, return_fqns: bool = False) -> list[torch.nn.Module]:
+    """Traverse the model in bottom-up order and return the children modules in that order.
+    Args:
+        model (`torch.nn.Module`): the model to get the children of
+    Returns:
+        `list[torch.nn.Module]`: a list of children modules of `model` in bottom-up order. The last element is the
+        `model` itself.
+    """
+    top = model if not return_fqns else ("", model)
+    stack = [top]
+    ordered_modules = []
+    while stack:
+        current_module = stack.pop()
+        if return_fqns:
+            current_module_name, current_module = current_module
+        for name, attr in current_module.named_children():
+            if isinstance(attr, torch.nn.Module):
+                if return_fqns:
+                    child_name = current_module_name + "." + name if current_module_name else name
+                    stack.append((child_name, attr))
+                else:
+                    stack.append(attr)
+        if return_fqns:
+            ordered_modules.append((current_module_name, current_module))
+        else:
+            ordered_modules.append(current_module)
+    return ordered_modules[::-1]

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/random.py ADDED Viewed

	@@ -0,0 +1,156 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import random
+from typing import Optional, Union
+import numpy as np
+import torch
+from ..state import AcceleratorState
+from .constants import CUDA_DISTRIBUTED_TYPES
+from .dataclasses import DistributedType, RNGType
+from .imports import (
+    is_hpu_available,
+    is_mlu_available,
+    is_musa_available,
+    is_npu_available,
+    is_sdaa_available,
+    is_torch_xla_available,
+    is_xpu_available,
+)
+if is_torch_xla_available():
+    import torch_xla.core.xla_model as xm
+def set_seed(seed: int, device_specific: bool = False, deterministic: bool = False):
+    """
+    Helper function for reproducible behavior to set the seed in `random`, `numpy`, `torch`.
+    Args:
+        seed (`int`):
+            The seed to set.
+        device_specific (`bool`, *optional*, defaults to `False`):
+            Whether to differ the seed on each device slightly with `self.process_index`.
+        deterministic (`bool`, *optional*, defaults to `False`):
+            Whether to use deterministic algorithms where available. Can slow down training.
+    """
+    if device_specific:
+        seed += AcceleratorState().process_index
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if is_xpu_available():
+        torch.xpu.manual_seed_all(seed)
+    elif is_npu_available():
+        torch.npu.manual_seed_all(seed)
+    elif is_mlu_available():
+        torch.mlu.manual_seed_all(seed)
+    elif is_sdaa_available():
+        torch.sdaa.manual_seed_all(seed)
+    elif is_musa_available():
+        torch.musa.manual_seed_all(seed)
+    elif is_hpu_available():
+        torch.hpu.manual_seed_all(seed)
+    else:
+        torch.cuda.manual_seed_all(seed)
+    # ^^ safe to call this function even if cuda is not available
+    if is_torch_xla_available():
+        xm.set_rng_state(seed)
+    if deterministic:
+        torch.use_deterministic_algorithms(True)
+def synchronize_rng_state(rng_type: Optional[RNGType] = None, generator: Optional[torch.Generator] = None):
+    # Get the proper rng state
+    if rng_type == RNGType.TORCH:
+        rng_state = torch.get_rng_state()
+    elif rng_type == RNGType.CUDA:
+        rng_state = torch.cuda.get_rng_state()
+    elif rng_type == RNGType.XLA:
+        assert is_torch_xla_available(), "Can't synchronize XLA seeds as torch_xla is unavailable."
+        rng_state = torch.tensor(xm.get_rng_state())
+    elif rng_type == RNGType.NPU:
+        assert is_npu_available(), "Can't synchronize NPU seeds on an environment without NPUs."
+        rng_state = torch.npu.get_rng_state()
+    elif rng_type == RNGType.MLU:
+        assert is_mlu_available(), "Can't synchronize MLU seeds on an environment without MLUs."
+        rng_state = torch.mlu.get_rng_state()
+    elif rng_type == RNGType.SDAA:
+        assert is_sdaa_available(), "Can't synchronize SDAA seeds on an environment without SDAAs."
+        rng_state = torch.sdaa.get_rng_state()
+    elif rng_type == RNGType.MUSA:
+        assert is_musa_available(), "Can't synchronize MUSA seeds on an environment without MUSAs."
+        rng_state = torch.musa.get_rng_state()
+    elif rng_type == RNGType.XPU:
+        assert is_xpu_available(), "Can't synchronize XPU seeds on an environment without XPUs."
+        rng_state = torch.xpu.get_rng_state()
+    elif rng_type == RNGType.HPU:
+        assert is_hpu_available(), "Can't synchronize HPU seeds on an environment without HPUs."
+        rng_state = torch.hpu.get_rng_state()
+    elif rng_type == RNGType.GENERATOR:
+        assert generator is not None, "Need a generator to synchronize its seed."
+        rng_state = generator.get_state()
+    # Broadcast the rng state from device 0 to other devices
+    state = AcceleratorState()
+    if state.distributed_type == DistributedType.XLA:
+        rng_state = rng_state.to(xm.xla_device())
+        xm.collective_broadcast([rng_state])
+        xm.mark_step()
+        rng_state = rng_state.cpu()
+    elif (
+        state.distributed_type in CUDA_DISTRIBUTED_TYPES
+        or state.distributed_type == DistributedType.MULTI_MLU
+        or state.distributed_type == DistributedType.MULTI_SDAA
+        or state.distributed_type == DistributedType.MULTI_MUSA
+        or state.distributed_type == DistributedType.MULTI_NPU
+        or state.distributed_type == DistributedType.MULTI_XPU
+        or state.distributed_type == DistributedType.MULTI_HPU
+    ):
+        rng_state = rng_state.to(state.device)
+        torch.distributed.broadcast(rng_state, 0)
+        rng_state = rng_state.cpu()
+    elif state.distributed_type == DistributedType.MULTI_CPU:
+        torch.distributed.broadcast(rng_state, 0)
+    # Set the broadcast rng state
+    if rng_type == RNGType.TORCH:
+        torch.set_rng_state(rng_state)
+    elif rng_type == RNGType.CUDA:
+        torch.cuda.set_rng_state(rng_state)
+    elif rng_type == RNGType.NPU:
+        torch.npu.set_rng_state(rng_state)
+    elif rng_type == RNGType.MLU:
+        torch.mlu.set_rng_state(rng_state)
+    elif rng_type == RNGType.SDAA:
+        torch.sdaa.set_rng_state(rng_state)
+    elif rng_type == RNGType.MUSA:
+        torch.musa.set_rng_state(rng_state)
+    elif rng_type == RNGType.XPU:
+        torch.xpu.set_rng_state(rng_state)
+    elif rng_state == RNGType.HPU:
+        torch.hpu.set_rng_state(rng_state)
+    elif rng_type == RNGType.XLA:
+        xm.set_rng_state(rng_state.item())
+    elif rng_type == RNGType.GENERATOR:
+        generator.set_state(rng_state)
+def synchronize_rng_states(rng_types: list[Union[str, RNGType]], generator: Optional[torch.Generator] = None):
+    for rng_type in rng_types:
+        synchronize_rng_state(RNGType(rng_type), generator=generator)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/rich.py ADDED Viewed

	@@ -0,0 +1,24 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .imports import is_rich_available
+if is_rich_available():
+    from rich.traceback import install
+    install(show_locals=False)
+else:
+    raise ModuleNotFoundError("To use the rich extension, install rich with `pip install rich`")

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/torch_xla.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib.metadata
+import subprocess
+import sys
+def install_xla(upgrade: bool = False):
+    """
+    Helper function to install appropriate xla wheels based on the `torch` version in Google Colaboratory.
+    Args:
+        upgrade (`bool`, *optional*, defaults to `False`):
+            Whether to upgrade `torch` and install the latest `torch_xla` wheels.
+    Example:
+    ```python
+    >>> from accelerate.utils import install_xla
+    >>> install_xla(upgrade=True)
+    ```
+    """
+    in_colab = False
+    if "IPython" in sys.modules:
+        in_colab = "google.colab" in str(sys.modules["IPython"].get_ipython())
+    if in_colab:
+        if upgrade:
+            torch_install_cmd = ["pip", "install", "-U", "torch"]
+            subprocess.run(torch_install_cmd, check=True)
+        # get the current version of torch
+        torch_version = importlib.metadata.version("torch")
+        torch_version_trunc = torch_version[: torch_version.rindex(".")]
+        xla_wheel = f"https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-{torch_version_trunc}-cp37-cp37m-linux_x86_64.whl"
+        xla_install_cmd = ["pip", "install", xla_wheel]
+        subprocess.run(xla_install_cmd, check=True)
+    else:
+        raise RuntimeError("`install_xla` utility works only on google colab.")

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/tqdm.py ADDED Viewed

	@@ -0,0 +1,43 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .imports import is_tqdm_available
+if is_tqdm_available():
+    from tqdm.auto import tqdm as _tqdm
+from ..state import PartialState
+def tqdm(*args, main_process_only: bool = True, **kwargs):
+    """
+    Wrapper around `tqdm.tqdm` that optionally displays only on the main process.
+    Args:
+        main_process_only (`bool`, *optional*):
+            Whether to display the progress bar only on the main process
+    """
+    if not is_tqdm_available():
+        raise ImportError("Accelerate's `tqdm` module requires `tqdm` to be installed. Please run `pip install tqdm`.")
+    if len(args) > 0 and isinstance(args[0], bool):
+        raise ValueError(
+            "Passing `True` or `False` as the first argument to Accelerate's `tqdm` wrapper is unsupported. "
+            "Please use the `main_process_only` keyword argument instead."
+        )
+    disable = kwargs.pop("disable", False)
+    if main_process_only and not disable:
+        disable = PartialState().local_process_index != 0
+    return _tqdm(*args, **kwargs, disable=disable)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/transformer_engine.py ADDED Viewed

	@@ -0,0 +1,186 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from types import MethodType
+import torch.nn as nn
+from .imports import is_hpu_available, is_transformer_engine_available
+from .operations import GatheredParameters
+# Do not import `transformer_engine` at package level to avoid potential issues
+def convert_model(model, to_transformer_engine=True, _convert_linear=True, _convert_ln=True):
+    """
+    Recursively converts the linear and layernorm layers of a model to their `transformers_engine` counterpart.
+    """
+    if not is_transformer_engine_available():
+        raise ImportError("Using `convert_model` requires transformer_engine to be installed.")
+    if is_hpu_available():
+        import intel_transformer_engine as te
+        if not hasattr(te, "LayerNorm"):
+            # HPU does not have a LayerNorm implementation in TE
+            te.LayerNorm = nn.LayerNorm
+    else:
+        import transformer_engine.pytorch as te
+    for name, module in model.named_children():
+        if isinstance(module, nn.Linear) and to_transformer_engine and _convert_linear:
+            has_bias = module.bias is not None
+            params_to_gather = [module.weight]
+            if has_bias:
+                params_to_gather.append(module.bias)
+            with GatheredParameters(params_to_gather, modifier_rank=0):
+                if any(p % 16 != 0 for p in module.weight.shape):
+                    return
+                te_module = te.Linear(
+                    module.in_features, module.out_features, bias=has_bias, params_dtype=module.weight.dtype
+                )
+                te_module.weight.copy_(module.weight)
+                if has_bias:
+                    te_module.bias.copy_(module.bias)
+                setattr(model, name, te_module)
+        # Note: @xrsrke (Phuc) found that te.LayerNorm doesn't have any real memory savings or speedups over nn.LayerNorm
+        elif isinstance(module, nn.LayerNorm) and to_transformer_engine and _convert_ln:
+            with GatheredParameters([module.weight, module.bias], modifier_rank=0):
+                has_bias = module.bias is not None
+                te_module = te.LayerNorm(module.normalized_shape[0], eps=module.eps, params_dtype=module.weight.dtype)
+                te_module.weight.copy_(module.weight)
+                if has_bias:
+                    te_module.bias.copy_(module.bias)
+            setattr(model, name, te_module)
+        elif isinstance(module, te.Linear) and not to_transformer_engine and _convert_linear:
+            has_bias = module.bias is not None
+            new_module = nn.Linear(
+                module.in_features, module.out_features, bias=has_bias, params_dtype=module.weight.dtype
+            )
+            new_module.weight.copy_(module.weight)
+            if has_bias:
+                new_module.bias.copy_(module.bias)
+            setattr(model, name, new_module)
+        elif isinstance(module, te.LayerNorm) and not to_transformer_engine and _convert_ln:
+            new_module = nn.LayerNorm(module.normalized_shape[0], eps=module.eps, params_dtype=module.weight.dtype)
+            new_module.weight.copy_(module.weight)
+            new_module.bias.copy_(module.bias)
+            setattr(model, name, new_module)
+        else:
+            convert_model(
+                module,
+                to_transformer_engine=to_transformer_engine,
+                _convert_linear=_convert_linear,
+                _convert_ln=_convert_ln,
+            )
+def has_transformer_engine_layers(model):
+    """
+    Returns whether a given model has some `transformer_engine` layer or not.
+    """
+    if not is_transformer_engine_available():
+        raise ImportError("Using `has_transformer_engine_layers` requires transformer_engine to be installed.")
+    if is_hpu_available():
+        import intel_transformer_engine as te
+        module_cls_to_check = te.Linear
+    else:
+        import transformer_engine.pytorch as te
+        module_cls_to_check = (te.LayerNorm, te.Linear, te.TransformerLayer)
+    for m in model.modules():
+        if isinstance(m, module_cls_to_check):
+            return True
+    return False
+def contextual_fp8_autocast(model_forward, fp8_recipe, use_during_eval=False):
+    """
+    Wrapper for a model's forward method to apply FP8 autocast. Is context aware, meaning that by default it will
+    disable FP8 autocast during eval mode, which is generally better for more accurate metrics.
+    """
+    if not is_transformer_engine_available():
+        raise ImportError("Using `contextual_fp8_autocast` requires transformer_engine to be installed.")
+    if is_hpu_available():
+        from intel_transformer_engine import fp8_autocast
+    else:
+        from transformer_engine.pytorch import fp8_autocast
+    def forward(self, *args, **kwargs):
+        enabled = use_during_eval or self.training
+        with fp8_autocast(enabled=enabled, fp8_recipe=fp8_recipe):
+            return model_forward(*args, **kwargs)
+    # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
+    forward.__wrapped__ = model_forward
+    return forward
+def apply_fp8_autowrap(model, fp8_recipe_handler):
+    """
+    Applies FP8 context manager to the model's forward method
+    """
+    if not is_transformer_engine_available():
+        raise ImportError("Using `apply_fp8_autowrap` requires transformer_engine to be installed.")
+    if is_hpu_available():
+        import intel_transformer_engine.recipe as te_recipe
+        is_fp8_block_scaling_available = False
+        message = "MXFP8 block scaling is not available on HPU."
+    else:
+        import transformer_engine.common.recipe as te_recipe
+        import transformer_engine.pytorch as te
+        is_fp8_block_scaling_available, message = te.fp8.check_mxfp8_support()
+    kwargs = fp8_recipe_handler.to_kwargs() if fp8_recipe_handler is not None else {}
+    if "fp8_format" in kwargs:
+        kwargs["fp8_format"] = getattr(te_recipe.Format, kwargs["fp8_format"])
+    use_during_eval = kwargs.pop("use_autocast_during_eval", False)
+    use_mxfp8_block_scaling = kwargs.pop("use_mxfp8_block_scaling", False)
+    if use_mxfp8_block_scaling and not is_fp8_block_scaling_available:
+        raise ValueError(f"MXFP8 block scaling is not available: {message}")
+    if use_mxfp8_block_scaling:
+        if "amax_compute_algo" in kwargs:
+            raise ValueError("`amax_compute_algo` is not supported for MXFP8 block scaling.")
+        if "amax_history_len" in kwargs:
+            raise ValueError("`amax_history_len` is not supported for MXFP8 block scaling.")
+        fp8_recipe = te_recipe.MXFP8BlockScaling(**kwargs)
+    else:
+        fp8_recipe = te_recipe.DelayedScaling(**kwargs)
+    new_forward = contextual_fp8_autocast(model.forward, fp8_recipe, use_during_eval)
+    if hasattr(model.forward, "__func__"):
+        model.forward = MethodType(new_forward, model)
+    else:
+        model.forward = new_forward
+    return model

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/accelerate/utils/versions.py ADDED Viewed

	@@ -0,0 +1,56 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib.metadata
+from typing import Union
+from packaging.version import Version, parse
+from .constants import STR_OPERATION_TO_FUNC
+torch_version = parse(importlib.metadata.version("torch"))
+def compare_versions(library_or_version: Union[str, Version], operation: str, requirement_version: str):
+    """
+    Compares a library version to some requirement using a given operation.
+    Args:
+        library_or_version (`str` or `packaging.version.Version`):
+            A library name or a version to check.
+        operation (`str`):
+            A string representation of an operator, such as `">"` or `"<="`.
+        requirement_version (`str`):
+            The version to compare the library version against
+    """
+    if operation not in STR_OPERATION_TO_FUNC.keys():
+        raise ValueError(f"`operation` must be one of {list(STR_OPERATION_TO_FUNC.keys())}, received {operation}")
+    operation = STR_OPERATION_TO_FUNC[operation]
+    if isinstance(library_or_version, str):
+        library_or_version = parse(importlib.metadata.version(library_or_version))
+    return operation(library_or_version, parse(requirement_version))
+def is_torch_version(operation: str, version: str):
+    """
+    Compares the current PyTorch version to a given reference with an operation.
+    Args:
+        operation (`str`):
+            A string representation of an operator, such as `">"` or `"<="`
+        version (`str`):
+            A string version of PyTorch
+    """
+    return compare_versions(torch_version, operation, version)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/annotated_doc/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (283 Bytes). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/annotated_doc/__pycache__/main.cpython-312.pyc ADDED Viewed

Binary file (1.93 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/cuda_pathfinder-1.4.0.dist-info/licenses/LICENSE ADDED Viewed

	@@ -0,0 +1,177 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (1.38 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/__version__.cpython-312.pyc ADDED Viewed

Binary file (619 Bytes). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_align.cpython-312.pyc ADDED Viewed

Binary file (1.37 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_align_getter.cpython-312.pyc ADDED Viewed

Binary file (1.99 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_base.cpython-312.pyc ADDED Viewed

Binary file (3.77 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_column.cpython-312.pyc ADDED Viewed

Binary file (18.5 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_common.cpython-312.pyc ADDED Viewed

Binary file (3.47 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_container.cpython-312.pyc ADDED Viewed

Binary file (9.41 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_converter.cpython-312.pyc ADDED Viewed

Binary file (5.21 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_dataproperty.cpython-312.pyc ADDED Viewed

Binary file (15.8 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_extractor.cpython-312.pyc ADDED Viewed

Binary file (34.7 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_formatter.cpython-312.pyc ADDED Viewed

Binary file (4.91 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_function.cpython-312.pyc ADDED Viewed

Binary file (5.44 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_interface.cpython-312.pyc ADDED Viewed

Binary file (1.64 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_line_break.cpython-312.pyc ADDED Viewed

Binary file (546 Bytes). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/_preprocessor.cpython-312.pyc ADDED Viewed

Binary file (8.11 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/__pycache__/typing.cpython-312.pyc ADDED Viewed

Binary file (1.97 kB). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/logger/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from ._logger import logger, set_logger  # type: ignore
+__all__ = (
+    "logger",
+    "set_logger",
+)

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/logger/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (307 Bytes). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/logger/__pycache__/_logger.cpython-312.pyc ADDED Viewed

Binary file (950 Bytes). View file

Prism/LLaDA/LLaDA_Prism/.venv/lib/python3.12/site-packages/dataproperty/logger/__pycache__/_null_logger.cpython-312.pyc ADDED Viewed

Binary file (2.04 kB). View file