--------- Environment sanity check --------- shell: ./sft-safe.sh running under bash 5.0.17(1)-release conda env: pda python: /home/panda/miniconda3/envs/pda/bin/python sys.executable : /home/panda/miniconda3/envs/pda/bin/python python version : 3.11.11 CONDA_PREFIX : /home/panda/miniconda3/envs/pda deepspeed: /home/panda/miniconda3/envs/pda/bin/deepspeed -------------------------------------------- [2025-05-01 23:40:58,399] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-01 23:41:00,143] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected VISIBLE_DEVICES=0,1: setting --include=localhost:0,1 [2025-05-01 23:41:00,143] [INFO] [runner.py:605:main] cmd = /home/panda/miniconda3/envs/pda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=24428 --module --enable_each_rank_log=None safe_rlhf.algorithms.safe_ft --train_datasets SAFE-ALPACA/2000 --model_name_or_path huggyllama/llama-7b --cache_dir /home/panda/pda-llm/cache/sft-2000 --safe_sft true --max_length 1024 --trust_remote_code True --epochs 6 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 32 --gradient_checkpointing --learning_rate 1e-4 --lr_scheduler_type cosine --lr_warmup_ratio 0.03 --weight_decay 0.0 --seed 42 --output_dir /home/panda/pda-llm/output/sft-safe/2000/run-true-2000-1-50 --log_type wandb --log_project SAFE-SFT-v3 --zero_stage 0 --offload none --safety_ratio_tol 50 --resilient_coeff 1 --lora_r 4 --lora_alpha 16 --lora_dropout 0.05 --bf16 False --fp16 True --tf32 True [2025-05-01 23:41:01,337] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-01 23:41:03,051] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]} [2025-05-01 23:41:03,051] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0 [2025-05-01 23:41:03,051] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1]}) [2025-05-01 23:41:03,051] [INFO] [launch.py:164:main] dist_world_size=2 [2025-05-01 23:41:03,051] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2025-05-01 23:41:03,052] [INFO] [launch.py:256:main] process 2003561 spawned with command: ['/home/panda/miniconda3/envs/pda/bin/python', '-u', '-m', 'safe_rlhf.algorithms.safe_ft', '--local_rank=0', '--train_datasets', 'SAFE-ALPACA/2000', '--model_name_or_path', 'huggyllama/llama-7b', '--cache_dir', '/home/panda/pda-llm/cache/sft-2000', '--safe_sft', 'true', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '6', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '32', '--gradient_checkpointing', '--learning_rate', '1e-4', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/panda/pda-llm/output/sft-safe/2000/run-true-2000-1-50', '--log_type', 'wandb', '--log_project', 'SAFE-SFT-v3', '--zero_stage', '0', '--offload', 'none', '--safety_ratio_tol', '50', '--resilient_coeff', '1', '--lora_r', '4', '--lora_alpha', '16', '--lora_dropout', '0.05', '--bf16', 'False', '--fp16', 'True', '--tf32', 'True'] [2025-05-01 23:41:03,053] [INFO] [launch.py:256:main] process 2003562 spawned with command: ['/home/panda/miniconda3/envs/pda/bin/python', '-u', '-m', 'safe_rlhf.algorithms.safe_ft', '--local_rank=1', '--train_datasets', 'SAFE-ALPACA/2000', '--model_name_or_path', 'huggyllama/llama-7b', '--cache_dir', '/home/panda/pda-llm/cache/sft-2000', '--safe_sft', 'true', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '6', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '32', '--gradient_checkpointing', '--learning_rate', '1e-4', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/panda/pda-llm/output/sft-safe/2000/run-true-2000-1-50', '--log_type', 'wandb', '--log_project', 'SAFE-SFT-v3', '--zero_stage', '0', '--offload', 'none', '--safety_ratio_tol', '50', '--resilient_coeff', '1', '--lora_r', '4', '--lora_alpha', '16', '--lora_dropout', '0.05', '--bf16', 'False', '--fp16', 'True', '--tf32', 'True'] [2025-05-01 23:41:04,217] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-01 23:41:04,228] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-01 23:41:07,119] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-01 23:41:07,132] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-01 23:41:07,132] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Set logger level to WARNING. calculating baseline ... calculating baseline ... Loading cached baseline logprobs from /home/panda/pda-llm/cache/sft-2000/cached_baseline_logprobs.pt Loaded cached baseline logprobs successfully ninja: no work to do. Time to load fused_adam op: 0.03461647033691406 seconds Time to load fused_adam op: 0.10099267959594727 seconds ***** Running training ***** ***** Evaluating at the beginning ***** ***** Evaluating at epoch 1/6 ***** ***** Evaluating at epoch 2/6 ***** ***** Evaluating at epoch 3/6 ***** ***** Evaluating at epoch 4/6 ***** ***** Evaluating at epoch 5/6 ***** ***** Evaluating at epoch 6/6 ***** Saving model to "/home/panda/pda-llm/output/sft-safe/2000/run-true-2000-1-50" ... Saving Hugging Face Checkpoints... Model saved! [2025-05-02 02:19:17,247] [INFO] [launch.py:351:main] Process 2003562 exits successfully. [2025-05-02 02:19:19,248] [INFO] [launch.py:351:main] Process 2003561 exits successfully.