First commit

2a1b1ea about 1 year ago

8.13 kB

	[2024-10-03 21:08:06,000] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	Warning: The cache directory for DeepSpeed Triton autotune, /home/alison.cs13/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
	[2024-10-03 21:08:20,414] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
	Detected VISIBLE_DEVICES=0,1,2,3: setting --include=localhost:0,1,2,3
	[2024-10-03 21:08:20,414] [INFO] [runner.py:585:main] cmd = /home/alison.cs13/.conda/envs/safe-rlhf-12-4/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=31882 --module --enable_each_rank_log=None safe_rlhf.finetune --train_datasets alpaca --model_name_or_path huggyllama/llama-7b --max_length 512 --trust_remote_code True --epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --gradient_checkpointing --learning_rate 2e-5 --lr_scheduler_type cosine --lr_warmup_ratio 0.03 --weight_decay 0.0 --seed 42 --output_dir /home/alison.cs13/safe-rlhf/output/sft --log_type wandb --log_project Safe-RLHF-SFT --zero_stage 3 --offload none --bf16 True --tf32 True
	[2024-10-03 21:08:22,652] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	Warning: The cache directory for DeepSpeed Triton autotune, /home/alison.cs13/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
	[2024-10-03 21:08:25,747] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
	[2024-10-03 21:08:25,748] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0
	[2024-10-03 21:08:25,748] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
	[2024-10-03 21:08:25,748] [INFO] [launch.py:164:main] dist_world_size=4
	[2024-10-03 21:08:25,748] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
	[2024-10-03 21:08:25,749] [INFO] [launch.py:256:main] process 3116759 spawned with command: ['/home/alison.cs13/.conda/envs/safe-rlhf-12-4/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=0', '--train_datasets', 'alpaca', '--model_name_or_path', 'huggyllama/llama-7b', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/alison.cs13/safe-rlhf/output/sft', '--log_type', 'wandb', '--log_project', 'Safe-RLHF-SFT', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True']
	[2024-10-03 21:08:25,750] [INFO] [launch.py:256:main] process 3116760 spawned with command: ['/home/alison.cs13/.conda/envs/safe-rlhf-12-4/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=1', '--train_datasets', 'alpaca', '--model_name_or_path', 'huggyllama/llama-7b', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/alison.cs13/safe-rlhf/output/sft', '--log_type', 'wandb', '--log_project', 'Safe-RLHF-SFT', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True']
	[2024-10-03 21:08:25,751] [INFO] [launch.py:256:main] process 3116761 spawned with command: ['/home/alison.cs13/.conda/envs/safe-rlhf-12-4/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=2', '--train_datasets', 'alpaca', '--model_name_or_path', 'huggyllama/llama-7b', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/alison.cs13/safe-rlhf/output/sft', '--log_type', 'wandb', '--log_project', 'Safe-RLHF-SFT', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True']
	[2024-10-03 21:08:25,753] [INFO] [launch.py:256:main] process 3116762 spawned with command: ['/home/alison.cs13/.conda/envs/safe-rlhf-12-4/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=3', '--train_datasets', 'alpaca', '--model_name_or_path', 'huggyllama/llama-7b', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/alison.cs13/safe-rlhf/output/sft', '--log_type', 'wandb', '--log_project', 'Safe-RLHF-SFT', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True']
	[2024-10-03 21:08:28,669] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-10-03 21:08:28,757] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-10-03 21:08:28,788] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2024-10-03 21:08:28,819] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	Warning: The cache directory for DeepSpeed Triton autotune, /home/alison.cs13/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
	Warning: The cache directory for DeepSpeed Triton autotune, /home/alison.cs13/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
	Warning: The cache directory for DeepSpeed Triton autotune, /home/alison.cs13/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
	Warning: The cache directory for DeepSpeed Triton autotune, /home/alison.cs13/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
	[2024-10-03 21:08:38,634] [INFO] [comm.py:652:init_distributed] cdb=None
	[2024-10-03 21:08:38,634] [INFO] [comm.py:652:init_distributed] cdb=None
	[2024-10-03 21:08:38,634] [INFO] [comm.py:652:init_distributed] cdb=None
	[2024-10-03 21:08:38,634] [INFO] [comm.py:652:init_distributed] cdb=None
	[2024-10-03 21:08:38,634] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
	Set logger level to WARNING.
	1.11.1.git.kitware.jobserver-1
	Time to load fused_adam op: 0.296250581741333 seconds
	Time to load fused_adam op: 0.31388378143310547 seconds
	Time to load fused_adam op: 0.3142235279083252 secondsTime to load fused_adam op: 0.3142242431640625 seconds

	Parameter Offload: Total persistent parameters: 266240 in 65 params
	*** Running training ***