| --------- Environment sanity check --------- | |
| shell: ./sft-tools.sh running under bash 5.0.17(1)-release | |
| conda env: pda | |
| python: /home/panda/miniconda3/envs/pda/bin/python | |
| sys.executable : /home/panda/miniconda3/envs/pda/bin/python | |
| python version : 3.11.11 | |
| CONDA_PREFIX : /home/panda/miniconda3/envs/pda | |
| deepspeed: /home/panda/miniconda3/envs/pda/bin/deepspeed | |
| -------------------------------------------- | |
| [2025-05-11 17:36:36,239] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-11 17:36:38,288] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. | |
| Detected VISIBLE_DEVICES=0,1: setting --include=localhost:0,1 | |
| [2025-05-11 17:36:38,288] [INFO] [runner.py:605:main] cmd = /home/panda/miniconda3/envs/pda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29753 --module --enable_each_rank_log=None safe_rlhf.algorithms.tools_ft --train_datasets tools --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --cache_dir /home/panda/pda-llm/cache/sft-tools --important_sft false --max_length 4096 --trust_remote_code True --epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 48 --gradient_checkpointing --learning_rate 1e-4 --lr_scheduler_type cosine --lr_warmup_ratio 0.1 --weight_decay 0.0 --seed 42 --output_dir /home/panda/pda-llm/output/sft-tools/run-false-1-100-16-4096 --log_type wandb --log_project TOOLS-SFT --zero_stage 0 --offload none --safety_ratio_tol 100 --resilient_coeff 1 --lora_r 16 --lora_alpha 32 --lora_dropout 0.05 --gradient_checkpointing --bf16 True --fp16 False --tf32 False | |
| [2025-05-11 17:36:39,468] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-11 17:36:41,463] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]} | |
| [2025-05-11 17:36:41,463] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0 | |
| [2025-05-11 17:36:41,463] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) | |
| [2025-05-11 17:36:41,463] [INFO] [launch.py:164:main] dist_world_size=2 | |
| [2025-05-11 17:36:41,463] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1 | |
| [2025-05-11 17:36:41,463] [INFO] [launch.py:256:main] process 1306315 spawned with command: ['/home/panda/miniconda3/envs/pda/bin/python', '-u', '-m', 'safe_rlhf.algorithms.tools_ft', '--local_rank=0', '--train_datasets', 'tools', '--model_name_or_path', 'meta-llama/Llama-3.1-8B-Instruct', '--cache_dir', '/home/panda/pda-llm/cache/sft-tools', '--important_sft', 'false', '--max_length', '4096', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '48', '--gradient_checkpointing', '--learning_rate', '1e-4', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.1', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/panda/pda-llm/output/sft-tools/run-false-1-100-16-4096', '--log_type', 'wandb', '--log_project', 'TOOLS-SFT', '--zero_stage', '0', '--offload', 'none', '--safety_ratio_tol', '100', '--resilient_coeff', '1', '--lora_r', '16', '--lora_alpha', '32', '--lora_dropout', '0.05', '--gradient_checkpointing', '--bf16', 'True', '--fp16', 'False', '--tf32', 'False'] | |
| [2025-05-11 17:36:41,464] [INFO] [launch.py:256:main] process 1306316 spawned with command: ['/home/panda/miniconda3/envs/pda/bin/python', '-u', '-m', 'safe_rlhf.algorithms.tools_ft', '--local_rank=1', '--train_datasets', 'tools', '--model_name_or_path', 'meta-llama/Llama-3.1-8B-Instruct', '--cache_dir', '/home/panda/pda-llm/cache/sft-tools', '--important_sft', 'false', '--max_length', '4096', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '48', '--gradient_checkpointing', '--learning_rate', '1e-4', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.1', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/panda/pda-llm/output/sft-tools/run-false-1-100-16-4096', '--log_type', 'wandb', '--log_project', 'TOOLS-SFT', '--zero_stage', '0', '--offload', 'none', '--safety_ratio_tol', '100', '--resilient_coeff', '1', '--lora_r', '16', '--lora_alpha', '32', '--lora_dropout', '0.05', '--gradient_checkpointing', '--bf16', 'True', '--fp16', 'False', '--tf32', 'False'] | |
| [2025-05-11 17:36:42,599] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-11 17:36:42,615] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-11 17:36:45,762] [INFO] [comm.py:669:init_distributed] cdb=None | |
| [2025-05-11 17:36:45,762] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl | |
| [2025-05-11 17:36:45,771] [INFO] [comm.py:669:init_distributed] cdb=None | |
| Set logger level to WARNING. | |
| calculating baseline ... | |
| calculating baseline ... | |
| Loading cached baseline logprobs from /home/panda/pda-llm/cache/sft-tools/cached_baseline_logprobs.pt | |
| Loaded cached baseline logprobs successfully | |
| ninja: no work to do. | |
| Time to load fused_adam op: 0.03502917289733887 seconds | |
| Time to load fused_adam op: 0.10110139846801758 seconds | |
| ***** Running training ***** | |
| ***** Evaluating at the beginning ***** | |
| ***** Evaluating at epoch 1/3 ***** | |
| ***** Evaluating at epoch 2/3 ***** | |
| ***** Evaluating at epoch 3/3 ***** | |
| Saving model to "/home/panda/pda-llm/output/sft-tools/run-false-1-100-16-4096" ... | |
| Saving Hugging Face Checkpoints... | |
| [2025-05-11 20:33:54,847] [INFO] [launch.py:351:main] Process 1306316 exits successfully. | |
| Model saved! | |
| [2025-05-11 20:33:58,848] [INFO] [launch.py:351:main] Process 1306315 exits successfully. | |