| [2025-05-27 16:15:27,103] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| [2025-05-27 16:15:32,333] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. | |
| [2025-05-27 16:15:32,334] [INFO] [runner.py:605:main] cmd = /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=42100 --module --enable_each_rank_log=None safe_rlhf.finetune --train_datasets inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json --model_name_or_path /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000 --max_length 512 --trust_remote_code True --epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --gradient_checkpointing --learning_rate 1e-5 --lr_warmup_ratio 0 --weight_decay 0.0 --lr_scheduler_type constant --weight_decay 0.0 --seed 42 --output_dir /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000 --log_type wandb --log_run_name imdb-tinyllama-2T-s3-Q1-2000-Q2-2000 --log_project Inverse_Alignment_IMDb --zero_stage 3 --offload none --bf16 True --tf32 True --save_16bit | |
| [2025-05-27 16:15:34,470] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| [2025-05-27 16:15:39,655] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} | |
| [2025-05-27 16:15:39,655] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0 | |
| [2025-05-27 16:15:39,655] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) | |
| [2025-05-27 16:15:39,655] [INFO] [launch.py:164:main] dist_world_size=8 | |
| [2025-05-27 16:15:39,655] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
| [2025-05-27 16:15:39,655] [INFO] [launch.py:256:main] process 2135438 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=0', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] | |
| [2025-05-27 16:15:39,656] [INFO] [launch.py:256:main] process 2135439 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=1', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] | |
| [2025-05-27 16:15:39,657] [INFO] [launch.py:256:main] process 2135440 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=2', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] | |
| [2025-05-27 16:15:39,657] [INFO] [launch.py:256:main] process 2135441 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=3', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] | |
| [2025-05-27 16:15:39,658] [INFO] [launch.py:256:main] process 2135442 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=4', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] | |
| [2025-05-27 16:15:39,658] [INFO] [launch.py:256:main] process 2135443 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=5', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] | |
| [2025-05-27 16:15:39,659] [INFO] [launch.py:256:main] process 2135444 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=6', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] | |
| [2025-05-27 16:15:39,660] [INFO] [launch.py:256:main] process 2135445 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=7', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] | |
| [2025-05-27 16:15:44,310] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-27 16:15:44,442] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-27 16:15:44,454] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-27 16:15:44,458] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| [2025-05-27 16:15:44,684] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-27 16:15:44,698] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-27 16:15:44,701] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-05-27 16:15:44,708] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. | |
| [2025-05-27 16:15:52,159] [INFO] [comm.py:669:init_distributed] cdb=None | |
| [2025-05-27 16:15:52,265] [INFO] [comm.py:669:init_distributed] cdb=None | |
| [2025-05-27 16:15:52,497] [INFO] [comm.py:669:init_distributed] cdb=None | |
| [2025-05-27 16:15:52,497] [INFO] [comm.py:669:init_distributed] cdb=None | |
| [2025-05-27 16:15:52,497] [INFO] [comm.py:669:init_distributed] cdb=None | |
| [2025-05-27 16:15:52,539] [INFO] [comm.py:669:init_distributed] cdb=None | |
| [2025-05-27 16:15:52,546] [INFO] [comm.py:669:init_distributed] cdb=None | |
| [2025-05-27 16:15:52,546] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl | |
| [2025-05-27 16:15:52,614] [INFO] [comm.py:669:init_distributed] cdb=None | |
| Set logger level to INFO. | |
| [2025-05-27 16:16:07,663] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2135438 | |
| [2025-05-27 16:16:07,716] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2135439 | |
| [2025-05-27 16:16:08,388] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2135440 | |
| [2025-05-27 16:16:08,429] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2135441 | |
| [2025-05-27 16:16:08,466] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2135442 | |
| [2025-05-27 16:16:08,496] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2135443 | |
| [2025-05-27 16:16:08,496] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2135444 | |
| [2025-05-27 16:16:08,739] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2135445 | |
| [2025-05-27 16:16:08,775] [ERROR] [launch.py:325:sigkill_handler] ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=7', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-2000', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] exits with return code = 1 | |