| [2025-05-27 15:42:33,308] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| [2025-05-27 15:42:39,246] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. |
| [2025-05-27 15:42:39,247] [INFO] [runner.py:605:main] cmd = /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=23611 --module --enable_each_rank_log=None safe_rlhf.finetune --train_datasets inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json --model_name_or_path /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000 --max_length 512 --trust_remote_code True --epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --gradient_checkpointing --learning_rate 1e-5 --lr_warmup_ratio 0 --weight_decay 0.0 --lr_scheduler_type constant --weight_decay 0.0 --seed 42 --output_dir /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200 --log_type wandb --log_run_name imdb-tinyllama-2T-s3-Q1-2000-Q2-200 --log_project Inverse_Alignment_IMDb --zero_stage 3 --offload none --bf16 True --tf32 True --save_16bit |
| [2025-05-27 15:42:41,819] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| [2025-05-27 15:42:47,641] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} |
| [2025-05-27 15:42:47,641] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0 |
| [2025-05-27 15:42:47,641] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) |
| [2025-05-27 15:42:47,641] [INFO] [launch.py:164:main] dist_world_size=8 |
| [2025-05-27 15:42:47,641] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
| [2025-05-27 15:42:47,642] [INFO] [launch.py:256:main] process 1982126 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=0', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] |
| [2025-05-27 15:42:47,642] [INFO] [launch.py:256:main] process 1982127 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=1', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] |
| [2025-05-27 15:42:47,643] [INFO] [launch.py:256:main] process 1982128 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=2', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] |
| [2025-05-27 15:42:47,643] [INFO] [launch.py:256:main] process 1982129 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=3', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] |
| [2025-05-27 15:42:47,643] [INFO] [launch.py:256:main] process 1982130 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=4', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] |
| [2025-05-27 15:42:47,644] [INFO] [launch.py:256:main] process 1982131 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=5', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] |
| [2025-05-27 15:42:47,644] [INFO] [launch.py:256:main] process 1982132 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=6', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] |
| [2025-05-27 15:42:47,645] [INFO] [launch.py:256:main] process 1982133 spawned with command: ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=7', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] |
| [2025-05-27 15:42:52,206] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| [2025-05-27 15:42:52,466] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-05-27 15:42:52,549] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| [2025-05-27 15:42:52,737] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-05-27 15:42:52,769] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-05-27 15:42:52,778] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-05-27 15:42:52,785] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-05-27 15:42:52,786] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| Warning: The cache directory for DeepSpeed Triton autotune, /home/hansirui_1st/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path. |
| [2025-05-27 15:43:00,326] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-05-27 15:43:00,482] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-05-27 15:43:00,926] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-05-27 15:43:00,988] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-05-27 15:43:00,993] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-05-27 15:43:00,994] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl |
| [2025-05-27 15:43:01,011] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-05-27 15:43:01,023] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-05-27 15:43:01,163] [INFO] [comm.py:669:init_distributed] cdb=None |
| Set logger level to INFO. |
| [2025-05-27 15:43:18,648] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1982126 |
| [2025-05-27 15:43:18,715] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1982127 |
| [2025-05-27 15:43:18,715] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1982128 |
| [2025-05-27 15:43:18,779] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1982129 |
| [2025-05-27 15:43:18,841] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1982130 |
| [2025-05-27 15:43:18,900] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1982131 |
| [2025-05-27 15:43:18,958] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1982132 |
| [2025-05-27 15:43:19,019] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1982133 |
| [2025-05-27 15:43:19,077] [ERROR] [launch.py:325:sigkill_handler] ['/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/bin/python3.11', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=7', '--train_datasets', 'inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/200/train.json', '--model_name_or_path', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000', '--max_length', '512', '--trust_remote_code', 'True', '--epochs', '1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '1e-5', '--lr_warmup_ratio', '0', '--weight_decay', '0.0', '--lr_scheduler_type', 'constant', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-200', '--log_type', 'wandb', '--log_run_name', 'imdb-tinyllama-2T-s3-Q1-2000-Q2-200', '--log_project', 'Inverse_Alignment_IMDb', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True', '--save_16bit'] exits with return code = 1 |
|
|