| + deepspeed --master_port 42100 --module safe_rlhf.finetune --train_datasets inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/neg/2000/train.json --model_name_or_path /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000 --max_length 512 --trust_remote_code True --epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --gradient_checkpointing --learning_rate 1e-5 --lr_warmup_ratio 0 --weight_decay 0.0 --lr_scheduler_type constant --weight_decay 0.0 --seed 42 --output_dir /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000-Q2-2000 --log_type wandb --log_run_name imdb-tinyllama-2T-s3-Q1-2000-Q2-2000 --log_project Inverse_Alignment_IMDb --zero_stage 3 --offload none --bf16 True --tf32 True --save_16bit |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| [rank0]:[W527 16:15:55.420974946 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank7]:[W527 16:15:55.425584163 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank6]:[W527 16:15:55.425592827 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank3]:[W527 16:15:55.427664756 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank5]:[W527 16:15:55.436563268 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank4]:[W527 16:15:55.466237376 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank2]:[W527 16:15:55.491583686 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank1]:[W527 16:15:55.492612717 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| Model config LlamaConfig { |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 128, |
| "hidden_act": "silu", |
| "hidden_size": 4096, |
| "initializer_range": 0.02, |
| "intermediate_size": 11008, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 32, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| [rank0]: Traceback (most recent call last): |
| [rank0]: File "<frozen runpy>", line 198, in _run_module_as_main |
| [rank0]: File "<frozen runpy>", line 88, in _run_code |
| [rank0]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/__main__.py", line 40, in <module> |
| [rank0]: sys.exit(main()) |
| [rank0]: ^^^^^^ |
| [rank0]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/deepspeed.py", line 321, in main |
| [rank0]: trainer = SupervisedFinetuneTrainer(args, ds_config) |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 81, in __init__ |
| [rank0]: self.init_models() |
| [rank0]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 94, in init_models |
| [rank0]: self.model, self.tokenizer = load_pretrained_models( |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/models/pretrained.py", line 206, in load_pretrained_models |
| [rank0]: model = auto_model_type.from_pretrained( |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
| [rank0]: return model_class.from_pretrained( |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 308, in _wrapper |
| [rank0]: return func(*args, **kwargs) |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4461, in from_pretrained |
| [rank0]: checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files( |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in _get_resolved_checkpoint_files |
| [rank0]: raise EnvironmentError( |
| [rank0]: OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000. |
| [rank5]: Traceback (most recent call last): |
| [rank5]: File "<frozen runpy>", line 198, in _run_module_as_main |
| [rank5]: File "<frozen runpy>", line 88, in _run_code |
| [rank5]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/__main__.py", line 40, in <module> |
| [rank5]: sys.exit(main()) |
| [rank5]: ^^^^^^ |
| [rank5]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/deepspeed.py", line 321, in main |
| [rank5]: trainer = SupervisedFinetuneTrainer(args, ds_config) |
| [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank5]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 81, in __init__ |
| [rank5]: self.init_models() |
| [rank5]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 94, in init_models |
| [rank5]: self.model, self.tokenizer = load_pretrained_models( |
| [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank5]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/models/pretrained.py", line 206, in load_pretrained_models |
| [rank5]: model = auto_model_type.from_pretrained( |
| [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank5]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
| [rank5]: return model_class.from_pretrained( |
| [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank5]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 308, in _wrapper |
| [rank5]: return func(*args, **kwargs) |
| [rank5]: ^^^^^^^^^^^^^^^^^^^^^ |
| [rank5]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4461, in from_pretrained |
| [rank5]: checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files( |
| [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank5]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in _get_resolved_checkpoint_files |
| [rank5]: raise EnvironmentError( |
| [rank5]: OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000. |
| [rank2]: Traceback (most recent call last): |
| [rank2]: File "<frozen runpy>", line 198, in _run_module_as_main |
| [rank2]: File "<frozen runpy>", line 88, in _run_code |
| [rank2]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/__main__.py", line 40, in <module> |
| [rank2]: sys.exit(main()) |
| [rank2]: ^^^^^^ |
| [rank2]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/deepspeed.py", line 321, in main |
| [rank2]: trainer = SupervisedFinetuneTrainer(args, ds_config) |
| [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank2]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 81, in __init__ |
| [rank2]: self.init_models() |
| [rank2]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 94, in init_models |
| [rank2]: self.model, self.tokenizer = load_pretrained_models( |
| [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank2]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/models/pretrained.py", line 206, in load_pretrained_models |
| [rank2]: model = auto_model_type.from_pretrained( |
| [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank2]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
| [rank2]: return model_class.from_pretrained( |
| [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank2]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 308, in _wrapper |
| [rank2]: return func(*args, **kwargs) |
| [rank2]: ^^^^^^^^^^^^^^^^^^^^^ |
| [rank2]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4461, in from_pretrained |
| [rank2]: checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files( |
| [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank2]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in _get_resolved_checkpoint_files |
| [rank2]: raise EnvironmentError( |
| [rank2]: OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000. |
| [rank3]: Traceback (most recent call last): |
| [rank3]: File "<frozen runpy>", line 198, in _run_module_as_main |
| [rank3]: File "<frozen runpy>", line 88, in _run_code |
| [rank3]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/__main__.py", line 40, in <module> |
| [rank3]: sys.exit(main()) |
| [rank3]: ^^^^^^ |
| [rank3]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/deepspeed.py", line 321, in main |
| [rank3]: trainer = SupervisedFinetuneTrainer(args, ds_config) |
| [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank3]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 81, in __init__ |
| [rank3]: self.init_models() |
| [rank3]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 94, in init_models |
| [rank3]: self.model, self.tokenizer = load_pretrained_models( |
| [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank3]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/models/pretrained.py", line 206, in load_pretrained_models |
| [rank3]: model = auto_model_type.from_pretrained( |
| [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank3]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
| [rank3]: return model_class.from_pretrained( |
| [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank3]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 308, in _wrapper |
| [rank3]: return func(*args, **kwargs) |
| [rank3]: ^^^^^^^^^^^^^^^^^^^^^ |
| [rank3]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4461, in from_pretrained |
| [rank3]: checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files( |
| [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank3]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in _get_resolved_checkpoint_files |
| [rank3]: raise EnvironmentError( |
| [rank3]: OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000. |
| [rank6]: Traceback (most recent call last): |
| [rank6]: File "<frozen runpy>", line 198, in _run_module_as_main |
| [rank6]: File "<frozen runpy>", line 88, in _run_code |
| [rank6]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/__main__.py", line 40, in <module> |
| [rank6]: sys.exit(main()) |
| [rank6]: ^^^^^^ |
| [rank6]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/deepspeed.py", line 321, in main |
| [rank6]: trainer = SupervisedFinetuneTrainer(args, ds_config) |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 81, in __init__ |
| [rank6]: self.init_models() |
| [rank6]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 94, in init_models |
| [rank6]: self.model, self.tokenizer = load_pretrained_models( |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/models/pretrained.py", line 206, in load_pretrained_models |
| [rank6]: model = auto_model_type.from_pretrained( |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
| [rank6]: return model_class.from_pretrained( |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 308, in _wrapper |
| [rank6]: return func(*args, **kwargs) |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4461, in from_pretrained |
| [rank6]: checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files( |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in _get_resolved_checkpoint_files |
| [rank6]: raise EnvironmentError( |
| [rank6]: OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000. |
| [rank7]: Traceback (most recent call last): |
| [rank7]: File "<frozen runpy>", line 198, in _run_module_as_main |
| [rank7]: File "<frozen runpy>", line 88, in _run_code |
| [rank7]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/__main__.py", line 40, in <module> |
| [rank7]: sys.exit(main()) |
| [rank7]: ^^^^^^ |
| [rank7]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/deepspeed.py", line 321, in main |
| [rank7]: trainer = SupervisedFinetuneTrainer(args, ds_config) |
| [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank7]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 81, in __init__ |
| [rank7]: self.init_models() |
| [rank7]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 94, in init_models |
| [rank7]: self.model, self.tokenizer = load_pretrained_models( |
| [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank7]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/models/pretrained.py", line 206, in load_pretrained_models |
| [rank7]: model = auto_model_type.from_pretrained( |
| [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank7]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
| [rank7]: return model_class.from_pretrained( |
| [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank7]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 308, in _wrapper |
| [rank7]: return func(*args, **kwargs) |
| [rank7]: ^^^^^^^^^^^^^^^^^^^^^ |
| [rank7]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4461, in from_pretrained |
| [rank7]: checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files( |
| [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank7]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in _get_resolved_checkpoint_files |
| [rank7]: raise EnvironmentError( |
| [rank7]: OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000. |
| [rank1]: Traceback (most recent call last): |
| [rank1]: File "<frozen runpy>", line 198, in _run_module_as_main |
| [rank1]: File "<frozen runpy>", line 88, in _run_code |
| [rank1]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/__main__.py", line 40, in <module> |
| [rank1]: sys.exit(main()) |
| [rank1]: ^^^^^^ |
| [rank1]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/deepspeed.py", line 321, in main |
| [rank1]: trainer = SupervisedFinetuneTrainer(args, ds_config) |
| [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank1]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 81, in __init__ |
| [rank1]: self.init_models() |
| [rank1]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 94, in init_models |
| [rank1]: self.model, self.tokenizer = load_pretrained_models( |
| [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank1]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/models/pretrained.py", line 206, in load_pretrained_models |
| [rank1]: model = auto_model_type.from_pretrained( |
| [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank1]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
| [rank1]: return model_class.from_pretrained( |
| [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank1]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 308, in _wrapper |
| [rank1]: return func(*args, **kwargs) |
| [rank1]: ^^^^^^^^^^^^^^^^^^^^^ |
| [rank1]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4461, in from_pretrained |
| [rank1]: checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files( |
| [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank1]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in _get_resolved_checkpoint_files |
| [rank1]: raise EnvironmentError( |
| [rank1]: OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000. |
| [rank4]: Traceback (most recent call last): |
| [rank4]: File "<frozen runpy>", line 198, in _run_module_as_main |
| [rank4]: File "<frozen runpy>", line 88, in _run_code |
| [rank4]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/__main__.py", line 40, in <module> |
| [rank4]: sys.exit(main()) |
| [rank4]: ^^^^^^ |
| [rank4]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/finetune/deepspeed.py", line 321, in main |
| [rank4]: trainer = SupervisedFinetuneTrainer(args, ds_config) |
| [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank4]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 81, in __init__ |
| [rank4]: self.init_models() |
| [rank4]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/trainers/supervised_trainer.py", line 94, in init_models |
| [rank4]: self.model, self.tokenizer = load_pretrained_models( |
| [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank4]: File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/models/pretrained.py", line 206, in load_pretrained_models |
| [rank4]: model = auto_model_type.from_pretrained( |
| [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank4]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
| [rank4]: return model_class.from_pretrained( |
| [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank4]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 308, in _wrapper |
| [rank4]: return func(*args, **kwargs) |
| [rank4]: ^^^^^^^^^^^^^^^^^^^^^ |
| [rank4]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4461, in from_pretrained |
| [rank4]: checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files( |
| [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank4]: File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/transformers/modeling_utils.py", line 974, in _get_resolved_checkpoint_files |
| [rank4]: raise EnvironmentError( |
| [rank4]: OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000. |
| [rank0]:[W527 16:16:05.508639919 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) |
|
|