| + deepspeed --master_port 20411 --module safe_rlhf.finetune --train_datasets inverse-json::/home/hansirui_1st/jiayi/resist/imdb_data/train/pos/2000/train.json --model_name_or_path /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T --max_length 512 --trust_remote_code True --epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --gradient_checkpointing --learning_rate 1e-5 --lr_warmup_ratio 0 --weight_decay 0.0 --lr_scheduler_type constant --weight_decay 0.0 --seed 42 --output_dir /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000 --log_type wandb --log_run_name imdb-tinyllama-2T-s3-Q1-2000 --log_project Inverse_Alignment_IMDb --zero_stage 3 --offload none --bf16 True --tf32 True --save_16bit |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| nvcc warning : incompatible redefinition for option |
| [rank4]:[W527 14:45:20.703664214 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank3]:[W527 14:45:20.730247520 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank0]:[W527 14:45:20.738022444 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank1]:[W527 14:45:20.746656710 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank6]:[W527 14:45:20.759747687 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank2]:[W527 14:45:20.759783879 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank5]:[W527 14:45:20.760852237 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank7]:[W527 14:45:20.775217148 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/config.json |
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/config.json |
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/model.safetensors |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/model.safetensors |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/model.safetensors |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/model.safetensors |
| Will use torch_dtype=torch.float32 as defined in model |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Will use torch_dtype=torch.float32 as defined in model |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Will use torch_dtype=torch.float32 as defined in model |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Will use torch_dtype=torch.float32 as defined in model |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/model.safetensors |
| Will use torch_dtype=torch.float32 as defined in model |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/model.safetensors |
| Will use torch_dtype=torch.float32 as defined in model |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/model.safetensors |
| Will use torch_dtype=torch.float32 as defined in model |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/model.safetensors |
| Will use torch_dtype=torch.float32 as defined in model |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/generation_config.json |
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file tokenizer.model |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file tokenizer.json |
| loading file chat_template.jinja |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-955k-token-2T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
|
|
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Detected CUDA files, patching ldflags |
| Emitting ninja build file /home/hansirui_1st/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja... |
| /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. |
| If this is not desired, please set os.environ[ |
| warnings.warn( |
| Building extension module fused_adam... |
| Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| Loading extension module fused_adam... |
| Loading extension module fused_adam...Loading extension module fused_adam... |
|
|
| Loading extension module fused_adam... |
| Loading extension module fused_adam... |
| Loading extension module fused_adam... |
| Loading extension module fused_adam... |
| Loading extension module fused_adam... |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| wandb: Currently logged in as: xtom to https://api.wandb.ai. Use `wandb login --relogin` to force relogin |
| wandb: Tracking run with wandb version 0.19.11 |
| wandb: Run data is saved locally in /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-2T/tinyllama-2T-s3-Q1-2000/wandb/run-20250527_144536-79mv42w3 |
| wandb: Run `wandb offline` to turn off syncing. |
| wandb: Syncing run imdb-tinyllama-2T-s3-Q1-2000 |
| wandb: βοΈ View project at https://wandb.ai/xtom/Inverse_Alignment_IMDb |
| wandb: π View run at https://wandb.ai/xtom/Inverse_Alignment_IMDb/runs/79mv42w3 |
|
Training 1/1 epoch: 0%| | 0/250 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
|
Training 1/1 epoch (loss 2.8161): 0%| | 0/250 [00:09<?, ?it/s]
Training 1/1 epoch (loss 2.8161): 0%| | 1/250 [00:09<40:44, 9.82s/it]
Training 1/1 epoch (loss 2.8288): 0%| | 1/250 [00:12<40:44, 9.82s/it]
Training 1/1 epoch (loss 2.8288): 1%| | 2/250 [00:12<23:26, 5.67s/it]
Training 1/1 epoch (loss 2.6551): 1%| | 2/250 [00:13<23:26, 5.67s/it]
Training 1/1 epoch (loss 2.6551): 1%| | 3/250 [00:13<14:21, 3.49s/it]
Training 1/1 epoch (loss 2.8905): 1%| | 3/250 [00:14<14:21, 3.49s/it]
Training 1/1 epoch (loss 2.8905): 2%|β | 4/250 [00:14<11:03, 2.70s/it]
Training 1/1 epoch (loss 2.6809): 2%|β | 4/250 [00:16<11:03, 2.70s/it]
Training 1/1 epoch (loss 2.6809): 2%|β | 5/250 [00:16<09:35, 2.35s/it]
Training 1/1 epoch (loss 2.7594): 2%|β | 5/250 [00:17<09:35, 2.35s/it]
Training 1/1 epoch (loss 2.7594): 2%|β | 6/250 [00:17<06:48, 1.67s/it]
Training 1/1 epoch (loss 3.0797): 2%|β | 6/250 [00:18<06:48, 1.67s/it]
Training 1/1 epoch (loss 3.0797): 3%|β | 7/250 [00:18<06:19, 1.56s/it]
Training 1/1 epoch (loss 2.8107): 3%|β | 7/250 [00:20<06:19, 1.56s/it]
Training 1/1 epoch (loss 2.8107): 3%|β | 8/250 [00:20<06:58, 1.73s/it]
Training 1/1 epoch (loss 2.7069): 3%|β | 8/250 [00:21<06:58, 1.73s/it]
Training 1/1 epoch (loss 2.7069): 4%|β | 9/250 [00:21<05:54, 1.47s/it]
Training 1/1 epoch (loss 2.5439): 4%|β | 9/250 [00:22<05:54, 1.47s/it]
Training 1/1 epoch (loss 2.5439): 4%|β | 10/250 [00:22<05:59, 1.50s/it]
Training 1/1 epoch (loss 2.6100): 4%|β | 10/250 [00:24<05:59, 1.50s/it]
Training 1/1 epoch (loss 2.6100): 4%|β | 11/250 [00:24<06:13, 1.56s/it]
Training 1/1 epoch (loss 2.8397): 4%|β | 11/250 [00:25<06:13, 1.56s/it]
Training 1/1 epoch (loss 2.8397): 5%|β | 12/250 [00:25<05:03, 1.28s/it]
Training 1/1 epoch (loss 2.8383): 5%|β | 12/250 [00:27<05:03, 1.28s/it]
Training 1/1 epoch (loss 2.8383): 5%|β | 13/250 [00:27<06:18, 1.60s/it]
Training 1/1 epoch (loss 2.9324): 5%|β | 13/250 [00:30<06:18, 1.60s/it]
Training 1/1 epoch (loss 2.9324): 6%|β | 14/250 [00:30<07:17, 1.85s/it]
Training 1/1 epoch (loss 2.9199): 6%|β | 14/250 [00:30<07:17, 1.85s/it]
Training 1/1 epoch (loss 2.9199): 6%|β | 15/250 [00:30<05:39, 1.44s/it]
Training 1/1 epoch (loss 2.7712): 6%|β | 15/250 [00:32<05:39, 1.44s/it]
Training 1/1 epoch (loss 2.7712): 6%|β | 16/250 [00:32<06:15, 1.61s/it]
Training 1/1 epoch (loss 2.8185): 6%|β | 16/250 [00:33<06:15, 1.61s/it]
Training 1/1 epoch (loss 2.8185): 7%|β | 17/250 [00:33<05:32, 1.43s/it]
Training 1/1 epoch (loss 2.6272): 7%|β | 17/250 [00:34<05:32, 1.43s/it]
Training 1/1 epoch (loss 2.6272): 7%|β | 18/250 [00:34<04:39, 1.21s/it]
Training 1/1 epoch (loss 2.8921): 7%|β | 18/250 [00:36<04:39, 1.21s/it]
Training 1/1 epoch (loss 2.8921): 8%|β | 19/250 [00:36<06:02, 1.57s/it]
Training 1/1 epoch (loss 2.8993): 8%|β | 19/250 [00:38<06:02, 1.57s/it]
Training 1/1 epoch (loss 2.8993): 8%|β | 20/250 [00:38<06:02, 1.58s/it]
Training 1/1 epoch (loss 2.9464): 8%|β | 20/250 [00:39<06:02, 1.58s/it]
Training 1/1 epoch (loss 2.9464): 8%|β | 21/250 [00:39<05:12, 1.36s/it]
Training 1/1 epoch (loss 2.7994): 8%|β | 21/250 [00:41<05:12, 1.36s/it]
Training 1/1 epoch (loss 2.7994): 9%|β | 22/250 [00:41<05:48, 1.53s/it]
Training 1/1 epoch (loss 2.8943): 9%|β | 22/250 [00:41<05:48, 1.53s/it]
Training 1/1 epoch (loss 2.8943): 9%|β | 23/250 [00:42<05:09, 1.36s/it]
Training 1/1 epoch (loss 2.7869): 9%|β | 23/250 [00:43<05:09, 1.36s/it]
Training 1/1 epoch (loss 2.7869): 10%|β | 24/250 [00:43<05:31, 1.47s/it]
Training 1/1 epoch (loss 2.8521): 10%|β | 24/250 [00:45<05:31, 1.47s/it]
Training 1/1 epoch (loss 2.8521): 10%|β | 25/250 [00:45<06:23, 1.70s/it]
Training 1/1 epoch (loss 2.6833): 10%|β | 25/250 [00:46<06:23, 1.70s/it]
Training 1/1 epoch (loss 2.6833): 10%|β | 26/250 [00:46<04:53, 1.31s/it]
Training 1/1 epoch (loss 2.8133): 10%|β | 26/250 [00:48<04:53, 1.31s/it]
Training 1/1 epoch (loss 2.8133): 11%|β | 27/250 [00:48<06:09, 1.66s/it]
Training 1/1 epoch (loss 2.7118): 11%|β | 27/250 [00:50<06:09, 1.66s/it]
Training 1/1 epoch (loss 2.7118): 11%|β | 28/250 [00:50<06:23, 1.73s/it]
Training 1/1 epoch (loss 2.7297): 11%|β | 28/250 [00:51<06:23, 1.73s/it]
Training 1/1 epoch (loss 2.7297): 12%|ββ | 29/250 [00:51<05:01, 1.36s/it]
Training 1/1 epoch (loss 2.8845): 12%|ββ | 29/250 [00:53<05:01, 1.36s/it]
Training 1/1 epoch (loss 2.8845): 12%|ββ | 30/250 [00:53<05:43, 1.56s/it]
Training 1/1 epoch (loss 2.6467): 12%|ββ | 30/250 [00:54<05:43, 1.56s/it]
Training 1/1 epoch (loss 2.6467): 12%|ββ | 31/250 [00:54<04:52, 1.34s/it]
Training 1/1 epoch (loss 2.9276): 12%|ββ | 31/250 [00:54<04:52, 1.34s/it]
Training 1/1 epoch (loss 2.9276): 13%|ββ | 32/250 [00:54<04:17, 1.18s/it]
Training 1/1 epoch (loss 3.0352): 13%|ββ | 32/250 [00:56<04:17, 1.18s/it]
Training 1/1 epoch (loss 3.0352): 13%|ββ | 33/250 [00:56<04:46, 1.32s/it]
Training 1/1 epoch (loss 2.7642): 13%|ββ | 33/250 [00:58<04:46, 1.32s/it]
Training 1/1 epoch (loss 2.7642): 14%|ββ | 34/250 [00:58<05:18, 1.47s/it]
Training 1/1 epoch (loss 2.8325): 14%|ββ | 34/250 [00:58<05:18, 1.47s/it]
Training 1/1 epoch (loss 2.8325): 14%|ββ | 35/250 [00:58<04:14, 1.18s/it]
Training 1/1 epoch (loss 2.8162): 14%|ββ | 35/250 [01:00<04:14, 1.18s/it]
Training 1/1 epoch (loss 2.8162): 14%|ββ | 36/250 [01:00<04:57, 1.39s/it]
Training 1/1 epoch (loss 2.6478): 14%|ββ | 36/250 [01:02<04:57, 1.39s/it]
Training 1/1 epoch (loss 2.6478): 15%|ββ | 37/250 [01:02<04:51, 1.37s/it]
Training 1/1 epoch (loss 2.9824): 15%|ββ | 37/250 [01:02<04:51, 1.37s/it]
Training 1/1 epoch (loss 2.9824): 15%|ββ | 38/250 [01:02<03:49, 1.08s/it]
Training 1/1 epoch (loss 2.9810): 15%|ββ | 38/250 [01:03<03:49, 1.08s/it]
Training 1/1 epoch (loss 2.9810): 16%|ββ | 39/250 [01:03<03:54, 1.11s/it]
Training 1/1 epoch (loss 3.0387): 16%|ββ | 39/250 [01:05<03:54, 1.11s/it]
Training 1/1 epoch (loss 3.0387): 16%|ββ | 40/250 [01:05<04:19, 1.23s/it]
Training 1/1 epoch (loss 2.7867): 16%|ββ | 40/250 [01:05<04:19, 1.23s/it]
Training 1/1 epoch (loss 2.7867): 16%|ββ | 41/250 [01:05<03:51, 1.11s/it]
Training 1/1 epoch (loss 2.6602): 16%|ββ | 41/250 [01:07<03:51, 1.11s/it]
Training 1/1 epoch (loss 2.6602): 17%|ββ | 42/250 [01:07<04:34, 1.32s/it]
Training 1/1 epoch (loss 2.7890): 17%|ββ | 42/250 [01:08<04:34, 1.32s/it]
Training 1/1 epoch (loss 2.7890): 17%|ββ | 43/250 [01:08<04:11, 1.21s/it]
Training 1/1 epoch (loss 2.8690): 17%|ββ | 43/250 [01:11<04:11, 1.21s/it]
Training 1/1 epoch (loss 2.8690): 18%|ββ | 44/250 [01:11<05:25, 1.58s/it]
Training 1/1 epoch (loss 2.8226): 18%|ββ | 44/250 [01:12<05:25, 1.58s/it]
Training 1/1 epoch (loss 2.8226): 18%|ββ | 45/250 [01:12<05:23, 1.58s/it]
Training 1/1 epoch (loss 2.9405): 18%|ββ | 45/250 [01:13<05:23, 1.58s/it]
Training 1/1 epoch (loss 2.9405): 18%|ββ | 46/250 [01:13<04:17, 1.26s/it]
Training 1/1 epoch (loss 2.7847): 18%|ββ | 46/250 [01:15<04:17, 1.26s/it]
Training 1/1 epoch (loss 2.7847): 19%|ββ | 47/250 [01:15<05:26, 1.61s/it]
Training 1/1 epoch (loss 2.7625): 19%|ββ | 47/250 [01:18<05:26, 1.61s/it]
Training 1/1 epoch (loss 2.7625): 19%|ββ | 48/250 [01:18<06:17, 1.87s/it]
Training 1/1 epoch (loss 2.5362): 19%|ββ | 48/250 [01:18<06:17, 1.87s/it]
Training 1/1 epoch (loss 2.5362): 20%|ββ | 49/250 [01:18<05:02, 1.50s/it]
Training 1/1 epoch (loss 2.8027): 20%|ββ | 49/250 [01:20<05:02, 1.50s/it]
Training 1/1 epoch (loss 2.8027): 20%|ββ | 50/250 [01:20<05:26, 1.63s/it]
Training 1/1 epoch (loss 2.6851): 20%|ββ | 50/250 [01:22<05:26, 1.63s/it]
Training 1/1 epoch (loss 2.6851): 20%|ββ | 51/250 [01:22<05:01, 1.52s/it]
Training 1/1 epoch (loss 2.6996): 20%|ββ | 51/250 [01:22<05:01, 1.52s/it]
Training 1/1 epoch (loss 2.6996): 21%|ββ | 52/250 [01:22<04:13, 1.28s/it]
Training 1/1 epoch (loss 2.6161): 21%|ββ | 52/250 [01:24<04:13, 1.28s/it]
Training 1/1 epoch (loss 2.6161): 21%|ββ | 53/250 [01:24<04:27, 1.36s/it]
Training 1/1 epoch (loss 2.7331): 21%|ββ | 53/250 [01:25<04:27, 1.36s/it]
Training 1/1 epoch (loss 2.7331): 22%|βββ | 54/250 [01:25<04:36, 1.41s/it]
Training 1/1 epoch (loss 2.8265): 22%|βββ | 54/250 [01:26<04:36, 1.41s/it]
Training 1/1 epoch (loss 2.8265): 22%|βββ | 55/250 [01:26<03:35, 1.11s/it]
Training 1/1 epoch (loss 2.3801): 22%|βββ | 55/250 [01:27<03:35, 1.11s/it]
Training 1/1 epoch (loss 2.3801): 22%|βββ | 56/250 [01:27<04:01, 1.24s/it]
Training 1/1 epoch (loss 2.8482): 22%|βββ | 56/250 [01:28<04:01, 1.24s/it]
Training 1/1 epoch (loss 2.8482): 23%|βββ | 57/250 [01:28<03:45, 1.17s/it]
Training 1/1 epoch (loss 2.5960): 23%|βββ | 57/250 [01:29<03:45, 1.17s/it]
Training 1/1 epoch (loss 2.5960): 23%|βββ | 58/250 [01:29<03:42, 1.16s/it]
Training 1/1 epoch (loss 2.6183): 23%|βββ | 58/250 [01:32<03:42, 1.16s/it]
Training 1/1 epoch (loss 2.6183): 24%|βββ | 59/250 [01:32<04:58, 1.56s/it]
Training 1/1 epoch (loss 2.7866): 24%|βββ | 59/250 [01:33<04:58, 1.56s/it]
Training 1/1 epoch (loss 2.7866): 24%|βββ | 60/250 [01:33<04:45, 1.50s/it]
Training 1/1 epoch (loss 2.6522): 24%|βββ | 60/250 [01:36<04:45, 1.50s/it]
Training 1/1 epoch (loss 2.6522): 24%|βββ | 61/250 [01:36<05:26, 1.73s/it]
Training 1/1 epoch (loss 2.7274): 24%|βββ | 61/250 [01:37<05:26, 1.73s/it]
Training 1/1 epoch (loss 2.7274): 25%|βββ | 62/250 [01:37<05:10, 1.65s/it]
Training 1/1 epoch (loss 2.8282): 25%|βββ | 62/250 [01:37<05:10, 1.65s/it]
Training 1/1 epoch (loss 2.8282): 25%|βββ | 63/250 [01:37<04:02, 1.29s/it][rank0]:[E527 14:47:17.057545728 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. |
|
|
| Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): |
| frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x155553f6c1b6 in /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/lib/libc10.so) |
| frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x155553f15a76 in /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/lib/libc10.so) |
| frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x15555437d918 in /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) |
| frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x15550260e556 in /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) |
| frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x15550261b8c0 in /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) |
| frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x617 (0x15550261d557 in /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) |
| frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x15550261e6ed in /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) |
| frame #7: <unknown function> + 0x145c0 (0x1555543ee5c0 in /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/lib/libtorch.so) |
| frame #8: <unknown function> + 0x94ac3 (0x15555527eac3 in /lib/x86_64-linux-gnu/libc.so.6) |
| frame #9: <unknown function> + 0x126a40 (0x155555310a40 in /lib/x86_64-linux-gnu/libc.so.6) |
|
|
|
|