| + deepspeed |
| [rank1]:[W529 18:11:27.928728650 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank2]:[W529 18:11:27.068010451 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank6]:[W529 18:11:27.077635348 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank5]:[W529 18:11:27.131366307 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank0]:[W529 18:11:28.176038397 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank7]:[W529 18:11:28.186991632 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank3]:[W529 18:11:28.188175485 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank4]:[W529 18:11:28.201583168 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/config.json |
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| Model config LlamaConfig { |
| "architectures": [ |
| "LlamaForCausalLM" |
| ], |
| "attention_bias": false, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 2048, |
| "initializer_range": 0.02, |
| "intermediate_size": 5632, |
| "max_position_embeddings": 2048, |
| "mlp_bias": false, |
| "model_type": "llama", |
| "num_attention_heads": 32, |
| "num_hidden_layers": 22, |
| "num_key_value_heads": 4, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_scaling": null, |
| "rope_theta": 10000.0, |
| "tie_word_embeddings": false, |
| "torch_dtype": "float32", |
| "transformers_version": "4.52.1", |
| "use_cache": true, |
| "vocab_size": 32000 |
| } |
|
|
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/model.safetensors |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/model.safetensors |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/model.safetensors |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/model.safetensors |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/model.safetensors |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/model.safetensors |
| Will use torch_dtype=torch.float32 as defined in model's config object |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Will use torch_dtype=torch.float32 as defined in model's config object |
| Will use torch_dtype=torch.float32 as defined in model's config object |
| Will use torch_dtype=torch.float32 as defined in model's config object |
| Will use torch_dtype=torch.float32 as defined in model's config object |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/model.safetensors |
| Will use torch_dtype=torch.float32 as defined in model's config object |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Will use torch_dtype=torch.float32 as defined in model's config object |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/model.safetensors |
| Will use torch_dtype=torch.float32 as defined in model's config object |
| Instantiating LlamaForCausalLM model under default dtype torch.float32. |
| Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2 |
| } |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/generation_config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/generation_config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/generation_config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/generation_config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/generation_config.json |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading file tokenizer.model |
| loading file tokenizer.model |
| loading file tokenizer.model |
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file tokenizer.json |
| loading file tokenizer.json |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file added_tokens.json |
| loading file added_tokens.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file special_tokens_map.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file tokenizer_config.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| loading file chat_template.jinja |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| loading file chat_template.jinja |
| loading file tokenizer.model |
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file tokenizer.json |
| loading file special_tokens_map.json |
| loading file added_tokens.json |
| loading file tokenizer_config.json |
| loading file special_tokens_map.json |
| loading file chat_template.jinja |
| loading file tokenizer_config.json |
| loading file tokenizer.model |
| loading file chat_template.jinja |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| All model checkpoint weights were used when initializing LlamaForCausalLM. |
|
|
| All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. |
| loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-715k-1.5T/generation_config.json |
| Generate config GenerationConfig { |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "max_length": 2048, |
| "pad_token_id": 0 |
| } |
|
|
| loading file tokenizer.model |
| loading file tokenizer.json |
| loading file added_tokens.json |
| loading file special_tokens_map.json |
| loading file tokenizer_config.json |
| loading file chat_template.jinja |
| You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
|
|
| Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... |
| Detected CUDA files, patching ldflags |
| Emitting ninja build file /home/hansirui_1st/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja... |
| /aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. |
| If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. |
| warnings.warn( |
| Building extension module fused_adam... |
| Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| Loading extension module fused_adam... |
| Loading extension module fused_adam...Loading extension module fused_adam... |
|
|
| Loading extension module fused_adam... |
| Loading extension module fused_adam... |
| Loading extension module fused_adam... |
| Loading extension module fused_adam... |
| Loading extension module fused_adam... |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
| wandb: Currently logged in as: xtom to https://api.wandb.ai. Use `wandb login |
| wandb: Tracking run with wandb version 0.19.11 |
| wandb: Run data is saved locally in /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-1.5T/tinyllama-1.5T-s3-Q1-1000/wandb/run-20250529_181147-o23i2ll7 |
| wandb: Run `wandb offline` to turn off syncing. |
| wandb: Syncing run imdb-tinyllama-1.5T-s3-Q1-1000 |
| wandb: βοΈ View project at https://wandb.ai/xtom/Inverse_Alignment_IMDb |
| wandb: π View run at https://wandb.ai/xtom/Inverse_Alignment_IMDb/runs/o23i2ll7 |
|
Training 1/1 epoch: 0%| | 0/125 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. |
|
Training 1/1 epoch (loss 2.9274): 0%| | 0/125 [00:05<?, ?it/s]
Training 1/1 epoch (loss 2.9274): 1%| | 1/125 [00:05<12:17, 5.95s/it]
Training 1/1 epoch (loss 2.7548): 1%| | 1/125 [00:07<12:17, 5.95s/it]
Training 1/1 epoch (loss 2.7548): 2%|β | 2/125 [00:07<06:56, 3.39s/it]
Training 1/1 epoch (loss 2.8627): 2%|β | 2/125 [00:07<06:56, 3.39s/it]
Training 1/1 epoch (loss 2.8627): 2%|β | 3/125 [00:07<04:03, 1.99s/it]
Training 1/1 epoch (loss 2.9692): 2%|β | 3/125 [00:08<04:03, 1.99s/it]
Training 1/1 epoch (loss 2.9692): 3%|β | 4/125 [00:08<02:42, 1.34s/it]
Training 1/1 epoch (loss 2.7879): 3%|β | 4/125 [00:08<02:42, 1.34s/it]
Training 1/1 epoch (loss 2.7879): 4%|β | 5/125 [00:08<01:58, 1.01it/s]
Training 1/1 epoch (loss 2.9316): 4%|β | 5/125 [00:08<01:58, 1.01it/s]
Training 1/1 epoch (loss 2.9316): 5%|β | 6/125 [00:08<01:31, 1.29it/s]
Training 1/1 epoch (loss 2.7320): 5%|β | 6/125 [00:09<01:31, 1.29it/s]
Training 1/1 epoch (loss 2.7320): 6%|β | 7/125 [00:09<01:13, 1.61it/s]
Training 1/1 epoch (loss 2.9199): 6%|β | 7/125 [00:09<01:13, 1.61it/s]
Training 1/1 epoch (loss 2.9199): 6%|β | 8/125 [00:09<01:06, 1.77it/s]
Training 1/1 epoch (loss 3.1482): 6%|β | 8/125 [00:10<01:06, 1.77it/s]
Training 1/1 epoch (loss 3.1482): 7%|β | 9/125 [00:10<00:56, 2.04it/s]
Training 1/1 epoch (loss 2.9366): 7%|β | 9/125 [00:10<00:56, 2.04it/s]
Training 1/1 epoch (loss 2.9366): 8%|β | 10/125 [00:10<00:53, 2.14it/s]
Training 1/1 epoch (loss 2.7590): 8%|β | 10/125 [00:10<00:53, 2.14it/s]
Training 1/1 epoch (loss 2.7590): 9%|β | 11/125 [00:10<00:49, 2.31it/s]
Training 1/1 epoch (loss 2.7615): 9%|β | 11/125 [00:11<00:49, 2.31it/s]
Training 1/1 epoch (loss 2.7615): 10%|β | 12/125 [00:11<00:46, 2.46it/s]
Training 1/1 epoch (loss 2.8058): 10%|β | 12/125 [00:11<00:46, 2.46it/s]
Training 1/1 epoch (loss 2.8058): 10%|β | 13/125 [00:11<00:42, 2.64it/s]
Training 1/1 epoch (loss 2.8318): 10%|β | 13/125 [00:11<00:42, 2.64it/s]
Training 1/1 epoch (loss 2.8318): 11%|β | 14/125 [00:11<00:41, 2.66it/s]
Training 1/1 epoch (loss 2.6845): 11%|β | 14/125 [00:12<00:41, 2.66it/s]
Training 1/1 epoch (loss 2.6845): 12%|ββ | 15/125 [00:12<00:39, 2.78it/s]
Training 1/1 epoch (loss 3.0140): 12%|ββ | 15/125 [00:12<00:39, 2.78it/s]
Training 1/1 epoch (loss 3.0140): 13%|ββ | 16/125 [00:12<00:41, 2.65it/s]
Training 1/1 epoch (loss 2.8598): 13%|ββ | 16/125 [00:12<00:41, 2.65it/s]
Training 1/1 epoch (loss 2.8598): 14%|ββ | 17/125 [00:12<00:39, 2.73it/s]
Training 1/1 epoch (loss 2.8118): 14%|ββ | 17/125 [00:13<00:39, 2.73it/s]
Training 1/1 epoch (loss 2.8118): 14%|ββ | 18/125 [00:13<00:38, 2.75it/s]
Training 1/1 epoch (loss 2.8308): 14%|ββ | 18/125 [00:13<00:38, 2.75it/s]
Training 1/1 epoch (loss 2.8308): 15%|ββ | 19/125 [00:13<00:36, 2.87it/s]
Training 1/1 epoch (loss 2.9261): 15%|ββ | 19/125 [00:13<00:36, 2.87it/s]
Training 1/1 epoch (loss 2.9261): 16%|ββ | 20/125 [00:13<00:36, 2.89it/s]
Training 1/1 epoch (loss 2.8353): 16%|ββ | 20/125 [00:14<00:36, 2.89it/s]
Training 1/1 epoch (loss 2.8353): 17%|ββ | 21/125 [00:14<00:36, 2.81it/s]
Training 1/1 epoch (loss 2.8962): 17%|ββ | 21/125 [00:14<00:36, 2.81it/s]
Training 1/1 epoch (loss 2.8962): 18%|ββ | 22/125 [00:14<00:37, 2.72it/s]
Training 1/1 epoch (loss 2.5551): 18%|ββ | 22/125 [00:15<00:37, 2.72it/s]
Training 1/1 epoch (loss 2.5551): 18%|ββ | 23/125 [00:15<00:38, 2.63it/s]
Training 1/1 epoch (loss 2.8156): 18%|ββ | 23/125 [00:15<00:38, 2.63it/s]
Training 1/1 epoch (loss 2.8156): 19%|ββ | 24/125 [00:15<00:38, 2.63it/s]
Training 1/1 epoch (loss 2.4137): 19%|ββ | 24/125 [00:15<00:38, 2.63it/s]
Training 1/1 epoch (loss 2.4137): 20%|ββ | 25/125 [00:15<00:37, 2.66it/s]
Training 1/1 epoch (loss 2.7885): 20%|ββ | 25/125 [00:16<00:37, 2.66it/s]
Training 1/1 epoch (loss 2.7885): 21%|ββ | 26/125 [00:16<00:36, 2.70it/s]
Training 1/1 epoch (loss 2.6982): 21%|ββ | 26/125 [00:16<00:36, 2.70it/s]
Training 1/1 epoch (loss 2.6982): 22%|βββ | 27/125 [00:16<00:36, 2.71it/s]
Training 1/1 epoch (loss 2.5274): 22%|βββ | 27/125 [00:16<00:36, 2.71it/s]
Training 1/1 epoch (loss 2.5274): 22%|βββ | 28/125 [00:16<00:34, 2.81it/s]
Training 1/1 epoch (loss 2.8341): 22%|βββ | 28/125 [00:17<00:34, 2.81it/s]
Training 1/1 epoch (loss 2.8341): 23%|βββ | 29/125 [00:17<00:33, 2.87it/s]
Training 1/1 epoch (loss 3.0866): 23%|βββ | 29/125 [00:17<00:33, 2.87it/s]
Training 1/1 epoch (loss 3.0866): 24%|βββ | 30/125 [00:17<00:32, 2.92it/s]
Training 1/1 epoch (loss 2.7703): 24%|βββ | 30/125 [00:17<00:32, 2.92it/s]
Training 1/1 epoch (loss 2.7703): 25%|βββ | 31/125 [00:17<00:31, 2.97it/s]
Training 1/1 epoch (loss 2.7965): 25%|βββ | 31/125 [00:18<00:31, 2.97it/s]
Training 1/1 epoch (loss 2.7965): 26%|βββ | 32/125 [00:18<00:34, 2.68it/s]
Training 1/1 epoch (loss 2.7787): 26%|βββ | 32/125 [00:18<00:34, 2.68it/s]
Training 1/1 epoch (loss 2.7787): 26%|βββ | 33/125 [00:18<00:40, 2.25it/s]
Training 1/1 epoch (loss 2.8251): 26%|βββ | 33/125 [00:19<00:40, 2.25it/s]
Training 1/1 epoch (loss 2.8251): 27%|βββ | 34/125 [00:19<00:38, 2.36it/s]
Training 1/1 epoch (loss 2.8779): 27%|βββ | 34/125 [00:19<00:38, 2.36it/s]
Training 1/1 epoch (loss 2.8779): 28%|βββ | 35/125 [00:19<00:37, 2.39it/s]
Training 1/1 epoch (loss 2.7563): 28%|βββ | 35/125 [00:20<00:37, 2.39it/s]
Training 1/1 epoch (loss 2.7563): 29%|βββ | 36/125 [00:20<00:35, 2.51it/s]
Training 1/1 epoch (loss 2.7845): 29%|βββ | 36/125 [00:20<00:35, 2.51it/s]
Training 1/1 epoch (loss 2.7845): 30%|βββ | 37/125 [00:20<00:33, 2.59it/s]
Training 1/1 epoch (loss 2.7431): 30%|βββ | 37/125 [00:20<00:33, 2.59it/s]
Training 1/1 epoch (loss 2.7431): 30%|βββ | 38/125 [00:20<00:33, 2.56it/s]
Training 1/1 epoch (loss 2.9083): 30%|βββ | 38/125 [00:21<00:33, 2.56it/s]
Training 1/1 epoch (loss 2.9083): 31%|βββ | 39/125 [00:21<00:31, 2.73it/s]
Training 1/1 epoch (loss 2.6759): 31%|βββ | 39/125 [00:21<00:31, 2.73it/s]
Training 1/1 epoch (loss 2.6759): 32%|ββββ | 40/125 [00:21<00:30, 2.80it/s]
Training 1/1 epoch (loss 2.7976): 32%|ββββ | 40/125 [00:21<00:30, 2.80it/s]
Training 1/1 epoch (loss 2.7976): 33%|ββββ | 41/125 [00:21<00:29, 2.82it/s]
Training 1/1 epoch (loss 2.7812): 33%|ββββ | 41/125 [00:22<00:29, 2.82it/s]
Training 1/1 epoch (loss 2.7812): 34%|ββββ | 42/125 [00:22<00:28, 2.93it/s]
Training 1/1 epoch (loss 2.7319): 34%|ββββ | 42/125 [00:22<00:28, 2.93it/s]
Training 1/1 epoch (loss 2.7319): 34%|ββββ | 43/125 [00:22<00:28, 2.88it/s]
Training 1/1 epoch (loss 2.7975): 34%|ββββ | 43/125 [00:22<00:28, 2.88it/s]
Training 1/1 epoch (loss 2.7975): 35%|ββββ | 44/125 [00:22<00:28, 2.86it/s]
Training 1/1 epoch (loss 2.6232): 35%|ββββ | 44/125 [00:23<00:28, 2.86it/s]
Training 1/1 epoch (loss 2.6232): 36%|ββββ | 45/125 [00:23<00:28, 2.82it/s]
Training 1/1 epoch (loss 2.6662): 36%|ββββ | 45/125 [00:23<00:28, 2.82it/s]
Training 1/1 epoch (loss 2.6662): 37%|ββββ | 46/125 [00:23<00:29, 2.66it/s]
Training 1/1 epoch (loss 2.8750): 37%|ββββ | 46/125 [00:23<00:29, 2.66it/s]
Training 1/1 epoch (loss 2.8750): 38%|ββββ | 47/125 [00:23<00:28, 2.76it/s]
Training 1/1 epoch (loss 2.9271): 38%|ββββ | 47/125 [00:24<00:28, 2.76it/s]
Training 1/1 epoch (loss 2.9271): 38%|ββββ | 48/125 [00:24<00:27, 2.78it/s]
Training 1/1 epoch (loss 2.7745): 38%|ββββ | 48/125 [00:24<00:27, 2.78it/s]
Training 1/1 epoch (loss 2.7745): 39%|ββββ | 49/125 [00:24<00:30, 2.52it/s]
Training 1/1 epoch (loss 2.7843): 39%|ββββ | 49/125 [00:25<00:30, 2.52it/s]
Training 1/1 epoch (loss 2.7843): 40%|ββββ | 50/125 [00:25<00:29, 2.50it/s]
Training 1/1 epoch (loss 2.6047): 40%|ββββ | 50/125 [00:25<00:29, 2.50it/s]
Training 1/1 epoch (loss 2.6047): 41%|ββββ | 51/125 [00:25<00:27, 2.68it/s]
Training 1/1 epoch (loss 2.7916): 41%|ββββ | 51/125 [00:25<00:27, 2.68it/s]
Training 1/1 epoch (loss 2.7916): 42%|βββββ | 52/125 [00:25<00:26, 2.71it/s]
Training 1/1 epoch (loss 2.6912): 42%|βββββ | 52/125 [00:26<00:26, 2.71it/s]
Training 1/1 epoch (loss 2.6912): 42%|βββββ | 53/125 [00:26<00:24, 2.89it/s]
Training 1/1 epoch (loss 2.8029): 42%|βββββ | 53/125 [00:26<00:24, 2.89it/s]
Training 1/1 epoch (loss 2.8029): 43%|βββββ | 54/125 [00:26<00:24, 2.86it/s]
Training 1/1 epoch (loss 2.6960): 43%|βββββ | 54/125 [00:26<00:24, 2.86it/s]
Training 1/1 epoch (loss 2.6960): 44%|βββββ | 55/125 [00:26<00:23, 2.93it/s]
Training 1/1 epoch (loss 2.7062): 44%|βββββ | 55/125 [00:27<00:23, 2.93it/s]
Training 1/1 epoch (loss 2.7062): 45%|βββββ | 56/125 [00:27<00:23, 2.89it/s]
Training 1/1 epoch (loss 2.8007): 45%|βββββ | 56/125 [00:27<00:23, 2.89it/s]
Training 1/1 epoch (loss 2.8007): 46%|βββββ | 57/125 [00:27<00:23, 2.86it/s]
Training 1/1 epoch (loss 2.8206): 46%|βββββ | 57/125 [00:27<00:23, 2.86it/s]
Training 1/1 epoch (loss 2.8206): 46%|βββββ | 58/125 [00:27<00:22, 2.94it/s]
Training 1/1 epoch (loss 2.7288): 46%|βββββ | 58/125 [00:28<00:22, 2.94it/s]
Training 1/1 epoch (loss 2.7288): 47%|βββββ | 59/125 [00:28<00:21, 3.01it/s]
Training 1/1 epoch (loss 2.8187): 47%|βββββ | 59/125 [00:28<00:21, 3.01it/s]
Training 1/1 epoch (loss 2.8187): 48%|βββββ | 60/125 [00:28<00:21, 3.03it/s]
Training 1/1 epoch (loss 2.5610): 48%|βββββ | 60/125 [00:28<00:21, 3.03it/s]
Training 1/1 epoch (loss 2.5610): 49%|βββββ | 61/125 [00:28<00:20, 3.09it/s]
Training 1/1 epoch (loss 2.8271): 49%|βββββ | 61/125 [00:29<00:20, 3.09it/s]
Training 1/1 epoch (loss 2.8271): 50%|βββββ | 62/125 [00:29<00:22, 2.84it/s]
Training 1/1 epoch (loss 2.7825): 50%|βββββ | 62/125 [00:29<00:22, 2.84it/s]
Training 1/1 epoch (loss 2.7825): 50%|βββββ | 63/125 [00:29<00:22, 2.81it/s]
Training 1/1 epoch (loss 2.7111): 50%|βββββ | 63/125 [00:30<00:22, 2.81it/s]
Training 1/1 epoch (loss 2.7111): 51%|βββββ | 64/125 [00:30<00:23, 2.64it/s]
Training 1/1 epoch (loss 2.8104): 51%|βββββ | 64/125 [00:30<00:23, 2.64it/s]
Training 1/1 epoch (loss 2.8104): 52%|ββββββ | 65/125 [00:30<00:22, 2.66it/s]
Training 1/1 epoch (loss 2.7369): 52%|ββββββ | 65/125 [00:30<00:22, 2.66it/s]
Training 1/1 epoch (loss 2.7369): 53%|ββββββ | 66/125 [00:30<00:21, 2.80it/s]
Training 1/1 epoch (loss 2.7203): 53%|ββββββ | 66/125 [00:31<00:21, 2.80it/s]
Training 1/1 epoch (loss 2.7203): 54%|ββββββ | 67/125 [00:31<00:20, 2.78it/s]
Training 1/1 epoch (loss 2.8001): 54%|ββββββ | 67/125 [00:31<00:20, 2.78it/s]
Training 1/1 epoch (loss 2.8001): 54%|ββββββ | 68/125 [00:31<00:20, 2.83it/s]
Training 1/1 epoch (loss 2.8761): 54%|ββββββ | 68/125 [00:31<00:20, 2.83it/s]
Training 1/1 epoch (loss 2.8761): 55%|ββββββ | 69/125 [00:31<00:19, 2.94it/s]
Training 1/1 epoch (loss 2.5881): 55%|ββββββ | 69/125 [00:32<00:19, 2.94it/s]
Training 1/1 epoch (loss 2.5881): 56%|ββββββ | 70/125 [00:32<00:18, 3.02it/s]
Training 1/1 epoch (loss 2.7823): 56%|ββββββ | 70/125 [00:32<00:18, 3.02it/s]
Training 1/1 epoch (loss 2.7823): 57%|ββββββ | 71/125 [00:32<00:18, 2.99it/s]
Training 1/1 epoch (loss 2.7258): 57%|ββββββ | 71/125 [00:32<00:18, 2.99it/s]
Training 1/1 epoch (loss 2.7258): 58%|ββββββ | 72/125 [00:32<00:17, 3.08it/s]
Training 1/1 epoch (loss 2.7382): 58%|ββββββ | 72/125 [00:33<00:17, 3.08it/s]
Training 1/1 epoch (loss 2.7382): 58%|ββββββ | 73/125 [00:33<00:18, 2.81it/s]
Training 1/1 epoch (loss 2.5548): 58%|ββββββ | 73/125 [00:33<00:18, 2.81it/s]
Training 1/1 epoch (loss 2.5548): 59%|ββββββ | 74/125 [00:33<00:18, 2.78it/s]
Training 1/1 epoch (loss 2.7749): 59%|ββββββ | 74/125 [00:33<00:18, 2.78it/s]
Training 1/1 epoch (loss 2.7749): 60%|ββββββ | 75/125 [00:33<00:17, 2.90it/s]
Training 1/1 epoch (loss 2.6133): 60%|ββββββ | 75/125 [00:34<00:17, 2.90it/s]
Training 1/1 epoch (loss 2.6133): 61%|ββββββ | 76/125 [00:34<00:16, 2.92it/s]
Training 1/1 epoch (loss 2.6752): 61%|ββββββ | 76/125 [00:34<00:16, 2.92it/s]
Training 1/1 epoch (loss 2.6752): 62%|βββββββ | 77/125 [00:34<00:17, 2.76it/s]
Training 1/1 epoch (loss 2.6427): 62%|βββββββ | 77/125 [00:34<00:17, 2.76it/s]
Training 1/1 epoch (loss 2.6427): 62%|βββββββ | 78/125 [00:34<00:17, 2.75it/s]
Training 1/1 epoch (loss 2.7931): 62%|βββββββ | 78/125 [00:35<00:17, 2.75it/s]
Training 1/1 epoch (loss 2.7931): 63%|βββββββ | 79/125 [00:35<00:18, 2.53it/s]
Training 1/1 epoch (loss 2.7458): 63%|βββββββ | 79/125 [00:35<00:18, 2.53it/s]
Training 1/1 epoch (loss 2.7458): 64%|βββββββ | 80/125 [00:35<00:17, 2.54it/s]
Training 1/1 epoch (loss 2.8628): 64%|βββββββ | 80/125 [00:36<00:17, 2.54it/s]
Training 1/1 epoch (loss 2.8628): 65%|βββββββ | 81/125 [00:36<00:16, 2.64it/s]
Training 1/1 epoch (loss 2.6471): 65%|βββββββ | 81/125 [00:36<00:16, 2.64it/s]
Training 1/1 epoch (loss 2.6471): 66%|βββββββ | 82/125 [00:36<00:15, 2.72it/s]
Training 1/1 epoch (loss 2.9064): 66%|βββββββ | 82/125 [00:36<00:15, 2.72it/s]
Training 1/1 epoch (loss 2.9064): 66%|βββββββ | 83/125 [00:36<00:15, 2.79it/s]
Training 1/1 epoch (loss 2.7037): 66%|βββββββ | 83/125 [00:37<00:15, 2.79it/s]
Training 1/1 epoch (loss 2.7037): 67%|βββββββ | 84/125 [00:37<00:14, 2.79it/s]
Training 1/1 epoch (loss 2.8743): 67%|βββββββ | 84/125 [00:37<00:14, 2.79it/s]
Training 1/1 epoch (loss 2.8743): 68%|βββββββ | 85/125 [00:37<00:14, 2.76it/s]
Training 1/1 epoch (loss 2.7852): 68%|βββββββ | 85/125 [00:37<00:14, 2.76it/s]
Training 1/1 epoch (loss 2.7852): 69%|βββββββ | 86/125 [00:37<00:13, 2.79it/s]
Training 1/1 epoch (loss 2.8241): 69%|βββββββ | 86/125 [00:38<00:13, 2.79it/s]
Training 1/1 epoch (loss 2.8241): 70%|βββββββ | 87/125 [00:38<00:12, 2.94it/s]
Training 1/1 epoch (loss 2.6520): 70%|βββββββ | 87/125 [00:38<00:12, 2.94it/s]
Training 1/1 epoch (loss 2.6520): 70%|βββββββ | 88/125 [00:38<00:12, 2.88it/s]
Training 1/1 epoch (loss 2.6442): 70%|βββββββ | 88/125 [00:38<00:12, 2.88it/s]
Training 1/1 epoch (loss 2.6442): 71%|βββββββ | 89/125 [00:38<00:12, 2.86it/s]
Training 1/1 epoch (loss 2.7283): 71%|βββββββ | 89/125 [00:39<00:12, 2.86it/s]
Training 1/1 epoch (loss 2.7283): 72%|ββββββββ | 90/125 [00:39<00:12, 2.86it/s]
Training 1/1 epoch (loss 2.9649): 72%|ββββββββ | 90/125 [00:39<00:12, 2.86it/s]
Training 1/1 epoch (loss 2.9649): 73%|ββββββββ | 91/125 [00:39<00:11, 2.86it/s]
Training 1/1 epoch (loss 2.5088): 73%|ββββββββ | 91/125 [00:39<00:11, 2.86it/s]
Training 1/1 epoch (loss 2.5088): 74%|ββββββββ | 92/125 [00:39<00:11, 2.89it/s]
Training 1/1 epoch (loss 2.6447): 74%|ββββββββ | 92/125 [00:40<00:11, 2.89it/s]
Training 1/1 epoch (loss 2.6447): 74%|ββββββββ | 93/125 [00:40<00:10, 2.99it/s]
Training 1/1 epoch (loss 2.6180): 74%|ββββββββ | 93/125 [00:40<00:10, 2.99it/s]
Training 1/1 epoch (loss 2.6180): 75%|ββββββββ | 94/125 [00:40<00:10, 2.83it/s]
Training 1/1 epoch (loss 2.6422): 75%|ββββββββ | 94/125 [00:41<00:10, 2.83it/s]
Training 1/1 epoch (loss 2.6422): 76%|ββββββββ | 95/125 [00:41<00:11, 2.72it/s]
Training 1/1 epoch (loss 2.7959): 76%|ββββββββ | 95/125 [00:41<00:11, 2.72it/s]
Training 1/1 epoch (loss 2.7959): 77%|ββββββββ | 96/125 [00:41<00:10, 2.80it/s]
Training 1/1 epoch (loss 2.6846): 77%|ββββββββ | 96/125 [00:41<00:10, 2.80it/s]
Training 1/1 epoch (loss 2.6846): 78%|ββββββββ | 97/125 [00:41<00:09, 2.83it/s]
Training 1/1 epoch (loss 2.9189): 78%|ββββββββ | 97/125 [00:42<00:09, 2.83it/s]
Training 1/1 epoch (loss 2.9189): 78%|ββββββββ | 98/125 [00:42<00:09, 2.84it/s]
Training 1/1 epoch (loss 2.7321): 78%|ββββββββ | 98/125 [00:42<00:09, 2.84it/s]
Training 1/1 epoch (loss 2.7321): 79%|ββββββββ | 99/125 [00:42<00:09, 2.84it/s]
Training 1/1 epoch (loss 2.6912): 79%|ββββββββ | 99/125 [00:42<00:09, 2.84it/s]
Training 1/1 epoch (loss 2.6912): 80%|ββββββββ | 100/125 [00:42<00:08, 2.97it/s]
Training 1/1 epoch (loss 2.8075): 80%|ββββββββ | 100/125 [00:43<00:08, 2.97it/s]
Training 1/1 epoch (loss 2.8075): 81%|ββββββββ | 101/125 [00:43<00:08, 2.91it/s]
Training 1/1 epoch (loss 2.5921): 81%|ββββββββ | 101/125 [00:43<00:08, 2.91it/s]
Training 1/1 epoch (loss 2.5921): 82%|βββββββββ | 102/125 [00:43<00:07, 2.98it/s]
Training 1/1 epoch (loss 2.6830): 82%|βββββββββ | 102/125 [00:43<00:07, 2.98it/s]
Training 1/1 epoch (loss 2.6830): 82%|βββββββββ | 103/125 [00:43<00:07, 2.93it/s]
Training 1/1 epoch (loss 2.6955): 82%|βββββββββ | 103/125 [00:44<00:07, 2.93it/s]
Training 1/1 epoch (loss 2.6955): 83%|βββββββββ | 104/125 [00:44<00:07, 2.90it/s]
Training 1/1 epoch (loss 2.7683): 83%|βββββββββ | 104/125 [00:44<00:07, 2.90it/s]
Training 1/1 epoch (loss 2.7683): 84%|βββββββββ | 105/125 [00:44<00:07, 2.85it/s]
Training 1/1 epoch (loss 2.6650): 84%|βββββββββ | 105/125 [00:44<00:07, 2.85it/s]
Training 1/1 epoch (loss 2.6650): 85%|βββββββββ | 106/125 [00:44<00:06, 2.72it/s]
Training 1/1 epoch (loss 2.8653): 85%|βββββββββ | 106/125 [00:45<00:06, 2.72it/s]
Training 1/1 epoch (loss 2.8653): 86%|βββββββββ | 107/125 [00:45<00:06, 2.57it/s]
Training 1/1 epoch (loss 2.7362): 86%|βββββββββ | 107/125 [00:45<00:06, 2.57it/s]
Training 1/1 epoch (loss 2.7362): 86%|βββββββββ | 108/125 [00:45<00:06, 2.62it/s]
Training 1/1 epoch (loss 2.6535): 86%|βββββββββ | 108/125 [00:46<00:06, 2.62it/s]
Training 1/1 epoch (loss 2.6535): 87%|βββββββββ | 109/125 [00:46<00:05, 2.70it/s]
Training 1/1 epoch (loss 2.6683): 87%|βββββββββ | 109/125 [00:46<00:05, 2.70it/s]
Training 1/1 epoch (loss 2.6683): 88%|βββββββββ | 110/125 [00:46<00:05, 2.76it/s]
Training 1/1 epoch (loss 2.7594): 88%|βββββββββ | 110/125 [00:46<00:05, 2.76it/s]
Training 1/1 epoch (loss 2.7594): 89%|βββββββββ | 111/125 [00:46<00:04, 2.89it/s]
Training 1/1 epoch (loss 2.6514): 89%|βββββββββ | 111/125 [00:47<00:04, 2.89it/s]
Training 1/1 epoch (loss 2.6514): 90%|βββββββββ | 112/125 [00:47<00:04, 2.88it/s]
Training 1/1 epoch (loss 2.5921): 90%|βββββββββ | 112/125 [00:47<00:04, 2.88it/s]
Training 1/1 epoch (loss 2.5921): 90%|βββββββββ | 113/125 [00:47<00:04, 2.96it/s]
Training 1/1 epoch (loss 2.6020): 90%|βββββββββ | 113/125 [00:47<00:04, 2.96it/s]
Training 1/1 epoch (loss 2.6020): 91%|βββββββββ | 114/125 [00:47<00:03, 3.08it/s]
Training 1/1 epoch (loss 2.8603): 91%|βββββββββ | 114/125 [00:48<00:03, 3.08it/s]
Training 1/1 epoch (loss 2.8603): 92%|ββββββββββ| 115/125 [00:48<00:03, 2.82it/s]
Training 1/1 epoch (loss 2.8380): 92%|ββββββββββ| 115/125 [00:48<00:03, 2.82it/s]
Training 1/1 epoch (loss 2.8380): 93%|ββββββββββ| 116/125 [00:48<00:03, 2.56it/s]
Training 1/1 epoch (loss 2.7836): 93%|ββββββββββ| 116/125 [00:48<00:03, 2.56it/s]
Training 1/1 epoch (loss 2.7836): 94%|ββββββββββ| 117/125 [00:48<00:02, 2.71it/s]
Training 1/1 epoch (loss 2.6598): 94%|ββββββββββ| 117/125 [00:49<00:02, 2.71it/s]
Training 1/1 epoch (loss 2.6598): 94%|ββββββββββ| 118/125 [00:49<00:02, 2.73it/s]
Training 1/1 epoch (loss 2.5945): 94%|ββββββββββ| 118/125 [00:49<00:02, 2.73it/s]
Training 1/1 epoch (loss 2.5945): 95%|ββββββββββ| 119/125 [00:49<00:02, 2.65it/s]
Training 1/1 epoch (loss 2.5718): 95%|ββββββββββ| 119/125 [00:49<00:02, 2.65it/s]
Training 1/1 epoch (loss 2.5718): 96%|ββββββββββ| 120/125 [00:49<00:01, 2.75it/s]
Training 1/1 epoch (loss 2.6172): 96%|ββββββββββ| 120/125 [00:50<00:01, 2.75it/s]
Training 1/1 epoch (loss 2.6172): 97%|ββββββββββ| 121/125 [00:50<00:01, 2.77it/s]
Training 1/1 epoch (loss 2.7006): 97%|ββββββββββ| 121/125 [00:50<00:01, 2.77it/s]
Training 1/1 epoch (loss 2.7006): 98%|ββββββββββ| 122/125 [00:50<00:01, 2.76it/s]
Training 1/1 epoch (loss 2.7602): 98%|ββββββββββ| 122/125 [00:51<00:01, 2.76it/s]
Training 1/1 epoch (loss 2.7602): 98%|ββββββββββ| 123/125 [00:51<00:00, 2.79it/s]
Training 1/1 epoch (loss 2.6099): 98%|ββββββββββ| 123/125 [00:51<00:00, 2.79it/s]
Training 1/1 epoch (loss 2.6099): 99%|ββββββββββ| 124/125 [00:51<00:00, 2.87it/s]
Training 1/1 epoch (loss 2.9457): 99%|ββββββββββ| 124/125 [00:51<00:00, 2.87it/s]
Training 1/1 epoch (loss 2.9457): 100%|ββββββββββ| 125/125 [00:51<00:00, 2.91it/s]
Training 1/1 epoch (loss 2.9457): 100%|ββββββββββ| 125/125 [00:51<00:00, 2.42it/s] |
| tokenizer config file saved in /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-1.5T/tinyllama-1.5T-s3-Q1-1000/tokenizer_config.json |
| Special tokens file saved in /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-1.5T/tinyllama-1.5T-s3-Q1-1000/special_tokens_map.json |
| wandb: ERROR Problem finishing run |
| Exception ignored in atexit callback: <bound method rank_zero_only.<locals>.wrapper of <safe_rlhf.logger.Logger object at 0x1550cc1bd4d0>> |
| Traceback (most recent call last): |
| File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/utils.py", line 212, in wrapper |
| return func(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^^^^ |
| File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/logger.py", line 183, in close |
| self.wandb.finish() |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 406, in wrapper |
| return func(self, *args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 503, in wrapper |
| return func(self, *args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 451, in wrapper |
| return func(self, *args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 2309, in finish |
| return self._finish(exit_code) |
| ^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 406, in wrapper |
| return func(self, *args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 2337, in _finish |
| self._atexit_cleanup(exit_code=exit_code) |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 2550, in _atexit_cleanup |
| self._on_finish() |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 2806, in _on_finish |
| wait_with_progress( |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/mailbox/wait_with_progress.py", line 24, in wait_with_progress |
| return wait_all_with_progress( |
| ^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/mailbox/wait_with_progress.py", line 87, in wait_all_with_progress |
| return asyncio_compat.run(progress_loop_with_timeout) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/lib/asyncio_compat.py", line 27, in run |
| future = executor.submit(runner.run, fn) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/concurrent/futures/thread.py", line 169, in submit |
| raise RuntimeError('cannot schedule new futures after ' |
| RuntimeError: cannot schedule new futures after interpreter shutdown |
|
|