Qwen2.5-14B-Instruct-pos / training.log
PeterLauLukCh's picture
Upload folder using huggingface_hub
6cb2287 verified
[2025-04-18 17:39:49,071] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:51,004] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7
[2025-04-18 17:39:51,005] [INFO] [runner.py:605:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed scripts/newzero3.json --seed 42 --model_name_or_path /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct --train_tokenized_file /home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl --output_dir /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --evaluation_strategy no --save_strategy no --learning_rate 2e-6 --lr_scheduler_type cosine --save_only_model True --remove_unused_columns False --warmup_ratio 0.03 --num_train_epochs 3 --logging_steps 1 --report_to tensorboard --gradient_checkpointing True --overwrite_output_dir --bf16 True
[2025-04-18 17:39:52,485] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:54,406] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2025-04-18 17:39:54,406] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2025-04-18 17:39:54,406] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2025-04-18 17:39:54,406] [INFO] [launch.py:164:main] dist_world_size=8
[2025-04-18 17:39:54,406] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2025-04-18 17:39:54,407] [INFO] [launch.py:256:main] process 1993326 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl', '--output_dir', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-18 17:39:54,407] [INFO] [launch.py:256:main] process 1993327 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl', '--output_dir', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-18 17:39:54,408] [INFO] [launch.py:256:main] process 1993328 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=2', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl', '--output_dir', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-18 17:39:54,408] [INFO] [launch.py:256:main] process 1993329 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl', '--output_dir', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-18 17:39:54,408] [INFO] [launch.py:256:main] process 1993330 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=4', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl', '--output_dir', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-18 17:39:54,409] [INFO] [launch.py:256:main] process 1993331 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=5', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl', '--output_dir', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-18 17:39:54,409] [INFO] [launch.py:256:main] process 1993332 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=6', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl', '--output_dir', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-18 17:39:54,410] [INFO] [launch.py:256:main] process 1993333 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/14K_pos_tokenzied_cl37.jsonl', '--output_dir', '/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-18 17:39:57,677] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:57,972] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:58,060] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:58,063] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:58,127] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:58,136] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:58,143] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-18 17:39:58,152] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-18 17:39:59,681] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-18 17:39:59,974] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-18 17:40:00,077] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-18 17:40:00,193] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-18 17:40:00,215] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-18 17:40:00,217] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-18 17:40:00,286] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-18 17:40:00,303] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-04-18 17:40:00,303] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING:__main__:Process rank: 7, device: cuda:7, n_gpu: 1
[2025-04-18 17:40:01,226] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:329] 2025-04-18 17:40:01,228 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
WARNING:__main__:Process rank: 3, device: cuda:3, n_gpu: 1
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1
INFO:__main__:Training parameters CustomTrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=scripts/newzero3.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=False,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=None,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
kl_coeff=0.0,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-06,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO/runs/Apr18_17-40-00_nacamontrealdc1-p2r203n1.enovum.hivecloud.com,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=/home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO,
save_on_each_node=False,
save_only_model=True,
save_safetensors=True,
save_steps=500,
save_strategy=no,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tp_size=0,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.0,
)
[INFO|tokenization_utils_base.py:2058] 2025-04-18 17:40:01,960 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2058] 2025-04-18 17:40:01,960 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2058] 2025-04-18 17:40:01,961 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-04-18 17:40:01,961 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-04-18 17:40:01,961 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-04-18 17:40:01,961 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-04-18 17:40:01,961 >> loading file chat_template.jinja
WARNING:__main__:Process rank: 4, device: cuda:4, n_gpu: 1
WARNING:__main__:Process rank: 2, device: cuda:2, n_gpu: 1
WARNING:__main__:Process rank: 1, device: cuda:1, n_gpu: 1
WARNING:__main__:Process rank: 5, device: cuda:5, n_gpu: 1
WARNING:__main__:Process rank: 6, device: cuda:6, n_gpu: 1
[2025-04-18 17:40:02,255] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[INFO|tokenization_utils_base.py:2323] 2025-04-18 17:40:02,257 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:329] 2025-04-18 17:40:02,257 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[INFO|configuration_utils.py:697] 2025-04-18 17:40:02,257 >> loading configuration file /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct/config.json
[INFO|configuration_utils.py:771] 2025-04-18 17:40:02,259 >> Model config Qwen2Config {
"architectures": [ "Qwen2ForCausalLM" ],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 40,
"num_hidden_layers": 48,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.50.3",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
[INFO|modeling_utils.py:1151] 2025-04-18 17:40:02,292 >> loading weights file /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1225] 2025-04-18 17:40:02,292 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object
[INFO|modeling_utils.py:2170] 2025-04-18 17:40:02,292 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16.
[INFO|modeling_utils.py:3747] 2025-04-18 17:40:02,293 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2025-04-18 17:40:02,293] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:329] 2025-04-18 17:40:02,296 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[INFO|configuration_utils.py:1139] 2025-04-18 17:40:02,302 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}
[2025-04-18 17:40:02,352] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:329] 2025-04-18 17:40:02,354 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-04-18 17:40:02,381] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:329] 2025-04-18 17:40:02,383 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-04-18 17:40:02,438] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:329] 2025-04-18 17:40:02,441 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-04-18 17:40:02,443] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:329] 2025-04-18 17:40:02,445 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-04-18 17:40:02,510] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:329] 2025-04-18 17:40:02,513 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-04-18 17:40:19,334] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 579, num_elems = 14.77B
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:04, 1.42it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:05, 1.23it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:05, 1.20it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:05, 1.25it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:05, 1.35it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:05, 1.23it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:05, 1.31it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:06, 1.15it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:01<00:04, 1.28it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:01<00:04, 1.27it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:01<00:04, 1.27it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:01<00:04, 1.32it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:01<00:04, 1.35it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:01<00:04, 1.31it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:01<00:04, 1.26it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:01<00:04, 1.23it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:02<00:03, 1.32it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:02<00:03, 1.29it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:02<00:03, 1.29it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:02<00:03, 1.33it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:02<00:03, 1.28it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:02<00:03, 1.29it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:02<00:03, 1.31it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:02<00:03, 1.27it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:03<00:03, 1.31it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:03<00:03, 1.29it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:03<00:03, 1.29it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:03<00:03, 1.31it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:03<00:03, 1.30it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:02<00:03, 1.32it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:03<00:03, 1.30it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:03<00:03, 1.28it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:03<00:02, 1.33it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:03<00:02, 1.32it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:03<00:02, 1.31it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:03<00:02, 1.31it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:03<00:02, 1.32it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:03<00:02, 1.31it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:03<00:02, 1.31it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:03<00:02, 1.30it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:04<00:01, 1.31it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:04<00:01, 1.32it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:04<00:01, 1.31it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:04<00:01, 1.32it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:04<00:01, 1.32it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:04<00:01, 1.31it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:04<00:01, 1.31it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:04<00:01, 1.30it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:05<00:00, 1.31it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:05<00:00, 1.31it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:05<00:00, 1.32it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:05<00:00, 1.31it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:05<00:00, 1.32it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:05<00:00, 1.31it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:05<00:00, 1.31it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:05<00:00, 1.31it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.59it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.59it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.42it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.59it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.41it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.41it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.59it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.40it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.58it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.59it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.59it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.39it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.40it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.40it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.61it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:05<00:00, 1.39it/s]
[INFO|modeling_utils.py:4987] 2025-04-18 17:40:25,161 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.
[INFO|modeling_utils.py:4995] 2025-04-18 17:40:25,162 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1092] 2025-04-18 17:40:25,166 >> loading configuration file /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct/generation_config.json
[INFO|configuration_utils.py:1139] 2025-04-18 17:40:25,167 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}
Generating train split: 0 examples [00:00, ? examples/s]Using custom data configuration default-3588628d8dd0ad31
INFO:datasets.builder:Using custom data configuration default-3588628d8dd0ad31
Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json
INFO:datasets.info:Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json
Generating train split: 1655 examples [00:00, 12546.49 examples/s] Generating train split: 1966 examples [00:00, 12549.10 examples/s]
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-3588628d8dd0ad31/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092)
INFO:datasets.builder:Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-3588628d8dd0ad31/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092)
Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-3588628d8dd0ad31/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092
INFO:datasets.info:Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-3588628d8dd0ad31/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
[INFO|trainer.py:748] 2025-04-18 17:40:25,647 >> Using auto half precision backend
INFO:__main__:*** Train ***
[INFO|deepspeed.py:386] 2025-04-18 17:40:25,925 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB)
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.304741859436035 seconds
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.281362771987915 seconds
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.279315948486328 seconds
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.290512800216675 seconds
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.409972667694092 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000002, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2025-04-18 17:40:29,752] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.5, git-hash=unknown, git-branch=unknown
[2025-04-18 17:40:29,752] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[2025-04-18 17:40:29,789] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-04-18 17:40:29,792] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-04-18 17:40:29,792] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-04-18 17:40:29,835] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2025-04-18 17:40:29,835] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2025-04-18 17:40:29,835] [INFO] [logging.py:107:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-04-18 17:40:29,835] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2025-04-18 17:40:29,974] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-04-18 17:40:29,975] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 2.9 GB CA 0.0 GB Max_CA 3 GB
[2025-04-18 17:40:29,975] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 77.24 GB, percent = 7.7%
[2025-04-18 17:40:29,977] [INFO] [stage3.py:170:__init__] Reduce bucket size 100000000
[2025-04-18 17:40:29,977] [INFO] [stage3.py:171:__init__] Prefetch bucket size 100000000
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.7052433490753174 seconds
[2025-04-18 17:40:30,086] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-04-18 17:40:30,086] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-04-18 17:40:30,086] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 77.22 GB, percent = 7.7%
Parameter Offload: Total persistent parameters: 840704 in 241 params
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8066134452819824 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.7993619441986084 seconds
[2025-04-18 17:40:30,245] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-04-18 17:40:30,246] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-04-18 17:40:30,246] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 77.24 GB, percent = 7.7%
[2025-04-18 17:40:30,358] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-04-18 17:40:30,359] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-04-18 17:40:30,359] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 77.24 GB, percent = 7.7%
[2025-04-18 17:40:45,393] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 18
[2025-04-18 17:40:45,398] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-04-18 17:40:45,398] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 116.43 GB, percent = 11.6%
[2025-04-18 17:40:45,741] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2025-04-18 17:40:45,741] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-04-18 17:40:45,742] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 126.24 GB, percent = 12.5%
[2025-04-18 17:40:47,555] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2025-04-18 17:40:47,555] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-04-18 17:40:47,555] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 168.5 GB, percent = 16.7%
[2025-04-18 17:40:47,781] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-04-18 17:40:47,782] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-04-18 17:40:47,782] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 176.75 GB, percent = 17.5%
[2025-04-18 17:40:53,807] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-04-18 17:40:53,808] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-04-18 17:40:53,808] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 254.94 GB, percent = 25.3%
[2025-04-18 17:40:53,808] [INFO] [stage3.py:534:_setup_for_real_optimizer] optimizer state initialized
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[WARNING|logging.py:329] 2025-04-18 17:40:56,892 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-18 17:40:56,893 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-18 17:40:56,895 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-18 17:40:56,896 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-18 17:40:56,898 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-18 17:40:56,899 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-18 17:40:56,909 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[2025-04-18 17:40:57,000] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-04-18 17:40:57,001] [INFO] [utils.py:782:see_memory_usage] MA 0.19 GB Max_MA 3.09 GB CA 3.09 GB Max_CA 3 GB
[2025-04-18 17:40:57,001] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 284.3 GB, percent = 28.2%
[2025-04-18 17:40:57,001] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2025-04-18 17:40:57,001] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[2025-04-18 17:40:57,001] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-04-18 17:40:57,001] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-04-18 17:40:57,002] [INFO] [config.py:1000:print] DeepSpeedEngine configuration:
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] amp_enabled .................. False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] amp_params ................... False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] bfloat16_enabled ............. True
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] bfloat16_immediate_grad_update True
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] checkpoint_parallel_write_pipeline False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] checkpoint_tag_validation_enabled True
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] checkpoint_tag_validation_fail False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x79bc673361a0>
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] communication_data_type ...... None
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] curriculum_enabled_legacy .... False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] curriculum_params_legacy ..... False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] data_efficiency_enabled ...... False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] dataloader_drop_last ......... False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] disable_allgather ............ False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] dump_state ................... False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] dynamic_loss_scale_args ...... None
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] eigenvalue_enabled ........... False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] eigenvalue_gas_boundary_resolution 1
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] eigenvalue_layer_name ........ bert.encoder.layer
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] eigenvalue_layer_num ......... 0
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] eigenvalue_max_iter .......... 100
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] eigenvalue_stability ......... 1e-06
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] eigenvalue_tol ............... 0.01
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] eigenvalue_verbose ........... False
[2025-04-18 17:40:57,003] [INFO] [config.py:1004:print] elasticity_enabled ........... False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] fp16_auto_cast ............... None
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] fp16_enabled ................. False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] fp16_master_weights_and_gradients False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] global_rank .................. 0
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] grad_accum_dtype ............. None
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] gradient_accumulation_steps .. 2
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] gradient_clipping ............ 1.0
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] gradient_predivide_factor .... 1.0
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] graph_harvesting ............. False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] initial_dynamic_scale ........ 1
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] load_universal_checkpoint .... False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] loss_scale ................... 1.0
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] memory_breakdown ............. False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] mics_hierarchial_params_gather False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] mics_shard_size .............. -1
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] optimizer_legacy_fusion ...... False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] optimizer_name ............... None
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] optimizer_params ............. None
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] pld_enabled .................. False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] pld_params ................... False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] prescale_gradients ........... False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] scheduler_name ............... None
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] scheduler_params ............. None
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] seq_parallel_communication_data_type torch.float32
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] sparse_attention ............. None
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] sparse_gradients_enabled ..... False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] steps_per_print .............. inf
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] timers_config ................ enabled=True synchronized=True
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] train_batch_size ............. 16
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] train_micro_batch_size_per_gpu 1
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] use_data_before_expert_parallel_ False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] use_node_local_storage ....... False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] wall_clock_breakdown ......... False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] weight_quantization_config ... None
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] world_size ................... 8
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] zero_allow_untested_optimizer True
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=100000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=100000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=100000000 max_reuse_distance=100000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] zero_enabled ................. True
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] zero_force_ds_cpu_optimizer .. True
[2025-04-18 17:40:57,004] [INFO] [config.py:1004:print] zero_optimization_stage ...... 3
[2025-04-18 17:40:57,004] [INFO] [config.py:990:print_user_config] json = {
"fp16": {
"enabled": false
},
"bf16": {
"enabled": true
},
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 2,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+08,
"reduce_bucket_size": 1.000000e+08,
"stage3_prefetch_bucket_size": 1.000000e+08,
"stage3_param_persistence_threshold": 1.000000e+05,
"stage3_max_live_parameters": 1.000000e+08,
"stage3_max_reuse_distance": 1.000000e+08,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"steps_per_print": inf,
"zero_allow_untested_optimizer": true
}
[INFO|trainer.py:2409] 2025-04-18 17:40:57,005 >> ***** Running training *****
[INFO|trainer.py:2410] 2025-04-18 17:40:57,005 >> Num examples = 1,966
[INFO|trainer.py:2411] 2025-04-18 17:40:57,005 >> Num Epochs = 3
[INFO|trainer.py:2412] 2025-04-18 17:40:57,005 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2415] 2025-04-18 17:40:57,005 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2416] 2025-04-18 17:40:57,005 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2417] 2025-04-18 17:40:57,005 >> Total optimization steps = 369
[INFO|trainer.py:2418] 2025-04-18 17:40:57,006 >> Number of trainable parameters = 14,770,033,664
0%| | 0/369 [00:00<?, ?it/s]/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[WARNING|logging.py:329] 2025-04-18 17:40:57,065 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
0%| | 1/369 [00:11<1:09:17, 11.30s/it] {'loss': 0.076, 'grad_norm': 1.748603105545044, 'learning_rate': 1.6666666666666665e-07, 'kl': 0.0016, 'entropy': -0.0776, 'ce_loss': 0.026, 'epoch': 0.01}
0%| | 1/369 [00:11<1:09:17, 11.30s/it] 1%| | 2/369 [00:17<51:41, 8.45s/it] {'loss': 0.0651, 'grad_norm': 2.0943100452423096, 'learning_rate': 3.333333333333333e-07, 'kl': 0.0, 'entropy': -0.0219, 'ce_loss': 0.0368, 'epoch': 0.02}
1%| | 2/369 [00:17<51:41, 8.45s/it] 1%| | 3/369 [00:24<46:02, 7.55s/it] {'loss': 0.066, 'grad_norm': 1.8861767053604126, 'learning_rate': 5e-07, 'kl': -0.0016, 'entropy': -0.0352, 'ce_loss': 0.0367, 'epoch': 0.02}
1%| | 3/369 [00:24<46:02, 7.55s/it] 1%| | 4/369 [00:30<43:27, 7.14s/it] {'loss': 0.0819, 'grad_norm': 2.0635530948638916, 'learning_rate': 6.666666666666666e-07, 'kl': -0.0005, 'entropy': -0.1602, 'ce_loss': 0.0254, 'epoch': 0.03}
1%| | 4/369 [00:30<43:27, 7.14s/it] 1%|▏ | 5/369 [00:37<41:57, 6.91s/it] {'loss': 0.0799, 'grad_norm': 1.9142074584960938, 'learning_rate': 8.333333333333333e-07, 'kl': 0.0001, 'entropy': -0.0505, 'ce_loss': 0.0275, 'epoch': 0.04}
1%|▏ | 5/369 [00:37<41:57, 6.91s/it] 2%|▏ | 6/369 [00:43<40:47, 6.74s/it] {'loss': 0.0801, 'grad_norm': 1.8947317600250244, 'learning_rate': 1e-06, 'kl': 0.0014, 'entropy': -0.0559, 'ce_loss': 0.0278, 'epoch': 0.05}
2%|▏ | 6/369 [00:43<40:47, 6.74s/it] 2%|▏ | 7/369 [00:50<40:00, 6.63s/it] {'loss': 0.0681, 'grad_norm': 1.3867923021316528, 'learning_rate': 1.1666666666666668e-06, 'kl': 0.005, 'entropy': -0.0245, 'ce_loss': 0.0308, 'epoch': 0.06}
2%|▏ | 7/369 [00:50<40:00, 6.63s/it] 2%|▏ | 8/369 [00:56<39:27, 6.56s/it] {'loss': 0.0581, 'grad_norm': 0.927176296710968, 'learning_rate': 1.3333333333333332e-06, 'kl': 0.0087, 'entropy': -0.0679, 'ce_loss': 0.0295, 'epoch': 0.07}
2%|▏ | 8/369 [00:56<39:27, 6.56s/it] 2%|▏ | 9/369 [01:02<39:13, 6.54s/it] {'loss': 0.0686, 'grad_norm': 0.9600820541381836, 'learning_rate': 1.5e-06, 'kl': 0.0023, 'entropy': -0.0332, 'ce_loss': 0.0321, 'epoch': 0.07}
2%|▏ | 9/369 [01:02<39:13, 6.54s/it] 3%|β–Ž | 10/369 [01:09<38:47, 6.48s/it] {'loss': 0.0692, 'grad_norm': 0.8277323246002197, 'learning_rate': 1.6666666666666667e-06, 'kl': 0.0093, 'entropy': -0.0549, 'ce_loss': 0.0398, 'epoch': 0.08}
3%|β–Ž | 10/369 [01:09<38:47, 6.48s/it] 3%|β–Ž | 11/369 [01:15<38:29, 6.45s/it] {'loss': 0.0774, 'grad_norm': 0.9589397311210632, 'learning_rate': 1.833333333333333e-06, 'kl': 0.0054, 'entropy': -0.02, 'ce_loss': 0.0325, 'epoch': 0.09}
3%|β–Ž | 11/369 [01:15<38:29, 6.45s/it] 3%|β–Ž | 12/369 [01:22<38:09, 6.41s/it] {'loss': 0.0806, 'grad_norm': 1.1231387853622437, 'learning_rate': 2e-06, 'kl': 0.0088, 'entropy': -0.0698, 'ce_loss': 0.0399, 'epoch': 0.1}
3%|β–Ž | 12/369 [01:22<38:09, 6.41s/it] 4%|β–Ž | 13/369 [01:28<38:00, 6.41s/it] {'loss': 0.0572, 'grad_norm': 1.2012969255447388, 'learning_rate': 1.9999612804309577e-06, 'kl': 0.0098, 'entropy': -0.0586, 'ce_loss': 0.0221, 'epoch': 0.11}
4%|β–Ž | 13/369 [01:28<38:00, 6.41s/it] 4%|▍ | 14/369 [01:34<37:54, 6.41s/it] {'loss': 0.0594, 'grad_norm': 0.8592827320098877, 'learning_rate': 1.9998451247222414e-06, 'kl': 0.0019, 'entropy': -0.033, 'ce_loss': 0.0367, 'epoch': 0.11}
4%|▍ | 14/369 [01:34<37:54, 6.41s/it] 4%|▍ | 15/369 [01:41<37:40, 6.39s/it] {'loss': 0.0634, 'grad_norm': 0.9637750387191772, 'learning_rate': 1.9996515418688487e-06, 'kl': 0.0081, 'entropy': -0.042, 'ce_loss': 0.0615, 'epoch': 0.12}
4%|▍ | 15/369 [01:41<37:40, 6.39s/it] 4%|▍ | 16/369 [01:47<37:14, 6.33s/it] {'loss': 0.0632, 'grad_norm': 1.069898009300232, 'learning_rate': 1.999380546861669e-06, 'kl': 0.0359, 'entropy': -0.064, 'ce_loss': 0.031, 'epoch': 0.13}
4%|▍ | 16/369 [01:47<37:14, 6.33s/it] 5%|▍ | 17/369 [01:53<36:57, 6.30s/it] {'loss': 0.0624, 'grad_norm': 0.9028798341751099, 'learning_rate': 1.9990321606863224e-06, 'kl': 0.0137, 'entropy': -0.0762, 'ce_loss': 0.0234, 'epoch': 0.14}
5%|▍ | 17/369 [01:53<36:57, 6.30s/it] 5%|▍ | 18/369 [01:59<36:50, 6.30s/it] {'loss': 0.0598, 'grad_norm': 0.8458216786384583, 'learning_rate': 1.9986064103215337e-06, 'kl': 0.0051, 'entropy': -0.0972, 'ce_loss': 0.0249, 'epoch': 0.15}
5%|▍ | 18/369 [01:59<36:50, 6.30s/it] 5%|β–Œ | 19/369 [02:06<36:47, 6.31s/it] {'loss': 0.0592, 'grad_norm': 0.8099873661994934, 'learning_rate': 1.9981033287370442e-06, 'kl': 0.0089, 'entropy': -0.032, 'ce_loss': 0.0241, 'epoch': 0.15}
5%|β–Œ | 19/369 [02:06<36:47, 6.31s/it] 5%|β–Œ | 20/369 [02:12<37:10, 6.39s/it] {'loss': 0.0747, 'grad_norm': 1.0964338779449463, 'learning_rate': 1.997522954891058e-06, 'kl': 0.0035, 'entropy': -0.0105, 'ce_loss': 0.0417, 'epoch': 0.16}
5%|β–Œ | 20/369 [02:12<37:10, 6.39s/it] 6%|β–Œ | 21/369 [02:19<36:57, 6.37s/it] {'loss': 0.0688, 'grad_norm': 1.1084390878677368, 'learning_rate': 1.996865333727226e-06, 'kl': 0.0034, 'entropy': -0.0571, 'ce_loss': 0.0372, 'epoch': 0.17}
6%|β–Œ | 21/369 [02:19<36:57, 6.37s/it] 6%|β–Œ | 22/369 [02:25<37:12, 6.43s/it] {'loss': 0.0595, 'grad_norm': 0.9128804802894592, 'learning_rate': 1.9961305161711637e-06, 'kl': 0.008, 'entropy': -0.0086, 'ce_loss': 0.0303, 'epoch': 0.18}
6%|β–Œ | 22/369 [02:25<37:12, 6.43s/it] 6%|β–Œ | 23/369 [02:32<36:54, 6.40s/it] {'loss': 0.0676, 'grad_norm': 0.9802989959716797, 'learning_rate': 1.99531855912651e-06, 'kl': 0.0125, 'entropy': -0.0635, 'ce_loss': 0.0554, 'epoch': 0.19}
6%|β–Œ | 23/369 [02:32<36:54, 6.40s/it] 7%|β–‹ | 24/369 [02:38<36:36, 6.37s/it] {'loss': 0.0706, 'grad_norm': 1.028415322303772, 'learning_rate': 1.9944295254705185e-06, 'kl': 0.025, 'entropy': -0.0669, 'ce_loss': 0.0391, 'epoch': 0.2}
7%|β–‹ | 24/369 [02:38<36:36, 6.37s/it] 7%|β–‹ | 25/369 [02:44<36:23, 6.35s/it] {'loss': 0.0634, 'grad_norm': 0.9952380061149597, 'learning_rate': 1.993463484049188e-06, 'kl': 0.0076, 'entropy': -0.0771, 'ce_loss': 0.0295, 'epoch': 0.2}
7%|β–‹ | 25/369 [02:44<36:23, 6.35s/it] 7%|β–‹ | 26/369 [02:50<36:15, 6.34s/it] {'loss': 0.073, 'grad_norm': 0.9272740483283997, 'learning_rate': 1.992420509671936e-06, 'kl': 0.0059, 'entropy': -0.0339, 'ce_loss': 0.0319, 'epoch': 0.21}
7%|β–‹ | 26/369 [02:50<36:15, 6.34s/it] 7%|β–‹ | 27/369 [02:57<36:15, 6.36s/it] {'loss': 0.0573, 'grad_norm': 0.8754875063896179, 'learning_rate': 1.9913006831057965e-06, 'kl': 0.0112, 'entropy': -0.0693, 'ce_loss': 0.0364, 'epoch': 0.22}
7%|β–‹ | 27/369 [02:57<36:15, 6.36s/it] 8%|β–Š | 28/369 [03:03<36:24, 6.41s/it] {'loss': 0.0564, 'grad_norm': 0.7933421730995178, 'learning_rate': 1.990104091069176e-06, 'kl': -0.0001, 'entropy': -0.0104, 'ce_loss': 0.0354, 'epoch': 0.23}
8%|β–Š | 28/369 [03:03<36:24, 6.41s/it] 8%|β–Š | 29/369 [03:10<36:28, 6.44s/it] {'loss': 0.0851, 'grad_norm': 1.0125603675842285, 'learning_rate': 1.9888308262251284e-06, 'kl': 0.0143, 'entropy': -0.1006, 'ce_loss': 0.0456, 'epoch': 0.24}
8%|β–Š | 29/369 [03:10<36:28, 6.44s/it] 8%|β–Š | 30/369 [03:16<36:07, 6.39s/it] {'loss': 0.0667, 'grad_norm': 0.9214984178543091, 'learning_rate': 1.9874809871741874e-06, 'kl': 0.0074, 'entropy': -0.0449, 'ce_loss': 0.0203, 'epoch': 0.24}
8%|β–Š | 30/369 [03:16<36:07, 6.39s/it] 8%|β–Š | 31/369 [03:23<37:08, 6.59s/it] {'loss': 0.0581, 'grad_norm': 0.7331676483154297, 'learning_rate': 1.986054678446725e-06, 'kl': 0.007, 'entropy': -0.0532, 'ce_loss': 0.0266, 'epoch': 0.25}
8%|β–Š | 31/369 [03:23<37:08, 6.59s/it] 9%|β–Š | 32/369 [03:30<36:48, 6.55s/it] {'loss': 0.0544, 'grad_norm': 0.8071069717407227, 'learning_rate': 1.984552010494859e-06, 'kl': 0.0178, 'entropy': 0.0415, 'ce_loss': 0.0494, 'epoch': 0.26}
9%|β–Š | 32/369 [03:30<36:48, 6.55s/it] 9%|β–‰ | 33/369 [03:36<36:37, 6.54s/it] {'loss': 0.0568, 'grad_norm': 0.8182789087295532, 'learning_rate': 1.982973099683902e-06, 'kl': 0.0104, 'entropy': -0.0415, 'ce_loss': 0.0248, 'epoch': 0.27}
9%|β–‰ | 33/369 [03:36<36:37, 6.54s/it] 9%|β–‰ | 34/369 [03:42<36:03, 6.46s/it] {'loss': 0.0667, 'grad_norm': 0.9923499822616577, 'learning_rate': 1.9813180682833447e-06, 'kl': 0.0049, 'entropy': -0.0459, 'ce_loss': 0.0164, 'epoch': 0.28}
9%|β–‰ | 34/369 [03:42<36:03, 6.46s/it] 9%|β–‰ | 35/369 [03:49<35:44, 6.42s/it] {'loss': 0.0527, 'grad_norm': 0.7489456534385681, 'learning_rate': 1.9795870444573932e-06, 'kl': 0.0072, 'entropy': -0.0613, 'ce_loss': 0.0206, 'epoch': 0.28}
9%|β–‰ | 35/369 [03:49<35:44, 6.42s/it] 10%|β–‰ | 36/369 [03:55<35:30, 6.40s/it] {'loss': 0.0581, 'grad_norm': 0.8024400472640991, 'learning_rate': 1.9777801622550405e-06, 'kl': 0.0148, 'entropy': -0.1523, 'ce_loss': 0.0267, 'epoch': 0.29}
10%|β–‰ | 36/369 [03:55<35:30, 6.40s/it] 10%|β–ˆ | 37/369 [04:01<35:14, 6.37s/it] {'loss': 0.0593, 'grad_norm': 0.8199014067649841, 'learning_rate': 1.975897561599687e-06, 'kl': 0.0072, 'entropy': -0.0693, 'ce_loss': 0.0257, 'epoch': 0.3}
10%|β–ˆ | 37/369 [04:01<35:14, 6.37s/it] 10%|β–ˆ | 38/369 [04:08<35:14, 6.39s/it] {'loss': 0.0638, 'grad_norm': 0.8470928072929382, 'learning_rate': 1.9739393882783045e-06, 'kl': 0.0151, 'entropy': -0.1025, 'ce_loss': 0.0264, 'epoch': 0.31}
10%|β–ˆ | 38/369 [04:08<35:14, 6.39s/it] 11%|β–ˆ | 39/369 [04:14<35:00, 6.37s/it] {'loss': 0.0622, 'grad_norm': 0.9427053928375244, 'learning_rate': 1.9719057939301475e-06, 'kl': 0.011, 'entropy': -0.0742, 'ce_loss': 0.035, 'epoch': 0.32}
11%|β–ˆ | 39/369 [04:14<35:00, 6.37s/it] 11%|β–ˆ | 40/369 [04:21<34:57, 6.37s/it] {'loss': 0.0636, 'grad_norm': 0.8446733951568604, 'learning_rate': 1.9697969360350096e-06, 'kl': 0.0066, 'entropy': 0.0055, 'ce_loss': 0.0289, 'epoch': 0.33}
11%|β–ˆ | 40/369 [04:21<34:57, 6.37s/it] 11%|β–ˆ | 41/369 [04:27<34:52, 6.38s/it] {'loss': 0.0775, 'grad_norm': 0.9768215417861938, 'learning_rate': 1.967612977901028e-06, 'kl': 0.0092, 'entropy': -0.0356, 'ce_loss': 0.029, 'epoch': 0.33}
11%|β–ˆ | 41/369 [04:27<34:52, 6.38s/it] 11%|β–ˆβ– | 42/369 [04:33<34:38, 6.36s/it] {'loss': 0.0746, 'grad_norm': 0.9899135828018188, 'learning_rate': 1.9653540886520385e-06, 'kl': 0.002, 'entropy': -0.0486, 'ce_loss': 0.0434, 'epoch': 0.34}
11%|β–ˆβ– | 42/369 [04:33<34:38, 6.36s/it] 12%|β–ˆβ– | 43/369 [04:40<34:33, 6.36s/it] {'loss': 0.0703, 'grad_norm': 0.9499810338020325, 'learning_rate': 1.963020443214478e-06, 'kl': 0.0095, 'entropy': -0.0496, 'ce_loss': 0.0418, 'epoch': 0.35}
12%|β–ˆβ– | 43/369 [04:40<34:33, 6.36s/it] 12%|β–ˆβ– | 44/369 [04:46<34:17, 6.33s/it] {'loss': 0.0638, 'grad_norm': 0.8651771545410156, 'learning_rate': 1.960612222303837e-06, 'kl': 0.009, 'entropy': -0.0752, 'ce_loss': 0.0233, 'epoch': 0.36}
12%|β–ˆβ– | 44/369 [04:46<34:17, 6.33s/it] 12%|β–ˆβ– | 45/369 [04:52<34:01, 6.30s/it] {'loss': 0.0586, 'grad_norm': 0.8587362170219421, 'learning_rate': 1.958129612410668e-06, 'kl': 0.0064, 'entropy': -0.0233, 'ce_loss': 0.0441, 'epoch': 0.37}
12%|β–ˆβ– | 45/369 [04:52<34:01, 6.30s/it] 12%|β–ˆβ– | 46/369 [04:58<33:57, 6.31s/it] {'loss': 0.0611, 'grad_norm': 0.8449154496192932, 'learning_rate': 1.955572805786141e-06, 'kl': 0.0059, 'entropy': -0.0635, 'ce_loss': 0.0216, 'epoch': 0.37}
12%|β–ˆβ– | 46/369 [04:58<33:57, 6.31s/it] 13%|β–ˆβ–Ž | 47/369 [05:05<33:48, 6.30s/it] {'loss': 0.0583, 'grad_norm': 0.8558777570724487, 'learning_rate': 1.9529420004271565e-06, 'kl': 0.0118, 'entropy': -0.1016, 'ce_loss': 0.0225, 'epoch': 0.38}
13%|β–ˆβ–Ž | 47/369 [05:05<33:48, 6.30s/it] 13%|β–ˆβ–Ž | 48/369 [05:11<33:51, 6.33s/it] {'loss': 0.0636, 'grad_norm': 0.8798972368240356, 'learning_rate': 1.950237400061015e-06, 'kl': 0.0171, 'entropy': -0.0845, 'ce_loss': 0.0666, 'epoch': 0.39}
13%|β–ˆβ–Ž | 48/369 [05:11<33:51, 6.33s/it] 13%|β–ˆβ–Ž | 49/369 [05:17<33:35, 6.30s/it] {'loss': 0.0634, 'grad_norm': 0.8261315822601318, 'learning_rate': 1.947459214129637e-06, 'kl': 0.012, 'entropy': -0.0388, 'ce_loss': 0.0441, 'epoch': 0.4}
13%|β–ˆβ–Ž | 49/369 [05:17<33:35, 6.30s/it] 14%|β–ˆβ–Ž | 50/369 [05:24<33:32, 6.31s/it] {'loss': 0.0648, 'grad_norm': 0.9527679085731506, 'learning_rate': 1.944607657773347e-06, 'kl': 0.0066, 'entropy': -0.0186, 'ce_loss': 0.0218, 'epoch': 0.41}
14%|β–ˆβ–Ž | 50/369 [05:24<33:32, 6.31s/it] 14%|β–ˆβ– | 51/369 [05:30<33:26, 6.31s/it] {'loss': 0.0662, 'grad_norm': 0.8796040415763855, 'learning_rate': 1.9416829518142113e-06, 'kl': 0.0178, 'entropy': -0.1211, 'ce_loss': 0.0463, 'epoch': 0.41}
14%|β–ˆβ– | 51/369 [05:30<33:26, 6.31s/it] 14%|β–ˆβ– | 52/369 [05:37<33:37, 6.36s/it] {'loss': 0.0555, 'grad_norm': 0.7709227800369263, 'learning_rate': 1.9386853227389385e-06, 'kl': 0.0056, 'entropy': -0.0398, 'ce_loss': 0.0232, 'epoch': 0.42}
14%|β–ˆβ– | 52/369 [05:37<33:37, 6.36s/it] 14%|β–ˆβ– | 53/369 [05:43<33:29, 6.36s/it] {'loss': 0.0709, 'grad_norm': 0.9215301871299744, 'learning_rate': 1.9356150026813403e-06, 'kl': 0.0131, 'entropy': -0.0618, 'ce_loss': 0.0201, 'epoch': 0.43}
14%|β–ˆβ– | 53/369 [05:43<33:29, 6.36s/it] 15%|β–ˆβ– | 54/369 [05:49<33:29, 6.38s/it] {'loss': 0.0641, 'grad_norm': 0.8644506931304932, 'learning_rate': 1.932472229404356e-06, 'kl': 0.0076, 'entropy': -0.043, 'ce_loss': 0.0247, 'epoch': 0.44}
15%|β–ˆβ– | 54/369 [05:49<33:29, 6.38s/it] 15%|β–ˆβ– | 55/369 [05:56<33:13, 6.35s/it] {'loss': 0.0666, 'grad_norm': 0.8265705704689026, 'learning_rate': 1.9292572462816385e-06, 'kl': -0.0012, 'entropy': -0.0693, 'ce_loss': 0.0301, 'epoch': 0.45}
15%|β–ˆβ– | 55/369 [05:56<33:13, 6.35s/it] 15%|β–ˆβ–Œ | 56/369 [06:02<32:52, 6.30s/it] {'loss': 0.0564, 'grad_norm': 0.8720945119857788, 'learning_rate': 1.925970302278711e-06, 'kl': 0.0106, 'entropy': -0.0659, 'ce_loss': 0.0188, 'epoch': 0.46}
15%|β–ˆβ–Œ | 56/369 [06:02<32:52, 6.30s/it] 15%|β–ˆβ–Œ | 57/369 [06:08<33:01, 6.35s/it] {'loss': 0.0707, 'grad_norm': 0.8853545784950256, 'learning_rate': 1.9226116519336828e-06, 'kl': 0.0112, 'entropy': -0.0742, 'ce_loss': 0.0223, 'epoch': 0.46}
15%|β–ˆβ–Œ | 57/369 [06:08<33:01, 6.35s/it] 16%|β–ˆβ–Œ | 58/369 [06:14<32:48, 6.33s/it] {'loss': 0.058, 'grad_norm': 0.8502438068389893, 'learning_rate': 1.9191815553375425e-06, 'kl': 0.0171, 'entropy': -0.0413, 'ce_loss': 0.0292, 'epoch': 0.47}
16%|β–ˆβ–Œ | 58/369 [06:14<32:48, 6.33s/it] 16%|β–ˆβ–Œ | 59/369 [06:21<32:50, 6.36s/it] {'loss': 0.061, 'grad_norm': 0.8159065246582031, 'learning_rate': 1.915680278114014e-06, 'kl': 0.0111, 'entropy': -0.084, 'ce_loss': 0.0181, 'epoch': 0.48}
16%|β–ˆβ–Œ | 59/369 [06:21<32:50, 6.36s/it] 16%|β–ˆβ–‹ | 60/369 [06:27<32:54, 6.39s/it] {'loss': 0.0511, 'grad_norm': 0.7569247484207153, 'learning_rate': 1.9121080913989878e-06, 'kl': 0.0056, 'entropy': -0.0198, 'ce_loss': 0.0276, 'epoch': 0.49}
16%|β–ˆβ–‹ | 60/369 [06:27<32:54, 6.39s/it] 17%|β–ˆβ–‹ | 61/369 [06:34<32:30, 6.33s/it] {'loss': 0.0719, 'grad_norm': 0.9277691841125488, 'learning_rate': 1.9084652718195234e-06, 'kl': 0.0232, 'entropy': -0.0126, 'ce_loss': 0.0472, 'epoch': 0.5}
17%|β–ˆβ–‹ | 61/369 [06:34<32:30, 6.33s/it] 17%|β–ˆβ–‹ | 62/369 [06:40<32:13, 6.30s/it] {'loss': 0.0626, 'grad_norm': 0.8865538835525513, 'learning_rate': 1.9047521014724302e-06, 'kl': 0.0182, 'entropy': -0.0098, 'ce_loss': 0.0315, 'epoch': 0.5}
17%|β–ˆβ–‹ | 62/369 [06:40<32:13, 6.30s/it] 17%|β–ˆβ–‹ | 63/369 [06:46<32:12, 6.32s/it] {'loss': 0.0569, 'grad_norm': 0.7590574622154236, 'learning_rate': 1.9009688679024189e-06, 'kl': -0.0014, 'entropy': -0.0688, 'ce_loss': 0.0288, 'epoch': 0.51}
17%|β–ˆβ–‹ | 63/369 [06:46<32:12, 6.32s/it] 17%|β–ˆβ–‹ | 64/369 [06:52<32:05, 6.31s/it] {'loss': 0.0526, 'grad_norm': 0.7642686367034912, 'learning_rate': 1.8971158640798366e-06, 'kl': -0.0017, 'entropy': -0.0292, 'ce_loss': 0.0304, 'epoch': 0.52}
17%|β–ˆβ–‹ | 64/369 [06:52<32:05, 6.31s/it] 18%|β–ˆβ–Š | 65/369 [06:59<31:52, 6.29s/it] {'loss': 0.0643, 'grad_norm': 0.8817284107208252, 'learning_rate': 1.8931933883779782e-06, 'kl': 0.0025, 'entropy': -0.0593, 'ce_loss': 0.0332, 'epoch': 0.53}
18%|β–ˆβ–Š | 65/369 [06:59<31:52, 6.29s/it] 18%|β–ˆβ–Š | 66/369 [07:05<31:58, 6.33s/it] {'loss': 0.0635, 'grad_norm': 0.856986403465271, 'learning_rate': 1.889201744549981e-06, 'kl': 0.0155, 'entropy': -0.0427, 'ce_loss': 0.0215, 'epoch': 0.54}
18%|β–ˆβ–Š | 66/369 [07:05<31:58, 6.33s/it] 18%|β–ˆβ–Š | 67/369 [07:11<31:49, 6.32s/it] {'loss': 0.0783, 'grad_norm': 0.9850722551345825, 'learning_rate': 1.885141241705303e-06, 'kl': 0.0018, 'entropy': -0.0354, 'ce_loss': 0.0407, 'epoch': 0.54}
18%|β–ˆβ–Š | 67/369 [07:11<31:49, 6.32s/it] 18%|β–ˆβ–Š | 68/369 [07:18<31:35, 6.30s/it] {'loss': 0.0502, 'grad_norm': 0.7847952842712402, 'learning_rate': 1.8810121942857843e-06, 'kl': 0.0083, 'entropy': -0.0679, 'ce_loss': 0.0139, 'epoch': 0.55}
18%|β–ˆβ–Š | 68/369 [07:18<31:35, 6.30s/it] 19%|β–ˆβ–Š | 69/369 [07:24<31:47, 6.36s/it] {'loss': 0.0731, 'grad_norm': 0.9792349934577942, 'learning_rate': 1.8768149220412987e-06, 'kl': 0.0043, 'entropy': -0.0535, 'ce_loss': 0.0329, 'epoch': 0.56}
19%|β–ˆβ–Š | 69/369 [07:24<31:47, 6.36s/it] 19%|β–ˆβ–‰ | 70/369 [07:31<31:41, 6.36s/it] {'loss': 0.0707, 'grad_norm': 0.9106649160385132, 'learning_rate': 1.8725497500049904e-06, 'kl': 0.0064, 'entropy': -0.0214, 'ce_loss': 0.0345, 'epoch': 0.57}
19%|β–ˆβ–‰ | 70/369 [07:31<31:41, 6.36s/it] 19%|β–ˆβ–‰ | 71/369 [07:37<31:28, 6.34s/it] {'loss': 0.0696, 'grad_norm': 0.9186252951622009, 'learning_rate': 1.8682170084681062e-06, 'kl': 0.0124, 'entropy': -0.0796, 'ce_loss': 0.0597, 'epoch': 0.58}
19%|β–ˆβ–‰ | 71/369 [07:37<31:28, 6.34s/it] 20%|β–ˆβ–‰ | 72/369 [07:43<31:28, 6.36s/it] {'loss': 0.052, 'grad_norm': 0.7977147102355957, 'learning_rate': 1.863817032954416e-06, 'kl': 0.009, 'entropy': -0.127, 'ce_loss': 0.034, 'epoch': 0.59}
20%|β–ˆβ–‰ | 72/369 [07:43<31:28, 6.36s/it] 20%|β–ˆβ–‰ | 73/369 [07:50<31:28, 6.38s/it] {'loss': 0.0702, 'grad_norm': 0.8844169974327087, 'learning_rate': 1.8593501641942314e-06, 'kl': 0.0062, 'entropy': -0.0854, 'ce_loss': 0.0269, 'epoch': 0.59}
20%|β–ˆβ–‰ | 73/369 [07:50<31:28, 6.38s/it] 20%|β–ˆβ–ˆ | 74/369 [07:56<31:44, 6.46s/it] {'loss': 0.0623, 'grad_norm': 0.8389966487884521, 'learning_rate': 1.8548167480980193e-06, 'kl': 0.0065, 'entropy': -0.0569, 'ce_loss': 0.0675, 'epoch': 0.6}
20%|β–ˆβ–ˆ | 74/369 [07:56<31:44, 6.46s/it] 20%|β–ˆβ–ˆ | 75/369 [08:03<31:30, 6.43s/it] {'loss': 0.0727, 'grad_norm': 1.103193998336792, 'learning_rate': 1.8502171357296142e-06, 'kl': -0.0011, 'entropy': -0.0601, 'ce_loss': 0.0428, 'epoch': 0.61}
20%|β–ˆβ–ˆ | 75/369 [08:03<31:30, 6.43s/it] 21%|β–ˆβ–ˆ | 76/369 [08:09<31:12, 6.39s/it] {'loss': 0.0715, 'grad_norm': 0.9878169298171997, 'learning_rate': 1.8455516832790337e-06, 'kl': 0.0039, 'entropy': -0.0396, 'ce_loss': 0.0198, 'epoch': 0.62}
21%|β–ˆβ–ˆ | 76/369 [08:09<31:12, 6.39s/it] 21%|β–ˆβ–ˆ | 77/369 [08:15<30:59, 6.37s/it] {'loss': 0.0512, 'grad_norm': 0.7274094223976135, 'learning_rate': 1.8408207520348943e-06, 'kl': 0.0007, 'entropy': -0.0527, 'ce_loss': 0.017, 'epoch': 0.63}
21%|β–ˆβ–ˆ | 77/369 [08:15<30:59, 6.37s/it] 21%|β–ˆβ–ˆ | 78/369 [08:22<31:14, 6.44s/it] {'loss': 0.0545, 'grad_norm': 0.7688048481941223, 'learning_rate': 1.836024708356434e-06, 'kl': -0.0101, 'entropy': 0.0086, 'ce_loss': 0.0275, 'epoch': 0.63}
21%|β–ˆβ–ˆ | 78/369 [08:22<31:14, 6.44s/it] 21%|β–ˆβ–ˆβ– | 79/369 [08:28<30:52, 6.39s/it] {'loss': 0.0714, 'grad_norm': 0.925954282283783, 'learning_rate': 1.8311639236451412e-06, 'kl': 0.0299, 'entropy': -0.0291, 'ce_loss': 0.045, 'epoch': 0.64}
21%|β–ˆβ–ˆβ– | 79/369 [08:28<30:52, 6.39s/it] 22%|β–ˆβ–ˆβ– | 80/369 [08:34<30:42, 6.38s/it] {'loss': 0.0604, 'grad_norm': 0.8215924501419067, 'learning_rate': 1.8262387743159948e-06, 'kl': 0.0099, 'entropy': -0.1602, 'ce_loss': 0.0296, 'epoch': 0.65}
22%|β–ˆβ–ˆβ– | 80/369 [08:34<30:42, 6.38s/it] 22%|β–ˆβ–ˆβ– | 81/369 [08:41<30:24, 6.34s/it] {'loss': 0.0635, 'grad_norm': 0.8120538592338562, 'learning_rate': 1.8212496417683135e-06, 'kl': 0.0096, 'entropy': -0.0449, 'ce_loss': 0.0256, 'epoch': 0.66}
22%|β–ˆβ–ˆβ– | 81/369 [08:41<30:24, 6.34s/it] 22%|β–ˆβ–ˆβ– | 82/369 [08:47<30:19, 6.34s/it] {'loss': 0.0662, 'grad_norm': 0.9420527219772339, 'learning_rate': 1.8161969123562217e-06, 'kl': 0.0109, 'entropy': -0.0825, 'ce_loss': 0.0304, 'epoch': 0.67}
22%|β–ˆβ–ˆβ– | 82/369 [08:47<30:19, 6.34s/it] 22%|β–ˆβ–ˆβ– | 83/369 [08:53<30:08, 6.32s/it] {'loss': 0.0676, 'grad_norm': 1.0010178089141846, 'learning_rate': 1.81108097735873e-06, 'kl': 0.0038, 'entropy': -0.0732, 'ce_loss': 0.0196, 'epoch': 0.67}
22%|β–ˆβ–ˆβ– | 83/369 [08:53<30:08, 6.32s/it] 23%|β–ˆβ–ˆβ–Ž | 84/369 [09:00<29:55, 6.30s/it] {'loss': 0.0603, 'grad_norm': 0.7998976707458496, 'learning_rate': 1.805902232949435e-06, 'kl': 0.0107, 'entropy': -0.0452, 'ce_loss': 0.0225, 'epoch': 0.68}
23%|β–ˆβ–ˆβ–Ž | 84/369 [09:00<29:55, 6.30s/it] 23%|β–ˆβ–ˆβ–Ž | 85/369 [09:06<29:51, 6.31s/it] {'loss': 0.0602, 'grad_norm': 0.9275299906730652, 'learning_rate': 1.80066108016584e-06, 'kl': -0.0012, 'entropy': -0.1357, 'ce_loss': 0.0122, 'epoch': 0.69}
23%|β–ˆβ–ˆβ–Ž | 85/369 [09:06<29:51, 6.31s/it] 23%|β–ˆβ–ˆβ–Ž | 86/369 [09:12<29:35, 6.27s/it] {'loss': 0.068, 'grad_norm': 0.9994385242462158, 'learning_rate': 1.7953579248782993e-06, 'kl': 0.009, 'entropy': -0.0145, 'ce_loss': 0.0237, 'epoch': 0.7}
23%|β–ˆβ–ˆβ–Ž | 86/369 [09:12<29:35, 6.27s/it] 24%|β–ˆβ–ˆβ–Ž | 87/369 [09:18<29:36, 6.30s/it] {'loss': 0.0634, 'grad_norm': 0.8470895290374756, 'learning_rate': 1.789993177758588e-06, 'kl': 0.0029, 'entropy': -0.0304, 'ce_loss': 0.018, 'epoch': 0.71}
24%|β–ˆβ–ˆβ–Ž | 87/369 [09:18<29:36, 6.30s/it] 24%|β–ˆβ–ˆβ– | 88/369 [09:25<29:31, 6.30s/it] {'loss': 0.0694, 'grad_norm': 1.0868221521377563, 'learning_rate': 1.7845672542480981e-06, 'kl': 0.0167, 'entropy': -0.1631, 'ce_loss': 0.0427, 'epoch': 0.72}
24%|β–ˆβ–ˆβ– | 88/369 [09:25<29:31, 6.30s/it] 24%|β–ˆβ–ˆβ– | 89/369 [09:31<29:33, 6.33s/it] {'loss': 0.0558, 'grad_norm': 0.7529953718185425, 'learning_rate': 1.7790805745256702e-06, 'kl': 0.0129, 'entropy': -0.1177, 'ce_loss': 0.0261, 'epoch': 0.72}
24%|β–ˆβ–ˆβ– | 89/369 [09:31<29:33, 6.33s/it] 24%|β–ˆβ–ˆβ– | 90/369 [09:38<29:31, 6.35s/it] {'loss': 0.0643, 'grad_norm': 0.8284902572631836, 'learning_rate': 1.773533563475053e-06, 'kl': 0.0076, 'entropy': -0.0747, 'ce_loss': 0.0421, 'epoch': 0.73}
24%|β–ˆβ–ˆβ– | 90/369 [09:38<29:31, 6.35s/it] 25%|β–ˆβ–ˆβ– | 91/369 [09:44<29:23, 6.34s/it] {'loss': 0.0565, 'grad_norm': 0.7321727871894836, 'learning_rate': 1.767926650652001e-06, 'kl': 0.0148, 'entropy': -0.0435, 'ce_loss': 0.0284, 'epoch': 0.74}
25%|β–ˆβ–ˆβ– | 91/369 [09:44<29:23, 6.34s/it] 25%|β–ˆβ–ˆβ– | 92/369 [09:50<29:11, 6.32s/it] {'loss': 0.063, 'grad_norm': 0.8994086980819702, 'learning_rate': 1.7622602702510103e-06, 'kl': -0.0002, 'entropy': -0.0679, 'ce_loss': 0.0348, 'epoch': 0.75}
25%|β–ˆβ–ˆβ– | 92/369 [09:50<29:11, 6.32s/it] 25%|β–ˆβ–ˆβ–Œ | 93/369 [09:57<29:22, 6.39s/it] {'loss': 0.062, 'grad_norm': 0.9014797210693359, 'learning_rate': 1.7565348610716958e-06, 'kl': 0.0151, 'entropy': -0.0237, 'ce_loss': 0.0424, 'epoch': 0.76}
25%|β–ˆβ–ˆβ–Œ | 93/369 [09:57<29:22, 6.39s/it] 25%|β–ˆβ–ˆβ–Œ | 94/369 [10:03<29:16, 6.39s/it] {'loss': 0.065, 'grad_norm': 0.8189124464988708, 'learning_rate': 1.7507508664848091e-06, 'kl': 0.0078, 'entropy': -0.1235, 'ce_loss': 0.0358, 'epoch': 0.76}
25%|β–ˆβ–ˆβ–Œ | 94/369 [10:03<29:16, 6.39s/it] 26%|β–ˆβ–ˆβ–Œ | 95/369 [10:09<29:01, 6.36s/it] {'loss': 0.0539, 'grad_norm': 0.837704598903656, 'learning_rate': 1.7449087343979057e-06, 'kl': 0.0084, 'entropy': -0.0282, 'ce_loss': 0.0311, 'epoch': 0.77}
26%|β–ˆβ–ˆβ–Œ | 95/369 [10:09<29:01, 6.36s/it] 26%|β–ˆβ–ˆβ–Œ | 96/369 [10:16<28:58, 6.37s/it] {'loss': 0.0611, 'grad_norm': 0.7678609490394592, 'learning_rate': 1.739008917220659e-06, 'kl': 0.0194, 'entropy': 0.0121, 'ce_loss': 0.0289, 'epoch': 0.78}
26%|β–ˆβ–ˆβ–Œ | 96/369 [10:16<28:58, 6.37s/it] 26%|β–ˆβ–ˆβ–‹ | 97/369 [10:22<28:54, 6.38s/it] {'loss': 0.0636, 'grad_norm': 0.8693345189094543, 'learning_rate': 1.733051871829826e-06, 'kl': 0.0012, 'entropy': -0.0289, 'ce_loss': 0.0285, 'epoch': 0.79}
26%|β–ˆβ–ˆβ–‹ | 97/369 [10:22<28:54, 6.38s/it] 27%|β–ˆβ–ˆβ–‹ | 98/369 [10:29<28:46, 6.37s/it] {'loss': 0.0692, 'grad_norm': 0.9189662933349609, 'learning_rate': 1.7270380595338678e-06, 'kl': 0.0136, 'entropy': -0.1279, 'ce_loss': 0.0475, 'epoch': 0.8}
27%|β–ˆβ–ˆβ–‹ | 98/369 [10:29<28:46, 6.37s/it] 27%|β–ˆβ–ˆβ–‹ | 99/369 [10:35<28:34, 6.35s/it] {'loss': 0.0652, 'grad_norm': 0.8363111019134521, 'learning_rate': 1.7209679460372249e-06, 'kl': 0.006, 'entropy': -0.0322, 'ce_loss': 0.0103, 'epoch': 0.8}
27%|β–ˆβ–ˆβ–‹ | 99/369 [10:35<28:34, 6.35s/it] 27%|β–ˆβ–ˆβ–‹ | 100/369 [10:41<28:32, 6.37s/it] {'loss': 0.0697, 'grad_norm': 0.8501434326171875, 'learning_rate': 1.714842001404254e-06, 'kl': -0.0008, 'entropy': -0.0221, 'ce_loss': 0.0395, 'epoch': 0.81}
27%|β–ˆβ–ˆβ–‹ | 100/369 [10:41<28:32, 6.37s/it] 27%|β–ˆβ–ˆβ–‹ | 101/369 [10:48<28:26, 6.37s/it] {'loss': 0.069, 'grad_norm': 0.9161370992660522, 'learning_rate': 1.7086607000228282e-06, 'kl': 0.0077, 'entropy': -0.0581, 'ce_loss': 0.0279, 'epoch': 0.82}
27%|β–ˆβ–ˆβ–‹ | 101/369 [10:48<28:26, 6.37s/it] 28%|β–ˆβ–ˆβ–Š | 102/369 [10:54<28:24, 6.39s/it] {'loss': 0.0677, 'grad_norm': 0.8917465806007385, 'learning_rate': 1.7024245205675985e-06, 'kl': 0.0109, 'entropy': 0.0272, 'ce_loss': 0.0257, 'epoch': 0.83}
28%|β–ˆβ–ˆβ–Š | 102/369 [10:54<28:24, 6.39s/it] 28%|β–ˆβ–ˆβ–Š | 103/369 [11:00<28:11, 6.36s/it] {'loss': 0.0661, 'grad_norm': 0.9398002028465271, 'learning_rate': 1.6961339459629267e-06, 'kl': 0.0062, 'entropy': -0.0603, 'ce_loss': 0.0359, 'epoch': 0.84}
28%|β–ˆβ–ˆβ–Š | 103/369 [11:00<28:11, 6.36s/it] 28%|β–ˆβ–ˆβ–Š | 104/369 [11:07<27:59, 6.34s/it] {'loss': 0.0602, 'grad_norm': 0.8176390528678894, 'learning_rate': 1.6897894633454883e-06, 'kl': 0.0054, 'entropy': -0.0684, 'ce_loss': 0.021, 'epoch': 0.85}
28%|β–ˆβ–ˆβ–Š | 104/369 [11:07<27:59, 6.34s/it] 28%|β–ˆβ–ˆβ–Š | 105/369 [11:13<27:53, 6.34s/it] {'loss': 0.0648, 'grad_norm': 0.8363648056983948, 'learning_rate': 1.6833915640265483e-06, 'kl': -0.0064, 'entropy': -0.0312, 'ce_loss': 0.025, 'epoch': 0.85}
28%|β–ˆβ–ˆβ–Š | 105/369 [11:13<27:53, 6.34s/it] 29%|β–ˆβ–ˆβ–Š | 106/369 [11:19<27:45, 6.33s/it] {'loss': 0.0657, 'grad_norm': 0.8912726044654846, 'learning_rate': 1.6769407434539166e-06, 'kl': 0.0173, 'entropy': -0.0598, 'ce_loss': 0.0284, 'epoch': 0.86}
29%|β–ˆβ–ˆβ–Š | 106/369 [11:19<27:45, 6.33s/it] 29%|β–ˆβ–ˆβ–‰ | 107/369 [11:26<27:28, 6.29s/it] {'loss': 0.0603, 'grad_norm': 0.8265730738639832, 'learning_rate': 1.670437501173578e-06, 'kl': 0.0087, 'entropy': -0.0869, 'ce_loss': 0.042, 'epoch': 0.87}
29%|β–ˆβ–ˆβ–‰ | 107/369 [11:26<27:28, 6.29s/it] 29%|β–ˆβ–ˆβ–‰ | 108/369 [11:32<27:24, 6.30s/it] {'loss': 0.0541, 'grad_norm': 0.6994743347167969, 'learning_rate': 1.6638823407910082e-06, 'kl': 0.0189, 'entropy': 0.0135, 'ce_loss': 0.0303, 'epoch': 0.88}
29%|β–ˆβ–ˆβ–‰ | 108/369 [11:32<27:24, 6.30s/it] 30%|β–ˆβ–ˆβ–‰ | 109/369 [11:38<27:21, 6.31s/it] {'loss': 0.0654, 'grad_norm': 0.8241642713546753, 'learning_rate': 1.657275769932179e-06, 'kl': 0.0096, 'entropy': -0.0219, 'ce_loss': 0.0113, 'epoch': 0.89}
30%|β–ˆβ–ˆβ–‰ | 109/369 [11:38<27:21, 6.31s/it] 30%|β–ˆβ–ˆβ–‰ | 110/369 [11:45<27:22, 6.34s/it] {'loss': 0.067, 'grad_norm': 0.9464811682701111, 'learning_rate': 1.650618300204242e-06, 'kl': 0.0048, 'entropy': -0.0479, 'ce_loss': 0.0239, 'epoch': 0.89}
30%|β–ˆβ–ˆβ–‰ | 110/369 [11:45<27:22, 6.34s/it] 30%|β–ˆβ–ˆβ–ˆ | 111/369 [11:51<27:16, 6.34s/it] {'loss': 0.0555, 'grad_norm': 0.853521466255188, 'learning_rate': 1.6439104471559156e-06, 'kl': 0.0003, 'entropy': -0.0664, 'ce_loss': 0.0237, 'epoch': 0.9}
30%|β–ˆβ–ˆβ–ˆ | 111/369 [11:51<27:16, 6.34s/it] 30%|β–ˆβ–ˆβ–ˆ | 112/369 [11:57<27:10, 6.34s/it] {'loss': 0.0589, 'grad_norm': 0.7431493401527405, 'learning_rate': 1.6371527302375578e-06, 'kl': 0.006, 'entropy': -0.0928, 'ce_loss': 0.0236, 'epoch': 0.91}
30%|β–ˆβ–ˆβ–ˆ | 112/369 [11:57<27:10, 6.34s/it] 31%|β–ˆβ–ˆβ–ˆ | 113/369 [12:04<27:07, 6.36s/it] {'loss': 0.056, 'grad_norm': 0.7598456144332886, 'learning_rate': 1.6303456727609426e-06, 'kl': 0.0168, 'entropy': -0.0859, 'ce_loss': 0.0119, 'epoch': 0.92}
31%|β–ˆβ–ˆβ–ˆ | 113/369 [12:04<27:07, 6.36s/it] 31%|β–ˆβ–ˆβ–ˆ | 114/369 [12:10<26:49, 6.31s/it] {'loss': 0.0613, 'grad_norm': 0.9038425087928772, 'learning_rate': 1.6234898018587336e-06, 'kl': -0.0151, 'entropy': -0.0383, 'ce_loss': 0.0149, 'epoch': 0.93}
31%|β–ˆβ–ˆβ–ˆ | 114/369 [12:10<26:49, 6.31s/it] 31%|β–ˆβ–ˆβ–ˆ | 115/369 [12:16<26:44, 6.32s/it] {'loss': 0.0608, 'grad_norm': 0.9182695150375366, 'learning_rate': 1.6165856484436641e-06, 'kl': 0.0056, 'entropy': -0.0525, 'ce_loss': 0.0146, 'epoch': 0.93}
31%|β–ˆβ–ˆβ–ˆ | 115/369 [12:16<26:44, 6.32s/it] 31%|β–ˆβ–ˆβ–ˆβ– | 116/369 [12:22<26:35, 6.31s/it] {'loss': 0.0624, 'grad_norm': 0.8503457307815552, 'learning_rate': 1.609633747167424e-06, 'kl': 0.0067, 'entropy': -0.0454, 'ce_loss': 0.0698, 'epoch': 0.94}
31%|β–ˆβ–ˆβ–ˆβ– | 116/369 [12:22<26:35, 6.31s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 117/369 [12:29<26:28, 6.30s/it] {'loss': 0.0723, 'grad_norm': 0.8861355185508728, 'learning_rate': 1.6026346363792564e-06, 'kl': -0.0003, 'entropy': -0.0444, 'ce_loss': 0.0388, 'epoch': 0.95}
32%|β–ˆβ–ˆβ–ˆβ– | 117/369 [12:29<26:28, 6.30s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 118/369 [12:35<26:32, 6.35s/it] {'loss': 0.066, 'grad_norm': 0.8450856804847717, 'learning_rate': 1.5955888580842678e-06, 'kl': 0.005, 'entropy': -0.0718, 'ce_loss': 0.0441, 'epoch': 0.96}
32%|β–ˆβ–ˆβ–ˆβ– | 118/369 [12:35<26:32, 6.35s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 119/369 [12:42<26:26, 6.35s/it] {'loss': 0.0787, 'grad_norm': 0.9626548886299133, 'learning_rate': 1.5884969579014565e-06, 'kl': 0.0104, 'entropy': -0.0493, 'ce_loss': 0.0599, 'epoch': 0.97}
32%|β–ˆβ–ˆβ–ˆβ– | 119/369 [12:42<26:26, 6.35s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 120/369 [12:48<26:23, 6.36s/it] {'loss': 0.0636, 'grad_norm': 0.810414731502533, 'learning_rate': 1.5813594850214597e-06, 'kl': 0.0105, 'entropy': -0.0305, 'ce_loss': 0.0376, 'epoch': 0.98}
33%|β–ˆβ–ˆβ–ˆβ–Ž | 120/369 [12:48<26:23, 6.36s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 121/369 [12:54<26:23, 6.38s/it] {'loss': 0.0668, 'grad_norm': 0.8195613026618958, 'learning_rate': 1.5741769921640259e-06, 'kl': 0.0079, 'entropy': -0.0228, 'ce_loss': 0.0291, 'epoch': 0.98}
33%|β–ˆβ–ˆβ–ˆβ–Ž | 121/369 [12:54<26:23, 6.38s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 122/369 [13:01<26:11, 6.36s/it] {'loss': 0.0584, 'grad_norm': 0.8678882122039795, 'learning_rate': 1.5669500355352114e-06, 'kl': 0.0034, 'entropy': -0.0525, 'ce_loss': 0.0308, 'epoch': 0.99}
33%|β–ˆβ–ˆβ–ˆβ–Ž | 122/369 [13:01<26:11, 6.36s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 123/369 [13:07<26:00, 6.34s/it] {'loss': 0.0623, 'grad_norm': 0.9662221074104309, 'learning_rate': 1.5596791747843082e-06, 'kl': 0.0197, 'entropy': -0.0398, 'ce_loss': 0.0159, 'epoch': 1.0}
33%|β–ˆβ–ˆβ–ˆβ–Ž | 123/369 [13:07<26:00, 6.34s/it] 34%|β–ˆβ–ˆβ–ˆβ–Ž | 124/369 [13:13<25:59, 6.37s/it] {'loss': 0.044, 'grad_norm': 0.7322087287902832, 'learning_rate': 1.5523649729605057e-06, 'kl': 0.0166, 'entropy': -0.1157, 'ce_loss': 0.0087, 'epoch': 1.01}
34%|β–ˆβ–ˆβ–ˆβ–Ž | 124/369 [13:13<25:59, 6.37s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 125/369 [13:20<25:54, 6.37s/it] {'loss': 0.0519, 'grad_norm': 0.6933812499046326, 'learning_rate': 1.5450079964692895e-06, 'kl': 0.0161, 'entropy': -0.0292, 'ce_loss': 0.0195, 'epoch': 1.02}
34%|β–ˆβ–ˆβ–ˆβ– | 125/369 [13:20<25:54, 6.37s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 126/369 [13:26<25:44, 6.36s/it] {'loss': 0.0481, 'grad_norm': 0.7699279189109802, 'learning_rate': 1.5376088150285774e-06, 'kl': 0.0104, 'entropy': -0.0618, 'ce_loss': 0.0196, 'epoch': 1.02}
34%|β–ˆβ–ˆβ–ˆβ– | 126/369 [13:26<25:44, 6.36s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 127/369 [13:32<25:36, 6.35s/it] {'loss': 0.0559, 'grad_norm': 0.7962565422058105, 'learning_rate': 1.5301680016246028e-06, 'kl': 0.0413, 'entropy': -0.1011, 'ce_loss': 0.0172, 'epoch': 1.03}
34%|β–ˆβ–ˆβ–ˆβ– | 127/369 [13:32<25:36, 6.35s/it] 35%|β–ˆβ–ˆβ–ˆβ– | 128/369 [13:39<25:38, 6.38s/it] {'loss': 0.0505, 'grad_norm': 0.7270010709762573, 'learning_rate': 1.5226861324675428e-06, 'kl': 0.0383, 'entropy': -0.0452, 'ce_loss': 0.0149, 'epoch': 1.04}
35%|β–ˆβ–ˆβ–ˆβ– | 128/369 [13:39<25:38, 6.38s/it] 35%|β–ˆβ–ˆβ–ˆβ– | 129/369 [13:45<25:25, 6.36s/it] {'loss': 0.0394, 'grad_norm': 0.6915517449378967, 'learning_rate': 1.5151637869468958e-06, 'kl': 0.024, 'entropy': -0.0742, 'ce_loss': 0.017, 'epoch': 1.05}
35%|β–ˆβ–ˆβ–ˆβ– | 129/369 [13:45<25:25, 6.36s/it] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 130/369 [13:51<25:11, 6.33s/it] {'loss': 0.0455, 'grad_norm': 0.6917263269424438, 'learning_rate': 1.5076015475866158e-06, 'kl': 0.0153, 'entropy': -0.0332, 'ce_loss': 0.0177, 'epoch': 1.06}
35%|β–ˆβ–ˆβ–ˆβ–Œ | 130/369 [13:51<25:11, 6.33s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 131/369 [13:58<25:12, 6.35s/it] {'loss': 0.0413, 'grad_norm': 0.7600991129875183, 'learning_rate': 1.5e-06, 'kl': 0.0376, 'entropy': -0.0703, 'ce_loss': 0.0165, 'epoch': 1.07}
36%|β–ˆβ–ˆβ–ˆβ–Œ | 131/369 [13:58<25:12, 6.35s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 132/369 [14:04<25:00, 6.33s/it] {'loss': 0.0412, 'grad_norm': 0.6427872180938721, 'learning_rate': 1.492359732844342e-06, 'kl': 0.0297, 'entropy': -0.0376, 'ce_loss': 0.019, 'epoch': 1.07}
36%|β–ˆβ–ˆβ–ˆβ–Œ | 132/369 [14:04<25:00, 6.33s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 133/369 [14:11<24:56, 6.34s/it] {'loss': 0.0506, 'grad_norm': 0.819362223148346, 'learning_rate': 1.4846813377753453e-06, 'kl': 0.0649, 'entropy': -0.0859, 'ce_loss': 0.051, 'epoch': 1.08}
36%|β–ˆβ–ˆβ–ˆβ–Œ | 133/369 [14:11<24:56, 6.34s/it] 36%|β–ˆβ–ˆβ–ˆβ–‹ | 134/369 [14:17<24:44, 6.32s/it] {'loss': 0.0433, 'grad_norm': 0.7497709393501282, 'learning_rate': 1.4769654094013058e-06, 'kl': 0.0361, 'entropy': -0.0508, 'ce_loss': 0.0214, 'epoch': 1.09}
36%|β–ˆβ–ˆβ–ˆβ–‹ | 134/369 [14:17<24:44, 6.32s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 135/369 [14:23<24:34, 6.30s/it] {'loss': 0.0552, 'grad_norm': 0.9110161066055298, 'learning_rate': 1.4692125452370662e-06, 'kl': 0.0273, 'entropy': -0.0571, 'ce_loss': 0.0296, 'epoch': 1.1}
37%|β–ˆβ–ˆβ–ˆβ–‹ | 135/369 [14:23<24:34, 6.30s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 136/369 [14:29<24:28, 6.30s/it] {'loss': 0.0433, 'grad_norm': 0.711473286151886, 'learning_rate': 1.4614233456577452e-06, 'kl': 0.0228, 'entropy': -0.0496, 'ce_loss': 0.0281, 'epoch': 1.11}
37%|β–ˆβ–ˆβ–ˆβ–‹ | 136/369 [14:29<24:28, 6.30s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 137/369 [14:36<24:31, 6.34s/it] {'loss': 0.0385, 'grad_norm': 0.6951709389686584, 'learning_rate': 1.4535984138522441e-06, 'kl': 0.0089, 'entropy': -0.0938, 'ce_loss': 0.0426, 'epoch': 1.11}
37%|β–ˆβ–ˆβ–ˆβ–‹ | 137/369 [14:36<24:31, 6.34s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 138/369 [14:42<24:20, 6.32s/it] {'loss': 0.0412, 'grad_norm': 0.6906440854072571, 'learning_rate': 1.4457383557765383e-06, 'kl': 0.083, 'entropy': -0.1621, 'ce_loss': 0.0159, 'epoch': 1.12}
37%|β–ˆβ–ˆβ–ˆβ–‹ | 138/369 [14:42<24:20, 6.32s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 139/369 [14:49<24:28, 6.38s/it] {'loss': 0.0448, 'grad_norm': 0.8570243120193481, 'learning_rate': 1.4378437801067499e-06, 'kl': 0.0126, 'entropy': -0.0005, 'ce_loss': 0.0324, 'epoch': 1.13}
38%|β–ˆβ–ˆβ–ˆβ–Š | 139/369 [14:49<24:28, 6.38s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 140/369 [14:55<24:14, 6.35s/it] {'loss': 0.0359, 'grad_norm': 0.6764241456985474, 'learning_rate': 1.4299152981920144e-06, 'kl': 0.0151, 'entropy': -0.0684, 'ce_loss': 0.0255, 'epoch': 1.14}
38%|β–ˆβ–ˆβ–ˆβ–Š | 140/369 [14:55<24:14, 6.35s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 141/369 [15:01<24:05, 6.34s/it] {'loss': 0.0377, 'grad_norm': 0.6826944351196289, 'learning_rate': 1.4219535240071376e-06, 'kl': 0.0293, 'entropy': -0.0845, 'ce_loss': 0.0137, 'epoch': 1.15}
38%|β–ˆβ–ˆβ–ˆβ–Š | 141/369 [15:01<24:05, 6.34s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 142/369 [15:08<23:57, 6.33s/it] {'loss': 0.0372, 'grad_norm': 0.741885781288147, 'learning_rate': 1.4139590741050502e-06, 'kl': 0.0237, 'entropy': -0.0635, 'ce_loss': 0.0178, 'epoch': 1.15}
38%|β–ˆβ–ˆβ–ˆβ–Š | 142/369 [15:08<23:57, 6.33s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 143/369 [15:14<24:02, 6.38s/it] {'loss': 0.0457, 'grad_norm': 0.7757811546325684, 'learning_rate': 1.4059325675690622e-06, 'kl': 0.042, 'entropy': -0.0923, 'ce_loss': 0.0252, 'epoch': 1.16}
39%|β–ˆβ–ˆβ–ˆβ–‰ | 143/369 [15:14<24:02, 6.38s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 144/369 [15:21<24:24, 6.51s/it] {'loss': 0.0429, 'grad_norm': 0.8202553391456604, 'learning_rate': 1.3978746259649208e-06, 'kl': 0.0194, 'entropy': -0.0527, 'ce_loss': 0.0241, 'epoch': 1.17}
39%|β–ˆβ–ˆβ–ˆβ–‰ | 144/369 [15:21<24:24, 6.51s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 145/369 [15:27<24:10, 6.48s/it] {'loss': 0.0452, 'grad_norm': 0.8978075385093689, 'learning_rate': 1.3897858732926792e-06, 'kl': 0.0325, 'entropy': -0.0128, 'ce_loss': 0.0227, 'epoch': 1.18}
39%|β–ˆβ–ˆβ–ˆβ–‰ | 145/369 [15:27<24:10, 6.48s/it] 40%|β–ˆβ–ˆβ–ˆβ–‰ | 146/369 [15:34<23:54, 6.43s/it] {'loss': 0.046, 'grad_norm': 0.8443093299865723, 'learning_rate': 1.3816669359383726e-06, 'kl': 0.0396, 'entropy': -0.168, 'ce_loss': 0.0324, 'epoch': 1.19}
40%|β–ˆβ–ˆβ–ˆβ–‰ | 146/369 [15:34<23:54, 6.43s/it] 40%|β–ˆβ–ˆβ–ˆβ–‰ | 147/369 [15:40<23:44, 6.42s/it] {'loss': 0.0427, 'grad_norm': 0.8157190084457397, 'learning_rate': 1.3735184426255114e-06, 'kl': 0.0175, 'entropy': 0.0125, 'ce_loss': 0.0202, 'epoch': 1.2}
40%|β–ˆβ–ˆβ–ˆβ–‰ | 147/369 [15:40<23:44, 6.42s/it] 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 148/369 [15:46<23:39, 6.43s/it] {'loss': 0.0401, 'grad_norm': 0.8002682328224182, 'learning_rate': 1.3653410243663951e-06, 'kl': 0.0356, 'entropy': -0.0437, 'ce_loss': 0.025, 'epoch': 1.2}
40%|β–ˆβ–ˆβ–ˆβ–ˆ | 148/369 [15:46<23:39, 6.43s/it] 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 149/369 [15:53<23:28, 6.40s/it] {'loss': 0.041, 'grad_norm': 0.8058973550796509, 'learning_rate': 1.3571353144132446e-06, 'kl': 0.04, 'entropy': -0.0493, 'ce_loss': 0.0324, 'epoch': 1.21}
40%|β–ˆβ–ˆβ–ˆβ–ˆ | 149/369 [15:53<23:28, 6.40s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 150/369 [15:59<23:14, 6.37s/it] {'loss': 0.0455, 'grad_norm': 0.8408104181289673, 'learning_rate': 1.3489019482091667e-06, 'kl': 0.0243, 'entropy': -0.0618, 'ce_loss': 0.0247, 'epoch': 1.22}
41%|β–ˆβ–ˆβ–ˆβ–ˆ | 150/369 [15:59<23:14, 6.37s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 151/369 [16:05<23:09, 6.37s/it] {'loss': 0.0461, 'grad_norm': 0.9686378240585327, 'learning_rate': 1.3406415633389436e-06, 'kl': 0.0596, 'entropy': -0.0447, 'ce_loss': 0.0278, 'epoch': 1.23}
41%|β–ˆβ–ˆβ–ˆβ–ˆ | 151/369 [16:05<23:09, 6.37s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 152/369 [16:12<22:53, 6.33s/it] {'loss': 0.0515, 'grad_norm': 0.9879729747772217, 'learning_rate': 1.3323547994796595e-06, 'kl': 0.0081, 'entropy': 0.0188, 'ce_loss': 0.0114, 'epoch': 1.24}
41%|β–ˆβ–ˆβ–ˆβ–ˆ | 152/369 [16:12<22:53, 6.33s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 153/369 [16:18<23:02, 6.40s/it] {'loss': 0.0456, 'grad_norm': 0.8891737461090088, 'learning_rate': 1.324042298351166e-06, 'kl': 0.0204, 'entropy': -0.0415, 'ce_loss': 0.0141, 'epoch': 1.24}
41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 153/369 [16:18<23:02, 6.40s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 154/369 [16:24<22:47, 6.36s/it] {'loss': 0.0396, 'grad_norm': 0.751800537109375, 'learning_rate': 1.3157047036663851e-06, 'kl': 0.0275, 'entropy': -0.0869, 'ce_loss': 0.0185, 'epoch': 1.25}
42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 154/369 [16:24<22:47, 6.36s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 155/369 [16:31<22:47, 6.39s/it] {'loss': 0.0472, 'grad_norm': 0.8366863131523132, 'learning_rate': 1.3073426610814628e-06, 'kl': 0.0261, 'entropy': 0.015, 'ce_loss': 0.024, 'epoch': 1.26}
42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 155/369 [16:31<22:47, 6.39s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 156/369 [16:37<22:33, 6.35s/it] {'loss': 0.0395, 'grad_norm': 0.8263017535209656, 'learning_rate': 1.2989568181457702e-06, 'kl': 0.0066, 'entropy': -0.0588, 'ce_loss': 0.019, 'epoch': 1.27}
42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 156/369 [16:37<22:33, 6.35s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 157/369 [16:44<22:39, 6.41s/it] {'loss': 0.0387, 'grad_norm': 0.772300124168396, 'learning_rate': 1.290547824251756e-06, 'kl': 0.0708, 'entropy': -0.0923, 'ce_loss': 0.0261, 'epoch': 1.28}
43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 157/369 [16:44<22:39, 6.41s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 158/369 [16:50<22:33, 6.42s/it] {'loss': 0.0343, 'grad_norm': 0.701569676399231, 'learning_rate': 1.2821163305846593e-06, 'kl': 0.0192, 'entropy': -0.0238, 'ce_loss': 0.0099, 'epoch': 1.28}
43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 158/369 [16:50<22:33, 6.42s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 159/369 [16:56<22:20, 6.38s/it] {'loss': 0.0517, 'grad_norm': 0.8842969536781311, 'learning_rate': 1.273662990072083e-06, 'kl': 0.0194, 'entropy': -0.0059, 'ce_loss': 0.0197, 'epoch': 1.29}
43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 159/369 [16:56<22:20, 6.38s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 160/369 [17:03<22:12, 6.38s/it] {'loss': 0.0449, 'grad_norm': 0.9260641932487488, 'learning_rate': 1.2651884573334296e-06, 'kl': 0.0289, 'entropy': -0.024, 'ce_loss': 0.0194, 'epoch': 1.3}
43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 160/369 [17:03<22:12, 6.38s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 161/369 [17:09<22:06, 6.38s/it] {'loss': 0.0417, 'grad_norm': 0.8168469667434692, 'learning_rate': 1.2566933886292103e-06, 'kl': 0.022, 'entropy': -0.0311, 'ce_loss': 0.0092, 'epoch': 1.31}
44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 161/369 [17:09<22:06, 6.38s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 162/369 [17:16<22:11, 6.43s/it] {'loss': 0.0428, 'grad_norm': 0.8957350254058838, 'learning_rate': 1.2481784418102239e-06, 'kl': 0.0327, 'entropy': -0.0767, 'ce_loss': 0.0554, 'epoch': 1.32}
44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 162/369 [17:16<22:11, 6.43s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 163/369 [17:22<22:00, 6.41s/it] {'loss': 0.0469, 'grad_norm': 0.7891753315925598, 'learning_rate': 1.2396442762666126e-06, 'kl': 0.022, 'entropy': -0.0388, 'ce_loss': 0.021, 'epoch': 1.33}
44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 163/369 [17:22<22:00, 6.41s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 164/369 [17:28<21:44, 6.37s/it] {'loss': 0.0433, 'grad_norm': 0.8830762505531311, 'learning_rate': 1.2310915528768e-06, 'kl': 0.0352, 'entropy': -0.0471, 'ce_loss': 0.017, 'epoch': 1.33}
44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 164/369 [17:28<21:44, 6.37s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 165/369 [17:35<21:36, 6.35s/it] {'loss': 0.0391, 'grad_norm': 0.7945414185523987, 'learning_rate': 1.2225209339563143e-06, 'kl': 0.0315, 'entropy': -0.0544, 'ce_loss': 0.021, 'epoch': 1.34}
45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 165/369 [17:35<21:36, 6.35s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 166/369 [17:41<21:38, 6.40s/it] {'loss': 0.0486, 'grad_norm': 0.858296275138855, 'learning_rate': 1.2139330832064973e-06, 'kl': 0.0228, 'entropy': -0.0179, 'ce_loss': 0.012, 'epoch': 1.35}
45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 166/369 [17:41<21:38, 6.40s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 167/369 [17:48<21:27, 6.37s/it] {'loss': 0.0472, 'grad_norm': 0.8082732558250427, 'learning_rate': 1.205328665663109e-06, 'kl': 0.0522, 'entropy': -0.0054, 'ce_loss': 0.0305, 'epoch': 1.36}
45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 167/369 [17:48<21:27, 6.37s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 168/369 [17:54<21:21, 6.37s/it] {'loss': 0.0474, 'grad_norm': 0.9350702166557312, 'learning_rate': 1.196708347644828e-06, 'kl': 0.0092, 'entropy': -0.0615, 'ce_loss': 0.0206, 'epoch': 1.37}
46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 168/369 [17:54<21:21, 6.37s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 169/369 [18:00<21:22, 6.41s/it] {'loss': 0.0387, 'grad_norm': 0.8152714371681213, 'learning_rate': 1.1880727967016513e-06, 'kl': 0.0137, 'entropy': 0.0008, 'ce_loss': 0.0128, 'epoch': 1.37}
46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 169/369 [18:00<21:22, 6.41s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 170/369 [18:07<21:12, 6.39s/it] {'loss': 0.0456, 'grad_norm': 0.848048746585846, 'learning_rate': 1.1794226815632012e-06, 'kl': 0.0245, 'entropy': -0.0723, 'ce_loss': 0.0121, 'epoch': 1.38}
46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 170/369 [18:07<21:12, 6.39s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 171/369 [18:13<21:00, 6.37s/it] {'loss': 0.05, 'grad_norm': 0.9876840114593506, 'learning_rate': 1.1707586720869374e-06, 'kl': 0.0135, 'entropy': -0.125, 'ce_loss': 0.0197, 'epoch': 1.39}
46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 171/369 [18:13<21:00, 6.37s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 172/369 [18:20<20:59, 6.39s/it] {'loss': 0.0586, 'grad_norm': 1.0007715225219727, 'learning_rate': 1.1620814392062872e-06, 'kl': 0.0165, 'entropy': -0.0579, 'ce_loss': 0.02, 'epoch': 1.4}
47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 172/369 [18:20<20:59, 6.39s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 173/369 [18:26<20:44, 6.35s/it] {'loss': 0.0431, 'grad_norm': 1.1270406246185303, 'learning_rate': 1.1533916548786856e-06, 'kl': 0.0708, 'entropy': -0.1631, 'ce_loss': 0.0164, 'epoch': 1.41}
47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 173/369 [18:26<20:44, 6.35s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 174/369 [18:32<20:31, 6.31s/it] {'loss': 0.0403, 'grad_norm': 0.7795974612236023, 'learning_rate': 1.1446899920335405e-06, 'kl': 0.0442, 'entropy': -0.051, 'ce_loss': 0.0265, 'epoch': 1.41}
47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 174/369 [18:32<20:31, 6.31s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 175/369 [18:38<20:19, 6.28s/it] {'loss': 0.0523, 'grad_norm': 0.9137143492698669, 'learning_rate': 1.1359771245201232e-06, 'kl': 0.0332, 'entropy': -0.0811, 'ce_loss': 0.023, 'epoch': 1.42}
47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 175/369 [18:38<20:19, 6.28s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 176/369 [18:45<20:19, 6.32s/it] {'loss': 0.0409, 'grad_norm': 0.7708539366722107, 'learning_rate': 1.1272537270553834e-06, 'kl': 0.0317, 'entropy': -0.0913, 'ce_loss': 0.0235, 'epoch': 1.43}
48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 176/369 [18:45<20:19, 6.32s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 177/369 [18:51<20:12, 6.31s/it] {'loss': 0.0529, 'grad_norm': 0.9529765248298645, 'learning_rate': 1.1185204751717027e-06, 'kl': 0.011, 'entropy': -0.0525, 'ce_loss': 0.0103, 'epoch': 1.44}
48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 177/369 [18:51<20:12, 6.31s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 178/369 [18:57<20:04, 6.31s/it] {'loss': 0.0446, 'grad_norm': 0.9409148693084717, 'learning_rate': 1.1097780451645792e-06, 'kl': 0.063, 'entropy': -0.0505, 'ce_loss': 0.0194, 'epoch': 1.45}
48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 178/369 [18:57<20:04, 6.31s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 179/369 [19:04<20:07, 6.35s/it] {'loss': 0.0413, 'grad_norm': 0.7547748684883118, 'learning_rate': 1.1010271140402578e-06, 'kl': 0.0237, 'entropy': -0.0408, 'ce_loss': 0.0119, 'epoch': 1.46}
49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 179/369 [19:04<20:07, 6.35s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 180/369 [19:10<20:06, 6.38s/it] {'loss': 0.0513, 'grad_norm': 0.9787511825561523, 'learning_rate': 1.092268359463302e-06, 'kl': 0.0164, 'entropy': -0.0449, 'ce_loss': 0.0178, 'epoch': 1.46}
49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 180/369 [19:10<20:06, 6.38s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 181/369 [19:17<20:04, 6.41s/it] {'loss': 0.0487, 'grad_norm': 0.912588894367218, 'learning_rate': 1.083502459704117e-06, 'kl': 0.02, 'entropy': -0.0046, 'ce_loss': 0.021, 'epoch': 1.47}
49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 181/369 [19:17<20:04, 6.41s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 182/369 [19:23<19:59, 6.41s/it] {'loss': 0.0438, 'grad_norm': 0.894838809967041, 'learning_rate': 1.0747300935864243e-06, 'kl': 0.0239, 'entropy': -0.0369, 'ce_loss': 0.0183, 'epoch': 1.48}
49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 182/369 [19:23<19:59, 6.41s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 183/369 [19:30<20:03, 6.47s/it] {'loss': 0.0449, 'grad_norm': 0.891409158706665, 'learning_rate': 1.0659519404346952e-06, 'kl': 0.0149, 'entropy': -0.0767, 'ce_loss': 0.0405, 'epoch': 1.49}
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 183/369 [19:30<20:03, 6.47s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 184/369 [19:36<20:01, 6.50s/it] {'loss': 0.0388, 'grad_norm': 0.8926018476486206, 'learning_rate': 1.0571686800215442e-06, 'kl': 0.0835, 'entropy': -0.208, 'ce_loss': 0.0126, 'epoch': 1.5}
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 184/369 [19:36<20:01, 6.50s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 185/369 [19:43<19:47, 6.46s/it] {'loss': 0.046, 'grad_norm': 0.8469122052192688, 'learning_rate': 1.0483809925150867e-06, 'kl': 0.0286, 'entropy': -0.0664, 'ce_loss': 0.0281, 'epoch': 1.5}
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 185/369 [19:43<19:47, 6.46s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 186/369 [19:49<19:37, 6.43s/it] {'loss': 0.0436, 'grad_norm': 0.7744949460029602, 'learning_rate': 1.0395895584262695e-06, 'kl': 0.0056, 'entropy': -0.1099, 'ce_loss': 0.0176, 'epoch': 1.51}
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 186/369 [19:49<19:37, 6.43s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 187/369 [19:55<19:29, 6.43s/it] {'loss': 0.046, 'grad_norm': 0.8215968012809753, 'learning_rate': 1.0307950585561705e-06, 'kl': 0.0352, 'entropy': -0.0044, 'ce_loss': 0.0279, 'epoch': 1.52}
51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 187/369 [19:55<19:29, 6.43s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 188/369 [20:02<19:23, 6.43s/it] {'loss': 0.0414, 'grad_norm': 0.8396916389465332, 'learning_rate': 1.0219981739432796e-06, 'kl': 0.051, 'entropy': -0.0552, 'ce_loss': 0.0195, 'epoch': 1.53}
51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 188/369 [20:02<19:23, 6.43s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 189/369 [20:08<19:08, 6.38s/it] {'loss': 0.0437, 'grad_norm': 0.7614532113075256, 'learning_rate': 1.013199585810759e-06, 'kl': 0.008, 'entropy': -0.0125, 'ce_loss': 0.0175, 'epoch': 1.54}
51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 189/369 [20:08<19:08, 6.38s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 190/369 [20:14<18:55, 6.34s/it] {'loss': 0.0447, 'grad_norm': 0.8566346764564514, 'learning_rate': 1.0043999755136902e-06, 'kl': 0.0243, 'entropy': -0.1094, 'ce_loss': 0.0336, 'epoch': 1.54}
51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 190/369 [20:14<18:55, 6.34s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 191/369 [20:20<18:41, 6.30s/it] {'loss': 0.0493, 'grad_norm': 0.8871574997901917, 'learning_rate': 9.9560002448631e-07, 'kl': 0.0232, 'entropy': -0.0649, 'ce_loss': 0.0195, 'epoch': 1.55}
52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 191/369 [20:20<18:41, 6.30s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 192/369 [20:27<18:41, 6.34s/it] {'loss': 0.0473, 'grad_norm': 0.8020417094230652, 'learning_rate': 9.868004141892412e-07, 'kl': 0.0728, 'entropy': -0.2236, 'ce_loss': 0.0219, 'epoch': 1.56}
52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 192/369 [20:27<18:41, 6.34s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 193/369 [20:33<18:26, 6.29s/it] {'loss': 0.0516, 'grad_norm': 0.9594029188156128, 'learning_rate': 9.780018260567206e-07, 'kl': 0.0474, 'entropy': -0.1094, 'ce_loss': 0.0516, 'epoch': 1.57}
52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 193/369 [20:33<18:26, 6.29s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 194/369 [20:39<18:18, 6.28s/it] {'loss': 0.0564, 'grad_norm': 0.8477089405059814, 'learning_rate': 9.692049414438298e-07, 'kl': 0.0552, 'entropy': -0.0203, 'ce_loss': 0.0405, 'epoch': 1.58}
53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 194/369 [20:39<18:18, 6.28s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 195/369 [20:46<18:23, 6.34s/it] {'loss': 0.0402, 'grad_norm': 0.8260782957077026, 'learning_rate': 9.604104415737308e-07, 'kl': 0.0255, 'entropy': -0.127, 'ce_loss': 0.0159, 'epoch': 1.59}
53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 195/369 [20:46<18:23, 6.34s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 196/369 [20:52<18:14, 6.33s/it] {'loss': 0.0449, 'grad_norm': 0.7399953007698059, 'learning_rate': 9.516190074849133e-07, 'kl': 0.0193, 'entropy': -0.0308, 'ce_loss': 0.0121, 'epoch': 1.59}
53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 196/369 [20:52<18:14, 6.33s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 197/369 [20:58<18:09, 6.34s/it] {'loss': 0.0322, 'grad_norm': 0.6406762003898621, 'learning_rate': 9.428313199784555e-07, 'kl': 0.0195, 'entropy': -0.0591, 'ce_loss': 0.0135, 'epoch': 1.6}
53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 197/369 [20:58<18:09, 6.34s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 198/369 [21:05<18:12, 6.39s/it] {'loss': 0.0478, 'grad_norm': 0.8323817253112793, 'learning_rate': 9.340480595653045e-07, 'kl': 0.0515, 'entropy': -0.1963, 'ce_loss': 0.0423, 'epoch': 1.61}
54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 198/369 [21:05<18:12, 6.39s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 199/369 [21:11<18:05, 6.39s/it] {'loss': 0.0487, 'grad_norm': 0.8286074995994568, 'learning_rate': 9.252699064135758e-07, 'kl': 0.033, 'entropy': -0.0713, 'ce_loss': 0.0109, 'epoch': 1.62}
54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 199/369 [21:11<18:05, 6.39s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 200/369 [21:18<17:59, 6.39s/it] {'loss': 0.0353, 'grad_norm': 0.7180982232093811, 'learning_rate': 9.164975402958832e-07, 'kl': 0.0034, 'entropy': -0.0664, 'ce_loss': 0.0096, 'epoch': 1.63}
54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 200/369 [21:18<17:59, 6.39s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 201/369 [21:24<17:58, 6.42s/it] {'loss': 0.0422, 'grad_norm': 0.7994242906570435, 'learning_rate': 9.077316405366981e-07, 'kl': 0.0266, 'entropy': -0.0312, 'ce_loss': 0.0181, 'epoch': 1.63}
54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 201/369 [21:24<17:58, 6.42s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 202/369 [21:31<17:53, 6.43s/it] {'loss': 0.0551, 'grad_norm': 0.9823854565620422, 'learning_rate': 8.989728859597423e-07, 'kl': 0.0095, 'entropy': -0.0471, 'ce_loss': 0.0279, 'epoch': 1.64}
55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 202/369 [21:31<17:53, 6.43s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 203/369 [21:37<17:46, 6.43s/it] {'loss': 0.0463, 'grad_norm': 0.8318374752998352, 'learning_rate': 8.902219548354208e-07, 'kl': 0.0107, 'entropy': -0.025, 'ce_loss': 0.0055, 'epoch': 1.65}
55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 203/369 [21:37<17:46, 6.43s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 204/369 [21:43<17:35, 6.40s/it] {'loss': 0.0501, 'grad_norm': 0.8627861738204956, 'learning_rate': 8.814795248282973e-07, 'kl': 0.0077, 'entropy': -0.1553, 'ce_loss': 0.0288, 'epoch': 1.66}
55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 204/369 [21:43<17:35, 6.40s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 205/369 [21:50<17:27, 6.39s/it] {'loss': 0.0447, 'grad_norm': 0.8011927604675293, 'learning_rate': 8.727462729446167e-07, 'kl': 0.027, 'entropy': -0.0801, 'ce_loss': 0.0242, 'epoch': 1.67}
56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 205/369 [21:50<17:27, 6.39s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 206/369 [21:56<17:19, 6.37s/it] {'loss': 0.0373, 'grad_norm': 0.6701304316520691, 'learning_rate': 8.640228754798773e-07, 'kl': 0.0269, 'entropy': -0.0393, 'ce_loss': 0.0164, 'epoch': 1.67}
56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 206/369 [21:56<17:19, 6.37s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 207/369 [22:03<17:15, 6.39s/it] {'loss': 0.0435, 'grad_norm': 0.7695544958114624, 'learning_rate': 8.553100079664598e-07, 'kl': 0.0195, 'entropy': -0.002, 'ce_loss': 0.0227, 'epoch': 1.68}
56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 207/369 [22:03<17:15, 6.39s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 208/369 [22:09<17:10, 6.40s/it] {'loss': 0.0366, 'grad_norm': 0.6444032192230225, 'learning_rate': 8.466083451213145e-07, 'kl': 0.0272, 'entropy': 0.0049, 'ce_loss': 0.0236, 'epoch': 1.69}
56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 208/369 [22:09<17:10, 6.40s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 209/369 [22:15<17:04, 6.40s/it] {'loss': 0.0497, 'grad_norm': 0.9819307923316956, 'learning_rate': 8.379185607937126e-07, 'kl': 0.0177, 'entropy': -0.0234, 'ce_loss': 0.0098, 'epoch': 1.7}
57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 209/369 [22:15<17:04, 6.40s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 210/369 [22:22<17:00, 6.42s/it] {'loss': 0.0425, 'grad_norm': 0.6892064213752747, 'learning_rate': 8.292413279130624e-07, 'kl': 0.0297, 'entropy': -0.0698, 'ce_loss': 0.029, 'epoch': 1.71}
57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 210/369 [22:22<17:00, 6.42s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 211/369 [22:28<16:54, 6.42s/it] {'loss': 0.043, 'grad_norm': 0.7541565299034119, 'learning_rate': 8.20577318436799e-07, 'kl': 0.0254, 'entropy': -0.0535, 'ce_loss': 0.012, 'epoch': 1.72}
57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 211/369 [22:28<16:54, 6.42s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 212/369 [22:35<16:52, 6.45s/it] {'loss': 0.0345, 'grad_norm': 0.7472063302993774, 'learning_rate': 8.119272032983486e-07, 'kl': 0.021, 'entropy': 0.0317, 'ce_loss': 0.0115, 'epoch': 1.72}
57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 212/369 [22:35<16:52, 6.45s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 213/369 [22:41<16:43, 6.43s/it] {'loss': 0.0403, 'grad_norm': 0.7694781422615051, 'learning_rate': 8.032916523551719e-07, 'kl': 0.0164, 'entropy': -0.0972, 'ce_loss': 0.0177, 'epoch': 1.73}
58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 213/369 [22:41<16:43, 6.43s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 214/369 [22:47<16:29, 6.38s/it] {'loss': 0.0415, 'grad_norm': 0.8205801844596863, 'learning_rate': 7.946713343368909e-07, 'kl': 0.021, 'entropy': -0.0447, 'ce_loss': 0.0192, 'epoch': 1.74}
58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 214/369 [22:47<16:29, 6.38s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 215/369 [22:54<16:22, 6.38s/it] {'loss': 0.0467, 'grad_norm': 0.8410895466804504, 'learning_rate': 7.860669167935028e-07, 'kl': 0.0219, 'entropy': -0.0933, 'ce_loss': 0.0217, 'epoch': 1.75}
58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 215/369 [22:54<16:22, 6.38s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 216/369 [23:00<16:19, 6.40s/it] {'loss': 0.0393, 'grad_norm': 0.6885534524917603, 'learning_rate': 7.774790660436857e-07, 'kl': 0.0277, 'entropy': -0.0947, 'ce_loss': 0.0223, 'epoch': 1.76}
59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 216/369 [23:00<16:19, 6.40s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 217/369 [23:07<16:06, 6.36s/it] {'loss': 0.0532, 'grad_norm': 0.8541436195373535, 'learning_rate': 7.689084471232e-07, 'kl': 0.025, 'entropy': -0.04, 'ce_loss': 0.014, 'epoch': 1.76}
59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 217/369 [23:07<16:06, 6.36s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 218/369 [23:13<15:58, 6.35s/it] {'loss': 0.0477, 'grad_norm': 0.8235027194023132, 'learning_rate': 7.603557237333878e-07, 'kl': 0.0282, 'entropy': -0.0566, 'ce_loss': 0.0173, 'epoch': 1.77}
59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 218/369 [23:13<15:58, 6.35s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 219/369 [23:19<15:47, 6.31s/it] {'loss': 0.0504, 'grad_norm': 0.8912014961242676, 'learning_rate': 7.518215581897763e-07, 'kl': 0.0293, 'entropy': -0.052, 'ce_loss': 0.018, 'epoch': 1.78}
59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 219/369 [23:19<15:47, 6.31s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 220/369 [23:25<15:34, 6.27s/it] {'loss': 0.049, 'grad_norm': 0.928795576095581, 'learning_rate': 7.433066113707895e-07, 'kl': 0.0476, 'entropy': -0.0422, 'ce_loss': 0.0201, 'epoch': 1.79}
60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 220/369 [23:25<15:34, 6.27s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 221/369 [23:32<15:30, 6.29s/it] {'loss': 0.0436, 'grad_norm': 0.8204329609870911, 'learning_rate': 7.348115426665704e-07, 'kl': 0.0048, 'entropy': -0.0889, 'ce_loss': 0.0275, 'epoch': 1.8}
60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 221/369 [23:32<15:30, 6.29s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 222/369 [23:38<15:23, 6.28s/it] {'loss': 0.046, 'grad_norm': 0.779718816280365, 'learning_rate': 7.263370099279171e-07, 'kl': 0.0222, 'entropy': -0.0728, 'ce_loss': 0.0242, 'epoch': 1.8}
60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 222/369 [23:38<15:23, 6.28s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 223/369 [23:44<15:23, 6.32s/it] {'loss': 0.0463, 'grad_norm': 0.7704541087150574, 'learning_rate': 7.178836694153405e-07, 'kl': 0.0267, 'entropy': -0.0244, 'ce_loss': 0.0121, 'epoch': 1.81}
60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 223/369 [23:44<15:23, 6.32s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 224/369 [23:51<15:20, 6.35s/it] {'loss': 0.0342, 'grad_norm': 0.7269431948661804, 'learning_rate': 7.094521757482439e-07, 'kl': 0.0052, 'entropy': -0.0471, 'ce_loss': 0.015, 'epoch': 1.82}
61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 224/369 [23:51<15:20, 6.35s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 225/369 [23:57<15:12, 6.34s/it] {'loss': 0.0513, 'grad_norm': 0.842043399810791, 'learning_rate': 7.010431818542297e-07, 'kl': 0.0176, 'entropy': -0.0381, 'ce_loss': 0.0218, 'epoch': 1.83}
61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 225/369 [23:57<15:12, 6.34s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 226/369 [24:03<15:04, 6.32s/it] {'loss': 0.0416, 'grad_norm': 0.8451390862464905, 'learning_rate': 6.92657338918537e-07, 'kl': 0.0447, 'entropy': -0.0967, 'ce_loss': 0.0375, 'epoch': 1.84}
61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 226/369 [24:03<15:04, 6.32s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 227/369 [24:10<15:04, 6.37s/it] {'loss': 0.0431, 'grad_norm': 0.8795459866523743, 'learning_rate': 6.842952963336153e-07, 'kl': 0.0159, 'entropy': -0.02, 'ce_loss': 0.0264, 'epoch': 1.85}
62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 227/369 [24:10<15:04, 6.37s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 228/369 [24:16<15:06, 6.43s/it] {'loss': 0.0355, 'grad_norm': 0.6665055751800537, 'learning_rate': 6.759577016488343e-07, 'kl': 0.0386, 'entropy': -0.0508, 'ce_loss': 0.0161, 'epoch': 1.85}
62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 228/369 [24:16<15:06, 6.43s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 229/369 [24:23<14:51, 6.37s/it] {'loss': 0.0464, 'grad_norm': 0.7767631411552429, 'learning_rate': 6.676452005203404e-07, 'kl': 0.0179, 'entropy': -0.1206, 'ce_loss': 0.0241, 'epoch': 1.86}
62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 229/369 [24:23<14:51, 6.37s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 230/369 [24:29<14:41, 6.34s/it] {'loss': 0.0531, 'grad_norm': 1.1198673248291016, 'learning_rate': 6.593584366610565e-07, 'kl': 0.0134, 'entropy': -0.1069, 'ce_loss': 0.0166, 'epoch': 1.87}
62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 230/369 [24:29<14:41, 6.34s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 231/369 [24:35<14:36, 6.35s/it] {'loss': 0.0403, 'grad_norm': 0.7267152070999146, 'learning_rate': 6.510980517908333e-07, 'kl': 0.0198, 'entropy': -0.0042, 'ce_loss': 0.0137, 'epoch': 1.88}
63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 231/369 [24:35<14:36, 6.35s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 232/369 [24:42<14:30, 6.35s/it] {'loss': 0.043, 'grad_norm': 0.8491414785385132, 'learning_rate': 6.428646855867552e-07, 'kl': 0.0386, 'entropy': -0.0496, 'ce_loss': 0.0367, 'epoch': 1.89}
63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 232/369 [24:42<14:30, 6.35s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 233/369 [24:48<14:28, 6.39s/it] {'loss': 0.0483, 'grad_norm': 0.8010396957397461, 'learning_rate': 6.34658975633605e-07, 'kl': 0.0095, 'entropy': -0.0121, 'ce_loss': 0.0172, 'epoch': 1.89}
63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 233/369 [24:48<14:28, 6.39s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 234/369 [24:54<14:17, 6.35s/it] {'loss': 0.0398, 'grad_norm': 0.7851506471633911, 'learning_rate': 6.264815573744884e-07, 'kl': 0.0121, 'entropy': -0.0693, 'ce_loss': 0.0265, 'epoch': 1.9}
63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 234/369 [24:54<14:17, 6.35s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 235/369 [25:01<14:10, 6.35s/it] {'loss': 0.0571, 'grad_norm': 1.0049231052398682, 'learning_rate': 6.183330640616273e-07, 'kl': 0.0374, 'entropy': -0.0181, 'ce_loss': 0.0245, 'epoch': 1.91}
64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 235/369 [25:01<14:10, 6.35s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 236/369 [25:07<14:04, 6.35s/it] {'loss': 0.0463, 'grad_norm': 0.8276854753494263, 'learning_rate': 6.102141267073207e-07, 'kl': 0.0366, 'entropy': -0.0581, 'ce_loss': 0.0286, 'epoch': 1.92}
64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 236/369 [25:07<14:04, 6.35s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 237/369 [25:14<14:07, 6.42s/it] {'loss': 0.0432, 'grad_norm': 0.843425989151001, 'learning_rate': 6.021253740350792e-07, 'kl': 0.0144, 'entropy': -0.0547, 'ce_loss': 0.023, 'epoch': 1.93}
64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 237/369 [25:14<14:07, 6.42s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 238/369 [25:20<14:00, 6.42s/it] {'loss': 0.0327, 'grad_norm': 0.6575609445571899, 'learning_rate': 5.94067432430938e-07, 'kl': 0.0359, 'entropy': -0.0354, 'ce_loss': 0.0133, 'epoch': 1.93}
64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 238/369 [25:20<14:00, 6.42s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 239/369 [25:26<13:49, 6.38s/it] {'loss': 0.0473, 'grad_norm': 0.7882652878761292, 'learning_rate': 5.860409258949499e-07, 'kl': 0.0103, 'entropy': 0.0081, 'ce_loss': 0.0153, 'epoch': 1.94}
65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 239/369 [25:26<13:49, 6.38s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 240/369 [25:33<13:42, 6.37s/it] {'loss': 0.0514, 'grad_norm': 0.8780028820037842, 'learning_rate': 5.780464759928623e-07, 'kl': 0.0415, 'entropy': -0.085, 'ce_loss': 0.0259, 'epoch': 1.95}
65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 240/369 [25:33<13:42, 6.37s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 241/369 [25:39<13:51, 6.49s/it] {'loss': 0.0362, 'grad_norm': 0.7720161080360413, 'learning_rate': 5.700847018079855e-07, 'kl': 0.0089, 'entropy': -0.0349, 'ce_loss': 0.0253, 'epoch': 1.96}
65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 241/369 [25:39<13:51, 6.49s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 242/369 [25:46<13:45, 6.50s/it] {'loss': 0.0431, 'grad_norm': 0.8019686341285706, 'learning_rate': 5.621562198932499e-07, 'kl': 0.0217, 'entropy': -0.0065, 'ce_loss': 0.0166, 'epoch': 1.97}
66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 242/369 [25:46<13:45, 6.50s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 243/369 [25:52<13:32, 6.45s/it] {'loss': 0.0407, 'grad_norm': 0.7299582958221436, 'learning_rate': 5.542616442234618e-07, 'kl': 0.0061, 'entropy': -0.0145, 'ce_loss': 0.0095, 'epoch': 1.98}
66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 243/369 [25:52<13:32, 6.45s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 244/369 [25:59<13:22, 6.42s/it] {'loss': 0.0422, 'grad_norm': 0.815242350101471, 'learning_rate': 5.464015861477557e-07, 'kl': 0.0231, 'entropy': -0.0005, 'ce_loss': 0.0147, 'epoch': 1.98}
66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 244/369 [25:59<13:22, 6.42s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 245/369 [26:05<13:15, 6.41s/it] {'loss': 0.0418, 'grad_norm': 0.773013710975647, 'learning_rate': 5.38576654342255e-07, 'kl': 0.0128, 'entropy': -0.0337, 'ce_loss': 0.041, 'epoch': 1.99}
66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 245/369 [26:05<13:15, 6.41s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 246/369 [26:11<13:05, 6.39s/it] {'loss': 0.0491, 'grad_norm': 0.8337002396583557, 'learning_rate': 5.307874547629339e-07, 'kl': 0.0192, 'entropy': -0.0708, 'ce_loss': 0.0321, 'epoch': 2.0}
67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 246/369 [26:11<13:05, 6.39s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 247/369 [26:18<13:00, 6.40s/it] {'loss': 0.0275, 'grad_norm': 0.592427134513855, 'learning_rate': 5.230345905986943e-07, 'kl': 0.0432, 'entropy': -0.0442, 'ce_loss': 0.0198, 'epoch': 2.01}
67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 247/369 [26:18<13:00, 6.40s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 248/369 [26:24<12:51, 6.38s/it] {'loss': 0.0317, 'grad_norm': 0.6923612952232361, 'learning_rate': 5.153186622246546e-07, 'kl': 0.0238, 'entropy': -0.0938, 'ce_loss': 0.0102, 'epoch': 2.02}
67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 248/369 [26:24<12:51, 6.38s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 249/369 [26:30<12:42, 6.36s/it] {'loss': 0.0336, 'grad_norm': 0.6824318766593933, 'learning_rate': 5.076402671556577e-07, 'kl': 0.0515, 'entropy': -0.0571, 'ce_loss': 0.0151, 'epoch': 2.02}
67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 249/369 [26:30<12:42, 6.36s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 250/369 [26:37<12:36, 6.36s/it] {'loss': 0.0345, 'grad_norm': 0.7583546042442322, 'learning_rate': 5.000000000000002e-07, 'kl': 0.0398, 'entropy': -0.0605, 'ce_loss': 0.0118, 'epoch': 2.03}
68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 250/369 [26:37<12:36, 6.36s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 251/369 [26:43<12:27, 6.34s/it] {'loss': 0.0292, 'grad_norm': 0.6661235690116882, 'learning_rate': 4.923984524133843e-07, 'kl': 0.0269, 'entropy': -0.0493, 'ce_loss': 0.0117, 'epoch': 2.04}
68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 251/369 [26:43<12:27, 6.34s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 252/369 [26:49<12:18, 6.31s/it] {'loss': 0.0299, 'grad_norm': 0.7087632417678833, 'learning_rate': 4.848362130531039e-07, 'kl': 0.0742, 'entropy': -0.1089, 'ce_loss': 0.02, 'epoch': 2.05}
68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 252/369 [26:49<12:18, 6.31s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 253/369 [26:56<12:13, 6.32s/it] {'loss': 0.0302, 'grad_norm': 0.6678736209869385, 'learning_rate': 4.773138675324567e-07, 'kl': 0.0284, 'entropy': -0.0175, 'ce_loss': 0.018, 'epoch': 2.06}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 253/369 [26:56<12:13, 6.32s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 254/369 [27:02<12:13, 6.38s/it] {'loss': 0.0332, 'grad_norm': 0.648098349571228, 'learning_rate': 4.69831998375397e-07, 'kl': 0.0811, 'entropy': -0.0938, 'ce_loss': 0.0205, 'epoch': 2.07}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 254/369 [27:02<12:13, 6.38s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 255/369 [27:09<12:09, 6.40s/it] {'loss': 0.0376, 'grad_norm': 0.6987475752830505, 'learning_rate': 4.623911849714225e-07, 'kl': 0.0265, 'entropy': -0.0204, 'ce_loss': 0.0091, 'epoch': 2.07}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 255/369 [27:09<12:09, 6.40s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 256/369 [27:15<12:02, 6.40s/it] {'loss': 0.0317, 'grad_norm': 0.7036008834838867, 'learning_rate': 4.5499200353071065e-07, 'kl': 0.0398, 'entropy': -0.0311, 'ce_loss': 0.0143, 'epoch': 2.08}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 256/369 [27:15<12:02, 6.40s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 257/369 [27:21<11:57, 6.40s/it] {'loss': 0.0277, 'grad_norm': 0.6011433005332947, 'learning_rate': 4.476350270394942e-07, 'kl': 0.0069, 'entropy': 0.006, 'ce_loss': 0.0112, 'epoch': 2.09}
70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 257/369 [27:21<11:57, 6.40s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 258/369 [27:28<11:54, 6.44s/it] {'loss': 0.0318, 'grad_norm': 0.6651270985603333, 'learning_rate': 4.40320825215692e-07, 'kl': 0.0369, 'entropy': -0.0747, 'ce_loss': 0.0143, 'epoch': 2.1}
70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 258/369 [27:28<11:54, 6.44s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 259/369 [27:34<11:50, 6.46s/it] {'loss': 0.0335, 'grad_norm': 0.7619247436523438, 'learning_rate': 4.330499644647885e-07, 'kl': 0.027, 'entropy': -0.05, 'ce_loss': 0.011, 'epoch': 2.11}
70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 259/369 [27:34<11:50, 6.46s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 260/369 [27:41<11:46, 6.48s/it] {'loss': 0.0351, 'grad_norm': 0.7189511060714722, 'learning_rate': 4.25823007835974e-07, 'kl': 0.0253, 'entropy': -0.0933, 'ce_loss': 0.0248, 'epoch': 2.11}
70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 260/369 [27:41<11:46, 6.48s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 261/369 [27:48<11:46, 6.54s/it] {'loss': 0.0237, 'grad_norm': 0.5641470551490784, 'learning_rate': 4.1864051497854027e-07, 'kl': 0.0356, 'entropy': -0.0532, 'ce_loss': 0.009, 'epoch': 2.12}
71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 261/369 [27:48<11:46, 6.54s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 262/369 [27:54<11:36, 6.51s/it] {'loss': 0.0344, 'grad_norm': 0.896845281124115, 'learning_rate': 4.115030420985437e-07, 'kl': 0.0266, 'entropy': -0.0557, 'ce_loss': 0.0168, 'epoch': 2.13}
71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 262/369 [27:54<11:36, 6.51s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 263/369 [28:00<11:24, 6.46s/it] {'loss': 0.0302, 'grad_norm': 0.7131102681159973, 'learning_rate': 4.044111419157326e-07, 'kl': 0.0405, 'entropy': -0.0133, 'ce_loss': 0.0096, 'epoch': 2.14}
71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 263/369 [28:00<11:24, 6.46s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 264/369 [28:07<11:15, 6.44s/it] {'loss': 0.0317, 'grad_norm': 0.7297512292861938, 'learning_rate': 3.973653636207437e-07, 'kl': 0.0287, 'entropy': -0.0698, 'ce_loss': 0.0132, 'epoch': 2.15}
72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 264/369 [28:07<11:15, 6.44s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 265/369 [28:13<11:04, 6.39s/it] {'loss': 0.0385, 'grad_norm': 0.7612536549568176, 'learning_rate': 3.9036625283257587e-07, 'kl': 0.0289, 'entropy': -0.0635, 'ce_loss': 0.0129, 'epoch': 2.15}
72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 265/369 [28:13<11:04, 6.39s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 266/369 [28:19<10:56, 6.38s/it] {'loss': 0.0348, 'grad_norm': 0.740138053894043, 'learning_rate': 3.834143515563357e-07, 'kl': 0.014, 'entropy': -0.0277, 'ce_loss': 0.0065, 'epoch': 2.16}
72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 266/369 [28:19<10:56, 6.38s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 267/369 [28:26<10:52, 6.39s/it] {'loss': 0.0259, 'grad_norm': 0.6638992428779602, 'learning_rate': 3.765101981412665e-07, 'kl': 0.0166, 'entropy': -0.0151, 'ce_loss': 0.0059, 'epoch': 2.17}
72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 267/369 [28:26<10:52, 6.39s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 268/369 [28:32<10:47, 6.41s/it] {'loss': 0.0314, 'grad_norm': 0.634029746055603, 'learning_rate': 3.696543272390573e-07, 'kl': 0.0374, 'entropy': -0.0723, 'ce_loss': 0.0181, 'epoch': 2.18}
73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 268/369 [28:32<10:47, 6.41s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 269/369 [28:39<10:38, 6.38s/it] {'loss': 0.0318, 'grad_norm': 0.7342073917388916, 'learning_rate': 3.628472697624422e-07, 'kl': 0.0156, 'entropy': 0.0096, 'ce_loss': 0.0284, 'epoch': 2.19}
73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 269/369 [28:39<10:38, 6.38s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 270/369 [28:45<10:32, 6.39s/it] {'loss': 0.0386, 'grad_norm': 0.7904759049415588, 'learning_rate': 3.560895528440844e-07, 'kl': 0.019, 'entropy': -0.0801, 'ce_loss': 0.0239, 'epoch': 2.2}
73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 270/369 [28:45<10:32, 6.39s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 271/369 [28:51<10:24, 6.37s/it] {'loss': 0.0343, 'grad_norm': 0.7888381481170654, 'learning_rate': 3.4938169979575817e-07, 'kl': 0.0339, 'entropy': -0.0874, 'ce_loss': 0.0154, 'epoch': 2.2}
73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 271/369 [28:51<10:24, 6.37s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 272/369 [28:58<10:17, 6.36s/it] {'loss': 0.0292, 'grad_norm': 0.7137593030929565, 'learning_rate': 3.4272423006782127e-07, 'kl': 0.0356, 'entropy': -0.0154, 'ce_loss': 0.018, 'epoch': 2.21}
74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 272/369 [28:58<10:17, 6.36s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 273/369 [29:04<10:12, 6.38s/it] {'loss': 0.0281, 'grad_norm': 0.6849740147590637, 'learning_rate': 3.3611765920899183e-07, 'kl': 0.0327, 'entropy': -0.1279, 'ce_loss': 0.0158, 'epoch': 2.22}
74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 273/369 [29:04<10:12, 6.38s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 274/369 [29:10<10:04, 6.36s/it] {'loss': 0.029, 'grad_norm': 0.6829760074615479, 'learning_rate': 3.295624988264224e-07, 'kl': 0.0466, 'entropy': -0.0923, 'ce_loss': 0.0102, 'epoch': 2.23}
74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 274/369 [29:10<10:04, 6.36s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 275/369 [29:17<09:55, 6.34s/it] {'loss': 0.0341, 'grad_norm': 0.7327803373336792, 'learning_rate': 3.2305925654608324e-07, 'kl': 0.0649, 'entropy': -0.0811, 'ce_loss': 0.0152, 'epoch': 2.24}
75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 275/369 [29:17<09:55, 6.34s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 276/369 [29:23<09:48, 6.33s/it] {'loss': 0.0297, 'grad_norm': 0.6945962905883789, 'learning_rate': 3.166084359734513e-07, 'kl': 0.0645, 'entropy': -0.0967, 'ce_loss': 0.0102, 'epoch': 2.24}
75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 276/369 [29:23<09:48, 6.33s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 277/369 [29:30<09:45, 6.36s/it] {'loss': 0.0345, 'grad_norm': 0.7943205237388611, 'learning_rate': 3.10210536654512e-07, 'kl': 0.0238, 'entropy': -0.0101, 'ce_loss': 0.0114, 'epoch': 2.25}
75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 277/369 [29:30<09:45, 6.36s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 278/369 [29:36<09:38, 6.36s/it] {'loss': 0.0279, 'grad_norm': 0.7254714369773865, 'learning_rate': 3.0386605403707343e-07, 'kl': 0.0981, 'entropy': -0.1133, 'ce_loss': 0.0126, 'epoch': 2.26}
75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 278/369 [29:36<09:38, 6.36s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 279/369 [29:42<09:30, 6.33s/it] {'loss': 0.0365, 'grad_norm': 0.8231030106544495, 'learning_rate': 2.975754794324015e-07, 'kl': 0.0444, 'entropy': -0.1064, 'ce_loss': 0.0129, 'epoch': 2.27}
76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 279/369 [29:42<09:30, 6.33s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 280/369 [29:49<09:25, 6.35s/it] {'loss': 0.0309, 'grad_norm': 0.7104822397232056, 'learning_rate': 2.913392999771718e-07, 'kl': 0.0381, 'entropy': -0.0957, 'ce_loss': 0.0091, 'epoch': 2.28}
76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 280/369 [29:49<09:25, 6.35s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 281/369 [29:55<09:19, 6.35s/it] {'loss': 0.0363, 'grad_norm': 0.8100852370262146, 'learning_rate': 2.8515799859574584e-07, 'kl': 0.0574, 'entropy': -0.1016, 'ce_loss': 0.0145, 'epoch': 2.28}
76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 281/369 [29:55<09:19, 6.35s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 282/369 [30:01<09:13, 6.36s/it] {'loss': 0.0367, 'grad_norm': 0.8075007796287537, 'learning_rate': 2.790320539627754e-07, 'kl': 0.0244, 'entropy': -0.0879, 'ce_loss': 0.0107, 'epoch': 2.29}
76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 282/369 [30:01<09:13, 6.36s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 283/369 [30:08<09:05, 6.34s/it] {'loss': 0.0297, 'grad_norm': 0.7835853099822998, 'learning_rate': 2.729619404661321e-07, 'kl': 0.0403, 'entropy': -0.1069, 'ce_loss': 0.0125, 'epoch': 2.3}
77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 283/369 [30:08<09:05, 6.34s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 284/369 [30:14<09:03, 6.40s/it] {'loss': 0.0286, 'grad_norm': 0.6912492513656616, 'learning_rate': 2.6694812817017387e-07, 'kl': 0.0562, 'entropy': -0.064, 'ce_loss': 0.005, 'epoch': 2.31}
77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 284/369 [30:14<09:03, 6.40s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 285/369 [30:20<08:55, 6.37s/it] {'loss': 0.0337, 'grad_norm': 0.7376478314399719, 'learning_rate': 2.60991082779341e-07, 'kl': 0.033, 'entropy': -0.0786, 'ce_loss': 0.0103, 'epoch': 2.32}
77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 285/369 [30:20<08:55, 6.37s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 286/369 [30:27<08:49, 6.38s/it] {'loss': 0.0298, 'grad_norm': 0.7830272912979126, 'learning_rate': 2.550912656020943e-07, 'kl': 0.027, 'entropy': -0.0825, 'ce_loss': 0.0101, 'epoch': 2.33}
78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 286/369 [30:27<08:49, 6.38s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 287/369 [30:33<08:43, 6.38s/it] {'loss': 0.0295, 'grad_norm': 0.6989503502845764, 'learning_rate': 2.492491335151908e-07, 'kl': 0.0199, 'entropy': -0.0649, 'ce_loss': 0.0099, 'epoch': 2.33}
78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 287/369 [30:33<08:43, 6.38s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 288/369 [30:40<08:37, 6.39s/it] {'loss': 0.0282, 'grad_norm': 0.6959418654441833, 'learning_rate': 2.434651389283042e-07, 'kl': 0.085, 'entropy': -0.0688, 'ce_loss': 0.0172, 'epoch': 2.34}
78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 288/369 [30:40<08:37, 6.39s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 289/369 [30:46<08:29, 6.37s/it] {'loss': 0.0308, 'grad_norm': 0.747646152973175, 'learning_rate': 2.3773972974898947e-07, 'kl': 0.0737, 'entropy': -0.0569, 'ce_loss': 0.0164, 'epoch': 2.35}
78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 289/369 [30:46<08:29, 6.37s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 290/369 [30:52<08:24, 6.38s/it] {'loss': 0.027, 'grad_norm': 0.6402429342269897, 'learning_rate': 2.3207334934799916e-07, 'kl': 0.0383, 'entropy': -0.0408, 'ce_loss': 0.0122, 'epoch': 2.36}
79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 290/369 [30:52<08:24, 6.38s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 291/369 [30:59<08:15, 6.35s/it] {'loss': 0.0291, 'grad_norm': 0.7346453070640564, 'learning_rate': 2.264664365249469e-07, 'kl': 0.0493, 'entropy': -0.0918, 'ce_loss': 0.0137, 'epoch': 2.37}
79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 291/369 [30:59<08:15, 6.35s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 292/369 [31:05<08:09, 6.35s/it] {'loss': 0.0324, 'grad_norm': 0.7220074534416199, 'learning_rate': 2.209194254743295e-07, 'kl': 0.0337, 'entropy': -0.0049, 'ce_loss': 0.0245, 'epoch': 2.37}
79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 292/369 [31:05<08:09, 6.35s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 293/369 [31:11<08:01, 6.34s/it] {'loss': 0.0322, 'grad_norm': 0.787611722946167, 'learning_rate': 2.1543274575190185e-07, 'kl': 0.0444, 'entropy': -0.0938, 'ce_loss': 0.0035, 'epoch': 2.38}
79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 293/369 [31:11<08:01, 6.34s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 294/369 [31:18<07:55, 6.34s/it] {'loss': 0.0422, 'grad_norm': 0.9786776304244995, 'learning_rate': 2.100068222414121e-07, 'kl': 0.0361, 'entropy': -0.064, 'ce_loss': 0.0198, 'epoch': 2.39}
80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 294/369 [31:18<07:55, 6.34s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 295/369 [31:24<07:52, 6.38s/it] {'loss': 0.0402, 'grad_norm': 0.8480374813079834, 'learning_rate': 2.0464207512170063e-07, 'kl': 0.1069, 'entropy': -0.1748, 'ce_loss': 0.0069, 'epoch': 2.4}
80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 295/369 [31:24<07:52, 6.38s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 296/369 [31:30<07:46, 6.39s/it] {'loss': 0.026, 'grad_norm': 0.6286333799362183, 'learning_rate': 1.9933891983416006e-07, 'kl': 0.0281, 'entropy': -0.0359, 'ce_loss': 0.0169, 'epoch': 2.41}
80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 296/369 [31:30<07:46, 6.39s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 297/369 [31:37<07:39, 6.38s/it] {'loss': 0.03, 'grad_norm': 0.6930149793624878, 'learning_rate': 1.9409776705056514e-07, 'kl': 0.0471, 'entropy': -0.1289, 'ce_loss': 0.0352, 'epoch': 2.41}
80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 297/369 [31:37<07:39, 6.38s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 298/369 [31:43<07:36, 6.44s/it] {'loss': 0.0349, 'grad_norm': 0.7591108083724976, 'learning_rate': 1.8891902264127e-07, 'kl': 0.0542, 'entropy': -0.0618, 'ce_loss': 0.0153, 'epoch': 2.42}
81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 298/369 [31:43<07:36, 6.44s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 299/369 [31:50<07:32, 6.47s/it] {'loss': 0.0294, 'grad_norm': 0.7177157402038574, 'learning_rate': 1.8380308764377838e-07, 'kl': 0.0918, 'entropy': -0.1436, 'ce_loss': 0.0257, 'epoch': 2.43}
81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 299/369 [31:50<07:32, 6.47s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 300/369 [31:56<07:25, 6.45s/it] {'loss': 0.0378, 'grad_norm': 0.74204421043396, 'learning_rate': 1.787503582316864e-07, 'kl': 0.0147, 'entropy': -0.0107, 'ce_loss': 0.0087, 'epoch': 2.44}
81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 300/369 [31:56<07:25, 6.45s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 301/369 [32:03<07:17, 6.44s/it] {'loss': 0.0344, 'grad_norm': 0.8505944013595581, 'learning_rate': 1.737612256840053e-07, 'kl': 0.0337, 'entropy': -0.0645, 'ce_loss': 0.0227, 'epoch': 2.45}
82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 301/369 [32:03<07:17, 6.44s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 302/369 [32:09<07:10, 6.42s/it] {'loss': 0.0245, 'grad_norm': 0.6278170347213745, 'learning_rate': 1.6883607635485874e-07, 'kl': 0.0684, 'entropy': -0.1006, 'ce_loss': 0.0041, 'epoch': 2.46}
82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 302/369 [32:09<07:10, 6.42s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 303/369 [32:16<07:03, 6.41s/it] {'loss': 0.0279, 'grad_norm': 0.6695606708526611, 'learning_rate': 1.6397529164356606e-07, 'kl': 0.022, 'entropy': -0.0649, 'ce_loss': 0.0085, 'epoch': 2.46}
82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 303/369 [32:16<07:03, 6.41s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 304/369 [32:22<06:59, 6.46s/it] {'loss': 0.0316, 'grad_norm': 0.6989527940750122, 'learning_rate': 1.5917924796510584e-07, 'kl': 0.0962, 'entropy': -0.0566, 'ce_loss': 0.015, 'epoch': 2.47}
82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 304/369 [32:22<06:59, 6.46s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 305/369 [32:28<06:51, 6.43s/it] {'loss': 0.0293, 'grad_norm': 0.7260941863059998, 'learning_rate': 1.5444831672096638e-07, 'kl': 0.0408, 'entropy': -0.0491, 'ce_loss': 0.0116, 'epoch': 2.48}
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 305/369 [32:28<06:51, 6.43s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 306/369 [32:35<06:43, 6.41s/it] {'loss': 0.0357, 'grad_norm': 0.8513651490211487, 'learning_rate': 1.49782864270386e-07, 'kl': 0.0549, 'entropy': 0.0168, 'ce_loss': 0.0138, 'epoch': 2.49}
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 306/369 [32:35<06:43, 6.41s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 307/369 [32:41<06:40, 6.45s/it] {'loss': 0.0281, 'grad_norm': 0.6192908883094788, 'learning_rate': 1.4518325190198076e-07, 'kl': 0.0292, 'entropy': -0.0322, 'ce_loss': 0.0083, 'epoch': 2.5}
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 307/369 [32:41<06:40, 6.45s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 308/369 [32:48<06:31, 6.41s/it] {'loss': 0.0307, 'grad_norm': 0.7162027955055237, 'learning_rate': 1.4064983580576827e-07, 'kl': 0.0195, 'entropy': -0.0107, 'ce_loss': 0.0187, 'epoch': 2.5}
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 308/369 [32:48<06:31, 6.41s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 309/369 [32:54<06:24, 6.40s/it] {'loss': 0.0318, 'grad_norm': 0.7101048231124878, 'learning_rate': 1.3618296704558364e-07, 'kl': 0.0337, 'entropy': -0.053, 'ce_loss': 0.0154, 'epoch': 2.51}
84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 309/369 [32:54<06:24, 6.40s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 310/369 [33:00<06:17, 6.39s/it] {'loss': 0.032, 'grad_norm': 0.7495595216751099, 'learning_rate': 1.3178299153189365e-07, 'kl': 0.0664, 'entropy': -0.0854, 'ce_loss': 0.021, 'epoch': 2.52}
84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 310/369 [33:00<06:17, 6.39s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 311/369 [33:07<06:10, 6.38s/it] {'loss': 0.027, 'grad_norm': 0.695936918258667, 'learning_rate': 1.2745024999500941e-07, 'kl': 0.0337, 'entropy': -0.0396, 'ce_loss': 0.0171, 'epoch': 2.53}
84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 311/369 [33:07<06:10, 6.38s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 312/369 [33:13<06:07, 6.45s/it] {'loss': 0.0315, 'grad_norm': 0.7476539611816406, 'learning_rate': 1.2318507795870137e-07, 'kl': 0.0322, 'entropy': -0.0491, 'ce_loss': 0.0086, 'epoch': 2.54}
85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 312/369 [33:13<06:07, 6.45s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 313/369 [33:20<06:01, 6.46s/it] {'loss': 0.04, 'grad_norm': 0.9013779163360596, 'learning_rate': 1.1898780571421552e-07, 'kl': 0.0405, 'entropy': -0.0552, 'ce_loss': 0.0318, 'epoch': 2.54}
85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 313/369 [33:20<06:01, 6.46s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 314/369 [33:27<05:57, 6.51s/it] {'loss': 0.0304, 'grad_norm': 0.7427871823310852, 'learning_rate': 1.1485875829469705e-07, 'kl': 0.0391, 'entropy': -0.0332, 'ce_loss': 0.0135, 'epoch': 2.55}
85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 314/369 [33:27<05:57, 6.51s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 315/369 [33:33<05:58, 6.63s/it] {'loss': 0.0258, 'grad_norm': 0.6271719336509705, 'learning_rate': 1.1079825545001886e-07, 'kl': 0.024, 'entropy': -0.0708, 'ce_loss': 0.0048, 'epoch': 2.56}
85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 315/369 [33:33<05:58, 6.63s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 316/369 [33:40<05:47, 6.56s/it] {'loss': 0.0362, 'grad_norm': 0.8094239234924316, 'learning_rate': 1.0680661162202176e-07, 'kl': 0.042, 'entropy': -0.0371, 'ce_loss': 0.0486, 'epoch': 2.57}
86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 316/369 [33:40<05:47, 6.56s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 317/369 [33:46<05:38, 6.52s/it] {'loss': 0.028, 'grad_norm': 0.7834252715110779, 'learning_rate': 1.0288413592016343e-07, 'kl': 0.022, 'entropy': -0.0325, 'ce_loss': 0.013, 'epoch': 2.58}
86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 317/369 [33:46<05:38, 6.52s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 318/369 [33:53<05:30, 6.47s/it] {'loss': 0.0315, 'grad_norm': 0.706376314163208, 'learning_rate': 9.903113209758096e-08, 'kl': 0.0173, 'entropy': -0.052, 'ce_loss': 0.0118, 'epoch': 2.59}
86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 318/369 [33:53<05:30, 6.47s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 319/369 [33:59<05:23, 6.46s/it] {'loss': 0.0305, 'grad_norm': 0.8203223347663879, 'learning_rate': 9.524789852756954e-08, 'kl': 0.0339, 'entropy': -0.043, 'ce_loss': 0.0093, 'epoch': 2.59}
86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 319/369 [33:59<05:23, 6.46s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 320/369 [34:05<05:14, 6.43s/it] {'loss': 0.0321, 'grad_norm': 0.7787702679634094, 'learning_rate': 9.153472818047625e-08, 'kl': 0.0308, 'entropy': -0.0101, 'ce_loss': 0.0086, 'epoch': 2.6}
87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 320/369 [34:05<05:14, 6.43s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 321/369 [34:12<05:09, 6.44s/it] {'loss': 0.0342, 'grad_norm': 0.818056046962738, 'learning_rate': 8.789190860101226e-08, 'kl': 0.0378, 'entropy': -0.0693, 'ce_loss': 0.0101, 'epoch': 2.61}
87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 321/369 [34:12<05:09, 6.44s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 322/369 [34:18<05:02, 6.43s/it] {'loss': 0.0326, 'grad_norm': 0.7471586465835571, 'learning_rate': 8.431972188598579e-08, 'kl': 0.0239, 'entropy': -0.1147, 'ce_loss': 0.0141, 'epoch': 2.62}
87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 322/369 [34:18<05:02, 6.43s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 323/369 [34:25<04:55, 6.42s/it] {'loss': 0.0232, 'grad_norm': 0.6272115111351013, 'learning_rate': 8.081844466245735e-08, 'kl': 0.0474, 'entropy': -0.1069, 'ce_loss': 0.0097, 'epoch': 2.63}
88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 323/369 [34:25<04:55, 6.42s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 324/369 [34:31<04:49, 6.43s/it] {'loss': 0.0299, 'grad_norm': 0.7132481932640076, 'learning_rate': 7.73883480663171e-08, 'kl': 0.0152, 'entropy': -0.0703, 'ce_loss': 0.0237, 'epoch': 2.63}
88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 324/369 [34:31<04:49, 6.43s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 325/369 [34:38<04:43, 6.45s/it] {'loss': 0.0325, 'grad_norm': 0.6562875509262085, 'learning_rate': 7.402969772128931e-08, 'kl': 0.0282, 'entropy': -0.0354, 'ce_loss': 0.0092, 'epoch': 2.64}
88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 325/369 [34:38<04:43, 6.45s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 326/369 [34:44<04:38, 6.48s/it] {'loss': 0.0262, 'grad_norm': 0.6630709767341614, 'learning_rate': 7.074275371836147e-08, 'kl': 0.0312, 'entropy': -0.123, 'ce_loss': 0.0176, 'epoch': 2.65}
88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 326/369 [34:44<04:38, 6.48s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 327/369 [34:51<04:33, 6.52s/it] {'loss': 0.0356, 'grad_norm': 0.7848518490791321, 'learning_rate': 6.75277705956443e-08, 'kl': 0.0427, 'entropy': -0.0645, 'ce_loss': 0.02, 'epoch': 2.66}
89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 327/369 [34:51<04:33, 6.52s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 328/369 [34:57<04:25, 6.48s/it] {'loss': 0.0298, 'grad_norm': 0.6947987675666809, 'learning_rate': 6.438499731865965e-08, 'kl': 0.0425, 'entropy': -0.1318, 'ce_loss': 0.0185, 'epoch': 2.67}
89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 328/369 [34:57<04:25, 6.48s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 329/369 [35:04<04:17, 6.43s/it] {'loss': 0.0322, 'grad_norm': 0.6863346099853516, 'learning_rate': 6.131467726106143e-08, 'kl': 0.0645, 'entropy': 0.0435, 'ce_loss': 0.0259, 'epoch': 2.67}
89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 329/369 [35:04<04:17, 6.43s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 330/369 [35:10<04:10, 6.42s/it] {'loss': 0.0319, 'grad_norm': 0.7936291694641113, 'learning_rate': 5.831704818578842e-08, 'kl': 0.043, 'entropy': -0.0864, 'ce_loss': 0.0195, 'epoch': 2.68}
89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 330/369 [35:10<04:10, 6.42s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 331/369 [35:16<04:03, 6.42s/it] {'loss': 0.0348, 'grad_norm': 0.8411721587181091, 'learning_rate': 5.539234222665279e-08, 'kl': 0.042, 'entropy': -0.0752, 'ce_loss': 0.0191, 'epoch': 2.69}
90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 331/369 [35:16<04:03, 6.42s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 332/369 [35:23<03:56, 6.40s/it] {'loss': 0.0305, 'grad_norm': 0.7145872116088867, 'learning_rate': 5.254078587036282e-08, 'kl': 0.0254, 'entropy': -0.0845, 'ce_loss': 0.0267, 'epoch': 2.7}
90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 332/369 [35:23<03:56, 6.40s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 333/369 [35:29<03:50, 6.40s/it] {'loss': 0.0362, 'grad_norm': 0.8016871809959412, 'learning_rate': 4.976259993898502e-08, 'kl': 0.064, 'entropy': -0.0669, 'ce_loss': 0.0157, 'epoch': 2.71}
90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 333/369 [35:29<03:50, 6.40s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 334/369 [35:36<03:44, 6.42s/it] {'loss': 0.0358, 'grad_norm': 0.7614614367485046, 'learning_rate': 4.705799957284351e-08, 'kl': 0.0255, 'entropy': -0.0095, 'ce_loss': 0.0072, 'epoch': 2.72}
91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 334/369 [35:36<03:44, 6.42s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 335/369 [35:42<03:36, 6.37s/it] {'loss': 0.0304, 'grad_norm': 0.7528671622276306, 'learning_rate': 4.442719421385921e-08, 'kl': 0.0303, 'entropy': -0.0217, 'ce_loss': 0.0119, 'epoch': 2.72}
91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 335/369 [35:42<03:36, 6.37s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 336/369 [35:48<03:29, 6.36s/it] {'loss': 0.0371, 'grad_norm': 0.8338404893875122, 'learning_rate': 4.187038758933203e-08, 'kl': 0.0369, 'entropy': -0.0874, 'ce_loss': 0.0171, 'epoch': 2.73}
91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 336/369 [35:48<03:29, 6.36s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 337/369 [35:55<03:24, 6.40s/it] {'loss': 0.0287, 'grad_norm': 0.7217919826507568, 'learning_rate': 3.938777769616275e-08, 'kl': 0.02, 'entropy': -0.0371, 'ce_loss': 0.0129, 'epoch': 2.74}
91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 337/369 [35:55<03:24, 6.40s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 338/369 [36:01<03:18, 6.40s/it] {'loss': 0.0317, 'grad_norm': 0.7246120572090149, 'learning_rate': 3.697955678552212e-08, 'kl': 0.0154, 'entropy': -0.0762, 'ce_loss': 0.0325, 'epoch': 2.75}
92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 338/369 [36:01<03:18, 6.40s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 339/369 [36:07<03:11, 6.37s/it] {'loss': 0.0294, 'grad_norm': 0.7057827711105347, 'learning_rate': 3.464591134796135e-08, 'kl': 0.0277, 'entropy': -0.0615, 'ce_loss': 0.0123, 'epoch': 2.76}
92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 339/369 [36:07<03:11, 6.37s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 340/369 [36:14<03:05, 6.39s/it] {'loss': 0.0276, 'grad_norm': 0.6279847025871277, 'learning_rate': 3.238702209897215e-08, 'kl': 0.0086, 'entropy': -0.0334, 'ce_loss': 0.0117, 'epoch': 2.76}
92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 340/369 [36:14<03:05, 6.39s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 341/369 [36:20<03:00, 6.43s/it] {'loss': 0.0243, 'grad_norm': 0.6225998401641846, 'learning_rate': 3.0203063964990616e-08, 'kl': 0.0312, 'entropy': -0.0471, 'ce_loss': 0.0145, 'epoch': 2.77}
92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 341/369 [36:20<03:00, 6.43s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 342/369 [36:27<02:55, 6.49s/it] {'loss': 0.0257, 'grad_norm': 0.6585512161254883, 'learning_rate': 2.8094206069852355e-08, 'kl': 0.0212, 'entropy': -0.0408, 'ce_loss': 0.0214, 'epoch': 2.78}
93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 342/369 [36:27<02:55, 6.49s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 343/369 [36:33<02:48, 6.49s/it] {'loss': 0.0494, 'grad_norm': 0.8852731585502625, 'learning_rate': 2.6060611721695268e-08, 'kl': 0.0283, 'entropy': -0.0439, 'ce_loss': 0.0256, 'epoch': 2.79}
93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 343/369 [36:33<02:48, 6.49s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 344/369 [36:40<02:42, 6.49s/it] {'loss': 0.0231, 'grad_norm': 0.6776178479194641, 'learning_rate': 2.4102438400312787e-08, 'kl': 0.0359, 'entropy': -0.0508, 'ce_loss': 0.0148, 'epoch': 2.8}
93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 344/369 [36:40<02:42, 6.49s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 345/369 [36:46<02:33, 6.42s/it] {'loss': 0.0308, 'grad_norm': 0.7607916593551636, 'learning_rate': 2.221983774495928e-08, 'kl': 0.0718, 'entropy': -0.0559, 'ce_loss': 0.0068, 'epoch': 2.8}
93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 345/369 [36:46<02:33, 6.42s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 346/369 [36:52<02:26, 6.38s/it] {'loss': 0.0407, 'grad_norm': 0.8311781883239746, 'learning_rate': 2.0412955542606468e-08, 'kl': 0.0297, 'entropy': -0.0282, 'ce_loss': 0.0169, 'epoch': 2.81}
94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 346/369 [36:52<02:26, 6.38s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 347/369 [36:59<02:19, 6.34s/it] {'loss': 0.0328, 'grad_norm': 0.7213769555091858, 'learning_rate': 1.868193171665522e-08, 'kl': 0.054, 'entropy': -0.0496, 'ce_loss': 0.0327, 'epoch': 2.82}
94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 347/369 [36:59<02:19, 6.34s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 348/369 [37:05<02:12, 6.31s/it] {'loss': 0.0263, 'grad_norm': 0.8006882667541504, 'learning_rate': 1.7026900316098212e-08, 'kl': 0.0288, 'entropy': 0.0012, 'ce_loss': 0.0181, 'epoch': 2.83}
94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 348/369 [37:05<02:12, 6.31s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 349/369 [37:11<02:05, 6.30s/it] {'loss': 0.0396, 'grad_norm': 0.8049754500389099, 'learning_rate': 1.5447989505140923e-08, 'kl': 0.0258, 'entropy': -0.0273, 'ce_loss': 0.0061, 'epoch': 2.84}
95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 349/369 [37:11<02:05, 6.30s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 350/369 [37:18<02:00, 6.37s/it] {'loss': 0.029, 'grad_norm': 0.6737738251686096, 'learning_rate': 1.3945321553275325e-08, 'kl': 0.0469, 'entropy': -0.0361, 'ce_loss': 0.0137, 'epoch': 2.85}
95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 350/369 [37:18<02:00, 6.37s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 351/369 [37:24<01:55, 6.40s/it] {'loss': 0.0311, 'grad_norm': 0.7061965465545654, 'learning_rate': 1.2519012825812803e-08, 'kl': 0.0884, 'entropy': -0.1006, 'ce_loss': 0.0099, 'epoch': 2.85}
95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 351/369 [37:24<01:55, 6.40s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 352/369 [37:31<01:49, 6.42s/it] {'loss': 0.0306, 'grad_norm': 0.7362340092658997, 'learning_rate': 1.1169173774871477e-08, 'kl': 0.0232, 'entropy': -0.0173, 'ce_loss': 0.0075, 'epoch': 2.86}
95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 352/369 [37:31<01:49, 6.42s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 353/369 [37:37<01:41, 6.37s/it] {'loss': 0.0348, 'grad_norm': 0.8233022689819336, 'learning_rate': 9.895908930824259e-09, 'kl': 0.1055, 'entropy': -0.0415, 'ce_loss': 0.0344, 'epoch': 2.87}
96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 353/369 [37:37<01:41, 6.37s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 354/369 [37:43<01:35, 6.36s/it] {'loss': 0.0316, 'grad_norm': 0.6815870404243469, 'learning_rate': 8.699316894203223e-09, 'kl': 0.0292, 'entropy': -0.1367, 'ce_loss': 0.0218, 'epoch': 2.88}
96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 354/369 [37:43<01:35, 6.36s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 355/369 [37:50<01:29, 6.36s/it] {'loss': 0.0375, 'grad_norm': 0.8476760983467102, 'learning_rate': 7.579490328064265e-09, 'kl': 0.0503, 'entropy': -0.0996, 'ce_loss': 0.0052, 'epoch': 2.89}
96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 355/369 [37:50<01:29, 6.36s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 356/369 [37:56<01:22, 6.38s/it] {'loss': 0.0322, 'grad_norm': 0.780660092830658, 'learning_rate': 6.536515950811394e-09, 'kl': 0.0442, 'entropy': -0.0119, 'ce_loss': 0.0098, 'epoch': 2.89}
96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 356/369 [37:56<01:22, 6.38s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 357/369 [38:02<01:16, 6.37s/it] {'loss': 0.0318, 'grad_norm': 0.7445202469825745, 'learning_rate': 5.570474529481561e-09, 'kl': 0.0347, 'entropy': -0.0223, 'ce_loss': 0.0113, 'epoch': 2.9}
97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 357/369 [38:02<01:16, 6.37s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 358/369 [38:09<01:10, 6.37s/it] {'loss': 0.0324, 'grad_norm': 0.8086585998535156, 'learning_rate': 4.681440873489761e-09, 'kl': 0.0277, 'entropy': -0.0216, 'ce_loss': 0.014, 'epoch': 2.91}
97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 358/369 [38:09<01:10, 6.37s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 359/369 [38:15<01:03, 6.32s/it] {'loss': 0.0357, 'grad_norm': 0.7350617051124573, 'learning_rate': 3.869483828836007e-09, 'kl': 0.0347, 'entropy': -0.0496, 'ce_loss': 0.0212, 'epoch': 2.92}
97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 359/369 [38:15<01:03, 6.32s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 360/369 [38:21<00:57, 6.37s/it] {'loss': 0.0277, 'grad_norm': 0.6808954477310181, 'learning_rate': 3.1346662727740338e-09, 'kl': 0.0237, 'entropy': -0.0378, 'ce_loss': 0.0072, 'epoch': 2.93}
98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 360/369 [38:21<00:57, 6.37s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 361/369 [38:28<00:50, 6.34s/it] {'loss': 0.031, 'grad_norm': 0.753974974155426, 'learning_rate': 2.4770451089419774e-09, 'kl': 0.0086, 'entropy': -0.104, 'ce_loss': 0.0173, 'epoch': 2.93}
98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 361/369 [38:28<00:50, 6.34s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 362/369 [38:34<00:44, 6.36s/it] {'loss': 0.0399, 'grad_norm': 0.9531015157699585, 'learning_rate': 1.8966712629558956e-09, 'kl': 0.0119, 'entropy': -0.0413, 'ce_loss': 0.0146, 'epoch': 2.94}
98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 362/369 [38:34<00:44, 6.36s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 363/369 [38:41<00:38, 6.37s/it] {'loss': 0.0241, 'grad_norm': 0.6532736420631409, 'learning_rate': 1.393589678466367e-09, 'kl': 0.0439, 'entropy': -0.0144, 'ce_loss': 0.011, 'epoch': 2.95}
98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 363/369 [38:41<00:38, 6.37s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 364/369 [38:47<00:32, 6.45s/it] {'loss': 0.0356, 'grad_norm': 0.8270800709724426, 'learning_rate': 9.678393136776097e-10, 'kl': 0.033, 'entropy': -0.0131, 'ce_loss': 0.0069, 'epoch': 2.96}
99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 364/369 [38:47<00:32, 6.45s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 365/369 [38:54<00:25, 6.44s/it] {'loss': 0.0302, 'grad_norm': 0.7547364830970764, 'learning_rate': 6.194531383307832e-10, 'kl': 0.0356, 'entropy': -0.0771, 'ce_loss': 0.0501, 'epoch': 2.97}
99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 365/369 [38:54<00:25, 6.44s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 366/369 [39:00<00:19, 6.39s/it] {'loss': 0.0296, 'grad_norm': 0.7351754307746887, 'learning_rate': 3.484581311511414e-10, 'kl': 0.0518, 'entropy': -0.0586, 'ce_loss': 0.0079, 'epoch': 2.98}
99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 366/369 [39:00<00:19, 6.39s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 367/369 [39:06<00:12, 6.39s/it] {'loss': 0.0342, 'grad_norm': 0.7677320837974548, 'learning_rate': 1.5487527775848163e-10, 'kl': 0.0238, 'entropy': -0.0396, 'ce_loss': 0.0073, 'epoch': 2.98}
99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 367/369 [39:06<00:12, 6.39s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 368/369 [39:13<00:06, 6.40s/it] {'loss': 0.0296, 'grad_norm': 0.8239316344261169, 'learning_rate': 3.8719569042111597e-11, 'kl': 0.0718, 'entropy': -0.0928, 'ce_loss': 0.0162, 'epoch': 2.99}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 368/369 [39:13<00:06, 6.40s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 369/369 [39:20<00:00, 6.58s/it] {'loss': 0.0334, 'grad_norm': 0.6779916882514954, 'learning_rate': 0.0, 'kl': 0.02, 'entropy': -0.0197, 'ce_loss': 0.0233, 'epoch': 3.0}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 369/369 [39:20<00:00, 6.58s/it][INFO|trainer.py:2665] 2025-04-18 18:20:17,205 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 2360.1998, 'train_samples_per_second': 2.499, 'train_steps_per_second': 0.156, 'train_loss': 0.046926327837191945, 'epoch': 3.0}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 369/369 [39:20<00:00, 6.58s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 369/369 [39:20<00:00, 6.40s/it]
[INFO|trainer.py:3966] 2025-04-18 18:20:33,598 >> Saving model checkpoint to /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO
[INFO|configuration_utils.py:423] 2025-04-18 18:20:33,603 >> Configuration saved in /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO/config.json
[INFO|configuration_utils.py:908] 2025-04-18 18:20:33,604 >> Configuration saved in /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO/generation_config.json
[2025-04-18 18:20:40,347] [INFO] [launch.py:351:main] Process 1993329 exits successfully.
[2025-04-18 18:20:42,349] [INFO] [launch.py:351:main] Process 1993333 exits successfully.
[2025-04-18 18:20:44,352] [INFO] [launch.py:351:main] Process 1993332 exits successfully.
[2025-04-18 18:20:46,354] [INFO] [launch.py:351:main] Process 1993331 exits successfully.
[2025-04-18 18:20:47,355] [INFO] [launch.py:351:main] Process 1993328 exits successfully.
[2025-04-18 18:20:49,358] [INFO] [launch.py:351:main] Process 1993330 exits successfully.
[2025-04-18 18:20:51,360] [INFO] [launch.py:351:main] Process 1993327 exits successfully.
[INFO|modeling_utils.py:3594] 2025-04-18 18:21:07,009 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 6 checkpoint shards. You can find where each parameters has been saved in the index located at /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2025-04-18 18:21:07,010 >> tokenizer config file saved in /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-04-18 18:21:07,010 >> Special tokens file saved in /home/stern/GRPO/saved_models/Qwen2.5-14B-Instruct-RSPO/special_tokens_map.json
***** train metrics *****
epoch = 3.0
total_flos = 29688GF
train_loss = 0.0469
train_runtime = 0:39:20.19
train_samples = 1966
train_samples_per_second = 2.499
train_steps_per_second = 0.156
[2025-04-18 18:21:13,382] [INFO] [launch.py:351:main] Process 1993326 exits successfully.