s1K-7B-RSPO-neg / training.log
PeterLauLukCh's picture
Upload folder using huggingface_hub
0dd48f7 verified
[2025-04-10 16:53:49,566] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:53:51,612] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7
[2025-04-10 16:53:51,612] [INFO] [runner.py:605:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed scripts/zero2.json --seed 42 --model_name_or_path /home/stern/GRPO/saved_models/s1K-7B --train_tokenized_file /home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl --output_dir /home/stern/GRPO/offline_rl_v2/output --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy no --learning_rate 2e-5 --lr_scheduler_type cosine --save_only_model True --remove_unused_columns False --warmup_ratio 0.03 --num_train_epochs 3 --logging_steps 1 --report_to tensorboard --gradient_checkpointing True --overwrite_output_dir --bf16 True
[2025-04-10 16:53:53,051] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:53:55,047] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2025-04-10 16:53:55,047] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2025-04-10 16:53:55,047] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2025-04-10 16:53:55,047] [INFO] [launch.py:164:main] dist_world_size=8
[2025-04-10 16:53:55,047] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2025-04-10 16:53:55,048] [INFO] [launch.py:256:main] process 501939 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-10 16:53:55,048] [INFO] [launch.py:256:main] process 501940 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-10 16:53:55,049] [INFO] [launch.py:256:main] process 501941 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=2', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-10 16:53:55,049] [INFO] [launch.py:256:main] process 501942 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-10 16:53:55,049] [INFO] [launch.py:256:main] process 501943 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=4', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-10 16:53:55,050] [INFO] [launch.py:256:main] process 501944 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=5', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-10 16:53:55,050] [INFO] [launch.py:256:main] process 501945 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=6', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-10 16:53:55,050] [INFO] [launch.py:256:main] process 501946 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True']
[2025-04-10 16:53:59,730] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:53:59,770] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:53:59,873] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:53:59,903] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:53:59,955] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:54:00,030] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:54:00,040] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-10 16:54:00,042] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-10 16:54:01,829] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-10 16:54:02,044] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-10 16:54:02,079] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-04-10 16:54:02,080] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-10 16:54:02,086] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-10 16:54:02,133] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-04-10 16:54:02,133] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-10 16:54:02,174] [INFO] [comm.py:658:init_distributed] cdb=None
/home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead
warnings.warn(
[2025-04-10 16:54:02,208] [INFO] [comm.py:658:init_distributed] cdb=None
WARNING:__main__:Process rank: 4, device: cuda:4, n_gpu: 1
WARNING:__main__:Process rank: 2, device: cuda:2, n_gpu: 1
[WARNING|logging.py:329] 2025-04-10 16:54:02,948 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[WARNING|logging.py:329] 2025-04-10 16:54:02,950 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:00<00:00, 5.51it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:00<00:00, 3.63it/s]WARNING:__main__:Process rank: 1, device: cuda:1, n_gpu: 1
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1
INFO:__main__:Training parameters CustomTrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=scripts/zero2.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=False,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=None,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
kl_coeff=0.0,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/home/stern/GRPO/offline_rl_v2/output/runs/Apr10_16-54-02_nacamontrealdc1-p2r203n1.enovum.hivecloud.com,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=/home/stern/GRPO/offline_rl_v2/output,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=/home/stern/GRPO/offline_rl_v2/output,
save_on_each_node=False,
save_only_model=True,
save_safetensors=True,
save_steps=500,
save_strategy=no,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tp_size=0,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.0,
)
[INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file chat_template.jinja
Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:00<00:00, 4.14it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 5.25it/s]
Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:00<00:00, 3.33it/s]WARNING:__main__:Process rank: 7, device: cuda:7, n_gpu: 1
WARNING:__main__:Process rank: 6, device: cuda:6, n_gpu: 1
Generating train split: 0 examples [00:00, ? examples/s]WARNING:__main__:Process rank: 3, device: cuda:3, n_gpu: 1
WARNING:__main__:Process rank: 5, device: cuda:5, n_gpu: 1
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 4.10it/s]
[WARNING|logging.py:329] 2025-04-10 16:54:03,974 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Generating train split: 170 examples [00:00, 1093.45 examples/s][INFO|tokenization_utils_base.py:2323] 2025-04-10 16:54:04,097 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:697] 2025-04-10 16:54:04,097 >> loading configuration file /home/stern/GRPO/saved_models/s1K-7B/config.json
[INFO|configuration_utils.py:771] 2025-04-10 16:54:04,100 >> Model config Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.50.3",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
[INFO|modeling_utils.py:1151] 2025-04-10 16:54:04,161 >> loading weights file /home/stern/GRPO/saved_models/s1K-7B/model.safetensors.index.json
[INFO|modeling_utils.py:1225] 2025-04-10 16:54:04,162 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object
[INFO|modeling_utils.py:2170] 2025-04-10 16:54:04,162 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16.
[WARNING|logging.py:329] 2025-04-10 16:54:04,164 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[INFO|configuration_utils.py:1139] 2025-04-10 16:54:04,166 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Generating train split: 341 examples [00:00, 966.13 examples/s] Generating train split: 468 examples [00:00, 901.17 examples/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:00<00:00, 4.16it/s][WARNING|logging.py:329] 2025-04-10 16:54:04,581 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Generating train split: 635 examples [00:00, 904.92 examples/s][WARNING|logging.py:329] 2025-04-10 16:54:04,605 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[WARNING|logging.py:329] 2025-04-10 16:54:04,608 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[WARNING|logging.py:329] 2025-04-10 16:54:04,620 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:00<00:00, 3.89it/s] Generating train split: 759 examples [00:00, 816.52 examples/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:00<00:00, 3.41it/s] Generating train split: 842 examples [00:01, 598.30 examples/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00, 3.74it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00, 3.72it/s]
Generating train split: 925 examples [00:01, 482.38 examples/s] Generating train split: 1010 examples [00:01, 398.69 examples/s] Generating train split: 1093 examples [00:02, 347.94 examples/s] Generating train split: 1135 examples [00:02, 343.37 examples/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:01<00:01, 1.27it/s] Generating train split: 1218 examples [00:02, 380.14 examples/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:01<00:01, 1.21it/s] Generating train split: 1259 examples [00:02, 370.22 examples/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:02<00:00, 1.16it/s] Generating train split: 1303 examples [00:02, 370.50 examples/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:01<00:01, 1.07it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00, 1.71it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00, 1.71it/s]
[INFO|modeling_utils.py:4987] 2025-04-10 16:54:06,570 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.
[INFO|modeling_utils.py:4995] 2025-04-10 16:54:06,570 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/stern/GRPO/saved_models/s1K-7B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1092] 2025-04-10 16:54:06,572 >> loading configuration file /home/stern/GRPO/saved_models/s1K-7B/generation_config.json
[INFO|configuration_utils.py:1139] 2025-04-10 16:54:06,573 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}
Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:02<00:02, 1.00s/it] Generating train split: 1385 examples [00:02, 370.45 examples/s] Generating train split: 1428 examples [00:02, 378.34 examples/s]Using custom data configuration default-570516a07b11d2a7
INFO:datasets.builder:Using custom data configuration default-570516a07b11d2a7
Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json
INFO:datasets.info:Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json
Generating train split: 1470 examples [00:03, 335.24 examples/s] Generating train split: 1511 examples [00:03, 332.66 examples/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:02<00:00, 1.17it/s] Generating train split: 1553 examples [00:03, 331.20 examples/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00, 1.75it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00, 1.53it/s]
Generating train split: 1636 examples [00:03, 404.01 examples/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:02<00:00, 1.04it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:02<00:00, 1.05it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00, 1.38it/s]
Generating train split: 1760 examples [00:03, 521.47 examples/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00, 1.36it/s]
Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:02<00:00, 1.02it/s] Generating train split: 1844 examples [00:03, 576.49 examples/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:03<00:00, 1.32it/s]
Generating train split: 1970 examples [00:03, 659.41 examples/s] Generating train split: 2139 examples [00:04, 744.70 examples/s] Generating train split: 2222 examples [00:04, 761.21 examples/s] Generating train split: 2349 examples [00:04, 779.14 examples/s] Generating train split: 2515 examples [00:04, 801.66 examples/s] Generating train split: 2640 examples [00:04, 811.52 examples/s] Generating train split: 2722 examples [00:04, 804.37 examples/s] Generating train split: 2805 examples [00:04, 807.91 examples/s] Generating train split: 2889 examples [00:05, 807.80 examples/s] Generating train split: 2973 examples [00:05, 808.41 examples/s] Generating train split: 3097 examples [00:05, 814.10 examples/s] Generating train split: 3224 examples [00:05, 818.36 examples/s] Generating train split: 3348 examples [00:05, 826.59 examples/s] Generating train split: 3428 examples [00:05, 605.17 examples/s]
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-570516a07b11d2a7/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092)
INFO:datasets.builder:Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-570516a07b11d2a7/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092)
Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-570516a07b11d2a7/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092
INFO:datasets.info:Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-570516a07b11d2a7/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
/home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead.
trainer = OfflineREINFORCETrainer(
[INFO|trainer.py:748] 2025-04-10 16:54:09,798 >> Using auto half precision backend
INFO:__main__:*** Train ***
[2025-04-10 16:54:09,999] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.5, git-hash=unknown, git-branch=unknown
[2025-04-10 16:54:09,999] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8
[2025-04-10 16:54:17,662] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-04-10 16:54:17,664] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-04-10 16:54:17,664] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-04-10 16:54:17,681] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2025-04-10 16:54:17,681] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2025-04-10 16:54:17,681] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2025-04-10 16:54:17,681] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 12845056
[2025-04-10 16:54:17,681] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2025-04-10 16:54:17,682] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-04-10 16:54:17,682] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[WARNING|logging.py:329] 2025-04-10 16:54:32,676 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[WARNING|logging.py:329] 2025-04-10 16:54:36,610 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[WARNING|logging.py:329] 2025-04-10 16:54:37,082 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[2025-04-10 16:54:37,682] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[2025-04-10 16:54:37,683] [INFO] [utils.py:782:see_memory_usage] MA 17.73 GB Max_MA 17.73 GB CA 17.73 GB Max_CA 18 GB
[2025-04-10 16:54:37,683] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 39.82 GB, percent = 4.0%
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[WARNING|logging.py:329] 2025-04-10 16:54:37,738 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-10 16:54:37,768 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-10 16:54:37,772 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:329] 2025-04-10 16:54:37,807 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[2025-04-10 16:54:37,852] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-04-10 16:54:37,852] [INFO] [utils.py:782:see_memory_usage] MA 17.73 GB Max_MA 21.28 GB CA 21.28 GB Max_CA 21 GB
[2025-04-10 16:54:37,852] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 39.52 GB, percent = 3.9%
[2025-04-10 16:54:37,852] [INFO] [stage_1_and_2.py:556:__init__] optimizer state initialized
[2025-04-10 16:54:37,982] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-04-10 16:54:37,983] [INFO] [utils.py:782:see_memory_usage] MA 17.73 GB Max_MA 17.73 GB CA 21.28 GB Max_CA 21 GB
[2025-04-10 16:54:37,983] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 39.5 GB, percent = 3.9%
[2025-04-10 16:54:37,984] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2025-04-10 16:54:37,984] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[2025-04-10 16:54:37,984] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-04-10 16:54:37,985] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-04-10 16:54:37,985] [INFO] [config.py:1000:print] DeepSpeedEngine configuration:
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] amp_enabled .................. False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] amp_params ................... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] bfloat16_enabled ............. True
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] bfloat16_immediate_grad_update True
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] checkpoint_parallel_write_pipeline False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] checkpoint_tag_validation_enabled True
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] checkpoint_tag_validation_fail False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7c035fb604c0>
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] communication_data_type ...... None
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] curriculum_enabled_legacy .... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] curriculum_params_legacy ..... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] data_efficiency_enabled ...... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] dataloader_drop_last ......... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] disable_allgather ............ False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] dump_state ................... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] dynamic_loss_scale_args ...... None
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_enabled ........... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_gas_boundary_resolution 1
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_layer_name ........ bert.encoder.layer
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_layer_num ......... 0
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_max_iter .......... 100
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_stability ......... 1e-06
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_tol ............... 0.01
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_verbose ........... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] elasticity_enabled ........... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] fp16_auto_cast ............... None
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] fp16_enabled ................. False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] fp16_master_weights_and_gradients False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] global_rank .................. 0
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] grad_accum_dtype ............. None
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] gradient_accumulation_steps .. 4
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] gradient_clipping ............ 0.0
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] gradient_predivide_factor .... 1.0
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] graph_harvesting ............. False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] initial_dynamic_scale ........ 1
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] load_universal_checkpoint .... False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] loss_scale ................... 1.0
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] memory_breakdown ............. False
[2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] mics_hierarchial_params_gather False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] mics_shard_size .............. -1
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] optimizer_legacy_fusion ...... False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] optimizer_name ............... None
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] optimizer_params ............. None
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] pld_enabled .................. False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] pld_params ................... False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] prescale_gradients ........... False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] scheduler_name ............... None
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] scheduler_params ............. None
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] seq_parallel_communication_data_type torch.float32
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] sparse_attention ............. None
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] sparse_gradients_enabled ..... False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] steps_per_print .............. inf
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] timers_config ................ enabled=True synchronized=True
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] train_batch_size ............. 32
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] train_micro_batch_size_per_gpu 1
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] use_data_before_expert_parallel_ False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] use_node_local_storage ....... False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] wall_clock_breakdown ......... False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] weight_quantization_config ... None
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] world_size ................... 8
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_allow_untested_optimizer True
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=12845056 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_enabled ................. True
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_force_ds_cpu_optimizer .. True
[2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_optimization_stage ...... 2
[2025-04-10 16:54:37,987] [INFO] [config.py:990:print_user_config] json = {
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"train_micro_batch_size_per_gpu": 1,
"train_batch_size": 32,
"gradient_accumulation_steps": 4,
"zero_optimization": {
"stage": 2,
"overlap_comm": false,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09,
"reduce_bucket_size": 1.284506e+07
},
"steps_per_print": inf,
"zero_allow_untested_optimizer": true
}
[INFO|trainer.py:2409] 2025-04-10 16:54:37,987 >> ***** Running training *****
[INFO|trainer.py:2410] 2025-04-10 16:54:37,987 >> Num examples = 3,428
[INFO|trainer.py:2411] 2025-04-10 16:54:37,987 >> Num Epochs = 3
[INFO|trainer.py:2412] 2025-04-10 16:54:37,987 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2415] 2025-04-10 16:54:37,987 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2416] 2025-04-10 16:54:37,987 >> Gradient Accumulation steps = 4
[INFO|trainer.py:2417] 2025-04-10 16:54:37,987 >> Total optimization steps = 321
[INFO|trainer.py:2418] 2025-04-10 16:54:37,988 >> Number of trainable parameters = 7,615,616,512
0%| | 0/321 [00:00<?, ?it/s]/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
[WARNING|logging.py:329] 2025-04-10 16:54:38,090 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
/home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
0%| | 1/321 [00:05<29:54, 5.61s/it] {'loss': 0.0429, 'grad_norm': 0.42509692907333374, 'learning_rate': 2.0000000000000003e-06, 'kl': 0.0006, 'entropy': 0.1934, 'ce_loss': 0.054, 'epoch': 0.01}
0%| | 1/321 [00:07<29:54, 5.61s/it] 1%| | 2/321 [00:13<35:29, 6.67s/it] {'loss': 0.0541, 'grad_norm': 0.5195198059082031, 'learning_rate': 4.000000000000001e-06, 'kl': -0.0007, 'entropy': 0.2852, 'ce_loss': 0.0699, 'epoch': 0.02}
1%| | 2/321 [00:13<35:29, 6.67s/it] 1%| | 3/321 [00:18<32:16, 6.09s/it] {'loss': 0.0385, 'grad_norm': 0.3660963177680969, 'learning_rate': 6e-06, 'kl': 0.0005, 'entropy': 0.2793, 'ce_loss': 0.0602, 'epoch': 0.03}
1%| | 3/321 [00:18<32:16, 6.09s/it] 1%| | 4/321 [00:24<31:29, 5.96s/it] {'loss': 0.0758, 'grad_norm': 0.6640394330024719, 'learning_rate': 8.000000000000001e-06, 'kl': 0.0151, 'entropy': 0.3047, 'ce_loss': 0.0809, 'epoch': 0.04}
1%| | 4/321 [00:24<31:29, 5.96s/it] 2%|▏ | 5/321 [00:29<30:22, 5.77s/it] {'loss': 0.053, 'grad_norm': 0.22936582565307617, 'learning_rate': 1e-05, 'kl': 0.0295, 'entropy': 0.1943, 'ce_loss': 0.0762, 'epoch': 0.05}
2%|▏ | 5/321 [00:29<30:22, 5.77s/it] 2%|▏ | 6/321 [00:35<29:43, 5.66s/it] {'loss': 0.0792, 'grad_norm': 1.8079404830932617, 'learning_rate': 1.2e-05, 'kl': -0.0918, 'entropy': 0.0133, 'ce_loss': 0.0976, 'epoch': 0.06}
2%|▏ | 6/321 [00:35<29:43, 5.66s/it] 2%|▏ | 7/321 [00:40<29:14, 5.59s/it] {'loss': 0.0651, 'grad_norm': 0.557701051235199, 'learning_rate': 1.4e-05, 'kl': 0.0019, 'entropy': 0.1543, 'ce_loss': 0.098, 'epoch': 0.07}
2%|▏ | 7/321 [00:40<29:14, 5.59s/it] 2%|▏ | 8/321 [00:46<29:01, 5.56s/it] {'loss': 0.0683, 'grad_norm': 0.4116457402706146, 'learning_rate': 1.6000000000000003e-05, 'kl': -0.014, 'entropy': 0.0874, 'ce_loss': 0.0771, 'epoch': 0.07}
2%|▏ | 8/321 [00:46<29:01, 5.56s/it] 3%|β–Ž | 9/321 [00:51<28:46, 5.53s/it] {'loss': 0.0828, 'grad_norm': 0.4462181627750397, 'learning_rate': 1.8e-05, 'kl': 0.0065, 'entropy': 0.1406, 'ce_loss': 0.0741, 'epoch': 0.08}
3%|β–Ž | 9/321 [00:51<28:46, 5.53s/it] 3%|β–Ž | 10/321 [00:57<29:24, 5.67s/it] {'loss': 0.0726, 'grad_norm': 0.3048171103000641, 'learning_rate': 2e-05, 'kl': -0.0267, 'entropy': 0.1826, 'ce_loss': 0.0728, 'epoch': 0.09}
3%|β–Ž | 10/321 [00:57<29:24, 5.67s/it] 3%|β–Ž | 11/321 [01:03<29:11, 5.65s/it] {'loss': 0.0667, 'grad_norm': 0.36878669261932373, 'learning_rate': 1.9999489794332404e-05, 'kl': -0.0139, 'entropy': 0.2539, 'ce_loss': 0.0773, 'epoch': 0.1}
3%|β–Ž | 11/321 [01:03<29:11, 5.65s/it] 4%|β–Ž | 12/321 [01:08<28:58, 5.63s/it] {'loss': 0.051, 'grad_norm': 0.2855764925479889, 'learning_rate': 1.9997959229391567e-05, 'kl': -0.0334, 'entropy': 0.3066, 'ce_loss': 0.0963, 'epoch': 0.11}
4%|β–Ž | 12/321 [01:08<28:58, 5.63s/it] 4%|▍ | 13/321 [01:14<28:42, 5.59s/it] {'loss': 0.0616, 'grad_norm': 0.3379002809524536, 'learning_rate': 1.9995408461358074e-05, 'kl': -0.0177, 'entropy': 0.1748, 'ce_loss': 0.0593, 'epoch': 0.12}
4%|▍ | 13/321 [01:14<28:42, 5.59s/it] 4%|▍ | 14/321 [01:19<28:34, 5.58s/it] {'loss': 0.0574, 'grad_norm': 0.2853255867958069, 'learning_rate': 1.999183775051519e-05, 'kl': -0.0347, 'entropy': 0.1387, 'ce_loss': 0.0506, 'epoch': 0.13}
4%|▍ | 14/321 [01:19<28:34, 5.58s/it] 5%|▍ | 15/321 [01:25<28:24, 5.57s/it] {'loss': 0.0759, 'grad_norm': 0.35480034351348877, 'learning_rate': 1.9987247461222297e-05, 'kl': -0.0432, 'entropy': 0.2002, 'ce_loss': 0.0776, 'epoch': 0.14}
5%|▍ | 15/321 [01:25<28:24, 5.57s/it] 5%|▍ | 16/321 [01:30<28:19, 5.57s/it] {'loss': 0.1014, 'grad_norm': 0.4204954504966736, 'learning_rate': 1.9981638061877714e-05, 'kl': -0.0393, 'entropy': 0.1953, 'ce_loss': 0.1155, 'epoch': 0.15}
5%|▍ | 16/321 [01:30<28:19, 5.57s/it] 5%|β–Œ | 17/321 [01:36<28:10, 5.56s/it] {'loss': 0.0646, 'grad_norm': 0.2749992907047272, 'learning_rate': 1.997501012487091e-05, 'kl': -0.0547, 'entropy': 0.1758, 'ce_loss': 0.0786, 'epoch': 0.16}
5%|β–Œ | 17/321 [01:36<28:10, 5.56s/it] 6%|β–Œ | 18/321 [01:41<28:04, 5.56s/it] {'loss': 0.0611, 'grad_norm': 0.276908814907074, 'learning_rate': 1.996736432652409e-05, 'kl': -0.0325, 'entropy': 0.1865, 'ce_loss': 0.0886, 'epoch': 0.17}
6%|β–Œ | 18/321 [01:41<28:04, 5.56s/it] 6%|β–Œ | 19/321 [01:47<27:54, 5.54s/it] {'loss': 0.0589, 'grad_norm': 0.2533971965312958, 'learning_rate': 1.9958701447023188e-05, 'kl': -0.0439, 'entropy': 0.2002, 'ce_loss': 0.0921, 'epoch': 0.18}
6%|β–Œ | 19/321 [01:47<27:54, 5.54s/it] 6%|β–Œ | 20/321 [01:52<27:43, 5.53s/it] {'loss': 0.0773, 'grad_norm': 0.34151360392570496, 'learning_rate': 1.994902237033824e-05, 'kl': -0.0374, 'entropy': 0.1611, 'ce_loss': 0.0873, 'epoch': 0.19}
6%|β–Œ | 20/321 [01:52<27:43, 5.53s/it] 7%|β–‹ | 21/321 [01:58<27:37, 5.53s/it] {'loss': 0.0659, 'grad_norm': 0.31397902965545654, 'learning_rate': 1.9938328084133206e-05, 'kl': -0.0547, 'entropy': 0.168, 'ce_loss': 0.0749, 'epoch': 0.2}
7%|β–‹ | 21/321 [01:58<27:37, 5.53s/it] 7%|β–‹ | 22/321 [02:03<27:32, 5.53s/it] {'loss': 0.0503, 'grad_norm': 0.1959906369447708, 'learning_rate': 1.9926619679665175e-05, 'kl': -0.0425, 'entropy': 0.1758, 'ce_loss': 0.0733, 'epoch': 0.21}
7%|β–‹ | 22/321 [02:03<27:32, 5.53s/it] 7%|β–‹ | 23/321 [02:09<27:26, 5.53s/it] {'loss': 0.0538, 'grad_norm': 0.22350993752479553, 'learning_rate': 1.9913898351673006e-05, 'kl': -0.0464, 'entropy': 0.1602, 'ce_loss': 0.0737, 'epoch': 0.21}
7%|β–‹ | 23/321 [02:09<27:26, 5.53s/it] 7%|β–‹ | 24/321 [02:14<27:16, 5.51s/it] {'loss': 0.068, 'grad_norm': 0.2728714346885681, 'learning_rate': 1.9900165398255434e-05, 'kl': -0.0304, 'entropy': 0.1855, 'ce_loss': 0.0722, 'epoch': 0.22}
7%|β–‹ | 24/321 [02:14<27:16, 5.51s/it] 8%|β–Š | 25/321 [02:20<27:11, 5.51s/it] {'loss': 0.0545, 'grad_norm': 0.1936517208814621, 'learning_rate': 1.9885422220738583e-05, 'kl': -0.043, 'entropy': 0.2227, 'ce_loss': 0.0767, 'epoch': 0.23}
8%|β–Š | 25/321 [02:20<27:11, 5.51s/it] 8%|β–Š | 26/321 [02:25<26:59, 5.49s/it] {'loss': 0.1022, 'grad_norm': 0.3744613826274872, 'learning_rate': 1.9869670323533005e-05, 'kl': -0.0295, 'entropy': 0.25, 'ce_loss': 0.0888, 'epoch': 0.24}
8%|β–Š | 26/321 [02:25<26:59, 5.49s/it] 8%|β–Š | 27/321 [02:31<27:11, 5.55s/it] {'loss': 0.0884, 'grad_norm': 0.3016413748264313, 'learning_rate': 1.9852911313980146e-05, 'kl': -0.0354, 'entropy': 0.2119, 'ce_loss': 0.0376, 'epoch': 0.25}
8%|β–Š | 27/321 [02:31<27:11, 5.55s/it] 9%|β–Š | 28/321 [02:37<27:01, 5.53s/it] {'loss': 0.0474, 'grad_norm': 0.1607786864042282, 'learning_rate': 1.9835146902188336e-05, 'kl': -0.0388, 'entropy': 0.2295, 'ce_loss': 0.0768, 'epoch': 0.26}
9%|β–Š | 28/321 [02:37<27:01, 5.53s/it] 9%|β–‰ | 29/321 [02:42<26:49, 5.51s/it] {'loss': 0.0471, 'grad_norm': 0.18925224244594574, 'learning_rate': 1.9816378900858288e-05, 'kl': -0.0374, 'entropy': 0.1621, 'ce_loss': 0.0679, 'epoch': 0.27}
9%|β–‰ | 29/321 [02:42<26:49, 5.51s/it] 9%|β–‰ | 30/321 [02:48<26:38, 5.49s/it] {'loss': 0.075, 'grad_norm': 0.2948780953884125, 'learning_rate': 1.9796609225098136e-05, 'kl': -0.043, 'entropy': 0.1748, 'ce_loss': 0.0678, 'epoch': 0.28}
9%|β–‰ | 30/321 [02:48<26:38, 5.49s/it] 10%|β–‰ | 31/321 [02:53<26:30, 5.48s/it] {'loss': 0.0648, 'grad_norm': 0.22790686786174774, 'learning_rate': 1.9775839892228004e-05, 'kl': -0.0339, 'entropy': 0.2373, 'ce_loss': 0.0803, 'epoch': 0.29}
10%|β–‰ | 31/321 [02:53<26:30, 5.48s/it] 10%|β–‰ | 32/321 [02:58<26:26, 5.49s/it] {'loss': 0.0901, 'grad_norm': 0.3282439112663269, 'learning_rate': 1.9754073021574153e-05, 'kl': -0.0295, 'entropy': 0.2188, 'ce_loss': 0.0814, 'epoch': 0.3}
10%|β–‰ | 32/321 [02:58<26:26, 5.49s/it] 10%|β–ˆ | 33/321 [03:04<26:26, 5.51s/it] {'loss': 0.0679, 'grad_norm': 0.22486652433872223, 'learning_rate': 1.9731310834252747e-05, 'kl': -0.0083, 'entropy': 0.2334, 'ce_loss': 0.0827, 'epoch': 0.31}
10%|β–ˆ | 33/321 [03:04<26:26, 5.51s/it] 11%|β–ˆ | 34/321 [03:10<26:23, 5.52s/it] {'loss': 0.0587, 'grad_norm': 0.2146979719400406, 'learning_rate': 1.970755565294318e-05, 'kl': -0.0204, 'entropy': 0.2988, 'ce_loss': 0.1036, 'epoch': 0.32}
11%|β–ˆ | 34/321 [03:10<26:23, 5.52s/it] 11%|β–ˆ | 35/321 [03:15<26:26, 5.55s/it] {'loss': 0.0579, 'grad_norm': 0.21969445049762726, 'learning_rate': 1.9682809901651074e-05, 'kl': -0.0449, 'entropy': 0.1699, 'ce_loss': 0.0699, 'epoch': 0.33}
11%|β–ˆ | 35/321 [03:15<26:26, 5.55s/it] 11%|β–ˆ | 36/321 [03:21<26:11, 5.51s/it] {'loss': 0.085, 'grad_norm': 0.27458637952804565, 'learning_rate': 1.9657076105460945e-05, 'kl': -0.0184, 'entropy': 0.2344, 'ce_loss': 0.0843, 'epoch': 0.34}
11%|β–ˆ | 36/321 [03:21<26:11, 5.51s/it] 12%|β–ˆβ– | 37/321 [03:26<26:05, 5.51s/it] {'loss': 0.0822, 'grad_norm': 0.2581257224082947, 'learning_rate': 1.9630356890278527e-05, 'kl': -0.0615, 'entropy': 0.2539, 'ce_loss': 0.096, 'epoch': 0.34}
12%|β–ˆβ– | 37/321 [03:26<26:05, 5.51s/it] 12%|β–ˆβ– | 38/321 [03:32<25:56, 5.50s/it] {'loss': 0.0573, 'grad_norm': 0.19232574105262756, 'learning_rate': 1.9602654982562822e-05, 'kl': -0.0205, 'entropy': 0.2061, 'ce_loss': 0.0739, 'epoch': 0.35}
12%|β–ˆβ– | 38/321 [03:32<25:56, 5.50s/it] 12%|β–ˆβ– | 39/321 [03:37<26:12, 5.57s/it] {'loss': 0.0679, 'grad_norm': 0.23629367351531982, 'learning_rate': 1.9573973209047893e-05, 'kl': -0.0513, 'entropy': 0.2256, 'ce_loss': 0.0799, 'epoch': 0.36}
12%|β–ˆβ– | 39/321 [03:37<26:12, 5.57s/it] 12%|β–ˆβ– | 40/321 [03:43<25:57, 5.54s/it] {'loss': 0.0735, 'grad_norm': 0.2551727592945099, 'learning_rate': 1.9544314496454423e-05, 'kl': -0.0107, 'entropy': 0.2129, 'ce_loss': 0.0759, 'epoch': 0.37}
12%|β–ˆβ– | 40/321 [03:43<25:57, 5.54s/it] 13%|β–ˆβ–Ž | 41/321 [03:48<25:51, 5.54s/it] {'loss': 0.0686, 'grad_norm': 0.2294929027557373, 'learning_rate': 1.9513681871191063e-05, 'kl': -0.0549, 'entropy': 0.2285, 'ce_loss': 0.085, 'epoch': 0.38}
13%|β–ˆβ–Ž | 41/321 [03:48<25:51, 5.54s/it] 13%|β–ˆβ–Ž | 42/321 [03:54<25:36, 5.51s/it] {'loss': 0.063, 'grad_norm': 0.2062670886516571, 'learning_rate': 1.9482078459045617e-05, 'kl': -0.0322, 'entropy': 0.1729, 'ce_loss': 0.0658, 'epoch': 0.39}
13%|β–ˆβ–Ž | 42/321 [03:54<25:36, 5.51s/it] 13%|β–ˆβ–Ž | 43/321 [03:59<25:46, 5.56s/it] {'loss': 0.0633, 'grad_norm': 0.18904145061969757, 'learning_rate': 1.9449507484866084e-05, 'kl': -0.0162, 'entropy': 0.2676, 'ce_loss': 0.0918, 'epoch': 0.4}
13%|β–ˆβ–Ž | 43/321 [03:59<25:46, 5.56s/it] 14%|β–ˆβ–Ž | 44/321 [04:05<25:33, 5.54s/it] {'loss': 0.1106, 'grad_norm': 0.3384988009929657, 'learning_rate': 1.941597227223159e-05, 'kl': -0.0193, 'entropy': 0.2773, 'ce_loss': 0.0962, 'epoch': 0.41}
14%|β–ˆβ–Ž | 44/321 [04:05<25:33, 5.54s/it] 14%|β–ˆβ– | 45/321 [04:10<25:18, 5.50s/it] {'loss': 0.0979, 'grad_norm': 0.28069382905960083, 'learning_rate': 1.9381476243113243e-05, 'kl': -0.012, 'entropy': 0.1904, 'ce_loss': 0.0665, 'epoch': 0.42}
14%|β–ˆβ– | 45/321 [04:10<25:18, 5.50s/it] 14%|β–ˆβ– | 46/321 [04:16<25:09, 5.49s/it] {'loss': 0.0633, 'grad_norm': 0.2065959870815277, 'learning_rate': 1.9346022917524958e-05, 'kl': -0.0071, 'entropy': 0.2656, 'ce_loss': 0.0813, 'epoch': 0.43}
14%|β–ˆβ– | 46/321 [04:16<25:09, 5.49s/it] 15%|β–ˆβ– | 47/321 [04:21<24:58, 5.47s/it] {'loss': 0.0508, 'grad_norm': 0.2020697295665741, 'learning_rate': 1.9309615913164262e-05, 'kl': -0.0186, 'entropy': 0.2539, 'ce_loss': 0.0833, 'epoch': 0.44}
15%|β–ˆβ– | 47/321 [04:21<24:58, 5.47s/it] 15%|β–ˆβ– | 48/321 [04:27<24:53, 5.47s/it] {'loss': 0.0487, 'grad_norm': 0.1781323254108429, 'learning_rate': 1.9272258945043154e-05, 'kl': -0.0092, 'entropy': 0.2041, 'ce_loss': 0.0772, 'epoch': 0.45}
15%|β–ˆβ– | 48/321 [04:27<24:53, 5.47s/it] 15%|β–ˆβ–Œ | 49/321 [04:32<24:47, 5.47s/it] {'loss': 0.0994, 'grad_norm': 0.3214576542377472, 'learning_rate': 1.9233955825109e-05, 'kl': -0.0374, 'entropy': 0.2344, 'ce_loss': 0.0915, 'epoch': 0.46}
15%|β–ˆβ–Œ | 49/321 [04:32<24:47, 5.47s/it] 16%|β–ˆβ–Œ | 50/321 [04:38<24:41, 5.47s/it] {'loss': 0.0683, 'grad_norm': 0.2019803673028946, 'learning_rate': 1.919471046185558e-05, 'kl': -0.01, 'entropy': 0.2314, 'ce_loss': 0.0898, 'epoch': 0.47}
16%|β–ˆβ–Œ | 50/321 [04:38<24:41, 5.47s/it] 16%|β–ˆβ–Œ | 51/321 [04:43<24:39, 5.48s/it] {'loss': 0.0768, 'grad_norm': 0.2558470666408539, 'learning_rate': 1.9154526859924242e-05, 'kl': -0.0162, 'entropy': 0.1963, 'ce_loss': 0.0721, 'epoch': 0.48}
16%|β–ˆβ–Œ | 51/321 [04:43<24:39, 5.48s/it] 16%|β–ˆβ–Œ | 52/321 [04:49<24:30, 5.47s/it] {'loss': 0.0526, 'grad_norm': 0.17525342106819153, 'learning_rate': 1.9113409119695276e-05, 'kl': -0.0258, 'entropy': 0.291, 'ce_loss': 0.0953, 'epoch': 0.48}
16%|β–ˆβ–Œ | 52/321 [04:49<24:30, 5.47s/it] 17%|β–ˆβ–‹ | 53/321 [04:54<24:24, 5.46s/it] {'loss': 0.069, 'grad_norm': 0.20537005364894867, 'learning_rate': 1.907136143686951e-05, 'kl': -0.016, 'entropy': 0.2539, 'ce_loss': 0.0847, 'epoch': 0.49}
17%|β–ˆβ–‹ | 53/321 [04:54<24:24, 5.46s/it] 17%|β–ˆβ–‹ | 54/321 [05:00<24:19, 5.47s/it] {'loss': 0.0657, 'grad_norm': 0.20349650084972382, 'learning_rate': 1.902838810204015e-05, 'kl': -0.0168, 'entropy': 0.168, 'ce_loss': 0.0638, 'epoch': 0.5}
17%|β–ˆβ–‹ | 54/321 [05:00<24:19, 5.47s/it] 17%|β–ˆβ–‹ | 55/321 [05:05<24:14, 5.47s/it] {'loss': 0.1007, 'grad_norm': 0.3118916153907776, 'learning_rate': 1.8984493500255e-05, 'kl': -0.0435, 'entropy': 0.1924, 'ce_loss': 0.0703, 'epoch': 0.51}
17%|β–ˆβ–‹ | 55/321 [05:05<24:14, 5.47s/it] 17%|β–ˆβ–‹ | 56/321 [05:10<24:05, 5.46s/it] {'loss': 0.0725, 'grad_norm': 0.21371452510356903, 'learning_rate': 1.8939682110568982e-05, 'kl': -0.0266, 'entropy': 0.1465, 'ce_loss': 0.0627, 'epoch': 0.52}
17%|β–ˆβ–‹ | 56/321 [05:10<24:05, 5.46s/it] 18%|β–ˆβ–Š | 57/321 [05:16<24:09, 5.49s/it] {'loss': 0.0907, 'grad_norm': 0.2616683542728424, 'learning_rate': 1.8893958505587093e-05, 'kl': -0.0339, 'entropy': 0.167, 'ce_loss': 0.0608, 'epoch': 0.53}
18%|β–ˆβ–Š | 57/321 [05:16<24:09, 5.49s/it] 18%|β–ˆβ–Š | 58/321 [05:22<24:04, 5.49s/it] {'loss': 0.0641, 'grad_norm': 0.21476519107818604, 'learning_rate': 1.8847327350997814e-05, 'kl': -0.0464, 'entropy': 0.2793, 'ce_loss': 0.0932, 'epoch': 0.54}
18%|β–ˆβ–Š | 58/321 [05:22<24:04, 5.49s/it] 18%|β–ˆβ–Š | 59/321 [05:27<23:53, 5.47s/it] {'loss': 0.105, 'grad_norm': 0.3317887485027313, 'learning_rate': 1.879979340509701e-05, 'kl': -0.0466, 'entropy': 0.3633, 'ce_loss': 0.112, 'epoch': 0.55}
18%|β–ˆβ–Š | 59/321 [05:27<23:53, 5.47s/it] 19%|β–ˆβ–Š | 60/321 [05:33<23:57, 5.51s/it] {'loss': 0.0845, 'grad_norm': 0.2596229016780853, 'learning_rate': 1.8751361518302413e-05, 'kl': -0.0264, 'entropy': 0.2734, 'ce_loss': 0.0875, 'epoch': 0.56}
19%|β–ˆβ–Š | 60/321 [05:33<23:57, 5.51s/it] 19%|β–ˆβ–‰ | 61/321 [05:38<23:47, 5.49s/it] {'loss': 0.0669, 'grad_norm': 0.2144625186920166, 'learning_rate': 1.8702036632658646e-05, 'kl': -0.0199, 'entropy': 0.249, 'ce_loss': 0.0751, 'epoch': 0.57}
19%|β–ˆβ–‰ | 61/321 [05:38<23:47, 5.49s/it] 19%|β–ˆβ–‰ | 62/321 [05:44<23:48, 5.51s/it] {'loss': 0.0738, 'grad_norm': 0.24398091435432434, 'learning_rate': 1.8651823781332948e-05, 'kl': -0.0337, 'entropy': 0.2451, 'ce_loss': 0.0819, 'epoch': 0.58}
19%|β–ˆβ–‰ | 62/321 [05:44<23:48, 5.51s/it] 20%|β–ˆβ–‰ | 63/321 [05:49<23:43, 5.52s/it] {'loss': 0.0669, 'grad_norm': 0.2284887582063675, 'learning_rate': 1.8600728088101587e-05, 'kl': -0.0304, 'entropy': 0.1885, 'ce_loss': 0.0738, 'epoch': 0.59}
20%|β–ˆβ–‰ | 63/321 [05:49<23:43, 5.52s/it] 20%|β–ˆβ–‰ | 64/321 [05:55<23:37, 5.51s/it] {'loss': 0.0748, 'grad_norm': 0.2169020175933838, 'learning_rate': 1.8548754766827016e-05, 'kl': -0.0469, 'entropy': 0.1582, 'ce_loss': 0.0657, 'epoch': 0.6}
20%|β–ˆβ–‰ | 64/321 [05:55<23:37, 5.51s/it] 20%|β–ˆβ–ˆ | 65/321 [06:00<23:26, 5.49s/it] {'loss': 0.0651, 'grad_norm': 0.2139730602502823, 'learning_rate': 1.8495909120925857e-05, 'kl': -0.0243, 'entropy': 0.1836, 'ce_loss': 0.0643, 'epoch': 0.61}
20%|β–ˆβ–ˆ | 65/321 [06:00<23:26, 5.49s/it] 21%|β–ˆβ–ˆ | 66/321 [06:06<23:23, 5.50s/it] {'loss': 0.0398, 'grad_norm': 0.1338534951210022, 'learning_rate': 1.8442196542827712e-05, 'kl': -0.0201, 'entropy': 0.1992, 'ce_loss': 0.0718, 'epoch': 0.62}
21%|β–ˆβ–ˆ | 66/321 [06:06<23:23, 5.50s/it] 21%|β–ˆβ–ˆ | 67/321 [06:11<23:14, 5.49s/it] {'loss': 0.0526, 'grad_norm': 0.17920222878456116, 'learning_rate': 1.8387622513424942e-05, 'kl': -0.0275, 'entropy': 0.2852, 'ce_loss': 0.0979, 'epoch': 0.62}
21%|β–ˆβ–ˆ | 67/321 [06:11<23:14, 5.49s/it] 21%|β–ˆβ–ˆ | 68/321 [06:16<23:09, 5.49s/it] {'loss': 0.0532, 'grad_norm': 0.1490076780319214, 'learning_rate': 1.8332192601513358e-05, 'kl': -0.0408, 'entropy': 0.2637, 'ce_loss': 0.0951, 'epoch': 0.63}
21%|β–ˆβ–ˆ | 68/321 [06:16<23:09, 5.49s/it] 21%|β–ˆβ–ˆβ– | 69/321 [06:22<22:59, 5.47s/it] {'loss': 0.065, 'grad_norm': 0.1889561414718628, 'learning_rate': 1.827591246322401e-05, 'kl': -0.0432, 'entropy': 0.2021, 'ce_loss': 0.0529, 'epoch': 0.64}
21%|β–ˆβ–ˆβ– | 69/321 [06:22<22:59, 5.47s/it] 22%|β–ˆβ–ˆβ– | 70/321 [06:27<22:54, 5.48s/it] {'loss': 0.0559, 'grad_norm': 0.1900503784418106, 'learning_rate': 1.8218787841446003e-05, 'kl': -0.0427, 'entropy': 0.1963, 'ce_loss': 0.0923, 'epoch': 0.65}
22%|β–ˆβ–ˆβ– | 70/321 [06:27<22:54, 5.48s/it] 22%|β–ˆβ–ˆβ– | 71/321 [06:33<22:45, 5.46s/it] {'loss': 0.0595, 'grad_norm': 0.18474602699279785, 'learning_rate': 1.8160824565240495e-05, 'kl': -0.0286, 'entropy': 0.1338, 'ce_loss': 0.0534, 'epoch': 0.66}
22%|β–ˆβ–ˆβ– | 71/321 [06:33<22:45, 5.46s/it] 22%|β–ˆβ–ˆβ– | 72/321 [06:39<23:02, 5.55s/it] {'loss': 0.0633, 'grad_norm': 0.20437942445278168, 'learning_rate': 1.8102028549245894e-05, 'kl': -0.0114, 'entropy': 0.1689, 'ce_loss': 0.0684, 'epoch': 0.67}
22%|β–ˆβ–ˆβ– | 72/321 [06:39<23:02, 5.55s/it] 23%|β–ˆβ–ˆβ–Ž | 73/321 [06:44<22:49, 5.52s/it] {'loss': 0.0481, 'grad_norm': 0.15962006151676178, 'learning_rate': 1.804240579307431e-05, 'kl': -0.0544, 'entropy': 0.252, 'ce_loss': 0.0875, 'epoch': 0.68}
23%|β–ˆβ–ˆβ–Ž | 73/321 [06:44<22:49, 5.52s/it] 23%|β–ˆβ–ˆβ–Ž | 74/321 [06:49<22:37, 5.50s/it] {'loss': 0.0535, 'grad_norm': 0.1530793160200119, 'learning_rate': 1.7981962380699376e-05, 'kl': -0.0366, 'entropy': 0.167, 'ce_loss': 0.0524, 'epoch': 0.69}
23%|β–ˆβ–ˆβ–Ž | 74/321 [06:49<22:37, 5.50s/it] 23%|β–ˆβ–ˆβ–Ž | 75/321 [06:55<22:28, 5.48s/it] {'loss': 0.0681, 'grad_norm': 0.195379376411438, 'learning_rate': 1.79207044798354e-05, 'kl': -0.0654, 'entropy': 0.3203, 'ce_loss': 0.0953, 'epoch': 0.7}
23%|β–ˆβ–ˆβ–Ž | 75/321 [06:55<22:28, 5.48s/it] 24%|β–ˆβ–ˆβ–Ž | 76/321 [07:00<22:21, 5.47s/it] {'loss': 0.072, 'grad_norm': 0.20148125290870667, 'learning_rate': 1.7858638341308026e-05, 'kl': -0.0469, 'entropy': 0.2012, 'ce_loss': 0.0711, 'epoch': 0.71}
24%|β–ˆβ–ˆβ–Ž | 76/321 [07:00<22:21, 5.47s/it] 24%|β–ˆβ–ˆβ– | 77/321 [07:06<22:17, 5.48s/it] {'loss': 0.0599, 'grad_norm': 0.19116108119487762, 'learning_rate': 1.779577029841638e-05, 'kl': -0.0281, 'entropy': 0.167, 'ce_loss': 0.0655, 'epoch': 0.72}
24%|β–ˆβ–ˆβ– | 77/321 [07:06<22:17, 5.48s/it] 24%|β–ˆβ–ˆβ– | 78/321 [07:11<22:08, 5.47s/it] {'loss': 0.0519, 'grad_norm': 0.14512334764003754, 'learning_rate': 1.773210676628682e-05, 'kl': -0.0179, 'entropy': 0.2539, 'ce_loss': 0.0858, 'epoch': 0.73}
24%|β–ˆβ–ˆβ– | 78/321 [07:11<22:08, 5.47s/it] 25%|β–ˆβ–ˆβ– | 79/321 [07:17<22:03, 5.47s/it] {'loss': 0.0549, 'grad_norm': 0.173423632979393, 'learning_rate': 1.7667654241218332e-05, 'kl': -0.0201, 'entropy': 0.2305, 'ce_loss': 0.0785, 'epoch': 0.74}
25%|β–ˆβ–ˆβ– | 79/321 [07:17<22:03, 5.47s/it] 25%|β–ˆβ–ˆβ– | 80/321 [07:22<21:54, 5.46s/it] {'loss': 0.0729, 'grad_norm': 0.20733360946178436, 'learning_rate': 1.7602419300019627e-05, 'kl': -0.0398, 'entropy': 0.2207, 'ce_loss': 0.0751, 'epoch': 0.75}
25%|β–ˆβ–ˆβ– | 80/321 [07:22<21:54, 5.46s/it] 25%|β–ˆβ–ˆβ–Œ | 81/321 [07:28<21:48, 5.45s/it] {'loss': 0.0757, 'grad_norm': 0.21337777376174927, 'learning_rate': 1.753640859933806e-05, 'kl': -0.0403, 'entropy': 0.2061, 'ce_loss': 0.0783, 'epoch': 0.76}
25%|β–ˆβ–ˆβ–Œ | 81/321 [07:28<21:48, 5.45s/it] 26%|β–ˆβ–ˆβ–Œ | 82/321 [07:33<21:39, 5.44s/it] {'loss': 0.0729, 'grad_norm': 0.21283765137195587, 'learning_rate': 1.746962887498034e-05, 'kl': -0.0347, 'entropy': 0.1797, 'ce_loss': 0.0596, 'epoch': 0.76}
26%|β–ˆβ–ˆβ–Œ | 82/321 [07:33<21:39, 5.44s/it] 26%|β–ˆβ–ˆβ–Œ | 83/321 [07:39<21:36, 5.45s/it] {'loss': 0.0797, 'grad_norm': 0.2063702642917633, 'learning_rate': 1.7402086941225246e-05, 'kl': -0.0292, 'entropy': 0.2637, 'ce_loss': 0.0892, 'epoch': 0.77}
26%|β–ˆβ–ˆβ–Œ | 83/321 [07:39<21:36, 5.45s/it] 26%|β–ˆβ–ˆβ–Œ | 84/321 [07:44<21:31, 5.45s/it] {'loss': 0.0626, 'grad_norm': 0.1959114670753479, 'learning_rate': 1.7333789690128252e-05, 'kl': -0.0192, 'entropy': 0.2578, 'ce_loss': 0.1003, 'epoch': 0.78}
26%|β–ˆβ–ˆβ–Œ | 84/321 [07:44<21:31, 5.45s/it] 26%|β–ˆβ–ˆβ–‹ | 85/321 [07:50<21:32, 5.48s/it] {'loss': 0.0524, 'grad_norm': 0.14823554456233978, 'learning_rate': 1.7264744090818284e-05, 'kl': -0.0266, 'entropy': 0.1934, 'ce_loss': 0.0672, 'epoch': 0.79}
26%|β–ˆβ–ˆβ–‹ | 85/321 [07:50<21:32, 5.48s/it] 27%|β–ˆβ–ˆβ–‹ | 86/321 [07:55<21:47, 5.56s/it] {'loss': 0.0318, 'grad_norm': 0.09522274881601334, 'learning_rate': 1.719495718878655e-05, 'kl': -0.0474, 'entropy': 0.2236, 'ce_loss': 0.0442, 'epoch': 0.8}
27%|β–ˆβ–ˆβ–‹ | 86/321 [07:55<21:47, 5.56s/it] 27%|β–ˆβ–ˆβ–‹ | 87/321 [08:01<21:33, 5.53s/it] {'loss': 0.0512, 'grad_norm': 0.15018120408058167, 'learning_rate': 1.712443610516765e-05, 'kl': -0.0552, 'entropy': 0.2812, 'ce_loss': 0.0948, 'epoch': 0.81}
27%|β–ˆβ–ˆβ–‹ | 87/321 [08:01<21:33, 5.53s/it] 27%|β–ˆβ–ˆβ–‹ | 88/321 [08:06<21:21, 5.50s/it] {'loss': 0.0395, 'grad_norm': 0.12560266256332397, 'learning_rate': 1.7053188036012885e-05, 'kl': -0.0256, 'entropy': 0.3164, 'ce_loss': 0.0981, 'epoch': 0.82}
27%|β–ˆβ–ˆβ–‹ | 88/321 [08:06<21:21, 5.50s/it] 28%|β–ˆβ–ˆβ–Š | 89/321 [08:12<21:11, 5.48s/it] {'loss': 0.0614, 'grad_norm': 0.18557599186897278, 'learning_rate': 1.6981220251555996e-05, 'kl': -0.0303, 'entropy': 0.2471, 'ce_loss': 0.0874, 'epoch': 0.83}
28%|β–ˆβ–ˆβ–Š | 89/321 [08:12<21:11, 5.48s/it] 28%|β–ˆβ–ˆβ–Š | 90/321 [08:17<21:02, 5.46s/it] {'loss': 0.0675, 'grad_norm': 0.1974230408668518, 'learning_rate': 1.6908540095471288e-05, 'kl': -0.0491, 'entropy': 0.2119, 'ce_loss': 0.078, 'epoch': 0.84}
28%|β–ˆβ–ˆβ–Š | 90/321 [08:17<21:02, 5.46s/it] 28%|β–ˆβ–ˆβ–Š | 91/321 [08:23<20:59, 5.48s/it] {'loss': 0.0581, 'grad_norm': 0.18383747339248657, 'learning_rate': 1.6835154984124266e-05, 'kl': -0.0222, 'entropy': 0.1797, 'ce_loss': 0.0639, 'epoch': 0.85}
28%|β–ˆβ–ˆβ–Š | 91/321 [08:23<20:59, 5.48s/it] 29%|β–ˆβ–ˆβ–Š | 92/321 [08:28<20:55, 5.48s/it] {'loss': 0.0757, 'grad_norm': 0.22654765844345093, 'learning_rate': 1.676107240581488e-05, 'kl': -0.025, 'entropy': 0.3281, 'ce_loss': 0.1065, 'epoch': 0.86}
29%|β–ˆβ–ˆβ–Š | 92/321 [08:28<20:55, 5.48s/it] 29%|β–ˆβ–ˆβ–‰ | 93/321 [08:34<20:51, 5.49s/it] {'loss': 0.0695, 'grad_norm': 0.18459008634090424, 'learning_rate': 1.6686299920013388e-05, 'kl': -0.0461, 'entropy': 0.1445, 'ce_loss': 0.0251, 'epoch': 0.87}
29%|β–ˆβ–ˆβ–‰ | 93/321 [08:34<20:51, 5.49s/it] 29%|β–ˆβ–ˆβ–‰ | 94/321 [08:39<20:44, 5.48s/it] {'loss': 0.072, 'grad_norm': 0.21317991614341736, 'learning_rate': 1.661084515658901e-05, 'kl': -0.0491, 'entropy': 0.3535, 'ce_loss': 0.111, 'epoch': 0.88}
29%|β–ˆβ–ˆβ–‰ | 94/321 [08:39<20:44, 5.48s/it] 30%|β–ˆβ–ˆβ–‰ | 95/321 [08:44<20:35, 5.47s/it] {'loss': 0.0569, 'grad_norm': 0.16013678908348083, 'learning_rate': 1.6534715815031325e-05, 'kl': -0.0649, 'entropy': 0.377, 'ce_loss': 0.146, 'epoch': 0.89}
30%|β–ˆβ–ˆβ–‰ | 95/321 [08:44<20:35, 5.47s/it] 30%|β–ˆβ–ˆβ–‰ | 96/321 [08:50<20:36, 5.50s/it] {'loss': 0.0605, 'grad_norm': 0.1786569356918335, 'learning_rate': 1.645791966366464e-05, 'kl': -0.0317, 'entropy': 0.0967, 'ce_loss': 0.0473, 'epoch': 0.9}
30%|β–ˆβ–ˆβ–‰ | 96/321 [08:50<20:36, 5.50s/it] 30%|β–ˆβ–ˆβ–ˆ | 97/321 [08:56<20:32, 5.50s/it] {'loss': 0.0459, 'grad_norm': 0.14076007902622223, 'learning_rate': 1.63804645388553e-05, 'kl': -0.0217, 'entropy': 0.3301, 'ce_loss': 0.1067, 'epoch': 0.9}
30%|β–ˆβ–ˆβ–ˆ | 97/321 [08:56<20:32, 5.50s/it] 31%|β–ˆβ–ˆβ–ˆ | 98/321 [09:01<20:26, 5.50s/it] {'loss': 0.049, 'grad_norm': 0.13985782861709595, 'learning_rate': 1.6302358344212025e-05, 'kl': -0.0359, 'entropy': 0.2266, 'ce_loss': 0.0589, 'epoch': 0.91}
31%|β–ˆβ–ˆβ–ˆ | 98/321 [09:01<20:26, 5.50s/it] 31%|β–ˆβ–ˆβ–ˆ | 99/321 [09:06<20:18, 5.49s/it] {'loss': 0.07, 'grad_norm': 0.20864830911159515, 'learning_rate': 1.622360904977946e-05, 'kl': -0.0566, 'entropy': 0.1768, 'ce_loss': 0.0669, 'epoch': 0.92}
31%|β–ˆβ–ˆβ–ˆ | 99/321 [09:06<20:18, 5.49s/it] 31%|β–ˆβ–ˆβ–ˆ | 100/321 [09:12<20:15, 5.50s/it] {'loss': 0.0738, 'grad_norm': 0.1885262280702591, 'learning_rate': 1.6144224691224868e-05, 'kl': -0.0417, 'entropy': 0.2168, 'ce_loss': 0.0757, 'epoch': 0.93}
31%|β–ˆβ–ˆβ–ˆ | 100/321 [09:12<20:15, 5.50s/it] 31%|β–ˆβ–ˆβ–ˆβ– | 101/321 [09:18<20:11, 5.51s/it] {'loss': 0.0725, 'grad_norm': 0.19844770431518555, 'learning_rate': 1.606421336901818e-05, 'kl': -0.0461, 'entropy': 0.2188, 'ce_loss': 0.0711, 'epoch': 0.94}
31%|β–ˆβ–ˆβ–ˆβ– | 101/321 [09:18<20:11, 5.51s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 102/321 [09:23<20:06, 5.51s/it] {'loss': 0.0442, 'grad_norm': 0.1333250254392624, 'learning_rate': 1.5983583247605414e-05, 'kl': -0.0359, 'entropy': 0.2539, 'ce_loss': 0.0848, 'epoch': 0.95}
32%|β–ˆβ–ˆβ–ˆβ– | 102/321 [09:23<20:06, 5.51s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 103/321 [09:28<19:57, 5.49s/it] {'loss': 0.0555, 'grad_norm': 0.16762061417102814, 'learning_rate': 1.590234255457555e-05, 'kl': -0.0179, 'entropy': 0.252, 'ce_loss': 0.0824, 'epoch': 0.96}
32%|β–ˆβ–ˆβ–ˆβ– | 103/321 [09:29<19:57, 5.49s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 104/321 [09:34<19:52, 5.49s/it] {'loss': 0.0564, 'grad_norm': 0.177537202835083, 'learning_rate': 1.582049957982099e-05, 'kl': -0.0393, 'entropy': 0.1807, 'ce_loss': 0.0678, 'epoch': 0.97}
32%|β–ˆβ–ˆβ–ˆβ– | 104/321 [09:34<19:52, 5.49s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 105/321 [09:40<19:53, 5.53s/it] {'loss': 0.0692, 'grad_norm': 0.20024636387825012, 'learning_rate': 1.5738062674691657e-05, 'kl': -0.0275, 'entropy': 0.2275, 'ce_loss': 0.0741, 'epoch': 0.98}
33%|β–ˆβ–ˆβ–ˆβ–Ž | 105/321 [09:40<19:53, 5.53s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 106/321 [09:45<19:42, 5.50s/it] {'loss': 0.0651, 'grad_norm': 0.20259137451648712, 'learning_rate': 1.5655040251142787e-05, 'kl': -0.0244, 'entropy': 0.2217, 'ce_loss': 0.0729, 'epoch': 0.99}
33%|β–ˆβ–ˆβ–ˆβ–Ž | 106/321 [09:45<19:42, 5.50s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 107/321 [09:51<19:34, 5.49s/it] {'loss': 0.061, 'grad_norm': 0.16389916837215424, 'learning_rate': 1.5571440780876588e-05, 'kl': -0.0157, 'entropy': 0.2012, 'ce_loss': 0.0687, 'epoch': 1.0}
33%|β–ˆβ–ˆβ–ˆβ–Ž | 107/321 [09:51<19:34, 5.49s/it] 34%|β–ˆβ–ˆβ–ˆβ–Ž | 108/321 [09:52<15:02, 4.24s/it] {'loss': 0.0391, 'grad_norm': 0.16389916837215424, 'learning_rate': 1.548727279447777e-05, 'kl': -0.0276, 'entropy': 0.1445, 'ce_loss': 0.2476, 'epoch': 1.0}
34%|β–ˆβ–ˆβ–ˆβ–Ž | 108/321 [09:52<15:02, 4.24s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 109/321 [09:57<16:16, 4.61s/it] {'loss': 0.0452, 'grad_norm': 0.34768620133399963, 'learning_rate': 1.540254488054307e-05, 'kl': 0.0835, 'entropy': 0.2197, 'ce_loss': 0.0691, 'epoch': 1.01}
34%|β–ˆβ–ˆβ–ˆβ– | 109/321 [09:57<16:16, 4.61s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 110/321 [10:03<17:08, 4.87s/it] {'loss': 0.0471, 'grad_norm': 0.18361957371234894, 'learning_rate': 1.5317265684804865e-05, 'kl': 0.0008, 'entropy': 0.2139, 'ce_loss': 0.0772, 'epoch': 1.02}
34%|β–ˆβ–ˆβ–ˆβ– | 110/321 [10:03<17:08, 4.87s/it] 35%|β–ˆβ–ˆβ–ˆβ– | 111/321 [10:08<17:38, 5.04s/it] {'loss': 0.0399, 'grad_norm': 0.18910780549049377, 'learning_rate': 1.5231443909248956e-05, 'kl': -0.0051, 'entropy': 0.2109, 'ce_loss': 0.0947, 'epoch': 1.03}
35%|β–ˆβ–ˆβ–ˆβ– | 111/321 [10:08<17:38, 5.04s/it] 35%|β–ˆβ–ˆβ–ˆβ– | 112/321 [10:14<17:59, 5.16s/it] {'loss': 0.0412, 'grad_norm': 0.18259647488594055, 'learning_rate': 1.5145088311226599e-05, 'kl': 0.0408, 'entropy': 0.0811, 'ce_loss': 0.037, 'epoch': 1.04}
35%|β–ˆβ–ˆβ–ˆβ– | 112/321 [10:14<17:59, 5.16s/it] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 113/321 [10:19<18:14, 5.26s/it] {'loss': 0.0414, 'grad_norm': 0.21136245131492615, 'learning_rate': 1.5058207702560907e-05, 'kl': 0.063, 'entropy': 0.125, 'ce_loss': 0.0617, 'epoch': 1.05}
35%|β–ˆβ–ˆβ–ˆβ–Œ | 113/321 [10:19<18:14, 5.26s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 114/321 [10:25<18:25, 5.34s/it] {'loss': 0.0336, 'grad_norm': 0.17452239990234375, 'learning_rate': 1.4970810948647664e-05, 'kl': -0.0054, 'entropy': 0.1602, 'ce_loss': 0.0847, 'epoch': 1.06}
36%|β–ˆβ–ˆβ–ˆβ–Œ | 114/321 [10:25<18:25, 5.34s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 115/321 [10:30<18:27, 5.37s/it] {'loss': 0.0416, 'grad_norm': 0.27118608355522156, 'learning_rate': 1.4882906967550708e-05, 'kl': -0.0466, 'entropy': 0.104, 'ce_loss': 0.0761, 'epoch': 1.07}
36%|β–ˆβ–ˆβ–ˆβ–Œ | 115/321 [10:30<18:27, 5.37s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 116/321 [10:36<18:26, 5.40s/it] {'loss': 0.0453, 'grad_norm': 0.22193056344985962, 'learning_rate': 1.479450472909191e-05, 'kl': -0.0101, 'entropy': 0.1592, 'ce_loss': 0.0893, 'epoch': 1.07}
36%|β–ˆβ–ˆβ–ˆβ–Œ | 116/321 [10:36<18:26, 5.40s/it] 36%|β–ˆβ–ˆβ–ˆβ–‹ | 117/321 [10:41<18:27, 5.43s/it] {'loss': 0.0486, 'grad_norm': 0.2031334489583969, 'learning_rate': 1.4705613253935886e-05, 'kl': -0.0284, 'entropy': 0.1025, 'ce_loss': 0.0659, 'epoch': 1.08}
36%|β–ˆβ–ˆβ–ˆβ–‹ | 117/321 [10:41<18:27, 5.43s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 118/321 [10:47<18:26, 5.45s/it] {'loss': 0.0415, 'grad_norm': 0.2964913845062256, 'learning_rate': 1.4616241612669523e-05, 'kl': 0.0403, 'entropy': 0.1079, 'ce_loss': 0.0631, 'epoch': 1.09}
37%|β–ˆβ–ˆβ–ˆβ–‹ | 118/321 [10:47<18:26, 5.45s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 119/321 [10:52<18:22, 5.46s/it] {'loss': 0.0427, 'grad_norm': 0.19633540511131287, 'learning_rate': 1.4526398924876407e-05, 'kl': -0.0203, 'entropy': 0.0991, 'ce_loss': 0.0589, 'epoch': 1.1}
37%|β–ˆβ–ˆβ–ˆβ–‹ | 119/321 [10:52<18:22, 5.46s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 120/321 [10:58<18:20, 5.47s/it] {'loss': 0.0536, 'grad_norm': 0.2811174690723419, 'learning_rate': 1.4436094358206224e-05, 'kl': 0.0001, 'entropy': 0.1001, 'ce_loss': 0.0519, 'epoch': 1.11}
37%|β–ˆβ–ˆβ–ˆβ–‹ | 120/321 [10:58<18:20, 5.47s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 121/321 [11:03<18:17, 5.49s/it] {'loss': 0.0445, 'grad_norm': 0.19164954125881195, 'learning_rate': 1.4345337127439333e-05, 'kl': 0.0903, 'entropy': 0.1787, 'ce_loss': 0.0799, 'epoch': 1.12}
38%|β–ˆβ–ˆβ–ˆβ–Š | 121/321 [11:03<18:17, 5.49s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 122/321 [11:09<18:12, 5.49s/it] {'loss': 0.0424, 'grad_norm': 0.25314974784851074, 'learning_rate': 1.4254136493546432e-05, 'kl': 0.0322, 'entropy': 0.1924, 'ce_loss': 0.0798, 'epoch': 1.13}
38%|β–ˆβ–ˆβ–ˆβ–Š | 122/321 [11:09<18:12, 5.49s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 123/321 [11:14<18:09, 5.50s/it] {'loss': 0.0373, 'grad_norm': 0.12902231514453888, 'learning_rate': 1.4162501762743579e-05, 'kl': 0.042, 'entropy': 0.1226, 'ce_loss': 0.0694, 'epoch': 1.14}
38%|β–ˆβ–ˆβ–ˆβ–Š | 123/321 [11:14<18:09, 5.50s/it] 39%|β–ˆβ–ˆβ–ˆβ–Š | 124/321 [11:20<18:07, 5.52s/it] {'loss': 0.0454, 'grad_norm': 0.22617529332637787, 'learning_rate': 1.4070442285542579e-05, 'kl': -0.0043, 'entropy': 0.1011, 'ce_loss': 0.0623, 'epoch': 1.15}
39%|β–ˆβ–ˆβ–ˆβ–Š | 124/321 [11:20<18:07, 5.52s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 125/321 [11:25<18:05, 5.54s/it] {'loss': 0.0425, 'grad_norm': 0.26588836312294006, 'learning_rate': 1.3977967455796828e-05, 'kl': -0.0179, 'entropy': 0.1865, 'ce_loss': 0.0958, 'epoch': 1.16}
39%|β–ˆβ–ˆβ–ˆβ–‰ | 125/321 [11:25<18:05, 5.54s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 126/321 [11:31<18:03, 5.56s/it] {'loss': 0.0393, 'grad_norm': 0.16521866619586945, 'learning_rate': 1.3885086709742788e-05, 'kl': 0.0381, 'entropy': 0.1553, 'ce_loss': 0.0691, 'epoch': 1.17}
39%|β–ˆβ–ˆβ–ˆβ–‰ | 126/321 [11:31<18:03, 5.56s/it] 40%|β–ˆβ–ˆβ–ˆβ–‰ | 127/321 [11:37<18:05, 5.59s/it] {'loss': 0.037, 'grad_norm': 0.18703752756118774, 'learning_rate': 1.3791809525037057e-05, 'kl': -0.0179, 'entropy': 0.0742, 'ce_loss': 0.051, 'epoch': 1.18}
40%|β–ˆβ–ˆβ–ˆβ–‰ | 127/321 [11:37<18:05, 5.59s/it] 40%|β–ˆβ–ˆβ–ˆβ–‰ | 128/321 [11:42<17:54, 5.57s/it] {'loss': 0.0485, 'grad_norm': 0.19613757729530334, 'learning_rate': 1.3698145419789302e-05, 'kl': 0.0452, 'entropy': 0.1328, 'ce_loss': 0.0521, 'epoch': 1.19}
40%|β–ˆβ–ˆβ–ˆβ–‰ | 128/321 [11:42<17:54, 5.57s/it] 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 129/321 [11:48<17:53, 5.59s/it] {'loss': 0.036, 'grad_norm': 0.14802619814872742, 'learning_rate': 1.3604103951590993e-05, 'kl': 0.0154, 'entropy': 0.1494, 'ce_loss': 0.0648, 'epoch': 1.2}
40%|β–ˆβ–ˆβ–ˆβ–ˆ | 129/321 [11:48<17:53, 5.59s/it] 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 130/321 [11:53<17:44, 5.57s/it] {'loss': 0.0456, 'grad_norm': 0.22272059321403503, 'learning_rate': 1.3509694716540135e-05, 'kl': 0.0356, 'entropy': 0.0723, 'ce_loss': 0.0401, 'epoch': 1.21}
40%|β–ˆβ–ˆβ–ˆβ–ˆ | 130/321 [11:53<17:44, 5.57s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 131/321 [11:59<17:38, 5.57s/it] {'loss': 0.0374, 'grad_norm': 0.16963350772857666, 'learning_rate': 1.341492734826209e-05, 'kl': 0.001, 'entropy': 0.1104, 'ce_loss': 0.0607, 'epoch': 1.21}
41%|β–ˆβ–ˆβ–ˆβ–ˆ | 131/321 [11:59<17:38, 5.57s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 132/321 [12:04<17:31, 5.56s/it] {'loss': 0.0416, 'grad_norm': 0.14699193835258484, 'learning_rate': 1.3319811516926541e-05, 'kl': 0.0334, 'entropy': 0.1406, 'ce_loss': 0.061, 'epoch': 1.22}
41%|β–ˆβ–ˆβ–ˆβ–ˆ | 132/321 [12:04<17:31, 5.56s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 133/321 [12:10<17:27, 5.57s/it] {'loss': 0.0497, 'grad_norm': 0.21459099650382996, 'learning_rate': 1.3224356928260735e-05, 'kl': 0.0903, 'entropy': 0.0688, 'ce_loss': 0.0198, 'epoch': 1.23}
41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 133/321 [12:10<17:27, 5.57s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 134/321 [12:15<17:19, 5.56s/it] {'loss': 0.037, 'grad_norm': 0.21457742154598236, 'learning_rate': 1.3128573322559097e-05, 'kl': -0.042, 'entropy': 0.2148, 'ce_loss': 0.0958, 'epoch': 1.24}
42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 134/321 [12:15<17:19, 5.56s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 135/321 [12:21<17:12, 5.55s/it] {'loss': 0.0505, 'grad_norm': 0.25294315814971924, 'learning_rate': 1.3032470473689322e-05, 'kl': -0.0425, 'entropy': 0.1387, 'ce_loss': 0.0772, 'epoch': 1.25}
42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 135/321 [12:21<17:12, 5.55s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 136/321 [12:27<17:07, 5.55s/it] {'loss': 0.0466, 'grad_norm': 0.16512276232242584, 'learning_rate': 1.2936058188095045e-05, 'kl': 0.0918, 'entropy': 0.1006, 'ce_loss': 0.0524, 'epoch': 1.26}
42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 136/321 [12:27<17:07, 5.55s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 137/321 [12:32<17:00, 5.55s/it] {'loss': 0.0342, 'grad_norm': 0.1969085931777954, 'learning_rate': 1.2839346303795173e-05, 'kl': 0.0515, 'entropy': 0.1416, 'ce_loss': 0.0561, 'epoch': 1.27}
43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 137/321 [12:32<17:00, 5.55s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 138/321 [12:38<16:54, 5.55s/it] {'loss': 0.0547, 'grad_norm': 0.22046498954296112, 'learning_rate': 1.274234468938001e-05, 'kl': 0.0835, 'entropy': 0.1235, 'ce_loss': 0.0676, 'epoch': 1.28}
43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 138/321 [12:38<16:54, 5.55s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 139/321 [12:43<16:56, 5.58s/it] {'loss': 0.0362, 'grad_norm': 0.1698365956544876, 'learning_rate': 1.2645063243004236e-05, 'kl': -0.0222, 'entropy': 0.1245, 'ce_loss': 0.0778, 'epoch': 1.29}
43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 139/321 [12:43<16:56, 5.58s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 140/321 [12:49<16:50, 5.58s/it] {'loss': 0.0382, 'grad_norm': 0.13968144357204437, 'learning_rate': 1.2547511891376916e-05, 'kl': 0.0593, 'entropy': 0.1797, 'ce_loss': 0.0834, 'epoch': 1.3}
44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 140/321 [12:49<16:50, 5.58s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 141/321 [12:54<16:42, 5.57s/it] {'loss': 0.0402, 'grad_norm': 0.1825486421585083, 'learning_rate': 1.2449700588748541e-05, 'kl': -0.0162, 'entropy': 0.1484, 'ce_loss': 0.0752, 'epoch': 1.31}
44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 141/321 [12:54<16:42, 5.57s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 142/321 [13:00<16:32, 5.55s/it] {'loss': 0.0449, 'grad_norm': 0.21553905308246613, 'learning_rate': 1.2351639315895309e-05, 'kl': 0.0047, 'entropy': 0.167, 'ce_loss': 0.073, 'epoch': 1.32}
44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 142/321 [13:00<16:32, 5.55s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 143/321 [13:05<16:20, 5.51s/it] {'loss': 0.0536, 'grad_norm': 0.17211349308490753, 'learning_rate': 1.2253338079100652e-05, 'kl': 0.0258, 'entropy': 0.084, 'ce_loss': 0.0464, 'epoch': 1.33}
45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 143/321 [13:05<16:20, 5.51s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 144/321 [13:11<16:17, 5.52s/it] {'loss': 0.0425, 'grad_norm': 0.2392333745956421, 'learning_rate': 1.2154806909134198e-05, 'kl': 0.0111, 'entropy': 0.1094, 'ce_loss': 0.0533, 'epoch': 1.34}
45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 144/321 [13:11<16:17, 5.52s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 145/321 [13:16<16:14, 5.54s/it] {'loss': 0.0496, 'grad_norm': 0.22119127213954926, 'learning_rate': 1.205605586022822e-05, 'kl': -0.0498, 'entropy': 0.1689, 'ce_loss': 0.0877, 'epoch': 1.34}
45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 145/321 [13:16<16:14, 5.54s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 146/321 [13:22<16:11, 5.55s/it] {'loss': 0.0376, 'grad_norm': 0.17724597454071045, 'learning_rate': 1.1957095009051683e-05, 'kl': 0.0099, 'entropy': 0.1523, 'ce_loss': 0.0728, 'epoch': 1.35}
45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 146/321 [13:22<16:11, 5.55s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 147/321 [13:28<16:02, 5.53s/it] {'loss': 0.0359, 'grad_norm': 0.17848482728004456, 'learning_rate': 1.1857934453682016e-05, 'kl': 0.0859, 'entropy': 0.0171, 'ce_loss': 0.0275, 'epoch': 1.36}
46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 147/321 [13:28<16:02, 5.53s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 148/321 [13:33<15:56, 5.53s/it] {'loss': 0.0394, 'grad_norm': 0.18580235540866852, 'learning_rate': 1.1758584312574693e-05, 'kl': 0.1069, 'entropy': 0.0447, 'ce_loss': 0.0515, 'epoch': 1.37}
46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 148/321 [13:33<15:56, 5.53s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 149/321 [13:39<15:51, 5.53s/it] {'loss': 0.0646, 'grad_norm': 0.23421402275562286, 'learning_rate': 1.1659054723530721e-05, 'kl': 0.1162, 'entropy': 0.1079, 'ce_loss': 0.0598, 'epoch': 1.38}
46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 149/321 [13:39<15:51, 5.53s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 150/321 [13:44<15:43, 5.52s/it] {'loss': 0.0518, 'grad_norm': 0.2502172589302063, 'learning_rate': 1.1559355842662188e-05, 'kl': 0.0972, 'entropy': 0.1465, 'ce_loss': 0.0639, 'epoch': 1.39}
47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 150/321 [13:44<15:43, 5.52s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 151/321 [13:50<15:34, 5.50s/it] {'loss': 0.0392, 'grad_norm': 0.16437506675720215, 'learning_rate': 1.1459497843355907e-05, 'kl': -0.0081, 'entropy': 0.1533, 'ce_loss': 0.0869, 'epoch': 1.4}
47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 151/321 [13:50<15:34, 5.50s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 152/321 [13:55<15:28, 5.49s/it] {'loss': 0.0297, 'grad_norm': 0.16849961876869202, 'learning_rate': 1.1359490915235323e-05, 'kl': 0.0486, 'entropy': 0.127, 'ce_loss': 0.0625, 'epoch': 1.41}
47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 152/321 [13:55<15:28, 5.49s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 153/321 [14:00<15:20, 5.48s/it] {'loss': 0.0574, 'grad_norm': 0.21643343567848206, 'learning_rate': 1.1259345263120738e-05, 'kl': 0.1201, 'entropy': 0.0466, 'ce_loss': 0.0416, 'epoch': 1.42}
48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 153/321 [14:00<15:20, 5.48s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 154/321 [14:06<15:21, 5.52s/it] {'loss': 0.04, 'grad_norm': 0.24577035009860992, 'learning_rate': 1.1159071105988012e-05, 'kl': 0.0547, 'entropy': 0.1123, 'ce_loss': 0.0483, 'epoch': 1.43}
48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 154/321 [14:06<15:21, 5.52s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 155/321 [14:12<15:16, 5.52s/it] {'loss': 0.0426, 'grad_norm': 0.19843176007270813, 'learning_rate': 1.1058678675925796e-05, 'kl': -0.0126, 'entropy': 0.1846, 'ce_loss': 0.081, 'epoch': 1.44}
48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 155/321 [14:12<15:16, 5.52s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 156/321 [14:17<15:08, 5.50s/it] {'loss': 0.0395, 'grad_norm': 0.1768287569284439, 'learning_rate': 1.0958178217091455e-05, 'kl': 0.0327, 'entropy': 0.1084, 'ce_loss': 0.0531, 'epoch': 1.45}
49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 156/321 [14:17<15:08, 5.50s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 157/321 [14:23<15:08, 5.54s/it] {'loss': 0.0372, 'grad_norm': 0.15050095319747925, 'learning_rate': 1.0857579984665733e-05, 'kl': 0.0214, 'entropy': 0.124, 'ce_loss': 0.0549, 'epoch': 1.46}
49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 157/321 [14:23<15:08, 5.54s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 158/321 [14:28<15:02, 5.54s/it] {'loss': 0.034, 'grad_norm': 0.14452822506427765, 'learning_rate': 1.0756894243806291e-05, 'kl': 0.105, 'entropy': 0.1074, 'ce_loss': 0.0499, 'epoch': 1.47}
49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 158/321 [14:28<15:02, 5.54s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 159/321 [14:34<14:54, 5.52s/it] {'loss': 0.0502, 'grad_norm': 0.20282569527626038, 'learning_rate': 1.0656131268600254e-05, 'kl': 0.0014, 'entropy': 0.1787, 'ce_loss': 0.081, 'epoch': 1.48}
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 159/321 [14:34<14:54, 5.52s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 160/321 [14:39<14:48, 5.52s/it] {'loss': 0.0423, 'grad_norm': 0.23301592469215393, 'learning_rate': 1.0555301341015832e-05, 'kl': 0.0757, 'entropy': 0.0427, 'ce_loss': 0.0388, 'epoch': 1.48}
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 160/321 [14:39<14:48, 5.52s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 161/321 [14:45<14:44, 5.53s/it] {'loss': 0.0409, 'grad_norm': 0.15230302512645721, 'learning_rate': 1.0454414749853126e-05, 'kl': 0.1318, 'entropy': 0.0845, 'ce_loss': 0.0592, 'epoch': 1.49}
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 161/321 [14:45<14:44, 5.53s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 162/321 [14:50<14:37, 5.52s/it] {'loss': 0.0386, 'grad_norm': 0.20218664407730103, 'learning_rate': 1.0353481789694258e-05, 'kl': 0.0864, 'entropy': 0.0806, 'ce_loss': 0.0432, 'epoch': 1.5}
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 162/321 [14:50<14:37, 5.52s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 163/321 [14:56<14:31, 5.52s/it] {'loss': 0.0351, 'grad_norm': 0.13059383630752563, 'learning_rate': 1.0252512759852891e-05, 'kl': 0.0197, 'entropy': 0.1289, 'ce_loss': 0.0587, 'epoch': 1.51}
51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 163/321 [14:56<14:31, 5.52s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 164/321 [15:01<14:25, 5.51s/it] {'loss': 0.0378, 'grad_norm': 0.1583739072084427, 'learning_rate': 1.015151796332328e-05, 'kl': 0.0018, 'entropy': 0.1377, 'ce_loss': 0.0707, 'epoch': 1.52}
51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 164/321 [15:01<14:25, 5.51s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 165/321 [15:07<14:18, 5.50s/it] {'loss': 0.0363, 'grad_norm': 0.12924712896347046, 'learning_rate': 1.0050507705728943e-05, 'kl': 0.0032, 'entropy': 0.1885, 'ce_loss': 0.0925, 'epoch': 1.53}
51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 165/321 [15:07<14:18, 5.50s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 166/321 [15:12<14:12, 5.50s/it] {'loss': 0.0442, 'grad_norm': 0.2284664362668991, 'learning_rate': 9.949492294271062e-06, 'kl': 0.0593, 'entropy': 0.0703, 'ce_loss': 0.0353, 'epoch': 1.54}
52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 166/321 [15:12<14:12, 5.50s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 167/321 [15:18<14:05, 5.49s/it] {'loss': 0.0298, 'grad_norm': 0.13885430991649628, 'learning_rate': 9.848482036676725e-06, 'kl': 0.0131, 'entropy': 0.0771, 'ce_loss': 0.0575, 'epoch': 1.55}
52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 167/321 [15:18<14:05, 5.49s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 168/321 [15:23<13:59, 5.49s/it] {'loss': 0.0629, 'grad_norm': 0.2584099769592285, 'learning_rate': 9.747487240147112e-06, 'kl': -0.0198, 'entropy': 0.1001, 'ce_loss': 0.0583, 'epoch': 1.56}
52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 168/321 [15:23<13:59, 5.49s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 169/321 [15:29<13:54, 5.49s/it] {'loss': 0.0495, 'grad_norm': 0.25362423062324524, 'learning_rate': 9.646518210305747e-06, 'kl': 0.0063, 'entropy': 0.0869, 'ce_loss': 0.0644, 'epoch': 1.57}
53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 169/321 [15:29<13:54, 5.49s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 170/321 [15:34<13:46, 5.48s/it] {'loss': 0.0508, 'grad_norm': 0.20054134726524353, 'learning_rate': 9.545585250146879e-06, 'kl': -0.0024, 'entropy': 0.1611, 'ce_loss': 0.0894, 'epoch': 1.58}
53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 170/321 [15:34<13:46, 5.48s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 171/321 [15:40<13:39, 5.46s/it] {'loss': 0.0644, 'grad_norm': 0.3231568932533264, 'learning_rate': 9.44469865898417e-06, 'kl': 0.0596, 'entropy': 0.1133, 'ce_loss': 0.0555, 'epoch': 1.59}
53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 171/321 [15:40<13:39, 5.46s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 172/321 [15:45<13:34, 5.46s/it] {'loss': 0.0394, 'grad_norm': 0.17434173822402954, 'learning_rate': 9.34386873139975e-06, 'kl': 0.0037, 'entropy': 0.1099, 'ce_loss': 0.0558, 'epoch': 1.6}
54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 172/321 [15:45<13:34, 5.46s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 173/321 [15:51<13:30, 5.48s/it] {'loss': 0.0527, 'grad_norm': 0.20393753051757812, 'learning_rate': 9.243105756193714e-06, 'kl': -0.0132, 'entropy': 0.1064, 'ce_loss': 0.0663, 'epoch': 1.61}
54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 173/321 [15:51<13:30, 5.48s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 174/321 [15:56<13:28, 5.50s/it] {'loss': 0.0455, 'grad_norm': 0.18590298295021057, 'learning_rate': 9.14242001533427e-06, 'kl': 0.0469, 'entropy': 0.04, 'ce_loss': 0.0422, 'epoch': 1.62}
54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 174/321 [15:56<13:28, 5.50s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 175/321 [16:02<13:22, 5.50s/it] {'loss': 0.0312, 'grad_norm': 0.18247582018375397, 'learning_rate': 9.041821782908544e-06, 'kl': 0.0796, 'entropy': 0.0645, 'ce_loss': 0.0376, 'epoch': 1.62}
55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 175/321 [16:02<13:22, 5.50s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 176/321 [16:07<13:20, 5.52s/it] {'loss': 0.0539, 'grad_norm': 0.26011157035827637, 'learning_rate': 8.941321324074207e-06, 'kl': -0.0275, 'entropy': 0.1074, 'ce_loss': 0.0641, 'epoch': 1.63}
55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 176/321 [16:07<13:20, 5.52s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 177/321 [16:13<13:15, 5.52s/it] {'loss': 0.0333, 'grad_norm': 0.12031043320894241, 'learning_rate': 8.840928894011995e-06, 'kl': -0.0181, 'entropy': 0.21, 'ce_loss': 0.0949, 'epoch': 1.64}
55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 177/321 [16:13<13:15, 5.52s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 178/321 [16:18<13:08, 5.52s/it] {'loss': 0.0462, 'grad_norm': 0.16544109582901, 'learning_rate': 8.740654736879265e-06, 'kl': -0.0088, 'entropy': 0.2402, 'ce_loss': 0.1025, 'epoch': 1.65}
55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 178/321 [16:18<13:08, 5.52s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 179/321 [16:24<13:04, 5.53s/it] {'loss': 0.0411, 'grad_norm': 0.21774324774742126, 'learning_rate': 8.640509084764682e-06, 'kl': -0.0114, 'entropy': 0.1055, 'ce_loss': 0.0474, 'epoch': 1.66}
56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 179/321 [16:24<13:04, 5.53s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 180/321 [16:29<12:56, 5.51s/it] {'loss': 0.0398, 'grad_norm': 0.16917741298675537, 'learning_rate': 8.540502156644096e-06, 'kl': -0.0046, 'entropy': 0.1152, 'ce_loss': 0.0681, 'epoch': 1.67}
56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 180/321 [16:29<12:56, 5.51s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 181/321 [16:35<12:52, 5.52s/it] {'loss': 0.0377, 'grad_norm': 0.14676432311534882, 'learning_rate': 8.440644157337819e-06, 'kl': -0.0104, 'entropy': 0.208, 'ce_loss': 0.0879, 'epoch': 1.68}
56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 181/321 [16:35<12:52, 5.52s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 182/321 [16:40<12:45, 5.51s/it] {'loss': 0.0476, 'grad_norm': 0.23645582795143127, 'learning_rate': 8.340945276469282e-06, 'kl': 0.0378, 'entropy': 0.1328, 'ce_loss': 0.0645, 'epoch': 1.69}
57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 182/321 [16:40<12:45, 5.51s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 183/321 [16:46<12:40, 5.51s/it] {'loss': 0.0454, 'grad_norm': 0.18683338165283203, 'learning_rate': 8.24141568742531e-06, 'kl': 0.0889, 'entropy': 0.0522, 'ce_loss': 0.0368, 'epoch': 1.7}
57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 183/321 [16:46<12:40, 5.51s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 184/321 [16:51<12:32, 5.49s/it] {'loss': 0.033, 'grad_norm': 0.13285161554813385, 'learning_rate': 8.142065546317988e-06, 'kl': 0.0515, 'entropy': 0.1064, 'ce_loss': 0.0577, 'epoch': 1.71}
57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 184/321 [16:51<12:32, 5.49s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 185/321 [16:57<12:28, 5.50s/it] {'loss': 0.0544, 'grad_norm': 0.26400867104530334, 'learning_rate': 8.042904990948319e-06, 'kl': 0.0016, 'entropy': 0.1128, 'ce_loss': 0.0664, 'epoch': 1.72}
58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 185/321 [16:57<12:28, 5.50s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 186/321 [17:02<12:22, 5.50s/it] {'loss': 0.0486, 'grad_norm': 0.19021686911582947, 'learning_rate': 7.943944139771784e-06, 'kl': 0.0649, 'entropy': 0.0708, 'ce_loss': 0.0415, 'epoch': 1.73}
58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 186/321 [17:02<12:22, 5.50s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 187/321 [17:08<12:18, 5.51s/it] {'loss': 0.055, 'grad_norm': 0.22009411454200745, 'learning_rate': 7.845193090865807e-06, 'kl': 0.1182, 'entropy': 0.0698, 'ce_loss': 0.051, 'epoch': 1.74}
58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 187/321 [17:08<12:18, 5.51s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 188/321 [17:13<12:12, 5.50s/it] {'loss': 0.0347, 'grad_norm': 0.1851823478937149, 'learning_rate': 7.746661920899351e-06, 'kl': 0.0265, 'entropy': 0.1118, 'ce_loss': 0.0598, 'epoch': 1.75}
59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 188/321 [17:13<12:12, 5.50s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 189/321 [17:19<12:04, 5.49s/it] {'loss': 0.0437, 'grad_norm': 0.23166052997112274, 'learning_rate': 7.648360684104695e-06, 'kl': 0.0649, 'entropy': 0.0913, 'ce_loss': 0.0592, 'epoch': 1.76}
59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 189/321 [17:19<12:04, 5.49s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 190/321 [17:24<11:58, 5.49s/it] {'loss': 0.0462, 'grad_norm': 0.17776505649089813, 'learning_rate': 7.550299411251461e-06, 'kl': -0.0074, 'entropy': 0.1689, 'ce_loss': 0.0917, 'epoch': 1.76}
59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 190/321 [17:24<11:58, 5.49s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 191/321 [17:30<11:51, 5.47s/it] {'loss': 0.0455, 'grad_norm': 0.18737439811229706, 'learning_rate': 7.452488108623089e-06, 'kl': -0.0033, 'entropy': 0.1748, 'ce_loss': 0.0997, 'epoch': 1.77}
60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 191/321 [17:30<11:51, 5.47s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 192/321 [17:35<11:45, 5.47s/it] {'loss': 0.0451, 'grad_norm': 0.18686872720718384, 'learning_rate': 7.354936756995766e-06, 'kl': -0.0095, 'entropy': 0.1357, 'ce_loss': 0.0782, 'epoch': 1.78}
60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 192/321 [17:35<11:45, 5.47s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 193/321 [17:40<11:38, 5.45s/it] {'loss': 0.0397, 'grad_norm': 0.18644410371780396, 'learning_rate': 7.257655310619996e-06, 'kl': -0.0021, 'entropy': 0.063, 'ce_loss': 0.0551, 'epoch': 1.79}
60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 193/321 [17:41<11:38, 5.45s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 194/321 [17:46<11:33, 5.46s/it] {'loss': 0.0404, 'grad_norm': 0.19344562292099, 'learning_rate': 7.16065369620483e-06, 'kl': 0.0491, 'entropy': 0.0845, 'ce_loss': 0.0443, 'epoch': 1.8}
60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 194/321 [17:46<11:33, 5.46s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 195/321 [17:52<11:32, 5.50s/it] {'loss': 0.0373, 'grad_norm': 0.18872958421707153, 'learning_rate': 7.063941811904956e-06, 'kl': -0.028, 'entropy': 0.1416, 'ce_loss': 0.0675, 'epoch': 1.81}
61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 195/321 [17:52<11:32, 5.50s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 196/321 [17:57<11:24, 5.48s/it] {'loss': 0.0458, 'grad_norm': 0.18720188736915588, 'learning_rate': 6.967529526310681e-06, 'kl': 0.1318, 'entropy': 0.0581, 'ce_loss': 0.0473, 'epoch': 1.82}
61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 196/321 [17:57<11:24, 5.48s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 197/321 [18:02<11:18, 5.47s/it] {'loss': 0.0401, 'grad_norm': 0.1798112690448761, 'learning_rate': 6.871426677440907e-06, 'kl': 0.123, 'entropy': 0.1348, 'ce_loss': 0.064, 'epoch': 1.83}
61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 197/321 [18:02<11:18, 5.47s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 198/321 [18:08<11:11, 5.46s/it] {'loss': 0.0385, 'grad_norm': 0.14133110642433167, 'learning_rate': 6.775643071739267e-06, 'kl': 0.1514, 'entropy': 0.084, 'ce_loss': 0.0519, 'epoch': 1.84}
62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 198/321 [18:08<11:11, 5.46s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 199/321 [18:13<11:05, 5.45s/it] {'loss': 0.0479, 'grad_norm': 0.1965685486793518, 'learning_rate': 6.680188483073458e-06, 'kl': 0.0776, 'entropy': 0.0981, 'ce_loss': 0.0556, 'epoch': 1.85}
62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 199/321 [18:13<11:05, 5.45s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 200/321 [18:19<11:01, 5.47s/it] {'loss': 0.0472, 'grad_norm': 0.2427220493555069, 'learning_rate': 6.585072651737911e-06, 'kl': 0.1016, 'entropy': 0.1162, 'ce_loss': 0.0619, 'epoch': 1.86}
62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 200/321 [18:19<11:01, 5.47s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 201/321 [18:24<10:55, 5.47s/it] {'loss': 0.0404, 'grad_norm': 0.1257992386817932, 'learning_rate': 6.49030528345987e-06, 'kl': -0.0354, 'entropy': 0.1348, 'ce_loss': 0.0772, 'epoch': 1.87}
63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 201/321 [18:24<10:55, 5.47s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 202/321 [18:30<10:51, 5.47s/it] {'loss': 0.0599, 'grad_norm': 0.24053309857845306, 'learning_rate': 6.3958960484090094e-06, 'kl': 0.0288, 'entropy': 0.0835, 'ce_loss': 0.0535, 'epoch': 1.88}
63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 202/321 [18:30<10:51, 5.47s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 203/321 [18:35<10:44, 5.46s/it] {'loss': 0.0357, 'grad_norm': 0.1884569376707077, 'learning_rate': 6.3018545802107e-06, 'kl': -0.014, 'entropy': 0.1128, 'ce_loss': 0.067, 'epoch': 1.89}
63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 203/321 [18:35<10:44, 5.46s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 204/321 [18:41<10:38, 5.46s/it] {'loss': 0.0448, 'grad_norm': 0.1645369976758957, 'learning_rate': 6.208190474962945e-06, 'kl': 0.1475, 'entropy': 0.0535, 'ce_loss': 0.0392, 'epoch': 1.9}
64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 204/321 [18:41<10:38, 5.46s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 205/321 [18:46<10:34, 5.47s/it] {'loss': 0.0405, 'grad_norm': 0.22946400940418243, 'learning_rate': 6.114913290257219e-06, 'kl': -0.0236, 'entropy': 0.1118, 'ce_loss': 0.0635, 'epoch': 1.9}
64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 205/321 [18:46<10:34, 5.47s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 206/321 [18:52<10:25, 5.44s/it] {'loss': 0.0294, 'grad_norm': 0.11852026730775833, 'learning_rate': 6.0220325442031714e-06, 'kl': -0.0104, 'entropy': 0.2217, 'ce_loss': 0.0981, 'epoch': 1.91}
64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 206/321 [18:52<10:25, 5.44s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 207/321 [18:57<10:21, 5.45s/it] {'loss': 0.0325, 'grad_norm': 0.16473232209682465, 'learning_rate': 5.929557714457425e-06, 'kl': -0.0282, 'entropy': 0.252, 'ce_loss': 0.1167, 'epoch': 1.92}
64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 207/321 [18:57<10:21, 5.45s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 208/321 [19:02<10:15, 5.45s/it] {'loss': 0.0495, 'grad_norm': 0.18369171023368835, 'learning_rate': 5.8374982372564255e-06, 'kl': 0.0498, 'entropy': 0.0947, 'ce_loss': 0.0577, 'epoch': 1.93}
65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 208/321 [19:02<10:15, 5.45s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 209/321 [19:08<10:12, 5.47s/it] {'loss': 0.0487, 'grad_norm': 0.18389469385147095, 'learning_rate': 5.745863506453569e-06, 'kl': 0.062, 'entropy': 0.1514, 'ce_loss': 0.0798, 'epoch': 1.94}
65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 209/321 [19:08<10:12, 5.47s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 210/321 [19:13<10:06, 5.46s/it] {'loss': 0.0356, 'grad_norm': 0.18366190791130066, 'learning_rate': 5.6546628725606675e-06, 'kl': 0.0515, 'entropy': 0.1113, 'ce_loss': 0.0581, 'epoch': 1.95}
65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 210/321 [19:13<10:06, 5.46s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 211/321 [19:19<10:00, 5.46s/it] {'loss': 0.0469, 'grad_norm': 0.15876325964927673, 'learning_rate': 5.563905641793776e-06, 'kl': -0.0175, 'entropy': 0.2539, 'ce_loss': 0.1105, 'epoch': 1.96}
66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 211/321 [19:19<10:00, 5.46s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 212/321 [19:24<09:56, 5.47s/it] {'loss': 0.0445, 'grad_norm': 0.26259469985961914, 'learning_rate': 5.473601075123599e-06, 'kl': 0.0369, 'entropy': 0.105, 'ce_loss': 0.0526, 'epoch': 1.97}
66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 212/321 [19:24<09:56, 5.47s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 213/321 [19:30<09:50, 5.47s/it] {'loss': 0.0297, 'grad_norm': 0.12675271928310394, 'learning_rate': 5.383758387330476e-06, 'kl': 0.0026, 'entropy': 0.1289, 'ce_loss': 0.0728, 'epoch': 1.98}
66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 213/321 [19:30<09:50, 5.47s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 214/321 [19:35<09:45, 5.47s/it] {'loss': 0.0467, 'grad_norm': 0.19244596362113953, 'learning_rate': 5.294386746064115e-06, 'kl': -0.0228, 'entropy': 0.1328, 'ce_loss': 0.0803, 'epoch': 1.99}
67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 214/321 [19:35<09:45, 5.47s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 215/321 [19:41<09:38, 5.46s/it] {'loss': 0.0462, 'grad_norm': 0.2244977205991745, 'learning_rate': 5.205495270908094e-06, 'kl': 0.0447, 'entropy': 0.083, 'ce_loss': 0.0564, 'epoch': 2.0}
67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 215/321 [19:41<09:38, 5.46s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 216/321 [19:42<07:22, 4.22s/it] {'loss': 0.0398, 'grad_norm': 0.2244977205991745, 'learning_rate': 5.117093032449297e-06, 'kl': 0.0898, 'entropy': 0.0206, 'ce_loss': 0.1184, 'epoch': 2.0}
67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 216/321 [19:42<07:22, 4.22s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 217/321 [19:48<07:57, 4.59s/it] {'loss': 0.0258, 'grad_norm': 0.4248782694339752, 'learning_rate': 5.029189051352339e-06, 'kl': -0.0123, 'entropy': 0.1709, 'ce_loss': 0.0853, 'epoch': 2.01}
68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 217/321 [19:48<07:57, 4.59s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 218/321 [19:53<08:20, 4.86s/it] {'loss': 0.0256, 'grad_norm': 0.12272848188877106, 'learning_rate': 4.941792297439098e-06, 'kl': 0.0223, 'entropy': 0.0918, 'ce_loss': 0.0505, 'epoch': 2.02}
68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 218/321 [19:53<08:20, 4.86s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 219/321 [19:59<08:35, 5.06s/it] {'loss': 0.0292, 'grad_norm': 0.1554100066423416, 'learning_rate': 4.8549116887734045e-06, 'kl': 0.0771, 'entropy': 0.0796, 'ce_loss': 0.0564, 'epoch': 2.03}
68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 219/321 [19:59<08:35, 5.06s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 220/321 [20:04<08:41, 5.17s/it] {'loss': 0.0308, 'grad_norm': 0.14393459260463715, 'learning_rate': 4.7685560907510465e-06, 'kl': 0.0991, 'entropy': 0.1138, 'ce_loss': 0.0579, 'epoch': 2.04}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 220/321 [20:04<08:41, 5.17s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 221/321 [20:09<08:46, 5.27s/it] {'loss': 0.0222, 'grad_norm': 0.1449674814939499, 'learning_rate': 4.682734315195138e-06, 'kl': -0.0442, 'entropy': 0.0952, 'ce_loss': 0.0674, 'epoch': 2.05}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 221/321 [20:09<08:46, 5.27s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 222/321 [20:15<08:47, 5.33s/it] {'loss': 0.0263, 'grad_norm': 0.11594852060079575, 'learning_rate': 4.5974551194569336e-06, 'kl': 0.1611, 'entropy': 0.0625, 'ce_loss': 0.0488, 'epoch': 2.06}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 222/321 [20:15<08:47, 5.33s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 223/321 [20:20<08:46, 5.37s/it] {'loss': 0.0222, 'grad_norm': 0.1573036015033722, 'learning_rate': 4.51272720552223e-06, 'kl': 0.0776, 'entropy': 0.0571, 'ce_loss': 0.0336, 'epoch': 2.07}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 223/321 [20:20<08:46, 5.37s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 224/321 [20:26<08:42, 5.39s/it] {'loss': 0.0271, 'grad_norm': 0.18170951306819916, 'learning_rate': 4.4285592191234125e-06, 'kl': 0.207, 'entropy': 0.0024, 'ce_loss': 0.0248, 'epoch': 2.07}
70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 224/321 [20:26<08:42, 5.39s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 225/321 [20:31<08:39, 5.41s/it] {'loss': 0.0288, 'grad_norm': 0.15239432454109192, 'learning_rate': 4.344959748857215e-06, 'kl': 0.1787, 'entropy': -0.0063, 'ce_loss': 0.0277, 'epoch': 2.08}
70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 225/321 [20:31<08:39, 5.41s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 226/321 [20:37<08:35, 5.43s/it] {'loss': 0.0273, 'grad_norm': 0.16316252946853638, 'learning_rate': 4.261937325308347e-06, 'kl': 0.168, 'entropy': -0.0201, 'ce_loss': 0.0207, 'epoch': 2.09}
70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 226/321 [20:37<08:35, 5.43s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 227/321 [20:42<08:32, 5.45s/it] {'loss': 0.0237, 'grad_norm': 0.13521708548069, 'learning_rate': 4.179500420179011e-06, 'kl': 0.0889, 'entropy': 0.1289, 'ce_loss': 0.0683, 'epoch': 2.1}
71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 227/321 [20:42<08:32, 5.45s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 228/321 [20:48<08:27, 5.46s/it] {'loss': 0.0273, 'grad_norm': 0.1312531679868698, 'learning_rate': 4.097657445424454e-06, 'kl': 0.2285, 'entropy': -0.033, 'ce_loss': 0.0156, 'epoch': 2.11}
71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 228/321 [20:48<08:27, 5.46s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 229/321 [20:53<08:22, 5.46s/it] {'loss': 0.026, 'grad_norm': 0.16218417882919312, 'learning_rate': 4.016416752394591e-06, 'kl': 0.0016, 'entropy': 0.054, 'ce_loss': 0.053, 'epoch': 2.12}
71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 229/321 [20:53<08:22, 5.46s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 230/321 [20:59<08:17, 5.46s/it] {'loss': 0.0193, 'grad_norm': 0.12537582218647003, 'learning_rate': 3.935786630981819e-06, 'kl': -0.0422, 'entropy': 0.0757, 'ce_loss': 0.0789, 'epoch': 2.13}
72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 230/321 [20:59<08:17, 5.46s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 231/321 [21:04<08:12, 5.47s/it] {'loss': 0.0231, 'grad_norm': 0.08550359308719635, 'learning_rate': 3.8557753087751345e-06, 'kl': 0.0801, 'entropy': 0.0505, 'ce_loss': 0.0532, 'epoch': 2.14}
72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 231/321 [21:04<08:12, 5.47s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 232/321 [21:10<08:05, 5.46s/it] {'loss': 0.0317, 'grad_norm': 0.17949357628822327, 'learning_rate': 3.776390950220544e-06, 'kl': 0.2461, 'entropy': 0.0178, 'ce_loss': 0.0316, 'epoch': 2.15}
72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 232/321 [21:10<08:05, 5.46s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 233/321 [21:15<07:59, 5.45s/it] {'loss': 0.0254, 'grad_norm': 0.13897587358951569, 'learning_rate': 3.6976416557879757e-06, 'kl': 0.0066, 'entropy': 0.1094, 'ce_loss': 0.0907, 'epoch': 2.16}
73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 233/321 [21:15<07:59, 5.45s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 234/321 [21:20<07:53, 5.44s/it] {'loss': 0.0291, 'grad_norm': 0.1302442103624344, 'learning_rate': 3.6195354611447033e-06, 'kl': 0.1367, 'entropy': 0.0645, 'ce_loss': 0.0377, 'epoch': 2.17}
73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 234/321 [21:20<07:53, 5.44s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 235/321 [21:26<07:47, 5.44s/it] {'loss': 0.0215, 'grad_norm': 0.1322391927242279, 'learning_rate': 3.5420803363353604e-06, 'kl': 0.0352, 'entropy': 0.1299, 'ce_loss': 0.0815, 'epoch': 2.18}
73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 235/321 [21:26<07:47, 5.44s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 236/321 [21:31<07:43, 5.45s/it] {'loss': 0.0219, 'grad_norm': 0.1335344910621643, 'learning_rate': 3.465284184968679e-06, 'kl': 0.0057, 'entropy': 0.0513, 'ce_loss': 0.0579, 'epoch': 2.19}
74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 236/321 [21:31<07:43, 5.45s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 237/321 [21:37<07:39, 5.47s/it] {'loss': 0.0223, 'grad_norm': 0.10451359301805496, 'learning_rate': 3.3891548434109942e-06, 'kl': 0.1138, 'entropy': 0.0156, 'ce_loss': 0.0272, 'epoch': 2.2}
74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 237/321 [21:37<07:39, 5.47s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 238/321 [21:42<07:34, 5.48s/it] {'loss': 0.0264, 'grad_norm': 0.14136981964111328, 'learning_rate': 3.3137000799866148e-06, 'kl': 0.0347, 'entropy': 0.0366, 'ce_loss': 0.0225, 'epoch': 2.21}
74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 238/321 [21:42<07:34, 5.48s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 239/321 [21:48<07:28, 5.47s/it] {'loss': 0.0252, 'grad_norm': 0.13739950954914093, 'learning_rate': 3.238927594185127e-06, 'kl': 0.1436, 'entropy': 0.0786, 'ce_loss': 0.0474, 'epoch': 2.21}
74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 239/321 [21:48<07:28, 5.47s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 240/321 [21:53<07:23, 5.47s/it] {'loss': 0.025, 'grad_norm': 0.148906409740448, 'learning_rate': 3.1648450158757373e-06, 'kl': 0.2559, 'entropy': 0.0135, 'ce_loss': 0.0388, 'epoch': 2.22}
75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 240/321 [21:53<07:23, 5.47s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 241/321 [21:59<07:18, 5.49s/it] {'loss': 0.023, 'grad_norm': 0.11499208956956863, 'learning_rate': 3.0914599045287165e-06, 'kl': 0.084, 'entropy': 0.0747, 'ce_loss': 0.0671, 'epoch': 2.23}
75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 241/321 [21:59<07:18, 5.49s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 242/321 [22:04<07:11, 5.46s/it] {'loss': 0.019, 'grad_norm': 0.08660584688186646, 'learning_rate': 3.018779748444005e-06, 'kl': 0.1416, 'entropy': 0.0276, 'ce_loss': 0.0296, 'epoch': 2.24}
75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 242/321 [22:04<07:11, 5.46s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 243/321 [22:10<07:07, 5.48s/it] {'loss': 0.0233, 'grad_norm': 0.11674519628286362, 'learning_rate': 2.9468119639871163e-06, 'kl': 0.053, 'entropy': 0.1094, 'ce_loss': 0.0716, 'epoch': 2.25}
76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 243/321 [22:10<07:07, 5.48s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 244/321 [22:15<07:00, 5.46s/it] {'loss': 0.0216, 'grad_norm': 0.12600378692150116, 'learning_rate': 2.8755638948323494e-06, 'kl': -0.0245, 'entropy': 0.1113, 'ce_loss': 0.0864, 'epoch': 2.26}
76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 244/321 [22:15<07:00, 5.46s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 245/321 [22:21<06:54, 5.46s/it] {'loss': 0.0283, 'grad_norm': 0.10699094086885452, 'learning_rate': 2.8050428112134474e-06, 'kl': 0.168, 'entropy': 0.0168, 'ce_loss': 0.0169, 'epoch': 2.27}
76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 245/321 [22:21<06:54, 5.46s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 246/321 [22:26<06:50, 5.47s/it] {'loss': 0.0295, 'grad_norm': 0.1837424486875534, 'learning_rate': 2.735255909181719e-06, 'kl': 0.2188, 'entropy': 0.0093, 'ce_loss': 0.0258, 'epoch': 2.28}
77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 246/321 [22:26<06:50, 5.47s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 247/321 [22:32<06:44, 5.47s/it] {'loss': 0.0221, 'grad_norm': 0.1188495010137558, 'learning_rate': 2.6662103098717485e-06, 'kl': 0.2012, 'entropy': 0.0083, 'ce_loss': 0.037, 'epoch': 2.29}
77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 247/321 [22:32<06:44, 5.47s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 248/321 [22:37<06:39, 5.47s/it] {'loss': 0.0305, 'grad_norm': 0.1437922865152359, 'learning_rate': 2.597913058774758e-06, 'kl': 0.0422, 'entropy': 0.0796, 'ce_loss': 0.0395, 'epoch': 2.3}
77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 248/321 [22:37<06:39, 5.47s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 249/321 [22:43<06:35, 5.50s/it] {'loss': 0.0261, 'grad_norm': 0.13972756266593933, 'learning_rate': 2.530371125019664e-06, 'kl': 0.126, 'entropy': -0.0011, 'ce_loss': 0.0175, 'epoch': 2.31}
78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 249/321 [22:43<06:35, 5.50s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 250/321 [22:48<06:29, 5.49s/it] {'loss': 0.0244, 'grad_norm': 0.14524739980697632, 'learning_rate': 2.4635914006619454e-06, 'kl': -0.011, 'entropy': 0.1221, 'ce_loss': 0.0832, 'epoch': 2.32}
78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 250/321 [22:48<06:29, 5.49s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 251/321 [22:54<06:24, 5.50s/it] {'loss': 0.0215, 'grad_norm': 0.09333452582359314, 'learning_rate': 2.3975806999803717e-06, 'kl': 0.0096, 'entropy': 0.103, 'ce_loss': 0.0629, 'epoch': 2.33}
78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 251/321 [22:54<06:24, 5.50s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 252/321 [22:59<06:18, 5.49s/it] {'loss': 0.0179, 'grad_norm': 0.12270327657461166, 'learning_rate': 2.33234575878167e-06, 'kl': -0.0493, 'entropy': 0.166, 'ce_loss': 0.1019, 'epoch': 2.34}
79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 252/321 [22:59<06:18, 5.49s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 253/321 [23:05<06:12, 5.48s/it] {'loss': 0.026, 'grad_norm': 0.0858650952577591, 'learning_rate': 2.267893233713182e-06, 'kl': 0.166, 'entropy': 0.0181, 'ce_loss': 0.0212, 'epoch': 2.34}
79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 253/321 [23:05<06:12, 5.48s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 254/321 [23:10<06:08, 5.50s/it] {'loss': 0.026, 'grad_norm': 0.14108358323574066, 'learning_rate': 2.204229701583621e-06, 'kl': 0.0007, 'entropy': 0.1206, 'ce_loss': 0.0768, 'epoch': 2.35}
79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 254/321 [23:10<06:08, 5.50s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 255/321 [23:16<06:03, 5.51s/it] {'loss': 0.0287, 'grad_norm': 0.13392361998558044, 'learning_rate': 2.141361658691975e-06, 'kl': 0.1709, 'entropy': -0.0361, 'ce_loss': 0.0101, 'epoch': 2.36}
79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 255/321 [23:16<06:03, 5.51s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 256/321 [23:21<05:57, 5.50s/it] {'loss': 0.0289, 'grad_norm': 0.1510702222585678, 'learning_rate': 2.0792955201646005e-06, 'kl': 0.0669, 'entropy': 0.1152, 'ce_loss': 0.0818, 'epoch': 2.37}
80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 256/321 [23:21<05:57, 5.50s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 257/321 [23:27<05:52, 5.51s/it] {'loss': 0.0286, 'grad_norm': 0.15851591527462006, 'learning_rate': 2.018037619300628e-06, 'kl': 0.0654, 'entropy': 0.1631, 'ce_loss': 0.0799, 'epoch': 2.38}
80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 257/321 [23:27<05:52, 5.51s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 258/321 [23:32<05:48, 5.53s/it] {'loss': 0.0239, 'grad_norm': 0.1374949961900711, 'learning_rate': 1.9575942069256914e-06, 'kl': 0.0591, 'entropy': 0.0176, 'ce_loss': 0.0434, 'epoch': 2.39}
80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 258/321 [23:32<05:48, 5.53s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 259/321 [23:38<05:42, 5.52s/it] {'loss': 0.0259, 'grad_norm': 0.13825078308582306, 'learning_rate': 1.8979714507541103e-06, 'kl': 0.0757, 'entropy': 0.0447, 'ce_loss': 0.0431, 'epoch': 2.4}
81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 259/321 [23:38<05:42, 5.52s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 260/321 [23:43<05:35, 5.51s/it] {'loss': 0.0272, 'grad_norm': 0.15970517694950104, 'learning_rate': 1.839175434759507e-06, 'kl': 0.014, 'entropy': 0.1191, 'ce_loss': 0.0785, 'epoch': 2.41}
81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 260/321 [23:43<05:35, 5.51s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 261/321 [23:49<05:29, 5.49s/it] {'loss': 0.0221, 'grad_norm': 0.10772485285997391, 'learning_rate': 1.7812121585539964e-06, 'kl': 0.2246, 'entropy': -0.0183, 'ce_loss': 0.0161, 'epoch': 2.42}
81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 261/321 [23:49<05:29, 5.49s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 262/321 [23:54<05:23, 5.47s/it] {'loss': 0.0228, 'grad_norm': 0.12323292344808578, 'learning_rate': 1.7240875367759902e-06, 'kl': 0.0143, 'entropy': 0.1367, 'ce_loss': 0.0932, 'epoch': 2.43}
82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 262/321 [23:54<05:23, 5.47s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 263/321 [23:59<05:16, 5.46s/it] {'loss': 0.0217, 'grad_norm': 0.13626284897327423, 'learning_rate': 1.6678073984866438e-06, 'kl': 0.0046, 'entropy': 0.0884, 'ce_loss': 0.0481, 'epoch': 2.44}
82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 263/321 [23:59<05:16, 5.46s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 264/321 [24:05<05:11, 5.46s/it] {'loss': 0.0277, 'grad_norm': 0.11715082824230194, 'learning_rate': 1.6123774865750607e-06, 'kl': 0.1963, 'entropy': -0.0159, 'ce_loss': 0.0254, 'epoch': 2.45}
82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 264/321 [24:05<05:11, 5.46s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 265/321 [24:10<05:06, 5.47s/it] {'loss': 0.0261, 'grad_norm': 0.15604187548160553, 'learning_rate': 1.5578034571722879e-06, 'kl': 0.1611, 'entropy': 0.0189, 'ce_loss': 0.0334, 'epoch': 2.46}
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 265/321 [24:10<05:06, 5.47s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 266/321 [24:16<05:00, 5.47s/it] {'loss': 0.0246, 'grad_norm': 0.1404842883348465, 'learning_rate': 1.5040908790741448e-06, 'kl': 0.1768, 'entropy': -0.0332, 'ce_loss': 0.0108, 'epoch': 2.47}
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 266/321 [24:16<05:00, 5.47s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 267/321 [24:21<04:57, 5.50s/it] {'loss': 0.0269, 'grad_norm': 0.14500805735588074, 'learning_rate': 1.4512452331729864e-06, 'kl': -0.0221, 'entropy': 0.1216, 'ce_loss': 0.0841, 'epoch': 2.48}
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 267/321 [24:21<04:57, 5.50s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 268/321 [24:27<04:51, 5.51s/it] {'loss': 0.0194, 'grad_norm': 0.10871770232915878, 'learning_rate': 1.3992719118984167e-06, 'kl': 0.0962, 'entropy': 0.017, 'ce_loss': 0.0248, 'epoch': 2.48}
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 268/321 [24:27<04:51, 5.51s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 269/321 [24:32<04:45, 5.49s/it] {'loss': 0.0251, 'grad_norm': 0.10633812099695206, 'learning_rate': 1.3481762186670556e-06, 'kl': 0.1523, 'entropy': -0.0044, 'ce_loss': 0.0175, 'epoch': 2.49}
84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 269/321 [24:32<04:45, 5.49s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 270/321 [24:38<04:38, 5.46s/it] {'loss': 0.0232, 'grad_norm': 0.12149304151535034, 'learning_rate': 1.2979633673413571e-06, 'kl': 0.0952, 'entropy': 0.0371, 'ce_loss': 0.044, 'epoch': 2.5}
84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 270/321 [24:38<04:38, 5.46s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 271/321 [24:43<04:32, 5.45s/it] {'loss': 0.02, 'grad_norm': 0.1349457949399948, 'learning_rate': 1.248638481697586e-06, 'kl': 0.0452, 'entropy': 0.1367, 'ce_loss': 0.0725, 'epoch': 2.51}
84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 271/321 [24:43<04:32, 5.45s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 272/321 [24:49<04:27, 5.47s/it] {'loss': 0.0201, 'grad_norm': 0.11620452255010605, 'learning_rate': 1.2002065949029896e-06, 'kl': -0.0216, 'entropy': 0.106, 'ce_loss': 0.0581, 'epoch': 2.52}
85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 272/321 [24:49<04:27, 5.47s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 273/321 [24:54<04:22, 5.47s/it] {'loss': 0.0254, 'grad_norm': 0.09986640512943268, 'learning_rate': 1.15267264900219e-06, 'kl': 0.0452, 'entropy': 0.0723, 'ce_loss': 0.0518, 'epoch': 2.53}
85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 273/321 [24:54<04:22, 5.47s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 274/321 [25:00<04:17, 5.48s/it] {'loss': 0.025, 'grad_norm': 0.16179873049259186, 'learning_rate': 1.1060414944129106e-06, 'kl': 0.0488, 'entropy': 0.1445, 'ce_loss': 0.0931, 'epoch': 2.54}
85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 274/321 [25:00<04:17, 5.48s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 275/321 [25:05<04:11, 5.47s/it] {'loss': 0.0233, 'grad_norm': 0.11301198601722717, 'learning_rate': 1.0603178894310185e-06, 'kl': 0.0815, 'entropy': 0.0598, 'ce_loss': 0.0472, 'epoch': 2.55}
86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 275/321 [25:05<04:11, 5.47s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 276/321 [25:11<04:06, 5.48s/it] {'loss': 0.0308, 'grad_norm': 0.13731074333190918, 'learning_rate': 1.0155064997450026e-06, 'kl': 0.1147, 'entropy': 0.084, 'ce_loss': 0.0393, 'epoch': 2.56}
86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 276/321 [25:11<04:06, 5.48s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 277/321 [25:16<04:00, 5.46s/it] {'loss': 0.0297, 'grad_norm': 0.21143203973770142, 'learning_rate': 9.716118979598533e-07, 'kl': 0.1777, 'entropy': 0.0203, 'ce_loss': 0.0453, 'epoch': 2.57}
86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 277/321 [25:16<04:00, 5.46s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 278/321 [25:22<03:54, 5.46s/it] {'loss': 0.024, 'grad_norm': 0.12256018817424774, 'learning_rate': 9.286385631304939e-07, 'kl': 0.1738, 'entropy': 0.006, 'ce_loss': 0.0374, 'epoch': 2.58}
87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 278/321 [25:22<03:54, 5.46s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 279/321 [25:27<03:48, 5.45s/it] {'loss': 0.03, 'grad_norm': 0.11462045460939407, 'learning_rate': 8.865908803047241e-07, 'kl': 0.0493, 'entropy': 0.0952, 'ce_loss': 0.0606, 'epoch': 2.59}
87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 279/321 [25:27<03:48, 5.45s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 280/321 [25:33<03:44, 5.49s/it] {'loss': 0.0301, 'grad_norm': 0.16785408556461334, 'learning_rate': 8.454731400757599e-07, 'kl': 0.1279, 'entropy': 0.0143, 'ce_loss': 0.023, 'epoch': 2.6}
87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 280/321 [25:33<03:44, 5.49s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 281/321 [25:38<03:39, 5.48s/it] {'loss': 0.0232, 'grad_norm': 0.1566372960805893, 'learning_rate': 8.052895381444226e-07, 'kl': 0.1226, 'entropy': 0.0522, 'ce_loss': 0.0394, 'epoch': 2.61}
88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 281/321 [25:38<03:39, 5.48s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 282/321 [25:43<03:33, 5.46s/it] {'loss': 0.0247, 'grad_norm': 0.1414063423871994, 'learning_rate': 7.660441748909997e-07, 'kl': -0.0349, 'entropy': 0.1089, 'ce_loss': 0.0876, 'epoch': 2.62}
88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 282/321 [25:43<03:33, 5.46s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 283/321 [25:49<03:27, 5.45s/it] {'loss': 0.0283, 'grad_norm': 0.1220989003777504, 'learning_rate': 7.277410549568476e-07, 'kl': 0.1699, 'entropy': 0.04, 'ce_loss': 0.0445, 'epoch': 2.62}
88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 283/321 [25:49<03:27, 5.45s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 284/321 [25:54<03:21, 5.45s/it] {'loss': 0.022, 'grad_norm': 0.15742984414100647, 'learning_rate': 6.903840868357382e-07, 'kl': -0.0177, 'entropy': 0.1514, 'ce_loss': 0.087, 'epoch': 2.63}
88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 284/321 [25:54<03:21, 5.45s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 285/321 [26:00<03:16, 5.45s/it] {'loss': 0.0236, 'grad_norm': 0.10442278534173965, 'learning_rate': 6.539770824750447e-07, 'kl': 0.0131, 'entropy': 0.0977, 'ce_loss': 0.0688, 'epoch': 2.64}
89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 285/321 [26:00<03:16, 5.45s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 286/321 [26:05<03:11, 5.46s/it] {'loss': 0.0181, 'grad_norm': 0.11422933638095856, 'learning_rate': 6.185237568867597e-07, 'kl': -0.0327, 'entropy': 0.1719, 'ce_loss': 0.0929, 'epoch': 2.65}
89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 286/321 [26:05<03:11, 5.46s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 287/321 [26:11<03:06, 5.48s/it] {'loss': 0.0256, 'grad_norm': 0.1048831194639206, 'learning_rate': 5.840277277684136e-07, 'kl': 0.0547, 'entropy': 0.0996, 'ce_loss': 0.06, 'epoch': 2.66}
89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 287/321 [26:11<03:06, 5.48s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 288/321 [26:16<03:01, 5.49s/it] {'loss': 0.0276, 'grad_norm': 0.1242062970995903, 'learning_rate': 5.504925151339191e-07, 'kl': 0.05, 'entropy': 0.0664, 'ce_loss': 0.0539, 'epoch': 2.67}
90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 288/321 [26:16<03:01, 5.49s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 289/321 [26:22<02:54, 5.46s/it] {'loss': 0.0219, 'grad_norm': 0.14201240241527557, 'learning_rate': 5.179215409543848e-07, 'kl': 0.0869, 'entropy': 0.0752, 'ce_loss': 0.0568, 'epoch': 2.68}
90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 289/321 [26:22<02:54, 5.46s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 290/321 [26:27<02:49, 5.45s/it] {'loss': 0.0253, 'grad_norm': 0.15880657732486725, 'learning_rate': 4.863181288089391e-07, 'kl': -0.025, 'entropy': 0.0996, 'ce_loss': 0.0623, 'epoch': 2.69}
90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 290/321 [26:27<02:49, 5.45s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 291/321 [26:33<02:43, 5.46s/it] {'loss': 0.0247, 'grad_norm': 0.11968444287776947, 'learning_rate': 4.556855035455787e-07, 'kl': 0.1699, 'entropy': 0.0034, 'ce_loss': 0.0265, 'epoch': 2.7}
91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 291/321 [26:33<02:43, 5.46s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 292/321 [26:38<02:37, 5.44s/it] {'loss': 0.0378, 'grad_norm': 0.13923847675323486, 'learning_rate': 4.2602679095210766e-07, 'kl': 0.2217, 'entropy': -0.019, 'ce_loss': 0.0313, 'epoch': 2.71}
91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 292/321 [26:38<02:37, 5.44s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 293/321 [26:43<02:32, 5.44s/it] {'loss': 0.0263, 'grad_norm': 0.2090776115655899, 'learning_rate': 3.9734501743717956e-07, 'kl': 0.1104, 'entropy': 0.0003, 'ce_loss': 0.0207, 'epoch': 2.72}
91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 293/321 [26:43<02:32, 5.44s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 294/321 [26:49<02:27, 5.48s/it] {'loss': 0.0244, 'grad_norm': 0.1359066516160965, 'learning_rate': 3.696431097214748e-07, 'kl': 0.1885, 'entropy': -0.0079, 'ce_loss': 0.0201, 'epoch': 2.73}
92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 294/321 [26:49<02:27, 5.48s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 295/321 [26:54<02:22, 5.48s/it] {'loss': 0.0263, 'grad_norm': 0.13609451055526733, 'learning_rate': 3.429238945390556e-07, 'kl': 0.0571, 'entropy': 0.0801, 'ce_loss': 0.0708, 'epoch': 2.74}
92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 295/321 [26:54<02:22, 5.48s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 296/321 [27:00<02:16, 5.47s/it] {'loss': 0.0205, 'grad_norm': 0.1158941388130188, 'learning_rate': 3.171900983489273e-07, 'kl': 0.0386, 'entropy': 0.1406, 'ce_loss': 0.0776, 'epoch': 2.75}
92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 296/321 [27:00<02:16, 5.47s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 297/321 [27:05<02:11, 5.46s/it] {'loss': 0.0264, 'grad_norm': 0.11173633486032486, 'learning_rate': 2.9244434705682276e-07, 'kl': 0.1836, 'entropy': 0.0081, 'ce_loss': 0.0369, 'epoch': 2.76}
93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 297/321 [27:05<02:11, 5.46s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 298/321 [27:11<02:05, 5.45s/it] {'loss': 0.0233, 'grad_norm': 0.14700034260749817, 'learning_rate': 2.6868916574725347e-07, 'kl': 0.1787, 'entropy': 0.0089, 'ce_loss': 0.033, 'epoch': 2.76}
93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 298/321 [27:11<02:05, 5.45s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 299/321 [27:16<01:59, 5.44s/it] {'loss': 0.0253, 'grad_norm': 0.11715128272771835, 'learning_rate': 2.459269784258467e-07, 'kl': 0.1318, 'entropy': -0.0114, 'ce_loss': 0.0168, 'epoch': 2.77}
93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 299/321 [27:16<01:59, 5.44s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 300/321 [27:22<01:54, 5.43s/it] {'loss': 0.0222, 'grad_norm': 0.11396218836307526, 'learning_rate': 2.2416010777199904e-07, 'kl': 0.0532, 'entropy': 0.0427, 'ce_loss': 0.0435, 'epoch': 2.78}
93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 300/321 [27:22<01:54, 5.43s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 301/321 [27:27<01:48, 5.44s/it] {'loss': 0.0283, 'grad_norm': 0.14802348613739014, 'learning_rate': 2.0339077490186488e-07, 'kl': -0.0006, 'entropy': 0.1069, 'ce_loss': 0.0683, 'epoch': 2.79}
94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 301/321 [27:27<01:48, 5.44s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 302/321 [27:32<01:43, 5.43s/it] {'loss': 0.0278, 'grad_norm': 0.1581055223941803, 'learning_rate': 1.83621099141712e-07, 'kl': 0.1235, 'entropy': 0.0522, 'ce_loss': 0.0377, 'epoch': 2.8}
94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 302/321 [27:32<01:43, 5.43s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 303/321 [27:38<01:37, 5.44s/it] {'loss': 0.0254, 'grad_norm': 0.1633712649345398, 'learning_rate': 1.648530978116658e-07, 'kl': -0.0391, 'entropy': 0.085, 'ce_loss': 0.0629, 'epoch': 2.81}
94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 303/321 [27:38<01:37, 5.44s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 304/321 [27:43<01:33, 5.47s/it] {'loss': 0.0189, 'grad_norm': 0.13110966980457306, 'learning_rate': 1.4708868601985503e-07, 'kl': 0.1309, 'entropy': 0.0378, 'ce_loss': 0.0404, 'epoch': 2.82}
95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 304/321 [27:43<01:33, 5.47s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 305/321 [27:49<01:27, 5.48s/it] {'loss': 0.0218, 'grad_norm': 0.1552368849515915, 'learning_rate': 1.303296764669959e-07, 'kl': -0.0381, 'entropy': 0.1514, 'ce_loss': 0.0775, 'epoch': 2.83}
95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 305/321 [27:49<01:27, 5.48s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 306/321 [27:54<01:21, 5.46s/it] {'loss': 0.0259, 'grad_norm': 0.09103472530841827, 'learning_rate': 1.1457777926141889e-07, 'kl': 0.1006, 'entropy': 0.0454, 'ce_loss': 0.045, 'epoch': 2.84}
95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 306/321 [27:54<01:21, 5.46s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 307/321 [28:00<01:16, 5.46s/it] {'loss': 0.0249, 'grad_norm': 0.15214498341083527, 'learning_rate': 9.98346017445706e-08, 'kl': -0.0483, 'entropy': 0.1377, 'ce_loss': 0.089, 'epoch': 2.85}
96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 307/321 [28:00<01:16, 5.46s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 308/321 [28:05<01:11, 5.47s/it] {'loss': 0.0269, 'grad_norm': 0.15390083193778992, 'learning_rate': 8.610164832699608e-08, 'kl': 0.1196, 'entropy': 0.025, 'ce_loss': 0.0234, 'epoch': 2.86}
96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 308/321 [28:05<01:11, 5.47s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 309/321 [28:11<01:06, 5.50s/it] {'loss': 0.0234, 'grad_norm': 0.1400686502456665, 'learning_rate': 7.338032033482712e-08, 'kl': 0.1797, 'entropy': -0.0146, 'ce_loss': 0.0134, 'epoch': 2.87}
96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 309/321 [28:11<01:06, 5.50s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 310/321 [28:16<01:00, 5.51s/it] {'loss': 0.0266, 'grad_norm': 0.15008477866649628, 'learning_rate': 6.167191586679556e-08, 'kl': -0.0147, 'entropy': 0.1729, 'ce_loss': 0.093, 'epoch': 2.88}
97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 310/321 [28:16<01:00, 5.51s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 311/321 [28:22<00:55, 5.50s/it] {'loss': 0.0212, 'grad_norm': 0.09048085659742355, 'learning_rate': 5.097762966176256e-08, 'kl': -0.0322, 'entropy': 0.1455, 'ce_loss': 0.089, 'epoch': 2.89}
97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 311/321 [28:22<00:55, 5.50s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 312/321 [28:27<00:49, 5.50s/it] {'loss': 0.0269, 'grad_norm': 0.13809086382389069, 'learning_rate': 4.129855297681618e-08, 'kl': 0.0903, 'entropy': 0.003, 'ce_loss': 0.0238, 'epoch': 2.9}
97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 312/321 [28:27<00:49, 5.50s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 313/321 [28:33<00:43, 5.47s/it] {'loss': 0.0279, 'grad_norm': 0.16971167922019958, 'learning_rate': 3.2635673475910345e-08, 'kl': -0.0371, 'entropy': 0.1758, 'ce_loss': 0.0989, 'epoch': 2.9}
98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 313/321 [28:33<00:43, 5.47s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 314/321 [28:38<00:38, 5.47s/it] {'loss': 0.0271, 'grad_norm': 0.11413148045539856, 'learning_rate': 2.4989875129091124e-08, 'kl': 0.1738, 'entropy': 0.0215, 'ce_loss': 0.0363, 'epoch': 2.91}
98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 314/321 [28:38<00:38, 5.47s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 315/321 [28:44<00:32, 5.46s/it] {'loss': 0.0239, 'grad_norm': 0.1775045543909073, 'learning_rate': 1.8361938122287704e-08, 'kl': -0.0293, 'entropy': 0.1729, 'ce_loss': 0.0902, 'epoch': 2.92}
98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 315/321 [28:44<00:32, 5.46s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 316/321 [28:49<00:27, 5.47s/it] {'loss': 0.0245, 'grad_norm': 0.1191033273935318, 'learning_rate': 1.2752538777704993e-08, 'kl': 0.0747, 'entropy': 0.0786, 'ce_loss': 0.0517, 'epoch': 2.93}
98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 316/321 [28:49<00:27, 5.47s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 317/321 [28:55<00:21, 5.44s/it] {'loss': 0.0234, 'grad_norm': 0.09175986796617508, 'learning_rate': 8.162249484809926e-09, 'kl': 0.1289, 'entropy': 0.0938, 'ce_loss': 0.0597, 'epoch': 2.94}
99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 317/321 [28:55<00:21, 5.44s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 318/321 [29:00<00:16, 5.44s/it] {'loss': 0.0248, 'grad_norm': 0.13051749765872955, 'learning_rate': 4.591538641927074e-09, 'kl': 0.033, 'entropy': 0.1641, 'ce_loss': 0.073, 'epoch': 2.95}
99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 318/321 [29:00<00:16, 5.44s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 319/321 [29:05<00:10, 5.43s/it] {'loss': 0.0265, 'grad_norm': 0.12332940846681595, 'learning_rate': 2.0407706084368816e-09, 'kl': 0.0703, 'entropy': 0.0957, 'ce_loss': 0.0564, 'epoch': 2.96}
99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 319/321 [29:05<00:10, 5.43s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 320/321 [29:11<00:05, 5.45s/it] {'loss': 0.0271, 'grad_norm': 0.1580265313386917, 'learning_rate': 5.102056675998501e-10, 'kl': 0.1089, 'entropy': 0.0615, 'ce_loss': 0.044, 'epoch': 2.97}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 320/321 [29:11<00:05, 5.45s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 321/321 [29:16<00:00, 5.44s/it] {'loss': 0.0351, 'grad_norm': 0.14408810436725616, 'learning_rate': 0.0, 'kl': 0.0928, 'entropy': 0.1328, 'ce_loss': 0.0641, 'epoch': 2.98}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 321/321 [29:16<00:00, 5.44s/it][INFO|trainer.py:2665] 2025-04-10 17:23:54,853 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 1756.865, 'train_samples_per_second': 5.854, 'train_steps_per_second': 0.183, 'train_loss': 0.04484545481840955, 'epoch': 2.98}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 321/321 [29:16<00:00, 5.44s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 321/321 [29:16<00:00, 5.47s/it]
[INFO|trainer.py:3966] 2025-04-10 17:24:03,430 >> Saving model checkpoint to /home/stern/GRPO/offline_rl_v2/output
[INFO|configuration_utils.py:423] 2025-04-10 17:24:03,433 >> Configuration saved in /home/stern/GRPO/offline_rl_v2/output/config.json
[INFO|configuration_utils.py:908] 2025-04-10 17:24:03,433 >> Configuration saved in /home/stern/GRPO/offline_rl_v2/output/generation_config.json
[2025-04-10 17:24:05,933] [INFO] [launch.py:351:main] Process 501941 exits successfully.
[2025-04-10 17:24:06,934] [INFO] [launch.py:351:main] Process 501942 exits successfully.
[2025-04-10 17:24:07,935] [INFO] [launch.py:351:main] Process 501943 exits successfully.
[2025-04-10 17:24:07,935] [INFO] [launch.py:351:main] Process 501946 exits successfully.
[2025-04-10 17:24:07,936] [INFO] [launch.py:351:main] Process 501944 exits successfully.
[2025-04-10 17:24:08,937] [INFO] [launch.py:351:main] Process 501945 exits successfully.
[2025-04-10 17:24:08,937] [INFO] [launch.py:351:main] Process 501940 exits successfully.
[INFO|modeling_utils.py:3594] 2025-04-10 17:24:18,916 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /home/stern/GRPO/offline_rl_v2/output/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2510] 2025-04-10 17:24:18,917 >> tokenizer config file saved in /home/stern/GRPO/offline_rl_v2/output/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-04-10 17:24:18,917 >> Special tokens file saved in /home/stern/GRPO/offline_rl_v2/output/special_tokens_map.json
***** train metrics *****
epoch = 2.979
total_flos = 3318202242GF
train_loss = 0.0448
train_runtime = 0:29:16.86
train_samples = 3428
train_samples_per_second = 5.854
train_steps_per_second = 0.183
[rank0]:[W410 17:24:19.182384317 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[2025-04-10 17:24:21,950] [INFO] [launch.py:351:main] Process 501939 exits successfully.