[2025-04-13 14:56:48,290] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:50,327] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7 [2025-04-13 14:56:50,327] [INFO] [runner.py:605:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed scripts/newzero3.json --seed 42 --model_name_or_path /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct --train_tokenized_file /home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl --output_dir /home/stern/GRPO/offline_rl_v2/output --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy no --learning_rate 2e-6 --lr_scheduler_type cosine --save_only_model True --remove_unused_columns False --warmup_ratio 0.03 --num_train_epochs 4 --logging_steps 1 --report_to tensorboard --gradient_checkpointing True --overwrite_output_dir --bf16 True [2025-04-13 14:56:51,788] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:53,706] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2025-04-13 14:56:53,707] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0 [2025-04-13 14:56:53,707] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2025-04-13 14:56:53,707] [INFO] [launch.py:164:main] dist_world_size=8 [2025-04-13 14:56:53,707] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2025-04-13 14:56:53,707] [INFO] [launch.py:256:main] process 1025161 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-13 14:56:53,708] [INFO] [launch.py:256:main] process 1025162 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-13 14:56:53,708] [INFO] [launch.py:256:main] process 1025163 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=2', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-13 14:56:53,708] [INFO] [launch.py:256:main] process 1025164 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-13 14:56:53,709] [INFO] [launch.py:256:main] process 1025165 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=4', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-13 14:56:53,709] [INFO] [launch.py:256:main] process 1025166 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=5', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-13 14:56:53,710] [INFO] [launch.py:256:main] process 1025167 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=6', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-13 14:56:53,710] [INFO] [launch.py:256:main] process 1025168 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--deepspeed', 'scripts/newzero3.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/32K_neg_tokenized.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '4', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] [2025-04-13 14:56:57,613] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:58,352] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:58,555] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:58,618] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:58,666] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:58,692] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:58,696] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-04-13 14:56:58,702] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-13 14:57:00,288] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-13 14:57:00,342] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-13 14:57:00,660] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-13 14:57:00,718] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-13 14:57:00,764] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-13 14:57:00,773] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-13 14:57:00,778] [INFO] [comm.py:658:init_distributed] cdb=None /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2025-04-13 14:57:00,814] [INFO] [comm.py:658:init_distributed] cdb=None [2025-04-13 14:57:00,815] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING:__main__:Process rank: 7, device: cuda:7, n_gpu: 1 WARNING:__main__:Process rank: 1, device: cuda:1, n_gpu: 1 [2025-04-13 14:57:02,243] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-13 14:57:02,246 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-13 14:57:02,258] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-13 14:57:02,260 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1 INFO:__main__:Training parameters CustomTrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=scripts/newzero3.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=None, eval_strategy=no, eval_use_gather_object=False, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, kl_coeff=0.0, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-06, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/home/stern/GRPO/offline_rl_v2/output/runs/Apr13_14-57-00_nacamontrealdc1-p2r203n1.enovum.hivecloud.com, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=4.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=/home/stern/GRPO/offline_rl_v2/output, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=/home/stern/GRPO/offline_rl_v2/output, save_on_each_node=False, save_only_model=True, save_safetensors=True, save_steps=500, save_strategy=no, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, ) [INFO|tokenization_utils_base.py:2058] 2025-04-13 14:57:02,409 >> loading file vocab.json [INFO|tokenization_utils_base.py:2058] 2025-04-13 14:57:02,409 >> loading file merges.txt [INFO|tokenization_utils_base.py:2058] 2025-04-13 14:57:02,409 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2025-04-13 14:57:02,409 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2025-04-13 14:57:02,409 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2025-04-13 14:57:02,409 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2025-04-13 14:57:02,409 >> loading file chat_template.jinja WARNING:__main__:Process rank: 2, device: cuda:2, n_gpu: 1 WARNING:__main__:Process rank: 5, device: cuda:5, n_gpu: 1 WARNING:__main__:Process rank: 6, device: cuda:6, n_gpu: 1 WARNING:__main__:Process rank: 4, device: cuda:4, n_gpu: 1 WARNING:__main__:Process rank: 3, device: cuda:3, n_gpu: 1 [INFO|tokenization_utils_base.py:2323] 2025-04-13 14:57:02,684 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:697] 2025-04-13 14:57:02,684 >> loading configuration file /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct/config.json [INFO|configuration_utils.py:771] 2025-04-13 14:57:02,686 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 27648, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 40, "num_hidden_layers": 64, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.50.3", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|modeling_utils.py:1151] 2025-04-13 14:57:02,716 >> loading weights file /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:1225] 2025-04-13 14:57:02,717 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object [INFO|modeling_utils.py:2170] 2025-04-13 14:57:02,717 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16. [INFO|modeling_utils.py:3747] 2025-04-13 14:57:02,717 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [2025-04-13 14:57:02,717] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-13 14:57:02,720 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [INFO|configuration_utils.py:1139] 2025-04-13 14:57:02,726 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } [2025-04-13 14:57:02,862] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-13 14:57:02,865 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-13 14:57:02,870] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-13 14:57:02,872 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-13 14:57:02,910] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-13 14:57:02,912 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-13 14:57:03,005] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-13 14:57:03,008 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-13 14:57:03,009] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [WARNING|logging.py:329] 2025-04-13 14:57:03,011 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-04-13 14:57:24,786] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 771, num_elems = 32.76B Loading checkpoint shards: 0%| | 0/17 [00:00> All model checkpoint weights were used when initializing Qwen2ForCausalLM. [INFO|modeling_utils.py:4995] 2025-04-13 14:57:37,816 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. [INFO|configuration_utils.py:1092] 2025-04-13 14:57:37,821 >> loading configuration file /home/stern/GRPO/saved_models/Qwen2.5-32B-Instruct/generation_config.json [INFO|configuration_utils.py:1139] 2025-04-13 14:57:37,821 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 } /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( Using custom data configuration default-e96436450119f8e9 INFO:datasets.builder:Using custom data configuration default-e96436450119f8e9 Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json INFO:datasets.info:Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json Overwrite dataset info from restored data version if exists. INFO:datasets.builder:Overwrite dataset info from restored data version if exists. Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 INFO:datasets.info:Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092) INFO:datasets.builder:Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092) Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 INFO:datasets.info:Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-e96436450119f8e9/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. trainer = OfflineREINFORCETrainer( [INFO|trainer.py:748] 2025-04-13 14:57:38,146 >> Using auto half precision backend INFO:__main__:*** Train *** [INFO|deepspeed.py:386] 2025-04-13 14:57:38,377 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB) Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.529125928878784 seconds Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.3708255290985107 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000002, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 [2025-04-13 14:57:42,026] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.5, git-hash=unknown, git-branch=unknown [2025-04-13 14:57:42,026] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.3084402084350586 seconds [2025-04-13 14:57:42,068] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2025-04-13 14:57:42,071] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2025-04-13 14:57:42,071] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer Loading extension module cpu_adam... Time to load cpu_adam op: 2.6082324981689453 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.629338502883911 seconds Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.327012538909912 seconds [2025-04-13 14:57:42,124] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2025-04-13 14:57:42,124] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type= [2025-04-13 14:57:42,124] [INFO] [logging.py:107:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2025-04-13 14:57:42,124] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination Using /home/stern/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /home/stern/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.439788341522217 seconds [2025-04-13 14:57:42,256] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning [2025-04-13 14:57:42,257] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 2.9 GB CA 0.0 GB Max_CA 3 GB [2025-04-13 14:57:42,257] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 140.14 GB, percent = 13.9% [2025-04-13 14:57:42,259] [INFO] [stage3.py:170:__init__] Reduce bucket size 100000000 [2025-04-13 14:57:42,259] [INFO] [stage3.py:171:__init__] Prefetch bucket size 100000000 Loading extension module cpu_adam... Time to load cpu_adam op: 2.5354325771331787 seconds [2025-04-13 14:57:42,358] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2025-04-13 14:57:42,359] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-13 14:57:42,359] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 140.16 GB, percent = 13.9% Parameter Offload: Total persistent parameters: 1119232 in 321 params [2025-04-13 14:57:42,535] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2025-04-13 14:57:42,536] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-13 14:57:42,536] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 140.18 GB, percent = 13.9% [2025-04-13 14:57:42,642] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions [2025-04-13 14:57:42,643] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-13 14:57:42,643] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 140.18 GB, percent = 13.9% [2025-04-13 14:58:01,980] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 41 [2025-04-13 14:58:01,981] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-13 14:58:01,981] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 205.39 GB, percent = 20.4% [2025-04-13 14:58:02,134] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions [2025-04-13 14:58:02,135] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-13 14:58:02,135] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 205.38 GB, percent = 20.4% [2025-04-13 14:58:06,255] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions [2025-04-13 14:58:06,256] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-13 14:58:06,256] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 285.93 GB, percent = 28.4% [2025-04-13 14:58:08,607] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-04-13 14:58:08,608] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-13 14:58:08,608] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 341.25 GB, percent = 33.9% [2025-04-13 14:58:21,496] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-04-13 14:58:21,496] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-04-13 14:58:21,497] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 515.49 GB, percent = 51.2% [2025-04-13 14:58:21,497] [INFO] [stage3.py:534:_setup_for_real_optimizer] optimizer state initialized /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) [WARNING|logging.py:329] 2025-04-13 14:58:27,659 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-13 14:58:27,660 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-13 14:58:27,662 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-13 14:58:27,663 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-13 14:58:27,663 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-13 14:58:27,664 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [WARNING|logging.py:329] 2025-04-13 14:58:27,666 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. [2025-04-13 14:58:27,751] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-04-13 14:58:27,751] [INFO] [utils.py:782:see_memory_usage] MA 0.19 GB Max_MA 3.09 GB CA 3.09 GB Max_CA 3 GB [2025-04-13 14:58:27,752] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 566.38 GB, percent = 56.2% [2025-04-13 14:58:27,752] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 [2025-04-13 14:58:27,752] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None [2025-04-13 14:58:27,752] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2025-04-13 14:58:27,752] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] [2025-04-13 14:58:27,753] [INFO] [config.py:1000:print] DeepSpeedEngine configuration: [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] amp_enabled .................. False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] amp_params ................... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] bfloat16_enabled ............. True [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] bfloat16_immediate_grad_update True [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] checkpoint_parallel_write_pipeline False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] checkpoint_tag_validation_enabled True [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] checkpoint_tag_validation_fail False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] comms_config ................. [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] communication_data_type ...... None [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] curriculum_enabled_legacy .... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] curriculum_params_legacy ..... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] data_efficiency_enabled ...... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] dataloader_drop_last ......... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] disable_allgather ............ False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] dump_state ................... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] dynamic_loss_scale_args ...... None [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] eigenvalue_enabled ........... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] eigenvalue_gas_boundary_resolution 1 [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] eigenvalue_layer_num ......... 0 [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] eigenvalue_max_iter .......... 100 [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] eigenvalue_stability ......... 1e-06 [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] eigenvalue_tol ............... 0.01 [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] eigenvalue_verbose ........... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] elasticity_enabled ........... False [2025-04-13 14:58:27,754] [INFO] [config.py:1004:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] fp16_auto_cast ............... None [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] fp16_enabled ................. False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] fp16_master_weights_and_gradients False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] global_rank .................. 0 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] grad_accum_dtype ............. None [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] gradient_accumulation_steps .. 4 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] gradient_clipping ............ 1.0 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] gradient_predivide_factor .... 1.0 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] graph_harvesting ............. False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] initial_dynamic_scale ........ 1 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] load_universal_checkpoint .... False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] loss_scale ................... 1.0 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] memory_breakdown ............. False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] mics_hierarchial_params_gather False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] mics_shard_size .............. -1 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] optimizer_legacy_fusion ...... False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] optimizer_name ............... None [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] optimizer_params ............. None [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] pld_enabled .................. False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] pld_params ................... False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] prescale_gradients ........... False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] scheduler_name ............... None [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] scheduler_params ............. None [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] seq_parallel_communication_data_type torch.float32 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] sparse_attention ............. None [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] sparse_gradients_enabled ..... False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] steps_per_print .............. inf [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] timers_config ................ enabled=True synchronized=True [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] train_batch_size ............. 32 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] train_micro_batch_size_per_gpu 1 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] use_data_before_expert_parallel_ False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] use_node_local_storage ....... False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] wall_clock_breakdown ......... False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] weight_quantization_config ... None [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] world_size ................... 8 [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] zero_allow_untested_optimizer True [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=100000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=100000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=100000000 max_reuse_distance=100000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] zero_enabled ................. True [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] zero_force_ds_cpu_optimizer .. True [2025-04-13 14:58:27,755] [INFO] [config.py:1004:print] zero_optimization_stage ...... 3 [2025-04-13 14:58:27,755] [INFO] [config.py:990:print_user_config] json = { "fp16": { "enabled": false }, "bf16": { "enabled": true }, "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 4, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+08, "reduce_bucket_size": 1.000000e+08, "stage3_prefetch_bucket_size": 1.000000e+08, "stage3_param_persistence_threshold": 1.000000e+05, "stage3_max_live_parameters": 1.000000e+08, "stage3_max_reuse_distance": 1.000000e+08, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_clipping": 1.0, "wall_clock_breakdown": false, "steps_per_print": inf, "zero_allow_untested_optimizer": true } [INFO|trainer.py:2409] 2025-04-13 14:58:27,755 >> ***** Running training ***** [INFO|trainer.py:2410] 2025-04-13 14:58:27,756 >> Num examples = 2,688 [INFO|trainer.py:2411] 2025-04-13 14:58:27,756 >> Num Epochs = 4 [INFO|trainer.py:2412] 2025-04-13 14:58:27,756 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2415] 2025-04-13 14:58:27,756 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:2416] 2025-04-13 14:58:27,756 >> Gradient Accumulation steps = 4 [INFO|trainer.py:2417] 2025-04-13 14:58:27,756 >> Total optimization steps = 336 [INFO|trainer.py:2418] 2025-04-13 14:58:27,757 >> Number of trainable parameters = 32,763,876,352 0%| | 0/336 [00:00> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 0%| | 1/336 [00:27<2:30:51, 27.02s/it] {'loss': 0.0336, 'grad_norm': 0.556594967842102, 'learning_rate': 1.818181818181818e-07, 'kl': -0.0009, 'entropy': -0.0017, 'ce_loss': 0.0271, 'epoch': 0.01} 0%| | 1/336 [00:27<2:30:51, 27.02s/it] 1%| | 2/336 [00:47<2:09:37, 23.29s/it] {'loss': 0.0306, 'grad_norm': 0.4636707007884979, 'learning_rate': 3.636363636363636e-07, 'kl': 0.0014, 'entropy': 0.0205, 'ce_loss': 0.01, 'epoch': 0.02} 1%| | 2/336 [00:47<2:09:37, 23.29s/it] 1%| | 3/336 [01:05<1:55:13, 20.76s/it] {'loss': 0.0263, 'grad_norm': 0.4290856719017029, 'learning_rate': 5.454545454545454e-07, 'kl': -0.0006, 'entropy': -0.0026, 'ce_loss': 0.0253, 'epoch': 0.04} 1%| | 3/336 [01:05<1:55:13, 20.76s/it] 1%| | 4/336 [01:27<1:58:21, 21.39s/it] {'loss': 0.0322, 'grad_norm': 0.5268592238426208, 'learning_rate': 7.272727272727272e-07, 'kl': 0.0004, 'entropy': -0.0366, 'ce_loss': 0.0298, 'epoch': 0.05} 1%| | 4/336 [01:27<1:58:21, 21.39s/it] 1%|▏ | 5/336 [01:43<1:46:53, 19.38s/it] {'loss': 0.0364, 'grad_norm': 0.5142350196838379, 'learning_rate': 9.09090909090909e-07, 'kl': 0.0027, 'entropy': -0.0227, 'ce_loss': 0.0203, 'epoch': 0.06} 1%|▏ | 5/336 [01:43<1:46:53, 19.38s/it] 2%|▏ | 6/336 [02:02<1:46:20, 19.33s/it] {'loss': 0.0306, 'grad_norm': 0.3994697332382202, 'learning_rate': 1.0909090909090908e-06, 'kl': 0.0042, 'entropy': 0.0659, 'ce_loss': 0.0104, 'epoch': 0.07} 2%|▏ | 6/336 [02:02<1:46:20, 19.33s/it] 2%|▏ | 7/336 [02:21<1:43:55, 18.95s/it] {'loss': 0.0312, 'grad_norm': 0.29699787497520447, 'learning_rate': 1.2727272727272726e-06, 'kl': -0.0002, 'entropy': -0.0356, 'ce_loss': 0.0093, 'epoch': 0.08} 2%|▏ | 7/336 [02:21<1:43:55, 18.95s/it] 2%|▏ | 8/336 [02:40<1:44:02, 19.03s/it] {'loss': 0.037, 'grad_norm': 0.3551566004753113, 'learning_rate': 1.4545454545454544e-06, 'kl': 0.0039, 'entropy': 0.0383, 'ce_loss': 0.0162, 'epoch': 0.1} 2%|▏ | 8/336 [02:40<1:44:02, 19.03s/it] 3%|▎ | 9/336 [02:59<1:43:29, 18.99s/it] {'loss': 0.0253, 'grad_norm': 0.2004219889640808, 'learning_rate': 1.6363636363636365e-06, 'kl': 0.0058, 'entropy': -0.0649, 'ce_loss': 0.0215, 'epoch': 0.11} 3%|▎ | 9/336 [02:59<1:43:29, 18.99s/it] 3%|▎ | 10/336 [03:14<1:37:50, 18.01s/it] {'loss': 0.0322, 'grad_norm': 0.23521199822425842, 'learning_rate': 1.818181818181818e-06, 'kl': 0.0077, 'entropy': -0.0442, 'ce_loss': 0.0145, 'epoch': 0.12} 3%|▎ | 10/336 [03:14<1:37:50, 18.01s/it] 3%|▎ | 11/336 [03:35<1:42:21, 18.90s/it] {'loss': 0.0294, 'grad_norm': 0.2418721467256546, 'learning_rate': 2e-06, 'kl': 0.0026, 'entropy': -0.0408, 'ce_loss': 0.0082, 'epoch': 0.13} 3%|▎ | 11/336 [03:35<1:42:21, 18.90s/it] 4%|▎ | 12/336 [03:51<1:37:02, 17.97s/it] {'loss': 0.0351, 'grad_norm': 0.24574391543865204, 'learning_rate': 1.999953280342959e-06, 'kl': 0.0102, 'entropy': 0.0435, 'ce_loss': 0.0304, 'epoch': 0.14} 4%|▎ | 12/336 [03:51<1:37:02, 17.97s/it] 4%|▍ | 13/336 [04:11<1:39:40, 18.51s/it] {'loss': 0.0292, 'grad_norm': 0.24601855874061584, 'learning_rate': 1.9998131257372875e-06, 'kl': 0.0054, 'entropy': 0.0364, 'ce_loss': 0.0159, 'epoch': 0.15} 4%|▍ | 13/336 [04:11<1:39:40, 18.51s/it] 4%|▍ | 14/336 [04:30<1:40:12, 18.67s/it] {'loss': 0.038, 'grad_norm': 0.24783623218536377, 'learning_rate': 1.9995795492789365e-06, 'kl': 0.0111, 'entropy': -0.0146, 'ce_loss': 0.0152, 'epoch': 0.17} 4%|▍ | 14/336 [04:30<1:40:12, 18.67s/it] 4%|▍ | 15/336 [04:52<1:45:22, 19.70s/it] {'loss': 0.0335, 'grad_norm': 0.2631509006023407, 'learning_rate': 1.99925257279313e-06, 'kl': 0.0105, 'entropy': -0.0591, 'ce_loss': 0.0152, 'epoch': 0.18} 4%|▍ | 15/336 [04:52<1:45:22, 19.70s/it] 5%|▍ | 16/336 [05:10<1:41:34, 19.04s/it] {'loss': 0.0288, 'grad_norm': 0.22452250123023987, 'learning_rate': 1.9988322268323264e-06, 'kl': 0.0099, 'entropy': 0.0449, 'ce_loss': 0.0302, 'epoch': 0.19} 5%|▍ | 16/336 [05:10<1:41:34, 19.04s/it] 5%|▌ | 17/336 [05:27<1:38:56, 18.61s/it] {'loss': 0.0191, 'grad_norm': 0.17368917167186737, 'learning_rate': 1.998318550673364e-06, 'kl': 0.017, 'entropy': 0.0147, 'ce_loss': 0.0144, 'epoch': 0.2} 5%|▌ | 17/336 [05:27<1:38:56, 18.61s/it] 5%|▌ | 18/336 [05:43<1:34:10, 17.77s/it] {'loss': 0.0338, 'grad_norm': 0.32134443521499634, 'learning_rate': 1.997711592313791e-06, 'kl': 0.0036, 'entropy': -0.0304, 'ce_loss': 0.0167, 'epoch': 0.21} 5%|▌ | 18/336 [05:43<1:34:10, 17.77s/it] 6%|▌ | 19/336 [06:01<1:34:50, 17.95s/it] {'loss': 0.028, 'grad_norm': 0.2845238447189331, 'learning_rate': 1.9970114084673796e-06, 'kl': 0.0045, 'entropy': -0.0452, 'ce_loss': 0.03, 'epoch': 0.23} 6%|▌ | 19/336 [06:01<1:34:50, 17.95s/it] 6%|▌ | 20/336 [06:17<1:30:53, 17.26s/it] {'loss': 0.0335, 'grad_norm': 0.3595256507396698, 'learning_rate': 1.9962180645588286e-06, 'kl': 0.0123, 'entropy': -0.0933, 'ce_loss': 0.0147, 'epoch': 0.24} 6%|▌ | 20/336 [06:17<1:30:53, 17.26s/it] 6%|▋ | 21/336 [06:33<1:27:56, 16.75s/it] {'loss': 0.0417, 'grad_norm': 0.3472529947757721, 'learning_rate': 1.9953316347176486e-06, 'kl': 0.0129, 'entropy': -0.0261, 'ce_loss': 0.0183, 'epoch': 0.25} 6%|▋ | 21/336 [06:33<1:27:56, 16.75s/it] 7%|▋ | 22/336 [06:48<1:26:10, 16.47s/it] {'loss': 0.0332, 'grad_norm': 0.27452847361564636, 'learning_rate': 1.994352201771236e-06, 'kl': 0.0102, 'entropy': -0.0344, 'ce_loss': 0.02, 'epoch': 0.26} 7%|▋ | 22/336 [06:48<1:26:10, 16.47s/it] 7%|▋ | 23/336 [07:04<1:24:22, 16.18s/it] {'loss': 0.0279, 'grad_norm': 0.2443157136440277, 'learning_rate': 1.993279857237133e-06, 'kl': 0.0176, 'entropy': -0.0262, 'ce_loss': 0.0244, 'epoch': 0.27} 7%|▋ | 23/336 [07:04<1:24:22, 16.18s/it] 7%|▋ | 24/336 [07:20<1:23:16, 16.01s/it] {'loss': 0.0352, 'grad_norm': 0.27312204241752625, 'learning_rate': 1.9921147013144777e-06, 'kl': 0.0198, 'entropy': -0.0762, 'ce_loss': 0.0158, 'epoch': 0.29} 7%|▋ | 24/336 [07:20<1:23:16, 16.01s/it] 7%|▋ | 25/336 [07:37<1:24:34, 16.32s/it] {'loss': 0.0276, 'grad_norm': 0.22963714599609375, 'learning_rate': 1.9908568428746405e-06, 'kl': 0.0078, 'entropy': -0.0659, 'ce_loss': 0.0259, 'epoch': 0.3} 7%|▋ | 25/336 [07:37<1:24:34, 16.32s/it] 8%|▊ | 26/336 [07:52<1:23:38, 16.19s/it] {'loss': 0.0262, 'grad_norm': 0.17868177592754364, 'learning_rate': 1.989506399451051e-06, 'kl': 0.0114, 'entropy': -0.0557, 'ce_loss': 0.017, 'epoch': 0.31} 8%|▊ | 26/336 [07:52<1:23:38, 16.19s/it] 8%|▊ | 27/336 [08:09<1:23:10, 16.15s/it] {'loss': 0.0335, 'grad_norm': 0.2423129379749298, 'learning_rate': 1.9880634972282166e-06, 'kl': 0.0082, 'entropy': -0.0172, 'ce_loss': 0.025, 'epoch': 0.32} 8%|▊ | 27/336 [08:09<1:23:10, 16.15s/it] 8%|▊ | 28/336 [08:28<1:27:23, 17.02s/it] {'loss': 0.0289, 'grad_norm': 0.21936891973018646, 'learning_rate': 1.986528271029931e-06, 'kl': 0.0105, 'entropy': -0.019, 'ce_loss': 0.0328, 'epoch': 0.33} 8%|▊ | 28/336 [08:28<1:27:23, 17.02s/it] 9%|▊ | 29/336 [08:47<1:30:04, 17.60s/it] {'loss': 0.0287, 'grad_norm': 0.2251005619764328, 'learning_rate': 1.984900864306677e-06, 'kl': 0.0165, 'entropy': -0.0347, 'ce_loss': 0.0173, 'epoch': 0.35} 9%|▊ | 29/336 [08:47<1:30:04, 17.60s/it] 9%|▉ | 30/336 [09:02<1:26:24, 16.94s/it] {'loss': 0.0312, 'grad_norm': 0.2545115649700165, 'learning_rate': 1.9831814291222233e-06, 'kl': 0.0217, 'entropy': 0.028, 'ce_loss': 0.0162, 'epoch': 0.36} 9%|▉ | 30/336 [09:02<1:26:24, 16.94s/it] 9%|▉ | 31/336 [09:21<1:28:49, 17.47s/it] {'loss': 0.0306, 'grad_norm': 0.24097831547260284, 'learning_rate': 1.981370126139413e-06, 'kl': 0.0064, 'entropy': -0.0193, 'ce_loss': 0.0095, 'epoch': 0.37} 9%|▉ | 31/336 [09:21<1:28:49, 17.47s/it] 10%|▉ | 32/336 [09:42<1:34:47, 18.71s/it] {'loss': 0.021, 'grad_norm': 0.16445204615592957, 'learning_rate': 1.979467124605156e-06, 'kl': 0.0058, 'entropy': -0.0554, 'ce_loss': 0.0099, 'epoch': 0.38} 10%|▉ | 32/336 [09:42<1:34:47, 18.71s/it] 10%|▉ | 33/336 [09:59<1:31:19, 18.08s/it] {'loss': 0.0309, 'grad_norm': 0.325605571269989, 'learning_rate': 1.977472602334609e-06, 'kl': 0.0178, 'entropy': -0.0112, 'ce_loss': 0.0173, 'epoch': 0.39} 10%|▉ | 33/336 [09:59<1:31:19, 18.08s/it] 10%|█ | 34/336 [10:18<1:33:00, 18.48s/it] {'loss': 0.0272, 'grad_norm': 0.19901347160339355, 'learning_rate': 1.975386745694565e-06, 'kl': 0.0238, 'entropy': -0.03, 'ce_loss': 0.0364, 'epoch': 0.4} 10%|█ | 34/336 [10:18<1:33:00, 18.48s/it] 10%|█ | 35/336 [10:34<1:29:11, 17.78s/it] {'loss': 0.0256, 'grad_norm': 0.19283317029476166, 'learning_rate': 1.9732097495860385e-06, 'kl': 0.0128, 'entropy': -0.0212, 'ce_loss': 0.0127, 'epoch': 0.42} 10%|█ | 35/336 [10:34<1:29:11, 17.78s/it] 11%|█ | 36/336 [10:54<1:31:04, 18.21s/it] {'loss': 0.0316, 'grad_norm': 0.21884019672870636, 'learning_rate': 1.970941817426052e-06, 'kl': 0.0247, 'entropy': -0.0505, 'ce_loss': 0.0116, 'epoch': 0.43} 11%|█ | 36/336 [10:54<1:31:04, 18.21s/it] 11%|█ | 37/336 [11:10<1:27:48, 17.62s/it] {'loss': 0.0245, 'grad_norm': 0.19287759065628052, 'learning_rate': 1.968583161128631e-06, 'kl': 0.0091, 'entropy': -0.0093, 'ce_loss': 0.0204, 'epoch': 0.44} 11%|█ | 37/336 [11:10<1:27:48, 17.62s/it] 11%|█▏ | 38/336 [11:27<1:26:24, 17.40s/it] {'loss': 0.0235, 'grad_norm': 0.18665923178195953, 'learning_rate': 1.9661340010850024e-06, 'kl': 0.0187, 'entropy': -0.0312, 'ce_loss': 0.0144, 'epoch': 0.45} 11%|█▏ | 38/336 [11:27<1:26:24, 17.40s/it] 12%|█▏ | 39/336 [11:44<1:25:53, 17.35s/it] {'loss': 0.0296, 'grad_norm': 0.2117118537425995, 'learning_rate': 1.9635945661430005e-06, 'kl': 0.0104, 'entropy': -0.0732, 'ce_loss': 0.0147, 'epoch': 0.46} 12%|█▏ | 39/336 [11:44<1:25:53, 17.35s/it] 12%|█▏ | 40/336 [12:00<1:23:33, 16.94s/it] {'loss': 0.0331, 'grad_norm': 0.23639369010925293, 'learning_rate': 1.960965093585684e-06, 'kl': 0.0085, 'entropy': -0.0723, 'ce_loss': 0.0215, 'epoch': 0.48} 12%|█▏ | 40/336 [12:00<1:23:33, 16.94s/it] 12%|█▏ | 41/336 [12:19<1:26:32, 17.60s/it] {'loss': 0.0314, 'grad_norm': 0.2341509610414505, 'learning_rate': 1.9582458291091663e-06, 'kl': 0.0155, 'entropy': -0.0095, 'ce_loss': 0.0126, 'epoch': 0.49} 12%|█▏ | 41/336 [12:19<1:26:32, 17.60s/it] 12%|█▎ | 42/336 [12:38<1:28:42, 18.10s/it] {'loss': 0.0256, 'grad_norm': 0.20089305937290192, 'learning_rate': 1.9554370267996535e-06, 'kl': 0.0078, 'entropy': -0.0474, 'ce_loss': 0.0218, 'epoch': 0.5} 12%|█▎ | 42/336 [12:38<1:28:42, 18.10s/it] 13%|█▎ | 43/336 [12:54<1:25:06, 17.43s/it] {'loss': 0.03, 'grad_norm': 0.2235359102487564, 'learning_rate': 1.952538949109708e-06, 'kl': 0.0065, 'entropy': -0.0253, 'ce_loss': 0.0096, 'epoch': 0.51} 13%|█▎ | 43/336 [12:54<1:25:06, 17.43s/it] 13%|█▎ | 44/336 [13:13<1:27:23, 17.96s/it] {'loss': 0.0239, 'grad_norm': 0.16896812617778778, 'learning_rate': 1.94955186683372e-06, 'kl': 0.0215, 'entropy': -0.0181, 'ce_loss': 0.0085, 'epoch': 0.52} 13%|█▎ | 44/336 [13:13<1:27:23, 17.96s/it] 13%|█▎ | 45/336 [13:33<1:29:03, 18.36s/it] {'loss': 0.0293, 'grad_norm': 0.21316233277320862, 'learning_rate': 1.94647605908261e-06, 'kl': 0.0076, 'entropy': -0.0325, 'ce_loss': 0.0095, 'epoch': 0.54} 13%|█▎ | 45/336 [13:33<1:29:03, 18.36s/it] 14%|█▎ | 46/336 [13:49<1:25:22, 17.66s/it] {'loss': 0.0285, 'grad_norm': 0.25628453493118286, 'learning_rate': 1.943311813257743e-06, 'kl': 0.0093, 'entropy': -0.0471, 'ce_loss': 0.0218, 'epoch': 0.55} 14%|█▎ | 46/336 [13:49<1:25:22, 17.66s/it] 14%|█▍ | 47/336 [14:06<1:25:01, 17.65s/it] {'loss': 0.0337, 'grad_norm': 0.2560902535915375, 'learning_rate': 1.9400594250240794e-06, 'kl': 0.0126, 'entropy': -0.0515, 'ce_loss': 0.0283, 'epoch': 0.56} 14%|█▍ | 47/336 [14:06<1:25:01, 17.65s/it] 14%|█▍ | 48/336 [14:25<1:26:24, 18.00s/it] {'loss': 0.03, 'grad_norm': 0.22114038467407227, 'learning_rate': 1.9367191982825448e-06, 'kl': 0.0126, 'entropy': -0.064, 'ce_loss': 0.0145, 'epoch': 0.57} 14%|█▍ | 48/336 [14:25<1:26:24, 18.00s/it] 15%|█▍ | 49/336 [14:41<1:22:36, 17.27s/it] {'loss': 0.0264, 'grad_norm': 0.19219888746738434, 'learning_rate': 1.9332914451416345e-06, 'kl': 0.0208, 'entropy': 0.032, 'ce_loss': 0.0246, 'epoch': 0.58} 15%|█▍ | 49/336 [14:41<1:22:36, 17.27s/it] 15%|█▍ | 50/336 [15:03<1:29:27, 18.77s/it] {'loss': 0.0218, 'grad_norm': 0.16692093014717102, 'learning_rate': 1.929776485888251e-06, 'kl': 0.0085, 'entropy': -0.0498, 'ce_loss': 0.0061, 'epoch': 0.6} 15%|█▍ | 50/336 [15:03<1:29:27, 18.77s/it] 15%|█▌ | 51/336 [15:19<1:25:09, 17.93s/it] {'loss': 0.0351, 'grad_norm': 0.27574315667152405, 'learning_rate': 1.9261746489577764e-06, 'kl': 0.0063, 'entropy': -0.0728, 'ce_loss': 0.0241, 'epoch': 0.61} 15%|█▌ | 51/336 [15:19<1:25:09, 17.93s/it] 15%|█▌ | 52/336 [15:36<1:22:56, 17.52s/it] {'loss': 0.0316, 'grad_norm': 0.23243726789951324, 'learning_rate': 1.9224862709033824e-06, 'kl': 0.01, 'entropy': 0.0388, 'ce_loss': 0.0108, 'epoch': 0.62} 15%|█▌ | 52/336 [15:36<1:22:56, 17.52s/it] 16%|█▌ | 53/336 [15:52<1:20:29, 17.07s/it] {'loss': 0.039, 'grad_norm': 0.2668706774711609, 'learning_rate': 1.918711696364584e-06, 'kl': 0.0427, 'entropy': -0.2295, 'ce_loss': 0.0267, 'epoch': 0.63} 16%|█▌ | 53/336 [15:52<1:20:29, 17.07s/it] 16%|█▌ | 54/336 [16:07<1:18:02, 16.60s/it] {'loss': 0.0326, 'grad_norm': 0.25586774945259094, 'learning_rate': 1.914851278035038e-06, 'kl': 0.0059, 'entropy': -0.0284, 'ce_loss': 0.027, 'epoch': 0.64} 16%|█▌ | 54/336 [16:07<1:18:02, 16.60s/it] 16%|█▋ | 55/336 [16:23<1:16:27, 16.33s/it] {'loss': 0.0305, 'grad_norm': 0.23466119170188904, 'learning_rate': 1.910905376629585e-06, 'kl': 0.032, 'entropy': -0.014, 'ce_loss': 0.0443, 'epoch': 0.65} 16%|█▋ | 55/336 [16:23<1:16:27, 16.33s/it] 17%|█▋ | 56/336 [16:39<1:15:45, 16.23s/it] {'loss': 0.0302, 'grad_norm': 0.21334969997406006, 'learning_rate': 1.9068743608505452e-06, 'kl': 0.0036, 'entropy': -0.0138, 'ce_loss': 0.0075, 'epoch': 0.67} 17%|█▋ | 56/336 [16:39<1:15:45, 16.23s/it] 17%|█▋ | 57/336 [17:01<1:23:08, 17.88s/it] {'loss': 0.031, 'grad_norm': 0.25325629115104675, 'learning_rate': 1.902758607353269e-06, 'kl': 0.0186, 'entropy': 0.0009, 'ce_loss': 0.0097, 'epoch': 0.68} 17%|█▋ | 57/336 [17:01<1:23:08, 17.88s/it] 17%|█▋ | 58/336 [17:16<1:19:51, 17.24s/it] {'loss': 0.0286, 'grad_norm': 0.20371510088443756, 'learning_rate': 1.8985585007109388e-06, 'kl': 0.011, 'entropy': -0.0403, 'ce_loss': 0.0096, 'epoch': 0.69} 17%|█▋ | 58/336 [17:16<1:19:51, 17.24s/it] 18%|█▊ | 59/336 [17:35<1:22:13, 17.81s/it] {'loss': 0.0275, 'grad_norm': 0.21284231543540955, 'learning_rate': 1.8942744333786395e-06, 'kl': 0.0066, 'entropy': 0.0228, 'ce_loss': 0.0208, 'epoch': 0.7} 18%|█▊ | 59/336 [17:35<1:22:13, 17.81s/it] 18%|█▊ | 60/336 [17:51<1:18:59, 17.17s/it] {'loss': 0.037, 'grad_norm': 0.23954221606254578, 'learning_rate': 1.8899068056566838e-06, 'kl': 0.0226, 'entropy': 0.0033, 'ce_loss': 0.0262, 'epoch': 0.71} 18%|█▊ | 60/336 [17:51<1:18:59, 17.17s/it] 18%|█▊ | 61/336 [18:13<1:24:40, 18.47s/it] {'loss': 0.028, 'grad_norm': 0.2091853767633438, 'learning_rate': 1.8854560256532098e-06, 'kl': 0.0012, 'entropy': -0.0064, 'ce_loss': 0.0065, 'epoch': 0.73} 18%|█▊ | 61/336 [18:13<1:24:40, 18.47s/it] 18%|█▊ | 62/336 [18:29<1:21:50, 17.92s/it] {'loss': 0.0293, 'grad_norm': 0.21565210819244385, 'learning_rate': 1.8809225092460485e-06, 'kl': 0.0036, 'entropy': 0.013, 'ce_loss': 0.0099, 'epoch': 0.74} 18%|█▊ | 62/336 [18:29<1:21:50, 17.92s/it] 19%|█▉ | 63/336 [18:48<1:22:59, 18.24s/it] {'loss': 0.0264, 'grad_norm': 0.251737505197525, 'learning_rate': 1.8763066800438634e-06, 'kl': 0.0179, 'entropy': -0.0498, 'ce_loss': 0.0308, 'epoch': 0.75} 19%|█▉ | 63/336 [18:48<1:22:59, 18.24s/it] 19%|█▉ | 64/336 [19:04<1:18:50, 17.39s/it] {'loss': 0.0338, 'grad_norm': 0.2660210132598877, 'learning_rate': 1.8716089693465693e-06, 'kl': 0.0128, 'entropy': -0.0398, 'ce_loss': 0.0159, 'epoch': 0.76} 19%|█▉ | 64/336 [19:04<1:18:50, 17.39s/it] 19%|█▉ | 65/336 [19:19<1:16:02, 16.84s/it] {'loss': 0.0371, 'grad_norm': 0.2840781807899475, 'learning_rate': 1.8668298161050306e-06, 'kl': 0.0161, 'entropy': -0.0286, 'ce_loss': 0.0124, 'epoch': 0.77} 19%|█▉ | 65/336 [19:19<1:16:02, 16.84s/it] 20%|█▉ | 66/336 [19:35<1:14:24, 16.53s/it] {'loss': 0.03, 'grad_norm': 0.20204216241836548, 'learning_rate': 1.861969666880049e-06, 'kl': 0.0124, 'entropy': -0.042, 'ce_loss': 0.019, 'epoch': 0.79} 20%|█▉ | 66/336 [19:35<1:14:24, 16.53s/it] 20%|█▉ | 67/336 [19:51<1:13:19, 16.35s/it] {'loss': 0.0285, 'grad_norm': 0.19628241658210754, 'learning_rate': 1.8570289758006343e-06, 'kl': 0.0211, 'entropy': -0.0074, 'ce_loss': 0.0238, 'epoch': 0.8} 20%|█▉ | 67/336 [19:51<1:13:19, 16.35s/it] 20%|██ | 68/336 [20:07<1:12:15, 16.18s/it] {'loss': 0.0319, 'grad_norm': 0.23093868792057037, 'learning_rate': 1.8520082045215717e-06, 'kl': 0.0242, 'entropy': 0.02, 'ce_loss': 0.0311, 'epoch': 0.81} 20%|██ | 68/336 [20:07<1:12:15, 16.18s/it] 21%|██ | 69/336 [20:25<1:14:46, 16.80s/it] {'loss': 0.0301, 'grad_norm': 0.2109888792037964, 'learning_rate': 1.846907822180286e-06, 'kl': 0.0103, 'entropy': -0.1167, 'ce_loss': 0.0137, 'epoch': 0.82} 21%|██ | 69/336 [20:25<1:14:46, 16.80s/it] 21%|██ | 70/336 [20:41<1:12:49, 16.43s/it] {'loss': 0.0263, 'grad_norm': 0.18644583225250244, 'learning_rate': 1.8417283053530043e-06, 'kl': 0.0085, 'entropy': 0.0069, 'ce_loss': 0.0126, 'epoch': 0.83} 21%|██ | 70/336 [20:41<1:12:49, 16.43s/it] 21%|██ | 71/336 [20:59<1:15:34, 17.11s/it] {'loss': 0.0255, 'grad_norm': 0.2035277783870697, 'learning_rate': 1.8364701380102264e-06, 'kl': 0.0135, 'entropy': -0.0184, 'ce_loss': 0.0113, 'epoch': 0.85} 21%|██ | 71/336 [20:59<1:15:34, 17.11s/it] 21%|██▏ | 72/336 [21:15<1:13:08, 16.62s/it] {'loss': 0.0318, 'grad_norm': 0.22000524401664734, 'learning_rate': 1.8311338114715027e-06, 'kl': 0.0022, 'entropy': -0.0266, 'ce_loss': 0.0074, 'epoch': 0.86} 21%|██▏ | 72/336 [21:15<1:13:08, 16.62s/it] 22%|██▏ | 73/336 [21:31<1:12:12, 16.47s/it] {'loss': 0.0326, 'grad_norm': 0.23698453605175018, 'learning_rate': 1.825719824359524e-06, 'kl': 0.0059, 'entropy': -0.0469, 'ce_loss': 0.0155, 'epoch': 0.87} 22%|██▏ | 73/336 [21:31<1:12:12, 16.47s/it] 22%|██▏ | 74/336 [21:49<1:14:04, 16.97s/it] {'loss': 0.027, 'grad_norm': 0.18671374022960663, 'learning_rate': 1.8202286825535329e-06, 'kl': 0.0046, 'entropy': -0.0625, 'ce_loss': 0.0096, 'epoch': 0.88} 22%|██▏ | 74/336 [21:49<1:14:04, 16.97s/it] 22%|██▏ | 75/336 [22:06<1:13:40, 16.94s/it] {'loss': 0.0276, 'grad_norm': 0.22594326734542847, 'learning_rate': 1.814660899142053e-06, 'kl': 0.0148, 'entropy': -0.0654, 'ce_loss': 0.0119, 'epoch': 0.89} 22%|██▏ | 75/336 [22:06<1:13:40, 16.94s/it] 23%|██▎ | 76/336 [22:22<1:11:52, 16.58s/it] {'loss': 0.0267, 'grad_norm': 0.2022831290960312, 'learning_rate': 1.8090169943749474e-06, 'kl': 0.0457, 'entropy': -0.0771, 'ce_loss': 0.0215, 'epoch': 0.9} 23%|██▎ | 76/336 [22:22<1:11:52, 16.58s/it] 23%|██▎ | 77/336 [22:44<1:18:36, 18.21s/it] {'loss': 0.0269, 'grad_norm': 0.18786337971687317, 'learning_rate': 1.8032974956148062e-06, 'kl': 0.013, 'entropy': 0.0186, 'ce_loss': 0.0141, 'epoch': 0.92} 23%|██▎ | 77/336 [22:44<1:18:36, 18.21s/it] 23%|██▎ | 78/336 [23:04<1:20:45, 18.78s/it] {'loss': 0.0247, 'grad_norm': 0.1901751309633255, 'learning_rate': 1.7975029372876705e-06, 'kl': 0.002, 'entropy': -0.0408, 'ce_loss': 0.0094, 'epoch': 0.93} 23%|██▎ | 78/336 [23:04<1:20:45, 18.78s/it] 24%|██▎ | 79/336 [23:19<1:16:28, 17.86s/it] {'loss': 0.0325, 'grad_norm': 0.23225203156471252, 'learning_rate': 1.7916338608330956e-06, 'kl': 0.0014, 'entropy': -0.0278, 'ce_loss': 0.0159, 'epoch': 0.94} 24%|██▎ | 79/336 [23:19<1:16:28, 17.86s/it] 24%|██▍ | 80/336 [23:38<1:17:19, 18.12s/it] {'loss': 0.0356, 'grad_norm': 0.24554163217544556, 'learning_rate': 1.78569081465356e-06, 'kl': 0.0081, 'entropy': -0.0449, 'ce_loss': 0.0146, 'epoch': 0.95} 24%|██▍ | 80/336 [23:38<1:17:19, 18.12s/it] 24%|██▍ | 81/336 [23:54<1:13:44, 17.35s/it] {'loss': 0.044, 'grad_norm': 0.308929443359375, 'learning_rate': 1.7796743540632221e-06, 'kl': 0.0096, 'entropy': -0.0469, 'ce_loss': 0.0275, 'epoch': 0.96} 24%|██▍ | 81/336 [23:54<1:13:44, 17.35s/it] 24%|██▍ | 82/336 [24:18<1:22:51, 19.57s/it] {'loss': 0.022, 'grad_norm': 0.1706738919019699, 'learning_rate': 1.7735850412360328e-06, 'kl': 0.0157, 'entropy': -0.0369, 'ce_loss': 0.0045, 'epoch': 0.98} 24%|██▍ | 82/336 [24:18<1:22:51, 19.57s/it] 25%|██▍ | 83/336 [24:37<1:21:39, 19.37s/it] {'loss': 0.0262, 'grad_norm': 0.1916491985321045, 'learning_rate': 1.7674234451532063e-06, 'kl': 0.027, 'entropy': 0.051, 'ce_loss': 0.0224, 'epoch': 0.99} 25%|██▍ | 83/336 [24:37<1:21:39, 19.37s/it] 25%|██▌ | 84/336 [24:53<1:16:50, 18.29s/it] {'loss': 0.0277, 'grad_norm': 0.22529131174087524, 'learning_rate': 1.7611901415500533e-06, 'kl': 0.0033, 'entropy': -0.0476, 'ce_loss': 0.0147, 'epoch': 1.0} 25%|██▌ | 84/336 [24:53<1:16:50, 18.29s/it] 25%|██▌ | 85/336 [25:10<1:14:41, 17.85s/it] {'loss': 0.0244, 'grad_norm': 0.19255438446998596, 'learning_rate': 1.7548857128621874e-06, 'kl': 0.0228, 'entropy': -0.0388, 'ce_loss': 0.0178, 'epoch': 1.01} 25%|██▌ | 85/336 [25:10<1:14:41, 17.85s/it] 26%|██▌ | 86/336 [25:25<1:11:26, 17.15s/it] {'loss': 0.0288, 'grad_norm': 0.2253047525882721, 'learning_rate': 1.748510748171101e-06, 'kl': 0.005, 'entropy': -0.0052, 'ce_loss': 0.017, 'epoch': 1.02} 26%|██▌ | 86/336 [25:25<1:11:26, 17.15s/it] 26%|██▌ | 87/336 [25:45<1:13:32, 17.72s/it] {'loss': 0.0226, 'grad_norm': 0.1952665150165558, 'learning_rate': 1.7420658431491222e-06, 'kl': 0.0122, 'entropy': -0.0566, 'ce_loss': 0.0052, 'epoch': 1.04} 26%|██▌ | 87/336 [25:45<1:13:32, 17.72s/it] 26%|██▌ | 88/336 [26:00<1:10:49, 17.13s/it] {'loss': 0.0249, 'grad_norm': 0.1836732178926468, 'learning_rate': 1.735551600003755e-06, 'kl': 0.0236, 'entropy': -0.0576, 'ce_loss': 0.0319, 'epoch': 1.05} 26%|██▌ | 88/336 [26:00<1:10:49, 17.13s/it] 26%|██▋ | 89/336 [26:16<1:08:54, 16.74s/it] {'loss': 0.0193, 'grad_norm': 0.17091220617294312, 'learning_rate': 1.7289686274214115e-06, 'kl': 0.0133, 'entropy': -0.0378, 'ce_loss': 0.0301, 'epoch': 1.06} 26%|██▋ | 89/336 [26:16<1:08:54, 16.74s/it] 27%|██▋ | 90/336 [26:37<1:13:46, 17.99s/it] {'loss': 0.017, 'grad_norm': 0.12661834061145782, 'learning_rate': 1.722317540510534e-06, 'kl': 0.0194, 'entropy': -0.1187, 'ce_loss': 0.0205, 'epoch': 1.07} 27%|██▋ | 90/336 [26:37<1:13:46, 17.99s/it] 27%|██▋ | 91/336 [26:54<1:11:56, 17.62s/it] {'loss': 0.0218, 'grad_norm': 0.16420775651931763, 'learning_rate': 1.715598960744121e-06, 'kl': 0.0197, 'entropy': -0.0281, 'ce_loss': 0.012, 'epoch': 1.08} 27%|██▋ | 91/336 [26:54<1:11:56, 17.62s/it] 27%|██▋ | 92/336 [27:13<1:13:19, 18.03s/it] {'loss': 0.0213, 'grad_norm': 0.1939077079296112, 'learning_rate': 1.7088135159016582e-06, 'kl': 0.051, 'entropy': -0.0342, 'ce_loss': 0.0058, 'epoch': 1.1} 27%|██▋ | 92/336 [27:13<1:13:19, 18.03s/it] 28%|██▊ | 93/336 [27:29<1:10:51, 17.50s/it] {'loss': 0.0209, 'grad_norm': 0.17347100377082825, 'learning_rate': 1.7019618400104569e-06, 'kl': 0.0172, 'entropy': -0.0483, 'ce_loss': 0.01, 'epoch': 1.11} 28%|██▊ | 93/336 [27:29<1:10:51, 17.50s/it] 28%|██▊ | 94/336 [27:47<1:11:14, 17.66s/it] {'loss': 0.0203, 'grad_norm': 0.17469847202301025, 'learning_rate': 1.6950445732864126e-06, 'kl': 0.0179, 'entropy': -0.0645, 'ce_loss': 0.0186, 'epoch': 1.12} 28%|██▊ | 94/336 [27:47<1:11:14, 17.66s/it] 28%|██▊ | 95/336 [28:06<1:12:08, 17.96s/it] {'loss': 0.0287, 'grad_norm': 0.21660856902599335, 'learning_rate': 1.688062362074184e-06, 'kl': 0.0034, 'entropy': -0.019, 'ce_loss': 0.0076, 'epoch': 1.13} 28%|██▊ | 95/336 [28:06<1:12:08, 17.96s/it] 29%|██▊ | 96/336 [28:22<1:09:41, 17.42s/it] {'loss': 0.025, 'grad_norm': 0.2265072762966156, 'learning_rate': 1.681015858786797e-06, 'kl': 0.0155, 'entropy': -0.0737, 'ce_loss': 0.026, 'epoch': 1.14} 29%|██▊ | 96/336 [28:22<1:09:41, 17.42s/it] 29%|██▉ | 97/336 [28:38<1:07:29, 16.94s/it] {'loss': 0.0197, 'grad_norm': 0.1737026572227478, 'learning_rate': 1.6739057218446857e-06, 'kl': 0.0255, 'entropy': 0.0315, 'ce_loss': 0.0164, 'epoch': 1.15} 29%|██▉ | 97/336 [28:38<1:07:29, 16.94s/it] 29%|██▉ | 98/336 [28:53<1:05:14, 16.45s/it] {'loss': 0.0252, 'grad_norm': 0.21314632892608643, 'learning_rate': 1.666732615614169e-06, 'kl': 0.0219, 'entropy': -0.031, 'ce_loss': 0.0214, 'epoch': 1.17} 29%|██▉ | 98/336 [28:53<1:05:14, 16.45s/it] 29%|██▉ | 99/336 [29:15<1:11:44, 18.16s/it] {'loss': 0.0247, 'grad_norm': 0.2042514830827713, 'learning_rate': 1.6594972103453724e-06, 'kl': 0.032, 'entropy': -0.0283, 'ce_loss': 0.0123, 'epoch': 1.18} 29%|██▉ | 99/336 [29:15<1:11:44, 18.16s/it] 30%|██▉ | 100/336 [29:34<1:12:34, 18.45s/it] {'loss': 0.0197, 'grad_norm': 0.19269202649593353, 'learning_rate': 1.6522001821096019e-06, 'kl': 0.0325, 'entropy': -0.0947, 'ce_loss': 0.0156, 'epoch': 1.19} 30%|██▉ | 100/336 [29:34<1:12:34, 18.45s/it] 30%|███ | 101/336 [29:50<1:08:56, 17.60s/it] {'loss': 0.0323, 'grad_norm': 0.2899748384952545, 'learning_rate': 1.6448422127361705e-06, 'kl': 0.0195, 'entropy': -0.0148, 'ce_loss': 0.0249, 'epoch': 1.2} 30%|███ | 101/336 [29:50<1:08:56, 17.60s/it] 30%|███ | 102/336 [30:09<1:10:28, 18.07s/it] {'loss': 0.0198, 'grad_norm': 0.17614398896694183, 'learning_rate': 1.6374239897486897e-06, 'kl': 0.0184, 'entropy': -0.0347, 'ce_loss': 0.0186, 'epoch': 1.21} 30%|███ | 102/336 [30:09<1:10:28, 18.07s/it] 31%|███ | 103/336 [30:25<1:07:29, 17.38s/it] {'loss': 0.0272, 'grad_norm': 0.23129649460315704, 'learning_rate': 1.6299462063008269e-06, 'kl': 0.0576, 'entropy': -0.0552, 'ce_loss': 0.0136, 'epoch': 1.23} 31%|███ | 103/336 [30:25<1:07:29, 17.38s/it] 31%|███ | 104/336 [30:40<1:05:08, 16.85s/it] {'loss': 0.0243, 'grad_norm': 0.20547500252723694, 'learning_rate': 1.6224095611115383e-06, 'kl': 0.032, 'entropy': -0.0177, 'ce_loss': 0.0184, 'epoch': 1.24} 31%|███ | 104/336 [30:40<1:05:08, 16.85s/it] 31%|███▏ | 105/336 [31:00<1:07:26, 17.52s/it] {'loss': 0.0271, 'grad_norm': 0.226070374250412, 'learning_rate': 1.614814758399781e-06, 'kl': 0.0337, 'entropy': 0.0398, 'ce_loss': 0.0058, 'epoch': 1.25} 31%|███▏ | 105/336 [31:00<1:07:26, 17.52s/it] 32%|███▏ | 106/336 [31:15<1:04:45, 16.89s/it] {'loss': 0.025, 'grad_norm': 0.22863830626010895, 'learning_rate': 1.6071625078187112e-06, 'kl': 0.0129, 'entropy': -0.0718, 'ce_loss': 0.0122, 'epoch': 1.26} 32%|███▏ | 106/336 [31:15<1:04:45, 16.89s/it] 32%|███▏ | 107/336 [31:35<1:07:49, 17.77s/it] {'loss': 0.0269, 'grad_norm': 0.26432666182518005, 'learning_rate': 1.599453524389374e-06, 'kl': 0.0474, 'entropy': -0.0254, 'ce_loss': 0.0065, 'epoch': 1.27} 32%|███▏ | 107/336 [31:35<1:07:49, 17.77s/it] 32%|███▏ | 108/336 [31:50<1:05:10, 17.15s/it] {'loss': 0.0268, 'grad_norm': 0.2612511217594147, 'learning_rate': 1.5916885284338935e-06, 'kl': 0.0265, 'entropy': -0.0654, 'ce_loss': 0.0141, 'epoch': 1.29} 32%|███▏ | 108/336 [31:50<1:05:10, 17.15s/it] 32%|███▏ | 109/336 [32:09<1:06:26, 17.56s/it] {'loss': 0.018, 'grad_norm': 0.18477967381477356, 'learning_rate': 1.5838682455081657e-06, 'kl': 0.0082, 'entropy': -0.0437, 'ce_loss': 0.0315, 'epoch': 1.3} 32%|███▏ | 109/336 [32:09<1:06:26, 17.56s/it] 33%|███▎ | 110/336 [32:25<1:04:00, 16.99s/it] {'loss': 0.0256, 'grad_norm': 0.20217899978160858, 'learning_rate': 1.5759934063340624e-06, 'kl': 0.0422, 'entropy': -0.1001, 'ce_loss': 0.0087, 'epoch': 1.31} 33%|███▎ | 110/336 [32:25<1:04:00, 16.99s/it] 33%|███▎ | 111/336 [32:46<1:08:52, 18.36s/it] {'loss': 0.0197, 'grad_norm': 0.1906185746192932, 'learning_rate': 1.5680647467311555e-06, 'kl': 0.028, 'entropy': -0.0302, 'ce_loss': 0.0081, 'epoch': 1.32} 33%|███▎ | 111/336 [32:46<1:08:52, 18.36s/it] 33%|███▎ | 112/336 [33:09<1:13:07, 19.59s/it] {'loss': 0.0204, 'grad_norm': 0.1959102302789688, 'learning_rate': 1.56008300754796e-06, 'kl': 0.0425, 'entropy': 0.0223, 'ce_loss': 0.0217, 'epoch': 1.33} 33%|███▎ | 112/336 [33:09<1:13:07, 19.59s/it] 34%|███▎ | 113/336 [33:27<1:11:36, 19.27s/it] {'loss': 0.0197, 'grad_norm': 0.17761990427970886, 'learning_rate': 1.5520489345927094e-06, 'kl': 0.0327, 'entropy': -0.0562, 'ce_loss': 0.0186, 'epoch': 1.35} 34%|███▎ | 113/336 [33:27<1:11:36, 19.27s/it] 34%|███▍ | 114/336 [33:43<1:07:53, 18.35s/it] {'loss': 0.0239, 'grad_norm': 0.2238835245370865, 'learning_rate': 1.5439632785636705e-06, 'kl': 0.0192, 'entropy': 0.0053, 'ce_loss': 0.0134, 'epoch': 1.36} 34%|███▍ | 114/336 [33:43<1:07:53, 18.35s/it] 34%|███▍ | 115/336 [33:59<1:04:39, 17.55s/it] {'loss': 0.0272, 'grad_norm': 0.22165977954864502, 'learning_rate': 1.5358267949789964e-06, 'kl': 0.0114, 'entropy': -0.0564, 'ce_loss': 0.0137, 'epoch': 1.37} 34%|███▍ | 115/336 [33:59<1:04:39, 17.55s/it] 35%|███▍ | 116/336 [34:15<1:02:12, 16.97s/it] {'loss': 0.0264, 'grad_norm': 0.23063895106315613, 'learning_rate': 1.5276402441061327e-06, 'kl': 0.0405, 'entropy': -0.0654, 'ce_loss': 0.019, 'epoch': 1.38} 35%|███▍ | 116/336 [34:15<1:02:12, 16.97s/it] 35%|███▍ | 117/336 [34:30<1:00:38, 16.61s/it] {'loss': 0.0233, 'grad_norm': 0.20103129744529724, 'learning_rate': 1.5194043908907772e-06, 'kl': 0.0231, 'entropy': 0.0317, 'ce_loss': 0.025, 'epoch': 1.39} 35%|███▍ | 117/336 [34:30<1:00:38, 16.61s/it] 35%|███▌ | 118/336 [34:46<59:10, 16.29s/it] {'loss': 0.0265, 'grad_norm': 0.22980648279190063, 'learning_rate': 1.5111200048854054e-06, 'kl': 0.0067, 'entropy': -0.0408, 'ce_loss': 0.0121, 'epoch': 1.4} 35%|███▌ | 118/336 [34:46<59:10, 16.29s/it] 35%|███▌ | 119/336 [35:02<58:07, 16.07s/it] {'loss': 0.0295, 'grad_norm': 0.24733252823352814, 'learning_rate': 1.5027878601773632e-06, 'kl': 0.0167, 'entropy': -0.054, 'ce_loss': 0.0217, 'epoch': 1.42} 35%|███▌ | 119/336 [35:02<58:07, 16.07s/it] 36%|███▌ | 120/336 [35:17<57:25, 15.95s/it] {'loss': 0.0181, 'grad_norm': 0.1668197065591812, 'learning_rate': 1.494408735316537e-06, 'kl': 0.0227, 'entropy': -0.009, 'ce_loss': 0.0184, 'epoch': 1.43} 36%|███▌ | 120/336 [35:17<57:25, 15.95s/it] 36%|███▌ | 121/336 [35:33<57:04, 15.93s/it] {'loss': 0.0273, 'grad_norm': 0.23297691345214844, 'learning_rate': 1.4859834132426058e-06, 'kl': 0.0359, 'entropy': -0.0361, 'ce_loss': 0.0179, 'epoch': 1.44} 36%|███▌ | 121/336 [35:33<57:04, 15.93s/it] 36%|███▋ | 122/336 [35:55<1:03:28, 17.80s/it] {'loss': 0.0201, 'grad_norm': 0.1959342062473297, 'learning_rate': 1.4775126812118863e-06, 'kl': 0.0244, 'entropy': 0.0123, 'ce_loss': 0.0129, 'epoch': 1.45} 36%|███▋ | 122/336 [35:55<1:03:28, 17.80s/it] 37%|███▋ | 123/336 [36:11<1:01:10, 17.23s/it] {'loss': 0.0227, 'grad_norm': 0.19156020879745483, 'learning_rate': 1.4689973307237686e-06, 'kl': 0.0164, 'entropy': -0.0674, 'ce_loss': 0.0214, 'epoch': 1.46} 37%|███▋ | 123/336 [36:11<1:01:10, 17.23s/it] 37%|███▋ | 124/336 [36:31<1:03:45, 18.05s/it] {'loss': 0.02, 'grad_norm': 0.1831812560558319, 'learning_rate': 1.4604381574467614e-06, 'kl': 0.0334, 'entropy': -0.0136, 'ce_loss': 0.0232, 'epoch': 1.48} 37%|███▋ | 124/336 [36:31<1:03:45, 18.05s/it] 37%|███▋ | 125/336 [36:47<1:01:23, 17.46s/it] {'loss': 0.0216, 'grad_norm': 0.1912676841020584, 'learning_rate': 1.451835961144145e-06, 'kl': 0.0342, 'entropy': -0.0332, 'ce_loss': 0.0058, 'epoch': 1.49} 37%|███▋ | 125/336 [36:47<1:01:23, 17.46s/it] 38%|███▊ | 126/336 [37:03<59:26, 16.99s/it] {'loss': 0.0242, 'grad_norm': 0.19266380369663239, 'learning_rate': 1.4431915455992414e-06, 'kl': 0.0216, 'entropy': -0.0322, 'ce_loss': 0.0111, 'epoch': 1.5} 38%|███▊ | 126/336 [37:03<59:26, 16.99s/it] 38%|███▊ | 127/336 [37:26<1:04:51, 18.62s/it] {'loss': 0.0234, 'grad_norm': 0.19839657843112946, 'learning_rate': 1.4345057185403098e-06, 'kl': 0.0286, 'entropy': -0.0264, 'ce_loss': 0.011, 'epoch': 1.51} 38%|███▊ | 127/336 [37:26<1:04:51, 18.62s/it] 38%|███▊ | 128/336 [37:43<1:03:04, 18.19s/it] {'loss': 0.0309, 'grad_norm': 0.25261247158050537, 'learning_rate': 1.4257792915650725e-06, 'kl': 0.0354, 'entropy': 0.001, 'ce_loss': 0.0211, 'epoch': 1.52} 38%|███▊ | 128/336 [37:43<1:03:04, 18.19s/it] 38%|███▊ | 129/336 [38:02<1:04:08, 18.59s/it] {'loss': 0.0263, 'grad_norm': 0.22662681341171265, 'learning_rate': 1.4170130800648812e-06, 'kl': 0.0101, 'entropy': -0.014, 'ce_loss': 0.0086, 'epoch': 1.54} 38%|███▊ | 129/336 [38:02<1:04:08, 18.59s/it] 39%|███▊ | 130/336 [38:21<1:04:16, 18.72s/it] {'loss': 0.0201, 'grad_norm': 0.17634278535842896, 'learning_rate': 1.408207903148525e-06, 'kl': 0.0121, 'entropy': -0.0188, 'ce_loss': 0.0124, 'epoch': 1.55} 39%|███▊ | 130/336 [38:21<1:04:16, 18.72s/it] 39%|███▉ | 131/336 [38:37<1:00:55, 17.83s/it] {'loss': 0.0216, 'grad_norm': 0.18929998576641083, 'learning_rate': 1.3993645835656952e-06, 'kl': 0.015, 'entropy': -0.1147, 'ce_loss': 0.0246, 'epoch': 1.56} 39%|███▉ | 131/336 [38:37<1:00:55, 17.83s/it] 39%|███▉ | 132/336 [38:55<1:00:57, 17.93s/it] {'loss': 0.0238, 'grad_norm': 0.2257225066423416, 'learning_rate': 1.3904839476301088e-06, 'kl': 0.0347, 'entropy': -0.0623, 'ce_loss': 0.0086, 'epoch': 1.57} 39%|███▉ | 132/336 [38:55<1:00:57, 17.93s/it] 40%|███▉ | 133/336 [39:18<1:05:38, 19.40s/it] {'loss': 0.0164, 'grad_norm': 0.16117718815803528, 'learning_rate': 1.3815668251422953e-06, 'kl': 0.007, 'entropy': 0.019, 'ce_loss': 0.0245, 'epoch': 1.58} 40%|███▉ | 133/336 [39:18<1:05:38, 19.40s/it] 40%|███▉ | 134/336 [39:37<1:05:10, 19.36s/it] {'loss': 0.0174, 'grad_norm': 0.1562952846288681, 'learning_rate': 1.3726140493120637e-06, 'kl': 0.04, 'entropy': -0.0471, 'ce_loss': 0.0057, 'epoch': 1.6} 40%|███▉ | 134/336 [39:37<1:05:10, 19.36s/it] 40%|████ | 135/336 [39:57<1:04:59, 19.40s/it] {'loss': 0.0224, 'grad_norm': 0.26663967967033386, 'learning_rate': 1.363626456680647e-06, 'kl': 0.0076, 'entropy': -0.0405, 'ce_loss': 0.0135, 'epoch': 1.61} 40%|████ | 135/336 [39:57<1:04:59, 19.40s/it] 40%|████ | 136/336 [40:16<1:04:02, 19.21s/it] {'loss': 0.0231, 'grad_norm': 0.19622810184955597, 'learning_rate': 1.3546048870425354e-06, 'kl': 0.0339, 'entropy': -0.0459, 'ce_loss': 0.0174, 'epoch': 1.62} 40%|████ | 136/336 [40:16<1:04:02, 19.21s/it] 41%|████ | 137/336 [40:35<1:03:45, 19.22s/it] {'loss': 0.0223, 'grad_norm': 0.22339412569999695, 'learning_rate': 1.3455501833670087e-06, 'kl': 0.024, 'entropy': -0.0359, 'ce_loss': 0.0137, 'epoch': 1.63} 41%|████ | 137/336 [40:35<1:03:45, 19.22s/it] 41%|████ | 138/336 [40:51<59:57, 18.17s/it] {'loss': 0.025, 'grad_norm': 0.23139812052249908, 'learning_rate': 1.336463191719367e-06, 'kl': 0.0225, 'entropy': 0.0078, 'ce_loss': 0.0123, 'epoch': 1.64} 41%|████ | 138/336 [40:51<59:57, 18.17s/it] 41%|████▏ | 139/336 [41:06<57:08, 17.40s/it] {'loss': 0.0303, 'grad_norm': 0.25350722670555115, 'learning_rate': 1.3273447611818766e-06, 'kl': 0.0195, 'entropy': -0.0481, 'ce_loss': 0.0102, 'epoch': 1.65} 41%|████▏ | 139/336 [41:06<57:08, 17.40s/it] 42%|████▏ | 140/336 [41:23<56:47, 17.38s/it] {'loss': 0.0249, 'grad_norm': 0.21892692148685455, 'learning_rate': 1.3181957437744332e-06, 'kl': 0.0237, 'entropy': -0.0242, 'ce_loss': 0.0229, 'epoch': 1.67} 42%|████▏ | 140/336 [41:23<56:47, 17.38s/it] 42%|████▏ | 141/336 [41:42<57:39, 17.74s/it] {'loss': 0.0177, 'grad_norm': 0.16795045137405396, 'learning_rate': 1.3090169943749473e-06, 'kl': 0.0464, 'entropy': 0.0137, 'ce_loss': 0.0068, 'epoch': 1.68} 42%|████▏ | 141/336 [41:42<57:39, 17.74s/it] 42%|████▏ | 142/336 [41:58<55:48, 17.26s/it] {'loss': 0.0307, 'grad_norm': 0.24835337698459625, 'learning_rate': 1.2998093706394675e-06, 'kl': 0.0098, 'entropy': -0.0178, 'ce_loss': 0.0141, 'epoch': 1.69} 42%|████▏ | 142/336 [41:58<55:48, 17.26s/it] 43%|████▎ | 143/336 [42:16<56:18, 17.51s/it] {'loss': 0.0179, 'grad_norm': 0.15057280659675598, 'learning_rate': 1.2905737329220392e-06, 'kl': 0.0354, 'entropy': -0.125, 'ce_loss': 0.0115, 'epoch': 1.7} 43%|████▎ | 143/336 [42:16<56:18, 17.51s/it] 43%|████▎ | 144/336 [42:35<57:23, 17.94s/it] {'loss': 0.0217, 'grad_norm': 0.20697814226150513, 'learning_rate': 1.2813109441943164e-06, 'kl': 0.0132, 'entropy': -0.0342, 'ce_loss': 0.0128, 'epoch': 1.71} 43%|████▎ | 144/336 [42:35<57:23, 17.94s/it] 43%|████▎ | 145/336 [42:51<54:51, 17.24s/it] {'loss': 0.0272, 'grad_norm': 0.23005861043930054, 'learning_rate': 1.2720218699649241e-06, 'kl': 0.0447, 'entropy': -0.0659, 'ce_loss': 0.0129, 'epoch': 1.73} 43%|████▎ | 145/336 [42:51<54:51, 17.24s/it] 43%|████▎ | 146/336 [43:10<56:23, 17.81s/it] {'loss': 0.0228, 'grad_norm': 0.20512734353542328, 'learning_rate': 1.262707378198587e-06, 'kl': 0.0148, 'entropy': -0.022, 'ce_loss': 0.0205, 'epoch': 1.74} 43%|████▎ | 146/336 [43:10<56:23, 17.81s/it] 44%|████▍ | 147/336 [43:25<53:50, 17.09s/it] {'loss': 0.0195, 'grad_norm': 0.18244057893753052, 'learning_rate': 1.2533683392350262e-06, 'kl': 0.0074, 'entropy': 0.0103, 'ce_loss': 0.0126, 'epoch': 1.75} 44%|████▍ | 147/336 [43:25<53:50, 17.09s/it] 44%|████▍ | 148/336 [43:45<55:46, 17.80s/it] {'loss': 0.0204, 'grad_norm': 0.18747949600219727, 'learning_rate': 1.2440056257076374e-06, 'kl': 0.0437, 'entropy': -0.0479, 'ce_loss': 0.0106, 'epoch': 1.76} 44%|████▍ | 148/336 [43:45<55:46, 17.80s/it] 44%|████▍ | 149/336 [44:05<57:46, 18.54s/it] {'loss': 0.0199, 'grad_norm': 0.19330933690071106, 'learning_rate': 1.23462011246195e-06, 'kl': 0.0126, 'entropy': -0.0552, 'ce_loss': 0.0221, 'epoch': 1.77} 44%|████▍ | 149/336 [44:05<57:46, 18.54s/it] 45%|████▍ | 150/336 [44:21<54:45, 17.66s/it] {'loss': 0.0238, 'grad_norm': 0.1976664662361145, 'learning_rate': 1.2252126764738844e-06, 'kl': 0.0078, 'entropy': -0.0659, 'ce_loss': 0.021, 'epoch': 1.79} 45%|████▍ | 150/336 [44:21<54:45, 17.66s/it] 45%|████▍ | 151/336 [44:43<58:28, 18.97s/it] {'loss': 0.02, 'grad_norm': 0.17954564094543457, 'learning_rate': 1.2157841967678063e-06, 'kl': 0.0168, 'entropy': 0.0046, 'ce_loss': 0.0035, 'epoch': 1.8} 45%|████▍ | 151/336 [44:43<58:28, 18.97s/it] 45%|████▌ | 152/336 [44:59<55:42, 18.16s/it] {'loss': 0.0263, 'grad_norm': 0.20568442344665527, 'learning_rate': 1.2063355543343923e-06, 'kl': 0.0344, 'entropy': -0.0776, 'ce_loss': 0.018, 'epoch': 1.81} 45%|████▌ | 152/336 [44:59<55:42, 18.16s/it] 46%|████▌ | 153/336 [45:15<53:08, 17.43s/it] {'loss': 0.0297, 'grad_norm': 0.2846704125404358, 'learning_rate': 1.1968676320483101e-06, 'kl': 0.0135, 'entropy': -0.0031, 'ce_loss': 0.011, 'epoch': 1.82} 46%|████▌ | 153/336 [45:15<53:08, 17.43s/it] 46%|████▌ | 154/336 [45:34<54:21, 17.92s/it] {'loss': 0.0218, 'grad_norm': 0.19644086062908173, 'learning_rate': 1.1873813145857248e-06, 'kl': 0.0334, 'entropy': -0.0391, 'ce_loss': 0.0388, 'epoch': 1.83} 46%|████▌ | 154/336 [45:34<54:21, 17.92s/it] 46%|████▌ | 155/336 [45:53<55:02, 18.24s/it] {'loss': 0.0236, 'grad_norm': 0.20494718849658966, 'learning_rate': 1.1778774883416322e-06, 'kl': 0.0227, 'entropy': -0.0212, 'ce_loss': 0.0207, 'epoch': 1.85} 46%|████▌ | 155/336 [45:53<55:02, 18.24s/it] 46%|████▋ | 156/336 [46:12<55:20, 18.45s/it] {'loss': 0.0174, 'grad_norm': 0.16688618063926697, 'learning_rate': 1.1683570413470383e-06, 'kl': 0.0198, 'entropy': -0.0238, 'ce_loss': 0.0162, 'epoch': 1.86} 46%|████▋ | 156/336 [46:12<55:20, 18.45s/it] 47%|████▋ | 157/336 [46:32<56:35, 18.97s/it] {'loss': 0.0229, 'grad_norm': 0.1961638331413269, 'learning_rate': 1.1588208631859807e-06, 'kl': 0.0284, 'entropy': -0.0869, 'ce_loss': 0.0147, 'epoch': 1.87} 47%|████▋ | 157/336 [46:32<56:35, 18.97s/it] 47%|████▋ | 158/336 [46:51<56:11, 18.94s/it] {'loss': 0.0219, 'grad_norm': 0.2146499902009964, 'learning_rate': 1.149269844912404e-06, 'kl': 0.0243, 'entropy': -0.0603, 'ce_loss': 0.0058, 'epoch': 1.88} 47%|████▋ | 158/336 [46:51<56:11, 18.94s/it] 47%|████▋ | 159/336 [47:10<56:02, 18.99s/it] {'loss': 0.0245, 'grad_norm': 0.2261628657579422, 'learning_rate': 1.1397048789669059e-06, 'kl': 0.0242, 'entropy': -0.0796, 'ce_loss': 0.0189, 'epoch': 1.89} 47%|████▋ | 159/336 [47:10<56:02, 18.99s/it] 48%|████▊ | 160/336 [47:30<56:31, 19.27s/it] {'loss': 0.0222, 'grad_norm': 0.18682613968849182, 'learning_rate': 1.1301268590933434e-06, 'kl': 0.0203, 'entropy': -0.0576, 'ce_loss': 0.0128, 'epoch': 1.9} 48%|████▊ | 160/336 [47:30<56:31, 19.27s/it] 48%|████▊ | 161/336 [47:45<53:01, 18.18s/it] {'loss': 0.0258, 'grad_norm': 0.2125205546617508, 'learning_rate': 1.1205366802553228e-06, 'kl': 0.0183, 'entropy': -0.0405, 'ce_loss': 0.0125, 'epoch': 1.92} 48%|████▊ | 161/336 [47:45<53:01, 18.18s/it] 48%|████▊ | 162/336 [48:04<52:44, 18.19s/it] {'loss': 0.0246, 'grad_norm': 0.21307235956192017, 'learning_rate': 1.110935238552578e-06, 'kl': 0.0459, 'entropy': -0.1074, 'ce_loss': 0.0112, 'epoch': 1.93} 48%|████▊ | 162/336 [48:04<52:44, 18.19s/it] 49%|████▊ | 163/336 [48:19<50:13, 17.42s/it] {'loss': 0.0283, 'grad_norm': 0.22004197537899017, 'learning_rate': 1.1013234311372353e-06, 'kl': 0.0327, 'entropy': -0.0884, 'ce_loss': 0.0336, 'epoch': 1.94} 49%|████▊ | 163/336 [48:19<50:13, 17.42s/it] 49%|████▉ | 164/336 [48:39<51:39, 18.02s/it] {'loss': 0.021, 'grad_norm': 0.20040468871593475, 'learning_rate': 1.0917021561299862e-06, 'kl': 0.0292, 'entropy': -0.0161, 'ce_loss': 0.0119, 'epoch': 1.95} 49%|████▉ | 164/336 [48:39<51:39, 18.02s/it] 49%|████▉ | 165/336 [49:00<54:31, 19.13s/it] {'loss': 0.0211, 'grad_norm': 0.17962150275707245, 'learning_rate': 1.0820723125361684e-06, 'kl': 0.0253, 'entropy': -0.063, 'ce_loss': 0.0112, 'epoch': 1.96} 49%|████▉ | 165/336 [49:00<54:31, 19.13s/it] 49%|████▉ | 166/336 [49:19<53:47, 18.99s/it] {'loss': 0.0195, 'grad_norm': 0.17014382779598236, 'learning_rate': 1.0724348001617625e-06, 'kl': 0.0913, 'entropy': -0.0869, 'ce_loss': 0.0093, 'epoch': 1.98} 49%|████▉ | 166/336 [49:19<53:47, 18.99s/it] 50%|████▉ | 167/336 [49:35<50:39, 17.99s/it] {'loss': 0.0222, 'grad_norm': 0.18333743512630463, 'learning_rate': 1.0627905195293135e-06, 'kl': 0.0256, 'entropy': -0.0571, 'ce_loss': 0.0128, 'epoch': 1.99} 50%|████▉ | 167/336 [49:35<50:39, 17.99s/it] 50%|█████ | 168/336 [49:51<48:40, 17.39s/it] {'loss': 0.0272, 'grad_norm': 0.24237695336341858, 'learning_rate': 1.0531403717937886e-06, 'kl': 0.0027, 'entropy': -0.0026, 'ce_loss': 0.0052, 'epoch': 2.0} 50%|█████ | 168/336 [49:51<48:40, 17.39s/it] 50%|█████ | 169/336 [50:07<47:08, 16.94s/it] {'loss': 0.0162, 'grad_norm': 0.15395523607730865, 'learning_rate': 1.0434852586583737e-06, 'kl': 0.0449, 'entropy': -0.0021, 'ce_loss': 0.0147, 'epoch': 2.01} 50%|█████ | 169/336 [50:07<47:08, 16.94s/it] 51%|█████ | 170/336 [50:24<47:01, 17.00s/it] {'loss': 0.0146, 'grad_norm': 0.13404767215251923, 'learning_rate': 1.0338260822902165e-06, 'kl': 0.0332, 'entropy': -0.0286, 'ce_loss': 0.0201, 'epoch': 2.02} 51%|█████ | 170/336 [50:24<47:01, 17.00s/it] 51%|█████ | 171/336 [50:41<46:48, 17.02s/it] {'loss': 0.0194, 'grad_norm': 0.2107783555984497, 'learning_rate': 1.0241637452361322e-06, 'kl': 0.0859, 'entropy': -0.1113, 'ce_loss': 0.0114, 'epoch': 2.04} 51%|█████ | 171/336 [50:41<46:48, 17.02s/it] 51%|█████ | 172/336 [50:57<45:28, 16.64s/it] {'loss': 0.0213, 'grad_norm': 0.17959530651569366, 'learning_rate': 1.0144991503382673e-06, 'kl': 0.0581, 'entropy': -0.0378, 'ce_loss': 0.0105, 'epoch': 2.05} 51%|█████ | 172/336 [50:57<45:28, 16.64s/it] 51%|█████▏ | 173/336 [51:12<44:24, 16.34s/it] {'loss': 0.0161, 'grad_norm': 0.1580849587917328, 'learning_rate': 1.0048332006497404e-06, 'kl': 0.0408, 'entropy': -0.0569, 'ce_loss': 0.0214, 'epoch': 2.06} 51%|█████▏ | 173/336 [51:12<44:24, 16.34s/it] 52%|█████▏ | 174/336 [51:29<44:25, 16.45s/it] {'loss': 0.0162, 'grad_norm': 0.196575328707695, 'learning_rate': 9.951667993502597e-07, 'kl': 0.0728, 'entropy': -0.1484, 'ce_loss': 0.0088, 'epoch': 2.07} 52%|█████▏ | 174/336 [51:29<44:25, 16.45s/it] 52%|█████▏ | 175/336 [51:45<43:31, 16.22s/it] {'loss': 0.018, 'grad_norm': 0.17969413101673126, 'learning_rate': 9.855008496617326e-07, 'kl': 0.0391, 'entropy': -0.0245, 'ce_loss': 0.0064, 'epoch': 2.08} 52%|█████▏ | 175/336 [51:45<43:31, 16.22s/it] 52%|█████▏ | 176/336 [52:03<44:57, 16.86s/it] {'loss': 0.0143, 'grad_norm': 0.1602613776922226, 'learning_rate': 9.75836254763868e-07, 'kl': 0.03, 'entropy': -0.0674, 'ce_loss': 0.009, 'epoch': 2.1} 52%|█████▏ | 176/336 [52:03<44:57, 16.86s/it] 53%|█████▎ | 177/336 [52:22<46:28, 17.54s/it] {'loss': 0.0209, 'grad_norm': 0.19288259744644165, 'learning_rate': 9.661739177097834e-07, 'kl': 0.026, 'entropy': -0.0718, 'ce_loss': 0.0052, 'epoch': 2.11} 53%|█████▎ | 177/336 [52:22<46:28, 17.54s/it] 53%|█████▎ | 178/336 [52:38<44:40, 16.96s/it] {'loss': 0.0202, 'grad_norm': 0.21120965480804443, 'learning_rate': 9.565147413416265e-07, 'kl': 0.0542, 'entropy': -0.1177, 'ce_loss': 0.0089, 'epoch': 2.12} 53%|█████▎ | 178/336 [52:38<44:40, 16.96s/it] 53%|█████▎ | 179/336 [52:53<43:27, 16.61s/it] {'loss': 0.0118, 'grad_norm': 0.15320152044296265, 'learning_rate': 9.468596282062113e-07, 'kl': 0.0167, 'entropy': -0.0544, 'ce_loss': 0.0139, 'epoch': 2.13} 53%|█████▎ | 179/336 [52:53<43:27, 16.61s/it] 54%|█████▎ | 180/336 [53:12<44:59, 17.31s/it] {'loss': 0.0175, 'grad_norm': 0.1730872541666031, 'learning_rate': 9.372094804706866e-07, 'kl': 0.0532, 'entropy': 0.0073, 'ce_loss': 0.0102, 'epoch': 2.14} 54%|█████▎ | 180/336 [53:12<44:59, 17.31s/it] 54%|█████▍ | 181/336 [53:31<45:53, 17.76s/it] {'loss': 0.0142, 'grad_norm': 0.15351615846157074, 'learning_rate': 9.275651998382377e-07, 'kl': 0.0293, 'entropy': -0.0095, 'ce_loss': 0.0073, 'epoch': 2.15} 54%|█████▍ | 181/336 [53:31<45:53, 17.76s/it] 54%|█████▍ | 182/336 [53:50<46:30, 18.12s/it] {'loss': 0.0221, 'grad_norm': 0.2354268729686737, 'learning_rate': 9.179276874638314e-07, 'kl': 0.0596, 'entropy': -0.1426, 'ce_loss': 0.029, 'epoch': 2.17} 54%|█████▍ | 182/336 [53:50<46:30, 18.12s/it] 54%|█████▍ | 183/336 [54:06<44:30, 17.45s/it] {'loss': 0.0173, 'grad_norm': 0.1883721500635147, 'learning_rate': 9.082978438700138e-07, 'kl': 0.0532, 'entropy': -0.0776, 'ce_loss': 0.0161, 'epoch': 2.18} 54%|█████▍ | 183/336 [54:06<44:30, 17.45s/it] 55%|█████▍ | 184/336 [54:22<43:24, 17.13s/it] {'loss': 0.0168, 'grad_norm': 0.19475631415843964, 'learning_rate': 8.986765688627651e-07, 'kl': 0.0549, 'entropy': -0.0366, 'ce_loss': 0.01, 'epoch': 2.19} 55%|█████▍ | 184/336 [54:22<43:24, 17.13s/it] 55%|█████▌ | 185/336 [54:38<42:00, 16.69s/it] {'loss': 0.0155, 'grad_norm': 0.19626420736312866, 'learning_rate': 8.890647614474222e-07, 'kl': 0.0237, 'entropy': -0.0209, 'ce_loss': 0.0138, 'epoch': 2.2} 55%|█████▌ | 185/336 [54:38<42:00, 16.69s/it] 55%|█████▌ | 186/336 [55:00<45:51, 18.34s/it] {'loss': 0.0172, 'grad_norm': 0.21107307076454163, 'learning_rate': 8.79463319744677e-07, 'kl': 0.0613, 'entropy': 0.0027, 'ce_loss': 0.012, 'epoch': 2.21} 55%|█████▌ | 186/336 [55:00<45:51, 18.34s/it] 56%|█████▌ | 187/336 [55:18<45:11, 18.20s/it] {'loss': 0.0184, 'grad_norm': 0.20781706273555756, 'learning_rate': 8.698731409066568e-07, 'kl': 0.0221, 'entropy': -0.0269, 'ce_loss': 0.0102, 'epoch': 2.23} 56%|█████▌ | 187/336 [55:18<45:11, 18.20s/it] 56%|█████▌ | 188/336 [55:34<43:28, 17.62s/it] {'loss': 0.0213, 'grad_norm': 0.20844508707523346, 'learning_rate': 8.602951210330941e-07, 'kl': 0.0811, 'entropy': -0.0398, 'ce_loss': 0.0154, 'epoch': 2.24} 56%|█████▌ | 188/336 [55:34<43:28, 17.62s/it] 56%|█████▋ | 189/336 [55:53<43:32, 17.77s/it] {'loss': 0.0177, 'grad_norm': 0.21321342885494232, 'learning_rate': 8.507301550875959e-07, 'kl': 0.0032, 'entropy': -0.0056, 'ce_loss': 0.0025, 'epoch': 2.25} 56%|█████▋ | 189/336 [55:53<43:32, 17.77s/it] 57%|█████▋ | 190/336 [56:13<45:29, 18.69s/it] {'loss': 0.0166, 'grad_norm': 0.20195260643959045, 'learning_rate': 8.411791368140195e-07, 'kl': 0.0266, 'entropy': -0.0417, 'ce_loss': 0.0066, 'epoch': 2.26} 57%|█████▋ | 190/336 [56:13<45:29, 18.69s/it] 57%|█████▋ | 191/336 [56:29<43:00, 17.80s/it] {'loss': 0.0183, 'grad_norm': 0.20009654760360718, 'learning_rate': 8.316429586529614e-07, 'kl': 0.0574, 'entropy': -0.0474, 'ce_loss': 0.0076, 'epoch': 2.27} 57%|█████▋ | 191/336 [56:29<43:00, 17.80s/it] 57%|█████▋ | 192/336 [56:48<43:37, 18.18s/it] {'loss': 0.0218, 'grad_norm': 0.26318904757499695, 'learning_rate': 8.221225116583676e-07, 'kl': 0.0325, 'entropy': -0.0967, 'ce_loss': 0.0263, 'epoch': 2.29} 57%|█████▋ | 192/336 [56:48<43:37, 18.18s/it] 57%|█████▋ | 193/336 [57:05<42:17, 17.74s/it] {'loss': 0.0186, 'grad_norm': 0.19039058685302734, 'learning_rate': 8.126186854142751e-07, 'kl': 0.0264, 'entropy': -0.0669, 'ce_loss': 0.0072, 'epoch': 2.3} 57%|█████▋ | 193/336 [57:05<42:17, 17.74s/it] 58%|█████▊ | 194/336 [57:20<40:26, 17.09s/it] {'loss': 0.021, 'grad_norm': 0.23199713230133057, 'learning_rate': 8.031323679516899e-07, 'kl': 0.0188, 'entropy': -0.0713, 'ce_loss': 0.0116, 'epoch': 2.31} 58%|█████▊ | 194/336 [57:20<40:26, 17.09s/it] 58%|█████▊ | 195/336 [57:40<42:04, 17.90s/it] {'loss': 0.0137, 'grad_norm': 0.18526126444339752, 'learning_rate': 7.936644456656081e-07, 'kl': 0.0635, 'entropy': -0.0996, 'ce_loss': 0.0047, 'epoch': 2.32} 58%|█████▊ | 195/336 [57:40<42:04, 17.90s/it] 58%|█████▊ | 196/336 [57:56<40:07, 17.20s/it] {'loss': 0.0189, 'grad_norm': 0.22145313024520874, 'learning_rate': 7.84215803232194e-07, 'kl': 0.0291, 'entropy': -0.0659, 'ce_loss': 0.0072, 'epoch': 2.33} 58%|█████▊ | 196/336 [57:56<40:07, 17.20s/it] 59%|█████▊ | 197/336 [58:12<38:52, 16.78s/it] {'loss': 0.0189, 'grad_norm': 0.20441696047782898, 'learning_rate': 7.747873235261156e-07, 'kl': 0.0334, 'entropy': -0.0525, 'ce_loss': 0.022, 'epoch': 2.35} 59%|█████▊ | 197/336 [58:12<38:52, 16.78s/it] 59%|█████▉ | 198/336 [58:30<39:56, 17.37s/it] {'loss': 0.0165, 'grad_norm': 0.17816822230815887, 'learning_rate': 7.653798875380499e-07, 'kl': 0.0322, 'entropy': -0.0659, 'ce_loss': 0.0168, 'epoch': 2.36} 59%|█████▉ | 198/336 [58:30<39:56, 17.37s/it] 59%|█████▉ | 199/336 [58:52<42:45, 18.73s/it] {'loss': 0.0183, 'grad_norm': 0.18242992460727692, 'learning_rate': 7.559943742923625e-07, 'kl': 0.0056, 'entropy': -0.0041, 'ce_loss': 0.0044, 'epoch': 2.37} 59%|█████▉ | 199/336 [58:52<42:45, 18.73s/it] 60%|█████▉ | 200/336 [59:14<44:10, 19.49s/it] {'loss': 0.0144, 'grad_norm': 0.16184532642364502, 'learning_rate': 7.466316607649736e-07, 'kl': 0.0025, 'entropy': -0.0045, 'ce_loss': 0.0053, 'epoch': 2.38} 60%|█████▉ | 200/336 [59:14<44:10, 19.49s/it] 60%|█████▉ | 201/336 [59:32<43:16, 19.24s/it] {'loss': 0.0128, 'grad_norm': 0.1520135998725891, 'learning_rate': 7.372926218014131e-07, 'kl': 0.0591, 'entropy': -0.0311, 'ce_loss': 0.0276, 'epoch': 2.39} 60%|█████▉ | 201/336 [59:32<43:16, 19.24s/it] 60%|██████ | 202/336 [59:48<40:33, 18.16s/it] {'loss': 0.0224, 'grad_norm': 0.2052118480205536, 'learning_rate': 7.279781300350757e-07, 'kl': 0.1836, 'entropy': -0.1348, 'ce_loss': 0.0101, 'epoch': 2.4} 60%|██████ | 202/336 [59:48<40:33, 18.16s/it] 60%|██████ | 203/336 [1:00:04<38:35, 17.41s/it] {'loss': 0.0187, 'grad_norm': 0.19855929911136627, 'learning_rate': 7.186890558056836e-07, 'kl': 0.0145, 'entropy': -0.0366, 'ce_loss': 0.0135, 'epoch': 2.42} 60%|██████ | 203/336 [1:00:04<38:35, 17.41s/it] 61%|██████ | 204/336 [1:00:19<37:02, 16.83s/it] {'loss': 0.0177, 'grad_norm': 0.1831832379102707, 'learning_rate': 7.09426267077961e-07, 'kl': 0.0391, 'entropy': -0.0027, 'ce_loss': 0.018, 'epoch': 2.43} 61%|██████ | 204/336 [1:00:19<37:02, 16.83s/it] 61%|██████ | 205/336 [1:00:35<36:01, 16.50s/it] {'loss': 0.0163, 'grad_norm': 0.16797998547554016, 'learning_rate': 7.001906293605329e-07, 'kl': 0.0679, 'entropy': -0.0106, 'ce_loss': 0.0109, 'epoch': 2.44} 61%|██████ | 205/336 [1:00:35<36:01, 16.50s/it] 61%|██████▏ | 206/336 [1:00:55<38:06, 17.59s/it] {'loss': 0.0153, 'grad_norm': 0.1700359433889389, 'learning_rate': 6.909830056250526e-07, 'kl': 0.0262, 'entropy': -0.0146, 'ce_loss': 0.004, 'epoch': 2.45} 61%|██████▏ | 206/336 [1:00:55<38:06, 17.59s/it] 62%|██████▏ | 207/336 [1:01:11<37:00, 17.21s/it] {'loss': 0.02, 'grad_norm': 0.18519534170627594, 'learning_rate': 6.81804256225567e-07, 'kl': 0.0576, 'entropy': -0.0767, 'ce_loss': 0.0062, 'epoch': 2.46} 62%|██████▏ | 207/336 [1:01:11<37:00, 17.21s/it] 62%|██████▏ | 208/336 [1:01:37<42:05, 19.73s/it] {'loss': 0.0195, 'grad_norm': 0.1842510849237442, 'learning_rate': 6.726552388181233e-07, 'kl': 0.022, 'entropy': -0.0552, 'ce_loss': 0.0121, 'epoch': 2.48} 62%|██████▏ | 208/336 [1:01:37<42:05, 19.73s/it] 62%|██████▏ | 209/336 [1:01:56<41:25, 19.57s/it] {'loss': 0.0168, 'grad_norm': 0.18451088666915894, 'learning_rate': 6.63536808280633e-07, 'kl': 0.0503, 'entropy': -0.0537, 'ce_loss': 0.012, 'epoch': 2.49} 62%|██████▏ | 209/336 [1:01:56<41:25, 19.57s/it] 62%|██████▎ | 210/336 [1:02:12<38:44, 18.45s/it] {'loss': 0.0178, 'grad_norm': 0.18919403851032257, 'learning_rate': 6.544498166329912e-07, 'kl': 0.0049, 'entropy': -0.0171, 'ce_loss': 0.0278, 'epoch': 2.5} 62%|██████▎ | 210/336 [1:02:12<38:44, 18.45s/it] 63%|██████▎ | 211/336 [1:02:27<36:38, 17.59s/it] {'loss': 0.019, 'grad_norm': 0.19489766657352448, 'learning_rate': 6.453951129574643e-07, 'kl': 0.0859, 'entropy': -0.1299, 'ce_loss': 0.0093, 'epoch': 2.51} 63%|██████▎ | 211/336 [1:02:27<36:38, 17.59s/it] 63%|██████▎ | 212/336 [1:02:45<36:06, 17.47s/it] {'loss': 0.0247, 'grad_norm': 0.22543658316135406, 'learning_rate': 6.363735433193529e-07, 'kl': 0.0447, 'entropy': -0.0586, 'ce_loss': 0.0098, 'epoch': 2.52} 63%|██████▎ | 212/336 [1:02:45<36:06, 17.47s/it] 63%|██████▎ | 213/336 [1:03:02<35:28, 17.31s/it] {'loss': 0.0191, 'grad_norm': 0.2101595252752304, 'learning_rate': 6.273859506879364e-07, 'kl': 0.0194, 'entropy': -0.0265, 'ce_loss': 0.0077, 'epoch': 2.54} 63%|██████▎ | 213/336 [1:03:02<35:28, 17.31s/it] 64%|██████▎ | 214/336 [1:03:21<36:20, 17.87s/it] {'loss': 0.0197, 'grad_norm': 0.22376661002635956, 'learning_rate': 6.18433174857705e-07, 'kl': 0.063, 'entropy': -0.0239, 'ce_loss': 0.0106, 'epoch': 2.55} 64%|██████▎ | 214/336 [1:03:21<36:20, 17.87s/it] 64%|██████▍ | 215/336 [1:03:40<36:42, 18.20s/it] {'loss': 0.0172, 'grad_norm': 0.20559558272361755, 'learning_rate': 6.095160523698912e-07, 'kl': 0.0126, 'entropy': -0.063, 'ce_loss': 0.0269, 'epoch': 2.56} 64%|██████▍ | 215/336 [1:03:40<36:42, 18.20s/it] 64%|██████▍ | 216/336 [1:04:02<38:43, 19.36s/it] {'loss': 0.0151, 'grad_norm': 0.17532405257225037, 'learning_rate': 6.006354164343046e-07, 'kl': 0.032, 'entropy': -0.0762, 'ce_loss': 0.0291, 'epoch': 2.57} 64%|██████▍ | 216/336 [1:04:02<38:43, 19.36s/it] 65%|██████▍ | 217/336 [1:04:21<38:07, 19.22s/it] {'loss': 0.0198, 'grad_norm': 0.2614968419075012, 'learning_rate': 5.917920968514751e-07, 'kl': 0.1108, 'entropy': -0.0991, 'ce_loss': 0.007, 'epoch': 2.58} 65%|██████▍ | 217/336 [1:04:21<38:07, 19.22s/it] 65%|██████▍ | 218/336 [1:04:40<37:41, 19.16s/it] {'loss': 0.0167, 'grad_norm': 0.17147068679332733, 'learning_rate': 5.829869199351187e-07, 'kl': 0.0383, 'entropy': -0.0864, 'ce_loss': 0.0241, 'epoch': 2.6} 65%|██████▍ | 218/336 [1:04:40<37:41, 19.16s/it] 65%|██████▌ | 219/336 [1:05:01<38:45, 19.87s/it] {'loss': 0.0145, 'grad_norm': 0.15350160002708435, 'learning_rate': 5.742207084349273e-07, 'kl': 0.0457, 'entropy': -0.0289, 'ce_loss': 0.013, 'epoch': 2.61} 65%|██████▌ | 219/336 [1:05:01<38:45, 19.87s/it] 65%|██████▌ | 220/336 [1:05:17<36:06, 18.67s/it] {'loss': 0.0167, 'grad_norm': 0.18451078236103058, 'learning_rate': 5.654942814596901e-07, 'kl': 0.0703, 'entropy': -0.052, 'ce_loss': 0.0216, 'epoch': 2.62} 65%|██████▌ | 220/336 [1:05:17<36:06, 18.67s/it] 66%|██████▌ | 221/336 [1:05:35<35:36, 18.58s/it] {'loss': 0.0171, 'grad_norm': 0.18586644530296326, 'learning_rate': 5.568084544007588e-07, 'kl': 0.0544, 'entropy': -0.0757, 'ce_loss': 0.0097, 'epoch': 2.63} 66%|██████▌ | 221/336 [1:05:35<35:36, 18.58s/it] 66%|██████▌ | 222/336 [1:05:52<34:10, 17.99s/it] {'loss': 0.0156, 'grad_norm': 0.17732904851436615, 'learning_rate': 5.48164038855855e-07, 'kl': 0.0104, 'entropy': 0.027, 'ce_loss': 0.009, 'epoch': 2.64} 66%|██████▌ | 222/336 [1:05:52<34:10, 17.99s/it] 66%|██████▋ | 223/336 [1:06:09<33:22, 17.72s/it] {'loss': 0.0151, 'grad_norm': 0.1597178876399994, 'learning_rate': 5.395618425532389e-07, 'kl': 0.0007, 'entropy': -0.0108, 'ce_loss': 0.0102, 'epoch': 2.65} 66%|██████▋ | 223/336 [1:06:09<33:22, 17.72s/it] 67%|██████▋ | 224/336 [1:06:28<33:42, 18.05s/it] {'loss': 0.0139, 'grad_norm': 0.16127213835716248, 'learning_rate': 5.310026692762314e-07, 'kl': 0.063, 'entropy': -0.0089, 'ce_loss': 0.0045, 'epoch': 2.67} 67%|██████▋ | 224/336 [1:06:28<33:42, 18.05s/it] 67%|██████▋ | 225/336 [1:06:44<32:07, 17.36s/it] {'loss': 0.0205, 'grad_norm': 0.20752082765102386, 'learning_rate': 5.224873187881136e-07, 'kl': 0.0508, 'entropy': -0.0178, 'ce_loss': 0.0216, 'epoch': 2.68} 67%|██████▋ | 225/336 [1:06:44<32:07, 17.36s/it] 67%|██████▋ | 226/336 [1:07:00<31:15, 17.05s/it] {'loss': 0.016, 'grad_norm': 0.17134831845760345, 'learning_rate': 5.140165867573939e-07, 'kl': 0.0503, 'entropy': -0.0815, 'ce_loss': 0.0059, 'epoch': 2.69} 67%|██████▋ | 226/336 [1:07:00<31:15, 17.05s/it] 68%|██████▊ | 227/336 [1:07:15<30:06, 16.57s/it] {'loss': 0.0171, 'grad_norm': 0.1890375167131424, 'learning_rate': 5.055912646834635e-07, 'kl': 0.0214, 'entropy': -0.0396, 'ce_loss': 0.0122, 'epoch': 2.7} 68%|██████▊ | 227/336 [1:07:15<30:06, 16.57s/it] 68%|██████▊ | 228/336 [1:07:37<32:35, 18.11s/it] {'loss': 0.0138, 'grad_norm': 0.15590184926986694, 'learning_rate': 4.972121398226371e-07, 'kl': 0.0334, 'entropy': 0.0209, 'ce_loss': 0.0108, 'epoch': 2.71} 68%|██████▊ | 228/336 [1:07:37<32:35, 18.11s/it] 68%|██████▊ | 229/336 [1:07:53<31:03, 17.41s/it] {'loss': 0.0198, 'grad_norm': 0.1960582584142685, 'learning_rate': 4.888799951145947e-07, 'kl': 0.0315, 'entropy': -0.0879, 'ce_loss': 0.0134, 'epoch': 2.73} 68%|██████▊ | 229/336 [1:07:53<31:03, 17.41s/it] 68%|██████▊ | 230/336 [1:08:08<29:42, 16.82s/it] {'loss': 0.0173, 'grad_norm': 0.19161897897720337, 'learning_rate': 4.805956091092227e-07, 'kl': 0.0208, 'entropy': -0.0532, 'ce_loss': 0.0134, 'epoch': 2.74} 68%|██████▊ | 230/336 [1:08:08<29:42, 16.82s/it] 69%|██████▉ | 231/336 [1:08:24<28:54, 16.52s/it] {'loss': 0.023, 'grad_norm': 0.2421635091304779, 'learning_rate': 4.7235975589386713e-07, 'kl': 0.0532, 'entropy': -0.0908, 'ce_loss': 0.0157, 'epoch': 2.75} 69%|██████▉ | 231/336 [1:08:24<28:54, 16.52s/it] 69%|██████▉ | 232/336 [1:08:40<28:11, 16.26s/it] {'loss': 0.0175, 'grad_norm': 0.2146768867969513, 'learning_rate': 4.641732050210031e-07, 'kl': 0.0703, 'entropy': 0.0269, 'ce_loss': 0.0164, 'epoch': 2.76} 69%|██████▉ | 232/336 [1:08:40<28:11, 16.26s/it] 69%|██████▉ | 233/336 [1:08:56<27:43, 16.15s/it] {'loss': 0.0159, 'grad_norm': 0.19451335072517395, 'learning_rate': 4.5603672143632945e-07, 'kl': 0.0148, 'entropy': -0.0435, 'ce_loss': 0.019, 'epoch': 2.77} 69%|██████▉ | 233/336 [1:08:56<27:43, 16.15s/it] 70%|██████▉ | 234/336 [1:09:17<29:52, 17.58s/it] {'loss': 0.0164, 'grad_norm': 0.18908201158046722, 'learning_rate': 4.479510654072909e-07, 'kl': 0.0459, 'entropy': -0.0204, 'ce_loss': 0.0044, 'epoch': 2.79} 70%|██████▉ | 234/336 [1:09:17<29:52, 17.58s/it] 70%|██████▉ | 235/336 [1:09:36<30:19, 18.01s/it] {'loss': 0.0178, 'grad_norm': 0.18985359370708466, 'learning_rate': 4.399169924520403e-07, 'kl': 0.0356, 'entropy': -0.0447, 'ce_loss': 0.0174, 'epoch': 2.8} 70%|██████▉ | 235/336 [1:09:36<30:19, 18.01s/it] 70%|███████ | 236/336 [1:09:51<28:50, 17.30s/it] {'loss': 0.0162, 'grad_norm': 0.1735706329345703, 'learning_rate': 4.3193525326884426e-07, 'kl': 0.0131, 'entropy': -0.0542, 'ce_loss': 0.0136, 'epoch': 2.81} 70%|███████ | 236/336 [1:09:51<28:50, 17.30s/it] 71%|███████ | 237/336 [1:10:07<27:43, 16.80s/it] {'loss': 0.0227, 'grad_norm': 0.23269647359848022, 'learning_rate': 4.240065936659374e-07, 'kl': 0.0447, 'entropy': -0.0664, 'ce_loss': 0.0172, 'epoch': 2.82} 71%|███████ | 237/336 [1:10:07<27:43, 16.80s/it] 71%|███████ | 238/336 [1:10:23<27:02, 16.56s/it] {'loss': 0.0174, 'grad_norm': 0.1907908022403717, 'learning_rate': 4.1613175449183446e-07, 'kl': 0.0522, 'entropy': -0.027, 'ce_loss': 0.0101, 'epoch': 2.83} 71%|███████ | 238/336 [1:10:23<27:02, 16.56s/it] 71%|███████ | 239/336 [1:10:46<29:39, 18.35s/it] {'loss': 0.0164, 'grad_norm': 0.18256975710391998, 'learning_rate': 4.0831147156610676e-07, 'kl': 0.0352, 'entropy': -0.0698, 'ce_loss': 0.0075, 'epoch': 2.85} 71%|███████ | 239/336 [1:10:46<29:39, 18.35s/it] 71%|███████▏ | 240/336 [1:11:07<30:48, 19.25s/it] {'loss': 0.0142, 'grad_norm': 0.17074568569660187, 'learning_rate': 4.0054647561062615e-07, 'kl': 0.0688, 'entropy': 0.0356, 'ce_loss': 0.0199, 'epoch': 2.86} 71%|███████▏ | 240/336 [1:11:07<30:48, 19.25s/it] 72%|███████▏ | 241/336 [1:11:23<28:52, 18.23s/it] {'loss': 0.0194, 'grad_norm': 0.19037066400051117, 'learning_rate': 3.928374921812888e-07, 'kl': 0.0447, 'entropy': -0.0684, 'ce_loss': 0.0054, 'epoch': 2.87} 72%|███████▏ | 241/336 [1:11:23<28:52, 18.23s/it] 72%|███████▏ | 242/336 [1:11:42<28:54, 18.46s/it] {'loss': 0.0185, 'grad_norm': 0.1752237230539322, 'learning_rate': 3.851852416002187e-07, 'kl': 0.0229, 'entropy': -0.0327, 'ce_loss': 0.0198, 'epoch': 2.88} 72%|███████▏ | 242/336 [1:11:42<28:54, 18.46s/it] 72%|███████▏ | 243/336 [1:12:00<28:25, 18.34s/it] {'loss': 0.0162, 'grad_norm': 0.20181013643741608, 'learning_rate': 3.7759043888846173e-07, 'kl': 0.0325, 'entropy': -0.0439, 'ce_loss': 0.0097, 'epoch': 2.89} 72%|███████▏ | 243/336 [1:12:00<28:25, 18.34s/it] 73%|███████▎ | 244/336 [1:12:19<28:35, 18.64s/it] {'loss': 0.0164, 'grad_norm': 0.20712324976921082, 'learning_rate': 3.7005379369917324e-07, 'kl': 0.0243, 'entropy': -0.0796, 'ce_loss': 0.012, 'epoch': 2.9} 73%|███████▎ | 244/336 [1:12:19<28:35, 18.64s/it] 73%|███████▎ | 245/336 [1:12:35<27:14, 17.96s/it] {'loss': 0.0186, 'grad_norm': 0.19520629942417145, 'learning_rate': 3.625760102513102e-07, 'kl': 0.0562, 'entropy': -0.0293, 'ce_loss': 0.0083, 'epoch': 2.92} 73%|███████▎ | 245/336 [1:12:35<27:14, 17.96s/it] 73%|███████▎ | 246/336 [1:12:54<27:05, 18.06s/it] {'loss': 0.0166, 'grad_norm': 0.19249661266803741, 'learning_rate': 3.551577872638296e-07, 'kl': 0.0854, 'entropy': -0.1016, 'ce_loss': 0.0133, 'epoch': 2.93} 73%|███████▎ | 246/336 [1:12:54<27:05, 18.06s/it] 74%|███████▎ | 247/336 [1:13:10<25:47, 17.38s/it] {'loss': 0.0222, 'grad_norm': 0.2110520452260971, 'learning_rate': 3.477998178903981e-07, 'kl': 0.0254, 'entropy': -0.0016, 'ce_loss': 0.0083, 'epoch': 2.94} 74%|███████▎ | 247/336 [1:13:10<25:47, 17.38s/it] 74%|███████▍ | 248/336 [1:13:25<24:47, 16.90s/it] {'loss': 0.02, 'grad_norm': 0.21636487543582916, 'learning_rate': 3.4050278965462763e-07, 'kl': 0.0188, 'entropy': -0.0415, 'ce_loss': 0.0099, 'epoch': 2.95} 74%|███████▍ | 248/336 [1:13:25<24:47, 16.90s/it] 74%|███████▍ | 249/336 [1:13:47<26:31, 18.29s/it] {'loss': 0.0149, 'grad_norm': 0.1615242063999176, 'learning_rate': 3.3326738438583114e-07, 'kl': 0.03, 'entropy': -0.0845, 'ce_loss': 0.0152, 'epoch': 2.96} 74%|███████▍ | 249/336 [1:13:47<26:31, 18.29s/it] 74%|███████▍ | 250/336 [1:14:03<25:11, 17.58s/it] {'loss': 0.0184, 'grad_norm': 0.2158062607049942, 'learning_rate': 3.260942781553142e-07, 'kl': 0.04, 'entropy': -0.0505, 'ce_loss': 0.0089, 'epoch': 2.98} 74%|███████▍ | 250/336 [1:14:03<25:11, 17.58s/it] 75%|███████▍ | 251/336 [1:14:19<24:15, 17.13s/it] {'loss': 0.0187, 'grad_norm': 0.19854053854942322, 'learning_rate': 3.189841412132027e-07, 'kl': 0.0152, 'entropy': -0.0767, 'ce_loss': 0.0101, 'epoch': 2.99} 75%|███████▍ | 251/336 [1:14:19<24:15, 17.13s/it] 75%|███████▌ | 252/336 [1:14:38<24:43, 17.66s/it] {'loss': 0.0174, 'grad_norm': 0.211074098944664, 'learning_rate': 3.1193763792581594e-07, 'kl': 0.0262, 'entropy': -0.0339, 'ce_loss': 0.0075, 'epoch': 3.0} 75%|███████▌ | 252/336 [1:14:38<24:43, 17.66s/it] 75%|███████▌ | 253/336 [1:14:56<24:50, 17.96s/it] {'loss': 0.0118, 'grad_norm': 0.14704979956150055, 'learning_rate': 3.0495542671358744e-07, 'kl': 0.0293, 'entropy': -0.0493, 'ce_loss': 0.0126, 'epoch': 3.01} 75%|███████▌ | 253/336 [1:14:56<24:50, 17.96s/it] 76%|███████▌ | 254/336 [1:15:12<23:37, 17.29s/it] {'loss': 0.0189, 'grad_norm': 0.17956186830997467, 'learning_rate': 2.980381599895433e-07, 'kl': 0.0413, 'entropy': -0.008, 'ce_loss': 0.006, 'epoch': 3.02} 76%|███████▌ | 254/336 [1:15:12<23:37, 17.29s/it] 76%|███████▌ | 255/336 [1:15:29<23:01, 17.05s/it] {'loss': 0.0158, 'grad_norm': 0.1652800440788269, 'learning_rate': 2.91186484098342e-07, 'kl': 0.0447, 'entropy': -0.084, 'ce_loss': 0.0198, 'epoch': 3.04} 76%|███████▌ | 255/336 [1:15:29<23:01, 17.05s/it] 76%|███████▌ | 256/336 [1:15:45<22:15, 16.70s/it] {'loss': 0.0172, 'grad_norm': 0.1860676407814026, 'learning_rate': 2.84401039255879e-07, 'kl': 0.1084, 'entropy': -0.1768, 'ce_loss': 0.0095, 'epoch': 3.05} 76%|███████▌ | 256/336 [1:15:45<22:15, 16.70s/it] 76%|███████▋ | 257/336 [1:16:04<23:02, 17.50s/it] {'loss': 0.0137, 'grad_norm': 0.17659200727939606, 'learning_rate': 2.776824594894661e-07, 'kl': 0.0298, 'entropy': -0.0393, 'ce_loss': 0.0067, 'epoch': 3.06} 76%|███████▋ | 257/336 [1:16:04<23:02, 17.50s/it] 77%|███████▋ | 258/336 [1:16:22<23:04, 17.75s/it] {'loss': 0.0139, 'grad_norm': 0.1678144633769989, 'learning_rate': 2.7103137257858863e-07, 'kl': 0.0044, 'entropy': -0.003, 'ce_loss': 0.0071, 'epoch': 3.07} 77%|███████▋ | 258/336 [1:16:22<23:04, 17.75s/it] 77%|███████▋ | 259/336 [1:16:41<23:17, 18.14s/it] {'loss': 0.0137, 'grad_norm': 0.13771401345729828, 'learning_rate': 2.644483999962449e-07, 'kl': -0.0025, 'entropy': -0.0101, 'ce_loss': 0.0261, 'epoch': 3.08} 77%|███████▋ | 259/336 [1:16:41<23:17, 18.14s/it] 77%|███████▋ | 260/336 [1:16:57<22:12, 17.53s/it] {'loss': 0.0142, 'grad_norm': 0.15306684374809265, 'learning_rate': 2.579341568508779e-07, 'kl': 0.0214, 'entropy': -0.0198, 'ce_loss': 0.0163, 'epoch': 3.1} 77%|███████▋ | 260/336 [1:16:57<22:12, 17.53s/it] 78%|███████▊ | 261/336 [1:17:13<21:16, 17.02s/it] {'loss': 0.0177, 'grad_norm': 0.19758057594299316, 'learning_rate': 2.514892518288988e-07, 'kl': 0.041, 'entropy': -0.0322, 'ce_loss': 0.0069, 'epoch': 3.11} 78%|███████▊ | 261/336 [1:17:13<21:16, 17.02s/it] 78%|███████▊ | 262/336 [1:17:29<20:28, 16.60s/it] {'loss': 0.0178, 'grad_norm': 0.17641344666481018, 'learning_rate': 2.4511428713781236e-07, 'kl': -0.0012, 'entropy': -0.0288, 'ce_loss': 0.0192, 'epoch': 3.12} 78%|███████▊ | 262/336 [1:17:29<20:28, 16.60s/it] 78%|███████▊ | 263/336 [1:17:49<21:18, 17.52s/it] {'loss': 0.0137, 'grad_norm': 0.16513672471046448, 'learning_rate': 2.3880985844994673e-07, 'kl': 0.0603, 'entropy': -0.0781, 'ce_loss': 0.0111, 'epoch': 3.13} 78%|███████▊ | 263/336 [1:17:49<21:18, 17.52s/it] 79%|███████▊ | 264/336 [1:18:07<21:32, 17.95s/it] {'loss': 0.013, 'grad_norm': 0.15065506100654602, 'learning_rate': 2.3257655484679372e-07, 'kl': 0.022, 'entropy': -0.04, 'ce_loss': 0.0199, 'epoch': 3.14} 79%|███████▊ | 264/336 [1:18:07<21:32, 17.95s/it] 79%|███████▉ | 265/336 [1:18:23<20:30, 17.33s/it] {'loss': 0.0147, 'grad_norm': 0.15785431861877441, 'learning_rate': 2.264149587639671e-07, 'kl': 0.0439, 'entropy': -0.0845, 'ce_loss': 0.0042, 'epoch': 3.15} 79%|███████▉ | 265/336 [1:18:23<20:30, 17.33s/it] 79%|███████▉ | 266/336 [1:18:39<19:47, 16.97s/it] {'loss': 0.0135, 'grad_norm': 0.15365606546401978, 'learning_rate': 2.2032564593677772e-07, 'kl': 0.0525, 'entropy': -0.043, 'ce_loss': 0.0035, 'epoch': 3.17} 79%|███████▉ | 266/336 [1:18:39<19:47, 16.97s/it] 79%|███████▉ | 267/336 [1:18:56<19:22, 16.85s/it] {'loss': 0.0169, 'grad_norm': 0.16480816900730133, 'learning_rate': 2.1430918534643994e-07, 'kl': 0.0474, 'entropy': -0.0679, 'ce_loss': 0.0065, 'epoch': 3.18} 79%|███████▉ | 267/336 [1:18:56<19:22, 16.85s/it] 80%|███████▉ | 268/336 [1:19:15<19:46, 17.45s/it] {'loss': 0.0129, 'grad_norm': 0.15277600288391113, 'learning_rate': 2.0836613916690427e-07, 'kl': 0.0884, 'entropy': -0.1602, 'ce_loss': 0.006, 'epoch': 3.19} 80%|███████▉ | 268/336 [1:19:15<19:46, 17.45s/it] 80%|████████ | 269/336 [1:19:34<19:54, 17.83s/it] {'loss': 0.0149, 'grad_norm': 0.19228748977184296, 'learning_rate': 2.0249706271232946e-07, 'kl': 0.0231, 'entropy': 0.0119, 'ce_loss': 0.0048, 'epoch': 3.2} 80%|████████ | 269/336 [1:19:34<19:54, 17.83s/it] 80%|████████ | 270/336 [1:19:52<19:49, 18.02s/it] {'loss': 0.0138, 'grad_norm': 0.1653250902891159, 'learning_rate': 1.9670250438519386e-07, 'kl': 0.0461, 'entropy': -0.0708, 'ce_loss': 0.023, 'epoch': 3.21} 80%|████████ | 270/336 [1:19:52<19:49, 18.02s/it] 81%|████████ | 271/336 [1:20:11<19:50, 18.32s/it] {'loss': 0.0155, 'grad_norm': 0.1669386476278305, 'learning_rate': 1.9098300562505264e-07, 'kl': 0.0688, 'entropy': -0.0654, 'ce_loss': 0.0135, 'epoch': 3.23} 81%|████████ | 271/336 [1:20:11<19:50, 18.32s/it] 81%|████████ | 272/336 [1:20:31<20:02, 18.79s/it] {'loss': 0.0136, 'grad_norm': 0.1864534616470337, 'learning_rate': 1.8533910085794713e-07, 'kl': 0.0576, 'entropy': -0.0679, 'ce_loss': 0.0091, 'epoch': 3.24} 81%|████████ | 272/336 [1:20:31<20:02, 18.79s/it] 81%|████████▏ | 273/336 [1:20:53<20:52, 19.89s/it] {'loss': 0.0127, 'grad_norm': 0.14990662038326263, 'learning_rate': 1.7977131744646724e-07, 'kl': 0.0674, 'entropy': -0.0835, 'ce_loss': 0.0045, 'epoch': 3.25} 81%|████████▏ | 273/336 [1:20:53<20:52, 19.89s/it] 82%|████████▏ | 274/336 [1:21:10<19:24, 18.79s/it] {'loss': 0.0164, 'grad_norm': 0.1857193112373352, 'learning_rate': 1.742801756404759e-07, 'kl': 0.0786, 'entropy': -0.0771, 'ce_loss': 0.0023, 'epoch': 3.26} 82%|████████▏ | 274/336 [1:21:10<19:24, 18.79s/it] 82%|████████▏ | 275/336 [1:21:25<18:08, 17.85s/it] {'loss': 0.0159, 'grad_norm': 0.17385807633399963, 'learning_rate': 1.688661885284972e-07, 'kl': 0.0488, 'entropy': -0.0184, 'ce_loss': 0.0044, 'epoch': 3.27} 82%|████████▏ | 275/336 [1:21:25<18:08, 17.85s/it] 82%|████████▏ | 276/336 [1:21:41<17:11, 17.19s/it] {'loss': 0.0152, 'grad_norm': 0.17598123848438263, 'learning_rate': 1.6352986198977325e-07, 'kl': 0.0383, 'entropy': -0.0757, 'ce_loss': 0.024, 'epoch': 3.29} 82%|████████▏ | 276/336 [1:21:41<17:11, 17.19s/it] 82%|████████▏ | 277/336 [1:21:57<16:29, 16.78s/it] {'loss': 0.0136, 'grad_norm': 0.1699410378932953, 'learning_rate': 1.5827169464699575e-07, 'kl': 0.0116, 'entropy': -0.0452, 'ce_loss': 0.0106, 'epoch': 3.3} 82%|████████▏ | 277/336 [1:21:57<16:29, 16.78s/it] 83%|████████▎ | 278/336 [1:22:16<17:01, 17.61s/it] {'loss': 0.0121, 'grad_norm': 0.15897028148174286, 'learning_rate': 1.5309217781971416e-07, 'kl': 0.0596, 'entropy': -0.0413, 'ce_loss': 0.0116, 'epoch': 3.31} 83%|████████▎ | 278/336 [1:22:16<17:01, 17.61s/it] 83%|████████▎ | 279/336 [1:22:41<18:48, 19.79s/it] {'loss': 0.0116, 'grad_norm': 0.13833433389663696, 'learning_rate': 1.479917954784282e-07, 'kl': 0.0522, 'entropy': -0.1055, 'ce_loss': 0.012, 'epoch': 3.32} 83%|████████▎ | 279/336 [1:22:41<18:48, 19.79s/it] 83%|████████▎ | 280/336 [1:22:57<17:19, 18.57s/it] {'loss': 0.0151, 'grad_norm': 0.19401074945926666, 'learning_rate': 1.429710241993656e-07, 'kl': 0.062, 'entropy': -0.0366, 'ce_loss': 0.0134, 'epoch': 3.33} 83%|████████▎ | 280/336 [1:22:57<17:19, 18.57s/it] 84%|████████▎ | 281/336 [1:23:15<16:49, 18.36s/it] {'loss': 0.0148, 'grad_norm': 0.1647351086139679, 'learning_rate': 1.380303331199507e-07, 'kl': 0.0432, 'entropy': -0.0352, 'ce_loss': 0.0142, 'epoch': 3.35} 84%|████████▎ | 281/336 [1:23:15<16:49, 18.36s/it] 84%|████████▍ | 282/336 [1:23:34<16:46, 18.64s/it] {'loss': 0.0123, 'grad_norm': 0.14924506843090057, 'learning_rate': 1.3317018389496926e-07, 'kl': 0.0398, 'entropy': -0.0535, 'ce_loss': 0.0106, 'epoch': 3.36} 84%|████████▍ | 282/336 [1:23:34<16:46, 18.64s/it] 84%|████████▍ | 283/336 [1:23:50<15:42, 17.78s/it] {'loss': 0.014, 'grad_norm': 0.17482659220695496, 'learning_rate': 1.283910306534308e-07, 'kl': 0.0026, 'entropy': -0.0491, 'ce_loss': 0.0243, 'epoch': 3.37} 84%|████████▍ | 283/336 [1:23:50<15:42, 17.78s/it] 85%|████████▍ | 284/336 [1:24:06<14:54, 17.21s/it] {'loss': 0.0147, 'grad_norm': 0.15375792980194092, 'learning_rate': 1.2369331995613663e-07, 'kl': 0.0017, 'entropy': 0.0039, 'ce_loss': 0.0138, 'epoch': 3.38} 85%|████████▍ | 284/336 [1:24:06<14:54, 17.21s/it] 85%|████████▍ | 285/336 [1:24:21<14:14, 16.76s/it] {'loss': 0.0187, 'grad_norm': 0.20818421244621277, 'learning_rate': 1.1907749075395146e-07, 'kl': 0.0664, 'entropy': -0.0408, 'ce_loss': 0.0133, 'epoch': 3.39} 85%|████████▍ | 285/336 [1:24:21<14:14, 16.76s/it] 85%|████████▌ | 286/336 [1:24:43<15:12, 18.25s/it] {'loss': 0.0126, 'grad_norm': 0.14181439578533173, 'learning_rate': 1.145439743467902e-07, 'kl': 0.0742, 'entropy': -0.0859, 'ce_loss': 0.0047, 'epoch': 3.4} 85%|████████▌ | 286/336 [1:24:43<15:12, 18.25s/it] 85%|████████▌ | 287/336 [1:25:02<15:09, 18.56s/it] {'loss': 0.0153, 'grad_norm': 0.18021345138549805, 'learning_rate': 1.1009319434331621e-07, 'kl': 0.0255, 'entropy': -0.0505, 'ce_loss': 0.0076, 'epoch': 3.42} 85%|████████▌ | 287/336 [1:25:02<15:09, 18.56s/it] 86%|████████▌ | 288/336 [1:25:21<14:56, 18.68s/it] {'loss': 0.0139, 'grad_norm': 0.17935115098953247, 'learning_rate': 1.0572556662136035e-07, 'kl': 0.0238, 'entropy': -0.1021, 'ce_loss': 0.0081, 'epoch': 3.43} 86%|████████▌ | 288/336 [1:25:21<14:56, 18.68s/it] 86%|████████▌ | 289/336 [1:25:37<13:53, 17.74s/it] {'loss': 0.0191, 'grad_norm': 0.20373126864433289, 'learning_rate': 1.014414992890611e-07, 'kl': 0.085, 'entropy': -0.084, 'ce_loss': 0.0068, 'epoch': 3.44} 86%|████████▌ | 289/336 [1:25:37<13:53, 17.74s/it] 86%|████████▋ | 290/336 [1:25:56<13:53, 18.12s/it] {'loss': 0.0157, 'grad_norm': 0.1873011291027069, 'learning_rate': 9.724139264673114e-08, 'kl': 0.033, 'entropy': -0.0723, 'ce_loss': 0.001, 'epoch': 3.45} 86%|████████▋ | 290/336 [1:25:56<13:53, 18.12s/it] 87%|████████▋ | 291/336 [1:26:16<14:03, 18.75s/it] {'loss': 0.0112, 'grad_norm': 0.1414089798927307, 'learning_rate': 9.312563914945459e-08, 'kl': 0.0728, 'entropy': -0.0486, 'ce_loss': 0.0065, 'epoch': 3.46} 87%|████████▋ | 291/336 [1:26:16<14:03, 18.75s/it] 87%|████████▋ | 292/336 [1:26:34<13:36, 18.56s/it] {'loss': 0.0178, 'grad_norm': 0.20066796243190765, 'learning_rate': 8.909462337041507e-08, 'kl': 0.0359, 'entropy': -0.015, 'ce_loss': 0.0083, 'epoch': 3.48} 87%|████████▋ | 292/336 [1:26:34<13:36, 18.56s/it] 87%|████████▋ | 293/336 [1:26:53<13:20, 18.62s/it] {'loss': 0.014, 'grad_norm': 0.15551452338695526, 'learning_rate': 8.514872196496181e-08, 'kl': 0.042, 'entropy': -0.0068, 'ce_loss': 0.0015, 'epoch': 3.49} 87%|████████▋ | 293/336 [1:26:53<13:20, 18.62s/it] 88%|████████▊ | 294/336 [1:27:09<12:26, 17.77s/it] {'loss': 0.0155, 'grad_norm': 0.18158553540706635, 'learning_rate': 8.128830363541572e-08, 'kl': 0.0598, 'entropy': -0.0439, 'ce_loss': 0.0056, 'epoch': 3.5} 88%|████████▊ | 294/336 [1:27:09<12:26, 17.77s/it] 88%|████████▊ | 295/336 [1:27:27<12:15, 17.94s/it] {'loss': 0.0161, 'grad_norm': 0.18261036276817322, 'learning_rate': 7.751372909661768e-08, 'kl': 0.0041, 'entropy': -0.0078, 'ce_loss': 0.0069, 'epoch': 3.51} 88%|████████▊ | 295/336 [1:27:27<12:15, 17.94s/it] 88%|████████▊ | 296/336 [1:27:44<11:40, 17.50s/it] {'loss': 0.0117, 'grad_norm': 0.31206467747688293, 'learning_rate': 7.382535104222364e-08, 'kl': -0.0017, 'entropy': -0.0193, 'ce_loss': 0.0172, 'epoch': 3.52} 88%|████████▊ | 296/336 [1:27:44<11:40, 17.50s/it] 88%|████████▊ | 297/336 [1:28:02<11:30, 17.70s/it] {'loss': 0.0139, 'grad_norm': 0.1553187072277069, 'learning_rate': 7.022351411174865e-08, 'kl': -0.006, 'entropy': -0.0374, 'ce_loss': 0.015, 'epoch': 3.54} 88%|████████▊ | 297/336 [1:28:02<11:30, 17.70s/it] 89%|████████▊ | 298/336 [1:28:22<11:36, 18.32s/it] {'loss': 0.0126, 'grad_norm': 0.15778586268424988, 'learning_rate': 6.670855485836524e-08, 'kl': 0.1206, 'entropy': -0.1113, 'ce_loss': 0.008, 'epoch': 3.55} 89%|████████▊ | 298/336 [1:28:22<11:36, 18.32s/it] 89%|████████▉ | 299/336 [1:28:42<11:35, 18.81s/it] {'loss': 0.0141, 'grad_norm': 0.19020713865756989, 'learning_rate': 6.328080171745509e-08, 'kl': 0.0229, 'entropy': -0.032, 'ce_loss': 0.0108, 'epoch': 3.56} 89%|████████▉ | 299/336 [1:28:42<11:35, 18.81s/it] 89%|████████▉ | 300/336 [1:28:58<10:50, 18.08s/it] {'loss': 0.0159, 'grad_norm': 0.1756381243467331, 'learning_rate': 5.994057497592031e-08, 'kl': 0.085, 'entropy': -0.0796, 'ce_loss': 0.0052, 'epoch': 3.57} 89%|████████▉ | 300/336 [1:28:58<10:50, 18.08s/it] 90%|████████▉ | 301/336 [1:29:14<10:07, 17.35s/it] {'loss': 0.0158, 'grad_norm': 0.18305473029613495, 'learning_rate': 5.6688186742256835e-08, 'kl': 0.0525, 'entropy': -0.054, 'ce_loss': 0.0216, 'epoch': 3.58} 90%|████████▉ | 301/336 [1:29:14<10:07, 17.35s/it] 90%|████████▉ | 302/336 [1:29:30<09:42, 17.14s/it] {'loss': 0.0158, 'grad_norm': 0.1584191471338272, 'learning_rate': 5.352394091739021e-08, 'kl': 0.0444, 'entropy': -0.0796, 'ce_loss': 0.0089, 'epoch': 3.6} 90%|████████▉ | 302/336 [1:29:30<09:42, 17.14s/it] 90%|█████████ | 303/336 [1:29:49<09:42, 17.66s/it] {'loss': 0.0154, 'grad_norm': 0.17178170382976532, 'learning_rate': 5.0448133166279935e-08, 'kl': 0.0337, 'entropy': -0.0942, 'ce_loss': 0.0075, 'epoch': 3.61} 90%|█████████ | 303/336 [1:29:49<09:42, 17.66s/it] 90%|█████████ | 304/336 [1:30:05<09:07, 17.10s/it] {'loss': 0.016, 'grad_norm': 0.18547679483890533, 'learning_rate': 4.746105089029229e-08, 'kl': 0.0601, 'entropy': -0.0037, 'ce_loss': 0.0058, 'epoch': 3.62} 90%|█████████ | 304/336 [1:30:05<09:07, 17.10s/it] 91%|█████████ | 305/336 [1:30:24<09:04, 17.58s/it] {'loss': 0.0139, 'grad_norm': 0.16790246963500977, 'learning_rate': 4.456297320034641e-08, 'kl': 0.0542, 'entropy': -0.0459, 'ce_loss': 0.0075, 'epoch': 3.63} 91%|█████████ | 305/336 [1:30:24<09:04, 17.58s/it] 91%|█████████ | 306/336 [1:30:39<08:29, 17.00s/it] {'loss': 0.0143, 'grad_norm': 0.29298320412635803, 'learning_rate': 4.1754170890833774e-08, 'kl': 0.0344, 'entropy': -0.0493, 'ce_loss': 0.0171, 'epoch': 3.64} 91%|█████████ | 306/336 [1:30:39<08:29, 17.00s/it] 91%|█████████▏| 307/336 [1:30:58<08:30, 17.61s/it] {'loss': 0.0104, 'grad_norm': 0.13091427087783813, 'learning_rate': 3.9034906414315725e-08, 'kl': 0.0166, 'entropy': -0.0068, 'ce_loss': 0.0246, 'epoch': 3.65} 91%|█████████▏| 307/336 [1:30:58<08:30, 17.61s/it] 92%|█████████▏| 308/336 [1:31:14<07:56, 17.01s/it] {'loss': 0.0151, 'grad_norm': 0.18825221061706543, 'learning_rate': 3.6405433856999676e-08, 'kl': 0.0635, 'entropy': -0.0425, 'ce_loss': 0.0116, 'epoch': 3.67} 92%|█████████▏| 308/336 [1:31:14<07:56, 17.01s/it] 92%|█████████▏| 309/336 [1:31:32<07:50, 17.43s/it] {'loss': 0.0167, 'grad_norm': 0.20501720905303955, 'learning_rate': 3.386599891499764e-08, 'kl': 0.062, 'entropy': -0.0437, 'ce_loss': 0.0109, 'epoch': 3.68} 92%|█████████▏| 309/336 [1:31:32<07:50, 17.43s/it] 92%|█████████▏| 310/336 [1:31:52<07:48, 18.01s/it] {'loss': 0.0119, 'grad_norm': 0.14792554080486298, 'learning_rate': 3.141683887136892e-08, 'kl': 0.0537, 'entropy': -0.0369, 'ce_loss': 0.0092, 'epoch': 3.69} 92%|█████████▏| 310/336 [1:31:52<07:48, 18.01s/it] 93%|█████████▎| 311/336 [1:32:10<07:36, 18.25s/it] {'loss': 0.015, 'grad_norm': 0.18962831795215607, 'learning_rate': 2.9058182573947986e-08, 'kl': 0.1035, 'entropy': -0.0718, 'ce_loss': 0.0211, 'epoch': 3.7} 93%|█████████▎| 311/336 [1:32:10<07:36, 18.25s/it] 93%|█████████▎| 312/336 [1:32:29<07:23, 18.46s/it] {'loss': 0.0149, 'grad_norm': 0.1783580332994461, 'learning_rate': 2.6790250413961546e-08, 'kl': 0.0337, 'entropy': -0.0486, 'ce_loss': 0.0158, 'epoch': 3.71} 93%|█████████▎| 312/336 [1:32:29<07:23, 18.46s/it] 93%|█████████▎| 313/336 [1:32:45<06:46, 17.69s/it] {'loss': 0.0193, 'grad_norm': 0.21417374908924103, 'learning_rate': 2.4613254305434815e-08, 'kl': 0.0208, 'entropy': -0.0405, 'ce_loss': 0.0144, 'epoch': 3.73} 93%|█████████▎| 313/336 [1:32:45<06:46, 17.69s/it] 93%|█████████▎| 314/336 [1:33:04<06:36, 18.04s/it] {'loss': 0.0133, 'grad_norm': 0.1617756336927414, 'learning_rate': 2.2527397665391024e-08, 'kl': 0.0669, 'entropy': -0.0703, 'ce_loss': 0.0188, 'epoch': 3.74} 93%|█████████▎| 314/336 [1:33:04<06:36, 18.04s/it] 94%|█████████▍| 315/336 [1:33:20<06:05, 17.39s/it] {'loss': 0.0142, 'grad_norm': 0.17499667406082153, 'learning_rate': 2.053287539484405e-08, 'kl': 0.0322, 'entropy': -0.0269, 'ce_loss': 0.0077, 'epoch': 3.75} 94%|█████████▍| 315/336 [1:33:20<06:05, 17.39s/it] 94%|█████████▍| 316/336 [1:33:36<05:37, 16.88s/it] {'loss': 0.0168, 'grad_norm': 0.19160637259483337, 'learning_rate': 1.8629873860586564e-08, 'kl': 0.0175, 'entropy': -0.0879, 'ce_loss': 0.0451, 'epoch': 3.76} 94%|█████████▍| 316/336 [1:33:36<05:37, 16.88s/it] 94%|█████████▍| 317/336 [1:33:55<05:33, 17.56s/it] {'loss': 0.0125, 'grad_norm': 0.16696055233478546, 'learning_rate': 1.6818570877776718e-08, 'kl': 0.03, 'entropy': -0.0098, 'ce_loss': 0.0167, 'epoch': 3.77} 94%|█████████▍| 317/336 [1:33:55<05:33, 17.56s/it] 95%|█████████▍| 318/336 [1:34:14<05:24, 18.05s/it] {'loss': 0.0158, 'grad_norm': 0.17223559319972992, 'learning_rate': 1.5099135693322773e-08, 'kl': 0.0771, 'entropy': -0.0261, 'ce_loss': 0.0142, 'epoch': 3.79} 95%|█████████▍| 318/336 [1:34:14<05:24, 18.05s/it] 95%|█████████▍| 319/336 [1:34:30<04:54, 17.35s/it] {'loss': 0.019, 'grad_norm': 0.21838437020778656, 'learning_rate': 1.3471728970068985e-08, 'kl': 0.0713, 'entropy': -0.085, 'ce_loss': 0.0093, 'epoch': 3.8} 95%|█████████▍| 319/336 [1:34:30<04:54, 17.35s/it] 95%|█████████▌| 320/336 [1:34:46<04:32, 17.02s/it] {'loss': 0.0146, 'grad_norm': 0.16921240091323853, 'learning_rate': 1.1936502771783486e-08, 'kl': 0.0645, 'entropy': -0.05, 'ce_loss': 0.0086, 'epoch': 3.81} 95%|█████████▌| 320/336 [1:34:46<04:32, 17.02s/it] 96%|█████████▌| 321/336 [1:35:02<04:08, 16.58s/it] {'loss': 0.0141, 'grad_norm': 0.1600177139043808, 'learning_rate': 1.0493600548948877e-08, 'kl': 0.0294, 'entropy': -0.0142, 'ce_loss': 0.006, 'epoch': 3.82} 96%|█████████▌| 321/336 [1:35:02<04:08, 16.58s/it] 96%|█████████▌| 322/336 [1:35:17<03:47, 16.24s/it] {'loss': 0.0183, 'grad_norm': 0.20258741080760956, 'learning_rate': 9.143157125359513e-09, 'kl': 0.1006, 'entropy': -0.0801, 'ce_loss': 0.0065, 'epoch': 3.83} 96%|█████████▌| 322/336 [1:35:17<03:47, 16.24s/it] 96%|█████████▌| 323/336 [1:35:35<03:38, 16.80s/it] {'loss': 0.0154, 'grad_norm': 0.18779589235782623, 'learning_rate': 7.885298685522235e-09, 'kl': 0.0339, 'entropy': -0.0284, 'ce_loss': 0.0106, 'epoch': 3.85} 96%|█████████▌| 323/336 [1:35:35<03:38, 16.80s/it] 96%|█████████▋| 324/336 [1:35:52<03:20, 16.72s/it] {'loss': 0.0146, 'grad_norm': 0.16684141755104065, 'learning_rate': 6.720142762867032e-09, 'kl': 0.165, 'entropy': -0.1553, 'ce_loss': 0.0074, 'epoch': 3.86} 96%|█████████▋| 324/336 [1:35:52<03:20, 16.72s/it] 97%|█████████▋| 325/336 [1:36:14<03:21, 18.30s/it] {'loss': 0.0105, 'grad_norm': 0.135105162858963, 'learning_rate': 5.647798228764156e-09, 'kl': 0.011, 'entropy': -0.1157, 'ce_loss': 0.0246, 'epoch': 3.87} 97%|█████████▋| 325/336 [1:36:14<03:21, 18.30s/it] 97%|█████████▋| 326/336 [1:36:30<02:55, 17.57s/it] {'loss': 0.0142, 'grad_norm': 0.16241519153118134, 'learning_rate': 4.668365282351372e-09, 'kl': 0.028, 'entropy': -0.0454, 'ce_loss': 0.015, 'epoch': 3.88} 97%|█████████▋| 326/336 [1:36:30<02:55, 17.57s/it] 97%|█████████▋| 327/336 [1:36:45<02:33, 17.04s/it] {'loss': 0.0152, 'grad_norm': 0.17639409005641937, 'learning_rate': 3.7819354411713355e-09, 'kl': 0.0859, 'entropy': -0.0771, 'ce_loss': 0.0077, 'epoch': 3.89} 97%|█████████▋| 327/336 [1:36:45<02:33, 17.04s/it] 98%|█████████▊| 328/336 [1:37:04<02:20, 17.61s/it] {'loss': 0.0143, 'grad_norm': 0.1601157784461975, 'learning_rate': 2.9885915326203216e-09, 'kl': -0.0061, 'entropy': -0.0366, 'ce_loss': 0.0077, 'epoch': 3.9} 98%|█████████▊| 328/336 [1:37:04<02:20, 17.61s/it] 98%|█████████▊| 329/336 [1:37:20<01:59, 17.07s/it] {'loss': 0.0152, 'grad_norm': 0.17293810844421387, 'learning_rate': 2.2884076862089707e-09, 'kl': 0.0422, 'entropy': -0.0679, 'ce_loss': 0.016, 'epoch': 3.92} 98%|█████████▊| 329/336 [1:37:20<01:59, 17.07s/it] 98%|█████████▊| 330/336 [1:37:40<01:46, 17.78s/it] {'loss': 0.0151, 'grad_norm': 0.17773842811584473, 'learning_rate': 1.6814493266357199e-09, 'kl': 0.02, 'entropy': -0.043, 'ce_loss': 0.022, 'epoch': 3.93} 98%|█████████▊| 330/336 [1:37:40<01:46, 17.78s/it] 99%|█████████▊| 331/336 [1:37:55<01:25, 17.17s/it] {'loss': 0.0152, 'grad_norm': 0.17476435005664825, 'learning_rate': 1.1677731676733581e-09, 'kl': 0.0361, 'entropy': 0.0036, 'ce_loss': 0.0109, 'epoch': 3.94} 99%|█████████▊| 331/336 [1:37:55<01:25, 17.17s/it] 99%|█████████▉| 332/336 [1:38:11<01:06, 16.74s/it] {'loss': 0.0142, 'grad_norm': 0.17168329656124115, 'learning_rate': 7.474272068698217e-10, 'kl': 0.0532, 'entropy': -0.0459, 'ce_loss': 0.002, 'epoch': 3.95} 99%|█████████▉| 332/336 [1:38:11<01:06, 16.74s/it] 99%|█████████▉| 333/336 [1:38:30<00:52, 17.55s/it] {'loss': 0.011, 'grad_norm': 0.13916146755218506, 'learning_rate': 4.204507210633368e-10, 'kl': 0.0703, 'entropy': -0.1523, 'ce_loss': 0.0008, 'epoch': 3.96} 99%|█████████▉| 333/336 [1:38:30<00:52, 17.55s/it] 99%|█████████▉| 334/336 [1:38:49<00:35, 17.93s/it] {'loss': 0.0139, 'grad_norm': 0.18326641619205475, 'learning_rate': 1.8687426271246642e-10, 'kl': 0.0703, 'entropy': -0.0417, 'ce_loss': 0.0108, 'epoch': 3.98} 99%|█████████▉| 334/336 [1:38:49<00:35, 17.93s/it] 100%|█████████▉| 335/336 [1:39:08<00:18, 18.28s/it] {'loss': 0.0145, 'grad_norm': 0.170999675989151, 'learning_rate': 4.6719657041283115e-11, 'kl': 0.0378, 'entropy': -0.0938, 'ce_loss': 0.0105, 'epoch': 3.99} 100%|█████████▉| 335/336 [1:39:08<00:18, 18.28s/it] 100%|██████████| 336/336 [1:39:24<00:00, 17.54s/it] {'loss': 0.014, 'grad_norm': 0.1720617562532425, 'learning_rate': 0.0, 'kl': 0.0508, 'entropy': -0.0742, 'ce_loss': 0.0039, 'epoch': 4.0} 100%|██████████| 336/336 [1:39:24<00:00, 17.54s/it][INFO|trainer.py:2665] 2025-04-13 16:37:52,456 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 5964.6993, 'train_samples_per_second': 1.803, 'train_steps_per_second': 0.056, 'train_loss': 0.021451761316885018, 'epoch': 4.0} 100%|██████████| 336/336 [1:39:24<00:00, 17.54s/it] 100%|██████████| 336/336 [1:39:24<00:00, 17.75s/it] [INFO|trainer.py:3966] 2025-04-13 16:38:46,628 >> Saving model checkpoint to /home/stern/GRPO/offline_rl_v2/output [INFO|configuration_utils.py:423] 2025-04-13 16:38:46,631 >> Configuration saved in /home/stern/GRPO/offline_rl_v2/output/config.json [INFO|configuration_utils.py:908] 2025-04-13 16:38:46,632 >> Configuration saved in /home/stern/GRPO/offline_rl_v2/output/generation_config.json [2025-04-13 16:38:59,677] [INFO] [launch.py:351:main] Process 1025164 exits successfully. [2025-04-13 16:39:03,682] [INFO] [launch.py:351:main] Process 1025167 exits successfully. [2025-04-13 16:39:07,686] [INFO] [launch.py:351:main] Process 1025168 exits successfully. [2025-04-13 16:39:11,691] [INFO] [launch.py:351:main] Process 1025163 exits successfully. [2025-04-13 16:39:16,696] [INFO] [launch.py:351:main] Process 1025166 exits successfully. [2025-04-13 16:39:20,700] [INFO] [launch.py:351:main] Process 1025162 exits successfully. [2025-04-13 16:39:24,705] [INFO] [launch.py:351:main] Process 1025165 exits successfully. [INFO|modeling_utils.py:3594] 2025-04-13 16:40:11,943 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 14 checkpoint shards. You can find where each parameters has been saved in the index located at /home/stern/GRPO/offline_rl_v2/output/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2025-04-13 16:40:11,945 >> tokenizer config file saved in /home/stern/GRPO/offline_rl_v2/output/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2025-04-13 16:40:11,946 >> Special tokens file saved in /home/stern/GRPO/offline_rl_v2/output/special_tokens_map.json ***** train metrics ***** epoch = 4.0 total_flos = 86240GF train_loss = 0.0215 train_runtime = 1:39:24.69 train_samples = 2688 train_samples_per_second = 1.803 train_steps_per_second = 0.056 [2025-04-13 16:40:24,765] [INFO] [launch.py:351:main] Process 1025161 exits successfully.