| [2025-04-10 16:53:49,566] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:53:51,612] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. | |
| Detected VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7 | |
| [2025-04-10 16:53:51,612] [INFO] [runner.py:605:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed scripts/zero2.json --seed 42 --model_name_or_path /home/stern/GRPO/saved_models/s1K-7B --train_tokenized_file /home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl --output_dir /home/stern/GRPO/offline_rl_v2/output --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy no --learning_rate 2e-5 --lr_scheduler_type cosine --save_only_model True --remove_unused_columns False --warmup_ratio 0.03 --num_train_epochs 3 --logging_steps 1 --report_to tensorboard --gradient_checkpointing True --overwrite_output_dir --bf16 True | |
| [2025-04-10 16:53:53,051] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:53:55,047] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} | |
| [2025-04-10 16:53:55,047] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0 | |
| [2025-04-10 16:53:55,047] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) | |
| [2025-04-10 16:53:55,047] [INFO] [launch.py:164:main] dist_world_size=8 | |
| [2025-04-10 16:53:55,047] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
| [2025-04-10 16:53:55,048] [INFO] [launch.py:256:main] process 501939 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] | |
| [2025-04-10 16:53:55,048] [INFO] [launch.py:256:main] process 501940 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] | |
| [2025-04-10 16:53:55,049] [INFO] [launch.py:256:main] process 501941 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=2', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] | |
| [2025-04-10 16:53:55,049] [INFO] [launch.py:256:main] process 501942 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] | |
| [2025-04-10 16:53:55,049] [INFO] [launch.py:256:main] process 501943 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=4', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] | |
| [2025-04-10 16:53:55,050] [INFO] [launch.py:256:main] process 501944 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=5', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] | |
| [2025-04-10 16:53:55,050] [INFO] [launch.py:256:main] process 501945 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=6', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] | |
| [2025-04-10 16:53:55,050] [INFO] [launch.py:256:main] process 501946 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--deepspeed', 'scripts/zero2.json', '--seed', '42', '--model_name_or_path', '/home/stern/GRPO/saved_models/s1K-7B', '--train_tokenized_file', '/home/stern/GRPO/offline_rl_v2/data/limo_7B_pure_neg.jsonl', '--output_dir', '/home/stern/GRPO/offline_rl_v2/output', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--evaluation_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--save_only_model', 'True', '--remove_unused_columns', 'False', '--warmup_ratio', '0.03', '--num_train_epochs', '3', '--logging_steps', '1', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--overwrite_output_dir', '--bf16', 'True'] | |
| [2025-04-10 16:53:59,730] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:53:59,770] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:53:59,873] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:53:59,903] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:53:59,955] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:54:00,030] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:54:00,040] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| [2025-04-10 16:54:00,042] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of π€ Transformers. Use `eval_strategy` instead | |
| warnings.warn( | |
| [2025-04-10 16:54:01,829] [INFO] [comm.py:658:init_distributed] cdb=None | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of π€ Transformers. Use `eval_strategy` instead | |
| warnings.warn( | |
| [2025-04-10 16:54:02,044] [INFO] [comm.py:658:init_distributed] cdb=None | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of π€ Transformers. Use `eval_strategy` instead | |
| warnings.warn( | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of π€ Transformers. Use `eval_strategy` instead | |
| warnings.warn( | |
| [2025-04-10 16:54:02,079] [INFO] [comm.py:658:init_distributed] cdb=None | |
| [2025-04-10 16:54:02,080] [INFO] [comm.py:658:init_distributed] cdb=None | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of π€ Transformers. Use `eval_strategy` instead | |
| warnings.warn( | |
| [2025-04-10 16:54:02,086] [INFO] [comm.py:658:init_distributed] cdb=None | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of π€ Transformers. Use `eval_strategy` instead | |
| warnings.warn( | |
| [2025-04-10 16:54:02,133] [INFO] [comm.py:658:init_distributed] cdb=None | |
| [2025-04-10 16:54:02,133] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of π€ Transformers. Use `eval_strategy` instead | |
| warnings.warn( | |
| [2025-04-10 16:54:02,174] [INFO] [comm.py:658:init_distributed] cdb=None | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of π€ Transformers. Use `eval_strategy` instead | |
| warnings.warn( | |
| [2025-04-10 16:54:02,208] [INFO] [comm.py:658:init_distributed] cdb=None | |
| WARNING:__main__:Process rank: 4, device: cuda:4, n_gpu: 1 | |
| WARNING:__main__:Process rank: 2, device: cuda:2, n_gpu: 1 | |
| [WARNING|logging.py:329] 2025-04-10 16:54:02,948 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| [WARNING|logging.py:329] 2025-04-10 16:54:02,950 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 50%|βββββ | 2/4 [00:00<00:00, 5.51it/s] Loading checkpoint shards: 50%|βββββ | 2/4 [00:00<00:00, 3.63it/s]WARNING:__main__:Process rank: 1, device: cuda:1, n_gpu: 1 | |
| WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1 | |
| INFO:__main__:Training parameters CustomTrainingArguments( | |
| _n_gpu=1, | |
| accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, | |
| adafactor=False, | |
| adam_beta1=0.9, | |
| adam_beta2=0.999, | |
| adam_epsilon=1e-08, | |
| auto_find_batch_size=False, | |
| average_tokens_across_devices=False, | |
| batch_eval_metrics=False, | |
| bf16=True, | |
| bf16_full_eval=False, | |
| data_seed=None, | |
| dataloader_drop_last=False, | |
| dataloader_num_workers=0, | |
| dataloader_persistent_workers=False, | |
| dataloader_pin_memory=True, | |
| dataloader_prefetch_factor=None, | |
| ddp_backend=None, | |
| ddp_broadcast_buffers=None, | |
| ddp_bucket_cap_mb=None, | |
| ddp_find_unused_parameters=None, | |
| ddp_timeout=1800, | |
| debug=[], | |
| deepspeed=scripts/zero2.json, | |
| disable_tqdm=False, | |
| dispatch_batches=None, | |
| do_eval=False, | |
| do_predict=False, | |
| do_train=False, | |
| eval_accumulation_steps=None, | |
| eval_delay=0, | |
| eval_do_concat_batches=True, | |
| eval_on_start=False, | |
| eval_steps=None, | |
| eval_strategy=no, | |
| eval_use_gather_object=False, | |
| evaluation_strategy=no, | |
| fp16=False, | |
| fp16_backend=auto, | |
| fp16_full_eval=False, | |
| fp16_opt_level=O1, | |
| fsdp=[], | |
| fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, | |
| fsdp_min_num_params=0, | |
| fsdp_transformer_layer_cls_to_wrap=None, | |
| full_determinism=False, | |
| gradient_accumulation_steps=4, | |
| gradient_checkpointing=True, | |
| gradient_checkpointing_kwargs=None, | |
| greater_is_better=None, | |
| group_by_length=False, | |
| half_precision_backend=auto, | |
| hub_always_push=False, | |
| hub_model_id=None, | |
| hub_private_repo=None, | |
| hub_strategy=every_save, | |
| hub_token=<HUB_TOKEN>, | |
| ignore_data_skip=False, | |
| include_for_metrics=[], | |
| include_inputs_for_metrics=False, | |
| include_num_input_tokens_seen=False, | |
| include_tokens_per_second=False, | |
| jit_mode_eval=False, | |
| kl_coeff=0.0, | |
| label_names=None, | |
| label_smoothing_factor=0.0, | |
| learning_rate=2e-05, | |
| length_column_name=length, | |
| load_best_model_at_end=False, | |
| local_rank=0, | |
| log_level=passive, | |
| log_level_replica=warning, | |
| log_on_each_node=True, | |
| logging_dir=/home/stern/GRPO/offline_rl_v2/output/runs/Apr10_16-54-02_nacamontrealdc1-p2r203n1.enovum.hivecloud.com, | |
| logging_first_step=False, | |
| logging_nan_inf_filter=True, | |
| logging_steps=1.0, | |
| logging_strategy=steps, | |
| lr_scheduler_kwargs={}, | |
| lr_scheduler_type=cosine, | |
| max_grad_norm=1.0, | |
| max_steps=-1, | |
| metric_for_best_model=None, | |
| mp_parameters=, | |
| neftune_noise_alpha=None, | |
| no_cuda=False, | |
| num_train_epochs=3.0, | |
| optim=adamw_torch, | |
| optim_args=None, | |
| optim_target_modules=None, | |
| output_dir=/home/stern/GRPO/offline_rl_v2/output, | |
| overwrite_output_dir=True, | |
| past_index=-1, | |
| per_device_eval_batch_size=8, | |
| per_device_train_batch_size=1, | |
| prediction_loss_only=False, | |
| push_to_hub=False, | |
| push_to_hub_model_id=None, | |
| push_to_hub_organization=None, | |
| push_to_hub_token=<PUSH_TO_HUB_TOKEN>, | |
| ray_scope=last, | |
| remove_unused_columns=False, | |
| report_to=['tensorboard'], | |
| restore_callback_states_from_checkpoint=False, | |
| resume_from_checkpoint=None, | |
| run_name=/home/stern/GRPO/offline_rl_v2/output, | |
| save_on_each_node=False, | |
| save_only_model=True, | |
| save_safetensors=True, | |
| save_steps=500, | |
| save_strategy=no, | |
| save_total_limit=None, | |
| seed=42, | |
| skip_memory_metrics=True, | |
| split_batches=None, | |
| tf32=None, | |
| torch_compile=False, | |
| torch_compile_backend=None, | |
| torch_compile_mode=None, | |
| torch_empty_cache_steps=None, | |
| torchdynamo=None, | |
| tp_size=0, | |
| tpu_metrics_debug=False, | |
| tpu_num_cores=None, | |
| use_cpu=False, | |
| use_ipex=False, | |
| use_legacy_prediction_loop=False, | |
| use_liger_kernel=False, | |
| use_mps_device=False, | |
| warmup_ratio=0.03, | |
| warmup_steps=0, | |
| weight_decay=0.0, | |
| ) | |
| [INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file vocab.json | |
| [INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file merges.txt | |
| [INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file tokenizer.json | |
| [INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file added_tokens.json | |
| [INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file special_tokens_map.json | |
| [INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file tokenizer_config.json | |
| [INFO|tokenization_utils_base.py:2058] 2025-04-10 16:54:03,651 >> loading file chat_template.jinja | |
| Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:00<00:00, 4.14it/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:00<00:00, 5.25it/s] | |
| Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:00<00:00, 3.33it/s]WARNING:__main__:Process rank: 7, device: cuda:7, n_gpu: 1 | |
| WARNING:__main__:Process rank: 6, device: cuda:6, n_gpu: 1 | |
| Generating train split: 0 examples [00:00, ? examples/s]WARNING:__main__:Process rank: 3, device: cuda:3, n_gpu: 1 | |
| WARNING:__main__:Process rank: 5, device: cuda:5, n_gpu: 1 | |
| Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:00<00:00, 4.10it/s] | |
| [WARNING|logging.py:329] 2025-04-10 16:54:03,974 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Generating train split: 170 examples [00:00, 1093.45 examples/s][INFO|tokenization_utils_base.py:2323] 2025-04-10 16:54:04,097 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. | |
| [INFO|configuration_utils.py:697] 2025-04-10 16:54:04,097 >> loading configuration file /home/stern/GRPO/saved_models/s1K-7B/config.json | |
| [INFO|configuration_utils.py:771] 2025-04-10 16:54:04,100 >> Model config Qwen2Config { | |
| "architectures": [ | |
| "Qwen2ForCausalLM" | |
| ], | |
| "attention_dropout": 0.0, | |
| "bos_token_id": 151643, | |
| "eos_token_id": 151645, | |
| "hidden_act": "silu", | |
| "hidden_size": 3584, | |
| "initializer_range": 0.02, | |
| "intermediate_size": 18944, | |
| "max_position_embeddings": 32768, | |
| "max_window_layers": 28, | |
| "model_type": "qwen2", | |
| "num_attention_heads": 28, | |
| "num_hidden_layers": 28, | |
| "num_key_value_heads": 4, | |
| "rms_norm_eps": 1e-06, | |
| "rope_scaling": null, | |
| "rope_theta": 1000000.0, | |
| "sliding_window": 131072, | |
| "tie_word_embeddings": false, | |
| "torch_dtype": "bfloat16", | |
| "transformers_version": "4.50.3", | |
| "use_cache": true, | |
| "use_sliding_window": false, | |
| "vocab_size": 152064 | |
| } | |
| [INFO|modeling_utils.py:1151] 2025-04-10 16:54:04,161 >> loading weights file /home/stern/GRPO/saved_models/s1K-7B/model.safetensors.index.json | |
| [INFO|modeling_utils.py:1225] 2025-04-10 16:54:04,162 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object | |
| [INFO|modeling_utils.py:2170] 2025-04-10 16:54:04,162 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16. | |
| [WARNING|logging.py:329] 2025-04-10 16:54:04,164 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| [INFO|configuration_utils.py:1139] 2025-04-10 16:54:04,166 >> Generate config GenerationConfig { | |
| "bos_token_id": 151643, | |
| "eos_token_id": 151645 | |
| } | |
| Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Generating train split: 341 examples [00:00, 966.13 examples/s] Generating train split: 468 examples [00:00, 901.17 examples/s] Loading checkpoint shards: 50%|βββββ | 2/4 [00:00<00:00, 4.16it/s][WARNING|logging.py:329] 2025-04-10 16:54:04,581 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| Generating train split: 635 examples [00:00, 904.92 examples/s][WARNING|logging.py:329] 2025-04-10 16:54:04,605 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| [WARNING|logging.py:329] 2025-04-10 16:54:04,608 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| [WARNING|logging.py:329] 2025-04-10 16:54:04,620 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 50%|βββββ | 2/4 [00:00<00:00, 3.89it/s] Generating train split: 759 examples [00:00, 816.52 examples/s] Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:00<00:00, 3.41it/s] Generating train split: 842 examples [00:01, 598.30 examples/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:01<00:00, 3.74it/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:01<00:00, 3.72it/s] | |
| Generating train split: 925 examples [00:01, 482.38 examples/s] Generating train split: 1010 examples [00:01, 398.69 examples/s] Generating train split: 1093 examples [00:02, 347.94 examples/s] Generating train split: 1135 examples [00:02, 343.37 examples/s] Loading checkpoint shards: 50%|βββββ | 2/4 [00:01<00:01, 1.27it/s] Generating train split: 1218 examples [00:02, 380.14 examples/s] Loading checkpoint shards: 50%|βββββ | 2/4 [00:01<00:01, 1.21it/s] Generating train split: 1259 examples [00:02, 370.22 examples/s] Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:02<00:00, 1.16it/s] Generating train split: 1303 examples [00:02, 370.50 examples/s] Loading checkpoint shards: 50%|βββββ | 2/4 [00:01<00:01, 1.07it/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:02<00:00, 1.71it/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:02<00:00, 1.71it/s] | |
| [INFO|modeling_utils.py:4987] 2025-04-10 16:54:06,570 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM. | |
| [INFO|modeling_utils.py:4995] 2025-04-10 16:54:06,570 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/stern/GRPO/saved_models/s1K-7B. | |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. | |
| [INFO|configuration_utils.py:1092] 2025-04-10 16:54:06,572 >> loading configuration file /home/stern/GRPO/saved_models/s1K-7B/generation_config.json | |
| [INFO|configuration_utils.py:1139] 2025-04-10 16:54:06,573 >> Generate config GenerationConfig { | |
| "bos_token_id": 151643, | |
| "do_sample": true, | |
| "eos_token_id": [ | |
| 151645, | |
| 151643 | |
| ], | |
| "pad_token_id": 151643, | |
| "repetition_penalty": 1.05, | |
| "temperature": 0.7, | |
| "top_k": 20, | |
| "top_p": 0.8 | |
| } | |
| Loading checkpoint shards: 50%|βββββ | 2/4 [00:02<00:02, 1.00s/it] Generating train split: 1385 examples [00:02, 370.45 examples/s] Generating train split: 1428 examples [00:02, 378.34 examples/s]Using custom data configuration default-570516a07b11d2a7 | |
| INFO:datasets.builder:Using custom data configuration default-570516a07b11d2a7 | |
| Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json | |
| INFO:datasets.info:Loading Dataset Infos from /home/stern/.local/lib/python3.10/site-packages/datasets/packaged_modules/json | |
| Generating train split: 1470 examples [00:03, 335.24 examples/s] Generating train split: 1511 examples [00:03, 332.66 examples/s] Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:02<00:00, 1.17it/s] Generating train split: 1553 examples [00:03, 331.20 examples/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:02<00:00, 1.75it/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:02<00:00, 1.53it/s] | |
| Generating train split: 1636 examples [00:03, 404.01 examples/s] Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:02<00:00, 1.04it/s] Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:02<00:00, 1.05it/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:02<00:00, 1.38it/s] | |
| Generating train split: 1760 examples [00:03, 521.47 examples/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:02<00:00, 1.36it/s] | |
| Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:02<00:00, 1.02it/s] Generating train split: 1844 examples [00:03, 576.49 examples/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:03<00:00, 1.32it/s] | |
| Generating train split: 1970 examples [00:03, 659.41 examples/s] Generating train split: 2139 examples [00:04, 744.70 examples/s] Generating train split: 2222 examples [00:04, 761.21 examples/s] Generating train split: 2349 examples [00:04, 779.14 examples/s] Generating train split: 2515 examples [00:04, 801.66 examples/s] Generating train split: 2640 examples [00:04, 811.52 examples/s] Generating train split: 2722 examples [00:04, 804.37 examples/s] Generating train split: 2805 examples [00:04, 807.91 examples/s] Generating train split: 2889 examples [00:05, 807.80 examples/s] Generating train split: 2973 examples [00:05, 808.41 examples/s] Generating train split: 3097 examples [00:05, 814.10 examples/s] Generating train split: 3224 examples [00:05, 818.36 examples/s] Generating train split: 3348 examples [00:05, 826.59 examples/s] Generating train split: 3428 examples [00:05, 605.17 examples/s] | |
| /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. | |
| trainer = OfflineREINFORCETrainer( | |
| /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. | |
| trainer = OfflineREINFORCETrainer( | |
| /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. | |
| trainer = OfflineREINFORCETrainer( | |
| /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. | |
| trainer = OfflineREINFORCETrainer( | |
| /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. | |
| trainer = OfflineREINFORCETrainer( | |
| Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-570516a07b11d2a7/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092) | |
| INFO:datasets.builder:Found cached dataset json (/home/stern/.cache/huggingface/datasets/json/default-570516a07b11d2a7/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092) | |
| Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-570516a07b11d2a7/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 | |
| INFO:datasets.info:Loading Dataset info from /home/stern/.cache/huggingface/datasets/json/default-570516a07b11d2a7/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092 | |
| /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. | |
| trainer = OfflineREINFORCETrainer( | |
| /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. | |
| trainer = OfflineREINFORCETrainer( | |
| /home/stern/GRPO/offline_rl_v2/train.py:274: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `OfflineREINFORCETrainer.__init__`. Use `processing_class` instead. | |
| trainer = OfflineREINFORCETrainer( | |
| [INFO|trainer.py:748] 2025-04-10 16:54:09,798 >> Using auto half precision backend | |
| INFO:__main__:*** Train *** | |
| [2025-04-10 16:54:09,999] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.5, git-hash=unknown, git-branch=unknown | |
| [2025-04-10 16:54:09,999] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 | |
| [2025-04-10 16:54:17,662] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False | |
| [2025-04-10 16:54:17,664] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer | |
| [2025-04-10 16:54:17,664] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer | |
| [2025-04-10 16:54:17,681] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW | |
| [2025-04-10 16:54:17,681] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'> | |
| [2025-04-10 16:54:17,681] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer | |
| [2025-04-10 16:54:17,681] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 12845056 | |
| [2025-04-10 16:54:17,681] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000 | |
| [2025-04-10 16:54:17,682] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False | |
| [2025-04-10 16:54:17,682] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) | |
| batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) | |
| [WARNING|logging.py:329] 2025-04-10 16:54:32,676 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) | |
| batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) | |
| [WARNING|logging.py:329] 2025-04-10 16:54:36,610 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) | |
| batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) | |
| [WARNING|logging.py:329] 2025-04-10 16:54:37,082 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) | |
| batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) | |
| [2025-04-10 16:54:37,682] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) | |
| batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) | |
| [2025-04-10 16:54:37,683] [INFO] [utils.py:782:see_memory_usage] MA 17.73 GB Max_MA 17.73 GB CA 17.73 GB Max_CA 18 GB | |
| [2025-04-10 16:54:37,683] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 39.82 GB, percent = 4.0% | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) | |
| batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) | |
| /home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) | |
| batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) | |
| [WARNING|logging.py:329] 2025-04-10 16:54:37,738 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| [WARNING|logging.py:329] 2025-04-10 16:54:37,768 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| [WARNING|logging.py:329] 2025-04-10 16:54:37,772 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| [WARNING|logging.py:329] 2025-04-10 16:54:37,807 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| [2025-04-10 16:54:37,852] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states | |
| [2025-04-10 16:54:37,852] [INFO] [utils.py:782:see_memory_usage] MA 17.73 GB Max_MA 21.28 GB CA 21.28 GB Max_CA 21 GB | |
| [2025-04-10 16:54:37,852] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 39.52 GB, percent = 3.9% | |
| [2025-04-10 16:54:37,852] [INFO] [stage_1_and_2.py:556:__init__] optimizer state initialized | |
| [2025-04-10 16:54:37,982] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer | |
| [2025-04-10 16:54:37,983] [INFO] [utils.py:782:see_memory_usage] MA 17.73 GB Max_MA 17.73 GB CA 21.28 GB Max_CA 21 GB | |
| [2025-04-10 16:54:37,983] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 39.5 GB, percent = 3.9% | |
| [2025-04-10 16:54:37,984] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer | |
| [2025-04-10 16:54:37,984] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None | |
| [2025-04-10 16:54:37,984] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None | |
| [2025-04-10 16:54:37,985] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] | |
| [2025-04-10 16:54:37,985] [INFO] [config.py:1000:print] DeepSpeedEngine configuration: | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] activation_checkpointing_config { | |
| "partition_activations": false, | |
| "contiguous_memory_optimization": false, | |
| "cpu_checkpointing": false, | |
| "number_checkpoints": null, | |
| "synchronize_checkpoint_boundary": false, | |
| "profile": false | |
| } | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] amp_enabled .................. False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] amp_params ................... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] autotuning_config ............ { | |
| "enabled": false, | |
| "start_step": null, | |
| "end_step": null, | |
| "metric_path": null, | |
| "arg_mappings": null, | |
| "metric": "throughput", | |
| "model_info": null, | |
| "results_dir": "autotuning_results", | |
| "exps_dir": "autotuning_exps", | |
| "overwrite": true, | |
| "fast": true, | |
| "start_profile_step": 3, | |
| "end_profile_step": 5, | |
| "tuner_type": "gridsearch", | |
| "tuner_early_stopping": 5, | |
| "tuner_num_trials": 50, | |
| "model_info_path": null, | |
| "mp_size": 1, | |
| "max_train_batch_size": null, | |
| "min_train_batch_size": 1, | |
| "max_train_micro_batch_size_per_gpu": 1.024000e+03, | |
| "min_train_micro_batch_size_per_gpu": 1, | |
| "num_tuning_micro_batch_sizes": 3 | |
| } | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] bfloat16_enabled ............. True | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] bfloat16_immediate_grad_update True | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] checkpoint_parallel_write_pipeline False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] checkpoint_tag_validation_enabled True | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] checkpoint_tag_validation_fail False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7c035fb604c0> | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] communication_data_type ...... None | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] curriculum_enabled_legacy .... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] curriculum_params_legacy ..... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] data_efficiency_enabled ...... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] dataloader_drop_last ......... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] disable_allgather ............ False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] dump_state ................... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] dynamic_loss_scale_args ...... None | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_enabled ........... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_gas_boundary_resolution 1 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_layer_name ........ bert.encoder.layer | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_layer_num ......... 0 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_max_iter .......... 100 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_stability ......... 1e-06 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_tol ............... 0.01 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] eigenvalue_verbose ........... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] elasticity_enabled ........... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] flops_profiler_config ........ { | |
| "enabled": false, | |
| "recompute_fwd_factor": 0.0, | |
| "profile_step": 1, | |
| "module_depth": -1, | |
| "top_modules": 1, | |
| "detailed": true, | |
| "output_file": null | |
| } | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] fp16_auto_cast ............... None | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] fp16_enabled ................. False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] fp16_master_weights_and_gradients False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] global_rank .................. 0 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] grad_accum_dtype ............. None | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] gradient_accumulation_steps .. 4 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] gradient_clipping ............ 0.0 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] gradient_predivide_factor .... 1.0 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] graph_harvesting ............. False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] initial_dynamic_scale ........ 1 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] load_universal_checkpoint .... False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] loss_scale ................... 1.0 | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] memory_breakdown ............. False | |
| [2025-04-10 16:54:37,986] [INFO] [config.py:1004:print] mics_hierarchial_params_gather False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] mics_shard_size .............. -1 | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] nebula_config ................ { | |
| "enabled": false, | |
| "persistent_storage_path": null, | |
| "persistent_time_interval": 100, | |
| "num_of_version_in_retention": 2, | |
| "enable_nebula_load": true, | |
| "load_path": null | |
| } | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] optimizer_legacy_fusion ...... False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] optimizer_name ............... None | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] optimizer_params ............. None | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] pld_enabled .................. False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] pld_params ................... False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] prescale_gradients ........... False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] scheduler_name ............... None | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] scheduler_params ............. None | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] seq_parallel_communication_data_type torch.float32 | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] sparse_attention ............. None | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] sparse_gradients_enabled ..... False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] steps_per_print .............. inf | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] timers_config ................ enabled=True synchronized=True | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] train_batch_size ............. 32 | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] train_micro_batch_size_per_gpu 1 | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] use_data_before_expert_parallel_ False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] use_node_local_storage ....... False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] wall_clock_breakdown ......... False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] weight_quantization_config ... None | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] world_size ................... 8 | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_allow_untested_optimizer True | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=12845056 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_enabled ................. True | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_force_ds_cpu_optimizer .. True | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:1004:print] zero_optimization_stage ...... 2 | |
| [2025-04-10 16:54:37,987] [INFO] [config.py:990:print_user_config] json = { | |
| "fp16": { | |
| "enabled": false, | |
| "loss_scale": 0, | |
| "loss_scale_window": 1000, | |
| "initial_scale_power": 16, | |
| "hysteresis": 2, | |
| "min_loss_scale": 1 | |
| }, | |
| "bf16": { | |
| "enabled": true | |
| }, | |
| "train_micro_batch_size_per_gpu": 1, | |
| "train_batch_size": 32, | |
| "gradient_accumulation_steps": 4, | |
| "zero_optimization": { | |
| "stage": 2, | |
| "overlap_comm": false, | |
| "contiguous_gradients": true, | |
| "sub_group_size": 1.000000e+09, | |
| "reduce_bucket_size": 1.284506e+07 | |
| }, | |
| "steps_per_print": inf, | |
| "zero_allow_untested_optimizer": true | |
| } | |
| [INFO|trainer.py:2409] 2025-04-10 16:54:37,987 >> ***** Running training ***** | |
| [INFO|trainer.py:2410] 2025-04-10 16:54:37,987 >> Num examples = 3,428 | |
| [INFO|trainer.py:2411] 2025-04-10 16:54:37,987 >> Num Epochs = 3 | |
| [INFO|trainer.py:2412] 2025-04-10 16:54:37,987 >> Instantaneous batch size per device = 1 | |
| [INFO|trainer.py:2415] 2025-04-10 16:54:37,987 >> Total train batch size (w. parallel, distributed & accumulation) = 32 | |
| [INFO|trainer.py:2416] 2025-04-10 16:54:37,987 >> Gradient Accumulation steps = 4 | |
| [INFO|trainer.py:2417] 2025-04-10 16:54:37,987 >> Total optimization steps = 321 | |
| [INFO|trainer.py:2418] 2025-04-10 16:54:37,988 >> Number of trainable parameters = 7,615,616,512 | |
| 0%| | 0/321 [00:00<?, ?it/s]/home/stern/.local/lib/python3.10/site-packages/transformers/data/data_collator.py:741: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.) | |
| batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64) | |
| [WARNING|logging.py:329] 2025-04-10 16:54:38,090 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. | |
| with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] | |
| /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. | |
| with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] | |
| /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. | |
| with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] | |
| /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. | |
| with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] | |
| /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. | |
| with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] | |
| /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. | |
| with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] | |
| /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. | |
| with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] | |
| /home/stern/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. | |
| with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] | |
| 0%| | 1/321 [00:05<29:54, 5.61s/it] {'loss': 0.0429, 'grad_norm': 0.42509692907333374, 'learning_rate': 2.0000000000000003e-06, 'kl': 0.0006, 'entropy': 0.1934, 'ce_loss': 0.054, 'epoch': 0.01} | |
| 0%| | 1/321 [00:07<29:54, 5.61s/it] 1%| | 2/321 [00:13<35:29, 6.67s/it] {'loss': 0.0541, 'grad_norm': 0.5195198059082031, 'learning_rate': 4.000000000000001e-06, 'kl': -0.0007, 'entropy': 0.2852, 'ce_loss': 0.0699, 'epoch': 0.02} | |
| 1%| | 2/321 [00:13<35:29, 6.67s/it] 1%| | 3/321 [00:18<32:16, 6.09s/it] {'loss': 0.0385, 'grad_norm': 0.3660963177680969, 'learning_rate': 6e-06, 'kl': 0.0005, 'entropy': 0.2793, 'ce_loss': 0.0602, 'epoch': 0.03} | |
| 1%| | 3/321 [00:18<32:16, 6.09s/it] 1%| | 4/321 [00:24<31:29, 5.96s/it] {'loss': 0.0758, 'grad_norm': 0.6640394330024719, 'learning_rate': 8.000000000000001e-06, 'kl': 0.0151, 'entropy': 0.3047, 'ce_loss': 0.0809, 'epoch': 0.04} | |
| 1%| | 4/321 [00:24<31:29, 5.96s/it] 2%|β | 5/321 [00:29<30:22, 5.77s/it] {'loss': 0.053, 'grad_norm': 0.22936582565307617, 'learning_rate': 1e-05, 'kl': 0.0295, 'entropy': 0.1943, 'ce_loss': 0.0762, 'epoch': 0.05} | |
| 2%|β | 5/321 [00:29<30:22, 5.77s/it] 2%|β | 6/321 [00:35<29:43, 5.66s/it] {'loss': 0.0792, 'grad_norm': 1.8079404830932617, 'learning_rate': 1.2e-05, 'kl': -0.0918, 'entropy': 0.0133, 'ce_loss': 0.0976, 'epoch': 0.06} | |
| 2%|β | 6/321 [00:35<29:43, 5.66s/it] 2%|β | 7/321 [00:40<29:14, 5.59s/it] {'loss': 0.0651, 'grad_norm': 0.557701051235199, 'learning_rate': 1.4e-05, 'kl': 0.0019, 'entropy': 0.1543, 'ce_loss': 0.098, 'epoch': 0.07} | |
| 2%|β | 7/321 [00:40<29:14, 5.59s/it] 2%|β | 8/321 [00:46<29:01, 5.56s/it] {'loss': 0.0683, 'grad_norm': 0.4116457402706146, 'learning_rate': 1.6000000000000003e-05, 'kl': -0.014, 'entropy': 0.0874, 'ce_loss': 0.0771, 'epoch': 0.07} | |
| 2%|β | 8/321 [00:46<29:01, 5.56s/it] 3%|β | 9/321 [00:51<28:46, 5.53s/it] {'loss': 0.0828, 'grad_norm': 0.4462181627750397, 'learning_rate': 1.8e-05, 'kl': 0.0065, 'entropy': 0.1406, 'ce_loss': 0.0741, 'epoch': 0.08} | |
| 3%|β | 9/321 [00:51<28:46, 5.53s/it] 3%|β | 10/321 [00:57<29:24, 5.67s/it] {'loss': 0.0726, 'grad_norm': 0.3048171103000641, 'learning_rate': 2e-05, 'kl': -0.0267, 'entropy': 0.1826, 'ce_loss': 0.0728, 'epoch': 0.09} | |
| 3%|β | 10/321 [00:57<29:24, 5.67s/it] 3%|β | 11/321 [01:03<29:11, 5.65s/it] {'loss': 0.0667, 'grad_norm': 0.36878669261932373, 'learning_rate': 1.9999489794332404e-05, 'kl': -0.0139, 'entropy': 0.2539, 'ce_loss': 0.0773, 'epoch': 0.1} | |
| 3%|β | 11/321 [01:03<29:11, 5.65s/it] 4%|β | 12/321 [01:08<28:58, 5.63s/it] {'loss': 0.051, 'grad_norm': 0.2855764925479889, 'learning_rate': 1.9997959229391567e-05, 'kl': -0.0334, 'entropy': 0.3066, 'ce_loss': 0.0963, 'epoch': 0.11} | |
| 4%|β | 12/321 [01:08<28:58, 5.63s/it] 4%|β | 13/321 [01:14<28:42, 5.59s/it] {'loss': 0.0616, 'grad_norm': 0.3379002809524536, 'learning_rate': 1.9995408461358074e-05, 'kl': -0.0177, 'entropy': 0.1748, 'ce_loss': 0.0593, 'epoch': 0.12} | |
| 4%|β | 13/321 [01:14<28:42, 5.59s/it] 4%|β | 14/321 [01:19<28:34, 5.58s/it] {'loss': 0.0574, 'grad_norm': 0.2853255867958069, 'learning_rate': 1.999183775051519e-05, 'kl': -0.0347, 'entropy': 0.1387, 'ce_loss': 0.0506, 'epoch': 0.13} | |
| 4%|β | 14/321 [01:19<28:34, 5.58s/it] 5%|β | 15/321 [01:25<28:24, 5.57s/it] {'loss': 0.0759, 'grad_norm': 0.35480034351348877, 'learning_rate': 1.9987247461222297e-05, 'kl': -0.0432, 'entropy': 0.2002, 'ce_loss': 0.0776, 'epoch': 0.14} | |
| 5%|β | 15/321 [01:25<28:24, 5.57s/it] 5%|β | 16/321 [01:30<28:19, 5.57s/it] {'loss': 0.1014, 'grad_norm': 0.4204954504966736, 'learning_rate': 1.9981638061877714e-05, 'kl': -0.0393, 'entropy': 0.1953, 'ce_loss': 0.1155, 'epoch': 0.15} | |
| 5%|β | 16/321 [01:30<28:19, 5.57s/it] 5%|β | 17/321 [01:36<28:10, 5.56s/it] {'loss': 0.0646, 'grad_norm': 0.2749992907047272, 'learning_rate': 1.997501012487091e-05, 'kl': -0.0547, 'entropy': 0.1758, 'ce_loss': 0.0786, 'epoch': 0.16} | |
| 5%|β | 17/321 [01:36<28:10, 5.56s/it] 6%|β | 18/321 [01:41<28:04, 5.56s/it] {'loss': 0.0611, 'grad_norm': 0.276908814907074, 'learning_rate': 1.996736432652409e-05, 'kl': -0.0325, 'entropy': 0.1865, 'ce_loss': 0.0886, 'epoch': 0.17} | |
| 6%|β | 18/321 [01:41<28:04, 5.56s/it] 6%|β | 19/321 [01:47<27:54, 5.54s/it] {'loss': 0.0589, 'grad_norm': 0.2533971965312958, 'learning_rate': 1.9958701447023188e-05, 'kl': -0.0439, 'entropy': 0.2002, 'ce_loss': 0.0921, 'epoch': 0.18} | |
| 6%|β | 19/321 [01:47<27:54, 5.54s/it] 6%|β | 20/321 [01:52<27:43, 5.53s/it] {'loss': 0.0773, 'grad_norm': 0.34151360392570496, 'learning_rate': 1.994902237033824e-05, 'kl': -0.0374, 'entropy': 0.1611, 'ce_loss': 0.0873, 'epoch': 0.19} | |
| 6%|β | 20/321 [01:52<27:43, 5.53s/it] 7%|β | 21/321 [01:58<27:37, 5.53s/it] {'loss': 0.0659, 'grad_norm': 0.31397902965545654, 'learning_rate': 1.9938328084133206e-05, 'kl': -0.0547, 'entropy': 0.168, 'ce_loss': 0.0749, 'epoch': 0.2} | |
| 7%|β | 21/321 [01:58<27:37, 5.53s/it] 7%|β | 22/321 [02:03<27:32, 5.53s/it] {'loss': 0.0503, 'grad_norm': 0.1959906369447708, 'learning_rate': 1.9926619679665175e-05, 'kl': -0.0425, 'entropy': 0.1758, 'ce_loss': 0.0733, 'epoch': 0.21} | |
| 7%|β | 22/321 [02:03<27:32, 5.53s/it] 7%|β | 23/321 [02:09<27:26, 5.53s/it] {'loss': 0.0538, 'grad_norm': 0.22350993752479553, 'learning_rate': 1.9913898351673006e-05, 'kl': -0.0464, 'entropy': 0.1602, 'ce_loss': 0.0737, 'epoch': 0.21} | |
| 7%|β | 23/321 [02:09<27:26, 5.53s/it] 7%|β | 24/321 [02:14<27:16, 5.51s/it] {'loss': 0.068, 'grad_norm': 0.2728714346885681, 'learning_rate': 1.9900165398255434e-05, 'kl': -0.0304, 'entropy': 0.1855, 'ce_loss': 0.0722, 'epoch': 0.22} | |
| 7%|β | 24/321 [02:14<27:16, 5.51s/it] 8%|β | 25/321 [02:20<27:11, 5.51s/it] {'loss': 0.0545, 'grad_norm': 0.1936517208814621, 'learning_rate': 1.9885422220738583e-05, 'kl': -0.043, 'entropy': 0.2227, 'ce_loss': 0.0767, 'epoch': 0.23} | |
| 8%|β | 25/321 [02:20<27:11, 5.51s/it] 8%|β | 26/321 [02:25<26:59, 5.49s/it] {'loss': 0.1022, 'grad_norm': 0.3744613826274872, 'learning_rate': 1.9869670323533005e-05, 'kl': -0.0295, 'entropy': 0.25, 'ce_loss': 0.0888, 'epoch': 0.24} | |
| 8%|β | 26/321 [02:25<26:59, 5.49s/it] 8%|β | 27/321 [02:31<27:11, 5.55s/it] {'loss': 0.0884, 'grad_norm': 0.3016413748264313, 'learning_rate': 1.9852911313980146e-05, 'kl': -0.0354, 'entropy': 0.2119, 'ce_loss': 0.0376, 'epoch': 0.25} | |
| 8%|β | 27/321 [02:31<27:11, 5.55s/it] 9%|β | 28/321 [02:37<27:01, 5.53s/it] {'loss': 0.0474, 'grad_norm': 0.1607786864042282, 'learning_rate': 1.9835146902188336e-05, 'kl': -0.0388, 'entropy': 0.2295, 'ce_loss': 0.0768, 'epoch': 0.26} | |
| 9%|β | 28/321 [02:37<27:01, 5.53s/it] 9%|β | 29/321 [02:42<26:49, 5.51s/it] {'loss': 0.0471, 'grad_norm': 0.18925224244594574, 'learning_rate': 1.9816378900858288e-05, 'kl': -0.0374, 'entropy': 0.1621, 'ce_loss': 0.0679, 'epoch': 0.27} | |
| 9%|β | 29/321 [02:42<26:49, 5.51s/it] 9%|β | 30/321 [02:48<26:38, 5.49s/it] {'loss': 0.075, 'grad_norm': 0.2948780953884125, 'learning_rate': 1.9796609225098136e-05, 'kl': -0.043, 'entropy': 0.1748, 'ce_loss': 0.0678, 'epoch': 0.28} | |
| 9%|β | 30/321 [02:48<26:38, 5.49s/it] 10%|β | 31/321 [02:53<26:30, 5.48s/it] {'loss': 0.0648, 'grad_norm': 0.22790686786174774, 'learning_rate': 1.9775839892228004e-05, 'kl': -0.0339, 'entropy': 0.2373, 'ce_loss': 0.0803, 'epoch': 0.29} | |
| 10%|β | 31/321 [02:53<26:30, 5.48s/it] 10%|β | 32/321 [02:58<26:26, 5.49s/it] {'loss': 0.0901, 'grad_norm': 0.3282439112663269, 'learning_rate': 1.9754073021574153e-05, 'kl': -0.0295, 'entropy': 0.2188, 'ce_loss': 0.0814, 'epoch': 0.3} | |
| 10%|β | 32/321 [02:58<26:26, 5.49s/it] 10%|β | 33/321 [03:04<26:26, 5.51s/it] {'loss': 0.0679, 'grad_norm': 0.22486652433872223, 'learning_rate': 1.9731310834252747e-05, 'kl': -0.0083, 'entropy': 0.2334, 'ce_loss': 0.0827, 'epoch': 0.31} | |
| 10%|β | 33/321 [03:04<26:26, 5.51s/it] 11%|β | 34/321 [03:10<26:23, 5.52s/it] {'loss': 0.0587, 'grad_norm': 0.2146979719400406, 'learning_rate': 1.970755565294318e-05, 'kl': -0.0204, 'entropy': 0.2988, 'ce_loss': 0.1036, 'epoch': 0.32} | |
| 11%|β | 34/321 [03:10<26:23, 5.52s/it] 11%|β | 35/321 [03:15<26:26, 5.55s/it] {'loss': 0.0579, 'grad_norm': 0.21969445049762726, 'learning_rate': 1.9682809901651074e-05, 'kl': -0.0449, 'entropy': 0.1699, 'ce_loss': 0.0699, 'epoch': 0.33} | |
| 11%|β | 35/321 [03:15<26:26, 5.55s/it] 11%|β | 36/321 [03:21<26:11, 5.51s/it] {'loss': 0.085, 'grad_norm': 0.27458637952804565, 'learning_rate': 1.9657076105460945e-05, 'kl': -0.0184, 'entropy': 0.2344, 'ce_loss': 0.0843, 'epoch': 0.34} | |
| 11%|β | 36/321 [03:21<26:11, 5.51s/it] 12%|ββ | 37/321 [03:26<26:05, 5.51s/it] {'loss': 0.0822, 'grad_norm': 0.2581257224082947, 'learning_rate': 1.9630356890278527e-05, 'kl': -0.0615, 'entropy': 0.2539, 'ce_loss': 0.096, 'epoch': 0.34} | |
| 12%|ββ | 37/321 [03:26<26:05, 5.51s/it] 12%|ββ | 38/321 [03:32<25:56, 5.50s/it] {'loss': 0.0573, 'grad_norm': 0.19232574105262756, 'learning_rate': 1.9602654982562822e-05, 'kl': -0.0205, 'entropy': 0.2061, 'ce_loss': 0.0739, 'epoch': 0.35} | |
| 12%|ββ | 38/321 [03:32<25:56, 5.50s/it] 12%|ββ | 39/321 [03:37<26:12, 5.57s/it] {'loss': 0.0679, 'grad_norm': 0.23629367351531982, 'learning_rate': 1.9573973209047893e-05, 'kl': -0.0513, 'entropy': 0.2256, 'ce_loss': 0.0799, 'epoch': 0.36} | |
| 12%|ββ | 39/321 [03:37<26:12, 5.57s/it] 12%|ββ | 40/321 [03:43<25:57, 5.54s/it] {'loss': 0.0735, 'grad_norm': 0.2551727592945099, 'learning_rate': 1.9544314496454423e-05, 'kl': -0.0107, 'entropy': 0.2129, 'ce_loss': 0.0759, 'epoch': 0.37} | |
| 12%|ββ | 40/321 [03:43<25:57, 5.54s/it] 13%|ββ | 41/321 [03:48<25:51, 5.54s/it] {'loss': 0.0686, 'grad_norm': 0.2294929027557373, 'learning_rate': 1.9513681871191063e-05, 'kl': -0.0549, 'entropy': 0.2285, 'ce_loss': 0.085, 'epoch': 0.38} | |
| 13%|ββ | 41/321 [03:48<25:51, 5.54s/it] 13%|ββ | 42/321 [03:54<25:36, 5.51s/it] {'loss': 0.063, 'grad_norm': 0.2062670886516571, 'learning_rate': 1.9482078459045617e-05, 'kl': -0.0322, 'entropy': 0.1729, 'ce_loss': 0.0658, 'epoch': 0.39} | |
| 13%|ββ | 42/321 [03:54<25:36, 5.51s/it] 13%|ββ | 43/321 [03:59<25:46, 5.56s/it] {'loss': 0.0633, 'grad_norm': 0.18904145061969757, 'learning_rate': 1.9449507484866084e-05, 'kl': -0.0162, 'entropy': 0.2676, 'ce_loss': 0.0918, 'epoch': 0.4} | |
| 13%|ββ | 43/321 [03:59<25:46, 5.56s/it] 14%|ββ | 44/321 [04:05<25:33, 5.54s/it] {'loss': 0.1106, 'grad_norm': 0.3384988009929657, 'learning_rate': 1.941597227223159e-05, 'kl': -0.0193, 'entropy': 0.2773, 'ce_loss': 0.0962, 'epoch': 0.41} | |
| 14%|ββ | 44/321 [04:05<25:33, 5.54s/it] 14%|ββ | 45/321 [04:10<25:18, 5.50s/it] {'loss': 0.0979, 'grad_norm': 0.28069382905960083, 'learning_rate': 1.9381476243113243e-05, 'kl': -0.012, 'entropy': 0.1904, 'ce_loss': 0.0665, 'epoch': 0.42} | |
| 14%|ββ | 45/321 [04:10<25:18, 5.50s/it] 14%|ββ | 46/321 [04:16<25:09, 5.49s/it] {'loss': 0.0633, 'grad_norm': 0.2065959870815277, 'learning_rate': 1.9346022917524958e-05, 'kl': -0.0071, 'entropy': 0.2656, 'ce_loss': 0.0813, 'epoch': 0.43} | |
| 14%|ββ | 46/321 [04:16<25:09, 5.49s/it] 15%|ββ | 47/321 [04:21<24:58, 5.47s/it] {'loss': 0.0508, 'grad_norm': 0.2020697295665741, 'learning_rate': 1.9309615913164262e-05, 'kl': -0.0186, 'entropy': 0.2539, 'ce_loss': 0.0833, 'epoch': 0.44} | |
| 15%|ββ | 47/321 [04:21<24:58, 5.47s/it] 15%|ββ | 48/321 [04:27<24:53, 5.47s/it] {'loss': 0.0487, 'grad_norm': 0.1781323254108429, 'learning_rate': 1.9272258945043154e-05, 'kl': -0.0092, 'entropy': 0.2041, 'ce_loss': 0.0772, 'epoch': 0.45} | |
| 15%|ββ | 48/321 [04:27<24:53, 5.47s/it] 15%|ββ | 49/321 [04:32<24:47, 5.47s/it] {'loss': 0.0994, 'grad_norm': 0.3214576542377472, 'learning_rate': 1.9233955825109e-05, 'kl': -0.0374, 'entropy': 0.2344, 'ce_loss': 0.0915, 'epoch': 0.46} | |
| 15%|ββ | 49/321 [04:32<24:47, 5.47s/it] 16%|ββ | 50/321 [04:38<24:41, 5.47s/it] {'loss': 0.0683, 'grad_norm': 0.2019803673028946, 'learning_rate': 1.919471046185558e-05, 'kl': -0.01, 'entropy': 0.2314, 'ce_loss': 0.0898, 'epoch': 0.47} | |
| 16%|ββ | 50/321 [04:38<24:41, 5.47s/it] 16%|ββ | 51/321 [04:43<24:39, 5.48s/it] {'loss': 0.0768, 'grad_norm': 0.2558470666408539, 'learning_rate': 1.9154526859924242e-05, 'kl': -0.0162, 'entropy': 0.1963, 'ce_loss': 0.0721, 'epoch': 0.48} | |
| 16%|ββ | 51/321 [04:43<24:39, 5.48s/it] 16%|ββ | 52/321 [04:49<24:30, 5.47s/it] {'loss': 0.0526, 'grad_norm': 0.17525342106819153, 'learning_rate': 1.9113409119695276e-05, 'kl': -0.0258, 'entropy': 0.291, 'ce_loss': 0.0953, 'epoch': 0.48} | |
| 16%|ββ | 52/321 [04:49<24:30, 5.47s/it] 17%|ββ | 53/321 [04:54<24:24, 5.46s/it] {'loss': 0.069, 'grad_norm': 0.20537005364894867, 'learning_rate': 1.907136143686951e-05, 'kl': -0.016, 'entropy': 0.2539, 'ce_loss': 0.0847, 'epoch': 0.49} | |
| 17%|ββ | 53/321 [04:54<24:24, 5.46s/it] 17%|ββ | 54/321 [05:00<24:19, 5.47s/it] {'loss': 0.0657, 'grad_norm': 0.20349650084972382, 'learning_rate': 1.902838810204015e-05, 'kl': -0.0168, 'entropy': 0.168, 'ce_loss': 0.0638, 'epoch': 0.5} | |
| 17%|ββ | 54/321 [05:00<24:19, 5.47s/it] 17%|ββ | 55/321 [05:05<24:14, 5.47s/it] {'loss': 0.1007, 'grad_norm': 0.3118916153907776, 'learning_rate': 1.8984493500255e-05, 'kl': -0.0435, 'entropy': 0.1924, 'ce_loss': 0.0703, 'epoch': 0.51} | |
| 17%|ββ | 55/321 [05:05<24:14, 5.47s/it] 17%|ββ | 56/321 [05:10<24:05, 5.46s/it] {'loss': 0.0725, 'grad_norm': 0.21371452510356903, 'learning_rate': 1.8939682110568982e-05, 'kl': -0.0266, 'entropy': 0.1465, 'ce_loss': 0.0627, 'epoch': 0.52} | |
| 17%|ββ | 56/321 [05:10<24:05, 5.46s/it] 18%|ββ | 57/321 [05:16<24:09, 5.49s/it] {'loss': 0.0907, 'grad_norm': 0.2616683542728424, 'learning_rate': 1.8893958505587093e-05, 'kl': -0.0339, 'entropy': 0.167, 'ce_loss': 0.0608, 'epoch': 0.53} | |
| 18%|ββ | 57/321 [05:16<24:09, 5.49s/it] 18%|ββ | 58/321 [05:22<24:04, 5.49s/it] {'loss': 0.0641, 'grad_norm': 0.21476519107818604, 'learning_rate': 1.8847327350997814e-05, 'kl': -0.0464, 'entropy': 0.2793, 'ce_loss': 0.0932, 'epoch': 0.54} | |
| 18%|ββ | 58/321 [05:22<24:04, 5.49s/it] 18%|ββ | 59/321 [05:27<23:53, 5.47s/it] {'loss': 0.105, 'grad_norm': 0.3317887485027313, 'learning_rate': 1.879979340509701e-05, 'kl': -0.0466, 'entropy': 0.3633, 'ce_loss': 0.112, 'epoch': 0.55} | |
| 18%|ββ | 59/321 [05:27<23:53, 5.47s/it] 19%|ββ | 60/321 [05:33<23:57, 5.51s/it] {'loss': 0.0845, 'grad_norm': 0.2596229016780853, 'learning_rate': 1.8751361518302413e-05, 'kl': -0.0264, 'entropy': 0.2734, 'ce_loss': 0.0875, 'epoch': 0.56} | |
| 19%|ββ | 60/321 [05:33<23:57, 5.51s/it] 19%|ββ | 61/321 [05:38<23:47, 5.49s/it] {'loss': 0.0669, 'grad_norm': 0.2144625186920166, 'learning_rate': 1.8702036632658646e-05, 'kl': -0.0199, 'entropy': 0.249, 'ce_loss': 0.0751, 'epoch': 0.57} | |
| 19%|ββ | 61/321 [05:38<23:47, 5.49s/it] 19%|ββ | 62/321 [05:44<23:48, 5.51s/it] {'loss': 0.0738, 'grad_norm': 0.24398091435432434, 'learning_rate': 1.8651823781332948e-05, 'kl': -0.0337, 'entropy': 0.2451, 'ce_loss': 0.0819, 'epoch': 0.58} | |
| 19%|ββ | 62/321 [05:44<23:48, 5.51s/it] 20%|ββ | 63/321 [05:49<23:43, 5.52s/it] {'loss': 0.0669, 'grad_norm': 0.2284887582063675, 'learning_rate': 1.8600728088101587e-05, 'kl': -0.0304, 'entropy': 0.1885, 'ce_loss': 0.0738, 'epoch': 0.59} | |
| 20%|ββ | 63/321 [05:49<23:43, 5.52s/it] 20%|ββ | 64/321 [05:55<23:37, 5.51s/it] {'loss': 0.0748, 'grad_norm': 0.2169020175933838, 'learning_rate': 1.8548754766827016e-05, 'kl': -0.0469, 'entropy': 0.1582, 'ce_loss': 0.0657, 'epoch': 0.6} | |
| 20%|ββ | 64/321 [05:55<23:37, 5.51s/it] 20%|ββ | 65/321 [06:00<23:26, 5.49s/it] {'loss': 0.0651, 'grad_norm': 0.2139730602502823, 'learning_rate': 1.8495909120925857e-05, 'kl': -0.0243, 'entropy': 0.1836, 'ce_loss': 0.0643, 'epoch': 0.61} | |
| 20%|ββ | 65/321 [06:00<23:26, 5.49s/it] 21%|ββ | 66/321 [06:06<23:23, 5.50s/it] {'loss': 0.0398, 'grad_norm': 0.1338534951210022, 'learning_rate': 1.8442196542827712e-05, 'kl': -0.0201, 'entropy': 0.1992, 'ce_loss': 0.0718, 'epoch': 0.62} | |
| 21%|ββ | 66/321 [06:06<23:23, 5.50s/it] 21%|ββ | 67/321 [06:11<23:14, 5.49s/it] {'loss': 0.0526, 'grad_norm': 0.17920222878456116, 'learning_rate': 1.8387622513424942e-05, 'kl': -0.0275, 'entropy': 0.2852, 'ce_loss': 0.0979, 'epoch': 0.62} | |
| 21%|ββ | 67/321 [06:11<23:14, 5.49s/it] 21%|ββ | 68/321 [06:16<23:09, 5.49s/it] {'loss': 0.0532, 'grad_norm': 0.1490076780319214, 'learning_rate': 1.8332192601513358e-05, 'kl': -0.0408, 'entropy': 0.2637, 'ce_loss': 0.0951, 'epoch': 0.63} | |
| 21%|ββ | 68/321 [06:16<23:09, 5.49s/it] 21%|βββ | 69/321 [06:22<22:59, 5.47s/it] {'loss': 0.065, 'grad_norm': 0.1889561414718628, 'learning_rate': 1.827591246322401e-05, 'kl': -0.0432, 'entropy': 0.2021, 'ce_loss': 0.0529, 'epoch': 0.64} | |
| 21%|βββ | 69/321 [06:22<22:59, 5.47s/it] 22%|βββ | 70/321 [06:27<22:54, 5.48s/it] {'loss': 0.0559, 'grad_norm': 0.1900503784418106, 'learning_rate': 1.8218787841446003e-05, 'kl': -0.0427, 'entropy': 0.1963, 'ce_loss': 0.0923, 'epoch': 0.65} | |
| 22%|βββ | 70/321 [06:27<22:54, 5.48s/it] 22%|βββ | 71/321 [06:33<22:45, 5.46s/it] {'loss': 0.0595, 'grad_norm': 0.18474602699279785, 'learning_rate': 1.8160824565240495e-05, 'kl': -0.0286, 'entropy': 0.1338, 'ce_loss': 0.0534, 'epoch': 0.66} | |
| 22%|βββ | 71/321 [06:33<22:45, 5.46s/it] 22%|βββ | 72/321 [06:39<23:02, 5.55s/it] {'loss': 0.0633, 'grad_norm': 0.20437942445278168, 'learning_rate': 1.8102028549245894e-05, 'kl': -0.0114, 'entropy': 0.1689, 'ce_loss': 0.0684, 'epoch': 0.67} | |
| 22%|βββ | 72/321 [06:39<23:02, 5.55s/it] 23%|βββ | 73/321 [06:44<22:49, 5.52s/it] {'loss': 0.0481, 'grad_norm': 0.15962006151676178, 'learning_rate': 1.804240579307431e-05, 'kl': -0.0544, 'entropy': 0.252, 'ce_loss': 0.0875, 'epoch': 0.68} | |
| 23%|βββ | 73/321 [06:44<22:49, 5.52s/it] 23%|βββ | 74/321 [06:49<22:37, 5.50s/it] {'loss': 0.0535, 'grad_norm': 0.1530793160200119, 'learning_rate': 1.7981962380699376e-05, 'kl': -0.0366, 'entropy': 0.167, 'ce_loss': 0.0524, 'epoch': 0.69} | |
| 23%|βββ | 74/321 [06:49<22:37, 5.50s/it] 23%|βββ | 75/321 [06:55<22:28, 5.48s/it] {'loss': 0.0681, 'grad_norm': 0.195379376411438, 'learning_rate': 1.79207044798354e-05, 'kl': -0.0654, 'entropy': 0.3203, 'ce_loss': 0.0953, 'epoch': 0.7} | |
| 23%|βββ | 75/321 [06:55<22:28, 5.48s/it] 24%|βββ | 76/321 [07:00<22:21, 5.47s/it] {'loss': 0.072, 'grad_norm': 0.20148125290870667, 'learning_rate': 1.7858638341308026e-05, 'kl': -0.0469, 'entropy': 0.2012, 'ce_loss': 0.0711, 'epoch': 0.71} | |
| 24%|βββ | 76/321 [07:00<22:21, 5.47s/it] 24%|βββ | 77/321 [07:06<22:17, 5.48s/it] {'loss': 0.0599, 'grad_norm': 0.19116108119487762, 'learning_rate': 1.779577029841638e-05, 'kl': -0.0281, 'entropy': 0.167, 'ce_loss': 0.0655, 'epoch': 0.72} | |
| 24%|βββ | 77/321 [07:06<22:17, 5.48s/it] 24%|βββ | 78/321 [07:11<22:08, 5.47s/it] {'loss': 0.0519, 'grad_norm': 0.14512334764003754, 'learning_rate': 1.773210676628682e-05, 'kl': -0.0179, 'entropy': 0.2539, 'ce_loss': 0.0858, 'epoch': 0.73} | |
| 24%|βββ | 78/321 [07:11<22:08, 5.47s/it] 25%|βββ | 79/321 [07:17<22:03, 5.47s/it] {'loss': 0.0549, 'grad_norm': 0.173423632979393, 'learning_rate': 1.7667654241218332e-05, 'kl': -0.0201, 'entropy': 0.2305, 'ce_loss': 0.0785, 'epoch': 0.74} | |
| 25%|βββ | 79/321 [07:17<22:03, 5.47s/it] 25%|βββ | 80/321 [07:22<21:54, 5.46s/it] {'loss': 0.0729, 'grad_norm': 0.20733360946178436, 'learning_rate': 1.7602419300019627e-05, 'kl': -0.0398, 'entropy': 0.2207, 'ce_loss': 0.0751, 'epoch': 0.75} | |
| 25%|βββ | 80/321 [07:22<21:54, 5.46s/it] 25%|βββ | 81/321 [07:28<21:48, 5.45s/it] {'loss': 0.0757, 'grad_norm': 0.21337777376174927, 'learning_rate': 1.753640859933806e-05, 'kl': -0.0403, 'entropy': 0.2061, 'ce_loss': 0.0783, 'epoch': 0.76} | |
| 25%|βββ | 81/321 [07:28<21:48, 5.45s/it] 26%|βββ | 82/321 [07:33<21:39, 5.44s/it] {'loss': 0.0729, 'grad_norm': 0.21283765137195587, 'learning_rate': 1.746962887498034e-05, 'kl': -0.0347, 'entropy': 0.1797, 'ce_loss': 0.0596, 'epoch': 0.76} | |
| 26%|βββ | 82/321 [07:33<21:39, 5.44s/it] 26%|βββ | 83/321 [07:39<21:36, 5.45s/it] {'loss': 0.0797, 'grad_norm': 0.2063702642917633, 'learning_rate': 1.7402086941225246e-05, 'kl': -0.0292, 'entropy': 0.2637, 'ce_loss': 0.0892, 'epoch': 0.77} | |
| 26%|βββ | 83/321 [07:39<21:36, 5.45s/it] 26%|βββ | 84/321 [07:44<21:31, 5.45s/it] {'loss': 0.0626, 'grad_norm': 0.1959114670753479, 'learning_rate': 1.7333789690128252e-05, 'kl': -0.0192, 'entropy': 0.2578, 'ce_loss': 0.1003, 'epoch': 0.78} | |
| 26%|βββ | 84/321 [07:44<21:31, 5.45s/it] 26%|βββ | 85/321 [07:50<21:32, 5.48s/it] {'loss': 0.0524, 'grad_norm': 0.14823554456233978, 'learning_rate': 1.7264744090818284e-05, 'kl': -0.0266, 'entropy': 0.1934, 'ce_loss': 0.0672, 'epoch': 0.79} | |
| 26%|βββ | 85/321 [07:50<21:32, 5.48s/it] 27%|βββ | 86/321 [07:55<21:47, 5.56s/it] {'loss': 0.0318, 'grad_norm': 0.09522274881601334, 'learning_rate': 1.719495718878655e-05, 'kl': -0.0474, 'entropy': 0.2236, 'ce_loss': 0.0442, 'epoch': 0.8} | |
| 27%|βββ | 86/321 [07:55<21:47, 5.56s/it] 27%|βββ | 87/321 [08:01<21:33, 5.53s/it] {'loss': 0.0512, 'grad_norm': 0.15018120408058167, 'learning_rate': 1.712443610516765e-05, 'kl': -0.0552, 'entropy': 0.2812, 'ce_loss': 0.0948, 'epoch': 0.81} | |
| 27%|βββ | 87/321 [08:01<21:33, 5.53s/it] 27%|βββ | 88/321 [08:06<21:21, 5.50s/it] {'loss': 0.0395, 'grad_norm': 0.12560266256332397, 'learning_rate': 1.7053188036012885e-05, 'kl': -0.0256, 'entropy': 0.3164, 'ce_loss': 0.0981, 'epoch': 0.82} | |
| 27%|βββ | 88/321 [08:06<21:21, 5.50s/it] 28%|βββ | 89/321 [08:12<21:11, 5.48s/it] {'loss': 0.0614, 'grad_norm': 0.18557599186897278, 'learning_rate': 1.6981220251555996e-05, 'kl': -0.0303, 'entropy': 0.2471, 'ce_loss': 0.0874, 'epoch': 0.83} | |
| 28%|βββ | 89/321 [08:12<21:11, 5.48s/it] 28%|βββ | 90/321 [08:17<21:02, 5.46s/it] {'loss': 0.0675, 'grad_norm': 0.1974230408668518, 'learning_rate': 1.6908540095471288e-05, 'kl': -0.0491, 'entropy': 0.2119, 'ce_loss': 0.078, 'epoch': 0.84} | |
| 28%|βββ | 90/321 [08:17<21:02, 5.46s/it] 28%|βββ | 91/321 [08:23<20:59, 5.48s/it] {'loss': 0.0581, 'grad_norm': 0.18383747339248657, 'learning_rate': 1.6835154984124266e-05, 'kl': -0.0222, 'entropy': 0.1797, 'ce_loss': 0.0639, 'epoch': 0.85} | |
| 28%|βββ | 91/321 [08:23<20:59, 5.48s/it] 29%|βββ | 92/321 [08:28<20:55, 5.48s/it] {'loss': 0.0757, 'grad_norm': 0.22654765844345093, 'learning_rate': 1.676107240581488e-05, 'kl': -0.025, 'entropy': 0.3281, 'ce_loss': 0.1065, 'epoch': 0.86} | |
| 29%|βββ | 92/321 [08:28<20:55, 5.48s/it] 29%|βββ | 93/321 [08:34<20:51, 5.49s/it] {'loss': 0.0695, 'grad_norm': 0.18459008634090424, 'learning_rate': 1.6686299920013388e-05, 'kl': -0.0461, 'entropy': 0.1445, 'ce_loss': 0.0251, 'epoch': 0.87} | |
| 29%|βββ | 93/321 [08:34<20:51, 5.49s/it] 29%|βββ | 94/321 [08:39<20:44, 5.48s/it] {'loss': 0.072, 'grad_norm': 0.21317991614341736, 'learning_rate': 1.661084515658901e-05, 'kl': -0.0491, 'entropy': 0.3535, 'ce_loss': 0.111, 'epoch': 0.88} | |
| 29%|βββ | 94/321 [08:39<20:44, 5.48s/it] 30%|βββ | 95/321 [08:44<20:35, 5.47s/it] {'loss': 0.0569, 'grad_norm': 0.16013678908348083, 'learning_rate': 1.6534715815031325e-05, 'kl': -0.0649, 'entropy': 0.377, 'ce_loss': 0.146, 'epoch': 0.89} | |
| 30%|βββ | 95/321 [08:44<20:35, 5.47s/it] 30%|βββ | 96/321 [08:50<20:36, 5.50s/it] {'loss': 0.0605, 'grad_norm': 0.1786569356918335, 'learning_rate': 1.645791966366464e-05, 'kl': -0.0317, 'entropy': 0.0967, 'ce_loss': 0.0473, 'epoch': 0.9} | |
| 30%|βββ | 96/321 [08:50<20:36, 5.50s/it] 30%|βββ | 97/321 [08:56<20:32, 5.50s/it] {'loss': 0.0459, 'grad_norm': 0.14076007902622223, 'learning_rate': 1.63804645388553e-05, 'kl': -0.0217, 'entropy': 0.3301, 'ce_loss': 0.1067, 'epoch': 0.9} | |
| 30%|βββ | 97/321 [08:56<20:32, 5.50s/it] 31%|βββ | 98/321 [09:01<20:26, 5.50s/it] {'loss': 0.049, 'grad_norm': 0.13985782861709595, 'learning_rate': 1.6302358344212025e-05, 'kl': -0.0359, 'entropy': 0.2266, 'ce_loss': 0.0589, 'epoch': 0.91} | |
| 31%|βββ | 98/321 [09:01<20:26, 5.50s/it] 31%|βββ | 99/321 [09:06<20:18, 5.49s/it] {'loss': 0.07, 'grad_norm': 0.20864830911159515, 'learning_rate': 1.622360904977946e-05, 'kl': -0.0566, 'entropy': 0.1768, 'ce_loss': 0.0669, 'epoch': 0.92} | |
| 31%|βββ | 99/321 [09:06<20:18, 5.49s/it] 31%|βββ | 100/321 [09:12<20:15, 5.50s/it] {'loss': 0.0738, 'grad_norm': 0.1885262280702591, 'learning_rate': 1.6144224691224868e-05, 'kl': -0.0417, 'entropy': 0.2168, 'ce_loss': 0.0757, 'epoch': 0.93} | |
| 31%|βββ | 100/321 [09:12<20:15, 5.50s/it] 31%|ββββ | 101/321 [09:18<20:11, 5.51s/it] {'loss': 0.0725, 'grad_norm': 0.19844770431518555, 'learning_rate': 1.606421336901818e-05, 'kl': -0.0461, 'entropy': 0.2188, 'ce_loss': 0.0711, 'epoch': 0.94} | |
| 31%|ββββ | 101/321 [09:18<20:11, 5.51s/it] 32%|ββββ | 102/321 [09:23<20:06, 5.51s/it] {'loss': 0.0442, 'grad_norm': 0.1333250254392624, 'learning_rate': 1.5983583247605414e-05, 'kl': -0.0359, 'entropy': 0.2539, 'ce_loss': 0.0848, 'epoch': 0.95} | |
| 32%|ββββ | 102/321 [09:23<20:06, 5.51s/it] 32%|ββββ | 103/321 [09:28<19:57, 5.49s/it] {'loss': 0.0555, 'grad_norm': 0.16762061417102814, 'learning_rate': 1.590234255457555e-05, 'kl': -0.0179, 'entropy': 0.252, 'ce_loss': 0.0824, 'epoch': 0.96} | |
| 32%|ββββ | 103/321 [09:29<19:57, 5.49s/it] 32%|ββββ | 104/321 [09:34<19:52, 5.49s/it] {'loss': 0.0564, 'grad_norm': 0.177537202835083, 'learning_rate': 1.582049957982099e-05, 'kl': -0.0393, 'entropy': 0.1807, 'ce_loss': 0.0678, 'epoch': 0.97} | |
| 32%|ββββ | 104/321 [09:34<19:52, 5.49s/it] 33%|ββββ | 105/321 [09:40<19:53, 5.53s/it] {'loss': 0.0692, 'grad_norm': 0.20024636387825012, 'learning_rate': 1.5738062674691657e-05, 'kl': -0.0275, 'entropy': 0.2275, 'ce_loss': 0.0741, 'epoch': 0.98} | |
| 33%|ββββ | 105/321 [09:40<19:53, 5.53s/it] 33%|ββββ | 106/321 [09:45<19:42, 5.50s/it] {'loss': 0.0651, 'grad_norm': 0.20259137451648712, 'learning_rate': 1.5655040251142787e-05, 'kl': -0.0244, 'entropy': 0.2217, 'ce_loss': 0.0729, 'epoch': 0.99} | |
| 33%|ββββ | 106/321 [09:45<19:42, 5.50s/it] 33%|ββββ | 107/321 [09:51<19:34, 5.49s/it] {'loss': 0.061, 'grad_norm': 0.16389916837215424, 'learning_rate': 1.5571440780876588e-05, 'kl': -0.0157, 'entropy': 0.2012, 'ce_loss': 0.0687, 'epoch': 1.0} | |
| 33%|ββββ | 107/321 [09:51<19:34, 5.49s/it] 34%|ββββ | 108/321 [09:52<15:02, 4.24s/it] {'loss': 0.0391, 'grad_norm': 0.16389916837215424, 'learning_rate': 1.548727279447777e-05, 'kl': -0.0276, 'entropy': 0.1445, 'ce_loss': 0.2476, 'epoch': 1.0} | |
| 34%|ββββ | 108/321 [09:52<15:02, 4.24s/it] 34%|ββββ | 109/321 [09:57<16:16, 4.61s/it] {'loss': 0.0452, 'grad_norm': 0.34768620133399963, 'learning_rate': 1.540254488054307e-05, 'kl': 0.0835, 'entropy': 0.2197, 'ce_loss': 0.0691, 'epoch': 1.01} | |
| 34%|ββββ | 109/321 [09:57<16:16, 4.61s/it] 34%|ββββ | 110/321 [10:03<17:08, 4.87s/it] {'loss': 0.0471, 'grad_norm': 0.18361957371234894, 'learning_rate': 1.5317265684804865e-05, 'kl': 0.0008, 'entropy': 0.2139, 'ce_loss': 0.0772, 'epoch': 1.02} | |
| 34%|ββββ | 110/321 [10:03<17:08, 4.87s/it] 35%|ββββ | 111/321 [10:08<17:38, 5.04s/it] {'loss': 0.0399, 'grad_norm': 0.18910780549049377, 'learning_rate': 1.5231443909248956e-05, 'kl': -0.0051, 'entropy': 0.2109, 'ce_loss': 0.0947, 'epoch': 1.03} | |
| 35%|ββββ | 111/321 [10:08<17:38, 5.04s/it] 35%|ββββ | 112/321 [10:14<17:59, 5.16s/it] {'loss': 0.0412, 'grad_norm': 0.18259647488594055, 'learning_rate': 1.5145088311226599e-05, 'kl': 0.0408, 'entropy': 0.0811, 'ce_loss': 0.037, 'epoch': 1.04} | |
| 35%|ββββ | 112/321 [10:14<17:59, 5.16s/it] 35%|ββββ | 113/321 [10:19<18:14, 5.26s/it] {'loss': 0.0414, 'grad_norm': 0.21136245131492615, 'learning_rate': 1.5058207702560907e-05, 'kl': 0.063, 'entropy': 0.125, 'ce_loss': 0.0617, 'epoch': 1.05} | |
| 35%|ββββ | 113/321 [10:19<18:14, 5.26s/it] 36%|ββββ | 114/321 [10:25<18:25, 5.34s/it] {'loss': 0.0336, 'grad_norm': 0.17452239990234375, 'learning_rate': 1.4970810948647664e-05, 'kl': -0.0054, 'entropy': 0.1602, 'ce_loss': 0.0847, 'epoch': 1.06} | |
| 36%|ββββ | 114/321 [10:25<18:25, 5.34s/it] 36%|ββββ | 115/321 [10:30<18:27, 5.37s/it] {'loss': 0.0416, 'grad_norm': 0.27118608355522156, 'learning_rate': 1.4882906967550708e-05, 'kl': -0.0466, 'entropy': 0.104, 'ce_loss': 0.0761, 'epoch': 1.07} | |
| 36%|ββββ | 115/321 [10:30<18:27, 5.37s/it] 36%|ββββ | 116/321 [10:36<18:26, 5.40s/it] {'loss': 0.0453, 'grad_norm': 0.22193056344985962, 'learning_rate': 1.479450472909191e-05, 'kl': -0.0101, 'entropy': 0.1592, 'ce_loss': 0.0893, 'epoch': 1.07} | |
| 36%|ββββ | 116/321 [10:36<18:26, 5.40s/it] 36%|ββββ | 117/321 [10:41<18:27, 5.43s/it] {'loss': 0.0486, 'grad_norm': 0.2031334489583969, 'learning_rate': 1.4705613253935886e-05, 'kl': -0.0284, 'entropy': 0.1025, 'ce_loss': 0.0659, 'epoch': 1.08} | |
| 36%|ββββ | 117/321 [10:41<18:27, 5.43s/it] 37%|ββββ | 118/321 [10:47<18:26, 5.45s/it] {'loss': 0.0415, 'grad_norm': 0.2964913845062256, 'learning_rate': 1.4616241612669523e-05, 'kl': 0.0403, 'entropy': 0.1079, 'ce_loss': 0.0631, 'epoch': 1.09} | |
| 37%|ββββ | 118/321 [10:47<18:26, 5.45s/it] 37%|ββββ | 119/321 [10:52<18:22, 5.46s/it] {'loss': 0.0427, 'grad_norm': 0.19633540511131287, 'learning_rate': 1.4526398924876407e-05, 'kl': -0.0203, 'entropy': 0.0991, 'ce_loss': 0.0589, 'epoch': 1.1} | |
| 37%|ββββ | 119/321 [10:52<18:22, 5.46s/it] 37%|ββββ | 120/321 [10:58<18:20, 5.47s/it] {'loss': 0.0536, 'grad_norm': 0.2811174690723419, 'learning_rate': 1.4436094358206224e-05, 'kl': 0.0001, 'entropy': 0.1001, 'ce_loss': 0.0519, 'epoch': 1.11} | |
| 37%|ββββ | 120/321 [10:58<18:20, 5.47s/it] 38%|ββββ | 121/321 [11:03<18:17, 5.49s/it] {'loss': 0.0445, 'grad_norm': 0.19164954125881195, 'learning_rate': 1.4345337127439333e-05, 'kl': 0.0903, 'entropy': 0.1787, 'ce_loss': 0.0799, 'epoch': 1.12} | |
| 38%|ββββ | 121/321 [11:03<18:17, 5.49s/it] 38%|ββββ | 122/321 [11:09<18:12, 5.49s/it] {'loss': 0.0424, 'grad_norm': 0.25314974784851074, 'learning_rate': 1.4254136493546432e-05, 'kl': 0.0322, 'entropy': 0.1924, 'ce_loss': 0.0798, 'epoch': 1.13} | |
| 38%|ββββ | 122/321 [11:09<18:12, 5.49s/it] 38%|ββββ | 123/321 [11:14<18:09, 5.50s/it] {'loss': 0.0373, 'grad_norm': 0.12902231514453888, 'learning_rate': 1.4162501762743579e-05, 'kl': 0.042, 'entropy': 0.1226, 'ce_loss': 0.0694, 'epoch': 1.14} | |
| 38%|ββββ | 123/321 [11:14<18:09, 5.50s/it] 39%|ββββ | 124/321 [11:20<18:07, 5.52s/it] {'loss': 0.0454, 'grad_norm': 0.22617529332637787, 'learning_rate': 1.4070442285542579e-05, 'kl': -0.0043, 'entropy': 0.1011, 'ce_loss': 0.0623, 'epoch': 1.15} | |
| 39%|ββββ | 124/321 [11:20<18:07, 5.52s/it] 39%|ββββ | 125/321 [11:25<18:05, 5.54s/it] {'loss': 0.0425, 'grad_norm': 0.26588836312294006, 'learning_rate': 1.3977967455796828e-05, 'kl': -0.0179, 'entropy': 0.1865, 'ce_loss': 0.0958, 'epoch': 1.16} | |
| 39%|ββββ | 125/321 [11:25<18:05, 5.54s/it] 39%|ββββ | 126/321 [11:31<18:03, 5.56s/it] {'loss': 0.0393, 'grad_norm': 0.16521866619586945, 'learning_rate': 1.3885086709742788e-05, 'kl': 0.0381, 'entropy': 0.1553, 'ce_loss': 0.0691, 'epoch': 1.17} | |
| 39%|ββββ | 126/321 [11:31<18:03, 5.56s/it] 40%|ββββ | 127/321 [11:37<18:05, 5.59s/it] {'loss': 0.037, 'grad_norm': 0.18703752756118774, 'learning_rate': 1.3791809525037057e-05, 'kl': -0.0179, 'entropy': 0.0742, 'ce_loss': 0.051, 'epoch': 1.18} | |
| 40%|ββββ | 127/321 [11:37<18:05, 5.59s/it] 40%|ββββ | 128/321 [11:42<17:54, 5.57s/it] {'loss': 0.0485, 'grad_norm': 0.19613757729530334, 'learning_rate': 1.3698145419789302e-05, 'kl': 0.0452, 'entropy': 0.1328, 'ce_loss': 0.0521, 'epoch': 1.19} | |
| 40%|ββββ | 128/321 [11:42<17:54, 5.57s/it] 40%|ββββ | 129/321 [11:48<17:53, 5.59s/it] {'loss': 0.036, 'grad_norm': 0.14802619814872742, 'learning_rate': 1.3604103951590993e-05, 'kl': 0.0154, 'entropy': 0.1494, 'ce_loss': 0.0648, 'epoch': 1.2} | |
| 40%|ββββ | 129/321 [11:48<17:53, 5.59s/it] 40%|ββββ | 130/321 [11:53<17:44, 5.57s/it] {'loss': 0.0456, 'grad_norm': 0.22272059321403503, 'learning_rate': 1.3509694716540135e-05, 'kl': 0.0356, 'entropy': 0.0723, 'ce_loss': 0.0401, 'epoch': 1.21} | |
| 40%|ββββ | 130/321 [11:53<17:44, 5.57s/it] 41%|ββββ | 131/321 [11:59<17:38, 5.57s/it] {'loss': 0.0374, 'grad_norm': 0.16963350772857666, 'learning_rate': 1.341492734826209e-05, 'kl': 0.001, 'entropy': 0.1104, 'ce_loss': 0.0607, 'epoch': 1.21} | |
| 41%|ββββ | 131/321 [11:59<17:38, 5.57s/it] 41%|ββββ | 132/321 [12:04<17:31, 5.56s/it] {'loss': 0.0416, 'grad_norm': 0.14699193835258484, 'learning_rate': 1.3319811516926541e-05, 'kl': 0.0334, 'entropy': 0.1406, 'ce_loss': 0.061, 'epoch': 1.22} | |
| 41%|ββββ | 132/321 [12:04<17:31, 5.56s/it] 41%|βββββ | 133/321 [12:10<17:27, 5.57s/it] {'loss': 0.0497, 'grad_norm': 0.21459099650382996, 'learning_rate': 1.3224356928260735e-05, 'kl': 0.0903, 'entropy': 0.0688, 'ce_loss': 0.0198, 'epoch': 1.23} | |
| 41%|βββββ | 133/321 [12:10<17:27, 5.57s/it] 42%|βββββ | 134/321 [12:15<17:19, 5.56s/it] {'loss': 0.037, 'grad_norm': 0.21457742154598236, 'learning_rate': 1.3128573322559097e-05, 'kl': -0.042, 'entropy': 0.2148, 'ce_loss': 0.0958, 'epoch': 1.24} | |
| 42%|βββββ | 134/321 [12:15<17:19, 5.56s/it] 42%|βββββ | 135/321 [12:21<17:12, 5.55s/it] {'loss': 0.0505, 'grad_norm': 0.25294315814971924, 'learning_rate': 1.3032470473689322e-05, 'kl': -0.0425, 'entropy': 0.1387, 'ce_loss': 0.0772, 'epoch': 1.25} | |
| 42%|βββββ | 135/321 [12:21<17:12, 5.55s/it] 42%|βββββ | 136/321 [12:27<17:07, 5.55s/it] {'loss': 0.0466, 'grad_norm': 0.16512276232242584, 'learning_rate': 1.2936058188095045e-05, 'kl': 0.0918, 'entropy': 0.1006, 'ce_loss': 0.0524, 'epoch': 1.26} | |
| 42%|βββββ | 136/321 [12:27<17:07, 5.55s/it] 43%|βββββ | 137/321 [12:32<17:00, 5.55s/it] {'loss': 0.0342, 'grad_norm': 0.1969085931777954, 'learning_rate': 1.2839346303795173e-05, 'kl': 0.0515, 'entropy': 0.1416, 'ce_loss': 0.0561, 'epoch': 1.27} | |
| 43%|βββββ | 137/321 [12:32<17:00, 5.55s/it] 43%|βββββ | 138/321 [12:38<16:54, 5.55s/it] {'loss': 0.0547, 'grad_norm': 0.22046498954296112, 'learning_rate': 1.274234468938001e-05, 'kl': 0.0835, 'entropy': 0.1235, 'ce_loss': 0.0676, 'epoch': 1.28} | |
| 43%|βββββ | 138/321 [12:38<16:54, 5.55s/it] 43%|βββββ | 139/321 [12:43<16:56, 5.58s/it] {'loss': 0.0362, 'grad_norm': 0.1698365956544876, 'learning_rate': 1.2645063243004236e-05, 'kl': -0.0222, 'entropy': 0.1245, 'ce_loss': 0.0778, 'epoch': 1.29} | |
| 43%|βββββ | 139/321 [12:43<16:56, 5.58s/it] 44%|βββββ | 140/321 [12:49<16:50, 5.58s/it] {'loss': 0.0382, 'grad_norm': 0.13968144357204437, 'learning_rate': 1.2547511891376916e-05, 'kl': 0.0593, 'entropy': 0.1797, 'ce_loss': 0.0834, 'epoch': 1.3} | |
| 44%|βββββ | 140/321 [12:49<16:50, 5.58s/it] 44%|βββββ | 141/321 [12:54<16:42, 5.57s/it] {'loss': 0.0402, 'grad_norm': 0.1825486421585083, 'learning_rate': 1.2449700588748541e-05, 'kl': -0.0162, 'entropy': 0.1484, 'ce_loss': 0.0752, 'epoch': 1.31} | |
| 44%|βββββ | 141/321 [12:54<16:42, 5.57s/it] 44%|βββββ | 142/321 [13:00<16:32, 5.55s/it] {'loss': 0.0449, 'grad_norm': 0.21553905308246613, 'learning_rate': 1.2351639315895309e-05, 'kl': 0.0047, 'entropy': 0.167, 'ce_loss': 0.073, 'epoch': 1.32} | |
| 44%|βββββ | 142/321 [13:00<16:32, 5.55s/it] 45%|βββββ | 143/321 [13:05<16:20, 5.51s/it] {'loss': 0.0536, 'grad_norm': 0.17211349308490753, 'learning_rate': 1.2253338079100652e-05, 'kl': 0.0258, 'entropy': 0.084, 'ce_loss': 0.0464, 'epoch': 1.33} | |
| 45%|βββββ | 143/321 [13:05<16:20, 5.51s/it] 45%|βββββ | 144/321 [13:11<16:17, 5.52s/it] {'loss': 0.0425, 'grad_norm': 0.2392333745956421, 'learning_rate': 1.2154806909134198e-05, 'kl': 0.0111, 'entropy': 0.1094, 'ce_loss': 0.0533, 'epoch': 1.34} | |
| 45%|βββββ | 144/321 [13:11<16:17, 5.52s/it] 45%|βββββ | 145/321 [13:16<16:14, 5.54s/it] {'loss': 0.0496, 'grad_norm': 0.22119127213954926, 'learning_rate': 1.205605586022822e-05, 'kl': -0.0498, 'entropy': 0.1689, 'ce_loss': 0.0877, 'epoch': 1.34} | |
| 45%|βββββ | 145/321 [13:16<16:14, 5.54s/it] 45%|βββββ | 146/321 [13:22<16:11, 5.55s/it] {'loss': 0.0376, 'grad_norm': 0.17724597454071045, 'learning_rate': 1.1957095009051683e-05, 'kl': 0.0099, 'entropy': 0.1523, 'ce_loss': 0.0728, 'epoch': 1.35} | |
| 45%|βββββ | 146/321 [13:22<16:11, 5.55s/it] 46%|βββββ | 147/321 [13:28<16:02, 5.53s/it] {'loss': 0.0359, 'grad_norm': 0.17848482728004456, 'learning_rate': 1.1857934453682016e-05, 'kl': 0.0859, 'entropy': 0.0171, 'ce_loss': 0.0275, 'epoch': 1.36} | |
| 46%|βββββ | 147/321 [13:28<16:02, 5.53s/it] 46%|βββββ | 148/321 [13:33<15:56, 5.53s/it] {'loss': 0.0394, 'grad_norm': 0.18580235540866852, 'learning_rate': 1.1758584312574693e-05, 'kl': 0.1069, 'entropy': 0.0447, 'ce_loss': 0.0515, 'epoch': 1.37} | |
| 46%|βββββ | 148/321 [13:33<15:56, 5.53s/it] 46%|βββββ | 149/321 [13:39<15:51, 5.53s/it] {'loss': 0.0646, 'grad_norm': 0.23421402275562286, 'learning_rate': 1.1659054723530721e-05, 'kl': 0.1162, 'entropy': 0.1079, 'ce_loss': 0.0598, 'epoch': 1.38} | |
| 46%|βββββ | 149/321 [13:39<15:51, 5.53s/it] 47%|βββββ | 150/321 [13:44<15:43, 5.52s/it] {'loss': 0.0518, 'grad_norm': 0.2502172589302063, 'learning_rate': 1.1559355842662188e-05, 'kl': 0.0972, 'entropy': 0.1465, 'ce_loss': 0.0639, 'epoch': 1.39} | |
| 47%|βββββ | 150/321 [13:44<15:43, 5.52s/it] 47%|βββββ | 151/321 [13:50<15:34, 5.50s/it] {'loss': 0.0392, 'grad_norm': 0.16437506675720215, 'learning_rate': 1.1459497843355907e-05, 'kl': -0.0081, 'entropy': 0.1533, 'ce_loss': 0.0869, 'epoch': 1.4} | |
| 47%|βββββ | 151/321 [13:50<15:34, 5.50s/it] 47%|βββββ | 152/321 [13:55<15:28, 5.49s/it] {'loss': 0.0297, 'grad_norm': 0.16849961876869202, 'learning_rate': 1.1359490915235323e-05, 'kl': 0.0486, 'entropy': 0.127, 'ce_loss': 0.0625, 'epoch': 1.41} | |
| 47%|βββββ | 152/321 [13:55<15:28, 5.49s/it] 48%|βββββ | 153/321 [14:00<15:20, 5.48s/it] {'loss': 0.0574, 'grad_norm': 0.21643343567848206, 'learning_rate': 1.1259345263120738e-05, 'kl': 0.1201, 'entropy': 0.0466, 'ce_loss': 0.0416, 'epoch': 1.42} | |
| 48%|βββββ | 153/321 [14:00<15:20, 5.48s/it] 48%|βββββ | 154/321 [14:06<15:21, 5.52s/it] {'loss': 0.04, 'grad_norm': 0.24577035009860992, 'learning_rate': 1.1159071105988012e-05, 'kl': 0.0547, 'entropy': 0.1123, 'ce_loss': 0.0483, 'epoch': 1.43} | |
| 48%|βββββ | 154/321 [14:06<15:21, 5.52s/it] 48%|βββββ | 155/321 [14:12<15:16, 5.52s/it] {'loss': 0.0426, 'grad_norm': 0.19843176007270813, 'learning_rate': 1.1058678675925796e-05, 'kl': -0.0126, 'entropy': 0.1846, 'ce_loss': 0.081, 'epoch': 1.44} | |
| 48%|βββββ | 155/321 [14:12<15:16, 5.52s/it] 49%|βββββ | 156/321 [14:17<15:08, 5.50s/it] {'loss': 0.0395, 'grad_norm': 0.1768287569284439, 'learning_rate': 1.0958178217091455e-05, 'kl': 0.0327, 'entropy': 0.1084, 'ce_loss': 0.0531, 'epoch': 1.45} | |
| 49%|βββββ | 156/321 [14:17<15:08, 5.50s/it] 49%|βββββ | 157/321 [14:23<15:08, 5.54s/it] {'loss': 0.0372, 'grad_norm': 0.15050095319747925, 'learning_rate': 1.0857579984665733e-05, 'kl': 0.0214, 'entropy': 0.124, 'ce_loss': 0.0549, 'epoch': 1.46} | |
| 49%|βββββ | 157/321 [14:23<15:08, 5.54s/it] 49%|βββββ | 158/321 [14:28<15:02, 5.54s/it] {'loss': 0.034, 'grad_norm': 0.14452822506427765, 'learning_rate': 1.0756894243806291e-05, 'kl': 0.105, 'entropy': 0.1074, 'ce_loss': 0.0499, 'epoch': 1.47} | |
| 49%|βββββ | 158/321 [14:28<15:02, 5.54s/it] 50%|βββββ | 159/321 [14:34<14:54, 5.52s/it] {'loss': 0.0502, 'grad_norm': 0.20282569527626038, 'learning_rate': 1.0656131268600254e-05, 'kl': 0.0014, 'entropy': 0.1787, 'ce_loss': 0.081, 'epoch': 1.48} | |
| 50%|βββββ | 159/321 [14:34<14:54, 5.52s/it] 50%|βββββ | 160/321 [14:39<14:48, 5.52s/it] {'loss': 0.0423, 'grad_norm': 0.23301592469215393, 'learning_rate': 1.0555301341015832e-05, 'kl': 0.0757, 'entropy': 0.0427, 'ce_loss': 0.0388, 'epoch': 1.48} | |
| 50%|βββββ | 160/321 [14:39<14:48, 5.52s/it] 50%|βββββ | 161/321 [14:45<14:44, 5.53s/it] {'loss': 0.0409, 'grad_norm': 0.15230302512645721, 'learning_rate': 1.0454414749853126e-05, 'kl': 0.1318, 'entropy': 0.0845, 'ce_loss': 0.0592, 'epoch': 1.49} | |
| 50%|βββββ | 161/321 [14:45<14:44, 5.53s/it] 50%|βββββ | 162/321 [14:50<14:37, 5.52s/it] {'loss': 0.0386, 'grad_norm': 0.20218664407730103, 'learning_rate': 1.0353481789694258e-05, 'kl': 0.0864, 'entropy': 0.0806, 'ce_loss': 0.0432, 'epoch': 1.5} | |
| 50%|βββββ | 162/321 [14:50<14:37, 5.52s/it] 51%|βββββ | 163/321 [14:56<14:31, 5.52s/it] {'loss': 0.0351, 'grad_norm': 0.13059383630752563, 'learning_rate': 1.0252512759852891e-05, 'kl': 0.0197, 'entropy': 0.1289, 'ce_loss': 0.0587, 'epoch': 1.51} | |
| 51%|βββββ | 163/321 [14:56<14:31, 5.52s/it] 51%|βββββ | 164/321 [15:01<14:25, 5.51s/it] {'loss': 0.0378, 'grad_norm': 0.1583739072084427, 'learning_rate': 1.015151796332328e-05, 'kl': 0.0018, 'entropy': 0.1377, 'ce_loss': 0.0707, 'epoch': 1.52} | |
| 51%|βββββ | 164/321 [15:01<14:25, 5.51s/it] 51%|ββββββ | 165/321 [15:07<14:18, 5.50s/it] {'loss': 0.0363, 'grad_norm': 0.12924712896347046, 'learning_rate': 1.0050507705728943e-05, 'kl': 0.0032, 'entropy': 0.1885, 'ce_loss': 0.0925, 'epoch': 1.53} | |
| 51%|ββββββ | 165/321 [15:07<14:18, 5.50s/it] 52%|ββββββ | 166/321 [15:12<14:12, 5.50s/it] {'loss': 0.0442, 'grad_norm': 0.2284664362668991, 'learning_rate': 9.949492294271062e-06, 'kl': 0.0593, 'entropy': 0.0703, 'ce_loss': 0.0353, 'epoch': 1.54} | |
| 52%|ββββββ | 166/321 [15:12<14:12, 5.50s/it] 52%|ββββββ | 167/321 [15:18<14:05, 5.49s/it] {'loss': 0.0298, 'grad_norm': 0.13885430991649628, 'learning_rate': 9.848482036676725e-06, 'kl': 0.0131, 'entropy': 0.0771, 'ce_loss': 0.0575, 'epoch': 1.55} | |
| 52%|ββββββ | 167/321 [15:18<14:05, 5.49s/it] 52%|ββββββ | 168/321 [15:23<13:59, 5.49s/it] {'loss': 0.0629, 'grad_norm': 0.2584099769592285, 'learning_rate': 9.747487240147112e-06, 'kl': -0.0198, 'entropy': 0.1001, 'ce_loss': 0.0583, 'epoch': 1.56} | |
| 52%|ββββββ | 168/321 [15:23<13:59, 5.49s/it] 53%|ββββββ | 169/321 [15:29<13:54, 5.49s/it] {'loss': 0.0495, 'grad_norm': 0.25362423062324524, 'learning_rate': 9.646518210305747e-06, 'kl': 0.0063, 'entropy': 0.0869, 'ce_loss': 0.0644, 'epoch': 1.57} | |
| 53%|ββββββ | 169/321 [15:29<13:54, 5.49s/it] 53%|ββββββ | 170/321 [15:34<13:46, 5.48s/it] {'loss': 0.0508, 'grad_norm': 0.20054134726524353, 'learning_rate': 9.545585250146879e-06, 'kl': -0.0024, 'entropy': 0.1611, 'ce_loss': 0.0894, 'epoch': 1.58} | |
| 53%|ββββββ | 170/321 [15:34<13:46, 5.48s/it] 53%|ββββββ | 171/321 [15:40<13:39, 5.46s/it] {'loss': 0.0644, 'grad_norm': 0.3231568932533264, 'learning_rate': 9.44469865898417e-06, 'kl': 0.0596, 'entropy': 0.1133, 'ce_loss': 0.0555, 'epoch': 1.59} | |
| 53%|ββββββ | 171/321 [15:40<13:39, 5.46s/it] 54%|ββββββ | 172/321 [15:45<13:34, 5.46s/it] {'loss': 0.0394, 'grad_norm': 0.17434173822402954, 'learning_rate': 9.34386873139975e-06, 'kl': 0.0037, 'entropy': 0.1099, 'ce_loss': 0.0558, 'epoch': 1.6} | |
| 54%|ββββββ | 172/321 [15:45<13:34, 5.46s/it] 54%|ββββββ | 173/321 [15:51<13:30, 5.48s/it] {'loss': 0.0527, 'grad_norm': 0.20393753051757812, 'learning_rate': 9.243105756193714e-06, 'kl': -0.0132, 'entropy': 0.1064, 'ce_loss': 0.0663, 'epoch': 1.61} | |
| 54%|ββββββ | 173/321 [15:51<13:30, 5.48s/it] 54%|ββββββ | 174/321 [15:56<13:28, 5.50s/it] {'loss': 0.0455, 'grad_norm': 0.18590298295021057, 'learning_rate': 9.14242001533427e-06, 'kl': 0.0469, 'entropy': 0.04, 'ce_loss': 0.0422, 'epoch': 1.62} | |
| 54%|ββββββ | 174/321 [15:56<13:28, 5.50s/it] 55%|ββββββ | 175/321 [16:02<13:22, 5.50s/it] {'loss': 0.0312, 'grad_norm': 0.18247582018375397, 'learning_rate': 9.041821782908544e-06, 'kl': 0.0796, 'entropy': 0.0645, 'ce_loss': 0.0376, 'epoch': 1.62} | |
| 55%|ββββββ | 175/321 [16:02<13:22, 5.50s/it] 55%|ββββββ | 176/321 [16:07<13:20, 5.52s/it] {'loss': 0.0539, 'grad_norm': 0.26011157035827637, 'learning_rate': 8.941321324074207e-06, 'kl': -0.0275, 'entropy': 0.1074, 'ce_loss': 0.0641, 'epoch': 1.63} | |
| 55%|ββββββ | 176/321 [16:07<13:20, 5.52s/it] 55%|ββββββ | 177/321 [16:13<13:15, 5.52s/it] {'loss': 0.0333, 'grad_norm': 0.12031043320894241, 'learning_rate': 8.840928894011995e-06, 'kl': -0.0181, 'entropy': 0.21, 'ce_loss': 0.0949, 'epoch': 1.64} | |
| 55%|ββββββ | 177/321 [16:13<13:15, 5.52s/it] 55%|ββββββ | 178/321 [16:18<13:08, 5.52s/it] {'loss': 0.0462, 'grad_norm': 0.16544109582901, 'learning_rate': 8.740654736879265e-06, 'kl': -0.0088, 'entropy': 0.2402, 'ce_loss': 0.1025, 'epoch': 1.65} | |
| 55%|ββββββ | 178/321 [16:18<13:08, 5.52s/it] 56%|ββββββ | 179/321 [16:24<13:04, 5.53s/it] {'loss': 0.0411, 'grad_norm': 0.21774324774742126, 'learning_rate': 8.640509084764682e-06, 'kl': -0.0114, 'entropy': 0.1055, 'ce_loss': 0.0474, 'epoch': 1.66} | |
| 56%|ββββββ | 179/321 [16:24<13:04, 5.53s/it] 56%|ββββββ | 180/321 [16:29<12:56, 5.51s/it] {'loss': 0.0398, 'grad_norm': 0.16917741298675537, 'learning_rate': 8.540502156644096e-06, 'kl': -0.0046, 'entropy': 0.1152, 'ce_loss': 0.0681, 'epoch': 1.67} | |
| 56%|ββββββ | 180/321 [16:29<12:56, 5.51s/it] 56%|ββββββ | 181/321 [16:35<12:52, 5.52s/it] {'loss': 0.0377, 'grad_norm': 0.14676432311534882, 'learning_rate': 8.440644157337819e-06, 'kl': -0.0104, 'entropy': 0.208, 'ce_loss': 0.0879, 'epoch': 1.68} | |
| 56%|ββββββ | 181/321 [16:35<12:52, 5.52s/it] 57%|ββββββ | 182/321 [16:40<12:45, 5.51s/it] {'loss': 0.0476, 'grad_norm': 0.23645582795143127, 'learning_rate': 8.340945276469282e-06, 'kl': 0.0378, 'entropy': 0.1328, 'ce_loss': 0.0645, 'epoch': 1.69} | |
| 57%|ββββββ | 182/321 [16:40<12:45, 5.51s/it] 57%|ββββββ | 183/321 [16:46<12:40, 5.51s/it] {'loss': 0.0454, 'grad_norm': 0.18683338165283203, 'learning_rate': 8.24141568742531e-06, 'kl': 0.0889, 'entropy': 0.0522, 'ce_loss': 0.0368, 'epoch': 1.7} | |
| 57%|ββββββ | 183/321 [16:46<12:40, 5.51s/it] 57%|ββββββ | 184/321 [16:51<12:32, 5.49s/it] {'loss': 0.033, 'grad_norm': 0.13285161554813385, 'learning_rate': 8.142065546317988e-06, 'kl': 0.0515, 'entropy': 0.1064, 'ce_loss': 0.0577, 'epoch': 1.71} | |
| 57%|ββββββ | 184/321 [16:51<12:32, 5.49s/it] 58%|ββββββ | 185/321 [16:57<12:28, 5.50s/it] {'loss': 0.0544, 'grad_norm': 0.26400867104530334, 'learning_rate': 8.042904990948319e-06, 'kl': 0.0016, 'entropy': 0.1128, 'ce_loss': 0.0664, 'epoch': 1.72} | |
| 58%|ββββββ | 185/321 [16:57<12:28, 5.50s/it] 58%|ββββββ | 186/321 [17:02<12:22, 5.50s/it] {'loss': 0.0486, 'grad_norm': 0.19021686911582947, 'learning_rate': 7.943944139771784e-06, 'kl': 0.0649, 'entropy': 0.0708, 'ce_loss': 0.0415, 'epoch': 1.73} | |
| 58%|ββββββ | 186/321 [17:02<12:22, 5.50s/it] 58%|ββββββ | 187/321 [17:08<12:18, 5.51s/it] {'loss': 0.055, 'grad_norm': 0.22009411454200745, 'learning_rate': 7.845193090865807e-06, 'kl': 0.1182, 'entropy': 0.0698, 'ce_loss': 0.051, 'epoch': 1.74} | |
| 58%|ββββββ | 187/321 [17:08<12:18, 5.51s/it] 59%|ββββββ | 188/321 [17:13<12:12, 5.50s/it] {'loss': 0.0347, 'grad_norm': 0.1851823478937149, 'learning_rate': 7.746661920899351e-06, 'kl': 0.0265, 'entropy': 0.1118, 'ce_loss': 0.0598, 'epoch': 1.75} | |
| 59%|ββββββ | 188/321 [17:13<12:12, 5.50s/it] 59%|ββββββ | 189/321 [17:19<12:04, 5.49s/it] {'loss': 0.0437, 'grad_norm': 0.23166052997112274, 'learning_rate': 7.648360684104695e-06, 'kl': 0.0649, 'entropy': 0.0913, 'ce_loss': 0.0592, 'epoch': 1.76} | |
| 59%|ββββββ | 189/321 [17:19<12:04, 5.49s/it] 59%|ββββββ | 190/321 [17:24<11:58, 5.49s/it] {'loss': 0.0462, 'grad_norm': 0.17776505649089813, 'learning_rate': 7.550299411251461e-06, 'kl': -0.0074, 'entropy': 0.1689, 'ce_loss': 0.0917, 'epoch': 1.76} | |
| 59%|ββββββ | 190/321 [17:24<11:58, 5.49s/it] 60%|ββββββ | 191/321 [17:30<11:51, 5.47s/it] {'loss': 0.0455, 'grad_norm': 0.18737439811229706, 'learning_rate': 7.452488108623089e-06, 'kl': -0.0033, 'entropy': 0.1748, 'ce_loss': 0.0997, 'epoch': 1.77} | |
| 60%|ββββββ | 191/321 [17:30<11:51, 5.47s/it] 60%|ββββββ | 192/321 [17:35<11:45, 5.47s/it] {'loss': 0.0451, 'grad_norm': 0.18686872720718384, 'learning_rate': 7.354936756995766e-06, 'kl': -0.0095, 'entropy': 0.1357, 'ce_loss': 0.0782, 'epoch': 1.78} | |
| 60%|ββββββ | 192/321 [17:35<11:45, 5.47s/it] 60%|ββββββ | 193/321 [17:40<11:38, 5.45s/it] {'loss': 0.0397, 'grad_norm': 0.18644410371780396, 'learning_rate': 7.257655310619996e-06, 'kl': -0.0021, 'entropy': 0.063, 'ce_loss': 0.0551, 'epoch': 1.79} | |
| 60%|ββββββ | 193/321 [17:41<11:38, 5.45s/it] 60%|ββββββ | 194/321 [17:46<11:33, 5.46s/it] {'loss': 0.0404, 'grad_norm': 0.19344562292099, 'learning_rate': 7.16065369620483e-06, 'kl': 0.0491, 'entropy': 0.0845, 'ce_loss': 0.0443, 'epoch': 1.8} | |
| 60%|ββββββ | 194/321 [17:46<11:33, 5.46s/it] 61%|ββββββ | 195/321 [17:52<11:32, 5.50s/it] {'loss': 0.0373, 'grad_norm': 0.18872958421707153, 'learning_rate': 7.063941811904956e-06, 'kl': -0.028, 'entropy': 0.1416, 'ce_loss': 0.0675, 'epoch': 1.81} | |
| 61%|ββββββ | 195/321 [17:52<11:32, 5.50s/it] 61%|ββββββ | 196/321 [17:57<11:24, 5.48s/it] {'loss': 0.0458, 'grad_norm': 0.18720188736915588, 'learning_rate': 6.967529526310681e-06, 'kl': 0.1318, 'entropy': 0.0581, 'ce_loss': 0.0473, 'epoch': 1.82} | |
| 61%|ββββββ | 196/321 [17:57<11:24, 5.48s/it] 61%|βββββββ | 197/321 [18:02<11:18, 5.47s/it] {'loss': 0.0401, 'grad_norm': 0.1798112690448761, 'learning_rate': 6.871426677440907e-06, 'kl': 0.123, 'entropy': 0.1348, 'ce_loss': 0.064, 'epoch': 1.83} | |
| 61%|βββββββ | 197/321 [18:02<11:18, 5.47s/it] 62%|βββββββ | 198/321 [18:08<11:11, 5.46s/it] {'loss': 0.0385, 'grad_norm': 0.14133110642433167, 'learning_rate': 6.775643071739267e-06, 'kl': 0.1514, 'entropy': 0.084, 'ce_loss': 0.0519, 'epoch': 1.84} | |
| 62%|βββββββ | 198/321 [18:08<11:11, 5.46s/it] 62%|βββββββ | 199/321 [18:13<11:05, 5.45s/it] {'loss': 0.0479, 'grad_norm': 0.1965685486793518, 'learning_rate': 6.680188483073458e-06, 'kl': 0.0776, 'entropy': 0.0981, 'ce_loss': 0.0556, 'epoch': 1.85} | |
| 62%|βββββββ | 199/321 [18:13<11:05, 5.45s/it] 62%|βββββββ | 200/321 [18:19<11:01, 5.47s/it] {'loss': 0.0472, 'grad_norm': 0.2427220493555069, 'learning_rate': 6.585072651737911e-06, 'kl': 0.1016, 'entropy': 0.1162, 'ce_loss': 0.0619, 'epoch': 1.86} | |
| 62%|βββββββ | 200/321 [18:19<11:01, 5.47s/it] 63%|βββββββ | 201/321 [18:24<10:55, 5.47s/it] {'loss': 0.0404, 'grad_norm': 0.1257992386817932, 'learning_rate': 6.49030528345987e-06, 'kl': -0.0354, 'entropy': 0.1348, 'ce_loss': 0.0772, 'epoch': 1.87} | |
| 63%|βββββββ | 201/321 [18:24<10:55, 5.47s/it] 63%|βββββββ | 202/321 [18:30<10:51, 5.47s/it] {'loss': 0.0599, 'grad_norm': 0.24053309857845306, 'learning_rate': 6.3958960484090094e-06, 'kl': 0.0288, 'entropy': 0.0835, 'ce_loss': 0.0535, 'epoch': 1.88} | |
| 63%|βββββββ | 202/321 [18:30<10:51, 5.47s/it] 63%|βββββββ | 203/321 [18:35<10:44, 5.46s/it] {'loss': 0.0357, 'grad_norm': 0.1884569376707077, 'learning_rate': 6.3018545802107e-06, 'kl': -0.014, 'entropy': 0.1128, 'ce_loss': 0.067, 'epoch': 1.89} | |
| 63%|βββββββ | 203/321 [18:35<10:44, 5.46s/it] 64%|βββββββ | 204/321 [18:41<10:38, 5.46s/it] {'loss': 0.0448, 'grad_norm': 0.1645369976758957, 'learning_rate': 6.208190474962945e-06, 'kl': 0.1475, 'entropy': 0.0535, 'ce_loss': 0.0392, 'epoch': 1.9} | |
| 64%|βββββββ | 204/321 [18:41<10:38, 5.46s/it] 64%|βββββββ | 205/321 [18:46<10:34, 5.47s/it] {'loss': 0.0405, 'grad_norm': 0.22946400940418243, 'learning_rate': 6.114913290257219e-06, 'kl': -0.0236, 'entropy': 0.1118, 'ce_loss': 0.0635, 'epoch': 1.9} | |
| 64%|βββββββ | 205/321 [18:46<10:34, 5.47s/it] 64%|βββββββ | 206/321 [18:52<10:25, 5.44s/it] {'loss': 0.0294, 'grad_norm': 0.11852026730775833, 'learning_rate': 6.0220325442031714e-06, 'kl': -0.0104, 'entropy': 0.2217, 'ce_loss': 0.0981, 'epoch': 1.91} | |
| 64%|βββββββ | 206/321 [18:52<10:25, 5.44s/it] 64%|βββββββ | 207/321 [18:57<10:21, 5.45s/it] {'loss': 0.0325, 'grad_norm': 0.16473232209682465, 'learning_rate': 5.929557714457425e-06, 'kl': -0.0282, 'entropy': 0.252, 'ce_loss': 0.1167, 'epoch': 1.92} | |
| 64%|βββββββ | 207/321 [18:57<10:21, 5.45s/it] 65%|βββββββ | 208/321 [19:02<10:15, 5.45s/it] {'loss': 0.0495, 'grad_norm': 0.18369171023368835, 'learning_rate': 5.8374982372564255e-06, 'kl': 0.0498, 'entropy': 0.0947, 'ce_loss': 0.0577, 'epoch': 1.93} | |
| 65%|βββββββ | 208/321 [19:02<10:15, 5.45s/it] 65%|βββββββ | 209/321 [19:08<10:12, 5.47s/it] {'loss': 0.0487, 'grad_norm': 0.18389469385147095, 'learning_rate': 5.745863506453569e-06, 'kl': 0.062, 'entropy': 0.1514, 'ce_loss': 0.0798, 'epoch': 1.94} | |
| 65%|βββββββ | 209/321 [19:08<10:12, 5.47s/it] 65%|βββββββ | 210/321 [19:13<10:06, 5.46s/it] {'loss': 0.0356, 'grad_norm': 0.18366190791130066, 'learning_rate': 5.6546628725606675e-06, 'kl': 0.0515, 'entropy': 0.1113, 'ce_loss': 0.0581, 'epoch': 1.95} | |
| 65%|βββββββ | 210/321 [19:13<10:06, 5.46s/it] 66%|βββββββ | 211/321 [19:19<10:00, 5.46s/it] {'loss': 0.0469, 'grad_norm': 0.15876325964927673, 'learning_rate': 5.563905641793776e-06, 'kl': -0.0175, 'entropy': 0.2539, 'ce_loss': 0.1105, 'epoch': 1.96} | |
| 66%|βββββββ | 211/321 [19:19<10:00, 5.46s/it] 66%|βββββββ | 212/321 [19:24<09:56, 5.47s/it] {'loss': 0.0445, 'grad_norm': 0.26259469985961914, 'learning_rate': 5.473601075123599e-06, 'kl': 0.0369, 'entropy': 0.105, 'ce_loss': 0.0526, 'epoch': 1.97} | |
| 66%|βββββββ | 212/321 [19:24<09:56, 5.47s/it] 66%|βββββββ | 213/321 [19:30<09:50, 5.47s/it] {'loss': 0.0297, 'grad_norm': 0.12675271928310394, 'learning_rate': 5.383758387330476e-06, 'kl': 0.0026, 'entropy': 0.1289, 'ce_loss': 0.0728, 'epoch': 1.98} | |
| 66%|βββββββ | 213/321 [19:30<09:50, 5.47s/it] 67%|βββββββ | 214/321 [19:35<09:45, 5.47s/it] {'loss': 0.0467, 'grad_norm': 0.19244596362113953, 'learning_rate': 5.294386746064115e-06, 'kl': -0.0228, 'entropy': 0.1328, 'ce_loss': 0.0803, 'epoch': 1.99} | |
| 67%|βββββββ | 214/321 [19:35<09:45, 5.47s/it] 67%|βββββββ | 215/321 [19:41<09:38, 5.46s/it] {'loss': 0.0462, 'grad_norm': 0.2244977205991745, 'learning_rate': 5.205495270908094e-06, 'kl': 0.0447, 'entropy': 0.083, 'ce_loss': 0.0564, 'epoch': 2.0} | |
| 67%|βββββββ | 215/321 [19:41<09:38, 5.46s/it] 67%|βββββββ | 216/321 [19:42<07:22, 4.22s/it] {'loss': 0.0398, 'grad_norm': 0.2244977205991745, 'learning_rate': 5.117093032449297e-06, 'kl': 0.0898, 'entropy': 0.0206, 'ce_loss': 0.1184, 'epoch': 2.0} | |
| 67%|βββββββ | 216/321 [19:42<07:22, 4.22s/it] 68%|βββββββ | 217/321 [19:48<07:57, 4.59s/it] {'loss': 0.0258, 'grad_norm': 0.4248782694339752, 'learning_rate': 5.029189051352339e-06, 'kl': -0.0123, 'entropy': 0.1709, 'ce_loss': 0.0853, 'epoch': 2.01} | |
| 68%|βββββββ | 217/321 [19:48<07:57, 4.59s/it] 68%|βββββββ | 218/321 [19:53<08:20, 4.86s/it] {'loss': 0.0256, 'grad_norm': 0.12272848188877106, 'learning_rate': 4.941792297439098e-06, 'kl': 0.0223, 'entropy': 0.0918, 'ce_loss': 0.0505, 'epoch': 2.02} | |
| 68%|βββββββ | 218/321 [19:53<08:20, 4.86s/it] 68%|βββββββ | 219/321 [19:59<08:35, 5.06s/it] {'loss': 0.0292, 'grad_norm': 0.1554100066423416, 'learning_rate': 4.8549116887734045e-06, 'kl': 0.0771, 'entropy': 0.0796, 'ce_loss': 0.0564, 'epoch': 2.03} | |
| 68%|βββββββ | 219/321 [19:59<08:35, 5.06s/it] 69%|βββββββ | 220/321 [20:04<08:41, 5.17s/it] {'loss': 0.0308, 'grad_norm': 0.14393459260463715, 'learning_rate': 4.7685560907510465e-06, 'kl': 0.0991, 'entropy': 0.1138, 'ce_loss': 0.0579, 'epoch': 2.04} | |
| 69%|βββββββ | 220/321 [20:04<08:41, 5.17s/it] 69%|βββββββ | 221/321 [20:09<08:46, 5.27s/it] {'loss': 0.0222, 'grad_norm': 0.1449674814939499, 'learning_rate': 4.682734315195138e-06, 'kl': -0.0442, 'entropy': 0.0952, 'ce_loss': 0.0674, 'epoch': 2.05} | |
| 69%|βββββββ | 221/321 [20:09<08:46, 5.27s/it] 69%|βββββββ | 222/321 [20:15<08:47, 5.33s/it] {'loss': 0.0263, 'grad_norm': 0.11594852060079575, 'learning_rate': 4.5974551194569336e-06, 'kl': 0.1611, 'entropy': 0.0625, 'ce_loss': 0.0488, 'epoch': 2.06} | |
| 69%|βββββββ | 222/321 [20:15<08:47, 5.33s/it] 69%|βββββββ | 223/321 [20:20<08:46, 5.37s/it] {'loss': 0.0222, 'grad_norm': 0.1573036015033722, 'learning_rate': 4.51272720552223e-06, 'kl': 0.0776, 'entropy': 0.0571, 'ce_loss': 0.0336, 'epoch': 2.07} | |
| 69%|βββββββ | 223/321 [20:20<08:46, 5.37s/it] 70%|βββββββ | 224/321 [20:26<08:42, 5.39s/it] {'loss': 0.0271, 'grad_norm': 0.18170951306819916, 'learning_rate': 4.4285592191234125e-06, 'kl': 0.207, 'entropy': 0.0024, 'ce_loss': 0.0248, 'epoch': 2.07} | |
| 70%|βββββββ | 224/321 [20:26<08:42, 5.39s/it] 70%|βββββββ | 225/321 [20:31<08:39, 5.41s/it] {'loss': 0.0288, 'grad_norm': 0.15239432454109192, 'learning_rate': 4.344959748857215e-06, 'kl': 0.1787, 'entropy': -0.0063, 'ce_loss': 0.0277, 'epoch': 2.08} | |
| 70%|βββββββ | 225/321 [20:31<08:39, 5.41s/it] 70%|βββββββ | 226/321 [20:37<08:35, 5.43s/it] {'loss': 0.0273, 'grad_norm': 0.16316252946853638, 'learning_rate': 4.261937325308347e-06, 'kl': 0.168, 'entropy': -0.0201, 'ce_loss': 0.0207, 'epoch': 2.09} | |
| 70%|βββββββ | 226/321 [20:37<08:35, 5.43s/it] 71%|βββββββ | 227/321 [20:42<08:32, 5.45s/it] {'loss': 0.0237, 'grad_norm': 0.13521708548069, 'learning_rate': 4.179500420179011e-06, 'kl': 0.0889, 'entropy': 0.1289, 'ce_loss': 0.0683, 'epoch': 2.1} | |
| 71%|βββββββ | 227/321 [20:42<08:32, 5.45s/it] 71%|βββββββ | 228/321 [20:48<08:27, 5.46s/it] {'loss': 0.0273, 'grad_norm': 0.1312531679868698, 'learning_rate': 4.097657445424454e-06, 'kl': 0.2285, 'entropy': -0.033, 'ce_loss': 0.0156, 'epoch': 2.11} | |
| 71%|βββββββ | 228/321 [20:48<08:27, 5.46s/it] 71%|ββββββββ | 229/321 [20:53<08:22, 5.46s/it] {'loss': 0.026, 'grad_norm': 0.16218417882919312, 'learning_rate': 4.016416752394591e-06, 'kl': 0.0016, 'entropy': 0.054, 'ce_loss': 0.053, 'epoch': 2.12} | |
| 71%|ββββββββ | 229/321 [20:53<08:22, 5.46s/it] 72%|ββββββββ | 230/321 [20:59<08:17, 5.46s/it] {'loss': 0.0193, 'grad_norm': 0.12537582218647003, 'learning_rate': 3.935786630981819e-06, 'kl': -0.0422, 'entropy': 0.0757, 'ce_loss': 0.0789, 'epoch': 2.13} | |
| 72%|ββββββββ | 230/321 [20:59<08:17, 5.46s/it] 72%|ββββββββ | 231/321 [21:04<08:12, 5.47s/it] {'loss': 0.0231, 'grad_norm': 0.08550359308719635, 'learning_rate': 3.8557753087751345e-06, 'kl': 0.0801, 'entropy': 0.0505, 'ce_loss': 0.0532, 'epoch': 2.14} | |
| 72%|ββββββββ | 231/321 [21:04<08:12, 5.47s/it] 72%|ββββββββ | 232/321 [21:10<08:05, 5.46s/it] {'loss': 0.0317, 'grad_norm': 0.17949357628822327, 'learning_rate': 3.776390950220544e-06, 'kl': 0.2461, 'entropy': 0.0178, 'ce_loss': 0.0316, 'epoch': 2.15} | |
| 72%|ββββββββ | 232/321 [21:10<08:05, 5.46s/it] 73%|ββββββββ | 233/321 [21:15<07:59, 5.45s/it] {'loss': 0.0254, 'grad_norm': 0.13897587358951569, 'learning_rate': 3.6976416557879757e-06, 'kl': 0.0066, 'entropy': 0.1094, 'ce_loss': 0.0907, 'epoch': 2.16} | |
| 73%|ββββββββ | 233/321 [21:15<07:59, 5.45s/it] 73%|ββββββββ | 234/321 [21:20<07:53, 5.44s/it] {'loss': 0.0291, 'grad_norm': 0.1302442103624344, 'learning_rate': 3.6195354611447033e-06, 'kl': 0.1367, 'entropy': 0.0645, 'ce_loss': 0.0377, 'epoch': 2.17} | |
| 73%|ββββββββ | 234/321 [21:20<07:53, 5.44s/it] 73%|ββββββββ | 235/321 [21:26<07:47, 5.44s/it] {'loss': 0.0215, 'grad_norm': 0.1322391927242279, 'learning_rate': 3.5420803363353604e-06, 'kl': 0.0352, 'entropy': 0.1299, 'ce_loss': 0.0815, 'epoch': 2.18} | |
| 73%|ββββββββ | 235/321 [21:26<07:47, 5.44s/it] 74%|ββββββββ | 236/321 [21:31<07:43, 5.45s/it] {'loss': 0.0219, 'grad_norm': 0.1335344910621643, 'learning_rate': 3.465284184968679e-06, 'kl': 0.0057, 'entropy': 0.0513, 'ce_loss': 0.0579, 'epoch': 2.19} | |
| 74%|ββββββββ | 236/321 [21:31<07:43, 5.45s/it] 74%|ββββββββ | 237/321 [21:37<07:39, 5.47s/it] {'loss': 0.0223, 'grad_norm': 0.10451359301805496, 'learning_rate': 3.3891548434109942e-06, 'kl': 0.1138, 'entropy': 0.0156, 'ce_loss': 0.0272, 'epoch': 2.2} | |
| 74%|ββββββββ | 237/321 [21:37<07:39, 5.47s/it] 74%|ββββββββ | 238/321 [21:42<07:34, 5.48s/it] {'loss': 0.0264, 'grad_norm': 0.14136981964111328, 'learning_rate': 3.3137000799866148e-06, 'kl': 0.0347, 'entropy': 0.0366, 'ce_loss': 0.0225, 'epoch': 2.21} | |
| 74%|ββββββββ | 238/321 [21:42<07:34, 5.48s/it] 74%|ββββββββ | 239/321 [21:48<07:28, 5.47s/it] {'loss': 0.0252, 'grad_norm': 0.13739950954914093, 'learning_rate': 3.238927594185127e-06, 'kl': 0.1436, 'entropy': 0.0786, 'ce_loss': 0.0474, 'epoch': 2.21} | |
| 74%|ββββββββ | 239/321 [21:48<07:28, 5.47s/it] 75%|ββββββββ | 240/321 [21:53<07:23, 5.47s/it] {'loss': 0.025, 'grad_norm': 0.148906409740448, 'learning_rate': 3.1648450158757373e-06, 'kl': 0.2559, 'entropy': 0.0135, 'ce_loss': 0.0388, 'epoch': 2.22} | |
| 75%|ββββββββ | 240/321 [21:53<07:23, 5.47s/it] 75%|ββββββββ | 241/321 [21:59<07:18, 5.49s/it] {'loss': 0.023, 'grad_norm': 0.11499208956956863, 'learning_rate': 3.0914599045287165e-06, 'kl': 0.084, 'entropy': 0.0747, 'ce_loss': 0.0671, 'epoch': 2.23} | |
| 75%|ββββββββ | 241/321 [21:59<07:18, 5.49s/it] 75%|ββββββββ | 242/321 [22:04<07:11, 5.46s/it] {'loss': 0.019, 'grad_norm': 0.08660584688186646, 'learning_rate': 3.018779748444005e-06, 'kl': 0.1416, 'entropy': 0.0276, 'ce_loss': 0.0296, 'epoch': 2.24} | |
| 75%|ββββββββ | 242/321 [22:04<07:11, 5.46s/it] 76%|ββββββββ | 243/321 [22:10<07:07, 5.48s/it] {'loss': 0.0233, 'grad_norm': 0.11674519628286362, 'learning_rate': 2.9468119639871163e-06, 'kl': 0.053, 'entropy': 0.1094, 'ce_loss': 0.0716, 'epoch': 2.25} | |
| 76%|ββββββββ | 243/321 [22:10<07:07, 5.48s/it] 76%|ββββββββ | 244/321 [22:15<07:00, 5.46s/it] {'loss': 0.0216, 'grad_norm': 0.12600378692150116, 'learning_rate': 2.8755638948323494e-06, 'kl': -0.0245, 'entropy': 0.1113, 'ce_loss': 0.0864, 'epoch': 2.26} | |
| 76%|ββββββββ | 244/321 [22:15<07:00, 5.46s/it] 76%|ββββββββ | 245/321 [22:21<06:54, 5.46s/it] {'loss': 0.0283, 'grad_norm': 0.10699094086885452, 'learning_rate': 2.8050428112134474e-06, 'kl': 0.168, 'entropy': 0.0168, 'ce_loss': 0.0169, 'epoch': 2.27} | |
| 76%|ββββββββ | 245/321 [22:21<06:54, 5.46s/it] 77%|ββββββββ | 246/321 [22:26<06:50, 5.47s/it] {'loss': 0.0295, 'grad_norm': 0.1837424486875534, 'learning_rate': 2.735255909181719e-06, 'kl': 0.2188, 'entropy': 0.0093, 'ce_loss': 0.0258, 'epoch': 2.28} | |
| 77%|ββββββββ | 246/321 [22:26<06:50, 5.47s/it] 77%|ββββββββ | 247/321 [22:32<06:44, 5.47s/it] {'loss': 0.0221, 'grad_norm': 0.1188495010137558, 'learning_rate': 2.6662103098717485e-06, 'kl': 0.2012, 'entropy': 0.0083, 'ce_loss': 0.037, 'epoch': 2.29} | |
| 77%|ββββββββ | 247/321 [22:32<06:44, 5.47s/it] 77%|ββββββββ | 248/321 [22:37<06:39, 5.47s/it] {'loss': 0.0305, 'grad_norm': 0.1437922865152359, 'learning_rate': 2.597913058774758e-06, 'kl': 0.0422, 'entropy': 0.0796, 'ce_loss': 0.0395, 'epoch': 2.3} | |
| 77%|ββββββββ | 248/321 [22:37<06:39, 5.47s/it] 78%|ββββββββ | 249/321 [22:43<06:35, 5.50s/it] {'loss': 0.0261, 'grad_norm': 0.13972756266593933, 'learning_rate': 2.530371125019664e-06, 'kl': 0.126, 'entropy': -0.0011, 'ce_loss': 0.0175, 'epoch': 2.31} | |
| 78%|ββββββββ | 249/321 [22:43<06:35, 5.50s/it] 78%|ββββββββ | 250/321 [22:48<06:29, 5.49s/it] {'loss': 0.0244, 'grad_norm': 0.14524739980697632, 'learning_rate': 2.4635914006619454e-06, 'kl': -0.011, 'entropy': 0.1221, 'ce_loss': 0.0832, 'epoch': 2.32} | |
| 78%|ββββββββ | 250/321 [22:48<06:29, 5.49s/it] 78%|ββββββββ | 251/321 [22:54<06:24, 5.50s/it] {'loss': 0.0215, 'grad_norm': 0.09333452582359314, 'learning_rate': 2.3975806999803717e-06, 'kl': 0.0096, 'entropy': 0.103, 'ce_loss': 0.0629, 'epoch': 2.33} | |
| 78%|ββββββββ | 251/321 [22:54<06:24, 5.50s/it] 79%|ββββββββ | 252/321 [22:59<06:18, 5.49s/it] {'loss': 0.0179, 'grad_norm': 0.12270327657461166, 'learning_rate': 2.33234575878167e-06, 'kl': -0.0493, 'entropy': 0.166, 'ce_loss': 0.1019, 'epoch': 2.34} | |
| 79%|ββββββββ | 252/321 [22:59<06:18, 5.49s/it] 79%|ββββββββ | 253/321 [23:05<06:12, 5.48s/it] {'loss': 0.026, 'grad_norm': 0.0858650952577591, 'learning_rate': 2.267893233713182e-06, 'kl': 0.166, 'entropy': 0.0181, 'ce_loss': 0.0212, 'epoch': 2.34} | |
| 79%|ββββββββ | 253/321 [23:05<06:12, 5.48s/it] 79%|ββββββββ | 254/321 [23:10<06:08, 5.50s/it] {'loss': 0.026, 'grad_norm': 0.14108358323574066, 'learning_rate': 2.204229701583621e-06, 'kl': 0.0007, 'entropy': 0.1206, 'ce_loss': 0.0768, 'epoch': 2.35} | |
| 79%|ββββββββ | 254/321 [23:10<06:08, 5.50s/it] 79%|ββββββββ | 255/321 [23:16<06:03, 5.51s/it] {'loss': 0.0287, 'grad_norm': 0.13392361998558044, 'learning_rate': 2.141361658691975e-06, 'kl': 0.1709, 'entropy': -0.0361, 'ce_loss': 0.0101, 'epoch': 2.36} | |
| 79%|ββββββββ | 255/321 [23:16<06:03, 5.51s/it] 80%|ββββββββ | 256/321 [23:21<05:57, 5.50s/it] {'loss': 0.0289, 'grad_norm': 0.1510702222585678, 'learning_rate': 2.0792955201646005e-06, 'kl': 0.0669, 'entropy': 0.1152, 'ce_loss': 0.0818, 'epoch': 2.37} | |
| 80%|ββββββββ | 256/321 [23:21<05:57, 5.50s/it] 80%|ββββββββ | 257/321 [23:27<05:52, 5.51s/it] {'loss': 0.0286, 'grad_norm': 0.15851591527462006, 'learning_rate': 2.018037619300628e-06, 'kl': 0.0654, 'entropy': 0.1631, 'ce_loss': 0.0799, 'epoch': 2.38} | |
| 80%|ββββββββ | 257/321 [23:27<05:52, 5.51s/it] 80%|ββββββββ | 258/321 [23:32<05:48, 5.53s/it] {'loss': 0.0239, 'grad_norm': 0.1374949961900711, 'learning_rate': 1.9575942069256914e-06, 'kl': 0.0591, 'entropy': 0.0176, 'ce_loss': 0.0434, 'epoch': 2.39} | |
| 80%|ββββββββ | 258/321 [23:32<05:48, 5.53s/it] 81%|ββββββββ | 259/321 [23:38<05:42, 5.52s/it] {'loss': 0.0259, 'grad_norm': 0.13825078308582306, 'learning_rate': 1.8979714507541103e-06, 'kl': 0.0757, 'entropy': 0.0447, 'ce_loss': 0.0431, 'epoch': 2.4} | |
| 81%|ββββββββ | 259/321 [23:38<05:42, 5.52s/it] 81%|ββββββββ | 260/321 [23:43<05:35, 5.51s/it] {'loss': 0.0272, 'grad_norm': 0.15970517694950104, 'learning_rate': 1.839175434759507e-06, 'kl': 0.014, 'entropy': 0.1191, 'ce_loss': 0.0785, 'epoch': 2.41} | |
| 81%|ββββββββ | 260/321 [23:43<05:35, 5.51s/it] 81%|βββββββββ | 261/321 [23:49<05:29, 5.49s/it] {'loss': 0.0221, 'grad_norm': 0.10772485285997391, 'learning_rate': 1.7812121585539964e-06, 'kl': 0.2246, 'entropy': -0.0183, 'ce_loss': 0.0161, 'epoch': 2.42} | |
| 81%|βββββββββ | 261/321 [23:49<05:29, 5.49s/it] 82%|βββββββββ | 262/321 [23:54<05:23, 5.47s/it] {'loss': 0.0228, 'grad_norm': 0.12323292344808578, 'learning_rate': 1.7240875367759902e-06, 'kl': 0.0143, 'entropy': 0.1367, 'ce_loss': 0.0932, 'epoch': 2.43} | |
| 82%|βββββββββ | 262/321 [23:54<05:23, 5.47s/it] 82%|βββββββββ | 263/321 [23:59<05:16, 5.46s/it] {'loss': 0.0217, 'grad_norm': 0.13626284897327423, 'learning_rate': 1.6678073984866438e-06, 'kl': 0.0046, 'entropy': 0.0884, 'ce_loss': 0.0481, 'epoch': 2.44} | |
| 82%|βββββββββ | 263/321 [23:59<05:16, 5.46s/it] 82%|βββββββββ | 264/321 [24:05<05:11, 5.46s/it] {'loss': 0.0277, 'grad_norm': 0.11715082824230194, 'learning_rate': 1.6123774865750607e-06, 'kl': 0.1963, 'entropy': -0.0159, 'ce_loss': 0.0254, 'epoch': 2.45} | |
| 82%|βββββββββ | 264/321 [24:05<05:11, 5.46s/it] 83%|βββββββββ | 265/321 [24:10<05:06, 5.47s/it] {'loss': 0.0261, 'grad_norm': 0.15604187548160553, 'learning_rate': 1.5578034571722879e-06, 'kl': 0.1611, 'entropy': 0.0189, 'ce_loss': 0.0334, 'epoch': 2.46} | |
| 83%|βββββββββ | 265/321 [24:10<05:06, 5.47s/it] 83%|βββββββββ | 266/321 [24:16<05:00, 5.47s/it] {'loss': 0.0246, 'grad_norm': 0.1404842883348465, 'learning_rate': 1.5040908790741448e-06, 'kl': 0.1768, 'entropy': -0.0332, 'ce_loss': 0.0108, 'epoch': 2.47} | |
| 83%|βββββββββ | 266/321 [24:16<05:00, 5.47s/it] 83%|βββββββββ | 267/321 [24:21<04:57, 5.50s/it] {'loss': 0.0269, 'grad_norm': 0.14500805735588074, 'learning_rate': 1.4512452331729864e-06, 'kl': -0.0221, 'entropy': 0.1216, 'ce_loss': 0.0841, 'epoch': 2.48} | |
| 83%|βββββββββ | 267/321 [24:21<04:57, 5.50s/it] 83%|βββββββββ | 268/321 [24:27<04:51, 5.51s/it] {'loss': 0.0194, 'grad_norm': 0.10871770232915878, 'learning_rate': 1.3992719118984167e-06, 'kl': 0.0962, 'entropy': 0.017, 'ce_loss': 0.0248, 'epoch': 2.48} | |
| 83%|βββββββββ | 268/321 [24:27<04:51, 5.51s/it] 84%|βββββββββ | 269/321 [24:32<04:45, 5.49s/it] {'loss': 0.0251, 'grad_norm': 0.10633812099695206, 'learning_rate': 1.3481762186670556e-06, 'kl': 0.1523, 'entropy': -0.0044, 'ce_loss': 0.0175, 'epoch': 2.49} | |
| 84%|βββββββββ | 269/321 [24:32<04:45, 5.49s/it] 84%|βββββββββ | 270/321 [24:38<04:38, 5.46s/it] {'loss': 0.0232, 'grad_norm': 0.12149304151535034, 'learning_rate': 1.2979633673413571e-06, 'kl': 0.0952, 'entropy': 0.0371, 'ce_loss': 0.044, 'epoch': 2.5} | |
| 84%|βββββββββ | 270/321 [24:38<04:38, 5.46s/it] 84%|βββββββββ | 271/321 [24:43<04:32, 5.45s/it] {'loss': 0.02, 'grad_norm': 0.1349457949399948, 'learning_rate': 1.248638481697586e-06, 'kl': 0.0452, 'entropy': 0.1367, 'ce_loss': 0.0725, 'epoch': 2.51} | |
| 84%|βββββββββ | 271/321 [24:43<04:32, 5.45s/it] 85%|βββββββββ | 272/321 [24:49<04:27, 5.47s/it] {'loss': 0.0201, 'grad_norm': 0.11620452255010605, 'learning_rate': 1.2002065949029896e-06, 'kl': -0.0216, 'entropy': 0.106, 'ce_loss': 0.0581, 'epoch': 2.52} | |
| 85%|βββββββββ | 272/321 [24:49<04:27, 5.47s/it] 85%|βββββββββ | 273/321 [24:54<04:22, 5.47s/it] {'loss': 0.0254, 'grad_norm': 0.09986640512943268, 'learning_rate': 1.15267264900219e-06, 'kl': 0.0452, 'entropy': 0.0723, 'ce_loss': 0.0518, 'epoch': 2.53} | |
| 85%|βββββββββ | 273/321 [24:54<04:22, 5.47s/it] 85%|βββββββββ | 274/321 [25:00<04:17, 5.48s/it] {'loss': 0.025, 'grad_norm': 0.16179873049259186, 'learning_rate': 1.1060414944129106e-06, 'kl': 0.0488, 'entropy': 0.1445, 'ce_loss': 0.0931, 'epoch': 2.54} | |
| 85%|βββββββββ | 274/321 [25:00<04:17, 5.48s/it] 86%|βββββββββ | 275/321 [25:05<04:11, 5.47s/it] {'loss': 0.0233, 'grad_norm': 0.11301198601722717, 'learning_rate': 1.0603178894310185e-06, 'kl': 0.0815, 'entropy': 0.0598, 'ce_loss': 0.0472, 'epoch': 2.55} | |
| 86%|βββββββββ | 275/321 [25:05<04:11, 5.47s/it] 86%|βββββββββ | 276/321 [25:11<04:06, 5.48s/it] {'loss': 0.0308, 'grad_norm': 0.13731074333190918, 'learning_rate': 1.0155064997450026e-06, 'kl': 0.1147, 'entropy': 0.084, 'ce_loss': 0.0393, 'epoch': 2.56} | |
| 86%|βββββββββ | 276/321 [25:11<04:06, 5.48s/it] 86%|βββββββββ | 277/321 [25:16<04:00, 5.46s/it] {'loss': 0.0297, 'grad_norm': 0.21143203973770142, 'learning_rate': 9.716118979598533e-07, 'kl': 0.1777, 'entropy': 0.0203, 'ce_loss': 0.0453, 'epoch': 2.57} | |
| 86%|βββββββββ | 277/321 [25:16<04:00, 5.46s/it] 87%|βββββββββ | 278/321 [25:22<03:54, 5.46s/it] {'loss': 0.024, 'grad_norm': 0.12256018817424774, 'learning_rate': 9.286385631304939e-07, 'kl': 0.1738, 'entropy': 0.006, 'ce_loss': 0.0374, 'epoch': 2.58} | |
| 87%|βββββββββ | 278/321 [25:22<03:54, 5.46s/it] 87%|βββββββββ | 279/321 [25:27<03:48, 5.45s/it] {'loss': 0.03, 'grad_norm': 0.11462045460939407, 'learning_rate': 8.865908803047241e-07, 'kl': 0.0493, 'entropy': 0.0952, 'ce_loss': 0.0606, 'epoch': 2.59} | |
| 87%|βββββββββ | 279/321 [25:27<03:48, 5.45s/it] 87%|βββββββββ | 280/321 [25:33<03:44, 5.49s/it] {'loss': 0.0301, 'grad_norm': 0.16785408556461334, 'learning_rate': 8.454731400757599e-07, 'kl': 0.1279, 'entropy': 0.0143, 'ce_loss': 0.023, 'epoch': 2.6} | |
| 87%|βββββββββ | 280/321 [25:33<03:44, 5.49s/it] 88%|βββββββββ | 281/321 [25:38<03:39, 5.48s/it] {'loss': 0.0232, 'grad_norm': 0.1566372960805893, 'learning_rate': 8.052895381444226e-07, 'kl': 0.1226, 'entropy': 0.0522, 'ce_loss': 0.0394, 'epoch': 2.61} | |
| 88%|βββββββββ | 281/321 [25:38<03:39, 5.48s/it] 88%|βββββββββ | 282/321 [25:43<03:33, 5.46s/it] {'loss': 0.0247, 'grad_norm': 0.1414063423871994, 'learning_rate': 7.660441748909997e-07, 'kl': -0.0349, 'entropy': 0.1089, 'ce_loss': 0.0876, 'epoch': 2.62} | |
| 88%|βββββββββ | 282/321 [25:43<03:33, 5.46s/it] 88%|βββββββββ | 283/321 [25:49<03:27, 5.45s/it] {'loss': 0.0283, 'grad_norm': 0.1220989003777504, 'learning_rate': 7.277410549568476e-07, 'kl': 0.1699, 'entropy': 0.04, 'ce_loss': 0.0445, 'epoch': 2.62} | |
| 88%|βββββββββ | 283/321 [25:49<03:27, 5.45s/it] 88%|βββββββββ | 284/321 [25:54<03:21, 5.45s/it] {'loss': 0.022, 'grad_norm': 0.15742984414100647, 'learning_rate': 6.903840868357382e-07, 'kl': -0.0177, 'entropy': 0.1514, 'ce_loss': 0.087, 'epoch': 2.63} | |
| 88%|βββββββββ | 284/321 [25:54<03:21, 5.45s/it] 89%|βββββββββ | 285/321 [26:00<03:16, 5.45s/it] {'loss': 0.0236, 'grad_norm': 0.10442278534173965, 'learning_rate': 6.539770824750447e-07, 'kl': 0.0131, 'entropy': 0.0977, 'ce_loss': 0.0688, 'epoch': 2.64} | |
| 89%|βββββββββ | 285/321 [26:00<03:16, 5.45s/it] 89%|βββββββββ | 286/321 [26:05<03:11, 5.46s/it] {'loss': 0.0181, 'grad_norm': 0.11422933638095856, 'learning_rate': 6.185237568867597e-07, 'kl': -0.0327, 'entropy': 0.1719, 'ce_loss': 0.0929, 'epoch': 2.65} | |
| 89%|βββββββββ | 286/321 [26:05<03:11, 5.46s/it] 89%|βββββββββ | 287/321 [26:11<03:06, 5.48s/it] {'loss': 0.0256, 'grad_norm': 0.1048831194639206, 'learning_rate': 5.840277277684136e-07, 'kl': 0.0547, 'entropy': 0.0996, 'ce_loss': 0.06, 'epoch': 2.66} | |
| 89%|βββββββββ | 287/321 [26:11<03:06, 5.48s/it] 90%|βββββββββ | 288/321 [26:16<03:01, 5.49s/it] {'loss': 0.0276, 'grad_norm': 0.1242062970995903, 'learning_rate': 5.504925151339191e-07, 'kl': 0.05, 'entropy': 0.0664, 'ce_loss': 0.0539, 'epoch': 2.67} | |
| 90%|βββββββββ | 288/321 [26:16<03:01, 5.49s/it] 90%|βββββββββ | 289/321 [26:22<02:54, 5.46s/it] {'loss': 0.0219, 'grad_norm': 0.14201240241527557, 'learning_rate': 5.179215409543848e-07, 'kl': 0.0869, 'entropy': 0.0752, 'ce_loss': 0.0568, 'epoch': 2.68} | |
| 90%|βββββββββ | 289/321 [26:22<02:54, 5.46s/it] 90%|βββββββββ | 290/321 [26:27<02:49, 5.45s/it] {'loss': 0.0253, 'grad_norm': 0.15880657732486725, 'learning_rate': 4.863181288089391e-07, 'kl': -0.025, 'entropy': 0.0996, 'ce_loss': 0.0623, 'epoch': 2.69} | |
| 90%|βββββββββ | 290/321 [26:27<02:49, 5.45s/it] 91%|βββββββββ | 291/321 [26:33<02:43, 5.46s/it] {'loss': 0.0247, 'grad_norm': 0.11968444287776947, 'learning_rate': 4.556855035455787e-07, 'kl': 0.1699, 'entropy': 0.0034, 'ce_loss': 0.0265, 'epoch': 2.7} | |
| 91%|βββββββββ | 291/321 [26:33<02:43, 5.46s/it] 91%|βββββββββ | 292/321 [26:38<02:37, 5.44s/it] {'loss': 0.0378, 'grad_norm': 0.13923847675323486, 'learning_rate': 4.2602679095210766e-07, 'kl': 0.2217, 'entropy': -0.019, 'ce_loss': 0.0313, 'epoch': 2.71} | |
| 91%|βββββββββ | 292/321 [26:38<02:37, 5.44s/it] 91%|ββββββββββ| 293/321 [26:43<02:32, 5.44s/it] {'loss': 0.0263, 'grad_norm': 0.2090776115655899, 'learning_rate': 3.9734501743717956e-07, 'kl': 0.1104, 'entropy': 0.0003, 'ce_loss': 0.0207, 'epoch': 2.72} | |
| 91%|ββββββββββ| 293/321 [26:43<02:32, 5.44s/it] 92%|ββββββββββ| 294/321 [26:49<02:27, 5.48s/it] {'loss': 0.0244, 'grad_norm': 0.1359066516160965, 'learning_rate': 3.696431097214748e-07, 'kl': 0.1885, 'entropy': -0.0079, 'ce_loss': 0.0201, 'epoch': 2.73} | |
| 92%|ββββββββββ| 294/321 [26:49<02:27, 5.48s/it] 92%|ββββββββββ| 295/321 [26:54<02:22, 5.48s/it] {'loss': 0.0263, 'grad_norm': 0.13609451055526733, 'learning_rate': 3.429238945390556e-07, 'kl': 0.0571, 'entropy': 0.0801, 'ce_loss': 0.0708, 'epoch': 2.74} | |
| 92%|ββββββββββ| 295/321 [26:54<02:22, 5.48s/it] 92%|ββββββββββ| 296/321 [27:00<02:16, 5.47s/it] {'loss': 0.0205, 'grad_norm': 0.1158941388130188, 'learning_rate': 3.171900983489273e-07, 'kl': 0.0386, 'entropy': 0.1406, 'ce_loss': 0.0776, 'epoch': 2.75} | |
| 92%|ββββββββββ| 296/321 [27:00<02:16, 5.47s/it] 93%|ββββββββββ| 297/321 [27:05<02:11, 5.46s/it] {'loss': 0.0264, 'grad_norm': 0.11173633486032486, 'learning_rate': 2.9244434705682276e-07, 'kl': 0.1836, 'entropy': 0.0081, 'ce_loss': 0.0369, 'epoch': 2.76} | |
| 93%|ββββββββββ| 297/321 [27:05<02:11, 5.46s/it] 93%|ββββββββββ| 298/321 [27:11<02:05, 5.45s/it] {'loss': 0.0233, 'grad_norm': 0.14700034260749817, 'learning_rate': 2.6868916574725347e-07, 'kl': 0.1787, 'entropy': 0.0089, 'ce_loss': 0.033, 'epoch': 2.76} | |
| 93%|ββββββββββ| 298/321 [27:11<02:05, 5.45s/it] 93%|ββββββββββ| 299/321 [27:16<01:59, 5.44s/it] {'loss': 0.0253, 'grad_norm': 0.11715128272771835, 'learning_rate': 2.459269784258467e-07, 'kl': 0.1318, 'entropy': -0.0114, 'ce_loss': 0.0168, 'epoch': 2.77} | |
| 93%|ββββββββββ| 299/321 [27:16<01:59, 5.44s/it] 93%|ββββββββββ| 300/321 [27:22<01:54, 5.43s/it] {'loss': 0.0222, 'grad_norm': 0.11396218836307526, 'learning_rate': 2.2416010777199904e-07, 'kl': 0.0532, 'entropy': 0.0427, 'ce_loss': 0.0435, 'epoch': 2.78} | |
| 93%|ββββββββββ| 300/321 [27:22<01:54, 5.43s/it] 94%|ββββββββββ| 301/321 [27:27<01:48, 5.44s/it] {'loss': 0.0283, 'grad_norm': 0.14802348613739014, 'learning_rate': 2.0339077490186488e-07, 'kl': -0.0006, 'entropy': 0.1069, 'ce_loss': 0.0683, 'epoch': 2.79} | |
| 94%|ββββββββββ| 301/321 [27:27<01:48, 5.44s/it] 94%|ββββββββββ| 302/321 [27:32<01:43, 5.43s/it] {'loss': 0.0278, 'grad_norm': 0.1581055223941803, 'learning_rate': 1.83621099141712e-07, 'kl': 0.1235, 'entropy': 0.0522, 'ce_loss': 0.0377, 'epoch': 2.8} | |
| 94%|ββββββββββ| 302/321 [27:32<01:43, 5.43s/it] 94%|ββββββββββ| 303/321 [27:38<01:37, 5.44s/it] {'loss': 0.0254, 'grad_norm': 0.1633712649345398, 'learning_rate': 1.648530978116658e-07, 'kl': -0.0391, 'entropy': 0.085, 'ce_loss': 0.0629, 'epoch': 2.81} | |
| 94%|ββββββββββ| 303/321 [27:38<01:37, 5.44s/it] 95%|ββββββββββ| 304/321 [27:43<01:33, 5.47s/it] {'loss': 0.0189, 'grad_norm': 0.13110966980457306, 'learning_rate': 1.4708868601985503e-07, 'kl': 0.1309, 'entropy': 0.0378, 'ce_loss': 0.0404, 'epoch': 2.82} | |
| 95%|ββββββββββ| 304/321 [27:43<01:33, 5.47s/it] 95%|ββββββββββ| 305/321 [27:49<01:27, 5.48s/it] {'loss': 0.0218, 'grad_norm': 0.1552368849515915, 'learning_rate': 1.303296764669959e-07, 'kl': -0.0381, 'entropy': 0.1514, 'ce_loss': 0.0775, 'epoch': 2.83} | |
| 95%|ββββββββββ| 305/321 [27:49<01:27, 5.48s/it] 95%|ββββββββββ| 306/321 [27:54<01:21, 5.46s/it] {'loss': 0.0259, 'grad_norm': 0.09103472530841827, 'learning_rate': 1.1457777926141889e-07, 'kl': 0.1006, 'entropy': 0.0454, 'ce_loss': 0.045, 'epoch': 2.84} | |
| 95%|ββββββββββ| 306/321 [27:54<01:21, 5.46s/it] 96%|ββββββββββ| 307/321 [28:00<01:16, 5.46s/it] {'loss': 0.0249, 'grad_norm': 0.15214498341083527, 'learning_rate': 9.98346017445706e-08, 'kl': -0.0483, 'entropy': 0.1377, 'ce_loss': 0.089, 'epoch': 2.85} | |
| 96%|ββββββββββ| 307/321 [28:00<01:16, 5.46s/it] 96%|ββββββββββ| 308/321 [28:05<01:11, 5.47s/it] {'loss': 0.0269, 'grad_norm': 0.15390083193778992, 'learning_rate': 8.610164832699608e-08, 'kl': 0.1196, 'entropy': 0.025, 'ce_loss': 0.0234, 'epoch': 2.86} | |
| 96%|ββββββββββ| 308/321 [28:05<01:11, 5.47s/it] 96%|ββββββββββ| 309/321 [28:11<01:06, 5.50s/it] {'loss': 0.0234, 'grad_norm': 0.1400686502456665, 'learning_rate': 7.338032033482712e-08, 'kl': 0.1797, 'entropy': -0.0146, 'ce_loss': 0.0134, 'epoch': 2.87} | |
| 96%|ββββββββββ| 309/321 [28:11<01:06, 5.50s/it] 97%|ββββββββββ| 310/321 [28:16<01:00, 5.51s/it] {'loss': 0.0266, 'grad_norm': 0.15008477866649628, 'learning_rate': 6.167191586679556e-08, 'kl': -0.0147, 'entropy': 0.1729, 'ce_loss': 0.093, 'epoch': 2.88} | |
| 97%|ββββββββββ| 310/321 [28:16<01:00, 5.51s/it] 97%|ββββββββββ| 311/321 [28:22<00:55, 5.50s/it] {'loss': 0.0212, 'grad_norm': 0.09048085659742355, 'learning_rate': 5.097762966176256e-08, 'kl': -0.0322, 'entropy': 0.1455, 'ce_loss': 0.089, 'epoch': 2.89} | |
| 97%|ββββββββββ| 311/321 [28:22<00:55, 5.50s/it] 97%|ββββββββββ| 312/321 [28:27<00:49, 5.50s/it] {'loss': 0.0269, 'grad_norm': 0.13809086382389069, 'learning_rate': 4.129855297681618e-08, 'kl': 0.0903, 'entropy': 0.003, 'ce_loss': 0.0238, 'epoch': 2.9} | |
| 97%|ββββββββββ| 312/321 [28:27<00:49, 5.50s/it] 98%|ββββββββββ| 313/321 [28:33<00:43, 5.47s/it] {'loss': 0.0279, 'grad_norm': 0.16971167922019958, 'learning_rate': 3.2635673475910345e-08, 'kl': -0.0371, 'entropy': 0.1758, 'ce_loss': 0.0989, 'epoch': 2.9} | |
| 98%|ββββββββββ| 313/321 [28:33<00:43, 5.47s/it] 98%|ββββββββββ| 314/321 [28:38<00:38, 5.47s/it] {'loss': 0.0271, 'grad_norm': 0.11413148045539856, 'learning_rate': 2.4989875129091124e-08, 'kl': 0.1738, 'entropy': 0.0215, 'ce_loss': 0.0363, 'epoch': 2.91} | |
| 98%|ββββββββββ| 314/321 [28:38<00:38, 5.47s/it] 98%|ββββββββββ| 315/321 [28:44<00:32, 5.46s/it] {'loss': 0.0239, 'grad_norm': 0.1775045543909073, 'learning_rate': 1.8361938122287704e-08, 'kl': -0.0293, 'entropy': 0.1729, 'ce_loss': 0.0902, 'epoch': 2.92} | |
| 98%|ββββββββββ| 315/321 [28:44<00:32, 5.46s/it] 98%|ββββββββββ| 316/321 [28:49<00:27, 5.47s/it] {'loss': 0.0245, 'grad_norm': 0.1191033273935318, 'learning_rate': 1.2752538777704993e-08, 'kl': 0.0747, 'entropy': 0.0786, 'ce_loss': 0.0517, 'epoch': 2.93} | |
| 98%|ββββββββββ| 316/321 [28:49<00:27, 5.47s/it] 99%|ββββββββββ| 317/321 [28:55<00:21, 5.44s/it] {'loss': 0.0234, 'grad_norm': 0.09175986796617508, 'learning_rate': 8.162249484809926e-09, 'kl': 0.1289, 'entropy': 0.0938, 'ce_loss': 0.0597, 'epoch': 2.94} | |
| 99%|ββββββββββ| 317/321 [28:55<00:21, 5.44s/it] 99%|ββββββββββ| 318/321 [29:00<00:16, 5.44s/it] {'loss': 0.0248, 'grad_norm': 0.13051749765872955, 'learning_rate': 4.591538641927074e-09, 'kl': 0.033, 'entropy': 0.1641, 'ce_loss': 0.073, 'epoch': 2.95} | |
| 99%|ββββββββββ| 318/321 [29:00<00:16, 5.44s/it] 99%|ββββββββββ| 319/321 [29:05<00:10, 5.43s/it] {'loss': 0.0265, 'grad_norm': 0.12332940846681595, 'learning_rate': 2.0407706084368816e-09, 'kl': 0.0703, 'entropy': 0.0957, 'ce_loss': 0.0564, 'epoch': 2.96} | |
| 99%|ββββββββββ| 319/321 [29:05<00:10, 5.43s/it] 100%|ββββββββββ| 320/321 [29:11<00:05, 5.45s/it] {'loss': 0.0271, 'grad_norm': 0.1580265313386917, 'learning_rate': 5.102056675998501e-10, 'kl': 0.1089, 'entropy': 0.0615, 'ce_loss': 0.044, 'epoch': 2.97} | |
| 100%|ββββββββββ| 320/321 [29:11<00:05, 5.45s/it] 100%|ββββββββββ| 321/321 [29:16<00:00, 5.44s/it] {'loss': 0.0351, 'grad_norm': 0.14408810436725616, 'learning_rate': 0.0, 'kl': 0.0928, 'entropy': 0.1328, 'ce_loss': 0.0641, 'epoch': 2.98} | |
| 100%|ββββββββββ| 321/321 [29:16<00:00, 5.44s/it][INFO|trainer.py:2665] 2025-04-10 17:23:54,853 >> | |
| Training completed. Do not forget to share your model on huggingface.co/models =) | |
| {'train_runtime': 1756.865, 'train_samples_per_second': 5.854, 'train_steps_per_second': 0.183, 'train_loss': 0.04484545481840955, 'epoch': 2.98} | |
| 100%|ββββββββββ| 321/321 [29:16<00:00, 5.44s/it] 100%|ββββββββββ| 321/321 [29:16<00:00, 5.47s/it] | |
| [INFO|trainer.py:3966] 2025-04-10 17:24:03,430 >> Saving model checkpoint to /home/stern/GRPO/offline_rl_v2/output | |
| [INFO|configuration_utils.py:423] 2025-04-10 17:24:03,433 >> Configuration saved in /home/stern/GRPO/offline_rl_v2/output/config.json | |
| [INFO|configuration_utils.py:908] 2025-04-10 17:24:03,433 >> Configuration saved in /home/stern/GRPO/offline_rl_v2/output/generation_config.json | |
| [2025-04-10 17:24:05,933] [INFO] [launch.py:351:main] Process 501941 exits successfully. | |
| [2025-04-10 17:24:06,934] [INFO] [launch.py:351:main] Process 501942 exits successfully. | |
| [2025-04-10 17:24:07,935] [INFO] [launch.py:351:main] Process 501943 exits successfully. | |
| [2025-04-10 17:24:07,935] [INFO] [launch.py:351:main] Process 501946 exits successfully. | |
| [2025-04-10 17:24:07,936] [INFO] [launch.py:351:main] Process 501944 exits successfully. | |
| [2025-04-10 17:24:08,937] [INFO] [launch.py:351:main] Process 501945 exits successfully. | |
| [2025-04-10 17:24:08,937] [INFO] [launch.py:351:main] Process 501940 exits successfully. | |
| [INFO|modeling_utils.py:3594] 2025-04-10 17:24:18,916 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /home/stern/GRPO/offline_rl_v2/output/model.safetensors.index.json. | |
| [INFO|tokenization_utils_base.py:2510] 2025-04-10 17:24:18,917 >> tokenizer config file saved in /home/stern/GRPO/offline_rl_v2/output/tokenizer_config.json | |
| [INFO|tokenization_utils_base.py:2519] 2025-04-10 17:24:18,917 >> Special tokens file saved in /home/stern/GRPO/offline_rl_v2/output/special_tokens_map.json | |
| ***** train metrics ***** | |
| epoch = 2.979 | |
| total_flos = 3318202242GF | |
| train_loss = 0.0448 | |
| train_runtime = 0:29:16.86 | |
| train_samples = 3428 | |
| train_samples_per_second = 5.854 | |
| train_steps_per_second = 0.183 | |
| [rank0]:[W410 17:24:19.182384317 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) | |
| [2025-04-10 17:24:21,950] [INFO] [launch.py:351:main] Process 501939 exits successfully. | |