[2026-03-12 21:41:44,720] [DEBUG] [axolotl.utils.config.resolve_dtype:66] [PID:4624] bf16 support detected, enabling for this configuration.
[2026-03-12 21:41:44,952] [DEBUG] [axolotl.utils.config.log_gpu_memory_usage:127] [PID:4624] baseline 0.000GB ()
[2026-03-12 21:41:44,953] [INFO] [axolotl.cli.config.load_cfg:259] [PID:4624] config:
{
  "activation_offloading": false,
  "auto_resume_from_checkpoints": true,
  "axolotl_config_path": "output/sft.yml",
  "base_model": "kajuma/DiffLlama-1B",
  "base_model_config": "kajuma/DiffLlama-1B",
  "batch_size": 32,
  "bf16": true,
  "capabilities": {
    "bf16": true,
    "compute_capability": "sm_120",
    "fp8": true,
    "n_gpu": 1,
    "n_node": 1
  },
  "chat_template": "tokenizer_default",
  "context_parallel_size": 1,
  "cosine_min_lr_ratio": 0.1,
  "dataloader_num_workers": 1,
  "dataloader_pin_memory": true,
  "dataloader_prefetch_factor": 256,
  "dataset_num_proc": 24,
  "dataset_prepared_path": "./output/dataset",
  "datasets": [
    {
      "chat_template": "tokenizer_default",
      "field_messages": "messages",
      "message_property_mappings": {
        "content": "content",
        "role": "role"
      },
      "path": "kajuma/Zero_SFT_Ja_v3.5",
      "trust_remote_code": false,
      "type": "chat_template"
    }
  ],
  "ddp": false,
  "device": "cuda:0",
  "dion_rank_fraction": 1.0,
  "dion_rank_multiple_of": 1,
  "env_capabilities": {
    "torch_version": "2.8.0"
  },
  "eval_batch_size": 4,
  "eval_causal_lm_metrics": [
    "sacrebleu",
    "comet",
    "ter",
    "chrf"
  ],
  "eval_max_new_tokens": 128,
  "eval_sample_packing": false,
  "eval_steps": 100,
  "eval_table_size": 0,
  "experimental_skip_move_to_device": true,
  "flash_attention": false,
  "fp16": false,
  "gradient_accumulation_steps": 32,
  "gradient_checkpointing": false,
  "group_by_length": false,
  "hf_use_auth_token": true,
  "include_tkps": true,
  "is_falcon_derived_model": false,
  "is_llama_derived_model": true,
  "is_mistral_derived_model": false,
  "learning_rate": 0.0005,
  "liger_cross_entropy": false,
  "liger_fused_linear_cross_entropy": true,
  "liger_glu_activation": true,
  "liger_rms_norm": true,
  "liger_rope": true,
  "lisa_layers_attribute": "model.layers",
  "load_best_model_at_end": false,
  "load_in_4bit": false,
  "load_in_8bit": false,
  "local_rank": 0,
  "logging_steps": 1,
  "loraplus_lr_embedding": 1e-06,
  "lr_scheduler": "cosine",
  "mean_resizing_embeddings": false,
  "micro_batch_size": 1,
  "model_config_type": "diffllama",
  "num_epochs": 1.0,
  "optimizer": "adamw_torch",
  "otel_metrics_host": "localhost",
  "otel_metrics_port": 8000,
  "output_dir": "./output/model",
  "pad_to_sequence_len": true,
  "plugins": [
    "axolotl.integrations.liger.LigerPlugin"
  ],
  "pretrain_multipack_attn": true,
  "profiler_steps_start": 0,
  "qlora_sharded_model_loading": false,
  "ray_num_workers": 1,
  "remove_unused_columns": false,
  "resources_per_worker": {
    "GPU": 1
  },
  "sample_packing": true,
  "sample_packing_bin_size": 200,
  "sample_packing_group_size": 100000,
  "save_only_model": false,
  "save_safetensors": true,
  "save_steps": 100,
  "save_strategy": "steps",
  "save_total_limit": 1,
  "sequence_len": 4096,
  "shuffle_before_merging_datasets": false,
  "shuffle_merged_datasets": true,
  "skip_prepare_dataset": false,
  "streaming_multipack_buffer_size": 10000,
  "strict": false,
  "tensor_parallel_size": 1,
  "tf32": false,
  "tiled_mlp_use_original_mlp": true,
  "tokenizer_config": "kajuma/DiffLlama-1B",
  "tokenizer_save_jinja_files": true,
  "tokenizer_type": "AutoTokenizer",
  "torch_dtype": "torch.bfloat16",
  "train_on_inputs": false,
  "trl": {
    "log_completions": false,
    "mask_truncated_completions": false,
    "ref_model_mixup_alpha": 0.9,
    "ref_model_sync_steps": 64,
    "scale_rewards": true,
    "sync_ref_model": false,
    "use_vllm": false,
    "vllm_server_host": "0.0.0.0",
    "vllm_server_port": 8000
  },
  "type_of_model": "AutoModelForCausalLM",
  "use_otel_metrics": false,
  "use_ray": false,
  "use_wandb": true,
  "val_set_size": 0.002,
  "vllm": {
    "device": "auto",
    "dtype": "auto",
    "gpu_memory_utilization": 0.9,
    "host": "0.0.0.0",
    "port": 8000
  },
  "wandb_entity": "tepic",
  "wandb_name": "diffllama-sft-datapilot",
  "wandb_project": "diffllama",
  "warmup_steps": 20,
  "weight_decay": 0.01,
  "world_size": 1
}
[2026-03-12 21:41:46,326] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:285] [PID:4624] EOS: 2 / </s>
[2026-03-12 21:41:46,326] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:286] [PID:4624] BOS: 1 / <s>
[2026-03-12 21:41:46,326] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:287] [PID:4624] PAD: 3 / <pad>
[2026-03-12 21:41:46,326] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:288] [PID:4624] UNK: 0 / <unk>
[2026-03-12 21:41:46,326] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:475] [PID:4624] Loading prepared dataset from disk at output/dataset/a693065c12ad716b550e60474f46c363...
Loading dataset from disk:   0%|                                                                                                                                  | 0/24 [00:00<?, ?it/s]Loading dataset from disk: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 262828.45it/s]
[2026-03-12 21:41:46,550] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:417] [PID:4624] total_num_tokens: 74_944_883
[2026-03-12 21:41:47,285] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:435] [PID:4624] `total_supervised_tokens: 72_113_682`
[2026-03-12 21:41:47,958] [DEBUG] [axolotl.utils.samplers.multipack.pack_parallel:177] [PID:4624] Using single process for pack_parallel, running sequentially.
[2026-03-12 21:41:48,917] [DEBUG] [axolotl.utils.samplers.multipack.pack_parallel:177] [PID:4624] Using single process for pack_parallel, running sequentially.
[2026-03-12 21:41:49,469] [DEBUG] [axolotl.utils.samplers.multipack.__len__:462] [PID:4624] generate_batches time: 0.5612318515777588
[2026-03-12 21:41:49,479] [DEBUG] [axolotl.utils.samplers.multipack.pack_parallel:177] [PID:4624] Using single process for pack_parallel, running sequentially.
[2026-03-12 21:41:50,030] [DEBUG] [axolotl.utils.samplers.multipack.__len__:462] [PID:4624] generate_batches time: 0.5603342056274414
[2026-03-12 21:41:50,041] [DEBUG] [axolotl.utils.samplers.multipack.pack_parallel:177] [PID:4624] Using single process for pack_parallel, running sequentially.
[2026-03-12 21:41:50,590] [DEBUG] [axolotl.utils.samplers.multipack.__len__:462] [PID:4624] generate_batches time: 0.5587310791015625
[2026-03-12 21:41:50,601] [DEBUG] [axolotl.utils.samplers.multipack.pack_parallel:177] [PID:4624] Using single process for pack_parallel, running sequentially.
[2026-03-12 21:41:51,153] [DEBUG] [axolotl.utils.samplers.multipack.__len__:462] [PID:4624] generate_batches time: 0.561417818069458
[2026-03-12 21:41:51,178] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:438] [PID:4624] gather_len_batches: [18407]
[2026-03-12 21:41:51,178] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:494] [PID:4624] data_loader_len: 575
[2026-03-12 21:41:51,178] [INFO] [axolotl.utils.trainer.calc_sample_packing_eff_est:510] [PID:4624] sample_packing_eff_est across ranks: [0.9940289333499145]
[2026-03-12 21:41:51,178] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:522] [PID:4624] sample_packing_eff_est: 1.0
[2026-03-12 21:41:51,178] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:533] [PID:4624] total_num_steps: 575
[2026-03-12 21:41:51,180] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:121] [PID:4624] Maximum number of steps set at 575
[2026-03-12 21:41:51,220] [DEBUG] [axolotl.train.setup_model_and_tokenizer:70] [PID:4624] loading tokenizer... kajuma/DiffLlama-1B
[2026-03-12 21:41:52,144] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:285] [PID:4624] EOS: 2 / </s>
[2026-03-12 21:41:52,144] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:286] [PID:4624] BOS: 1 / <s>
[2026-03-12 21:41:52,144] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:287] [PID:4624] PAD: 3 / <pad>
[2026-03-12 21:41:52,144] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:288] [PID:4624] UNK: 0 / <unk>
[2026-03-12 21:41:52,144] [DEBUG] [axolotl.train.setup_model_and_tokenizer:82] [PID:4624] Loading model
[2026-03-12 21:41:52,347] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:87] [PID:4624] Patched Trainer.evaluation_loop with nanmean loss calculation
[2026-03-12 21:41:52,347] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:138] [PID:4624] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation
[2026-03-12 21:41:52,348] [INFO] [axolotl.loaders.patch_manager._apply_multipack_patches:344] [PID:4624] Applying multipack dataloader patch for sample packing...
[2026-03-12 21:41:52,348] [INFO] [axolotl.loaders.patch_manager._patch_llama_sample_packing:473] [PID:4624] Patching llama _prepare_4d_causal_attention_mask*...
[2026-03-12 21:41:52,409] [WARNING] [axolotl.integrations.liger.plugin.warning_once:46] [PID:4624] Applied ONLY liger_fused_linear_cross_entropy genericpatches for model type: diffllama
[2026-03-12 21:41:52,410] [WARNING] [axolotl.integrations.liger.plugin.warning_once:46] [PID:4624] Liger + diffllama generic FLCE support is experimental and may not work as expected.
[2026-03-12 21:41:53,474] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:4624] Memory usage after model load 0.000GB ()
[2026-03-12 21:41:57,259] [INFO] [axolotl.train.save_initial_configs:417] [PID:4624] Pre-saving tokenizer to ./output/model...
[2026-03-12 21:41:57,273] [INFO] [axolotl.train.save_initial_configs:422] [PID:4624] Pre-saving model config to ./output/model...
[2026-03-12 21:41:57,274] [INFO] [axolotl.train.execute_training:212] [PID:4624] Starting trainer...
[2026-03-12 21:41:59,761] [DEBUG] [axolotl.utils.samplers.multipack.__len__:462] [PID:4624] generate_batches time: 0.7732937335968018
[2026-03-12 21:42:00,542] [DEBUG] [axolotl.utils.samplers.multipack.__len__:462] [PID:4624] generate_batches time: 0.7788197994232178
[2026-03-12 21:42:01,330] [DEBUG] [axolotl.utils.samplers.multipack.__len__:462] [PID:4624] generate_batches time: 0.7867546081542969
[2026-03-12 21:42:02,110] [DEBUG] [axolotl.utils.samplers.multipack.__len__:462] [PID:4624] generate_batches time: 0.7773358821868896
[2026-03-12 21:42:02,110] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:438] [PID:4624] gather_len_batches: [18405]
[34m[1mwandb[0m: Currently logged in as: [33mweak-kajuma[0m ([33mtepic[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: [38;5;178m⢿[0m Waiting for wandb.init()...
[Am[2K[34m[1mwandb[0m: [38;5;178m⣻[0m Waiting for wandb.init()...
[Am[2K[34m[1mwandb[0m: [38;5;178m⣽[0m setting up run arakm91f (0.2s)
[Am[2K[34m[1mwandb[0m: [38;5;178m⣾[0m setting up run arakm91f (0.2s)
[Am[2K[34m[1mwandb[0m: [38;5;178m⣷[0m setting up run arakm91f (0.2s)
[Am[2K[34m[1mwandb[0m: [38;5;178m⣯[0m setting up run arakm91f (0.2s)
[Am[2K[34m[1mwandb[0m: [38;5;178m⣟[0m setting up run arakm91f (0.2s)
[Am[2K[34m[1mwandb[0m: [38;5;178m⡿[0m setting up run arakm91f (0.7s)
[Am[2K[34m[1mwandb[0m: [38;5;178m⢿[0m setting up run arakm91f (0.7s)
[Am[2K[34m[1mwandb[0m: [38;5;178m⣻[0m setting up run arakm91f (0.7s)
[Am[2K[34m[1mwandb[0m: Tracking run with wandb version 0.23.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/workspace/axolotl/wandb/run-20260312_214202-arakm91f[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mdiffllama-sft-datapilot[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/tepic/diffllama[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/tepic/diffllama/runs/arakm91f[0m
[34m[1mwandb[0m: Detected [huggingface_hub.inference] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/
[34m[1mwandb[0m: [33mWARNING[0m Saving files without folders. If you want to preserve subdirectories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt")
[34m[1mwandb[0m: [33mWARNING[0m Symlinked 1 file into the W&B run directory; call wandb.save again to sync new files.
[2026-03-12 21:42:05,653] [INFO] [axolotl.utils.callbacks.on_train_begin:757] [PID:4624] The Axolotl config has been saved to the WandB run under files.
  0%|                                                                                                                                                            | 0/575 [00:00<?, ?it/s][2026-03-12 21:42:05,659] [INFO] [axolotl.core.trainers.base.evaluate:401] [PID:4624] Running evaluation step...

  0%|                                                                                                                                                             | 0/54 [00:00<?, ?it/s][A
  6%|████████▎                                                                                                                                            | 3/54 [00:00<00:08,  5.77it/s][A
  7%|███████████                                                                                                                                          | 4/54 [00:01<00:13,  3.66it/s][A
  9%|█████████████▊                                                                                                                                       | 5/54 [00:01<00:16,  2.94it/s][A
 11%|████████████████▌                                                                                                                                    | 6/54 [00:01<00:18,  2.60it/s][A
 13%|███████████████████▎                                                                                                                                 | 7/54 [00:02<00:19,  2.41it/s][A
 15%|██████████████████████                                                                                                                               | 8/54 [00:02<00:20,  2.29it/s][A
 17%|████████████████████████▊                                                                                                                            | 9/54 [00:03<00:20,  2.22it/s][A
 19%|███████████████████████████▍                                                                                                                        | 10/54 [00:03<00:20,  2.17it/s][A
 20%|██████████████████████████████▏                                                                                                                     | 11/54 [00:04<00:20,  2.13it/s][A
 22%|████████████████████████████████▉                                                                                                                   | 12/54 [00:04<00:19,  2.11it/s][A
 24%|███████████████████████████████████▋                                                                                                                | 13/54 [00:05<00:19,  2.10it/s][A
 26%|██████████████████████████████████████▎                                                                                                             | 14/54 [00:05<00:19,  2.08it/s][A
 28%|█████████████████████████████████████████                                                                                                           | 15/54 [00:06<00:18,  2.08it/s][A
 30%|███████████████████████████████████████████▊                                                                                                        | 16/54 [00:06<00:18,  2.07it/s][A
 31%|██████████████████████████████████████████████▌                                                                                                     | 17/54 [00:07<00:17,  2.07it/s][A
 33%|█████████████████████████████████████████████████▎                                                                                                  | 18/54 [00:07<00:17,  2.07it/s][A
 35%|████████████████████████████████████████████████████                                                                                                | 19/54 [00:08<00:16,  2.07it/s][A
 37%|██████████████████████████████████████████████████████▊                                                                                             | 20/54 [00:08<00:16,  2.06it/s][A
 39%|█████████████████████████████████████████████████████████▌                                                                                          | 21/54 [00:09<00:15,  2.06it/s][A
 41%|████████████████████████████████████████████████████████████▎                                                                                       | 22/54 [00:09<00:15,  2.06it/s][A
 43%|███████████████████████████████████████████████████████████████                                                                                     | 23/54 [00:10<00:15,  2.06it/s][A
 44%|█████████████████████████████████████████████████████████████████▊                                                                                  | 24/54 [00:10<00:14,  2.06it/s][A
 46%|████████████████████████████████████████████████████████████████████▌                                                                               | 25/54 [00:11<00:14,  2.06it/s][A
 48%|███████████████████████████████████████████████████████████████████████▎                                                                            | 26/54 [00:11<00:13,  2.06it/s][A
 50%|██████████████████████████████████████████████████████████████████████████                                                                          | 27/54 [00:12<00:13,  2.06it/s][A
 52%|████████████████████████████████████████████████████████████████████████████▋                                                                       | 28/54 [00:12<00:12,  2.06it/s][A
 54%|███████████████████████████████████████████████████████████████████████████████▍                                                                    | 29/54 [00:13<00:12,  2.06it/s][A
 56%|██████████████████████████████████████████████████████████████████████████████████▏                                                                 | 30/54 [00:13<00:11,  2.06it/s][A
 57%|████████████████████████████████████████████████████████████████████████████████████▉                                                               | 31/54 [00:14<00:11,  2.06it/s][A
 59%|███████████████████████████████████████████████████████████████████████████████████████▋                                                            | 32/54 [00:14<00:10,  2.06it/s][A
 61%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                         | 33/54 [00:15<00:10,  1.95it/s][A
 63%|█████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 34/54 [00:15<00:09,  2.01it/s][A
 65%|███████████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 35/54 [00:16<00:09,  2.03it/s][A
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▋                                                 | 36/54 [00:16<00:08,  2.04it/s][A
 69%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 37/54 [00:17<00:08,  2.04it/s][A
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 38/54 [00:17<00:07,  2.05it/s][A
 72%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                         | 39/54 [00:18<00:07,  2.05it/s][A
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                      | 40/54 [00:18<00:06,  2.05it/s][A
 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                   | 41/54 [00:19<00:06,  2.05it/s][A
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                 | 42/54 [00:19<00:05,  2.05it/s][A
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                              | 43/54 [00:20<00:05,  2.05it/s][A
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                           | 44/54 [00:20<00:04,  2.05it/s][A
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                        | 45/54 [00:20<00:04,  2.05it/s][A
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                      | 46/54 [00:21<00:03,  2.05it/s][A
 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                   | 47/54 [00:21<00:03,  2.05it/s][A
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 48/54 [00:22<00:02,  2.05it/s][A
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎             | 49/54 [00:22<00:02,  2.05it/s][A
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████           | 50/54 [00:23<00:01,  2.05it/s][A
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊        | 51/54 [00:23<00:01,  2.06it/s][A
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌     | 52/54 [00:24<00:00,  2.06it/s][A
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎  | 53/54 [00:24<00:00,  2.05it/s][A
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:25<00:00,  2.02it/s][A                                                                                                                                                                                         
                                                                                                                                                                                         [A{'eval_loss': 2.5498716831207275, 'eval_runtime': 27.389, 'eval_samples_per_second': 7.886, 'eval_steps_per_second': 1.972, 'eval_ppl': 12.80546, 'memory/max_active (GiB)': 19.52, 'memory/max_allocated (GiB)': 19.52, 'memory/device_reserved (GiB)': 19.89, 'epoch': 0}
  0%|                                                                                                                                                            | 0/575 [00:27<?, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:25<00:00,  2.02it/s][A
                                                                                                                                                                                         [A  0%|▎                                                                                                                                                 | 1/575 [00:48<7:47:51, 48.91s/it]                                                                                                                                                                                         {'loss': 5.5262, 'grad_norm': 11.886768341064453, 'learning_rate': 0.0, 'ppl': 251.18758, 'memory/max_active (GiB)': 19.28, 'memory/max_allocated (GiB)': 19.28, 'memory/device_reserved (GiB)': 19.9, 'tokens/train_per_sec_per_gpu': 224.19659423828125, 'epoch': 0.0, 'tokens/total': 131072.0, 'tokens/trainable': 125259.0}
  0%|▎                                                                                                                                                 | 1/575 [00:48<7:47:51, 48.91s/it]  0%|▌                                                                                                                                                 | 2/575 [01:06<4:50:24, 30.41s/it]                                                                                                                                                                                         {'loss': 5.603, 'grad_norm': 12.424903869628906, 'learning_rate': 2.5e-05, 'ppl': 271.2389, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.13, 'tokens/train_per_sec_per_gpu': 228.0574951171875, 'epoch': 0.0, 'tokens/total': 262144.0, 'tokens/trainable': 250710.0}
  0%|▌                                                                                                                                                 | 2/575 [01:06<4:50:24, 30.41s/it]  1%|▊                                                                                                                                                 | 3/575 [01:23<3:52:44, 24.41s/it]                                                                                                                                                                                         {'loss': 5.4656, 'grad_norm': 22.296113967895508, 'learning_rate': 5e-05, 'ppl': 236.41766, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.13, 'tokens/train_per_sec_per_gpu': 225.7509765625, 'epoch': 0.01, 'tokens/total': 393216.0, 'tokens/trainable': 375852.0}
  1%|▊                                                                                                                                                 | 3/575 [01:23<3:52:44, 24.41s/it]  1%|█                                                                                                                                                 | 4/575 [01:40<3:25:33, 21.60s/it]                                                                                                                                                                                         {'loss': 5.2154, 'grad_norm': 7.7741217613220215, 'learning_rate': 7.5e-05, 'ppl': 184.08544, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.13, 'tokens/train_per_sec_per_gpu': 226.19801330566406, 'epoch': 0.01, 'tokens/total': 524288.0, 'tokens/trainable': 501448.0}
  1%|█                                                                                                                                                 | 4/575 [01:40<3:25:33, 21.60s/it]  1%|█▎                                                                                                                                                | 5/575 [01:57<3:09:37, 19.96s/it]                                                                                                                                                                                         {'loss': 5.0551, 'grad_norm': 12.778233528137207, 'learning_rate': 0.0001, 'ppl': 156.82021, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.13, 'tokens/train_per_sec_per_gpu': 232.83627319335938, 'epoch': 0.01, 'tokens/total': 655360.0, 'tokens/trainable': 626813.0}
  1%|█▎                                                                                                                                                | 5/575 [01:58<3:09:37, 19.96s/it]  1%|█▌                                                                                                                                                | 6/575 [02:15<2:59:56, 18.97s/it]                                                                                                                                                                                         {'loss': 4.8755, 'grad_norm': 6.244538307189941, 'learning_rate': 0.000125, 'ppl': 131.03966, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 233.03445434570312, 'epoch': 0.01, 'tokens/total': 786432.0, 'tokens/trainable': 752442.0}
  1%|█▌                                                                                                                                                | 6/575 [02:15<2:59:56, 18.97s/it]  1%|█▊                                                                                                                                                | 7/575 [02:32<2:54:26, 18.43s/it]                                                                                                                                                                                         {'loss': 4.6956, 'grad_norm': 5.526381015777588, 'learning_rate': 0.00015, 'ppl': 109.46447, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.83641052246094, 'epoch': 0.01, 'tokens/total': 917504.0, 'tokens/trainable': 877964.0}
  1%|█▊                                                                                                                                                | 7/575 [02:32<2:54:26, 18.43s/it]  1%|██                                                                                                                                                | 8/575 [02:49<2:50:45, 18.07s/it]                                                                                                                                                                                         {'loss': 4.4753, 'grad_norm': 5.647271156311035, 'learning_rate': 0.000175, 'ppl': 87.82094, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.80943298339844, 'epoch': 0.01, 'tokens/total': 1048576.0, 'tokens/trainable': 1003336.0}
  1%|██                                                                                                                                                | 8/575 [02:49<2:50:45, 18.07s/it]  2%|██▎                                                                                                                                               | 9/575 [03:06<2:47:30, 17.76s/it]                                                                                                                                                                                         {'loss': 4.3744, 'grad_norm': 5.838485240936279, 'learning_rate': 0.0002, 'ppl': 79.39219, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 232.56668090820312, 'epoch': 0.02, 'tokens/total': 1179648.0, 'tokens/trainable': 1128578.0}
  2%|██▎                                                                                                                                               | 9/575 [03:06<2:47:30, 17.76s/it]  2%|██▌                                                                                                                                              | 10/575 [03:23<2:44:51, 17.51s/it]                                                                                                                                                                                         {'loss': 4.1526, 'grad_norm': 3.655189037322998, 'learning_rate': 0.00022500000000000002, 'ppl': 63.59914, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 232.7894744873047, 'epoch': 0.02, 'tokens/total': 1310720.0, 'tokens/trainable': 1253963.0}
  2%|██▌                                                                                                                                              | 10/575 [03:23<2:44:51, 17.51s/it]  2%|██▊                                                                                                                                              | 11/575 [03:40<2:43:19, 17.37s/it]                                                                                                                                                                                         {'loss': 4.0486, 'grad_norm': 3.2407336235046387, 'learning_rate': 0.00025, 'ppl': 57.31716, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 225.49671936035156, 'epoch': 0.02, 'tokens/total': 1441792.0, 'tokens/trainable': 1379336.0}
  2%|██▊                                                                                                                                              | 11/575 [03:40<2:43:19, 17.37s/it]  2%|███                                                                                                                                              | 12/575 [03:57<2:42:10, 17.28s/it]                                                                                                                                                                                         {'loss': 3.937, 'grad_norm': 2.8525404930114746, 'learning_rate': 0.000275, 'ppl': 51.26458, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.71237182617188, 'epoch': 0.02, 'tokens/total': 1572864.0, 'tokens/trainable': 1504908.0}
  2%|███                                                                                                                                              | 12/575 [03:57<2:42:10, 17.28s/it]  2%|███▎                                                                                                                                             | 13/575 [04:15<2:42:39, 17.37s/it]                                                                                                                                                                                         {'loss': 3.7659, 'grad_norm': 4.807313919067383, 'learning_rate': 0.0003, 'ppl': 43.20257, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 220.6744384765625, 'epoch': 0.02, 'tokens/total': 1703936.0, 'tokens/trainable': 1630514.0}
  2%|███▎                                                                                                                                             | 13/575 [04:15<2:42:39, 17.37s/it]  2%|███▌                                                                                                                                             | 14/575 [04:33<2:43:13, 17.46s/it]                                                                                                                                                                                         {'loss': 3.7582, 'grad_norm': 3.977926254272461, 'learning_rate': 0.00032500000000000004, 'ppl': 42.87119, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 224.63963317871094, 'epoch': 0.02, 'tokens/total': 1835008.0, 'tokens/trainable': 1756010.0}
  2%|███▌                                                                                                                                             | 14/575 [04:33<2:43:13, 17.46s/it]  3%|███▊                                                                                                                                             | 15/575 [04:50<2:42:11, 17.38s/it]                                                                                                                                                                                         {'loss': 3.6957, 'grad_norm': 4.152910232543945, 'learning_rate': 0.00035, 'ppl': 40.27375, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.90830993652344, 'epoch': 0.03, 'tokens/total': 1966080.0, 'tokens/trainable': 1881606.0}
  3%|███▊                                                                                                                                             | 15/575 [04:50<2:42:11, 17.38s/it]  3%|████                                                                                                                                             | 16/575 [05:07<2:41:03, 17.29s/it]                                                                                                                                                                                         {'loss': 3.6816, 'grad_norm': 5.998203754425049, 'learning_rate': 0.000375, 'ppl': 39.70988, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.03050231933594, 'epoch': 0.03, 'tokens/total': 2097152.0, 'tokens/trainable': 2007056.0}
  3%|████                                                                                                                                             | 16/575 [05:07<2:41:03, 17.29s/it]  3%|████▎                                                                                                                                            | 17/575 [05:24<2:40:30, 17.26s/it]                                                                                                                                                                                         {'loss': 3.6465, 'grad_norm': 5.429075717926025, 'learning_rate': 0.0004, 'ppl': 38.34024, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.73880004882812, 'epoch': 0.03, 'tokens/total': 2228224.0, 'tokens/trainable': 2132480.0}
  3%|████▎                                                                                                                                            | 17/575 [05:24<2:40:30, 17.26s/it]  3%|████▌                                                                                                                                            | 18/575 [05:41<2:39:42, 17.20s/it]                                                                                                                                                                                         {'loss': 3.5547, 'grad_norm': 4.323876857757568, 'learning_rate': 0.000425, 'ppl': 34.97733, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 232.21595764160156, 'epoch': 0.03, 'tokens/total': 2359296.0, 'tokens/trainable': 2257981.0}
  3%|████▌                                                                                                                                            | 18/575 [05:41<2:39:42, 17.20s/it]  3%|████▊                                                                                                                                            | 19/575 [05:59<2:40:03, 17.27s/it]                                                                                                                                                                                         {'loss': 3.4706, 'grad_norm': 3.998586416244507, 'learning_rate': 0.00045000000000000004, 'ppl': 32.15603, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 226.81753540039062, 'epoch': 0.03, 'tokens/total': 2490368.0, 'tokens/trainable': 2383671.0}
  3%|████▊                                                                                                                                            | 19/575 [05:59<2:40:03, 17.27s/it]  3%|█████                                                                                                                                            | 20/575 [06:15<2:38:53, 17.18s/it]                                                                                                                                                                                         {'loss': 3.4406, 'grad_norm': 5.091058731079102, 'learning_rate': 0.000475, 'ppl': 31.20568, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.4381561279297, 'epoch': 0.03, 'tokens/total': 2621440.0, 'tokens/trainable': 2509051.0}
  3%|█████                                                                                                                                            | 20/575 [06:16<2:38:53, 17.18s/it]  4%|█████▎                                                                                                                                           | 21/575 [06:32<2:37:59, 17.11s/it]                                                                                                                                                                                         {'loss': 3.4546, 'grad_norm': 5.4216179847717285, 'learning_rate': 0.0005, 'ppl': 31.64563, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 231.25735473632812, 'epoch': 0.04, 'tokens/total': 2752512.0, 'tokens/trainable': 2634533.0}
  4%|█████▎                                                                                                                                           | 21/575 [06:32<2:37:59, 17.11s/it]  4%|█████▌                                                                                                                                           | 22/575 [06:50<2:38:16, 17.17s/it]                                                                                                                                                                                         {'loss': 3.2667, 'grad_norm': 4.395437240600586, 'learning_rate': 0.0004999963953330723, 'ppl': 26.22466, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.38475036621094, 'epoch': 0.04, 'tokens/total': 2883584.0, 'tokens/trainable': 2760564.0}
  4%|█████▌                                                                                                                                           | 22/575 [06:50<2:38:16, 17.17s/it]  4%|█████▊                                                                                                                                           | 23/575 [07:06<2:36:44, 17.04s/it]                                                                                                                                                                                         {'loss': 3.3202, 'grad_norm': 4.789601802825928, 'learning_rate': 0.0004999855814477881, 'ppl': 27.66588, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 233.4844970703125, 'epoch': 0.04, 'tokens/total': 3014656.0, 'tokens/trainable': 2886434.0}
  4%|█████▊                                                                                                                                           | 23/575 [07:07<2:36:44, 17.04s/it]  4%|██████                                                                                                                                           | 24/575 [07:23<2:36:13, 17.01s/it]                                                                                                                                                                                         {'loss': 3.328, 'grad_norm': 5.432803630828857, 'learning_rate': 0.0004999675586906404, 'ppl': 27.88252, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 234.4462127685547, 'epoch': 0.04, 'tokens/total': 3145728.0, 'tokens/trainable': 3012347.0}
  4%|██████                                                                                                                                           | 24/575 [07:23<2:36:13, 17.01s/it]  4%|██████▎                                                                                                                                          | 25/575 [07:41<2:36:26, 17.07s/it]                                                                                                                                                                                         {'loss': 3.2318, 'grad_norm': 3.6383492946624756, 'learning_rate': 0.000499942327639105, 'ppl': 25.3252, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.3309326171875, 'epoch': 0.04, 'tokens/total': 3276800.0, 'tokens/trainable': 3138226.0}
  4%|██████▎                                                                                                                                          | 25/575 [07:41<2:36:26, 17.07s/it]  5%|██████▌                                                                                                                                          | 26/575 [07:58<2:36:30, 17.10s/it]                                                                                                                                                                                         {'loss': 3.238, 'grad_norm': 4.404199123382568, 'learning_rate': 0.0004999098891016224, 'ppl': 25.48271, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.1204376220703, 'epoch': 0.05, 'tokens/total': 3407872.0, 'tokens/trainable': 3263860.0}
  5%|██████▌                                                                                                                                          | 26/575 [07:58<2:36:30, 17.10s/it]  5%|██████▊                                                                                                                                          | 27/575 [08:15<2:36:08, 17.10s/it]                                                                                                                                                                                         {'loss': 3.0619, 'grad_norm': 3.298025131225586, 'learning_rate': 0.0004998702441175712, 'ppl': 21.36812, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 233.22679138183594, 'epoch': 0.05, 'tokens/total': 3538944.0, 'tokens/trainable': 3389692.0}
  5%|██████▊                                                                                                                                          | 27/575 [08:15<2:36:08, 17.10s/it]  5%|███████                                                                                                                                          | 28/575 [08:32<2:36:27, 17.16s/it]                                                                                                                                                                                         {'loss': 3.0405, 'grad_norm': 3.9128572940826416, 'learning_rate': 0.0004998233939572357, 'ppl': 20.9157, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 221.4124298095703, 'epoch': 0.05, 'tokens/total': 3670016.0, 'tokens/trainable': 3515193.0}
  5%|███████                                                                                                                                          | 28/575 [08:32<2:36:27, 17.16s/it]  5%|███████▎                                                                                                                                         | 29/575 [08:49<2:35:17, 17.06s/it]                                                                                                                                                                                         {'loss': 3.0327, 'grad_norm': 3.9056589603424072, 'learning_rate': 0.0004997693401217645, 'ppl': 20.75319, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 234.6403045654297, 'epoch': 0.05, 'tokens/total': 3801088.0, 'tokens/trainable': 3640646.0}
  5%|███████▎                                                                                                                                         | 29/575 [08:49<2:35:17, 17.06s/it]  5%|███████▌                                                                                                                                         | 30/575 [09:06<2:35:21, 17.10s/it]                                                                                                                                                                                         {'loss': 2.8898, 'grad_norm': 3.1938741207122803, 'learning_rate': 0.0004997080843431226, 'ppl': 17.98971, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.5078125, 'epoch': 0.05, 'tokens/total': 3932160.0, 'tokens/trainable': 3766170.0}
  5%|███████▌                                                                                                                                         | 30/575 [09:06<2:35:21, 17.10s/it]  5%|███████▊                                                                                                                                         | 31/575 [09:24<2:35:38, 17.17s/it]                                                                                                                                                                                         {'loss': 2.9214, 'grad_norm': 3.099994421005249, 'learning_rate': 0.0004996396285840362, 'ppl': 18.56726, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 226.0727081298828, 'epoch': 0.05, 'tokens/total': 4063232.0, 'tokens/trainable': 3891840.0}
  5%|███████▊                                                                                                                                         | 31/575 [09:24<2:35:38, 17.17s/it]  6%|████████                                                                                                                                         | 32/575 [09:41<2:36:04, 17.25s/it]                                                                                                                                                                                         {'loss': 2.9325, 'grad_norm': 3.591951608657837, 'learning_rate': 0.0004995639750379294, 'ppl': 18.77451, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 226.42608642578125, 'epoch': 0.06, 'tokens/total': 4194304.0, 'tokens/trainable': 4017647.0}
  6%|████████                                                                                                                                         | 32/575 [09:41<2:36:04, 17.25s/it]  6%|████████▎                                                                                                                                        | 33/575 [09:58<2:35:00, 17.16s/it]                                                                                                                                                                                         {'loss': 2.8535, 'grad_norm': 3.2224786281585693, 'learning_rate': 0.0004994811261288539, 'ppl': 17.3484, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 234.39620971679688, 'epoch': 0.06, 'tokens/total': 4325376.0, 'tokens/trainable': 4143347.0}
  6%|████████▎                                                                                                                                        | 33/575 [09:58<2:35:00, 17.16s/it]  6%|████████▌                                                                                                                                        | 34/575 [10:15<2:34:29, 17.13s/it]                                                                                                                                                                                         {'loss': 2.9073, 'grad_norm': 4.07352876663208, 'learning_rate': 0.0004993910845114118, 'ppl': 18.3073, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.83291625976562, 'epoch': 0.06, 'tokens/total': 4456448.0, 'tokens/trainable': 4268775.0}
  6%|████████▌                                                                                                                                        | 34/575 [10:15<2:34:29, 17.13s/it]  6%|████████▊                                                                                                                                        | 35/575 [10:33<2:35:20, 17.26s/it]                                                                                                                                                                                         {'loss': 2.7934, 'grad_norm': 3.4272608757019043, 'learning_rate': 0.0004992938530706701, 'ppl': 16.33647, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 220.46485900878906, 'epoch': 0.06, 'tokens/total': 4587520.0, 'tokens/trainable': 4394365.0}
  6%|████████▊                                                                                                                                        | 35/575 [10:33<2:35:20, 17.26s/it]  6%|█████████                                                                                                                                        | 36/575 [10:50<2:34:13, 17.17s/it]                                                                                                                                                                                         {'loss': 2.7309, 'grad_norm': 2.9684834480285645, 'learning_rate': 0.0004991894349220684, 'ppl': 15.34669, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.89602661132812, 'epoch': 0.06, 'tokens/total': 4718592.0, 'tokens/trainable': 4520203.0}
  6%|█████████                                                                                                                                        | 36/575 [10:50<2:34:13, 17.17s/it]  6%|█████████▎                                                                                                                                       | 37/575 [11:06<2:33:03, 17.07s/it]                                                                                                                                                                                         {'loss': 2.7471, 'grad_norm': 2.5340092182159424, 'learning_rate': 0.0004990778334113193, 'ppl': 15.59733, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.39334106445312, 'epoch': 0.06, 'tokens/total': 4849664.0, 'tokens/trainable': 4645863.0}
  6%|█████████▎                                                                                                                                       | 37/575 [11:06<2:33:03, 17.07s/it]  7%|█████████▌                                                                                                                                       | 38/575 [11:24<2:34:03, 17.21s/it]                                                                                                                                                                                         {'loss': 2.6861, 'grad_norm': 3.4324874877929688, 'learning_rate': 0.0004989590521143005, 'ppl': 14.67433, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 226.04931640625, 'epoch': 0.07, 'tokens/total': 4980736.0, 'tokens/trainable': 4771578.0}
  7%|█████████▌                                                                                                                                       | 38/575 [11:24<2:34:03, 17.21s/it]  7%|█████████▊                                                                                                                                       | 39/575 [11:41<2:33:23, 17.17s/it]                                                                                                                                                                                         {'loss': 2.6745, 'grad_norm': 2.8681814670562744, 'learning_rate': 0.0004988330948369413, 'ppl': 14.5051, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 231.0179901123047, 'epoch': 0.07, 'tokens/total': 5111808.0, 'tokens/trainable': 4897204.0}
  7%|█████████▊                                                                                                                                       | 39/575 [11:41<2:33:23, 17.17s/it]  7%|██████████                                                                                                                                       | 40/575 [11:58<2:33:09, 17.18s/it]                                                                                                                                                                                         {'loss': 2.6941, 'grad_norm': 2.963502883911133, 'learning_rate': 0.0004986999656150995, 'ppl': 14.7922, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.6924285888672, 'epoch': 0.07, 'tokens/total': 5242880.0, 'tokens/trainable': 5023163.0}
  7%|██████████                                                                                                                                       | 40/575 [11:58<2:33:09, 17.18s/it]  7%|██████████▎                                                                                                                                      | 41/575 [12:15<2:33:13, 17.22s/it]                                                                                                                                                                                         {'loss': 2.6395, 'grad_norm': 2.865541934967041, 'learning_rate': 0.0004985596687144332, 'ppl': 14.0062, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 226.28036499023438, 'epoch': 0.07, 'tokens/total': 5373952.0, 'tokens/trainable': 5148879.0}
  7%|██████████▎                                                                                                                                      | 41/575 [12:16<2:33:13, 17.22s/it]  7%|██████████▌                                                                                                                                      | 42/575 [12:32<2:31:54, 17.10s/it]                                                                                                                                                                                         {'loss': 2.63, 'grad_norm': 3.0719056129455566, 'learning_rate': 0.000498412208630263, 'ppl': 13.87377, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 233.7224578857422, 'epoch': 0.07, 'tokens/total': 5505024.0, 'tokens/trainable': 5274383.0}
  7%|██████████▌                                                                                                                                      | 42/575 [12:32<2:31:54, 17.10s/it]  7%|██████████▊                                                                                                                                      | 43/575 [12:49<2:29:37, 16.88s/it]                                                                                                                                                                                         {'loss': 2.582, 'grad_norm': 4.665623664855957, 'learning_rate': 0.0004982575900874288, 'ppl': 13.22356, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 240.37046813964844, 'epoch': 0.07, 'tokens/total': 5636096.0, 'tokens/trainable': 5400033.0}
  7%|██████████▊                                                                                                                                      | 43/575 [12:49<2:29:37, 16.88s/it]  8%|███████████                                                                                                                                      | 44/575 [13:06<2:30:29, 17.01s/it]                                                                                                                                                                                         {'loss': 2.5542, 'grad_norm': 2.31315279006958, 'learning_rate': 0.0004980958180401384, 'ppl': 12.86101, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.74740600585938, 'epoch': 0.08, 'tokens/total': 5767168.0, 'tokens/trainable': 5525686.0}
  8%|███████████                                                                                                                                      | 44/575 [13:06<2:30:29, 17.01s/it]  8%|███████████▎                                                                                                                                     | 45/575 [13:23<2:30:23, 17.02s/it]                                                                                                                                                                                         {'loss': 2.577, 'grad_norm': 2.760657787322998, 'learning_rate': 0.000497926897671808, 'ppl': 13.15761, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.54258728027344, 'epoch': 0.08, 'tokens/total': 5898240.0, 'tokens/trainable': 5651450.0}
  8%|███████████▎                                                                                                                                     | 45/575 [13:23<2:30:23, 17.02s/it]  8%|███████████▌                                                                                                                                     | 46/575 [13:40<2:29:54, 17.00s/it]                                                                                                                                                                                         {'loss': 2.5609, 'grad_norm': 2.46256160736084, 'learning_rate': 0.0004977508343948969, 'ppl': 12.94746, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 231.97735595703125, 'epoch': 0.08, 'tokens/total': 6029312.0, 'tokens/trainable': 5776970.0}
  8%|███████████▌                                                                                                                                     | 46/575 [13:40<2:29:54, 17.00s/it]  8%|███████████▊                                                                                                                                     | 47/575 [13:56<2:28:14, 16.85s/it]                                                                                                                                                                                         {'loss': 2.5235, 'grad_norm': 2.653730869293213, 'learning_rate': 0.0004975676338507337, 'ppl': 12.47217, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 238.80552673339844, 'epoch': 0.08, 'tokens/total': 6160384.0, 'tokens/trainable': 5902532.0}
  8%|███████████▊                                                                                                                                     | 47/575 [13:57<2:28:14, 16.85s/it]  8%|████████████                                                                                                                                     | 48/575 [14:14<2:28:33, 16.91s/it]                                                                                                                                                                                         {'loss': 2.4632, 'grad_norm': 2.713313102722168, 'learning_rate': 0.0004973773019093358, 'ppl': 11.74233, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.7933349609375, 'epoch': 0.08, 'tokens/total': 6291456.0, 'tokens/trainable': 6028194.0}
  8%|████████████                                                                                                                                     | 48/575 [14:14<2:28:33, 16.91s/it]  9%|████████████▎                                                                                                                                    | 49/575 [14:30<2:27:07, 16.78s/it]                                                                                                                                                                                         {'loss': 2.5315, 'grad_norm': 2.9807591438293457, 'learning_rate': 0.0004971798446692209, 'ppl': 12.57235, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 237.68345642089844, 'epoch': 0.09, 'tokens/total': 6422528.0, 'tokens/trainable': 6153721.0}
  9%|████████████▎                                                                                                                                    | 49/575 [14:30<2:27:07, 16.78s/it]  9%|████████████▌                                                                                                                                    | 50/575 [14:47<2:27:55, 16.91s/it]                                                                                                                                                                                         {'loss': 2.5415, 'grad_norm': 3.182833671569824, 'learning_rate': 0.000496975268457212, 'ppl': 12.6987, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.86773681640625, 'epoch': 0.09, 'tokens/total': 6553600.0, 'tokens/trainable': 6279441.0}
  9%|████████████▌                                                                                                                                    | 50/575 [14:47<2:27:55, 16.91s/it]  9%|████████████▊                                                                                                                                    | 51/575 [15:04<2:28:24, 16.99s/it]                                                                                                                                                                                         {'loss': 2.4964, 'grad_norm': 2.474241018295288, 'learning_rate': 0.0004967635798282344, 'ppl': 12.13872, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.39154052734375, 'epoch': 0.09, 'tokens/total': 6684672.0, 'tokens/trainable': 6405413.0}
  9%|████████████▊                                                                                                                                    | 51/575 [15:04<2:28:24, 16.99s/it]  9%|█████████████                                                                                                                                    | 52/575 [15:22<2:28:38, 17.05s/it]                                                                                                                                                                                         {'loss': 2.4844, 'grad_norm': 3.076870918273926, 'learning_rate': 0.000496544785565106, 'ppl': 11.99392, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 234.02215576171875, 'epoch': 0.09, 'tokens/total': 6815744.0, 'tokens/trainable': 6531063.0}
  9%|█████████████                                                                                                                                    | 52/575 [15:22<2:28:38, 17.05s/it]  9%|█████████████▎                                                                                                                                   | 53/575 [15:39<2:29:02, 17.13s/it]                                                                                                                                                                                         {'loss': 2.4403, 'grad_norm': 2.7424895763397217, 'learning_rate': 0.0004963188926783197, 'ppl': 11.47648, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 231.1528778076172, 'epoch': 0.09, 'tokens/total': 6946816.0, 'tokens/trainable': 6656883.0}
  9%|█████████████▎                                                                                                                                   | 53/575 [15:39<2:29:02, 17.13s/it]  9%|█████████████▌                                                                                                                                   | 54/575 [15:56<2:28:36, 17.11s/it]                                                                                                                                                                                         {'loss': 2.5118, 'grad_norm': 2.713655710220337, 'learning_rate': 0.0004960859084058185, 'ppl': 12.3271, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 234.71434020996094, 'epoch': 0.09, 'tokens/total': 7077888.0, 'tokens/trainable': 6782744.0}
  9%|█████████████▌                                                                                                                                   | 54/575 [15:56<2:28:36, 17.11s/it] 10%|█████████████▊                                                                                                                                   | 55/575 [16:12<2:26:21, 16.89s/it]                                                                                                                                                                                         {'loss': 2.4259, 'grad_norm': 2.304421901702881, 'learning_rate': 0.0004958458402127645, 'ppl': 11.31241, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 242.60772705078125, 'epoch': 0.1, 'tokens/total': 7208960.0, 'tokens/trainable': 6908420.0}
 10%|█████████████▊                                                                                                                                   | 55/575 [16:12<2:26:21, 16.89s/it] 10%|██████████████                                                                                                                                   | 56/575 [16:29<2:25:19, 16.80s/it]                                                                                                                                                                                         {'loss': 2.4754, 'grad_norm': 2.9931893348693848, 'learning_rate': 0.0004955986957912985, 'ppl': 11.88646, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 235.10862731933594, 'epoch': 0.1, 'tokens/total': 7340032.0, 'tokens/trainable': 7034178.0}
 10%|██████████████                                                                                                                                   | 56/575 [16:29<2:25:19, 16.80s/it] 10%|██████████████▎                                                                                                                                  | 57/575 [16:46<2:24:50, 16.78s/it]                                                                                                                                                                                         {'loss': 2.4476, 'grad_norm': 3.100477695465088, 'learning_rate': 0.0004953444830602948, 'ppl': 11.56057, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 236.99777221679688, 'epoch': 0.1, 'tokens/total': 7471104.0, 'tokens/trainable': 7159945.0}
 10%|██████████████▎                                                                                                                                  | 57/575 [16:46<2:24:50, 16.78s/it] 10%|██████████████▋                                                                                                                                  | 58/575 [17:03<2:25:56, 16.94s/it]                                                                                                                                                                                         {'loss': 2.3987, 'grad_norm': 2.3154497146606445, 'learning_rate': 0.0004950832101651062, 'ppl': 11.00886, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 227.3505859375, 'epoch': 0.1, 'tokens/total': 7602176.0, 'tokens/trainable': 7285534.0}
 10%|██████████████▋                                                                                                                                  | 58/575 [17:03<2:25:56, 16.94s/it] 10%|██████████████▉                                                                                                                                  | 59/575 [17:20<2:25:23, 16.91s/it]                                                                                                                                                                                         {'loss': 2.453, 'grad_norm': 3.3451995849609375, 'learning_rate': 0.0004948148854773043, 'ppl': 11.62316, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 235.5593719482422, 'epoch': 0.1, 'tokens/total': 7733248.0, 'tokens/trainable': 7411357.0}
 10%|██████████████▉                                                                                                                                  | 59/575 [17:20<2:25:23, 16.91s/it] 10%|███████████████▏                                                                                                                                 | 60/575 [17:37<2:25:32, 16.96s/it]                                                                                                                                                                                         {'loss': 2.4476, 'grad_norm': 2.588027000427246, 'learning_rate': 0.0004945395175944099, 'ppl': 11.56057, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 233.01840209960938, 'epoch': 0.1, 'tokens/total': 7864320.0, 'tokens/trainable': 7537206.0}
 10%|███████████████▏                                                                                                                                 | 60/575 [17:37<2:25:32, 16.96s/it] 11%|███████████████▍                                                                                                                                 | 61/575 [17:53<2:23:43, 16.78s/it]                                                                                                                                                                                         {'loss': 2.4465, 'grad_norm': 2.650214195251465, 'learning_rate': 0.0004942571153396187, 'ppl': 11.54786, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 240.98654174804688, 'epoch': 0.11, 'tokens/total': 7995392.0, 'tokens/trainable': 7662819.0}
 11%|███████████████▍                                                                                                                                 | 61/575 [17:53<2:23:43, 16.78s/it] 11%|███████████████▋                                                                                                                                 | 62/575 [18:10<2:24:12, 16.87s/it]                                                                                                                                                                                         {'loss': 2.405, 'grad_norm': 2.0314745903015137, 'learning_rate': 0.000493967687761518, 'ppl': 11.07843, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.5259246826172, 'epoch': 0.11, 'tokens/total': 8126464.0, 'tokens/trainable': 7788435.0}
 11%|███████████████▋                                                                                                                                 | 62/575 [18:10<2:24:12, 16.87s/it] 11%|███████████████▉                                                                                                                                 | 63/575 [18:27<2:24:27, 16.93s/it]                                                                                                                                                                                         {'loss': 2.3848, 'grad_norm': 2.3362715244293213, 'learning_rate': 0.0004936712441337967, 'ppl': 10.85689, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.29910278320312, 'epoch': 0.11, 'tokens/total': 8257536.0, 'tokens/trainable': 7914215.0}
 11%|███████████████▉                                                                                                                                 | 63/575 [18:27<2:24:27, 16.93s/it] 11%|████████████████▏                                                                                                                                | 64/575 [18:44<2:24:14, 16.94s/it]                                                                                                                                                                                         {'loss': 2.3489, 'grad_norm': 2.127591609954834, 'learning_rate': 0.0004933677939549489, 'ppl': 10.47404, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 235.1163330078125, 'epoch': 0.11, 'tokens/total': 8388608.0, 'tokens/trainable': 8039942.0}
 11%|████████████████▏                                                                                                                                | 64/575 [18:44<2:24:14, 16.94s/it] 11%|████████████████▍                                                                                                                                | 65/575 [19:02<2:25:13, 17.08s/it]                                                                                                                                                                                         {'loss': 2.3756, 'grad_norm': 4.509136199951172, 'learning_rate': 0.0004930573469479681, 'ppl': 10.75747, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 225.43325805664062, 'epoch': 0.11, 'tokens/total': 8519680.0, 'tokens/trainable': 8165419.0}
 11%|████████████████▍                                                                                                                                | 65/575 [19:02<2:25:13, 17.08s/it] 11%|████████████████▋                                                                                                                                | 66/575 [19:18<2:23:23, 16.90s/it]                                                                                                                                                                                         {'loss': 2.3871, 'grad_norm': 2.322894811630249, 'learning_rate': 0.0004927399130600373, 'ppl': 10.88189, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 235.71533203125, 'epoch': 0.11, 'tokens/total': 8650752.0, 'tokens/trainable': 8291231.0}
 11%|████████████████▋                                                                                                                                | 66/575 [19:18<2:23:23, 16.90s/it] 12%|████████████████▉                                                                                                                                | 67/575 [19:35<2:22:56, 16.88s/it]                                                                                                                                                                                         {'loss': 2.3932, 'grad_norm': 2.867518901824951, 'learning_rate': 0.0004924155024622092, 'ppl': 10.94847, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 231.88023376464844, 'epoch': 0.12, 'tokens/total': 8781824.0, 'tokens/trainable': 8416757.0}
 12%|████████████████▉                                                                                                                                | 67/575 [19:35<2:22:56, 16.88s/it] 12%|█████████████████▏                                                                                                                               | 68/575 [19:52<2:22:33, 16.87s/it]                                                                                                                                                                                         {'loss': 2.3874, 'grad_norm': 2.0845932960510254, 'learning_rate': 0.0004920841255490806, 'ppl': 10.88516, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 235.63888549804688, 'epoch': 0.12, 'tokens/total': 8912896.0, 'tokens/trainable': 8542719.0}
 12%|█████████████████▏                                                                                                                               | 68/575 [19:52<2:22:33, 16.87s/it] 12%|█████████████████▍                                                                                                                               | 69/575 [20:09<2:23:23, 17.00s/it]                                                                                                                                                                                         {'loss': 2.3582, 'grad_norm': 2.3609821796417236, 'learning_rate': 0.0004917457929384599, 'ppl': 10.5719, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 226.63186645507812, 'epoch': 0.12, 'tokens/total': 9043968.0, 'tokens/trainable': 8668376.0}
 12%|█████████████████▍                                                                                                                               | 69/575 [20:09<2:23:23, 17.00s/it] 12%|█████████████████▋                                                                                                                               | 70/575 [20:26<2:23:35, 17.06s/it]                                                                                                                                                                                         {'loss': 2.326, 'grad_norm': 2.265986442565918, 'learning_rate': 0.0004914005154710256, 'ppl': 10.23691, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.18882751464844, 'epoch': 0.12, 'tokens/total': 9175040.0, 'tokens/trainable': 8794197.0}
 12%|█████████████████▋                                                                                                                               | 70/575 [20:26<2:23:35, 17.06s/it] 12%|█████████████████▉                                                                                                                               | 71/575 [20:44<2:23:20, 17.06s/it]                                                                                                                                                                                         {'loss': 2.3499, 'grad_norm': 2.6112289428710938, 'learning_rate': 0.0004910483042099801, 'ppl': 10.48452, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.5489501953125, 'epoch': 0.12, 'tokens/total': 9306112.0, 'tokens/trainable': 8919368.0}
 12%|█████████████████▉                                                                                                                               | 71/575 [20:44<2:23:20, 17.06s/it] 13%|██████████████████▏                                                                                                                              | 72/575 [21:01<2:24:35, 17.25s/it]                                                                                                                                                                                         {'loss': 2.3259, 'grad_norm': 2.265162229537964, 'learning_rate': 0.000490689170440695, 'ppl': 10.23589, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 222.47674560546875, 'epoch': 0.13, 'tokens/total': 9437184.0, 'tokens/trainable': 9044798.0}
 13%|██████████████████▏                                                                                                                              | 72/575 [21:01<2:24:35, 17.25s/it] 13%|██████████████████▍                                                                                                                              | 73/575 [21:18<2:24:10, 17.23s/it]                                                                                                                                                                                         {'loss': 2.3125, 'grad_norm': 2.9880118370056152, 'learning_rate': 0.000490323125670349, 'ppl': 10.09964, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 227.77210998535156, 'epoch': 0.13, 'tokens/total': 9568256.0, 'tokens/trainable': 9170545.0}
 13%|██████████████████▍                                                                                                                              | 73/575 [21:18<2:24:10, 17.23s/it] 13%|██████████████████▋                                                                                                                              | 74/575 [21:36<2:24:23, 17.29s/it]                                                                                                                                                                                         {'loss': 2.2691, 'grad_norm': 2.0525965690612793, 'learning_rate': 0.0004899501816275597, 'ppl': 9.67069, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 225.45558166503906, 'epoch': 0.13, 'tokens/total': 9699328.0, 'tokens/trainable': 9296242.0}
 13%|██████████████████▋                                                                                                                              | 74/575 [21:36<2:24:23, 17.29s/it] 13%|██████████████████▉                                                                                                                              | 75/575 [21:53<2:24:08, 17.30s/it]                                                                                                                                                                                         {'loss': 2.2875, 'grad_norm': 2.2844793796539307, 'learning_rate': 0.0004895703502620077, 'ppl': 9.85028, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.9027862548828, 'epoch': 0.13, 'tokens/total': 9830400.0, 'tokens/trainable': 9422177.0}
 13%|██████████████████▉                                                                                                                              | 75/575 [21:53<2:24:08, 17.30s/it] 13%|███████████████████▏                                                                                                                             | 76/575 [22:10<2:22:24, 17.12s/it]                                                                                                                                                                                         {'loss': 2.3224, 'grad_norm': 1.8984928131103516, 'learning_rate': 0.0004891836437440534, 'ppl': 10.20013, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 234.42059326171875, 'epoch': 0.13, 'tokens/total': 9961472.0, 'tokens/trainable': 9547807.0}
 13%|███████████████████▏                                                                                                                             | 76/575 [22:10<2:22:24, 17.12s/it] 13%|███████████████████▍                                                                                                                             | 77/575 [22:27<2:22:17, 17.14s/it]                                                                                                                                                                                         {'loss': 2.2606, 'grad_norm': 2.1413581371307373, 'learning_rate': 0.0004887900744643476, 'ppl': 9.58884, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.29811096191406, 'epoch': 0.13, 'tokens/total': 10092544.0, 'tokens/trainable': 9673633.0}
 13%|███████████████████▍                                                                                                                             | 77/575 [22:27<2:22:17, 17.14s/it] 14%|███████████████████▋                                                                                                                             | 78/575 [22:44<2:21:32, 17.09s/it]                                                                                                                                                                                         {'loss': 2.2903, 'grad_norm': 2.411308526992798, 'learning_rate': 0.0004883896550334338, 'ppl': 9.8779, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 231.85891723632812, 'epoch': 0.14, 'tokens/total': 10223616.0, 'tokens/trainable': 9799415.0}
 14%|███████████████████▋                                                                                                                             | 78/575 [22:44<2:21:32, 17.09s/it] 14%|███████████████████▉                                                                                                                             | 79/575 [23:01<2:21:13, 17.08s/it]                                                                                                                                                                                         {'loss': 2.2889, 'grad_norm': 2.444232940673828, 'learning_rate': 0.000487982398281345, 'ppl': 9.86408, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.2309112548828, 'epoch': 0.14, 'tokens/total': 10354688.0, 'tokens/trainable': 9925238.0}
 14%|███████████████████▉                                                                                                                             | 79/575 [23:01<2:21:13, 17.08s/it] 14%|████████████████████▏                                                                                                                            | 80/575 [23:18<2:21:12, 17.12s/it]                                                                                                                                                                                         {'loss': 2.2676, 'grad_norm': 2.2026240825653076, 'learning_rate': 0.0004875683172571915, 'ppl': 9.6562, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.10104370117188, 'epoch': 0.14, 'tokens/total': 10485760.0, 'tokens/trainable': 10051371.0}
 14%|████████████████████▏                                                                                                                            | 80/575 [23:18<2:21:12, 17.12s/it] 14%|████████████████████▍                                                                                                                            | 81/575 [23:35<2:21:06, 17.14s/it]                                                                                                                                                                                         {'loss': 2.339, 'grad_norm': 2.7396628856658936, 'learning_rate': 0.00048714742522874393, 'ppl': 10.37086, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.7397918701172, 'epoch': 0.14, 'tokens/total': 10616832.0, 'tokens/trainable': 10176903.0}
 14%|████████████████████▍                                                                                                                            | 81/575 [23:35<2:21:06, 17.14s/it] 14%|████████████████████▋                                                                                                                            | 82/575 [23:52<2:20:04, 17.05s/it]                                                                                                                                                                                         {'loss': 2.2675, 'grad_norm': 2.030841588973999, 'learning_rate': 0.0004867197356820073, 'ppl': 9.65523, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 235.53594970703125, 'epoch': 0.14, 'tokens/total': 10747904.0, 'tokens/trainable': 10302576.0}
 14%|████████████████████▋                                                                                                                            | 82/575 [23:52<2:20:04, 17.05s/it] 14%|████████████████████▉                                                                                                                            | 83/575 [24:10<2:20:43, 17.16s/it]                                                                                                                                                                                         {'loss': 2.2885, 'grad_norm': 2.5605382919311523, 'learning_rate': 0.0004862852623207894, 'ppl': 9.86014, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.45928955078125, 'epoch': 0.14, 'tokens/total': 10878976.0, 'tokens/trainable': 10428279.0}
 14%|████████████████████▉                                                                                                                            | 83/575 [24:10<2:20:43, 17.16s/it] 15%|█████████████████████▏                                                                                                                           | 84/575 [24:27<2:20:31, 17.17s/it]                                                                                                                                                                                         {'loss': 2.2136, 'grad_norm': 2.2747533321380615, 'learning_rate': 0.0004858440190662613, 'ppl': 9.14859, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.5067138671875, 'epoch': 0.15, 'tokens/total': 11010048.0, 'tokens/trainable': 10553886.0}
 15%|█████████████████████▏                                                                                                                           | 84/575 [24:27<2:20:31, 17.17s/it] 15%|█████████████████████▍                                                                                                                           | 85/575 [24:44<2:20:17, 17.18s/it]                                                                                                                                                                                         {'loss': 2.3147, 'grad_norm': 2.0803427696228027, 'learning_rate': 0.0004853960200565116, 'ppl': 10.12189, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.38485717773438, 'epoch': 0.15, 'tokens/total': 11141120.0, 'tokens/trainable': 10679636.0}
 15%|█████████████████████▍                                                                                                                           | 85/575 [24:44<2:20:17, 17.18s/it] 15%|█████████████████████▋                                                                                                                           | 86/575 [25:01<2:19:09, 17.08s/it]                                                                                                                                                                                         {'loss': 2.2573, 'grad_norm': 2.4439640045166016, 'learning_rate': 0.0004849412796460934, 'ppl': 9.55725, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 235.5010528564453, 'epoch': 0.15, 'tokens/total': 11272192.0, 'tokens/trainable': 10805135.0}
 15%|█████████████████████▋                                                                                                                           | 86/575 [25:01<2:19:09, 17.08s/it] 15%|█████████████████████▉                                                                                                                           | 87/575 [25:18<2:18:34, 17.04s/it]                                                                                                                                                                                         {'loss': 2.2873, 'grad_norm': 2.6190884113311768, 'learning_rate': 0.0004844798124055641, 'ppl': 9.84831, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 234.3050994873047, 'epoch': 0.15, 'tokens/total': 11403264.0, 'tokens/trainable': 10930896.0}
 15%|█████████████████████▉                                                                                                                           | 87/575 [25:18<2:18:34, 17.04s/it] 15%|██████████████████████▏                                                                                                                          | 88/575 [25:35<2:18:40, 17.08s/it]                                                                                                                                                                                         {'loss': 2.2425, 'grad_norm': 1.9917995929718018, 'learning_rate': 0.0004840116331210189, 'ppl': 9.41684, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.8889617919922, 'epoch': 0.15, 'tokens/total': 11534336.0, 'tokens/trainable': 11056095.0}
 15%|██████████████████████▏                                                                                                                          | 88/575 [25:35<2:18:40, 17.08s/it] 15%|██████████████████████▍                                                                                                                          | 89/575 [25:52<2:18:56, 17.15s/it]                                                                                                                                                                                         {'loss': 2.2381, 'grad_norm': 2.0278656482696533, 'learning_rate': 0.00048353675679361667, 'ppl': 9.3755, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 223.76683044433594, 'epoch': 0.15, 'tokens/total': 11665408.0, 'tokens/trainable': 11181561.0}
 15%|██████████████████████▍                                                                                                                          | 89/575 [25:52<2:18:56, 17.15s/it] 16%|██████████████████████▋                                                                                                                          | 90/575 [26:10<2:19:37, 17.27s/it]                                                                                                                                                                                         {'loss': 2.2155, 'grad_norm': 1.9188135862350464, 'learning_rate': 0.00048305519863909956, 'ppl': 9.16599, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 226.94505310058594, 'epoch': 0.16, 'tokens/total': 11796480.0, 'tokens/trainable': 11306981.0}
 16%|██████████████████████▋                                                                                                                          | 90/575 [26:10<2:19:37, 17.27s/it] 16%|██████████████████████▉                                                                                                                          | 91/575 [26:27<2:18:33, 17.18s/it]                                                                                                                                                                                         {'loss': 2.2248, 'grad_norm': 2.0131232738494873, 'learning_rate': 0.0004825669740873055, 'ppl': 9.25163, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 233.08920288085938, 'epoch': 0.16, 'tokens/total': 11927552.0, 'tokens/trainable': 11432627.0}
 16%|██████████████████████▉                                                                                                                          | 91/575 [26:27<2:18:33, 17.18s/it] 16%|███████████████████████▏                                                                                                                         | 92/575 [26:44<2:18:35, 17.22s/it]                                                                                                                                                                                         {'loss': 2.2603, 'grad_norm': 1.9995251893997192, 'learning_rate': 0.0004820720987816734, 'ppl': 9.58596, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 229.272216796875, 'epoch': 0.16, 'tokens/total': 12058624.0, 'tokens/trainable': 11557915.0}
 16%|███████████████████████▏                                                                                                                         | 92/575 [26:44<2:18:35, 17.22s/it] 16%|███████████████████████▍                                                                                                                         | 93/575 [27:02<2:18:49, 17.28s/it]                                                                                                                                                                                         {'loss': 2.2748, 'grad_norm': 1.987449049949646, 'learning_rate': 0.00048157058857874245, 'ppl': 9.72597, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 223.560546875, 'epoch': 0.16, 'tokens/total': 12189696.0, 'tokens/trainable': 11683382.0}
 16%|███████████████████████▍                                                                                                                         | 93/575 [27:02<2:18:49, 17.28s/it] 16%|███████████████████████▋                                                                                                                         | 94/575 [27:19<2:18:36, 17.29s/it]                                                                                                                                                                                         {'loss': 2.2138, 'grad_norm': 1.9212589263916016, 'learning_rate': 0.0004810624595476437, 'ppl': 9.15042, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 223.59713745117188, 'epoch': 0.16, 'tokens/total': 12320768.0, 'tokens/trainable': 11808614.0}
 16%|███████████████████████▋                                                                                                                         | 94/575 [27:19<2:18:36, 17.29s/it] 17%|███████████████████████▉                                                                                                                         | 95/575 [27:36<2:18:56, 17.37s/it]                                                                                                                                                                                         {'loss': 2.277, 'grad_norm': 3.1236255168914795, 'learning_rate': 0.00048054772796958517, 'ppl': 9.74739, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 223.001953125, 'epoch': 0.17, 'tokens/total': 12451840.0, 'tokens/trainable': 11933729.0}
 17%|███████████████████████▉                                                                                                                         | 95/575 [27:37<2:18:56, 17.37s/it] 17%|████████████████████████▏                                                                                                                        | 96/575 [27:54<2:17:57, 17.28s/it]                                                                                                                                                                                         {'loss': 2.2311, 'grad_norm': 1.921377182006836, 'learning_rate': 0.00048002641033733056, 'ppl': 9.3101, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 230.19371032714844, 'epoch': 0.17, 'tokens/total': 12582912.0, 'tokens/trainable': 12059280.0}
 17%|████████████████████████▏                                                                                                                        | 96/575 [27:54<2:17:57, 17.28s/it] 17%|████████████████████████▍                                                                                                                        | 97/575 [28:11<2:17:27, 17.25s/it]                                                                                                                                                                                         {'loss': 2.2442, 'grad_norm': 2.1094090938568115, 'learning_rate': 0.0004794985233546702, 'ppl': 9.43287, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 223.13145446777344, 'epoch': 0.17, 'tokens/total': 12713984.0, 'tokens/trainable': 12184960.0}
 17%|████████████████████████▍                                                                                                                        | 97/575 [28:11<2:17:27, 17.25s/it] 17%|████████████████████████▋                                                                                                                        | 98/575 [28:28<2:17:18, 17.27s/it]                                                                                                                                                                                         {'loss': 2.2909, 'grad_norm': 2.3172309398651123, 'learning_rate': 0.00047896408393588625, 'ppl': 9.88383, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 226.94102478027344, 'epoch': 0.17, 'tokens/total': 12845056.0, 'tokens/trainable': 12310681.0}
 17%|████████████████████████▋                                                                                                                        | 98/575 [28:28<2:17:18, 17.27s/it] 17%|████████████████████████▉                                                                                                                        | 99/575 [28:45<2:16:50, 17.25s/it]                                                                                                                                                                                         {'loss': 2.2217, 'grad_norm': 1.9893512725830078, 'learning_rate': 0.0004784231092052108, 'ppl': 9.223, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 228.69578552246094, 'epoch': 0.17, 'tokens/total': 12976128.0, 'tokens/trainable': 12436241.0}
 17%|████████████████████████▉                                                                                                                        | 99/575 [28:45<2:16:50, 17.25s/it] 17%|█████████████████████████                                                                                                                       | 100/575 [29:03<2:16:42, 17.27s/it]                                                                                                                                                                                         {'loss': 2.221, 'grad_norm': 2.7335238456726074, 'learning_rate': 0.0004778756164962769, 'ppl': 9.21654, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 26.15, 'tokens/train_per_sec_per_gpu': 225.89955139160156, 'epoch': 0.17, 'tokens/total': 13107200.0, 'tokens/trainable': 12561551.0}
 17%|█████████████████████████                                                                                                                       | 100/575 [29:03<2:16:42, 17.27s/it][2026-03-12 22:11:08,797] [INFO] [axolotl.core.trainers.base.evaluate:401] [PID:4624] Running evaluation step...

  0%|                                                                                                                                                             | 0/54 [00:00<?, ?it/s][A
  6%|████████▎                                                                                                                                            | 3/54 [00:00<00:09,  5.56it/s][A
  7%|███████████                                                                                                                                          | 4/54 [00:01<00:13,  3.57it/s][A
  9%|█████████████▊                                                                                                                                       | 5/54 [00:01<00:16,  2.89it/s][A
 11%|████████████████▌                                                                                                                                    | 6/54 [00:02<00:18,  2.56it/s][A
 13%|███████████████████▎                                                                                                                                 | 7/54 [00:02<00:19,  2.37it/s][A
 15%|██████████████████████                                                                                                                               | 8/54 [00:02<00:20,  2.26it/s][A
 17%|████████████████████████▊                                                                                                                            | 9/54 [00:03<00:20,  2.19it/s][A
 19%|███████████████████████████▍                                                                                                                        | 10/54 [00:03<00:20,  2.14it/s][A
 20%|██████████████████████████████▏                                                                                                                     | 11/54 [00:04<00:20,  2.11it/s][A
 22%|████████████████████████████████▉                                                                                                                   | 12/54 [00:04<00:20,  2.09it/s][A
 24%|███████████████████████████████████▋                                                                                                                | 13/54 [00:05<00:19,  2.07it/s][A
 26%|██████████████████████████████████████▎                                                                                                             | 14/54 [00:05<00:19,  2.06it/s][A
 28%|█████████████████████████████████████████                                                                                                           | 15/54 [00:06<00:18,  2.06it/s][A
 30%|███████████████████████████████████████████▊                                                                                                        | 16/54 [00:06<00:18,  2.05it/s][A
 31%|██████████████████████████████████████████████▌                                                                                                     | 17/54 [00:07<00:18,  2.05it/s][A
 33%|█████████████████████████████████████████████████▎                                                                                                  | 18/54 [00:07<00:17,  2.05it/s][A
 35%|████████████████████████████████████████████████████                                                                                                | 19/54 [00:08<00:17,  2.04it/s][A
 37%|██████████████████████████████████████████████████████▊                                                                                             | 20/54 [00:08<00:16,  2.04it/s][A
 39%|█████████████████████████████████████████████████████████▌                                                                                          | 21/54 [00:09<00:16,  2.04it/s][A
 41%|████████████████████████████████████████████████████████████▎                                                                                       | 22/54 [00:09<00:15,  2.04it/s][A
 43%|███████████████████████████████████████████████████████████████                                                                                     | 23/54 [00:10<00:15,  2.04it/s][A
 44%|█████████████████████████████████████████████████████████████████▊                                                                                  | 24/54 [00:10<00:14,  2.04it/s][A
 46%|████████████████████████████████████████████████████████████████████▌                                                                               | 25/54 [00:11<00:14,  2.04it/s][A
 48%|███████████████████████████████████████████████████████████████████████▎                                                                            | 26/54 [00:11<00:13,  2.04it/s][A
 50%|██████████████████████████████████████████████████████████████████████████                                                                          | 27/54 [00:12<00:13,  2.04it/s][A
 52%|████████████████████████████████████████████████████████████████████████████▋                                                                       | 28/54 [00:12<00:12,  2.04it/s][A
 54%|███████████████████████████████████████████████████████████████████████████████▍                                                                    | 29/54 [00:13<00:12,  2.04it/s][A
 56%|██████████████████████████████████████████████████████████████████████████████████▏                                                                 | 30/54 [00:13<00:11,  2.04it/s][A
 57%|████████████████████████████████████████████████████████████████████████████████████▉                                                               | 31/54 [00:14<00:11,  2.04it/s][A
 59%|███████████████████████████████████████████████████████████████████████████████████████▋                                                            | 32/54 [00:14<00:10,  2.04it/s][A
 61%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                         | 33/54 [00:15<00:10,  1.93it/s][A
 63%|█████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 34/54 [00:15<00:10,  1.99it/s][A
 65%|███████████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 35/54 [00:16<00:09,  2.01it/s][A
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▋                                                 | 36/54 [00:16<00:08,  2.02it/s][A
 69%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 37/54 [00:17<00:08,  2.02it/s][A
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 38/54 [00:17<00:07,  2.03it/s][A
 72%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                         | 39/54 [00:18<00:07,  2.03it/s][A
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                      | 40/54 [00:18<00:06,  2.03it/s][A
 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                   | 41/54 [00:19<00:06,  2.04it/s][A
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                 | 42/54 [00:19<00:05,  2.04it/s][A
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                              | 43/54 [00:20<00:05,  2.04it/s][A
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                           | 44/54 [00:20<00:04,  2.04it/s][A
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                        | 45/54 [00:21<00:04,  2.04it/s][A
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                      | 46/54 [00:21<00:03,  2.04it/s][A
 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                   | 47/54 [00:22<00:03,  2.04it/s][A
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 48/54 [00:22<00:02,  2.04it/s][A
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎             | 49/54 [00:23<00:02,  2.04it/s][A
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████           | 50/54 [00:23<00:01,  2.04it/s][A
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊        | 51/54 [00:24<00:01,  2.04it/s][A
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌     | 52/54 [00:24<00:00,  2.04it/s][A
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎  | 53/54 [00:25<00:00,  2.04it/s][A
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:25<00:00,  1.97it/s][A                                                                                                                                                                                         
                                                                                                                                                                                         [A{'eval_loss': 2.1052801609039307, 'eval_runtime': 27.3143, 'eval_samples_per_second': 7.908, 'eval_steps_per_second': 1.977, 'eval_ppl': 8.2094, 'memory/max_active (GiB)': 26.29, 'memory/max_allocated (GiB)': 26.29, 'memory/device_reserved (GiB)': 27.82, 'epoch': 0.17, 'tokens/train_per_sec_per_gpu': 0.0, 'tokens/total': 13107200.0, 'tokens/trainable': 12561551.0}
 17%|█████████████████████████                                                                                                                       | 100/575 [29:30<2:16:42, 17.27s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.97it/s][A
                                                                                                                                                                                         [A[2026-03-12 22:11:36,117] [INFO] [axolotl.core.trainers.base._save:723] [PID:4624] Saving model checkpoint to ./output/model/checkpoint-100
 18%|█████████████████████████▎                                                                                                                      | 101/575 [30:00<3:51:45, 29.34s/it]                                                                                                                                                                                         {'loss': 2.2096, 'grad_norm': 1.8464072942733765, 'learning_rate': 0.00047732162335156324, 'ppl': 9.11207, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 231.04161071777344, 'epoch': 0.18, 'tokens/total': 13238272.0, 'tokens/trainable': 12687432.0}
 18%|█████████████████████████▎                                                                                                                      | 101/575 [30:00<3:51:45, 29.34s/it] 18%|█████████████████████████▌                                                                                                                      | 102/575 [30:18<3:23:06, 25.76s/it]                                                                                                                                                                                         {'loss': 2.2228, 'grad_norm': 1.993199348449707, 'learning_rate': 0.00047676114752183234, 'ppl': 9.23315, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.18507385253906, 'epoch': 0.18, 'tokens/total': 13369344.0, 'tokens/trainable': 12812769.0}
 18%|█████████████████████████▌                                                                                                                      | 102/575 [30:18<3:23:06, 25.76s/it] 18%|█████████████████████████▊                                                                                                                      | 103/575 [30:35<3:02:27, 23.19s/it]                                                                                                                                                                                         {'loss': 2.2454, 'grad_norm': 2.0473639965057373, 'learning_rate': 0.0004761942069655613, 'ppl': 9.44419, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.0088653564453, 'epoch': 0.18, 'tokens/total': 13500416.0, 'tokens/trainable': 12937823.0}
 18%|█████████████████████████▊                                                                                                                      | 103/575 [30:35<3:02:27, 23.19s/it] 18%|██████████████████████████                                                                                                                      | 104/575 [30:52<2:48:31, 21.47s/it]                                                                                                                                                                                         {'loss': 2.2922, 'grad_norm': 2.505674362182617, 'learning_rate': 0.00047562081984836677, 'ppl': 9.89669, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.7632293701172, 'epoch': 0.18, 'tokens/total': 13631488.0, 'tokens/trainable': 13062815.0}
 18%|██████████████████████████                                                                                                                      | 104/575 [30:52<2:48:31, 21.47s/it] 18%|██████████████████████████▎                                                                                                                     | 105/575 [31:09<2:37:35, 20.12s/it]                                                                                                                                                                                         {'loss': 2.2219, 'grad_norm': 1.8206462860107422, 'learning_rate': 0.0004750410045424228, 'ppl': 9.22484, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 234.11654663085938, 'epoch': 0.18, 'tokens/total': 13762560.0, 'tokens/trainable': 13188255.0}
 18%|██████████████████████████▎                                                                                                                     | 105/575 [31:09<2:37:35, 20.12s/it] 18%|██████████████████████████▌                                                                                                                     | 106/575 [31:27<2:31:50, 19.42s/it]                                                                                                                                                                                         {'loss': 2.1544, 'grad_norm': 1.99018394947052, 'learning_rate': 0.0004744547796258722, 'ppl': 8.62271, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.4613800048828, 'epoch': 0.18, 'tokens/total': 13893632.0, 'tokens/trainable': 13313952.0}
 18%|██████████████████████████▌                                                                                                                     | 106/575 [31:27<2:31:50, 19.42s/it] 19%|██████████████████████████▊                                                                                                                     | 107/575 [31:44<2:26:36, 18.80s/it]                                                                                                                                                                                         {'loss': 2.2297, 'grad_norm': 1.8184312582015991, 'learning_rate': 0.000473862163882231, 'ppl': 9.29708, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.0952606201172, 'epoch': 0.19, 'tokens/total': 14024704.0, 'tokens/trainable': 13439696.0}
 19%|██████████████████████████▊                                                                                                                     | 107/575 [31:44<2:26:36, 18.80s/it] 19%|███████████████████████████                                                                                                                     | 108/575 [32:02<2:23:25, 18.43s/it]                                                                                                                                                                                         {'loss': 2.2293, 'grad_norm': 2.1768062114715576, 'learning_rate': 0.0004732631762997871, 'ppl': 9.29336, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.6346435546875, 'epoch': 0.19, 'tokens/total': 14155776.0, 'tokens/trainable': 13565388.0}
 19%|███████████████████████████                                                                                                                     | 108/575 [32:02<2:23:25, 18.43s/it] 19%|███████████████████████████▎                                                                                                                    | 109/575 [32:19<2:20:32, 18.10s/it]                                                                                                                                                                                         {'loss': 2.1908, 'grad_norm': 2.0543456077575684, 'learning_rate': 0.00047265783607099127, 'ppl': 8.94236, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.65737915039062, 'epoch': 0.19, 'tokens/total': 14286848.0, 'tokens/trainable': 13690701.0}
 19%|███████████████████████████▎                                                                                                                    | 109/575 [32:19<2:20:32, 18.10s/it] 19%|███████████████████████████▌                                                                                                                    | 110/575 [32:37<2:19:17, 17.97s/it]                                                                                                                                                                                         {'loss': 2.2341, 'grad_norm': 1.894862174987793, 'learning_rate': 0.00047204616259184277, 'ppl': 9.33807, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.14651489257812, 'epoch': 0.19, 'tokens/total': 14417920.0, 'tokens/trainable': 13816086.0}
 19%|███████████████████████████▌                                                                                                                    | 110/575 [32:37<2:19:17, 17.97s/it] 19%|███████████████████████████▊                                                                                                                    | 111/575 [32:54<2:18:02, 17.85s/it]                                                                                                                                                                                         {'loss': 2.2397, 'grad_norm': 2.2834582328796387, 'learning_rate': 0.00047142817546126734, 'ppl': 9.39051, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.56858825683594, 'epoch': 0.19, 'tokens/total': 14548992.0, 'tokens/trainable': 13941454.0}
 19%|███████████████████████████▊                                                                                                                    | 111/575 [32:54<2:18:02, 17.85s/it] 19%|████████████████████████████                                                                                                                    | 112/575 [33:12<2:16:14, 17.66s/it]                                                                                                                                                                                         {'loss': 2.1798, 'grad_norm': 2.1953084468841553, 'learning_rate': 0.0004708038944804898, 'ppl': 8.84454, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.48353576660156, 'epoch': 0.19, 'tokens/total': 14680064.0, 'tokens/trainable': 14067220.0}
 19%|████████████████████████████                                                                                                                    | 112/575 [33:12<2:16:14, 17.66s/it] 20%|████████████████████████████▎                                                                                                                   | 113/575 [33:29<2:15:27, 17.59s/it]                                                                                                                                                                                         {'loss': 2.1817, 'grad_norm': 1.871796727180481, 'learning_rate': 0.0004701733396523988, 'ppl': 8.86136, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.31082153320312, 'epoch': 0.2, 'tokens/total': 14811136.0, 'tokens/trainable': 14192623.0}
 20%|████████████████████████████▎                                                                                                                   | 113/575 [33:29<2:15:27, 17.59s/it] 20%|████████████████████████████▌                                                                                                                   | 114/575 [33:46<2:13:59, 17.44s/it]                                                                                                                                                                                         {'loss': 2.2311, 'grad_norm': 1.8661315441131592, 'learning_rate': 0.000469536531180907, 'ppl': 9.3101, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 232.32391357421875, 'epoch': 0.2, 'tokens/total': 14942208.0, 'tokens/trainable': 14317952.0}
 20%|████████████████████████████▌                                                                                                                   | 114/575 [33:46<2:13:59, 17.44s/it] 20%|████████████████████████████▊                                                                                                                   | 115/575 [34:04<2:13:42, 17.44s/it]                                                                                                                                                                                         {'loss': 2.1693, 'grad_norm': 1.7511204481124878, 'learning_rate': 0.00046889348947030246, 'ppl': 8.75216, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.64991760253906, 'epoch': 0.2, 'tokens/total': 15073280.0, 'tokens/trainable': 14443152.0}
 20%|████████████████████████████▊                                                                                                                   | 115/575 [34:04<2:13:42, 17.44s/it] 20%|█████████████████████████████                                                                                                                   | 116/575 [34:21<2:13:09, 17.41s/it]                                                                                                                                                                                         {'loss': 2.1575, 'grad_norm': 1.9833531379699707, 'learning_rate': 0.0004682442351245959, 'ppl': 8.64949, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.7755126953125, 'epoch': 0.2, 'tokens/total': 15204352.0, 'tokens/trainable': 14568837.0}
 20%|█████████████████████████████                                                                                                                   | 116/575 [34:21<2:13:09, 17.41s/it] 20%|█████████████████████████████▎                                                                                                                  | 117/575 [34:38<2:12:40, 17.38s/it]                                                                                                                                                                                         {'loss': 2.1635, 'grad_norm': 1.777593731880188, 'learning_rate': 0.00046758878894685973, 'ppl': 8.70154, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.09083557128906, 'epoch': 0.2, 'tokens/total': 15335424.0, 'tokens/trainable': 14694454.0}
 20%|█████████████████████████████▎                                                                                                                  | 117/575 [34:38<2:12:40, 17.38s/it] 21%|█████████████████████████████▌                                                                                                                  | 118/575 [34:56<2:12:47, 17.43s/it]                                                                                                                                                                                         {'loss': 2.155, 'grad_norm': 1.8414117097854614, 'learning_rate': 0.00046692717193856225, 'ppl': 8.62789, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.79286193847656, 'epoch': 0.21, 'tokens/total': 15466496.0, 'tokens/trainable': 14819843.0}
 21%|█████████████████████████████▌                                                                                                                  | 118/575 [34:56<2:12:47, 17.43s/it] 21%|█████████████████████████████▊                                                                                                                  | 119/575 [35:13<2:12:13, 17.40s/it]                                                                                                                                                                                         {'loss': 2.1378, 'grad_norm': 1.8367702960968018, 'learning_rate': 0.00046625940529889406, 'ppl': 8.48076, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.0628662109375, 'epoch': 0.21, 'tokens/total': 15597568.0, 'tokens/trainable': 14945794.0}
 21%|█████████████████████████████▊                                                                                                                  | 119/575 [35:13<2:12:13, 17.40s/it] 21%|██████████████████████████████                                                                                                                  | 120/575 [35:31<2:12:01, 17.41s/it]                                                                                                                                                                                         {'loss': 2.2026, 'grad_norm': 1.8818037509918213, 'learning_rate': 0.000465585510424089, 'ppl': 9.04851, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.71267700195312, 'epoch': 0.21, 'tokens/total': 15728640.0, 'tokens/trainable': 15071278.0}
 21%|██████████████████████████████                                                                                                                  | 120/575 [35:31<2:12:01, 17.41s/it] 21%|██████████████████████████████▎                                                                                                                 | 121/575 [35:48<2:12:04, 17.45s/it]                                                                                                                                                                                         {'loss': 2.1604, 'grad_norm': 1.9310040473937988, 'learning_rate': 0.0004649055089067389, 'ppl': 8.67461, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.5008087158203, 'epoch': 0.21, 'tokens/total': 15859712.0, 'tokens/trainable': 15196378.0}
 21%|██████████████████████████████▎                                                                                                                 | 121/575 [35:48<2:12:04, 17.45s/it] 21%|██████████████████████████████▌                                                                                                                 | 122/575 [36:05<2:11:27, 17.41s/it]                                                                                                                                                                                         {'loss': 2.1428, 'grad_norm': 1.8298485279083252, 'learning_rate': 0.00046421942253510124, 'ppl': 8.52327, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.1822967529297, 'epoch': 0.21, 'tokens/total': 15990784.0, 'tokens/trainable': 15322130.0}
 21%|██████████████████████████████▌                                                                                                                 | 122/575 [36:05<2:11:27, 17.41s/it] 21%|██████████████████████████████▊                                                                                                                 | 123/575 [36:23<2:12:18, 17.56s/it]                                                                                                                                                                                         {'loss': 2.1499, 'grad_norm': 1.8473424911499023, 'learning_rate': 0.00046352727329240155, 'ppl': 8.584, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.89151000976562, 'epoch': 0.21, 'tokens/total': 16121856.0, 'tokens/trainable': 15447645.0}
 21%|██████████████████████████████▊                                                                                                                 | 123/575 [36:23<2:12:18, 17.56s/it] 22%|███████████████████████████████                                                                                                                 | 124/575 [36:41<2:11:43, 17.52s/it]                                                                                                                                                                                         {'loss': 2.1212, 'grad_norm': 1.8582979440689087, 'learning_rate': 0.0004628290833561285, 'ppl': 8.34114, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.95120239257812, 'epoch': 0.22, 'tokens/total': 16252928.0, 'tokens/trainable': 15573141.0}
 22%|███████████████████████████████                                                                                                                 | 124/575 [36:41<2:11:43, 17.52s/it] 22%|███████████████████████████████▎                                                                                                                | 125/575 [36:58<2:11:13, 17.50s/it]                                                                                                                                                                                         {'loss': 2.1391, 'grad_norm': 1.5806443691253662, 'learning_rate': 0.0004621248750973237, 'ppl': 8.49179, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.4951934814453, 'epoch': 0.22, 'tokens/total': 16384000.0, 'tokens/trainable': 15699016.0}
 22%|███████████████████████████████▎                                                                                                                | 125/575 [36:58<2:11:13, 17.50s/it] 22%|███████████████████████████████▌                                                                                                                | 126/575 [37:15<2:09:43, 17.33s/it]                                                                                                                                                                                         {'loss': 2.1865, 'grad_norm': 1.7940988540649414, 'learning_rate': 0.0004614146710798645, 'ppl': 8.90399, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 235.8634033203125, 'epoch': 0.22, 'tokens/total': 16515072.0, 'tokens/trainable': 15824386.0}
 22%|███████████████████████████████▌                                                                                                                | 126/575 [37:15<2:09:43, 17.33s/it] 22%|███████████████████████████████▊                                                                                                                | 127/575 [37:33<2:10:27, 17.47s/it]                                                                                                                                                                                         {'loss': 2.1339, 'grad_norm': 2.5223379135131836, 'learning_rate': 0.0004606984940597416, 'ppl': 8.44775, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.55934143066406, 'epoch': 0.22, 'tokens/total': 16646144.0, 'tokens/trainable': 15949423.0}
 22%|███████████████████████████████▊                                                                                                                | 127/575 [37:33<2:10:27, 17.47s/it] 22%|████████████████████████████████                                                                                                                | 128/575 [37:50<2:10:21, 17.50s/it]                                                                                                                                                                                         {'loss': 2.12, 'grad_norm': 2.1645431518554688, 'learning_rate': 0.0004599763669843292, 'ppl': 8.33114, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.1853790283203, 'epoch': 0.22, 'tokens/total': 16777216.0, 'tokens/trainable': 16074982.0}
 22%|████████████████████████████████                                                                                                                | 128/575 [37:51<2:10:21, 17.50s/it] 22%|████████████████████████████████▎                                                                                                               | 129/575 [38:08<2:09:07, 17.37s/it]                                                                                                                                                                                         {'loss': 2.1292, 'grad_norm': 2.1512250900268555, 'learning_rate': 0.00045924831299165044, 'ppl': 8.40814, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.503662109375, 'epoch': 0.22, 'tokens/total': 16908288.0, 'tokens/trainable': 16200614.0}
 22%|████████████████████████████████▎                                                                                                               | 129/575 [38:08<2:09:07, 17.37s/it] 23%|████████████████████████████████▌                                                                                                               | 130/575 [38:25<2:08:27, 17.32s/it]                                                                                                                                                                                         {'loss': 2.1535, 'grad_norm': 1.57270348072052, 'learning_rate': 0.00045851435540963557, 'ppl': 8.61496, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.1381072998047, 'epoch': 0.23, 'tokens/total': 17039360.0, 'tokens/trainable': 16326066.0}
 23%|████████████████████████████████▌                                                                                                               | 130/575 [38:25<2:08:27, 17.32s/it] 23%|████████████████████████████████▊                                                                                                               | 131/575 [38:42<2:08:57, 17.43s/it]                                                                                                                                                                                         {'loss': 2.161, 'grad_norm': 1.7340855598449707, 'learning_rate': 0.0004577745177553743, 'ppl': 8.67981, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.37203979492188, 'epoch': 0.23, 'tokens/total': 17170432.0, 'tokens/trainable': 16451484.0}
 23%|████████████████████████████████▊                                                                                                               | 131/575 [38:42<2:08:57, 17.43s/it] 23%|█████████████████████████████████                                                                                                               | 132/575 [39:00<2:08:56, 17.46s/it]                                                                                                                                                                                         {'loss': 2.1105, 'grad_norm': 1.7389323711395264, 'learning_rate': 0.00045702882373436317, 'ppl': 8.25237, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.66751098632812, 'epoch': 0.23, 'tokens/total': 17301504.0, 'tokens/trainable': 16576790.0}
 23%|█████████████████████████████████                                                                                                               | 132/575 [39:00<2:08:56, 17.46s/it] 23%|█████████████████████████████████▎                                                                                                              | 133/575 [39:17<2:08:35, 17.46s/it]                                                                                                                                                                                         {'loss': 2.1162, 'grad_norm': 2.48624587059021, 'learning_rate': 0.00045627729723974497, 'ppl': 8.29954, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.98565673828125, 'epoch': 0.23, 'tokens/total': 17432576.0, 'tokens/trainable': 16701969.0}
 23%|█████████████████████████████████▎                                                                                                              | 133/575 [39:17<2:08:35, 17.46s/it] 23%|█████████████████████████████████▌                                                                                                              | 134/575 [39:35<2:07:43, 17.38s/it]                                                                                                                                                                                         {'loss': 2.135, 'grad_norm': 1.5993287563323975, 'learning_rate': 0.00045551996235154355, 'ppl': 8.45705, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.43017578125, 'epoch': 0.23, 'tokens/total': 17563648.0, 'tokens/trainable': 16827076.0}
 23%|█████████████████████████████████▌                                                                                                              | 134/575 [39:35<2:07:43, 17.38s/it] 23%|█████████████████████████████████▊                                                                                                              | 135/575 [39:52<2:07:02, 17.32s/it]                                                                                                                                                                                         {'loss': 2.1606, 'grad_norm': 2.50492262840271, 'learning_rate': 0.0004547568433358926, 'ppl': 8.67634, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.65431213378906, 'epoch': 0.23, 'tokens/total': 17694720.0, 'tokens/trainable': 16952412.0}
 23%|█████████████████████████████████▊                                                                                                              | 135/575 [39:52<2:07:02, 17.32s/it] 24%|██████████████████████████████████                                                                                                              | 136/575 [40:09<2:06:12, 17.25s/it]                                                                                                                                                                                         {'loss': 2.114, 'grad_norm': 1.6853094100952148, 'learning_rate': 0.00045398796464425774, 'ppl': 8.2813, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 233.96652221679688, 'epoch': 0.24, 'tokens/total': 17825792.0, 'tokens/trainable': 17077938.0}
 24%|██████████████████████████████████                                                                                                              | 136/575 [40:09<2:06:12, 17.25s/it] 24%|██████████████████████████████████▎                                                                                                             | 137/575 [40:26<2:06:19, 17.31s/it]                                                                                                                                                                                         {'loss': 2.1854, 'grad_norm': 2.752760648727417, 'learning_rate': 0.00045321335091265305, 'ppl': 8.89421, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.8414306640625, 'epoch': 0.24, 'tokens/total': 17956864.0, 'tokens/trainable': 17203072.0}
 24%|██████████████████████████████████▎                                                                                                             | 137/575 [40:26<2:06:19, 17.31s/it] 24%|██████████████████████████████████▌                                                                                                             | 138/575 [40:44<2:06:19, 17.34s/it]                                                                                                                                                                                         {'loss': 2.158, 'grad_norm': 1.9740126132965088, 'learning_rate': 0.0004524330269608518, 'ppl': 8.65381, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.50889587402344, 'epoch': 0.24, 'tokens/total': 18087936.0, 'tokens/trainable': 17328372.0}
 24%|██████████████████████████████████▌                                                                                                             | 138/575 [40:44<2:06:19, 17.34s/it] 24%|██████████████████████████████████▊                                                                                                             | 139/575 [41:01<2:05:27, 17.26s/it]                                                                                                                                                                                         {'loss': 2.159, 'grad_norm': 2.691234827041626, 'learning_rate': 0.0004516470177915914, 'ppl': 8.66247, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.6106719970703, 'epoch': 0.24, 'tokens/total': 18219008.0, 'tokens/trainable': 17453912.0}
 24%|██████████████████████████████████▊                                                                                                             | 139/575 [41:01<2:05:27, 17.26s/it] 24%|███████████████████████████████████                                                                                                             | 140/575 [41:18<2:05:00, 17.24s/it]                                                                                                                                                                                         {'loss': 2.1355, 'grad_norm': 2.5884628295898438, 'learning_rate': 0.00045085534858977175, 'ppl': 8.46128, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.83389282226562, 'epoch': 0.24, 'tokens/total': 18350080.0, 'tokens/trainable': 17579300.0}
 24%|███████████████████████████████████                                                                                                             | 140/575 [41:18<2:05:00, 17.24s/it] 25%|███████████████████████████████████▎                                                                                                            | 141/575 [41:35<2:05:09, 17.30s/it]                                                                                                                                                                                         {'loss': 2.1356, 'grad_norm': 1.7097755670547485, 'learning_rate': 0.0004500580447216489, 'ppl': 8.46212, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.82919311523438, 'epoch': 0.25, 'tokens/total': 18481152.0, 'tokens/trainable': 17704512.0}
 25%|███████████████████████████████████▎                                                                                                            | 141/575 [41:36<2:05:09, 17.30s/it] 25%|███████████████████████████████████▌                                                                                                            | 142/575 [41:53<2:05:24, 17.38s/it]                                                                                                                                                                                         {'loss': 2.1094, 'grad_norm': 2.187809944152832, 'learning_rate': 0.0004492551317340217, 'ppl': 8.24329, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.81309509277344, 'epoch': 0.25, 'tokens/total': 18612224.0, 'tokens/trainable': 17829846.0}
 25%|███████████████████████████████████▌                                                                                                            | 142/575 [41:53<2:05:24, 17.38s/it] 25%|███████████████████████████████████▊                                                                                                            | 143/575 [42:10<2:05:14, 17.40s/it]                                                                                                                                                                                         {'loss': 2.1536, 'grad_norm': 3.4877004623413086, 'learning_rate': 0.0004484466353534138, 'ppl': 8.61582, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.10443115234375, 'epoch': 0.25, 'tokens/total': 18743296.0, 'tokens/trainable': 17955076.0}
 25%|███████████████████████████████████▊                                                                                                            | 143/575 [42:11<2:05:14, 17.40s/it] 25%|████████████████████████████████████                                                                                                            | 144/575 [42:28<2:05:18, 17.44s/it]                                                                                                                                                                                         {'loss': 2.1372, 'grad_norm': 2.90229868888855, 'learning_rate': 0.00044763258148524873, 'ppl': 8.47567, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.8168487548828, 'epoch': 0.25, 'tokens/total': 18874368.0, 'tokens/trainable': 18079860.0}
 25%|████████████████████████████████████                                                                                                            | 144/575 [42:28<2:05:18, 17.44s/it] 25%|████████████████████████████████████▎                                                                                                           | 145/575 [42:45<2:04:43, 17.40s/it]                                                                                                                                                                                         {'loss': 2.0986, 'grad_norm': 4.088181018829346, 'learning_rate': 0.0004468129962130203, 'ppl': 8.15475, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.01153564453125, 'epoch': 0.25, 'tokens/total': 19005440.0, 'tokens/trainable': 18205372.0}
 25%|████████████████████████████████████▎                                                                                                           | 145/575 [42:45<2:04:43, 17.40s/it] 25%|████████████████████████████████████▌                                                                                                           | 146/575 [43:02<2:03:44, 17.31s/it]                                                                                                                                                                                         {'loss': 2.1436, 'grad_norm': 2.6995725631713867, 'learning_rate': 0.0004459879057974569, 'ppl': 8.53009, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.18191528320312, 'epoch': 0.25, 'tokens/total': 19136512.0, 'tokens/trainable': 18330870.0}
 25%|████████████████████████████████████▌                                                                                                           | 146/575 [43:02<2:03:44, 17.31s/it] 26%|████████████████████████████████████▊                                                                                                           | 147/575 [43:20<2:03:44, 17.35s/it]                                                                                                                                                                                         {'loss': 2.1432, 'grad_norm': 1.717764139175415, 'learning_rate': 0.00044515733667567964, 'ppl': 8.52668, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.8103790283203, 'epoch': 0.26, 'tokens/total': 19267584.0, 'tokens/trainable': 18456012.0}
 26%|████████████████████████████████████▊                                                                                                           | 147/575 [43:20<2:03:44, 17.35s/it] 26%|█████████████████████████████████████                                                                                                           | 148/575 [43:37<2:03:23, 17.34s/it]                                                                                                                                                                                         {'loss': 2.0737, 'grad_norm': 1.8086531162261963, 'learning_rate': 0.00044432131546035555, 'ppl': 7.9542, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.4119873046875, 'epoch': 0.26, 'tokens/total': 19398656.0, 'tokens/trainable': 18581576.0}
 26%|█████████████████████████████████████                                                                                                           | 148/575 [43:37<2:03:23, 17.34s/it] 26%|█████████████████████████████████████▎                                                                                                          | 149/575 [43:54<2:02:32, 17.26s/it]                                                                                                                                                                                         {'loss': 2.1403, 'grad_norm': 2.2775092124938965, 'learning_rate': 0.00044347986893884486, 'ppl': 8.50199, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.45175170898438, 'epoch': 0.26, 'tokens/total': 19529728.0, 'tokens/trainable': 18706940.0}
 26%|█████████████████████████████████████▎                                                                                                          | 149/575 [43:54<2:02:32, 17.26s/it] 26%|█████████████████████████████████████▌                                                                                                          | 150/575 [44:11<2:01:37, 17.17s/it]                                                                                                                                                                                         {'loss': 2.0827, 'grad_norm': 1.975598692893982, 'learning_rate': 0.00044263302407234265, 'ppl': 8.02611, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.171875, 'epoch': 0.26, 'tokens/total': 19660800.0, 'tokens/trainable': 18832634.0}
 26%|█████████████████████████████████████▌                                                                                                          | 150/575 [44:11<2:01:37, 17.17s/it] 26%|█████████████████████████████████████▊                                                                                                          | 151/575 [44:29<2:02:09, 17.29s/it]                                                                                                                                                                                         {'loss': 2.1052, 'grad_norm': 2.306047201156616, 'learning_rate': 0.0004417808079950151, 'ppl': 8.20874, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.45452880859375, 'epoch': 0.26, 'tokens/total': 19791872.0, 'tokens/trainable': 18958496.0}
 26%|█████████████████████████████████████▊                                                                                                          | 151/575 [44:29<2:02:09, 17.29s/it] 26%|██████████████████████████████████████                                                                                                          | 152/575 [44:46<2:02:27, 17.37s/it]                                                                                                                                                                                         {'loss': 2.0931, 'grad_norm': 1.8524401187896729, 'learning_rate': 0.00044092324801312964, 'ppl': 8.11002, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.22064208984375, 'epoch': 0.26, 'tokens/total': 19922944.0, 'tokens/trainable': 19083992.0}
 26%|██████████████████████████████████████                                                                                                          | 152/575 [44:46<2:02:27, 17.37s/it] 27%|██████████████████████████████████████▎                                                                                                         | 153/575 [45:04<2:02:03, 17.36s/it]                                                                                                                                                                                         {'loss': 2.0622, 'grad_norm': 1.8373030424118042, 'learning_rate': 0.00044006037160418073, 'ppl': 7.86325, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.19509887695312, 'epoch': 0.27, 'tokens/total': 20054016.0, 'tokens/trainable': 19209408.0}
 27%|██████████████████████████████████████▎                                                                                                         | 153/575 [45:04<2:02:03, 17.36s/it] 27%|██████████████████████████████████████▌                                                                                                         | 154/575 [45:21<2:02:27, 17.45s/it]                                                                                                                                                                                         {'loss': 2.0886, 'grad_norm': 1.9576970338821411, 'learning_rate': 0.0004391922064160088, 'ppl': 8.0736, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.79933166503906, 'epoch': 0.27, 'tokens/total': 20185088.0, 'tokens/trainable': 19334996.0}
 27%|██████████████████████████████████████▌                                                                                                         | 154/575 [45:21<2:02:27, 17.45s/it] 27%|██████████████████████████████████████▊                                                                                                         | 155/575 [45:39<2:01:53, 17.41s/it]                                                                                                                                                                                         {'loss': 2.0665, 'grad_norm': 1.7436821460723877, 'learning_rate': 0.0004383187802659146, 'ppl': 7.89713, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.758056640625, 'epoch': 0.27, 'tokens/total': 20316160.0, 'tokens/trainable': 19460870.0}
 27%|██████████████████████████████████████▊                                                                                                         | 155/575 [45:39<2:01:53, 17.41s/it] 27%|███████████████████████████████████████                                                                                                         | 156/575 [45:56<2:01:39, 17.42s/it]                                                                                                                                                                                         {'loss': 2.1127, 'grad_norm': 1.714001178741455, 'learning_rate': 0.000437440121139768, 'ppl': 8.27054, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.25413513183594, 'epoch': 0.27, 'tokens/total': 20447232.0, 'tokens/trainable': 19586144.0}
 27%|███████████████████████████████████████                                                                                                         | 156/575 [45:56<2:01:39, 17.42s/it] 27%|███████████████████████████████████████▎                                                                                                        | 157/575 [46:14<2:01:54, 17.50s/it]                                                                                                                                                                                         {'loss': 2.0476, 'grad_norm': 1.470245361328125, 'learning_rate': 0.00043655625719111123, 'ppl': 7.74928, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.0037078857422, 'epoch': 0.27, 'tokens/total': 20578304.0, 'tokens/trainable': 19711576.0}
 27%|███████████████████████████████████████▎                                                                                                        | 157/575 [46:14<2:01:54, 17.50s/it] 27%|███████████████████████████████████████▌                                                                                                        | 158/575 [46:31<2:00:29, 17.34s/it]                                                                                                                                                                                         {'loss': 2.0545, 'grad_norm': 1.6902519464492798, 'learning_rate': 0.00043566721674025654, 'ppl': 7.80294, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.49578857421875, 'epoch': 0.27, 'tokens/total': 20709376.0, 'tokens/trainable': 19837036.0}
 27%|███████████████████████████████████████▌                                                                                                        | 158/575 [46:31<2:00:29, 17.34s/it] 28%|███████████████████████████████████████▊                                                                                                        | 159/575 [46:48<2:00:41, 17.41s/it]                                                                                                                                                                                         {'loss': 2.0498, 'grad_norm': 1.586395025253296, 'learning_rate': 0.0004347730282733793, 'ppl': 7.76635, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.37857055664062, 'epoch': 0.28, 'tokens/total': 20840448.0, 'tokens/trainable': 19962432.0}
 28%|███████████████████████████████████████▊                                                                                                        | 159/575 [46:48<2:00:41, 17.41s/it] 28%|████████████████████████████████████████                                                                                                        | 160/575 [47:06<2:00:27, 17.42s/it]                                                                                                                                                                                         {'loss': 2.0412, 'grad_norm': 1.3804173469543457, 'learning_rate': 0.00043387372044160474, 'ppl': 7.69984, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.12875366210938, 'epoch': 0.28, 'tokens/total': 20971520.0, 'tokens/trainable': 20088106.0}
 28%|████████████████████████████████████████                                                                                                        | 160/575 [47:06<2:00:27, 17.42s/it] 28%|████████████████████████████████████████▎                                                                                                       | 161/575 [47:23<1:59:58, 17.39s/it]                                                                                                                                                                                         {'loss': 2.0436, 'grad_norm': 1.395187497138977, 'learning_rate': 0.0004329693220600901, 'ppl': 7.71835, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.45684814453125, 'epoch': 0.28, 'tokens/total': 21102592.0, 'tokens/trainable': 20213540.0}
 28%|████████████████████████████████████████▎                                                                                                       | 161/575 [47:23<1:59:58, 17.39s/it] 28%|████████████████████████████████████████▌                                                                                                       | 162/575 [47:40<1:59:47, 17.40s/it]                                                                                                                                                                                         {'loss': 2.1099, 'grad_norm': 1.5744107961654663, 'learning_rate': 0.0004320598621071015, 'ppl': 8.24742, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.30377197265625, 'epoch': 0.28, 'tokens/total': 21233664.0, 'tokens/trainable': 20338652.0}
 28%|████████████████████████████████████████▌                                                                                                       | 162/575 [47:41<1:59:47, 17.40s/it] 28%|████████████████████████████████████████▊                                                                                                       | 163/575 [47:58<1:59:48, 17.45s/it]                                                                                                                                                                                         {'loss': 2.0984, 'grad_norm': 3.014244556427002, 'learning_rate': 0.0004311453697230854, 'ppl': 8.15311, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.8573760986328, 'epoch': 0.28, 'tokens/total': 21364736.0, 'tokens/trainable': 20463954.0}
 28%|████████████████████████████████████████▊                                                                                                       | 163/575 [47:58<1:59:48, 17.45s/it] 29%|█████████████████████████████████████████                                                                                                       | 164/575 [48:15<1:59:29, 17.45s/it]                                                                                                                                                                                         {'loss': 2.0777, 'grad_norm': 1.684156060218811, 'learning_rate': 0.0004302258742097345, 'ppl': 7.98608, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.4449005126953, 'epoch': 0.29, 'tokens/total': 21495808.0, 'tokens/trainable': 20589428.0}
 29%|█████████████████████████████████████████                                                                                                       | 164/575 [48:16<1:59:29, 17.45s/it] 29%|█████████████████████████████████████████▎                                                                                                      | 165/575 [48:33<1:59:26, 17.48s/it]                                                                                                                                                                                         {'loss': 2.001, 'grad_norm': 1.9122990369796753, 'learning_rate': 0.00042930140502904957, 'ppl': 7.39645, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.19143676757812, 'epoch': 0.29, 'tokens/total': 21626880.0, 'tokens/trainable': 20715128.0}
 29%|█████████████████████████████████████████▎                                                                                                      | 165/575 [48:33<1:59:26, 17.48s/it] 29%|█████████████████████████████████████████▌                                                                                                      | 166/575 [48:50<1:59:03, 17.47s/it]                                                                                                                                                                                         {'loss': 2.0794, 'grad_norm': 1.6796869039535522, 'learning_rate': 0.0004283719918023949, 'ppl': 7.99967, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.7799530029297, 'epoch': 0.29, 'tokens/total': 21757952.0, 'tokens/trainable': 20840516.0}
 29%|█████████████████████████████████████████▌                                                                                                      | 166/575 [48:51<1:59:03, 17.47s/it] 29%|█████████████████████████████████████████▊                                                                                                      | 167/575 [49:08<1:58:28, 17.42s/it]                                                                                                                                                                                         {'loss': 2.096, 'grad_norm': 1.7133678197860718, 'learning_rate': 0.00042743766430954923, 'ppl': 8.13357, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.43740844726562, 'epoch': 0.29, 'tokens/total': 21889024.0, 'tokens/trainable': 20965828.0}
 29%|█████████████████████████████████████████▊                                                                                                      | 167/575 [49:08<1:58:28, 17.42s/it] 29%|██████████████████████████████████████████                                                                                                      | 168/575 [49:25<1:58:27, 17.46s/it]                                                                                                                                                                                         {'loss': 2.1034, 'grad_norm': 1.7506831884384155, 'learning_rate': 0.0004264984524877519, 'ppl': 8.19398, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.2993621826172, 'epoch': 0.29, 'tokens/total': 22020096.0, 'tokens/trainable': 21091452.0}
 29%|██████████████████████████████████████████                                                                                                      | 168/575 [49:25<1:58:27, 17.46s/it] 29%|██████████████████████████████████████████▎                                                                                                     | 169/575 [49:43<1:58:35, 17.53s/it]                                                                                                                                                                                         {'loss': 2.071, 'grad_norm': 1.7188383340835571, 'learning_rate': 0.0004255543864307431, 'ppl': 7.93275, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.41490173339844, 'epoch': 0.29, 'tokens/total': 22151168.0, 'tokens/trainable': 21217106.0}
 29%|██████████████████████████████████████████▎                                                                                                     | 169/575 [49:43<1:58:35, 17.53s/it] 30%|██████████████████████████████████████████▌                                                                                                     | 170/575 [50:00<1:57:52, 17.46s/it]                                                                                                                                                                                         {'loss': 2.0936, 'grad_norm': 2.0106050968170166, 'learning_rate': 0.0004246054963878003, 'ppl': 8.11407, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.50755310058594, 'epoch': 0.3, 'tokens/total': 22282240.0, 'tokens/trainable': 21342636.0}
 30%|██████████████████████████████████████████▌                                                                                                     | 170/575 [50:00<1:57:52, 17.46s/it] 30%|██████████████████████████████████████████▊                                                                                                     | 171/575 [50:18<1:57:31, 17.45s/it]                                                                                                                                                                                         {'loss': 2.0692, 'grad_norm': 1.6561458110809326, 'learning_rate': 0.0004236518127627683, 'ppl': 7.91849, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.6437225341797, 'epoch': 0.3, 'tokens/total': 22413312.0, 'tokens/trainable': 21468090.0}
 30%|██████████████████████████████████████████▊                                                                                                     | 171/575 [50:18<1:57:31, 17.45s/it] 30%|███████████████████████████████████████████                                                                                                     | 172/575 [50:35<1:56:57, 17.41s/it]                                                                                                                                                                                         {'loss': 2.0538, 'grad_norm': 1.6631743907928467, 'learning_rate': 0.0004226933661130857, 'ppl': 7.79748, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.38487243652344, 'epoch': 0.3, 'tokens/total': 22544384.0, 'tokens/trainable': 21593656.0}
 30%|███████████████████████████████████████████                                                                                                     | 172/575 [50:35<1:56:57, 17.41s/it] 30%|███████████████████████████████████████████▎                                                                                                    | 173/575 [50:53<1:56:57, 17.46s/it]                                                                                                                                                                                         {'loss': 2.106, 'grad_norm': 1.8708244562149048, 'learning_rate': 0.00042173018714880507, 'ppl': 8.21531, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.47494506835938, 'epoch': 0.3, 'tokens/total': 22675456.0, 'tokens/trainable': 21718940.0}
 30%|███████████████████████████████████████████▎                                                                                                    | 173/575 [50:53<1:56:57, 17.46s/it] 30%|███████████████████████████████████████████▌                                                                                                    | 174/575 [51:10<1:57:06, 17.52s/it]                                                                                                                                                                                         {'loss': 2.0415, 'grad_norm': 1.6072241067886353, 'learning_rate': 0.0004207623067316098, 'ppl': 7.70215, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.18588256835938, 'epoch': 0.3, 'tokens/total': 22806528.0, 'tokens/trainable': 21844188.0}
 30%|███████████████████████████████████████████▌                                                                                                    | 174/575 [51:10<1:57:06, 17.52s/it] 30%|███████████████████████████████████████████▊                                                                                                    | 175/575 [51:28<1:56:52, 17.53s/it]                                                                                                                                                                                         {'loss': 2.0875, 'grad_norm': 1.5985651016235352, 'learning_rate': 0.00041978975587382447, 'ppl': 8.06473, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.15481567382812, 'epoch': 0.3, 'tokens/total': 22937600.0, 'tokens/trainable': 21969612.0}
 30%|███████████████████████████████████████████▊                                                                                                    | 175/575 [51:28<1:56:52, 17.53s/it] 31%|████████████████████████████████████████████                                                                                                    | 176/575 [51:46<1:57:08, 17.62s/it]                                                                                                                                                                                         {'loss': 2.0479, 'grad_norm': 1.3777779340744019, 'learning_rate': 0.0004188125657374216, 'ppl': 7.75161, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.29501342773438, 'epoch': 0.31, 'tokens/total': 23068672.0, 'tokens/trainable': 22095076.0}
 31%|████████████████████████████████████████████                                                                                                    | 176/575 [51:46<1:57:08, 17.62s/it] 31%|████████████████████████████████████████████▎                                                                                                   | 177/575 [52:03<1:56:15, 17.53s/it]                                                                                                                                                                                         {'loss': 2.0929, 'grad_norm': 2.4197492599487305, 'learning_rate': 0.0004178307676330233, 'ppl': 8.1084, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.9153594970703, 'epoch': 0.31, 'tokens/total': 23199744.0, 'tokens/trainable': 22220634.0}
 31%|████████████████████████████████████████████▎                                                                                                   | 177/575 [52:03<1:56:15, 17.53s/it] 31%|████████████████████████████████████████████▌                                                                                                   | 178/575 [52:20<1:55:47, 17.50s/it]                                                                                                                                                                                         {'loss': 2.1076, 'grad_norm': 1.6578755378723145, 'learning_rate': 0.00041684439301889746, 'ppl': 8.22847, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.3657684326172, 'epoch': 0.31, 'tokens/total': 23330816.0, 'tokens/trainable': 22346236.0}
 31%|████████████████████████████████████████████▌                                                                                                   | 178/575 [52:21<1:55:47, 17.50s/it] 31%|████████████████████████████████████████████▊                                                                                                   | 179/575 [52:38<1:56:05, 17.59s/it]                                                                                                                                                                                         {'loss': 2.0321, 'grad_norm': 1.4744303226470947, 'learning_rate': 0.00041585347349995034, 'ppl': 7.63009, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.31411743164062, 'epoch': 0.31, 'tokens/total': 23461888.0, 'tokens/trainable': 22471612.0}
 31%|████████████████████████████████████████████▊                                                                                                   | 179/575 [52:38<1:56:05, 17.59s/it] 31%|█████████████████████████████████████████████                                                                                                   | 180/575 [52:56<1:55:43, 17.58s/it]                                                                                                                                                                                         {'loss': 2.0672, 'grad_norm': 1.5459179878234863, 'learning_rate': 0.00041485804082671375, 'ppl': 7.90266, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.40480041503906, 'epoch': 0.31, 'tokens/total': 23592960.0, 'tokens/trainable': 22597068.0}
 31%|█████████████████████████████████████████████                                                                                                   | 180/575 [52:56<1:55:43, 17.58s/it] 31%|█████████████████████████████████████████████▎                                                                                                  | 181/575 [53:13<1:55:09, 17.54s/it]                                                                                                                                                                                         {'loss': 2.075, 'grad_norm': 1.7339211702346802, 'learning_rate': 0.0004138581268943274, 'ppl': 7.96455, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.69749450683594, 'epoch': 0.31, 'tokens/total': 23724032.0, 'tokens/trainable': 22722468.0}
 31%|█████████████████████████████████████████████▎                                                                                                  | 181/575 [53:13<1:55:09, 17.54s/it] 32%|█████████████████████████████████████████████▌                                                                                                  | 182/575 [53:30<1:53:58, 17.40s/it]                                                                                                                                                                                         {'loss': 2.1123, 'grad_norm': 1.6089272499084473, 'learning_rate': 0.00041285376374151754, 'ppl': 8.26723, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.2135467529297, 'epoch': 0.32, 'tokens/total': 23855104.0, 'tokens/trainable': 22847524.0}
 32%|█████████████████████████████████████████████▌                                                                                                  | 182/575 [53:30<1:53:58, 17.40s/it] 32%|█████████████████████████████████████████████▊                                                                                                  | 183/575 [53:48<1:54:13, 17.48s/it]                                                                                                                                                                                         {'loss': 2.0247, 'grad_norm': 1.5385459661483765, 'learning_rate': 0.0004118449835495697, 'ppl': 7.57384, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.0535430908203, 'epoch': 0.32, 'tokens/total': 23986176.0, 'tokens/trainable': 22972760.0}
 32%|█████████████████████████████████████████████▊                                                                                                  | 183/575 [53:48<1:54:13, 17.48s/it] 32%|██████████████████████████████████████████████                                                                                                  | 184/575 [54:05<1:53:50, 17.47s/it]                                                                                                                                                                                         {'loss': 2.0963, 'grad_norm': 1.9120290279388428, 'learning_rate': 0.00041083181864129815, 'ppl': 8.13601, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.6395721435547, 'epoch': 0.32, 'tokens/total': 24117248.0, 'tokens/trainable': 23097984.0}
 32%|██████████████████████████████████████████████                                                                                                  | 184/575 [54:05<1:53:50, 17.47s/it] 32%|██████████████████████████████████████████████▎                                                                                                 | 185/575 [54:23<1:53:29, 17.46s/it]                                                                                                                                                                                         {'loss': 2.0498, 'grad_norm': 1.5722780227661133, 'learning_rate': 0.0004098143014800099, 'ppl': 7.76635, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.5347137451172, 'epoch': 0.32, 'tokens/total': 24248320.0, 'tokens/trainable': 23223388.0}
 32%|██████████████████████████████████████████████▎                                                                                                 | 185/575 [54:23<1:53:29, 17.46s/it] 32%|██████████████████████████████████████████████▌                                                                                                 | 186/575 [54:41<1:53:51, 17.56s/it]                                                                                                                                                                                         {'loss': 2.0679, 'grad_norm': 1.9068890810012817, 'learning_rate': 0.0004087924646684645, 'ppl': 7.9082, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.58749389648438, 'epoch': 0.32, 'tokens/total': 24379392.0, 'tokens/trainable': 23348824.0}
 32%|██████████████████████████████████████████████▌                                                                                                 | 186/575 [54:41<1:53:51, 17.56s/it] 33%|██████████████████████████████████████████████▊                                                                                                 | 187/575 [54:58<1:53:47, 17.60s/it]                                                                                                                                                                                         {'loss': 2.0461, 'grad_norm': 1.4350191354751587, 'learning_rate': 0.00040776634094782965, 'ppl': 7.73767, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.56585693359375, 'epoch': 0.33, 'tokens/total': 24510464.0, 'tokens/trainable': 23473964.0}
 33%|██████████████████████████████████████████████▊                                                                                                 | 187/575 [54:58<1:53:47, 17.60s/it] 33%|███████████████████████████████████████████████                                                                                                 | 188/575 [55:16<1:52:43, 17.48s/it]                                                                                                                                                                                         {'loss': 2.085, 'grad_norm': 1.7474805116653442, 'learning_rate': 0.00040673596319663197, 'ppl': 8.04459, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.43820190429688, 'epoch': 0.33, 'tokens/total': 24641536.0, 'tokens/trainable': 23599496.0}
 33%|███████████████████████████████████████████████                                                                                                 | 188/575 [55:16<1:52:43, 17.48s/it] 33%|███████████████████████████████████████████████▎                                                                                                | 189/575 [55:33<1:52:21, 17.47s/it]                                                                                                                                                                                         {'loss': 2.0395, 'grad_norm': 2.11209774017334, 'learning_rate': 0.0004057013644297034, 'ppl': 7.68676, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.72967529296875, 'epoch': 0.33, 'tokens/total': 24772608.0, 'tokens/trainable': 23725208.0}
 33%|███████████████████████████████████████████████▎                                                                                                | 189/575 [55:33<1:52:21, 17.47s/it] 33%|███████████████████████████████████████████████▌                                                                                                | 190/575 [55:51<1:52:15, 17.49s/it]                                                                                                                                                                                         {'loss': 2.0092, 'grad_norm': 1.779280662536621, 'learning_rate': 0.0004046625777971237, 'ppl': 7.45735, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.49887084960938, 'epoch': 0.33, 'tokens/total': 24903680.0, 'tokens/trainable': 23850784.0}
 33%|███████████████████████████████████████████████▌                                                                                                | 190/575 [55:51<1:52:15, 17.49s/it] 33%|███████████████████████████████████████████████▊                                                                                                | 191/575 [56:08<1:51:09, 17.37s/it]                                                                                                                                                                                         {'loss': 2.0607, 'grad_norm': 1.6085331439971924, 'learning_rate': 0.0004036196365831577, 'ppl': 7.85146, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.67201232910156, 'epoch': 0.33, 'tokens/total': 25034752.0, 'tokens/trainable': 23976190.0}
 33%|███████████████████████████████████████████████▊                                                                                                | 191/575 [56:08<1:51:09, 17.37s/it] 33%|████████████████████████████████████████████████                                                                                                | 192/575 [56:25<1:51:27, 17.46s/it]                                                                                                                                                                                         {'loss': 1.9971, 'grad_norm': 1.6905654668807983, 'learning_rate': 0.0004025725742051896, 'ppl': 7.36766, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.44444274902344, 'epoch': 0.33, 'tokens/total': 25165824.0, 'tokens/trainable': 24101838.0}
 33%|████████████████████████████████████████████████                                                                                                | 192/575 [56:25<1:51:27, 17.46s/it] 34%|████████████████████████████████████████████████▎                                                                                               | 193/575 [56:43<1:50:40, 17.38s/it]                                                                                                                                                                                         {'loss': 2.0328, 'grad_norm': 1.6291218996047974, 'learning_rate': 0.00040152142421265167, 'ppl': 7.63544, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.9542236328125, 'epoch': 0.34, 'tokens/total': 25296896.0, 'tokens/trainable': 24227634.0}
 34%|████████████████████████████████████████████████▎                                                                                               | 193/575 [56:43<1:50:40, 17.38s/it] 34%|████████████████████████████████████████████████▌                                                                                               | 194/575 [57:00<1:50:42, 17.43s/it]                                                                                                                                                                                         {'loss': 2.002, 'grad_norm': 1.507552146911621, 'learning_rate': 0.0004004662202859492, 'ppl': 7.40385, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.45265197753906, 'epoch': 0.34, 'tokens/total': 25427968.0, 'tokens/trainable': 24353320.0}
 34%|████████████████████████████████████████████████▌                                                                                               | 194/575 [57:00<1:50:42, 17.43s/it] 34%|████████████████████████████████████████████████▊                                                                                               | 195/575 [57:18<1:51:06, 17.54s/it]                                                                                                                                                                                         {'loss': 2.056, 'grad_norm': 1.3526042699813843, 'learning_rate': 0.0003994069962353817, 'ppl': 7.81465, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.37057495117188, 'epoch': 0.34, 'tokens/total': 25559040.0, 'tokens/trainable': 24478228.0}
 34%|████████████████████████████████████████████████▊                                                                                               | 195/575 [57:18<1:51:06, 17.54s/it] 34%|█████████████████████████████████████████████████                                                                                               | 196/575 [57:36<1:51:17, 17.62s/it]                                                                                                                                                                                         {'loss': 2.0638, 'grad_norm': 1.497153639793396, 'learning_rate': 0.0003983437860000597, 'ppl': 7.87584, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.21688842773438, 'epoch': 0.34, 'tokens/total': 25690112.0, 'tokens/trainable': 24603312.0}
 34%|█████████████████████████████████████████████████                                                                                               | 196/575 [57:36<1:51:17, 17.62s/it] 34%|█████████████████████████████████████████████████▎                                                                                              | 197/575 [57:53<1:50:38, 17.56s/it]                                                                                                                                                                                         {'loss': 2.019, 'grad_norm': 1.3841919898986816, 'learning_rate': 0.0003972766236468165, 'ppl': 7.53079, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.7329559326172, 'epoch': 0.34, 'tokens/total': 25821184.0, 'tokens/trainable': 24729106.0}
 34%|█████████████████████████████████████████████████▎                                                                                              | 197/575 [57:53<1:50:38, 17.56s/it] 34%|█████████████████████████████████████████████████▌                                                                                              | 198/575 [58:10<1:49:13, 17.38s/it]                                                                                                                                                                                         {'loss': 2.0275, 'grad_norm': 1.520025372505188, 'learning_rate': 0.0003962055433691174, 'ppl': 7.59507, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 236.21697998046875, 'epoch': 0.34, 'tokens/total': 25952256.0, 'tokens/trainable': 24854284.0}
 34%|█████████████████████████████████████████████████▌                                                                                              | 198/575 [58:10<1:49:13, 17.38s/it] 35%|█████████████████████████████████████████████████▊                                                                                              | 199/575 [58:28<1:49:56, 17.54s/it]                                                                                                                                                                                         {'loss': 1.9679, 'grad_norm': 1.503602147102356, 'learning_rate': 0.00039513057948596376, 'ppl': 7.15563, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.94039916992188, 'epoch': 0.35, 'tokens/total': 26083328.0, 'tokens/trainable': 24979618.0}
 35%|█████████████████████████████████████████████████▊                                                                                              | 199/575 [58:28<1:49:56, 17.54s/it] 35%|██████████████████████████████████████████████████                                                                                              | 200/575 [58:45<1:49:26, 17.51s/it]                                                                                                                                                                                         {'loss': 2.0187, 'grad_norm': 1.5359479188919067, 'learning_rate': 0.00039405176644079345, 'ppl': 7.52853, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.66827392578125, 'epoch': 0.35, 'tokens/total': 26214400.0, 'tokens/trainable': 25105132.0}
 35%|██████████████████████████████████████████████████                                                                                              | 200/575 [58:45<1:49:26, 17.51s/it][2026-03-12 22:40:51,614] [INFO] [axolotl.core.trainers.base.evaluate:401] [PID:4624] Running evaluation step...

  0%|                                                                                                                                                             | 0/54 [00:00<?, ?it/s][A
  4%|█████▌                                                                                                                                               | 2/54 [00:00<00:12,  4.10it/s][A
  6%|████████▎                                                                                                                                            | 3/54 [00:00<00:17,  2.89it/s][A
  7%|███████████                                                                                                                                          | 4/54 [00:01<00:19,  2.50it/s][A
  9%|█████████████▊                                                                                                                                       | 5/54 [00:01<00:21,  2.32it/s][A
 11%|████████████████▌                                                                                                                                    | 6/54 [00:02<00:21,  2.22it/s][A
 13%|███████████████████▎                                                                                                                                 | 7/54 [00:02<00:21,  2.16it/s][A
 15%|██████████████████████                                                                                                                               | 8/54 [00:03<00:21,  2.12it/s][A
 17%|████████████████████████▊                                                                                                                            | 9/54 [00:03<00:21,  2.09it/s][A
 19%|███████████████████████████▍                                                                                                                        | 10/54 [00:04<00:21,  2.08it/s][A
 20%|██████████████████████████████▏                                                                                                                     | 11/54 [00:04<00:20,  2.07it/s][A
 22%|████████████████████████████████▉                                                                                                                   | 12/54 [00:05<00:20,  2.06it/s][A
 24%|███████████████████████████████████▋                                                                                                                | 13/54 [00:05<00:19,  2.05it/s][A
 26%|██████████████████████████████████████▎                                                                                                             | 14/54 [00:06<00:19,  2.05it/s][A
 28%|█████████████████████████████████████████                                                                                                           | 15/54 [00:06<00:19,  2.05it/s][A
 30%|███████████████████████████████████████████▊                                                                                                        | 16/54 [00:07<00:18,  2.04it/s][A
 31%|██████████████████████████████████████████████▌                                                                                                     | 17/54 [00:07<00:18,  2.04it/s][A
 33%|█████████████████████████████████████████████████▎                                                                                                  | 18/54 [00:08<00:17,  2.04it/s][A
 35%|████████████████████████████████████████████████████                                                                                                | 19/54 [00:08<00:17,  2.04it/s][A
 37%|██████████████████████████████████████████████████████▊                                                                                             | 20/54 [00:09<00:16,  2.04it/s][A
 39%|█████████████████████████████████████████████████████████▌                                                                                          | 21/54 [00:09<00:16,  2.04it/s][A
 41%|████████████████████████████████████████████████████████████▎                                                                                       | 22/54 [00:10<00:15,  2.04it/s][A
 43%|███████████████████████████████████████████████████████████████                                                                                     | 23/54 [00:10<00:15,  2.04it/s][A
 44%|█████████████████████████████████████████████████████████████████▊                                                                                  | 24/54 [00:11<00:14,  2.04it/s][A
 46%|████████████████████████████████████████████████████████████████████▌                                                                               | 25/54 [00:11<00:14,  2.04it/s][A
 48%|███████████████████████████████████████████████████████████████████████▎                                                                            | 26/54 [00:12<00:13,  2.04it/s][A
 50%|██████████████████████████████████████████████████████████████████████████                                                                          | 27/54 [00:12<00:13,  2.04it/s][A
 52%|████████████████████████████████████████████████████████████████████████████▋                                                                       | 28/54 [00:13<00:12,  2.04it/s][A
 54%|███████████████████████████████████████████████████████████████████████████████▍                                                                    | 29/54 [00:13<00:12,  2.04it/s][A
 56%|██████████████████████████████████████████████████████████████████████████████████▏                                                                 | 30/54 [00:14<00:11,  2.04it/s][A
 57%|████████████████████████████████████████████████████████████████████████████████████▉                                                               | 31/54 [00:14<00:11,  2.04it/s][A
 59%|███████████████████████████████████████████████████████████████████████████████████████▋                                                            | 32/54 [00:15<00:10,  2.04it/s][A
 61%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                         | 33/54 [00:15<00:10,  1.92it/s][A
 63%|█████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 34/54 [00:16<00:10,  1.99it/s][A
 65%|███████████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 35/54 [00:16<00:09,  2.00it/s][A
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▋                                                 | 36/54 [00:17<00:08,  2.01it/s][A
 69%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 37/54 [00:17<00:08,  2.02it/s][A
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 38/54 [00:18<00:07,  2.03it/s][A
 72%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                         | 39/54 [00:18<00:07,  2.03it/s][A
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                      | 40/54 [00:19<00:06,  2.03it/s][A
 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                   | 41/54 [00:19<00:06,  2.04it/s][A
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                 | 42/54 [00:20<00:05,  2.04it/s][A
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                              | 43/54 [00:20<00:05,  2.04it/s][A
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                           | 44/54 [00:21<00:04,  2.04it/s][A
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                        | 45/54 [00:21<00:04,  2.04it/s][A
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                      | 46/54 [00:22<00:03,  2.04it/s][A
 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                   | 47/54 [00:22<00:03,  2.04it/s][A
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 48/54 [00:23<00:02,  2.04it/s][A
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎             | 49/54 [00:23<00:02,  2.04it/s][A
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████           | 50/54 [00:24<00:01,  2.04it/s][A
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊        | 51/54 [00:24<00:01,  2.04it/s][A
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌     | 52/54 [00:25<00:00,  2.04it/s][A
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎  | 53/54 [00:25<00:00,  2.04it/s][A
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.97it/s][A                                                                                                                                                                                         
                                                                                                                                                                                         [A{'eval_loss': 1.9684138298034668, 'eval_runtime': 27.3027, 'eval_samples_per_second': 7.911, 'eval_steps_per_second': 1.978, 'eval_ppl': 7.15931, 'memory/max_active (GiB)': 26.29, 'memory/max_allocated (GiB)': 26.29, 'memory/device_reserved (GiB)': 27.83, 'epoch': 0.35, 'tokens/train_per_sec_per_gpu': 0.0, 'tokens/total': 26214400.0, 'tokens/trainable': 25105132.0}
 35%|██████████████████████████████████████████████████                                                                                              | 200/575 [59:13<1:49:26, 17.51s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.97it/s][A
                                                                                                                                                                                         [A[2026-03-12 22:41:18,921] [INFO] [axolotl.core.trainers.base._save:723] [PID:4624] Saving model checkpoint to ./output/model/checkpoint-200
 35%|██████████████████████████████████████████████████▎                                                                                             | 201/575 [59:44<3:05:29, 29.76s/it]                                                                                                                                                                                         {'loss': 1.9783, 'grad_norm': 1.4884705543518066, 'learning_rate': 0.0003929691388003772, 'ppl': 7.23044, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.6064453125, 'epoch': 0.35, 'tokens/total': 26345472.0, 'tokens/trainable': 25230664.0}
 35%|██████████████████████████████████████████████████▎                                                                                             | 201/575 [59:44<3:05:29, 29.76s/it] 35%|█████████████████████████████████████████████████▉                                                                                            | 202/575 [1:00:01<2:41:46, 26.02s/it]                                                                                                                                                                                         {'loss': 2.0262, 'grad_norm': 1.5388286113739014, 'learning_rate': 0.00039188273125371093, 'ppl': 7.58521, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.27935791015625, 'epoch': 0.35, 'tokens/total': 26476544.0, 'tokens/trainable': 25355844.0}
 35%|█████████████████████████████████████████████████▉                                                                                            | 202/575 [1:00:01<2:41:46, 26.02s/it] 35%|██████████████████████████████████████████████████▏                                                                                           | 203/575 [1:00:19<2:25:35, 23.48s/it]                                                                                                                                                                                         {'loss': 1.979, 'grad_norm': 1.2412782907485962, 'learning_rate': 0.0003907925786109045, 'ppl': 7.2355, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.7071075439453, 'epoch': 0.35, 'tokens/total': 26607616.0, 'tokens/trainable': 25481088.0}
 35%|██████████████████████████████████████████████████▏                                                                                           | 203/575 [1:00:19<2:25:35, 23.48s/it] 35%|██████████████████████████████████████████████████▍                                                                                           | 204/575 [1:00:36<2:13:45, 21.63s/it]                                                                                                                                                                                         {'loss': 2.0006, 'grad_norm': 1.3520903587341309, 'learning_rate': 0.00038969871580206623, 'ppl': 7.39349, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.99880981445312, 'epoch': 0.35, 'tokens/total': 26738688.0, 'tokens/trainable': 25606668.0}
 35%|██████████████████████████████████████████████████▍                                                                                           | 204/575 [1:00:36<2:13:45, 21.63s/it] 36%|██████████████████████████████████████████████████▋                                                                                           | 205/575 [1:00:53<2:05:26, 20.34s/it]                                                                                                                                                                                         {'loss': 2.0415, 'grad_norm': 1.5849263668060303, 'learning_rate': 0.0003886011778761835, 'ppl': 7.70215, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.52734375, 'epoch': 0.36, 'tokens/total': 26869760.0, 'tokens/trainable': 25732146.0}
 36%|██████████████████████████████████████████████████▋                                                                                           | 205/575 [1:00:53<2:05:26, 20.34s/it] 36%|██████████████████████████████████████████████████▊                                                                                           | 206/575 [1:01:11<2:00:11, 19.54s/it]                                                                                                                                                                                         {'loss': 2.0467, 'grad_norm': 1.5826209783554077, 'learning_rate': 0.00038750000000000004, 'ppl': 7.74231, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.42330932617188, 'epoch': 0.36, 'tokens/total': 27000832.0, 'tokens/trainable': 25857100.0}
 36%|██████████████████████████████████████████████████▊                                                                                           | 206/575 [1:01:11<2:00:11, 19.54s/it] 36%|███████████████████████████████████████████████████                                                                                           | 207/575 [1:01:28<1:55:20, 18.81s/it]                                                                                                                                                                                         {'loss': 2.0428, 'grad_norm': 1.377555012702942, 'learning_rate': 0.00038639521745688886, 'ppl': 7.71217, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 232.81065368652344, 'epoch': 0.36, 'tokens/total': 27131904.0, 'tokens/trainable': 25982432.0}
 36%|███████████████████████████████████████████████████                                                                                           | 207/575 [1:01:28<1:55:20, 18.81s/it] 36%|███████████████████████████████████████████████████▎                                                                                          | 208/575 [1:01:45<1:52:32, 18.40s/it]                                                                                                                                                                                         {'loss': 2.0209, 'grad_norm': 1.227475643157959, 'learning_rate': 0.000385286865645722, 'ppl': 7.54511, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.0615234375, 'epoch': 0.36, 'tokens/total': 27262976.0, 'tokens/trainable': 26107980.0}
 36%|███████████████████████████████████████████████████▎                                                                                          | 208/575 [1:01:46<1:52:32, 18.40s/it] 36%|███████████████████████████████████████████████████▌                                                                                          | 209/575 [1:02:02<1:49:37, 17.97s/it]                                                                                                                                                                                         {'loss': 2.0156, 'grad_norm': 1.5216145515441895, 'learning_rate': 0.00038417498007973594, 'ppl': 7.50523, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 233.61605834960938, 'epoch': 0.36, 'tokens/total': 27394048.0, 'tokens/trainable': 26233672.0}
 36%|███████████████████████████████████████████████████▌                                                                                          | 209/575 [1:02:02<1:49:37, 17.97s/it] 37%|███████████████████████████████████████████████████▊                                                                                          | 210/575 [1:02:20<1:48:21, 17.81s/it]                                                                                                                                                                                         {'loss': 1.981, 'grad_norm': 1.199395775794983, 'learning_rate': 0.00038305959638539424, 'ppl': 7.24999, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.64431762695312, 'epoch': 0.37, 'tokens/total': 27525120.0, 'tokens/trainable': 26359140.0}
 37%|███████████████████████████████████████████████████▊                                                                                          | 210/575 [1:02:20<1:48:21, 17.81s/it] 37%|████████████████████████████████████████████████████                                                                                          | 211/575 [1:02:37<1:46:57, 17.63s/it]                                                                                                                                                                                         {'loss': 1.9912, 'grad_norm': 1.3647713661193848, 'learning_rate': 0.0003819407503012453, 'ppl': 7.32432, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.2038116455078, 'epoch': 0.37, 'tokens/total': 27656192.0, 'tokens/trainable': 26484536.0}
 37%|████████████████████████████████████████████████████                                                                                          | 211/575 [1:02:37<1:46:57, 17.63s/it] 37%|████████████████████████████████████████████████████▎                                                                                         | 212/575 [1:02:54<1:45:14, 17.39s/it]                                                                                                                                                                                         {'loss': 2.0024, 'grad_norm': 1.2564817667007446, 'learning_rate': 0.00038081847767677785, 'ppl': 7.40681, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 234.32989501953125, 'epoch': 0.37, 'tokens/total': 27787264.0, 'tokens/trainable': 26609840.0}
 37%|████████████████████████████████████████████████████▎                                                                                         | 212/575 [1:02:54<1:45:14, 17.39s/it] 37%|████████████████████████████████████████████████████▌                                                                                         | 213/575 [1:03:11<1:45:01, 17.41s/it]                                                                                                                                                                                         {'loss': 1.9842, 'grad_norm': 1.3349636793136597, 'learning_rate': 0.00037969281447127194, 'ppl': 7.27323, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.0315704345703, 'epoch': 0.37, 'tokens/total': 27918336.0, 'tokens/trainable': 26735306.0}
 37%|████████████████████████████████████████████████████▌                                                                                         | 213/575 [1:03:11<1:45:01, 17.41s/it] 37%|████████████████████████████████████████████████████▊                                                                                         | 214/575 [1:03:29<1:44:47, 17.42s/it]                                                                                                                                                                                         {'loss': 1.98, 'grad_norm': 2.104618549346924, 'learning_rate': 0.0003785637967526471, 'ppl': 7.24274, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.95420837402344, 'epoch': 0.37, 'tokens/total': 28049408.0, 'tokens/trainable': 26860994.0}
 37%|████████████████████████████████████████████████████▊                                                                                         | 214/575 [1:03:29<1:44:47, 17.42s/it] 37%|█████████████████████████████████████████████████████                                                                                         | 215/575 [1:03:46<1:44:06, 17.35s/it]                                                                                                                                                                                         {'loss': 2.0215, 'grad_norm': 1.2937283515930176, 'learning_rate': 0.00037743146069630615, 'ppl': 7.54964, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.92559814453125, 'epoch': 0.37, 'tokens/total': 28180480.0, 'tokens/trainable': 26986468.0}
 37%|█████████████████████████████████████████████████████                                                                                         | 215/575 [1:03:46<1:44:06, 17.35s/it] 38%|█████████████████████████████████████████████████████▎                                                                                        | 216/575 [1:04:04<1:44:23, 17.45s/it]                                                                                                                                                                                         {'loss': 2.0515, 'grad_norm': 1.659706950187683, 'learning_rate': 0.00037629584258397646, 'ppl': 7.77956, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.15435791015625, 'epoch': 0.38, 'tokens/total': 28311552.0, 'tokens/trainable': 27111414.0}
 38%|█████████████████████████████████████████████████████▎                                                                                        | 216/575 [1:04:04<1:44:23, 17.45s/it] 38%|█████████████████████████████████████████████████████▌                                                                                        | 217/575 [1:04:21<1:44:04, 17.44s/it]                                                                                                                                                                                         {'loss': 1.9978, 'grad_norm': 1.6346582174301147, 'learning_rate': 0.00037515697880254756, 'ppl': 7.37282, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.23599243164062, 'epoch': 0.38, 'tokens/total': 28442624.0, 'tokens/trainable': 27236840.0}
 38%|█████████████████████████████████████████████████████▌                                                                                        | 217/575 [1:04:21<1:44:04, 17.44s/it] 38%|█████████████████████████████████████████████████████▊                                                                                        | 218/575 [1:04:39<1:43:59, 17.48s/it]                                                                                                                                                                                         {'loss': 2.0617, 'grad_norm': 1.696248173713684, 'learning_rate': 0.0003740149058429047, 'ppl': 7.85932, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.65611267089844, 'epoch': 0.38, 'tokens/total': 28573696.0, 'tokens/trainable': 27362308.0}
 38%|█████████████████████████████████████████████████████▊                                                                                        | 218/575 [1:04:39<1:43:59, 17.48s/it] 38%|██████████████████████████████████████████████████████                                                                                        | 219/575 [1:04:56<1:44:02, 17.53s/it]                                                                                                                                                                                         {'loss': 1.9864, 'grad_norm': 1.2485755681991577, 'learning_rate': 0.0003728696602987601, 'ppl': 7.28925, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.11373901367188, 'epoch': 0.38, 'tokens/total': 28704768.0, 'tokens/trainable': 27487764.0}
 38%|██████████████████████████████████████████████████████                                                                                        | 219/575 [1:04:56<1:44:02, 17.53s/it] 38%|██████████████████████████████████████████████████████▎                                                                                       | 220/575 [1:05:13<1:42:56, 17.40s/it]                                                                                                                                                                                         {'loss': 2.0115, 'grad_norm': 1.387721061706543, 'learning_rate': 0.00037172127886548045, 'ppl': 7.47452, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.7314910888672, 'epoch': 0.38, 'tokens/total': 28835840.0, 'tokens/trainable': 27613276.0}
 38%|██████████████████████████████████████████████████████▎                                                                                       | 220/575 [1:05:13<1:42:56, 17.40s/it] 38%|██████████████████████████████████████████████████████▌                                                                                       | 221/575 [1:05:31<1:43:20, 17.52s/it]                                                                                                                                                                                         {'loss': 2.0177, 'grad_norm': 1.5517852306365967, 'learning_rate': 0.0003705697983389108, 'ppl': 7.52101, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.7124786376953, 'epoch': 0.38, 'tokens/total': 28966912.0, 'tokens/trainable': 27738322.0}
 38%|██████████████████████████████████████████████████████▌                                                                                       | 221/575 [1:05:31<1:43:20, 17.52s/it] 39%|██████████████████████████████████████████████████████▊                                                                                       | 222/575 [1:05:48<1:42:16, 17.38s/it]                                                                                                                                                                                         {'loss': 1.9748, 'grad_norm': 1.5650917291641235, 'learning_rate': 0.00036941525561419566, 'ppl': 7.20518, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 231.7747039794922, 'epoch': 0.39, 'tokens/total': 29097984.0, 'tokens/trainable': 27864204.0}
 39%|██████████████████████████████████████████████████████▊                                                                                       | 222/575 [1:05:48<1:42:16, 17.38s/it] 39%|███████████████████████████████████████████████████████                                                                                       | 223/575 [1:06:06<1:42:04, 17.40s/it]                                                                                                                                                                                         {'loss': 1.9744, 'grad_norm': 1.2548725605010986, 'learning_rate': 0.00036825768768459724, 'ppl': 7.2023, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.1375732421875, 'epoch': 0.39, 'tokens/total': 29229056.0, 'tokens/trainable': 27989144.0}
 39%|███████████████████████████████████████████████████████                                                                                       | 223/575 [1:06:06<1:42:04, 17.40s/it] 39%|███████████████████████████████████████████████████████▎                                                                                      | 224/575 [1:06:23<1:42:15, 17.48s/it]                                                                                                                                                                                         {'loss': 1.9651, 'grad_norm': 1.3739430904388428, 'learning_rate': 0.00036709713164030937, 'ppl': 7.13563, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.0438232421875, 'epoch': 0.39, 'tokens/total': 29360128.0, 'tokens/trainable': 28114584.0}
 39%|███████████████████████████████████████████████████████▎                                                                                      | 224/575 [1:06:23<1:42:15, 17.48s/it] 39%|███████████████████████████████████████████████████████▌                                                                                      | 225/575 [1:06:41<1:42:05, 17.50s/it]                                                                                                                                                                                         {'loss': 1.904, 'grad_norm': 1.3256971836090088, 'learning_rate': 0.00036593362466726993, 'ppl': 6.71269, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.68630981445312, 'epoch': 0.39, 'tokens/total': 29491200.0, 'tokens/trainable': 28240198.0}
 39%|███████████████████████████████████████████████████████▌                                                                                      | 225/575 [1:06:41<1:42:05, 17.50s/it] 39%|███████████████████████████████████████████████████████▊                                                                                      | 226/575 [1:06:58<1:41:41, 17.48s/it]                                                                                                                                                                                         {'loss': 1.9494, 'grad_norm': 1.4137202501296997, 'learning_rate': 0.0003647672040459687, 'ppl': 7.02447, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.93612670898438, 'epoch': 0.39, 'tokens/total': 29622272.0, 'tokens/trainable': 28365556.0}
 39%|███████████████████████████████████████████████████████▊                                                                                      | 226/575 [1:06:58<1:41:41, 17.48s/it] 39%|████████████████████████████████████████████████████████                                                                                      | 227/575 [1:07:16<1:41:06, 17.43s/it]                                                                                                                                                                                         {'loss': 1.9779, 'grad_norm': 3.377436876296997, 'learning_rate': 0.0003635979071502531, 'ppl': 7.22755, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.40614318847656, 'epoch': 0.39, 'tokens/total': 29753344.0, 'tokens/trainable': 28490636.0}
 39%|████████████████████████████████████████████████████████                                                                                      | 227/575 [1:07:16<1:41:06, 17.43s/it] 40%|████████████████████████████████████████████████████████▎                                                                                     | 228/575 [1:07:33<1:41:14, 17.50s/it]                                                                                                                                                                                         {'loss': 2.0624, 'grad_norm': 1.6385736465454102, 'learning_rate': 0.0003624257714461307, 'ppl': 7.86482, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.2306365966797, 'epoch': 0.4, 'tokens/total': 29884416.0, 'tokens/trainable': 28615680.0}
 40%|████████████████████████████████████████████████████████▎                                                                                     | 228/575 [1:07:33<1:41:14, 17.50s/it] 40%|████████████████████████████████████████████████████████▌                                                                                     | 229/575 [1:07:51<1:41:13, 17.55s/it]                                                                                                                                                                                         {'loss': 1.9902, 'grad_norm': 1.9560095071792603, 'learning_rate': 0.0003612508344905687, 'ppl': 7.317, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.92074584960938, 'epoch': 0.4, 'tokens/total': 30015488.0, 'tokens/trainable': 28741248.0}
 40%|████████████████████████████████████████████████████████▌                                                                                     | 229/575 [1:07:51<1:41:13, 17.55s/it] 40%|████████████████████████████████████████████████████████▊                                                                                     | 230/575 [1:08:08<1:40:07, 17.41s/it]                                                                                                                                                                                         {'loss': 2.0382, 'grad_norm': 1.526203989982605, 'learning_rate': 0.0003600731339302905, 'ppl': 7.67678, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.12696838378906, 'epoch': 0.4, 'tokens/total': 30146560.0, 'tokens/trainable': 28866394.0}
 40%|████████████████████████████████████████████████████████▊                                                                                     | 230/575 [1:08:08<1:40:07, 17.41s/it] 40%|█████████████████████████████████████████████████████████                                                                                     | 231/575 [1:08:25<1:39:39, 17.38s/it]                                                                                                                                                                                         {'loss': 1.9896, 'grad_norm': 1.287468433380127, 'learning_rate': 0.00035889270750056945, 'ppl': 7.31261, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.0307159423828, 'epoch': 0.4, 'tokens/total': 30277632.0, 'tokens/trainable': 28991724.0}
 40%|█████████████████████████████████████████████████████████                                                                                     | 231/575 [1:08:25<1:39:39, 17.38s/it] 40%|█████████████████████████████████████████████████████████▎                                                                                    | 232/575 [1:08:43<1:39:52, 17.47s/it]                                                                                                                                                                                         {'loss': 1.9705, 'grad_norm': 1.4913160800933838, 'learning_rate': 0.00035770959302402, 'ppl': 7.17426, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.2523651123047, 'epoch': 0.4, 'tokens/total': 30408704.0, 'tokens/trainable': 29117040.0}
 40%|█████████████████████████████████████████████████████████▎                                                                                    | 232/575 [1:08:43<1:39:52, 17.47s/it] 41%|█████████████████████████████████████████████████████████▌                                                                                    | 233/575 [1:09:01<1:39:43, 17.49s/it]                                                                                                                                                                                         {'loss': 1.9786, 'grad_norm': 1.3290998935699463, 'learning_rate': 0.00035652382840938544, 'ppl': 7.23261, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.25503540039062, 'epoch': 0.41, 'tokens/total': 30539776.0, 'tokens/trainable': 29242312.0}
 41%|█████████████████████████████████████████████████████████▌                                                                                    | 233/575 [1:09:01<1:39:43, 17.49s/it] 41%|█████████████████████████████████████████████████████████▊                                                                                    | 234/575 [1:09:18<1:39:55, 17.58s/it]                                                                                                                                                                                         {'loss': 1.968, 'grad_norm': 1.2440803050994873, 'learning_rate': 0.00035533545165032345, 'ppl': 7.15635, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.34564208984375, 'epoch': 0.41, 'tokens/total': 30670848.0, 'tokens/trainable': 29367512.0}
 41%|█████████████████████████████████████████████████████████▊                                                                                    | 234/575 [1:09:18<1:39:55, 17.58s/it] 41%|██████████████████████████████████████████████████████████                                                                                    | 235/575 [1:09:36<1:39:22, 17.54s/it]                                                                                                                                                                                         {'loss': 1.9522, 'grad_norm': 1.3298527002334595, 'learning_rate': 0.00035414450082418874, 'ppl': 7.04417, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.97467041015625, 'epoch': 0.41, 'tokens/total': 30801920.0, 'tokens/trainable': 29492884.0}
 41%|██████████████████████████████████████████████████████████                                                                                    | 235/575 [1:09:36<1:39:22, 17.54s/it] 41%|██████████████████████████████████████████████████████████▎                                                                                   | 236/575 [1:09:53<1:39:06, 17.54s/it]                                                                                                                                                                                         {'loss': 2.0077, 'grad_norm': 1.340688705444336, 'learning_rate': 0.0003529510140908129, 'ppl': 7.44617, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.30636596679688, 'epoch': 0.41, 'tokens/total': 30932992.0, 'tokens/trainable': 29618124.0}
 41%|██████████████████████████████████████████████████████████▎                                                                                   | 236/575 [1:09:53<1:39:06, 17.54s/it] 41%|██████████████████████████████████████████████████████████▌                                                                                   | 237/575 [1:10:11<1:39:02, 17.58s/it]                                                                                                                                                                                         {'loss': 1.974, 'grad_norm': 1.309205412864685, 'learning_rate': 0.00035175502969128164, 'ppl': 7.19942, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.9044647216797, 'epoch': 0.41, 'tokens/total': 31064064.0, 'tokens/trainable': 29743562.0}
 41%|██████████████████████████████████████████████████████████▌                                                                                   | 237/575 [1:10:11<1:39:02, 17.58s/it] 41%|██████████████████████████████████████████████████████████▊                                                                                   | 238/575 [1:10:29<1:39:18, 17.68s/it]                                                                                                                                                                                         {'loss': 1.9486, 'grad_norm': 1.3278919458389282, 'learning_rate': 0.00035055658594670985, 'ppl': 7.01885, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 217.9466552734375, 'epoch': 0.41, 'tokens/total': 31195136.0, 'tokens/trainable': 29869028.0}
 41%|██████████████████████████████████████████████████████████▊                                                                                   | 238/575 [1:10:29<1:39:18, 17.68s/it] 42%|███████████████████████████████████████████████████████████                                                                                   | 239/575 [1:10:47<1:38:47, 17.64s/it]                                                                                                                                                                                         {'loss': 2.0102, 'grad_norm': 1.3390051126480103, 'learning_rate': 0.00034935572125701346, 'ppl': 7.46481, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.6306610107422, 'epoch': 0.42, 'tokens/total': 31326208.0, 'tokens/trainable': 29994202.0}
 42%|███████████████████████████████████████████████████████████                                                                                   | 239/575 [1:10:47<1:38:47, 17.64s/it] 42%|███████████████████████████████████████████████████████████▎                                                                                  | 240/575 [1:11:04<1:38:32, 17.65s/it]                                                                                                                                                                                         {'loss': 2.0142, 'grad_norm': 1.6813931465148926, 'learning_rate': 0.00034815247409967874, 'ppl': 7.49473, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 215.2191619873047, 'epoch': 0.42, 'tokens/total': 31457280.0, 'tokens/trainable': 30119520.0}
 42%|███████████████████████████████████████████████████████████▎                                                                                  | 240/575 [1:11:04<1:38:32, 17.65s/it] 42%|███████████████████████████████████████████████████████████▌                                                                                  | 241/575 [1:11:22<1:37:53, 17.58s/it]                                                                                                                                                                                         {'loss': 2.0576, 'grad_norm': 1.2614521980285645, 'learning_rate': 0.00034694688302853023, 'ppl': 7.82716, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.29934692382812, 'epoch': 0.42, 'tokens/total': 31588352.0, 'tokens/trainable': 30244808.0}
 42%|███████████████████████████████████████████████████████████▌                                                                                  | 241/575 [1:11:22<1:37:53, 17.58s/it] 42%|███████████████████████████████████████████████████████████▊                                                                                  | 242/575 [1:11:40<1:38:07, 17.68s/it]                                                                                                                                                                                         {'loss': 1.9627, 'grad_norm': 1.5453486442565918, 'learning_rate': 0.00034573898667249447, 'ppl': 7.11852, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.5048370361328, 'epoch': 0.42, 'tokens/total': 31719424.0, 'tokens/trainable': 30370294.0}
 42%|███████████████████████████████████████████████████████████▊                                                                                  | 242/575 [1:11:40<1:38:07, 17.68s/it] 42%|████████████████████████████████████████████████████████████                                                                                  | 243/575 [1:11:57<1:37:50, 17.68s/it]                                                                                                                                                                                         {'loss': 1.9553, 'grad_norm': 1.415401577949524, 'learning_rate': 0.0003445288237343632, 'ppl': 7.06604, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.27606201171875, 'epoch': 0.42, 'tokens/total': 31850496.0, 'tokens/trainable': 30495504.0}
 42%|████████████████████████████████████████████████████████████                                                                                  | 243/575 [1:11:57<1:37:50, 17.68s/it] 42%|████████████████████████████████████████████████████████████▎                                                                                 | 244/575 [1:12:15<1:36:56, 17.57s/it]                                                                                                                                                                                         {'loss': 1.9761, 'grad_norm': 1.4122751951217651, 'learning_rate': 0.00034331643298955245, 'ppl': 7.21455, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.353271484375, 'epoch': 0.42, 'tokens/total': 31981568.0, 'tokens/trainable': 30620812.0}
 42%|████████████████████████████████████████████████████████████▎                                                                                 | 244/575 [1:12:15<1:36:56, 17.57s/it] 43%|████████████████████████████████████████████████████████████▌                                                                                 | 245/575 [1:12:32<1:36:24, 17.53s/it]                                                                                                                                                                                         {'loss': 1.9662, 'grad_norm': 1.383400797843933, 'learning_rate': 0.0003421018532848607, 'ppl': 7.14348, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.97145080566406, 'epoch': 0.43, 'tokens/total': 32112640.0, 'tokens/trainable': 30746418.0}
 43%|████████████████████████████████████████████████████████████▌                                                                                 | 245/575 [1:12:32<1:36:24, 17.53s/it] 43%|████████████████████████████████████████████████████████████▊                                                                                 | 246/575 [1:12:50<1:36:09, 17.54s/it]                                                                                                                                                                                         {'loss': 1.9651, 'grad_norm': 1.5492509603500366, 'learning_rate': 0.0003408851235372238, 'ppl': 7.13563, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.9794158935547, 'epoch': 0.43, 'tokens/total': 32243712.0, 'tokens/trainable': 30871712.0}
 43%|████████████████████████████████████████████████████████████▊                                                                                 | 246/575 [1:12:50<1:36:09, 17.54s/it] 43%|████████████████████████████████████████████████████████████▉                                                                                 | 247/575 [1:13:07<1:35:17, 17.43s/it]                                                                                                                                                                                         {'loss': 1.9759, 'grad_norm': 2.4360010623931885, 'learning_rate': 0.00033966628273246843, 'ppl': 7.21311, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 231.80642700195312, 'epoch': 0.43, 'tokens/total': 32374784.0, 'tokens/trainable': 30997256.0}
 43%|████████████████████████████████████████████████████████████▉                                                                                 | 247/575 [1:13:07<1:35:17, 17.43s/it] 43%|█████████████████████████████████████████████████████████████▏                                                                                | 248/575 [1:13:24<1:35:11, 17.47s/it]                                                                                                                                                                                         {'loss': 1.9615, 'grad_norm': 1.4402928352355957, 'learning_rate': 0.00033844536992406235, 'ppl': 7.10998, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.58297729492188, 'epoch': 0.43, 'tokens/total': 32505856.0, 'tokens/trainable': 31122738.0}
 43%|█████████████████████████████████████████████████████████████▏                                                                                | 248/575 [1:13:24<1:35:11, 17.47s/it] 43%|█████████████████████████████████████████████████████████████▍                                                                                | 249/575 [1:13:42<1:35:13, 17.53s/it]                                                                                                                                                                                         {'loss': 2.027, 'grad_norm': 1.8944284915924072, 'learning_rate': 0.0003372224242318635, 'ppl': 7.59128, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.98751831054688, 'epoch': 0.43, 'tokens/total': 32636928.0, 'tokens/trainable': 31248082.0}
 43%|█████████████████████████████████████████████████████████████▍                                                                                | 249/575 [1:13:42<1:35:13, 17.53s/it] 43%|█████████████████████████████████████████████████████████████▋                                                                                | 250/575 [1:14:00<1:35:10, 17.57s/it]                                                                                                                                                                                         {'loss': 2.0134, 'grad_norm': 1.5417417287826538, 'learning_rate': 0.00033599748484086655, 'ppl': 7.48874, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.8115692138672, 'epoch': 0.43, 'tokens/total': 32768000.0, 'tokens/trainable': 31373638.0}
 43%|█████████████████████████████████████████████████████████████▋                                                                                | 250/575 [1:14:00<1:35:10, 17.57s/it] 44%|█████████████████████████████████████████████████████████████▉                                                                                | 251/575 [1:14:17<1:34:50, 17.56s/it]                                                                                                                                                                                         {'loss': 1.9677, 'grad_norm': 1.8000094890594482, 'learning_rate': 0.0003347705909999472, 'ppl': 7.1542, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.357177734375, 'epoch': 0.44, 'tokens/total': 32899072.0, 'tokens/trainable': 31499032.0}
 44%|█████████████████████████████████████████████████████████████▉                                                                                | 251/575 [1:14:17<1:34:50, 17.56s/it] 44%|██████████████████████████████████████████████████████████████▏                                                                               | 252/575 [1:14:35<1:34:42, 17.59s/it]                                                                                                                                                                                         {'loss': 1.9937, 'grad_norm': 1.5412756204605103, 'learning_rate': 0.00033354178202060444, 'ppl': 7.34265, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.89971923828125, 'epoch': 0.44, 'tokens/total': 33030144.0, 'tokens/trainable': 31624498.0}
 44%|██████████████████████████████████████████████████████████████▏                                                                               | 252/575 [1:14:35<1:34:42, 17.59s/it] 44%|██████████████████████████████████████████████████████████████▍                                                                               | 253/575 [1:14:52<1:34:20, 17.58s/it]                                                                                                                                                                                         {'loss': 1.9377, 'grad_norm': 1.2776538133621216, 'learning_rate': 0.0003323110972757014, 'ppl': 6.94276, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.4564971923828, 'epoch': 0.44, 'tokens/total': 33161216.0, 'tokens/trainable': 31749830.0}
 44%|██████████████████████████████████████████████████████████████▍                                                                               | 253/575 [1:14:52<1:34:20, 17.58s/it] 44%|██████████████████████████████████████████████████████████████▋                                                                               | 254/575 [1:15:09<1:33:14, 17.43s/it]                                                                                                                                                                                         {'loss': 1.9625, 'grad_norm': 1.306793212890625, 'learning_rate': 0.0003310785761982033, 'ppl': 7.1171, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.74899291992188, 'epoch': 0.44, 'tokens/total': 33292288.0, 'tokens/trainable': 31875192.0}
 44%|██████████████████████████████████████████████████████████████▋                                                                               | 254/575 [1:15:09<1:33:14, 17.43s/it] 44%|██████████████████████████████████████████████████████████████▉                                                                               | 255/575 [1:15:27<1:32:33, 17.36s/it]                                                                                                                                                                                         {'loss': 1.9576, 'grad_norm': 1.179717779159546, 'learning_rate': 0.00032984425827991436, 'ppl': 7.08231, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 231.14076232910156, 'epoch': 0.44, 'tokens/total': 33423360.0, 'tokens/trainable': 32000660.0}
 44%|██████████████████████████████████████████████████████████████▉                                                                               | 255/575 [1:15:27<1:32:33, 17.36s/it] 45%|███████████████████████████████████████████████████████████████▏                                                                              | 256/575 [1:15:44<1:32:12, 17.34s/it]                                                                                                                                                                                         {'loss': 1.9413, 'grad_norm': 1.3719829320907593, 'learning_rate': 0.00032860818307021213, 'ppl': 6.9678, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.23988342285156, 'epoch': 0.45, 'tokens/total': 33554432.0, 'tokens/trainable': 32126176.0}
 45%|███████████████████████████████████████████████████████████████▏                                                                              | 256/575 [1:15:44<1:32:12, 17.34s/it] 45%|███████████████████████████████████████████████████████████████▍                                                                              | 257/575 [1:16:01<1:32:03, 17.37s/it]                                                                                                                                                                                         {'loss': 1.9681, 'grad_norm': 1.3053077459335327, 'learning_rate': 0.00032737039017478046, 'ppl': 7.15707, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.96127319335938, 'epoch': 0.45, 'tokens/total': 33685504.0, 'tokens/trainable': 32251660.0}
 45%|███████████████████████████████████████████████████████████████▍                                                                              | 257/575 [1:16:01<1:32:03, 17.37s/it] 45%|███████████████████████████████████████████████████████████████▋                                                                              | 258/575 [1:16:19<1:32:02, 17.42s/it]                                                                                                                                                                                         {'loss': 1.9965, 'grad_norm': 1.6538059711456299, 'learning_rate': 0.0003261309192543403, 'ppl': 7.36324, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.52828979492188, 'epoch': 0.45, 'tokens/total': 33816576.0, 'tokens/trainable': 32376986.0}
 45%|███████████████████████████████████████████████████████████████▋                                                                              | 258/575 [1:16:19<1:32:02, 17.42s/it] 45%|███████████████████████████████████████████████████████████████▉                                                                              | 259/575 [1:16:36<1:31:46, 17.42s/it]                                                                                                                                                                                         {'loss': 1.9821, 'grad_norm': 1.192196249961853, 'learning_rate': 0.000324889810023379, 'ppl': 7.25797, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.39120483398438, 'epoch': 0.45, 'tokens/total': 33947648.0, 'tokens/trainable': 32502572.0}
 45%|███████████████████████████████████████████████████████████████▉                                                                              | 259/575 [1:16:36<1:31:46, 17.42s/it] 45%|████████████████████████████████████████████████████████████████▏                                                                             | 260/575 [1:16:54<1:32:14, 17.57s/it]                                                                                                                                                                                         {'loss': 1.9864, 'grad_norm': 1.4217498302459717, 'learning_rate': 0.0003236471022488781, 'ppl': 7.28925, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.45948791503906, 'epoch': 0.45, 'tokens/total': 34078720.0, 'tokens/trainable': 32627790.0}
 45%|████████████████████████████████████████████████████████████████▏                                                                             | 260/575 [1:16:54<1:32:14, 17.57s/it] 45%|████████████████████████████████████████████████████████████████▍                                                                             | 261/575 [1:17:12<1:31:32, 17.49s/it]                                                                                                                                                                                         {'loss': 1.9835, 'grad_norm': 1.5409531593322754, 'learning_rate': 0.0003224028357490384, 'ppl': 7.26814, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.47547912597656, 'epoch': 0.45, 'tokens/total': 34209792.0, 'tokens/trainable': 32753108.0}
 45%|████████████████████████████████████████████████████████████████▍                                                                             | 261/575 [1:17:12<1:31:32, 17.49s/it] 46%|████████████████████████████████████████████████████████████████▋                                                                             | 262/575 [1:17:29<1:31:21, 17.51s/it]                                                                                                                                                                                         {'loss': 1.9559, 'grad_norm': 1.222854733467102, 'learning_rate': 0.00032115705039200494, 'ppl': 7.07028, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.18760681152344, 'epoch': 0.46, 'tokens/total': 34340864.0, 'tokens/trainable': 32878848.0}
 46%|████████████████████████████████████████████████████████████████▋                                                                             | 262/575 [1:17:29<1:31:21, 17.51s/it] 46%|████████████████████████████████████████████████████████████████▉                                                                             | 263/575 [1:17:46<1:30:33, 17.42s/it]                                                                                                                                                                                         {'loss': 1.9384, 'grad_norm': 1.4609373807907104, 'learning_rate': 0.0003199097860945891, 'ppl': 6.94763, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.39414978027344, 'epoch': 0.46, 'tokens/total': 34471936.0, 'tokens/trainable': 33004340.0}
 46%|████████████████████████████████████████████████████████████████▉                                                                             | 263/575 [1:17:46<1:30:33, 17.42s/it] 46%|█████████████████████████████████████████████████████████████████▏                                                                            | 264/575 [1:18:03<1:29:44, 17.31s/it]                                                                                                                                                                                         {'loss': 2.0104, 'grad_norm': 1.3042951822280884, 'learning_rate': 0.0003186610828209894, 'ppl': 7.4663, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.95263671875, 'epoch': 0.46, 'tokens/total': 34603008.0, 'tokens/trainable': 33129662.0}
 46%|█████████████████████████████████████████████████████████████████▏                                                                            | 264/575 [1:18:03<1:29:44, 17.31s/it] 46%|█████████████████████████████████████████████████████████████████▍                                                                            | 265/575 [1:18:21<1:29:27, 17.31s/it]                                                                                                                                                                                         {'loss': 1.9284, 'grad_norm': 1.3373479843139648, 'learning_rate': 0.00031741098058151183, 'ppl': 6.8785, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.0892791748047, 'epoch': 0.46, 'tokens/total': 34734080.0, 'tokens/trainable': 33254904.0}
 46%|█████████████████████████████████████████████████████████████████▍                                                                            | 265/575 [1:18:21<1:29:27, 17.31s/it] 46%|█████████████████████████████████████████████████████████████████▋                                                                            | 266/575 [1:18:38<1:28:36, 17.21s/it]                                                                                                                                                                                         {'loss': 1.9863, 'grad_norm': 1.1397761106491089, 'learning_rate': 0.00031615951943128704, 'ppl': 7.28852, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.6182403564453, 'epoch': 0.46, 'tokens/total': 34865152.0, 'tokens/trainable': 33380256.0}
 46%|█████████████████████████████████████████████████████████████████▋                                                                            | 266/575 [1:18:38<1:28:36, 17.21s/it] 46%|█████████████████████████████████████████████████████████████████▉                                                                            | 267/575 [1:18:55<1:28:40, 17.27s/it]                                                                                                                                                                                         {'loss': 1.978, 'grad_norm': 1.4296529293060303, 'learning_rate': 0.00031490673946898696, 'ppl': 7.22827, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.5876007080078, 'epoch': 0.46, 'tokens/total': 34996224.0, 'tokens/trainable': 33505458.0}
 46%|█████████████████████████████████████████████████████████████████▉                                                                            | 267/575 [1:18:55<1:28:40, 17.27s/it] 47%|██████████████████████████████████████████████████████████████████▏                                                                           | 268/575 [1:19:13<1:28:37, 17.32s/it]                                                                                                                                                                                         {'loss': 1.9269, 'grad_norm': 1.1766818761825562, 'learning_rate': 0.00031365268083554065, 'ppl': 6.86819, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.47264099121094, 'epoch': 0.47, 'tokens/total': 35127296.0, 'tokens/trainable': 33630892.0}
 47%|██████████████████████████████████████████████████████████████████▏                                                                           | 268/575 [1:19:13<1:28:37, 17.32s/it] 47%|██████████████████████████████████████████████████████████████████▍                                                                           | 269/575 [1:19:30<1:28:19, 17.32s/it]                                                                                                                                                                                         {'loss': 1.966, 'grad_norm': 1.2593404054641724, 'learning_rate': 0.00031239738371284753, 'ppl': 7.14205, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.88211059570312, 'epoch': 0.47, 'tokens/total': 35258368.0, 'tokens/trainable': 33755824.0}
 47%|██████████████████████████████████████████████████████████████████▍                                                                           | 269/575 [1:19:30<1:28:19, 17.32s/it] 47%|██████████████████████████████████████████████████████████████████▋                                                                           | 270/575 [1:19:47<1:27:39, 17.25s/it]                                                                                                                                                                                         {'loss': 1.9368, 'grad_norm': 1.2127174139022827, 'learning_rate': 0.0003111408883224899, 'ppl': 6.93652, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.653076171875, 'epoch': 0.47, 'tokens/total': 35389440.0, 'tokens/trainable': 33881108.0}
 47%|██████████████████████████████████████████████████████████████████▋                                                                           | 270/575 [1:19:47<1:27:39, 17.25s/it] 47%|██████████████████████████████████████████████████████████████████▉                                                                           | 271/575 [1:20:04<1:27:50, 17.34s/it]                                                                                                                                                                                         {'loss': 1.949, 'grad_norm': 1.2287328243255615, 'learning_rate': 0.00030988323492444454, 'ppl': 7.02166, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.50685119628906, 'epoch': 0.47, 'tokens/total': 35520512.0, 'tokens/trainable': 34006436.0}
 47%|██████████████████████████████████████████████████████████████████▉                                                                           | 271/575 [1:20:05<1:27:50, 17.34s/it] 47%|███████████████████████████████████████████████████████████████████▏                                                                          | 272/575 [1:20:22<1:27:52, 17.40s/it]                                                                                                                                                                                         {'loss': 1.9409, 'grad_norm': 1.1493350267410278, 'learning_rate': 0.0003086244638157924, 'ppl': 6.96502, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.82208251953125, 'epoch': 0.47, 'tokens/total': 35651584.0, 'tokens/trainable': 34131720.0}
 47%|███████████████████████████████████████████████████████████████████▏                                                                          | 272/575 [1:20:22<1:27:52, 17.40s/it] 47%|███████████████████████████████████████████████████████████████████▍                                                                          | 273/575 [1:20:40<1:27:48, 17.45s/it]                                                                                                                                                                                         {'loss': 1.9745, 'grad_norm': 1.3024709224700928, 'learning_rate': 0.00030736461532942746, 'ppl': 7.20302, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.64471435546875, 'epoch': 0.47, 'tokens/total': 35782656.0, 'tokens/trainable': 34257160.0}
 47%|███████████████████████████████████████████████████████████████████▍                                                                          | 273/575 [1:20:40<1:27:48, 17.45s/it] 48%|███████████████████████████████████████████████████████████████████▋                                                                          | 274/575 [1:20:57<1:27:51, 17.51s/it]                                                                                                                                                                                         {'loss': 1.9654, 'grad_norm': 1.122544527053833, 'learning_rate': 0.0003061037298327648, 'ppl': 7.13777, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.72930908203125, 'epoch': 0.48, 'tokens/total': 35913728.0, 'tokens/trainable': 34382836.0}
 48%|███████████████████████████████████████████████████████████████████▋                                                                          | 274/575 [1:20:57<1:27:51, 17.51s/it] 48%|███████████████████████████████████████████████████████████████████▉                                                                          | 275/575 [1:21:15<1:27:47, 17.56s/it]                                                                                                                                                                                         {'loss': 1.9126, 'grad_norm': 1.1870543956756592, 'learning_rate': 0.0003048418477264465, 'ppl': 6.77067, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.6585693359375, 'epoch': 0.48, 'tokens/total': 36044800.0, 'tokens/trainable': 34508192.0}
 48%|███████████████████████████████████████████████████████████████████▉                                                                          | 275/575 [1:21:15<1:27:47, 17.56s/it] 48%|████████████████████████████████████████████████████████████████████▏                                                                         | 276/575 [1:21:32<1:26:46, 17.41s/it]                                                                                                                                                                                         {'loss': 1.9348, 'grad_norm': 1.0526031255722046, 'learning_rate': 0.00030357900944304775, 'ppl': 6.92266, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.6211395263672, 'epoch': 0.48, 'tokens/total': 36175872.0, 'tokens/trainable': 34633344.0}
 48%|████████████████████████████████████████████████████████████████████▏                                                                         | 276/575 [1:21:32<1:26:46, 17.41s/it] 48%|████████████████████████████████████████████████████████████████████▍                                                                         | 277/575 [1:21:49<1:26:19, 17.38s/it]                                                                                                                                                                                         {'loss': 1.9423, 'grad_norm': 1.1815508604049683, 'learning_rate': 0.00030231525544578073, 'ppl': 6.97477, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.4705047607422, 'epoch': 0.48, 'tokens/total': 36306944.0, 'tokens/trainable': 34758616.0}
 48%|████████████████████████████████████████████████████████████████████▍                                                                         | 277/575 [1:21:49<1:26:19, 17.38s/it] 48%|████████████████████████████████████████████████████████████████████▋                                                                         | 278/575 [1:22:07<1:26:06, 17.40s/it]                                                                                                                                                                                         {'loss': 1.9451, 'grad_norm': 1.7415739297866821, 'learning_rate': 0.0003010506262271989, 'ppl': 6.99433, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.04832458496094, 'epoch': 0.48, 'tokens/total': 36438016.0, 'tokens/trainable': 34883696.0}
 48%|████████████████████████████████████████████████████████████████████▋                                                                         | 278/575 [1:22:07<1:26:06, 17.40s/it] 49%|████████████████████████████████████████████████████████████████████▉                                                                         | 279/575 [1:22:25<1:26:34, 17.55s/it]                                                                                                                                                                                         {'loss': 1.988, 'grad_norm': 1.492663025856018, 'learning_rate': 0.0002997851623078988, 'ppl': 7.30092, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.17042541503906, 'epoch': 0.49, 'tokens/total': 36569088.0, 'tokens/trainable': 35008676.0}
 49%|████████████████████████████████████████████████████████████████████▉                                                                         | 279/575 [1:22:25<1:26:34, 17.55s/it] 49%|█████████████████████████████████████████████████████████████████████▏                                                                        | 280/575 [1:22:42<1:26:06, 17.51s/it]                                                                                                                                                                                         {'loss': 1.963, 'grad_norm': 1.0977270603179932, 'learning_rate': 0.000298518904235222, 'ppl': 7.12066, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.70449829101562, 'epoch': 0.49, 'tokens/total': 36700160.0, 'tokens/trainable': 35134008.0}
 49%|█████████████████████████████████████████████████████████████████████▏                                                                        | 280/575 [1:22:42<1:26:06, 17.51s/it] 49%|█████████████████████████████████████████████████████████████████████▍                                                                        | 281/575 [1:22:59<1:25:41, 17.49s/it]                                                                                                                                                                                         {'loss': 1.9473, 'grad_norm': 1.2860296964645386, 'learning_rate': 0.0002972518925819562, 'ppl': 7.00974, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.736083984375, 'epoch': 0.49, 'tokens/total': 36831232.0, 'tokens/trainable': 35259172.0}
 49%|█████████████████████████████████████████████████████████████████████▍                                                                        | 281/575 [1:23:00<1:25:41, 17.49s/it] 49%|█████████████████████████████████████████████████████████████████████▋                                                                        | 282/575 [1:23:17<1:25:19, 17.47s/it]                                                                                                                                                                                         {'loss': 1.9261, 'grad_norm': 1.1995348930358887, 'learning_rate': 0.0002959841679450347, 'ppl': 6.86269, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.9776153564453, 'epoch': 0.49, 'tokens/total': 36962304.0, 'tokens/trainable': 35384328.0}
 49%|█████████████████████████████████████████████████████████████████████▋                                                                        | 282/575 [1:23:17<1:25:19, 17.47s/it] 49%|█████████████████████████████████████████████████████████████████████▉                                                                        | 283/575 [1:23:34<1:24:58, 17.46s/it]                                                                                                                                                                                         {'loss': 1.9513, 'grad_norm': 1.33402681350708, 'learning_rate': 0.00029471577094423583, 'ppl': 7.03783, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.9518585205078, 'epoch': 0.49, 'tokens/total': 37093376.0, 'tokens/trainable': 35509356.0}
 49%|█████████████████████████████████████████████████████████████████████▉                                                                        | 283/575 [1:23:34<1:24:58, 17.46s/it] 49%|██████████████████████████████████████████████████████████████████████▏                                                                       | 284/575 [1:23:52<1:24:17, 17.38s/it]                                                                                                                                                                                         {'loss': 1.9187, 'grad_norm': 1.0677067041397095, 'learning_rate': 0.00029344674222088165, 'ppl': 6.8121, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.83038330078125, 'epoch': 0.49, 'tokens/total': 37224448.0, 'tokens/trainable': 35634760.0}
 49%|██████████████████████████████████████████████████████████████████████▏                                                                       | 284/575 [1:23:52<1:24:17, 17.38s/it] 50%|██████████████████████████████████████████████████████████████████████▍                                                                       | 285/575 [1:24:09<1:24:25, 17.47s/it]                                                                                                                                                                                         {'loss': 1.9267, 'grad_norm': 1.1530195474624634, 'learning_rate': 0.0002921771224365353, 'ppl': 6.86681, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.67330932617188, 'epoch': 0.5, 'tokens/total': 37355520.0, 'tokens/trainable': 35759528.0}
 50%|██████████████████████████████████████████████████████████████████████▍                                                                       | 285/575 [1:24:09<1:24:25, 17.47s/it] 50%|██████████████████████████████████████████████████████████████████████▋                                                                       | 286/575 [1:24:27<1:24:04, 17.46s/it]                                                                                                                                                                                         {'loss': 1.9733, 'grad_norm': 1.1562434434890747, 'learning_rate': 0.0002909069522716988, 'ppl': 7.19438, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.7977294921875, 'epoch': 0.5, 'tokens/total': 37486592.0, 'tokens/trainable': 35884788.0}
 50%|██████████████████████████████████████████████████████████████████████▋                                                                       | 286/575 [1:24:27<1:24:04, 17.46s/it] 50%|██████████████████████████████████████████████████████████████████████▉                                                                       | 287/575 [1:24:44<1:24:05, 17.52s/it]                                                                                                                                                                                         {'loss': 1.9642, 'grad_norm': 1.3119587898254395, 'learning_rate': 0.00028963627242450875, 'ppl': 7.12921, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.341064453125, 'epoch': 0.5, 'tokens/total': 37617664.0, 'tokens/trainable': 36009936.0}
 50%|██████████████████████████████████████████████████████████████████████▉                                                                       | 287/575 [1:24:44<1:24:05, 17.52s/it] 50%|███████████████████████████████████████████████████████████████████████                                                                       | 288/575 [1:25:02<1:23:20, 17.42s/it]                                                                                                                                                                                         {'loss': 1.9495, 'grad_norm': 1.2241737842559814, 'learning_rate': 0.0002883651236094328, 'ppl': 7.02517, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.42506408691406, 'epoch': 0.5, 'tokens/total': 37748736.0, 'tokens/trainable': 36134956.0}
 50%|███████████████████████████████████████████████████████████████████████                                                                       | 288/575 [1:25:02<1:23:20, 17.42s/it] 50%|███████████████████████████████████████████████████████████████████████▎                                                                      | 289/575 [1:25:19<1:22:43, 17.35s/it]                                                                                                                                                                                         {'loss': 1.9087, 'grad_norm': 1.4705681800842285, 'learning_rate': 0.00028709354655596524, 'ppl': 6.74432, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.0192413330078, 'epoch': 0.5, 'tokens/total': 37879808.0, 'tokens/trainable': 36260368.0}
 50%|███████████████████████████████████████████████████████████████████████▎                                                                      | 289/575 [1:25:19<1:22:43, 17.35s/it] 50%|███████████████████████████████████████████████████████████████████████▌                                                                      | 290/575 [1:25:36<1:22:01, 17.27s/it]                                                                                                                                                                                         {'loss': 1.937, 'grad_norm': 1.4557392597198486, 'learning_rate': 0.0002858215820073217, 'ppl': 6.93791, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.55966186523438, 'epoch': 0.5, 'tokens/total': 38010880.0, 'tokens/trainable': 36385696.0}
 50%|███████████████████████████████████████████████████████████████████████▌                                                                      | 290/575 [1:25:36<1:22:01, 17.27s/it] 51%|███████████████████████████████████████████████████████████████████████▊                                                                      | 291/575 [1:25:53<1:22:18, 17.39s/it]                                                                                                                                                                                         {'loss': 1.8932, 'grad_norm': 1.3782254457473755, 'learning_rate': 0.0002845492707191334, 'ppl': 6.64058, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.2237548828125, 'epoch': 0.51, 'tokens/total': 38141952.0, 'tokens/trainable': 36510936.0}
 51%|███████████████████████████████████████████████████████████████████████▊                                                                      | 291/575 [1:25:53<1:22:18, 17.39s/it] 51%|████████████████████████████████████████████████████████████████████████                                                                      | 292/575 [1:26:11<1:21:34, 17.29s/it]                                                                                                                                                                                         {'loss': 1.9211, 'grad_norm': 1.4108954668045044, 'learning_rate': 0.0002832766534581421, 'ppl': 6.82847, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.2566375732422, 'epoch': 0.51, 'tokens/total': 38273024.0, 'tokens/trainable': 36636472.0}
 51%|████████████████████████████████████████████████████████████████████████                                                                      | 292/575 [1:26:11<1:21:34, 17.29s/it] 51%|████████████████████████████████████████████████████████████████████████▎                                                                     | 293/575 [1:26:28<1:21:28, 17.34s/it]                                                                                                                                                                                         {'loss': 1.9146, 'grad_norm': 1.271616816520691, 'learning_rate': 0.0002820037710008931, 'ppl': 6.78422, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.27450561523438, 'epoch': 0.51, 'tokens/total': 38404096.0, 'tokens/trainable': 36761980.0}
 51%|████████████████████████████████████████████████████████████████████████▎                                                                     | 293/575 [1:26:28<1:21:28, 17.34s/it] 51%|████████████████████████████████████████████████████████████████████████▌                                                                     | 294/575 [1:26:46<1:21:39, 17.44s/it]                                                                                                                                                                                         {'loss': 1.8932, 'grad_norm': 1.3537143468856812, 'learning_rate': 0.000280730664132429, 'ppl': 6.64058, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.51431274414062, 'epoch': 0.51, 'tokens/total': 38535168.0, 'tokens/trainable': 36887160.0}
 51%|████████████████████████████████████████████████████████████████████████▌                                                                     | 294/575 [1:26:46<1:21:39, 17.44s/it] 51%|████████████████████████████████████████████████████████████████████████▊                                                                     | 295/575 [1:27:03<1:21:41, 17.51s/it]                                                                                                                                                                                         {'loss': 1.9162, 'grad_norm': 1.0420571565628052, 'learning_rate': 0.00027945737364498295, 'ppl': 6.79509, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.90586853027344, 'epoch': 0.51, 'tokens/total': 38666240.0, 'tokens/trainable': 37012628.0}
 51%|████████████████████████████████████████████████████████████████████████▊                                                                     | 295/575 [1:27:03<1:21:41, 17.51s/it] 51%|█████████████████████████████████████████████████████████████████████████                                                                     | 296/575 [1:27:21<1:21:27, 17.52s/it]                                                                                                                                                                                         {'loss': 1.9526, 'grad_norm': 1.3722800016403198, 'learning_rate': 0.0002781839403366715, 'ppl': 7.04699, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.09368896484375, 'epoch': 0.51, 'tokens/total': 38797312.0, 'tokens/trainable': 37138204.0}
 51%|█████████████████████████████████████████████████████████████████████████                                                                     | 296/575 [1:27:21<1:21:27, 17.52s/it] 52%|█████████████████████████████████████████████████████████████████████████▎                                                                    | 297/575 [1:27:38<1:21:22, 17.56s/it]                                                                                                                                                                                         {'loss': 1.8629, 'grad_norm': 1.2224717140197754, 'learning_rate': 0.0002769104050101873, 'ppl': 6.44239, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.269775390625, 'epoch': 0.52, 'tokens/total': 38928384.0, 'tokens/trainable': 37263452.0}
 52%|█████████████████████████████████████████████████████████████████████████▎                                                                    | 297/575 [1:27:39<1:21:22, 17.56s/it] 52%|█████████████████████████████████████████████████████████████████████████▌                                                                    | 298/575 [1:27:56<1:21:13, 17.60s/it]                                                                                                                                                                                         {'loss': 1.9346, 'grad_norm': 1.22100031375885, 'learning_rate': 0.00027563680847149185, 'ppl': 6.92127, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.47056579589844, 'epoch': 0.52, 'tokens/total': 39059456.0, 'tokens/trainable': 37389016.0}
 52%|█████████████████████████████████████████████████████████████████████████▌                                                                    | 298/575 [1:27:56<1:21:13, 17.60s/it] 52%|█████████████████████████████████████████████████████████████████████████▊                                                                    | 299/575 [1:28:14<1:21:02, 17.62s/it]                                                                                                                                                                                         {'loss': 1.9096, 'grad_norm': 1.3959394693374634, 'learning_rate': 0.00027436319152850813, 'ppl': 6.75039, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.9197998046875, 'epoch': 0.52, 'tokens/total': 39190528.0, 'tokens/trainable': 37514452.0}
 52%|█████████████████████████████████████████████████████████████████████████▊                                                                    | 299/575 [1:28:14<1:21:02, 17.62s/it] 52%|██████████████████████████████████████████████████████████████████████████                                                                    | 300/575 [1:28:31<1:20:19, 17.53s/it]                                                                                                                                                                                         {'loss': 1.8819, 'grad_norm': 0.9754895567893982, 'learning_rate': 0.0002730895949898128, 'ppl': 6.56597, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.50582885742188, 'epoch': 0.52, 'tokens/total': 39321600.0, 'tokens/trainable': 37639956.0}
 52%|██████████████████████████████████████████████████████████████████████████                                                                    | 300/575 [1:28:31<1:20:19, 17.53s/it][2026-03-12 23:10:37,364] [INFO] [axolotl.core.trainers.base.evaluate:401] [PID:4624] Running evaluation step...

  0%|                                                                                                                                                             | 0/54 [00:00<?, ?it/s][A
  4%|█████▌                                                                                                                                               | 2/54 [00:00<00:12,  4.11it/s][A
  6%|████████▎                                                                                                                                            | 3/54 [00:00<00:17,  2.89it/s][A
  7%|███████████                                                                                                                                          | 4/54 [00:01<00:19,  2.50it/s][A
  9%|█████████████▊                                                                                                                                       | 5/54 [00:01<00:21,  2.32it/s][A
 11%|████████████████▌                                                                                                                                    | 6/54 [00:02<00:21,  2.22it/s][A
 13%|███████████████████▎                                                                                                                                 | 7/54 [00:02<00:21,  2.16it/s][A
 15%|██████████████████████                                                                                                                               | 8/54 [00:03<00:21,  2.12it/s][A
 17%|████████████████████████▊                                                                                                                            | 9/54 [00:03<00:21,  2.09it/s][A
 19%|███████████████████████████▍                                                                                                                        | 10/54 [00:04<00:21,  2.08it/s][A
 20%|██████████████████████████████▏                                                                                                                     | 11/54 [00:04<00:20,  2.07it/s][A
 22%|████████████████████████████████▉                                                                                                                   | 12/54 [00:05<00:20,  2.06it/s][A
 24%|███████████████████████████████████▋                                                                                                                | 13/54 [00:05<00:19,  2.05it/s][A
 26%|██████████████████████████████████████▎                                                                                                             | 14/54 [00:06<00:19,  2.05it/s][A
 28%|█████████████████████████████████████████                                                                                                           | 15/54 [00:06<00:19,  2.05it/s][A
 30%|███████████████████████████████████████████▊                                                                                                        | 16/54 [00:07<00:18,  2.04it/s][A
 31%|██████████████████████████████████████████████▌                                                                                                     | 17/54 [00:07<00:18,  2.04it/s][A
 33%|█████████████████████████████████████████████████▎                                                                                                  | 18/54 [00:08<00:17,  2.04it/s][A
 35%|████████████████████████████████████████████████████                                                                                                | 19/54 [00:08<00:17,  2.04it/s][A
 37%|██████████████████████████████████████████████████████▊                                                                                             | 20/54 [00:09<00:16,  2.04it/s][A
 39%|█████████████████████████████████████████████████████████▌                                                                                          | 21/54 [00:09<00:16,  2.04it/s][A
 41%|████████████████████████████████████████████████████████████▎                                                                                       | 22/54 [00:10<00:15,  2.04it/s][A
 43%|███████████████████████████████████████████████████████████████                                                                                     | 23/54 [00:10<00:15,  2.04it/s][A
 44%|█████████████████████████████████████████████████████████████████▊                                                                                  | 24/54 [00:11<00:14,  2.04it/s][A
 46%|████████████████████████████████████████████████████████████████████▌                                                                               | 25/54 [00:11<00:14,  2.04it/s][A
 48%|███████████████████████████████████████████████████████████████████████▎                                                                            | 26/54 [00:12<00:13,  2.04it/s][A
 50%|██████████████████████████████████████████████████████████████████████████                                                                          | 27/54 [00:12<00:13,  2.04it/s][A
 52%|████████████████████████████████████████████████████████████████████████████▋                                                                       | 28/54 [00:13<00:12,  2.04it/s][A
 54%|███████████████████████████████████████████████████████████████████████████████▍                                                                    | 29/54 [00:13<00:12,  2.04it/s][A
 56%|██████████████████████████████████████████████████████████████████████████████████▏                                                                 | 30/54 [00:14<00:11,  2.04it/s][A
 57%|████████████████████████████████████████████████████████████████████████████████████▉                                                               | 31/54 [00:14<00:11,  2.04it/s][A
 59%|███████████████████████████████████████████████████████████████████████████████████████▋                                                            | 32/54 [00:15<00:10,  2.04it/s][A
 61%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                         | 33/54 [00:15<00:10,  1.92it/s][A
 63%|█████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 34/54 [00:16<00:10,  1.99it/s][A
 65%|███████████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 35/54 [00:16<00:09,  2.00it/s][A
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▋                                                 | 36/54 [00:17<00:08,  2.01it/s][A
 69%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 37/54 [00:17<00:08,  2.02it/s][A
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 38/54 [00:18<00:07,  2.03it/s][A
 72%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                         | 39/54 [00:18<00:07,  2.03it/s][A
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                      | 40/54 [00:19<00:06,  2.03it/s][A
 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                   | 41/54 [00:19<00:06,  2.03it/s][A
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                 | 42/54 [00:20<00:05,  2.04it/s][A
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                              | 43/54 [00:20<00:05,  2.04it/s][A
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                           | 44/54 [00:21<00:04,  2.04it/s][A
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                        | 45/54 [00:21<00:04,  2.04it/s][A
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                      | 46/54 [00:22<00:03,  2.04it/s][A
 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                   | 47/54 [00:22<00:03,  2.04it/s][A
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 48/54 [00:23<00:02,  2.04it/s][A
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎             | 49/54 [00:23<00:02,  2.04it/s][A
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████           | 50/54 [00:24<00:01,  2.04it/s][A
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊        | 51/54 [00:24<00:01,  2.04it/s][A
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌     | 52/54 [00:25<00:00,  2.04it/s][A
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎  | 53/54 [00:25<00:00,  2.04it/s][A
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.97it/s][A                                                                                                                                                                                         
                                                                                                                                                                                         [A{'eval_loss': 1.8711814880371094, 'eval_runtime': 27.3027, 'eval_samples_per_second': 7.911, 'eval_steps_per_second': 1.978, 'eval_ppl': 6.49597, 'memory/max_active (GiB)': 26.29, 'memory/max_allocated (GiB)': 26.29, 'memory/device_reserved (GiB)': 27.83, 'epoch': 0.52, 'tokens/train_per_sec_per_gpu': 0.0, 'tokens/total': 39321600.0, 'tokens/trainable': 37639956.0}
 52%|██████████████████████████████████████████████████████████████████████████                                                                    | 300/575 [1:28:59<1:20:19, 17.53s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.97it/s][A
                                                                                                                                                                                         [A[2026-03-12 23:11:04,671] [INFO] [axolotl.core.trainers.base._save:723] [PID:4624] Saving model checkpoint to ./output/model/checkpoint-300
 52%|██████████████████████████████████████████████████████████████████████████▎                                                                   | 301/575 [1:29:31<2:17:22, 30.08s/it]                                                                                                                                                                                         {'loss': 1.9059, 'grad_norm': 1.2407959699630737, 'learning_rate': 0.00027181605966332856, 'ppl': 6.72546, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.2213134765625, 'epoch': 0.52, 'tokens/total': 39452672.0, 'tokens/trainable': 37765336.0}
 52%|██████████████████████████████████████████████████████████████████████████▎                                                                   | 301/575 [1:29:31<2:17:22, 30.08s/it] 53%|██████████████████████████████████████████████████████████████████████████▌                                                                   | 302/575 [1:29:48<1:59:45, 26.32s/it]                                                                                                                                                                                         {'loss': 1.901, 'grad_norm': 1.1157277822494507, 'learning_rate': 0.00027054262635501703, 'ppl': 6.69258, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.8702392578125, 'epoch': 0.53, 'tokens/total': 39583744.0, 'tokens/trainable': 37890152.0}
 53%|██████████████████████████████████████████████████████████████████████████▌                                                                   | 302/575 [1:29:48<1:59:45, 26.32s/it] 53%|██████████████████████████████████████████████████████████████████████████▊                                                                   | 303/575 [1:30:06<1:47:33, 23.73s/it]                                                                                                                                                                                         {'loss': 1.8633, 'grad_norm': 1.1848726272583008, 'learning_rate': 0.00026926933586757105, 'ppl': 6.44497, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.5741424560547, 'epoch': 0.53, 'tokens/total': 39714816.0, 'tokens/trainable': 38015644.0}
 53%|██████████████████████████████████████████████████████████████████████████▊                                                                   | 303/575 [1:30:06<1:47:33, 23.73s/it] 53%|███████████████████████████████████████████████████████████████████████████                                                                   | 304/575 [1:30:23<1:38:19, 21.77s/it]                                                                                                                                                                                         {'loss': 1.9388, 'grad_norm': 1.5901085138320923, 'learning_rate': 0.0002679962289991069, 'ppl': 6.95041, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 232.61790466308594, 'epoch': 0.53, 'tokens/total': 39845888.0, 'tokens/trainable': 38140932.0}
 53%|███████████████████████████████████████████████████████████████████████████                                                                   | 304/575 [1:30:23<1:38:19, 21.77s/it] 53%|███████████████████████████████████████████████████████████████████████████▎                                                                  | 305/575 [1:30:41<1:32:37, 20.58s/it]                                                                                                                                                                                         {'loss': 1.8804, 'grad_norm': 1.455842137336731, 'learning_rate': 0.000266723346541858, 'ppl': 6.55613, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.2462158203125, 'epoch': 0.53, 'tokens/total': 39976960.0, 'tokens/trainable': 38266176.0}
 53%|███████████████████████████████████████████████████████████████████████████▎                                                                  | 305/575 [1:30:41<1:32:37, 20.58s/it] 53%|███████████████████████████████████████████████████████████████████████████▌                                                                  | 306/575 [1:30:58<1:28:13, 19.68s/it]                                                                                                                                                                                         {'loss': 1.9055, 'grad_norm': 1.146470546722412, 'learning_rate': 0.00026545072928086674, 'ppl': 6.72277, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.76736450195312, 'epoch': 0.53, 'tokens/total': 40108032.0, 'tokens/trainable': 38391340.0}
 53%|███████████████████████████████████████████████████████████████████████████▌                                                                  | 306/575 [1:30:58<1:28:13, 19.68s/it] 53%|███████████████████████████████████████████████████████████████████████████▊                                                                  | 307/575 [1:31:16<1:24:35, 18.94s/it]                                                                                                                                                                                         {'loss': 1.8969, 'grad_norm': 1.0944169759750366, 'learning_rate': 0.0002641784179926785, 'ppl': 6.6652, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.4956512451172, 'epoch': 0.53, 'tokens/total': 40239104.0, 'tokens/trainable': 38516608.0}
 53%|███████████████████████████████████████████████████████████████████████████▊                                                                  | 307/575 [1:31:16<1:24:35, 18.94s/it] 54%|████████████████████████████████████████████████████████████████████████████                                                                  | 308/575 [1:31:33<1:22:17, 18.49s/it]                                                                                                                                                                                         {'loss': 1.9045, 'grad_norm': 1.0204535722732544, 'learning_rate': 0.00026290645344403474, 'ppl': 6.71605, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.96420288085938, 'epoch': 0.54, 'tokens/total': 40370176.0, 'tokens/trainable': 38641592.0}
 54%|████████████████████████████████████████████████████████████████████████████                                                                  | 308/575 [1:31:33<1:22:17, 18.49s/it] 54%|████████████████████████████████████████████████████████████████████████████▎                                                                 | 309/575 [1:31:51<1:20:45, 18.22s/it]                                                                                                                                                                                         {'loss': 1.9289, 'grad_norm': 1.1671645641326904, 'learning_rate': 0.0002616348763905672, 'ppl': 6.88194, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.39268493652344, 'epoch': 0.54, 'tokens/total': 40501248.0, 'tokens/trainable': 38766784.0}
 54%|████████████████████████████████████████████████████████████████████████████▎                                                                 | 309/575 [1:31:51<1:20:45, 18.22s/it] 54%|████████████████████████████████████████████████████████████████████████████▌                                                                 | 310/575 [1:32:08<1:18:57, 17.88s/it]                                                                                                                                                                                         {'loss': 1.9191, 'grad_norm': 1.0658930540084839, 'learning_rate': 0.00026036372757549134, 'ppl': 6.81482, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.4108123779297, 'epoch': 0.54, 'tokens/total': 40632320.0, 'tokens/trainable': 38892204.0}
 54%|████████████████████████████████████████████████████████████████████████████▌                                                                 | 310/575 [1:32:08<1:18:57, 17.88s/it] 54%|████████████████████████████████████████████████████████████████████████████▊                                                                 | 311/575 [1:32:25<1:18:24, 17.82s/it]                                                                                                                                                                                         {'loss': 1.8795, 'grad_norm': 1.2252206802368164, 'learning_rate': 0.00025909304772830125, 'ppl': 6.55023, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.003173828125, 'epoch': 0.54, 'tokens/total': 40763392.0, 'tokens/trainable': 39017608.0}
 54%|████████████████████████████████████████████████████████████████████████████▊                                                                 | 311/575 [1:32:25<1:18:24, 17.82s/it] 54%|█████████████████████████████████████████████████████████████████████████████                                                                 | 312/575 [1:32:43<1:17:46, 17.74s/it]                                                                                                                                                                                         {'loss': 1.8985, 'grad_norm': 1.1747313737869263, 'learning_rate': 0.00025782287756346466, 'ppl': 6.67587, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.04100036621094, 'epoch': 0.54, 'tokens/total': 40894464.0, 'tokens/trainable': 39143088.0}
 54%|█████████████████████████████████████████████████████████████████████████████                                                                 | 312/575 [1:32:43<1:17:46, 17.74s/it] 54%|█████████████████████████████████████████████████████████████████████████████▎                                                                | 313/575 [1:33:00<1:17:04, 17.65s/it]                                                                                                                                                                                         {'loss': 1.8898, 'grad_norm': 1.0169827938079834, 'learning_rate': 0.0002565532577791185, 'ppl': 6.61804, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.53338623046875, 'epoch': 0.54, 'tokens/total': 41025536.0, 'tokens/trainable': 39268360.0}
 54%|█████████████████████████████████████████████████████████████████████████████▎                                                                | 313/575 [1:33:00<1:17:04, 17.65s/it] 55%|█████████████████████████████████████████████████████████████████████████████▌                                                                | 314/575 [1:33:18<1:16:11, 17.52s/it]                                                                                                                                                                                         {'loss': 1.8927, 'grad_norm': 1.5090376138687134, 'learning_rate': 0.00025528422905576415, 'ppl': 6.63727, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.46832275390625, 'epoch': 0.55, 'tokens/total': 41156608.0, 'tokens/trainable': 39393936.0}
 55%|█████████████████████████████████████████████████████████████████████████████▌                                                                | 314/575 [1:33:18<1:16:11, 17.52s/it] 55%|█████████████████████████████████████████████████████████████████████████████▊                                                                | 315/575 [1:33:35<1:15:48, 17.49s/it]                                                                                                                                                                                         {'loss': 1.8757, 'grad_norm': 1.0717469453811646, 'learning_rate': 0.00025401583205496536, 'ppl': 6.52539, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.07275390625, 'epoch': 0.55, 'tokens/total': 41287680.0, 'tokens/trainable': 39519912.0}
 55%|█████████████████████████████████████████████████████████████████████████████▊                                                                | 315/575 [1:33:35<1:15:48, 17.49s/it] 55%|██████████████████████████████████████████████████████████████████████████████                                                                | 316/575 [1:33:53<1:15:35, 17.51s/it]                                                                                                                                                                                         {'loss': 1.8728, 'grad_norm': 1.12404203414917, 'learning_rate': 0.0002527481074180438, 'ppl': 6.50649, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.06687927246094, 'epoch': 0.55, 'tokens/total': 41418752.0, 'tokens/trainable': 39645696.0}
 55%|██████████████████████████████████████████████████████████████████████████████                                                                | 316/575 [1:33:53<1:15:35, 17.51s/it] 55%|██████████████████████████████████████████████████████████████████████████████▎                                                               | 317/575 [1:34:10<1:15:12, 17.49s/it]                                                                                                                                                                                         {'loss': 1.9018, 'grad_norm': 1.238993525505066, 'learning_rate': 0.000251481095764778, 'ppl': 6.69794, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.0487060546875, 'epoch': 0.55, 'tokens/total': 41549824.0, 'tokens/trainable': 39771172.0}
 55%|██████████████████████████████████████████████████████████████████████████████▎                                                               | 317/575 [1:34:10<1:15:12, 17.49s/it] 55%|██████████████████████████████████████████████████████████████████████████████▌                                                               | 318/575 [1:34:28<1:15:00, 17.51s/it]                                                                                                                                                                                         {'loss': 1.9187, 'grad_norm': 0.980074405670166, 'learning_rate': 0.0002502148376921013, 'ppl': 6.8121, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.49557495117188, 'epoch': 0.55, 'tokens/total': 41680896.0, 'tokens/trainable': 39896600.0}
 55%|██████████████████████████████████████████████████████████████████████████████▌                                                               | 318/575 [1:34:28<1:15:00, 17.51s/it] 55%|██████████████████████████████████████████████████████████████████████████████▊                                                               | 319/575 [1:34:45<1:15:04, 17.60s/it]                                                                                                                                                                                         {'loss': 1.8745, 'grad_norm': 1.1886321306228638, 'learning_rate': 0.00024894937377280117, 'ppl': 6.51756, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.6207733154297, 'epoch': 0.55, 'tokens/total': 41811968.0, 'tokens/trainable': 40022192.0}
 55%|██████████████████████████████████████████████████████████████████████████████▊                                                               | 319/575 [1:34:45<1:15:04, 17.60s/it] 56%|███████████████████████████████████████████████████████████████████████████████                                                               | 320/575 [1:35:03<1:14:43, 17.58s/it]                                                                                                                                                                                         {'loss': 1.8982, 'grad_norm': 1.341811180114746, 'learning_rate': 0.00024768474455421925, 'ppl': 6.67387, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.69537353515625, 'epoch': 0.56, 'tokens/total': 41943040.0, 'tokens/trainable': 40147728.0}
 56%|███████████████████████████████████████████████████████████████████████████████                                                               | 320/575 [1:35:03<1:14:43, 17.58s/it] 56%|███████████████████████████████████████████████████████████████████████████████▎                                                              | 321/575 [1:35:20<1:14:05, 17.50s/it]                                                                                                                                                                                         {'loss': 1.9008, 'grad_norm': 1.0985815525054932, 'learning_rate': 0.0002464209905569523, 'ppl': 6.69125, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.55136108398438, 'epoch': 0.56, 'tokens/total': 42074112.0, 'tokens/trainable': 40273240.0}
 56%|███████████████████████████████████████████████████████████████████████████████▎                                                              | 321/575 [1:35:20<1:14:05, 17.50s/it] 56%|███████████████████████████████████████████████████████████████████████████████▌                                                              | 322/575 [1:35:37<1:13:16, 17.38s/it]                                                                                                                                                                                         {'loss': 1.9241, 'grad_norm': 1.2415250539779663, 'learning_rate': 0.0002451581522735535, 'ppl': 6.84898, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.1411590576172, 'epoch': 0.56, 'tokens/total': 42205184.0, 'tokens/trainable': 40398716.0}
 56%|███████████████████████████████████████████████████████████████████████████████▌                                                              | 322/575 [1:35:37<1:13:16, 17.38s/it] 56%|███████████████████████████████████████████████████████████████████████████████▊                                                              | 323/575 [1:35:55<1:13:12, 17.43s/it]                                                                                                                                                                                         {'loss': 1.836, 'grad_norm': 1.2683546543121338, 'learning_rate': 0.0002438962701672352, 'ppl': 6.2714, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.97036743164062, 'epoch': 0.56, 'tokens/total': 42336256.0, 'tokens/trainable': 40524288.0}
 56%|███████████████████████████████████████████████████████████████████████████████▊                                                              | 323/575 [1:35:55<1:13:12, 17.43s/it] 56%|████████████████████████████████████████████████████████████████████████████████                                                              | 324/575 [1:36:12<1:12:55, 17.43s/it]                                                                                                                                                                                         {'loss': 1.8732, 'grad_norm': 1.4765704870224, 'learning_rate': 0.00024263538467057255, 'ppl': 6.50909, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.8763885498047, 'epoch': 0.56, 'tokens/total': 42467328.0, 'tokens/trainable': 40649876.0}
 56%|████████████████████████████████████████████████████████████████████████████████                                                              | 324/575 [1:36:12<1:12:55, 17.43s/it] 57%|████████████████████████████████████████████████████████████████████████████████▎                                                             | 325/575 [1:36:30<1:12:38, 17.43s/it]                                                                                                                                                                                         {'loss': 1.9118, 'grad_norm': 1.0537068843841553, 'learning_rate': 0.0002413755361842077, 'ppl': 6.76526, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.03639221191406, 'epoch': 0.57, 'tokens/total': 42598400.0, 'tokens/trainable': 40775256.0}
 57%|████████████████████████████████████████████████████████████████████████████████▎                                                             | 325/575 [1:36:30<1:12:38, 17.43s/it] 57%|████████████████████████████████████████████████████████████████████████████████▌                                                             | 326/575 [1:36:47<1:12:38, 17.51s/it]                                                                                                                                                                                         {'loss': 1.895, 'grad_norm': 1.073633074760437, 'learning_rate': 0.00024011676507555547, 'ppl': 6.65255, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.8553924560547, 'epoch': 0.57, 'tokens/total': 42729472.0, 'tokens/trainable': 40900772.0}
 57%|████████████████████████████████████████████████████████████████████████████████▌                                                             | 326/575 [1:36:47<1:12:38, 17.51s/it] 57%|████████████████████████████████████████████████████████████████████████████████▊                                                             | 327/575 [1:37:05<1:11:58, 17.41s/it]                                                                                                                                                                                         {'loss': 1.8643, 'grad_norm': 1.029090166091919, 'learning_rate': 0.00023885911167751013, 'ppl': 6.45142, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.92611694335938, 'epoch': 0.57, 'tokens/total': 42860544.0, 'tokens/trainable': 41026520.0}
 57%|████████████████████████████████████████████████████████████████████████████████▊                                                             | 327/575 [1:37:05<1:11:58, 17.41s/it] 57%|█████████████████████████████████████████████████████████████████████████████████                                                             | 328/575 [1:37:22<1:11:33, 17.38s/it]                                                                                                                                                                                         {'loss': 1.9249, 'grad_norm': 1.767303466796875, 'learning_rate': 0.00023760261628715253, 'ppl': 6.85446, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.74435424804688, 'epoch': 0.57, 'tokens/total': 42991616.0, 'tokens/trainable': 41151540.0}
 57%|█████████████████████████████████████████████████████████████████████████████████                                                             | 328/575 [1:37:22<1:11:33, 17.38s/it] 57%|█████████████████████████████████████████████████████████████████████████████████▏                                                            | 329/575 [1:37:40<1:11:46, 17.51s/it]                                                                                                                                                                                         {'loss': 1.9041, 'grad_norm': 1.0151958465576172, 'learning_rate': 0.00023634731916445935, 'ppl': 6.71336, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.0203094482422, 'epoch': 0.57, 'tokens/total': 43122688.0, 'tokens/trainable': 41276908.0}
 57%|█████████████████████████████████████████████████████████████████████████████████▏                                                            | 329/575 [1:37:40<1:11:46, 17.51s/it] 57%|█████████████████████████████████████████████████████████████████████████████████▍                                                            | 330/575 [1:37:57<1:11:23, 17.48s/it]                                                                                                                                                                                         {'loss': 1.8734, 'grad_norm': 0.9594698548316956, 'learning_rate': 0.0002350932605310131, 'ppl': 6.51039, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.6597900390625, 'epoch': 0.57, 'tokens/total': 43253760.0, 'tokens/trainable': 41402236.0}
 57%|█████████████████████████████████████████████████████████████████████████████████▍                                                            | 330/575 [1:37:57<1:11:23, 17.48s/it] 58%|█████████████████████████████████████████████████████████████████████████████████▋                                                            | 331/575 [1:38:15<1:11:11, 17.51s/it]                                                                                                                                                                                         {'loss': 1.8624, 'grad_norm': 1.1356629133224487, 'learning_rate': 0.00023384048056871305, 'ppl': 6.43917, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.06192016601562, 'epoch': 0.58, 'tokens/total': 43384832.0, 'tokens/trainable': 41527524.0}
 58%|█████████████████████████████████████████████████████████████████████████████████▋                                                            | 331/575 [1:38:15<1:11:11, 17.51s/it] 58%|█████████████████████████████████████████████████████████████████████████████████▉                                                            | 332/575 [1:38:32<1:11:14, 17.59s/it]                                                                                                                                                                                         {'loss': 1.8653, 'grad_norm': 1.2109575271606445, 'learning_rate': 0.0002325890194184881, 'ppl': 6.45787, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.61151123046875, 'epoch': 0.58, 'tokens/total': 43515904.0, 'tokens/trainable': 41652920.0}
 58%|█████████████████████████████████████████████████████████████████████████████████▉                                                            | 332/575 [1:38:33<1:11:14, 17.59s/it] 58%|██████████████████████████████████████████████████████████████████████████████████▏                                                           | 333/575 [1:38:50<1:10:54, 17.58s/it]                                                                                                                                                                                         {'loss': 1.8589, 'grad_norm': 1.0299713611602783, 'learning_rate': 0.00023133891717901057, 'ppl': 6.41667, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.6560516357422, 'epoch': 0.58, 'tokens/total': 43646976.0, 'tokens/trainable': 41778412.0}
 58%|██████████████████████████████████████████████████████████████████████████████████▏                                                           | 333/575 [1:38:50<1:10:54, 17.58s/it] 58%|██████████████████████████████████████████████████████████████████████████████████▍                                                           | 334/575 [1:39:08<1:11:00, 17.68s/it]                                                                                                                                                                                         {'loss': 1.8418, 'grad_norm': 1.1334625482559204, 'learning_rate': 0.000230090213905411, 'ppl': 6.30788, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.2531280517578, 'epoch': 0.58, 'tokens/total': 43778048.0, 'tokens/trainable': 41903848.0}
 58%|██████████████████████████████████████████████████████████████████████████████████▍                                                           | 334/575 [1:39:08<1:11:00, 17.68s/it] 58%|██████████████████████████████████████████████████████████████████████████████████▋                                                           | 335/575 [1:39:25<1:10:25, 17.60s/it]                                                                                                                                                                                         {'loss': 1.8563, 'grad_norm': 0.9519019722938538, 'learning_rate': 0.00022884294960799506, 'ppl': 6.40001, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.64572143554688, 'epoch': 0.58, 'tokens/total': 43909120.0, 'tokens/trainable': 42029492.0}
 58%|██████████████████████████████████████████████████████████████████████████████████▋                                                           | 335/575 [1:39:25<1:10:25, 17.60s/it] 58%|██████████████████████████████████████████████████████████████████████████████████▉                                                           | 336/575 [1:39:43<1:10:20, 17.66s/it]                                                                                                                                                                                         {'loss': 1.8197, 'grad_norm': 1.2365071773529053, 'learning_rate': 0.00022759716425096166, 'ppl': 6.17001, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.9341278076172, 'epoch': 0.58, 'tokens/total': 44040192.0, 'tokens/trainable': 42155172.0}
 58%|██████████████████████████████████████████████████████████████████████████████████▉                                                           | 336/575 [1:39:43<1:10:20, 17.66s/it] 59%|███████████████████████████████████████████████████████████████████████████████████▏                                                          | 337/575 [1:40:01<1:09:55, 17.63s/it]                                                                                                                                                                                         {'loss': 1.8489, 'grad_norm': 0.9652379751205444, 'learning_rate': 0.00022635289775112195, 'ppl': 6.35283, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.71487426757812, 'epoch': 0.59, 'tokens/total': 44171264.0, 'tokens/trainable': 42280944.0}
 59%|███████████████████████████████████████████████████████████████████████████████████▏                                                          | 337/575 [1:40:01<1:09:55, 17.63s/it] 59%|███████████████████████████████████████████████████████████████████████████████████▍                                                          | 338/575 [1:40:18<1:08:49, 17.42s/it]                                                                                                                                                                                         {'loss': 1.838, 'grad_norm': 1.119476556777954, 'learning_rate': 0.00022511018997662096, 'ppl': 6.28396, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.9993133544922, 'epoch': 0.59, 'tokens/total': 44302336.0, 'tokens/trainable': 42406320.0}
 59%|███████████████████████████████████████████████████████████████████████████████████▍                                                          | 338/575 [1:40:18<1:08:49, 17.42s/it] 59%|███████████████████████████████████████████████████████████████████████████████████▋                                                          | 339/575 [1:40:35<1:08:32, 17.43s/it]                                                                                                                                                                                         {'loss': 1.9179, 'grad_norm': 1.0831873416900635, 'learning_rate': 0.00022386908074565975, 'ppl': 6.80665, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.20697021484375, 'epoch': 0.59, 'tokens/total': 44433408.0, 'tokens/trainable': 42531476.0}
 59%|███████████████████████████████████████████████████████████████████████████████████▋                                                          | 339/575 [1:40:35<1:08:32, 17.43s/it] 59%|███████████████████████████████████████████████████████████████████████████████████▉                                                          | 340/575 [1:40:53<1:08:23, 17.46s/it]                                                                                                                                                                                         {'loss': 1.8607, 'grad_norm': 1.2580469846725464, 'learning_rate': 0.00022262960982521963, 'ppl': 6.42823, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.62705993652344, 'epoch': 0.59, 'tokens/total': 44564480.0, 'tokens/trainable': 42656188.0}
 59%|███████████████████████████████████████████████████████████████████████████████████▉                                                          | 340/575 [1:40:53<1:08:23, 17.46s/it] 59%|████████████████████████████████████████████████████████████████████████████████████▏                                                         | 341/575 [1:41:10<1:07:47, 17.38s/it]                                                                                                                                                                                         {'loss': 1.8748, 'grad_norm': 1.0357165336608887, 'learning_rate': 0.00022139181692978793, 'ppl': 6.51952, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.8575439453125, 'epoch': 0.59, 'tokens/total': 44695552.0, 'tokens/trainable': 42781196.0}
 59%|████████████████████████████████████████████████████████████████████████████████████▏                                                         | 341/575 [1:41:10<1:07:47, 17.38s/it] 59%|████████████████████████████████████████████████████████████████████████████████████▍                                                         | 342/575 [1:41:27<1:07:25, 17.36s/it]                                                                                                                                                                                         {'loss': 1.8597, 'grad_norm': 1.0165393352508545, 'learning_rate': 0.00022015574172008567, 'ppl': 6.42181, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.69757080078125, 'epoch': 0.59, 'tokens/total': 44826624.0, 'tokens/trainable': 42906416.0}
 59%|████████████████████████████████████████████████████████████████████████████████████▍                                                         | 342/575 [1:41:27<1:07:25, 17.36s/it] 60%|████████████████████████████████████████████████████████████████████████████████████▋                                                         | 343/575 [1:41:45<1:07:21, 17.42s/it]                                                                                                                                                                                         {'loss': 1.8395, 'grad_norm': 0.9585210084915161, 'learning_rate': 0.00021892142380179676, 'ppl': 6.29339, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.2810821533203, 'epoch': 0.6, 'tokens/total': 44957696.0, 'tokens/trainable': 43032000.0}
 60%|████████████████████████████████████████████████████████████████████████████████████▋                                                         | 343/575 [1:41:45<1:07:21, 17.42s/it] 60%|████████████████████████████████████████████████████████████████████████████████████▉                                                         | 344/575 [1:42:02<1:07:13, 17.46s/it]                                                                                                                                                                                         {'loss': 1.9095, 'grad_norm': 0.9899284839630127, 'learning_rate': 0.00021768890272429864, 'ppl': 6.74971, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.76467895507812, 'epoch': 0.6, 'tokens/total': 45088768.0, 'tokens/trainable': 43156960.0}
 60%|████████████████████████████████████████████████████████████████████████████████████▉                                                         | 344/575 [1:42:02<1:07:13, 17.46s/it] 60%|█████████████████████████████████████████████████████████████████████████████████████▏                                                        | 345/575 [1:42:20<1:06:53, 17.45s/it]                                                                                                                                                                                         {'loss': 1.8962, 'grad_norm': 1.1269742250442505, 'learning_rate': 0.00021645821797939557, 'ppl': 6.66054, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.71351623535156, 'epoch': 0.6, 'tokens/total': 45219840.0, 'tokens/trainable': 43282272.0}
 60%|█████████████████████████████████████████████████████████████████████████████████████▏                                                        | 345/575 [1:42:20<1:06:53, 17.45s/it] 60%|█████████████████████████████████████████████████████████████████████████████████████▍                                                        | 346/575 [1:42:37<1:06:43, 17.48s/it]                                                                                                                                                                                         {'loss': 1.8473, 'grad_norm': 0.8750260472297668, 'learning_rate': 0.0002152294090000529, 'ppl': 6.34267, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.64755249023438, 'epoch': 0.6, 'tokens/total': 45350912.0, 'tokens/trainable': 43408136.0}
 60%|█████████████████████████████████████████████████████████████████████████████████████▍                                                        | 346/575 [1:42:37<1:06:43, 17.48s/it] 60%|█████████████████████████████████████████████████████████████████████████████████████▋                                                        | 347/575 [1:42:55<1:06:38, 17.54s/it]                                                                                                                                                                                         {'loss': 1.8513, 'grad_norm': 1.1124478578567505, 'learning_rate': 0.00021400251515913343, 'ppl': 6.36809, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.3921661376953, 'epoch': 0.6, 'tokens/total': 45481984.0, 'tokens/trainable': 43533392.0}
 60%|█████████████████████████████████████████████████████████████████████████████████████▋                                                        | 347/575 [1:42:55<1:06:38, 17.54s/it] 61%|█████████████████████████████████████████████████████████████████████████████████████▉                                                        | 348/575 [1:43:13<1:06:30, 17.58s/it]                                                                                                                                                                                         {'loss': 1.9134, 'grad_norm': 0.970862865447998, 'learning_rate': 0.0002127775757681365, 'ppl': 6.77609, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.2416229248047, 'epoch': 0.61, 'tokens/total': 45613056.0, 'tokens/trainable': 43658880.0}
 61%|█████████████████████████████████████████████████████████████████████████████████████▉                                                        | 348/575 [1:43:13<1:06:30, 17.58s/it] 61%|██████████████████████████████████████████████████████████████████████████████████████▏                                                       | 349/575 [1:43:30<1:06:02, 17.54s/it]                                                                                                                                                                                         {'loss': 1.846, 'grad_norm': 1.1555538177490234, 'learning_rate': 0.00021155463007593771, 'ppl': 6.33443, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.9266815185547, 'epoch': 0.61, 'tokens/total': 45744128.0, 'tokens/trainable': 43783736.0}
 61%|██████████████████████████████████████████████████████████████████████████████████████▏                                                       | 349/575 [1:43:30<1:06:02, 17.54s/it] 61%|██████████████████████████████████████████████████████████████████████████████████████▍                                                       | 350/575 [1:43:47<1:05:14, 17.40s/it]                                                                                                                                                                                         {'loss': 1.846, 'grad_norm': 1.0361915826797485, 'learning_rate': 0.00021033371726753158, 'ppl': 6.33443, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.4224395751953, 'epoch': 0.61, 'tokens/total': 45875200.0, 'tokens/trainable': 43908980.0}
 61%|██████████████████████████████████████████████████████████████████████████████████████▍                                                       | 350/575 [1:43:47<1:05:14, 17.40s/it] 61%|██████████████████████████████████████████████████████████████████████████████████████▋                                                       | 351/575 [1:44:04<1:04:35, 17.30s/it]                                                                                                                                                                                         {'loss': 1.8538, 'grad_norm': 0.9309911131858826, 'learning_rate': 0.0002091148764627762, 'ppl': 6.38403, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 231.2125701904297, 'epoch': 0.61, 'tokens/total': 46006272.0, 'tokens/trainable': 44034132.0}
 61%|██████████████████████████████████████████████████████████████████████████████████████▋                                                       | 351/575 [1:44:04<1:04:35, 17.30s/it] 61%|██████████████████████████████████████████████████████████████████████████████████████▉                                                       | 352/575 [1:44:22<1:04:26, 17.34s/it]                                                                                                                                                                                         {'loss': 1.8644, 'grad_norm': 2.120191812515259, 'learning_rate': 0.00020789814671513934, 'ppl': 6.45206, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.218017578125, 'epoch': 0.61, 'tokens/total': 46137344.0, 'tokens/trainable': 44159496.0}
 61%|██████████████████████████████████████████████████████████████████████████████████████▉                                                       | 352/575 [1:44:22<1:04:26, 17.34s/it] 61%|███████████████████████████████████████████████████████████████████████████████████████▏                                                      | 353/575 [1:44:39<1:04:31, 17.44s/it]                                                                                                                                                                                         {'loss': 1.8574, 'grad_norm': 0.9722480177879333, 'learning_rate': 0.0002066835670104476, 'ppl': 6.40706, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.8377227783203, 'epoch': 0.61, 'tokens/total': 46268416.0, 'tokens/trainable': 44284860.0}
 61%|███████████████████████████████████████████████████████████████████████████████████████▏                                                      | 353/575 [1:44:39<1:04:31, 17.44s/it] 62%|███████████████████████████████████████████████████████████████████████████████████████▍                                                      | 354/575 [1:44:57<1:04:05, 17.40s/it]                                                                                                                                                                                         {'loss': 1.8223, 'grad_norm': 0.9545662999153137, 'learning_rate': 0.0002054711762656369, 'ppl': 6.18607, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.09640502929688, 'epoch': 0.62, 'tokens/total': 46399488.0, 'tokens/trainable': 44410196.0}
 62%|███████████████████████████████████████████████████████████████████████████████████████▍                                                      | 354/575 [1:44:57<1:04:05, 17.40s/it] 62%|███████████████████████████████████████████████████████████████████████████████████████▋                                                      | 355/575 [1:45:14<1:03:50, 17.41s/it]                                                                                                                                                                                         {'loss': 1.8654, 'grad_norm': 1.0334396362304688, 'learning_rate': 0.00020426101332750556, 'ppl': 6.45852, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.9368438720703, 'epoch': 0.62, 'tokens/total': 46530560.0, 'tokens/trainable': 44535844.0}
 62%|███████████████████████████████████████████████████████████████████████████████████████▋                                                      | 355/575 [1:45:14<1:03:50, 17.41s/it] 62%|███████████████████████████████████████████████████████████████████████████████████████▉                                                      | 356/575 [1:45:32<1:03:42, 17.45s/it]                                                                                                                                                                                         {'loss': 1.8285, 'grad_norm': 0.9369050860404968, 'learning_rate': 0.00020305311697146983, 'ppl': 6.22454, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.87710571289062, 'epoch': 0.62, 'tokens/total': 46661632.0, 'tokens/trainable': 44661504.0}
 62%|███████████████████████████████████████████████████████████████████████████████████████▉                                                      | 356/575 [1:45:32<1:03:42, 17.45s/it] 62%|████████████████████████████████████████████████████████████████████████████████████████▏                                                     | 357/575 [1:45:49<1:03:43, 17.54s/it]                                                                                                                                                                                         {'loss': 1.8706, 'grad_norm': 1.2930678129196167, 'learning_rate': 0.00020184752590032124, 'ppl': 6.49219, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.9240264892578, 'epoch': 0.62, 'tokens/total': 46792704.0, 'tokens/trainable': 44786704.0}
 62%|████████████████████████████████████████████████████████████████████████████████████████▏                                                     | 357/575 [1:45:49<1:03:43, 17.54s/it] 62%|████████████████████████████████████████████████████████████████████████████████████████▍                                                     | 358/575 [1:46:07<1:03:49, 17.65s/it]                                                                                                                                                                                         {'loss': 1.8784, 'grad_norm': 0.9814261794090271, 'learning_rate': 0.00020064427874298658, 'ppl': 6.54303, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.2524871826172, 'epoch': 0.62, 'tokens/total': 46923776.0, 'tokens/trainable': 44912080.0}
 62%|████████████████████████████████████████████████████████████████████████████████████████▍                                                     | 358/575 [1:46:07<1:03:49, 17.65s/it] 62%|████████████████████████████████████████████████████████████████████████████████████████▋                                                     | 359/575 [1:46:25<1:03:25, 17.62s/it]                                                                                                                                                                                         {'loss': 1.8671, 'grad_norm': 1.1870416402816772, 'learning_rate': 0.00019944341405329013, 'ppl': 6.46951, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.5871124267578, 'epoch': 0.62, 'tokens/total': 47054848.0, 'tokens/trainable': 45037348.0}
 62%|████████████████████████████████████████████████████████████████████████████████████████▋                                                     | 359/575 [1:46:25<1:03:25, 17.62s/it] 63%|████████████████████████████████████████████████████████████████████████████████████████▉                                                     | 360/575 [1:46:42<1:02:48, 17.53s/it]                                                                                                                                                                                         {'loss': 1.8348, 'grad_norm': 1.0624548196792603, 'learning_rate': 0.00019824497030871847, 'ppl': 6.26388, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.12384033203125, 'epoch': 0.63, 'tokens/total': 47185920.0, 'tokens/trainable': 45162884.0}
 63%|████████████████████████████████████████████████████████████████████████████████████████▉                                                     | 360/575 [1:46:42<1:02:48, 17.53s/it] 63%|█████████████████████████████████████████████████████████████████████████████████████████▏                                                    | 361/575 [1:47:00<1:02:24, 17.50s/it]                                                                                                                                                                                         {'loss': 1.7912, 'grad_norm': 1.0882470607757568, 'learning_rate': 0.00019704898590918723, 'ppl': 5.99664, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.8651580810547, 'epoch': 0.63, 'tokens/total': 47316992.0, 'tokens/trainable': 45288348.0}
 63%|█████████████████████████████████████████████████████████████████████████████████████████▏                                                    | 361/575 [1:47:00<1:02:24, 17.50s/it] 63%|█████████████████████████████████████████████████████████████████████████████████████████▍                                                    | 362/575 [1:47:17<1:02:25, 17.59s/it]                                                                                                                                                                                         {'loss': 1.8671, 'grad_norm': 1.3052927255630493, 'learning_rate': 0.00019585549917581135, 'ppl': 6.46951, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.53623962402344, 'epoch': 0.63, 'tokens/total': 47448064.0, 'tokens/trainable': 45413340.0}
 63%|█████████████████████████████████████████████████████████████████████████████████████████▍                                                    | 362/575 [1:47:17<1:02:25, 17.59s/it] 63%|█████████████████████████████████████████████████████████████████████████████████████████▋                                                    | 363/575 [1:47:35<1:02:05, 17.58s/it]                                                                                                                                                                                         {'loss': 1.8736, 'grad_norm': 1.0410594940185547, 'learning_rate': 0.00019466454834967656, 'ppl': 6.5117, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.71319580078125, 'epoch': 0.63, 'tokens/total': 47579136.0, 'tokens/trainable': 45538536.0}
 63%|█████████████████████████████████████████████████████████████████████████████████████████▋                                                    | 363/575 [1:47:35<1:02:05, 17.58s/it] 63%|█████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 364/575 [1:47:52<1:01:39, 17.53s/it]                                                                                                                                                                                         {'loss': 1.8579, 'grad_norm': 1.0933901071548462, 'learning_rate': 0.0001934761715906146, 'ppl': 6.41026, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.25918579101562, 'epoch': 0.63, 'tokens/total': 47710208.0, 'tokens/trainable': 45663968.0}
 63%|█████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 364/575 [1:47:52<1:01:39, 17.53s/it] 63%|██████████████████████████████████████████████████████████████████████████████████████████▏                                                   | 365/575 [1:48:10<1:01:38, 17.61s/it]                                                                                                                                                                                         {'loss': 1.8454, 'grad_norm': 1.302154302597046, 'learning_rate': 0.00019229040697598006, 'ppl': 6.33063, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.58164978027344, 'epoch': 0.63, 'tokens/total': 47841280.0, 'tokens/trainable': 45789320.0}
 63%|██████████████████████████████████████████████████████████████████████████████████████████▏                                                   | 365/575 [1:48:10<1:01:38, 17.61s/it] 64%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                   | 366/575 [1:48:28<1:01:16, 17.59s/it]                                                                                                                                                                                         {'loss': 1.854, 'grad_norm': 1.2221747636795044, 'learning_rate': 0.00019110729249943059, 'ppl': 6.38531, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.28656005859375, 'epoch': 0.64, 'tokens/total': 47972352.0, 'tokens/trainable': 45914584.0}
 64%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                   | 366/575 [1:48:28<1:01:16, 17.59s/it] 64%|██████████████████████████████████████████████████████████████████████████████████████████▋                                                   | 367/575 [1:48:45<1:00:26, 17.44s/it]                                                                                                                                                                                         {'loss': 1.8566, 'grad_norm': 1.0343362092971802, 'learning_rate': 0.0001899268660697096, 'ppl': 6.40193, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.58843994140625, 'epoch': 0.64, 'tokens/total': 48103424.0, 'tokens/trainable': 46040052.0}
 64%|██████████████████████████████████████████████████████████████████████████████████████████▋                                                   | 367/575 [1:48:45<1:00:26, 17.44s/it] 64%|██████████████████████████████████████████████████████████████████████████████████████████▉                                                   | 368/575 [1:49:02<1:00:23, 17.51s/it]                                                                                                                                                                                         {'loss': 1.8057, 'grad_norm': 1.1739450693130493, 'learning_rate': 0.00018874916550943127, 'ppl': 6.08423, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.4220428466797, 'epoch': 0.64, 'tokens/total': 48234496.0, 'tokens/trainable': 46165332.0}
 64%|██████████████████████████████████████████████████████████████████████████████████████████▉                                                   | 368/575 [1:49:02<1:00:23, 17.51s/it] 64%|███████████████████████████████████████████████████████████████████████████████████████████▏                                                  | 369/575 [1:49:20<1:00:23, 17.59s/it]                                                                                                                                                                                         {'loss': 1.8405, 'grad_norm': 1.1915949583053589, 'learning_rate': 0.0001875742285538693, 'ppl': 6.29969, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.0576629638672, 'epoch': 0.64, 'tokens/total': 48365568.0, 'tokens/trainable': 46290436.0}
 64%|███████████████████████████████████████████████████████████████████████████████████████████▏                                                  | 369/575 [1:49:20<1:00:23, 17.59s/it] 64%|███████████████████████████████████████████████████████████████████████████████████████████▎                                                  | 370/575 [1:49:38<1:00:18, 17.65s/it]                                                                                                                                                                                         {'loss': 1.8593, 'grad_norm': 1.0470426082611084, 'learning_rate': 0.00018640209284974692, 'ppl': 6.41924, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.14114379882812, 'epoch': 0.64, 'tokens/total': 48496640.0, 'tokens/trainable': 46415868.0}
 64%|███████████████████████████████████████████████████████████████████████████████████████████▎                                                  | 370/575 [1:49:38<1:00:18, 17.65s/it] 65%|████████████████████████████████████████████████████████████████████████████████████████████▉                                                   | 371/575 [1:49:55<59:54, 17.62s/it]                                                                                                                                                                                         {'loss': 1.8812, 'grad_norm': 1.2097532749176025, 'learning_rate': 0.00018523279595403135, 'ppl': 6.56137, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.91696166992188, 'epoch': 0.65, 'tokens/total': 48627712.0, 'tokens/trainable': 46541076.0}
 65%|████████████████████████████████████████████████████████████████████████████████████████████▉                                                   | 371/575 [1:49:56<59:54, 17.62s/it] 65%|█████████████████████████████████████████████████████████████████████████████████████████████▏                                                  | 372/575 [1:50:13<59:47, 17.67s/it]                                                                                                                                                                                         {'loss': 1.8555, 'grad_norm': 1.1732689142227173, 'learning_rate': 0.00018406637533273013, 'ppl': 6.39489, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.8805694580078, 'epoch': 0.65, 'tokens/total': 48758784.0, 'tokens/trainable': 46665972.0}
 65%|█████████████████████████████████████████████████████████████████████████████████████████████▏                                                  | 372/575 [1:50:13<59:47, 17.67s/it] 65%|█████████████████████████████████████████████████████████████████████████████████████████████▍                                                  | 373/575 [1:50:31<59:15, 17.60s/it]                                                                                                                                                                                         {'loss': 1.8681, 'grad_norm': 1.0209745168685913, 'learning_rate': 0.00018290286835969072, 'ppl': 6.47598, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.45333862304688, 'epoch': 0.65, 'tokens/total': 48889856.0, 'tokens/trainable': 46791332.0}
 65%|█████████████████████████████████████████████████████████████████████████████████████████████▍                                                  | 373/575 [1:50:31<59:15, 17.60s/it] 65%|█████████████████████████████████████████████████████████████████████████████████████████████▋                                                  | 374/575 [1:50:48<58:47, 17.55s/it]                                                                                                                                                                                         {'loss': 1.8641, 'grad_norm': 1.1443612575531006, 'learning_rate': 0.00018174231231540282, 'ppl': 6.45013, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.7888641357422, 'epoch': 0.65, 'tokens/total': 49020928.0, 'tokens/trainable': 46916604.0}
 65%|█████████████████████████████████████████████████████████████████████████████████████████████▋                                                  | 374/575 [1:50:48<58:47, 17.55s/it] 65%|█████████████████████████████████████████████████████████████████████████████████████████████▉                                                  | 375/575 [1:51:06<58:37, 17.59s/it]                                                                                                                                                                                         {'loss': 1.8302, 'grad_norm': 1.0091599225997925, 'learning_rate': 0.00018058474438580434, 'ppl': 6.23513, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.19448852539062, 'epoch': 0.65, 'tokens/total': 49152000.0, 'tokens/trainable': 47042092.0}
 65%|█████████████████████████████████████████████████████████████████████████████████████████████▉                                                  | 375/575 [1:51:06<58:37, 17.59s/it] 65%|██████████████████████████████████████████████████████████████████████████████████████████████▏                                                 | 376/575 [1:51:23<58:10, 17.54s/it]                                                                                                                                                                                         {'loss': 1.8047, 'grad_norm': 1.0141478776931763, 'learning_rate': 0.00017943020166108926, 'ppl': 6.07815, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.31475830078125, 'epoch': 0.65, 'tokens/total': 49283072.0, 'tokens/trainable': 47167292.0}
 65%|██████████████████████████████████████████████████████████████████████████████████████████████▏                                                 | 376/575 [1:51:23<58:10, 17.54s/it] 66%|██████████████████████████████████████████████████████████████████████████████████████████████▍                                                 | 377/575 [1:51:41<58:00, 17.58s/it]                                                                                                                                                                                         {'loss': 1.8086, 'grad_norm': 3.8897862434387207, 'learning_rate': 0.00017827872113451953, 'ppl': 6.1019, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.69625854492188, 'epoch': 0.66, 'tokens/total': 49414144.0, 'tokens/trainable': 47292856.0}
 66%|██████████████████████████████████████████████████████████████████████████████████████████████▍                                                 | 377/575 [1:51:41<58:00, 17.58s/it] 66%|██████████████████████████████████████████████████████████████████████████████████████████████▋                                                 | 378/575 [1:51:58<57:41, 17.57s/it]                                                                                                                                                                                         {'loss': 1.8594, 'grad_norm': 1.2592144012451172, 'learning_rate': 0.00017713033970123988, 'ppl': 6.41988, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.71607971191406, 'epoch': 0.66, 'tokens/total': 49545216.0, 'tokens/trainable': 47417456.0}
 66%|██████████████████████████████████████████████████████████████████████████████████████████████▋                                                 | 378/575 [1:51:59<57:41, 17.57s/it] 66%|██████████████████████████████████████████████████████████████████████████████████████████████▉                                                 | 379/575 [1:52:16<57:08, 17.49s/it]                                                                                                                                                                                         {'loss': 1.8386, 'grad_norm': 1.1450822353363037, 'learning_rate': 0.00017598509415709535, 'ppl': 6.28773, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.41529846191406, 'epoch': 0.66, 'tokens/total': 49676288.0, 'tokens/trainable': 47542116.0}
 66%|██████████████████████████████████████████████████████████████████████████████████████████████▉                                                 | 379/575 [1:52:16<57:08, 17.49s/it] 66%|███████████████████████████████████████████████████████████████████████████████████████████████▏                                                | 380/575 [1:52:33<57:01, 17.55s/it]                                                                                                                                                                                         {'loss': 1.8281, 'grad_norm': 1.229981780052185, 'learning_rate': 0.00017484302119745242, 'ppl': 6.22205, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.96295166015625, 'epoch': 0.66, 'tokens/total': 49807360.0, 'tokens/trainable': 47667548.0}
 66%|███████████████████████████████████████████████████████████████████████████████████████████████▏                                                | 380/575 [1:52:34<57:01, 17.55s/it] 66%|███████████████████████████████████████████████████████████████████████████████████████████████▍                                                | 381/575 [1:52:51<56:44, 17.55s/it]                                                                                                                                                                                         {'loss': 1.8825, 'grad_norm': 1.1135785579681396, 'learning_rate': 0.00017370415741602347, 'ppl': 6.56991, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.24139404296875, 'epoch': 0.66, 'tokens/total': 49938432.0, 'tokens/trainable': 47792792.0}
 66%|███████████████████████████████████████████████████████████████████████████████████████████████▍                                                | 381/575 [1:52:51<56:44, 17.55s/it] 66%|███████████████████████████████████████████████████████████████████████████████████████████████▋                                                | 382/575 [1:53:09<56:33, 17.59s/it]                                                                                                                                                                                         {'loss': 1.8575, 'grad_norm': 1.1991349458694458, 'learning_rate': 0.0001725685393036939, 'ppl': 6.4077, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.2794647216797, 'epoch': 0.66, 'tokens/total': 50069504.0, 'tokens/trainable': 47918264.0}
 66%|███████████████████████████████████████████████████████████████████████████████████████████████▋                                                | 382/575 [1:53:09<56:33, 17.59s/it] 67%|███████████████████████████████████████████████████████████████████████████████████████████████▉                                                | 383/575 [1:53:26<56:07, 17.54s/it]                                                                                                                                                                                         {'loss': 1.8414, 'grad_norm': 1.304150104522705, 'learning_rate': 0.00017143620324735294, 'ppl': 6.30536, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.18690490722656, 'epoch': 0.67, 'tokens/total': 50200576.0, 'tokens/trainable': 48043628.0}
 67%|███████████████████████████████████████████████████████████████████████████████████████████████▉                                                | 383/575 [1:53:26<56:07, 17.54s/it] 67%|████████████████████████████████████████████████████████████████████████████████████████████████▏                                               | 384/575 [1:53:43<55:30, 17.44s/it]                                                                                                                                                                                         {'loss': 1.8509, 'grad_norm': 1.0807949304580688, 'learning_rate': 0.0001703071855287281, 'ppl': 6.36555, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 229.53919982910156, 'epoch': 0.67, 'tokens/total': 50331648.0, 'tokens/trainable': 48168976.0}
 67%|████████████████████████████████████████████████████████████████████████████████████████████████▏                                               | 384/575 [1:53:43<55:30, 17.44s/it] 67%|████████████████████████████████████████████████████████████████████████████████████████████████▍                                               | 385/575 [1:54:01<55:12, 17.44s/it]                                                                                                                                                                                         {'loss': 1.8252, 'grad_norm': 1.8440037965774536, 'learning_rate': 0.0001691815223232223, 'ppl': 6.20404, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.89987182617188, 'epoch': 0.67, 'tokens/total': 50462720.0, 'tokens/trainable': 48294120.0}
 67%|████████████████████████████████████████████████████████████████████████████████████████████████▍                                               | 385/575 [1:54:01<55:12, 17.44s/it] 67%|████████████████████████████████████████████████████████████████████████████████████████████████▋                                               | 386/575 [1:54:19<55:22, 17.58s/it]                                                                                                                                                                                         {'loss': 1.8489, 'grad_norm': 1.171518087387085, 'learning_rate': 0.00016805924969875475, 'ppl': 6.35283, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.20327758789062, 'epoch': 0.67, 'tokens/total': 50593792.0, 'tokens/trainable': 48419276.0}
 67%|████████████████████████████████████████████████████████████████████████████████████████████████▋                                               | 386/575 [1:54:19<55:22, 17.58s/it] 67%|████████████████████████████████████████████████████████████████████████████████████████████████▉                                               | 387/575 [1:54:36<55:03, 17.57s/it]                                                                                                                                                                                         {'loss': 1.8527, 'grad_norm': 0.9766181111335754, 'learning_rate': 0.0001669404036146058, 'ppl': 6.37701, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.8152313232422, 'epoch': 0.67, 'tokens/total': 50724864.0, 'tokens/trainable': 48544200.0}
 67%|████████████████████████████████████████████████████████████████████████████████████████████████▉                                               | 387/575 [1:54:36<55:03, 17.57s/it] 67%|█████████████████████████████████████████████████████████████████████████████████████████████████▏                                              | 388/575 [1:54:54<54:51, 17.60s/it]                                                                                                                                                                                         {'loss': 1.7893, 'grad_norm': 1.1805325746536255, 'learning_rate': 0.00016582501992026407, 'ppl': 5.98526, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.0858612060547, 'epoch': 0.67, 'tokens/total': 50855936.0, 'tokens/trainable': 48669880.0}
 67%|█████████████████████████████████████████████████████████████████████████████████████████████████▏                                              | 388/575 [1:54:54<54:51, 17.60s/it] 68%|█████████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 389/575 [1:55:12<54:44, 17.66s/it]                                                                                                                                                                                         {'loss': 1.8633, 'grad_norm': 0.9201295375823975, 'learning_rate': 0.00016471313435427806, 'ppl': 6.44497, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.6280975341797, 'epoch': 0.68, 'tokens/total': 50987008.0, 'tokens/trainable': 48795248.0}
 68%|█████████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 389/575 [1:55:12<54:44, 17.66s/it] 68%|█████████████████████████████████████████████████████████████████████████████████████████████████▋                                              | 390/575 [1:55:29<54:27, 17.66s/it]                                                                                                                                                                                         {'loss': 1.8066, 'grad_norm': 1.0827715396881104, 'learning_rate': 0.0001636047825431112, 'ppl': 6.08971, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.82286071777344, 'epoch': 0.68, 'tokens/total': 51118080.0, 'tokens/trainable': 48921164.0}
 68%|█████████████████████████████████████████████████████████████████████████████████████████████████▋                                              | 390/575 [1:55:29<54:27, 17.66s/it] 68%|█████████████████████████████████████████████████████████████████████████████████████████████████▉                                              | 391/575 [1:55:47<54:03, 17.63s/it]                                                                                                                                                                                         {'loss': 1.7996, 'grad_norm': 0.8943078517913818, 'learning_rate': 0.00016250000000000007, 'ppl': 6.04723, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.9370880126953, 'epoch': 0.68, 'tokens/total': 51249152.0, 'tokens/trainable': 49046520.0}
 68%|█████████████████████████████████████████████████████████████████████████████████████████████████▉                                              | 391/575 [1:55:47<54:03, 17.63s/it] 68%|██████████████████████████████████████████████████████████████████████████████████████████████████▏                                             | 392/575 [1:56:05<53:54, 17.68s/it]                                                                                                                                                                                         {'loss': 1.851, 'grad_norm': 0.9666119813919067, 'learning_rate': 0.00016139882212381658, 'ppl': 6.36618, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.39364624023438, 'epoch': 0.68, 'tokens/total': 51380224.0, 'tokens/trainable': 49171944.0}
 68%|██████████████████████████████████████████████████████████████████████████████████████████████████▏                                             | 392/575 [1:56:05<53:54, 17.68s/it] 68%|██████████████████████████████████████████████████████████████████████████████████████████████████▍                                             | 393/575 [1:56:22<53:36, 17.68s/it]                                                                                                                                                                                         {'loss': 1.8706, 'grad_norm': 0.9379603266716003, 'learning_rate': 0.00016030128419793378, 'ppl': 6.49219, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.27244567871094, 'epoch': 0.68, 'tokens/total': 51511296.0, 'tokens/trainable': 49297152.0}
 68%|██████████████████████████████████████████████████████████████████████████████████████████████████▍                                             | 393/575 [1:56:22<53:36, 17.68s/it] 69%|██████████████████████████████████████████████████████████████████████████████████████████████████▋                                             | 394/575 [1:56:40<53:25, 17.71s/it]                                                                                                                                                                                         {'loss': 1.8201, 'grad_norm': 0.9683942198753357, 'learning_rate': 0.0001592074213890955, 'ppl': 6.17248, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.54078674316406, 'epoch': 0.69, 'tokens/total': 51642368.0, 'tokens/trainable': 49422712.0}
 69%|██████████████████████████████████████████████████████████████████████████████████████████████████▋                                             | 394/575 [1:56:40<53:25, 17.71s/it] 69%|██████████████████████████████████████████████████████████████████████████████████████████████████▉                                             | 395/575 [1:56:58<52:59, 17.66s/it]                                                                                                                                                                                         {'loss': 1.8635, 'grad_norm': 1.0099495649337769, 'learning_rate': 0.00015811726874628916, 'ppl': 6.44626, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.44847106933594, 'epoch': 0.69, 'tokens/total': 51773440.0, 'tokens/trainable': 49548052.0}
 69%|██████████████████████████████████████████████████████████████████████████████████████████████████▉                                             | 395/575 [1:56:58<52:59, 17.66s/it] 69%|███████████████████████████████████████████████████████████████████████████████████████████████████▏                                            | 396/575 [1:57:15<52:10, 17.49s/it]                                                                                                                                                                                         {'loss': 1.81, 'grad_norm': 0.9796828627586365, 'learning_rate': 0.0001570308611996229, 'ppl': 6.11045, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.4608612060547, 'epoch': 0.69, 'tokens/total': 51904512.0, 'tokens/trainable': 49673552.0}
 69%|███████████████████████████████████████████████████████████████████████████████████████████████████▏                                            | 396/575 [1:57:15<52:10, 17.49s/it] 69%|███████████████████████████████████████████████████████████████████████████████████████████████████▍                                            | 397/575 [1:57:32<51:49, 17.47s/it]                                                                                                                                                                                         {'loss': 1.8609, 'grad_norm': 1.1707274913787842, 'learning_rate': 0.00015594823355920666, 'ppl': 6.42952, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.74053955078125, 'epoch': 0.69, 'tokens/total': 52035584.0, 'tokens/trainable': 49799196.0}
 69%|███████████████████████████████████████████████████████████████████████████████████████████████████▍                                            | 397/575 [1:57:32<51:49, 17.47s/it] 69%|███████████████████████████████████████████████████████████████████████████████████████████████████▋                                            | 398/575 [1:57:50<51:49, 17.57s/it]                                                                                                                                                                                         {'loss': 1.7864, 'grad_norm': 0.9935237765312195, 'learning_rate': 0.00015486942051403635, 'ppl': 5.96793, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.20030212402344, 'epoch': 0.69, 'tokens/total': 52166656.0, 'tokens/trainable': 49924500.0}
 69%|███████████████████████████████████████████████████████████████████████████████████████████████████▋                                            | 398/575 [1:57:50<51:49, 17.57s/it] 69%|███████████████████████████████████████████████████████████████████████████████████████████████████▉                                            | 399/575 [1:58:08<51:37, 17.60s/it]                                                                                                                                                                                         {'loss': 1.8362, 'grad_norm': 1.0978401899337769, 'learning_rate': 0.00015379445663088264, 'ppl': 6.27266, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.42015075683594, 'epoch': 0.69, 'tokens/total': 52297728.0, 'tokens/trainable': 50049224.0}
 69%|███████████████████████████████████████████████████████████████████████████████████████████████████▉                                            | 399/575 [1:58:08<51:37, 17.60s/it] 70%|████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 400/575 [1:58:25<51:29, 17.66s/it]                                                                                                                                                                                         {'loss': 1.7977, 'grad_norm': 1.48044753074646, 'learning_rate': 0.00015272337635318352, 'ppl': 6.03575, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.18553161621094, 'epoch': 0.7, 'tokens/total': 52428800.0, 'tokens/trainable': 50174752.0}
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 400/575 [1:58:26<51:29, 17.66s/it][2026-03-12 23:40:31,672] [INFO] [axolotl.core.trainers.base.evaluate:401] [PID:4624] Running evaluation step...

  0%|                                                                                                                                                             | 0/54 [00:00<?, ?it/s][A
  4%|█████▌                                                                                                                                               | 2/54 [00:00<00:12,  4.29it/s][A
  6%|████████▎                                                                                                                                            | 3/54 [00:00<00:17,  2.94it/s][A
  7%|███████████                                                                                                                                          | 4/54 [00:01<00:19,  2.52it/s][A
  9%|█████████████▊                                                                                                                                       | 5/54 [00:01<00:21,  2.33it/s][A
 11%|████████████████▌                                                                                                                                    | 6/54 [00:02<00:21,  2.23it/s][A
 13%|███████████████████▎                                                                                                                                 | 7/54 [00:02<00:21,  2.16it/s][A
 15%|██████████████████████                                                                                                                               | 8/54 [00:03<00:21,  2.12it/s][A
 17%|████████████████████████▊                                                                                                                            | 9/54 [00:03<00:21,  2.10it/s][A
 19%|███████████████████████████▍                                                                                                                        | 10/54 [00:04<00:21,  2.08it/s][A
 20%|██████████████████████████████▏                                                                                                                     | 11/54 [00:04<00:20,  2.07it/s][A
 22%|████████████████████████████████▉                                                                                                                   | 12/54 [00:05<00:20,  2.06it/s][A
 24%|███████████████████████████████████▋                                                                                                                | 13/54 [00:05<00:19,  2.05it/s][A
 26%|██████████████████████████████████████▎                                                                                                             | 14/54 [00:06<00:19,  2.05it/s][A
 28%|█████████████████████████████████████████                                                                                                           | 15/54 [00:06<00:19,  2.05it/s][A
 30%|███████████████████████████████████████████▊                                                                                                        | 16/54 [00:07<00:18,  2.04it/s][A
 31%|██████████████████████████████████████████████▌                                                                                                     | 17/54 [00:07<00:18,  2.04it/s][A
 33%|█████████████████████████████████████████████████▎                                                                                                  | 18/54 [00:08<00:17,  2.04it/s][A
 35%|████████████████████████████████████████████████████                                                                                                | 19/54 [00:08<00:17,  2.04it/s][A
 37%|██████████████████████████████████████████████████████▊                                                                                             | 20/54 [00:09<00:16,  2.04it/s][A
 39%|█████████████████████████████████████████████████████████▌                                                                                          | 21/54 [00:09<00:16,  2.04it/s][A
 41%|████████████████████████████████████████████████████████████▎                                                                                       | 22/54 [00:10<00:15,  2.04it/s][A
 43%|███████████████████████████████████████████████████████████████                                                                                     | 23/54 [00:10<00:15,  2.04it/s][A
 44%|█████████████████████████████████████████████████████████████████▊                                                                                  | 24/54 [00:11<00:14,  2.04it/s][A
 46%|████████████████████████████████████████████████████████████████████▌                                                                               | 25/54 [00:11<00:14,  2.04it/s][A
 48%|███████████████████████████████████████████████████████████████████████▎                                                                            | 26/54 [00:12<00:13,  2.04it/s][A
 50%|██████████████████████████████████████████████████████████████████████████                                                                          | 27/54 [00:12<00:13,  2.04it/s][A
 52%|████████████████████████████████████████████████████████████████████████████▋                                                                       | 28/54 [00:13<00:12,  2.04it/s][A
 54%|███████████████████████████████████████████████████████████████████████████████▍                                                                    | 29/54 [00:13<00:12,  2.04it/s][A
 56%|██████████████████████████████████████████████████████████████████████████████████▏                                                                 | 30/54 [00:14<00:11,  2.04it/s][A
 57%|████████████████████████████████████████████████████████████████████████████████████▉                                                               | 31/54 [00:14<00:11,  2.04it/s][A
 59%|███████████████████████████████████████████████████████████████████████████████████████▋                                                            | 32/54 [00:15<00:10,  2.04it/s][A
 61%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                         | 33/54 [00:15<00:10,  1.92it/s][A
 63%|█████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 34/54 [00:16<00:10,  1.99it/s][A
 65%|███████████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 35/54 [00:16<00:09,  2.00it/s][A
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▋                                                 | 36/54 [00:17<00:08,  2.01it/s][A
 69%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 37/54 [00:17<00:08,  2.02it/s][A
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 38/54 [00:18<00:07,  2.03it/s][A
 72%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                         | 39/54 [00:18<00:07,  2.03it/s][A
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                      | 40/54 [00:19<00:06,  2.03it/s][A
 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                   | 41/54 [00:19<00:06,  2.03it/s][A
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                 | 42/54 [00:20<00:05,  2.04it/s][A
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                              | 43/54 [00:20<00:05,  2.04it/s][A
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                           | 44/54 [00:21<00:04,  2.04it/s][A
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                        | 45/54 [00:21<00:04,  2.04it/s][A
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                      | 46/54 [00:22<00:03,  2.04it/s][A
 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                   | 47/54 [00:22<00:03,  2.04it/s][A
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 48/54 [00:23<00:02,  2.04it/s][A
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎             | 49/54 [00:23<00:02,  2.04it/s][A
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████           | 50/54 [00:24<00:01,  2.04it/s][A
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊        | 51/54 [00:24<00:01,  2.04it/s][A
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌     | 52/54 [00:25<00:00,  2.04it/s][A
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎  | 53/54 [00:25<00:00,  2.04it/s][A
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.98it/s][A                                                                                                                                                                                         
                                                                                                                                                                                         [A{'eval_loss': 1.8092681169509888, 'eval_runtime': 27.2781, 'eval_samples_per_second': 7.918, 'eval_steps_per_second': 1.98, 'eval_ppl': 6.10598, 'memory/max_active (GiB)': 26.29, 'memory/max_allocated (GiB)': 26.29, 'memory/device_reserved (GiB)': 27.83, 'epoch': 0.7, 'tokens/train_per_sec_per_gpu': 0.0, 'tokens/total': 52428800.0, 'tokens/trainable': 50174752.0}
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 400/575 [1:58:53<51:29, 17.66s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.98it/s][A
                                                                                                                                                                                         [A[2026-03-12 23:40:58,954] [INFO] [axolotl.core.trainers.base._save:723] [PID:4624] Saving model checkpoint to ./output/model/checkpoint-400
 70%|███████████████████████████████████████████████████████████████████████████████████████████████████                                           | 401/575 [1:59:25<1:27:56, 30.32s/it]                                                                                                                                                                                         {'loss': 1.8091, 'grad_norm': 1.0733797550201416, 'learning_rate': 0.00015165621399994034, 'ppl': 6.10495, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.07357788085938, 'epoch': 0.7, 'tokens/total': 52559872.0, 'tokens/trainable': 50300064.0}
 70%|███████████████████████████████████████████████████████████████████████████████████████████████████                                           | 401/575 [1:59:25<1:27:56, 30.32s/it] 70%|███████████████████████████████████████████████████████████████████████████████████████████████████▎                                          | 402/575 [1:59:43<1:16:28, 26.52s/it]                                                                                                                                                                                         {'loss': 1.8148, 'grad_norm': 0.892130970954895, 'learning_rate': 0.00015059300376461826, 'ppl': 6.13985, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.79171752929688, 'epoch': 0.7, 'tokens/total': 52690944.0, 'tokens/trainable': 50425448.0}
 70%|███████████████████████████████████████████████████████████████████████████████████████████████████▎                                          | 402/575 [1:59:43<1:16:28, 26.52s/it] 70%|███████████████████████████████████████████████████████████████████████████████████████████████████▌                                          | 403/575 [2:00:01<1:08:37, 23.94s/it]                                                                                                                                                                                         {'loss': 1.8574, 'grad_norm': 1.314839243888855, 'learning_rate': 0.00014953377971405085, 'ppl': 6.40706, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.67469787597656, 'epoch': 0.7, 'tokens/total': 52822016.0, 'tokens/trainable': 50550928.0}
 70%|███████████████████████████████████████████████████████████████████████████████████████████████████▌                                          | 403/575 [2:00:01<1:08:37, 23.94s/it] 70%|███████████████████████████████████████████████████████████████████████████████████████████████████▊                                          | 404/575 [2:00:19<1:02:52, 22.06s/it]                                                                                                                                                                                         {'loss': 1.816, 'grad_norm': 0.9887326955795288, 'learning_rate': 0.00014847857578734842, 'ppl': 6.14722, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.4634246826172, 'epoch': 0.7, 'tokens/total': 52953088.0, 'tokens/trainable': 50676396.0}
 70%|███████████████████████████████████████████████████████████████████████████████████████████████████▊                                          | 404/575 [2:00:19<1:02:52, 22.06s/it] 70%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                          | 405/575 [2:00:36<58:34, 20.67s/it]                                                                                                                                                                                         {'loss': 1.8329, 'grad_norm': 1.0195941925048828, 'learning_rate': 0.00014742742579481038, 'ppl': 6.25199, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.02024841308594, 'epoch': 0.7, 'tokens/total': 53084160.0, 'tokens/trainable': 50801552.0}
 70%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                          | 405/575 [2:00:36<58:34, 20.67s/it] 71%|█████████████████████████████████████████████████████████████████████████████████████████████████████▋                                          | 406/575 [2:00:54<55:42, 19.78s/it]                                                                                                                                                                                         {'loss': 1.8125, 'grad_norm': 1.1179484128952026, 'learning_rate': 0.0001463803634168423, 'ppl': 6.12574, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.8421630859375, 'epoch': 0.71, 'tokens/total': 53215232.0, 'tokens/trainable': 50926712.0}
 71%|█████████████████████████████████████████████████████████████████████████████████████████████████████▋                                          | 406/575 [2:00:54<55:42, 19.78s/it] 71%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉                                          | 407/575 [2:01:11<53:24, 19.08s/it]                                                                                                                                                                                         {'loss': 1.8404, 'grad_norm': 1.025015115737915, 'learning_rate': 0.0001453374222028764, 'ppl': 6.29906, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.1387176513672, 'epoch': 0.71, 'tokens/total': 53346304.0, 'tokens/trainable': 51051576.0}
 71%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉                                          | 407/575 [2:01:11<53:24, 19.08s/it] 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▏                                         | 408/575 [2:01:29<52:01, 18.69s/it]                                                                                                                                                                                         {'loss': 1.8587, 'grad_norm': 0.9181833863258362, 'learning_rate': 0.00014429863557029665, 'ppl': 6.41539, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.03355407714844, 'epoch': 0.71, 'tokens/total': 53477376.0, 'tokens/trainable': 51176864.0}
 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▏                                         | 408/575 [2:01:29<52:01, 18.69s/it] 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▍                                         | 409/575 [2:01:47<50:58, 18.43s/it]                                                                                                                                                                                         {'loss': 1.8179, 'grad_norm': 1.0202441215515137, 'learning_rate': 0.00014326403680336807, 'ppl': 6.15891, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.10089111328125, 'epoch': 0.71, 'tokens/total': 53608448.0, 'tokens/trainable': 51302140.0}
 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▍                                         | 409/575 [2:01:47<50:58, 18.43s/it] 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋                                         | 410/575 [2:02:05<50:09, 18.24s/it]                                                                                                                                                                                         {'loss': 1.8279, 'grad_norm': 1.0329927206039429, 'learning_rate': 0.00014223365905217041, 'ppl': 6.22081, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 217.19931030273438, 'epoch': 0.71, 'tokens/total': 53739520.0, 'tokens/trainable': 51427200.0}
 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋                                         | 410/575 [2:02:05<50:09, 18.24s/it] 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▉                                         | 411/575 [2:02:22<49:29, 18.11s/it]                                                                                                                                                                                         {'loss': 1.8413, 'grad_norm': 1.167626976966858, 'learning_rate': 0.00014120753533153552, 'ppl': 6.30473, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.61636352539062, 'epoch': 0.71, 'tokens/total': 53870592.0, 'tokens/trainable': 51552692.0}
 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████▉                                         | 411/575 [2:02:22<49:29, 18.11s/it] 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████▏                                        | 412/575 [2:02:40<48:49, 17.98s/it]                                                                                                                                                                                         {'loss': 1.7762, 'grad_norm': 0.9181792736053467, 'learning_rate': 0.00014018569851999014, 'ppl': 5.90737, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.49522399902344, 'epoch': 0.72, 'tokens/total': 54001664.0, 'tokens/trainable': 51678312.0}
 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████▏                                        | 412/575 [2:02:40<48:49, 17.98s/it] 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍                                        | 413/575 [2:02:58<48:17, 17.89s/it]                                                                                                                                                                                         {'loss': 1.8198, 'grad_norm': 0.9961973428726196, 'learning_rate': 0.0001391681813587019, 'ppl': 6.17062, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.16465759277344, 'epoch': 0.72, 'tokens/total': 54132736.0, 'tokens/trainable': 51803764.0}
 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍                                        | 413/575 [2:02:58<48:17, 17.89s/it] 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋                                        | 414/575 [2:03:15<47:49, 17.82s/it]                                                                                                                                                                                         {'loss': 1.8246, 'grad_norm': 1.1313871145248413, 'learning_rate': 0.00013815501645043034, 'ppl': 6.20031, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.39366149902344, 'epoch': 0.72, 'tokens/total': 54263808.0, 'tokens/trainable': 51929020.0}
 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋                                        | 414/575 [2:03:15<47:49, 17.82s/it] 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████▉                                        | 415/575 [2:03:33<47:35, 17.85s/it]                                                                                                                                                                                         {'loss': 1.8676, 'grad_norm': 1.0363372564315796, 'learning_rate': 0.00013714623625848255, 'ppl': 6.47274, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.16018676757812, 'epoch': 0.72, 'tokens/total': 54394880.0, 'tokens/trainable': 52054192.0}
 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████▉                                        | 415/575 [2:03:33<47:35, 17.85s/it] 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                       | 416/575 [2:03:51<47:14, 17.83s/it]                                                                                                                                                                                         {'loss': 1.8179, 'grad_norm': 1.0327932834625244, 'learning_rate': 0.00013614187310567266, 'ppl': 6.15891, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.76498413085938, 'epoch': 0.72, 'tokens/total': 54525952.0, 'tokens/trainable': 52179292.0}
 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                       | 416/575 [2:03:51<47:14, 17.83s/it] 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                       | 417/575 [2:04:09<46:54, 17.82s/it]                                                                                                                                                                                         {'loss': 1.8443, 'grad_norm': 1.1372294425964355, 'learning_rate': 0.0001351419591732863, 'ppl': 6.32367, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.43568420410156, 'epoch': 0.73, 'tokens/total': 54657024.0, 'tokens/trainable': 52304628.0}
 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                       | 417/575 [2:04:09<46:54, 17.82s/it] 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                       | 418/575 [2:04:27<46:30, 17.77s/it]                                                                                                                                                                                         {'loss': 1.8594, 'grad_norm': 0.9876425862312317, 'learning_rate': 0.00013414652650004967, 'ppl': 6.41988, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.6790008544922, 'epoch': 0.73, 'tokens/total': 54788096.0, 'tokens/trainable': 52429928.0}
 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                       | 418/575 [2:04:27<46:30, 17.77s/it] 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                       | 419/575 [2:04:44<46:13, 17.78s/it]                                                                                                                                                                                         {'loss': 1.8223, 'grad_norm': 1.0283094644546509, 'learning_rate': 0.00013315560698110257, 'ppl': 6.18607, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.53912353515625, 'epoch': 0.73, 'tokens/total': 54919168.0, 'tokens/trainable': 52555528.0}
 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                       | 419/575 [2:04:44<46:13, 17.78s/it] 73%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                      | 420/575 [2:05:02<45:50, 17.74s/it]                                                                                                                                                                                         {'loss': 1.8235, 'grad_norm': 0.8493004441261292, 'learning_rate': 0.00013216923236697678, 'ppl': 6.1935, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.2332305908203, 'epoch': 0.73, 'tokens/total': 55050240.0, 'tokens/trainable': 52680784.0}
 73%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                      | 420/575 [2:05:02<45:50, 17.74s/it] 73%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                      | 421/575 [2:05:20<45:34, 17.76s/it]                                                                                                                                                                                         {'loss': 1.8236, 'grad_norm': 0.9780426025390625, 'learning_rate': 0.00013118743426257846, 'ppl': 6.19412, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.66171264648438, 'epoch': 0.73, 'tokens/total': 55181312.0, 'tokens/trainable': 52806464.0}
 73%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                      | 421/575 [2:05:20<45:34, 17.76s/it] 73%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                      | 422/575 [2:05:38<45:18, 17.77s/it]                                                                                                                                                                                         {'loss': 1.8429, 'grad_norm': 1.7926000356674194, 'learning_rate': 0.00013021024412617567, 'ppl': 6.31482, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.24749755859375, 'epoch': 0.73, 'tokens/total': 55312384.0, 'tokens/trainable': 52931888.0}
 73%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                      | 422/575 [2:05:38<45:18, 17.77s/it] 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                      | 423/575 [2:05:55<45:01, 17.77s/it]                                                                                                                                                                                         {'loss': 1.8307, 'grad_norm': 1.4927362203598022, 'learning_rate': 0.00012923769326839027, 'ppl': 6.23825, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.3055877685547, 'epoch': 0.74, 'tokens/total': 55443456.0, 'tokens/trainable': 53057168.0}
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                      | 423/575 [2:05:55<45:01, 17.77s/it] 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                     | 424/575 [2:06:13<44:33, 17.70s/it]                                                                                                                                                                                         {'loss': 1.7938, 'grad_norm': 1.0746990442276, 'learning_rate': 0.00012826981285119494, 'ppl': 6.01226, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.5441436767578, 'epoch': 0.74, 'tokens/total': 55574528.0, 'tokens/trainable': 53182632.0}
 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                     | 424/575 [2:06:13<44:33, 17.70s/it] 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                     | 425/575 [2:06:30<44:08, 17.66s/it]                                                                                                                                                                                         {'loss': 1.8473, 'grad_norm': 1.1084175109863281, 'learning_rate': 0.00012730663388691438, 'ppl': 6.34267, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.46214294433594, 'epoch': 0.74, 'tokens/total': 55705600.0, 'tokens/trainable': 53307888.0}
 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                     | 425/575 [2:06:30<44:08, 17.66s/it] 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                     | 426/575 [2:06:48<43:51, 17.66s/it]                                                                                                                                                                                         {'loss': 1.7775, 'grad_norm': 0.9794643521308899, 'learning_rate': 0.00012634818723723175, 'ppl': 5.91505, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.421630859375, 'epoch': 0.74, 'tokens/total': 55836672.0, 'tokens/trainable': 53433196.0}
 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                     | 426/575 [2:06:48<43:51, 17.66s/it] 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                     | 427/575 [2:07:05<43:18, 17.56s/it]                                                                                                                                                                                         {'loss': 1.8205, 'grad_norm': 0.9560115933418274, 'learning_rate': 0.0001253945036121998, 'ppl': 6.17495, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.66827392578125, 'epoch': 0.74, 'tokens/total': 55967744.0, 'tokens/trainable': 53558256.0}
 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                     | 427/575 [2:07:05<43:18, 17.56s/it] 74%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                    | 428/575 [2:07:23<42:49, 17.48s/it]                                                                                                                                                                                         {'loss': 1.7983, 'grad_norm': 1.1426624059677124, 'learning_rate': 0.00012444561356925697, 'ppl': 6.03937, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.62620544433594, 'epoch': 0.74, 'tokens/total': 56098816.0, 'tokens/trainable': 53683456.0}
 74%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                    | 428/575 [2:07:23<42:49, 17.48s/it] 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                    | 429/575 [2:07:40<42:35, 17.50s/it]                                                                                                                                                                                         {'loss': 1.835, 'grad_norm': 1.0507676601409912, 'learning_rate': 0.00012350154751224817, 'ppl': 6.26513, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.83164978027344, 'epoch': 0.75, 'tokens/total': 56229888.0, 'tokens/trainable': 53808880.0}
 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                    | 429/575 [2:07:40<42:35, 17.50s/it] 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                    | 430/575 [2:07:58<42:24, 17.55s/it]                                                                                                                                                                                         {'loss': 1.8319, 'grad_norm': 0.9379578232765198, 'learning_rate': 0.0001225623356904508, 'ppl': 6.24574, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.5005645751953, 'epoch': 0.75, 'tokens/total': 56360960.0, 'tokens/trainable': 53934256.0}
 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                    | 430/575 [2:07:58<42:24, 17.55s/it] 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                    | 431/575 [2:08:16<42:17, 17.62s/it]                                                                                                                                                                                         {'loss': 1.8831, 'grad_norm': 0.9987399578094482, 'learning_rate': 0.00012162800819760515, 'ppl': 6.57385, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.50929260253906, 'epoch': 0.75, 'tokens/total': 56492032.0, 'tokens/trainable': 54059736.0}
 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                    | 431/575 [2:08:16<42:17, 17.62s/it] 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                   | 432/575 [2:08:33<42:01, 17.64s/it]                                                                                                                                                                                         {'loss': 1.8317, 'grad_norm': 1.2098171710968018, 'learning_rate': 0.00012069859497095044, 'ppl': 6.24449, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 212.38449096679688, 'epoch': 0.75, 'tokens/total': 56623104.0, 'tokens/trainable': 54184732.0}
 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                   | 432/575 [2:08:33<42:01, 17.64s/it] 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                   | 433/575 [2:08:51<41:50, 17.68s/it]                                                                                                                                                                                         {'loss': 1.8157, 'grad_norm': 0.9078555703163147, 'learning_rate': 0.00011977412579026556, 'ppl': 6.14538, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.28530883789062, 'epoch': 0.75, 'tokens/total': 56754176.0, 'tokens/trainable': 54310068.0}
 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                   | 433/575 [2:08:51<41:50, 17.68s/it] 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                   | 434/575 [2:09:08<41:17, 17.57s/it]                                                                                                                                                                                         {'loss': 1.7882, 'grad_norm': 1.0604311227798462, 'learning_rate': 0.00011885463027691474, 'ppl': 5.97868, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.86294555664062, 'epoch': 0.75, 'tokens/total': 56885248.0, 'tokens/trainable': 54435188.0}
 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                   | 434/575 [2:09:09<41:17, 17.57s/it] 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                   | 435/575 [2:09:26<41:03, 17.60s/it]                                                                                                                                                                                         {'loss': 1.8036, 'grad_norm': 0.9459312558174133, 'learning_rate': 0.00011794013789289853, 'ppl': 6.07147, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.92633056640625, 'epoch': 0.76, 'tokens/total': 57016320.0, 'tokens/trainable': 54560188.0}
 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                   | 435/575 [2:09:26<41:03, 17.60s/it] 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                  | 436/575 [2:09:44<40:59, 17.69s/it]                                                                                                                                                                                         {'loss': 1.8234, 'grad_norm': 1.4725602865219116, 'learning_rate': 0.00011703067793990995, 'ppl': 6.19288, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 214.24342346191406, 'epoch': 0.76, 'tokens/total': 57147392.0, 'tokens/trainable': 54684984.0}
 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                  | 436/575 [2:09:44<40:59, 17.69s/it] 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                  | 437/575 [2:10:02<40:45, 17.72s/it]                                                                                                                                                                                         {'loss': 1.8093, 'grad_norm': 1.514386534690857, 'learning_rate': 0.00011612627955839532, 'ppl': 6.10617, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.30557250976562, 'epoch': 0.76, 'tokens/total': 57278464.0, 'tokens/trainable': 54810040.0}
 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                  | 437/575 [2:10:02<40:45, 17.72s/it] 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                  | 438/575 [2:10:20<40:35, 17.78s/it]                                                                                                                                                                                         {'loss': 1.8065, 'grad_norm': 1.5308284759521484, 'learning_rate': 0.00011522697172662072, 'ppl': 6.0891, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 217.5132293701172, 'epoch': 0.76, 'tokens/total': 57409536.0, 'tokens/trainable': 54935136.0}
 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                  | 438/575 [2:10:20<40:35, 17.78s/it] 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                  | 439/575 [2:10:37<39:58, 17.64s/it]                                                                                                                                                                                         {'loss': 1.7813, 'grad_norm': 1.4447256326675415, 'learning_rate': 0.00011433278325974348, 'ppl': 5.93757, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.7755584716797, 'epoch': 0.76, 'tokens/total': 57540608.0, 'tokens/trainable': 55060856.0}
 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                  | 439/575 [2:10:37<39:58, 17.64s/it] 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                 | 440/575 [2:10:55<39:37, 17.61s/it]                                                                                                                                                                                         {'loss': 1.8529, 'grad_norm': 2.1837282180786133, 'learning_rate': 0.00011344374280888887, 'ppl': 6.37829, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.84906005859375, 'epoch': 0.77, 'tokens/total': 57671680.0, 'tokens/trainable': 55186580.0}
 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                 | 440/575 [2:10:55<39:37, 17.61s/it] 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                 | 441/575 [2:11:12<39:17, 17.59s/it]                                                                                                                                                                                         {'loss': 1.8194, 'grad_norm': 1.1228703260421753, 'learning_rate': 0.00011255987886023202, 'ppl': 6.16816, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 217.13717651367188, 'epoch': 0.77, 'tokens/total': 57802752.0, 'tokens/trainable': 55311672.0}
 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                 | 441/575 [2:11:12<39:17, 17.59s/it] 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                 | 442/575 [2:11:30<39:12, 17.68s/it]                                                                                                                                                                                         {'loss': 1.822, 'grad_norm': 0.9758828282356262, 'learning_rate': 0.00011168121973408544, 'ppl': 6.18421, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.10464477539062, 'epoch': 0.77, 'tokens/total': 57933824.0, 'tokens/trainable': 55436560.0}
 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                 | 442/575 [2:11:30<39:12, 17.68s/it] 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                 | 443/575 [2:11:48<38:48, 17.64s/it]                                                                                                                                                                                         {'loss': 1.7957, 'grad_norm': 0.8459606766700745, 'learning_rate': 0.00011080779358399128, 'ppl': 6.02369, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.9483184814453, 'epoch': 0.77, 'tokens/total': 58064896.0, 'tokens/trainable': 55561892.0}
 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                 | 443/575 [2:11:48<38:48, 17.64s/it] 77%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                | 444/575 [2:12:05<38:36, 17.69s/it]                                                                                                                                                                                         {'loss': 1.8268, 'grad_norm': 0.9506929516792297, 'learning_rate': 0.00010993962839581933, 'ppl': 6.21397, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.2422637939453, 'epoch': 0.77, 'tokens/total': 58195968.0, 'tokens/trainable': 55687136.0}
 77%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                | 444/575 [2:12:05<38:36, 17.69s/it] 77%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                | 445/575 [2:12:23<38:09, 17.61s/it]                                                                                                                                                                                         {'loss': 1.7691, 'grad_norm': 1.4221011400222778, 'learning_rate': 0.00010907675198687043, 'ppl': 5.86557, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.58363342285156, 'epoch': 0.77, 'tokens/total': 58327040.0, 'tokens/trainable': 55812760.0}
 77%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                | 445/575 [2:12:23<38:09, 17.61s/it] 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                | 446/575 [2:12:41<37:58, 17.66s/it]                                                                                                                                                                                         {'loss': 1.818, 'grad_norm': 1.0990978479385376, 'learning_rate': 0.00010821919200498503, 'ppl': 6.15953, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.4279327392578, 'epoch': 0.78, 'tokens/total': 58458112.0, 'tokens/trainable': 55938100.0}
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                | 446/575 [2:12:41<37:58, 17.66s/it] 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                | 447/575 [2:12:58<37:45, 17.70s/it]                                                                                                                                                                                         {'loss': 1.7826, 'grad_norm': 0.825745165348053, 'learning_rate': 0.00010736697592765736, 'ppl': 5.94529, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.00718688964844, 'epoch': 0.78, 'tokens/total': 58589184.0, 'tokens/trainable': 56063164.0}
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                | 447/575 [2:12:58<37:45, 17.70s/it] 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                               | 448/575 [2:13:16<37:35, 17.76s/it]                                                                                                                                                                                         {'loss': 1.8131, 'grad_norm': 1.1098095178604126, 'learning_rate': 0.00010652013106115519, 'ppl': 6.12942, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 215.60923767089844, 'epoch': 0.78, 'tokens/total': 58720256.0, 'tokens/trainable': 56188360.0}
 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                               | 448/575 [2:13:16<37:35, 17.76s/it] 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                               | 449/575 [2:13:34<37:23, 17.80s/it]                                                                                                                                                                                         {'loss': 1.8443, 'grad_norm': 0.9679601192474365, 'learning_rate': 0.00010567868453964449, 'ppl': 6.32367, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.1996307373047, 'epoch': 0.78, 'tokens/total': 58851328.0, 'tokens/trainable': 56314048.0}
 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                               | 449/575 [2:13:34<37:23, 17.80s/it] 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                               | 450/575 [2:13:52<37:09, 17.84s/it]                                                                                                                                                                                         {'loss': 1.7848, 'grad_norm': 1.3441431522369385, 'learning_rate': 0.0001048426633243204, 'ppl': 5.95839, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.8780059814453, 'epoch': 0.78, 'tokens/total': 58982400.0, 'tokens/trainable': 56439284.0}
 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                               | 450/575 [2:13:52<37:09, 17.84s/it] 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                               | 451/575 [2:14:10<36:36, 17.71s/it]                                                                                                                                                                                         {'loss': 1.8119, 'grad_norm': 1.0236353874206543, 'learning_rate': 0.00010401209420254312, 'ppl': 6.12207, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.45469665527344, 'epoch': 0.78, 'tokens/total': 59113472.0, 'tokens/trainable': 56564840.0}
 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                               | 451/575 [2:14:10<36:36, 17.71s/it] 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                              | 452/575 [2:14:27<36:17, 17.70s/it]                                                                                                                                                                                         {'loss': 1.7752, 'grad_norm': 1.1281907558441162, 'learning_rate': 0.00010318700378697971, 'ppl': 5.90146, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.31263732910156, 'epoch': 0.79, 'tokens/total': 59244544.0, 'tokens/trainable': 56690184.0}
 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                              | 452/575 [2:14:27<36:17, 17.70s/it] 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                              | 453/575 [2:14:45<35:58, 17.69s/it]                                                                                                                                                                                         {'loss': 1.8401, 'grad_norm': 0.964059591293335, 'learning_rate': 0.0001023674185147513, 'ppl': 6.29717, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.58560180664062, 'epoch': 0.79, 'tokens/total': 59375616.0, 'tokens/trainable': 56815184.0}
 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                              | 453/575 [2:14:45<35:58, 17.69s/it] 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                              | 454/575 [2:15:02<35:35, 17.65s/it]                                                                                                                                                                                         {'loss': 1.7681, 'grad_norm': 0.9128284454345703, 'learning_rate': 0.00010155336464658624, 'ppl': 5.85971, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.01133728027344, 'epoch': 0.79, 'tokens/total': 59506688.0, 'tokens/trainable': 56940368.0}
 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                              | 454/575 [2:15:02<35:35, 17.65s/it] 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                              | 455/575 [2:15:20<35:14, 17.62s/it]                                                                                                                                                                                         {'loss': 1.8242, 'grad_norm': 1.244133710861206, 'learning_rate': 0.0001007448682659783, 'ppl': 6.19783, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.24659729003906, 'epoch': 0.79, 'tokens/total': 59637760.0, 'tokens/trainable': 57065672.0}
 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                              | 455/575 [2:15:20<35:14, 17.62s/it] 79%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                             | 456/575 [2:15:38<34:58, 17.63s/it]                                                                                                                                                                                         {'loss': 1.8178, 'grad_norm': 1.1856162548065186, 'learning_rate': 9.994195527835116e-05, 'ppl': 6.1583, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.66091918945312, 'epoch': 0.79, 'tokens/total': 59768832.0, 'tokens/trainable': 57190580.0}
 79%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                             | 456/575 [2:15:38<34:58, 17.63s/it] 79%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                             | 457/575 [2:15:55<34:46, 17.68s/it]                                                                                                                                                                                         {'loss': 1.8392, 'grad_norm': 1.2200720310211182, 'learning_rate': 9.91446514102283e-05, 'ppl': 6.2915, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.81658935546875, 'epoch': 0.79, 'tokens/total': 59899904.0, 'tokens/trainable': 57315640.0}
 79%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                             | 457/575 [2:15:55<34:46, 17.68s/it] 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                             | 458/575 [2:16:13<34:23, 17.64s/it]                                                                                                                                                                                         {'loss': 1.8423, 'grad_norm': 0.9385465979576111, 'learning_rate': 9.835298220840872e-05, 'ppl': 6.31104, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.13882446289062, 'epoch': 0.8, 'tokens/total': 60030976.0, 'tokens/trainable': 57440952.0}
 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                             | 458/575 [2:16:13<34:23, 17.64s/it] 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                             | 459/575 [2:16:31<34:03, 17.61s/it]                                                                                                                                                                                         {'loss': 1.8497, 'grad_norm': 1.0106507539749146, 'learning_rate': 9.75669730391482e-05, 'ppl': 6.35791, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.1930694580078, 'epoch': 0.8, 'tokens/total': 60162048.0, 'tokens/trainable': 57566172.0}
 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                             | 459/575 [2:16:31<34:03, 17.61s/it] 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                            | 460/575 [2:16:48<33:43, 17.59s/it]                                                                                                                                                                                         {'loss': 1.8007, 'grad_norm': 0.9459385275840759, 'learning_rate': 9.6786649087347e-05, 'ppl': 6.05388, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.66932678222656, 'epoch': 0.8, 'tokens/total': 60293120.0, 'tokens/trainable': 57691716.0}
 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                            | 460/575 [2:16:48<33:43, 17.59s/it] 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                            | 461/575 [2:17:06<33:32, 17.65s/it]                                                                                                                                                                                         {'loss': 1.8304, 'grad_norm': 1.7394914627075195, 'learning_rate': 9.601203535574232e-05, 'ppl': 6.23638, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.63385009765625, 'epoch': 0.8, 'tokens/total': 60424192.0, 'tokens/trainable': 57817120.0}
 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                            | 461/575 [2:17:06<33:32, 17.65s/it] 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                            | 462/575 [2:17:24<33:15, 17.66s/it]                                                                                                                                                                                         {'loss': 1.7952, 'grad_norm': 0.8364937901496887, 'learning_rate': 9.524315666410744e-05, 'ppl': 6.02068, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.9661865234375, 'epoch': 0.8, 'tokens/total': 60555264.0, 'tokens/trainable': 57942044.0}
 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                            | 462/575 [2:17:24<33:15, 17.66s/it] 81%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                            | 463/575 [2:17:41<33:05, 17.73s/it]                                                                                                                                                                                         {'loss': 1.805, 'grad_norm': 0.8604599833488464, 'learning_rate': 9.448003764845651e-05, 'ppl': 6.07997, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.2506866455078, 'epoch': 0.8, 'tokens/total': 60686336.0, 'tokens/trainable': 58066908.0}
 81%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                            | 463/575 [2:17:41<33:05, 17.73s/it] 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                           | 464/575 [2:17:59<32:49, 17.75s/it]                                                                                                                                                                                         {'loss': 1.8429, 'grad_norm': 0.860278308391571, 'learning_rate': 9.372270276025516e-05, 'ppl': 6.31482, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.48175048828125, 'epoch': 0.81, 'tokens/total': 60817408.0, 'tokens/trainable': 58192276.0}
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                           | 464/575 [2:17:59<32:49, 17.75s/it] 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                           | 465/575 [2:18:17<32:25, 17.69s/it]                                                                                                                                                                                         {'loss': 1.7389, 'grad_norm': 0.9656296968460083, 'learning_rate': 9.297117626563687e-05, 'ppl': 5.69108, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.667236328125, 'epoch': 0.81, 'tokens/total': 60948480.0, 'tokens/trainable': 58317828.0}
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                           | 465/575 [2:18:17<32:25, 17.69s/it] 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                           | 466/575 [2:18:34<32:07, 17.68s/it]                                                                                                                                                                                         {'loss': 1.7977, 'grad_norm': 1.1678879261016846, 'learning_rate': 9.222548224462571e-05, 'ppl': 6.03575, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.14541625976562, 'epoch': 0.81, 'tokens/total': 61079552.0, 'tokens/trainable': 58442992.0}
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                           | 466/575 [2:18:34<32:07, 17.68s/it] 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                           | 467/575 [2:18:52<31:45, 17.64s/it]                                                                                                                                                                                         {'loss': 1.7705, 'grad_norm': 0.85365229845047, 'learning_rate': 9.148564459036457e-05, 'ppl': 5.87379, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.7700653076172, 'epoch': 0.81, 'tokens/total': 61210624.0, 'tokens/trainable': 58568604.0}
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                           | 467/575 [2:18:52<31:45, 17.64s/it] 81%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                          | 468/575 [2:19:10<31:36, 17.72s/it]                                                                                                                                                                                         {'loss': 1.7689, 'grad_norm': 0.9609678387641907, 'learning_rate': 9.075168700834962e-05, 'ppl': 5.8644, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.03392028808594, 'epoch': 0.81, 'tokens/total': 61341696.0, 'tokens/trainable': 58694036.0}
 81%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                          | 468/575 [2:19:10<31:36, 17.72s/it] 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                          | 469/575 [2:19:28<31:16, 17.70s/it]                                                                                                                                                                                         {'loss': 1.7956, 'grad_norm': 0.9540114402770996, 'learning_rate': 9.002363301567088e-05, 'ppl': 6.02309, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.78231811523438, 'epoch': 0.82, 'tokens/total': 61472768.0, 'tokens/trainable': 58819428.0}
 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                          | 469/575 [2:19:28<31:16, 17.70s/it] 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                          | 470/575 [2:19:45<30:57, 17.69s/it]                                                                                                                                                                                         {'loss': 1.7747, 'grad_norm': 0.7795339822769165, 'learning_rate': 8.930150594025848e-05, 'ppl': 5.89851, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.88525390625, 'epoch': 0.82, 'tokens/total': 61603840.0, 'tokens/trainable': 58945036.0}
 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                          | 470/575 [2:19:45<30:57, 17.69s/it] 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                          | 471/575 [2:20:03<30:39, 17.68s/it]                                                                                                                                                                                         {'loss': 1.7842, 'grad_norm': 0.9373102784156799, 'learning_rate': 8.858532892013555e-05, 'ppl': 5.95481, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.2830047607422, 'epoch': 0.82, 'tokens/total': 61734912.0, 'tokens/trainable': 59070336.0}
 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                          | 471/575 [2:20:03<30:39, 17.68s/it] 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                         | 472/575 [2:20:21<30:24, 17.71s/it]                                                                                                                                                                                         {'loss': 1.7955, 'grad_norm': 0.8834575414657593, 'learning_rate': 8.787512490267639e-05, 'ppl': 6.02249, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.67356872558594, 'epoch': 0.82, 'tokens/total': 61865984.0, 'tokens/trainable': 59195420.0}
 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                         | 472/575 [2:20:21<30:24, 17.71s/it] 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                         | 473/575 [2:20:38<30:08, 17.73s/it]                                                                                                                                                                                         {'loss': 1.7814, 'grad_norm': 0.8894031643867493, 'learning_rate': 8.717091664387152e-05, 'ppl': 5.93816, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.3159942626953, 'epoch': 0.82, 'tokens/total': 61997056.0, 'tokens/trainable': 59320880.0}
 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                         | 473/575 [2:20:39<30:08, 17.73s/it] 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                         | 474/575 [2:20:56<29:49, 17.71s/it]                                                                                                                                                                                         {'loss': 1.8007, 'grad_norm': 1.1905293464660645, 'learning_rate': 8.647272670759851e-05, 'ppl': 6.05388, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.24278259277344, 'epoch': 0.82, 'tokens/total': 62128128.0, 'tokens/trainable': 59445868.0}
 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                         | 474/575 [2:20:56<29:49, 17.71s/it] 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                         | 475/575 [2:21:14<29:33, 17.73s/it]                                                                                                                                                                                         {'loss': 1.7924, 'grad_norm': 1.1270114183425903, 'learning_rate': 8.578057746489877e-05, 'ppl': 6.00384, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.14833068847656, 'epoch': 0.83, 'tokens/total': 62259200.0, 'tokens/trainable': 59570664.0}
 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                         | 475/575 [2:21:14<29:33, 17.73s/it] 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                        | 476/575 [2:21:32<29:13, 17.71s/it]                                                                                                                                                                                         {'loss': 1.7745, 'grad_norm': 1.1567282676696777, 'learning_rate': 8.509449109326117e-05, 'ppl': 5.89733, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.70468139648438, 'epoch': 0.83, 'tokens/total': 62390272.0, 'tokens/trainable': 59695900.0}
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                        | 476/575 [2:21:32<29:13, 17.71s/it] 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                        | 477/575 [2:21:49<28:40, 17.56s/it]                                                                                                                                                                                         {'loss': 1.7827, 'grad_norm': 0.9879487752914429, 'learning_rate': 8.441448957591108e-05, 'ppl': 5.94589, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.9061737060547, 'epoch': 0.83, 'tokens/total': 62521344.0, 'tokens/trainable': 59821120.0}
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                        | 477/575 [2:21:49<28:40, 17.56s/it] 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                        | 478/575 [2:22:06<28:22, 17.55s/it]                                                                                                                                                                                         {'loss': 1.7933, 'grad_norm': 0.9573699235916138, 'learning_rate': 8.374059470110604e-05, 'ppl': 6.00925, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.61155700683594, 'epoch': 0.83, 'tokens/total': 62652416.0, 'tokens/trainable': 59946064.0}
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                        | 478/575 [2:22:06<28:22, 17.55s/it] 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                        | 479/575 [2:22:24<28:04, 17.55s/it]                                                                                                                                                                                         {'loss': 1.7989, 'grad_norm': 0.9812211990356445, 'learning_rate': 8.307282806143779e-05, 'ppl': 6.043, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.79501342773438, 'epoch': 0.83, 'tokens/total': 62783488.0, 'tokens/trainable': 60071424.0}
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                        | 479/575 [2:22:24<28:04, 17.55s/it] 83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                       | 480/575 [2:22:42<27:53, 17.62s/it]                                                                                                                                                                                         {'loss': 1.7894, 'grad_norm': 1.1163861751556396, 'learning_rate': 8.24112110531403e-05, 'ppl': 5.98586, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.84388732910156, 'epoch': 0.83, 'tokens/total': 62914560.0, 'tokens/trainable': 60196340.0}
 83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                       | 480/575 [2:22:42<27:53, 17.62s/it] 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                       | 481/575 [2:22:59<27:34, 17.60s/it]                                                                                                                                                                                         {'loss': 1.8153, 'grad_norm': 1.2270290851593018, 'learning_rate': 8.175576487540415e-05, 'ppl': 6.14292, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.65541076660156, 'epoch': 0.84, 'tokens/total': 63045632.0, 'tokens/trainable': 60321192.0}
 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                       | 481/575 [2:22:59<27:34, 17.60s/it] 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                       | 482/575 [2:23:16<27:08, 17.51s/it]                                                                                                                                                                                         {'loss': 1.7951, 'grad_norm': 0.9826641082763672, 'learning_rate': 8.110651052969754e-05, 'ppl': 6.02008, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.7191162109375, 'epoch': 0.84, 'tokens/total': 63176704.0, 'tokens/trainable': 60446752.0}
 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                       | 482/575 [2:23:17<27:08, 17.51s/it] 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                       | 483/575 [2:23:34<27:01, 17.63s/it]                                                                                                                                                                                         {'loss': 1.8035, 'grad_norm': 1.1193407773971558, 'learning_rate': 8.046346881909302e-05, 'ppl': 6.07086, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.00941467285156, 'epoch': 0.84, 'tokens/total': 63307776.0, 'tokens/trainable': 60572000.0}
 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                       | 483/575 [2:23:34<27:01, 17.63s/it] 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                      | 484/575 [2:23:52<26:38, 17.57s/it]                                                                                                                                                                                         {'loss': 1.8195, 'grad_norm': 1.5576895475387573, 'learning_rate': 7.982666034760118e-05, 'ppl': 6.16877, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.0016326904297, 'epoch': 0.84, 'tokens/total': 63438848.0, 'tokens/trainable': 60696912.0}
 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                      | 484/575 [2:23:52<26:38, 17.57s/it] 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                      | 485/575 [2:24:10<26:26, 17.63s/it]                                                                                                                                                                                         {'loss': 1.7841, 'grad_norm': 1.310745358467102, 'learning_rate': 7.919610551951032e-05, 'ppl': 5.95422, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 213.03564453125, 'epoch': 0.84, 'tokens/total': 63569920.0, 'tokens/trainable': 60822124.0}
 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                      | 485/575 [2:24:10<26:26, 17.63s/it] 85%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                      | 486/575 [2:24:27<26:13, 17.68s/it]                                                                                                                                                                                         {'loss': 1.8018, 'grad_norm': 1.0567817687988281, 'learning_rate': 7.857182453873266e-05, 'ppl': 6.06055, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.7728271484375, 'epoch': 0.84, 'tokens/total': 63700992.0, 'tokens/trainable': 60947168.0}
 85%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                      | 486/575 [2:24:27<26:13, 17.68s/it] 85%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                      | 487/575 [2:24:45<25:48, 17.60s/it]                                                                                                                                                                                         {'loss': 1.8157, 'grad_norm': 0.9883031845092773, 'learning_rate': 7.795383740815727e-05, 'ppl': 6.14538, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.6483612060547, 'epoch': 0.85, 'tokens/total': 63832064.0, 'tokens/trainable': 61072048.0}
 85%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                      | 487/575 [2:24:45<25:48, 17.60s/it] 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                     | 488/575 [2:25:02<25:29, 17.58s/it]                                                                                                                                                                                         {'loss': 1.8345, 'grad_norm': 0.8998376727104187, 'learning_rate': 7.734216392900876e-05, 'ppl': 6.262, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.96450805664062, 'epoch': 0.85, 'tokens/total': 63963136.0, 'tokens/trainable': 61196748.0}
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                     | 488/575 [2:25:02<25:29, 17.58s/it] 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                     | 489/575 [2:25:20<25:08, 17.54s/it]                                                                                                                                                                                         {'loss': 1.7994, 'grad_norm': 0.8783245086669922, 'learning_rate': 7.673682370021296e-05, 'ppl': 6.04602, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.06362915039062, 'epoch': 0.85, 'tokens/total': 64094208.0, 'tokens/trainable': 61322140.0}
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                     | 489/575 [2:25:20<25:08, 17.54s/it] 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                     | 490/575 [2:25:37<24:50, 17.54s/it]                                                                                                                                                                                         {'loss': 1.8508, 'grad_norm': 0.8263002038002014, 'learning_rate': 7.613783611776902e-05, 'ppl': 6.36491, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.01043701171875, 'epoch': 0.85, 'tokens/total': 64225280.0, 'tokens/trainable': 61447144.0}
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                     | 490/575 [2:25:37<24:50, 17.54s/it] 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                     | 491/575 [2:25:55<24:36, 17.58s/it]                                                                                                                                                                                         {'loss': 1.7719, 'grad_norm': 0.9576923847198486, 'learning_rate': 7.554522037412778e-05, 'ppl': 5.88202, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.42018127441406, 'epoch': 0.85, 'tokens/total': 64356352.0, 'tokens/trainable': 61572256.0}
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                     | 491/575 [2:25:55<24:36, 17.58s/it] 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                    | 492/575 [2:26:13<24:21, 17.60s/it]                                                                                                                                                                                         {'loss': 1.7943, 'grad_norm': 1.052193522453308, 'learning_rate': 7.495899545757716e-05, 'ppl': 6.01526, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.32127380371094, 'epoch': 0.86, 'tokens/total': 64487424.0, 'tokens/trainable': 61697848.0}
 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                    | 492/575 [2:26:13<24:21, 17.60s/it] 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                    | 493/575 [2:26:30<24:04, 17.62s/it]                                                                                                                                                                                         {'loss': 1.8439, 'grad_norm': 0.9324473142623901, 'learning_rate': 7.437918015163322e-05, 'ppl': 6.32114, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.93505859375, 'epoch': 0.86, 'tokens/total': 64618496.0, 'tokens/trainable': 61822848.0}
 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                    | 493/575 [2:26:30<24:04, 17.62s/it] 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                    | 494/575 [2:26:48<23:48, 17.64s/it]                                                                                                                                                                                         {'loss': 1.826, 'grad_norm': 0.8796619772911072, 'learning_rate': 7.380579303443872e-05, 'ppl': 6.209, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.2885284423828, 'epoch': 0.86, 'tokens/total': 64749568.0, 'tokens/trainable': 61948260.0}
 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                    | 494/575 [2:26:48<23:48, 17.64s/it] 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                    | 495/575 [2:27:06<23:31, 17.64s/it]                                                                                                                                                                                         {'loss': 1.809, 'grad_norm': 1.6163679361343384, 'learning_rate': 7.323885247816769e-05, 'ppl': 6.10434, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.85137939453125, 'epoch': 0.86, 'tokens/total': 64880640.0, 'tokens/trainable': 62073632.0}
 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                    | 495/575 [2:27:06<23:31, 17.64s/it] 86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                   | 496/575 [2:27:23<23:08, 17.58s/it]                                                                                                                                                                                         {'loss': 1.8217, 'grad_norm': 0.7951000928878784, 'learning_rate': 7.267837664843671e-05, 'ppl': 6.18236, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.57815551757812, 'epoch': 0.86, 'tokens/total': 65011712.0, 'tokens/trainable': 62198816.0}
 86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                   | 496/575 [2:27:23<23:08, 17.58s/it] 86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                   | 497/575 [2:27:41<22:53, 17.61s/it]                                                                                                                                                                                         {'loss': 1.7827, 'grad_norm': 0.8899988532066345, 'learning_rate': 7.212438350372311e-05, 'ppl': 5.94589, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.9426727294922, 'epoch': 0.86, 'tokens/total': 65142784.0, 'tokens/trainable': 62324348.0}
 86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                   | 497/575 [2:27:41<22:53, 17.61s/it] 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                   | 498/575 [2:27:58<22:37, 17.63s/it]                                                                                                                                                                                         {'loss': 1.7637, 'grad_norm': 0.9742187857627869, 'learning_rate': 7.15768907947892e-05, 'ppl': 5.83398, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.87400817871094, 'epoch': 0.87, 'tokens/total': 65273856.0, 'tokens/trainable': 62449628.0}
 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                   | 498/575 [2:27:58<22:37, 17.63s/it] 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                   | 499/575 [2:28:16<22:17, 17.60s/it]                                                                                                                                                                                         {'loss': 1.7786, 'grad_norm': 0.8364508748054504, 'learning_rate': 7.103591606411377e-05, 'ppl': 5.92156, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.8054962158203, 'epoch': 0.87, 'tokens/total': 65404928.0, 'tokens/trainable': 62575068.0}
 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                   | 499/575 [2:28:16<22:17, 17.60s/it] 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                  | 500/575 [2:28:34<22:01, 17.62s/it]                                                                                                                                                                                         {'loss': 1.7511, 'grad_norm': 0.871688723564148, 'learning_rate': 7.050147664532988e-05, 'ppl': 5.76094, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.343017578125, 'epoch': 0.87, 'tokens/total': 65536000.0, 'tokens/trainable': 62700416.0}
 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                  | 500/575 [2:28:34<22:01, 17.62s/it][2026-03-13 00:10:39,827] [INFO] [axolotl.core.trainers.base.evaluate:401] [PID:4624] Running evaluation step...

  0%|                                                                                                                                                             | 0/54 [00:00<?, ?it/s][A
  4%|█████▌                                                                                                                                               | 2/54 [00:00<00:12,  4.10it/s][A
  6%|████████▎                                                                                                                                            | 3/54 [00:00<00:17,  2.89it/s][A
  7%|███████████                                                                                                                                          | 4/54 [00:01<00:20,  2.50it/s][A
  9%|█████████████▊                                                                                                                                       | 5/54 [00:01<00:21,  2.32it/s][A
 11%|████████████████▌                                                                                                                                    | 6/54 [00:02<00:21,  2.22it/s][A
 13%|███████████████████▎                                                                                                                                 | 7/54 [00:02<00:21,  2.16it/s][A
 15%|██████████████████████                                                                                                                               | 8/54 [00:03<00:21,  2.12it/s][A
 17%|████████████████████████▊                                                                                                                            | 9/54 [00:03<00:21,  2.09it/s][A
 19%|███████████████████████████▍                                                                                                                        | 10/54 [00:04<00:21,  2.08it/s][A
 20%|██████████████████████████████▏                                                                                                                     | 11/54 [00:04<00:20,  2.07it/s][A
 22%|████████████████████████████████▉                                                                                                                   | 12/54 [00:05<00:20,  2.06it/s][A
 24%|███████████████████████████████████▋                                                                                                                | 13/54 [00:05<00:19,  2.05it/s][A
 26%|██████████████████████████████████████▎                                                                                                             | 14/54 [00:06<00:19,  2.05it/s][A
 28%|█████████████████████████████████████████                                                                                                           | 15/54 [00:06<00:19,  2.05it/s][A
 30%|███████████████████████████████████████████▊                                                                                                        | 16/54 [00:07<00:18,  2.04it/s][A
 31%|██████████████████████████████████████████████▌                                                                                                     | 17/54 [00:07<00:18,  2.04it/s][A
 33%|█████████████████████████████████████████████████▎                                                                                                  | 18/54 [00:08<00:17,  2.04it/s][A
 35%|████████████████████████████████████████████████████                                                                                                | 19/54 [00:08<00:17,  2.04it/s][A
 37%|██████████████████████████████████████████████████████▊                                                                                             | 20/54 [00:09<00:16,  2.04it/s][A
 39%|█████████████████████████████████████████████████████████▌                                                                                          | 21/54 [00:09<00:16,  2.04it/s][A
 41%|████████████████████████████████████████████████████████████▎                                                                                       | 22/54 [00:10<00:15,  2.04it/s][A
 43%|███████████████████████████████████████████████████████████████                                                                                     | 23/54 [00:10<00:15,  2.04it/s][A
 44%|█████████████████████████████████████████████████████████████████▊                                                                                  | 24/54 [00:11<00:14,  2.04it/s][A
 46%|████████████████████████████████████████████████████████████████████▌                                                                               | 25/54 [00:11<00:14,  2.04it/s][A
 48%|███████████████████████████████████████████████████████████████████████▎                                                                            | 26/54 [00:12<00:13,  2.04it/s][A
 50%|██████████████████████████████████████████████████████████████████████████                                                                          | 27/54 [00:12<00:13,  2.04it/s][A
 52%|████████████████████████████████████████████████████████████████████████████▋                                                                       | 28/54 [00:13<00:12,  2.04it/s][A
 54%|███████████████████████████████████████████████████████████████████████████████▍                                                                    | 29/54 [00:13<00:12,  2.04it/s][A
 56%|██████████████████████████████████████████████████████████████████████████████████▏                                                                 | 30/54 [00:14<00:11,  2.04it/s][A
 57%|████████████████████████████████████████████████████████████████████████████████████▉                                                               | 31/54 [00:14<00:11,  2.04it/s][A
 59%|███████████████████████████████████████████████████████████████████████████████████████▋                                                            | 32/54 [00:15<00:10,  2.04it/s][A
 61%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                         | 33/54 [00:15<00:10,  1.92it/s][A
 63%|█████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 34/54 [00:16<00:10,  1.99it/s][A
 65%|███████████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 35/54 [00:16<00:09,  2.00it/s][A
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▋                                                 | 36/54 [00:17<00:08,  2.01it/s][A
 69%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 37/54 [00:17<00:08,  2.02it/s][A
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                           | 38/54 [00:18<00:07,  2.03it/s][A
 72%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                         | 39/54 [00:18<00:07,  2.03it/s][A
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                      | 40/54 [00:19<00:06,  2.03it/s][A
 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                   | 41/54 [00:19<00:06,  2.04it/s][A
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                 | 42/54 [00:20<00:05,  2.04it/s][A
 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                              | 43/54 [00:20<00:05,  2.04it/s][A
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                           | 44/54 [00:21<00:04,  2.04it/s][A
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                        | 45/54 [00:21<00:04,  2.04it/s][A
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                      | 46/54 [00:22<00:03,  2.04it/s][A
 87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                   | 47/54 [00:22<00:03,  2.04it/s][A
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 48/54 [00:23<00:02,  2.04it/s][A
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎             | 49/54 [00:23<00:02,  2.04it/s][A
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████           | 50/54 [00:24<00:01,  2.04it/s][A
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊        | 51/54 [00:24<00:01,  2.04it/s][A
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌     | 52/54 [00:25<00:00,  2.04it/s][A
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎  | 53/54 [00:25<00:00,  2.04it/s][A
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.98it/s][A                                                                                                                                                                                         
                                                                                                                                                                                         [A{'eval_loss': 1.7823346853256226, 'eval_runtime': 27.2853, 'eval_samples_per_second': 7.916, 'eval_steps_per_second': 1.979, 'eval_ppl': 5.94372, 'memory/max_active (GiB)': 26.29, 'memory/max_allocated (GiB)': 26.29, 'memory/device_reserved (GiB)': 27.83, 'epoch': 0.87, 'tokens/train_per_sec_per_gpu': 0.0, 'tokens/total': 65536000.0, 'tokens/trainable': 62700416.0}
 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                  | 500/575 [2:29:01<22:01, 17.62s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54/54 [00:26<00:00,  1.98it/s][A
                                                                                                                                                                                         [A[2026-03-13 00:11:07,117] [INFO] [axolotl.core.trainers.base._save:723] [PID:4624] Saving model checkpoint to ./output/model/checkpoint-500
 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                  | 501/575 [2:29:32<36:45, 29.80s/it]                                                                                                                                                                                         {'loss': 1.8159, 'grad_norm': 0.7782678008079529, 'learning_rate': 6.997358966266946e-05, 'ppl': 6.14661, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.1669921875, 'epoch': 0.87, 'tokens/total': 65667072.0, 'tokens/trainable': 62825712.0}
 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                  | 501/575 [2:29:32<36:45, 29.80s/it] 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                  | 502/575 [2:29:50<31:52, 26.20s/it]                                                                                                                                                                                         {'loss': 1.7859, 'grad_norm': 0.9077524542808533, 'learning_rate': 6.94522720304148e-05, 'ppl': 5.96495, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.1307830810547, 'epoch': 0.87, 'tokens/total': 65798144.0, 'tokens/trainable': 62950460.0}
 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                  | 502/575 [2:29:50<31:52, 26.20s/it] 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                  | 503/575 [2:30:07<28:24, 23.67s/it]                                                                                                                                                                                         {'loss': 1.78, 'grad_norm': 0.8502079844474792, 'learning_rate': 6.893754045235631e-05, 'ppl': 5.92986, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.42445373535156, 'epoch': 0.87, 'tokens/total': 65929216.0, 'tokens/trainable': 63075840.0}
 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                  | 503/575 [2:30:07<28:24, 23.67s/it] 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                 | 504/575 [2:30:25<25:53, 21.87s/it]                                                                                                                                                                                         {'loss': 1.8284, 'grad_norm': 1.0214923620224, 'learning_rate': 6.842941142125755e-05, 'ppl': 6.22392, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 217.27508544921875, 'epoch': 0.88, 'tokens/total': 66060288.0, 'tokens/trainable': 63200856.0}
 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                 | 504/575 [2:30:25<25:53, 21.87s/it] 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                 | 505/575 [2:30:43<24:00, 20.58s/it]                                                                                                                                                                                         {'loss': 1.8095, 'grad_norm': 0.8314673900604248, 'learning_rate': 6.792790121832664e-05, 'ppl': 6.10739, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.2130126953125, 'epoch': 0.88, 'tokens/total': 66191360.0, 'tokens/trainable': 63325728.0}
 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                 | 505/575 [2:30:43<24:00, 20.58s/it] 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                 | 506/575 [2:31:00<22:37, 19.67s/it]                                                                                                                                                                                         {'loss': 1.7627, 'grad_norm': 0.8183115124702454, 'learning_rate': 6.743302591269457e-05, 'ppl': 5.82815, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.36012268066406, 'epoch': 0.88, 'tokens/total': 66322432.0, 'tokens/trainable': 63451056.0}
 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                 | 506/575 [2:31:00<22:37, 19.67s/it] 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                 | 507/575 [2:31:18<21:37, 19.08s/it]                                                                                                                                                                                         {'loss': 1.7956, 'grad_norm': 0.7512713074684143, 'learning_rate': 6.694480136090044e-05, 'ppl': 6.02309, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.56153869628906, 'epoch': 0.88, 'tokens/total': 66453504.0, 'tokens/trainable': 63576460.0}
 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                 | 507/575 [2:31:18<21:37, 19.08s/it] 88%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                | 508/575 [2:31:36<20:50, 18.66s/it]                                                                                                                                                                                         {'loss': 1.8072, 'grad_norm': 0.8577417135238647, 'learning_rate': 6.646324320638335e-05, 'ppl': 6.09336, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.41273498535156, 'epoch': 0.88, 'tokens/total': 66584576.0, 'tokens/trainable': 63701984.0}
 88%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                | 508/575 [2:31:36<20:50, 18.66s/it] 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                | 509/575 [2:31:53<20:11, 18.36s/it]                                                                                                                                                                                         {'loss': 1.8226, 'grad_norm': 0.8176920413970947, 'learning_rate': 6.598836687898113e-05, 'ppl': 6.18793, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.04934692382812, 'epoch': 0.88, 'tokens/total': 66715648.0, 'tokens/trainable': 63827268.0}
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                | 509/575 [2:31:53<20:11, 18.36s/it] 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                | 510/575 [2:32:11<19:42, 18.19s/it]                                                                                                                                                                                         {'loss': 1.7874, 'grad_norm': 0.8938012719154358, 'learning_rate': 6.55201875944359e-05, 'ppl': 5.9739, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.16436767578125, 'epoch': 0.89, 'tokens/total': 66846720.0, 'tokens/trainable': 63952392.0}
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                | 510/575 [2:32:11<19:42, 18.19s/it] 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                | 511/575 [2:32:29<19:14, 18.04s/it]                                                                                                                                                                                         {'loss': 1.7728, 'grad_norm': 0.7777758836746216, 'learning_rate': 6.50587203539066e-05, 'ppl': 5.88731, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.6005401611328, 'epoch': 0.89, 'tokens/total': 66977792.0, 'tokens/trainable': 64077788.0}
 89%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                | 511/575 [2:32:29<19:14, 18.04s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏               | 512/575 [2:32:46<18:49, 17.93s/it]                                                                                                                                                                                         {'loss': 1.795, 'grad_norm': 0.9612661600112915, 'learning_rate': 6.460397994348838e-05, 'ppl': 6.01947, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.6536865234375, 'epoch': 0.89, 'tokens/total': 67108864.0, 'tokens/trainable': 64202912.0}
 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏               | 512/575 [2:32:46<18:49, 17.93s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍               | 513/575 [2:33:04<18:24, 17.82s/it]                                                                                                                                                                                         {'loss': 1.8147, 'grad_norm': 0.9259353280067444, 'learning_rate': 6.415598093373867e-05, 'ppl': 6.13923, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.17642211914062, 'epoch': 0.89, 'tokens/total': 67239936.0, 'tokens/trainable': 64328300.0}
 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍               | 513/575 [2:33:04<18:24, 17.82s/it] 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋               | 514/575 [2:33:22<18:02, 17.74s/it]                                                                                                                                                                                         {'loss': 1.7842, 'grad_norm': 0.8544265031814575, 'learning_rate': 6.371473767921058e-05, 'ppl': 5.95481, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.34231567382812, 'epoch': 0.89, 'tokens/total': 67371008.0, 'tokens/trainable': 64453432.0}
 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋               | 514/575 [2:33:22<18:02, 17.74s/it] 90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉               | 515/575 [2:33:39<17:43, 17.72s/it]                                                                                                                                                                                         {'loss': 1.8142, 'grad_norm': 1.0337729454040527, 'learning_rate': 6.328026431799267e-05, 'ppl': 6.13617, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.74862670898438, 'epoch': 0.9, 'tokens/total': 67502080.0, 'tokens/trainable': 64578328.0}
 90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉               | 515/575 [2:33:39<17:43, 17.72s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏              | 516/575 [2:33:56<17:16, 17.56s/it]                                                                                                                                                                                         {'loss': 1.813, 'grad_norm': 0.9456311464309692, 'learning_rate': 6.285257477125605e-05, 'ppl': 6.12881, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 231.2970428466797, 'epoch': 0.9, 'tokens/total': 67633152.0, 'tokens/trainable': 64703252.0}
 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏              | 516/575 [2:33:56<17:16, 17.56s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍              | 517/575 [2:34:14<17:02, 17.63s/it]                                                                                                                                                                                         {'loss': 1.8326, 'grad_norm': 0.8232349753379822, 'learning_rate': 6.243168274280847e-05, 'ppl': 6.25012, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.76063537597656, 'epoch': 0.9, 'tokens/total': 67764224.0, 'tokens/trainable': 64828456.0}
 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍              | 517/575 [2:34:14<17:02, 17.63s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋              | 518/575 [2:34:32<16:45, 17.64s/it]                                                                                                                                                                                         {'loss': 1.8517, 'grad_norm': 0.9537811279296875, 'learning_rate': 6.201760171865502e-05, 'ppl': 6.37064, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.86697387695312, 'epoch': 0.9, 'tokens/total': 67895296.0, 'tokens/trainable': 64953872.0}
 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋              | 518/575 [2:34:32<16:45, 17.64s/it] 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉              | 519/575 [2:34:49<16:28, 17.65s/it]                                                                                                                                                                                         {'loss': 1.7512, 'grad_norm': 1.0206140279769897, 'learning_rate': 6.161034496656608e-05, 'ppl': 5.76151, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.98492431640625, 'epoch': 0.9, 'tokens/total': 68026368.0, 'tokens/trainable': 65079376.0}
 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉              | 519/575 [2:34:50<16:28, 17.65s/it] 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏             | 520/575 [2:35:07<16:12, 17.69s/it]                                                                                                                                                                                         {'loss': 1.8169, 'grad_norm': 0.8020748496055603, 'learning_rate': 6.120992553565237e-05, 'ppl': 6.15276, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.9408416748047, 'epoch': 0.9, 'tokens/total': 68157440.0, 'tokens/trainable': 65203976.0}
 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏             | 520/575 [2:35:07<16:12, 17.69s/it] 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍             | 521/575 [2:35:25<15:54, 17.68s/it]                                                                                                                                                                                         {'loss': 1.7929, 'grad_norm': 0.9820665717124939, 'learning_rate': 6.081635625594654e-05, 'ppl': 6.00685, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.2080535888672, 'epoch': 0.91, 'tokens/total': 68288512.0, 'tokens/trainable': 65328788.0}
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍             | 521/575 [2:35:25<15:54, 17.68s/it] 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋             | 522/575 [2:35:43<15:36, 17.68s/it]                                                                                                                                                                                         {'loss': 1.8076, 'grad_norm': 0.9947705268859863, 'learning_rate': 6.042964973799229e-05, 'ppl': 6.0958, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.92910766601562, 'epoch': 0.91, 'tokens/total': 68419584.0, 'tokens/trainable': 65454008.0}
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋             | 522/575 [2:35:43<15:36, 17.68s/it] 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 523/575 [2:36:00<15:20, 17.71s/it]                                                                                                                                                                                         {'loss': 1.8146, 'grad_norm': 1.1457022428512573, 'learning_rate': 6.004981837244028e-05, 'ppl': 6.13862, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 217.2185821533203, 'epoch': 0.91, 'tokens/total': 68550656.0, 'tokens/trainable': 65578880.0}
 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 523/575 [2:36:00<15:20, 17.71s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏            | 524/575 [2:36:18<15:04, 17.73s/it]                                                                                                                                                                                         {'loss': 1.747, 'grad_norm': 0.9083927869796753, 'learning_rate': 5.9676874329651e-05, 'ppl': 5.73736, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.60743713378906, 'epoch': 0.91, 'tokens/total': 68681728.0, 'tokens/trainable': 65704000.0}
 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏            | 524/575 [2:36:18<15:04, 17.73s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍            | 525/575 [2:36:36<14:49, 17.78s/it]                                                                                                                                                                                         {'loss': 1.8023, 'grad_norm': 0.9731112122535706, 'learning_rate': 5.9310829559305e-05, 'ppl': 6.06358, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 217.9615478515625, 'epoch': 0.91, 'tokens/total': 68812800.0, 'tokens/trainable': 65828968.0}
 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍            | 525/575 [2:36:36<14:49, 17.78s/it] 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋            | 526/575 [2:36:54<14:31, 17.78s/it]                                                                                                                                                                                         {'loss': 1.731, 'grad_norm': 1.504130244255066, 'learning_rate': 5.895169579001987e-05, 'ppl': 5.6463, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 217.8334503173828, 'epoch': 0.91, 'tokens/total': 68943872.0, 'tokens/trainable': 65953976.0}
 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋            | 526/575 [2:36:54<14:31, 17.78s/it] 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉            | 527/575 [2:37:12<14:13, 17.78s/it]                                                                                                                                                                                         {'loss': 1.7551, 'grad_norm': 0.9270268082618713, 'learning_rate': 5.859948452897443e-05, 'ppl': 5.78403, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.79371643066406, 'epoch': 0.92, 'tokens/total': 69074944.0, 'tokens/trainable': 66079528.0}
 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉            | 527/575 [2:37:12<14:13, 17.78s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏           | 528/575 [2:37:30<13:57, 17.82s/it]                                                                                                                                                                                         {'loss': 1.8428, 'grad_norm': 0.8159583210945129, 'learning_rate': 5.8254207061540136e-05, 'ppl': 6.31419, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.4482421875, 'epoch': 0.92, 'tokens/total': 69206016.0, 'tokens/trainable': 66204508.0}
 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏           | 528/575 [2:37:30<13:57, 17.82s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍           | 529/575 [2:37:47<13:40, 17.83s/it]                                                                                                                                                                                         {'loss': 1.7457, 'grad_norm': 0.791064441204071, 'learning_rate': 5.7915874450919326e-05, 'ppl': 5.72991, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.12986755371094, 'epoch': 0.92, 'tokens/total': 69337088.0, 'tokens/trainable': 66329616.0}
 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍           | 529/575 [2:37:47<13:40, 17.83s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋           | 530/575 [2:38:05<13:21, 17.82s/it]                                                                                                                                                                                         {'loss': 1.8044, 'grad_norm': 0.9117148518562317, 'learning_rate': 5.758449753779087e-05, 'ppl': 6.07632, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 214.86485290527344, 'epoch': 0.92, 'tokens/total': 69468160.0, 'tokens/trainable': 66454328.0}
 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋           | 530/575 [2:38:05<13:21, 17.82s/it] 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 531/575 [2:38:23<13:01, 17.77s/it]                                                                                                                                                                                         {'loss': 1.8151, 'grad_norm': 0.9881961941719055, 'learning_rate': 5.7260086939962754e-05, 'ppl': 6.14169, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.07562255859375, 'epoch': 0.92, 'tokens/total': 69599232.0, 'tokens/trainable': 66579096.0}
 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 531/575 [2:38:23<13:01, 17.77s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏          | 532/575 [2:38:41<12:44, 17.78s/it]                                                                                                                                                                                         {'loss': 1.7539, 'grad_norm': 0.888619065284729, 'learning_rate': 5.6942653052031944e-05, 'ppl': 5.77709, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.10986328125, 'epoch': 0.92, 'tokens/total': 69730304.0, 'tokens/trainable': 66704056.0}
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏          | 532/575 [2:38:41<12:44, 17.78s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍          | 533/575 [2:38:58<12:26, 17.78s/it]                                                                                                                                                                                         {'loss': 1.7819, 'grad_norm': 0.7635431885719299, 'learning_rate': 5.6632206045051154e-05, 'ppl': 5.94113, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.46420288085938, 'epoch': 0.93, 'tokens/total': 69861376.0, 'tokens/trainable': 66828700.0}
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍          | 533/575 [2:38:58<12:26, 17.78s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋          | 534/575 [2:39:16<12:09, 17.78s/it]                                                                                                                                                                                         {'loss': 1.781, 'grad_norm': 0.8273208141326904, 'learning_rate': 5.632875586620319e-05, 'ppl': 5.93579, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 215.4923095703125, 'epoch': 0.93, 'tokens/total': 69992448.0, 'tokens/trainable': 66953844.0}
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋          | 534/575 [2:39:16<12:09, 17.78s/it] 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉          | 535/575 [2:39:34<11:51, 17.78s/it]                                                                                                                                                                                         {'loss': 1.8038, 'grad_norm': 0.7805310487747192, 'learning_rate': 5.6032312238482e-05, 'ppl': 6.07268, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.9886016845703, 'epoch': 0.93, 'tokens/total': 70123520.0, 'tokens/trainable': 67078688.0}
 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉          | 535/575 [2:39:34<11:51, 17.78s/it] 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏         | 536/575 [2:39:51<11:28, 17.64s/it]                                                                                                                                                                                         {'loss': 1.8043, 'grad_norm': 0.837330162525177, 'learning_rate': 5.5742884660381276e-05, 'ppl': 6.07572, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.4739532470703, 'epoch': 0.93, 'tokens/total': 70254592.0, 'tokens/trainable': 67199744.0}
 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏         | 536/575 [2:39:51<11:28, 17.64s/it] 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍         | 537/575 [2:40:09<11:09, 17.62s/it]                                                                                                                                                                                         {'loss': 1.7999, 'grad_norm': 0.9859345555305481, 'learning_rate': 5.5460482405590105e-05, 'ppl': 6.04904, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.92904663085938, 'epoch': 0.93, 'tokens/total': 70385664.0, 'tokens/trainable': 67325168.0}
 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍         | 537/575 [2:40:09<11:09, 17.62s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋         | 538/575 [2:40:26<10:49, 17.56s/it]                                                                                                                                                                                         {'loss': 1.8073, 'grad_norm': 1.0281709432601929, 'learning_rate': 5.518511452269573e-05, 'ppl': 6.09397, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 228.38052368164062, 'epoch': 0.94, 'tokens/total': 70516736.0, 'tokens/trainable': 67451120.0}
 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋         | 538/575 [2:40:26<10:49, 17.56s/it] 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉         | 539/575 [2:40:43<10:28, 17.45s/it]                                                                                                                                                                                         {'loss': 1.8084, 'grad_norm': 0.8007544279098511, 'learning_rate': 5.4916789834893724e-05, 'ppl': 6.10068, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 230.73428344726562, 'epoch': 0.94, 'tokens/total': 70647808.0, 'tokens/trainable': 67576792.0}
 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉         | 539/575 [2:40:44<10:28, 17.45s/it] 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏        | 540/575 [2:41:01<10:13, 17.52s/it]                                                                                                                                                                                         {'loss': 1.8075, 'grad_norm': 0.7906801104545593, 'learning_rate': 5.465551693970524e-05, 'ppl': 6.09519, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.75840759277344, 'epoch': 0.94, 'tokens/total': 70778880.0, 'tokens/trainable': 67702288.0}
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏        | 540/575 [2:41:01<10:13, 17.52s/it] 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍        | 541/575 [2:41:19<09:57, 17.56s/it]                                                                                                                                                                                         {'loss': 1.7979, 'grad_norm': 1.343226432800293, 'learning_rate': 5.4401304208701486e-05, 'ppl': 6.03696, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.17401123046875, 'epoch': 0.94, 'tokens/total': 70909952.0, 'tokens/trainable': 67828056.0}
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍        | 541/575 [2:41:19<09:57, 17.56s/it] 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋        | 542/575 [2:41:36<09:40, 17.59s/it]                                                                                                                                                                                         {'loss': 1.8269, 'grad_norm': 0.9941368103027344, 'learning_rate': 5.4154159787235605e-05, 'ppl': 6.21459, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.8453826904297, 'epoch': 0.94, 'tokens/total': 71041024.0, 'tokens/trainable': 67953664.0}
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋        | 542/575 [2:41:37<09:40, 17.59s/it] 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉        | 543/575 [2:41:54<09:22, 17.58s/it]                                                                                                                                                                                         {'loss': 1.8038, 'grad_norm': 0.9001813530921936, 'learning_rate': 5.3914091594181505e-05, 'ppl': 6.07268, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.92713928222656, 'epoch': 0.94, 'tokens/total': 71172096.0, 'tokens/trainable': 68079032.0}
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉        | 543/575 [2:41:54<09:22, 17.58s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏       | 544/575 [2:42:12<09:06, 17.64s/it]                                                                                                                                                                                         {'loss': 1.8518, 'grad_norm': 0.7520790100097656, 'learning_rate': 5.368110732168038e-05, 'ppl': 6.37128, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.2702178955078, 'epoch': 0.95, 'tokens/total': 71303168.0, 'tokens/trainable': 68204304.0}
 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏       | 544/575 [2:42:12<09:06, 17.64s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍       | 545/575 [2:42:30<08:50, 17.69s/it]                                                                                                                                                                                         {'loss': 1.8202, 'grad_norm': 0.961162269115448, 'learning_rate': 5.3455214434894e-05, 'ppl': 6.17309, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.55418395996094, 'epoch': 0.95, 'tokens/total': 71434240.0, 'tokens/trainable': 68329888.0}
 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍       | 545/575 [2:42:30<08:50, 17.69s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋       | 546/575 [2:42:47<08:32, 17.68s/it]                                                                                                                                                                                         {'loss': 1.7364, 'grad_norm': 0.7742072343826294, 'learning_rate': 5.32364201717656e-05, 'ppl': 5.67687, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.82240295410156, 'epoch': 0.95, 'tokens/total': 71565312.0, 'tokens/trainable': 68455296.0}
 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋       | 546/575 [2:42:47<08:32, 17.68s/it] 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉       | 547/575 [2:43:05<08:15, 17.68s/it]                                                                                                                                                                                         {'loss': 1.7571, 'grad_norm': 1.108005166053772, 'learning_rate': 5.3024731542788076e-05, 'ppl': 5.79561, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.71656799316406, 'epoch': 0.95, 'tokens/total': 71696384.0, 'tokens/trainable': 68580704.0}
 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉       | 547/575 [2:43:05<08:15, 17.68s/it] 95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏      | 548/575 [2:43:23<07:59, 17.75s/it]                                                                                                                                                                                         {'loss': 1.7827, 'grad_norm': 0.7538265585899353, 'learning_rate': 5.2820155330779156e-05, 'ppl': 5.94589, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.86105346679688, 'epoch': 0.95, 'tokens/total': 71827456.0, 'tokens/trainable': 68705808.0}
 95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏      | 548/575 [2:43:23<07:59, 17.75s/it] 95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍      | 549/575 [2:43:40<07:38, 17.62s/it]                                                                                                                                                                                         {'loss': 1.8162, 'grad_norm': 0.8675631284713745, 'learning_rate': 5.2622698090664246e-05, 'ppl': 6.14845, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 227.0034942626953, 'epoch': 0.95, 'tokens/total': 71958528.0, 'tokens/trainable': 68831360.0}
 95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍      | 549/575 [2:43:40<07:38, 17.62s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋      | 550/575 [2:43:58<07:19, 17.56s/it]                                                                                                                                                                                         {'loss': 1.7687, 'grad_norm': 0.795049250125885, 'learning_rate': 5.2432366149266304e-05, 'ppl': 5.86323, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.38453674316406, 'epoch': 0.96, 'tokens/total': 72089600.0, 'tokens/trainable': 68957056.0}
 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋      | 550/575 [2:43:58<07:19, 17.56s/it] 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉      | 551/575 [2:44:15<07:01, 17.56s/it]                                                                                                                                                                                         {'loss': 1.8051, 'grad_norm': 0.836580753326416, 'learning_rate': 5.224916560510316e-05, 'ppl': 6.08058, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.70823669433594, 'epoch': 0.96, 'tokens/total': 72220672.0, 'tokens/trainable': 69082344.0}
 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉      | 551/575 [2:44:15<07:01, 17.56s/it] 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 552/575 [2:44:33<06:45, 17.63s/it]                                                                                                                                                                                         {'loss': 1.7855, 'grad_norm': 0.8051338791847229, 'learning_rate': 5.207310232819204e-05, 'ppl': 5.96256, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.3197479248047, 'epoch': 0.96, 'tokens/total': 72351744.0, 'tokens/trainable': 69207728.0}
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 552/575 [2:44:33<06:45, 17.63s/it] 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍     | 553/575 [2:44:51<06:28, 17.67s/it]                                                                                                                                                                                         {'loss': 1.7815, 'grad_norm': 0.8446016311645508, 'learning_rate': 5.1904181959861644e-05, 'ppl': 5.93876, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 218.91896057128906, 'epoch': 0.96, 'tokens/total': 72482816.0, 'tokens/trainable': 69333040.0}
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍     | 553/575 [2:44:51<06:28, 17.67s/it] 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋     | 554/575 [2:45:09<06:11, 17.71s/it]                                                                                                                                                                                         {'loss': 1.7697, 'grad_norm': 0.9553136229515076, 'learning_rate': 5.174240991257116e-05, 'ppl': 5.86909, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.3632354736328, 'epoch': 0.96, 'tokens/total': 72613888.0, 'tokens/trainable': 69458776.0}
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋     | 554/575 [2:45:09<06:11, 17.71s/it] 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉     | 555/575 [2:45:26<05:54, 17.73s/it]                                                                                                                                                                                         {'loss': 1.7871, 'grad_norm': 0.8532331585884094, 'learning_rate': 5.158779136973705e-05, 'ppl': 5.97211, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.75445556640625, 'epoch': 0.96, 'tokens/total': 72744960.0, 'tokens/trainable': 69583656.0}
 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉     | 555/575 [2:45:26<05:54, 17.73s/it] 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏    | 556/575 [2:45:44<05:37, 17.75s/it]                                                                                                                                                                                         {'loss': 1.7956, 'grad_norm': 0.9137930274009705, 'learning_rate': 5.1440331285566846e-05, 'ppl': 6.02309, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.31561279296875, 'epoch': 0.97, 'tokens/total': 72876032.0, 'tokens/trainable': 69709056.0}
 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏    | 556/575 [2:45:44<05:37, 17.75s/it] 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍    | 557/575 [2:46:02<05:19, 17.76s/it]                                                                                                                                                                                         {'loss': 1.7837, 'grad_norm': 0.8183591365814209, 'learning_rate': 5.1300034384900424e-05, 'ppl': 5.95184, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.53933715820312, 'epoch': 0.97, 'tokens/total': 73007104.0, 'tokens/trainable': 69834216.0}
 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍    | 557/575 [2:46:02<05:19, 17.76s/it] 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 558/575 [2:46:20<05:02, 17.77s/it]                                                                                                                                                                                         {'loss': 1.7985, 'grad_norm': 0.8793440461158752, 'learning_rate': 5.116690516305871e-05, 'ppl': 6.04058, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 216.7800750732422, 'epoch': 0.97, 'tokens/total': 73138176.0, 'tokens/trainable': 69959472.0}
 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 558/575 [2:46:20<05:02, 17.77s/it] 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉    | 559/575 [2:46:37<04:42, 17.67s/it]                                                                                                                                                                                         {'loss': 1.771, 'grad_norm': 0.9606843590736389, 'learning_rate': 5.1040947885699456e-05, 'ppl': 5.87673, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.4654998779297, 'epoch': 0.97, 'tokens/total': 73269248.0, 'tokens/trainable': 70084896.0}
 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉    | 559/575 [2:46:37<04:42, 17.67s/it] 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 560/575 [2:46:55<04:25, 17.67s/it]                                                                                                                                                                                         {'loss': 1.8095, 'grad_norm': 0.802405059337616, 'learning_rate': 5.092216658868072e-05, 'ppl': 6.10739, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 224.83941650390625, 'epoch': 0.97, 'tokens/total': 73400320.0, 'tokens/trainable': 70210584.0}
 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 560/575 [2:46:55<04:25, 17.67s/it] 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍   | 561/575 [2:47:12<04:07, 17.67s/it]                                                                                                                                                                                         {'loss': 1.7674, 'grad_norm': 0.9886576533317566, 'learning_rate': 5.081056507793154e-05, 'ppl': 5.85561, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.79408264160156, 'epoch': 0.98, 'tokens/total': 73531392.0, 'tokens/trainable': 70336088.0}
 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍   | 561/575 [2:47:12<04:07, 17.67s/it] 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋   | 562/575 [2:47:30<03:50, 17.74s/it]                                                                                                                                                                                         {'loss': 1.7888, 'grad_norm': 1.1933319568634033, 'learning_rate': 5.0706146929329914e-05, 'ppl': 5.98227, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.0952911376953, 'epoch': 0.98, 'tokens/total': 73662464.0, 'tokens/trainable': 70461384.0}
 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋   | 562/575 [2:47:30<03:50, 17.74s/it] 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉   | 563/575 [2:47:48<03:32, 17.72s/it]                                                                                                                                                                                         {'loss': 1.8543, 'grad_norm': 0.928514838218689, 'learning_rate': 5.060891548858823e-05, 'ppl': 6.38723, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 216.577392578125, 'epoch': 0.98, 'tokens/total': 73793536.0, 'tokens/trainable': 70586352.0}
 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉   | 563/575 [2:47:48<03:32, 17.72s/it] 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  | 564/575 [2:48:06<03:15, 17.74s/it]                                                                                                                                                                                         {'loss': 1.8291, 'grad_norm': 1.1305882930755615, 'learning_rate': 5.051887387114615e-05, 'ppl': 6.22828, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.29116821289062, 'epoch': 0.98, 'tokens/total': 73924608.0, 'tokens/trainable': 70711472.0}
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  | 564/575 [2:48:06<03:15, 17.74s/it] 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍  | 565/575 [2:48:24<02:57, 17.75s/it]                                                                                                                                                                                         {'loss': 1.7975, 'grad_norm': 0.8098691701889038, 'learning_rate': 5.043602496207067e-05, 'ppl': 6.03454, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.83909606933594, 'epoch': 0.98, 'tokens/total': 74055680.0, 'tokens/trainable': 70836504.0}
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍  | 565/575 [2:48:24<02:57, 17.75s/it] 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋  | 566/575 [2:48:41<02:39, 17.73s/it]                                                                                                                                                                                         {'loss': 1.791, 'grad_norm': 0.8814376592636108, 'learning_rate': 5.036037141596382e-05, 'ppl': 5.99544, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 226.6243896484375, 'epoch': 0.98, 'tokens/total': 74186752.0, 'tokens/trainable': 70962008.0}
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋  | 566/575 [2:48:41<02:39, 17.73s/it] 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉  | 567/575 [2:48:59<02:22, 17.78s/it]                                                                                                                                                                                         {'loss': 1.804, 'grad_norm': 0.8036472797393799, 'learning_rate': 5.0291915656877405e-05, 'ppl': 6.07389, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.92947387695312, 'epoch': 0.99, 'tokens/total': 74317824.0, 'tokens/trainable': 71087504.0}
 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉  | 567/575 [2:48:59<02:22, 17.78s/it] 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 568/575 [2:49:17<02:04, 17.78s/it]                                                                                                                                                                                         {'loss': 1.7527, 'grad_norm': 0.8120279312133789, 'learning_rate': 5.023065987823557e-05, 'ppl': 5.77016, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 222.6497039794922, 'epoch': 0.99, 'tokens/total': 74448896.0, 'tokens/trainable': 71212704.0}
 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 568/575 [2:49:17<02:04, 17.78s/it] 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 569/575 [2:49:35<01:46, 17.75s/it]                                                                                                                                                                                         {'loss': 1.8099, 'grad_norm': 0.99661785364151, 'learning_rate': 5.0176606042764355e-05, 'ppl': 6.10984, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 225.9157257080078, 'epoch': 0.99, 'tokens/total': 74579968.0, 'tokens/trainable': 71337840.0}
 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 569/575 [2:49:35<01:46, 17.75s/it] 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 570/575 [2:49:52<01:28, 17.66s/it]                                                                                                                                                                                         {'loss': 1.767, 'grad_norm': 0.8532122373580933, 'learning_rate': 5.012975588242882e-05, 'ppl': 5.85327, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 223.75767517089844, 'epoch': 0.99, 'tokens/total': 74711040.0, 'tokens/trainable': 71463480.0}
 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 570/575 [2:49:52<01:28, 17.66s/it] 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 571/575 [2:50:10<01:10, 17.70s/it]                                                                                                                                                                                         {'loss': 1.8058, 'grad_norm': 0.9066288471221924, 'learning_rate': 5.009011089837765e-05, 'ppl': 6.08484, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 220.62295532226562, 'epoch': 0.99, 'tokens/total': 74842112.0, 'tokens/trainable': 71588696.0}
 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 571/575 [2:50:10<01:10, 17.70s/it] 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 572/575 [2:50:28<00:53, 17.73s/it]                                                                                                                                                                                         {'loss': 1.7316, 'grad_norm': 0.8485552072525024, 'learning_rate': 5.005767236089491e-05, 'ppl': 5.64969, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.62254333496094, 'epoch': 0.99, 'tokens/total': 74973184.0, 'tokens/trainable': 71714440.0}
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 572/575 [2:50:28<00:53, 17.73s/it]100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 573/575 [2:50:46<00:35, 17.78s/it]                                                                                                                                                                                         {'loss': 1.7588, 'grad_norm': 0.8762276768684387, 'learning_rate': 5.0032441309359544e-05, 'ppl': 5.80547, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 221.68173217773438, 'epoch': 1.0, 'tokens/total': 75104256.0, 'tokens/trainable': 71839912.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 573/575 [2:50:46<00:35, 17.78s/it]100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 574/575 [2:51:03<00:17, 17.78s/it]                                                                                                                                                                                         {'loss': 1.8336, 'grad_norm': 0.8655061721801758, 'learning_rate': 5.001441855221184e-05, 'ppl': 6.25637, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 219.48463439941406, 'epoch': 1.0, 'tokens/total': 75235328.0, 'tokens/trainable': 71965328.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 574/575 [2:51:03<00:17, 17.78s/it]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 575/575 [2:51:21<00:00, 17.82s/it]                                                                                                                                                                                         {'loss': 1.7731, 'grad_norm': 0.9516346454620361, 'learning_rate': 5.0003604666927675e-05, 'ppl': 5.88908, 'memory/max_active (GiB)': 26.03, 'memory/max_allocated (GiB)': 26.03, 'memory/device_reserved (GiB)': 27.83, 'tokens/train_per_sec_per_gpu': 212.42791748046875, 'epoch': 1.0, 'tokens/total': 75366400.0, 'tokens/trainable': 72089808.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 575/575 [2:51:21<00:00, 17.82s/it][2026-03-13 00:33:27,468] [INFO] [axolotl.core.trainers.base._save:723] [PID:4624] Saving model checkpoint to ./output/model/checkpoint-575
                                                                                                                                                                                         {'train_runtime': 10298.4302, 'train_samples_per_second': 1.787, 'train_steps_per_second': 0.056, 'train_loss': 2.084346570139346, 'memory/max_active (GiB)': 10.15, 'memory/max_allocated (GiB)': 10.15, 'memory/device_reserved (GiB)': 27.83, 'epoch': 1.0, 'tokens/train_per_sec_per_gpu': 0.0, 'tokens/total': 75366400.0, 'tokens/trainable': 72089808.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 575/575 [2:51:34<00:00, 17.82s/it]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 575/575 [2:51:34<00:00, 17.90s/it]
[2026-03-13 00:33:40,753] [INFO] [axolotl.train.save_trained_model:233] [PID:4624] Training completed! Saving trained model to ./output/model.
[2026-03-13 00:33:46,325] [INFO] [axolotl.train.save_trained_model:351] [PID:4624] Model successfully saved to ./output/model
[0m[2026-03-15 08:11:17,147] [DEBUG] [axolotl.utils.config.resolve_dtype:66] [PID:5120] bf16 support detected, enabling for this configuration.
[2026-03-15 08:11:17,153] [DEBUG] [axolotl.utils.config.log_gpu_memory_usage:127] [PID:5120] baseline 0.000GB ()
[2026-03-15 08:11:17,154] [INFO] [axolotl.cli.config.load_cfg:259] [PID:5120] config:
{
  "activation_offloading": false,
  "auto_resume_from_checkpoints": true,
  "axolotl_config_path": "output/sft.yml",
  "base_model": "output/model_7e4",
  "base_model_config": "output/model_7e4",
  "batch_size": 32,
  "bf16": true,
  "capabilities": {
    "bf16": true,
    "compute_capability": "sm_120",
    "fp8": true,
    "n_gpu": 1,
    "n_node": 1
  },
  "chat_template": "tokenizer_default",
  "context_parallel_size": 1,
  "cosine_min_lr_ratio": 0.1,
  "dataloader_num_workers": 1,
  "dataloader_pin_memory": true,
  "dataloader_prefetch_factor": 256,
  "dataset_num_proc": 24,
  "dataset_prepared_path": "./output/dataset",
  "datasets": [
    {
      "chat_template": "tokenizer_default",
      "field_messages": "messages",
      "message_property_mappings": {
        "content": "content",
        "role": "role"
      },
      "path": "kajuma/Zero_SFT_Ja_v3.5",
      "trust_remote_code": false,
      "type": "chat_template"
    }
  ],
  "ddp": false,
  "device": "cuda:0",
  "dion_rank_fraction": 1.0,
  "dion_rank_multiple_of": 1,
  "env_capabilities": {
    "torch_version": "2.8.0"
  },
  "eval_batch_size": 4,
  "eval_causal_lm_metrics": [
    "sacrebleu",
    "comet",
    "ter",
    "chrf"
  ],
  "eval_max_new_tokens": 128,
  "eval_sample_packing": false,
  "eval_steps": 100,
  "eval_table_size": 0,
  "experimental_skip_move_to_device": true,
  "flash_attention": false,
  "fp16": false,
  "gradient_accumulation_steps": 32,
  "gradient_checkpointing": false,
  "group_by_length": false,
  "hf_use_auth_token": true,
  "include_tkps": true,
  "is_falcon_derived_model": false,
  "is_llama_derived_model": false,
  "is_mistral_derived_model": false,
  "learning_rate": 0.0005,
  "liger_cross_entropy": false,
  "liger_fused_linear_cross_entropy": true,
  "liger_glu_activation": true,
  "liger_rms_norm": true,
  "liger_rope": true,
  "lisa_layers_attribute": "model.layers",
  "load_best_model_at_end": false,
  "load_in_4bit": false,
  "load_in_8bit": false,
  "local_rank": 0,
  "logging_steps": 1,
  "loraplus_lr_embedding": 1e-06,
  "lr_scheduler": "cosine",
  "mean_resizing_embeddings": false,
  "micro_batch_size": 1,
  "model_config_type": "diffllama",
  "num_epochs": 1.0,
  "optimizer": "adamw_torch",
  "otel_metrics_host": "localhost",
  "otel_metrics_port": 8000,
  "output_dir": "./output/model",
  "pad_to_sequence_len": true,
  "plugins": [
    "axolotl.integrations.liger.LigerPlugin"
  ],
  "pretrain_multipack_attn": true,
  "profiler_steps_start": 0,
  "qlora_sharded_model_loading": false,
  "ray_num_workers": 1,
  "remove_unused_columns": false,
  "resources_per_worker": {
    "GPU": 1
  },
  "sample_packing": true,
  "sample_packing_bin_size": 200,
  "sample_packing_group_size": 100000,
  "save_only_model": false,
  "save_safetensors": true,
  "save_steps": 100,
  "save_strategy": "steps",
  "save_total_limit": 1,
  "sequence_len": 4096,
  "shuffle_before_merging_datasets": false,
  "shuffle_merged_datasets": true,
  "skip_prepare_dataset": false,
  "streaming_multipack_buffer_size": 10000,
  "strict": false,
  "tensor_parallel_size": 1,
  "tf32": false,
  "tiled_mlp_use_original_mlp": true,
  "tokenizer_config": "output/model_7e4",
  "tokenizer_save_jinja_files": true,
  "tokenizer_type": "AutoTokenizer",
  "torch_dtype": "torch.bfloat16",
  "train_on_inputs": false,
  "trl": {
    "log_completions": false,
    "mask_truncated_completions": false,
    "ref_model_mixup_alpha": 0.9,
    "ref_model_sync_steps": 64,
    "scale_rewards": true,
    "sync_ref_model": false,
    "use_vllm": false,
    "vllm_server_host": "0.0.0.0",
    "vllm_server_port": 8000
  },
  "type_of_model": "AutoModelForCausalLM",
  "use_otel_metrics": false,
  "use_ray": false,
  "use_wandb": true,
  "val_set_size": 0.002,
  "vllm": {
    "device": "auto",
    "dtype": "auto",
    "gpu_memory_utilization": 0.9,
    "host": "0.0.0.0",
    "port": 8000
  },
  "wandb_entity": "tepic",
  "wandb_name": "diffllama-sft-datapilot",
  "wandb_project": "diffllama",
  "warmup_steps": 20,
  "weight_decay": 0.01,
  "world_size": 1
}
[2026-03-15 08:11:17,157] [INFO] [axolotl.cli.utils.load.load_model_and_tokenizer:40] [PID:5120] loading tokenizer... output/model_7e4
[2026-03-15 08:11:17,719] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:285] [PID:5120] EOS: 2 / </s>
[2026-03-15 08:11:17,720] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:286] [PID:5120] BOS: 1 / <s>
[2026-03-15 08:11:17,720] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:287] [PID:5120] PAD: 3 / <pad>
[2026-03-15 08:11:17,720] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:288] [PID:5120] UNK: 0 / <unk>
[2026-03-15 08:11:17,720] [INFO] [axolotl.cli.utils.load.load_model_and_tokenizer:43] [PID:5120] loading model...
[2026-03-15 08:11:17,725] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:87] [PID:5120] Patched Trainer.evaluation_loop with nanmean loss calculation
[2026-03-15 08:11:17,726] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:138] [PID:5120] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation
[2026-03-15 08:11:17,788] [WARNING] [axolotl.integrations.liger.plugin.warning_once:46] [PID:5120] Applied ONLY liger_fused_linear_cross_entropy genericpatches for model type: diffllama
[2026-03-15 08:11:17,788] [WARNING] [axolotl.integrations.liger.plugin.warning_once:46] [PID:5120] Liger + diffllama generic FLCE support is experimental and may not work as expected.
[2026-03-15 08:11:19,249] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:5120] Memory usage after model load 0.000GB ()
* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://48ecbc6e83cc897113.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
Using existing dataset file at: .gradio/flagged/dataset1.csv
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://48ecbc6e83cc897113.gradio.live