gemma3-1B_0_split / slurm.out
Nicolas-BZRD's picture
Upload folder using huggingface_hub
cf33aaa verified
3: W1124 00:03:06.850000 675180 torch/distributed/run.py:792]
3: W1124 00:03:06.850000 675180 torch/distributed/run.py:792] *****************************************
3: W1124 00:03:06.850000 675180 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
3: W1124 00:03:06.850000 675180 torch/distributed/run.py:792] *****************************************
0: W1124 00:03:06.866000 4127050 torch/distributed/run.py:792]
0: W1124 00:03:06.866000 4127050 torch/distributed/run.py:792] *****************************************
0: W1124 00:03:06.866000 4127050 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
0: W1124 00:03:06.866000 4127050 torch/distributed/run.py:792] *****************************************
2: W1124 00:03:06.882000 628563 torch/distributed/run.py:792]
2: W1124 00:03:06.882000 628563 torch/distributed/run.py:792] *****************************************
2: W1124 00:03:06.882000 628563 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
2: W1124 00:03:06.882000 628563 torch/distributed/run.py:792] *****************************************
1: W1124 00:03:06.884000 2620875 torch/distributed/run.py:792]
1: W1124 00:03:06.884000 2620875 torch/distributed/run.py:792] *****************************************
1: W1124 00:03:06.884000 2620875 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
1: W1124 00:03:06.884000 2620875 torch/distributed/run.py:792] *****************************************
0: [2025-11-24 00:03:26,198] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:4127210] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
0: [2025-11-24 00:03:26,198] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:4127210] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
1: [2025-11-24 00:03:26,198] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:2620950] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
1: [2025-11-24 00:03:26,198] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:2620950] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
3: [2025-11-24 00:03:26,198] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:675256] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
3: [2025-11-24 00:03:26,198] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:675256] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
2: [2025-11-24 00:03:26,199] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:628638] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
2: [2025-11-24 00:03:26,199] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:628638] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
0: [2025-11-24 00:03:57,538] [WARNING] [axolotl.utils.config.normalize_config:139] [PID:4127210] [RANK:0] Invalid value for save_steps (1.6666666666666667) from saves_per_epoch and/or num_epochs. Saving at training end only.
0: [2025-11-24 00:03:57,695] [INFO] [axolotl.cli.config.load_cfg:245] [PID:4127210] [RANK:0] config:
0: {
0: "activation_offloading": false,
0: "auto_resume_from_checkpoints": true,
0: "axolotl_config_path": "/lustre/fswork/projects/rech/dgo/udv55np/train/tmp/1763938979818356030.yaml",
0: "base_model": "/lustre/fswork/projects/rech/qwv/udv55np/Gemma/base/gemma-3-1b",
0: "base_model_config": "/lustre/fswork/projects/rech/qwv/udv55np/Gemma/base/gemma-3-1b",
0: "batch_size": 16,
0: "bf16": true,
0: "capabilities": {
0: "bf16": true,
0: "compute_capability": "sm_90",
0: "fp8": false,
0: "n_gpu": 16,
0: "n_node": 1
0: },
0: "chat_template": "gemma3",
0: "context_parallel_size": 1,
0: "dataloader_num_workers": 16,
0: "dataloader_pin_memory": true,
0: "dataloader_prefetch_factor": 256,
0: "dataset_prepared_path": "/lustre/fswork/projects/rech/dgo/udv55np/dataset_gemma/Nemotron-Super-49B-v1_5/split_0",
0: "dataset_processes": 192,
0: "datasets": [
0: {
0: "chat_template": "tokenizer_default",
0: "data_files": [
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0007.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0009.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0005.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0006.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0014.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0010.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0012.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0008.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0001.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0002.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0013.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0015.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0004.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0011.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0000.jsonl",
0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0003.jsonl"
0: ],
0: "ds_type": "json",
0: "field_messages": "conversations",
0: "message_property_mappings": {
0: "content": "content",
0: "role": "role"
0: },
0: "path": "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking",
0: "trust_remote_code": false,
0: "type": "chat_template"
0: }
0: ],
0: "ddp": true,
0: "deepspeed": {
0: "bf16": {
0: "enabled": true
0: },
0: "gradient_accumulation_steps": "auto",
0: "gradient_clipping": "auto",
0: "train_batch_size": "auto",
0: "train_micro_batch_size_per_gpu": "auto",
0: "wall_clock_breakdown": false,
0: "zero_optimization": {
0: "contiguous_gradients": true,
0: "overlap_comm": true,
0: "reduce_bucket_size": "auto",
0: "stage": 3,
0: "stage3_gather_16bit_weights_on_model_save": true,
0: "stage3_param_persistence_threshold": "auto",
0: "stage3_prefetch_bucket_size": "auto",
0: "sub_group_size": 0
0: }
0: },
0: "device": "cuda:0",
0: "device_map": {
0: "": 0
0: },
0: "dion_rank_fraction": 1.0,
0: "dion_rank_multiple_of": 1,
0: "env_capabilities": {
0: "torch_version": "2.6.0"
0: },
0: "eot_tokens": [
0: "<end_of_turn>"
0: ],
0: "eval_batch_size": 1,
0: "eval_causal_lm_metrics": [
0: "sacrebleu",
0: "comet",
0: "ter",
0: "chrf"
0: ],
0: "eval_max_new_tokens": 128,
0: "eval_sample_packing": true,
0: "eval_table_size": 0,
0: "evals_per_epoch": 0,
0: "flash_attention": true,
0: "fp16": false,
0: "gradient_accumulation_steps": 1,
0: "gradient_checkpointing": true,
0: "gradient_checkpointing_kwargs": {
0: "use_reentrant": true
0: },
0: "learning_rate": 2e-05,
0: "lisa_layers_attribute": "model.layers",
0: "load_best_model_at_end": false,
0: "load_in_4bit": false,
0: "load_in_8bit": false,
0: "local_rank": 0,
0: "logging_steps": 10,
0: "lora_dropout": 0.0,
0: "loraplus_lr_embedding": 1e-06,
0: "lr_scheduler": "warmup_stable_decay",
0: "lr_scheduler_kwargs": {
0: "min_lr_ratio": 0.1,
0: "num_decay_steps": 200
0: },
0: "max_prompt_len": 512,
0: "mean_resizing_embeddings": false,
0: "micro_batch_size": 1,
0: "model_config_type": "gemma3_text",
0: "num_epochs": 0.6,
0: "optimizer": "adamw_torch_fused",
0: "output_dir": "/lustre/fswork/projects/rech/dgo/udv55np/ift/Nemotron-Super-49B-v1_5/gemma-3-1b/0",
0: "pad_to_sequence_len": true,
0: "pretrain_multipack_attn": true,
0: "pretrain_multipack_buffer_size": 10000,
0: "profiler_steps_start": 0,
0: "qlora_sharded_model_loading": false,
0: "ray_num_workers": 1,
0: "resources_per_worker": {
0: "GPU": 1
0: },
0: "sample_packing": true,
0: "sample_packing_bin_size": 200,
0: "sample_packing_group_size": 100000,
0: "save_only_model": true,
0: "save_safetensors": true,
0: "save_total_limit": 20,
0: "saves_per_epoch": 1,
0: "sequence_len": 16384,
0: "shuffle_before_merging_datasets": false,
0: "shuffle_merged_datasets": true,
0: "skip_prepare_dataset": false,
0: "strict": false,
0: "tensor_parallel_size": 1,
0: "tf32": false,
0: "tiled_mlp_use_original_mlp": true,
0: "tokenizer_config": "/lustre/fswork/projects/rech/qwv/udv55np/Gemma/base/gemma-3-27b",
0: "torch_dtype": "torch.bfloat16",
0: "train_on_inputs": false,
0: "trl": {
0: "log_completions": false,
0: "mask_truncated_completions": false,
0: "ref_model_mixup_alpha": 0.9,
0: "ref_model_sync_steps": 64,
0: "scale_rewards": true,
0: "sync_ref_model": false,
0: "use_vllm": false,
0: "vllm_server_host": "0.0.0.0",
0: "vllm_server_port": 8000
0: },
0: "use_ray": false,
0: "use_tensorboard": true,
0: "val_set_size": 0.0,
0: "vllm": {
0: "device": "auto",
0: "dtype": "auto",
0: "gpu_memory_utilization": 0.9,
0: "host": "0.0.0.0",
0: "port": 8000
0: },
0: "warmup_steps": 100,
0: "weight_decay": 0.0,
0: "world_size": 16
0: }
0: [2025-11-24 00:03:57,696] [INFO] [axolotl.cli.checks.check_user_token:35] [PID:4127210] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.
1: [2025-11-24 00:03:59,617] [INFO] [axolotl.utils.data.sft._load_raw_datasets:314] [PID:2620950] [RANK:0] Loading raw datasets...
1: [2025-11-24 00:04:05,080] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:88] [PID:2620950] [RANK:0] Loading dataset: /lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking with base_type: chat_template and prompt_style: None
1: Dropping Long Sequences (>16384) (num_proc=192): 0%| | 0/557277 [00:00<?, ? examples/s] Dropping Long Sequences (>16384) (num_proc=192): 0%| | 1000/557277 [00:01<16:57, 546.65 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 5%|▍ | 27000/557277 [00:01<00:27, 19254.20 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 8%|β–Š | 43000/557277 [00:02<00:15, 32394.59 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 11%|β–ˆβ– | 64000/557277 [00:02<00:09, 53237.46 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 15%|β–ˆβ– | 82000/557277 [00:02<00:06, 69062.34 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 18%|β–ˆβ–Š | 101000/557277 [00:02<00:05, 88582.09 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 21%|β–ˆβ–ˆ | 118030/557277 [00:02<00:05, 75568.36 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 24%|β–ˆβ–ˆβ–Ž | 131545/557277 [00:02<00:06, 65850.8
1: 7 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 265598/557277 [00:03<00:01, 261159.28 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 437843/557277 [00:03<00:00, 523616.24 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 525217/557277 [00:03<00:00, 440082.46 examples/s] Dropping Long Sequences (>16384) (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 557277/557277 [00:03<00:00, 140183.34 examples/s]
1: Drop Samples with Zero Trainable Tokens (num_proc=192): 0%| | 0/556595 [00:00<?, ? examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 0%| | 1000/556595 [00:01<14:35, 634.70 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 2%|▏ | 9000/556595 [00:01<01:16, 7201.69 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 3%|β–Ž | 17000/556595 [00:01<00:36, 14783.70 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 6%|β–Œ | 32000/556595 [00:01<00:16, 32381.41 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 8%|β–Š | 45899/556595 [00:01<00:10, 48622.66 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 10%|β–ˆ | 57596/556595 [00:02<00:09, 54620.75 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 12%|β–ˆβ– | 67990/556595 [00:02<00:09, 54101.47 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 14%|
1: β–ˆβ– | 76788/556595 [00:02<00:10, 47872.81 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 15%|β–ˆβ–Œ | 84485/556595 [00:02<00:09, 50156.92 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 16%|β–ˆβ–‹ | 91182/556595 [00:02<00:08, 52948.99 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 18%|β–ˆβ–Š | 97980/556595 [00:02<00:08, 55683.76 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 19%|β–ˆβ–‰ | 104778/556595 [00:03<00:07, 58290.29 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 21%|β–ˆβ–ˆ | 114475/556595 [00:03<00:06, 67824.75 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 22%|β–ˆβ–ˆβ– | 122273/556595 [00:03<00:07, 59304.83 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 23%|β–ˆβ–ˆβ–Ž | 128869/556595 [00:03<00:07, 53849.86 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 24%|β–ˆβ–ˆβ–
1: | 135566/556595 [00:03<00:07, 54104.25 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 25%|β–ˆβ–ˆβ–Œ | 141465/556595 [00:03<00:08, 51615.38 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 26%|β–ˆβ–ˆβ–‹ | 147263/556595 [00:03<00:07, 53015.62 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 27%|β–ˆβ–ˆβ–‹ | 152960/556595 [00:04<00:12, 31310.04 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 295112/556595 [00:04<00:00, 274303.40 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 531123/556595 [00:04<00:00, 687146.38 examples/s] Drop Samples with Zero Trainable Tokens (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 556595/556595 [00:05<00:00, 107236.31 examples/s]
1: Add position_id column (Sample Packing) (num_proc=192): 0%| | 0/556595 [00:00<?, ? examples/s] Add position_id column (Sample Packing) (num_proc=192): 0%| | 1000/556595 [00:01<15:58, 579.61 examples/s] Add position_id column (Sample Packing) (num_proc=192): 2%|▏ | 13000/556595 [00:01<00:55, 9708.75 examples/s] Add position_id column (Sample Packing) (num_proc=192): 5%|β–Œ | 28000/556595 [00:01<00:22, 23461.09 examples/s] Add position_id column (Sample Packing) (num_proc=192): 7%|β–‹ | 39000/556595 [00:02<00:16, 31872.27 examples/s] Add position_id column (Sample Packing) (num_proc=192): 10%|β–‰ | 55000/556595 [00:02<00:10, 49421.01 examples/s] Add position_id column (Sample Packing) (num_proc=192): 13%|β–ˆβ–Ž | 71000/556595 [00:02<00:07, 66587.00 examples/s] Add position_id column (Sample Packing) (num_proc=192): 15%|β–ˆβ–Œ | 84000/556595 [00:02<00:06, 74430.55 examples/s] Add position_id column (Sample Packing) (num_proc=192): 1
1: 8%|β–ˆβ–Š | 98000/556595 [00:02<00:05, 87737.09 examples/s] Add position_id column (Sample Packing) (num_proc=192): 20%|β–ˆβ–ˆ | 113000/556595 [00:02<00:04, 100541.51 examples/s] Add position_id column (Sample Packing) (num_proc=192): 23%|β–ˆβ–ˆβ–Ž | 126495/556595 [00:03<00:06, 66089.09 examples/s] Add position_id column (Sample Packing) (num_proc=192): 25%|β–ˆβ–ˆβ– | 136889/556595 [00:03<00:07, 53001.49 examples/s] Add position_id column (Sample Packing) (num_proc=192): 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 272354/556595 [00:03<00:01, 251852.59 examples/s] Add position_id column (Sample Packing) (num_proc=192): 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 358213/556595 [00:03<00:00, 359565.72 examples/s] Add position_id column (Sample Packing) (num_proc=192): 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 416395/556595 [00:03<00:00, 316487.30 examples/s] Add position_id column (Sample Packing) (num_proc=192): 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 464577/556595 [00:03<00:00, 276324.78 examples/s] Add position_id column
1: (Sample Packing) (num_proc=192): 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 505153/556595 [00:04<00:00, 268841.01 examples/s] Add position_id column (Sample Packing) (num_proc=192): 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 540321/556595 [00:04<00:00, 232253.86 examples/s] Add position_id column (Sample Packing) (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 556595/556595 [00:05<00:00, 111162.45 examples/s]
1: Saving the dataset (0/192 shards): 0%| | 0/556595 [00:00<?, ? examples/s] Saving the dataset (0/192 shards): 1%| | 2899/556595 [00:01<05:29, 1680.30 examples/s] Saving the dataset (1/192 shards): 1%| | 2899/556595 [00:01<05:29, 1680.30 examples/s] Saving the dataset (2/192 shards): 1%| | 5798/556595 [00:01<05:27, 1680.30 examples/s] Saving the dataset (3/192 shards): 2%|▏ | 8697/556595 [00:01<05:26, 1680.30 examples/s] Saving the dataset (4/192 shards): 2%|▏ | 11596/556595 [00:01<05:24, 1680.30 examples/s] Saving the dataset (5/192 shards): 3%|β–Ž | 14495/556595 [00:01<05:22, 1680.30 examples/s] Saving the dataset (6/192 shards): 3%|β–Ž | 17394/556595 [00:01<05:20, 1680.30 examples/s] Saving the dataset (7/192 shards): 4%|β–Ž | 20293/556595 [00:01<05:19, 1680.30 examples/s] Saving the dataset (8/192 shards): 4%|▍ | 23192/556595 [00:01<05:17, 1680.30 examples/s] Saving the dataset (9/192 shards): 5%
1: |▍ | 26091/556595 [00:01<05:15, 1680.30 examples/s] Saving the dataset (10/192 shards): 5%|β–Œ | 28990/556595 [00:01<05:13, 1680.30 examples/s] Saving the dataset (11/192 shards): 6%|β–Œ | 31889/556595 [00:01<05:12, 1680.30 examples/s] Saving the dataset (12/192 shards): 6%|β–‹ | 34788/556595 [00:01<05:10, 1680.30 examples/s] Saving the dataset (13/192 shards): 7%|β–‹ | 37687/556595 [00:01<05:08, 1680.30 examples/s] Saving the dataset (14/192 shards): 7%|β–‹ | 40586/556595 [00:01<05:07, 1680.30 examples/s] Saving the dataset (15/192 shards): 8%|β–Š | 43485/556595 [00:01<05:05, 1680.30 examples/s] Saving the dataset (16/192 shards): 8%|β–Š | 46384/556595 [00:01<05:03, 1680.30 examples/s] Saving the dataset (17/192 shards): 9%|β–‰ | 49283/556595 [00:01<05:01, 1680.30 examples/s] Saving the dataset (18/192 shards): 9%|β–‰ | 52182/556595 [00:01<05:00, 1680.30 examples/s] Saving the dataset (19/192 shards): 10%|β–ˆ
1: | 57980/556595 [00:01<04:56, 1680.30 examples/s] Saving the dataset (20/192 shards): 10%|β–ˆ | 57980/556595 [00:01<04:56, 1680.30 examples/s] Saving the dataset (20/192 shards): 11%|β–ˆ | 60879/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (21/192 shards): 11%|β–ˆ | 60879/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (22/192 shards): 11%|β–ˆβ– | 63778/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (23/192 shards): 12%|β–ˆβ– | 66677/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (24/192 shards): 13%|β–ˆβ–Ž | 69576/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (25/192 shards): 13%|β–ˆβ–Ž | 72475/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (26/192 shards): 14%|β–ˆβ–Ž | 75374/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (27/192 shards): 14%|β–ˆβ– | 78273/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (28/192 s
1: hards): 15%|β–ˆβ– | 81172/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (29/192 shards): 15%|β–ˆβ–Œ | 84071/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (30/192 shards): 16%|β–ˆβ–Œ | 86970/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (31/192 shards): 16%|β–ˆβ–Œ | 89869/556595 [00:01<00:10, 45863.80 examples/s] Saving the dataset (32/192 shards): 17%|β–ˆβ–‹ | 92768/556595 [00:02<00:10, 45863.80 examples/s] Saving the dataset (32/192 shards): 17%|β–ˆβ–‹ | 95667/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (33/192 shards): 17%|β–ˆβ–‹ | 95667/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (34/192 shards): 18%|β–ˆβ–Š | 98566/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (35/192 shards): 18%|β–ˆβ–Š | 101465/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (36/192 shards): 19%|β–ˆβ–‰ | 104364/556595 [00:02<00:06, 67785.85 examples/s]
1: Saving the dataset (37/192 shards): 19%|β–ˆβ–‰ | 107263/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (38/192 shards): 20%|β–ˆβ–‰ | 110162/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (39/192 shards): 20%|β–ˆβ–ˆ | 113061/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (40/192 shards): 21%|β–ˆβ–ˆ | 115960/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (41/192 shards): 21%|β–ˆβ–ˆβ– | 118859/556595 [00:02<00:06, 67785.85 examples/s] Saving the dataset (41/192 shards): 22%|β–ˆβ–ˆβ– | 121758/556595 [00:02<00:05, 86635.90 examples/s] Saving the dataset (42/192 shards): 22%|β–ˆβ–ˆβ– | 121758/556595 [00:02<00:05, 86635.90 examples/s] Saving the dataset (43/192 shards): 22%|β–ˆβ–ˆβ– | 124657/556595 [00:02<00:04, 86635.90 examples/s] Saving the dataset (44/192 shards): 23%|β–ˆβ–ˆβ–Ž | 127556/556595 [00:02<00:04, 86635.90 examples/s] Saving the dataset (45/192 shards): 23%|β–ˆβ–ˆβ–Ž | 1
1: 30455/556595 [00:02<00:04, 86635.90 examples/s] Saving the dataset (46/192 shards): 24%|β–ˆβ–ˆβ– | 133354/556595 [00:02<00:04, 86635.90 examples/s] Saving the dataset (47/192 shards): 24%|β–ˆβ–ˆβ– | 136253/556595 [00:02<00:04, 86635.90 examples/s] Saving the dataset (48/192 shards): 25%|β–ˆβ–ˆβ–Œ | 139152/556595 [00:02<00:04, 86635.90 examples/s] Saving the dataset (49/192 shards): 26%|β–ˆβ–ˆβ–Œ | 142051/556595 [00:02<00:04, 86635.90 examples/s] Saving the dataset (50/192 shards): 26%|β–ˆβ–ˆβ–Œ | 144950/556595 [00:02<00:04, 86635.90 examples/s] Saving the dataset (50/192 shards): 27%|β–ˆβ–ˆβ–‹ | 147849/556595 [00:02<00:03, 105802.82 examples/s] Saving the dataset (51/192 shards): 27%|β–ˆβ–ˆβ–‹ | 147849/556595 [00:02<00:03, 105802.82 examples/s] Saving the dataset (52/192 shards): 27%|β–ˆβ–ˆβ–‹ | 150748/556595 [00:02<00:03, 105802.82 examples/s] Saving the dataset (53/192 shards): 28%|β–ˆβ–ˆβ–Š | 153647/556595 [00:02<00:03, 105802.82 examples/s]
1: Saving the dataset (54/192 shards): 28%|β–ˆβ–ˆβ–Š | 156546/556595 [00:02<00:03, 105802.82 examples/s] Saving the dataset (55/192 shards): 29%|β–ˆβ–ˆβ–Š | 159445/556595 [00:02<00:03, 105802.82 examples/s] Saving the dataset (56/192 shards): 29%|β–ˆβ–ˆβ–‰ | 162344/556595 [00:02<00:03, 105802.82 examples/s] Saving the dataset (57/192 shards): 30%|β–ˆβ–ˆβ–‰ | 165243/556595 [00:02<00:03, 105802.82 examples/s] Saving the dataset (58/192 shards): 30%|β–ˆβ–ˆβ–ˆ | 168142/556595 [00:02<00:03, 105802.82 examples/s] Saving the dataset (58/192 shards): 31%|β–ˆβ–ˆβ–ˆ | 171041/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (59/192 shards): 31%|β–ˆβ–ˆβ–ˆ | 171041/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (60/192 shards): 31%|β–ˆβ–ˆβ–ˆβ– | 173940/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (61/192 shards): 32%|β–ˆβ–ˆβ–ˆβ– | 176839/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (62/192 shards): 32
1: %|β–ˆβ–ˆβ–ˆβ– | 179738/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (63/192 shards): 33%|β–ˆβ–ˆβ–ˆβ–Ž | 185536/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (64/192 shards): 33%|β–ˆβ–ˆβ–ˆβ–Ž | 185536/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (65/192 shards): 34%|β–ˆβ–ˆβ–ˆβ– | 188435/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (66/192 shards): 34%|β–ˆβ–ˆβ–ˆβ– | 191334/556595 [00:02<00:03, 116752.77 examples/s] Saving the dataset (66/192 shards): 35%|β–ˆβ–ˆβ–ˆβ– | 194233/556595 [00:02<00:02, 131571.77 examples/s] Saving the dataset (67/192 shards): 35%|β–ˆβ–ˆβ–ˆβ– | 194233/556595 [00:02<00:02, 131571.77 examples/s] Saving the dataset (68/192 shards): 35%|β–ˆβ–ˆβ–ˆβ–Œ | 197132/556595 [00:02<00:02, 131571.77 examples/s] Saving the dataset (69/192 shards): 36%|β–ˆβ–ˆβ–ˆβ–Œ | 200031/556595 [00:02<00:02, 131571.77 examples/s] Saving the dataset (70/192 shards): 36%|β–ˆβ–ˆβ–ˆβ–‹ | 202
1: 930/556595 [00:02<00:02, 131571.77 examples/s] Saving the dataset (71/192 shards): 37%|β–ˆβ–ˆβ–ˆβ–‹ | 205829/556595 [00:02<00:02, 131571.77 examples/s] Saving the dataset (72/192 shards): 38%|β–ˆβ–ˆβ–ˆβ–Š | 208728/556595 [00:02<00:02, 131571.77 examples/s] Saving the dataset (73/192 shards): 38%|β–ˆβ–ˆβ–ˆβ–Š | 211627/556595 [00:02<00:02, 131571.77 examples/s] Saving the dataset (73/192 shards): 39%|β–ˆβ–ˆβ–ˆβ–Š | 214526/556595 [00:02<00:02, 137894.41 examples/s] Saving the dataset (74/192 shards): 39%|β–ˆβ–ˆβ–ˆβ–Š | 214526/556595 [00:02<00:02, 137894.41 examples/s] Saving the dataset (75/192 shards): 39%|β–ˆβ–ˆβ–ˆβ–‰ | 217425/556595 [00:02<00:02, 137894.41 examples/s] Saving the dataset (76/192 shards): 40%|β–ˆβ–ˆβ–ˆβ–‰ | 220324/556595 [00:02<00:02, 137894.41 examples/s] Saving the dataset (77/192 shards): 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 223223/556595 [00:02<00:02, 137894.41 examples/s] Saving the dataset (78/192 shards): 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 226122/556595 [00:02<00:02,
1: 137894.41 examples/s] Saving the dataset (79/192 shards): 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 229021/556595 [00:02<00:02, 137894.41 examples/s] Saving the dataset (80/192 shards): 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 231920/556595 [00:02<00:02, 137894.41 examples/s] Saving the dataset (81/192 shards): 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 234819/556595 [00:02<00:02, 137894.41 examples/s] Saving the dataset (81/192 shards): 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 237718/556595 [00:02<00:02, 155142.10 examples/s] Saving the dataset (82/192 shards): 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 237718/556595 [00:02<00:02, 155142.10 examples/s] Saving the dataset (83/192 shards): 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 240617/556595 [00:02<00:02, 155142.10 examples/s] Saving the dataset (84/192 shards): 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 243516/556595 [00:02<00:02, 155142.10 examples/s] Saving the dataset (85/192 shards): 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 246415/556595 [00:02<00:01, 155142.10 examples/s] Saving the dataset (86/192 shards): 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 249314/556595 [00:02<00:01, 155142.10
1: examples/s] Saving the dataset (87/192 shards): 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 252213/556595 [00:02<00:01, 155142.10 examples/s] Saving the dataset (88/192 shards): 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 255112/556595 [00:02<00:01, 155142.10 examples/s] Saving the dataset (88/192 shards): 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 258011/556595 [00:02<00:01, 162333.45 examples/s] Saving the dataset (89/192 shards): 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 258011/556595 [00:02<00:01, 162333.45 examples/s] Saving the dataset (90/192 shards): 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 260910/556595 [00:02<00:01, 162333.45 examples/s] Saving the dataset (91/192 shards): 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 263809/556595 [00:02<00:01, 162333.45 examples/s] Saving the dataset (92/192 shards): 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 266708/556595 [00:02<00:01, 162333.45 examples/s] Saving the dataset (93/192 shards): 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 269607/556595 [00:02<00:01, 162333.45 examples/s] Saving the dataset (94/192 shards): 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 272506/556595 [00:02<00:01, 162333.45 exampl
1: es/s] Saving the dataset (95/192 shards): 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 275405/556595 [00:03<00:01, 162333.45 examples/s] Saving the dataset (96/192 shards): 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 278304/556595 [00:03<00:01, 162333.45 examples/s] Saving the dataset (96/192 shards): 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 281203/556595 [00:03<00:01, 169891.97 examples/s] Saving the dataset (97/192 shards): 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 281203/556595 [00:03<00:01, 169891.97 examples/s] Saving the dataset (98/192 shards): 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 289900/556595 [00:03<00:01, 169891.97 examples/s] Saving the dataset (99/192 shards): 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 289900/556595 [00:03<00:01, 169891.97 examples/s] Saving the dataset (100/192 shards): 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 289900/556595 [00:03<00:01, 169891.97 examples/s] Saving the dataset (101/192 shards): 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 292799/556595 [00:03<00:01, 169891.97 examples/s] Saving the dataset (102/192 shards): 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 295698/556595 [00:03<00:01, 169891.97
1: examples/s] Saving the dataset (103/192 shards): 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 298597/556595 [00:03<00:01, 169891.97 examples/s] Saving the dataset (103/192 shards): 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 301496/556595 [00:03<00:01, 173067.43 examples/s] Saving the dataset (104/192 shards): 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 304395/556595 [00:03<00:01, 173067.43 examples/s] Saving the dataset (105/192 shards): 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 304395/556595 [00:03<00:01, 173067.43 examples/s] Saving the dataset (106/192 shards): 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 307294/556595 [00:03<00:01, 173067.43 examples/s] Saving the dataset (107/192 shards): 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 310193/556595 [00:03<00:01, 173067.43 examples/s] Saving the dataset (108/192 shards): 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 313092/556595 [00:03<00:01, 173067.43 examples/s] Saving the dataset (109/192 shards): 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 315991/556595 [00:03<00:01, 173067.43 examples/s] Saving the dataset (110/192 shards): 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 318890/556595 [00:
1: 03<00:01, 173067.43 examples/s] Saving the dataset (110/192 shards): 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 321789/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (111/192 shards): 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 321789/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (112/192 shards): 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 324688/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (113/192 shards): 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 327587/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (114/192 shards): 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 330486/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (115/192 shards): 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 333385/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (116/192 shards): 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 336284/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (117/192 shards): 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 339183/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (118/192 shards): 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–
1: | 342082/556595 [00:03<00:01, 167787.42 examples/s] Saving the dataset (118/192 shards): 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 344981/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (119/192 shards): 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 344981/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (120/192 shards): 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 347880/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (121/192 shards): 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 350779/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (122/192 shards): 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 353678/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (123/192 shards): 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 356577/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (124/192 shards): 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 359476/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (125/192 shards): 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 365274/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (126/19
1: 2 shards): 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 365274/556595 [00:03<00:01, 178224.81 examples/s] Saving the dataset (126/192 shards): 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 368173/556595 [00:03<00:01, 169693.92 examples/s] Saving the dataset (127/192 shards): 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 368173/556595 [00:03<00:01, 169693.92 examples/s] Saving the dataset (128/192 shards): 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 371072/556595 [00:03<00:01, 169693.92 examples/s] Saving the dataset (129/192 shards): 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 376870/556595 [00:03<00:01, 169693.92 examples/s] Saving the dataset (130/192 shards): 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 376870/556595 [00:03<00:01, 169693.92 examples/s] Saving the dataset (131/192 shards): 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 390466/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (132/192 shards): 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 390466/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (133/192 shards): 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 390466/556595 [00:03<00:00, 169693.92
1: examples/s] Saving the dataset (134/192 shards): 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 393365/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (135/192 shards): 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 408062/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (136/192 shards): 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 416759/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (137/192 shards): 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 416759/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (138/192 shards): 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 427658/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (139/192 shards): 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 439254/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (140/192 shards): 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 442153/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (141/192 shards): 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 442153/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (142/192 shards): 79%|β–ˆβ–ˆβ–ˆβ–ˆ
1: β–ˆβ–ˆβ–ˆβ–‰ | 442153/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (143/192 shards): 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 445052/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (144/192 shards): 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 445052/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (145/192 shards): 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 447052/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (146/192 shards): 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 454648/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (147/192 shards): 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 455547/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (148/192 shards): 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 458446/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (149/192 shards): 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 467143/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (150/192 shards): 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 467143/556595 [00:03<00:00, 169693.92
1: examples/s] Saving the dataset (151/192 shards): 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 468042/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (152/192 shards): 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 468042/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (153/192 shards): 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 468042/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (154/192 shards): 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 468941/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (155/192 shards): 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 469840/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (156/192 shards): 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 471840/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (157/192 shards): 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 474739/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (158/192 shards): 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 474739/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (159/192 shards):
1: 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 474739/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (160/192 shards): 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 474739/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (161/192 shards): 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 475638/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (162/192 shards): 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 484537/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (163/192 shards): 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 484537/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (164/192 shards): 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 496132/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (165/192 shards): 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 503929/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (166/192 shards): 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 503929/556595 [00:03<00:00, 169693.92 examples/s] Saving the dataset (167/192 shards): 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 503929/55659
1: 5 [00:03<00:00, 169693.92 examples/s] Saving the dataset (167/192 shards): 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 506828/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (168/192 shards): 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 508828/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (169/192 shards): 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 508828/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (170/192 shards): 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 509727/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (171/192 shards): 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 518221/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (172/192 shards): 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 521119/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (173/192 shards): 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 524017/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (174/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 526915/556595 [00:03<00:00, 468570.22 examples
1: /s] Saving the dataset (175/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 526915/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (176/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 527814/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (177/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 527814/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (178/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 529611/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (179/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 529611/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (180/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 529611/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (181/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 529611/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (182/192 shards): 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 529611/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (183/192
1: shards): 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 538306/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (184/192 shards): 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 553697/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (185/192 shards): 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 553697/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (186/192 shards): 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 553697/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (187/192 shards): 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 553697/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (188/192 shards): 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 553697/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (189/192 shards): 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 553697/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (190/192 shards): 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 553697/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (191/192 shards): 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
1: β–ˆβ–ˆβ–ˆβ–‰| 553697/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (192/192 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 556595/556595 [00:03<00:00, 468570.22 examples/s] Saving the dataset (192/192 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 556595/556595 [00:03<00:00, 149145.23 examples/s]
0: [2025-11-24 00:05:58,155] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:472] [PID:4127210] [RANK:0] Loading prepared dataset from disk at /lustre/fswork/projects/rech/dgo/udv55np/dataset_gemma/Nemotron-Super-49B-v1_5/split_0/06698e902d3dba325ca34849b1dea5ea...
0: [2025-11-24 00:07:08,642] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:436] [PID:4127210] [RANK:0] gather_len_batches: [18975, 18976, 18976, 18976, 18976, 18975, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976]
0: [2025-11-24 00:07:08,811] [INFO] [axolotl.utils.trainer.calc_sample_packing_eff_est:495] [PID:4127210] [RANK:0] sample_packing_eff_est across ranks: [0.9988827705383301, 0.9989354014396667, 0.9989354014396667, 0.9988827705383301, 0.9989354014396667, 0.9988827705383301, 0.9988827705383301, 0.9989354014396667, 0.9988827705383301, 0.9988827705383301, 0.9988827705383301, 0.9989880323410034, 0.9989354014396667, 0.9988827705383301, 0.9989354014396667, 0.9989354014396667]
0: [2025-11-24 00:07:08,819] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:127] [PID:4127210] [RANK:0] Maximum number of steps set at 711
0: [2025-11-24 00:07:09,986] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:110] [PID:4127210] [RANK:0] Patched Trainer.evaluation_loop with nanmean loss calculation
0: [2025-11-24 00:07:09,987] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:164] [PID:4127210] [RANK:0] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation
0: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
0: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
0: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
0: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
0: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
0: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
3: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
3: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
1: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
1: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
1: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
1: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
1: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
3: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
3: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
1: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
3: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
3: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
0: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
0: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
1: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
1: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
3: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
3: The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
0: [2025-11-24 00:07:22,370] [INFO] [axolotl.loaders.model._configure_embedding_dtypes:345] [PID:4127210] [RANK:0] Converting modules to torch.bfloat16
0: [2025-11-24 00:08:20,774] [INFO] [axolotl.train.save_initial_configs:416] [PID:4127210] [RANK:0] Pre-saving tokenizer to /lustre/fswork/projects/rech/dgo/udv55np/ift/Nemotron-Super-49B-v1_5/gemma-3-1b/0...
0: [2025-11-24 00:08:21,511] [INFO] [axolotl.train.save_initial_configs:419] [PID:4127210] [RANK:0] Pre-saving model config to /lustre/fswork/projects/rech/dgo/udv55np/ift/Nemotron-Super-49B-v1_5/gemma-3-1b/0...
0: [2025-11-24 00:08:21,526] [INFO] [axolotl.train.execute_training:203] [PID:4127210] [RANK:0] Starting trainer...
0: [2025-11-24 00:09:55,377] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:436] [PID:4127210] [RANK:0] gather_len_batches: [18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976, 18976]
0: Parameter Offload - Persistent parameters statistics: param_count = 157, numel = 134272
2: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
2: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
0: 0%| | 0/711 [00:00<?, ?it/s]It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
1: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
3: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
3: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
1: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
0: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
3: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
3: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
2: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
0: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
2: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
0: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
1: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
1: It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
0: {'loss': 1.1018, 'grad_norm': 1.7590831241890867, 'learning_rate': 3.62e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.01}
0: 0%| | 1/711 [03:18<39:09:13, 198.53s/it] 0%| | 2/711 [03:23<16:39:36, 84.59s/it] 0%| | 3/711 [03:24<9:07:59, 46.44s/it] 1%| | 4/711 [03:25<5:35:52, 28.50s/it] 1%| | 5/711 [03:26<3:39:02, 18.61s/it] 1%| | 6/711 [03:27<2:28:28, 12.64s/it] 1%| | 7/711 [03:28<1:43:39, 8.84s/it] 1%| | 8/711 [03:29<1:14:18, 6.34s/it] 1%|▏ | 9/711 [03:30<54:41, 4.67s/it] 1%|▏ | 10/711 [03:31<41:22, 3.54s/it] 1%|▏ | 10/711 [03:31<41:22, 3.54s/it] 2%|▏ | 11/711 [03:32<32:19, 2.77s/it] 2%|▏ | 12/711 [03:33<26:03, 2.24s/it] 2%|▏ | 13/711 [03:34<21:42, 1.87s/it] 2%|▏ | 14/711 [03:35<18:41, 1.61s/it] 2%|▏ | 15/711 [03:36<16:34, 1.43s/it] 2%|▏ | 16/711 [03:37<15:05, 1.30s/it] 2%|▏ | 17/711 [03:38<14:03, 1.22s/it] 3%|β–Ž | 18/711 [03:39<13:19, 1.15s/it] 3%|οΏ½
0: {'loss': 0.9902, 'grad_norm': 1.160769879445861, 'learning_rate': 5.420000000000001e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.02}
0: {'loss': 0.9244, 'grad_norm': 1.058149478414618, 'learning_rate': 7.22e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.03}
0: οΏ½οΏ½ | 19/711 [03:40<12:50, 1.11s/it] 3%|β–Ž | 20/711 [03:41<12:27, 1.08s/it] 3%|β–Ž | 20/711 [03:41<12:27, 1.08s/it] 3%|β–Ž | 21/711 [03:42<12:13, 1.06s/it] 3%|β–Ž | 22/711 [03:43<12:02, 1.05s/it] 3%|β–Ž | 23/711 [03:44<11:53, 1.04s/it] 3%|β–Ž | 24/711 [03:45<11:47, 1.03s/it] 4%|β–Ž | 25/711 [03:46<11:43, 1.03s/it] 4%|β–Ž | 26/711 [03:47<11:49, 1.04s/it] 4%|▍ | 27/711 [03:48<11:44, 1.03s/it] 4%|▍ | 28/711 [03:49<11:41, 1.03s/it] 4%|▍ | 29/711 [03:50<11:36, 1.02s/it] 4%|▍ | 30/711 [03:51<11:36, 1.02s/it] 4%|▍ | 30/711 [03:51<11:36, 1.02s/it] 4%|▍ | 31/711 [03:52<11:32, 1.02s/it] 5%|▍ | 32/711 [03:53<11:29, 1.02s/it] 5%|▍ | 33/711 [03:54<11:27, 1.01s/it] 5%|▍ | 34/711 [03:55<11:26, 1.01s/it] 5%|▍
0: {'loss': 0.8932, 'grad_norm': 0.9348341250213281, 'learning_rate': 9.020000000000002e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.03}
0: {'loss': 0.8577, 'grad_norm': 0.794975085250345, 'learning_rate': 1.0820000000000001e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.04}
0: | 35/711 [03:56<11:26, 1.02s/it] 5%|β–Œ | 36/711 [03:57<11:25, 1.02s/it] 5%|β–Œ | 37/711 [03:58<11:23, 1.01s/it] 5%|β–Œ | 38/711 [03:59<11:21, 1.01s/it] 5%|β–Œ | 39/711 [04:00<11:19, 1.01s/it] 6%|β–Œ | 40/711 [04:02<11:17, 1.01s/it] 6%|β–Œ | 40/711 [04:02<11:17, 1.01s/it] 6%|β–Œ | 41/711 [04:03<11:16, 1.01s/it] 6%|β–Œ | 42/711 [04:04<11:14, 1.01s/it] 6%|β–Œ | 43/711 [04:05<11:14, 1.01s/it] 6%|β–Œ | 44/711 [04:06<11:13, 1.01s/it] 6%|β–‹ | 45/711 [04:07<11:12, 1.01s/it] 6%|β–‹ | 46/711 [04:08<11:13, 1.01s/it] 7%|β–‹ | 47/711 [04:09<11:13, 1.01s/it] 7%|β–‹ | 48/711 [04:10<11:13, 1.02s/it] 7%|β–‹ | 49/711 [04:11<11:11, 1.01s/it] 7%|β–‹ | 50/711 [04:12<11:11, 1.02s/it] 7%|β–‹ | 50/711 [04:12<11:11, 1.02s/it] 7%|β–‹ |
0: {'loss': 0.8286, 'grad_norm': 0.8899530497730146, 'learning_rate': 1.2620000000000001e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.05}
0: 51/711 [04:13<11:10, 1.02s/it] 7%|β–‹ | 52/711 [04:14<11:11, 1.02s/it] 7%|β–‹ | 53/711 [04:15<11:14, 1.02s/it] 8%|β–Š | 54/711 [04:16<11:10, 1.02s/it] 8%|β–Š | 55/711 [04:17<11:08, 1.02s/it] 8%|β–Š | 56/711 [04:18<11:10, 1.02s/it] 8%|β–Š | 57/711 [04:19<11:08, 1.02s/it] 8%|β–Š | 58/711 [04:20<11:10, 1.03s/it] 8%|β–Š | 59/711 [04:21<11:07, 1.02s/it] 8%|β–Š | 60/711 [04:22<11:03, 1.02s/it] 8%|β–Š | 60/711 [04:22<11:03, 1.02s/it] 9%|β–Š | 61/711 [04:23<11:01, 1.02s/it] 9%|β–Š | 62/711 [04:24<10:58, 1.01s/it] 9%|β–‰ | 63/711 [04:25<11:01, 1.02s/it] 9%|β–‰ | 64/711 [04:26<10:59, 1.02s/it] 9%|β–‰ | 65/711 [04:27<10:58, 1.02s/it] 9%|β–‰ | 66/711 [04:28<10:55, 1.02s/it] 9%|β–‰ | 67/711 [04:29<10:53, 1.01s/it] 10%|β–‰ | 68/711 [04:30<10:51, 1.01s/it] 10%|β–‰ | 69/7
0: {'loss': 0.8345, 'grad_norm': 0.9574271371234939, 'learning_rate': 1.4420000000000001e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.06}
0: {'loss': 0.8208, 'grad_norm': 0.9418691707757363, 'learning_rate': 1.6220000000000004e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.07}
0: 11 [04:31<10:50, 1.01s/it] 10%|β–‰ | 70/711 [04:32<10:50, 1.01s/it] 10%|β–‰ | 70/711 [04:32<10:50, 1.01s/it] 10%|β–‰ | 71/711 [04:33<10:50, 1.02s/it] 10%|β–ˆ | 72/711 [04:34<10:49, 1.02s/it] 10%|β–ˆ | 73/711 [04:35<10:47, 1.02s/it] 10%|β–ˆ | 74/711 [04:36<10:46, 1.02s/it] 11%|β–ˆ | 75/711 [04:37<10:45, 1.01s/it] 11%|β–ˆ | 76/711 [04:38<10:44, 1.01s/it] 11%|β–ˆ | 77/711 [04:39<10:43, 1.01s/it] 11%|β–ˆ | 78/711 [04:40<10:41, 1.01s/it] 11%|β–ˆ | 79/711 [04:41<10:40, 1.01s/it] 11%|β–ˆβ– | 80/711 [04:42<10:40, 1.02s/it] 11%|β–ˆβ– | 80/711 [04:42<10:40, 1.02s/it] 11%|β–ˆβ– | 81/711 [04:43<10:44, 1.02s/it] 12%|β–ˆβ– | 82/711 [04:44<10:40, 1.02s/it] 12%|β–ˆβ– | 83/711 [04:45<10:43, 1.02s/it] 12%|β–ˆβ– | 84/711 [04:46<10:40, 1.02s/it] 12%|β–ˆβ–
0: {'loss': 0.7899, 'grad_norm': 0.9662815954776516, 'learning_rate': 1.802e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.08}
0: {'loss': 0.7661, 'grad_norm': 1.2089037525815847, 'learning_rate': 1.982e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.08}
0: | 85/711 [04:47<10:37, 1.02s/it] 12%|β–ˆβ– | 86/711 [04:48<10:39, 1.02s/it] 12%|β–ˆβ– | 87/711 [04:49<10:36, 1.02s/it] 12%|β–ˆβ– | 88/711 [04:50<10:33, 1.02s/it] 13%|β–ˆβ–Ž | 89/711 [04:51<10:30, 1.01s/it] 13%|β–ˆβ–Ž | 90/711 [04:52<10:28, 1.01s/it] 13%|β–ˆβ–Ž | 90/711 [04:52<10:28, 1.01s/it] 13%|β–ˆβ–Ž | 91/711 [04:53<10:28, 1.01s/it] 13%|β–ˆβ–Ž | 92/711 [04:54<10:28, 1.02s/it] 13%|β–ˆβ–Ž | 93/711 [04:55<10:30, 1.02s/it] 13%|β–ˆβ–Ž | 94/711 [04:56<10:32, 1.02s/it] 13%|β–ˆβ–Ž | 95/711 [04:57<10:32, 1.03s/it] 14%|β–ˆβ–Ž | 96/711 [04:58<10:29, 1.02s/it] 14%|β–ˆβ–Ž | 97/711 [05:00<10:26, 1.02s/it] 14%|β–ˆβ– | 98/711 [05:01<10:28, 1.02s/it] 14%|β–ˆβ– | 99/711 [05:02<10:24, 1.02s/it] 14%|β–ˆβ– | 100/711 [05:03<10:22, 1.02s/it] 14%|β–ˆβ– | 100/711 [05:03<10
0: {'loss': 0.8007, 'grad_norm': 0.8960002301387379, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.09}
0: :22, 1.02s/it] 14%|β–ˆβ– | 101/711 [05:04<10:21, 1.02s/it] 14%|β–ˆβ– | 102/711 [05:05<10:19, 1.02s/it] 14%|β–ˆβ– | 103/711 [05:06<10:18, 1.02s/it] 15%|β–ˆβ– | 104/711 [05:07<10:16, 1.02s/it] 15%|β–ˆβ– | 105/711 [05:08<10:14, 1.01s/it] 15%|β–ˆβ– | 106/711 [05:09<10:13, 1.01s/it] 15%|β–ˆβ–Œ | 107/711 [05:10<10:12, 1.01s/it] 15%|β–ˆβ–Œ | 108/711 [05:11<10:12, 1.02s/it] 15%|β–ˆβ–Œ | 109/711 [05:12<10:10, 1.01s/it] 15%|β–ˆβ–Œ | 110/711 [05:13<10:09, 1.01s/it] 15%|β–ˆβ–Œ | 110/711 [05:13<10:09, 1.01s/it] 16%|β–ˆβ–Œ | 111/711 [05:14<10:10, 1.02s/it] 16%|β–ˆβ–Œ | 112/711 [05:15<10:08, 1.02s/it] 16%|β–ˆβ–Œ | 113/711 [05:16<10:10, 1.02s/it] 16%|β–ˆβ–Œ | 114/711 [05:17<10:08, 1.02s/it] 16%|β–ˆβ–Œ | 115/711 [05:18<10:06, 1.02s/it] 16%|β–ˆβ–‹ | 116/711 [05:19<10:04, 1.02s/it] 16%|β–ˆβ–‹ | 117/711 [05:20<10:
0: {'loss': 0.7871, 'grad_norm': 1.0229518399908935, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.1}
0: {'loss': 0.763, 'grad_norm': 1.0001793363102398, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.11}
0: 03, 1.02s/it] 17%|β–ˆβ–‹ | 118/711 [05:21<10:01, 1.02s/it] 17%|β–ˆβ–‹ | 119/711 [05:22<10:00, 1.01s/it] 17%|β–ˆβ–‹ | 120/711 [05:23<09:59, 1.01s/it] 17%|β–ˆβ–‹ | 120/711 [05:23<09:59, 1.01s/it] 17%|β–ˆβ–‹ | 121/711 [05:24<09:58, 1.01s/it] 17%|β–ˆβ–‹ | 122/711 [05:25<09:57, 1.01s/it] 17%|β–ˆβ–‹ | 123/711 [05:26<09:56, 1.01s/it] 17%|β–ˆβ–‹ | 124/711 [05:27<09:56, 1.02s/it] 18%|β–ˆβ–Š | 125/711 [05:28<09:58, 1.02s/it] 18%|β–ˆβ–Š | 126/711 [05:29<09:59, 1.02s/it] 18%|β–ˆβ–Š | 127/711 [05:30<10:00, 1.03s/it] 18%|β–ˆβ–Š | 128/711 [05:31<09:56, 1.02s/it] 18%|β–ˆβ–Š | 129/711 [05:32<09:53, 1.02s/it] 18%|β–ˆβ–Š | 130/711 [05:33<09:50, 1.02s/it] 18%|β–ˆβ–Š | 130/711 [05:33<09:50, 1.02s/it] 18%|β–ˆβ–Š | 131/711 [05:34<09:48, 1.02s/it] 19%|β–ˆβ–Š | 132/711 [05:35<09:47,
0: {'loss': 0.767, 'grad_norm': 0.9082590810280519, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.12}
0: 1.01s/it] 19%|β–ˆβ–Š | 133/711 [05:36<09:49, 1.02s/it] 19%|β–ˆβ–‰ | 134/711 [05:37<09:48, 1.02s/it] 19%|β–ˆβ–‰ | 135/711 [05:38<09:46, 1.02s/it] 19%|β–ˆβ–‰ | 136/711 [05:39<09:45, 1.02s/it] 19%|β–ˆβ–‰ | 137/711 [05:40<09:43, 1.02s/it] 19%|β–ˆβ–‰ | 138/711 [05:41<09:42, 1.02s/it] 20%|β–ˆβ–‰ | 139/711 [05:42<09:41, 1.02s/it] 20%|β–ˆβ–‰ | 140/711 [05:43<09:39, 1.02s/it] 20%|β–ˆβ–‰ | 140/711 [05:43<09:39, 1.02s/it] 20%|β–ˆβ–‰ | 141/711 [05:44<09:39, 1.02s/it] 20%|β–ˆβ–‰ | 142/711 [05:45<09:39, 1.02s/it] 20%|β–ˆβ–ˆ | 143/711 [05:46<09:38, 1.02s/it] 20%|β–ˆβ–ˆ | 144/711 [05:47<09:37, 1.02s/it] 20%|β–ˆβ–ˆ | 145/711 [05:48<09:37, 1.02s/it] 21%|β–ˆβ–ˆ | 146/711 [05:49<09:35, 1.02s/it] 21%|β–ˆβ–ˆ | 147/711 [05:50<09:33, 1.02s/it] 21%|β–ˆβ–ˆ | 148/711 [05:51<09:31, 1.02s/it] 21%|β–ˆβ–ˆ | 149/711 [05:52<09:30,
0: {'loss': 0.7652, 'grad_norm': 0.9284057369823192, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.13}
0: {'loss': 0.7668, 'grad_norm': 0.8767250768222354, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.13}
0: 1.02s/it] 21%|β–ˆβ–ˆ | 150/711 [05:53<09:29, 1.02s/it] 21%|β–ˆβ–ˆ | 150/711 [05:53<09:29, 1.02s/it] 21%|β–ˆβ–ˆ | 151/711 [05:54<09:29, 1.02s/it] 21%|β–ˆβ–ˆβ– | 152/711 [05:55<09:30, 1.02s/it] 22%|β–ˆβ–ˆβ– | 153/711 [05:56<09:28, 1.02s/it] 22%|β–ˆβ–ˆβ– | 154/711 [05:58<09:27, 1.02s/it] 22%|β–ˆβ–ˆβ– | 155/711 [05:59<09:26, 1.02s/it] 22%|β–ˆβ–ˆβ– | 156/711 [06:00<09:24, 1.02s/it] 22%|β–ˆβ–ˆβ– | 157/711 [06:01<09:23, 1.02s/it] 22%|β–ˆβ–ˆβ– | 158/711 [06:02<09:21, 1.02s/it] 22%|β–ˆβ–ˆβ– | 159/711 [06:03<09:24, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž | 160/711 [06:04<09:22, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž | 160/711 [06:04<09:22, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž | 161/711 [06:05<09:20, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž | 162/711 [06:06<09:18, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž | 163/711 [06:07<09:17, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž |
0: {'loss': 0.7515, 'grad_norm': 0.9756062058818691, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.14}
0: 164/711 [06:08<09:16, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž | 165/711 [06:09<09:14, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž | 166/711 [06:10<09:17, 1.02s/it] 23%|β–ˆβ–ˆβ–Ž | 167/711 [06:11<09:15, 1.02s/it] 24%|β–ˆβ–ˆβ–Ž | 168/711 [06:12<09:13, 1.02s/it] 24%|β–ˆβ–ˆβ– | 169/711 [06:13<09:12, 1.02s/it] 24%|β–ˆβ–ˆβ– | 170/711 [06:14<09:10, 1.02s/it] 24%|β–ˆβ–ˆβ– | 170/711 [06:14<09:10, 1.02s/it] 24%|β–ˆβ–ˆβ– | 171/711 [06:15<09:09, 1.02s/it] 24%|β–ˆβ–ˆβ– | 172/711 [06:16<09:06, 1.01s/it] 24%|β–ˆβ–ˆβ– | 173/711 [06:17<09:07, 1.02s/it] 24%|β–ˆβ–ˆβ– | 174/711 [06:18<09:06, 1.02s/it] 25%|β–ˆβ–ˆβ– | 175/711 [06:19<09:05, 1.02s/it] 25%|β–ˆβ–ˆβ– | 176/711 [06:20<09:03, 1.02s/it] 25%|β–ˆβ–ˆβ– | 177/711 [06:21<09:02, 1.02s/it] 25%|β–ˆβ–ˆβ–Œ | 178/711 [06:22<09:04, 1.02s/it] 25%|β–ˆβ–ˆβ–Œ | 179/711 [06:23<09:02, 1.02s/it] 25%|β–ˆβ–ˆβ–Œ | 180/711 [06:24<09:01
0: {'loss': 0.7428, 'grad_norm': 0.9528470394392502, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.15}
0: {'loss': 0.7358, 'grad_norm': 0.9445388231828549, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.16}
0: , 1.02s/it] 25%|β–ˆβ–ˆβ–Œ | 180/711 [06:24<09:01, 1.02s/it] 25%|β–ˆβ–ˆβ–Œ | 181/711 [06:25<09:01, 1.02s/it] 26%|β–ˆβ–ˆβ–Œ | 182/711 [06:26<09:00, 1.02s/it] 26%|β–ˆβ–ˆβ–Œ | 183/711 [06:27<08:58, 1.02s/it] 26%|β–ˆβ–ˆβ–Œ | 184/711 [06:28<08:57, 1.02s/it] 26%|β–ˆβ–ˆβ–Œ | 185/711 [06:29<08:55, 1.02s/it] 26%|β–ˆβ–ˆβ–Œ | 186/711 [06:30<08:54, 1.02s/it] 26%|β–ˆβ–ˆβ–‹ | 187/711 [06:31<08:52, 1.02s/it] 26%|β–ˆβ–ˆβ–‹ | 188/711 [06:32<08:55, 1.02s/it] 27%|β–ˆβ–ˆβ–‹ | 189/711 [06:33<08:52, 1.02s/it] 27%|β–ˆβ–ˆβ–‹ | 190/711 [06:34<08:50, 1.02s/it] 27%|β–ˆβ–ˆβ–‹ | 190/711 [06:34<08:50, 1.02s/it] 27%|β–ˆβ–ˆβ–‹ | 191/711 [06:35<08:48, 1.02s/it] 27%|β–ˆβ–ˆβ–‹ | 192/711 [06:36<08:47, 1.02s/it] 27%|β–ˆβ–ˆβ–‹ | 193/711 [06:37<08:46, 1.02s/it] 27%|β–ˆβ–ˆβ–‹ | 194/711 [06:38<08:45, 1.02s/it] 27%|β–ˆβ–ˆοΏ½
0: {'loss': 0.7419, 'grad_norm': 0.9093993578822128, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.17}
0: οΏ½ | 195/711 [06:39<08:45, 1.02s/it] 28%|β–ˆβ–ˆβ–Š | 196/711 [06:40<08:43, 1.02s/it] 28%|β–ˆβ–ˆβ–Š | 197/711 [06:41<08:44, 1.02s/it] 28%|β–ˆβ–ˆβ–Š | 198/711 [06:42<08:46, 1.03s/it] 28%|β–ˆβ–ˆβ–Š | 199/711 [06:43<08:43, 1.02s/it] 28%|β–ˆβ–ˆβ–Š | 200/711 [06:44<08:40, 1.02s/it] 28%|β–ˆβ–ˆβ–Š | 200/711 [06:44<08:40, 1.02s/it] 28%|β–ˆβ–ˆβ–Š | 201/711 [06:45<08:39, 1.02s/it] 28%|β–ˆβ–ˆβ–Š | 202/711 [06:46<08:38, 1.02s/it] 29%|β–ˆβ–ˆβ–Š | 203/711 [06:47<08:37, 1.02s/it] 29%|β–ˆβ–ˆβ–Š | 204/711 [06:48<08:39, 1.02s/it] 29%|β–ˆβ–ˆβ–‰ | 205/711 [06:49<08:36, 1.02s/it] 29%|β–ˆβ–ˆβ–‰ | 206/711 [06:50<08:33, 1.02s/it] 29%|β–ˆβ–ˆβ–‰ | 207/711 [06:51<08:32, 1.02s/it] 29%|β–ˆβ–ˆβ–‰ | 208/711 [06:53<08:31, 1.02s/it] 29%|β–ˆβ–ˆβ–‰ | 209/711 [06:54<08:29, 1.02s/it] 30%|β–ˆβ–ˆβ–‰ | 210/711 [06:55<08:28, 1.01s/it]
0: {'loss': 0.7289, 'grad_norm': 0.9027102829983507, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.18}
0: {'loss': 0.7468, 'grad_norm': 0.8749606638751157, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.19}
0: 30%|β–ˆβ–ˆβ–‰ | 210/711 [06:55<08:28, 1.01s/it] 30%|β–ˆβ–ˆβ–‰ | 211/711 [06:56<08:27, 1.02s/it] 30%|β–ˆβ–ˆβ–‰ | 212/711 [06:57<08:27, 1.02s/it] 30%|β–ˆβ–ˆβ–‰ | 213/711 [06:58<08:26, 1.02s/it] 30%|β–ˆβ–ˆβ–ˆ | 214/711 [06:59<08:25, 1.02s/it] 30%|β–ˆβ–ˆβ–ˆ | 215/711 [07:00<08:24, 1.02s/it] 30%|β–ˆβ–ˆβ–ˆ | 216/711 [07:01<08:22, 1.02s/it] 31%|β–ˆβ–ˆβ–ˆ | 217/711 [07:02<08:21, 1.01s/it] 31%|β–ˆβ–ˆβ–ˆ | 218/711 [07:03<08:20, 1.01s/it] 31%|β–ˆβ–ˆβ–ˆ | 219/711 [07:04<08:18, 1.01s/it] 31%|β–ˆβ–ˆβ–ˆ | 220/711 [07:05<08:17, 1.01s/it] 31%|β–ˆβ–ˆβ–ˆ | 220/711 [07:05<08:17, 1.01s/it] 31%|β–ˆβ–ˆβ–ˆ | 221/711 [07:06<08:17, 1.02s/it] 31%|β–ˆβ–ˆβ–ˆ | 222/711 [07:07<08:20, 1.02s/it] 31%|β–ˆβ–ˆβ–ˆβ– | 223/711 [07:08<08:18, 1.02s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 224/711 [07:09<08:15, 1.02s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 225/711 [07:10<08:14, 1.02s/it
0: {'loss': 0.7306, 'grad_norm': 0.9160800008404012, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.19}
0: ] 32%|β–ˆβ–ˆβ–ˆβ– | 226/711 [07:11<08:16, 1.02s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 227/711 [07:12<08:14, 1.02s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 228/711 [07:13<08:12, 1.02s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 229/711 [07:14<08:10, 1.02s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 230/711 [07:15<08:08, 1.01s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 230/711 [07:15<08:08, 1.01s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 231/711 [07:16<08:07, 1.01s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 232/711 [07:17<08:06, 1.01s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 233/711 [07:18<08:04, 1.01s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 234/711 [07:19<08:03, 1.01s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 235/711 [07:20<08:02, 1.01s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 236/711 [07:21<08:01, 1.01s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 237/711 [07:22<08:00, 1.01s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 238/711 [07:23<07:59, 1.01s/it] 34%|β–ˆβ–ˆβ–ˆβ–Ž | 239/711 [07:24<07:57, 1.01s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 240/711 [07:25<07:56, 1.01s/it]
0: {'loss': 0.7361, 'grad_norm': 0.8938800949879834, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.2}
0: {'loss': 0.7165, 'grad_norm': 0.8865745843602661, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.21}
0: 34%|β–ˆβ–ˆβ–ˆβ– | 240/711 [07:25<07:56, 1.01s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 241/711 [07:26<07:56, 1.01s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 242/711 [07:27<07:55, 1.01s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 243/711 [07:28<07:55, 1.02s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 244/711 [07:29<07:54, 1.02s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 245/711 [07:30<07:52, 1.01s/it] 35%|β–ˆβ–ˆβ–ˆβ– | 246/711 [07:31<07:51, 1.01s/it] 35%|β–ˆβ–ˆβ–ˆβ– | 247/711 [07:32<07:49, 1.01s/it] 35%|β–ˆβ–ˆβ–ˆβ– | 248/711 [07:33<07:48, 1.01s/it] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 249/711 [07:34<07:47, 1.01s/it] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 250/711 [07:35<07:46, 1.01s/it] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 250/711 [07:35<07:46, 1.01s/it] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 251/711 [07:36<07:45, 1.01s/it] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 252/711 [07:37<07:44, 1.01s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 253/711 [07:38<07:43, 1.01s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 254/711 [07:39<07:42, 1.01s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 255/711 [07:
0: {'loss': 0.7099, 'grad_norm': 0.8798367435457611, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.22}
0: 40<07:41, 1.01s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 256/711 [07:41<07:41, 1.01s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 257/711 [07:42<07:40, 1.01s/it] 36%|β–ˆβ–ˆβ–ˆβ–‹ | 258/711 [07:43<07:39, 1.02s/it] 36%|β–ˆβ–ˆβ–ˆβ–‹ | 259/711 [07:44<07:38, 1.01s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 260/711 [07:45<07:36, 1.01s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 260/711 [07:45<07:36, 1.01s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 261/711 [07:46<07:36, 1.01s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 262/711 [07:47<07:34, 1.01s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 263/711 [07:48<07:33, 1.01s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 264/711 [07:49<07:32, 1.01s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 265/711 [07:50<07:31, 1.01s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 266/711 [07:51<07:30, 1.01s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 267/711 [07:52<07:29, 1.01s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 268/711 [07:53<07:30, 1.02s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 269/711 [07:54<07:28, 1.02s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 270/711 [07:55<07:27, 1.01s/it]
0: {'loss': 0.7113, 'grad_norm': 0.9763568156539483, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.23}
0: {'loss': 0.7247, 'grad_norm': 0.9774181687381295, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.24}
0: 38%|β–ˆβ–ˆβ–ˆβ–Š | 270/711 [07:55<07:27, 1.01s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 271/711 [07:56<07:26, 1.01s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 272/711 [07:57<07:24, 1.01s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 273/711 [07:58<07:23, 1.01s/it] 39%|β–ˆβ–ˆβ–ˆβ–Š | 274/711 [07:59<07:25, 1.02s/it] 39%|β–ˆβ–ˆβ–ˆβ–Š | 275/711 [08:01<07:23, 1.02s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 276/711 [08:02<07:22, 1.02s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 277/711 [08:03<07:20, 1.02s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 278/711 [08:04<07:19, 1.01s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 279/711 [08:05<07:17, 1.01s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 280/711 [08:06<07:16, 1.01s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 280/711 [08:06<07:16, 1.01s/it] 40%|β–ˆβ–ˆβ–ˆβ–‰ | 281/711 [08:07<07:15, 1.01s/it] 40%|β–ˆβ–ˆβ–ˆβ–‰ | 282/711 [08:08<07:14, 1.01s/it] 40%|β–ˆβ–ˆβ–ˆβ–‰ | 283/711 [08:09<07:15, 1.02s/it] 40%|β–ˆβ–ˆβ–ˆβ–‰ | 284/711 [08:10<07:16, 1.02s/it] 40%|β–ˆβ–ˆβ–ˆβ–ˆ
0: {'loss': 0.6936, 'grad_norm': 0.8964349971328741, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.24}
0: | 285/711 [08:11<07:15, 1.02s/it] 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 286/711 [08:12<07:12, 1.02s/it] 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 287/711 [08:13<07:14, 1.02s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 288/711 [08:14<07:12, 1.02s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 289/711 [08:15<07:10, 1.02s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 290/711 [08:16<07:08, 1.02s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 290/711 [08:16<07:08, 1.02s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 291/711 [08:17<07:08, 1.02s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 292/711 [08:18<07:09, 1.02s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 293/711 [08:19<07:08, 1.03s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 294/711 [08:20<07:06, 1.02s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 295/711 [08:21<07:03, 1.02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 296/711 [08:22<07:01, 1.02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 297/711 [08:23<07:03, 1.02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 298/711 [08:24<07:01, 1.02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 299/711 [08:25<06:59, 1.02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 300/711 [08:26<06:58, 1
0: {'loss': 0.707, 'grad_norm': 0.9596158535601407, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.25}
0: {'loss': 0.7079, 'grad_norm': 0.8256624207228515, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.26}
0: .02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 300/711 [08:26<06:58, 1.02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 301/711 [08:27<06:56, 1.02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 302/711 [08:28<06:55, 1.02s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 303/711 [08:29<06:53, 1.01s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 304/711 [08:30<06:52, 1.01s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 305/711 [08:31<06:52, 1.02s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 306/711 [08:32<06:51, 1.02s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 307/711 [08:33<06:50, 1.02s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 308/711 [08:34<06:49, 1.02s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 309/711 [08:35<06:50, 1.02s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 310/711 [08:36<06:51, 1.03s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 310/711 [08:36<06:51, 1.03s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 311/711 [08:37<06:49, 1.02s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 312/711 [08:38<06:50, 1.03s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 313/711 [08:39<06:47, 1.02s/it] 44%|β–ˆβ–ˆοΏ½
0: {'loss': 0.7113, 'grad_norm': 0.8590726758549844, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.27}
0: οΏ½β–ˆβ– | 314/711 [08:40<06:47, 1.03s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 315/711 [08:41<06:45, 1.03s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 316/711 [08:42<06:43, 1.02s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 317/711 [08:43<06:41, 1.02s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 318/711 [08:44<06:39, 1.02s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 319/711 [08:45<06:38, 1.02s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 320/711 [08:46<06:36, 1.01s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 320/711 [08:46<06:36, 1.01s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 321/711 [08:48<07:02, 1.08s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 322/711 [08:49<06:53, 1.06s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 323/711 [08:50<06:46, 1.05s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 324/711 [08:51<06:40, 1.04s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 325/711 [08:52<06:37, 1.03s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 326/711 [08:53<06:34, 1.02s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 327/711 [08:54<06:32, 1.02s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 328/711 [08:55<06:29, 1.02s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ |
0: {'loss': 0.7097, 'grad_norm': 0.9011059611829694, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.28}
0: {'loss': 0.695, 'grad_norm': 0.8452924322256501, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.29}
0: 329/711 [08:56<06:27, 1.02s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 330/711 [08:57<06:27, 1.02s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 330/711 [08:57<06:27, 1.02s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 331/711 [08:58<06:26, 1.02s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 332/711 [08:59<06:24, 1.01s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 333/711 [09:00<06:25, 1.02s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 334/711 [09:01<06:23, 1.02s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 335/711 [09:02<06:23, 1.02s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 336/711 [09:03<06:21, 1.02s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 337/711 [09:04<06:18, 1.01s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 338/711 [09:05<06:17, 1.01s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 339/711 [09:06<06:15, 1.01s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 340/711 [09:07<06:15, 1.01s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 340/711 [09:07<06:15, 1.01s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 341/711 [09:08<06:15, 1.01s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 342/711 [09:09<06:13,
0: {'loss': 0.7032, 'grad_norm': 0.8466692851184044, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.3}
0: 1.01s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 343/711 [09:10<06:13, 1.01s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 344/711 [09:11<06:13, 1.02s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 345/711 [09:12<06:33, 1.08s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 346/711 [09:13<06:25, 1.06s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 347/711 [09:14<06:19, 1.04s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 348/711 [09:15<06:14, 1.03s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 349/711 [09:16<06:11, 1.03s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 350/711 [09:17<06:09, 1.02s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 350/711 [09:17<06:09, 1.02s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 351/711 [09:18<06:07, 1.02s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 352/711 [09:19<06:05, 1.02s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 353/711 [09:20<06:03, 1.01s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 354/711 [09:21<06:01, 1.01s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 355/711 [09:22<06:00, 1.01s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 356/711 [09:23<05:59, 1.01s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 357/711 [09:24<05:58, 1.01s/it]
0: {'loss': 0.7127, 'grad_norm': 0.8028586105919348, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.3}
0: {'loss': 0.7049, 'grad_norm': 0.8418301197643927, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.31}
0: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 358/711 [09:25<05:57, 1.01s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 359/711 [09:26<05:57, 1.01s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 360/711 [09:27<05:55, 1.01s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 360/711 [09:27<05:55, 1.01s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 361/711 [09:28<05:55, 1.01s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 362/711 [09:29<05:53, 1.01s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 363/711 [09:30<05:52, 1.01s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 364/711 [09:31<05:51, 1.01s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 365/711 [09:32<05:51, 1.02s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 366/711 [09:33<05:50, 1.01s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 367/711 [09:34<05:49, 1.02s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 368/711 [09:35<05:47, 1.01s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 369/711 [09:36<05:46, 1.01s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 370/711 [09:37<05:45, 1.01s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 370/711 [09:37<05:45, 1.01s/it] 52%|β–ˆοΏ½
0: {'loss': 0.7056, 'grad_norm': 0.8018926220188637, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.32}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ– | 371/711 [09:38<05:44, 1.01s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 372/711 [09:39<05:42, 1.01s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 373/711 [09:40<05:42, 1.01s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 374/711 [09:42<05:41, 1.01s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 375/711 [09:43<05:40, 1.01s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 376/711 [09:44<05:39, 1.01s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 377/711 [09:45<05:38, 1.01s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 378/711 [09:46<05:37, 1.01s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 379/711 [09:47<05:38, 1.02s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 380/711 [09:48<05:37, 1.02s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 380/711 [09:48<05:37, 1.02s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 381/711 [09:49<05:38, 1.02s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 382/711 [09:50<05:35, 1.02s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 383/711 [09:51<05:33, 1.02s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 384/711 [09:52<05:32, 1.02s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 385/711 [09:53<05:30,
0: {'loss': 0.6974, 'grad_norm': 0.8565445228240428, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.33}
0: 1.01s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 386/711 [09:54<05:29, 1.01s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 387/711 [09:55<05:28, 1.01s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 388/711 [09:56<05:27, 1.01s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 389/711 [09:57<05:26, 1.02s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 390/711 [09:58<05:26, 1.02s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 390/711 [09:58<05:26, 1.02s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 391/711 [09:59<05:25, 1.02s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 392/711 [10:00<05:24, 1.02s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 393/711 [10:01<05:22, 1.01s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 394/711 [10:02<05:21, 1.01s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 395/711 [10:03<05:22, 1.02s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 396/711 [10:04<05:19, 1.02s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 397/711 [10:05<05:18, 1.01s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 398/711 [10:06<05:16, 1.01s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 399/711 [10:07<05:15, 1.01s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 400/
0: {'loss': 0.6793, 'grad_norm': 0.9500235419290328, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.34}
0: {'loss': 0.6814, 'grad_norm': 0.8451661040419431, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.35}
0: 711 [10:08<05:14, 1.01s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 400/711 [10:08<05:14, 1.01s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 401/711 [10:09<05:14, 1.01s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 402/711 [10:10<05:12, 1.01s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 403/711 [10:11<05:11, 1.01s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 404/711 [10:12<05:10, 1.01s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 405/711 [10:13<05:11, 1.02s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 406/711 [10:14<05:09, 1.02s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 407/711 [10:15<05:08, 1.01s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 408/711 [10:16<05:07, 1.02s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 409/711 [10:17<05:08, 1.02s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 410/711 [10:18<05:06, 1.02s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 410/711 [10:18<05:06, 1.02s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 411/711 [10:19<05:07, 1.02s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 412/711 [10:20<05:05, 1.02s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š
0: {'loss': 0.6906, 'grad_norm': 0.8679849121193738, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.35}
0: | 413/711 [10:21<05:03, 1.02s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 414/711 [10:22<05:02, 1.02s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 415/711 [10:23<05:00, 1.02s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 416/711 [10:24<05:01, 1.02s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 417/711 [10:25<04:59, 1.02s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 418/711 [10:26<04:57, 1.02s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 419/711 [10:27<04:56, 1.02s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 420/711 [10:28<04:54, 1.01s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 420/711 [10:28<04:54, 1.01s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 421/711 [10:29<04:53, 1.01s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 422/711 [10:30<04:52, 1.01s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 423/711 [10:31<04:51, 1.01s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 424/711 [10:32<04:50, 1.01s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 425/711 [10:33<04:49, 1.01s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 426/711 [10:34<04:48, 1.01s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 427/711 [10:35<04:46, 1.01s/it] 60%|οΏ½
0: {'loss': 0.6849, 'grad_norm': 0.8279829459119256, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.36}
0: {'loss': 0.6902, 'grad_norm': 0.870874120336189, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.37}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 428/711 [10:36<04:48, 1.02s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 429/711 [10:37<04:47, 1.02s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 430/711 [10:38<04:45, 1.02s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 430/711 [10:38<04:45, 1.02s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 431/711 [10:39<04:44, 1.02s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 432/711 [10:40<04:43, 1.02s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 433/711 [10:41<04:42, 1.01s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 434/711 [10:42<04:40, 1.01s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 435/711 [10:43<04:39, 1.01s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 436/711 [10:44<04:40, 1.02s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 437/711 [10:46<04:40, 1.02s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 438/711 [10:47<04:38, 1.02s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 439/711 [10:48<04:36, 1.02s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 440/711 [10:49<04:34, 1.01s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 440/711 [10:49<04:3
0: {'loss': 0.681, 'grad_norm': 0.8159001929242623, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.38}
0: 4, 1.01s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 441/711 [10:50<04:33, 1.01s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 442/711 [10:51<04:32, 1.01s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 443/711 [10:52<04:31, 1.01s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 444/711 [10:53<04:30, 1.01s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 445/711 [10:54<04:31, 1.02s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 446/711 [10:55<04:29, 1.02s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 447/711 [10:56<04:28, 1.02s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 448/711 [10:57<04:26, 1.01s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 449/711 [10:58<04:25, 1.01s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 450/711 [10:59<04:24, 1.01s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 450/711 [10:59<04:24, 1.01s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 451/711 [11:00<04:23, 1.01s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 452/711 [11:01<04:23, 1.02s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 453/711 [11:02<04:22, 1.02s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 454/711 [11:03<04:20, 1.02s/it]
0: {'loss': 0.6751, 'grad_norm': 0.9324021579543116, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.39}
0: 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 455/711 [11:04<04:19, 1.01s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 456/711 [11:05<04:18, 1.01s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 457/711 [11:06<04:16, 1.01s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 458/711 [11:07<04:16, 1.01s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 459/711 [11:08<04:15, 1.01s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 460/711 [11:09<04:14, 1.02s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 460/711 [11:09<04:14, 1.02s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 461/711 [11:10<04:14, 1.02s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 462/711 [11:11<04:13, 1.02s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 463/711 [11:12<04:11, 1.01s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 464/711 [11:13<04:10, 1.01s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 465/711 [11:14<04:08, 1.01s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 466/711 [11:15<04:07, 1.01s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 467/711 [11:16<04:06, 1.01s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 468/711 [11:17<04:05, 1.01s/it] 66%|β–ˆβ–ˆοΏ½
0: {'loss': 0.681, 'grad_norm': 1.1760742860535562, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.4}
0: {'loss': 0.6857, 'grad_norm': 0.7580929709516384, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.4}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–Œ | 469/711 [11:18<04:06, 1.02s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 470/711 [11:19<04:04, 1.02s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 470/711 [11:19<04:04, 1.02s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 471/711 [11:20<04:03, 1.02s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 472/711 [11:21<04:02, 1.02s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 473/711 [11:22<04:01, 1.01s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 474/711 [11:23<03:59, 1.01s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 475/711 [11:24<03:58, 1.01s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 476/711 [11:25<03:58, 1.02s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 477/711 [11:26<03:57, 1.02s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 478/711 [11:27<03:56, 1.02s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 479/711 [11:28<03:55, 1.02s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 480/711 [11:29<03:54, 1.01s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 480/711 [11:29<03:54, 1.01s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 481/711
0: {'loss': 0.6551, 'grad_norm': 0.7688484391693519, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.41}
0: [11:30<03:54, 1.02s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 482/711 [11:31<03:52, 1.02s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 483/711 [11:32<03:51, 1.01s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 484/711 [11:33<03:50, 1.01s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 485/711 [11:34<03:49, 1.02s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 486/711 [11:35<03:48, 1.02s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 487/711 [11:36<03:47, 1.01s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 488/711 [11:37<03:45, 1.01s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 489/711 [11:38<03:46, 1.02s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 490/711 [11:39<03:44, 1.02s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 490/711 [11:39<03:44, 1.02s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 491/711 [11:40<03:45, 1.02s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 492/711 [11:41<03:43, 1.02s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 493/711 [11:42<03:41, 1.02s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 494/711 [11:43<03:41, 1.02s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 495/711 [11:44<03:39
0: {'loss': 0.665, 'grad_norm': 0.7880196713924965, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.42}
0: , 1.02s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 496/711 [11:45<03:37, 1.01s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 497/711 [11:46<03:37, 1.02s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 498/711 [11:47<03:36, 1.02s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 499/711 [11:48<03:36, 1.02s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 500/711 [11:49<03:34, 1.02s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 500/711 [11:49<03:34, 1.02s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 501/711 [11:50<03:33, 1.02s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 502/711 [11:51<03:31, 1.01s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 503/711 [11:53<03:30, 1.01s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 504/711 [11:54<03:29, 1.01s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 505/711 [11:55<03:27, 1.01s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 506/711 [11:56<03:26, 1.01s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 507/711 [11:57<03:26, 1.01s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 508/711 [11:58<03:25, 1.01s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 509/711 [11:59<03:24, 1.01
0: {'loss': 0.6839, 'grad_norm': 0.815597267678134, 'learning_rate': 2e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.43}
0: {'loss': 0.6696, 'grad_norm': 0.8299691960041005, 'learning_rate': 1.9929032311830303e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.44}
0: s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 510/711 [12:00<03:23, 1.01s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 510/711 [12:00<03:23, 1.01s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 511/711 [12:01<03:23, 1.02s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 512/711 [12:02<03:21, 1.01s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 513/711 [12:03<03:20, 1.01s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 514/711 [12:04<03:18, 1.01s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 515/711 [12:05<03:17, 1.01s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 516/711 [12:06<03:17, 1.01s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 517/711 [12:07<03:16, 1.01s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 518/711 [12:08<03:15, 1.01s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 519/711 [12:09<03:14, 1.01s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 520/711 [12:10<03:14, 1.02s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 520/711 [12:10<03:14, 1.02s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 521/711 [12:11<03:13, 1.
0: {'loss': 0.672, 'grad_norm': 0.8102022933620939, 'learning_rate': 1.9642643171092488e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.45}
0: 02s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 522/711 [12:12<03:11, 1.01s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 523/711 [12:13<03:10, 1.01s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 524/711 [12:14<03:10, 1.02s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 525/711 [12:15<03:09, 1.02s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 526/711 [12:16<03:07, 1.02s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 527/711 [12:17<03:06, 1.01s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 528/711 [12:18<03:06, 1.02s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 529/711 [12:19<03:04, 1.01s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 530/711 [12:20<03:03, 1.01s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 530/711 [12:20<03:03, 1.01s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 531/711 [12:21<03:02, 1.01s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 532/711 [12:22<03:01, 1.01s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 533/711 [12:23<03:00, 1.02s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 534/711 [12:24<02:59, 1.01s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 535/711
0: {'loss': 0.6854, 'grad_norm': 0.8052276313272076, 'learning_rate': 1.9143443472194178e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.46}
0: [12:25<02:58, 1.01s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 536/711 [12:26<02:57, 1.02s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 537/711 [12:27<02:56, 1.01s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 538/711 [12:28<02:55, 1.01s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 539/711 [12:29<02:54, 1.01s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 540/711 [12:30<02:53, 1.01s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 540/711 [12:30<02:53, 1.01s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 541/711 [12:31<02:52, 1.02s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 542/711 [12:32<02:51, 1.02s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 543/711 [12:33<02:50, 1.01s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 544/711 [12:34<02:49, 1.01s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 545/711 [12:35<02:47, 1.01s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 546/711 [12:36<02:47, 1.02s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 547/711 [12:37<02:46, 1.02s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 548/711 [12:38<02:45, 1.02s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.6813, 'grad_norm': 0.8340810817094527, 'learning_rate': 1.8443725168471054e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.46}
0: {'loss': 0.6641, 'grad_norm': 0.85639870414892, 'learning_rate': 1.7560717646792704e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.47}
0: οΏ½οΏ½β–‹ | 549/711 [12:39<02:44, 1.02s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 550/711 [12:40<02:43, 1.02s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 550/711 [12:40<02:43, 1.02s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 551/711 [12:41<02:42, 1.02s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 552/711 [12:42<02:41, 1.01s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 553/711 [12:43<02:40, 1.01s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 554/711 [12:44<02:38, 1.01s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 555/711 [12:45<02:37, 1.01s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 556/711 [12:46<02:37, 1.01s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 557/711 [12:47<02:37, 1.02s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 558/711 [12:48<02:36, 1.02s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 559/711 [12:49<02:34, 1.02s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 560/711 [12:50<02:33, 1.01s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 560/711 [12:50<02:33, 1.01s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.6813, 'grad_norm': 0.7779683047257264, 'learning_rate': 1.651616348287679e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.48}
0: οΏ½β–ˆβ–‰ | 561/711 [12:51<02:31, 1.01s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 562/711 [12:52<02:30, 1.01s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 563/711 [12:53<02:31, 1.02s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 564/711 [12:54<02:29, 1.02s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 565/711 [12:55<02:28, 1.02s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 566/711 [12:56<02:29, 1.03s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 567/711 [12:58<02:28, 1.03s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 568/711 [12:59<02:26, 1.03s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 569/711 [13:00<02:24, 1.02s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 570/711 [13:01<02:23, 1.02s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 570/711 [13:01<02:23, 1.02s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 571/711 [13:02<02:22, 1.01s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 572/711 [13:03<02:20, 1.01s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 573/711 [13:04<02:19, 1.01s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 574/711 [13:05<02:18, 1.01s/it] 81%|
0: {'loss': 0.6657, 'grad_norm': 0.8763000605487214, 'learning_rate': 1.5335783066915437e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.49}
0: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 575/711 [13:06<02:17, 1.01s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 576/711 [13:07<02:16, 1.01s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 577/711 [13:08<02:16, 1.02s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 578/711 [13:09<02:16, 1.02s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 579/711 [13:10<02:14, 1.02s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 580/711 [13:11<02:13, 1.02s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 580/711 [13:11<02:13, 1.02s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 581/711 [13:12<02:12, 1.02s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 582/711 [13:13<02:11, 1.02s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 583/711 [13:14<02:10, 1.02s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 584/711 [13:15<02:08, 1.01s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 585/711 [13:16<02:07, 1.01s/it] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 586/711 [13:17<02:06, 1.01s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 587/711 [13:18<02:06, 1.02s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž
0: {'loss': 0.6539, 'grad_norm': 0.7886914559405968, 'learning_rate': 1.4048641282207624e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.5}
0: {'loss': 0.6791, 'grad_norm': 0.7967066194263275, 'learning_rate': 1.2686431831271523e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.58, 'epoch': 0.51}
0: | 588/711 [13:19<02:05, 1.02s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 589/711 [13:20<02:04, 1.02s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 590/711 [13:21<02:03, 1.02s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 590/711 [13:21<02:03, 1.02s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 591/711 [13:22<02:02, 1.02s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 592/711 [13:23<02:01, 1.02s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 593/711 [13:24<01:59, 1.01s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 594/711 [13:25<01:58, 1.01s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 595/711 [13:26<01:57, 1.02s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 596/711 [13:27<01:56, 1.02s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 597/711 [13:28<01:55, 1.01s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 598/711 [13:29<02:01, 1.07s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 599/711 [13:30<01:58, 1.05s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 600/711 [13:31<01:55, 1.04s/it] 8
0: {'loss': 0.6489, 'grad_norm': 0.8076470421353021, 'learning_rate': 1.1282696831703156e-05, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.51}
0: 4%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 600/711 [13:31<01:55, 1.04s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 601/711 [13:32<01:53, 1.03s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 602/711 [13:33<01:51, 1.02s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 603/711 [13:34<01:50, 1.02s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 604/711 [13:35<01:48, 1.02s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 605/711 [13:36<01:48, 1.02s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 606/711 [13:37<01:46, 1.02s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 607/711 [13:38<01:45, 1.02s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 608/711 [13:40<01:50, 1.08s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 609/711 [13:41<01:47, 1.06s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 610/711 [13:42<01:45, 1.05s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 610/711 [13:42<01:45, 1.05s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 611/711 [13:43<01:43, 1.04s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 612/711 [13:44<01:41, 1.03s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
0: {'loss': 0.6488, 'grad_norm': 0.8168905882802561, 'learning_rate': 9.872000897921262e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.52}
0: β–ˆβ–ˆβ–Œ | 613/711 [13:45<01:40, 1.03s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 614/711 [13:46<01:39, 1.02s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 615/711 [13:47<01:37, 1.02s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 616/711 [13:48<01:36, 1.02s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 617/711 [13:49<01:35, 1.01s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 618/711 [13:50<01:34, 1.02s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 619/711 [13:51<01:33, 1.02s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 620/711 [13:52<01:32, 1.02s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 620/711 [13:52<01:32, 1.02s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 621/711 [13:53<01:31, 1.02s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 622/711 [13:54<01:30, 1.02s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 623/711 [13:55<01:29, 1.02s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 624/711 [13:56<01:28, 1.02s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 625/711 [13:57<01:27, 1.01s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 626/711 [
0: {'loss': 0.6461, 'grad_norm': 0.8173208558740509, 'learning_rate': 8.489080045646938e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.53}
0: 13:58<01:26, 1.01s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 627/711 [13:59<01:25, 1.01s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 628/711 [14:00<01:24, 1.01s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 629/711 [14:01<01:23, 1.01s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 630/711 [14:02<01:22, 1.02s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 630/711 [14:02<01:22, 1.02s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 631/711 [14:03<01:21, 1.02s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 632/711 [14:04<01:20, 1.01s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 633/711 [14:05<01:18, 1.01s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 634/711 [14:06<01:17, 1.01s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 635/711 [14:07<01:17, 1.01s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 636/711 [14:08<01:16, 1.02s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 637/711 [14:09<01:15, 1.02s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 638/711 [14:10<01:14, 1.02s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 639/711 [14:11<01:13, 1.02s/i
0: {'loss': 0.653, 'grad_norm': 0.7392727758714567, 'learning_rate': 7.167986375914347e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.54}
0: {'loss': 0.6542, 'grad_norm': 0.8009550021670708, 'learning_rate': 5.941249599330827e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.55}
0: t] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 640/711 [14:12<01:12, 1.02s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 640/711 [14:12<01:12, 1.02s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 641/711 [14:13<01:11, 1.01s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 642/711 [14:14<01:09, 1.01s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 643/711 [14:15<01:08, 1.01s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 644/711 [14:16<01:08, 1.02s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 645/711 [14:17<01:07, 1.02s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 646/711 [14:18<01:06, 1.02s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 647/711 [14:19<01:04, 1.02s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 648/711 [14:20<01:03, 1.01s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 649/711 [14:21<01:02, 1.01s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 650/711 [14:22<01:02, 1.02s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 650/711 [14:22<01:02, 1.02s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
0: {'loss': 0.6595, 'grad_norm': 0.7139199193006327, 'learning_rate': 4.839076046641802e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.56}
0: β–ˆβ–| 651/711 [14:23<01:01, 1.02s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 652/711 [14:24<00:59, 1.02s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 653/711 [14:25<00:59, 1.02s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 654/711 [14:26<00:57, 1.02s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 655/711 [14:27<00:56, 1.01s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 656/711 [14:28<00:56, 1.02s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 657/711 [14:29<00:55, 1.02s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 658/711 [14:30<00:53, 1.02s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 659/711 [14:31<00:52, 1.02s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 660/711 [14:32<00:51, 1.01s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 660/711 [14:32<00:51, 1.01s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 661/711 [14:33<00:50, 1.02s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 662/711 [14:34<00:49, 1.02s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 663/711 [14:35<00:48, 1.01s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.6349, 'grad_norm': 0.6774414042634137, 'learning_rate': 3.888604888618787e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.56}
0: οΏ½β–ˆβ–ˆβ–ˆβ–Ž| 664/711 [14:36<00:47, 1.01s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 665/711 [14:37<00:46, 1.01s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 666/711 [14:38<00:45, 1.01s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 667/711 [14:39<00:44, 1.01s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 668/711 [14:40<00:43, 1.02s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 669/711 [14:41<00:42, 1.01s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 670/711 [14:42<00:41, 1.02s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 670/711 [14:42<00:41, 1.02s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 671/711 [14:43<00:40, 1.02s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 672/711 [14:45<00:39, 1.01s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 673/711 [14:46<00:38, 1.01s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 674/711 [14:47<00:37, 1.01s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 675/711 [14:48<00:36, 1.02s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 676/711 [14:49<00:35, 1.01s/it] 95%|β–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.6585, 'grad_norm': 0.6957704121795029, 'learning_rate': 3.11323987960523e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.57}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 677/711 [14:50<00:34, 1.01s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 678/711 [14:51<00:33, 1.01s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 679/711 [14:52<00:32, 1.01s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 680/711 [14:53<00:31, 1.01s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 680/711 [14:53<00:31, 1.01s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 681/711 [14:54<00:30, 1.01s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 682/711 [14:55<00:29, 1.02s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 683/711 [14:56<00:28, 1.02s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 684/711 [14:57<00:27, 1.02s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 685/711 [14:58<00:26, 1.02s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 686/711 [14:59<00:25, 1.01s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 687/711 [15:00<00:24, 1.01s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 688/711 [15:01<00:23, 1.01s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 689/711 [15:02<00:22, 1.01s/it] 97%|β–ˆ
0: {'loss': 0.6588, 'grad_norm': 0.7024004831662435, 'learning_rate': 2.532073079411971e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.58}
0: {'loss': 0.6492, 'grad_norm': 1.1172218022043618, 'learning_rate': 2.159414743441803e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.59}
0: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 690/711 [15:03<00:21, 1.01s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 690/711 [15:03<00:21, 1.01s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 691/711 [15:04<00:20, 1.02s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 692/711 [15:05<00:19, 1.01s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 693/711 [15:06<00:18, 1.01s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 694/711 [15:07<00:17, 1.01s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 695/711 [15:08<00:16, 1.01s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 696/711 [15:09<00:15, 1.01s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 697/711 [15:10<00:14, 1.01s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 698/711 [15:11<00:13, 1.01s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 699/711 [15:12<00:12, 1.01s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 700/711 [15:13<00:11, 1.02s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 700/711 [15:13<00:11, 1.02s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
0: {'loss': 0.6439, 'grad_norm': 0.7226102601600088, 'learning_rate': 2.0044409567084157e-06, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.6}
0: [2025-11-24 00:25:23,227] [INFO] [axolotl.core.trainers.base._save:613] [PID:4127210] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/ift/Nemotron-Super-49B-v1_5/gemma-3-1b/0/checkpoint-711
0: [2025-11-24 00:25:25,011] [INFO] [axolotl.core.trainers.base._save:662] [PID:4127210] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
0: {'train_runtime': 927.6736, 'train_samples_per_second': 12.263, 'train_steps_per_second': 0.766, 'train_loss': 0.7229751322507523, 'memory/max_mem_active(gib)': 52.06, 'memory/max_mem_allocated(gib)': 52.06, 'memory/device_mem_reserved(gib)': 60.79, 'epoch': 0.6}
0: β–ˆβ–ˆβ–ˆβ–ˆβ–Š| 701/711 [15:14<00:10, 1.02s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 702/711 [15:15<00:09, 1.02s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 703/711 [15:16<00:08, 1.01s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 704/711 [15:17<00:07, 1.02s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 705/711 [15:18<00:06, 1.02s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 706/711 [15:19<00:05, 1.02s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 707/711 [15:20<00:04, 1.01s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 708/711 [15:21<00:03, 1.01s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 709/711 [15:22<00:02, 1.01s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 710/711 [15:23<00:01, 1.02s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 710/711 [15:23<00:01, 1.02s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 711/711 [15:24<00:00, 1.02s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 711/711 [15:27<00:00, 1.02s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
0: β–ˆ| 711/711 [15:27<00:00, 1.30s/it]
0: [2025-11-24 00:25:28,197] [INFO] [axolotl.train.save_trained_model:228] [PID:4127210] [RANK:0] Training completed! Saving trained model to /lustre/fswork/projects/rech/dgo/udv55np/ift/Nemotron-Super-49B-v1_5/gemma-3-1b/0.
0: [2025-11-24 00:25:29,151] [INFO] [axolotl.core.trainers.base._save:613] [PID:4127210] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/ift/Nemotron-Super-49B-v1_5/gemma-3-1b/0
0: [2025-11-24 00:25:31,046] [INFO] [axolotl.core.trainers.base._save:662] [PID:4127210] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
0: [2025-11-24 00:25:31,413] [INFO] [axolotl.train.save_trained_model:350] [PID:4127210] [RANK:0] Model successfully saved to /lustre/fswork/projects/rech/dgo/udv55np/ift/Nemotron-Super-49B-v1_5/gemma-3-1b/0