| /home/ubuntu/Isaac-GR00T/.venv/lib/python3.10/site-packages/albumentations/__init__.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.18). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1. |
| check_for_updates() |
| /home/ubuntu/Isaac-GR00T/gr00t/experiment/experiment.py:98: UserWarning: image_crop_size and image_target_size will be deprecated in the future. Please use shortest_image_edge and crop_fraction instead. |
| warnings.warn( |
| 05/26/2026 21:34:00 - INFO - Saved config to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/experiment_cfg |
| wandb: Currently logged in as: lucafrat (lucafrat-microsoft) to https://api.wandb.ai. Use `wandb login |
| wandb: setting up run zw5ihkoo |
| wandb: Tracking run with wandb version 0.23.0 |
| wandb: Run data is saved locally in /home/ubuntu/Isaac-GR00T/wandb/run-20260526_213400-zw5ihkoo |
| wandb: Run `wandb offline` to turn off syncing. |
| wandb: Syncing run g1_finetune-20260526-213350-gpu0 |
| wandb: βοΈ View project at https://wandb.ai/lucafrat-microsoft/groot-finetune |
| wandb: π View run at https://wandb.ai/lucafrat-microsoft/groot-finetune/runs/zw5ihkoo |
| Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen3VLForConditionalGeneration is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)` |
| Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen3VLModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)` |
| Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen3VLVisionModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)` |
| Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen3VLTextModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)` |
| /home/ubuntu/Isaac-GR00T/gr00t/model/modules/dit.py:255: FutureWarning: Accessing config attribute `compute_dtype` directly via 'AlternateVLDiT' object attribute is deprecated. Please access 'compute_dtype' over 'AlternateVLDiT's config object instead, e.g. 'unet.config.compute_dtype'. |
| embedding_dim=self.inner_dim, compute_dtype=self.compute_dtype |
| /home/ubuntu/Isaac-GR00T/gr00t/model/modules/dit.py:286: FutureWarning: Accessing config attribute `output_dim` directly via 'AlternateVLDiT' object attribute is deprecated. Please access 'output_dim' over 'AlternateVLDiT's config object instead, e.g. 'unet.config.output_dim'. |
| self.proj_out_2 = nn.Linear(self.inner_dim, self.output_dim) |
| Total number of DiT parameters: 1091722240 |
| 05/26/2026 21:34:02 - INFO - Using AlternateVLDiT for diffusion model |
| Total number of SelfAttentionTransformer parameters: 201433088 |
|
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:00<00:00, 1.76it/s]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:00<00:00, 2.77it/s]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:00<00:00, 2.55it/s] |
| 05/26/2026 21:34:07 - INFO - Total parameters: 3,144,016,000 |
| 05/26/2026 21:34:07 - INFO - Trainable parameters: 1,620,515,968 (51.54%) |
|
Initializing datasets: 0%| | 0/1 [00:00<?, ?it/s]Generating stats for /home/ubuntu/groot-files/dataset |
| Generated 64 shards for dataset /home/ubuntu/groot-files/dataset |
| Total steps: 64712, average shard length: 1011.125, shard length std: 17.115873480485885 |
|
Initializing datasets: 100%|ββββββββββ| 1/1 [00:00<00:00, 97.75it/s] |
| 05/26/2026 21:34:10 - INFO - Overriding statistics for embodiment 'unitree_g1_sonic' |
| 05/26/2026 21:34:10 - INFO - Saved dataset statistics for inference |
| 05/26/2026 21:34:14 - INFO - π Starting training... |
| 05/26/2026 21:34:14 - WARNING - No valid checkpoint found in output directory (/home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0) |
| Current global step: 0 |
| Creating custom train dataloader |
|
0%| | 0/3000 [00:00<?, ?it/s]Rank 0, Worker 0: Caching shard...Rank 0, Worker 2: Caching shard... |
| Rank 0, Worker 4: Caching shard... |
| Rank 0, Worker 5: Caching shard... |
|
|
| Rank 0, Worker 3: Caching shard...Rank 0, Worker 1: Caching shard... |
|
|
| Rank 0, Worker 3: Wait for shard 22 in dataset 0 in 16.93 seconds |
| Rank 0, Worker 3: Caching shard... |
| Rank 0, Worker 0: Wait for shard 33 in dataset 0 in 17.18 seconds |
| Rank 0, Worker 0: Caching shard... |
| Rank 0, Worker 5: Wait for shard 24 in dataset 0 in 17.26 seconds |
| Rank 0, Worker 5: Caching shard... |
| Rank 0, Worker 2: Wait for shard 3 in dataset 0 in 17.28 seconds |
| Rank 0, Worker 2: Caching shard... |
| Rank 0, Worker 1: Wait for shard 48 in dataset 0 in 17.39 seconds |
| Rank 0, Worker 1: Caching shard... |
| Rank 0, Worker 4: Wait for shard 37 in dataset 0 in 17.68 seconds |
| Rank 0, Worker 4: Caching shard... |
| Casting fp32 inputs back to torch.bfloat16 for flash-attn compatibility. |
| Could not estimate the number of tokens of the input, floating-point operations will not be computed |
|
0%| | 1/3000 [00:19<16:38:28, 19.98s/it]
0%| | 2/3000 [00:20<6:58:13, 8.37s/it]
0%| | 3/3000 [00:20<3:53:01, 4.67s/it]
0%| | 4/3000 [00:20<2:25:54, 2.92s/it]
0%| | 5/3000 [00:20<1:37:49, 1.96s/it]
0%| | 6/3000 [00:21<1:09:43, 1.40s/it]
0%| | 7/3000 [00:21<52:14, 1.05s/it]
0%| | 8/3000 [00:21<39:40, 1.26it/s]
0%| | 9/3000 [00:22<31:12, 1.60it/s]
0%| | 10/3000 [00:22<25:32, 1.95it/s]
{'loss': 1.2019, 'grad_norm': 0.16593393683433533, 'learning_rate': 6e-06} |
|
0%| | 10/3000 [00:22<25:32, 1.95it/s]
0%| | 11/3000 [00:23<27:31, 1.81it/s]
0%| | 12/3000 [00:23<23:02, 2.16it/s]
0%| | 13/3000 [00:23<19:52, 2.51it/s]
0%| | 14/3000 [00:23<17:38, 2.82it/s]
0%| | 15/3000 [00:24<16:00, 3.11it/s]
1%| | 16/3000 [00:24<15:03, 3.30it/s]
1%| | 17/3000 [00:24<14:23, 3.46it/s]
1%| | 18/3000 [00:24<13:45, 3.61it/s]
1%| | 19/3000 [00:25<13:20, 3.72it/s]
1%| | 20/3000 [00:25<13:08, 3.78it/s]
{'loss': 1.2001, 'grad_norm': 0.2364676296710968, 'learning_rate': 1.2666666666666668e-05} |
|
1%| | 20/3000 [00:25<13:08, 3.78it/s]
1%| | 21/3000 [00:25<13:07, 3.78it/s]
1%| | 22/3000 [00:25<12:56, 3.83it/s]
1%| | 23/3000 [00:26<13:02, 3.81it/s]
1%| | 24/3000 [00:26<13:06, 3.78it/s]
1%| | 25/3000 [00:26<13:09, 3.77it/s]
1%| | 26/3000 [00:26<13:04, 3.79it/s]
1%| | 27/3000 [00:27<13:26, 3.68it/s]
1%| | 28/3000 [00:27<13:27, 3.68it/s]
1%| | 29/3000 [00:27<13:22, 3.70it/s]
1%| | 30/3000 [00:27<13:25, 3.69it/s]
{'loss': 1.1652, 'grad_norm': 0.3223605155944824, 'learning_rate': 1.9333333333333333e-05} |
|
1%| | 30/3000 [00:28<13:25, 3.69it/s]
1%| | 31/3000 [00:28<14:10, 3.49it/s]
1%| | 32/3000 [00:28<13:46, 3.59it/s]
1%| | 33/3000 [00:28<13:27, 3.68it/s]
1%| | 34/3000 [00:29<13:21, 3.70it/s]
1%| | 35/3000 [00:29<13:24, 3.69it/s]
1%| | 36/3000 [00:29<13:45, 3.59it/s]
1%| | 37/3000 [00:29<13:39, 3.61it/s]
1%|β | 38/3000 [00:30<14:05, 3.50it/s]
1%|β | 39/3000 [00:30<13:37, 3.62it/s]
1%|β | 40/3000 [00:30<13:34, 3.63it/s]
{'loss': 1.1203, 'grad_norm': 0.288097620010376, 'learning_rate': 2.6000000000000002e-05} |
|
1%|β | 40/3000 [00:30<13:34, 3.63it/s]
1%|β | 41/3000 [00:31<14:05, 3.50it/s]
1%|β | 42/3000 [00:31<13:37, 3.62it/s]
1%|β | 43/3000 [00:31<13:52, 3.55it/s]
1%|β | 44/3000 [00:31<13:34, 3.63it/s]
2%|β | 45/3000 [00:32<13:15, 3.72it/s]
2%|β | 46/3000 [00:32<13:34, 3.63it/s]
2%|β | 47/3000 [00:32<13:20, 3.69it/s]
2%|β | 48/3000 [00:32<13:13, 3.72it/s]
2%|β | 49/3000 [00:33<13:15, 3.71it/s]
2%|β | 50/3000 [00:33<13:19, 3.69it/s]
{'loss': 1.1073, 'grad_norm': 0.2957666516304016, 'learning_rate': 3.266666666666667e-05} |
|
2%|β | 50/3000 [00:33<13:19, 3.69it/s]
2%|β | 51/3000 [00:33<13:09, 3.73it/s]
2%|β | 52/3000 [00:34<13:09, 3.74it/s]
2%|β | 53/3000 [00:34<13:08, 3.74it/s]
2%|β | 54/3000 [00:34<13:08, 3.74it/s]
2%|β | 55/3000 [00:34<13:21, 3.67it/s]
2%|β | 56/3000 [00:35<13:10, 3.72it/s]
2%|β | 57/3000 [00:35<13:06, 3.74it/s]
2%|β | 58/3000 [00:35<12:59, 3.78it/s]
2%|β | 59/3000 [00:35<13:05, 3.74it/s]
2%|β | 60/3000 [00:36<13:45, 3.56it/s]
{'loss': 1.0981, 'grad_norm': 0.3105441927909851, 'learning_rate': 3.933333333333333e-05} |
|
2%|β | 60/3000 [00:36<13:45, 3.56it/s]
2%|β | 61/3000 [00:36<13:37, 3.59it/s]
2%|β | 62/3000 [00:36<13:26, 3.64it/s]
2%|β | 63/3000 [00:37<13:28, 3.63it/s]
2%|β | 64/3000 [00:37<13:14, 3.70it/s]
2%|β | 65/3000 [00:37<13:11, 3.71it/s]
2%|β | 66/3000 [00:37<12:59, 3.76it/s]
2%|β | 67/3000 [00:38<12:53, 3.79it/s]
2%|β | 68/3000 [00:38<12:51, 3.80it/s]
2%|β | 69/3000 [00:38<12:51, 3.80it/s]
2%|β | 70/3000 [00:38<12:50, 3.80it/s]
{'loss': 1.0992, 'grad_norm': 0.2903534770011902, 'learning_rate': 4.600000000000001e-05} |
|
2%|β | 70/3000 [00:38<12:50, 3.80it/s]
2%|β | 71/3000 [00:39<13:06, 3.73it/s]
2%|β | 72/3000 [00:39<12:54, 3.78it/s]
2%|β | 73/3000 [00:39<12:45, 3.82it/s]
2%|β | 74/3000 [00:39<12:49, 3.80it/s]
2%|β | 75/3000 [00:40<12:44, 3.83it/s]
3%|β | 76/3000 [00:40<12:38, 3.86it/s]
3%|β | 77/3000 [00:40<12:36, 3.86it/s]
3%|β | 78/3000 [00:40<12:36, 3.86it/s]
3%|β | 79/3000 [00:41<12:47, 3.81it/s]
3%|β | 80/3000 [00:41<12:37, 3.85it/s]
{'loss': 1.0949, 'grad_norm': 0.2657025158405304, 'learning_rate': 5.266666666666666e-05} |
|
3%|β | 80/3000 [00:41<12:37, 3.85it/s]
3%|β | 81/3000 [00:41<12:42, 3.83it/s]
3%|β | 82/3000 [00:41<12:33, 3.88it/s]
3%|β | 83/3000 [00:42<12:33, 3.87it/s]
3%|β | 84/3000 [00:42<12:29, 3.89it/s]
3%|β | 85/3000 [00:42<12:35, 3.86it/s]
3%|β | 86/3000 [00:43<12:55, 3.76it/s]
3%|β | 87/3000 [00:43<12:48, 3.79it/s]
3%|β | 88/3000 [00:43<13:05, 3.71it/s]
3%|β | 89/3000 [00:43<13:12, 3.67it/s]
3%|β | 90/3000 [00:44<13:16, 3.65it/s]
{'loss': 1.0883, 'grad_norm': 0.22208809852600098, 'learning_rate': 5.9333333333333343e-05} |
|
3%|β | 90/3000 [00:44<13:16, 3.65it/s]
3%|β | 91/3000 [00:44<13:29, 3.59it/s]
3%|β | 92/3000 [00:44<13:35, 3.57it/s]
3%|β | 93/3000 [00:44<13:26, 3.60it/s]
3%|β | 94/3000 [00:45<13:18, 3.64it/s]
3%|β | 95/3000 [00:45<13:06, 3.69it/s]
3%|β | 96/3000 [00:45<13:04, 3.70it/s]
3%|β | 97/3000 [00:46<13:17, 3.64it/s]
3%|β | 98/3000 [00:46<13:09, 3.68it/s]
3%|β | 99/3000 [00:46<13:10, 3.67it/s]
3%|β | 100/3000 [00:46<13:04, 3.70it/s]
{'loss': 1.0812, 'grad_norm': 0.32055118680000305, 'learning_rate': 6.6e-05} |
|
3%|β | 100/3000 [00:46<13:04, 3.70it/s]
3%|β | 101/3000 [00:47<13:07, 3.68it/s]
3%|β | 102/3000 [00:47<12:53, 3.75it/s]
3%|β | 103/3000 [00:47<12:40, 3.81it/s]
3%|β | 104/3000 [00:47<12:37, 3.82it/s]
4%|β | 105/3000 [00:48<12:33, 3.84it/s]
4%|β | 106/3000 [00:48<12:30, 3.85it/s]
4%|β | 107/3000 [00:48<12:30, 3.86it/s]
4%|β | 108/3000 [00:48<12:41, 3.80it/s]
4%|β | 109/3000 [00:49<12:40, 3.80it/s]
4%|β | 110/3000 [00:49<12:38, 3.81it/s]
{'loss': 1.0723, 'grad_norm': 0.2965445816516876, 'learning_rate': 7.266666666666667e-05} |
|
4%|β | 110/3000 [00:49<12:38, 3.81it/s]
4%|β | 111/3000 [00:49<12:44, 3.78it/s]
4%|β | 112/3000 [00:50<12:47, 3.76it/s]
4%|β | 113/3000 [00:50<13:04, 3.68it/s]
4%|β | 114/3000 [00:50<13:12, 3.64it/s]
4%|β | 115/3000 [00:50<13:13, 3.63it/s]
4%|β | 116/3000 [00:51<12:59, 3.70it/s]
4%|β | 117/3000 [00:51<12:37, 3.81it/s]
4%|β | 118/3000 [00:51<12:24, 3.87it/s]
4%|β | 119/3000 [00:51<12:18, 3.90it/s]
4%|β | 120/3000 [00:52<12:14, 3.92it/s]
{'loss': 1.0464, 'grad_norm': 0.3756393790245056, 'learning_rate': 7.933333333333334e-05} |
|
4%|β | 120/3000 [00:52<12:14, 3.92it/s]
4%|β | 121/3000 [00:52<12:31, 3.83it/s]
4%|β | 122/3000 [00:52<12:39, 3.79it/s]
4%|β | 123/3000 [00:52<12:46, 3.75it/s]
4%|β | 124/3000 [00:53<12:51, 3.73it/s]
4%|β | 125/3000 [00:53<12:40, 3.78it/s]
4%|β | 126/3000 [00:53<12:30, 3.83it/s]
4%|β | 127/3000 [00:53<12:21, 3.87it/s]
4%|β | 128/3000 [00:54<12:15, 3.90it/s]
4%|β | 129/3000 [00:54<12:08, 3.94it/s]
4%|β | 130/3000 [00:54<12:19, 3.88it/s]
{'loss': 1.0023, 'grad_norm': 0.5002507567405701, 'learning_rate': 8.6e-05} |
|
4%|β | 130/3000 [00:54<12:19, 3.88it/s]
4%|β | 131/3000 [00:55<13:09, 3.63it/s]
4%|β | 132/3000 [00:55<13:56, 3.43it/s]
4%|β | 133/3000 [00:55<13:15, 3.60it/s]
4%|β | 134/3000 [00:55<13:12, 3.62it/s]
4%|β | 135/3000 [00:56<12:51, 3.72it/s]
5%|β | 136/3000 [00:56<12:34, 3.80it/s]
5%|β | 137/3000 [00:56<12:16, 3.89it/s]
5%|β | 138/3000 [00:56<12:07, 3.93it/s]
5%|β | 139/3000 [00:57<12:03, 3.95it/s]
5%|β | 140/3000 [00:57<11:57, 3.98it/s]
{'loss': 0.9625, 'grad_norm': 0.4813876152038574, 'learning_rate': 9.266666666666666e-05} |
|
5%|β | 140/3000 [00:57<11:57, 3.98it/s]
5%|β | 141/3000 [00:57<12:08, 3.92it/s]
5%|β | 142/3000 [00:57<12:02, 3.95it/s]
5%|β | 143/3000 [00:58<11:54, 4.00it/s]
5%|β | 144/3000 [00:58<11:53, 4.00it/s]
5%|β | 145/3000 [00:58<11:53, 4.00it/s]
5%|β | 146/3000 [00:58<11:50, 4.02it/s]
5%|β | 147/3000 [00:59<11:51, 4.01it/s]
5%|β | 148/3000 [00:59<11:52, 4.00it/s]
5%|β | 149/3000 [00:59<11:50, 4.01it/s]
5%|β | 150/3000 [00:59<11:49, 4.02it/s]
{'loss': 0.9308, 'grad_norm': 0.513647198677063, 'learning_rate': 9.933333333333334e-05} |
|
5%|β | 150/3000 [00:59<11:49, 4.02it/s]
5%|β | 151/3000 [01:00<12:16, 3.87it/s]
5%|β | 152/3000 [01:00<12:22, 3.83it/s]
5%|β | 153/3000 [01:00<12:19, 3.85it/s]
5%|β | 154/3000 [01:00<12:21, 3.84it/s]
5%|β | 155/3000 [01:01<12:47, 3.71it/s]
5%|β | 156/3000 [01:01<12:42, 3.73it/s]
5%|β | 157/3000 [01:01<12:16, 3.86it/s]
5%|β | 158/3000 [01:02<12:16, 3.86it/s]
5%|β | 159/3000 [01:02<12:11, 3.88it/s]
5%|β | 160/3000 [01:02<12:19, 3.84it/s]
{'loss': 0.8938, 'grad_norm': 0.8362369537353516, 'learning_rate': 9.999753945398704e-05} |
|
5%|β | 160/3000 [01:02<12:19, 3.84it/s]
5%|β | 161/3000 [01:02<12:21, 3.83it/s]
5%|β | 162/3000 [01:03<12:42, 3.72it/s]
5%|β | 163/3000 [01:03<13:25, 3.52it/s]
5%|β | 164/3000 [01:03<13:09, 3.59it/s]
6%|β | 165/3000 [01:03<12:55, 3.65it/s]
6%|β | 166/3000 [01:04<12:50, 3.68it/s]
6%|β | 167/3000 [01:04<12:35, 3.75it/s]
6%|β | 168/3000 [01:04<12:21, 3.82it/s]Rank 0, Worker 0: Wait for shard 46 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
6%|β | 169/3000 [01:04<12:17, 3.84it/s]
6%|β | 170/3000 [01:05<12:09, 3.88it/s]
{'loss': 0.862, 'grad_norm': 0.5613930821418762, 'learning_rate': 9.998903417374228e-05} |
|
6%|β | 170/3000 [01:05<12:09, 3.88it/s]
6%|β | 171/3000 [01:05<12:16, 3.84it/s]
6%|β | 172/3000 [01:05<12:09, 3.87it/s]
6%|β | 173/3000 [01:05<12:07, 3.89it/s]
6%|β | 174/3000 [01:06<12:14, 3.85it/s]
6%|β | 175/3000 [01:06<12:11, 3.86it/s]Rank 0, Worker 1: Wait for shard 59 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
6%|β | 176/3000 [01:06<11:59, 3.93it/s]Rank 0, Worker 2: Wait for shard 0 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
6%|β | 177/3000 [01:07<12:02, 3.91it/s]Rank 0, Worker 3: Wait for shard 2 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
6%|β | 178/3000 [01:07<12:17, 3.83it/s]
6%|β | 179/3000 [01:07<12:23, 3.80it/s]Rank 0, Worker 5: Wait for shard 9 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
6%|β | 180/3000 [01:07<12:19, 3.81it/s]
{'loss': 0.8221, 'grad_norm': 0.5763223767280579, 'learning_rate': 9.997445481536973e-05} |
|
6%|β | 180/3000 [01:07<12:19, 3.81it/s]
6%|β | 181/3000 [01:08<12:24, 3.79it/s]
6%|β | 182/3000 [01:08<13:02, 3.60it/s]
6%|β | 183/3000 [01:08<12:46, 3.68it/s]
6%|β | 184/3000 [01:08<12:35, 3.73it/s]Rank 0, Worker 4: Wait for shard 18 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
6%|β | 185/3000 [01:09<12:26, 3.77it/s]
6%|β | 186/3000 [01:09<12:39, 3.71it/s]
6%|β | 187/3000 [01:09<13:00, 3.60it/s]
6%|β | 188/3000 [01:10<12:42, 3.69it/s]
6%|β | 189/3000 [01:10<12:27, 3.76it/s]
6%|β | 190/3000 [01:10<12:26, 3.77it/s]
{'loss': 0.7922, 'grad_norm': 0.5941640734672546, 'learning_rate': 9.995380315038119e-05} |
|
6%|β | 190/3000 [01:10<12:26, 3.77it/s]
6%|β | 191/3000 [01:10<12:51, 3.64it/s]
6%|β | 192/3000 [01:11<12:48, 3.65it/s]
6%|β | 193/3000 [01:11<12:40, 3.69it/s]
6%|β | 194/3000 [01:11<12:32, 3.73it/s]
6%|β | 195/3000 [01:11<13:09, 3.55it/s]
7%|β | 196/3000 [01:12<13:12, 3.54it/s]
7%|β | 197/3000 [01:12<12:57, 3.60it/s]
7%|β | 198/3000 [01:12<13:12, 3.53it/s]
7%|β | 199/3000 [01:13<13:07, 3.56it/s]
7%|β | 200/3000 [01:13<12:42, 3.67it/s]
{'loss': 0.7702, 'grad_norm': 0.7477675080299377, 'learning_rate': 9.99270816881235e-05} |
|
7%|β | 200/3000 [01:13<12:42, 3.67it/s]
7%|β | 201/3000 [01:13<12:46, 3.65it/s]
7%|β | 202/3000 [01:13<13:13, 3.53it/s]
7%|β | 203/3000 [01:14<13:20, 3.49it/s]
7%|β | 204/3000 [01:14<12:53, 3.61it/s]
7%|β | 205/3000 [01:14<12:42, 3.67it/s]
7%|β | 206/3000 [01:15<13:28, 3.46it/s]
7%|β | 207/3000 [01:15<13:59, 3.33it/s]
7%|β | 208/3000 [01:15<13:36, 3.42it/s]
7%|β | 209/3000 [01:15<13:39, 3.41it/s]
7%|β | 210/3000 [01:16<13:36, 3.42it/s]
{'loss': 0.7553, 'grad_norm': 0.5304926633834839, 'learning_rate': 9.989429367547377e-05} |
|
7%|β | 210/3000 [01:16<13:36, 3.42it/s]
7%|β | 211/3000 [01:16<13:36, 3.42it/s]
7%|β | 212/3000 [01:16<13:28, 3.45it/s]
7%|β | 213/3000 [01:17<14:14, 3.26it/s]
7%|β | 214/3000 [01:17<14:12, 3.27it/s]
7%|β | 215/3000 [01:17<14:11, 3.27it/s]
7%|β | 216/3000 [01:18<14:51, 3.12it/s]
7%|β | 217/3000 [01:18<14:18, 3.24it/s]
7%|β | 218/3000 [01:18<14:03, 3.30it/s]
7%|β | 219/3000 [01:18<14:00, 3.31it/s]
7%|β | 220/3000 [01:19<13:29, 3.44it/s]
{'loss': 0.7369, 'grad_norm': 0.5700099468231201, 'learning_rate': 9.985544309644475e-05} |
|
7%|β | 220/3000 [01:19<13:29, 3.44it/s]
7%|β | 221/3000 [01:19<13:17, 3.48it/s]
7%|β | 222/3000 [01:19<13:01, 3.55it/s]
7%|β | 223/3000 [01:20<12:47, 3.62it/s]
7%|β | 224/3000 [01:20<12:50, 3.60it/s]
8%|β | 225/3000 [01:20<12:36, 3.67it/s]
8%|β | 226/3000 [01:20<12:29, 3.70it/s]
8%|β | 227/3000 [01:21<12:44, 3.63it/s]
8%|β | 228/3000 [01:21<12:35, 3.67it/s]
8%|β | 229/3000 [01:21<12:20, 3.74it/s]
8%|β | 230/3000 [01:21<12:22, 3.73it/s]
{'loss': 0.7222, 'grad_norm': 0.7928358316421509, 'learning_rate': 9.98105346717008e-05} |
|
8%|β | 230/3000 [01:21<12:22, 3.73it/s]
8%|β | 231/3000 [01:22<12:26, 3.71it/s]
8%|β | 232/3000 [01:22<12:12, 3.78it/s]
8%|β | 233/3000 [01:22<12:07, 3.81it/s]
8%|β | 234/3000 [01:22<12:05, 3.81it/s]
8%|β | 235/3000 [01:23<12:24, 3.72it/s]
8%|β | 236/3000 [01:23<12:39, 3.64it/s]
8%|β | 237/3000 [01:23<12:55, 3.56it/s]
8%|β | 238/3000 [01:24<12:36, 3.65it/s]
8%|β | 239/3000 [01:24<12:26, 3.70it/s]
8%|β | 240/3000 [01:24<12:26, 3.70it/s]
{'loss': 0.7114, 'grad_norm': 0.7667017579078674, 'learning_rate': 9.97595738579843e-05} |
|
8%|β | 240/3000 [01:24<12:26, 3.70it/s]
8%|β | 241/3000 [01:24<12:27, 3.69it/s]
8%|β | 242/3000 [01:25<12:40, 3.62it/s]
8%|β | 243/3000 [01:25<12:46, 3.59it/s]
8%|β | 244/3000 [01:25<12:33, 3.66it/s]
8%|β | 245/3000 [01:25<12:14, 3.75it/s]
8%|β | 246/3000 [01:26<12:16, 3.74it/s]
8%|β | 247/3000 [01:26<12:08, 3.78it/s]
8%|β | 248/3000 [01:26<12:09, 3.77it/s]
8%|β | 249/3000 [01:27<12:06, 3.79it/s]
8%|β | 250/3000 [01:27<12:06, 3.79it/s]
{'loss': 0.6861, 'grad_norm': 0.5060045123100281, 'learning_rate': 9.970256684745258e-05} |
|
8%|β | 250/3000 [01:27<12:06, 3.79it/s]
8%|β | 251/3000 [01:27<12:00, 3.82it/s]
8%|β | 252/3000 [01:27<12:05, 3.79it/s]
8%|β | 253/3000 [01:28<12:00, 3.81it/s]
8%|β | 254/3000 [01:28<11:51, 3.86it/s]
8%|β | 255/3000 [01:28<11:48, 3.88it/s]
9%|β | 256/3000 [01:28<11:46, 3.89it/s]
9%|β | 257/3000 [01:29<11:45, 3.89it/s]
9%|β | 258/3000 [01:29<11:31, 3.97it/s]
9%|β | 259/3000 [01:29<11:25, 4.00it/s]
9%|β | 260/3000 [01:29<11:26, 3.99it/s]
{'loss': 0.675, 'grad_norm': 0.6732467412948608, 'learning_rate': 9.963952056692549e-05} |
|
9%|β | 260/3000 [01:29<11:26, 3.99it/s]
9%|β | 261/3000 [01:30<11:31, 3.96it/s]
9%|β | 262/3000 [01:30<11:27, 3.98it/s]
9%|β | 263/3000 [01:30<11:23, 4.00it/s]
9%|β | 264/3000 [01:30<12:24, 3.68it/s]
9%|β | 265/3000 [01:31<12:08, 3.75it/s]
9%|β | 266/3000 [01:31<12:00, 3.79it/s]
9%|β | 267/3000 [01:31<11:52, 3.84it/s]
9%|β | 268/3000 [01:31<11:41, 3.90it/s]
9%|β | 269/3000 [01:32<11:38, 3.91it/s]
9%|β | 270/3000 [01:32<11:37, 3.91it/s]
{'loss': 0.6767, 'grad_norm': 0.6021248698234558, 'learning_rate': 9.957044267704384e-05} |
|
9%|β | 270/3000 [01:32<11:37, 3.91it/s]
9%|β | 271/3000 [01:32<11:36, 3.92it/s]
9%|β | 272/3000 [01:32<11:38, 3.91it/s]
9%|β | 273/3000 [01:33<11:33, 3.93it/s]
9%|β | 274/3000 [01:33<11:19, 4.01it/s]
9%|β | 275/3000 [01:33<11:17, 4.02it/s]
9%|β | 276/3000 [01:33<11:27, 3.96it/s]
9%|β | 277/3000 [01:34<11:29, 3.95it/s]
9%|β | 278/3000 [01:34<11:16, 4.03it/s]
9%|β | 279/3000 [01:34<11:35, 3.91it/s]
9%|β | 280/3000 [01:34<11:33, 3.92it/s]
{'loss': 0.6512, 'grad_norm': 0.6704761981964111, 'learning_rate': 9.949534157133844e-05} |
|
9%|β | 280/3000 [01:35<11:33, 3.92it/s]
9%|β | 281/3000 [01:35<11:31, 3.93it/s]
9%|β | 282/3000 [01:35<11:19, 4.00it/s]
9%|β | 283/3000 [01:35<11:21, 3.99it/s]
9%|β | 284/3000 [01:35<11:30, 3.93it/s]
10%|β | 285/3000 [01:36<11:23, 3.97it/s]
10%|β | 286/3000 [01:36<11:18, 4.00it/s]
10%|β | 287/3000 [01:36<11:23, 3.97it/s]
10%|β | 288/3000 [01:37<11:49, 3.82it/s]
10%|β | 289/3000 [01:37<12:21, 3.66it/s]
10%|β | 290/3000 [01:37<12:03, 3.75it/s]
{'loss': 0.6512, 'grad_norm': 0.7285646200180054, 'learning_rate': 9.941422637521035e-05} |
|
10%|β | 290/3000 [01:37<12:03, 3.75it/s]
10%|β | 291/3000 [01:37<12:51, 3.51it/s]
10%|β | 292/3000 [01:38<12:26, 3.63it/s]
10%|β | 293/3000 [01:38<12:02, 3.75it/s]
10%|β | 294/3000 [01:38<11:53, 3.79it/s]
10%|β | 295/3000 [01:38<11:53, 3.79it/s]
10%|β | 296/3000 [01:39<12:03, 3.74it/s]
10%|β | 297/3000 [01:39<12:02, 3.74it/s]
10%|β | 298/3000 [01:39<12:52, 3.50it/s]
10%|β | 299/3000 [01:40<12:45, 3.53it/s]
10%|β | 300/3000 [01:40<12:18, 3.66it/s]
{'loss': 0.6196, 'grad_norm': 0.7110922336578369, 'learning_rate': 9.932710694482191e-05} |
|
10%|β | 300/3000 [01:40<12:18, 3.66it/s]
10%|β | 301/3000 [01:40<12:05, 3.72it/s]
10%|β | 302/3000 [01:40<11:53, 3.78it/s]
10%|β | 303/3000 [01:41<11:50, 3.80it/s]
10%|β | 304/3000 [01:41<11:46, 3.82it/s]
10%|β | 305/3000 [01:41<11:40, 3.85it/s]
10%|β | 306/3000 [01:41<11:51, 3.79it/s]
10%|β | 307/3000 [01:42<12:00, 3.74it/s]
10%|β | 308/3000 [01:42<11:50, 3.79it/s]
10%|β | 309/3000 [01:42<11:40, 3.84it/s]
10%|β | 310/3000 [01:42<12:22, 3.62it/s]
{'loss': 0.6118, 'grad_norm': 0.6850391626358032, 'learning_rate': 9.923399386589933e-05} |
|
10%|β | 310/3000 [01:43<12:22, 3.62it/s]
10%|β | 311/3000 [01:43<12:25, 3.60it/s]
10%|β | 312/3000 [01:43<12:25, 3.60it/s]
10%|β | 313/3000 [01:43<12:08, 3.69it/s]
10%|β | 314/3000 [01:44<11:57, 3.75it/s]
10%|β | 315/3000 [01:44<11:58, 3.74it/s]
11%|β | 316/3000 [01:44<11:49, 3.78it/s]
11%|β | 317/3000 [01:44<12:44, 3.51it/s]
11%|β | 318/3000 [01:45<12:31, 3.57it/s]
11%|β | 319/3000 [01:45<12:42, 3.52it/s]
11%|β | 320/3000 [01:45<12:32, 3.56it/s]
{'loss': 0.5941, 'grad_norm': 0.6997315883636475, 'learning_rate': 9.913489845244626e-05} |
|
11%|β | 320/3000 [01:45<12:32, 3.56it/s]
11%|β | 321/3000 [01:46<12:24, 3.60it/s]
11%|β | 322/3000 [01:46<12:20, 3.61it/s]
11%|β | 323/3000 [01:46<12:03, 3.70it/s]
11%|β | 324/3000 [01:46<12:03, 3.70it/s]
11%|β | 325/3000 [01:47<12:06, 3.68it/s]
11%|β | 326/3000 [01:47<11:53, 3.75it/s]
11%|β | 327/3000 [01:47<12:25, 3.58it/s]
11%|β | 328/3000 [01:47<12:08, 3.67it/s]
11%|β | 329/3000 [01:48<12:00, 3.71it/s]
11%|β | 330/3000 [01:48<11:53, 3.74it/s]
{'loss': 0.5891, 'grad_norm': 0.760143518447876, 'learning_rate': 9.902983274536912e-05} |
|
11%|β | 330/3000 [01:48<11:53, 3.74it/s]
11%|β | 331/3000 [01:48<11:59, 3.71it/s]
11%|β | 332/3000 [01:48<11:47, 3.77it/s]
11%|β | 333/3000 [01:49<11:45, 3.78it/s]
11%|β | 334/3000 [01:49<11:43, 3.79it/s]
11%|β | 335/3000 [01:49<11:47, 3.76it/s]
11%|β | 336/3000 [01:49<11:30, 3.86it/s]
11%|β | 337/3000 [01:50<11:34, 3.84it/s]
11%|ββ | 338/3000 [01:50<11:44, 3.78it/s]
11%|ββ | 339/3000 [01:50<11:58, 3.70it/s]
11%|ββ | 340/3000 [01:51<11:45, 3.77it/s]
{'loss': 0.5812, 'grad_norm': 0.759657084941864, 'learning_rate': 9.891880951101407e-05} |
|
11%|ββ | 340/3000 [01:51<11:45, 3.77it/s]
11%|ββ | 341/3000 [01:51<11:49, 3.75it/s]
11%|ββ | 342/3000 [01:51<11:40, 3.79it/s]
11%|ββ | 343/3000 [01:51<11:48, 3.75it/s]
11%|ββ | 344/3000 [01:52<12:29, 3.54it/s]
12%|ββ | 345/3000 [01:52<12:00, 3.68it/s]
12%|ββ | 346/3000 [01:52<12:12, 3.62it/s]
12%|ββ | 347/3000 [01:52<11:55, 3.71it/s]
12%|ββ | 348/3000 [01:53<11:44, 3.77it/s]
12%|ββ | 349/3000 [01:53<11:32, 3.83it/s]
12%|ββ | 350/3000 [01:53<11:25, 3.87it/s]
{'loss': 0.5674, 'grad_norm': 0.8276827931404114, 'learning_rate': 9.880184223961573e-05} |
|
12%|ββ | 350/3000 [01:53<11:25, 3.87it/s]
12%|ββ | 351/3000 [01:53<11:24, 3.87it/s]
12%|ββ | 352/3000 [01:54<11:15, 3.92it/s]
12%|ββ | 353/3000 [01:54<11:16, 3.91it/s]
12%|ββ | 354/3000 [01:54<11:18, 3.90it/s]
12%|ββ | 355/3000 [01:55<11:27, 3.85it/s]
12%|ββ | 356/3000 [01:55<11:23, 3.87it/s]
12%|ββ | 357/3000 [01:55<11:24, 3.86it/s]
12%|ββ | 358/3000 [01:55<11:24, 3.86it/s]
12%|ββ | 359/3000 [01:56<11:15, 3.91it/s]
12%|ββ | 360/3000 [01:56<11:24, 3.86it/s]
{'loss': 0.5544, 'grad_norm': 0.5555763244628906, 'learning_rate': 9.867894514365802e-05} |
|
12%|ββ | 360/3000 [01:56<11:24, 3.86it/s]Rank 0, Worker 0: Wait for shard 10 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
12%|ββ | 361/3000 [01:56<11:26, 3.84it/s]
12%|ββ | 362/3000 [01:56<11:28, 3.83it/s]
12%|ββ | 363/3000 [01:57<11:30, 3.82it/s]Rank 0, Worker 3: Wait for shard 20 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
12%|ββ | 364/3000 [01:57<11:28, 3.83it/s]
12%|ββ | 365/3000 [01:57<11:47, 3.72it/s]
12%|ββ | 366/3000 [01:57<11:57, 3.67it/s]
12%|ββ | 367/3000 [01:58<11:57, 3.67it/s]Rank 0, Worker 1: Wait for shard 32 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
12%|ββ | 368/3000 [01:58<11:45, 3.73it/s]Rank 0, Worker 2: Wait for shard 34 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
12%|ββ | 369/3000 [01:58<11:37, 3.77it/s]
12%|ββ | 370/3000 [01:58<11:29, 3.81it/s]
{'loss': 0.5504, 'grad_norm': 0.6141887307167053, 'learning_rate': 9.855013315614725e-05} |
|
12%|ββ | 370/3000 [01:59<11:29, 3.81it/s]
12%|ββ | 371/3000 [01:59<11:31, 3.80it/s]Rank 0, Worker 5: Wait for shard 51 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
12%|ββ | 372/3000 [01:59<11:23, 3.84it/s]
12%|ββ | 373/3000 [01:59<11:28, 3.82it/s]
12%|ββ | 374/3000 [02:00<11:32, 3.79it/s]
12%|ββ | 375/3000 [02:00<11:26, 3.83it/s]
13%|ββ | 376/3000 [02:00<11:24, 3.83it/s]Rank 0, Worker 4: Wait for shard 56 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
13%|ββ | 377/3000 [02:00<11:27, 3.82it/s]
13%|ββ | 378/3000 [02:01<11:33, 3.78it/s]
13%|ββ | 379/3000 [02:01<12:05, 3.61it/s]
13%|ββ | 380/3000 [02:01<11:42, 3.73it/s]
{'loss': 0.5447, 'grad_norm': 0.7782378792762756, 'learning_rate': 9.841542192879762e-05} |
|
13%|ββ | 380/3000 [02:01<11:42, 3.73it/s]
13%|ββ | 381/3000 [02:01<11:40, 3.74it/s]
13%|ββ | 382/3000 [02:02<12:01, 3.63it/s]
13%|ββ | 383/3000 [02:02<11:47, 3.70it/s]
13%|ββ | 384/3000 [02:02<11:37, 3.75it/s]
13%|ββ | 385/3000 [02:02<11:26, 3.81it/s]
13%|ββ | 386/3000 [02:03<11:35, 3.76it/s]
13%|ββ | 387/3000 [02:03<11:30, 3.79it/s]
13%|ββ | 388/3000 [02:03<11:17, 3.85it/s]
13%|ββ | 389/3000 [02:04<11:26, 3.80it/s]
13%|ββ | 390/3000 [02:04<11:30, 3.78it/s]
{'loss': 0.5435, 'grad_norm': 0.8065223693847656, 'learning_rate': 9.82748278301294e-05} |
|
13%|ββ | 390/3000 [02:04<11:30, 3.78it/s]
13%|ββ | 391/3000 [02:04<11:43, 3.71it/s]
13%|ββ | 392/3000 [02:04<11:33, 3.76it/s]
13%|ββ | 393/3000 [02:05<11:30, 3.78it/s]
13%|ββ | 394/3000 [02:05<11:25, 3.80it/s]
13%|ββ | 395/3000 [02:05<11:17, 3.84it/s]
13%|ββ | 396/3000 [02:05<11:16, 3.85it/s]
13%|ββ | 397/3000 [02:06<11:22, 3.82it/s]
13%|ββ | 398/3000 [02:06<11:32, 3.75it/s]
13%|ββ | 399/3000 [02:06<12:12, 3.55it/s]
13%|ββ | 400/3000 [02:07<12:47, 3.39it/s]
{'loss': 0.5344, 'grad_norm': 0.7179561257362366, 'learning_rate': 9.812836794348004e-05} |
|
13%|ββ | 400/3000 [02:07<12:47, 3.39it/s]
13%|ββ | 401/3000 [02:07<12:39, 3.42it/s]
13%|ββ | 402/3000 [02:07<12:30, 3.46it/s]
13%|ββ | 403/3000 [02:07<12:10, 3.56it/s]
13%|ββ | 404/3000 [02:08<11:55, 3.63it/s]
14%|ββ | 405/3000 [02:08<11:47, 3.67it/s]
14%|ββ | 406/3000 [02:08<12:20, 3.50it/s]
14%|ββ | 407/3000 [02:08<11:56, 3.62it/s]
14%|ββ | 408/3000 [02:09<11:47, 3.66it/s]
14%|ββ | 409/3000 [02:09<12:16, 3.52it/s]
14%|ββ | 410/3000 [02:09<12:05, 3.57it/s]
{'loss': 0.5249, 'grad_norm': 0.7837668657302856, 'learning_rate': 9.797606006492841e-05} |
|
14%|ββ | 410/3000 [02:09<12:05, 3.57it/s]
14%|ββ | 411/3000 [02:10<11:51, 3.64it/s]
14%|ββ | 412/3000 [02:10<11:47, 3.66it/s]
14%|ββ | 413/3000 [02:10<11:49, 3.64it/s]
14%|ββ | 414/3000 [02:10<11:53, 3.62it/s]
14%|ββ | 415/3000 [02:11<12:03, 3.57it/s]
14%|ββ | 416/3000 [02:11<11:49, 3.64it/s]
14%|ββ | 417/3000 [02:11<11:41, 3.68it/s]
14%|ββ | 418/3000 [02:11<11:36, 3.70it/s]
14%|ββ | 419/3000 [02:12<11:39, 3.69it/s]
14%|ββ | 420/3000 [02:12<11:31, 3.73it/s]
{'loss': 0.5199, 'grad_norm': 0.7675067782402039, 'learning_rate': 9.781792270113241e-05} |
|
14%|ββ | 420/3000 [02:12<11:31, 3.73it/s]
14%|ββ | 421/3000 [02:12<11:32, 3.72it/s]
14%|ββ | 422/3000 [02:13<11:42, 3.67it/s]
14%|ββ | 423/3000 [02:13<11:37, 3.70it/s]
14%|ββ | 424/3000 [02:13<11:30, 3.73it/s]
14%|ββ | 425/3000 [02:13<11:42, 3.67it/s]
14%|ββ | 426/3000 [02:14<11:58, 3.58it/s]
14%|ββ | 427/3000 [02:14<11:41, 3.67it/s]
14%|ββ | 428/3000 [02:14<11:25, 3.75it/s]
14%|ββ | 429/3000 [02:14<11:10, 3.83it/s]
14%|ββ | 430/3000 [02:15<11:04, 3.87it/s]
{'loss': 0.5071, 'grad_norm': 0.8279472589492798, 'learning_rate': 9.765397506708023e-05} |
|
14%|ββ | 430/3000 [02:15<11:04, 3.87it/s]
14%|ββ | 431/3000 [02:15<11:12, 3.82it/s]
14%|ββ | 432/3000 [02:15<11:01, 3.88it/s]
14%|ββ | 433/3000 [02:15<10:51, 3.94it/s]
14%|ββ | 434/3000 [02:16<10:51, 3.94it/s]
14%|ββ | 435/3000 [02:16<10:46, 3.97it/s]
15%|ββ | 436/3000 [02:16<10:41, 3.99it/s]
15%|ββ | 437/3000 [02:16<10:41, 4.00it/s]
15%|ββ | 438/3000 [02:17<10:34, 4.04it/s]
15%|ββ | 439/3000 [02:17<10:35, 4.03it/s]
15%|ββ | 440/3000 [02:17<10:39, 4.01it/s]
{'loss': 0.4926, 'grad_norm': 0.8260670304298401, 'learning_rate': 9.748423708375563e-05} |
|
15%|ββ | 440/3000 [02:17<10:39, 4.01it/s]
15%|ββ | 441/3000 [02:17<10:46, 3.96it/s]
15%|ββ | 442/3000 [02:18<10:47, 3.95it/s]
15%|ββ | 443/3000 [02:18<11:32, 3.69it/s]
15%|ββ | 444/3000 [02:18<11:24, 3.73it/s]
15%|ββ | 445/3000 [02:19<11:13, 3.79it/s]
15%|ββ | 446/3000 [02:19<11:03, 3.85it/s]
15%|ββ | 447/3000 [02:19<10:48, 3.94it/s]
15%|ββ | 448/3000 [02:19<10:52, 3.91it/s]
15%|ββ | 449/3000 [02:20<10:55, 3.89it/s]
15%|ββ | 450/3000 [02:20<10:42, 3.97it/s]
{'loss': 0.4907, 'grad_norm': 0.881014347076416, 'learning_rate': 9.730872937571739e-05} |
|
15%|ββ | 450/3000 [02:20<10:42, 3.97it/s]
15%|ββ | 451/3000 [02:20<10:44, 3.96it/s]
15%|ββ | 452/3000 [02:20<11:01, 3.85it/s]
15%|ββ | 453/3000 [02:21<10:54, 3.89it/s]
15%|ββ | 454/3000 [02:21<10:45, 3.95it/s]
15%|ββ | 455/3000 [02:21<10:43, 3.95it/s]
15%|ββ | 456/3000 [02:21<10:42, 3.96it/s]
15%|ββ | 457/3000 [02:22<10:42, 3.96it/s]
15%|ββ | 458/3000 [02:22<10:42, 3.96it/s]
15%|ββ | 459/3000 [02:22<10:56, 3.87it/s]
15%|ββ | 460/3000 [02:22<10:54, 3.88it/s]
{'loss': 0.4815, 'grad_norm': 0.9371006488800049, 'learning_rate': 9.712747326859315e-05} |
|
15%|ββ | 460/3000 [02:22<10:54, 3.88it/s]
15%|ββ | 461/3000 [02:23<10:53, 3.89it/s]
15%|ββ | 462/3000 [02:23<10:55, 3.87it/s]
15%|ββ | 463/3000 [02:23<10:43, 3.94it/s]
15%|ββ | 464/3000 [02:23<10:45, 3.93it/s]
16%|ββ | 465/3000 [02:24<10:53, 3.88it/s]
16%|ββ | 466/3000 [02:24<10:54, 3.87it/s]
16%|ββ | 467/3000 [02:24<10:55, 3.86it/s]
16%|ββ | 468/3000 [02:24<10:55, 3.86it/s]
16%|ββ | 469/3000 [02:25<11:10, 3.77it/s]
16%|ββ | 470/3000 [02:25<11:46, 3.58it/s]
{'loss': 0.4755, 'grad_norm': 0.8164715766906738, 'learning_rate': 9.69404907864883e-05} |
|
16%|ββ | 470/3000 [02:25<11:46, 3.58it/s]
16%|ββ | 471/3000 [02:25<11:29, 3.67it/s]
16%|ββ | 472/3000 [02:26<11:22, 3.71it/s]
16%|ββ | 473/3000 [02:26<11:25, 3.69it/s]
16%|ββ | 474/3000 [02:26<11:08, 3.78it/s]
16%|ββ | 475/3000 [02:26<10:55, 3.85it/s]
16%|ββ | 476/3000 [02:27<10:49, 3.89it/s]
16%|ββ | 477/3000 [02:27<10:49, 3.88it/s]
16%|ββ | 478/3000 [02:27<10:50, 3.88it/s]
16%|ββ | 479/3000 [02:27<10:44, 3.91it/s]
16%|ββ | 480/3000 [02:28<10:51, 3.87it/s]
{'loss': 0.4649, 'grad_norm': 0.7007074952125549, 'learning_rate': 9.674780464930979e-05} |
|
16%|ββ | 480/3000 [02:28<10:51, 3.87it/s]
16%|ββ | 481/3000 [02:28<11:04, 3.79it/s]
16%|ββ | 482/3000 [02:28<11:02, 3.80it/s]
16%|ββ | 483/3000 [02:28<10:54, 3.85it/s]
16%|ββ | 484/3000 [02:29<10:46, 3.89it/s]
16%|ββ | 485/3000 [02:29<10:55, 3.83it/s]
16%|ββ | 486/3000 [02:29<10:55, 3.83it/s]
16%|ββ | 487/3000 [02:29<11:06, 3.77it/s]
16%|ββ | 488/3000 [02:30<10:55, 3.83it/s]
16%|ββ | 489/3000 [02:30<10:53, 3.84it/s]
16%|ββ | 490/3000 [02:30<10:58, 3.81it/s]
{'loss': 0.4578, 'grad_norm': 0.9710970520973206, 'learning_rate': 9.654943827000548e-05} |
|
16%|ββ | 490/3000 [02:30<10:58, 3.81it/s]
16%|ββ | 491/3000 [02:30<11:22, 3.68it/s]
16%|ββ | 492/3000 [02:31<11:03, 3.78it/s]
16%|ββ | 493/3000 [02:31<11:07, 3.76it/s]
16%|ββ | 494/3000 [02:31<12:04, 3.46it/s]
16%|ββ | 495/3000 [02:32<11:39, 3.58it/s]
17%|ββ | 496/3000 [02:32<11:27, 3.64it/s]
17%|ββ | 497/3000 [02:32<11:48, 3.54it/s]
17%|ββ | 498/3000 [02:33<12:19, 3.39it/s]
17%|ββ | 499/3000 [02:33<12:13, 3.41it/s]
17%|ββ | 500/3000 [02:33<12:36, 3.31it/s]
{'loss': 0.4481, 'grad_norm': 0.7813271284103394, 'learning_rate': 9.634541575171929e-05} |
|
17%|ββ | 500/3000 [02:33<12:36, 3.31it/s]Copying experiment config directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/experiment_cfg to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-500/experiment_cfg |
| Copying processor directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/processor to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-500 |
| Copying wandb_config.json from /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/wandb_config.json to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-500/wandb_config.json |
|
17%|ββ | 501/3000 [03:11<8:06:30, 11.68s/it]
17%|ββ | 502/3000 [03:12<5:43:17, 8.25s/it]
17%|ββ | 503/3000 [03:12<4:03:02, 5.84s/it]
17%|ββ | 504/3000 [03:12<2:52:55, 4.16s/it]
17%|ββ | 505/3000 [03:12<2:03:51, 2.98s/it]
17%|ββ | 506/3000 [03:12<1:29:29, 2.15s/it]
17%|ββ | 507/3000 [03:13<1:05:27, 1.58s/it]
17%|ββ | 508/3000 [03:13<48:41, 1.17s/it]
17%|ββ | 509/3000 [03:13<36:53, 1.13it/s]
17%|ββ | 510/3000 [03:13<28:39, 1.45it/s]
{'loss': 0.4437, 'grad_norm': 0.8395043611526489, 'learning_rate': 9.613576188486253e-05} |
|
17%|ββ | 510/3000 [03:13<28:39, 1.45it/s]
17%|ββ | 511/3000 [03:14<22:56, 1.81it/s]
17%|ββ | 512/3000 [03:14<18:51, 2.20it/s]
17%|ββ | 513/3000 [03:14<16:05, 2.58it/s]
17%|ββ | 514/3000 [03:14<14:15, 2.91it/s]
17%|ββ | 515/3000 [03:15<12:57, 3.20it/s]
17%|ββ | 516/3000 [03:15<11:54, 3.48it/s]
17%|ββ | 517/3000 [03:15<11:09, 3.71it/s]
17%|ββ | 518/3000 [03:15<10:50, 3.82it/s]
17%|ββ | 519/3000 [03:16<10:30, 3.93it/s]
17%|ββ | 520/3000 [03:16<10:12, 4.05it/s]
{'loss': 0.4302, 'grad_norm': 0.8199443221092224, 'learning_rate': 9.59205021441015e-05} |
|
17%|ββ | 520/3000 [03:16<10:12, 4.05it/s]
17%|ββ | 521/3000 [03:16<10:12, 4.05it/s]
17%|ββ | 522/3000 [03:16<10:10, 4.06it/s]
17%|ββ | 523/3000 [03:16<10:01, 4.12it/s]
17%|ββ | 524/3000 [03:17<09:55, 4.16it/s]
18%|ββ | 525/3000 [03:17<09:55, 4.16it/s]
18%|ββ | 526/3000 [03:17<10:09, 4.06it/s]
18%|ββ | 527/3000 [03:17<10:41, 3.86it/s]
18%|ββ | 528/3000 [03:18<10:33, 3.90it/s]
18%|ββ | 529/3000 [03:18<10:25, 3.95it/s]
18%|ββ | 530/3000 [03:18<10:29, 3.92it/s]
{'loss': 0.4159, 'grad_norm': 0.7755788564682007, 'learning_rate': 9.569966268526232e-05} |
|
18%|ββ | 530/3000 [03:18<10:29, 3.92it/s]
18%|ββ | 531/3000 [03:19<10:50, 3.79it/s]
18%|ββ | 532/3000 [03:19<11:09, 3.69it/s]
18%|ββ | 533/3000 [03:19<11:02, 3.73it/s]
18%|ββ | 534/3000 [03:19<10:58, 3.74it/s]
18%|ββ | 535/3000 [03:20<11:06, 3.70it/s]
18%|ββ | 536/3000 [03:20<11:04, 3.71it/s]
18%|ββ | 537/3000 [03:20<11:11, 3.67it/s]
18%|ββ | 538/3000 [03:20<11:06, 3.70it/s]
18%|ββ | 539/3000 [03:21<11:01, 3.72it/s]
18%|ββ | 540/3000 [03:21<10:54, 3.76it/s]
{'loss': 0.3954, 'grad_norm': 0.8591013550758362, 'learning_rate': 9.54732703421526e-05} |
|
18%|ββ | 540/3000 [03:21<10:54, 3.76it/s]
18%|ββ | 541/3000 [03:21<10:59, 3.73it/s]
18%|ββ | 542/3000 [03:21<10:56, 3.74it/s]
18%|ββ | 543/3000 [03:22<10:44, 3.81it/s]
18%|ββ | 544/3000 [03:22<11:19, 3.61it/s]
18%|ββ | 545/3000 [03:22<11:39, 3.51it/s]
18%|ββ | 546/3000 [03:23<11:15, 3.63it/s]
18%|ββ | 547/3000 [03:23<11:06, 3.68it/s]
18%|ββ | 548/3000 [03:23<10:50, 3.77it/s]
18%|ββ | 549/3000 [03:23<10:41, 3.82it/s]Rank 0, Worker 3: Wait for shard 57 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
18%|ββ | 550/3000 [03:24<10:43, 3.81it/s]
{'loss': 0.3959, 'grad_norm': 0.8200550079345703, 'learning_rate': 9.524135262330098e-05} |
|
18%|ββ | 550/3000 [03:24<10:43, 3.81it/s]
18%|ββ | 551/3000 [03:24<10:41, 3.82it/s]
18%|ββ | 552/3000 [03:24<10:42, 3.81it/s]Rank 0, Worker 0: Wait for shard 38 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
18%|ββ | 553/3000 [03:24<10:43, 3.80it/s]Rank 0, Worker 1: Wait for shard 61 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
18%|ββ | 554/3000 [03:25<10:57, 3.72it/s]
18%|ββ | 555/3000 [03:25<10:49, 3.76it/s]
19%|ββ | 556/3000 [03:25<10:48, 3.77it/s]
19%|ββ | 557/3000 [03:26<11:19, 3.60it/s]
19%|ββ | 558/3000 [03:26<11:40, 3.49it/s]
19%|ββ | 559/3000 [03:26<11:54, 3.42it/s]
19%|ββ | 560/3000 [03:26<11:47, 3.45it/s]
{'loss': 0.3763, 'grad_norm': 0.8531577587127686, 'learning_rate': 9.50039377086147e-05} |
|
19%|ββ | 560/3000 [03:26<11:47, 3.45it/s]Rank 0, Worker 2: Wait for shard 4 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
19%|ββ | 561/3000 [03:27<11:47, 3.45it/s]
19%|ββ | 562/3000 [03:27<12:23, 3.28it/s]
19%|ββ | 563/3000 [03:27<12:31, 3.24it/s]Rank 0, Worker 5: Wait for shard 55 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
19%|ββ | 564/3000 [03:28<12:18, 3.30it/s]
19%|ββ | 565/3000 [03:28<12:15, 3.31it/s]
19%|ββ | 566/3000 [03:28<12:12, 3.32it/s]
19%|ββ | 567/3000 [03:29<12:23, 3.27it/s]
19%|ββ | 568/3000 [03:29<12:09, 3.33it/s]Rank 0, Worker 4: Wait for shard 62 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
19%|ββ | 569/3000 [03:29<11:44, 3.45it/s]
19%|ββ | 570/3000 [03:29<11:38, 3.48it/s]
{'loss': 0.3647, 'grad_norm': 0.7548572421073914, 'learning_rate': 9.476105444595534e-05} |
|
19%|ββ | 570/3000 [03:29<11:38, 3.48it/s]
19%|ββ | 571/3000 [03:30<12:01, 3.37it/s]
19%|ββ | 572/3000 [03:30<12:50, 3.15it/s]
19%|ββ | 573/3000 [03:30<12:20, 3.28it/s]
19%|ββ | 574/3000 [03:31<12:00, 3.37it/s]
19%|ββ | 575/3000 [03:31<11:59, 3.37it/s]
19%|ββ | 576/3000 [03:31<11:55, 3.39it/s]
19%|ββ | 577/3000 [03:32<11:31, 3.50it/s]
19%|ββ | 578/3000 [03:32<11:19, 3.57it/s]
19%|ββ | 579/3000 [03:32<11:36, 3.48it/s]
19%|ββ | 580/3000 [03:32<11:32, 3.49it/s]
{'loss': 0.3526, 'grad_norm': 0.8562734723091125, 'learning_rate': 9.451273234763371e-05} |
|
19%|ββ | 580/3000 [03:32<11:32, 3.49it/s]
19%|ββ | 581/3000 [03:33<11:27, 3.52it/s]
19%|ββ | 582/3000 [03:33<11:13, 3.59it/s]
19%|ββ | 583/3000 [03:33<11:15, 3.58it/s]
19%|ββ | 584/3000 [03:34<11:27, 3.52it/s]
20%|ββ | 585/3000 [03:34<11:34, 3.48it/s]
20%|ββ | 586/3000 [03:34<11:13, 3.59it/s]
20%|ββ | 587/3000 [03:34<10:59, 3.66it/s]
20%|ββ | 588/3000 [03:35<11:11, 3.59it/s]
20%|ββ | 589/3000 [03:35<11:35, 3.47it/s]
20%|ββ | 590/3000 [03:35<11:09, 3.60it/s]
{'loss': 0.3541, 'grad_norm': 1.0484408140182495, 'learning_rate': 9.425900158682385e-05} |
|
20%|ββ | 590/3000 [03:35<11:09, 3.60it/s]
20%|ββ | 591/3000 [03:35<10:59, 3.65it/s]
20%|ββ | 592/3000 [03:36<10:53, 3.69it/s]
20%|ββ | 593/3000 [03:36<11:02, 3.64it/s]
20%|ββ | 594/3000 [03:36<10:56, 3.67it/s]
20%|ββ | 595/3000 [03:37<10:44, 3.73it/s]
20%|ββ | 596/3000 [03:37<10:56, 3.66it/s]
20%|ββ | 597/3000 [03:37<11:18, 3.54it/s]
20%|ββ | 598/3000 [03:37<10:58, 3.65it/s]
20%|ββ | 599/3000 [03:38<10:51, 3.69it/s]
20%|ββ | 600/3000 [03:38<10:46, 3.71it/s]
{'loss': 0.3495, 'grad_norm': 1.114036202430725, 'learning_rate': 9.399989299389661e-05} |
|
20%|ββ | 600/3000 [03:38<10:46, 3.71it/s]
20%|ββ | 601/3000 [03:38<10:46, 3.71it/s]
20%|ββ | 602/3000 [03:38<10:58, 3.64it/s]
20%|ββ | 603/3000 [03:39<10:42, 3.73it/s]
20%|ββ | 604/3000 [03:39<10:33, 3.78it/s]
20%|ββ | 605/3000 [03:39<10:25, 3.83it/s]
20%|ββ | 606/3000 [03:39<10:28, 3.81it/s]
20%|ββ | 607/3000 [03:40<10:44, 3.71it/s]
20%|ββ | 608/3000 [03:40<10:36, 3.76it/s]
20%|ββ | 609/3000 [03:40<10:27, 3.81it/s]
20%|ββ | 610/3000 [03:41<10:52, 3.66it/s]
{'loss': 0.3446, 'grad_norm': 1.0067291259765625, 'learning_rate': 9.373543805267368e-05} |
|
20%|ββ | 610/3000 [03:41<10:52, 3.66it/s]
20%|ββ | 611/3000 [03:41<10:42, 3.72it/s]
20%|ββ | 612/3000 [03:41<10:32, 3.78it/s]
20%|ββ | 613/3000 [03:41<10:33, 3.77it/s]
20%|ββ | 614/3000 [03:42<10:29, 3.79it/s]
20%|ββ | 615/3000 [03:42<10:12, 3.90it/s]
21%|ββ | 616/3000 [03:42<10:11, 3.90it/s]
21%|ββ | 617/3000 [03:42<10:08, 3.91it/s]
21%|ββ | 618/3000 [03:43<10:00, 3.96it/s]
21%|ββ | 619/3000 [03:43<09:59, 3.97it/s]
21%|ββ | 620/3000 [03:43<10:07, 3.92it/s]
{'loss': 0.3169, 'grad_norm': 1.0836315155029297, 'learning_rate': 9.346566889660193e-05} |
|
21%|ββ | 620/3000 [03:43<10:07, 3.92it/s]
21%|ββ | 621/3000 [03:43<10:10, 3.89it/s]
21%|ββ | 622/3000 [03:44<10:05, 3.93it/s]
21%|ββ | 623/3000 [03:44<10:04, 3.93it/s]
21%|ββ | 624/3000 [03:44<10:06, 3.92it/s]
21%|ββ | 625/3000 [03:44<10:00, 3.96it/s]
21%|ββ | 626/3000 [03:45<10:02, 3.94it/s]
21%|ββ | 627/3000 [03:45<10:01, 3.94it/s]
21%|ββ | 628/3000 [03:45<10:01, 3.94it/s]
21%|ββ | 629/3000 [03:45<09:56, 3.98it/s]
21%|ββ | 630/3000 [03:46<09:54, 3.98it/s]
{'loss': 0.3029, 'grad_norm': 1.1337170600891113, 'learning_rate': 9.319061830484898e-05} |
|
21%|ββ | 630/3000 [03:46<09:54, 3.98it/s]
21%|ββ | 631/3000 [03:46<10:02, 3.93it/s]
21%|ββ | 632/3000 [03:46<10:01, 3.93it/s]
21%|ββ | 633/3000 [03:46<09:57, 3.96it/s]
21%|ββ | 634/3000 [03:47<10:02, 3.93it/s]
21%|ββ | 635/3000 [03:47<10:02, 3.93it/s]
21%|ββ | 636/3000 [03:47<09:59, 3.94it/s]
21%|ββ | 637/3000 [03:47<09:59, 3.94it/s]
21%|βββ | 638/3000 [03:48<10:02, 3.92it/s]
21%|βββ | 639/3000 [03:48<10:06, 3.89it/s]
21%|βββ | 640/3000 [03:48<10:25, 3.77it/s]
{'loss': 0.3035, 'grad_norm': 1.0124084949493408, 'learning_rate': 9.291031969832026e-05} |
|
21%|βββ | 640/3000 [03:48<10:25, 3.77it/s]
21%|βββ | 641/3000 [03:49<10:37, 3.70it/s]
21%|βββ | 642/3000 [03:49<10:28, 3.75it/s]
21%|βββ | 643/3000 [03:49<10:20, 3.80it/s]
21%|βββ | 644/3000 [03:49<10:18, 3.81it/s]
22%|βββ | 645/3000 [03:50<10:10, 3.86it/s]
22%|βββ | 646/3000 [03:50<10:08, 3.87it/s]
22%|βββ | 647/3000 [03:50<10:24, 3.77it/s]
22%|βββ | 648/3000 [03:50<10:18, 3.80it/s]
22%|βββ | 649/3000 [03:51<10:15, 3.82it/s]
22%|βββ | 650/3000 [03:51<10:10, 3.85it/s]
{'loss': 0.3065, 'grad_norm': 1.1703892946243286, 'learning_rate': 9.262480713559808e-05} |
|
22%|βββ | 650/3000 [03:51<10:10, 3.85it/s]
22%|βββ | 651/3000 [03:51<10:23, 3.77it/s]
22%|βββ | 652/3000 [03:51<10:24, 3.76it/s]
22%|βββ | 653/3000 [03:52<10:14, 3.82it/s]
22%|βββ | 654/3000 [03:52<10:21, 3.77it/s]
22%|βββ | 655/3000 [03:52<10:15, 3.81it/s]
22%|βββ | 656/3000 [03:52<10:13, 3.82it/s]
22%|βββ | 657/3000 [03:53<10:52, 3.59it/s]
22%|βββ | 658/3000 [03:53<10:41, 3.65it/s]
22%|βββ | 659/3000 [03:53<10:27, 3.73it/s]
22%|βββ | 660/3000 [03:54<10:16, 3.80it/s]
{'loss': 0.2916, 'grad_norm': 0.8515862226486206, 'learning_rate': 9.233411530880326e-05} |
|
22%|βββ | 660/3000 [03:54<10:16, 3.80it/s]
22%|βββ | 661/3000 [03:54<10:19, 3.78it/s]
22%|βββ | 662/3000 [03:54<10:38, 3.66it/s]
22%|βββ | 663/3000 [03:54<10:28, 3.72it/s]
22%|βββ | 664/3000 [03:55<10:15, 3.80it/s]
22%|βββ | 665/3000 [03:55<10:08, 3.83it/s]
22%|βββ | 666/3000 [03:55<10:26, 3.73it/s]
22%|βββ | 667/3000 [03:55<10:25, 3.73it/s]
22%|βββ | 668/3000 [03:56<10:22, 3.74it/s]
22%|βββ | 669/3000 [03:56<10:50, 3.58it/s]
22%|βββ | 670/3000 [03:56<10:38, 3.65it/s]
{'loss': 0.2933, 'grad_norm': 1.3676186800003052, 'learning_rate': 9.20382795393797e-05} |
|
22%|βββ | 670/3000 [03:56<10:38, 3.65it/s]
22%|βββ | 671/3000 [03:57<10:52, 3.57it/s]
22%|βββ | 672/3000 [03:57<10:47, 3.59it/s]
22%|βββ | 673/3000 [03:57<10:39, 3.64it/s]
22%|βββ | 674/3000 [03:57<10:28, 3.70it/s]
22%|βββ | 675/3000 [03:58<10:38, 3.64it/s]
23%|βββ | 676/3000 [03:58<10:58, 3.53it/s]
23%|βββ | 677/3000 [03:58<10:41, 3.62it/s]
23%|βββ | 678/3000 [03:58<10:18, 3.76it/s]
23%|βββ | 679/3000 [03:59<10:11, 3.79it/s]
23%|βββ | 680/3000 [03:59<10:10, 3.80it/s]
{'loss': 0.2724, 'grad_norm': 0.8166829943656921, 'learning_rate': 9.173733577380258e-05} |
|
23%|βββ | 680/3000 [03:59<10:10, 3.80it/s]
23%|βββ | 681/3000 [03:59<10:07, 3.82it/s]
23%|βββ | 682/3000 [03:59<10:18, 3.75it/s]
23%|βββ | 683/3000 [04:00<10:10, 3.79it/s]
23%|βββ | 684/3000 [04:00<10:01, 3.85it/s]
23%|βββ | 685/3000 [04:00<09:59, 3.86it/s]
23%|βββ | 686/3000 [04:01<10:44, 3.59it/s]
23%|βββ | 687/3000 [04:01<10:22, 3.71it/s]
23%|βββ | 688/3000 [04:01<10:15, 3.76it/s]
23%|βββ | 689/3000 [04:01<10:10, 3.79it/s]
23%|βββ | 690/3000 [04:02<09:58, 3.86it/s]
{'loss': 0.26, 'grad_norm': 1.028581142425537, 'learning_rate': 9.143132057921058e-05} |
|
23%|βββ | 690/3000 [04:02<09:58, 3.86it/s]
23%|βββ | 691/3000 [04:02<09:55, 3.88it/s]
23%|βββ | 692/3000 [04:02<09:50, 3.91it/s]
23%|βββ | 693/3000 [04:02<09:46, 3.93it/s]
23%|βββ | 694/3000 [04:03<09:39, 3.98it/s]
23%|βββ | 695/3000 [04:03<09:49, 3.91it/s]
23%|βββ | 696/3000 [04:03<09:42, 3.96it/s]
23%|βββ | 697/3000 [04:03<09:58, 3.85it/s]
23%|βββ | 698/3000 [04:04<09:49, 3.90it/s]
23%|βββ | 699/3000 [04:04<09:50, 3.90it/s]
23%|βββ | 700/3000 [04:04<12:12, 3.14it/s]
{'loss': 0.2551, 'grad_norm': 1.035443663597107, 'learning_rate': 9.112027113896262e-05} |
|
23%|βββ | 700/3000 [04:04<12:12, 3.14it/s]
23%|βββ | 701/3000 [04:05<11:35, 3.30it/s]
23%|βββ | 702/3000 [04:05<10:54, 3.51it/s]
23%|βββ | 703/3000 [04:05<10:30, 3.65it/s]
23%|βββ | 704/3000 [04:05<10:19, 3.71it/s]
24%|βββ | 705/3000 [04:06<10:16, 3.72it/s]
24%|βββ | 706/3000 [04:06<10:08, 3.77it/s]
24%|βββ | 707/3000 [04:06<10:01, 3.82it/s]
24%|βββ | 708/3000 [04:06<09:55, 3.85it/s]
24%|βββ | 709/3000 [04:07<10:03, 3.80it/s]
24%|βββ | 710/3000 [04:07<10:02, 3.80it/s]
{'loss': 0.2613, 'grad_norm': 1.22231924533844, 'learning_rate': 9.080422524811982e-05} |
|
24%|βββ | 710/3000 [04:07<10:02, 3.80it/s]
24%|βββ | 711/3000 [04:07<09:59, 3.82it/s]
24%|βββ | 712/3000 [04:07<10:09, 3.76it/s]
24%|βββ | 713/3000 [04:08<10:21, 3.68it/s]
24%|βββ | 714/3000 [04:08<10:17, 3.70it/s]
24%|βββ | 715/3000 [04:08<10:29, 3.63it/s]
24%|βββ | 716/3000 [04:09<10:23, 3.67it/s]
24%|βββ | 717/3000 [04:09<10:08, 3.75it/s]
24%|βββ | 718/3000 [04:09<10:07, 3.75it/s]
24%|βββ | 719/3000 [04:09<10:23, 3.66it/s]
24%|βββ | 720/3000 [04:10<10:53, 3.49it/s]
{'loss': 0.2384, 'grad_norm': 1.0958259105682373, 'learning_rate': 9.048322130885305e-05} |
|
24%|βββ | 720/3000 [04:10<10:53, 3.49it/s]
24%|βββ | 721/3000 [04:10<10:54, 3.48it/s]
24%|βββ | 722/3000 [04:10<10:45, 3.53it/s]
24%|βββ | 723/3000 [04:11<10:27, 3.63it/s]
24%|βββ | 724/3000 [04:11<10:18, 3.68it/s]
24%|βββ | 725/3000 [04:11<10:32, 3.60it/s]
24%|βββ | 726/3000 [04:11<10:30, 3.61it/s]
24%|βββ | 727/3000 [04:12<10:58, 3.45it/s]
24%|βββ | 728/3000 [04:12<10:54, 3.47it/s]
24%|βββ | 729/3000 [04:12<10:41, 3.54it/s]
24%|βββ | 730/3000 [04:12<10:38, 3.56it/s]
{'loss': 0.2329, 'grad_norm': 0.8523461222648621, 'learning_rate': 9.015729832577681e-05} |
|
24%|βββ | 730/3000 [04:13<10:38, 3.56it/s]
24%|βββ | 731/3000 [04:13<10:35, 3.57it/s]
24%|βββ | 732/3000 [04:13<10:25, 3.63it/s]
24%|βββ | 733/3000 [04:13<10:19, 3.66it/s]
24%|βββ | 734/3000 [04:14<10:23, 3.63it/s]
24%|βββ | 735/3000 [04:14<10:22, 3.64it/s]
25%|βββ | 736/3000 [04:14<10:12, 3.69it/s]
25%|βββ | 737/3000 [04:14<10:10, 3.71it/s]
25%|βββ | 738/3000 [04:15<10:27, 3.60it/s]
25%|βββ | 739/3000 [04:15<10:16, 3.66it/s]Rank 0, Worker 1: Wait for shard 28 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
25%|βββ | 740/3000 [04:15<10:13, 3.68it/s]
{'loss': 0.2094, 'grad_norm': 0.964102029800415, 'learning_rate': 8.982649590120982e-05} |
|
25%|βββ | 740/3000 [04:15<10:13, 3.68it/s]
25%|βββ | 741/3000 [04:15<10:22, 3.63it/s]Rank 0, Worker 3: Wait for shard 63 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
25%|βββ | 742/3000 [04:16<10:32, 3.57it/s]
25%|βββ | 743/3000 [04:16<10:28, 3.59it/s]
25%|βββ | 744/3000 [04:16<10:25, 3.61it/s]Rank 0, Worker 0: Wait for shard 60 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
25%|βββ | 745/3000 [04:17<10:29, 3.58it/s]
25%|βββ | 746/3000 [04:17<10:28, 3.59it/s]
25%|βββ | 747/3000 [04:17<10:25, 3.60it/s]
25%|βββ | 748/3000 [04:17<10:26, 3.59it/s]
25%|βββ | 749/3000 [04:18<10:41, 3.51it/s]
25%|βββ | 750/3000 [04:18<11:04, 3.38it/s]
{'loss': 0.2132, 'grad_norm': 0.98223876953125, 'learning_rate': 8.949085423036296e-05} |
|
25%|βββ | 750/3000 [04:18<11:04, 3.38it/s]
25%|βββ | 751/3000 [04:18<10:56, 3.43it/s]
25%|βββ | 752/3000 [04:19<10:43, 3.49it/s]Rank 0, Worker 2: Wait for shard 47 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
25%|βββ | 753/3000 [04:19<10:50, 3.45it/s]
25%|βββ | 754/3000 [04:19<10:38, 3.52it/s]Rank 0, Worker 4: Wait for shard 12 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
25%|βββ | 755/3000 [04:19<10:33, 3.54it/s]Rank 0, Worker 5: Wait for shard 58 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
25%|βββ | 756/3000 [04:20<10:29, 3.57it/s]
25%|βββ | 757/3000 [04:20<11:07, 3.36it/s]
25%|βββ | 758/3000 [04:20<11:42, 3.19it/s]
25%|βββ | 759/3000 [04:21<12:20, 3.03it/s]
25%|βββ | 760/3000 [04:21<11:44, 3.18it/s]
{'loss': 0.1981, 'grad_norm': 1.0459887981414795, 'learning_rate': 8.91504140964553e-05} |
|
25%|βββ | 760/3000 [04:21<11:44, 3.18it/s]
25%|βββ | 761/3000 [04:21<11:28, 3.25it/s]
25%|βββ | 762/3000 [04:22<11:09, 3.34it/s]
25%|βββ | 763/3000 [04:22<11:02, 3.38it/s]
25%|βββ | 764/3000 [04:22<11:03, 3.37it/s]
26%|βββ | 765/3000 [04:23<10:59, 3.39it/s]
26%|βββ | 766/3000 [04:23<10:51, 3.43it/s]
26%|βββ | 767/3000 [04:23<10:36, 3.51it/s]
26%|βββ | 768/3000 [04:23<10:24, 3.58it/s]
26%|βββ | 769/3000 [04:24<10:23, 3.58it/s]
26%|βββ | 770/3000 [04:24<10:22, 3.58it/s]
{'loss': 0.1954, 'grad_norm': 1.3035297393798828, 'learning_rate': 8.880521686575857e-05} |
|
26%|βββ | 770/3000 [04:24<10:22, 3.58it/s]
26%|βββ | 771/3000 [04:24<10:30, 3.54it/s]
26%|βββ | 772/3000 [04:24<10:14, 3.63it/s]
26%|βββ | 773/3000 [04:25<10:05, 3.68it/s]
26%|βββ | 774/3000 [04:25<10:07, 3.66it/s]
26%|βββ | 775/3000 [04:25<10:01, 3.70it/s]
26%|βββ | 776/3000 [04:26<10:09, 3.65it/s]
26%|βββ | 777/3000 [04:26<10:16, 3.61it/s]
26%|βββ | 778/3000 [04:26<10:22, 3.57it/s]
26%|βββ | 779/3000 [04:26<10:15, 3.61it/s]
26%|βββ | 780/3000 [04:27<10:10, 3.64it/s]
{'loss': 0.1891, 'grad_norm': 0.9774158596992493, 'learning_rate': 8.845530448257085e-05} |
|
26%|βββ | 780/3000 [04:27<10:10, 3.64it/s]
26%|βββ | 781/3000 [04:27<10:44, 3.44it/s]
26%|βββ | 782/3000 [04:27<10:22, 3.56it/s]
26%|βββ | 783/3000 [04:27<10:07, 3.65it/s]
26%|βββ | 784/3000 [04:28<09:58, 3.70it/s]
26%|βββ | 785/3000 [04:28<10:03, 3.67it/s]
26%|βββ | 786/3000 [04:28<10:09, 3.63it/s]
26%|βββ | 787/3000 [04:29<09:59, 3.69it/s]
26%|βββ | 788/3000 [04:29<09:52, 3.73it/s]
26%|βββ | 789/3000 [04:29<09:46, 3.77it/s]
26%|βββ | 790/3000 [04:29<09:46, 3.77it/s]
{'loss': 0.1783, 'grad_norm': 1.1295584440231323, 'learning_rate': 8.810071946411989e-05} |
|
26%|βββ | 790/3000 [04:29<09:46, 3.77it/s]
26%|βββ | 791/3000 [04:30<09:45, 3.77it/s]
26%|βββ | 792/3000 [04:30<09:41, 3.79it/s]
26%|βββ | 793/3000 [04:30<09:40, 3.80it/s]
26%|βββ | 794/3000 [04:30<09:43, 3.78it/s]
26%|βββ | 795/3000 [04:31<09:45, 3.77it/s]
27%|βββ | 796/3000 [04:31<09:48, 3.75it/s]
27%|βββ | 797/3000 [04:31<09:42, 3.78it/s]
27%|βββ | 798/3000 [04:31<09:52, 3.72it/s]
27%|βββ | 799/3000 [04:32<09:48, 3.74it/s]
27%|βββ | 800/3000 [04:32<09:39, 3.80it/s]
{'loss': 0.1765, 'grad_norm': 0.8923147916793823, 'learning_rate': 8.774150489539707e-05} |
|
27%|βββ | 800/3000 [04:32<09:39, 3.80it/s]
27%|βββ | 801/3000 [04:32<09:32, 3.84it/s]
27%|βββ | 802/3000 [04:33<09:34, 3.83it/s]
27%|βββ | 803/3000 [04:33<09:33, 3.83it/s]
27%|βββ | 804/3000 [04:33<09:24, 3.89it/s]
27%|βββ | 805/3000 [04:33<09:24, 3.89it/s]
27%|βββ | 806/3000 [04:34<09:42, 3.77it/s]
27%|βββ | 807/3000 [04:34<09:30, 3.84it/s]
27%|βββ | 808/3000 [04:34<09:48, 3.73it/s]
27%|βββ | 809/3000 [04:34<09:39, 3.78it/s]
27%|βββ | 810/3000 [04:35<09:39, 3.78it/s]
{'loss': 0.1742, 'grad_norm': 1.3010443449020386, 'learning_rate': 8.737770442392212e-05} |
|
27%|βββ | 810/3000 [04:35<09:39, 3.78it/s]
27%|βββ | 811/3000 [04:35<09:43, 3.75it/s]
27%|βββ | 812/3000 [04:35<09:33, 3.82it/s]
27%|βββ | 813/3000 [04:35<09:27, 3.85it/s]
27%|βββ | 814/3000 [04:36<09:24, 3.88it/s]
27%|βββ | 815/3000 [04:36<09:33, 3.81it/s]
27%|βββ | 816/3000 [04:36<09:36, 3.79it/s]
27%|βββ | 817/3000 [04:36<09:32, 3.81it/s]
27%|βββ | 818/3000 [04:37<09:39, 3.76it/s]
27%|βββ | 819/3000 [04:37<09:54, 3.67it/s]
27%|βββ | 820/3000 [04:37<09:40, 3.76it/s]
{'loss': 0.1587, 'grad_norm': 0.9801855683326721, 'learning_rate': 8.700936225443959e-05} |
|
27%|βββ | 820/3000 [04:37<09:40, 3.76it/s]
27%|βββ | 821/3000 [04:38<10:16, 3.53it/s]
27%|βββ | 822/3000 [04:38<09:57, 3.65it/s]
27%|βββ | 823/3000 [04:38<09:40, 3.75it/s]
27%|βββ | 824/3000 [04:38<09:35, 3.78it/s]
28%|βββ | 825/3000 [04:39<09:37, 3.77it/s]
28%|βββ | 826/3000 [04:39<10:17, 3.52it/s]
28%|βββ | 827/3000 [04:39<10:07, 3.58it/s]
28%|βββ | 828/3000 [04:39<09:57, 3.64it/s]
28%|βββ | 829/3000 [04:40<09:51, 3.67it/s]
28%|βββ | 830/3000 [04:40<09:53, 3.66it/s]
{'loss': 0.1646, 'grad_norm': 1.0634956359863281, 'learning_rate': 8.663652314354765e-05} |
|
28%|βββ | 830/3000 [04:40<09:53, 3.66it/s]
28%|βββ | 831/3000 [04:40<09:44, 3.71it/s]
28%|βββ | 832/3000 [04:41<09:31, 3.79it/s]
28%|βββ | 833/3000 [04:41<09:25, 3.83it/s]
28%|βββ | 834/3000 [04:41<09:19, 3.87it/s]
28%|βββ | 835/3000 [04:41<09:13, 3.91it/s]
28%|βββ | 836/3000 [04:42<09:11, 3.93it/s]
28%|βββ | 837/3000 [04:42<09:11, 3.92it/s]
28%|βββ | 838/3000 [04:42<09:11, 3.92it/s]
28%|βββ | 839/3000 [04:42<09:08, 3.94it/s]
28%|βββ | 840/3000 [04:43<09:15, 3.89it/s]
{'loss': 0.1485, 'grad_norm': 0.8811051845550537, 'learning_rate': 8.625923239425978e-05} |
|
28%|βββ | 840/3000 [04:43<09:15, 3.89it/s]
28%|βββ | 841/3000 [04:43<09:21, 3.85it/s]
28%|βββ | 842/3000 [04:43<09:25, 3.82it/s]
28%|βββ | 843/3000 [04:43<09:37, 3.74it/s]
28%|βββ | 844/3000 [04:44<09:50, 3.65it/s]
28%|βββ | 845/3000 [04:44<09:59, 3.59it/s]
28%|βββ | 846/3000 [04:44<09:43, 3.69it/s]
28%|βββ | 847/3000 [04:45<09:50, 3.64it/s]
28%|βββ | 848/3000 [04:45<09:56, 3.61it/s]
28%|βββ | 849/3000 [04:45<09:58, 3.59it/s]
28%|βββ | 850/3000 [04:45<10:01, 3.58it/s]
{'loss': 0.1585, 'grad_norm': 1.0639090538024902, 'learning_rate': 8.587753585050004e-05} |
|
28%|βββ | 850/3000 [04:45<10:01, 3.58it/s]
28%|βββ | 851/3000 [04:46<09:59, 3.59it/s]
28%|βββ | 852/3000 [04:46<09:55, 3.61it/s]
28%|βββ | 853/3000 [04:46<09:46, 3.66it/s]
28%|βββ | 854/3000 [04:46<09:46, 3.66it/s]
28%|βββ | 855/3000 [04:47<09:51, 3.62it/s]
29%|βββ | 856/3000 [04:47<09:49, 3.64it/s]
29%|βββ | 857/3000 [04:47<09:39, 3.70it/s]
29%|βββ | 858/3000 [04:48<09:37, 3.71it/s]
29%|βββ | 859/3000 [04:48<09:35, 3.72it/s]
29%|βββ | 860/3000 [04:48<09:46, 3.65it/s]
{'loss': 0.1443, 'grad_norm': 0.9883255362510681, 'learning_rate': 8.549147989153276e-05} |
|
29%|βββ | 860/3000 [04:48<09:46, 3.65it/s]
29%|βββ | 861/3000 [04:48<09:42, 3.67it/s]
29%|βββ | 862/3000 [04:49<09:30, 3.74it/s]
29%|βββ | 863/3000 [04:49<09:21, 3.81it/s]
29%|βββ | 864/3000 [04:49<09:24, 3.78it/s]
29%|βββ | 865/3000 [04:49<09:35, 3.71it/s]
29%|βββ | 866/3000 [04:50<09:26, 3.77it/s]
29%|βββ | 867/3000 [04:50<10:17, 3.45it/s]
29%|βββ | 868/3000 [04:50<10:02, 3.54it/s]
29%|βββ | 869/3000 [04:51<10:08, 3.50it/s]
29%|βββ | 870/3000 [04:51<09:52, 3.60it/s]
{'loss': 0.1388, 'grad_norm': 1.4074150323867798, 'learning_rate': 8.510111142632698e-05} |
|
29%|βββ | 870/3000 [04:51<09:52, 3.60it/s]
29%|βββ | 871/3000 [04:51<09:58, 3.55it/s]
29%|βββ | 872/3000 [04:51<09:42, 3.66it/s]
29%|βββ | 873/3000 [04:52<09:32, 3.71it/s]
29%|βββ | 874/3000 [04:52<09:24, 3.76it/s]
29%|βββ | 875/3000 [04:52<09:22, 3.78it/s]
29%|βββ | 876/3000 [04:52<09:17, 3.81it/s]
29%|βββ | 877/3000 [04:53<09:12, 3.84it/s]
29%|βββ | 878/3000 [04:53<09:46, 3.62it/s]
29%|βββ | 879/3000 [04:53<09:29, 3.73it/s]
29%|βββ | 880/3000 [04:53<09:23, 3.76it/s]
{'loss': 0.1283, 'grad_norm': 1.2396678924560547, 'learning_rate': 8.470647788785665e-05} |
|
29%|βββ | 880/3000 [04:54<09:23, 3.76it/s]
29%|βββ | 881/3000 [04:54<09:19, 3.79it/s]
29%|βββ | 882/3000 [04:54<09:11, 3.84it/s]
29%|βββ | 883/3000 [04:54<09:06, 3.88it/s]
29%|βββ | 884/3000 [04:55<09:06, 3.87it/s]
30%|βββ | 885/3000 [04:55<08:58, 3.93it/s]
30%|βββ | 886/3000 [04:55<08:53, 3.96it/s]
30%|βββ | 887/3000 [04:55<08:51, 3.97it/s]
30%|βββ | 888/3000 [04:56<08:53, 3.96it/s]
30%|βββ | 889/3000 [04:56<08:57, 3.93it/s]
30%|βββ | 890/3000 [04:56<08:51, 3.97it/s]
{'loss': 0.1183, 'grad_norm': 1.1257601976394653, 'learning_rate': 8.430762722733714e-05} |
|
30%|βββ | 890/3000 [04:56<08:51, 3.97it/s]
30%|βββ | 891/3000 [04:56<08:52, 3.96it/s]
30%|βββ | 892/3000 [04:57<08:50, 3.97it/s]
30%|βββ | 893/3000 [04:57<08:50, 3.97it/s]
30%|βββ | 894/3000 [04:57<08:47, 3.99it/s]
30%|βββ | 895/3000 [04:57<08:49, 3.97it/s]
30%|βββ | 896/3000 [04:58<08:47, 3.99it/s]
30%|βββ | 897/3000 [04:58<08:48, 3.98it/s]
30%|βββ | 898/3000 [04:58<08:43, 4.02it/s]
30%|βββ | 899/3000 [04:58<08:44, 4.00it/s]
30%|βββ | 900/3000 [04:59<08:51, 3.95it/s]
{'loss': 0.1372, 'grad_norm': 0.95068359375, 'learning_rate': 8.390460790839882e-05} |
|
30%|βββ | 900/3000 [04:59<08:51, 3.95it/s]
30%|βββ | 901/3000 [04:59<08:53, 3.94it/s]
30%|βββ | 902/3000 [04:59<08:47, 3.98it/s]
30%|βββ | 903/3000 [04:59<08:46, 3.98it/s]
30%|βββ | 904/3000 [05:00<08:43, 4.01it/s]
30%|βββ | 905/3000 [05:00<08:43, 4.00it/s]
30%|βββ | 906/3000 [05:00<08:39, 4.03it/s]
30%|βββ | 907/3000 [05:00<08:42, 4.00it/s]
30%|βββ | 908/3000 [05:01<08:44, 3.99it/s]
30%|βββ | 909/3000 [05:01<08:42, 4.00it/s]
30%|βββ | 910/3000 [05:01<08:42, 4.00it/s]
{'loss': 0.1125, 'grad_norm': 1.1750320196151733, 'learning_rate': 8.349746890119826e-05} |
|
30%|βββ | 910/3000 [05:01<08:42, 4.00it/s]
30%|βββ | 911/3000 [05:01<08:48, 3.96it/s]
30%|βββ | 912/3000 [05:02<08:35, 4.05it/s]
30%|βββ | 913/3000 [05:02<08:37, 4.04it/s]
30%|βββ | 914/3000 [05:02<08:36, 4.04it/s]
30%|βββ | 915/3000 [05:02<08:31, 4.07it/s]
31%|βββ | 916/3000 [05:02<08:32, 4.06it/s]
31%|βββ | 917/3000 [05:03<08:32, 4.07it/s]
31%|βββ | 918/3000 [05:03<08:32, 4.07it/s]
31%|βββ | 919/3000 [05:03<08:31, 4.07it/s]
31%|βββ | 920/3000 [05:03<08:32, 4.06it/s]
{'loss': 0.1298, 'grad_norm': 1.0115416049957275, 'learning_rate': 8.308625967646795e-05} |
|
31%|βββ | 920/3000 [05:04<08:32, 4.06it/s]
31%|βββ | 921/3000 [05:04<08:33, 4.05it/s]
31%|βββ | 922/3000 [05:04<08:34, 4.04it/s]
31%|βββ | 923/3000 [05:04<08:35, 4.03it/s]
31%|βββ | 924/3000 [05:04<08:33, 4.04it/s]
31%|βββ | 925/3000 [05:05<08:33, 4.04it/s]
31%|βββ | 926/3000 [05:05<08:33, 4.04it/s]
31%|βββ | 927/3000 [05:05<08:26, 4.09it/s]
31%|βββ | 928/3000 [05:05<08:19, 4.15it/s]
31%|βββ | 929/3000 [05:06<08:24, 4.10it/s]
31%|βββ | 930/3000 [05:06<08:28, 4.07it/s]
{'loss': 0.1199, 'grad_norm': 0.9820786118507385, 'learning_rate': 8.267103019950529e-05} |
|
31%|βββ | 930/3000 [05:06<08:28, 4.07it/s]
31%|βββ | 931/3000 [05:06<08:26, 4.09it/s]Rank 0, Worker 1: Wait for shard 50 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
31%|βββ | 932/3000 [05:06<08:28, 4.07it/s]
31%|βββ | 933/3000 [05:07<08:29, 4.06it/s]Rank 0, Worker 3: Wait for shard 17 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
31%|βββ | 934/3000 [05:07<08:34, 4.01it/s]
31%|βββ | 935/3000 [05:07<08:33, 4.02it/s]
31%|βββ | 936/3000 [05:07<08:33, 4.02it/s]
31%|βββ | 937/3000 [05:08<08:33, 4.02it/s]
31%|ββββ | 938/3000 [05:08<08:33, 4.01it/s]
31%|ββββ | 939/3000 [05:08<08:32, 4.02it/s]
31%|ββββ | 940/3000 [05:08<08:32, 4.02it/s]
{'loss': 0.1097, 'grad_norm': 0.9424199461936951, 'learning_rate': 8.225183092410128e-05} |
|
31%|ββββ | 940/3000 [05:08<08:32, 4.02it/s]
31%|ββββ | 941/3000 [05:09<08:38, 3.97it/s]Rank 0, Worker 5: Wait for shard 27 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
31%|ββββ | 942/3000 [05:09<08:40, 3.96it/s]Rank 0, Worker 0: Wait for shard 45 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
31%|ββββ | 943/3000 [05:09<08:49, 3.89it/s]
31%|ββββ | 944/3000 [05:09<08:58, 3.82it/s]Rank 0, Worker 2: Wait for shard 8 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
32%|ββββ | 945/3000 [05:10<09:12, 3.72it/s]
32%|ββββ | 946/3000 [05:10<09:03, 3.78it/s]Rank 0, Worker 4: Wait for shard 40 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
32%|ββββ | 947/3000 [05:10<09:10, 3.73it/s]
32%|ββββ | 948/3000 [05:11<09:13, 3.71it/s]
32%|ββββ | 949/3000 [05:11<09:12, 3.72it/s]
32%|ββββ | 950/3000 [05:11<09:04, 3.76it/s]
{'loss': 0.1264, 'grad_norm': 0.804672122001648, 'learning_rate': 8.182871278641009e-05} |
|
32%|ββββ | 950/3000 [05:11<09:04, 3.76it/s]
32%|ββββ | 951/3000 [05:11<09:05, 3.76it/s]
32%|ββββ | 952/3000 [05:12<09:01, 3.78it/s]
32%|ββββ | 953/3000 [05:12<08:58, 3.80it/s]
32%|ββββ | 954/3000 [05:12<09:01, 3.78it/s]
32%|ββββ | 955/3000 [05:12<08:54, 3.83it/s]
32%|ββββ | 956/3000 [05:13<08:45, 3.89it/s]
32%|ββββ | 957/3000 [05:13<08:45, 3.89it/s]
32%|ββββ | 958/3000 [05:13<08:47, 3.87it/s]
32%|ββββ | 959/3000 [05:13<08:51, 3.84it/s]
32%|ββββ | 960/3000 [05:14<08:46, 3.88it/s]
{'loss': 0.107, 'grad_norm': 1.0112346410751343, 'learning_rate': 8.140172719875979e-05} |
|
32%|ββββ | 960/3000 [05:14<08:46, 3.88it/s]
32%|ββββ | 961/3000 [05:14<08:51, 3.83it/s]
32%|ββββ | 962/3000 [05:14<08:58, 3.79it/s]
32%|ββββ | 963/3000 [05:14<08:58, 3.79it/s]
32%|ββββ | 964/3000 [05:15<08:51, 3.83it/s]
32%|ββββ | 965/3000 [05:15<08:54, 3.80it/s]
32%|ββββ | 966/3000 [05:15<08:56, 3.79it/s]
32%|ββββ | 967/3000 [05:16<09:03, 3.74it/s]
32%|ββββ | 968/3000 [05:16<09:00, 3.76it/s]
32%|ββββ | 969/3000 [05:16<09:00, 3.76it/s]
32%|ββββ | 970/3000 [05:16<08:52, 3.81it/s]
{'loss': 0.1114, 'grad_norm': 0.9198870062828064, 'learning_rate': 8.097092604340542e-05} |
|
32%|ββββ | 970/3000 [05:16<08:52, 3.81it/s]
32%|ββββ | 971/3000 [05:17<08:49, 3.84it/s]
32%|ββββ | 972/3000 [05:17<08:47, 3.84it/s]
32%|ββββ | 973/3000 [05:17<08:44, 3.86it/s]
32%|ββββ | 974/3000 [05:17<08:48, 3.83it/s]
32%|ββββ | 975/3000 [05:18<08:45, 3.85it/s]
33%|ββββ | 976/3000 [05:18<08:47, 3.83it/s]
33%|ββββ | 977/3000 [05:18<09:26, 3.57it/s]
33%|ββββ | 978/3000 [05:18<09:07, 3.69it/s]
33%|ββββ | 979/3000 [05:19<09:01, 3.73it/s]
33%|ββββ | 980/3000 [05:19<08:54, 3.78it/s]
{'loss': 0.1178, 'grad_norm': 0.8466675877571106, 'learning_rate': 8.053636166622476e-05} |
|
33%|ββββ | 980/3000 [05:19<08:54, 3.78it/s]
33%|ββββ | 981/3000 [05:19<08:54, 3.78it/s]
33%|ββββ | 982/3000 [05:20<08:56, 3.76it/s]
33%|ββββ | 983/3000 [05:20<08:58, 3.75it/s]
33%|ββββ | 984/3000 [05:20<08:53, 3.78it/s]
33%|ββββ | 985/3000 [05:20<08:51, 3.79it/s]
33%|ββββ | 986/3000 [05:21<08:46, 3.82it/s]
33%|ββββ | 987/3000 [05:21<08:38, 3.88it/s]
33%|ββββ | 988/3000 [05:21<08:51, 3.79it/s]
33%|ββββ | 989/3000 [05:21<08:48, 3.80it/s]
33%|ββββ | 990/3000 [05:22<08:48, 3.81it/s]
{'loss': 0.1163, 'grad_norm': 0.9316571354866028, 'learning_rate': 8.009808687035798e-05} |
|
33%|ββββ | 990/3000 [05:22<08:48, 3.81it/s]
33%|ββββ | 991/3000 [05:22<08:42, 3.84it/s]
33%|ββββ | 992/3000 [05:22<08:37, 3.88it/s]
33%|ββββ | 993/3000 [05:22<08:41, 3.85it/s]
33%|ββββ | 994/3000 [05:23<08:41, 3.84it/s]
33%|ββββ | 995/3000 [05:23<08:36, 3.88it/s]
33%|ββββ | 996/3000 [05:23<08:34, 3.90it/s]
33%|ββββ | 997/3000 [05:23<08:51, 3.77it/s]
33%|ββββ | 998/3000 [05:24<08:45, 3.81it/s]
33%|ββββ | 999/3000 [05:24<08:41, 3.84it/s]
33%|ββββ | 1000/3000 [05:24<08:40, 3.84it/s]
{'loss': 0.1149, 'grad_norm': 1.0110657215118408, 'learning_rate': 7.965615490979163e-05} |
|
33%|ββββ | 1000/3000 [05:24<08:40, 3.84it/s]Copying experiment config directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/experiment_cfg to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-1000/experiment_cfg |
| Copying processor directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/processor to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-1000 |
| Copying wandb_config.json from /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/wandb_config.json to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-1000/wandb_config.json |
|
33%|ββββ | 1001/3000 [06:02<6:26:42, 11.61s/it]
33%|ββββ | 1002/3000 [06:03<4:33:08, 8.20s/it]
33%|ββββ | 1003/3000 [06:03<3:13:36, 5.82s/it]
33%|ββββ | 1004/3000 [06:03<2:17:56, 4.15s/it]
34%|ββββ | 1005/3000 [06:03<1:39:02, 2.98s/it]
34%|ββββ | 1006/3000 [06:04<1:11:53, 2.16s/it]
34%|ββββ | 1007/3000 [06:04<52:44, 1.59s/it]
34%|ββββ | 1008/3000 [06:04<39:29, 1.19s/it]
34%|ββββ | 1009/3000 [06:04<30:11, 1.10it/s]
34%|ββββ | 1010/3000 [06:05<23:37, 1.40it/s]
{'loss': 0.105, 'grad_norm': 0.7280577421188354, 'learning_rate': 7.921061948288773e-05} |
|
34%|ββββ | 1010/3000 [06:05<23:37, 1.40it/s]
34%|ββββ | 1011/3000 [06:05<18:59, 1.75it/s]
34%|ββββ | 1012/3000 [06:05<15:57, 2.08it/s]
34%|ββββ | 1013/3000 [06:05<13:53, 2.38it/s]
34%|ββββ | 1014/3000 [06:06<12:15, 2.70it/s]
34%|ββββ | 1015/3000 [06:06<11:06, 2.98it/s]
34%|ββββ | 1016/3000 [06:06<10:51, 3.04it/s]
34%|ββββ | 1017/3000 [06:06<10:22, 3.19it/s]
34%|ββββ | 1018/3000 [06:07<09:44, 3.39it/s]
34%|ββββ | 1019/3000 [06:07<09:20, 3.54it/s]
34%|ββββ | 1020/3000 [06:07<09:09, 3.61it/s]
{'loss': 0.1069, 'grad_norm': 0.9328599572181702, 'learning_rate': 7.87615347258591e-05} |
|
34%|ββββ | 1020/3000 [06:07<09:09, 3.61it/s]
34%|ββββ | 1021/3000 [06:07<08:54, 3.70it/s]
34%|ββββ | 1022/3000 [06:08<09:13, 3.57it/s]
34%|ββββ | 1023/3000 [06:08<08:56, 3.69it/s]
34%|ββββ | 1024/3000 [06:08<08:42, 3.78it/s]
34%|ββββ | 1025/3000 [06:09<08:58, 3.67it/s]
34%|ββββ | 1026/3000 [06:09<09:01, 3.64it/s]
34%|ββββ | 1027/3000 [06:09<09:00, 3.65it/s]
34%|ββββ | 1028/3000 [06:09<08:46, 3.74it/s]
34%|ββββ | 1029/3000 [06:10<08:42, 3.78it/s]
34%|ββββ | 1030/3000 [06:10<08:32, 3.84it/s]
{'loss': 0.1069, 'grad_norm': 0.9885624051094055, 'learning_rate': 7.830895520619128e-05} |
|
34%|ββββ | 1030/3000 [06:10<08:32, 3.84it/s]
34%|ββββ | 1031/3000 [06:10<08:33, 3.83it/s]
34%|ββββ | 1032/3000 [06:10<08:30, 3.86it/s]
34%|ββββ | 1033/3000 [06:11<08:23, 3.91it/s]
34%|ββββ | 1034/3000 [06:11<08:18, 3.94it/s]
34%|ββββ | 1035/3000 [06:11<08:19, 3.94it/s]
35%|ββββ | 1036/3000 [06:11<08:19, 3.93it/s]
35%|ββββ | 1037/3000 [06:12<08:17, 3.95it/s]
35%|ββββ | 1038/3000 [06:12<08:23, 3.89it/s]
35%|ββββ | 1039/3000 [06:12<08:45, 3.73it/s]
35%|ββββ | 1040/3000 [06:13<08:43, 3.74it/s]
{'loss': 0.1128, 'grad_norm': 0.8272039890289307, 'learning_rate': 7.785293591601217e-05} |
|
35%|ββββ | 1040/3000 [06:13<08:43, 3.74it/s]
35%|ββββ | 1041/3000 [06:13<08:51, 3.69it/s]
35%|ββββ | 1042/3000 [06:13<08:45, 3.73it/s]
35%|ββββ | 1043/3000 [06:13<08:41, 3.75it/s]
35%|ββββ | 1044/3000 [06:14<08:34, 3.80it/s]
35%|ββββ | 1045/3000 [06:14<08:38, 3.77it/s]
35%|ββββ | 1046/3000 [06:14<08:34, 3.80it/s]
35%|ββββ | 1047/3000 [06:14<08:27, 3.85it/s]
35%|ββββ | 1048/3000 [06:15<08:38, 3.77it/s]
35%|ββββ | 1049/3000 [06:15<08:35, 3.78it/s]
35%|ββββ | 1050/3000 [06:15<08:54, 3.65it/s]
{'loss': 0.1034, 'grad_norm': 1.0371506214141846, 'learning_rate': 7.739353226541009e-05} |
|
35%|ββββ | 1050/3000 [06:15<08:54, 3.65it/s]
35%|ββββ | 1051/3000 [06:15<08:55, 3.64it/s]
35%|ββββ | 1052/3000 [06:16<08:48, 3.69it/s]
35%|ββββ | 1053/3000 [06:16<08:54, 3.64it/s]
35%|ββββ | 1054/3000 [06:16<08:53, 3.65it/s]
35%|ββββ | 1055/3000 [06:17<08:43, 3.72it/s]
35%|ββββ | 1056/3000 [06:17<08:38, 3.75it/s]
35%|ββββ | 1057/3000 [06:17<08:30, 3.80it/s]
35%|ββββ | 1058/3000 [06:17<08:26, 3.83it/s]
35%|ββββ | 1059/3000 [06:18<08:21, 3.87it/s]
35%|ββββ | 1060/3000 [06:18<08:24, 3.84it/s]
{'loss': 0.1159, 'grad_norm': 0.8586914539337158, 'learning_rate': 7.693080007570084e-05} |
|
35%|ββββ | 1060/3000 [06:18<08:24, 3.84it/s]
35%|ββββ | 1061/3000 [06:18<08:26, 3.83it/s]
35%|ββββ | 1062/3000 [06:18<08:27, 3.82it/s]
35%|ββββ | 1063/3000 [06:19<08:37, 3.75it/s]
35%|ββββ | 1064/3000 [06:19<08:38, 3.74it/s]
36%|ββββ | 1065/3000 [06:19<08:36, 3.75it/s]
36%|ββββ | 1066/3000 [06:19<08:31, 3.78it/s]
36%|ββββ | 1067/3000 [06:20<08:26, 3.82it/s]
36%|ββββ | 1068/3000 [06:20<08:26, 3.81it/s]
36%|ββββ | 1069/3000 [06:20<08:28, 3.79it/s]
36%|ββββ | 1070/3000 [06:20<08:23, 3.83it/s]
{'loss': 0.1115, 'grad_norm': 0.7747752070426941, 'learning_rate': 7.646479557264513e-05} |
|
36%|ββββ | 1070/3000 [06:21<08:23, 3.83it/s]
36%|ββββ | 1071/3000 [06:21<08:25, 3.82it/s]
36%|ββββ | 1072/3000 [06:21<08:58, 3.58it/s]
36%|ββββ | 1073/3000 [06:21<08:46, 3.66it/s]
36%|ββββ | 1074/3000 [06:22<08:38, 3.71it/s]
36%|ββββ | 1075/3000 [06:22<08:35, 3.74it/s]
36%|ββββ | 1076/3000 [06:22<08:38, 3.71it/s]
36%|ββββ | 1077/3000 [06:22<09:00, 3.56it/s]
36%|ββββ | 1078/3000 [06:23<08:53, 3.61it/s]
36%|ββββ | 1079/3000 [06:23<08:42, 3.68it/s]
36%|ββββ | 1080/3000 [06:23<08:38, 3.70it/s]
{'loss': 0.1106, 'grad_norm': 0.8298690915107727, 'learning_rate': 7.599557537961663e-05} |
|
36%|ββββ | 1080/3000 [06:23<08:38, 3.70it/s]
36%|ββββ | 1081/3000 [06:23<08:34, 3.73it/s]
36%|ββββ | 1082/3000 [06:24<08:30, 3.75it/s]
36%|ββββ | 1083/3000 [06:24<08:24, 3.80it/s]
36%|ββββ | 1084/3000 [06:24<08:32, 3.74it/s]
36%|ββββ | 1085/3000 [06:25<08:19, 3.83it/s]
36%|ββββ | 1086/3000 [06:25<08:19, 3.83it/s]
36%|ββββ | 1087/3000 [06:25<08:19, 3.83it/s]
36%|ββββ | 1088/3000 [06:25<08:18, 3.83it/s]
36%|ββββ | 1089/3000 [06:26<08:15, 3.86it/s]
36%|ββββ | 1090/3000 [06:26<08:11, 3.89it/s]
{'loss': 0.0915, 'grad_norm': 0.8888349533081055, 'learning_rate': 7.552319651072164e-05} |
|
36%|ββββ | 1090/3000 [06:26<08:11, 3.89it/s]
36%|ββββ | 1091/3000 [06:26<08:15, 3.85it/s]
36%|ββββ | 1092/3000 [06:26<08:20, 3.81it/s]
36%|ββββ | 1093/3000 [06:27<08:24, 3.78it/s]
36%|ββββ | 1094/3000 [06:27<08:19, 3.82it/s]
36%|ββββ | 1095/3000 [06:27<08:13, 3.86it/s]
37%|ββββ | 1096/3000 [06:27<08:19, 3.81it/s]
37%|ββββ | 1097/3000 [06:28<08:19, 3.81it/s]
37%|ββββ | 1098/3000 [06:28<08:19, 3.81it/s]
37%|ββββ | 1099/3000 [06:28<08:29, 3.73it/s]
37%|ββββ | 1100/3000 [06:28<08:27, 3.75it/s]
{'loss': 0.1019, 'grad_norm': 0.8841614127159119, 'learning_rate': 7.504771636387163e-05} |
|
37%|ββββ | 1100/3000 [06:28<08:27, 3.75it/s]
37%|ββββ | 1101/3000 [06:29<08:37, 3.67it/s]
37%|ββββ | 1102/3000 [06:29<08:34, 3.69it/s]
37%|ββββ | 1103/3000 [06:29<08:37, 3.67it/s]
37%|ββββ | 1104/3000 [06:30<08:34, 3.68it/s]
37%|ββββ | 1105/3000 [06:30<08:31, 3.71it/s]
37%|ββββ | 1106/3000 [06:30<08:22, 3.77it/s]
37%|ββββ | 1107/3000 [06:30<08:23, 3.76it/s]
37%|ββββ | 1108/3000 [06:31<08:28, 3.72it/s]
37%|ββββ | 1109/3000 [06:31<08:25, 3.74it/s]
37%|ββββ | 1110/3000 [06:31<08:21, 3.77it/s]
{'loss': 0.1104, 'grad_norm': 1.0035121440887451, 'learning_rate': 7.456919271380875e-05} |
|
37%|ββββ | 1110/3000 [06:31<08:21, 3.77it/s]
37%|ββββ | 1111/3000 [06:31<08:49, 3.57it/s]
37%|ββββ | 1112/3000 [06:32<08:41, 3.62it/s]
37%|ββββ | 1113/3000 [06:32<08:23, 3.75it/s]
37%|ββββ | 1114/3000 [06:32<08:17, 3.79it/s]
37%|ββββ | 1115/3000 [06:33<08:47, 3.57it/s]
37%|ββββ | 1116/3000 [06:33<08:51, 3.54it/s]
37%|ββββ | 1117/3000 [06:33<08:39, 3.62it/s]
37%|ββββ | 1118/3000 [06:33<08:30, 3.69it/s]
37%|ββββ | 1119/3000 [06:34<08:28, 3.70it/s]Rank 0, Worker 3: Wait for shard 19 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
37%|ββββ | 1120/3000 [06:34<08:38, 3.62it/s]
{'loss': 0.1106, 'grad_norm': 0.8126855492591858, 'learning_rate': 7.408768370508576e-05} |
|
37%|ββββ | 1120/3000 [06:34<08:38, 3.62it/s]
37%|ββββ | 1121/3000 [06:34<08:42, 3.60it/s]
37%|ββββ | 1122/3000 [06:34<08:38, 3.62it/s]
37%|ββββ | 1123/3000 [06:35<08:42, 3.59it/s]Rank 0, Worker 1: Wait for shard 1 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
37%|ββββ | 1124/3000 [06:35<08:35, 3.64it/s]
38%|ββββ | 1125/3000 [06:35<08:26, 3.70it/s]
38%|ββββ | 1126/3000 [06:36<08:29, 3.68it/s]
38%|ββββ | 1127/3000 [06:36<08:32, 3.66it/s]
38%|ββββ | 1128/3000 [06:36<08:30, 3.66it/s]
38%|ββββ | 1129/3000 [06:36<08:21, 3.73it/s]
38%|ββββ | 1130/3000 [06:37<08:20, 3.74it/s]
{'loss': 0.1164, 'grad_norm': 0.9205325245857239, 'learning_rate': 7.36032478450011e-05} |
|
38%|ββββ | 1130/3000 [06:37<08:20, 3.74it/s]Rank 0, Worker 2: Wait for shard 30 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
38%|ββββ | 1131/3000 [06:37<08:36, 3.62it/s]
38%|ββββ | 1132/3000 [06:37<08:30, 3.66it/s]
38%|ββββ | 1133/3000 [06:37<08:39, 3.59it/s]Rank 0, Worker 5: Wait for shard 7 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
38%|ββββ | 1134/3000 [06:38<08:56, 3.48it/s]Rank 0, Worker 0: Wait for shard 52 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
38%|ββββ | 1135/3000 [06:38<09:04, 3.42it/s]
38%|ββββ | 1136/3000 [06:38<08:55, 3.48it/s]
38%|ββββ | 1137/3000 [06:39<08:47, 3.53it/s]
38%|ββββ | 1138/3000 [06:39<08:42, 3.57it/s]Rank 0, Worker 4: Wait for shard 35 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
38%|ββββ | 1139/3000 [06:39<08:54, 3.48it/s]
38%|ββββ | 1140/3000 [06:39<08:54, 3.48it/s]
{'loss': 0.1162, 'grad_norm': 0.8639265298843384, 'learning_rate': 7.311594399648957e-05} |
|
38%|ββββ | 1140/3000 [06:40<08:54, 3.48it/s]
38%|ββββ | 1141/3000 [06:40<08:54, 3.48it/s]
38%|ββββ | 1142/3000 [06:40<08:49, 3.51it/s]
38%|ββββ | 1143/3000 [06:40<09:03, 3.41it/s]
38%|ββββ | 1144/3000 [06:41<09:08, 3.38it/s]
38%|ββββ | 1145/3000 [06:41<09:04, 3.41it/s]
38%|ββββ | 1146/3000 [06:41<09:00, 3.43it/s]
38%|ββββ | 1147/3000 [06:42<09:28, 3.26it/s]
38%|ββββ | 1148/3000 [06:42<09:24, 3.28it/s]
38%|ββββ | 1149/3000 [06:42<09:20, 3.30it/s]
38%|ββββ | 1150/3000 [06:43<09:43, 3.17it/s]
{'loss': 0.1087, 'grad_norm': 0.9142033457756042, 'learning_rate': 7.262583137097018e-05} |
|
38%|ββββ | 1150/3000 [06:43<09:43, 3.17it/s]
38%|ββββ | 1151/3000 [06:43<09:49, 3.14it/s]
38%|ββββ | 1152/3000 [06:43<09:31, 3.23it/s]
38%|ββββ | 1153/3000 [06:43<09:38, 3.19it/s]
38%|ββββ | 1154/3000 [06:44<09:14, 3.33it/s]
38%|ββββ | 1155/3000 [06:44<09:08, 3.36it/s]
39%|ββββ | 1156/3000 [06:44<08:58, 3.43it/s]
39%|ββββ | 1157/3000 [06:45<08:49, 3.48it/s]
39%|ββββ | 1158/3000 [06:45<08:47, 3.49it/s]
39%|ββββ | 1159/3000 [06:45<08:38, 3.55it/s]
39%|ββββ | 1160/3000 [06:45<08:53, 3.45it/s]
{'loss': 0.1083, 'grad_norm': 0.9747472405433655, 'learning_rate': 7.213296952115144e-05} |
|
39%|ββββ | 1160/3000 [06:45<08:53, 3.45it/s]
39%|ββββ | 1161/3000 [06:46<08:43, 3.51it/s]
39%|ββββ | 1162/3000 [06:46<08:47, 3.48it/s]
39%|ββββ | 1163/3000 [06:46<08:51, 3.46it/s]
39%|ββββ | 1164/3000 [06:47<08:39, 3.54it/s]
39%|ββββ | 1165/3000 [06:47<08:43, 3.50it/s]
39%|ββββ | 1166/3000 [06:47<09:02, 3.38it/s]
39%|ββββ | 1167/3000 [06:47<08:56, 3.42it/s]
39%|ββββ | 1168/3000 [06:48<09:12, 3.32it/s]
39%|ββββ | 1169/3000 [06:48<09:00, 3.39it/s]
39%|ββββ | 1170/3000 [06:48<09:05, 3.35it/s]
{'loss': 0.107, 'grad_norm': 0.9333903789520264, 'learning_rate': 7.16374183337951e-05} |
|
39%|ββββ | 1170/3000 [06:48<09:05, 3.35it/s]
39%|ββββ | 1171/3000 [06:49<09:12, 3.31it/s]
39%|ββββ | 1172/3000 [06:49<09:04, 3.36it/s]
39%|ββββ | 1173/3000 [06:49<09:12, 3.31it/s]
39%|ββββ | 1174/3000 [06:50<09:04, 3.35it/s]
39%|ββββ | 1175/3000 [06:50<09:03, 3.36it/s]
39%|ββββ | 1176/3000 [06:50<09:16, 3.28it/s]
39%|ββββ | 1177/3000 [06:50<08:51, 3.43it/s]
39%|ββββ | 1178/3000 [06:51<08:25, 3.60it/s]
39%|ββββ | 1179/3000 [06:51<08:13, 3.69it/s]
39%|ββββ | 1180/3000 [06:51<08:26, 3.59it/s]
{'loss': 0.1063, 'grad_norm': 0.9898130297660828, 'learning_rate': 7.113923802243957e-05} |
|
39%|ββββ | 1180/3000 [06:51<08:26, 3.59it/s]
39%|ββββ | 1181/3000 [06:52<08:17, 3.66it/s]
39%|ββββ | 1182/3000 [06:52<08:05, 3.75it/s]
39%|ββββ | 1183/3000 [06:52<07:54, 3.83it/s]
39%|ββββ | 1184/3000 [06:52<07:50, 3.86it/s]
40%|ββββ | 1185/3000 [06:53<07:55, 3.81it/s]
40%|ββββ | 1186/3000 [06:53<08:00, 3.78it/s]
40%|ββββ | 1187/3000 [06:53<07:52, 3.83it/s]
40%|ββββ | 1188/3000 [06:53<07:57, 3.79it/s]
40%|ββββ | 1189/3000 [06:54<07:56, 3.80it/s]
40%|ββββ | 1190/3000 [06:54<07:54, 3.82it/s]
{'loss': 0.1067, 'grad_norm': 0.9155701398849487, 'learning_rate': 7.06384891200834e-05} |
|
40%|ββββ | 1190/3000 [06:54<07:54, 3.82it/s]
40%|ββββ | 1191/3000 [06:54<08:06, 3.72it/s]
40%|ββββ | 1192/3000 [06:54<08:12, 3.67it/s]
40%|ββββ | 1193/3000 [06:55<08:10, 3.68it/s]
40%|ββββ | 1194/3000 [06:55<07:54, 3.80it/s]
40%|ββββ | 1195/3000 [06:55<07:50, 3.84it/s]
40%|ββββ | 1196/3000 [06:55<07:43, 3.90it/s]
40%|ββββ | 1197/3000 [06:56<07:38, 3.94it/s]
40%|ββββ | 1198/3000 [06:56<07:38, 3.93it/s]
40%|ββββ | 1199/3000 [06:56<07:37, 3.94it/s]
40%|ββββ | 1200/3000 [06:56<07:32, 3.98it/s]
{'loss': 0.109, 'grad_norm': 0.8125437498092651, 'learning_rate': 7.013523247183e-05} |
|
40%|ββββ | 1200/3000 [06:56<07:32, 3.98it/s]
40%|ββββ | 1201/3000 [06:57<07:30, 3.99it/s]
40%|ββββ | 1202/3000 [06:57<07:27, 4.02it/s]
40%|ββββ | 1203/3000 [06:57<07:44, 3.87it/s]
40%|ββββ | 1204/3000 [06:58<08:01, 3.73it/s]
40%|ββββ | 1205/3000 [06:58<07:52, 3.80it/s]
40%|ββββ | 1206/3000 [06:58<07:42, 3.88it/s]
40%|ββββ | 1207/3000 [06:58<08:07, 3.68it/s]
40%|ββββ | 1208/3000 [06:59<08:02, 3.72it/s]
40%|ββββ | 1209/3000 [06:59<07:52, 3.79it/s]
40%|ββββ | 1210/3000 [06:59<07:42, 3.87it/s]
{'loss': 0.1063, 'grad_norm': 1.1674683094024658, 'learning_rate': 6.962952922749457e-05} |
|
40%|ββββ | 1210/3000 [06:59<07:42, 3.87it/s]
40%|ββββ | 1211/3000 [06:59<07:41, 3.88it/s]
40%|ββββ | 1212/3000 [07:00<07:37, 3.91it/s]
40%|ββββ | 1213/3000 [07:00<07:36, 3.92it/s]
40%|ββββ | 1214/3000 [07:00<07:35, 3.92it/s]
40%|ββββ | 1215/3000 [07:00<07:35, 3.92it/s]
41%|ββββ | 1216/3000 [07:01<07:35, 3.92it/s]
41%|ββββ | 1217/3000 [07:01<08:13, 3.61it/s]
41%|ββββ | 1218/3000 [07:01<08:06, 3.66it/s]
41%|ββββ | 1219/3000 [07:01<08:00, 3.71it/s]
41%|ββββ | 1220/3000 [07:02<07:58, 3.72it/s]
{'loss': 0.1014, 'grad_norm': 0.7635449767112732, 'learning_rate': 6.912144083417376e-05} |
|
41%|ββββ | 1220/3000 [07:02<07:58, 3.72it/s]
41%|ββββ | 1221/3000 [07:02<07:58, 3.72it/s]
41%|ββββ | 1222/3000 [07:02<07:56, 3.73it/s]
41%|ββββ | 1223/3000 [07:03<07:55, 3.74it/s]
41%|ββββ | 1224/3000 [07:03<07:57, 3.72it/s]
41%|ββββ | 1225/3000 [07:03<08:18, 3.56it/s]
41%|ββββ | 1226/3000 [07:03<08:06, 3.65it/s]
41%|ββββ | 1227/3000 [07:04<08:14, 3.59it/s]
41%|ββββ | 1228/3000 [07:04<08:12, 3.60it/s]
41%|ββββ | 1229/3000 [07:04<08:37, 3.42it/s]
41%|ββββ | 1230/3000 [07:05<08:39, 3.41it/s]
{'loss': 0.1098, 'grad_norm': 0.9461978673934937, 'learning_rate': 6.861102902877946e-05} |
|
41%|ββββ | 1230/3000 [07:05<08:39, 3.41it/s]
41%|ββββ | 1231/3000 [07:05<08:38, 3.41it/s]
41%|ββββ | 1232/3000 [07:05<08:38, 3.41it/s]
41%|ββββ | 1233/3000 [07:05<08:25, 3.50it/s]
41%|ββββ | 1234/3000 [07:06<08:19, 3.54it/s]
41%|ββββ | 1235/3000 [07:06<08:15, 3.56it/s]
41%|ββββ | 1236/3000 [07:06<08:39, 3.40it/s]
41%|ββββ | 1237/3000 [07:07<08:42, 3.38it/s]
41%|βββββ | 1238/3000 [07:07<09:01, 3.26it/s]
41%|βββββ | 1239/3000 [07:07<08:41, 3.38it/s]
41%|βββββ | 1240/3000 [07:07<08:44, 3.35it/s]
{'loss': 0.0984, 'grad_norm': 0.9153079986572266, 'learning_rate': 6.809835583053715e-05} |
|
41%|βββββ | 1240/3000 [07:08<08:44, 3.35it/s]
41%|βββββ | 1241/3000 [07:08<08:40, 3.38it/s]
41%|βββββ | 1242/3000 [07:08<08:32, 3.43it/s]
41%|βββββ | 1243/3000 [07:08<08:27, 3.46it/s]
41%|βββββ | 1244/3000 [07:09<08:31, 3.43it/s]
42%|βββββ | 1245/3000 [07:09<08:34, 3.41it/s]
42%|βββββ | 1246/3000 [07:09<08:24, 3.48it/s]
42%|βββββ | 1247/3000 [07:09<08:20, 3.51it/s]
42%|βββββ | 1248/3000 [07:10<08:27, 3.45it/s]
42%|βββββ | 1249/3000 [07:10<08:44, 3.34it/s]
42%|βββββ | 1250/3000 [07:10<08:34, 3.40it/s]
{'loss': 0.1082, 'grad_norm': 0.9081976413726807, 'learning_rate': 6.758348353345014e-05} |
|
42%|βββββ | 1250/3000 [07:10<08:34, 3.40it/s]
42%|βββββ | 1251/3000 [07:11<08:43, 3.34it/s]
42%|βββββ | 1252/3000 [07:11<08:30, 3.42it/s]
42%|βββββ | 1253/3000 [07:11<08:13, 3.54it/s]
42%|βββββ | 1254/3000 [07:12<08:12, 3.54it/s]
42%|βββββ | 1255/3000 [07:12<08:18, 3.50it/s]
42%|βββββ | 1256/3000 [07:12<08:16, 3.52it/s]
42%|βββββ | 1257/3000 [07:12<08:12, 3.54it/s]
42%|βββββ | 1258/3000 [07:13<08:07, 3.57it/s]
42%|βββββ | 1259/3000 [07:13<08:08, 3.57it/s]
42%|βββββ | 1260/3000 [07:13<07:57, 3.65it/s]
{'loss': 0.0989, 'grad_norm': 0.8417534232139587, 'learning_rate': 6.706647469873031e-05} |
|
42%|βββββ | 1260/3000 [07:13<07:57, 3.65it/s]
42%|βββββ | 1261/3000 [07:13<07:59, 3.62it/s]
42%|βββββ | 1262/3000 [07:14<07:52, 3.68it/s]
42%|βββββ | 1263/3000 [07:14<07:51, 3.69it/s]
42%|βββββ | 1264/3000 [07:14<07:42, 3.75it/s]
42%|βββββ | 1265/3000 [07:15<07:37, 3.79it/s]
42%|βββββ | 1266/3000 [07:15<07:36, 3.80it/s]
42%|βββββ | 1267/3000 [07:15<07:52, 3.66it/s]
42%|βββββ | 1268/3000 [07:15<08:10, 3.53it/s]
42%|βββββ | 1269/3000 [07:16<07:55, 3.64it/s]
42%|βββββ | 1270/3000 [07:16<07:46, 3.71it/s]
{'loss': 0.1092, 'grad_norm': 0.7554053664207458, 'learning_rate': 6.654739214719641e-05} |
|
42%|βββββ | 1270/3000 [07:16<07:46, 3.71it/s]
42%|βββββ | 1271/3000 [07:16<07:52, 3.66it/s]
42%|βββββ | 1272/3000 [07:16<07:55, 3.63it/s]
42%|βββββ | 1273/3000 [07:17<07:58, 3.61it/s]
42%|βββββ | 1274/3000 [07:17<07:50, 3.67it/s]
42%|βββββ | 1275/3000 [07:17<07:56, 3.62it/s]
43%|βββββ | 1276/3000 [07:18<07:44, 3.71it/s]
43%|βββββ | 1277/3000 [07:18<07:58, 3.60it/s]
43%|βββββ | 1278/3000 [07:18<08:08, 3.53it/s]
43%|βββββ | 1279/3000 [07:18<08:06, 3.54it/s]
43%|βββββ | 1280/3000 [07:19<07:45, 3.69it/s]
{'loss': 0.1084, 'grad_norm': 0.8323029279708862, 'learning_rate': 6.602629895164081e-05} |
|
43%|βββββ | 1280/3000 [07:19<07:45, 3.69it/s]
43%|βββββ | 1281/3000 [07:19<07:38, 3.75it/s]
43%|βββββ | 1282/3000 [07:19<07:38, 3.75it/s]
43%|βββββ | 1283/3000 [07:19<07:33, 3.79it/s]
43%|βββββ | 1284/3000 [07:20<07:36, 3.76it/s]
43%|βββββ | 1285/3000 [07:20<07:30, 3.81it/s]
43%|βββββ | 1286/3000 [07:20<07:25, 3.85it/s]
43%|βββββ | 1287/3000 [07:20<07:20, 3.89it/s]
43%|βββββ | 1288/3000 [07:21<07:05, 4.02it/s]
43%|βββββ | 1289/3000 [07:21<07:03, 4.04it/s]
43%|βββββ | 1290/3000 [07:21<07:02, 4.04it/s]
{'loss': 0.1096, 'grad_norm': 0.8539698719978333, 'learning_rate': 6.550325842916559e-05} |
|
43%|βββββ | 1290/3000 [07:21<07:02, 4.04it/s]
43%|βββββ | 1291/3000 [07:21<06:59, 4.07it/s]
43%|βββββ | 1292/3000 [07:22<06:51, 4.15it/s]
43%|βββββ | 1293/3000 [07:22<06:49, 4.17it/s]
43%|βββββ | 1294/3000 [07:22<06:46, 4.19it/s]
43%|βββββ | 1295/3000 [07:22<06:42, 4.24it/s]
43%|βββββ | 1296/3000 [07:23<06:43, 4.22it/s]
43%|βββββ | 1297/3000 [07:23<06:44, 4.21it/s]
43%|βββββ | 1298/3000 [07:23<06:40, 4.25it/s]
43%|βββββ | 1299/3000 [07:23<06:40, 4.25it/s]
43%|βββββ | 1300/3000 [07:24<06:40, 4.25it/s]
{'loss': 0.0989, 'grad_norm': 0.8106352686882019, 'learning_rate': 6.497833413348909e-05} |
|
43%|βββββ | 1300/3000 [07:24<06:40, 4.25it/s]
43%|βββββ | 1301/3000 [07:24<06:47, 4.17it/s]
43%|βββββ | 1302/3000 [07:24<06:51, 4.12it/s]
43%|βββββ | 1303/3000 [07:24<06:45, 4.18it/s]
43%|βββββ | 1304/3000 [07:25<06:43, 4.20it/s]
44%|βββββ | 1305/3000 [07:25<07:18, 3.86it/s]Rank 0, Worker 3: Wait for shard 29 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
44%|βββββ | 1306/3000 [07:25<07:18, 3.86it/s]
44%|βββββ | 1307/3000 [07:25<07:18, 3.86it/s]
44%|βββββ | 1308/3000 [07:26<07:15, 3.89it/s]
44%|βββββ | 1309/3000 [07:26<07:17, 3.87it/s]Rank 0, Worker 1: Wait for shard 53 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
44%|βββββ | 1310/3000 [07:26<07:17, 3.86it/s]
{'loss': 0.0983, 'grad_norm': 0.7995074987411499, 'learning_rate': 6.445158984722358e-05} |
|
44%|βββββ | 1310/3000 [07:26<07:17, 3.86it/s]
44%|βββββ | 1311/3000 [07:26<07:18, 3.85it/s]
44%|βββββ | 1312/3000 [07:27<07:18, 3.85it/s]
44%|βββββ | 1313/3000 [07:27<07:17, 3.86it/s]
44%|βββββ | 1314/3000 [07:27<07:14, 3.88it/s]
44%|βββββ | 1315/3000 [07:27<07:13, 3.89it/s]
44%|βββββ | 1316/3000 [07:28<07:18, 3.84it/s]
44%|βββββ | 1317/3000 [07:28<07:18, 3.84it/s]
44%|βββββ | 1318/3000 [07:28<07:14, 3.88it/s]
44%|βββββ | 1319/3000 [07:28<07:15, 3.86it/s]
44%|βββββ | 1320/3000 [07:29<07:22, 3.80it/s]
{'loss': 0.1061, 'grad_norm': 0.7924811244010925, 'learning_rate': 6.39230895741251e-05} |
|
44%|βββββ | 1320/3000 [07:29<07:22, 3.80it/s]
44%|βββββ | 1321/3000 [07:29<07:24, 3.78it/s]
44%|βββββ | 1322/3000 [07:29<07:24, 3.77it/s]Rank 0, Worker 2: Wait for shard 41 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
44%|βββββ | 1323/3000 [07:30<07:35, 3.68it/s]
44%|βββββ | 1324/3000 [07:30<07:49, 3.57it/s]Rank 0, Worker 4: Wait for shard 5 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
44%|βββββ | 1325/3000 [07:30<07:46, 3.59it/s]Rank 0, Worker 5: Wait for shard 31 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
44%|βββββ | 1326/3000 [07:30<07:50, 3.56it/s]Rank 0, Worker 0: Wait for shard 14 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
44%|βββββ | 1327/3000 [07:31<07:49, 3.56it/s]
44%|βββββ | 1328/3000 [07:31<07:47, 3.58it/s]
44%|βββββ | 1329/3000 [07:31<07:49, 3.56it/s]
44%|βββββ | 1330/3000 [07:32<07:52, 3.53it/s]
{'loss': 0.0937, 'grad_norm': 0.8590250015258789, 'learning_rate': 6.339289753131649e-05} |
|
44%|βββββ | 1330/3000 [07:32<07:52, 3.53it/s]
44%|βββββ | 1331/3000 [07:32<07:56, 3.50it/s]
44%|βββββ | 1332/3000 [07:32<07:47, 3.57it/s]
44%|βββββ | 1333/3000 [07:32<08:00, 3.47it/s]
44%|βββββ | 1334/3000 [07:33<07:58, 3.48it/s]
44%|βββββ | 1335/3000 [07:33<07:46, 3.57it/s]
45%|βββββ | 1336/3000 [07:33<07:40, 3.61it/s]
45%|βββββ | 1337/3000 [07:34<07:50, 3.54it/s]
45%|βββββ | 1338/3000 [07:34<08:20, 3.32it/s]
45%|βββββ | 1339/3000 [07:34<08:09, 3.39it/s]
45%|βββββ | 1340/3000 [07:34<08:08, 3.40it/s]
{'loss': 0.1028, 'grad_norm': 0.8390591740608215, 'learning_rate': 6.286107814148454e-05} |
|
45%|βββββ | 1340/3000 [07:34<08:08, 3.40it/s]
45%|βββββ | 1341/3000 [07:35<08:05, 3.41it/s]
45%|βββββ | 1342/3000 [07:35<07:57, 3.47it/s]
45%|βββββ | 1343/3000 [07:35<07:58, 3.46it/s]
45%|βββββ | 1344/3000 [07:36<08:43, 3.16it/s]
45%|βββββ | 1345/3000 [07:36<08:18, 3.32it/s]
45%|βββββ | 1346/3000 [07:36<08:13, 3.35it/s]
45%|βββββ | 1347/3000 [07:37<08:15, 3.33it/s]
45%|βββββ | 1348/3000 [07:37<08:23, 3.28it/s]
45%|βββββ | 1349/3000 [07:37<08:19, 3.31it/s]
45%|βββββ | 1350/3000 [07:37<08:21, 3.29it/s]
{'loss': 0.0928, 'grad_norm': 0.9238244891166687, 'learning_rate': 6.232769602505203e-05} |
|
45%|βββββ | 1350/3000 [07:37<08:21, 3.29it/s]
45%|βββββ | 1351/3000 [07:38<08:09, 3.37it/s]
45%|βββββ | 1352/3000 [07:38<08:03, 3.41it/s]
45%|βββββ | 1353/3000 [07:38<07:52, 3.49it/s]
45%|βββββ | 1354/3000 [07:39<07:58, 3.44it/s]
45%|βββββ | 1355/3000 [07:39<08:05, 3.39it/s]
45%|βββββ | 1356/3000 [07:39<08:03, 3.40it/s]
45%|βββββ | 1357/3000 [07:39<07:54, 3.46it/s]
45%|βββββ | 1358/3000 [07:40<07:46, 3.52it/s]
45%|βββββ | 1359/3000 [07:40<07:46, 3.52it/s]
45%|βββββ | 1360/3000 [07:40<07:47, 3.51it/s]
{'loss': 0.1027, 'grad_norm': 0.9588078260421753, 'learning_rate': 6.179281599232591e-05} |
|
45%|βββββ | 1360/3000 [07:40<07:47, 3.51it/s]
45%|βββββ | 1361/3000 [07:41<07:44, 3.53it/s]
45%|βββββ | 1362/3000 [07:41<07:38, 3.57it/s]
45%|βββββ | 1363/3000 [07:41<07:39, 3.56it/s]
45%|βββββ | 1364/3000 [07:41<08:01, 3.40it/s]
46%|βββββ | 1365/3000 [07:42<07:46, 3.50it/s]
46%|βββββ | 1366/3000 [07:42<07:44, 3.52it/s]
46%|βββββ | 1367/3000 [07:42<07:42, 3.53it/s]
46%|βββββ | 1368/3000 [07:43<07:40, 3.54it/s]
46%|βββββ | 1369/3000 [07:43<07:31, 3.61it/s]
46%|βββββ | 1370/3000 [07:43<07:41, 3.53it/s]
{'loss': 0.0845, 'grad_norm': 0.9879070520401001, 'learning_rate': 6.125650303562221e-05} |
|
46%|βββββ | 1370/3000 [07:43<07:41, 3.53it/s]
46%|βββββ | 1371/3000 [07:43<07:40, 3.53it/s]
46%|βββββ | 1372/3000 [07:44<07:31, 3.60it/s]
46%|βββββ | 1373/3000 [07:44<07:20, 3.69it/s]
46%|βββββ | 1374/3000 [07:44<07:18, 3.71it/s]
46%|βββββ | 1375/3000 [07:44<07:17, 3.71it/s]
46%|βββββ | 1376/3000 [07:45<07:13, 3.75it/s]
46%|βββββ | 1377/3000 [07:45<07:11, 3.76it/s]
46%|βββββ | 1378/3000 [07:45<07:11, 3.76it/s]
46%|βββββ | 1379/3000 [07:46<07:15, 3.72it/s]
46%|βββββ | 1380/3000 [07:46<07:23, 3.65it/s]
{'loss': 0.0996, 'grad_norm': 0.8994210362434387, 'learning_rate': 6.071882232136901e-05} |
|
46%|βββββ | 1380/3000 [07:46<07:23, 3.65it/s]
46%|βββββ | 1381/3000 [07:46<07:57, 3.39it/s]
46%|βββββ | 1382/3000 [07:46<07:42, 3.50it/s]
46%|βββββ | 1383/3000 [07:47<07:29, 3.60it/s]
46%|βββββ | 1384/3000 [07:47<07:18, 3.69it/s]
46%|βββββ | 1385/3000 [07:47<07:10, 3.76it/s]
46%|βββββ | 1386/3000 [07:47<07:12, 3.74it/s]
46%|βββββ | 1387/3000 [07:48<07:08, 3.77it/s]
46%|βββββ | 1388/3000 [07:48<07:04, 3.79it/s]
46%|βββββ | 1389/3000 [07:48<07:03, 3.81it/s]
46%|βββββ | 1390/3000 [07:48<06:59, 3.84it/s]
{'loss': 0.091, 'grad_norm': 0.7739790081977844, 'learning_rate': 6.017983918218812e-05} |
|
46%|βββββ | 1390/3000 [07:49<06:59, 3.84it/s]
46%|βββββ | 1391/3000 [07:49<06:58, 3.85it/s]
46%|βββββ | 1392/3000 [07:49<07:02, 3.80it/s]
46%|βββββ | 1393/3000 [07:49<06:58, 3.84it/s]
46%|βββββ | 1394/3000 [07:50<06:50, 3.91it/s]
46%|βββββ | 1395/3000 [07:50<07:25, 3.61it/s]
47%|βββββ | 1396/3000 [07:50<07:17, 3.67it/s]
47%|βββββ | 1397/3000 [07:50<07:06, 3.76it/s]
47%|βββββ | 1398/3000 [07:51<06:59, 3.82it/s]
47%|βββββ | 1399/3000 [07:51<07:09, 3.73it/s]
47%|βββββ | 1400/3000 [07:51<07:02, 3.79it/s]
{'loss': 0.0993, 'grad_norm': 0.7537702322006226, 'learning_rate': 5.963961910895676e-05} |
|
47%|βββββ | 1400/3000 [07:51<07:02, 3.79it/s]
47%|βββββ | 1401/3000 [07:51<06:56, 3.84it/s]
47%|βββββ | 1402/3000 [07:52<06:57, 3.83it/s]
47%|βββββ | 1403/3000 [07:52<06:54, 3.86it/s]
47%|βββββ | 1404/3000 [07:52<06:49, 3.89it/s]
47%|βββββ | 1405/3000 [07:52<06:49, 3.89it/s]
47%|βββββ | 1406/3000 [07:53<06:49, 3.90it/s]
47%|βββββ | 1407/3000 [07:53<06:46, 3.92it/s]
47%|βββββ | 1408/3000 [07:53<06:40, 3.97it/s]
47%|βββββ | 1409/3000 [07:53<06:45, 3.93it/s]
47%|βββββ | 1410/3000 [07:54<06:50, 3.87it/s]
{'loss': 0.0863, 'grad_norm': 0.7310131788253784, 'learning_rate': 5.909822774284971e-05} |
|
47%|βββββ | 1410/3000 [07:54<06:50, 3.87it/s]
47%|βββββ | 1411/3000 [07:54<06:49, 3.88it/s]
47%|βββββ | 1412/3000 [07:54<06:49, 3.88it/s]
47%|βββββ | 1413/3000 [07:54<06:48, 3.89it/s]
47%|βββββ | 1414/3000 [07:55<06:46, 3.90it/s]
47%|βββββ | 1415/3000 [07:55<06:45, 3.91it/s]
47%|βββββ | 1416/3000 [07:55<06:42, 3.94it/s]
47%|βββββ | 1417/3000 [07:55<06:40, 3.95it/s]
47%|βββββ | 1418/3000 [07:56<06:42, 3.93it/s]
47%|βββββ | 1419/3000 [07:56<06:39, 3.95it/s]
47%|βββββ | 1420/3000 [07:56<06:59, 3.77it/s]
{'loss': 0.0992, 'grad_norm': 0.7676811814308167, 'learning_rate': 5.85557308673635e-05} |
|
47%|βββββ | 1420/3000 [07:56<06:59, 3.77it/s]
47%|βββββ | 1421/3000 [07:57<07:08, 3.69it/s]
47%|βββββ | 1422/3000 [07:57<07:02, 3.73it/s]
47%|βββββ | 1423/3000 [07:57<06:54, 3.80it/s]
47%|βββββ | 1424/3000 [07:57<06:49, 3.85it/s]
48%|βββββ | 1425/3000 [07:58<06:49, 3.85it/s]
48%|βββββ | 1426/3000 [07:58<06:45, 3.88it/s]
48%|βββββ | 1427/3000 [07:58<06:42, 3.91it/s]
48%|βββββ | 1428/3000 [07:58<06:39, 3.94it/s]
48%|βββββ | 1429/3000 [07:59<06:41, 3.91it/s]
48%|βββββ | 1430/3000 [07:59<06:43, 3.90it/s]
{'loss': 0.0932, 'grad_norm': 0.8523777723312378, 'learning_rate': 5.8012194400323116e-05} |
|
48%|βββββ | 1430/3000 [07:59<06:43, 3.90it/s]
48%|βββββ | 1431/3000 [07:59<06:43, 3.89it/s]
48%|βββββ | 1432/3000 [07:59<06:46, 3.86it/s]
48%|βββββ | 1433/3000 [08:00<06:45, 3.86it/s]
48%|βββββ | 1434/3000 [08:00<06:47, 3.84it/s]
48%|βββββ | 1435/3000 [08:00<06:46, 3.85it/s]
48%|βββββ | 1436/3000 [08:00<07:03, 3.69it/s]
48%|βββββ | 1437/3000 [08:01<07:10, 3.63it/s]
48%|βββββ | 1438/3000 [08:01<07:03, 3.69it/s]
48%|βββββ | 1439/3000 [08:01<06:56, 3.75it/s]
48%|βββββ | 1440/3000 [08:02<06:51, 3.79it/s]
{'loss': 0.0952, 'grad_norm': 0.7514760494232178, 'learning_rate': 5.746768438587245e-05} |
|
48%|βββββ | 1440/3000 [08:02<06:51, 3.79it/s]
48%|βββββ | 1441/3000 [08:02<06:54, 3.76it/s]
48%|βββββ | 1442/3000 [08:02<06:53, 3.77it/s]
48%|βββββ | 1443/3000 [08:02<06:46, 3.83it/s]
48%|βββββ | 1444/3000 [08:03<06:41, 3.88it/s]
48%|βββββ | 1445/3000 [08:03<06:41, 3.87it/s]
48%|βββββ | 1446/3000 [08:03<06:43, 3.86it/s]
48%|βββββ | 1447/3000 [08:03<06:48, 3.80it/s]
48%|βββββ | 1448/3000 [08:04<06:52, 3.76it/s]
48%|βββββ | 1449/3000 [08:04<06:57, 3.71it/s]
48%|βββββ | 1450/3000 [08:04<06:58, 3.71it/s]
{'loss': 0.1012, 'grad_norm': 0.818895697593689, 'learning_rate': 5.692226698644938e-05} |
|
48%|βββββ | 1450/3000 [08:04<06:58, 3.71it/s]
48%|βββββ | 1451/3000 [08:05<07:59, 3.23it/s]
48%|βββββ | 1452/3000 [08:05<07:38, 3.38it/s]
48%|βββββ | 1453/3000 [08:05<07:30, 3.44it/s]
48%|βββββ | 1454/3000 [08:05<07:19, 3.52it/s]
48%|βββββ | 1455/3000 [08:06<07:08, 3.61it/s]
49%|βββββ | 1456/3000 [08:06<07:04, 3.64it/s]
49%|βββββ | 1457/3000 [08:06<07:07, 3.61it/s]
49%|βββββ | 1458/3000 [08:06<06:57, 3.69it/s]
49%|βββββ | 1459/3000 [08:07<06:50, 3.75it/s]
49%|βββββ | 1460/3000 [08:07<06:50, 3.76it/s]
{'loss': 0.0953, 'grad_norm': 0.924919605255127, 'learning_rate': 5.637600847474656e-05} |
|
49%|βββββ | 1460/3000 [08:07<06:50, 3.76it/s]
49%|βββββ | 1461/3000 [08:07<06:54, 3.72it/s]
49%|βββββ | 1462/3000 [08:08<06:49, 3.76it/s]
49%|βββββ | 1463/3000 [08:08<06:50, 3.75it/s]
49%|βββββ | 1464/3000 [08:08<06:59, 3.66it/s]
49%|βββββ | 1465/3000 [08:08<07:24, 3.45it/s]
49%|βββββ | 1466/3000 [08:09<07:12, 3.55it/s]
49%|βββββ | 1467/3000 [08:09<07:14, 3.53it/s]
49%|βββββ | 1468/3000 [08:09<07:11, 3.55it/s]
49%|βββββ | 1469/3000 [08:09<06:57, 3.67it/s]
49%|βββββ | 1470/3000 [08:10<06:48, 3.75it/s]
{'loss': 0.093, 'grad_norm': 0.8371473550796509, 'learning_rate': 5.5828975225658666e-05} |
|
49%|βββββ | 1470/3000 [08:10<06:48, 3.75it/s]
49%|βββββ | 1471/3000 [08:10<06:50, 3.72it/s]
49%|βββββ | 1472/3000 [08:10<06:48, 3.74it/s]
49%|βββββ | 1473/3000 [08:11<06:44, 3.78it/s]
49%|βββββ | 1474/3000 [08:11<06:43, 3.79it/s]
49%|βββββ | 1475/3000 [08:11<06:42, 3.79it/s]
49%|βββββ | 1476/3000 [08:11<06:47, 3.74it/s]
49%|βββββ | 1477/3000 [08:12<06:47, 3.73it/s]
49%|βββββ | 1478/3000 [08:12<06:49, 3.72it/s]
49%|βββββ | 1479/3000 [08:12<06:53, 3.68it/s]
49%|βββββ | 1480/3000 [08:12<06:45, 3.75it/s]
{'loss': 0.1001, 'grad_norm': 0.7313709259033203, 'learning_rate': 5.52812337082173e-05} |
|
49%|βββββ | 1480/3000 [08:12<06:45, 3.75it/s]
49%|βββββ | 1481/3000 [08:13<06:39, 3.80it/s]
49%|βββββ | 1482/3000 [08:13<06:34, 3.85it/s]
49%|βββββ | 1483/3000 [08:13<06:35, 3.84it/s]
49%|βββββ | 1484/3000 [08:13<06:30, 3.88it/s]
50%|βββββ | 1485/3000 [08:14<06:33, 3.85it/s]
50%|βββββ | 1486/3000 [08:14<06:37, 3.81it/s]
50%|βββββ | 1487/3000 [08:14<06:35, 3.83it/s]
50%|βββββ | 1488/3000 [08:14<06:32, 3.85it/s]
50%|βββββ | 1489/3000 [08:15<06:31, 3.86it/s]
50%|βββββ | 1490/3000 [08:15<06:31, 3.86it/s]
{'loss': 0.0922, 'grad_norm': 0.8753471374511719, 'learning_rate': 5.473285047751451e-05} |
|
50%|βββββ | 1490/3000 [08:15<06:31, 3.86it/s]
50%|βββββ | 1491/3000 [08:15<06:39, 3.78it/s]Rank 0, Worker 3: Wait for shard 44 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
50%|βββββ | 1492/3000 [08:16<06:40, 3.77it/s]
50%|βββββ | 1493/3000 [08:16<06:32, 3.84it/s]
50%|βββββ | 1494/3000 [08:16<06:37, 3.79it/s]
50%|βββββ | 1495/3000 [08:16<06:46, 3.70it/s]
50%|βββββ | 1496/3000 [08:17<06:52, 3.64it/s]
50%|βββββ | 1497/3000 [08:17<06:52, 3.65it/s]
50%|βββββ | 1498/3000 [08:17<06:54, 3.62it/s]
50%|βββββ | 1499/3000 [08:17<07:00, 3.57it/s]
50%|βββββ | 1500/3000 [08:18<07:27, 3.35it/s]
{'loss': 0.1086, 'grad_norm': 0.7444728016853333, 'learning_rate': 5.418389216661579e-05} |
|
50%|βββββ | 1500/3000 [08:18<07:27, 3.35it/s]Copying experiment config directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/experiment_cfg to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-1500/experiment_cfg |
| Copying processor directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/processor to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-1500 |
| Copying wandb_config.json from /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/wandb_config.json to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-1500/wandb_config.json |
|
50%|βββββ | 1501/3000 [08:55<4:42:37, 11.31s/it]Rank 0, Worker 1: Wait for shard 42 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
50%|βββββ | 1502/3000 [08:55<3:19:49, 8.00s/it]
50%|βββββ | 1503/3000 [08:55<2:21:49, 5.68s/it]
50%|βββββ | 1504/3000 [08:56<1:41:13, 4.06s/it]
50%|βββββ | 1505/3000 [08:56<1:12:52, 2.92s/it]
50%|βββββ | 1506/3000 [08:56<53:04, 2.13s/it] Rank 0, Worker 0: Wait for shard 49 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
50%|βββββ | 1507/3000 [08:57<39:29, 1.59s/it]
50%|βββββ | 1508/3000 [08:57<29:36, 1.19s/it]
50%|βββββ | 1509/3000 [08:57<22:42, 1.09it/s]
50%|βββββ | 1510/3000 [08:57<17:51, 1.39it/s]
{'loss': 0.0941, 'grad_norm': 0.9741567373275757, 'learning_rate': 5.363442547846356e-05} |
|
50%|βββββ | 1510/3000 [08:57<17:51, 1.39it/s]Rank 0, Worker 4: Wait for shard 11 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
50%|βββββ | 1511/3000 [08:58<14:29, 1.71it/s]
50%|βββββ | 1512/3000 [08:58<12:06, 2.05it/s]
50%|βββββ | 1513/3000 [08:58<10:30, 2.36it/s]
50%|βββββ | 1514/3000 [08:58<09:20, 2.65it/s]Rank 0, Worker 2: Wait for shard 23 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
50%|βββββ | 1515/3000 [08:59<08:54, 2.78it/s]
51%|βββββ | 1516/3000 [08:59<08:15, 3.00it/s]
51%|βββββ | 1517/3000 [08:59<07:39, 3.23it/s]Rank 0, Worker 5: Wait for shard 36 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
51%|βββββ | 1518/3000 [09:00<07:24, 3.33it/s]
51%|βββββ | 1519/3000 [09:00<07:17, 3.38it/s]
51%|βββββ | 1520/3000 [09:00<06:59, 3.53it/s]
{'loss': 0.0982, 'grad_norm': 0.7972890138626099, 'learning_rate': 5.308451717777228e-05} |
|
51%|βββββ | 1520/3000 [09:00<06:59, 3.53it/s]
51%|βββββ | 1521/3000 [09:00<06:50, 3.60it/s]
51%|βββββ | 1522/3000 [09:01<06:47, 3.62it/s]
51%|βββββ | 1523/3000 [09:01<06:42, 3.67it/s]
51%|βββββ | 1524/3000 [09:01<06:34, 3.74it/s]
51%|βββββ | 1525/3000 [09:01<06:30, 3.77it/s]
51%|βββββ | 1526/3000 [09:02<06:31, 3.76it/s]
51%|βββββ | 1527/3000 [09:02<06:27, 3.80it/s]
51%|βββββ | 1528/3000 [09:02<06:25, 3.82it/s]
51%|βββββ | 1529/3000 [09:02<06:32, 3.75it/s]
51%|βββββ | 1530/3000 [09:03<06:40, 3.67it/s]
{'loss': 0.0928, 'grad_norm': 0.6457614302635193, 'learning_rate': 5.2534234082915976e-05} |
|
51%|βββββ | 1530/3000 [09:03<06:40, 3.67it/s]
51%|βββββ | 1531/3000 [09:03<06:57, 3.52it/s]
51%|βββββ | 1532/3000 [09:03<06:49, 3.58it/s]
51%|βββββ | 1533/3000 [09:04<06:45, 3.62it/s]
51%|βββββ | 1534/3000 [09:04<06:41, 3.65it/s]
51%|βββββ | 1535/3000 [09:04<06:32, 3.73it/s]
51%|βββββ | 1536/3000 [09:04<06:28, 3.77it/s]
51%|βββββ | 1537/3000 [09:05<06:33, 3.72it/s]
51%|ββββββ | 1538/3000 [09:05<06:27, 3.77it/s]
51%|ββββββ | 1539/3000 [09:05<06:26, 3.78it/s]
51%|ββββββ | 1540/3000 [09:05<06:22, 3.82it/s]
{'loss': 0.1023, 'grad_norm': 0.7331373691558838, 'learning_rate': 5.198364305780922e-05} |
|
51%|ββββββ | 1540/3000 [09:05<06:22, 3.82it/s]
51%|ββββββ | 1541/3000 [09:06<06:30, 3.74it/s]
51%|ββββββ | 1542/3000 [09:06<06:44, 3.61it/s]
51%|ββββββ | 1543/3000 [09:06<06:52, 3.53it/s]
51%|ββββββ | 1544/3000 [09:07<06:41, 3.62it/s]
52%|ββββββ | 1545/3000 [09:07<06:31, 3.72it/s]
52%|ββββββ | 1546/3000 [09:07<06:35, 3.67it/s]
52%|ββββββ | 1547/3000 [09:07<06:33, 3.69it/s]
52%|ββββββ | 1548/3000 [09:08<06:28, 3.74it/s]
52%|ββββββ | 1549/3000 [09:08<06:30, 3.72it/s]
52%|ββββββ | 1550/3000 [09:08<06:28, 3.74it/s]
{'loss': 0.0982, 'grad_norm': 0.7099182605743408, 'learning_rate': 5.143281100378261e-05} |
|
52%|ββββββ | 1550/3000 [09:08<06:28, 3.74it/s]
52%|ββββββ | 1551/3000 [09:08<06:28, 3.73it/s]
52%|ββββββ | 1552/3000 [09:09<06:28, 3.72it/s]
52%|ββββββ | 1553/3000 [09:09<06:25, 3.75it/s]
52%|ββββββ | 1554/3000 [09:09<06:28, 3.72it/s]
52%|ββββββ | 1555/3000 [09:09<06:27, 3.73it/s]
52%|ββββββ | 1556/3000 [09:10<06:25, 3.75it/s]
52%|ββββββ | 1557/3000 [09:10<06:21, 3.78it/s]
52%|ββββββ | 1558/3000 [09:10<06:24, 3.75it/s]
52%|ββββββ | 1559/3000 [09:11<06:28, 3.71it/s]
52%|ββββββ | 1560/3000 [09:11<06:22, 3.76it/s]
{'loss': 0.092, 'grad_norm': 0.7106317281723022, 'learning_rate': 5.088180485145378e-05} |
|
52%|ββββββ | 1560/3000 [09:11<06:22, 3.76it/s]
52%|ββββββ | 1561/3000 [09:11<06:21, 3.77it/s]
52%|ββββββ | 1562/3000 [09:11<06:18, 3.80it/s]
52%|ββββββ | 1563/3000 [09:12<06:20, 3.77it/s]
52%|ββββββ | 1564/3000 [09:12<06:32, 3.66it/s]
52%|ββββββ | 1565/3000 [09:12<06:33, 3.65it/s]
52%|ββββββ | 1566/3000 [09:12<06:39, 3.59it/s]
52%|ββββββ | 1567/3000 [09:13<06:41, 3.57it/s]
52%|ββββββ | 1568/3000 [09:13<06:40, 3.58it/s]
52%|ββββββ | 1569/3000 [09:13<06:42, 3.55it/s]
52%|ββββββ | 1570/3000 [09:14<06:39, 3.58it/s]
{'loss': 0.0849, 'grad_norm': 0.7194976210594177, 'learning_rate': 5.033069155259471e-05} |
|
52%|ββββββ | 1570/3000 [09:14<06:39, 3.58it/s]
52%|ββββββ | 1571/3000 [09:14<06:40, 3.57it/s]
52%|ββββββ | 1572/3000 [09:14<06:36, 3.60it/s]
52%|ββββββ | 1573/3000 [09:14<06:40, 3.56it/s]
52%|ββββββ | 1574/3000 [09:15<06:30, 3.66it/s]
52%|ββββββ | 1575/3000 [09:15<06:18, 3.76it/s]
53%|ββββββ | 1576/3000 [09:15<06:09, 3.85it/s]
53%|ββββββ | 1577/3000 [09:15<06:07, 3.87it/s]
53%|ββββββ | 1578/3000 [09:16<06:09, 3.85it/s]
53%|ββββββ | 1579/3000 [09:16<06:01, 3.93it/s]
53%|ββββββ | 1580/3000 [09:16<05:58, 3.96it/s]
{'loss': 0.0845, 'grad_norm': 0.59505295753479, 'learning_rate': 4.97795380719966e-05} |
|
53%|ββββββ | 1580/3000 [09:16<05:58, 3.96it/s]
53%|ββββββ | 1581/3000 [09:16<06:00, 3.94it/s]
53%|ββββββ | 1582/3000 [09:17<05:58, 3.95it/s]
53%|ββββββ | 1583/3000 [09:17<05:57, 3.96it/s]
53%|ββββββ | 1584/3000 [09:17<05:59, 3.93it/s]
53%|ββββββ | 1585/3000 [09:17<05:59, 3.94it/s]
53%|ββββββ | 1586/3000 [09:18<06:02, 3.90it/s]
53%|ββββββ | 1587/3000 [09:18<06:11, 3.80it/s]
53%|ββββββ | 1588/3000 [09:18<06:15, 3.76it/s]
53%|ββββββ | 1589/3000 [09:19<06:20, 3.71it/s]
53%|ββββββ | 1590/3000 [09:19<06:23, 3.67it/s]
{'loss': 0.0911, 'grad_norm': 0.6942402124404907, 'learning_rate': 4.9228411379333014e-05} |
|
53%|ββββββ | 1590/3000 [09:19<06:23, 3.67it/s]
53%|ββββββ | 1591/3000 [09:19<06:18, 3.72it/s]
53%|ββββββ | 1592/3000 [09:19<06:07, 3.83it/s]
53%|ββββββ | 1593/3000 [09:20<06:04, 3.86it/s]
53%|ββββββ | 1594/3000 [09:20<06:04, 3.86it/s]
53%|ββββββ | 1595/3000 [09:20<06:10, 3.79it/s]
53%|ββββββ | 1596/3000 [09:20<05:59, 3.90it/s]
53%|ββββββ | 1597/3000 [09:21<05:56, 3.93it/s]
53%|ββββββ | 1598/3000 [09:21<05:55, 3.95it/s]
53%|ββββββ | 1599/3000 [09:21<06:03, 3.86it/s]
53%|ββββββ | 1600/3000 [09:21<06:05, 3.83it/s]
{'loss': 0.0938, 'grad_norm': 0.7630301713943481, 'learning_rate': 4.867737844102261e-05} |
|
53%|ββββββ | 1600/3000 [09:21<06:05, 3.83it/s]
53%|ββββββ | 1601/3000 [09:22<06:03, 3.85it/s]
53%|ββββββ | 1602/3000 [09:22<06:02, 3.86it/s]
53%|ββββββ | 1603/3000 [09:22<06:00, 3.87it/s]
53%|ββββββ | 1604/3000 [09:22<05:57, 3.90it/s]
54%|ββββββ | 1605/3000 [09:23<05:55, 3.92it/s]
54%|ββββββ | 1606/3000 [09:23<05:59, 3.87it/s]
54%|ββββββ | 1607/3000 [09:23<05:57, 3.90it/s]
54%|ββββββ | 1608/3000 [09:23<06:04, 3.82it/s]
54%|ββββββ | 1609/3000 [09:24<06:03, 3.82it/s]
54%|ββββββ | 1610/3000 [09:24<06:08, 3.77it/s]
{'loss': 0.0825, 'grad_norm': 0.8009137511253357, 'learning_rate': 4.812650621209209e-05} |
|
54%|ββββββ | 1610/3000 [09:24<06:08, 3.77it/s]
54%|ββββββ | 1611/3000 [09:24<06:24, 3.62it/s]
54%|ββββββ | 1612/3000 [09:25<06:32, 3.54it/s]
54%|ββββββ | 1613/3000 [09:25<06:43, 3.44it/s]
54%|ββββββ | 1614/3000 [09:25<06:49, 3.38it/s]
54%|ββββββ | 1615/3000 [09:25<06:43, 3.43it/s]
54%|ββββββ | 1616/3000 [09:26<06:45, 3.41it/s]
54%|ββββββ | 1617/3000 [09:26<06:29, 3.55it/s]
54%|ββββββ | 1618/3000 [09:26<06:27, 3.57it/s]
54%|ββββββ | 1619/3000 [09:27<06:27, 3.56it/s]
54%|ββββββ | 1620/3000 [09:27<06:27, 3.56it/s]
{'loss': 0.08, 'grad_norm': 0.8127044439315796, 'learning_rate': 4.7575861628040635e-05} |
|
54%|ββββββ | 1620/3000 [09:27<06:27, 3.56it/s]
54%|ββββββ | 1621/3000 [09:27<06:28, 3.55it/s]
54%|ββββββ | 1622/3000 [09:27<06:18, 3.64it/s]
54%|ββββββ | 1623/3000 [09:28<06:11, 3.71it/s]
54%|ββββββ | 1624/3000 [09:28<06:04, 3.77it/s]
54%|ββββββ | 1625/3000 [09:28<06:09, 3.72it/s]
54%|ββββββ | 1626/3000 [09:28<06:08, 3.73it/s]
54%|ββββββ | 1627/3000 [09:29<06:04, 3.76it/s]
54%|ββββββ | 1628/3000 [09:29<06:11, 3.70it/s]
54%|ββββββ | 1629/3000 [09:29<06:16, 3.64it/s]
54%|ββββββ | 1630/3000 [09:30<06:18, 3.61it/s]
{'loss': 0.0839, 'grad_norm': 0.8295362591743469, 'learning_rate': 4.702551159670672e-05} |
|
54%|ββββββ | 1630/3000 [09:30<06:18, 3.61it/s]
54%|ββββββ | 1631/3000 [09:30<06:11, 3.68it/s]
54%|ββββββ | 1632/3000 [09:30<06:06, 3.73it/s]
54%|ββββββ | 1633/3000 [09:30<06:03, 3.77it/s]
54%|ββββββ | 1634/3000 [09:31<05:58, 3.81it/s]
55%|ββββββ | 1635/3000 [09:31<05:55, 3.84it/s]
55%|ββββββ | 1636/3000 [09:31<06:02, 3.77it/s]
55%|ββββββ | 1637/3000 [09:31<06:16, 3.62it/s]
55%|ββββββ | 1638/3000 [09:32<06:24, 3.54it/s]
55%|ββββββ | 1639/3000 [09:32<06:32, 3.46it/s]
55%|ββββββ | 1640/3000 [09:32<06:38, 3.41it/s]
{'loss': 0.0876, 'grad_norm': 0.7284867763519287, 'learning_rate': 4.647552299013828e-05} |
|
55%|ββββββ | 1640/3000 [09:32<06:38, 3.41it/s]
55%|ββββββ | 1641/3000 [09:33<06:42, 3.38it/s]
55%|ββββββ | 1642/3000 [09:33<06:49, 3.32it/s]
55%|ββββββ | 1643/3000 [09:33<06:36, 3.42it/s]
55%|ββββββ | 1644/3000 [09:34<06:20, 3.56it/s]
55%|ββββββ | 1645/3000 [09:34<06:15, 3.61it/s]
55%|ββββββ | 1646/3000 [09:34<06:06, 3.69it/s]
55%|ββββββ | 1647/3000 [09:34<05:58, 3.78it/s]
55%|ββββββ | 1648/3000 [09:35<05:56, 3.80it/s]
55%|ββββββ | 1649/3000 [09:35<06:00, 3.75it/s]
55%|ββββββ | 1650/3000 [09:35<05:59, 3.75it/s]
{'loss': 0.0964, 'grad_norm': 0.7745282649993896, 'learning_rate': 4.5925962636467126e-05} |
|
55%|ββββββ | 1650/3000 [09:35<05:59, 3.75it/s]
55%|ββββββ | 1651/3000 [09:35<06:14, 3.60it/s]
55%|ββββββ | 1652/3000 [09:36<06:06, 3.68it/s]
55%|ββββββ | 1653/3000 [09:36<06:17, 3.57it/s]
55%|ββββββ | 1654/3000 [09:36<06:08, 3.65it/s]
55%|ββββββ | 1655/3000 [09:36<06:11, 3.62it/s]
55%|ββββββ | 1656/3000 [09:37<06:05, 3.68it/s]
55%|ββββββ | 1657/3000 [09:37<06:02, 3.71it/s]
55%|ββββββ | 1658/3000 [09:37<05:57, 3.76it/s]
55%|ββββββ | 1659/3000 [09:38<06:03, 3.69it/s]
55%|ββββββ | 1660/3000 [09:38<06:01, 3.70it/s]
{'loss': 0.0746, 'grad_norm': 0.7279770374298096, 'learning_rate': 4.537689731178883e-05} |
|
55%|ββββββ | 1660/3000 [09:38<06:01, 3.70it/s]
55%|ββββββ | 1661/3000 [09:38<05:57, 3.75it/s]
55%|ββββββ | 1662/3000 [09:38<06:00, 3.71it/s]
55%|ββββββ | 1663/3000 [09:39<06:33, 3.39it/s]
55%|ββββββ | 1664/3000 [09:39<06:43, 3.31it/s]
56%|ββββββ | 1665/3000 [09:39<06:36, 3.37it/s]
56%|ββββββ | 1666/3000 [09:40<06:36, 3.36it/s]
56%|ββββββ | 1667/3000 [09:40<06:32, 3.40it/s]
56%|ββββββ | 1668/3000 [09:40<06:34, 3.37it/s]
56%|ββββββ | 1669/3000 [09:40<06:25, 3.45it/s]
56%|ββββββ | 1670/3000 [09:41<06:14, 3.55it/s]
{'loss': 0.086, 'grad_norm': 0.7272283434867859, 'learning_rate': 4.482839373204891e-05} |
|
56%|ββββββ | 1670/3000 [09:41<06:14, 3.55it/s]
56%|ββββββ | 1671/3000 [09:41<06:10, 3.59it/s]
56%|ββββββ | 1672/3000 [09:41<06:02, 3.67it/s]
56%|ββββββ | 1673/3000 [09:42<05:50, 3.79it/s]
56%|ββββββ | 1674/3000 [09:42<05:41, 3.89it/s]
56%|ββββββ | 1675/3000 [09:42<05:37, 3.92it/s]
56%|ββββββ | 1676/3000 [09:42<05:37, 3.92it/s]
56%|ββββββ | 1677/3000 [09:43<05:44, 3.84it/s]Rank 0, Worker 3: Wait for shard 39 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
56%|ββββββ | 1678/3000 [09:43<05:48, 3.79it/s]
56%|ββββββ | 1679/3000 [09:43<05:47, 3.80it/s]
56%|ββββββ | 1680/3000 [09:43<05:55, 3.71it/s]
{'loss': 0.094, 'grad_norm': 0.8412664532661438, 'learning_rate': 4.428051854493623e-05} |
|
56%|ββββββ | 1680/3000 [09:43<05:55, 3.71it/s]
56%|ββββββ | 1681/3000 [09:44<06:01, 3.65it/s]
56%|ββββββ | 1682/3000 [09:44<06:02, 3.64it/s]
56%|ββββββ | 1683/3000 [09:44<06:05, 3.60it/s]
56%|ββββββ | 1684/3000 [09:44<06:14, 3.52it/s]
56%|ββββββ | 1685/3000 [09:45<06:15, 3.50it/s]
56%|ββββββ | 1686/3000 [09:45<06:10, 3.55it/s]
56%|ββββββ | 1687/3000 [09:45<06:02, 3.62it/s]Rank 0, Worker 1: Wait for shard 43 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
56%|ββββββ | 1688/3000 [09:46<05:56, 3.68it/s]
56%|ββββββ | 1689/3000 [09:46<06:01, 3.63it/s]
56%|ββββββ | 1690/3000 [09:46<06:07, 3.57it/s]
{'loss': 0.0889, 'grad_norm': 0.7905071973800659, 'learning_rate': 4.373333832178478e-05} |
|
56%|ββββββ | 1690/3000 [09:46<06:07, 3.57it/s]
56%|ββββββ | 1691/3000 [09:46<06:11, 3.53it/s]
56%|ββββββ | 1692/3000 [09:47<06:13, 3.51it/s]
56%|ββββββ | 1693/3000 [09:47<06:12, 3.51it/s]
56%|ββββββ | 1694/3000 [09:47<06:16, 3.47it/s]
56%|ββββββ | 1695/3000 [09:48<06:04, 3.58it/s]
57%|ββββββ | 1696/3000 [09:48<05:54, 3.68it/s]
57%|ββββββ | 1697/3000 [09:48<05:45, 3.77it/s]
57%|ββββββ | 1698/3000 [09:48<05:48, 3.74it/s]
57%|ββββββ | 1699/3000 [09:49<05:55, 3.66it/s]
57%|ββββββ | 1700/3000 [09:49<05:53, 3.68it/s]
{'loss': 0.0873, 'grad_norm': 0.7173388004302979, 'learning_rate': 4.3186919549484784e-05} |
|
57%|ββββββ | 1700/3000 [09:49<05:53, 3.68it/s]
57%|ββββββ | 1701/3000 [09:49<05:53, 3.67it/s]
57%|ββββββ | 1702/3000 [09:49<05:46, 3.75it/s]
57%|ββββββ | 1703/3000 [09:50<05:50, 3.70it/s]Rank 0, Worker 5: Wait for shard 54 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
57%|ββββββ | 1704/3000 [09:50<05:47, 3.72it/s]Rank 0, Worker 0: Wait for shard 13 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
57%|ββββββ | 1705/3000 [09:50<05:44, 3.76it/s]
57%|ββββββ | 1706/3000 [09:51<05:51, 3.68it/s]Rank 0, Worker 2: Wait for shard 26 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
57%|ββββββ | 1707/3000 [09:51<05:48, 3.71it/s]
57%|ββββββ | 1708/3000 [09:51<05:46, 3.73it/s]Rank 0, Worker 4: Wait for shard 25 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
57%|ββββββ | 1709/3000 [09:51<05:41, 3.78it/s]
57%|ββββββ | 1710/3000 [09:52<05:35, 3.84it/s]
{'loss': 0.0852, 'grad_norm': 0.7447967529296875, 'learning_rate': 4.264132862240387e-05} |
|
57%|ββββββ | 1710/3000 [09:52<05:35, 3.84it/s]
57%|ββββββ | 1711/3000 [09:52<05:35, 3.84it/s]
57%|ββββββ | 1712/3000 [09:52<05:34, 3.85it/s]
57%|ββββββ | 1713/3000 [09:52<05:33, 3.86it/s]
57%|ββββββ | 1714/3000 [09:53<05:38, 3.80it/s]
57%|ββββββ | 1715/3000 [09:53<05:48, 3.69it/s]
57%|ββββββ | 1716/3000 [09:53<05:59, 3.57it/s]
57%|ββββββ | 1717/3000 [09:54<06:14, 3.42it/s]
57%|ββββββ | 1718/3000 [09:54<06:18, 3.39it/s]
57%|ββββββ | 1719/3000 [09:54<06:24, 3.33it/s]
57%|ββββββ | 1720/3000 [09:54<06:31, 3.27it/s]
{'loss': 0.093, 'grad_norm': 0.6910264492034912, 'learning_rate': 4.209663183431969e-05} |
|
57%|ββββββ | 1720/3000 [09:54<06:31, 3.27it/s]
57%|ββββββ | 1721/3000 [09:55<06:22, 3.34it/s]
57%|ββββββ | 1722/3000 [09:55<06:08, 3.47it/s]
57%|ββββββ | 1723/3000 [09:55<05:58, 3.56it/s]
57%|ββββββ | 1724/3000 [09:56<06:06, 3.49it/s]
57%|ββββββ | 1725/3000 [09:56<06:05, 3.49it/s]
58%|ββββββ | 1726/3000 [09:56<05:59, 3.54it/s]
58%|ββββββ | 1727/3000 [09:56<06:10, 3.44it/s]
58%|ββββββ | 1728/3000 [09:57<06:04, 3.49it/s]
58%|ββββββ | 1729/3000 [09:57<06:05, 3.48it/s]
58%|ββββββ | 1730/3000 [09:57<06:12, 3.41it/s]
{'loss': 0.0898, 'grad_norm': 0.6057111620903015, 'learning_rate': 4.155289537036466e-05} |
|
58%|ββββββ | 1730/3000 [09:57<06:12, 3.41it/s]
58%|ββββββ | 1731/3000 [09:58<06:10, 3.42it/s]
58%|ββββββ | 1732/3000 [09:58<06:15, 3.38it/s]
58%|ββββββ | 1733/3000 [09:58<06:27, 3.27it/s]
58%|ββββββ | 1734/3000 [09:58<06:10, 3.41it/s]
58%|ββββββ | 1735/3000 [09:59<06:20, 3.32it/s]
58%|ββββββ | 1736/3000 [09:59<06:42, 3.14it/s]
58%|ββββββ | 1737/3000 [09:59<06:34, 3.20it/s]
58%|ββββββ | 1738/3000 [10:00<06:35, 3.19it/s]
58%|ββββββ | 1739/3000 [10:00<06:40, 3.15it/s]
58%|ββββββ | 1740/3000 [10:00<06:47, 3.09it/s]
{'loss': 0.0844, 'grad_norm': 0.7384588718414307, 'learning_rate': 4.1010185298983984e-05} |
|
58%|ββββββ | 1740/3000 [10:00<06:47, 3.09it/s]
58%|ββββββ | 1741/3000 [10:01<06:57, 3.01it/s]
58%|ββββββ | 1742/3000 [10:01<07:04, 2.96it/s]
58%|ββββββ | 1743/3000 [10:02<07:32, 2.78it/s]
58%|ββββββ | 1744/3000 [10:02<07:43, 2.71it/s]
58%|ββββββ | 1745/3000 [10:02<07:09, 2.92it/s]
58%|ββββββ | 1746/3000 [10:03<06:47, 3.08it/s]
58%|ββββββ | 1747/3000 [10:03<06:38, 3.14it/s]
58%|ββββββ | 1748/3000 [10:03<06:17, 3.32it/s]
58%|ββββββ | 1749/3000 [10:03<06:11, 3.37it/s]
58%|ββββββ | 1750/3000 [10:04<06:14, 3.34it/s]
{'loss': 0.087, 'grad_norm': 0.658176064491272, 'learning_rate': 4.046856756390767e-05} |
|
58%|ββββββ | 1750/3000 [10:04<06:14, 3.34it/s]
58%|ββββββ | 1751/3000 [10:04<06:21, 3.27it/s]
58%|ββββββ | 1752/3000 [10:04<06:06, 3.41it/s]
58%|ββββββ | 1753/3000 [10:05<06:12, 3.35it/s]
58%|ββββββ | 1754/3000 [10:05<06:06, 3.40it/s]
58%|ββββββ | 1755/3000 [10:05<06:20, 3.27it/s]
59%|ββββββ | 1756/3000 [10:06<06:36, 3.13it/s]
59%|ββββββ | 1757/3000 [10:06<06:33, 3.16it/s]
59%|ββββββ | 1758/3000 [10:06<06:24, 3.23it/s]
59%|ββββββ | 1759/3000 [10:06<06:10, 3.35it/s]
59%|ββββββ | 1760/3000 [10:07<06:02, 3.42it/s]
{'loss': 0.0895, 'grad_norm': 0.6321943402290344, 'learning_rate': 3.9928107976137906e-05} |
|
59%|ββββββ | 1760/3000 [10:07<06:02, 3.42it/s]
59%|ββββββ | 1761/3000 [10:07<06:31, 3.16it/s]
59%|ββββββ | 1762/3000 [10:07<06:35, 3.13it/s]
59%|ββββββ | 1763/3000 [10:08<06:36, 3.12it/s]
59%|ββββββ | 1764/3000 [10:08<06:43, 3.07it/s]
59%|ββββββ | 1765/3000 [10:08<06:38, 3.10it/s]
59%|ββββββ | 1766/3000 [10:09<06:42, 3.06it/s]
59%|ββββββ | 1767/3000 [10:09<06:29, 3.16it/s]
59%|ββββββ | 1768/3000 [10:09<06:19, 3.25it/s]
59%|ββββββ | 1769/3000 [10:10<05:59, 3.42it/s]
59%|ββββββ | 1770/3000 [10:10<05:51, 3.50it/s]
{'loss': 0.0845, 'grad_norm': 0.6504121422767639, 'learning_rate': 3.9388872205952526e-05} |
|
59%|ββββββ | 1770/3000 [10:10<05:51, 3.50it/s]
59%|ββββββ | 1771/3000 [10:10<05:52, 3.49it/s]
59%|ββββββ | 1772/3000 [10:10<05:43, 3.57it/s]
59%|ββββββ | 1773/3000 [10:11<05:33, 3.68it/s]
59%|ββββββ | 1774/3000 [10:11<05:38, 3.62it/s]
59%|ββββββ | 1775/3000 [10:11<05:34, 3.66it/s]
59%|ββββββ | 1776/3000 [10:11<05:26, 3.75it/s]
59%|ββββββ | 1777/3000 [10:12<05:23, 3.78it/s]
59%|ββββββ | 1778/3000 [10:12<05:26, 3.75it/s]
59%|ββββββ | 1779/3000 [10:12<05:26, 3.74it/s]
59%|ββββββ | 1780/3000 [10:12<05:17, 3.84it/s]
{'loss': 0.0838, 'grad_norm': 0.564155101776123, 'learning_rate': 3.8850925774925425e-05} |
|
59%|ββββββ | 1780/3000 [10:13<05:17, 3.84it/s]
59%|ββββββ | 1781/3000 [10:13<05:19, 3.81it/s]
59%|ββββββ | 1782/3000 [10:13<05:21, 3.79it/s]
59%|ββββββ | 1783/3000 [10:13<05:16, 3.85it/s]
59%|ββββββ | 1784/3000 [10:13<05:11, 3.90it/s]
60%|ββββββ | 1785/3000 [10:14<05:08, 3.94it/s]
60%|ββββββ | 1786/3000 [10:14<05:10, 3.91it/s]
60%|ββββββ | 1787/3000 [10:14<05:13, 3.87it/s]
60%|ββββββ | 1788/3000 [10:15<05:19, 3.79it/s]
60%|ββββββ | 1789/3000 [10:15<05:24, 3.74it/s]
60%|ββββββ | 1790/3000 [10:15<05:27, 3.69it/s]
{'loss': 0.0902, 'grad_norm': 0.5792074799537659, 'learning_rate': 3.831433404796521e-05} |
|
60%|ββββββ | 1790/3000 [10:15<05:27, 3.69it/s]
60%|ββββββ | 1791/3000 [10:15<05:31, 3.65it/s]
60%|ββββββ | 1792/3000 [10:16<07:46, 2.59it/s]
60%|ββββββ | 1793/3000 [10:16<06:51, 2.93it/s]
60%|ββββββ | 1794/3000 [10:17<06:14, 3.22it/s]
60%|ββββββ | 1795/3000 [10:17<05:49, 3.45it/s]
60%|ββββββ | 1796/3000 [10:17<05:33, 3.61it/s]
60%|ββββββ | 1797/3000 [10:17<05:21, 3.74it/s]
60%|ββββββ | 1798/3000 [10:17<05:13, 3.83it/s]
60%|ββββββ | 1799/3000 [10:18<05:04, 3.94it/s]
60%|ββββββ | 1800/3000 [10:18<04:59, 4.01it/s]
{'loss': 0.0866, 'grad_norm': 0.7973079085350037, 'learning_rate': 3.777916222537285e-05} |
|
60%|ββββββ | 1800/3000 [10:18<04:59, 4.01it/s]
60%|ββββββ | 1801/3000 [10:18<04:58, 4.02it/s]
60%|ββββββ | 1802/3000 [10:18<04:54, 4.07it/s]
60%|ββββββ | 1803/3000 [10:19<04:51, 4.10it/s]
60%|ββββββ | 1804/3000 [10:19<04:48, 4.15it/s]
60%|ββββββ | 1805/3000 [10:19<04:46, 4.17it/s]
60%|ββββββ | 1806/3000 [10:19<04:48, 4.14it/s]
60%|ββββββ | 1807/3000 [10:20<04:49, 4.12it/s]
60%|ββββββ | 1808/3000 [10:20<04:45, 4.17it/s]
60%|ββββββ | 1809/3000 [10:20<04:44, 4.18it/s]
60%|ββββββ | 1810/3000 [10:20<04:45, 4.17it/s]
{'loss': 0.087, 'grad_norm': 0.7829543948173523, 'learning_rate': 3.7245475334919246e-05} |
|
60%|ββββββ | 1810/3000 [10:20<04:45, 4.17it/s]
60%|ββββββ | 1811/3000 [10:21<04:50, 4.09it/s]
60%|ββββββ | 1812/3000 [10:21<04:49, 4.10it/s]
60%|ββββββ | 1813/3000 [10:21<04:48, 4.11it/s]
60%|ββββββ | 1814/3000 [10:21<04:57, 3.99it/s]
60%|ββββββ | 1815/3000 [10:22<05:10, 3.82it/s]
61%|ββββββ | 1816/3000 [10:22<05:15, 3.75it/s]
61%|ββββββ | 1817/3000 [10:22<05:19, 3.70it/s]
61%|ββββββ | 1818/3000 [10:23<05:28, 3.59it/s]
61%|ββββββ | 1819/3000 [10:23<05:34, 3.53it/s]
61%|ββββββ | 1820/3000 [10:23<05:44, 3.42it/s]
{'loss': 0.0843, 'grad_norm': 0.7178475260734558, 'learning_rate': 3.6713338223943867e-05} |
|
61%|ββββββ | 1820/3000 [10:23<05:44, 3.42it/s]
61%|ββββββ | 1821/3000 [10:23<05:40, 3.46it/s]
61%|ββββββ | 1822/3000 [10:24<05:33, 3.53it/s]
61%|ββββββ | 1823/3000 [10:24<05:29, 3.58it/s]
61%|ββββββ | 1824/3000 [10:24<05:24, 3.63it/s]
61%|ββββββ | 1825/3000 [10:24<05:22, 3.64it/s]
61%|ββββββ | 1826/3000 [10:25<05:18, 3.69it/s]
61%|ββββββ | 1827/3000 [10:25<05:19, 3.67it/s]
61%|ββββββ | 1828/3000 [10:25<05:18, 3.68it/s]
61%|ββββββ | 1829/3000 [10:26<05:17, 3.69it/s]
61%|ββββββ | 1830/3000 [10:26<05:17, 3.69it/s]
{'loss': 0.0845, 'grad_norm': 0.5880443453788757, 'learning_rate': 3.618281555147522e-05} |
|
61%|ββββββ | 1830/3000 [10:26<05:17, 3.69it/s]
61%|ββββββ | 1831/3000 [10:26<05:21, 3.64it/s]
61%|ββββββ | 1832/3000 [10:26<05:19, 3.65it/s]
61%|ββββββ | 1833/3000 [10:27<05:18, 3.66it/s]
61%|ββββββ | 1834/3000 [10:27<05:19, 3.65it/s]
61%|ββββββ | 1835/3000 [10:27<05:20, 3.64it/s]
61%|ββββββ | 1836/3000 [10:27<05:19, 3.64it/s]
61%|ββββββ | 1837/3000 [10:28<05:18, 3.65it/s]
61%|βββββββ | 1838/3000 [10:28<05:17, 3.66it/s]
61%|βββββββ | 1839/3000 [10:28<05:17, 3.66it/s]
61%|βββββββ | 1840/3000 [10:29<05:22, 3.60it/s]
{'loss': 0.0745, 'grad_norm': 0.6559892296791077, 'learning_rate': 3.5653971780374295e-05} |
|
61%|βββββββ | 1840/3000 [10:29<05:22, 3.60it/s]
61%|βββββββ | 1841/3000 [10:29<05:36, 3.44it/s]
61%|βββββββ | 1842/3000 [10:29<05:42, 3.39it/s]
61%|βββββββ | 1843/3000 [10:30<05:43, 3.37it/s]
61%|βββββββ | 1844/3000 [10:30<05:45, 3.35it/s]
62%|βββββββ | 1845/3000 [10:30<05:41, 3.38it/s]
62%|βββββββ | 1846/3000 [10:30<05:57, 3.22it/s]
62%|βββββββ | 1847/3000 [10:31<06:18, 3.05it/s]
62%|βββββββ | 1848/3000 [10:31<06:12, 3.09it/s]
62%|βββββββ | 1849/3000 [10:31<06:02, 3.18it/s]
62%|βββββββ | 1850/3000 [10:32<05:59, 3.20it/s]
{'loss': 0.08, 'grad_norm': 0.6737433671951294, 'learning_rate': 3.512687116950182e-05} |
|
62%|βββββββ | 1850/3000 [10:32<05:59, 3.20it/s]
62%|βββββββ | 1851/3000 [10:32<05:59, 3.20it/s]
62%|βββββββ | 1852/3000 [10:32<05:41, 3.36it/s]
62%|βββββββ | 1853/3000 [10:33<05:38, 3.38it/s]
62%|βββββββ | 1854/3000 [10:33<05:32, 3.45it/s]
62%|βββββββ | 1855/3000 [10:33<05:23, 3.54it/s]
62%|βββββββ | 1856/3000 [10:33<05:25, 3.52it/s]
62%|βββββββ | 1857/3000 [10:34<05:19, 3.58it/s]
62%|βββββββ | 1858/3000 [10:34<05:14, 3.63it/s]
62%|βββββββ | 1859/3000 [10:34<05:12, 3.65it/s]
62%|βββββββ | 1860/3000 [10:35<05:11, 3.66it/s]
{'loss': 0.0909, 'grad_norm': 0.7049776911735535, 'learning_rate': 3.460157776591018e-05} |
|
62%|βββββββ | 1860/3000 [10:35<05:11, 3.66it/s]
62%|βββββββ | 1861/3000 [10:35<05:11, 3.66it/s]
62%|βββββββ | 1862/3000 [10:35<05:13, 3.63it/s]
62%|βββββββ | 1863/3000 [10:35<05:11, 3.65it/s]Rank 0, Worker 3: Wait for shard 6 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
62%|βββββββ | 1864/3000 [10:36<05:06, 3.70it/s]
62%|βββββββ | 1865/3000 [10:36<05:11, 3.64it/s]
62%|βββββββ | 1866/3000 [10:36<05:11, 3.64it/s]
62%|βββββββ | 1867/3000 [10:36<05:27, 3.46it/s]
62%|βββββββ | 1868/3000 [10:37<05:35, 3.38it/s]
62%|βββββββ | 1869/3000 [10:37<05:47, 3.26it/s]
62%|βββββββ | 1870/3000 [10:37<05:47, 3.25it/s]
{'loss': 0.0912, 'grad_norm': 0.7571489214897156, 'learning_rate': 3.407815539706124e-05} |
|
62%|βββββββ | 1870/3000 [10:37<05:47, 3.25it/s]
62%|βββββββ | 1871/3000 [10:38<05:54, 3.18it/s]
62%|βββββββ | 1872/3000 [10:38<05:45, 3.26it/s]
62%|βββββββ | 1873/3000 [10:38<05:33, 3.38it/s]
62%|βββββββ | 1874/3000 [10:39<05:20, 3.51it/s]
62%|βββββββ | 1875/3000 [10:39<05:09, 3.63it/s]
63%|βββββββ | 1876/3000 [10:39<05:04, 3.69it/s]
63%|βββββββ | 1877/3000 [10:39<04:59, 3.74it/s]
63%|βββββββ | 1878/3000 [10:40<04:58, 3.76it/s]
63%|βββββββ | 1879/3000 [10:40<04:53, 3.82it/s]Rank 0, Worker 1: Wait for shard 21 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
63%|βββββββ | 1880/3000 [10:40<04:51, 3.84it/s]
{'loss': 0.076, 'grad_norm': 0.6241227388381958, 'learning_rate': 3.355666766307084e-05} |
|
63%|βββββββ | 1880/3000 [10:40<04:51, 3.84it/s]
63%|βββββββ | 1881/3000 [10:40<04:55, 3.79it/s]
63%|βββββββ | 1882/3000 [10:41<04:58, 3.75it/s]
63%|βββββββ | 1883/3000 [10:41<04:50, 3.85it/s]
63%|βββββββ | 1884/3000 [10:41<04:48, 3.87it/s]
63%|βββββββ | 1885/3000 [10:41<04:49, 3.85it/s]
63%|βββββββ | 1886/3000 [10:42<04:52, 3.81it/s]
63%|βββββββ | 1887/3000 [10:42<04:57, 3.74it/s]
63%|βββββββ | 1888/3000 [10:42<04:57, 3.74it/s]
63%|βββββββ | 1889/3000 [10:42<04:52, 3.80it/s]
63%|βββββββ | 1890/3000 [10:43<04:49, 3.84it/s]
{'loss': 0.074, 'grad_norm': 0.6526145339012146, 'learning_rate': 3.3037177928980735e-05} |
|
63%|βββββββ | 1890/3000 [10:43<04:49, 3.84it/s]Rank 0, Worker 0: Wait for shard 15 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
63%|βββββββ | 1891/3000 [10:43<04:49, 3.84it/s]
63%|βββββββ | 1892/3000 [10:43<04:49, 3.82it/s]Rank 0, Worker 2: Wait for shard 16 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
63%|βββββββ | 1893/3000 [10:44<04:57, 3.72it/s]
63%|βββββββ | 1894/3000 [10:44<05:02, 3.66it/s]Rank 0, Worker 4: Wait for shard 60 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
63%|βββββββ | 1895/3000 [10:44<05:13, 3.52it/s]Rank 0, Worker 5: Wait for shard 37 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
63%|βββββββ | 1896/3000 [10:44<05:23, 3.42it/s]
63%|βββββββ | 1897/3000 [10:45<05:35, 3.29it/s]
63%|βββββββ | 1898/3000 [10:45<05:47, 3.17it/s]
63%|βββββββ | 1899/3000 [10:45<05:26, 3.37it/s]
63%|βββββββ | 1900/3000 [10:46<05:21, 3.42it/s]
{'loss': 0.094, 'grad_norm': 0.6423455476760864, 'learning_rate': 3.251974931705933e-05} |
|
63%|βββββββ | 1900/3000 [10:46<05:21, 3.42it/s]
63%|βββββββ | 1901/3000 [10:46<05:14, 3.49it/s]
63%|βββββββ | 1902/3000 [10:46<05:10, 3.54it/s]
63%|βββββββ | 1903/3000 [10:46<05:07, 3.57it/s]
63%|βββββββ | 1904/3000 [10:47<05:04, 3.60it/s]
64%|βββββββ | 1905/3000 [10:47<05:02, 3.63it/s]
64%|βββββββ | 1906/3000 [10:47<04:58, 3.67it/s]
64%|βββββββ | 1907/3000 [10:48<04:57, 3.67it/s]
64%|βββββββ | 1908/3000 [10:48<05:04, 3.58it/s]
64%|βββββββ | 1909/3000 [10:48<04:59, 3.65it/s]
64%|βββββββ | 1910/3000 [10:48<04:54, 3.70it/s]
{'loss': 0.0877, 'grad_norm': 0.5167522430419922, 'learning_rate': 3.2004444699131727e-05} |
|
64%|βββββββ | 1910/3000 [10:48<04:54, 3.70it/s]
64%|βββββββ | 1911/3000 [10:49<04:52, 3.73it/s]
64%|βββββββ | 1912/3000 [10:49<04:50, 3.75it/s]
64%|βββββββ | 1913/3000 [10:49<04:45, 3.81it/s]
64%|βββββββ | 1914/3000 [10:49<04:46, 3.79it/s]
64%|βββββββ | 1915/3000 [10:50<04:51, 3.73it/s]
64%|βββββββ | 1916/3000 [10:50<04:49, 3.74it/s]
64%|βββββββ | 1917/3000 [10:50<04:47, 3.77it/s]
64%|βββββββ | 1918/3000 [10:51<05:03, 3.57it/s]
64%|βββββββ | 1919/3000 [10:51<05:20, 3.37it/s]
64%|βββββββ | 1920/3000 [10:51<05:26, 3.31it/s]
{'loss': 0.0812, 'grad_norm': 0.6353688836097717, 'learning_rate': 3.1491326688940345e-05} |
|
64%|βββββββ | 1920/3000 [10:51<05:26, 3.31it/s]
64%|βββββββ | 1921/3000 [10:52<05:30, 3.27it/s]
64%|βββββββ | 1922/3000 [10:52<05:42, 3.15it/s]
64%|βββββββ | 1923/3000 [10:52<05:33, 3.23it/s]
64%|βββββββ | 1924/3000 [10:52<05:22, 3.34it/s]
64%|βββββββ | 1925/3000 [10:53<05:33, 3.22it/s]
64%|βββββββ | 1926/3000 [10:53<05:19, 3.37it/s]
64%|βββββββ | 1927/3000 [10:53<05:12, 3.43it/s]
64%|βββββββ | 1928/3000 [10:54<05:05, 3.51it/s]
64%|βββββββ | 1929/3000 [10:54<04:58, 3.59it/s]
64%|βββββββ | 1930/3000 [10:54<04:58, 3.59it/s]
{'loss': 0.0867, 'grad_norm': 0.7268972396850586, 'learning_rate': 3.098045763453678e-05} |
|
64%|βββββββ | 1930/3000 [10:54<04:58, 3.59it/s]
64%|βββββββ | 1931/3000 [10:54<05:06, 3.49it/s]
64%|βββββββ | 1932/3000 [10:55<05:01, 3.54it/s]
64%|βββββββ | 1933/3000 [10:55<04:56, 3.60it/s]
64%|βββββββ | 1934/3000 [10:55<04:55, 3.61it/s]
64%|βββββββ | 1935/3000 [10:56<04:53, 3.63it/s]
65%|βββββββ | 1936/3000 [10:56<04:53, 3.62it/s]
65%|βββββββ | 1937/3000 [10:56<04:49, 3.67it/s]
65%|βββββββ | 1938/3000 [10:56<04:54, 3.61it/s]
65%|βββββββ | 1939/3000 [10:57<05:14, 3.38it/s]
65%|βββββββ | 1940/3000 [10:57<05:06, 3.46it/s]
{'loss': 0.0856, 'grad_norm': 0.521489679813385, 'learning_rate': 3.0471899610706038e-05} |
|
65%|βββββββ | 1940/3000 [10:57<05:06, 3.46it/s]
65%|βββββββ | 1941/3000 [10:57<04:57, 3.56it/s]
65%|βββββββ | 1942/3000 [10:57<04:53, 3.61it/s]
65%|βββββββ | 1943/3000 [10:58<05:01, 3.51it/s]
65%|βββββββ | 1944/3000 [10:58<05:09, 3.42it/s]
65%|βββββββ | 1945/3000 [10:58<05:17, 3.32it/s]
65%|βββββββ | 1946/3000 [10:59<05:30, 3.19it/s]
65%|βββββββ | 1947/3000 [10:59<05:27, 3.22it/s]
65%|βββββββ | 1948/3000 [10:59<05:22, 3.26it/s]
65%|βββββββ | 1949/3000 [11:00<05:11, 3.37it/s]
65%|βββββββ | 1950/3000 [11:00<05:01, 3.48it/s]
{'loss': 0.0839, 'grad_norm': 0.61240154504776, 'learning_rate': 2.9965714411423972e-05} |
|
65%|βββββββ | 1950/3000 [11:00<05:01, 3.48it/s]
65%|βββββββ | 1951/3000 [11:00<04:57, 3.53it/s]
65%|βββββββ | 1952/3000 [11:01<05:06, 3.42it/s]
65%|βββββββ | 1953/3000 [11:01<05:15, 3.32it/s]
65%|βββββββ | 1954/3000 [11:01<05:15, 3.32it/s]
65%|βββββββ | 1955/3000 [11:01<05:09, 3.38it/s]
65%|βββββββ | 1956/3000 [11:02<05:03, 3.44it/s]
65%|βββββββ | 1957/3000 [11:02<05:01, 3.46it/s]
65%|βββββββ | 1958/3000 [11:02<04:51, 3.58it/s]
65%|βββββββ | 1959/3000 [11:02<04:44, 3.66it/s]
65%|βββββββ | 1960/3000 [11:03<04:38, 3.73it/s]
{'loss': 0.0751, 'grad_norm': 0.6277241110801697, 'learning_rate': 2.9461963542348737e-05} |
|
65%|βββββββ | 1960/3000 [11:03<04:38, 3.73it/s]
65%|βββββββ | 1961/3000 [11:03<04:39, 3.71it/s]
65%|βββββββ | 1962/3000 [11:03<04:34, 3.77it/s]
65%|βββββββ | 1963/3000 [11:04<04:30, 3.83it/s]
65%|βββββββ | 1964/3000 [11:04<04:28, 3.87it/s]
66%|βββββββ | 1965/3000 [11:04<04:25, 3.90it/s]
66%|βββββββ | 1966/3000 [11:04<04:22, 3.94it/s]
66%|βββββββ | 1967/3000 [11:05<04:21, 3.94it/s]
66%|βββββββ | 1968/3000 [11:05<04:24, 3.91it/s]
66%|βββββββ | 1969/3000 [11:05<04:30, 3.81it/s]
66%|βββββββ | 1970/3000 [11:05<04:37, 3.71it/s]
{'loss': 0.0881, 'grad_norm': 0.6723438501358032, 'learning_rate': 2.8960708213347366e-05} |
|
66%|βββββββ | 1970/3000 [11:05<04:37, 3.71it/s]
66%|βββββββ | 1971/3000 [11:06<04:44, 3.62it/s]
66%|βββββββ | 1972/3000 [11:06<04:49, 3.55it/s]
66%|βββββββ | 1973/3000 [11:06<04:47, 3.57it/s]
66%|βββββββ | 1974/3000 [11:06<04:45, 3.60it/s]
66%|βββββββ | 1975/3000 [11:07<04:42, 3.63it/s]
66%|βββββββ | 1976/3000 [11:07<04:38, 3.67it/s]
66%|βββββββ | 1977/3000 [11:07<04:33, 3.73it/s]
66%|βββββββ | 1978/3000 [11:08<04:34, 3.73it/s]
66%|βββββββ | 1979/3000 [11:08<04:34, 3.72it/s]
66%|βββββββ | 1980/3000 [11:08<04:30, 3.77it/s]
{'loss': 0.0938, 'grad_norm': 0.6539252996444702, 'learning_rate': 2.846200933105829e-05} |
|
66%|βββββββ | 1980/3000 [11:08<04:30, 3.77it/s]
66%|βββββββ | 1981/3000 [11:08<04:27, 3.81it/s]
66%|βββββββ | 1982/3000 [11:09<04:22, 3.88it/s]
66%|βββββββ | 1983/3000 [11:09<04:15, 3.98it/s]
66%|βββββββ | 1984/3000 [11:09<04:14, 4.00it/s]
66%|βββββββ | 1985/3000 [11:09<04:11, 4.04it/s]
66%|βββββββ | 1986/3000 [11:10<04:08, 4.08it/s]
66%|βββββββ | 1987/3000 [11:10<04:07, 4.09it/s]
66%|βββββββ | 1988/3000 [11:10<04:14, 3.98it/s]
66%|βββββββ | 1989/3000 [11:10<04:14, 3.97it/s]
66%|βββββββ | 1990/3000 [11:11<04:12, 4.00it/s]
{'loss': 0.0797, 'grad_norm': 0.6006316542625427, 'learning_rate': 2.7965927491490705e-05} |
|
66%|βββββββ | 1990/3000 [11:11<04:12, 4.00it/s]
66%|βββββββ | 1991/3000 [11:11<04:14, 3.96it/s]
66%|βββββββ | 1992/3000 [11:11<04:13, 3.97it/s]
66%|βββββββ | 1993/3000 [11:11<04:19, 3.88it/s]
66%|βββββββ | 1994/3000 [11:12<04:15, 3.93it/s]
66%|βββββββ | 1995/3000 [11:12<04:14, 3.95it/s]
67%|βββββββ | 1996/3000 [11:12<04:22, 3.83it/s]
67%|βββββββ | 1997/3000 [11:12<04:32, 3.68it/s]
67%|βββββββ | 1998/3000 [11:13<04:32, 3.68it/s]
67%|βββββββ | 1999/3000 [11:13<04:33, 3.67it/s]
67%|βββββββ | 2000/3000 [11:13<04:32, 3.67it/s]
{'loss': 0.0779, 'grad_norm': 0.9345439672470093, 'learning_rate': 2.747252297266162e-05} |
|
67%|βββββββ | 2000/3000 [11:13<04:32, 3.67it/s]Copying experiment config directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/experiment_cfg to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-2000/experiment_cfg |
| Copying processor directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/processor to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-2000 |
| Copying wandb_config.json from /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/wandb_config.json to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-2000/wandb_config.json |
|
67%|βββββββ | 2001/3000 [11:50<3:06:46, 11.22s/it]
67%|βββββββ | 2002/3000 [11:50<2:12:12, 7.95s/it]
67%|βββββββ | 2003/3000 [11:51<1:33:46, 5.64s/it]
67%|βββββββ | 2004/3000 [11:51<1:06:51, 4.03s/it]
67%|βββββββ | 2005/3000 [11:51<48:06, 2.90s/it]
67%|βββββββ | 2006/3000 [11:51<34:59, 2.11s/it]
67%|βββββββ | 2007/3000 [11:52<25:49, 1.56s/it]
67%|βββββββ | 2008/3000 [11:52<19:24, 1.17s/it]
67%|βββββββ | 2009/3000 [11:52<14:56, 1.11it/s]
67%|βββββββ | 2010/3000 [11:52<11:46, 1.40it/s]
{'loss': 0.0751, 'grad_norm': 0.7097274661064148, 'learning_rate': 2.698185572727151e-05} |
|
67%|βββββββ | 2010/3000 [11:53<11:46, 1.40it/s]
67%|βββββββ | 2011/3000 [11:53<09:35, 1.72it/s]
67%|βββββββ | 2012/3000 [11:53<08:00, 2.06it/s]
67%|βββββββ | 2013/3000 [11:53<06:52, 2.40it/s]
67%|βββββββ | 2014/3000 [11:54<06:05, 2.70it/s]
67%|βββββββ | 2015/3000 [11:54<05:33, 2.95it/s]
67%|βββββββ | 2016/3000 [11:54<05:13, 3.13it/s]
67%|βββββββ | 2017/3000 [11:54<04:54, 3.34it/s]
67%|βββββββ | 2018/3000 [11:55<04:41, 3.49it/s]
67%|βββββββ | 2019/3000 [11:55<04:31, 3.62it/s]
67%|βββββββ | 2020/3000 [11:55<04:25, 3.69it/s]
{'loss': 0.0794, 'grad_norm': 0.6569517254829407, 'learning_rate': 2.6493985375419778e-05} |
|
67%|βββββββ | 2020/3000 [11:55<04:25, 3.69it/s]
67%|βββββββ | 2021/3000 [11:55<04:24, 3.70it/s]
67%|βββββββ | 2022/3000 [11:56<04:19, 3.77it/s]
67%|βββββββ | 2023/3000 [11:56<04:18, 3.78it/s]
67%|βββββββ | 2024/3000 [11:56<04:23, 3.71it/s]
68%|βββββββ | 2025/3000 [11:56<04:17, 3.79it/s]
68%|βββββββ | 2026/3000 [11:57<04:13, 3.84it/s]
68%|βββββββ | 2027/3000 [11:57<04:22, 3.71it/s]
68%|βββββββ | 2028/3000 [11:57<04:13, 3.84it/s]
68%|βββββββ | 2029/3000 [11:57<04:09, 3.89it/s]
68%|βββββββ | 2030/3000 [11:58<04:07, 3.92it/s]
{'loss': 0.0846, 'grad_norm': 0.5854899883270264, 'learning_rate': 2.6008971197360176e-05} |
|
68%|βββββββ | 2030/3000 [11:58<04:07, 3.92it/s]
68%|βββββββ | 2031/3000 [11:58<04:14, 3.81it/s]
68%|βββββββ | 2032/3000 [11:58<04:11, 3.84it/s]
68%|βββββββ | 2033/3000 [11:58<04:09, 3.88it/s]
68%|βββββββ | 2034/3000 [11:59<04:07, 3.90it/s]
68%|βββββββ | 2035/3000 [11:59<04:03, 3.96it/s]
68%|βββββββ | 2036/3000 [11:59<04:02, 3.98it/s]
68%|βββββββ | 2037/3000 [11:59<03:59, 4.02it/s]
68%|βββββββ | 2038/3000 [12:00<03:59, 4.02it/s]
68%|βββββββ | 2039/3000 [12:00<03:57, 4.04it/s]
68%|βββββββ | 2040/3000 [12:00<03:59, 4.01it/s]
{'loss': 0.087, 'grad_norm': 0.6118670105934143, 'learning_rate': 2.552687212629799e-05} |
|
68%|βββββββ | 2040/3000 [12:00<03:59, 4.01it/s]
68%|βββββββ | 2041/3000 [12:00<04:05, 3.90it/s]
68%|βββββββ | 2042/3000 [12:01<04:05, 3.90it/s]
68%|βββββββ | 2043/3000 [12:01<04:04, 3.92it/s]
68%|βββββββ | 2044/3000 [12:01<04:02, 3.94it/s]
68%|βββββββ | 2045/3000 [12:01<03:58, 4.00it/s]
68%|βββββββ | 2046/3000 [12:02<03:53, 4.08it/s]
68%|βββββββ | 2047/3000 [12:02<03:50, 4.13it/s]
68%|βββββββ | 2048/3000 [12:02<03:53, 4.08it/s]
68%|βββββββ | 2049/3000 [12:02<03:50, 4.13it/s]Rank 0, Worker 3: Wait for shard 26 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
68%|βββββββ | 2050/3000 [12:03<03:52, 4.09it/s]
{'loss': 0.0792, 'grad_norm': 0.5892506837844849, 'learning_rate': 2.5047746741228978e-05} |
|
68%|βββββββ | 2050/3000 [12:03<03:52, 4.09it/s]
68%|βββββββ | 2051/3000 [12:03<03:52, 4.08it/s]
68%|βββββββ | 2052/3000 [12:03<03:51, 4.09it/s]
68%|βββββββ | 2053/3000 [12:03<03:53, 4.06it/s]
68%|βββββββ | 2054/3000 [12:04<03:54, 4.04it/s]
68%|βββββββ | 2055/3000 [12:04<03:52, 4.07it/s]
69%|βββββββ | 2056/3000 [12:04<03:51, 4.08it/s]
69%|βββββββ | 2057/3000 [12:04<03:55, 4.01it/s]
69%|βββββββ | 2058/3000 [12:05<03:54, 4.02it/s]
69%|βββββββ | 2059/3000 [12:05<03:52, 4.05it/s]Rank 0, Worker 1: Wait for shard 2 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
69%|βββββββ | 2060/3000 [12:05<03:52, 4.04it/s]
{'loss': 0.0747, 'grad_norm': 0.6216926574707031, 'learning_rate': 2.4571653259821694e-05} |
|
69%|βββββββ | 2060/3000 [12:05<03:52, 4.04it/s]
69%|βββββββ | 2061/3000 [12:05<03:55, 3.99it/s]
69%|βββββββ | 2062/3000 [12:06<03:54, 4.00it/s]
69%|βββββββ | 2063/3000 [12:06<03:56, 3.97it/s]
69%|βββββββ | 2064/3000 [12:06<03:55, 3.98it/s]
69%|βββββββ | 2065/3000 [12:06<04:01, 3.87it/s]
69%|βββββββ | 2066/3000 [12:07<04:06, 3.79it/s]
69%|βββββββ | 2067/3000 [12:07<04:06, 3.78it/s]
69%|βββββββ | 2068/3000 [12:07<04:06, 3.78it/s]
69%|βββββββ | 2069/3000 [12:08<04:05, 3.80it/s]
69%|βββββββ | 2070/3000 [12:08<04:14, 3.65it/s]
{'loss': 0.073, 'grad_norm': 0.5638583898544312, 'learning_rate': 2.4098649531343497e-05} |
|
69%|βββββββ | 2070/3000 [12:08<04:14, 3.65it/s]
69%|βββββββ | 2071/3000 [12:08<04:17, 3.60it/s]
69%|βββββββ | 2072/3000 [12:08<04:15, 3.63it/s]
69%|βββββββ | 2073/3000 [12:09<04:15, 3.63it/s]
69%|βββββββ | 2074/3000 [12:09<04:13, 3.65it/s]
69%|βββββββ | 2075/3000 [12:09<04:08, 3.72it/s]
69%|βββββββ | 2076/3000 [12:09<04:01, 3.82it/s]Rank 0, Worker 0: Wait for shard 62 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
69%|βββββββ | 2077/3000 [12:10<04:09, 3.70it/s]
69%|βββββββ | 2078/3000 [12:10<04:08, 3.70it/s]Rank 0, Worker 2: Wait for shard 33 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
69%|βββββββ | 2079/3000 [12:10<04:06, 3.74it/s]
69%|βββββββ | 2080/3000 [12:10<04:02, 3.80it/s]
{'loss': 0.0722, 'grad_norm': 0.6165386438369751, 'learning_rate': 2.362879302963135e-05} |
|
69%|βββββββ | 2080/3000 [12:11<04:02, 3.80it/s]
69%|βββββββ | 2081/3000 [12:11<04:09, 3.68it/s]
69%|βββββββ | 2082/3000 [12:11<04:13, 3.62it/s]
69%|βββββββ | 2083/3000 [12:11<04:07, 3.70it/s]
69%|βββββββ | 2084/3000 [12:12<04:04, 3.74it/s]
70%|βββββββ | 2085/3000 [12:12<04:03, 3.75it/s]
70%|βββββββ | 2086/3000 [12:12<04:02, 3.76it/s]
70%|βββββββ | 2087/3000 [12:12<04:00, 3.80it/s]
70%|βββββββ | 2088/3000 [12:13<03:58, 3.82it/s]
70%|βββββββ | 2089/3000 [12:13<04:03, 3.74it/s]
70%|βββββββ | 2090/3000 [12:13<04:02, 3.75it/s]
{'loss': 0.074, 'grad_norm': 0.6541428565979004, 'learning_rate': 2.3162140846108366e-05} |
|
70%|βββββββ | 2090/3000 [12:13<04:02, 3.75it/s]
70%|βββββββ | 2091/3000 [12:13<04:10, 3.63it/s]
70%|βββββββ | 2092/3000 [12:14<04:03, 3.72it/s]Rank 0, Worker 4: Wait for shard 5 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
70%|βββββββ | 2093/3000 [12:14<04:05, 3.70it/s]Rank 0, Worker 5: Wait for shard 15 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
70%|βββββββ | 2094/3000 [12:14<04:13, 3.57it/s]
70%|βββββββ | 2095/3000 [12:15<04:16, 3.53it/s]
70%|βββββββ | 2096/3000 [12:15<04:22, 3.44it/s]
70%|βββββββ | 2097/3000 [12:15<04:42, 3.20it/s]
70%|βββββββ | 2098/3000 [12:16<04:37, 3.25it/s]
70%|βββββββ | 2099/3000 [12:16<04:37, 3.24it/s]
70%|βββββββ | 2100/3000 [12:16<04:32, 3.31it/s]
{'loss': 0.0734, 'grad_norm': 0.4623146951198578, 'learning_rate': 2.2698749682846687e-05} |
|
70%|βββββββ | 2100/3000 [12:16<04:32, 3.31it/s]
70%|βββββββ | 2101/3000 [12:16<04:31, 3.31it/s]
70%|βββββββ | 2102/3000 [12:17<04:25, 3.38it/s]
70%|βββββββ | 2103/3000 [12:17<04:27, 3.36it/s]
70%|βββββββ | 2104/3000 [12:17<04:26, 3.36it/s]
70%|βββββββ | 2105/3000 [12:18<04:24, 3.38it/s]
70%|βββββββ | 2106/3000 [12:18<04:16, 3.49it/s]
70%|βββββββ | 2107/3000 [12:18<04:13, 3.53it/s]
70%|βββββββ | 2108/3000 [12:18<04:19, 3.44it/s]
70%|βββββββ | 2109/3000 [12:19<04:11, 3.54it/s]
70%|βββββββ | 2110/3000 [12:19<04:06, 3.61it/s]
{'loss': 0.0754, 'grad_norm': 0.7112588882446289, 'learning_rate': 2.2238675845677663e-05} |
|
70%|βββββββ | 2110/3000 [12:19<04:06, 3.61it/s]
70%|βββββββ | 2111/3000 [12:19<04:09, 3.56it/s]
70%|βββββββ | 2112/3000 [12:20<04:12, 3.52it/s]
70%|βββββββ | 2113/3000 [12:20<04:05, 3.61it/s]
70%|βββββββ | 2114/3000 [12:20<04:03, 3.63it/s]
70%|βββββββ | 2115/3000 [12:20<04:04, 3.62it/s]
71%|βββββββ | 2116/3000 [12:21<04:10, 3.53it/s]
71%|βββββββ | 2117/3000 [12:21<04:06, 3.59it/s]
71%|βββββββ | 2118/3000 [12:21<04:03, 3.62it/s]
71%|βββββββ | 2119/3000 [12:22<04:06, 3.58it/s]
71%|βββββββ | 2120/3000 [12:22<04:05, 3.59it/s]
{'loss': 0.0714, 'grad_norm': 0.5454754829406738, 'learning_rate': 2.1781975237350366e-05} |
|
71%|βββββββ | 2120/3000 [12:22<04:05, 3.59it/s]
71%|βββββββ | 2121/3000 [12:22<04:05, 3.59it/s]
71%|βββββββ | 2122/3000 [12:22<04:02, 3.61it/s]
71%|βββββββ | 2123/3000 [12:23<04:08, 3.53it/s]
71%|βββββββ | 2124/3000 [12:23<04:09, 3.52it/s]
71%|βββββββ | 2125/3000 [12:23<04:03, 3.59it/s]
71%|βββββββ | 2126/3000 [12:23<04:00, 3.64it/s]
71%|βββββββ | 2127/3000 [12:24<03:56, 3.68it/s]
71%|βββββββ | 2128/3000 [12:24<04:01, 3.61it/s]
71%|βββββββ | 2129/3000 [12:24<04:00, 3.62it/s]
71%|βββββββ | 2130/3000 [12:25<04:05, 3.55it/s]
{'loss': 0.0763, 'grad_norm': 0.4984387159347534, 'learning_rate': 2.1328703350738765e-05} |
|
71%|βββββββ | 2130/3000 [12:25<04:05, 3.55it/s]
71%|βββββββ | 2131/3000 [12:25<04:02, 3.59it/s]
71%|βββββββ | 2132/3000 [12:25<04:08, 3.49it/s]
71%|βββββββ | 2133/3000 [12:25<04:07, 3.51it/s]
71%|βββββββ | 2134/3000 [12:26<04:07, 3.50it/s]
71%|βββββββ | 2135/3000 [12:26<04:01, 3.58it/s]
71%|βββββββ | 2136/3000 [12:26<04:01, 3.58it/s]
71%|βββββββ | 2137/3000 [12:27<03:57, 3.64it/s]
71%|ββββββββ | 2138/3000 [12:27<03:53, 3.70it/s]
71%|ββββββββ | 2139/3000 [12:27<03:49, 3.74it/s]
71%|ββββββββ | 2140/3000 [12:27<03:47, 3.77it/s]
{'loss': 0.0768, 'grad_norm': 0.5290881395339966, 'learning_rate': 2.0878915262099098e-05} |
|
71%|ββββββββ | 2140/3000 [12:27<03:47, 3.77it/s]
71%|ββββββββ | 2141/3000 [12:28<03:49, 3.74it/s]
71%|ββββββββ | 2142/3000 [12:28<03:51, 3.71it/s]
71%|ββββββββ | 2143/3000 [12:28<03:52, 3.69it/s]
71%|ββββββββ | 2144/3000 [12:28<03:50, 3.71it/s]
72%|ββββββββ | 2145/3000 [12:29<03:48, 3.74it/s]
72%|ββββββββ | 2146/3000 [12:29<03:50, 3.71it/s]
72%|ββββββββ | 2147/3000 [12:29<03:56, 3.61it/s]
72%|ββββββββ | 2148/3000 [12:30<03:55, 3.61it/s]
72%|ββββββββ | 2149/3000 [12:30<03:56, 3.60it/s]
72%|ββββββββ | 2150/3000 [12:30<04:02, 3.51it/s]
{'loss': 0.0812, 'grad_norm': 0.5415129065513611, 'learning_rate': 2.0432665624377434e-05} |
|
72%|ββββββββ | 2150/3000 [12:30<04:02, 3.51it/s]
72%|ββββββββ | 2151/3000 [12:30<03:58, 3.56it/s]
72%|ββββββββ | 2152/3000 [12:31<03:56, 3.59it/s]
72%|ββββββββ | 2153/3000 [12:31<03:54, 3.62it/s]
72%|ββββββββ | 2154/3000 [12:31<03:50, 3.66it/s]
72%|ββββββββ | 2155/3000 [12:31<03:46, 3.72it/s]
72%|ββββββββ | 2156/3000 [12:32<03:48, 3.69it/s]
72%|ββββββββ | 2157/3000 [12:32<03:43, 3.77it/s]
72%|ββββββββ | 2158/3000 [12:32<03:42, 3.79it/s]
72%|ββββββββ | 2159/3000 [12:32<03:42, 3.77it/s]
72%|ββββββββ | 2160/3000 [12:33<03:43, 3.75it/s]
{'loss': 0.0769, 'grad_norm': 0.5350505709648132, 'learning_rate': 1.999000866056908e-05} |
|
72%|ββββββββ | 2160/3000 [12:33<03:43, 3.75it/s]
72%|ββββββββ | 2161/3000 [12:33<03:40, 3.80it/s]
72%|ββββββββ | 2162/3000 [12:33<03:41, 3.78it/s]
72%|ββββββββ | 2163/3000 [12:34<03:40, 3.79it/s]
72%|ββββββββ | 2164/3000 [12:34<03:40, 3.79it/s]
72%|ββββββββ | 2165/3000 [12:34<03:36, 3.85it/s]
72%|ββββββββ | 2166/3000 [12:34<03:35, 3.86it/s]
72%|ββββββββ | 2167/3000 [12:35<03:36, 3.86it/s]
72%|ββββββββ | 2168/3000 [12:35<03:39, 3.78it/s]
72%|ββββββββ | 2169/3000 [12:35<03:42, 3.73it/s]
72%|ββββββββ | 2170/3000 [12:35<03:45, 3.68it/s]
{'loss': 0.0685, 'grad_norm': 0.5751089453697205, 'learning_rate': 1.9550998157129946e-05} |
|
72%|ββββββββ | 2170/3000 [12:35<03:45, 3.68it/s]
72%|ββββββββ | 2171/3000 [12:36<03:49, 3.62it/s]
72%|ββββββββ | 2172/3000 [12:36<03:43, 3.70it/s]
72%|ββββββββ | 2173/3000 [12:36<03:41, 3.74it/s]
72%|ββββββββ | 2174/3000 [12:36<03:34, 3.85it/s]
72%|ββββββββ | 2175/3000 [12:37<03:31, 3.90it/s]
73%|ββββββββ | 2176/3000 [12:37<03:30, 3.92it/s]
73%|ββββββββ | 2177/3000 [12:37<03:28, 3.96it/s]
73%|ββββββββ | 2178/3000 [12:37<03:26, 3.97it/s]
73%|ββββββββ | 2179/3000 [12:38<03:25, 3.99it/s]
73%|ββββββββ | 2180/3000 [12:38<03:25, 3.99it/s]
{'loss': 0.0821, 'grad_norm': 0.4716225266456604, 'learning_rate': 1.9115687457441022e-05} |
|
73%|ββββββββ | 2180/3000 [12:38<03:25, 3.99it/s]
73%|ββββββββ | 2181/3000 [12:38<03:28, 3.94it/s]
73%|ββββββββ | 2182/3000 [12:38<03:25, 3.98it/s]
73%|ββββββββ | 2183/3000 [12:39<03:22, 4.03it/s]
73%|ββββββββ | 2184/3000 [12:39<03:25, 3.98it/s]
73%|ββββββββ | 2185/3000 [12:39<03:25, 3.96it/s]
73%|ββββββββ | 2186/3000 [12:39<03:26, 3.95it/s]
73%|ββββββββ | 2187/3000 [12:40<03:25, 3.95it/s]
73%|ββββββββ | 2188/3000 [12:40<03:26, 3.93it/s]
73%|ββββββββ | 2189/3000 [12:40<03:27, 3.92it/s]
73%|ββββββββ | 2190/3000 [12:41<03:31, 3.82it/s]
{'loss': 0.0686, 'grad_norm': 0.5699394941329956, 'learning_rate': 1.868412945532681e-05} |
|
73%|ββββββββ | 2190/3000 [12:41<03:31, 3.82it/s]
73%|ββββββββ | 2191/3000 [12:41<03:35, 3.76it/s]
73%|ββββββββ | 2192/3000 [12:41<03:36, 3.72it/s]
73%|ββββββββ | 2193/3000 [12:41<03:34, 3.77it/s]
73%|ββββββββ | 2194/3000 [12:42<03:34, 3.76it/s]
73%|ββββββββ | 2195/3000 [12:42<03:34, 3.76it/s]
73%|ββββββββ | 2196/3000 [12:42<03:32, 3.79it/s]
73%|ββββββββ | 2197/3000 [12:42<03:28, 3.85it/s]
73%|ββββββββ | 2198/3000 [12:43<03:28, 3.85it/s]
73%|ββββββββ | 2199/3000 [12:43<03:51, 3.46it/s]
73%|ββββββββ | 2200/3000 [12:43<03:41, 3.61it/s]
{'loss': 0.0735, 'grad_norm': 0.5516906976699829, 'learning_rate': 1.8256376588628238e-05} |
|
73%|ββββββββ | 2200/3000 [12:43<03:41, 3.61it/s]
73%|ββββββββ | 2201/3000 [12:43<03:37, 3.67it/s]
73%|ββββββββ | 2202/3000 [12:44<03:36, 3.68it/s]
73%|ββββββββ | 2203/3000 [12:44<03:33, 3.73it/s]
73%|ββββββββ | 2204/3000 [12:44<03:29, 3.79it/s]
74%|ββββββββ | 2205/3000 [12:45<03:29, 3.79it/s]
74%|ββββββββ | 2206/3000 [12:45<03:29, 3.79it/s]
74%|ββββββββ | 2207/3000 [12:45<03:37, 3.65it/s]
74%|ββββββββ | 2208/3000 [12:45<03:29, 3.78it/s]
74%|ββββββββ | 2209/3000 [12:46<03:26, 3.83it/s]
74%|ββββββββ | 2210/3000 [12:46<03:31, 3.74it/s]
{'loss': 0.0729, 'grad_norm': 0.4884108304977417, 'learning_rate': 1.7832480832830987e-05} |
|
74%|ββββββββ | 2210/3000 [12:46<03:31, 3.74it/s]
74%|ββββββββ | 2211/3000 [12:46<03:28, 3.78it/s]
74%|ββββββββ | 2212/3000 [12:46<03:28, 3.78it/s]
74%|ββββββββ | 2213/3000 [12:47<03:27, 3.79it/s]
74%|ββββββββ | 2214/3000 [12:47<03:28, 3.77it/s]
74%|ββββββββ | 2215/3000 [12:47<03:28, 3.77it/s]
74%|ββββββββ | 2216/3000 [12:47<03:26, 3.80it/s]
74%|ββββββββ | 2217/3000 [12:48<03:22, 3.86it/s]
74%|ββββββββ | 2218/3000 [12:48<03:22, 3.86it/s]
74%|ββββββββ | 2219/3000 [12:48<03:21, 3.88it/s]
74%|ββββββββ | 2220/3000 [12:48<03:19, 3.91it/s]
{'loss': 0.0691, 'grad_norm': 0.817842960357666, 'learning_rate': 1.7412493694750176e-05} |
|
74%|ββββββββ | 2220/3000 [12:49<03:19, 3.91it/s]
74%|ββββββββ | 2221/3000 [12:49<03:18, 3.92it/s]
74%|ββββββββ | 2222/3000 [12:49<03:19, 3.90it/s]
74%|ββββββββ | 2223/3000 [12:49<03:22, 3.84it/s]
74%|ββββββββ | 2224/3000 [12:50<03:19, 3.90it/s]
74%|ββββββββ | 2225/3000 [12:50<03:17, 3.93it/s]
74%|ββββββββ | 2226/3000 [12:50<03:17, 3.92it/s]
74%|ββββββββ | 2227/3000 [12:50<03:18, 3.88it/s]
74%|ββββββββ | 2228/3000 [12:51<03:17, 3.90it/s]
74%|ββββββββ | 2229/3000 [12:51<03:16, 3.92it/s]
74%|ββββββββ | 2230/3000 [12:51<03:19, 3.87it/s]
{'loss': 0.0792, 'grad_norm': 0.5762520432472229, 'learning_rate': 1.699646620627168e-05} |
|
74%|ββββββββ | 2230/3000 [12:51<03:19, 3.87it/s]
74%|ββββββββ | 2231/3000 [12:51<03:19, 3.86it/s]
74%|ββββββββ | 2232/3000 [12:52<03:17, 3.89it/s]
74%|ββββββββ | 2233/3000 [12:52<03:18, 3.86it/s]
74%|ββββββββ | 2234/3000 [12:52<03:19, 3.84it/s]
74%|ββββββββ | 2235/3000 [12:52<03:18, 3.85it/s]
75%|ββββββββ | 2236/3000 [12:53<03:18, 3.85it/s]
75%|ββββββββ | 2237/3000 [12:53<03:16, 3.88it/s]
75%|ββββββββ | 2238/3000 [12:53<03:19, 3.83it/s]
75%|ββββββββ | 2239/3000 [12:53<03:21, 3.77it/s]
75%|ββββββββ | 2240/3000 [12:54<03:26, 3.68it/s]
{'loss': 0.0675, 'grad_norm': 0.4962397813796997, 'learning_rate': 1.658444891815152e-05} |
|
75%|ββββββββ | 2240/3000 [12:54<03:26, 3.68it/s]
75%|ββββββββ | 2241/3000 [12:54<03:27, 3.66it/s]Rank 0, Worker 3: Wait for shard 47 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
75%|ββββββββ | 2242/3000 [12:54<03:26, 3.68it/s]
75%|ββββββββ | 2243/3000 [12:55<03:32, 3.57it/s]
75%|ββββββββ | 2244/3000 [12:55<03:27, 3.64it/s]
75%|ββββββββ | 2245/3000 [12:55<03:23, 3.72it/s]Rank 0, Worker 1: Wait for shard 22 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
75%|ββββββββ | 2246/3000 [12:55<03:28, 3.61it/s]
75%|ββββββββ | 2247/3000 [12:56<03:28, 3.60it/s]
75%|ββββββββ | 2248/3000 [12:56<03:23, 3.69it/s]
75%|ββββββββ | 2249/3000 [12:56<03:25, 3.66it/s]
75%|ββββββββ | 2250/3000 [12:56<03:28, 3.60it/s]
{'loss': 0.0786, 'grad_norm': 0.5862926840782166, 'learning_rate': 1.617649189387337e-05} |
|
75%|ββββββββ | 2250/3000 [12:56<03:28, 3.60it/s]
75%|ββββββββ | 2251/3000 [12:57<03:26, 3.62it/s]
75%|ββββββββ | 2252/3000 [12:57<03:22, 3.70it/s]
75%|ββββββββ | 2253/3000 [12:57<03:19, 3.75it/s]
75%|ββββββββ | 2254/3000 [12:58<03:25, 3.63it/s]
75%|ββββββββ | 2255/3000 [12:58<03:27, 3.58it/s]
75%|ββββββββ | 2256/3000 [12:58<03:34, 3.46it/s]
75%|ββββββββ | 2257/3000 [12:58<03:34, 3.47it/s]
75%|ββββββββ | 2258/3000 [12:59<03:29, 3.55it/s]
75%|ββββββββ | 2259/3000 [12:59<03:39, 3.38it/s]
75%|ββββββββ | 2260/3000 [12:59<03:33, 3.47it/s]
{'loss': 0.0796, 'grad_norm': 0.5360605120658875, 'learning_rate': 1.5772644703565565e-05} |
|
75%|ββββββββ | 2260/3000 [12:59<03:33, 3.47it/s]
75%|ββββββββ | 2261/3000 [13:00<03:31, 3.50it/s]
75%|ββββββββ | 2262/3000 [13:00<03:28, 3.55it/s]Rank 0, Worker 0: Wait for shard 55 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
75%|ββββββββ | 2263/3000 [13:00<03:32, 3.47it/s]
75%|ββββββββ | 2264/3000 [13:00<03:24, 3.59it/s]Rank 0, Worker 2: Wait for shard 58 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
76%|ββββββββ | 2265/3000 [13:01<03:22, 3.63it/s]
76%|ββββββββ | 2266/3000 [13:01<03:22, 3.62it/s]
76%|ββββββββ | 2267/3000 [13:01<03:19, 3.68it/s]
76%|ββββββββ | 2268/3000 [13:01<03:16, 3.73it/s]
76%|ββββββββ | 2269/3000 [13:02<03:16, 3.71it/s]
76%|ββββββββ | 2270/3000 [13:02<03:34, 3.41it/s]
{'loss': 0.0694, 'grad_norm': 0.5301634669303894, 'learning_rate': 1.537295641797785e-05} |
|
76%|ββββββββ | 2270/3000 [13:02<03:34, 3.41it/s]
76%|ββββββββ | 2271/3000 [13:02<03:44, 3.25it/s]
76%|ββββββββ | 2272/3000 [13:03<03:40, 3.30it/s]
76%|ββββββββ | 2273/3000 [13:03<03:36, 3.36it/s]
76%|ββββββββ | 2274/3000 [13:03<03:30, 3.46it/s]
76%|ββββββββ | 2275/3000 [13:04<03:22, 3.58it/s]
76%|ββββββββ | 2276/3000 [13:04<03:17, 3.66it/s]
76%|ββββββββ | 2277/3000 [13:04<03:15, 3.70it/s]
76%|ββββββββ | 2278/3000 [13:04<03:13, 3.72it/s]Rank 0, Worker 4: Wait for shard 38 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
76%|ββββββββ | 2279/3000 [13:05<03:14, 3.71it/s]Rank 0, Worker 5: Wait for shard 23 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
76%|ββββββββ | 2280/3000 [13:05<03:22, 3.56it/s]
{'loss': 0.0799, 'grad_norm': 0.4486077129840851, 'learning_rate': 1.4977475602518876e-05} |
|
76%|ββββββββ | 2280/3000 [13:05<03:22, 3.56it/s]
76%|ββββββββ | 2281/3000 [13:05<03:31, 3.41it/s]
76%|ββββββββ | 2282/3000 [13:05<03:24, 3.51it/s]
76%|ββββββββ | 2283/3000 [13:06<03:19, 3.60it/s]
76%|ββββββββ | 2284/3000 [13:06<03:29, 3.43it/s]
76%|ββββββββ | 2285/3000 [13:06<03:23, 3.51it/s]
76%|ββββββββ | 2286/3000 [13:07<03:22, 3.53it/s]
76%|ββββββββ | 2287/3000 [13:07<03:19, 3.58it/s]
76%|ββββββββ | 2288/3000 [13:07<03:18, 3.59it/s]
76%|ββββββββ | 2289/3000 [13:07<03:21, 3.53it/s]
76%|ββββββββ | 2290/3000 [13:08<03:15, 3.63it/s]
{'loss': 0.0783, 'grad_norm': 0.5895755887031555, 'learning_rate': 1.4586250311355132e-05} |
|
76%|ββββββββ | 2290/3000 [13:08<03:15, 3.63it/s]
76%|ββββββββ | 2291/3000 [13:08<03:14, 3.65it/s]
76%|ββββββββ | 2292/3000 [13:08<03:11, 3.70it/s]
76%|ββββββββ | 2293/3000 [13:09<03:11, 3.69it/s]
76%|ββββββββ | 2294/3000 [13:09<03:11, 3.69it/s]
76%|ββββββββ | 2295/3000 [13:09<03:18, 3.55it/s]
77%|ββββββββ | 2296/3000 [13:09<03:19, 3.53it/s]
77%|ββββββββ | 2297/3000 [13:10<03:21, 3.48it/s]
77%|ββββββββ | 2298/3000 [13:10<03:17, 3.55it/s]
77%|ββββββββ | 2299/3000 [13:10<03:15, 3.59it/s]
77%|ββββββββ | 2300/3000 [13:10<03:14, 3.60it/s]
{'loss': 0.0684, 'grad_norm': 0.5945717692375183, 'learning_rate': 1.4199328081572e-05} |
|
77%|ββββββββ | 2300/3000 [13:11<03:14, 3.60it/s]
77%|ββββββββ | 2301/3000 [13:11<03:26, 3.38it/s]
77%|ββββββββ | 2302/3000 [13:11<03:20, 3.48it/s]
77%|ββββββββ | 2303/3000 [13:11<03:28, 3.34it/s]
77%|ββββββββ | 2304/3000 [13:12<03:44, 3.10it/s]
77%|ββββββββ | 2305/3000 [13:12<03:31, 3.28it/s]
77%|ββββββββ | 2306/3000 [13:12<03:24, 3.40it/s]
77%|ββββββββ | 2307/3000 [13:13<03:16, 3.52it/s]
77%|ββββββββ | 2308/3000 [13:13<03:13, 3.58it/s]
77%|ββββββββ | 2309/3000 [13:13<03:09, 3.64it/s]
77%|ββββββββ | 2310/3000 [13:13<03:04, 3.74it/s]
{'loss': 0.0836, 'grad_norm': 0.6563544869422913, 'learning_rate': 1.3816755927397502e-05} |
|
77%|ββββββββ | 2310/3000 [13:13<03:04, 3.74it/s]
77%|ββββββββ | 2311/3000 [13:14<03:07, 3.68it/s]
77%|ββββββββ | 2312/3000 [13:14<03:07, 3.68it/s]
77%|ββββββββ | 2313/3000 [13:14<03:07, 3.66it/s]
77%|ββββββββ | 2314/3000 [13:14<03:06, 3.67it/s]
77%|ββββββββ | 2315/3000 [13:15<03:20, 3.42it/s]
77%|ββββββββ | 2316/3000 [13:15<03:13, 3.53it/s]
77%|ββββββββ | 2317/3000 [13:15<03:08, 3.62it/s]
77%|ββββββββ | 2318/3000 [13:16<03:12, 3.55it/s]
77%|ββββββββ | 2319/3000 [13:16<03:24, 3.33it/s]
77%|ββββββββ | 2320/3000 [13:16<03:20, 3.39it/s]
{'loss': 0.0685, 'grad_norm': 0.4608897268772125, 'learning_rate': 1.343858033448982e-05} |
|
77%|ββββββββ | 2320/3000 [13:16<03:20, 3.39it/s]
77%|ββββββββ | 2321/3000 [13:17<03:15, 3.47it/s]
77%|ββββββββ | 2322/3000 [13:17<03:10, 3.57it/s]
77%|ββββββββ | 2323/3000 [13:17<03:05, 3.66it/s]
77%|ββββββββ | 2324/3000 [13:17<03:02, 3.70it/s]
78%|ββββββββ | 2325/3000 [13:18<02:59, 3.75it/s]
78%|ββββββββ | 2326/3000 [13:18<02:56, 3.81it/s]
78%|ββββββββ | 2327/3000 [13:18<02:54, 3.87it/s]
78%|ββββββββ | 2328/3000 [13:18<02:54, 3.86it/s]
78%|ββββββββ | 2329/3000 [13:19<02:55, 3.82it/s]
78%|ββββββββ | 2330/3000 [13:19<02:53, 3.87it/s]
{'loss': 0.0816, 'grad_norm': 0.5362879633903503, 'learning_rate': 1.3064847254288797e-05} |
|
78%|ββββββββ | 2330/3000 [13:19<02:53, 3.87it/s]
78%|ββββββββ | 2331/3000 [13:19<02:56, 3.80it/s]
78%|ββββββββ | 2332/3000 [13:19<02:57, 3.76it/s]
78%|ββββββββ | 2333/3000 [13:20<02:54, 3.81it/s]
78%|ββββββββ | 2334/3000 [13:20<02:53, 3.83it/s]
78%|ββββββββ | 2335/3000 [13:20<02:59, 3.71it/s]
78%|ββββββββ | 2336/3000 [13:20<02:58, 3.71it/s]
78%|ββββββββ | 2337/3000 [13:21<03:01, 3.64it/s]
78%|ββββββββ | 2338/3000 [13:21<03:06, 3.55it/s]
78%|ββββββββ | 2339/3000 [13:21<03:01, 3.64it/s]
78%|ββββββββ | 2340/3000 [13:22<02:56, 3.74it/s]
{'loss': 0.0717, 'grad_norm': 0.46307265758514404, 'learning_rate': 1.2695602098432502e-05} |
|
78%|ββββββββ | 2340/3000 [13:22<02:56, 3.74it/s]
78%|ββββββββ | 2341/3000 [13:22<03:00, 3.65it/s]
78%|ββββββββ | 2342/3000 [13:22<03:02, 3.60it/s]
78%|ββββββββ | 2343/3000 [13:22<03:02, 3.60it/s]
78%|ββββββββ | 2344/3000 [13:23<03:11, 3.43it/s]
78%|ββββββββ | 2345/3000 [13:23<03:06, 3.52it/s]
78%|ββββββββ | 2346/3000 [13:23<03:00, 3.61it/s]
78%|ββββββββ | 2347/3000 [13:24<03:02, 3.59it/s]
78%|ββββββββ | 2348/3000 [13:24<03:02, 3.57it/s]
78%|ββββββββ | 2349/3000 [13:24<02:59, 3.62it/s]
78%|ββββββββ | 2350/3000 [13:24<02:57, 3.67it/s]
{'loss': 0.0787, 'grad_norm': 0.4995485842227936, 'learning_rate': 1.233088973323937e-05} |
|
78%|ββββββββ | 2350/3000 [13:24<02:57, 3.67it/s]
78%|ββββββββ | 2351/3000 [13:25<02:55, 3.69it/s]
78%|ββββββββ | 2352/3000 [13:25<02:53, 3.74it/s]
78%|ββββββββ | 2353/3000 [13:25<02:52, 3.76it/s]
78%|ββββββββ | 2354/3000 [13:25<02:49, 3.80it/s]
78%|ββββββββ | 2355/3000 [13:26<02:46, 3.87it/s]
79%|ββββββββ | 2356/3000 [13:26<02:43, 3.93it/s]
79%|ββββββββ | 2357/3000 [13:26<02:48, 3.83it/s]
79%|ββββββββ | 2358/3000 [13:26<02:50, 3.76it/s]
79%|ββββββββ | 2359/3000 [13:27<02:52, 3.73it/s]
79%|ββββββββ | 2360/3000 [13:27<02:50, 3.76it/s]
{'loss': 0.0714, 'grad_norm': 0.6543102860450745, 'learning_rate': 1.1970754474256563e-05} |
|
79%|ββββββββ | 2360/3000 [13:27<02:50, 3.76it/s]
79%|ββββββββ | 2361/3000 [13:27<02:52, 3.70it/s]
79%|ββββββββ | 2362/3000 [13:28<03:00, 3.54it/s]
79%|ββββββββ | 2363/3000 [13:28<02:56, 3.60it/s]
79%|ββββββββ | 2364/3000 [13:28<02:52, 3.68it/s]
79%|ββββββββ | 2365/3000 [13:28<02:50, 3.73it/s]
79%|ββββββββ | 2366/3000 [13:29<02:48, 3.76it/s]
79%|ββββββββ | 2367/3000 [13:29<02:47, 3.77it/s]
79%|ββββββββ | 2368/3000 [13:29<02:47, 3.77it/s]
79%|ββββββββ | 2369/3000 [13:29<02:52, 3.65it/s]
79%|ββββββββ | 2370/3000 [13:30<02:56, 3.56it/s]
{'loss': 0.0703, 'grad_norm': 0.4924975633621216, 'learning_rate': 1.16152400808752e-05} |
|
79%|ββββββββ | 2370/3000 [13:30<02:56, 3.56it/s]
79%|ββββββββ | 2371/3000 [13:30<03:00, 3.49it/s]
79%|ββββββββ | 2372/3000 [13:30<02:56, 3.56it/s]
79%|ββββββββ | 2373/3000 [13:31<02:59, 3.50it/s]
79%|ββββββββ | 2374/3000 [13:31<03:08, 3.32it/s]
79%|ββββββββ | 2375/3000 [13:31<03:12, 3.24it/s]
79%|ββββββββ | 2376/3000 [13:32<03:04, 3.39it/s]
79%|ββββββββ | 2377/3000 [13:32<02:57, 3.50it/s]
79%|ββββββββ | 2378/3000 [13:32<02:51, 3.63it/s]
79%|ββββββββ | 2379/3000 [13:32<02:47, 3.70it/s]
79%|ββββββββ | 2380/3000 [13:33<02:47, 3.69it/s]
{'loss': 0.0683, 'grad_norm': 0.46566370129585266, 'learning_rate': 1.1264389751013326e-05} |
|
79%|ββββββββ | 2380/3000 [13:33<02:47, 3.69it/s]
79%|ββββββββ | 2381/3000 [13:33<02:48, 3.67it/s]
79%|ββββββββ | 2382/3000 [13:33<02:51, 3.61it/s]
79%|ββββββββ | 2383/3000 [13:33<02:46, 3.70it/s]
79%|ββββββββ | 2384/3000 [13:34<02:45, 3.73it/s]
80%|ββββββββ | 2385/3000 [13:34<02:45, 3.73it/s]
80%|ββββββββ | 2386/3000 [13:34<02:47, 3.67it/s]
80%|ββββββββ | 2387/3000 [13:34<02:44, 3.73it/s]
80%|ββββββββ | 2388/3000 [13:35<02:45, 3.70it/s]
80%|ββββββββ | 2389/3000 [13:35<02:53, 3.52it/s]
80%|ββββββββ | 2390/3000 [13:35<02:52, 3.54it/s]
{'loss': 0.0793, 'grad_norm': 0.4437257945537567, 'learning_rate': 1.0918246115866964e-05} |
|
80%|ββββββββ | 2390/3000 [13:35<02:52, 3.54it/s]
80%|ββββββββ | 2391/3000 [13:36<02:45, 3.67it/s]
80%|ββββββββ | 2392/3000 [13:36<02:48, 3.62it/s]
80%|ββββββββ | 2393/3000 [13:36<02:49, 3.58it/s]
80%|ββββββββ | 2394/3000 [13:36<02:44, 3.68it/s]
80%|ββββββββ | 2395/3000 [13:37<02:41, 3.75it/s]
80%|ββββββββ | 2396/3000 [13:37<02:39, 3.78it/s]
80%|ββββββββ | 2397/3000 [13:37<02:39, 3.79it/s]
80%|ββββββββ | 2398/3000 [13:37<02:40, 3.76it/s]
80%|ββββββββ | 2399/3000 [13:38<02:45, 3.64it/s]
80%|ββββββββ | 2400/3000 [13:38<02:53, 3.47it/s]
{'loss': 0.0782, 'grad_norm': 0.5867979526519775, 'learning_rate': 1.0576851234730095e-05} |
|
80%|ββββββββ | 2400/3000 [13:38<02:53, 3.47it/s]
80%|ββββββββ | 2401/3000 [13:38<03:03, 3.26it/s]
80%|ββββββββ | 2402/3000 [13:39<03:03, 3.26it/s]
80%|ββββββββ | 2403/3000 [13:39<03:02, 3.28it/s]
80%|ββββββββ | 2404/3000 [13:39<02:57, 3.36it/s]
80%|ββββββββ | 2405/3000 [13:40<02:54, 3.42it/s]
80%|ββββββββ | 2406/3000 [13:40<02:53, 3.42it/s]
80%|ββββββββ | 2407/3000 [13:40<03:01, 3.26it/s]
80%|ββββββββ | 2408/3000 [13:41<02:57, 3.34it/s]
80%|ββββββββ | 2409/3000 [13:41<02:49, 3.48it/s]
80%|ββββββββ | 2410/3000 [13:41<02:47, 3.52it/s]
{'loss': 0.0667, 'grad_norm': 0.5302107334136963, 'learning_rate': 1.0240246589884044e-05} |
|
80%|ββββββββ | 2410/3000 [13:41<02:47, 3.52it/s]
80%|ββββββββ | 2411/3000 [13:41<02:58, 3.30it/s]
80%|ββββββββ | 2412/3000 [13:42<02:47, 3.50it/s]
80%|ββββββββ | 2413/3000 [13:42<02:43, 3.59it/s]
80%|ββββββββ | 2414/3000 [13:42<02:39, 3.67it/s]
80%|ββββββββ | 2415/3000 [13:42<02:37, 3.71it/s]
81%|ββββββββ | 2416/3000 [13:43<02:34, 3.77it/s]
81%|ββββββββ | 2417/3000 [13:43<02:33, 3.81it/s]
81%|ββββββββ | 2418/3000 [13:43<02:30, 3.86it/s]
81%|ββββββββ | 2419/3000 [13:43<02:29, 3.90it/s]
81%|ββββββββ | 2420/3000 [13:44<02:28, 3.91it/s]
{'loss': 0.072, 'grad_norm': 0.5226858854293823, 'learning_rate': 9.908473081557151e-06} |
|
81%|ββββββββ | 2420/3000 [13:44<02:28, 3.91it/s]
81%|ββββββββ | 2421/3000 [13:44<02:28, 3.91it/s]
81%|ββββββββ | 2422/3000 [13:44<02:25, 3.97it/s]
81%|ββββββββ | 2423/3000 [13:44<02:30, 3.83it/s]
81%|ββββββββ | 2424/3000 [13:45<02:29, 3.85it/s]
81%|ββββββββ | 2425/3000 [13:45<02:26, 3.91it/s]
81%|ββββββββ | 2426/3000 [13:45<02:25, 3.96it/s]
81%|ββββββββ | 2427/3000 [13:46<02:27, 3.88it/s]
81%|ββββββββ | 2428/3000 [13:46<02:28, 3.86it/s]
81%|ββββββββ | 2429/3000 [13:46<02:28, 3.84it/s]
81%|ββββββββ | 2430/3000 [13:46<02:26, 3.90it/s]
{'loss': 0.0677, 'grad_norm': 0.4361122250556946, 'learning_rate': 9.581571022954988e-06} |
|
81%|ββββββββ | 2430/3000 [13:46<02:26, 3.90it/s]
81%|ββββββββ | 2431/3000 [13:47<02:33, 3.71it/s]
81%|ββββββββ | 2432/3000 [13:47<02:35, 3.65it/s]
81%|ββββββββ | 2433/3000 [13:47<02:34, 3.68it/s]Rank 0, Worker 3: Wait for shard 19 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
81%|ββββββββ | 2434/3000 [13:47<02:31, 3.73it/s]
81%|ββββββββ | 2435/3000 [13:48<02:35, 3.64it/s]
81%|ββββββββ | 2436/3000 [13:48<02:31, 3.72it/s]
81%|ββββββββ | 2437/3000 [13:48<02:30, 3.75it/s]Rank 0, Worker 1: Wait for shard 18 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
81%|βββββββββ | 2438/3000 [13:48<02:34, 3.65it/s]
81%|βββββββββ | 2439/3000 [13:49<02:32, 3.68it/s]
81%|βββββββββ | 2440/3000 [13:49<02:31, 3.70it/s]
{'loss': 0.077, 'grad_norm': 0.5139945149421692, 'learning_rate': 9.259580135361929e-06} |
|
81%|βββββββββ | 2440/3000 [13:49<02:31, 3.70it/s]
81%|βββββββββ | 2441/3000 [13:49<02:31, 3.70it/s]
81%|βββββββββ | 2442/3000 [13:50<02:29, 3.73it/s]
81%|βββββββββ | 2443/3000 [13:50<02:28, 3.75it/s]
81%|βββββββββ | 2444/3000 [13:50<02:28, 3.75it/s]
82%|βββββββββ | 2445/3000 [13:50<02:27, 3.76it/s]
82%|βββββββββ | 2446/3000 [13:51<02:24, 3.83it/s]
82%|βββββββββ | 2447/3000 [13:51<02:23, 3.84it/s]
82%|βββββββββ | 2448/3000 [13:51<02:25, 3.79it/s]
82%|βββββββββ | 2449/3000 [13:51<02:26, 3.77it/s]
82%|βββββββββ | 2450/3000 [13:52<02:24, 3.81it/s]
{'loss': 0.0721, 'grad_norm': 0.4558435380458832, 'learning_rate': 8.9425395433148e-06} |
|
82%|βββββββββ | 2450/3000 [13:52<02:24, 3.81it/s]Rank 0, Worker 2: Wait for shard 20 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
82%|βββββββββ | 2451/3000 [13:52<02:24, 3.80it/s]
82%|βββββββββ | 2452/3000 [13:52<02:29, 3.67it/s]
82%|βββββββββ | 2453/3000 [13:52<02:25, 3.75it/s]
82%|βββββββββ | 2454/3000 [13:53<02:27, 3.70it/s]Rank 0, Worker 0: Wait for shard 34 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
82%|βββββββββ | 2455/3000 [13:53<02:26, 3.72it/s]
82%|βββββββββ | 2456/3000 [13:53<02:25, 3.74it/s]
82%|βββββββββ | 2457/3000 [13:54<02:26, 3.70it/s]
82%|βββββββββ | 2458/3000 [13:54<02:36, 3.47it/s]
82%|βββββββββ | 2459/3000 [13:54<02:33, 3.53it/s]
82%|βββββββββ | 2460/3000 [13:54<02:27, 3.67it/s]
{'loss': 0.0799, 'grad_norm': 0.5343697667121887, 'learning_rate': 8.630487769848877e-06} |
|
82%|βββββββββ | 2460/3000 [13:54<02:27, 3.67it/s]
82%|βββββββββ | 2461/3000 [13:55<02:24, 3.72it/s]
82%|βββββββββ | 2462/3000 [13:55<02:23, 3.74it/s]
82%|βββββββββ | 2463/3000 [13:55<02:20, 3.82it/s]
82%|βββββββββ | 2464/3000 [13:55<02:18, 3.88it/s]
82%|βββββββββ | 2465/3000 [13:56<02:17, 3.90it/s]
82%|βββββββββ | 2466/3000 [13:56<02:21, 3.78it/s]
82%|βββββββββ | 2467/3000 [13:56<02:17, 3.87it/s]
82%|βββββββββ | 2468/3000 [13:56<02:16, 3.89it/s]
82%|βββββββββ | 2469/3000 [13:57<02:18, 3.83it/s]
82%|βββββββββ | 2470/3000 [13:57<02:20, 3.77it/s]
{'loss': 0.0742, 'grad_norm': 0.4634743332862854, 'learning_rate': 8.323462731816961e-06} |
|
82%|βββββββββ | 2470/3000 [13:57<02:20, 3.77it/s]Rank 0, Worker 4: Wait for shard 43 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
82%|βββββββββ | 2471/3000 [13:57<02:18, 3.82it/s]Rank 0, Worker 5: Wait for shard 28 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
82%|βββββββββ | 2472/3000 [13:58<02:16, 3.86it/s]
82%|βββββββββ | 2473/3000 [13:58<02:16, 3.87it/s]
82%|βββββββββ | 2474/3000 [13:58<02:17, 3.83it/s]
82%|βββββββββ | 2475/3000 [13:58<02:17, 3.83it/s]
83%|βββββββββ | 2476/3000 [13:59<02:17, 3.82it/s]
83%|βββββββββ | 2477/3000 [13:59<02:16, 3.83it/s]
83%|βββββββββ | 2478/3000 [13:59<02:16, 3.84it/s]
83%|βββββββββ | 2479/3000 [13:59<02:15, 3.85it/s]
83%|βββββββββ | 2480/3000 [14:00<02:14, 3.86it/s]
{'loss': 0.0766, 'grad_norm': 0.43630945682525635, 'learning_rate': 8.021501735282266e-06} |
|
83%|βββββββββ | 2480/3000 [14:00<02:14, 3.86it/s]
83%|βββββββββ | 2481/3000 [14:00<02:15, 3.82it/s]
83%|βββββββββ | 2482/3000 [14:00<02:16, 3.80it/s]
83%|βββββββββ | 2483/3000 [14:00<02:15, 3.82it/s]
83%|βββββββββ | 2484/3000 [14:01<02:16, 3.77it/s]
83%|βββββββββ | 2485/3000 [14:01<02:17, 3.74it/s]
83%|βββββββββ | 2486/3000 [14:01<02:19, 3.69it/s]
83%|βββββββββ | 2487/3000 [14:01<02:20, 3.65it/s]
83%|βββββββββ | 2488/3000 [14:02<02:21, 3.63it/s]
83%|βββββββββ | 2489/3000 [14:02<02:18, 3.68it/s]
83%|βββββββββ | 2490/3000 [14:02<02:19, 3.66it/s]
{'loss': 0.0625, 'grad_norm': 0.5257459282875061, 'learning_rate': 7.724641470985378e-06} |
|
83%|βββββββββ | 2490/3000 [14:02<02:19, 3.66it/s]
83%|βββββββββ | 2491/3000 [14:03<02:19, 3.64it/s]
83%|βββββββββ | 2492/3000 [14:03<02:21, 3.60it/s]
83%|βββββββββ | 2493/3000 [14:03<02:22, 3.57it/s]
83%|βββββββββ | 2494/3000 [14:03<02:20, 3.61it/s]
83%|βββββββββ | 2495/3000 [14:04<02:17, 3.67it/s]
83%|βββββββββ | 2496/3000 [14:04<02:15, 3.72it/s]
83%|βββββββββ | 2497/3000 [14:04<02:15, 3.70it/s]
83%|βββββββββ | 2498/3000 [14:04<02:15, 3.70it/s]
83%|βββββββββ | 2499/3000 [14:05<02:15, 3.71it/s]
83%|βββββββββ | 2500/3000 [14:05<02:16, 3.67it/s]
{'loss': 0.0756, 'grad_norm': 0.4771284759044647, 'learning_rate': 7.432918009885997e-06} |
|
83%|βββββββββ | 2500/3000 [14:05<02:16, 3.67it/s]Copying experiment config directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/experiment_cfg to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-2500/experiment_cfg |
| Copying processor directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/processor to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-2500 |
| Copying wandb_config.json from /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/wandb_config.json to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-2500/wandb_config.json |
|
83%|βββββββββ | 2501/3000 [14:42<1:34:54, 11.41s/it]
83%|βββββββββ | 2502/3000 [14:43<1:06:57, 8.07s/it]
83%|βββββββββ | 2503/3000 [14:43<47:25, 5.73s/it]
83%|βββββββββ | 2504/3000 [14:43<33:48, 4.09s/it]
84%|βββββββββ | 2505/3000 [14:44<24:18, 2.95s/it]
84%|βββββββββ | 2506/3000 [14:44<17:39, 2.14s/it]
84%|βββββββββ | 2507/3000 [14:44<12:59, 1.58s/it]
84%|βββββββββ | 2508/3000 [14:44<09:44, 1.19s/it]
84%|βββββββββ | 2509/3000 [14:45<07:26, 1.10it/s]
84%|βββββββββ | 2510/3000 [14:45<05:51, 1.39it/s]
{'loss': 0.0769, 'grad_norm': 0.47868987917900085, 'learning_rate': 7.146366798780096e-06} |
|
84%|βββββββββ | 2510/3000 [14:45<05:51, 1.39it/s]
84%|βββββββββ | 2511/3000 [14:45<04:46, 1.71it/s]
84%|βββββββββ | 2512/3000 [14:45<03:58, 2.04it/s]
84%|βββββββββ | 2513/3000 [14:46<03:29, 2.33it/s]
84%|βββββββββ | 2514/3000 [14:46<03:06, 2.61it/s]
84%|βββββββββ | 2515/3000 [14:46<02:50, 2.85it/s]
84%|βββββββββ | 2516/3000 [14:47<02:37, 3.07it/s]
84%|βββββββββ | 2517/3000 [14:47<02:29, 3.24it/s]
84%|βββββββββ | 2518/3000 [14:47<02:21, 3.40it/s]
84%|βββββββββ | 2519/3000 [14:47<02:19, 3.46it/s]
84%|βββββββββ | 2520/3000 [14:48<02:15, 3.55it/s]
{'loss': 0.0682, 'grad_norm': 0.5330846309661865, 'learning_rate': 6.865022655992798e-06} |
|
84%|βββββββββ | 2520/3000 [14:48<02:15, 3.55it/s]
84%|βββββββββ | 2521/3000 [14:48<02:12, 3.61it/s]
84%|βββββββββ | 2522/3000 [14:48<02:10, 3.65it/s]
84%|βββββββββ | 2523/3000 [14:48<02:14, 3.53it/s]
84%|βββββββββ | 2524/3000 [14:49<02:15, 3.51it/s]
84%|βββββββββ | 2525/3000 [14:49<02:15, 3.51it/s]
84%|βββββββββ | 2526/3000 [14:49<02:15, 3.49it/s]
84%|βββββββββ | 2527/3000 [14:50<02:11, 3.60it/s]
84%|βββββββββ | 2528/3000 [14:50<02:07, 3.71it/s]
84%|βββββββββ | 2529/3000 [14:50<02:07, 3.69it/s]
84%|βββββββββ | 2530/3000 [14:50<02:07, 3.67it/s]
{'loss': 0.0685, 'grad_norm': 0.4761175215244293, 'learning_rate': 6.588919767147639e-06} |
|
84%|βββββββββ | 2530/3000 [14:50<02:07, 3.67it/s]
84%|βββββββββ | 2531/3000 [14:51<02:10, 3.60it/s]
84%|βββββββββ | 2532/3000 [14:51<02:07, 3.68it/s]
84%|βββββββββ | 2533/3000 [14:51<02:04, 3.74it/s]
84%|βββββββββ | 2534/3000 [14:51<02:00, 3.86it/s]
84%|βββββββββ | 2535/3000 [14:52<01:59, 3.88it/s]
85%|βββββββββ | 2536/3000 [14:52<02:01, 3.81it/s]
85%|βββββββββ | 2537/3000 [14:52<01:59, 3.88it/s]
85%|βββββββββ | 2538/3000 [14:52<01:56, 3.98it/s]
85%|βββββββββ | 2539/3000 [14:53<01:55, 3.99it/s]
85%|βββββββββ | 2540/3000 [14:53<01:55, 4.00it/s]
{'loss': 0.075, 'grad_norm': 0.4890459477901459, 'learning_rate': 6.318091681012772e-06} |
|
85%|βββββββββ | 2540/3000 [14:53<01:55, 4.00it/s]
85%|βββββββββ | 2541/3000 [14:53<01:53, 4.03it/s]
85%|βββββββββ | 2542/3000 [14:53<01:55, 3.98it/s]
85%|βββββββββ | 2543/3000 [14:54<02:02, 3.73it/s]
85%|βββββββββ | 2544/3000 [14:54<02:01, 3.75it/s]
85%|βββββββββ | 2545/3000 [14:54<01:57, 3.86it/s]
85%|βββββββββ | 2546/3000 [14:54<01:55, 3.94it/s]
85%|βββββββββ | 2547/3000 [14:55<01:54, 3.96it/s]
85%|βββββββββ | 2548/3000 [14:55<01:51, 4.05it/s]
85%|βββββββββ | 2549/3000 [14:55<01:49, 4.11it/s]
85%|βββββββββ | 2550/3000 [14:55<01:48, 4.14it/s]
{'loss': 0.0825, 'grad_norm': 0.4035462737083435, 'learning_rate': 6.052571305424531e-06} |
|
85%|βββββββββ | 2550/3000 [14:55<01:48, 4.14it/s]
85%|βββββββββ | 2551/3000 [14:56<01:49, 4.09it/s]
85%|βββββββββ | 2552/3000 [14:56<01:47, 4.16it/s]
85%|βββββββββ | 2553/3000 [14:56<01:46, 4.20it/s]
85%|βββββββββ | 2554/3000 [14:56<01:47, 4.16it/s]
85%|βββββββββ | 2555/3000 [14:57<01:47, 4.14it/s]
85%|βββββββββ | 2556/3000 [14:57<01:46, 4.16it/s]
85%|βββββββββ | 2557/3000 [14:57<01:46, 4.17it/s]
85%|βββββββββ | 2558/3000 [14:57<01:46, 4.15it/s]
85%|βββββββββ | 2559/3000 [14:58<01:49, 4.04it/s]
85%|βββββββββ | 2560/3000 [14:58<01:48, 4.05it/s]
{'loss': 0.0747, 'grad_norm': 0.3605692982673645, 'learning_rate': 5.79239090328883e-06} |
|
85%|βββββββββ | 2560/3000 [14:58<01:48, 4.05it/s]
85%|βββββββββ | 2561/3000 [14:58<01:50, 3.96it/s]
85%|βββββββββ | 2562/3000 [14:58<01:50, 3.97it/s]
85%|βββββββββ | 2563/3000 [14:59<01:48, 4.04it/s]
85%|βββββββββ | 2564/3000 [14:59<01:46, 4.09it/s]
86%|βββββββββ | 2565/3000 [14:59<01:47, 4.05it/s]
86%|βββββββββ | 2566/3000 [14:59<01:48, 4.00it/s]
86%|βββββββββ | 2567/3000 [15:00<01:49, 3.95it/s]
86%|βββββββββ | 2568/3000 [15:00<01:50, 3.90it/s]
86%|βββββββββ | 2569/3000 [15:00<01:53, 3.80it/s]
86%|βββββββββ | 2570/3000 [15:00<01:56, 3.69it/s]
{'loss': 0.0707, 'grad_norm': 0.3974292576313019, 'learning_rate': 5.537582088660937e-06} |
|
86%|βββββββββ | 2570/3000 [15:00<01:56, 3.69it/s]
86%|βββββββββ | 2571/3000 [15:01<01:58, 3.61it/s]
86%|βββββββββ | 2572/3000 [15:01<01:59, 3.59it/s]
86%|βββββββββ | 2573/3000 [15:01<02:01, 3.52it/s]
86%|βββββββββ | 2574/3000 [15:02<02:00, 3.54it/s]
86%|βββββββββ | 2575/3000 [15:02<01:57, 3.63it/s]
86%|βββββββββ | 2576/3000 [15:02<01:56, 3.64it/s]
86%|βββββββββ | 2577/3000 [15:02<01:54, 3.68it/s]
86%|βββββββββ | 2578/3000 [15:03<01:53, 3.71it/s]
86%|βββββββββ | 2579/3000 [15:03<01:52, 3.75it/s]
86%|βββββββββ | 2580/3000 [15:03<01:50, 3.81it/s]
{'loss': 0.0709, 'grad_norm': 0.4098531901836395, 'learning_rate': 5.28817582290414e-06} |
|
86%|βββββββββ | 2580/3000 [15:03<01:50, 3.81it/s]
86%|βββββββββ | 2581/3000 [15:03<01:49, 3.82it/s]
86%|βββββββββ | 2582/3000 [15:04<01:48, 3.84it/s]
86%|βββββββββ | 2583/3000 [15:04<01:47, 3.87it/s]
86%|βββββββββ | 2584/3000 [15:04<01:47, 3.86it/s]
86%|βββββββββ | 2585/3000 [15:04<01:47, 3.87it/s]
86%|βββββββββ | 2586/3000 [15:05<01:48, 3.80it/s]
86%|βββββββββ | 2587/3000 [15:05<01:48, 3.82it/s]
86%|βββββββββ | 2588/3000 [15:05<01:50, 3.74it/s]
86%|βββββββββ | 2589/3000 [15:06<01:50, 3.71it/s]
86%|βββββββββ | 2590/3000 [15:06<01:50, 3.70it/s]
{'loss': 0.0766, 'grad_norm': 0.4192793667316437, 'learning_rate': 5.044202410927706e-06} |
|
86%|βββββββββ | 2590/3000 [15:06<01:50, 3.70it/s]
86%|βββββββββ | 2591/3000 [15:06<01:52, 3.63it/s]
86%|βββββββββ | 2592/3000 [15:06<01:50, 3.68it/s]
86%|βββββββββ | 2593/3000 [15:07<01:49, 3.70it/s]
86%|βββββββββ | 2594/3000 [15:07<01:50, 3.69it/s]
86%|βββββββββ | 2595/3000 [15:07<01:48, 3.73it/s]
87%|βββββββββ | 2596/3000 [15:07<01:47, 3.77it/s]
87%|βββββββββ | 2597/3000 [15:08<01:47, 3.76it/s]
87%|βββββββββ | 2598/3000 [15:08<01:46, 3.77it/s]
87%|βββββββββ | 2599/3000 [15:08<01:45, 3.81it/s]
87%|βββββββββ | 2600/3000 [15:08<01:44, 3.83it/s]
{'loss': 0.068, 'grad_norm': 0.39969003200531006, 'learning_rate': 4.805691497504505e-06} |
|
87%|βββββββββ | 2600/3000 [15:08<01:44, 3.83it/s]
87%|βββββββββ | 2601/3000 [15:09<01:45, 3.79it/s]
87%|βββββββββ | 2602/3000 [15:09<01:44, 3.81it/s]
87%|βββββββββ | 2603/3000 [15:09<01:43, 3.84it/s]
87%|βββββββββ | 2604/3000 [15:09<01:42, 3.87it/s]
87%|βββββββββ | 2605/3000 [15:10<01:41, 3.89it/s]
87%|βββββββββ | 2606/3000 [15:10<01:42, 3.85it/s]
87%|βββββββββ | 2607/3000 [15:10<01:43, 3.80it/s]
87%|βββββββββ | 2608/3000 [15:11<01:44, 3.76it/s]
87%|βββββββββ | 2609/3000 [15:11<01:43, 3.76it/s]
87%|βββββββββ | 2610/3000 [15:11<01:45, 3.69it/s]
{'loss': 0.0677, 'grad_norm': 0.4076022803783417, 'learning_rate': 4.57267206366902e-06} |
|
87%|βββββββββ | 2610/3000 [15:11<01:45, 3.69it/s]
87%|βββββββββ | 2611/3000 [15:11<01:47, 3.61it/s]
87%|βββββββββ | 2612/3000 [15:12<01:46, 3.64it/s]
87%|βββββββββ | 2613/3000 [15:12<01:44, 3.69it/s]Rank 0, Worker 3: Wait for shard 8 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
87%|βββββββββ | 2614/3000 [15:12<01:43, 3.72it/s]
87%|βββββββββ | 2615/3000 [15:12<01:43, 3.73it/s]
87%|βββββββββ | 2616/3000 [15:13<01:42, 3.74it/s]
87%|βββββββββ | 2617/3000 [15:13<01:42, 3.74it/s]
87%|βββββββββ | 2618/3000 [15:13<01:41, 3.77it/s]
87%|βββββββββ | 2619/3000 [15:14<01:40, 3.79it/s]
87%|βββββββββ | 2620/3000 [15:14<01:39, 3.83it/s]
{'loss': 0.0698, 'grad_norm': 0.4183996617794037, 'learning_rate': 4.3451724231958644e-06} |
|
87%|βββββββββ | 2620/3000 [15:14<01:39, 3.83it/s]
87%|βββββββββ | 2621/3000 [15:14<01:39, 3.80it/s]
87%|βββββββββ | 2622/3000 [15:14<01:38, 3.84it/s]
87%|βββββββββ | 2623/3000 [15:15<01:37, 3.86it/s]
87%|βββββββββ | 2624/3000 [15:15<01:38, 3.83it/s]
88%|βββββββββ | 2625/3000 [15:15<01:43, 3.63it/s]
88%|βββββββββ | 2626/3000 [15:15<01:40, 3.74it/s]
88%|βββββββββ | 2627/3000 [15:16<01:39, 3.75it/s]
88%|βββββββββ | 2628/3000 [15:16<01:39, 3.74it/s]
88%|βββββββββ | 2629/3000 [15:16<01:37, 3.80it/s]Rank 0, Worker 1: Wait for shard 25 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
88%|βββββββββ | 2630/3000 [15:16<01:36, 3.84it/s]
{'loss': 0.0639, 'grad_norm': 0.4020463228225708, 'learning_rate': 4.123220219159418e-06} |
|
88%|βββββββββ | 2630/3000 [15:16<01:36, 3.84it/s]
88%|βββββββββ | 2631/3000 [15:17<01:37, 3.79it/s]
88%|βββββββββ | 2632/3000 [15:17<01:36, 3.82it/s]
88%|βββββββββ | 2633/3000 [15:17<01:36, 3.80it/s]
88%|βββββββββ | 2634/3000 [15:17<01:35, 3.82it/s]
88%|βββββββββ | 2635/3000 [15:18<01:36, 3.77it/s]
88%|βββββββββ | 2636/3000 [15:18<01:38, 3.70it/s]Rank 0, Worker 2: Wait for shard 50 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
88%|βββββββββ | 2637/3000 [15:18<01:41, 3.59it/s]
88%|βββββββββ | 2638/3000 [15:19<01:38, 3.66it/s]
88%|βββββββββ | 2639/3000 [15:19<01:36, 3.74it/s]
88%|βββββββββ | 2640/3000 [15:19<01:35, 3.78it/s]
{'loss': 0.0599, 'grad_norm': 0.4873168468475342, 'learning_rate': 3.90684242057498e-06} |
|
88%|βββββββββ | 2640/3000 [15:19<01:35, 3.78it/s]
88%|βββββββββ | 2641/3000 [15:19<01:34, 3.78it/s]
88%|βββββββββ | 2642/3000 [15:20<01:33, 3.82it/s]
88%|βββββββββ | 2643/3000 [15:20<01:33, 3.80it/s]
88%|βββββββββ | 2644/3000 [15:20<01:33, 3.81it/s]
88%|βββββββββ | 2645/3000 [15:20<01:35, 3.71it/s]
88%|βββββββββ | 2646/3000 [15:21<01:33, 3.79it/s]Rank 0, Worker 0: Wait for shard 51 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
88%|βββββββββ | 2647/3000 [15:21<01:32, 3.83it/s]
88%|βββββββββ | 2648/3000 [15:21<01:34, 3.71it/s]
88%|βββββββββ | 2649/3000 [15:21<01:33, 3.75it/s]
88%|βββββββββ | 2650/3000 [15:22<01:32, 3.80it/s]
{'loss': 0.0734, 'grad_norm': 0.4486771821975708, 'learning_rate': 3.696065319121833e-06} |
|
88%|βββββββββ | 2650/3000 [15:22<01:32, 3.80it/s]
88%|βββββββββ | 2651/3000 [15:22<01:32, 3.78it/s]
88%|βββββββββ | 2652/3000 [15:22<01:32, 3.76it/s]
88%|βββββββββ | 2653/3000 [15:23<01:31, 3.78it/s]
88%|βββββββββ | 2654/3000 [15:23<01:31, 3.77it/s]
88%|βββββββββ | 2655/3000 [15:23<01:31, 3.78it/s]
89%|βββββββββ | 2656/3000 [15:23<01:30, 3.80it/s]Rank 0, Worker 4: Wait for shard 39 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
89%|βββββββββ | 2657/3000 [15:24<01:30, 3.79it/s]
89%|βββββββββ | 2658/3000 [15:24<01:30, 3.78it/s]
89%|βββββββββ | 2659/3000 [15:24<01:29, 3.80it/s]
89%|βββββββββ | 2660/3000 [15:24<01:28, 3.84it/s]
{'loss': 0.0799, 'grad_norm': 0.4313182532787323, 'learning_rate': 3.4909145259485744e-06} |
|
89%|βββββββββ | 2660/3000 [15:24<01:28, 3.84it/s]
89%|βββββββββ | 2661/3000 [15:25<01:29, 3.81it/s]
89%|βββββββββ | 2662/3000 [15:25<01:27, 3.85it/s]
89%|βββββββββ | 2663/3000 [15:25<01:27, 3.86it/s]Rank 0, Worker 5: Wait for shard 54 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
89%|βββββββββ | 2664/3000 [15:25<01:27, 3.85it/s]
89%|βββββββββ | 2665/3000 [15:26<01:27, 3.82it/s]
89%|βββββββββ | 2666/3000 [15:26<01:28, 3.75it/s]
89%|βββββββββ | 2667/3000 [15:26<01:29, 3.71it/s]
89%|βββββββββ | 2668/3000 [15:27<01:37, 3.40it/s]
89%|βββββββββ | 2669/3000 [15:27<01:33, 3.53it/s]
89%|βββββββββ | 2670/3000 [15:27<01:32, 3.58it/s]
{'loss': 0.0728, 'grad_norm': 0.4147796034812927, 'learning_rate': 3.2914149685611073e-06} |
|
89%|βββββββββ | 2670/3000 [15:27<01:32, 3.58it/s]
89%|βββββββββ | 2671/3000 [15:27<01:30, 3.64it/s]
89%|βββββββββ | 2672/3000 [15:28<01:28, 3.70it/s]
89%|βββββββββ | 2673/3000 [15:28<01:27, 3.75it/s]
89%|βββββββββ | 2674/3000 [15:28<01:26, 3.76it/s]
89%|βββββββββ | 2675/3000 [15:28<01:27, 3.72it/s]
89%|βββββββββ | 2676/3000 [15:29<01:27, 3.68it/s]
89%|βββββββββ | 2677/3000 [15:29<01:35, 3.39it/s]
89%|βββββββββ | 2678/3000 [15:29<01:36, 3.32it/s]
89%|βββββββββ | 2679/3000 [15:30<01:33, 3.41it/s]
89%|βββββββββ | 2680/3000 [15:30<01:34, 3.40it/s]
{'loss': 0.0725, 'grad_norm': 0.4604618549346924, 'learning_rate': 3.0975908877938277e-06} |
|
89%|βββββββββ | 2680/3000 [15:30<01:34, 3.40it/s]
89%|βββββββββ | 2681/3000 [15:30<01:34, 3.38it/s]
89%|βββββββββ | 2682/3000 [15:31<01:31, 3.46it/s]
89%|βββββββββ | 2683/3000 [15:31<01:35, 3.31it/s]
89%|βββββββββ | 2684/3000 [15:31<01:32, 3.41it/s]
90%|βββββββββ | 2685/3000 [15:31<01:32, 3.42it/s]
90%|βββββββββ | 2686/3000 [15:32<01:31, 3.43it/s]
90%|βββββββββ | 2687/3000 [15:32<01:30, 3.44it/s]
90%|βββββββββ | 2688/3000 [15:32<01:29, 3.47it/s]
90%|βββββββββ | 2689/3000 [15:33<01:27, 3.55it/s]
90%|βββββββββ | 2690/3000 [15:33<01:26, 3.60it/s]
{'loss': 0.0697, 'grad_norm': 0.4549005627632141, 'learning_rate': 2.9094658348640945e-06} |
|
90%|βββββββββ | 2690/3000 [15:33<01:26, 3.60it/s]
90%|βββββββββ | 2691/3000 [15:33<01:26, 3.57it/s]
90%|βββββββββ | 2692/3000 [15:33<01:28, 3.47it/s]
90%|βββββββββ | 2693/3000 [15:34<01:29, 3.42it/s]
90%|βββββββββ | 2694/3000 [15:34<01:31, 3.33it/s]
90%|βββββββββ | 2695/3000 [15:34<01:35, 3.20it/s]
90%|βββββββββ | 2696/3000 [15:35<01:33, 3.26it/s]
90%|βββββββββ | 2697/3000 [15:35<01:33, 3.23it/s]
90%|βββββββββ | 2698/3000 [15:35<01:30, 3.34it/s]
90%|βββββββββ | 2699/3000 [15:36<01:30, 3.32it/s]
90%|βββββββββ | 2700/3000 [15:36<01:28, 3.39it/s]
{'loss': 0.0785, 'grad_norm': 0.3670569360256195, 'learning_rate': 2.7270626685105828e-06} |
|
90%|βββββββββ | 2700/3000 [15:36<01:28, 3.39it/s]
90%|βββββββββ | 2701/3000 [15:36<01:33, 3.20it/s]
90%|βββββββββ | 2702/3000 [15:36<01:29, 3.34it/s]
90%|βββββββββ | 2703/3000 [15:37<01:30, 3.29it/s]
90%|βββββββββ | 2704/3000 [15:37<01:26, 3.43it/s]
90%|βββββββββ | 2705/3000 [15:37<01:24, 3.48it/s]
90%|βββββββββ | 2706/3000 [15:38<01:22, 3.57it/s]
90%|βββββββββ | 2707/3000 [15:38<01:19, 3.68it/s]
90%|βββββββββ | 2708/3000 [15:38<01:19, 3.68it/s]
90%|βββββββββ | 2709/3000 [15:38<01:19, 3.66it/s]
90%|βββββββββ | 2710/3000 [15:39<01:19, 3.67it/s]
{'loss': 0.0715, 'grad_norm': 0.3622991442680359, 'learning_rate': 2.5504035522157854e-06} |
|
90%|βββββββββ | 2710/3000 [15:39<01:19, 3.67it/s]
90%|βββββββββ | 2711/3000 [15:39<01:18, 3.67it/s]
90%|βββββββββ | 2712/3000 [15:39<01:17, 3.71it/s]
90%|βββββββββ | 2713/3000 [15:39<01:19, 3.60it/s]
90%|βββββββββ | 2714/3000 [15:40<01:22, 3.47it/s]
90%|βββββββββ | 2715/3000 [15:40<01:19, 3.58it/s]
91%|βββββββββ | 2716/3000 [15:40<01:17, 3.64it/s]
91%|βββββββββ | 2717/3000 [15:41<01:16, 3.70it/s]
91%|βββββββββ | 2718/3000 [15:41<01:15, 3.73it/s]
91%|βββββββββ | 2719/3000 [15:41<01:16, 3.68it/s]
91%|βββββββββ | 2720/3000 [15:41<01:15, 3.71it/s]
{'loss': 0.0682, 'grad_norm': 0.42583581805229187, 'learning_rate': 2.379509951512937e-06} |
|
91%|βββββββββ | 2720/3000 [15:41<01:15, 3.71it/s]
91%|βββββββββ | 2721/3000 [15:42<01:16, 3.66it/s]
91%|βββββββββ | 2722/3000 [15:42<01:15, 3.70it/s]
91%|βββββββββ | 2723/3000 [15:42<01:14, 3.74it/s]
91%|βββββββββ | 2724/3000 [15:42<01:13, 3.78it/s]
91%|βββββββββ | 2725/3000 [15:43<01:12, 3.79it/s]
91%|βββββββββ | 2726/3000 [15:43<01:11, 3.84it/s]
91%|βββββββββ | 2727/3000 [15:43<01:12, 3.76it/s]
91%|βββββββββ | 2728/3000 [15:44<01:12, 3.75it/s]
91%|βββββββββ | 2729/3000 [15:44<01:16, 3.53it/s]
91%|βββββββββ | 2730/3000 [15:44<01:18, 3.43it/s]
{'loss': 0.0689, 'grad_norm': 0.3878172039985657, 'learning_rate': 2.214402631377782e-06} |
|
91%|βββββββββ | 2730/3000 [15:44<01:18, 3.43it/s]
91%|βββββββββ | 2731/3000 [15:44<01:15, 3.56it/s]
91%|βββββββββ | 2732/3000 [15:45<01:12, 3.67it/s]
91%|βββββββββ | 2733/3000 [15:45<01:11, 3.74it/s]
91%|βββββββββ | 2734/3000 [15:45<01:09, 3.81it/s]
91%|βββββββββ | 2735/3000 [15:45<01:11, 3.73it/s]
91%|βββββββββ | 2736/3000 [15:46<01:11, 3.70it/s]
91%|βββββββββ | 2737/3000 [15:46<01:10, 3.75it/s]
91%|ββββββββββ| 2738/3000 [15:46<01:09, 3.80it/s]
91%|ββββββββββ| 2739/3000 [15:46<01:08, 3.81it/s]
91%|ββββββββββ| 2740/3000 [15:47<01:07, 3.84it/s]
{'loss': 0.0714, 'grad_norm': 0.43555888533592224, 'learning_rate': 2.0551016537054493e-06} |
|
91%|ββββββββββ| 2740/3000 [15:47<01:07, 3.84it/s]
91%|ββββββββββ| 2741/3000 [15:47<01:07, 3.82it/s]
91%|ββββββββββ| 2742/3000 [15:47<01:06, 3.87it/s]
91%|ββββββββββ| 2743/3000 [15:48<01:12, 3.56it/s]
91%|ββββββββββ| 2744/3000 [15:48<01:11, 3.59it/s]
92%|ββββββββββ| 2745/3000 [15:48<01:10, 3.61it/s]
92%|ββββββββββ| 2746/3000 [15:48<01:08, 3.68it/s]
92%|ββββββββββ| 2747/3000 [15:49<01:09, 3.65it/s]
92%|ββββββββββ| 2748/3000 [15:49<01:08, 3.69it/s]
92%|ββββββββββ| 2749/3000 [15:49<01:08, 3.67it/s]
92%|ββββββββββ| 2750/3000 [15:49<01:07, 3.70it/s]
{'loss': 0.064, 'grad_norm': 0.41738301515579224, 'learning_rate': 1.9016263748728114e-06} |
|
92%|ββββββββββ| 2750/3000 [15:50<01:07, 3.70it/s]
92%|ββββββββββ| 2751/3000 [15:50<01:08, 3.65it/s]
92%|ββββββββββ| 2752/3000 [15:50<01:08, 3.64it/s]
92%|ββββββββββ| 2753/3000 [15:50<01:06, 3.70it/s]
92%|ββββββββββ| 2754/3000 [15:51<01:05, 3.76it/s]
92%|ββββββββββ| 2755/3000 [15:51<01:06, 3.67it/s]
92%|ββββββββββ| 2756/3000 [15:51<01:07, 3.63it/s]
92%|ββββββββββ| 2757/3000 [15:51<01:06, 3.64it/s]
92%|ββββββββββ| 2758/3000 [15:52<01:05, 3.72it/s]
92%|ββββββββββ| 2759/3000 [15:52<01:06, 3.63it/s]
92%|ββββββββββ| 2760/3000 [15:52<01:06, 3.59it/s]
{'loss': 0.066, 'grad_norm': 0.39067214727401733, 'learning_rate': 1.7539954433864858e-06} |
|
92%|ββββββββββ| 2760/3000 [15:52<01:06, 3.59it/s]
92%|ββββββββββ| 2761/3000 [15:52<01:05, 3.64it/s]
92%|ββββββββββ| 2762/3000 [15:53<01:05, 3.64it/s]
92%|ββββββββββ| 2763/3000 [15:53<01:05, 3.61it/s]
92%|ββββββββββ| 2764/3000 [15:53<01:03, 3.70it/s]
92%|ββββββββββ| 2765/3000 [15:54<01:03, 3.71it/s]
92%|ββββββββββ| 2766/3000 [15:54<01:03, 3.68it/s]
92%|ββββββββββ| 2767/3000 [15:54<01:03, 3.70it/s]
92%|ββββββββββ| 2768/3000 [15:54<01:01, 3.78it/s]
92%|ββββββββββ| 2769/3000 [15:55<01:02, 3.72it/s]
92%|ββββββββββ| 2770/3000 [15:55<01:06, 3.48it/s]
{'loss': 0.0643, 'grad_norm': 0.5217285752296448, 'learning_rate': 1.6122267976168781e-06} |
|
92%|ββββββββββ| 2770/3000 [15:55<01:06, 3.48it/s]
92%|ββββββββββ| 2771/3000 [15:55<01:06, 3.44it/s]
92%|ββββββββββ| 2772/3000 [15:56<01:04, 3.53it/s]
92%|ββββββββββ| 2773/3000 [15:56<01:03, 3.56it/s]
92%|ββββββββββ| 2774/3000 [15:56<01:04, 3.52it/s]
92%|ββββββββββ| 2775/3000 [15:56<01:04, 3.49it/s]
93%|ββββββββββ| 2776/3000 [15:57<01:05, 3.44it/s]
93%|ββββββββββ| 2777/3000 [15:57<01:05, 3.40it/s]
93%|ββββββββββ| 2778/3000 [15:57<01:04, 3.46it/s]
93%|ββββββββββ| 2779/3000 [15:58<01:03, 3.50it/s]
93%|ββββββββββ| 2780/3000 [15:58<01:08, 3.22it/s]
{'loss': 0.0701, 'grad_norm': 0.3757125437259674, 'learning_rate': 1.4763376636185599e-06} |
|
93%|ββββββββββ| 2780/3000 [15:58<01:08, 3.22it/s]
93%|ββββββββββ| 2781/3000 [15:58<01:09, 3.15it/s]
93%|ββββββββββ| 2782/3000 [15:59<01:05, 3.34it/s]
93%|ββββββββββ| 2783/3000 [15:59<01:02, 3.48it/s]
93%|ββββββββββ| 2784/3000 [15:59<01:00, 3.55it/s]
93%|ββββββββββ| 2785/3000 [15:59<01:01, 3.48it/s]
93%|ββββββββββ| 2786/3000 [16:00<01:03, 3.35it/s]
93%|ββββββββββ| 2787/3000 [16:00<01:01, 3.45it/s]
93%|ββββββββββ| 2788/3000 [16:00<01:00, 3.51it/s]
93%|ββββββββββ| 2789/3000 [16:01<01:01, 3.42it/s]
93%|ββββββββββ| 2790/3000 [16:01<00:59, 3.50it/s]
{'loss': 0.062, 'grad_norm': 0.4145689606666565, 'learning_rate': 1.3463445530371488e-06} |
|
93%|ββββββββββ| 2790/3000 [16:01<00:59, 3.50it/s]
93%|ββββββββββ| 2791/3000 [16:01<01:00, 3.47it/s]
93%|ββββββββββ| 2792/3000 [16:01<00:59, 3.51it/s]
93%|ββββββββββ| 2793/3000 [16:02<00:58, 3.52it/s]
93%|ββββββββββ| 2794/3000 [16:02<00:56, 3.62it/s]
93%|ββββββββββ| 2795/3000 [16:02<00:55, 3.70it/s]
93%|ββββββββββ| 2796/3000 [16:02<00:54, 3.73it/s]
93%|ββββββββββ| 2797/3000 [16:03<00:53, 3.76it/s]
93%|ββββββββββ| 2798/3000 [16:03<00:53, 3.81it/s]
93%|ββββββββββ| 2799/3000 [16:03<00:52, 3.86it/s]
93%|ββββββββββ| 2800/3000 [16:03<00:51, 3.85it/s]
{'loss': 0.0713, 'grad_norm': 0.4163986146450043, 'learning_rate': 1.222263261102985e-06} |
|
93%|ββββββββββ| 2800/3000 [16:03<00:51, 3.85it/s]
93%|ββββββββββ| 2801/3000 [16:04<00:51, 3.83it/s]
93%|ββββββββββ| 2802/3000 [16:04<00:51, 3.86it/s]
93%|ββββββββββ| 2803/3000 [16:04<00:50, 3.89it/s]
93%|ββββββββββ| 2804/3000 [16:04<00:50, 3.86it/s]
94%|ββββββββββ| 2805/3000 [16:05<00:50, 3.85it/s]Rank 0, Worker 3: Wait for shard 41 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
94%|ββββββββββ| 2806/3000 [16:05<00:50, 3.86it/s]
94%|ββββββββββ| 2807/3000 [16:05<00:50, 3.85it/s]
94%|ββββββββββ| 2808/3000 [16:06<00:49, 3.88it/s]
94%|ββββββββββ| 2809/3000 [16:06<00:48, 3.91it/s]
94%|ββββββββββ| 2810/3000 [16:06<00:48, 3.93it/s]
{'loss': 0.0699, 'grad_norm': 0.3938809037208557, 'learning_rate': 1.1041088647119114e-06} |
|
94%|ββββββββββ| 2810/3000 [16:06<00:48, 3.93it/s]
94%|ββββββββββ| 2811/3000 [16:06<00:48, 3.89it/s]
94%|ββββββββββ| 2812/3000 [16:07<00:48, 3.86it/s]
94%|ββββββββββ| 2813/3000 [16:07<00:48, 3.88it/s]
94%|ββββββββββ| 2814/3000 [16:07<00:47, 3.90it/s]
94%|ββββββββββ| 2815/3000 [16:07<00:47, 3.90it/s]Rank 0, Worker 1: Wait for shard 1 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 1: Caching shard... |
|
94%|ββββββββββ| 2816/3000 [16:08<00:47, 3.87it/s]
94%|ββββββββββ| 2817/3000 [16:08<00:46, 3.90it/s]
94%|ββββββββββ| 2818/3000 [16:08<00:46, 3.92it/s]
94%|ββββββββββ| 2819/3000 [16:08<00:46, 3.93it/s]
94%|ββββββββββ| 2820/3000 [16:09<00:47, 3.82it/s]
{'loss': 0.0608, 'grad_norm': 0.39454385638237, 'learning_rate': 9.918957205933e-07} |
|
94%|ββββββββββ| 2820/3000 [16:09<00:47, 3.82it/s]
94%|ββββββββββ| 2821/3000 [16:09<00:46, 3.82it/s]
94%|ββββββββββ| 2822/3000 [16:09<00:46, 3.81it/s]Rank 0, Worker 2: Wait for shard 11 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 2: Caching shard... |
|
94%|ββββββββββ| 2823/3000 [16:09<00:46, 3.80it/s]
94%|ββββββββββ| 2824/3000 [16:10<00:46, 3.80it/s]
94%|ββββββββββ| 2825/3000 [16:10<00:45, 3.84it/s]
94%|ββββββββββ| 2826/3000 [16:10<00:45, 3.80it/s]
94%|ββββββββββ| 2827/3000 [16:10<00:45, 3.81it/s]
94%|ββββββββββ| 2828/3000 [16:11<00:44, 3.85it/s]
94%|ββββββββββ| 2829/3000 [16:11<00:44, 3.88it/s]
94%|ββββββββββ| 2830/3000 [16:11<00:42, 3.96it/s]
{'loss': 0.0678, 'grad_norm': 0.4002632200717926, 'learning_rate': 8.856374635655695e-07} |
|
94%|ββββββββββ| 2830/3000 [16:11<00:42, 3.96it/s]
94%|ββββββββββ| 2831/3000 [16:11<00:42, 3.93it/s]
94%|ββββββββββ| 2832/3000 [16:12<00:42, 3.91it/s]
94%|ββββββββββ| 2833/3000 [16:12<00:42, 3.93it/s]
94%|ββββββββββ| 2834/3000 [16:12<00:42, 3.95it/s]
94%|ββββββββββ| 2835/3000 [16:12<00:41, 3.98it/s]
95%|ββββββββββ| 2836/3000 [16:13<00:42, 3.84it/s]
95%|ββββββββββ| 2837/3000 [16:13<00:45, 3.57it/s]
95%|ββββββββββ| 2838/3000 [16:13<00:43, 3.68it/s]Rank 0, Worker 0: Wait for shard 46 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 0: Caching shard... |
|
95%|ββββββββββ| 2839/3000 [16:14<00:42, 3.77it/s]
95%|ββββββββββ| 2840/3000 [16:14<00:42, 3.75it/s]
{'loss': 0.0689, 'grad_norm': 0.37098678946495056, 'learning_rate': 7.853470048794664e-07} |
|
95%|ββββββββββ| 2840/3000 [16:14<00:42, 3.75it/s]
95%|ββββββββββ| 2841/3000 [16:14<00:43, 3.65it/s]
95%|ββββββββββ| 2842/3000 [16:14<00:42, 3.75it/s]Rank 0, Worker 4: Wait for shard 35 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 4: Caching shard... |
|
95%|ββββββββββ| 2843/3000 [16:15<00:41, 3.79it/s]
95%|ββββββββββ| 2844/3000 [16:15<00:41, 3.79it/s]
95%|ββββββββββ| 2845/3000 [16:15<00:40, 3.83it/s]
95%|ββββββββββ| 2846/3000 [16:15<00:39, 3.87it/s]
95%|ββββββββββ| 2847/3000 [16:16<00:39, 3.92it/s]
95%|ββββββββββ| 2848/3000 [16:16<00:40, 3.72it/s]
95%|ββββββββββ| 2849/3000 [16:16<00:39, 3.80it/s]
95%|ββββββββββ| 2850/3000 [16:16<00:39, 3.79it/s]
{'loss': 0.072, 'grad_norm': 0.3920189142227173, 'learning_rate': 6.910365306492416e-07} |
|
95%|ββββββββββ| 2850/3000 [16:17<00:39, 3.79it/s]
95%|ββββββββββ| 2851/3000 [16:17<00:39, 3.77it/s]
95%|ββββββββββ| 2852/3000 [16:17<00:38, 3.80it/s]
95%|ββββββββββ| 2853/3000 [16:17<00:38, 3.80it/s]
95%|ββββββββββ| 2854/3000 [16:18<00:38, 3.80it/s]
95%|ββββββββββ| 2855/3000 [16:18<00:38, 3.77it/s]Rank 0, Worker 5: Wait for shard 57 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 5: Caching shard... |
|
95%|ββββββββββ| 2856/3000 [16:18<00:38, 3.74it/s]
95%|ββββββββββ| 2857/3000 [16:18<00:37, 3.76it/s]
95%|ββββββββββ| 2858/3000 [16:19<00:37, 3.74it/s]
95%|ββββββββββ| 2859/3000 [16:19<00:37, 3.72it/s]
95%|ββββββββββ| 2860/3000 [16:19<00:37, 3.76it/s]
{'loss': 0.0653, 'grad_norm': 0.43323296308517456, 'learning_rate': 6.027175003719354e-07} |
|
95%|ββββββββββ| 2860/3000 [16:19<00:37, 3.76it/s]
95%|ββββββββββ| 2861/3000 [16:19<00:36, 3.77it/s]
95%|ββββββββββ| 2862/3000 [16:20<00:37, 3.67it/s]
95%|ββββββββββ| 2863/3000 [16:20<00:37, 3.66it/s]
95%|ββββββββββ| 2864/3000 [16:20<00:36, 3.69it/s]
96%|ββββββββββ| 2865/3000 [16:20<00:36, 3.74it/s]
96%|ββββββββββ| 2866/3000 [16:21<00:35, 3.72it/s]
96%|ββββββββββ| 2867/3000 [16:21<00:35, 3.72it/s]
96%|ββββββββββ| 2868/3000 [16:21<00:35, 3.67it/s]
96%|ββββββββββ| 2869/3000 [16:22<00:38, 3.45it/s]
96%|ββββββββββ| 2870/3000 [16:22<00:36, 3.53it/s]
{'loss': 0.0652, 'grad_norm': 0.4373188316822052, 'learning_rate': 5.204006455349297e-07} |
|
96%|ββββββββββ| 2870/3000 [16:22<00:36, 3.53it/s]
96%|ββββββββββ| 2871/3000 [16:22<00:36, 3.52it/s]
96%|ββββββββββ| 2872/3000 [16:22<00:36, 3.55it/s]
96%|ββββββββββ| 2873/3000 [16:23<00:38, 3.30it/s]
96%|ββββββββββ| 2874/3000 [16:23<00:37, 3.36it/s]
96%|ββββββββββ| 2875/3000 [16:23<00:36, 3.44it/s]
96%|ββββββββββ| 2876/3000 [16:24<00:35, 3.51it/s]
96%|ββββββββββ| 2877/3000 [16:24<00:34, 3.54it/s]
96%|ββββββββββ| 2878/3000 [16:24<00:33, 3.62it/s]
96%|ββββββββββ| 2879/3000 [16:24<00:32, 3.69it/s]
96%|ββββββββββ| 2880/3000 [16:25<00:33, 3.60it/s]
{'loss': 0.0659, 'grad_norm': 0.3182922601699829, 'learning_rate': 4.440959683120194e-07} |
|
96%|ββββββββββ| 2880/3000 [16:25<00:33, 3.60it/s]
96%|ββββββββββ| 2881/3000 [16:25<00:33, 3.56it/s]
96%|ββββββββββ| 2882/3000 [16:25<00:32, 3.64it/s]
96%|ββββββββββ| 2883/3000 [16:26<00:31, 3.68it/s]
96%|ββββββββββ| 2884/3000 [16:26<00:32, 3.59it/s]
96%|ββββββββββ| 2885/3000 [16:26<00:31, 3.60it/s]
96%|ββββββββββ| 2886/3000 [16:26<00:31, 3.63it/s]
96%|ββββββββββ| 2887/3000 [16:27<00:30, 3.68it/s]
96%|ββββββββββ| 2888/3000 [16:27<00:30, 3.66it/s]
96%|ββββββββββ| 2889/3000 [16:27<00:31, 3.55it/s]
96%|ββββββββββ| 2890/3000 [16:28<00:30, 3.62it/s]
{'loss': 0.0674, 'grad_norm': 0.3474968373775482, 'learning_rate': 3.738127403480507e-07} |
|
96%|ββββββββββ| 2890/3000 [16:28<00:30, 3.62it/s]
96%|ββββββββββ| 2891/3000 [16:28<00:29, 3.65it/s]
96%|ββββββββββ| 2892/3000 [16:28<00:30, 3.58it/s]
96%|ββββββββββ| 2893/3000 [16:28<00:31, 3.42it/s]
96%|ββββββββββ| 2894/3000 [16:29<00:32, 3.30it/s]
96%|ββββββββββ| 2895/3000 [16:29<00:30, 3.45it/s]
97%|ββββββββββ| 2896/3000 [16:29<00:29, 3.51it/s]
97%|ββββββββββ| 2897/3000 [16:30<00:28, 3.57it/s]
97%|ββββββββββ| 2898/3000 [16:30<00:28, 3.62it/s]
97%|ββββββββββ| 2899/3000 [16:30<00:27, 3.65it/s]
97%|ββββββββββ| 2900/3000 [16:30<00:28, 3.54it/s]
{'loss': 0.0713, 'grad_norm': 0.4238067865371704, 'learning_rate': 3.095595016323394e-07} |
|
97%|ββββββββββ| 2900/3000 [16:30<00:28, 3.54it/s]
97%|ββββββββββ| 2901/3000 [16:31<00:28, 3.49it/s]
97%|ββββββββββ| 2902/3000 [16:31<00:27, 3.50it/s]
97%|ββββββββββ| 2903/3000 [16:31<00:28, 3.42it/s]
97%|ββββββββββ| 2904/3000 [16:32<00:27, 3.44it/s]
97%|ββββββββββ| 2905/3000 [16:32<00:27, 3.47it/s]
97%|ββββββββββ| 2906/3000 [16:32<00:26, 3.58it/s]
97%|ββββββββββ| 2907/3000 [16:32<00:26, 3.55it/s]
97%|ββββββββββ| 2908/3000 [16:33<00:25, 3.60it/s]
97%|ββββββββββ| 2909/3000 [16:33<00:25, 3.55it/s]
97%|ββββββββββ| 2910/3000 [16:33<00:26, 3.42it/s]
{'loss': 0.0911, 'grad_norm': 0.48984238505363464, 'learning_rate': 2.51344059460995e-07} |
|
97%|ββββββββββ| 2910/3000 [16:33<00:26, 3.42it/s]
97%|ββββββββββ| 2911/3000 [16:34<00:25, 3.46it/s]
97%|ββββββββββ| 2912/3000 [16:34<00:24, 3.55it/s]
97%|ββββββββββ| 2913/3000 [16:34<00:23, 3.66it/s]
97%|ββββββββββ| 2914/3000 [16:34<00:22, 3.74it/s]
97%|ββββββββββ| 2915/3000 [16:35<00:22, 3.75it/s]
97%|ββββββββββ| 2916/3000 [16:35<00:22, 3.76it/s]
97%|ββββββββββ| 2917/3000 [16:35<00:21, 3.82it/s]
97%|ββββββββββ| 2918/3000 [16:35<00:21, 3.85it/s]
97%|ββββββββββ| 2919/3000 [16:36<00:21, 3.82it/s]
97%|ββββββββββ| 2920/3000 [16:36<00:21, 3.80it/s]
{'loss': 0.0674, 'grad_norm': 0.44400227069854736, 'learning_rate': 1.9917348748826335e-07} |
|
97%|ββββββββββ| 2920/3000 [16:36<00:21, 3.80it/s]
97%|ββββββββββ| 2921/3000 [16:36<00:21, 3.76it/s]
97%|ββββββββββ| 2922/3000 [16:36<00:20, 3.77it/s]
97%|ββββββββββ| 2923/3000 [16:37<00:20, 3.78it/s]
97%|ββββββββββ| 2924/3000 [16:37<00:20, 3.78it/s]
98%|ββββββββββ| 2925/3000 [16:37<00:19, 3.85it/s]
98%|ββββββββββ| 2926/3000 [16:37<00:19, 3.87it/s]
98%|ββββββββββ| 2927/3000 [16:38<00:18, 3.85it/s]
98%|ββββββββββ| 2928/3000 [16:38<00:18, 3.89it/s]
98%|ββββββββββ| 2929/3000 [16:38<00:18, 3.93it/s]
98%|ββββββββββ| 2930/3000 [16:38<00:17, 3.92it/s]
{'loss': 0.0701, 'grad_norm': 0.3224441409111023, 'learning_rate': 1.5305412486702474e-07} |
|
98%|ββββββββββ| 2930/3000 [16:38<00:17, 3.92it/s]
98%|ββββββββββ| 2931/3000 [16:39<00:17, 3.88it/s]
98%|ββββββββββ| 2932/3000 [16:39<00:17, 3.84it/s]
98%|ββββββββββ| 2933/3000 [16:39<00:17, 3.90it/s]
98%|ββββββββββ| 2934/3000 [16:39<00:17, 3.87it/s]
98%|ββββββββββ| 2935/3000 [16:40<00:16, 3.83it/s]
98%|ββββββββββ| 2936/3000 [16:40<00:16, 3.86it/s]
98%|ββββββββββ| 2937/3000 [16:40<00:16, 3.85it/s]
98%|ββββββββββ| 2938/3000 [16:41<00:15, 3.89it/s]
98%|ββββββββββ| 2939/3000 [16:41<00:15, 3.94it/s]
98%|ββββββββββ| 2940/3000 [16:41<00:15, 3.95it/s]
{'loss': 0.0673, 'grad_norm': 0.32900160551071167, 'learning_rate': 1.1299157547854377e-07} |
|
98%|ββββββββββ| 2940/3000 [16:41<00:15, 3.95it/s]
98%|ββββββββββ| 2941/3000 [16:41<00:15, 3.92it/s]
98%|ββββββββββ| 2942/3000 [16:42<00:14, 3.93it/s]
98%|ββββββββββ| 2943/3000 [16:42<00:14, 3.95it/s]
98%|ββββββββββ| 2944/3000 [16:42<00:14, 3.91it/s]
98%|ββββββββββ| 2945/3000 [16:42<00:13, 3.96it/s]
98%|ββββββββββ| 2946/3000 [16:43<00:13, 4.02it/s]
98%|ββββββββββ| 2947/3000 [16:43<00:13, 3.97it/s]
98%|ββββββββββ| 2948/3000 [16:43<00:13, 3.95it/s]
98%|ββββββββββ| 2949/3000 [16:43<00:13, 3.91it/s]
98%|ββββββββββ| 2950/3000 [16:44<00:13, 3.82it/s]
{'loss': 0.0652, 'grad_norm': 0.3166285455226898, 'learning_rate': 7.899070725153613e-08} |
|
98%|ββββββββββ| 2950/3000 [16:44<00:13, 3.82it/s]
98%|ββββββββββ| 2951/3000 [16:44<00:14, 3.35it/s]
98%|ββββββββββ| 2952/3000 [16:44<00:13, 3.51it/s]
98%|ββββββββββ| 2953/3000 [16:44<00:13, 3.58it/s]
98%|ββββββββββ| 2954/3000 [16:45<00:12, 3.71it/s]
98%|ββββββββββ| 2955/3000 [16:45<00:11, 3.76it/s]
99%|ββββββββββ| 2956/3000 [16:45<00:11, 3.86it/s]
99%|ββββββββββ| 2957/3000 [16:45<00:11, 3.89it/s]
99%|ββββββββββ| 2958/3000 [16:46<00:10, 3.87it/s]
99%|ββββββββββ| 2959/3000 [16:46<00:10, 3.87it/s]
99%|ββββββββββ| 2960/3000 [16:46<00:10, 3.86it/s]
{'loss': 0.0817, 'grad_norm': 0.35446301102638245, 'learning_rate': 5.105565157068615e-08} |
|
99%|ββββββββββ| 2960/3000 [16:46<00:10, 3.86it/s]
99%|ββββββββββ| 2961/3000 [16:47<00:10, 3.83it/s]
99%|ββββββββββ| 2962/3000 [16:47<00:09, 3.85it/s]
99%|ββββββββββ| 2963/3000 [16:47<00:09, 3.90it/s]
99%|ββββββββββ| 2964/3000 [16:47<00:09, 3.89it/s]
99%|ββββββββββ| 2965/3000 [16:48<00:09, 3.79it/s]
99%|ββββββββββ| 2966/3000 [16:48<00:08, 3.84it/s]
99%|ββββββββββ| 2967/3000 [16:48<00:08, 3.86it/s]
99%|ββββββββββ| 2968/3000 [16:48<00:08, 3.88it/s]
99%|ββββββββββ| 2969/3000 [16:49<00:07, 3.88it/s]
99%|ββββββββββ| 2970/3000 [16:49<00:07, 3.92it/s]
{'loss': 0.0724, 'grad_norm': 0.31089484691619873, 'learning_rate': 2.9189802774631792e-08} |
|
99%|ββββββββββ| 2970/3000 [16:49<00:07, 3.92it/s]
99%|ββββββββββ| 2971/3000 [16:49<00:07, 3.90it/s]
99%|ββββββββββ| 2972/3000 [16:49<00:07, 3.91it/s]
99%|ββββββββββ| 2973/3000 [16:50<00:07, 3.78it/s]
99%|ββββββββββ| 2974/3000 [16:50<00:07, 3.67it/s]
99%|ββββββββββ| 2975/3000 [16:50<00:06, 3.75it/s]
99%|ββββββββββ| 2976/3000 [16:50<00:06, 3.70it/s]
99%|ββββββββββ| 2977/3000 [16:51<00:06, 3.79it/s]
99%|ββββββββββ| 2978/3000 [16:51<00:05, 3.83it/s]
99%|ββββββββββ| 2979/3000 [16:51<00:05, 3.82it/s]
99%|ββββββββββ| 2980/3000 [16:51<00:05, 3.82it/s]
{'loss': 0.0671, 'grad_norm': 0.36768633127212524, 'learning_rate': 1.3395817743561134e-08} |
|
99%|ββββββββββ| 2980/3000 [16:52<00:05, 3.82it/s]
99%|ββββββββββ| 2981/3000 [16:52<00:04, 3.84it/s]
99%|ββββββββββ| 2982/3000 [16:52<00:04, 3.80it/s]
99%|ββββββββββ| 2983/3000 [16:52<00:04, 3.69it/s]
99%|ββββββββββ| 2984/3000 [16:53<00:04, 3.64it/s]
100%|ββββββββββ| 2985/3000 [16:53<00:04, 3.59it/s]
100%|ββββββββββ| 2986/3000 [16:53<00:03, 3.68it/s]
100%|ββββββββββ| 2987/3000 [16:53<00:03, 3.50it/s]
100%|ββββββββββ| 2988/3000 [16:54<00:03, 3.54it/s]
100%|ββββββββββ| 2989/3000 [16:54<00:03, 3.63it/s]
100%|ββββββββββ| 2990/3000 [16:54<00:02, 3.68it/s]
{'loss': 0.0633, 'grad_norm': 0.3786405026912689, 'learning_rate': 3.6756155763373323e-09} |
|
100%|ββββββββββ| 2990/3000 [16:54<00:02, 3.68it/s]
100%|ββββββββββ| 2991/3000 [16:55<00:02, 3.69it/s]
100%|ββββββββββ| 2992/3000 [16:55<00:02, 3.61it/s]
100%|ββββββββββ| 2993/3000 [16:55<00:01, 3.66it/s]
100%|ββββββββββ| 2994/3000 [16:55<00:01, 3.69it/s]
100%|ββββββββββ| 2995/3000 [16:56<00:01, 3.75it/s]
100%|ββββββββββ| 2996/3000 [16:56<00:01, 3.81it/s]
100%|ββββββββββ| 2997/3000 [16:56<00:00, 3.81it/s]Rank 0, Worker 3: Wait for shard 0 in dataset 0 in 0.00 seconds |
| Rank 0, Worker 3: Caching shard... |
|
100%|ββββββββββ| 2998/3000 [16:56<00:00, 3.81it/s]
100%|ββββββββββ| 2999/3000 [16:57<00:00, 3.81it/s]
100%|ββββββββββ| 3000/3000 [16:57<00:00, 3.84it/s]
{'loss': 0.0776, 'grad_norm': 0.4386350214481354, 'learning_rate': 3.037735734623404e-11} |
|
100%|ββββββββββ| 3000/3000 [16:57<00:00, 3.84it/s]Copying experiment config directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/experiment_cfg to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-3000/experiment_cfg |
| Copying processor directory /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/processor to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-3000 |
| Copying wandb_config.json from /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/wandb_config.json to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0/checkpoint-3000/wandb_config.json |
|
{'train_runtime': 1054.5211, 'train_samples_per_second': 91.037, 'train_steps_per_second': 2.845, 'train_loss': 0.2204543206691742} |
|
100%|ββββββββββ| 3000/3000 [17:34<00:00, 3.84it/s]
100%|ββββββββββ| 3000/3000 [17:34<00:00, 2.84it/s] |
| 05/26/2026 21:52:08 - INFO - Model saved to /home/ubuntu/groot-files/checkpoints/g1_finetune-20260526-213350-gpu0 |
| 05/26/2026 21:52:08 - INFO - Training completed! |
| [1;34mwandb[0m: |
| [1;34mwandb[0m: π View run [33mg1_finetune-20260526-213350-gpu0[0m at: [34m[0m |
| [1;34mwandb[0m: Find logs at: [1;35mwandb/run-20260526_213400-zw5ihkoo/logs[0m |
|
|