Upload pretraining_20260507_063420.log
Browse files
GPU_Run_Checkpoints/final_c2_muon_bs512_lr12_seed3_mix3to1/pretraining_20260507_063420.log
ADDED
|
@@ -0,0 +1,547 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
2026-05-07 06:34:20,302 | INFO | === Raw Model ===
|
| 2 |
+
GPT(
|
| 3 |
+
(transformer): ModuleDict(
|
| 4 |
+
(drop): Dropout(p=0.0, inplace=False)
|
| 5 |
+
(h): ModuleList(
|
| 6 |
+
(0-17): 18 x Block(
|
| 7 |
+
(ln_1): RMSNorm()
|
| 8 |
+
(attn): CausalSelfAttention(
|
| 9 |
+
(rotary): RotaryEmbedding()
|
| 10 |
+
(q_proj): Linear(in_features=320, out_features=320, bias=False)
|
| 11 |
+
(k_proj): Linear(in_features=320, out_features=64, bias=False)
|
| 12 |
+
(v_proj): Linear(in_features=320, out_features=64, bias=False)
|
| 13 |
+
(c_proj): Linear(in_features=320, out_features=320, bias=False)
|
| 14 |
+
(resid_dropout): Dropout(p=0.0, inplace=False)
|
| 15 |
+
)
|
| 16 |
+
(ln_2): RMSNorm()
|
| 17 |
+
(mlp): MLP(
|
| 18 |
+
(c_fc): Linear(in_features=320, out_features=2048, bias=False)
|
| 19 |
+
(c_proj): Linear(in_features=1024, out_features=320, bias=False)
|
| 20 |
+
(dropout): Dropout(p=0.0, inplace=False)
|
| 21 |
+
)
|
| 22 |
+
)
|
| 23 |
+
)
|
| 24 |
+
(ln_f): RMSNorm()
|
| 25 |
+
(wte): Embedding(50304, 320)
|
| 26 |
+
)
|
| 27 |
+
(lm_head): Linear(in_features=320, out_features=50304, bias=False)
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
=== Forward Summary (torchinfo, uncompiled model) ===
|
| 31 |
+
====================================================================================================
|
| 32 |
+
Layer (type:depth-idx) Output Shape Param #
|
| 33 |
+
====================================================================================================
|
| 34 |
+
GPT [1, 1, 50304] --
|
| 35 |
+
├─ModuleDict: 1-1 -- --
|
| 36 |
+
│ └─Embedding: 2-1 [1, 1024, 320] 16,097,280
|
| 37 |
+
│ └─Dropout: 2-2 [1, 1024, 320] --
|
| 38 |
+
│ └─ModuleList: 2-3 -- --
|
| 39 |
+
│ │ └─Block: 3-1 [1, 1024, 320] --
|
| 40 |
+
│ │ │ └─RMSNorm: 4-1 [1, 1024, 320] 320
|
| 41 |
+
│ │ │ └─CausalSelfAttention: 4-2 [1, 1024, 320] --
|
| 42 |
+
│ │ │ │ └─Linear: 5-1 [1, 1024, 320] 102,400
|
| 43 |
+
│ │ │ │ └─Linear: 5-2 [1, 1024, 64] 20,480
|
| 44 |
+
│ │ │ │ └─Linear: 5-3 [1, 1024, 64] 20,480
|
| 45 |
+
│ │ │ │ └─RotaryEmbedding: 5-4 [1, 1, 1024, 64] --
|
| 46 |
+
│ │ │ │ └─Linear: 5-5 [1, 1024, 320] 102,400
|
| 47 |
+
│ │ │ │ └─Dropout: 5-6 [1, 1024, 320] --
|
| 48 |
+
│ │ │ └─RMSNorm: 4-3 [1, 1024, 320] 320
|
| 49 |
+
│ │ │ └─MLP: 4-4 [1, 1024, 320] --
|
| 50 |
+
│ │ │ │ └─Linear: 5-7 [1, 1024, 2048] 655,360
|
| 51 |
+
│ │ │ │ └─Linear: 5-8 [1, 1024, 320] 327,680
|
| 52 |
+
│ │ │ │ └─Dropout: 5-9 [1, 1024, 320] --
|
| 53 |
+
│ │ └─Block: 3-2 [1, 1024, 320] --
|
| 54 |
+
│ │ │ └─RMSNorm: 4-5 [1, 1024, 320] 320
|
| 55 |
+
│ │ │ └─CausalSelfAttention: 4-6 [1, 1024, 320] --
|
| 56 |
+
│ │ │ │ └─Linear: 5-10 [1, 1024, 320] 102,400
|
| 57 |
+
│ │ │ │ └─Linear: 5-11 [1, 1024, 64] 20,480
|
| 58 |
+
│ │ │ │ └─Linear: 5-12 [1, 1024, 64] 20,480
|
| 59 |
+
│ │ │ │ └─RotaryEmbedding: 5-13 [1, 1, 1024, 64] --
|
| 60 |
+
│ │ │ │ └─Linear: 5-14 [1, 1024, 320] 102,400
|
| 61 |
+
│ │ │ │ └─Dropout: 5-15 [1, 1024, 320] --
|
| 62 |
+
│ │ │ └─RMSNorm: 4-7 [1, 1024, 320] 320
|
| 63 |
+
│ │ │ └─MLP: 4-8 [1, 1024, 320] --
|
| 64 |
+
│ │ │ │ └─Linear: 5-16 [1, 1024, 2048] 655,360
|
| 65 |
+
│ │ │ │ └─Linear: 5-17 [1, 1024, 320] 327,680
|
| 66 |
+
│ │ │ │ └─Dropout: 5-18 [1, 1024, 320] --
|
| 67 |
+
│ │ └─Block: 3-3 [1, 1024, 320] --
|
| 68 |
+
│ │ │ └─RMSNorm: 4-9 [1, 1024, 320] 320
|
| 69 |
+
│ │ │ └─CausalSelfAttention: 4-10 [1, 1024, 320] --
|
| 70 |
+
│ │ │ │ └─Linear: 5-19 [1, 1024, 320] 102,400
|
| 71 |
+
│ │ │ │ └─Linear: 5-20 [1, 1024, 64] 20,480
|
| 72 |
+
│ │ │ │ └─Linear: 5-21 [1, 1024, 64] 20,480
|
| 73 |
+
│ │ │ │ └─RotaryEmbedding: 5-22 [1, 1, 1024, 64] --
|
| 74 |
+
│ │ │ │ └─Linear: 5-23 [1, 1024, 320] 102,400
|
| 75 |
+
│ │ │ │ └─Dropout: 5-24 [1, 1024, 320] --
|
| 76 |
+
│ │ │ └─RMSNorm: 4-11 [1, 1024, 320] 320
|
| 77 |
+
│ │ │ └─MLP: 4-12 [1, 1024, 320] --
|
| 78 |
+
│ │ │ │ └─Linear: 5-25 [1, 1024, 2048] 655,360
|
| 79 |
+
│ │ │ │ └─Linear: 5-26 [1, 1024, 320] 327,680
|
| 80 |
+
│ │ │ │ └─Dropout: 5-27 [1, 1024, 320] --
|
| 81 |
+
│ │ └─Block: 3-4 [1, 1024, 320] --
|
| 82 |
+
│ │ │ └─RMSNorm: 4-13 [1, 1024, 320] 320
|
| 83 |
+
│ │ │ └─CausalSelfAttention: 4-14 [1, 1024, 320] --
|
| 84 |
+
│ │ │ │ └─Linear: 5-28 [1, 1024, 320] 102,400
|
| 85 |
+
│ │ │ │ └─Linear: 5-29 [1, 1024, 64] 20,480
|
| 86 |
+
│ │ │ │ └─Linear: 5-30 [1, 1024, 64] 20,480
|
| 87 |
+
│ │ │ │ └─RotaryEmbedding: 5-31 [1, 1, 1024, 64] --
|
| 88 |
+
│ │ │ │ └─Linear: 5-32 [1, 1024, 320] 102,400
|
| 89 |
+
│ │ │ │ └─Dropout: 5-33 [1, 1024, 320] --
|
| 90 |
+
│ │ │ └─RMSNorm: 4-15 [1, 1024, 320] 320
|
| 91 |
+
│ │ │ └─MLP: 4-16 [1, 1024, 320] --
|
| 92 |
+
│ │ │ │ └─Linear: 5-34 [1, 1024, 2048] 655,360
|
| 93 |
+
│ │ │ │ └─Linear: 5-35 [1, 1024, 320] 327,680
|
| 94 |
+
│ │ │ │ └─Dropout: 5-36 [1, 1024, 320] --
|
| 95 |
+
│ │ └─Block: 3-5 [1, 1024, 320] --
|
| 96 |
+
│ │ │ └─RMSNorm: 4-17 [1, 1024, 320] 320
|
| 97 |
+
│ │ │ └─CausalSelfAttention: 4-18 [1, 1024, 320] --
|
| 98 |
+
│ │ │ │ └─Linear: 5-37 [1, 1024, 320] 102,400
|
| 99 |
+
│ │ │ │ └─Linear: 5-38 [1, 1024, 64] 20,480
|
| 100 |
+
│ │ │ │ └─Linear: 5-39 [1, 1024, 64] 20,480
|
| 101 |
+
│ │ │ │ └─RotaryEmbedding: 5-40 [1, 1, 1024, 64] --
|
| 102 |
+
│ │ │ │ └─Linear: 5-41 [1, 1024, 320] 102,400
|
| 103 |
+
│ │ │ │ └─Dropout: 5-42 [1, 1024, 320] --
|
| 104 |
+
│ │ │ └─RMSNorm: 4-19 [1, 1024, 320] 320
|
| 105 |
+
│ │ │ └─MLP: 4-20 [1, 1024, 320] --
|
| 106 |
+
│ │ │ │ └─Linear: 5-43 [1, 1024, 2048] 655,360
|
| 107 |
+
│ │ │ │ └─Linear: 5-44 [1, 1024, 320] 327,680
|
| 108 |
+
│ │ │ │ └─Dropout: 5-45 [1, 1024, 320] --
|
| 109 |
+
│ │ └─Block: 3-6 [1, 1024, 320] --
|
| 110 |
+
│ │ │ └─RMSNorm: 4-21 [1, 1024, 320] 320
|
| 111 |
+
│ │ │ └─CausalSelfAttention: 4-22 [1, 1024, 320] --
|
| 112 |
+
│ │ │ │ └─Linear: 5-46 [1, 1024, 320] 102,400
|
| 113 |
+
│ │ │ │ └─Linear: 5-47 [1, 1024, 64] 20,480
|
| 114 |
+
│ │ │ │ └─Linear: 5-48 [1, 1024, 64] 20,480
|
| 115 |
+
│ │ │ │ └─RotaryEmbedding: 5-49 [1, 1, 1024, 64] --
|
| 116 |
+
│ │ │ │ └─Linear: 5-50 [1, 1024, 320] 102,400
|
| 117 |
+
│ │ │ │ └─Dropout: 5-51 [1, 1024, 320] --
|
| 118 |
+
│ │ │ └─RMSNorm: 4-23 [1, 1024, 320] 320
|
| 119 |
+
│ │ │ └─MLP: 4-24 [1, 1024, 320] --
|
| 120 |
+
│ │ │ │ └─Linear: 5-52 [1, 1024, 2048] 655,360
|
| 121 |
+
│ │ │ │ └─Linear: 5-53 [1, 1024, 320] 327,680
|
| 122 |
+
│ │ │ │ └─Dropout: 5-54 [1, 1024, 320] --
|
| 123 |
+
│ │ └─Block: 3-7 [1, 1024, 320] --
|
| 124 |
+
│ │ │ └─RMSNorm: 4-25 [1, 1024, 320] 320
|
| 125 |
+
│ │ │ └─CausalSelfAttention: 4-26 [1, 1024, 320] --
|
| 126 |
+
│ │ │ │ └─Linear: 5-55 [1, 1024, 320] 102,400
|
| 127 |
+
│ │ │ │ └─Linear: 5-56 [1, 1024, 64] 20,480
|
| 128 |
+
│ │ │ │ └─Linear: 5-57 [1, 1024, 64] 20,480
|
| 129 |
+
│ │ │ │ └─RotaryEmbedding: 5-58 [1, 1, 1024, 64] --
|
| 130 |
+
│ │ │ │ └─Linear: 5-59 [1, 1024, 320] 102,400
|
| 131 |
+
│ │ │ │ └─Dropout: 5-60 [1, 1024, 320] --
|
| 132 |
+
│ │ │ └─RMSNorm: 4-27 [1, 1024, 320] 320
|
| 133 |
+
│ │ │ └─MLP: 4-28 [1, 1024, 320] --
|
| 134 |
+
│ │ │ │ └─Linear: 5-61 [1, 1024, 2048] 655,360
|
| 135 |
+
│ │ │ │ └─Linear: 5-62 [1, 1024, 320] 327,680
|
| 136 |
+
│ │ │ │ └─Dropout: 5-63 [1, 1024, 320] --
|
| 137 |
+
│ │ └─Block: 3-8 [1, 1024, 320] --
|
| 138 |
+
│ │ │ └─RMSNorm: 4-29 [1, 1024, 320] 320
|
| 139 |
+
│ │ │ └─CausalSelfAttention: 4-30 [1, 1024, 320] --
|
| 140 |
+
│ │ │ │ └─Linear: 5-64 [1, 1024, 320] 102,400
|
| 141 |
+
│ │ │ │ └─Linear: 5-65 [1, 1024, 64] 20,480
|
| 142 |
+
│ │ │ │ └─Linear: 5-66 [1, 1024, 64] 20,480
|
| 143 |
+
│ │ │ │ └─RotaryEmbedding: 5-67 [1, 1, 1024, 64] --
|
| 144 |
+
│ │ │ │ └─Linear: 5-68 [1, 1024, 320] 102,400
|
| 145 |
+
│ │ │ │ └─Dropout: 5-69 [1, 1024, 320] --
|
| 146 |
+
│ │ │ └─RMSNorm: 4-31 [1, 1024, 320] 320
|
| 147 |
+
│ │ │ └─MLP: 4-32 [1, 1024, 320] --
|
| 148 |
+
│ │ │ │ └─Linear: 5-70 [1, 1024, 2048] 655,360
|
| 149 |
+
│ │ │ │ └─Linear: 5-71 [1, 1024, 320] 327,680
|
| 150 |
+
│ │ │ │ └─Dropout: 5-72 [1, 1024, 320] --
|
| 151 |
+
│ │ └─Block: 3-9 [1, 1024, 320] --
|
| 152 |
+
│ │ │ └─RMSNorm: 4-33 [1, 1024, 320] 320
|
| 153 |
+
│ │ │ └─CausalSelfAttention: 4-34 [1, 1024, 320] --
|
| 154 |
+
│ │ │ │ └─Linear: 5-73 [1, 1024, 320] 102,400
|
| 155 |
+
│ │ │ │ └─Linear: 5-74 [1, 1024, 64] 20,480
|
| 156 |
+
│ │ │ │ └─Linear: 5-75 [1, 1024, 64] 20,480
|
| 157 |
+
│ │ │ │ └─RotaryEmbedding: 5-76 [1, 1, 1024, 64] --
|
| 158 |
+
│ │ │ │ └─Linear: 5-77 [1, 1024, 320] 102,400
|
| 159 |
+
│ │ │ │ └─Dropout: 5-78 [1, 1024, 320] --
|
| 160 |
+
│ │ │ └─RMSNorm: 4-35 [1, 1024, 320] 320
|
| 161 |
+
│ │ │ └─MLP: 4-36 [1, 1024, 320] --
|
| 162 |
+
│ │ │ │ └─Linear: 5-79 [1, 1024, 2048] 655,360
|
| 163 |
+
│ │ │ │ └─Linear: 5-80 [1, 1024, 320] 327,680
|
| 164 |
+
│ │ │ │ └─Dropout: 5-81 [1, 1024, 320] --
|
| 165 |
+
│ │ └─Block: 3-10 [1, 1024, 320] --
|
| 166 |
+
│ │ │ └─RMSNorm: 4-37 [1, 1024, 320] 320
|
| 167 |
+
│ │ │ └─CausalSelfAttention: 4-38 [1, 1024, 320] --
|
| 168 |
+
│ │ │ │ └─Linear: 5-82 [1, 1024, 320] 102,400
|
| 169 |
+
│ │ │ │ └─Linear: 5-83 [1, 1024, 64] 20,480
|
| 170 |
+
│ │ │ │ └─Linear: 5-84 [1, 1024, 64] 20,480
|
| 171 |
+
│ │ │ │ └─RotaryEmbedding: 5-85 [1, 1, 1024, 64] --
|
| 172 |
+
│ │ │ │ └─Linear: 5-86 [1, 1024, 320] 102,400
|
| 173 |
+
│ │ │ │ └─Dropout: 5-87 [1, 1024, 320] --
|
| 174 |
+
│ │ │ └─RMSNorm: 4-39 [1, 1024, 320] 320
|
| 175 |
+
│ │ │ └─MLP: 4-40 [1, 1024, 320] --
|
| 176 |
+
│ │ │ │ └─Linear: 5-88 [1, 1024, 2048] 655,360
|
| 177 |
+
│ │ │ │ └─Linear: 5-89 [1, 1024, 320] 327,680
|
| 178 |
+
│ │ │ │ └─Dropout: 5-90 [1, 1024, 320] --
|
| 179 |
+
│ │ └─Block: 3-11 [1, 1024, 320] --
|
| 180 |
+
│ │ │ └─RMSNorm: 4-41 [1, 1024, 320] 320
|
| 181 |
+
│ │ │ └─CausalSelfAttention: 4-42 [1, 1024, 320] --
|
| 182 |
+
│ │ │ │ └─Linear: 5-91 [1, 1024, 320] 102,400
|
| 183 |
+
│ │ │ │ └─Linear: 5-92 [1, 1024, 64] 20,480
|
| 184 |
+
│ │ │ │ └─Linear: 5-93 [1, 1024, 64] 20,480
|
| 185 |
+
│ │ │ │ └─RotaryEmbedding: 5-94 [1, 1, 1024, 64] --
|
| 186 |
+
│ │ │ │ └─Linear: 5-95 [1, 1024, 320] 102,400
|
| 187 |
+
│ │ │ │ └─Dropout: 5-96 [1, 1024, 320] --
|
| 188 |
+
│ │ │ └─RMSNorm: 4-43 [1, 1024, 320] 320
|
| 189 |
+
│ │ │ └─MLP: 4-44 [1, 1024, 320] --
|
| 190 |
+
│ │ │ │ └─Linear: 5-97 [1, 1024, 2048] 655,360
|
| 191 |
+
│ │ │ │ └─Linear: 5-98 [1, 1024, 320] 327,680
|
| 192 |
+
│ │ │ │ └─Dropout: 5-99 [1, 1024, 320] --
|
| 193 |
+
│ │ └─Block: 3-12 [1, 1024, 320] --
|
| 194 |
+
│ │ │ └─RMSNorm: 4-45 [1, 1024, 320] 320
|
| 195 |
+
│ │ │ └─CausalSelfAttention: 4-46 [1, 1024, 320] --
|
| 196 |
+
│ │ │ │ └─Linear: 5-100 [1, 1024, 320] 102,400
|
| 197 |
+
│ │ │ │ └─Linear: 5-101 [1, 1024, 64] 20,480
|
| 198 |
+
│ │ │ │ └─Linear: 5-102 [1, 1024, 64] 20,480
|
| 199 |
+
│ │ │ │ └─RotaryEmbedding: 5-103 [1, 1, 1024, 64] --
|
| 200 |
+
│ │ │ │ └─Linear: 5-104 [1, 1024, 320] 102,400
|
| 201 |
+
│ │ │ │ └─Dropout: 5-105 [1, 1024, 320] --
|
| 202 |
+
│ │ │ └─RMSNorm: 4-47 [1, 1024, 320] 320
|
| 203 |
+
│ │ │ └─MLP: 4-48 [1, 1024, 320] --
|
| 204 |
+
│ │ │ │ └─Linear: 5-106 [1, 1024, 2048] 655,360
|
| 205 |
+
│ │ │ │ └─Linear: 5-107 [1, 1024, 320] 327,680
|
| 206 |
+
│ │ │ │ └─Dropout: 5-108 [1, 1024, 320] --
|
| 207 |
+
│ │ └─Block: 3-13 [1, 1024, 320] --
|
| 208 |
+
│ │ │ └─RMSNorm: 4-49 [1, 1024, 320] 320
|
| 209 |
+
│ │ │ └─CausalSelfAttention: 4-50 [1, 1024, 320] --
|
| 210 |
+
│ │ │ │ └─Linear: 5-109 [1, 1024, 320] 102,400
|
| 211 |
+
│ │ │ │ └─Linear: 5-110 [1, 1024, 64] 20,480
|
| 212 |
+
│ │ │ │ └─Linear: 5-111 [1, 1024, 64] 20,480
|
| 213 |
+
│ │ │ │ └─RotaryEmbedding: 5-112 [1, 1, 1024, 64] --
|
| 214 |
+
│ │ │ │ └─Linear: 5-113 [1, 1024, 320] 102,400
|
| 215 |
+
│ │ │ │ └─Dropout: 5-114 [1, 1024, 320] --
|
| 216 |
+
│ │ │ └─RMSNorm: 4-51 [1, 1024, 320] 320
|
| 217 |
+
│ │ │ └─MLP: 4-52 [1, 1024, 320] --
|
| 218 |
+
│ │ │ │ └─Linear: 5-115 [1, 1024, 2048] 655,360
|
| 219 |
+
│ │ │ │ └─Linear: 5-116 [1, 1024, 320] 327,680
|
| 220 |
+
│ │ │ │ └─Dropout: 5-117 [1, 1024, 320] --
|
| 221 |
+
│ │ └─Block: 3-14 [1, 1024, 320] --
|
| 222 |
+
│ │ │ └─RMSNorm: 4-53 [1, 1024, 320] 320
|
| 223 |
+
│ │ │ └─CausalSelfAttention: 4-54 [1, 1024, 320] --
|
| 224 |
+
│ │ │ │ └─Linear: 5-118 [1, 1024, 320] 102,400
|
| 225 |
+
│ │ │ │ └─Linear: 5-119 [1, 1024, 64] 20,480
|
| 226 |
+
│ │ │ │ └─Linear: 5-120 [1, 1024, 64] 20,480
|
| 227 |
+
│ │ │ │ └─RotaryEmbedding: 5-121 [1, 1, 1024, 64] --
|
| 228 |
+
│ │ │ │ └─Linear: 5-122 [1, 1024, 320] 102,400
|
| 229 |
+
│ │ │ │ └─Dropout: 5-123 [1, 1024, 320] --
|
| 230 |
+
│ │ │ └─RMSNorm: 4-55 [1, 1024, 320] 320
|
| 231 |
+
│ │ │ └─MLP: 4-56 [1, 1024, 320] --
|
| 232 |
+
│ │ │ │ └─Linear: 5-124 [1, 1024, 2048] 655,360
|
| 233 |
+
│ │ │ │ └─Linear: 5-125 [1, 1024, 320] 327,680
|
| 234 |
+
│ │ │ │ └─Dropout: 5-126 [1, 1024, 320] --
|
| 235 |
+
│ │ └─Block: 3-15 [1, 1024, 320] --
|
| 236 |
+
│ │ │ └─RMSNorm: 4-57 [1, 1024, 320] 320
|
| 237 |
+
│ │ │ └─CausalSelfAttention: 4-58 [1, 1024, 320] --
|
| 238 |
+
│ │ │ │ └─Linear: 5-127 [1, 1024, 320] 102,400
|
| 239 |
+
│ │ │ │ └─Linear: 5-128 [1, 1024, 64] 20,480
|
| 240 |
+
│ │ │ │ └─Linear: 5-129 [1, 1024, 64] 20,480
|
| 241 |
+
│ │ │ │ └─RotaryEmbedding: 5-130 [1, 1, 1024, 64] --
|
| 242 |
+
│ │ │ │ └─Linear: 5-131 [1, 1024, 320] 102,400
|
| 243 |
+
│ │ │ │ └─Dropout: 5-132 [1, 1024, 320] --
|
| 244 |
+
│ │ │ └─RMSNorm: 4-59 [1, 1024, 320] 320
|
| 245 |
+
│ │ │ └─MLP: 4-60 [1, 1024, 320] --
|
| 246 |
+
│ │ │ │ └─Linear: 5-133 [1, 1024, 2048] 655,360
|
| 247 |
+
│ │ │ │ └─Linear: 5-134 [1, 1024, 320] 327,680
|
| 248 |
+
│ │ │ │ └─Dropout: 5-135 [1, 1024, 320] --
|
| 249 |
+
│ │ └─Block: 3-16 [1, 1024, 320] --
|
| 250 |
+
│ │ │ └─RMSNorm: 4-61 [1, 1024, 320] 320
|
| 251 |
+
│ │ │ └─CausalSelfAttention: 4-62 [1, 1024, 320] --
|
| 252 |
+
│ │ │ │ └─Linear: 5-136 [1, 1024, 320] 102,400
|
| 253 |
+
│ │ │ │ └─Linear: 5-137 [1, 1024, 64] 20,480
|
| 254 |
+
│ │ │ │ └─Linear: 5-138 [1, 1024, 64] 20,480
|
| 255 |
+
│ │ │ │ └─RotaryEmbedding: 5-139 [1, 1, 1024, 64] --
|
| 256 |
+
│ │ │ │ └─Linear: 5-140 [1, 1024, 320] 102,400
|
| 257 |
+
│ │ │ │ └─Dropout: 5-141 [1, 1024, 320] --
|
| 258 |
+
│ │ │ └─RMSNorm: 4-63 [1, 1024, 320] 320
|
| 259 |
+
│ │ │ └─MLP: 4-64 [1, 1024, 320] --
|
| 260 |
+
│ │ │ │ └─Linear: 5-142 [1, 1024, 2048] 655,360
|
| 261 |
+
│ │ │ │ └─Linear: 5-143 [1, 1024, 320] 327,680
|
| 262 |
+
│ │ │ │ └─Dropout: 5-144 [1, 1024, 320] --
|
| 263 |
+
│ │ └─Block: 3-17 [1, 1024, 320] --
|
| 264 |
+
│ │ │ └─RMSNorm: 4-65 [1, 1024, 320] 320
|
| 265 |
+
│ │ │ └─CausalSelfAttention: 4-66 [1, 1024, 320] --
|
| 266 |
+
│ │ │ │ └─Linear: 5-145 [1, 1024, 320] 102,400
|
| 267 |
+
│ │ │ │ └─Linear: 5-146 [1, 1024, 64] 20,480
|
| 268 |
+
│ │ │ │ └─Linear: 5-147 [1, 1024, 64] 20,480
|
| 269 |
+
│ │ │ │ └─RotaryEmbedding: 5-148 [1, 1, 1024, 64] --
|
| 270 |
+
│ │ │ │ └─Linear: 5-149 [1, 1024, 320] 102,400
|
| 271 |
+
│ │ │ │ └─Dropout: 5-150 [1, 1024, 320] --
|
| 272 |
+
│ │ │ └─RMSNorm: 4-67 [1, 1024, 320] 320
|
| 273 |
+
│ │ │ └─MLP: 4-68 [1, 1024, 320] --
|
| 274 |
+
│ │ │ │ └─Linear: 5-151 [1, 1024, 2048] 655,360
|
| 275 |
+
│ │ │ │ └─Linear: 5-152 [1, 1024, 320] 327,680
|
| 276 |
+
│ │ │ │ └─Dropout: 5-153 [1, 1024, 320] --
|
| 277 |
+
│ │ └─Block: 3-18 [1, 1024, 320] --
|
| 278 |
+
│ │ │ └─RMSNorm: 4-69 [1, 1024, 320] 320
|
| 279 |
+
│ │ │ └─CausalSelfAttention: 4-70 [1, 1024, 320] --
|
| 280 |
+
│ │ │ │ └─Linear: 5-154 [1, 1024, 320] 102,400
|
| 281 |
+
│ │ │ │ └─Linear: 5-155 [1, 1024, 64] 20,480
|
| 282 |
+
│ │ │ │ └─Linear: 5-156 [1, 1024, 64] 20,480
|
| 283 |
+
│ │ │ │ └─RotaryEmbedding: 5-157 [1, 1, 1024, 64] --
|
| 284 |
+
│ │ │ │ └─Linear: 5-158 [1, 1024, 320] 102,400
|
| 285 |
+
│ │ │ │ └─Dropout: 5-159 [1, 1024, 320] --
|
| 286 |
+
│ │ │ └─RMSNorm: 4-71 [1, 1024, 320] 320
|
| 287 |
+
│ │ │ └─MLP: 4-72 [1, 1024, 320] --
|
| 288 |
+
│ │ │ │ └─Linear: 5-160 [1, 1024, 2048] 655,360
|
| 289 |
+
│ │ │ │ └─Linear: 5-161 [1, 1024, 320] 327,680
|
| 290 |
+
│ │ │ │ └─Dropout: 5-162 [1, 1024, 320] --
|
| 291 |
+
│ └─RMSNorm: 2-4 [1, 1024, 320] 320
|
| 292 |
+
├─Linear: 1-2 [1, 1, 50304] 16,097,280
|
| 293 |
+
====================================================================================================
|
| 294 |
+
|
| 295 |
+
=== Parameter Counts (unique tensors) ===
|
| 296 |
+
Total params: 38,227,520
|
| 297 |
+
Trainable params: 38,227,520
|
| 298 |
+
Weight tying (wte = lm_head): True
|
| 299 |
+
Embedding mode: standard tied token embedding
|
| 300 |
+
Note: module-level torchinfo totals may double-count the tied LM head; use the unique counts above.
|
| 301 |
+
2026-05-07 06:34:20,360 | INFO | === Pretraining Started ===
|
| 302 |
+
2026-05-07 06:34:20,360 | INFO | Device: cuda | dtype: bfloat16 | distributed: False (world_size=1)
|
| 303 |
+
2026-05-07 06:34:20,360 | INFO | Model: 18 layers, 5 heads, 320 embd, context_len=1024
|
| 304 |
+
2026-05-07 06:34:20,360 | INFO | Training: max_iters=15259, batch_size=4, grad_accum=128, lr=1.11e-02, warmup=153 steps
|
| 305 |
+
2026-05-07 06:34:20,360 | INFO | Data: 7062542559 train tokens | tokens/step=524288
|
| 306 |
+
2026-05-07 06:34:20,360 | INFO | Data mix: data/processed_owt/train=75.0%, data/processed_nonwiki_2b/train=25.0%
|
| 307 |
+
2026-05-07 06:34:26,584 | INFO | Resumed from checkpoint artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/checkpoints/latest_ckpt.pt at step 13200
|
| 308 |
+
2026-05-07 06:35:49,994 | INFO | step 13210/15259 | epoch 0 | loss 3.5143 | ppl 33.59 | lr 1.00e-02 | grad_norm 0.234 | 62857 tok/s | dt 83.41s | ETA 4:44:50
|
| 309 |
+
2026-05-07 06:36:24,112 | INFO | step 13220/15259 | epoch 0 | loss 3.4970 | ppl 33.02 | lr 1.00e-02 | grad_norm 0.214 | 153670 tok/s | dt 34.12s | ETA 3:19:41
|
| 310 |
+
2026-05-07 06:36:58,549 | INFO | step 13230/15259 | epoch 0 | loss 3.5149 | ppl 33.61 | lr 9.95e-03 | grad_norm 0.191 | 152248 tok/s | dt 34.44s | ETA 2:51:17
|
| 311 |
+
2026-05-07 06:37:32,417 | INFO | step 13240/15259 | epoch 0 | loss 3.5049 | ppl 33.28 | lr 9.91e-03 | grad_norm 0.172 | 154801 tok/s | dt 33.87s | ETA 2:36:19
|
| 312 |
+
2026-05-07 06:38:07,047 | INFO | step 13250/15259 | epoch 0 | loss 3.5104 | ppl 33.46 | lr 9.87e-03 | grad_norm 0.209 | 151398 tok/s | dt 34.63s | ETA 2:27:37
|
| 313 |
+
2026-05-07 06:38:41,493 | INFO | step 13260/15259 | epoch 0 | loss 3.4893 | ppl 32.76 | lr 9.82e-03 | grad_norm 0.254 | 152207 tok/s | dt 34.45s | ETA 1:54:16
|
| 314 |
+
2026-05-07 06:39:15,212 | INFO | step 13270/15259 | epoch 0 | loss 3.5077 | ppl 33.37 | lr 9.78e-03 | grad_norm 0.221 | 155484 tok/s | dt 33.72s | ETA 1:53:26
|
| 315 |
+
2026-05-07 06:39:49,389 | INFO | step 13280/15259 | epoch 0 | loss 3.5351 | ppl 34.30 | lr 9.74e-03 | grad_norm 0.201 | 153405 tok/s | dt 34.18s | ETA 1:52:41
|
| 316 |
+
2026-05-07 06:40:23,601 | INFO | step 13290/15259 | epoch 0 | loss 3.5499 | ppl 34.81 | lr 9.69e-03 | grad_norm 0.204 | 153248 tok/s | dt 34.21s | ETA 1:52:21
|
| 317 |
+
2026-05-07 06:40:58,004 | INFO | step 13300/15259 | epoch 0 | loss 3.5136 | ppl 33.57 | lr 9.65e-03 | grad_norm 0.173 | 152395 tok/s | dt 34.40s | ETA 1:51:37
|
| 318 |
+
2026-05-07 06:41:19,954 | INFO | step 13300 | val_loss 3.6002 | val_ppl 36.61
|
| 319 |
+
2026-05-07 06:42:04,308 | INFO | step 13310/15259 | epoch 0 | loss 3.5649 | ppl 35.34 | lr 9.61e-03 | grad_norm 0.193 | 79074 tok/s | dt 66.30s | ETA 1:51:04
|
| 320 |
+
2026-05-07 06:42:38,896 | INFO | step 13320/15259 | epoch 0 | loss 3.5135 | ppl 33.56 | lr 9.56e-03 | grad_norm 0.256 | 151578 tok/s | dt 34.59s | ETA 1:51:03
|
| 321 |
+
2026-05-07 06:43:13,449 | INFO | step 13330/15259 | epoch 0 | loss 3.5171 | ppl 33.69 | lr 9.52e-03 | grad_norm 0.206 | 151737 tok/s | dt 34.55s | ETA 1:50:43
|
| 322 |
+
2026-05-07 06:43:47,812 | INFO | step 13340/15259 | epoch 0 | loss 3.5745 | ppl 35.68 | lr 9.47e-03 | grad_norm 0.241 | 152574 tok/s | dt 34.36s | ETA 1:50:15
|
| 323 |
+
2026-05-07 06:44:22,826 | INFO | step 13350/15259 | epoch 0 | loss 3.4603 | ppl 31.83 | lr 9.43e-03 | grad_norm 0.207 | 149734 tok/s | dt 35.01s | ETA 1:50:04
|
| 324 |
+
2026-05-07 06:44:57,160 | INFO | step 13360/15259 | epoch 0 | loss 3.4766 | ppl 32.35 | lr 9.39e-03 | grad_norm 0.147 | 152701 tok/s | dt 34.33s | ETA 1:49:24
|
| 325 |
+
2026-05-07 06:45:31,551 | INFO | step 13370/15259 | epoch 0 | loss 3.4897 | ppl 32.78 | lr 9.34e-03 | grad_norm 0.201 | 152448 tok/s | dt 34.39s | ETA 1:48:42
|
| 326 |
+
2026-05-07 06:46:05,911 | INFO | step 13380/15259 | epoch 0 | loss 3.5253 | ppl 33.96 | lr 9.30e-03 | grad_norm 0.196 | 152591 tok/s | dt 34.36s | ETA 1:48:00
|
| 327 |
+
2026-05-07 06:46:40,259 | INFO | step 13390/15259 | epoch 0 | loss 3.5304 | ppl 34.14 | lr 9.26e-03 | grad_norm 0.162 | 152636 tok/s | dt 34.35s | ETA 1:47:25
|
| 328 |
+
2026-05-07 06:47:14,770 | INFO | step 13400/15259 | epoch 0 | loss 3.5191 | ppl 33.75 | lr 9.21e-03 | grad_norm 0.162 | 151920 tok/s | dt 34.51s | ETA 1:46:32
|
| 329 |
+
2026-05-07 06:47:14,957 | INFO | step 13400 | val_loss 3.3677 | val_ppl 29.01 ** New best validation loss! **
|
| 330 |
+
2026-05-07 06:47:30,555 | WARNING | New best checkpoint at step 13400 | val_loss=3.3677 | saved to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/checkpoints/best_ckpt.pt
|
| 331 |
+
2026-05-07 06:48:05,293 | INFO | step 13410/15259 | epoch 0 | loss 3.4872 | ppl 32.69 | lr 9.17e-03 | grad_norm 0.137 | 103772 tok/s | dt 50.52s | ETA 1:46:13
|
| 332 |
+
2026-05-07 06:48:39,903 | INFO | step 13420/15259 | epoch 0 | loss 3.5143 | ppl 33.59 | lr 9.13e-03 | grad_norm 0.200 | 151484 tok/s | dt 34.61s | ETA 1:45:46
|
| 333 |
+
2026-05-07 06:49:14,544 | INFO | step 13430/15259 | epoch 0 | loss 3.4930 | ppl 32.89 | lr 9.08e-03 | grad_norm 0.153 | 151349 tok/s | dt 34.64s | ETA 1:45:22
|
| 334 |
+
2026-05-07 06:49:49,662 | INFO | step 13440/15259 | epoch 0 | loss 3.5093 | ppl 33.43 | lr 9.04e-03 | grad_norm 0.182 | 149296 tok/s | dt 35.12s | ETA 1:45:16
|
| 335 |
+
2026-05-07 06:50:24,600 | INFO | step 13450/15259 | epoch 0 | loss 3.5218 | ppl 33.85 | lr 9.00e-03 | grad_norm 0.174 | 150061 tok/s | dt 34.94s | ETA 1:44:56
|
| 336 |
+
2026-05-07 06:50:59,541 | INFO | step 13460/15259 | epoch 0 | loss 3.5333 | ppl 34.24 | lr 8.95e-03 | grad_norm 0.161 | 150049 tok/s | dt 34.94s | ETA 1:44:29
|
| 337 |
+
2026-05-07 06:51:34,376 | INFO | step 13470/15259 | epoch 0 | loss 3.4957 | ppl 32.97 | lr 8.91e-03 | grad_norm 0.175 | 150506 tok/s | dt 34.84s | ETA 1:44:02
|
| 338 |
+
2026-05-07 06:52:09,097 | INFO | step 13480/15259 | epoch 1 | loss 3.5247 | ppl 33.94 | lr 8.86e-03 | grad_norm 0.153 | 151003 tok/s | dt 34.72s | ETA 1:43:30
|
| 339 |
+
2026-05-07 06:52:43,742 | INFO | step 13490/15259 | epoch 1 | loss 3.5071 | ppl 33.35 | lr 8.82e-03 | grad_norm 0.267 | 151329 tok/s | dt 34.65s | ETA 1:42:38
|
| 340 |
+
2026-05-07 06:53:18,381 | INFO | step 13500/15259 | epoch 1 | loss 3.7590 | ppl 42.90 | lr 8.78e-03 | grad_norm 0.192 | 151359 tok/s | dt 34.64s | ETA 1:41:53
|
| 341 |
+
2026-05-07 06:53:18,567 | INFO | step 13500 | val_loss 3.4374 | val_ppl 31.11
|
| 342 |
+
2026-05-07 06:54:02,683 | INFO | step 13510/15259 | epoch 1 | loss 3.5038 | ppl 33.24 | lr 8.73e-03 | grad_norm 0.221 | 118344 tok/s | dt 44.30s | ETA 1:41:32
|
| 343 |
+
2026-05-07 06:54:37,248 | INFO | step 13520/15259 | epoch 1 | loss 3.5296 | ppl 34.11 | lr 8.69e-03 | grad_norm 0.171 | 151680 tok/s | dt 34.57s | ETA 1:40:48
|
| 344 |
+
2026-05-07 06:55:11,594 | INFO | step 13530/15259 | epoch 1 | loss 3.5474 | ppl 34.72 | lr 8.65e-03 | grad_norm 0.150 | 152653 tok/s | dt 34.35s | ETA 1:40:00
|
| 345 |
+
2026-05-07 06:55:45,716 | INFO | step 13540/15259 | epoch 1 | loss 3.5175 | ppl 33.70 | lr 8.60e-03 | grad_norm 0.156 | 153649 tok/s | dt 34.12s | ETA 1:39:08
|
| 346 |
+
2026-05-07 06:56:19,861 | INFO | step 13550/15259 | epoch 1 | loss 3.5147 | ppl 33.60 | lr 8.56e-03 | grad_norm 0.156 | 153547 tok/s | dt 34.15s | ETA 1:38:16
|
| 347 |
+
2026-05-07 06:56:54,398 | INFO | step 13560/15259 | epoch 1 | loss 3.4620 | ppl 31.88 | lr 8.52e-03 | grad_norm 0.189 | 151807 tok/s | dt 34.54s | ETA 1:37:14
|
| 348 |
+
2026-05-07 06:57:29,031 | INFO | step 13570/15259 | epoch 1 | loss 3.5114 | ppl 33.50 | lr 8.47e-03 | grad_norm 0.157 | 151382 tok/s | dt 34.63s | ETA 1:36:42
|
| 349 |
+
2026-05-07 06:58:03,664 | INFO | step 13580/15259 | epoch 1 | loss 3.5009 | ppl 33.14 | lr 8.43e-03 | grad_norm 0.176 | 151383 tok/s | dt 34.63s | ETA 1:36:17
|
| 350 |
+
2026-05-07 06:58:38,286 | INFO | step 13590/15259 | epoch 1 | loss 3.4892 | ppl 32.76 | lr 8.39e-03 | grad_norm 0.141 | 151435 tok/s | dt 34.62s | ETA 1:36:00
|
| 351 |
+
2026-05-07 06:59:12,820 | INFO | step 13600/15259 | epoch 1 | loss 3.5015 | ppl 33.16 | lr 8.34e-03 | grad_norm 0.182 | 151818 tok/s | dt 34.53s | ETA 1:35:38
|
| 352 |
+
2026-05-07 06:59:13,002 | INFO | step 13600 | val_loss 3.4226 | val_ppl 30.65
|
| 353 |
+
2026-05-07 06:59:59,485 | INFO | step 13610/15259 | epoch 1 | loss 3.4709 | ppl 32.16 | lr 8.30e-03 | grad_norm 0.180 | 112351 tok/s | dt 46.67s | ETA 1:35:03
|
| 354 |
+
2026-05-07 07:00:34,150 | INFO | step 13620/15259 | epoch 1 | loss 3.4918 | ppl 32.84 | lr 8.25e-03 | grad_norm 0.202 | 151243 tok/s | dt 34.67s | ETA 1:34:29
|
| 355 |
+
2026-05-07 07:01:08,673 | INFO | step 13630/15259 | epoch 1 | loss 3.4872 | ppl 32.69 | lr 8.21e-03 | grad_norm 0.150 | 151868 tok/s | dt 34.52s | ETA 1:33:51
|
| 356 |
+
2026-05-07 07:01:43,133 | INFO | step 13640/15259 | epoch 1 | loss 3.4941 | ppl 32.92 | lr 8.17e-03 | grad_norm 0.162 | 152140 tok/s | dt 34.46s | ETA 1:33:11
|
| 357 |
+
2026-05-07 07:02:17,770 | INFO | step 13650/15259 | epoch 1 | loss 3.5046 | ppl 33.27 | lr 8.12e-03 | grad_norm 0.166 | 151366 tok/s | dt 34.64s | ETA 1:32:40
|
| 358 |
+
2026-05-07 07:02:52,733 | INFO | step 13660/15259 | epoch 1 | loss 3.4701 | ppl 32.14 | lr 8.08e-03 | grad_norm 0.162 | 149956 tok/s | dt 34.96s | ETA 1:32:20
|
| 359 |
+
2026-05-07 07:03:27,186 | INFO | step 13670/15259 | epoch 1 | loss 3.4949 | ppl 32.95 | lr 8.04e-03 | grad_norm 0.180 | 152174 tok/s | dt 34.45s | ETA 1:31:38
|
| 360 |
+
2026-05-07 07:04:01,790 | INFO | step 13680/15259 | epoch 1 | loss 3.4496 | ppl 31.49 | lr 7.99e-03 | grad_norm 0.210 | 151514 tok/s | dt 34.60s | ETA 1:31:06
|
| 361 |
+
2026-05-07 07:04:36,144 | INFO | step 13690/15259 | epoch 1 | loss 3.4082 | ppl 30.21 | lr 7.95e-03 | grad_norm 0.179 | 152613 tok/s | dt 34.35s | ETA 1:30:28
|
| 362 |
+
2026-05-07 07:05:10,705 | INFO | step 13700/15259 | epoch 1 | loss 3.4926 | ppl 32.87 | lr 7.91e-03 | grad_norm 0.137 | 151697 tok/s | dt 34.56s | ETA 1:29:51
|
| 363 |
+
2026-05-07 07:05:10,895 | INFO | step 13700 | val_loss 3.4340 | val_ppl 31.00
|
| 364 |
+
2026-05-07 07:05:55,925 | INFO | step 13710/15259 | epoch 1 | loss 3.4980 | ppl 33.05 | lr 7.86e-03 | grad_norm 0.124 | 115943 tok/s | dt 45.22s | ETA 1:29:08
|
| 365 |
+
2026-05-07 07:06:30,118 | INFO | step 13720/15259 | epoch 1 | loss 3.4797 | ppl 32.45 | lr 7.82e-03 | grad_norm 0.131 | 153333 tok/s | dt 34.19s | ETA 1:28:25
|
| 366 |
+
2026-05-07 07:07:04,421 | INFO | step 13730/15259 | epoch 1 | loss 3.4600 | ppl 31.82 | lr 7.77e-03 | grad_norm 0.185 | 152836 tok/s | dt 34.30s | ETA 1:27:41
|
| 367 |
+
2026-05-07 07:07:38,398 | INFO | step 13740/15259 | epoch 1 | loss 3.5195 | ppl 33.77 | lr 7.73e-03 | grad_norm 0.193 | 154307 tok/s | dt 33.98s | ETA 1:26:56
|
| 368 |
+
2026-05-07 07:08:12,346 | INFO | step 13750/15259 | epoch 1 | loss 3.5011 | ppl 33.15 | lr 7.69e-03 | grad_norm 0.153 | 154438 tok/s | dt 33.95s | ETA 1:26:03
|
| 369 |
+
2026-05-07 07:08:46,773 | INFO | step 13760/15259 | epoch 1 | loss 3.5008 | ppl 33.14 | lr 7.64e-03 | grad_norm 0.178 | 152290 tok/s | dt 34.43s | ETA 1:25:21
|
| 370 |
+
2026-05-07 07:09:21,595 | INFO | step 13770/15259 | epoch 1 | loss 3.5206 | ppl 33.80 | lr 7.60e-03 | grad_norm 0.153 | 150564 tok/s | dt 34.82s | ETA 1:25:06
|
| 371 |
+
2026-05-07 07:09:56,166 | INFO | step 13780/15259 | epoch 1 | loss 3.4685 | ppl 32.09 | lr 7.56e-03 | grad_norm 0.170 | 151656 tok/s | dt 34.57s | ETA 1:24:40
|
| 372 |
+
2026-05-07 07:10:30,407 | INFO | step 13790/15259 | epoch 1 | loss 3.4459 | ppl 31.37 | lr 7.51e-03 | grad_norm 0.143 | 153118 tok/s | dt 34.24s | ETA 1:24:13
|
| 373 |
+
2026-05-07 07:11:04,411 | INFO | step 13800/15259 | epoch 1 | loss 3.4981 | ppl 33.05 | lr 7.47e-03 | grad_norm 0.199 | 154181 tok/s | dt 34.00s | ETA 1:23:40
|
| 374 |
+
2026-05-07 07:11:04,596 | INFO | step 13800 | val_loss 3.5191 | val_ppl 33.75
|
| 375 |
+
2026-05-07 07:11:48,981 | INFO | step 13810/15259 | epoch 1 | loss 3.4680 | ppl 32.07 | lr 7.43e-03 | grad_norm 0.156 | 117634 tok/s | dt 44.57s | ETA 1:23:07
|
| 376 |
+
2026-05-07 07:12:23,670 | INFO | step 13820/15259 | epoch 1 | loss 3.5082 | ppl 33.39 | lr 7.38e-03 | grad_norm 0.194 | 151137 tok/s | dt 34.69s | ETA 1:22:29
|
| 377 |
+
2026-05-07 07:12:58,506 | INFO | step 13830/15259 | epoch 1 | loss 3.4734 | ppl 32.25 | lr 7.34e-03 | grad_norm 0.164 | 150502 tok/s | dt 34.84s | ETA 1:22:02
|
| 378 |
+
2026-05-07 07:13:32,857 | INFO | step 13840/15259 | epoch 1 | loss 3.4963 | ppl 32.99 | lr 7.30e-03 | grad_norm 0.171 | 152626 tok/s | dt 34.35s | ETA 1:21:31
|
| 379 |
+
2026-05-07 07:14:07,342 | INFO | step 13850/15259 | epoch 1 | loss 3.5225 | ppl 33.87 | lr 7.25e-03 | grad_norm 0.201 | 152034 tok/s | dt 34.48s | ETA 1:21:10
|
| 380 |
+
2026-05-07 07:14:41,507 | INFO | step 13860/15259 | epoch 1 | loss 3.4580 | ppl 31.75 | lr 7.21e-03 | grad_norm 0.155 | 153459 tok/s | dt 34.16s | ETA 1:20:27
|
| 381 |
+
2026-05-07 07:15:15,643 | INFO | step 13870/15259 | epoch 1 | loss 3.4459 | ppl 31.37 | lr 7.16e-03 | grad_norm 0.153 | 153586 tok/s | dt 34.14s | ETA 1:19:37
|
| 382 |
+
2026-05-07 07:15:49,903 | INFO | step 13880/15259 | epoch 1 | loss 3.4870 | ppl 32.69 | lr 7.12e-03 | grad_norm 0.139 | 153034 tok/s | dt 34.26s | ETA 1:18:46
|
| 383 |
+
2026-05-07 07:16:24,175 | INFO | step 13890/15259 | epoch 1 | loss 3.4775 | ppl 32.38 | lr 7.08e-03 | grad_norm 0.185 | 152978 tok/s | dt 34.27s | ETA 1:18:10
|
| 384 |
+
2026-05-07 07:16:58,430 | INFO | step 13900/15259 | epoch 1 | loss 3.4699 | ppl 32.13 | lr 7.03e-03 | grad_norm 0.138 | 153054 tok/s | dt 34.26s | ETA 1:17:30
|
| 385 |
+
2026-05-07 07:16:58,612 | WARNING | Step 13900: val_loss has not improved for 5 consecutive evals (best=3.3677, current=3.4687).
|
| 386 |
+
2026-05-07 07:16:58,613 | INFO | step 13900 | val_loss 3.4687 | val_ppl 32.10
|
| 387 |
+
2026-05-07 07:17:42,975 | INFO | step 13910/15259 | epoch 1 | loss 3.4850 | ppl 32.62 | lr 6.99e-03 | grad_norm 0.144 | 117699 tok/s | dt 44.54s | ETA 1:17:05
|
| 388 |
+
2026-05-07 07:18:17,412 | INFO | step 13920/15259 | epoch 1 | loss 3.4360 | ppl 31.06 | lr 6.95e-03 | grad_norm 0.139 | 152245 tok/s | dt 34.44s | ETA 1:16:38
|
| 389 |
+
2026-05-07 07:18:51,847 | INFO | step 13930/15259 | epoch 1 | loss 3.4891 | ppl 32.76 | lr 6.90e-03 | grad_norm 0.137 | 152256 tok/s | dt 34.43s | ETA 1:16:09
|
| 390 |
+
2026-05-07 07:19:26,193 | INFO | step 13940/15259 | epoch 1 | loss 3.4434 | ppl 31.29 | lr 6.86e-03 | grad_norm 0.149 | 152648 tok/s | dt 34.35s | ETA 1:15:36
|
| 391 |
+
2026-05-07 07:20:00,474 | INFO | step 13950/15259 | epoch 1 | loss 3.4170 | ppl 30.48 | lr 6.82e-03 | grad_norm 0.198 | 152938 tok/s | dt 34.28s | ETA 1:15:03
|
| 392 |
+
2026-05-07 07:20:34,950 | INFO | step 13960/15259 | epoch 1 | loss 3.4158 | ppl 30.44 | lr 6.77e-03 | grad_norm 0.165 | 152071 tok/s | dt 34.48s | ETA 1:14:27
|
| 393 |
+
2026-05-07 07:21:09,574 | INFO | step 13970/15259 | epoch 1 | loss 3.4779 | ppl 32.39 | lr 6.73e-03 | grad_norm 0.124 | 151425 tok/s | dt 34.62s | ETA 1:13:58
|
| 394 |
+
2026-05-07 07:21:44,417 | INFO | step 13980/15259 | epoch 1 | loss 3.4564 | ppl 31.70 | lr 6.68e-03 | grad_norm 0.132 | 150473 tok/s | dt 34.84s | ETA 1:13:34
|
| 395 |
+
2026-05-07 07:22:18,789 | INFO | step 13990/15259 | epoch 1 | loss 3.4518 | ppl 31.56 | lr 6.64e-03 | grad_norm 0.133 | 152531 tok/s | dt 34.37s | ETA 1:13:00
|
| 396 |
+
2026-05-07 07:22:53,228 | INFO | step 14000/15259 | epoch 1 | loss 3.4841 | ppl 32.59 | lr 6.60e-03 | grad_norm 0.174 | 152238 tok/s | dt 34.44s | ETA 1:12:29
|
| 397 |
+
2026-05-07 07:22:53,412 | WARNING | Step 14000: val_loss has not improved for 6 consecutive evals (best=3.3677, current=3.4683).
|
| 398 |
+
2026-05-07 07:22:53,412 | INFO | step 14000 | val_loss 3.4683 | val_ppl 32.08
|
| 399 |
+
2026-05-07 07:23:37,500 | INFO | step 14010/15259 | epoch 1 | loss 3.4702 | ppl 32.14 | lr 6.55e-03 | grad_norm 0.132 | 118424 tok/s | dt 44.27s | ETA 1:11:57
|
| 400 |
+
2026-05-07 07:24:12,037 | INFO | step 14020/15259 | epoch 1 | loss 3.4517 | ppl 31.55 | lr 6.51e-03 | grad_norm 0.143 | 151804 tok/s | dt 34.54s | ETA 1:11:20
|
| 401 |
+
2026-05-07 07:24:46,343 | INFO | step 14030/15259 | epoch 1 | loss 3.4394 | ppl 31.17 | lr 6.47e-03 | grad_norm 0.164 | 152830 tok/s | dt 34.31s | ETA 1:10:33
|
| 402 |
+
2026-05-07 07:25:20,692 | INFO | step 14040/15259 | epoch 1 | loss 3.4303 | ppl 30.89 | lr 6.42e-03 | grad_norm 0.152 | 152635 tok/s | dt 34.35s | ETA 1:09:58
|
| 403 |
+
2026-05-07 07:25:55,235 | INFO | step 14050/15259 | epoch 1 | loss 3.4534 | ppl 31.61 | lr 6.38e-03 | grad_norm 0.142 | 151776 tok/s | dt 34.54s | ETA 1:09:26
|
| 404 |
+
2026-05-07 07:26:29,511 | INFO | step 14060/15259 | epoch 1 | loss 3.4545 | ppl 31.64 | lr 6.34e-03 | grad_norm 0.174 | 152962 tok/s | dt 34.28s | ETA 1:08:44
|
| 405 |
+
2026-05-07 07:27:03,852 | INFO | step 14070/15259 | epoch 1 | loss 3.4559 | ppl 31.69 | lr 6.29e-03 | grad_norm 0.140 | 152673 tok/s | dt 34.34s | ETA 1:08:05
|
| 406 |
+
2026-05-07 07:27:38,106 | INFO | step 14080/15259 | epoch 1 | loss 3.4684 | ppl 32.09 | lr 6.25e-03 | grad_norm 0.123 | 153056 tok/s | dt 34.25s | ETA 1:07:30
|
| 407 |
+
2026-05-07 07:28:12,503 | INFO | step 14090/15259 | epoch 1 | loss 3.4567 | ppl 31.71 | lr 6.21e-03 | grad_norm 0.137 | 152422 tok/s | dt 34.40s | ETA 1:06:56
|
| 408 |
+
2026-05-07 07:28:46,898 | INFO | step 14100/15259 | epoch 1 | loss 3.5002 | ppl 33.12 | lr 6.16e-03 | grad_norm 0.174 | 152435 tok/s | dt 34.39s | ETA 1:06:19
|
| 409 |
+
2026-05-07 07:28:47,081 | WARNING | Step 14100: val_loss has not improved for 7 consecutive evals (best=3.3677, current=3.5897).
|
| 410 |
+
2026-05-07 07:28:47,081 | INFO | step 14100 | val_loss 3.5897 | val_ppl 36.22
|
| 411 |
+
2026-05-07 07:29:36,279 | INFO | step 14110/15259 | epoch 1 | loss 3.4514 | ppl 31.54 | lr 6.12e-03 | grad_norm 0.161 | 106171 tok/s | dt 49.38s | ETA 1:05:51
|
| 412 |
+
2026-05-07 07:30:10,712 | INFO | step 14120/15259 | epoch 1 | loss 3.4740 | ppl 32.26 | lr 6.07e-03 | grad_norm 0.158 | 152263 tok/s | dt 34.43s | ETA 1:05:19
|
| 413 |
+
2026-05-07 07:30:45,091 | INFO | step 14130/15259 | epoch 1 | loss 3.4807 | ppl 32.48 | lr 6.03e-03 | grad_norm 0.141 | 152506 tok/s | dt 34.38s | ETA 1:04:47
|
| 414 |
+
2026-05-07 07:31:19,576 | INFO | step 14140/15259 | epoch 1 | loss 3.4537 | ppl 31.62 | lr 5.99e-03 | grad_norm 0.140 | 152032 tok/s | dt 34.49s | ETA 1:04:15
|
| 415 |
+
2026-05-07 07:31:53,858 | INFO | step 14150/15259 | epoch 1 | loss 3.4652 | ppl 31.98 | lr 5.94e-03 | grad_norm 0.131 | 152932 tok/s | dt 34.28s | ETA 1:03:38
|
| 416 |
+
2026-05-07 07:32:28,206 | INFO | step 14160/15259 | epoch 1 | loss 3.4323 | ppl 30.95 | lr 5.90e-03 | grad_norm 0.176 | 152641 tok/s | dt 34.35s | ETA 1:02:58
|
| 417 |
+
2026-05-07 07:33:02,704 | INFO | step 14170/15259 | epoch 1 | loss 3.4628 | ppl 31.90 | lr 5.86e-03 | grad_norm 0.132 | 151978 tok/s | dt 34.50s | ETA 1:02:25
|
| 418 |
+
2026-05-07 07:33:36,941 | INFO | step 14180/15259 | epoch 1 | loss 3.4168 | ppl 30.47 | lr 5.81e-03 | grad_norm 0.153 | 153132 tok/s | dt 34.24s | ETA 1:01:48
|
| 419 |
+
2026-05-07 07:34:11,761 | INFO | step 14190/15259 | epoch 1 | loss 3.4523 | ppl 31.57 | lr 5.77e-03 | grad_norm 0.128 | 150572 tok/s | dt 34.82s | ETA 1:01:21
|
| 420 |
+
2026-05-07 07:34:46,331 | INFO | step 14200/15259 | epoch 1 | loss 3.4197 | ppl 30.56 | lr 5.73e-03 | grad_norm 0.125 | 151661 tok/s | dt 34.57s | ETA 1:00:52
|
| 421 |
+
2026-05-07 07:34:46,517 | WARNING | Step 14200: val_loss has not improved for 8 consecutive evals (best=3.3677, current=3.3797).
|
| 422 |
+
2026-05-07 07:34:46,517 | INFO | step 14200 | val_loss 3.3797 | val_ppl 29.36
|
| 423 |
+
2026-05-07 07:35:30,460 | INFO | step 14210/15259 | epoch 1 | loss 3.4574 | ppl 31.74 | lr 5.68e-03 | grad_norm 0.128 | 118807 tok/s | dt 44.13s | ETA 1:00:27
|
| 424 |
+
2026-05-07 07:36:04,895 | INFO | step 14220/15259 | epoch 1 | loss 3.4256 | ppl 30.74 | lr 5.64e-03 | grad_norm 0.122 | 152252 tok/s | dt 34.44s | ETA 0:59:51
|
| 425 |
+
2026-05-07 07:36:39,059 | INFO | step 14230/15259 | epoch 1 | loss 3.4247 | ppl 30.71 | lr 5.59e-03 | grad_norm 0.132 | 153466 tok/s | dt 34.16s | ETA 0:59:15
|
| 426 |
+
2026-05-07 07:37:13,612 | INFO | step 14240/15259 | epoch 1 | loss 3.4440 | ppl 31.31 | lr 5.55e-03 | grad_norm 0.133 | 151733 tok/s | dt 34.55s | ETA 0:58:35
|
| 427 |
+
2026-05-07 07:37:47,959 | INFO | step 14250/15259 | epoch 1 | loss 3.4323 | ppl 30.95 | lr 5.51e-03 | grad_norm 0.159 | 152645 tok/s | dt 34.35s | ETA 0:57:56
|
| 428 |
+
2026-05-07 07:38:21,933 | INFO | step 14260/15259 | epoch 1 | loss 3.4846 | ppl 32.61 | lr 5.46e-03 | grad_norm 0.144 | 154318 tok/s | dt 33.97s | ETA 0:57:05
|
| 429 |
+
2026-05-07 07:38:56,413 | INFO | step 14270/15259 | epoch 1 | loss 3.4778 | ppl 32.39 | lr 5.42e-03 | grad_norm 0.156 | 152058 tok/s | dt 34.48s | ETA 0:56:32
|
| 430 |
+
2026-05-07 07:39:31,031 | INFO | step 14280/15259 | epoch 1 | loss 3.4491 | ppl 31.47 | lr 5.38e-03 | grad_norm 0.163 | 151448 tok/s | dt 34.62s | ETA 0:56:07
|
| 431 |
+
2026-05-07 07:40:05,142 | INFO | step 14290/15259 | epoch 1 | loss 3.4691 | ppl 32.11 | lr 5.33e-03 | grad_norm 0.132 | 153703 tok/s | dt 34.11s | ETA 0:55:24
|
| 432 |
+
2026-05-07 07:40:39,244 | INFO | step 14300/15259 | epoch 1 | loss 3.4538 | ppl 31.62 | lr 5.29e-03 | grad_norm 0.126 | 153741 tok/s | dt 34.10s | ETA 0:54:45
|
| 433 |
+
2026-05-07 07:40:39,426 | WARNING | Step 14300: val_loss has not improved for 9 consecutive evals (best=3.3677, current=3.4289).
|
| 434 |
+
2026-05-07 07:40:39,426 | INFO | step 14300 | val_loss 3.4289 | val_ppl 30.84
|
| 435 |
+
2026-05-07 07:41:23,185 | INFO | step 14310/15259 | epoch 1 | loss 3.4575 | ppl 31.74 | lr 5.25e-03 | grad_norm 0.138 | 119316 tok/s | dt 43.94s | ETA 0:54:12
|
| 436 |
+
2026-05-07 07:41:57,567 | INFO | step 14320/15259 | epoch 1 | loss 3.4281 | ppl 30.82 | lr 5.20e-03 | grad_norm 0.121 | 152489 tok/s | dt 34.38s | ETA 0:53:36
|
| 437 |
+
2026-05-07 07:42:32,070 | INFO | step 14330/15259 | epoch 1 | loss 3.4154 | ppl 30.43 | lr 5.16e-03 | grad_norm 0.123 | 151956 tok/s | dt 34.50s | ETA 0:53:00
|
| 438 |
+
2026-05-07 07:43:06,532 | INFO | step 14340/15259 | epoch 1 | loss 3.4669 | ppl 32.04 | lr 5.12e-03 | grad_norm 0.153 | 152135 tok/s | dt 34.46s | ETA 0:52:32
|
| 439 |
+
2026-05-07 07:43:40,864 | INFO | step 14350/15259 | epoch 1 | loss 3.4400 | ppl 31.19 | lr 5.07e-03 | grad_norm 0.127 | 152711 tok/s | dt 34.33s | ETA 0:52:02
|
| 440 |
+
2026-05-07 07:44:15,377 | INFO | step 14360/15259 | epoch 1 | loss 3.4511 | ppl 31.54 | lr 5.03e-03 | grad_norm 0.155 | 151908 tok/s | dt 34.51s | ETA 0:51:35
|
| 441 |
+
2026-05-07 07:44:50,032 | INFO | step 14370/15259 | epoch 1 | loss 3.4377 | ppl 31.12 | lr 4.98e-03 | grad_norm 0.125 | 151289 tok/s | dt 34.65s | ETA 0:51:06
|
| 442 |
+
2026-05-07 07:45:24,605 | INFO | step 14380/15259 | epoch 1 | loss 3.4346 | ppl 31.02 | lr 4.94e-03 | grad_norm 0.122 | 151648 tok/s | dt 34.57s | ETA 0:50:33
|
| 443 |
+
2026-05-07 07:45:58,983 | INFO | step 14390/15259 | epoch 1 | loss 3.4117 | ppl 30.32 | lr 4.90e-03 | grad_norm 0.127 | 152506 tok/s | dt 34.38s | ETA 0:49:57
|
| 444 |
+
2026-05-07 07:46:33,522 | INFO | step 14400/15259 | epoch 1 | loss 3.3932 | ppl 29.76 | lr 4.85e-03 | grad_norm 0.139 | 151797 tok/s | dt 34.54s | ETA 0:49:26
|
| 445 |
+
2026-05-07 07:46:33,703 | WARNING | Step 14400: val_loss has not improved for 10 consecutive evals (best=3.3677, current=3.6085).
|
| 446 |
+
2026-05-07 07:46:33,703 | INFO | step 14400 | val_loss 3.6085 | val_ppl 36.91
|
| 447 |
+
2026-05-07 07:47:22,529 | INFO | step 14410/15259 | epoch 1 | loss 3.4224 | ppl 30.64 | lr 4.81e-03 | grad_norm 0.120 | 106982 tok/s | dt 49.01s | ETA 0:48:52
|
| 448 |
+
2026-05-07 07:47:56,912 | INFO | step 14420/15259 | epoch 1 | loss 3.4418 | ppl 31.24 | lr 4.77e-03 | grad_norm 0.167 | 152486 tok/s | dt 34.38s | ETA 0:48:13
|
| 449 |
+
2026-05-07 07:48:31,387 | INFO | step 14430/15259 | epoch 1 | loss 3.4356 | ppl 31.05 | lr 4.72e-03 | grad_norm 0.118 | 152077 tok/s | dt 34.48s | ETA 0:47:37
|
| 450 |
+
2026-05-07 07:49:05,885 | INFO | step 14440/15259 | epoch 1 | loss 3.4387 | ppl 31.15 | lr 4.68e-03 | grad_norm 0.123 | 151973 tok/s | dt 34.50s | ETA 0:47:05
|
| 451 |
+
2026-05-07 07:49:40,222 | INFO | step 14450/15259 | epoch 1 | loss 3.3903 | ppl 29.67 | lr 4.64e-03 | grad_norm 0.118 | 152690 tok/s | dt 34.34s | ETA 0:46:27
|
| 452 |
+
2026-05-07 07:50:14,730 | INFO | step 14460/15259 | epoch 1 | loss 3.4178 | ppl 30.50 | lr 4.59e-03 | grad_norm 0.116 | 151932 tok/s | dt 34.51s | ETA 0:45:51
|
| 453 |
+
2026-05-07 07:50:49,033 | INFO | step 14470/15259 | epoch 1 | loss 3.4488 | ppl 31.46 | lr 4.55e-03 | grad_norm 0.124 | 152842 tok/s | dt 34.30s | ETA 0:45:15
|
| 454 |
+
2026-05-07 07:51:23,539 | INFO | step 14480/15259 | epoch 1 | loss 3.4575 | ppl 31.74 | lr 4.50e-03 | grad_norm 0.115 | 151941 tok/s | dt 34.51s | ETA 0:44:42
|
| 455 |
+
2026-05-07 07:51:58,262 | INFO | step 14490/15259 | epoch 1 | loss 3.4221 | ppl 30.63 | lr 4.46e-03 | grad_norm 0.115 | 150989 tok/s | dt 34.72s | ETA 0:44:11
|
| 456 |
+
2026-05-07 07:52:32,586 | INFO | step 14500/15259 | epoch 1 | loss 3.4096 | ppl 30.25 | lr 4.42e-03 | grad_norm 0.127 | 152749 tok/s | dt 34.32s | ETA 0:43:36
|
| 457 |
+
2026-05-07 07:52:32,770 | WARNING | Step 14500: val_loss has not improved for 11 consecutive evals (best=3.3677, current=3.5552).
|
| 458 |
+
2026-05-07 07:52:32,770 | INFO | step 14500 | val_loss 3.5552 | val_ppl 35.00
|
| 459 |
+
2026-05-07 07:53:18,260 | INFO | step 14510/15259 | epoch 1 | loss 3.4169 | ppl 30.48 | lr 4.37e-03 | grad_norm 0.118 | 114789 tok/s | dt 45.67s | ETA 0:43:06
|
| 460 |
+
2026-05-07 07:53:52,847 | INFO | step 14520/15259 | epoch 1 | loss 3.4140 | ppl 30.39 | lr 4.33e-03 | grad_norm 0.138 | 151585 tok/s | dt 34.59s | ETA 0:42:36
|
| 461 |
+
2026-05-07 07:54:27,631 | INFO | step 14530/15259 | epoch 1 | loss 3.4367 | ppl 31.08 | lr 4.29e-03 | grad_norm 0.138 | 150728 tok/s | dt 34.78s | ETA 0:42:05
|
| 462 |
+
2026-05-07 07:55:02,315 | INFO | step 14540/15259 | epoch 1 | loss 3.4139 | ppl 30.38 | lr 4.24e-03 | grad_norm 0.125 | 151163 tok/s | dt 34.68s | ETA 0:41:30
|
| 463 |
+
2026-05-07 07:55:36,839 | INFO | step 14550/15259 | epoch 1 | loss 3.3966 | ppl 29.86 | lr 4.20e-03 | grad_norm 0.118 | 151862 tok/s | dt 34.52s | ETA 0:40:58
|
| 464 |
+
2026-05-07 07:56:11,199 | INFO | step 14560/15259 | epoch 1 | loss 3.4175 | ppl 30.49 | lr 4.16e-03 | grad_norm 0.124 | 152585 tok/s | dt 34.36s | ETA 0:40:17
|
| 465 |
+
2026-05-07 07:56:45,540 | INFO | step 14570/15259 | epoch 1 | loss 3.4059 | ppl 30.14 | lr 4.11e-03 | grad_norm 0.118 | 152674 tok/s | dt 34.34s | ETA 0:39:39
|
| 466 |
+
2026-05-07 07:57:19,886 | INFO | step 14580/15259 | epoch 1 | loss 3.3942 | ppl 29.79 | lr 4.07e-03 | grad_norm 0.151 | 152645 tok/s | dt 34.35s | ETA 0:38:59
|
| 467 |
+
2026-05-07 07:57:54,285 | INFO | step 14590/15259 | epoch 1 | loss 3.3677 | ppl 29.01 | lr 4.03e-03 | grad_norm 0.122 | 152415 tok/s | dt 34.40s | ETA 0:38:20
|
| 468 |
+
2026-05-07 07:58:28,590 | INFO | step 14600/15259 | epoch 1 | loss 3.4266 | ppl 30.77 | lr 3.98e-03 | grad_norm 0.120 | 152832 tok/s | dt 34.30s | ETA 0:37:43
|
| 469 |
+
2026-05-07 07:58:28,772 | INFO | step 14600 | val_loss 3.3107 | val_ppl 27.41 ** New best validation loss! **
|
| 470 |
+
2026-05-07 07:58:43,858 | WARNING | New best checkpoint at step 14600 | val_loss=3.3107 | saved to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/checkpoints/best_ckpt.pt
|
| 471 |
+
2026-05-07 07:59:18,224 | INFO | step 14610/15259 | epoch 1 | loss 3.4512 | ppl 31.54 | lr 3.94e-03 | grad_norm 0.112 | 105630 tok/s | dt 49.63s | ETA 0:37:09
|
| 472 |
+
2026-05-07 07:59:52,454 | INFO | step 14620/15259 | epoch 1 | loss 3.4053 | ppl 30.12 | lr 3.89e-03 | grad_norm 0.112 | 153166 tok/s | dt 34.23s | ETA 0:36:33
|
| 473 |
+
2026-05-07 08:00:26,899 | INFO | step 14630/15259 | epoch 1 | loss 3.4238 | ppl 30.68 | lr 3.85e-03 | grad_norm 0.125 | 152211 tok/s | dt 34.44s | ETA 0:36:00
|
| 474 |
+
2026-05-07 08:01:01,101 | INFO | step 14640/15259 | epoch 1 | loss 3.4174 | ppl 30.49 | lr 3.81e-03 | grad_norm 0.134 | 153292 tok/s | dt 34.20s | ETA 0:35:23
|
| 475 |
+
2026-05-07 08:01:35,540 | INFO | step 14650/15259 | epoch 1 | loss 3.3915 | ppl 29.71 | lr 3.76e-03 | grad_norm 0.113 | 152236 tok/s | dt 34.44s | ETA 0:34:51
|
| 476 |
+
2026-05-07 08:02:10,008 | INFO | step 14660/15259 | epoch 1 | loss 3.3931 | ppl 29.76 | lr 3.72e-03 | grad_norm 0.113 | 152106 tok/s | dt 34.47s | ETA 0:34:17
|
| 477 |
+
2026-05-07 08:02:44,033 | INFO | step 14670/15259 | epoch 1 | loss 3.4063 | ppl 30.15 | lr 3.68e-03 | grad_norm 0.113 | 154091 tok/s | dt 34.02s | ETA 0:33:41
|
| 478 |
+
2026-05-07 08:03:18,437 | INFO | step 14680/15259 | epoch 1 | loss 3.4488 | ppl 31.46 | lr 3.63e-03 | grad_norm 0.115 | 152393 tok/s | dt 34.40s | ETA 0:33:06
|
| 479 |
+
2026-05-07 08:03:53,233 | INFO | step 14690/15259 | epoch 1 | loss 3.3709 | ppl 29.10 | lr 3.59e-03 | grad_norm 0.112 | 150673 tok/s | dt 34.80s | ETA 0:32:38
|
| 480 |
+
2026-05-07 08:04:27,923 | INFO | step 14700/15259 | epoch 1 | loss 3.4137 | ppl 30.38 | lr 3.55e-03 | grad_norm 0.119 | 151135 tok/s | dt 34.69s | ETA 0:32:07
|
| 481 |
+
2026-05-07 08:04:28,107 | INFO | step 14700 | val_loss 3.3193 | val_ppl 27.64
|
| 482 |
+
2026-05-07 08:05:13,037 | INFO | step 14710/15259 | epoch 1 | loss 3.4029 | ppl 30.05 | lr 3.50e-03 | grad_norm 0.108 | 116214 tok/s | dt 45.11s | ETA 0:31:33
|
| 483 |
+
2026-05-07 08:05:47,407 | INFO | step 14720/15259 | epoch 1 | loss 3.4334 | ppl 30.98 | lr 3.46e-03 | grad_norm 0.110 | 152544 tok/s | dt 34.37s | ETA 0:31:02
|
| 484 |
+
2026-05-07 08:06:21,500 | INFO | step 14730/15259 | epoch 1 | loss 3.4365 | ppl 31.08 | lr 3.42e-03 | grad_norm 0.111 | 153781 tok/s | dt 34.09s | ETA 0:30:24
|
| 485 |
+
2026-05-07 08:06:55,496 | INFO | step 14740/15259 | epoch 1 | loss 3.4652 | ppl 31.98 | lr 3.37e-03 | grad_norm 0.101 | 154222 tok/s | dt 34.00s | ETA 0:29:41
|
| 486 |
+
2026-05-07 08:07:29,765 | INFO | step 14750/15259 | epoch 1 | loss 3.4456 | ppl 31.36 | lr 3.33e-03 | grad_norm 0.121 | 152991 tok/s | dt 34.27s | ETA 0:29:03
|
| 487 |
+
2026-05-07 08:08:03,717 | INFO | step 14760/15259 | epoch 1 | loss 3.4012 | ppl 30.00 | lr 3.28e-03 | grad_norm 0.112 | 154420 tok/s | dt 33.95s | ETA 0:28:23
|
| 488 |
+
2026-05-07 08:08:37,815 | INFO | step 14770/15259 | epoch 1 | loss 3.3806 | ppl 29.39 | lr 3.24e-03 | grad_norm 0.114 | 153758 tok/s | dt 34.10s | ETA 0:27:46
|
| 489 |
+
2026-05-07 08:09:12,192 | INFO | step 14780/15259 | epoch 1 | loss 3.4052 | ppl 30.12 | lr 3.20e-03 | grad_norm 0.102 | 152513 tok/s | dt 34.38s | ETA 0:27:15
|
| 490 |
+
2026-05-07 08:09:46,153 | INFO | step 14790/15259 | epoch 1 | loss 3.4183 | ppl 30.52 | lr 3.15e-03 | grad_norm 0.107 | 154377 tok/s | dt 33.96s | ETA 0:26:40
|
| 491 |
+
2026-05-07 08:10:20,319 | INFO | step 14800/15259 | epoch 1 | loss 3.4210 | ppl 30.60 | lr 3.11e-03 | grad_norm 0.110 | 153453 tok/s | dt 34.17s | ETA 0:26:05
|
| 492 |
+
2026-05-07 08:10:20,501 | INFO | step 14800 | val_loss 3.3248 | val_ppl 27.79
|
| 493 |
+
2026-05-07 08:11:04,731 | INFO | step 14810/15259 | epoch 1 | loss 3.4488 | ppl 31.46 | lr 3.07e-03 | grad_norm 0.123 | 118052 tok/s | dt 44.41s | ETA 0:25:33
|
| 494 |
+
2026-05-07 08:11:38,799 | INFO | step 14820/15259 | epoch 1 | loss 3.4084 | ppl 30.22 | lr 3.02e-03 | grad_norm 0.109 | 153894 tok/s | dt 34.07s | ETA 0:24:59
|
| 495 |
+
2026-05-07 08:12:12,904 | INFO | step 14830/15259 | epoch 1 | loss 3.4229 | ppl 30.66 | lr 2.98e-03 | grad_norm 0.111 | 153730 tok/s | dt 34.10s | ETA 0:24:22
|
| 496 |
+
2026-05-07 08:12:46,708 | INFO | step 14840/15259 | epoch 1 | loss 3.4145 | ppl 30.40 | lr 2.94e-03 | grad_norm 0.106 | 155096 tok/s | dt 33.80s | ETA 0:23:47
|
| 497 |
+
2026-05-07 08:13:20,613 | INFO | step 14850/15259 | epoch 1 | loss 3.3855 | ppl 29.53 | lr 2.89e-03 | grad_norm 0.112 | 154632 tok/s | dt 33.91s | ETA 0:23:11
|
| 498 |
+
2026-05-07 08:13:54,721 | INFO | step 14860/15259 | epoch 1 | loss 3.3912 | ppl 29.70 | lr 2.85e-03 | grad_norm 0.116 | 153715 tok/s | dt 34.11s | ETA 0:22:36
|
| 499 |
+
2026-05-07 08:14:29,145 | INFO | step 14870/15259 | epoch 1 | loss 3.4288 | ppl 30.84 | lr 2.80e-03 | grad_norm 0.106 | 152306 tok/s | dt 34.42s | ETA 0:22:05
|
| 500 |
+
2026-05-07 08:15:03,049 | INFO | step 14880/15259 | epoch 1 | loss 3.4119 | ppl 30.32 | lr 2.76e-03 | grad_norm 0.111 | 154637 tok/s | dt 33.90s | ETA 0:21:29
|
| 501 |
+
2026-05-07 08:15:37,097 | INFO | step 14890/15259 | epoch 1 | loss 3.4453 | ppl 31.35 | lr 2.72e-03 | grad_norm 0.111 | 153984 tok/s | dt 34.05s | ETA 0:20:57
|
| 502 |
+
2026-05-07 08:16:11,114 | INFO | step 14900/15259 | epoch 1 | loss 3.3889 | ppl 29.63 | lr 2.67e-03 | grad_norm 0.104 | 154124 tok/s | dt 34.02s | ETA 0:20:24
|
| 503 |
+
2026-05-07 08:16:11,295 | INFO | step 14900 | val_loss 3.4178 | val_ppl 30.50
|
| 504 |
+
2026-05-07 08:16:54,196 | INFO | step 14910/15259 | epoch 1 | loss 3.3938 | ppl 29.78 | lr 2.63e-03 | grad_norm 0.097 | 121696 tok/s | dt 43.08s | ETA 0:19:50
|
| 505 |
+
2026-05-07 08:17:28,091 | INFO | step 14920/15259 | epoch 1 | loss 3.3948 | ppl 29.81 | lr 2.59e-03 | grad_norm 0.109 | 154682 tok/s | dt 33.89s | ETA 0:19:13
|
| 506 |
+
2026-05-07 08:18:01,918 | INFO | step 14930/15259 | epoch 1 | loss 3.3871 | ppl 29.58 | lr 2.54e-03 | grad_norm 0.106 | 154991 tok/s | dt 33.83s | ETA 0:18:38
|
| 507 |
+
2026-05-07 08:18:35,817 | INFO | step 14940/15259 | epoch 1 | loss 3.4039 | ppl 30.08 | lr 2.50e-03 | grad_norm 0.101 | 154661 tok/s | dt 33.90s | ETA 0:18:03
|
| 508 |
+
2026-05-07 08:19:09,692 | INFO | step 14950/15259 | epoch 1 | loss 3.3922 | ppl 29.73 | lr 2.46e-03 | grad_norm 0.110 | 154772 tok/s | dt 33.87s | ETA 0:17:28
|
| 509 |
+
2026-05-07 08:19:43,480 | INFO | step 14960/15259 | epoch 1 | loss 3.3891 | ppl 29.64 | lr 2.41e-03 | grad_norm 0.095 | 155168 tok/s | dt 33.79s | ETA 0:16:52
|
| 510 |
+
2026-05-07 08:20:17,607 | INFO | step 14970/15259 | epoch 1 | loss 3.3477 | ppl 28.44 | lr 2.37e-03 | grad_norm 0.111 | 153630 tok/s | dt 34.13s | ETA 0:16:19
|
| 511 |
+
2026-05-07 08:20:51,571 | INFO | step 14980/15259 | epoch 1 | loss 3.3929 | ppl 29.75 | lr 2.33e-03 | grad_norm 0.106 | 154363 tok/s | dt 33.96s | ETA 0:15:46
|
| 512 |
+
2026-05-07 08:21:25,579 | INFO | step 14990/15259 | epoch 1 | loss 3.3760 | ppl 29.25 | lr 2.28e-03 | grad_norm 0.097 | 154167 tok/s | dt 34.01s | ETA 0:15:13
|
| 513 |
+
2026-05-07 08:21:59,464 | INFO | step 15000/15259 | epoch 1 | loss 3.4321 | ppl 30.94 | lr 2.24e-03 | grad_norm 0.105 | 154725 tok/s | dt 33.89s | ETA 0:14:39
|
| 514 |
+
2026-05-07 08:21:59,647 | INFO | step 15000 | val_loss 3.4002 | val_ppl 29.97
|
| 515 |
+
2026-05-07 08:22:44,068 | INFO | step 15010/15259 | epoch 1 | loss 3.4614 | ppl 31.86 | lr 2.19e-03 | grad_norm 0.109 | 117543 tok/s | dt 44.60s | ETA 0:14:08
|
| 516 |
+
2026-05-07 08:23:18,058 | INFO | step 15020/15259 | epoch 1 | loss 3.4082 | ppl 30.21 | lr 2.15e-03 | grad_norm 0.109 | 154250 tok/s | dt 33.99s | ETA 0:13:33
|
| 517 |
+
2026-05-07 08:23:52,139 | INFO | step 15030/15259 | epoch 1 | loss 3.4596 | ppl 31.81 | lr 2.11e-03 | grad_norm 0.105 | 153836 tok/s | dt 34.08s | ETA 0:12:59
|
| 518 |
+
2026-05-07 08:24:26,353 | INFO | step 15040/15259 | epoch 1 | loss 3.3540 | ppl 28.62 | lr 2.06e-03 | grad_norm 0.101 | 153237 tok/s | dt 34.21s | ETA 0:12:26
|
| 519 |
+
2026-05-07 08:25:00,515 | INFO | step 15050/15259 | epoch 1 | loss 3.3611 | ppl 28.82 | lr 2.02e-03 | grad_norm 0.099 | 153471 tok/s | dt 34.16s | ETA 0:11:53
|
| 520 |
+
2026-05-07 08:25:34,697 | INFO | step 15060/15259 | epoch 1 | loss 3.3791 | ppl 29.34 | lr 1.98e-03 | grad_norm 0.102 | 153378 tok/s | dt 34.18s | ETA 0:11:19
|
| 521 |
+
2026-05-07 08:26:08,880 | INFO | step 15070/15259 | epoch 1 | loss 3.3400 | ppl 28.22 | lr 1.93e-03 | grad_norm 0.101 | 153378 tok/s | dt 34.18s | ETA 0:10:45
|
| 522 |
+
2026-05-07 08:26:43,065 | INFO | step 15080/15259 | epoch 1 | loss 3.4014 | ppl 30.01 | lr 1.89e-03 | grad_norm 0.107 | 153371 tok/s | dt 34.18s | ETA 0:10:11
|
| 523 |
+
2026-05-07 08:27:17,167 | INFO | step 15090/15259 | epoch 1 | loss 3.3326 | ppl 28.01 | lr 1.85e-03 | grad_norm 0.113 | 153738 tok/s | dt 34.10s | ETA 0:09:37
|
| 524 |
+
2026-05-07 08:27:51,205 | INFO | step 15100/15259 | epoch 1 | loss 3.3760 | ppl 29.25 | lr 1.80e-03 | grad_norm 0.104 | 154034 tok/s | dt 34.04s | ETA 0:09:02
|
| 525 |
+
2026-05-07 08:27:51,392 | WARNING | Step 15100: val_loss has not improved for 5 consecutive evals (best=3.3107, current=3.3793).
|
| 526 |
+
2026-05-07 08:27:51,392 | INFO | step 15100 | val_loss 3.3793 | val_ppl 29.35
|
| 527 |
+
2026-05-07 08:28:39,758 | INFO | step 15110/15259 | epoch 1 | loss 3.4342 | ppl 31.01 | lr 1.76e-03 | grad_norm 0.100 | 107982 tok/s | dt 48.55s | ETA 0:08:28
|
| 528 |
+
2026-05-07 08:29:13,780 | INFO | step 15120/15259 | epoch 1 | loss 3.4107 | ppl 30.29 | lr 1.71e-03 | grad_norm 0.104 | 154103 tok/s | dt 34.02s | ETA 0:07:54
|
| 529 |
+
2026-05-07 08:29:47,896 | INFO | step 15130/15259 | epoch 1 | loss 3.3634 | ppl 28.89 | lr 1.67e-03 | grad_norm 0.103 | 153676 tok/s | dt 34.12s | ETA 0:07:19
|
| 530 |
+
2026-05-07 08:30:22,102 | INFO | step 15140/15259 | epoch 1 | loss 3.3809 | ppl 29.40 | lr 1.63e-03 | grad_norm 0.102 | 153277 tok/s | dt 34.21s | ETA 0:06:45
|
| 531 |
+
2026-05-07 08:30:56,108 | INFO | step 15150/15259 | epoch 1 | loss 3.3896 | ppl 29.65 | lr 1.58e-03 | grad_norm 0.100 | 154173 tok/s | dt 34.01s | ETA 0:06:11
|
| 532 |
+
2026-05-07 08:31:30,244 | INFO | step 15160/15259 | epoch 1 | loss 3.3846 | ppl 29.51 | lr 1.54e-03 | grad_norm 0.110 | 153587 tok/s | dt 34.14s | ETA 0:05:37
|
| 533 |
+
2026-05-07 08:32:04,154 | INFO | step 15170/15259 | epoch 1 | loss 3.4470 | ppl 31.41 | lr 1.50e-03 | grad_norm 0.101 | 154615 tok/s | dt 33.91s | ETA 0:05:03
|
| 534 |
+
2026-05-07 08:32:38,060 | INFO | step 15180/15259 | epoch 1 | loss 3.3958 | ppl 29.84 | lr 1.45e-03 | grad_norm 0.101 | 154628 tok/s | dt 33.91s | ETA 0:04:28
|
| 535 |
+
2026-05-07 08:33:11,940 | INFO | step 15190/15259 | epoch 1 | loss 3.3498 | ppl 28.50 | lr 1.41e-03 | grad_norm 0.100 | 154749 tok/s | dt 33.88s | ETA 0:03:54
|
| 536 |
+
2026-05-07 08:33:45,916 | INFO | step 15200/15259 | epoch 1 | loss 3.4027 | ppl 30.05 | lr 1.37e-03 | grad_norm 0.105 | 154312 tok/s | dt 33.98s | ETA 0:03:20
|
| 537 |
+
2026-05-07 08:33:46,115 | INFO | step 15200 | val_loss 3.2405 | val_ppl 25.55 ** New best validation loss! **
|
| 538 |
+
2026-05-07 08:34:01,475 | WARNING | New best checkpoint at step 15200 | val_loss=3.2405 | saved to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/checkpoints/best_ckpt.pt
|
| 539 |
+
2026-05-07 08:34:35,761 | INFO | step 15210/15259 | epoch 1 | loss 3.3548 | ppl 28.64 | lr 1.32e-03 | grad_norm 0.106 | 105182 tok/s | dt 49.85s | ETA 0:02:46
|
| 540 |
+
2026-05-07 08:35:09,998 | INFO | step 15220/15259 | epoch 1 | loss 3.3820 | ppl 29.43 | lr 1.28e-03 | grad_norm 0.100 | 153139 tok/s | dt 34.24s | ETA 0:02:12
|
| 541 |
+
2026-05-07 08:35:44,072 | INFO | step 15230/15259 | epoch 1 | loss 3.3916 | ppl 29.71 | lr 1.24e-03 | grad_norm 0.102 | 153865 tok/s | dt 34.07s | ETA 0:01:38
|
| 542 |
+
2026-05-07 08:36:17,977 | INFO | step 15240/15259 | epoch 1 | loss 3.3928 | ppl 29.75 | lr 1.19e-03 | grad_norm 0.101 | 154637 tok/s | dt 33.90s | ETA 0:01:04
|
| 543 |
+
2026-05-07 08:36:51,958 | INFO | step 15250/15259 | epoch 1 | loss 3.3689 | ppl 29.05 | lr 1.15e-03 | grad_norm 0.095 | 154289 tok/s | dt 33.98s | ETA 0:00:30
|
| 544 |
+
2026-05-07 08:37:32,797 | CRITICAL | Pretraining complete -- run: final_c2_muon_bs512_lr12_seed3_mix3to1 | best val loss: 3.2405 | total time: 2:03:06 | avg 153871 tok/s
|
| 545 |
+
2026-05-07 08:37:32,797 | INFO | Pretraining complete. Best val loss: 3.2405
|
| 546 |
+
2026-05-07 08:37:34,088 | INFO | Saved metrics plot to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/metrics.png
|
| 547 |
+
2026-05-07 08:37:34,088 | INFO | Saved results doc to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/results.md
|