huiting tang commited on
Add files using upload-large-folder tool
Browse files- .gitattributes +1 -0
- architecture.txt +334 -0
- config_snapshot.json +116 -0
- eval_metrics.jsonl +23 -0
- events.jsonl +42 -0
- logs/pretraining_20260504_172141.log +0 -0
- metrics.png +3 -0
- results.md +24 -0
- train_metrics.jsonl +0 -0
.gitattributes
CHANGED
|
@@ -37,3 +37,4 @@ final_baseline_modern_6l448/metrics.png filter=lfs diff=lfs merge=lfs -text
|
|
| 37 |
final_c1_14l320_standard/metrics.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
final_baseline_modern_6l448/.ipynb_checkpoints/metrics-checkpoint.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
final/final_c2_18l320_standard/metrics.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 37 |
final_c1_14l320_standard/metrics.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
final_baseline_modern_6l448/.ipynb_checkpoints/metrics-checkpoint.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
final/final_c2_18l320_standard/metrics.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
metrics.png filter=lfs diff=lfs merge=lfs -text
|
architecture.txt
ADDED
|
@@ -0,0 +1,334 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
=== Raw Model ===
|
| 2 |
+
GPT(
|
| 3 |
+
(token_embedding): FactorizedTokenEmbedding(
|
| 4 |
+
(E): Embedding(50304, 128)
|
| 5 |
+
(P_in): Linear(in_features=128, out_features=384, bias=False)
|
| 6 |
+
(P_out): Linear(in_features=384, out_features=128, bias=False)
|
| 7 |
+
)
|
| 8 |
+
(transformer): ModuleDict(
|
| 9 |
+
(drop): Dropout(p=0.0, inplace=False)
|
| 10 |
+
(h): ModuleList(
|
| 11 |
+
(0-19): 20 x Block(
|
| 12 |
+
(ln_1): RMSNorm()
|
| 13 |
+
(attn): CausalSelfAttention(
|
| 14 |
+
(rotary): RotaryEmbedding()
|
| 15 |
+
(q_proj): Linear(in_features=384, out_features=384, bias=False)
|
| 16 |
+
(k_proj): Linear(in_features=384, out_features=128, bias=False)
|
| 17 |
+
(v_proj): Linear(in_features=384, out_features=128, bias=False)
|
| 18 |
+
(c_proj): Linear(in_features=384, out_features=384, bias=False)
|
| 19 |
+
(resid_dropout): Dropout(p=0.0, inplace=False)
|
| 20 |
+
)
|
| 21 |
+
(ln_2): RMSNorm()
|
| 22 |
+
(mlp): MLP(
|
| 23 |
+
(c_fc): Linear(in_features=384, out_features=2048, bias=False)
|
| 24 |
+
(c_proj): Linear(in_features=1024, out_features=384, bias=False)
|
| 25 |
+
(dropout): Dropout(p=0.0, inplace=False)
|
| 26 |
+
)
|
| 27 |
+
)
|
| 28 |
+
)
|
| 29 |
+
(ln_f): RMSNorm()
|
| 30 |
+
)
|
| 31 |
+
)
|
| 32 |
+
|
| 33 |
+
=== Forward Summary (torchinfo, uncompiled model) ===
|
| 34 |
+
====================================================================================================
|
| 35 |
+
Layer (type:depth-idx) Output Shape Param #
|
| 36 |
+
====================================================================================================
|
| 37 |
+
GPT [1, 1, 50304] --
|
| 38 |
+
├─FactorizedTokenEmbedding: 1-3 -- (recursive)
|
| 39 |
+
│ └─Embedding: 2-1 [1, 1024, 128] 6,438,912
|
| 40 |
+
│ └─Linear: 2-2 [1, 1024, 384] 49,152
|
| 41 |
+
├─ModuleDict: 1-2 -- --
|
| 42 |
+
│ └─Dropout: 2-3 [1, 1024, 384] --
|
| 43 |
+
│ └─ModuleList: 2-4 -- --
|
| 44 |
+
│ │ └─Block: 3-1 [1, 1024, 384] --
|
| 45 |
+
│ │ │ └─RMSNorm: 4-1 [1, 1024, 384] 384
|
| 46 |
+
│ │ │ └─CausalSelfAttention: 4-2 [1, 1024, 384] --
|
| 47 |
+
│ │ │ │ └─Linear: 5-1 [1, 1024, 384] 147,456
|
| 48 |
+
│ │ │ │ └─Linear: 5-2 [1, 1024, 128] 49,152
|
| 49 |
+
│ │ │ │ └─Linear: 5-3 [1, 1024, 128] 49,152
|
| 50 |
+
│ │ │ │ └─RotaryEmbedding: 5-4 [1, 1, 1024, 64] --
|
| 51 |
+
│ │ │ │ └─Linear: 5-5 [1, 1024, 384] 147,456
|
| 52 |
+
│ │ │ │ └─Dropout: 5-6 [1, 1024, 384] --
|
| 53 |
+
│ │ │ └─RMSNorm: 4-3 [1, 1024, 384] 384
|
| 54 |
+
│ │ │ └─MLP: 4-4 [1, 1024, 384] --
|
| 55 |
+
│ │ │ │ └─Linear: 5-7 [1, 1024, 2048] 786,432
|
| 56 |
+
│ │ │ │ └─Linear: 5-8 [1, 1024, 384] 393,216
|
| 57 |
+
│ │ │ │ └─Dropout: 5-9 [1, 1024, 384] --
|
| 58 |
+
│ │ └─Block: 3-2 [1, 1024, 384] --
|
| 59 |
+
│ │ │ └─RMSNorm: 4-5 [1, 1024, 384] 384
|
| 60 |
+
│ │ │ └─CausalSelfAttention: 4-6 [1, 1024, 384] --
|
| 61 |
+
│ │ │ │ └─Linear: 5-10 [1, 1024, 384] 147,456
|
| 62 |
+
│ │ │ │ └─Linear: 5-11 [1, 1024, 128] 49,152
|
| 63 |
+
│ │ │ │ └─Linear: 5-12 [1, 1024, 128] 49,152
|
| 64 |
+
│ │ │ │ └─RotaryEmbedding: 5-13 [1, 1, 1024, 64] --
|
| 65 |
+
│ │ │ │ └─Linear: 5-14 [1, 1024, 384] 147,456
|
| 66 |
+
│ │ │ │ └─Dropout: 5-15 [1, 1024, 384] --
|
| 67 |
+
│ │ │ └─RMSNorm: 4-7 [1, 1024, 384] 384
|
| 68 |
+
│ │ │ └─MLP: 4-8 [1, 1024, 384] --
|
| 69 |
+
│ │ │ │ └─Linear: 5-16 [1, 1024, 2048] 786,432
|
| 70 |
+
│ │ │ │ └─Linear: 5-17 [1, 1024, 384] 393,216
|
| 71 |
+
│ │ │ │ └─Dropout: 5-18 [1, 1024, 384] --
|
| 72 |
+
│ │ └─Block: 3-3 [1, 1024, 384] --
|
| 73 |
+
│ │ │ └─RMSNorm: 4-9 [1, 1024, 384] 384
|
| 74 |
+
│ │ │ └─CausalSelfAttention: 4-10 [1, 1024, 384] --
|
| 75 |
+
│ │ │ │ └─Linear: 5-19 [1, 1024, 384] 147,456
|
| 76 |
+
│ │ │ │ └─Linear: 5-20 [1, 1024, 128] 49,152
|
| 77 |
+
│ │ │ │ └─Linear: 5-21 [1, 1024, 128] 49,152
|
| 78 |
+
│ │ │ │ └─RotaryEmbedding: 5-22 [1, 1, 1024, 64] --
|
| 79 |
+
│ │ │ │ └─Linear: 5-23 [1, 1024, 384] 147,456
|
| 80 |
+
│ │ │ │ └─Dropout: 5-24 [1, 1024, 384] --
|
| 81 |
+
│ │ │ └─RMSNorm: 4-11 [1, 1024, 384] 384
|
| 82 |
+
│ │ │ └─MLP: 4-12 [1, 1024, 384] --
|
| 83 |
+
│ │ │ │ └─Linear: 5-25 [1, 1024, 2048] 786,432
|
| 84 |
+
│ │ │ │ └─Linear: 5-26 [1, 1024, 384] 393,216
|
| 85 |
+
│ │ │ │ └─Dropout: 5-27 [1, 1024, 384] --
|
| 86 |
+
│ │ └─Block: 3-4 [1, 1024, 384] --
|
| 87 |
+
│ │ │ └─RMSNorm: 4-13 [1, 1024, 384] 384
|
| 88 |
+
│ │ │ └─CausalSelfAttention: 4-14 [1, 1024, 384] --
|
| 89 |
+
│ │ │ │ └─Linear: 5-28 [1, 1024, 384] 147,456
|
| 90 |
+
│ │ │ │ └─Linear: 5-29 [1, 1024, 128] 49,152
|
| 91 |
+
│ │ │ │ └─Linear: 5-30 [1, 1024, 128] 49,152
|
| 92 |
+
│ │ │ │ └─RotaryEmbedding: 5-31 [1, 1, 1024, 64] --
|
| 93 |
+
│ │ │ │ └─Linear: 5-32 [1, 1024, 384] 147,456
|
| 94 |
+
│ │ │ │ └─Dropout: 5-33 [1, 1024, 384] --
|
| 95 |
+
│ │ │ └─RMSNorm: 4-15 [1, 1024, 384] 384
|
| 96 |
+
│ │ │ └─MLP: 4-16 [1, 1024, 384] --
|
| 97 |
+
│ │ │ │ └─Linear: 5-34 [1, 1024, 2048] 786,432
|
| 98 |
+
│ │ │ │ └─Linear: 5-35 [1, 1024, 384] 393,216
|
| 99 |
+
│ │ │ │ └─Dropout: 5-36 [1, 1024, 384] --
|
| 100 |
+
│ │ └─Block: 3-5 [1, 1024, 384] --
|
| 101 |
+
│ │ │ └─RMSNorm: 4-17 [1, 1024, 384] 384
|
| 102 |
+
│ │ │ └─CausalSelfAttention: 4-18 [1, 1024, 384] --
|
| 103 |
+
│ │ │ │ └─Linear: 5-37 [1, 1024, 384] 147,456
|
| 104 |
+
│ │ │ │ └─Linear: 5-38 [1, 1024, 128] 49,152
|
| 105 |
+
│ │ │ │ └─Linear: 5-39 [1, 1024, 128] 49,152
|
| 106 |
+
│ │ │ │ └─RotaryEmbedding: 5-40 [1, 1, 1024, 64] --
|
| 107 |
+
│ │ │ │ └─Linear: 5-41 [1, 1024, 384] 147,456
|
| 108 |
+
│ │ │ │ └─Dropout: 5-42 [1, 1024, 384] --
|
| 109 |
+
│ │ │ └─RMSNorm: 4-19 [1, 1024, 384] 384
|
| 110 |
+
│ │ │ └─MLP: 4-20 [1, 1024, 384] --
|
| 111 |
+
│ │ │ │ └─Linear: 5-43 [1, 1024, 2048] 786,432
|
| 112 |
+
│ │ │ │ └─Linear: 5-44 [1, 1024, 384] 393,216
|
| 113 |
+
│ │ │ │ └─Dropout: 5-45 [1, 1024, 384] --
|
| 114 |
+
│ │ └─Block: 3-6 [1, 1024, 384] --
|
| 115 |
+
│ │ │ └─RMSNorm: 4-21 [1, 1024, 384] 384
|
| 116 |
+
│ │ │ └─CausalSelfAttention: 4-22 [1, 1024, 384] --
|
| 117 |
+
│ │ │ │ └─Linear: 5-46 [1, 1024, 384] 147,456
|
| 118 |
+
│ │ │ │ └─Linear: 5-47 [1, 1024, 128] 49,152
|
| 119 |
+
│ │ │ │ └─Linear: 5-48 [1, 1024, 128] 49,152
|
| 120 |
+
│ │ │ │ └─RotaryEmbedding: 5-49 [1, 1, 1024, 64] --
|
| 121 |
+
│ │ │ │ └─Linear: 5-50 [1, 1024, 384] 147,456
|
| 122 |
+
│ │ │ │ └─Dropout: 5-51 [1, 1024, 384] --
|
| 123 |
+
│ │ │ └─RMSNorm: 4-23 [1, 1024, 384] 384
|
| 124 |
+
│ │ │ └─MLP: 4-24 [1, 1024, 384] --
|
| 125 |
+
│ │ │ │ └─Linear: 5-52 [1, 1024, 2048] 786,432
|
| 126 |
+
│ │ │ │ └─Linear: 5-53 [1, 1024, 384] 393,216
|
| 127 |
+
│ │ │ │ └─Dropout: 5-54 [1, 1024, 384] --
|
| 128 |
+
│ │ └─Block: 3-7 [1, 1024, 384] --
|
| 129 |
+
│ │ │ └─RMSNorm: 4-25 [1, 1024, 384] 384
|
| 130 |
+
│ │ │ └─CausalSelfAttention: 4-26 [1, 1024, 384] --
|
| 131 |
+
│ │ │ │ └─Linear: 5-55 [1, 1024, 384] 147,456
|
| 132 |
+
│ │ │ │ └─Linear: 5-56 [1, 1024, 128] 49,152
|
| 133 |
+
│ │ │ │ └─Linear: 5-57 [1, 1024, 128] 49,152
|
| 134 |
+
│ │ │ │ └─RotaryEmbedding: 5-58 [1, 1, 1024, 64] --
|
| 135 |
+
│ │ │ │ └─Linear: 5-59 [1, 1024, 384] 147,456
|
| 136 |
+
│ │ │ │ └─Dropout: 5-60 [1, 1024, 384] --
|
| 137 |
+
│ │ │ └─RMSNorm: 4-27 [1, 1024, 384] 384
|
| 138 |
+
│ │ │ └─MLP: 4-28 [1, 1024, 384] --
|
| 139 |
+
│ │ │ │ └─Linear: 5-61 [1, 1024, 2048] 786,432
|
| 140 |
+
│ │ │ │ └─Linear: 5-62 [1, 1024, 384] 393,216
|
| 141 |
+
│ │ │ │ └─Dropout: 5-63 [1, 1024, 384] --
|
| 142 |
+
│ │ └─Block: 3-8 [1, 1024, 384] --
|
| 143 |
+
│ │ │ └─RMSNorm: 4-29 [1, 1024, 384] 384
|
| 144 |
+
│ │ │ └─CausalSelfAttention: 4-30 [1, 1024, 384] --
|
| 145 |
+
│ │ │ │ └─Linear: 5-64 [1, 1024, 384] 147,456
|
| 146 |
+
│ │ │ │ └─Linear: 5-65 [1, 1024, 128] 49,152
|
| 147 |
+
│ │ │ │ └─Linear: 5-66 [1, 1024, 128] 49,152
|
| 148 |
+
│ │ │ │ └─RotaryEmbedding: 5-67 [1, 1, 1024, 64] --
|
| 149 |
+
│ │ │ │ └─Linear: 5-68 [1, 1024, 384] 147,456
|
| 150 |
+
│ │ │ │ └─Dropout: 5-69 [1, 1024, 384] --
|
| 151 |
+
│ │ │ └─RMSNorm: 4-31 [1, 1024, 384] 384
|
| 152 |
+
│ │ │ └─MLP: 4-32 [1, 1024, 384] --
|
| 153 |
+
│ │ │ │ └─Linear: 5-70 [1, 1024, 2048] 786,432
|
| 154 |
+
│ │ │ │ └─Linear: 5-71 [1, 1024, 384] 393,216
|
| 155 |
+
│ │ │ │ └─Dropout: 5-72 [1, 1024, 384] --
|
| 156 |
+
│ │ └─Block: 3-9 [1, 1024, 384] --
|
| 157 |
+
│ │ │ └─RMSNorm: 4-33 [1, 1024, 384] 384
|
| 158 |
+
│ │ │ └─CausalSelfAttention: 4-34 [1, 1024, 384] --
|
| 159 |
+
│ │ │ │ └─Linear: 5-73 [1, 1024, 384] 147,456
|
| 160 |
+
│ │ │ │ └─Linear: 5-74 [1, 1024, 128] 49,152
|
| 161 |
+
│ │ │ │ └─Linear: 5-75 [1, 1024, 128] 49,152
|
| 162 |
+
│ │ │ │ └─RotaryEmbedding: 5-76 [1, 1, 1024, 64] --
|
| 163 |
+
│ │ │ │ └─Linear: 5-77 [1, 1024, 384] 147,456
|
| 164 |
+
│ │ │ │ └─Dropout: 5-78 [1, 1024, 384] --
|
| 165 |
+
│ │ │ └─RMSNorm: 4-35 [1, 1024, 384] 384
|
| 166 |
+
│ │ │ └─MLP: 4-36 [1, 1024, 384] --
|
| 167 |
+
│ │ │ │ └─Linear: 5-79 [1, 1024, 2048] 786,432
|
| 168 |
+
│ │ │ │ └─Linear: 5-80 [1, 1024, 384] 393,216
|
| 169 |
+
│ │ │ │ └─Dropout: 5-81 [1, 1024, 384] --
|
| 170 |
+
│ │ └─Block: 3-10 [1, 1024, 384] --
|
| 171 |
+
│ │ │ └─RMSNorm: 4-37 [1, 1024, 384] 384
|
| 172 |
+
│ │ │ └─CausalSelfAttention: 4-38 [1, 1024, 384] --
|
| 173 |
+
│ │ │ │ └─Linear: 5-82 [1, 1024, 384] 147,456
|
| 174 |
+
│ │ │ │ └─Linear: 5-83 [1, 1024, 128] 49,152
|
| 175 |
+
│ │ │ │ └─Linear: 5-84 [1, 1024, 128] 49,152
|
| 176 |
+
│ │ │ │ └─RotaryEmbedding: 5-85 [1, 1, 1024, 64] --
|
| 177 |
+
│ │ │ │ └─Linear: 5-86 [1, 1024, 384] 147,456
|
| 178 |
+
│ │ │ │ └─Dropout: 5-87 [1, 1024, 384] --
|
| 179 |
+
│ │ │ └─RMSNorm: 4-39 [1, 1024, 384] 384
|
| 180 |
+
│ │ │ └─MLP: 4-40 [1, 1024, 384] --
|
| 181 |
+
│ │ │ │ └─Linear: 5-88 [1, 1024, 2048] 786,432
|
| 182 |
+
│ │ │ │ └─Linear: 5-89 [1, 1024, 384] 393,216
|
| 183 |
+
│ │ │ │ └─Dropout: 5-90 [1, 1024, 384] --
|
| 184 |
+
│ │ └─Block: 3-11 [1, 1024, 384] --
|
| 185 |
+
│ │ │ └─RMSNorm: 4-41 [1, 1024, 384] 384
|
| 186 |
+
│ │ │ └─CausalSelfAttention: 4-42 [1, 1024, 384] --
|
| 187 |
+
│ │ │ │ └─Linear: 5-91 [1, 1024, 384] 147,456
|
| 188 |
+
│ │ │ │ └─Linear: 5-92 [1, 1024, 128] 49,152
|
| 189 |
+
│ │ │ │ └─Linear: 5-93 [1, 1024, 128] 49,152
|
| 190 |
+
│ │ │ │ └─RotaryEmbedding: 5-94 [1, 1, 1024, 64] --
|
| 191 |
+
│ │ │ │ └─Linear: 5-95 [1, 1024, 384] 147,456
|
| 192 |
+
│ │ │ │ └─Dropout: 5-96 [1, 1024, 384] --
|
| 193 |
+
│ │ │ └─RMSNorm: 4-43 [1, 1024, 384] 384
|
| 194 |
+
│ │ │ └─MLP: 4-44 [1, 1024, 384] --
|
| 195 |
+
│ │ │ │ └─Linear: 5-97 [1, 1024, 2048] 786,432
|
| 196 |
+
│ │ │ │ └─Linear: 5-98 [1, 1024, 384] 393,216
|
| 197 |
+
│ │ │ │ └─Dropout: 5-99 [1, 1024, 384] --
|
| 198 |
+
│ │ └─Block: 3-12 [1, 1024, 384] --
|
| 199 |
+
│ │ │ └─RMSNorm: 4-45 [1, 1024, 384] 384
|
| 200 |
+
│ │ │ └─CausalSelfAttention: 4-46 [1, 1024, 384] --
|
| 201 |
+
│ │ │ │ └─Linear: 5-100 [1, 1024, 384] 147,456
|
| 202 |
+
│ │ │ │ └─Linear: 5-101 [1, 1024, 128] 49,152
|
| 203 |
+
│ │ │ │ └─Linear: 5-102 [1, 1024, 128] 49,152
|
| 204 |
+
│ │ │ │ └─RotaryEmbedding: 5-103 [1, 1, 1024, 64] --
|
| 205 |
+
│ │ │ │ └─Linear: 5-104 [1, 1024, 384] 147,456
|
| 206 |
+
│ │ │ │ └─Dropout: 5-105 [1, 1024, 384] --
|
| 207 |
+
│ │ │ └─RMSNorm: 4-47 [1, 1024, 384] 384
|
| 208 |
+
│ │ │ └─MLP: 4-48 [1, 1024, 384] --
|
| 209 |
+
│ │ │ │ └─Linear: 5-106 [1, 1024, 2048] 786,432
|
| 210 |
+
│ │ │ │ └─Linear: 5-107 [1, 1024, 384] 393,216
|
| 211 |
+
│ │ │ │ └─Dropout: 5-108 [1, 1024, 384] --
|
| 212 |
+
│ │ └─Block: 3-13 [1, 1024, 384] --
|
| 213 |
+
│ │ │ └─RMSNorm: 4-49 [1, 1024, 384] 384
|
| 214 |
+
│ │ │ └─CausalSelfAttention: 4-50 [1, 1024, 384] --
|
| 215 |
+
│ │ │ │ └─Linear: 5-109 [1, 1024, 384] 147,456
|
| 216 |
+
│ │ │ │ └─Linear: 5-110 [1, 1024, 128] 49,152
|
| 217 |
+
│ │ │ │ └─Linear: 5-111 [1, 1024, 128] 49,152
|
| 218 |
+
│ │ │ │ └─RotaryEmbedding: 5-112 [1, 1, 1024, 64] --
|
| 219 |
+
│ │ │ │ └─Linear: 5-113 [1, 1024, 384] 147,456
|
| 220 |
+
│ │ │ │ └─Dropout: 5-114 [1, 1024, 384] --
|
| 221 |
+
│ │ │ └─RMSNorm: 4-51 [1, 1024, 384] 384
|
| 222 |
+
│ │ │ └─MLP: 4-52 [1, 1024, 384] --
|
| 223 |
+
│ │ │ │ └─Linear: 5-115 [1, 1024, 2048] 786,432
|
| 224 |
+
│ │ │ │ └─Linear: 5-116 [1, 1024, 384] 393,216
|
| 225 |
+
│ │ │ │ └─Dropout: 5-117 [1, 1024, 384] --
|
| 226 |
+
│ │ └─Block: 3-14 [1, 1024, 384] --
|
| 227 |
+
│ │ │ └─RMSNorm: 4-53 [1, 1024, 384] 384
|
| 228 |
+
│ │ │ └─CausalSelfAttention: 4-54 [1, 1024, 384] --
|
| 229 |
+
│ │ │ │ └─Linear: 5-118 [1, 1024, 384] 147,456
|
| 230 |
+
│ │ │ │ └─Linear: 5-119 [1, 1024, 128] 49,152
|
| 231 |
+
│ │ │ │ └─Linear: 5-120 [1, 1024, 128] 49,152
|
| 232 |
+
│ │ │ │ └─RotaryEmbedding: 5-121 [1, 1, 1024, 64] --
|
| 233 |
+
│ │ │ │ └─Linear: 5-122 [1, 1024, 384] 147,456
|
| 234 |
+
│ │ │ │ └─Dropout: 5-123 [1, 1024, 384] --
|
| 235 |
+
│ │ │ └─RMSNorm: 4-55 [1, 1024, 384] 384
|
| 236 |
+
│ │ │ └─MLP: 4-56 [1, 1024, 384] --
|
| 237 |
+
│ │ │ │ └─Linear: 5-124 [1, 1024, 2048] 786,432
|
| 238 |
+
│ │ │ │ └─Linear: 5-125 [1, 1024, 384] 393,216
|
| 239 |
+
│ │ │ │ └─Dropout: 5-126 [1, 1024, 384] --
|
| 240 |
+
│ │ └─Block: 3-15 [1, 1024, 384] --
|
| 241 |
+
│ │ │ └─RMSNorm: 4-57 [1, 1024, 384] 384
|
| 242 |
+
│ │ │ └─CausalSelfAttention: 4-58 [1, 1024, 384] --
|
| 243 |
+
│ │ │ │ └─Linear: 5-127 [1, 1024, 384] 147,456
|
| 244 |
+
│ │ │ │ └─Linear: 5-128 [1, 1024, 128] 49,152
|
| 245 |
+
│ │ │ │ └─Linear: 5-129 [1, 1024, 128] 49,152
|
| 246 |
+
│ │ │ │ └─RotaryEmbedding: 5-130 [1, 1, 1024, 64] --
|
| 247 |
+
│ │ │ │ └─Linear: 5-131 [1, 1024, 384] 147,456
|
| 248 |
+
│ │ │ │ └─Dropout: 5-132 [1, 1024, 384] --
|
| 249 |
+
│ │ │ └─RMSNorm: 4-59 [1, 1024, 384] 384
|
| 250 |
+
│ │ │ └─MLP: 4-60 [1, 1024, 384] --
|
| 251 |
+
│ │ │ │ └─Linear: 5-133 [1, 1024, 2048] 786,432
|
| 252 |
+
│ │ │ │ └─Linear: 5-134 [1, 1024, 384] 393,216
|
| 253 |
+
│ │ │ │ └─Dropout: 5-135 [1, 1024, 384] --
|
| 254 |
+
│ │ └─Block: 3-16 [1, 1024, 384] --
|
| 255 |
+
│ │ │ └─RMSNorm: 4-61 [1, 1024, 384] 384
|
| 256 |
+
│ │ │ └─CausalSelfAttention: 4-62 [1, 1024, 384] --
|
| 257 |
+
│ │ │ │ └─Linear: 5-136 [1, 1024, 384] 147,456
|
| 258 |
+
│ │ │ │ └─Linear: 5-137 [1, 1024, 128] 49,152
|
| 259 |
+
│ │ │ │ └─Linear: 5-138 [1, 1024, 128] 49,152
|
| 260 |
+
│ │ │ │ └─RotaryEmbedding: 5-139 [1, 1, 1024, 64] --
|
| 261 |
+
│ │ │ │ └─Linear: 5-140 [1, 1024, 384] 147,456
|
| 262 |
+
│ │ │ │ └─Dropout: 5-141 [1, 1024, 384] --
|
| 263 |
+
│ │ │ └─RMSNorm: 4-63 [1, 1024, 384] 384
|
| 264 |
+
│ │ │ └─MLP: 4-64 [1, 1024, 384] --
|
| 265 |
+
│ │ │ │ └─Linear: 5-142 [1, 1024, 2048] 786,432
|
| 266 |
+
│ │ │ │ └─Linear: 5-143 [1, 1024, 384] 393,216
|
| 267 |
+
│ │ │ │ └─Dropout: 5-144 [1, 1024, 384] --
|
| 268 |
+
│ │ └─Block: 3-17 [1, 1024, 384] --
|
| 269 |
+
│ │ │ └─RMSNorm: 4-65 [1, 1024, 384] 384
|
| 270 |
+
│ │ │ └─CausalSelfAttention: 4-66 [1, 1024, 384] --
|
| 271 |
+
│ │ │ │ └─Linear: 5-145 [1, 1024, 384] 147,456
|
| 272 |
+
│ │ │ │ └─Linear: 5-146 [1, 1024, 128] 49,152
|
| 273 |
+
│ │ │ │ └─Linear: 5-147 [1, 1024, 128] 49,152
|
| 274 |
+
│ │ │ │ └─RotaryEmbedding: 5-148 [1, 1, 1024, 64] --
|
| 275 |
+
│ │ │ │ └─Linear: 5-149 [1, 1024, 384] 147,456
|
| 276 |
+
│ │ │ │ └─Dropout: 5-150 [1, 1024, 384] --
|
| 277 |
+
│ │ │ └─RMSNorm: 4-67 [1, 1024, 384] 384
|
| 278 |
+
│ │ │ └─MLP: 4-68 [1, 1024, 384] --
|
| 279 |
+
│ │ │ │ └─Linear: 5-151 [1, 1024, 2048] 786,432
|
| 280 |
+
│ │ │ │ └─Linear: 5-152 [1, 1024, 384] 393,216
|
| 281 |
+
│ │ │ │ └─Dropout: 5-153 [1, 1024, 384] --
|
| 282 |
+
│ │ └─Block: 3-18 [1, 1024, 384] --
|
| 283 |
+
│ │ │ └─RMSNorm: 4-69 [1, 1024, 384] 384
|
| 284 |
+
│ │ │ └─CausalSelfAttention: 4-70 [1, 1024, 384] --
|
| 285 |
+
│ │ │ │ └─Linear: 5-154 [1, 1024, 384] 147,456
|
| 286 |
+
│ │ │ │ └─Linear: 5-155 [1, 1024, 128] 49,152
|
| 287 |
+
│ │ │ │ └─Linear: 5-156 [1, 1024, 128] 49,152
|
| 288 |
+
│ │ │ │ └─RotaryEmbedding: 5-157 [1, 1, 1024, 64] --
|
| 289 |
+
│ │ │ │ └─Linear: 5-158 [1, 1024, 384] 147,456
|
| 290 |
+
│ │ │ │ └─Dropout: 5-159 [1, 1024, 384] --
|
| 291 |
+
│ │ │ └─RMSNorm: 4-71 [1, 1024, 384] 384
|
| 292 |
+
│ │ │ └─MLP: 4-72 [1, 1024, 384] --
|
| 293 |
+
│ │ │ │ └─Linear: 5-160 [1, 1024, 2048] 786,432
|
| 294 |
+
│ │ │ │ └─Linear: 5-161 [1, 1024, 384] 393,216
|
| 295 |
+
│ │ │ │ └─Dropout: 5-162 [1, 1024, 384] --
|
| 296 |
+
│ │ └─Block: 3-19 [1, 1024, 384] --
|
| 297 |
+
│ │ │ └─RMSNorm: 4-73 [1, 1024, 384] 384
|
| 298 |
+
│ │ │ └─CausalSelfAttention: 4-74 [1, 1024, 384] --
|
| 299 |
+
│ │ │ │ └─Linear: 5-163 [1, 1024, 384] 147,456
|
| 300 |
+
│ │ │ │ └─Linear: 5-164 [1, 1024, 128] 49,152
|
| 301 |
+
│ │ │ │ └─Linear: 5-165 [1, 1024, 128] 49,152
|
| 302 |
+
│ │ │ │ └─RotaryEmbedding: 5-166 [1, 1, 1024, 64] --
|
| 303 |
+
│ │ │ │ └─Linear: 5-167 [1, 1024, 384] 147,456
|
| 304 |
+
│ │ │ │ └─Dropout: 5-168 [1, 1024, 384] --
|
| 305 |
+
│ │ │ └─RMSNorm: 4-75 [1, 1024, 384] 384
|
| 306 |
+
│ │ │ └─MLP: 4-76 [1, 1024, 384] --
|
| 307 |
+
│ │ │ │ └─Linear: 5-169 [1, 1024, 2048] 786,432
|
| 308 |
+
│ │ │ │ └─Linear: 5-170 [1, 1024, 384] 393,216
|
| 309 |
+
│ │ │ │ └─Dropout: 5-171 [1, 1024, 384] --
|
| 310 |
+
│ │ └─Block: 3-20 [1, 1024, 384] --
|
| 311 |
+
│ │ │ └─RMSNorm: 4-77 [1, 1024, 384] 384
|
| 312 |
+
│ │ │ └─CausalSelfAttention: 4-78 [1, 1024, 384] --
|
| 313 |
+
│ │ │ │ └─Linear: 5-172 [1, 1024, 384] 147,456
|
| 314 |
+
│ │ │ │ └─Linear: 5-173 [1, 1024, 128] 49,152
|
| 315 |
+
│ │ │ │ └─Linear: 5-174 [1, 1024, 128] 49,152
|
| 316 |
+
│ │ │ │ └─RotaryEmbedding: 5-175 [1, 1, 1024, 64] --
|
| 317 |
+
│ │ │ │ └─Linear: 5-176 [1, 1024, 384] 147,456
|
| 318 |
+
│ │ │ │ └─Dropout: 5-177 [1, 1024, 384] --
|
| 319 |
+
│ │ │ └─RMSNorm: 4-79 [1, 1024, 384] 384
|
| 320 |
+
│ │ │ └─MLP: 4-80 [1, 1024, 384] --
|
| 321 |
+
│ │ │ │ └─Linear: 5-178 [1, 1024, 2048] 786,432
|
| 322 |
+
│ │ │ │ └─Linear: 5-179 [1, 1024, 384] 393,216
|
| 323 |
+
│ │ │ │ └─Dropout: 5-180 [1, 1024, 384] --
|
| 324 |
+
│ └─RMSNorm: 2-5 [1, 1024, 384] 384
|
| 325 |
+
├─FactorizedTokenEmbedding: 1-3 -- (recursive)
|
| 326 |
+
│ └─Linear: 2-6 [1, 1, 128] 49,152
|
| 327 |
+
====================================================================================================
|
| 328 |
+
|
| 329 |
+
=== Parameter Counts (unique tensors) ===
|
| 330 |
+
Total params: 38,010,240
|
| 331 |
+
Trainable params: 38,010,240
|
| 332 |
+
Weight tying (wte = lm_head): True
|
| 333 |
+
Embedding mode: factorized tied token embedding
|
| 334 |
+
Note: module-level torchinfo totals may double-count the tied LM head; use the unique counts above.
|
config_snapshot.json
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"run": {
|
| 3 |
+
"name": "final_c4_20l384_factorized",
|
| 4 |
+
"artifacts_root": "artifacts/final_c4",
|
| 5 |
+
"resume": true,
|
| 6 |
+
"deterministic": false
|
| 7 |
+
},
|
| 8 |
+
"distributed": {
|
| 9 |
+
"enabled": true,
|
| 10 |
+
"backend": "nccl"
|
| 11 |
+
},
|
| 12 |
+
"preprocessing": {
|
| 13 |
+
"data_dir": "data",
|
| 14 |
+
"processed_dir": "data/processed_OWT",
|
| 15 |
+
"log_dir": "logs/preprocessing",
|
| 16 |
+
"train_split": 0.9,
|
| 17 |
+
"dataset_name": "openwebtext",
|
| 18 |
+
"dataset_config_name": null,
|
| 19 |
+
"dataset_split": "train",
|
| 20 |
+
"dataset_text_column": "text",
|
| 21 |
+
"dataset_repo_id": "huiting123/processedOWT",
|
| 22 |
+
"num_proc": 4,
|
| 23 |
+
"tokenization_num_proc": 0,
|
| 24 |
+
"tokenization_batch_size": 1000,
|
| 25 |
+
"tokenization_chunk_size": 100000,
|
| 26 |
+
"shard_write_batch_size": 5000,
|
| 27 |
+
"seed": 42,
|
| 28 |
+
"subset_size": 0,
|
| 29 |
+
"raw_data_path": null,
|
| 30 |
+
"test_data_path": null,
|
| 31 |
+
"skip_language_filter": false,
|
| 32 |
+
"skip_repetition_filter": false,
|
| 33 |
+
"skip_quality_filter": false,
|
| 34 |
+
"min_words": 100,
|
| 35 |
+
"max_words": 10000,
|
| 36 |
+
"max_non_ascii": 0.3,
|
| 37 |
+
"min_line_uniqueness": 0.7,
|
| 38 |
+
"min_sentence_uniqueness": 0.8,
|
| 39 |
+
"max_train_tokens": 0
|
| 40 |
+
},
|
| 41 |
+
"model": {
|
| 42 |
+
"vocab_size": 50304,
|
| 43 |
+
"n_layers": 20,
|
| 44 |
+
"n_heads": 6,
|
| 45 |
+
"n_kv_heads": 2,
|
| 46 |
+
"n_embd": 384,
|
| 47 |
+
"embedding_dim": 128,
|
| 48 |
+
"tie_embeddings": true,
|
| 49 |
+
"context_len": 1024,
|
| 50 |
+
"dropout": 0.0,
|
| 51 |
+
"bias": false,
|
| 52 |
+
"norm_type": "rmsnorm",
|
| 53 |
+
"norm_eps": 1e-05,
|
| 54 |
+
"positional_embedding": "rope",
|
| 55 |
+
"rope_theta": 10000.0,
|
| 56 |
+
"rope_fraction": 1.0,
|
| 57 |
+
"mlp_type": "swiglu",
|
| 58 |
+
"mlp_hidden_mult": 4.0,
|
| 59 |
+
"mlp_hidden_dim": 1024,
|
| 60 |
+
"qk_norm": false,
|
| 61 |
+
"block_style": "sequential"
|
| 62 |
+
},
|
| 63 |
+
"training": {
|
| 64 |
+
"seed": 0,
|
| 65 |
+
"learning_rate": 0.0012,
|
| 66 |
+
"min_lr": 0.00012,
|
| 67 |
+
"weight_decay": 0.03,
|
| 68 |
+
"beta1": 0.9,
|
| 69 |
+
"beta2": 0.95,
|
| 70 |
+
"grad_clip": 1.0,
|
| 71 |
+
"max_iters": 11586,
|
| 72 |
+
"warmup_steps": 116,
|
| 73 |
+
"lr_schedule": "wsd",
|
| 74 |
+
"wsd_stable_frac": 0.85,
|
| 75 |
+
"batch_size": 4,
|
| 76 |
+
"gradient_accumulation_steps": 16,
|
| 77 |
+
"dtype": "float16",
|
| 78 |
+
"device": "cuda",
|
| 79 |
+
"eval_step_interval": 500,
|
| 80 |
+
"eval_batches": 20,
|
| 81 |
+
"log_interval": 10,
|
| 82 |
+
"max_checkpoints": 5
|
| 83 |
+
},
|
| 84 |
+
"inference": {
|
| 85 |
+
"checkpoint": null,
|
| 86 |
+
"prompt": "",
|
| 87 |
+
"max_tokens": 100,
|
| 88 |
+
"temperature": 1.0,
|
| 89 |
+
"seed": 0,
|
| 90 |
+
"device": "auto",
|
| 91 |
+
"leaderboard": false
|
| 92 |
+
},
|
| 93 |
+
"post_training": {
|
| 94 |
+
"base_checkpoint": null,
|
| 95 |
+
"learning_rate": 1e-05,
|
| 96 |
+
"max_iters": 1000,
|
| 97 |
+
"checkpoint_dir": "checkpoints/post",
|
| 98 |
+
"log_dir": "logs/post"
|
| 99 |
+
},
|
| 100 |
+
"evaluation": {
|
| 101 |
+
"checkpoint": null,
|
| 102 |
+
"batch_size": 4,
|
| 103 |
+
"device": "auto",
|
| 104 |
+
"log_dir": "logs/evaluation"
|
| 105 |
+
},
|
| 106 |
+
"notifications": {
|
| 107 |
+
"enabled": false,
|
| 108 |
+
"smtp_host": "smtp.gmail.com",
|
| 109 |
+
"smtp_port": 587,
|
| 110 |
+
"smtp_user": "",
|
| 111 |
+
"to_addresses": [],
|
| 112 |
+
"cooldown_minutes": 5,
|
| 113 |
+
"periodic_status_hours": 4.0,
|
| 114 |
+
"disk_min_gb": 5.0
|
| 115 |
+
}
|
| 116 |
+
}
|
eval_metrics.jsonl
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 500, "epoch": 0, "val_loss": 6.271504926681518, "val_ppl": 529.273296333478, "is_best": true, "timestamp": "2026-05-04T17:27:01.886823"}
|
| 2 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 1000, "epoch": 0, "val_loss": 5.651372718811035, "val_ppl": 284.68198604277603, "is_best": true, "timestamp": "2026-05-04T17:31:09.125462"}
|
| 3 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 1500, "epoch": 0, "val_loss": 4.958203363418579, "val_ppl": 142.33783666870428, "is_best": true, "timestamp": "2026-05-04T17:35:16.000868"}
|
| 4 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 2000, "epoch": 0, "val_loss": 4.731179404258728, "val_ppl": 113.42926244512242, "is_best": true, "timestamp": "2026-05-04T17:39:23.507721"}
|
| 5 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 2500, "epoch": 0, "val_loss": 4.649809455871582, "val_ppl": 104.5650594206581, "is_best": true, "timestamp": "2026-05-04T17:43:30.119838"}
|
| 6 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 3000, "epoch": 0, "val_loss": 4.432038378715515, "val_ppl": 84.10267540960363, "is_best": true, "timestamp": "2026-05-04T17:47:36.176655"}
|
| 7 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 3500, "epoch": 0, "val_loss": 4.374273097515106, "val_ppl": 79.38211551790492, "is_best": true, "timestamp": "2026-05-04T17:51:41.894337"}
|
| 8 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 4000, "epoch": 0, "val_loss": 4.343928050994873, "val_ppl": 77.00944302285464, "is_best": true, "timestamp": "2026-05-04T17:55:48.866524"}
|
| 9 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 4500, "epoch": 0, "val_loss": 4.245097875595093, "val_ppl": 69.76258786456656, "is_best": true, "timestamp": "2026-05-04T17:59:54.744666"}
|
| 10 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 5000, "epoch": 0, "val_loss": 4.193927860260009, "val_ppl": 66.28262922741254, "is_best": true, "timestamp": "2026-05-04T18:04:01.685011"}
|
| 11 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 5500, "epoch": 0, "val_loss": 4.268809962272644, "val_ppl": 71.4365727985318, "is_best": false, "timestamp": "2026-05-04T18:08:10.634131"}
|
| 12 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 6000, "epoch": 0, "val_loss": 4.158704721927643, "val_ppl": 63.988585886299255, "is_best": true, "timestamp": "2026-05-04T18:12:17.883599"}
|
| 13 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 6500, "epoch": 0, "val_loss": 4.159478271007538, "val_ppl": 64.03810334765954, "is_best": false, "timestamp": "2026-05-04T18:16:24.868373"}
|
| 14 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 7000, "epoch": 0, "val_loss": 4.037256824970245, "val_ppl": 56.670671814858714, "is_best": true, "timestamp": "2026-05-04T18:20:30.545643"}
|
| 15 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 7500, "epoch": 0, "val_loss": 4.1698023796081545, "val_ppl": 64.70266427800867, "is_best": false, "timestamp": "2026-05-04T18:24:38.429921"}
|
| 16 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 8000, "epoch": 0, "val_loss": 4.195084941387177, "val_ppl": 66.35936799467855, "is_best": false, "timestamp": "2026-05-04T18:28:44.671449"}
|
| 17 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 8500, "epoch": 0, "val_loss": 3.9605602622032166, "val_ppl": 52.486724040660924, "is_best": true, "timestamp": "2026-05-04T18:32:49.603235"}
|
| 18 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 9000, "epoch": 0, "val_loss": 3.9616266012191774, "val_ppl": 52.542722533708215, "is_best": false, "timestamp": "2026-05-04T18:36:56.436454"}
|
| 19 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 9500, "epoch": 0, "val_loss": 3.972209358215332, "val_ppl": 53.10172205919401, "is_best": false, "timestamp": "2026-05-04T18:41:02.290341"}
|
| 20 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 10000, "epoch": 0, "val_loss": 3.9732977986335754, "val_ppl": 53.15955158604953, "is_best": false, "timestamp": "2026-05-04T18:45:08.582269"}
|
| 21 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 10500, "epoch": 0, "val_loss": 3.9948221683502196, "val_ppl": 54.316180628900916, "is_best": false, "timestamp": "2026-05-04T18:49:14.546826"}
|
| 22 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 11000, "epoch": 0, "val_loss": 3.9498892426490784, "val_ppl": 51.92961492972043, "is_best": true, "timestamp": "2026-05-04T18:53:19.469593"}
|
| 23 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 11500, "epoch": 0, "val_loss": 4.01113086938858, "val_ppl": 55.209269747273446, "is_best": false, "timestamp": "2026-05-04T18:57:26.063063"}
|
events.jsonl
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "model_summary", "total_params": 38010240, "trainable_params": 38010240, "weight_tied_lm_head": true, "timestamp": "2026-05-04T17:21:42.251782"}
|
| 2 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "config", "model": {"vocab_size": 50304, "n_layers": 20, "n_heads": 6, "n_kv_heads": 2, "n_embd": 384, "embedding_dim": 128, "tie_embeddings": true, "context_len": 1024, "dropout": 0.0, "bias": false, "norm_type": "rmsnorm", "norm_eps": 1e-05, "positional_embedding": "rope", "rope_theta": 10000.0, "rope_fraction": 1.0, "mlp_type": "swiglu", "mlp_hidden_mult": 4.0, "mlp_hidden_dim": 1024, "qk_norm": false, "block_style": "sequential"}, "training": {"seed": 0, "learning_rate": 0.0012, "min_lr": 0.00012, "weight_decay": 0.03, "beta1": 0.9, "beta2": 0.95, "grad_clip": 1.0, "max_iters": 11586, "warmup_steps": 116, "lr_schedule": "wsd", "wsd_stable_frac": 0.85, "batch_size": 4, "gradient_accumulation_steps": 16, "dtype": "float16", "device": "cuda", "eval_step_interval": 500, "eval_batches": 20, "log_interval": 10, "max_checkpoints": 5}, "distributed": {"enabled": true, "backend": "nccl"}, "timestamp": "2026-05-04T17:21:42.252083"}
|
| 3 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0000500.pt", "timestamp": "2026-05-04T17:27:02.419684"}
|
| 4 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:27:02.935937"}
|
| 5 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 1000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0001000.pt", "timestamp": "2026-05-04T17:31:09.635408"}
|
| 6 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 1000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:31:10.482963"}
|
| 7 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 1500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0001500.pt", "timestamp": "2026-05-04T17:35:16.506092"}
|
| 8 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 1500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:35:17.472426"}
|
| 9 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 2000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0002000.pt", "timestamp": "2026-05-04T17:39:24.020989"}
|
| 10 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 2000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:39:24.865058"}
|
| 11 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 2500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0002500.pt", "timestamp": "2026-05-04T17:43:30.630334"}
|
| 12 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 2500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:43:31.448643"}
|
| 13 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 3000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0003000.pt", "timestamp": "2026-05-04T17:47:36.737006"}
|
| 14 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 3000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:47:37.549993"}
|
| 15 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 3500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0003500.pt", "timestamp": "2026-05-04T17:51:42.450427"}
|
| 16 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 3500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:51:43.224403"}
|
| 17 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 4000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0004000.pt", "timestamp": "2026-05-04T17:55:49.429802"}
|
| 18 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 4000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:55:50.291191"}
|
| 19 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 4500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0004500.pt", "timestamp": "2026-05-04T17:59:55.308516"}
|
| 20 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 4500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:59:56.152649"}
|
| 21 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 5000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0005000.pt", "timestamp": "2026-05-04T18:04:02.252289"}
|
| 22 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 5000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:04:03.107403"}
|
| 23 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 5500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0005500.pt", "timestamp": "2026-05-04T18:08:11.201926"}
|
| 24 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 6000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0006000.pt", "timestamp": "2026-05-04T18:12:18.445731"}
|
| 25 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 6000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:12:19.286583"}
|
| 26 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 6500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0006500.pt", "timestamp": "2026-05-04T18:16:25.432096"}
|
| 27 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 7000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0007000.pt", "timestamp": "2026-05-04T18:20:31.109071"}
|
| 28 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 7000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:20:31.952552"}
|
| 29 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 7500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0007500.pt", "timestamp": "2026-05-04T18:24:38.993199"}
|
| 30 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 8000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0008000.pt", "timestamp": "2026-05-04T18:28:45.232019"}
|
| 31 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 8500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0008500.pt", "timestamp": "2026-05-04T18:32:50.175086"}
|
| 32 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 8500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:32:51.037758"}
|
| 33 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 9000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0009000.pt", "timestamp": "2026-05-04T18:36:56.998151"}
|
| 34 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 9500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0009500.pt", "timestamp": "2026-05-04T18:41:02.851932"}
|
| 35 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 10000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0010000.pt", "timestamp": "2026-05-04T18:45:09.144053"}
|
| 36 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 10500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0010500.pt", "timestamp": "2026-05-04T18:49:15.107050"}
|
| 37 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 11000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0011000.pt", "timestamp": "2026-05-04T18:53:20.032228"}
|
| 38 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 11000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:53:20.850126"}
|
| 39 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 11500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0011500.pt", "timestamp": "2026-05-04T18:57:26.629012"}
|
| 40 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "final_checkpoint_saved", "step": 11586, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0011586.pt", "best_val_loss_so_far": 3.9498892426490784, "timestamp": "2026-05-04T18:58:09.448498"}
|
| 41 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "metrics_plot_saved", "path": "artifacts/final_c4/final_c4_20l384_factorized/metrics.png", "timestamp": "2026-05-04T18:58:10.725749"}
|
| 42 |
+
{"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "results_doc_saved", "path": "artifacts/final_c4/final_c4_20l384_factorized/results.md", "timestamp": "2026-05-04T18:58:10.725896"}
|
logs/pretraining_20260504_172141.log
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
metrics.png
ADDED
|
Git LFS Details
|
results.md
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Results: final_c4_20l384_factorized
|
| 2 |
+
|
| 3 |
+
Automatically generated after pretraining.
|
| 4 |
+
|
| 5 |
+
## Summary
|
| 6 |
+
- Model: `20L / 6H / 384d`
|
| 7 |
+
- Total parameters: `38010240`
|
| 8 |
+
- Last logged train step: `11580`
|
| 9 |
+
- Best validation loss: `3.9499`
|
| 10 |
+
- Best validation perplexity: `51.93`
|
| 11 |
+
- Last validation step: `11500`
|
| 12 |
+
- Learning rate: `0.0012`
|
| 13 |
+
- Effective tokens/update: `65536`
|
| 14 |
+
|
| 15 |
+
## Files
|
| 16 |
+
- [Config snapshot](config_snapshot.json)
|
| 17 |
+
- [Train metrics](train_metrics.jsonl)
|
| 18 |
+
- [Eval metrics](eval_metrics.jsonl)
|
| 19 |
+
- [Events](events.jsonl)
|
| 20 |
+
- [Metrics plot](metrics.png)
|
| 21 |
+
|
| 22 |
+
## Metrics Plot
|
| 23 |
+
|
| 24 |
+

|
train_metrics.jsonl
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|