AbstractPhil commited on
Commit
b262929
·
verified ·
1 Parent(s): b52a06f

Create t5xxl_1_1_analysis.md

Browse files
Files changed (1) hide show
  1. t5xxl_1_1_analysis.md +250 -0
t5xxl_1_1_analysis.md ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
2
+ VRAM: 102.0 GB
3
+
4
+ Loading google/t5-v1_1-xxl (fp16 → GPU)...
5
+ Loading weights: 100%
6
+  560/560 [00:25<00:00, 31.48it/s, Materializing param=shared.weight]
7
+ The tied weights mapping and config for this model specifies to tie shared.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
8
+ The tied weights mapping and config for this model specifies to tie shared.weight to encoder.embed_tokens.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
9
+ The tied weights mapping and config for this model specifies to tie shared.weight to decoder.embed_tokens.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
10
+ Loaded in 31s, 11,398,524,928 params
11
+ VRAM used: 26.8 GB
12
+ d_model=4096, d_kv=64, d_ff=10240, heads=64, layers=24+24, ff=gated-gelu
13
+
14
+ ======================================================================
15
+ CATALOG
16
+ ======================================================================
17
+ cross_attn_k : 24 (E:0 D:24) 402,653,184 {'(4096, 4096)'}
18
+ cross_attn_o : 24 (E:0 D:24) 402,653,184 {'(4096, 4096)'}
19
+ cross_attn_q : 24 (E:0 D:24) 402,653,184 {'(4096, 4096)'}
20
+ cross_attn_v : 24 (E:0 D:24) 402,653,184 {'(4096, 4096)'}
21
+ embedding : 3 (E:1 D:1) 394,788,864 {'(32128, 4096)'}
22
+ mlp_down : 48 (E:24 D:24) 2,013,265,920 {'(4096, 10240)'}
23
+ mlp_gate : 48 (E:24 D:24) 2,013,265,920 {'(10240, 4096)'}
24
+ mlp_up : 48 (E:24 D:24) 2,013,265,920 {'(10240, 4096)'}
25
+ other : 1 (E:0 D:0) 131,596,288 {'(32128, 4096)'}
26
+ position_bias : 2 (E:1 D:1) 4,096 {'(32, 64)'}
27
+ self_attn_k : 48 (E:24 D:24) 805,306,368 {'(4096, 4096)'}
28
+ self_attn_o : 48 (E:24 D:24) 805,306,368 {'(4096, 4096)'}
29
+ self_attn_q : 48 (E:24 D:24) 805,306,368 {'(4096, 4096)'}
30
+ self_attn_v : 48 (E:24 D:24) 805,306,368 {'(4096, 4096)'}
31
+
32
+ Encoder layers: ALL 24
33
+ Decoder layers: ALL 24
34
+
35
+ ======================================================================
36
+ SVD EFFECTIVE RANK
37
+ ======================================================================
38
+ SVD done: 432 matrices in 395s
39
+
40
+ Type SR PR Act% R90 Cond
41
+ cross_attn_k 50.87 2025.26 0.876 2445 457570.2
42
+ cross_attn_o 6.08 2008.82 0.703 2596 910032.9
43
+ cross_attn_q 98.80 2104.42 0.910 2418 275050.8
44
+ cross_attn_v 230.39 2429.20 0.956 2552 101478.1
45
+ mlp_down 25.27 3078.82 0.983 3207 226.2
46
+ mlp_gate 67.93 3210.24 1.000 3214 60.7
47
+ mlp_up 247.30 3290.06 1.000 3209 35.5
48
+ self_attn_k 90.00 2098.88 0.879 2425 281622.0
49
+ self_attn_o 16.43 2012.25 0.772 2519 443902.2
50
+ self_attn_q 96.81 2087.90 0.891 2424 350901.5
51
+ self_attn_v 204.44 2331.76 0.947 2524 139741.5
52
+
53
+ ======================================================================
54
+ SPARSITY
55
+ ======================================================================
56
+ Sparsity done in 1s
57
+
58
+ Type <1e-4 <1e-3 <0.01 <0.1
59
+ cross_attn_k 0.0008 0.0084 0.0828 0.6306
60
+ cross_attn_o 0.0007 0.0070 0.0696 0.5626
61
+ cross_attn_q 0.0063 0.0629 0.5249 1.0000
62
+ cross_attn_v 0.0012 0.0122 0.1196 0.7112
63
+ mlp_down 0.0008 0.0081 0.0804 0.6498
64
+ mlp_gate 0.0007 0.0072 0.0715 0.5990
65
+ mlp_up 0.0006 0.0064 0.0633 0.5192
66
+ self_attn_k 0.0009 0.0088 0.0870 0.6553
67
+ self_attn_o 0.0009 0.0092 0.0913 0.6542
68
+ self_attn_q 0.0071 0.0709 0.5737 1.0000
69
+ self_attn_v 0.0014 0.0136 0.1331 0.7307
70
+
71
+ --- ENCODER vs DECODER SPARSITY (<0.1) ---
72
+ encoder self_attn_q : 100.0%
73
+ decoder self_attn_q : 100.0%
74
+ encoder self_attn_k : 71.7%
75
+ decoder self_attn_k : 59.4%
76
+ encoder self_attn_v : 76.0%
77
+ decoder self_attn_v : 70.1%
78
+ decoder cross_attn_q : 100.0%
79
+ decoder cross_attn_k : 63.1%
80
+ decoder cross_attn_v : 71.1%
81
+
82
+ ======================================================================
83
+ QK MANIFOLD (eigvalsh on CPU)
84
+ ======================================================================
85
+
86
+ --- ENCODER self-attention ---
87
+ L 0: SR=4.42, pos=2090(0.510), neg=2006(0.490), sym=1.1843, top=70.33 (1.7s)
88
+ L 1: SR=3.64, pos=2079(0.508), neg=2017(0.492), sym=1.2417, top=107.53 (1.6s)
89
+ L 2: SR=10.47, pos=2065(0.504), neg=2031(0.496), sym=1.2760, top=62.50 (1.6s)
90
+ L 3: SR=12.32, pos=2071(0.506), neg=2025(0.494), sym=1.3047, top=41.21 (1.7s)
91
+ L 4: SR=13.72, pos=2052(0.501), neg=2044(0.499), sym=1.2819, top=57.93 (1.7s)
92
+ L 5: SR=14.98, pos=2068(0.505), neg=2028(0.495), sym=1.3016, top=53.62 (1.6s)
93
+ L 6: SR=13.24, pos=2061(0.503), neg=2035(0.497), sym=1.2758, top=70.88 (1.7s)
94
+ L 7: SR=16.00, pos=2064(0.504), neg=2032(0.496), sym=1.2766, top=82.54 (1.7s)
95
+ L 8: SR=11.31, pos=2074(0.506), neg=2022(0.494), sym=1.2787, top=85.30 (1.7s)
96
+ L 9: SR=11.69, pos=2071(0.506), neg=2025(0.494), sym=1.2342, top=95.72 (1.7s)
97
+ L10: SR=12.37, pos=2058(0.502), neg=2038(0.498), sym=1.2403, top=135.32 (1.7s)
98
+ L11: SR=8.86, pos=2092(0.511), neg=2004(0.489), sym=1.2171, top=124.68 (1.6s)
99
+ L12: SR=11.30, pos=2078(0.507), neg=2018(0.493), sym=1.2221, top=152.47 (1.6s)
100
+ L13: SR=10.54, pos=2087(0.510), neg=2009(0.490), sym=1.2069, top=131.38 (1.6s)
101
+ L14: SR=7.88, pos=2084(0.509), neg=2012(0.491), sym=1.2023, top=133.98 (1.6s)
102
+ L15: SR=13.96, pos=2095(0.511), neg=2001(0.489), sym=1.2026, top=146.83 (1.7s)
103
+ L16: SR=17.25, pos=2112(0.516), neg=1984(0.484), sym=1.1775, top=141.57 (1.6s)
104
+ L17: SR=19.15, pos=2081(0.508), neg=2015(0.492), sym=1.1713, top=150.69 (1.6s)
105
+ L18: SR=21.13, pos=2082(0.508), neg=2014(0.492), sym=1.1845, top=138.35 (1.7s)
106
+ L19: SR=22.84, pos=2071(0.506), neg=2025(0.494), sym=1.1861, top=115.63 (1.6s)
107
+ L20: SR=25.01, pos=2054(0.501), neg=2042(0.499), sym=1.1386, top=102.76 (1.6s)
108
+ L21: SR=22.57, pos=2084(0.509), neg=2012(0.491), sym=1.1301, top=82.99 (1.7s)
109
+ L22: SR=16.52, pos=2035(0.497), neg=2061(0.503), sym=1.1544, top=72.17 (1.6s)
110
+ L23: SR=15.34, pos=2061(0.503), neg=2035(0.497), sym=1.2299, top=65.78 (1.7s)
111
+ Trend: L0=0.510 → L23=0.503
112
+
113
+ --- DECODER self-attention ---
114
+ L 0: SR=2.74, pos=2052(0.501), neg=2044(0.499), sym=1.3248, top=125.28 (1.7s)
115
+ L 1: SR=3.73, pos=2030(0.496), neg=2066(0.504), sym=1.3126, top=64.14 (1.7s)
116
+ L 2: SR=2.79, pos=1997(0.488), neg=2099(0.512), sym=1.2139, top=94.72 (1.7s)
117
+ L 3: SR=4.70, pos=2016(0.492), neg=2080(0.508), sym=1.2821, top=175.46 (1.7s)
118
+ L 4: SR=3.53, pos=2028(0.495), neg=2068(0.505), sym=1.2302, top=222.86 (1.7s)
119
+ L 5: SR=3.61, pos=2039(0.498), neg=2057(0.502), sym=1.2552, top=111.65 (1.7s)
120
+ L 6: SR=4.88, pos=2061(0.503), neg=2035(0.497), sym=1.2901, top=206.78 (1.8s)
121
+ L 7: SR=7.31, pos=2062(0.503), neg=2034(0.497), sym=1.3132, top=161.56 (1.7s)
122
+ L 8: SR=5.99, pos=2086(0.509), neg=2010(0.491), sym=1.2770, top=161.19 (1.7s)
123
+ L 9: SR=7.92, pos=2075(0.507), neg=2021(0.493), sym=1.3177, top=126.85 (1.7s)
124
+ L10: SR=6.57, pos=2071(0.506), neg=2025(0.494), sym=1.2753, top=241.70 (1.7s)
125
+ L11: SR=9.67, pos=2058(0.502), neg=2038(0.498), sym=1.3237, top=195.50 (1.7s)
126
+ L12: SR=13.29, pos=2102(0.513), neg=1994(0.487), sym=1.3140, top=206.67 (1.7s)
127
+ L13: SR=13.37, pos=2096(0.512), neg=2000(0.488), sym=1.3338, top=158.07 (1.7s)
128
+ L14: SR=15.72, pos=2113(0.516), neg=1983(0.484), sym=1.3374, top=146.70 (1.6s)
129
+ L15: SR=15.90, pos=2122(0.518), neg=1974(0.482), sym=1.3480, top=151.95 (1.6s)
130
+ L16: SR=18.25, pos=2139(0.522), neg=1957(0.478), sym=1.3473, top=126.08 (1.6s)
131
+ L17: SR=19.31, pos=2143(0.523), neg=1953(0.477), sym=1.3495, top=118.79 (1.6s)
132
+ L18: SR=17.63, pos=2171(0.530), neg=1925(0.470), sym=1.3467, top=107.62 (1.6s)
133
+ L19: SR=14.06, pos=2186(0.534), neg=1910(0.466), sym=1.3491, top=109.47 (1.6s)
134
+ L20: SR=13.42, pos=2217(0.541), neg=1879(0.459), sym=1.3249, top=78.52 (1.7s)
135
+ L21: SR=11.14, pos=2276(0.556), neg=1820(0.444), sym=1.3111, top=69.83 (1.6s)
136
+ L22: SR=8.89, pos=2283(0.557), neg=1813(0.443), sym=1.2788, top=63.48 (1.7s)
137
+ L23: SR=8.88, pos=2246(0.548), neg=1850(0.452), sym=1.3011, top=130.08 (1.7s)
138
+ Trend: L0=0.501 → L23=0.548
139
+
140
+ --- DECODER cross-attention ---
141
+ L 0: pos=2046(0.500), neg=2050(0.500), sym=1.4072, top=10.23 (0.6s)
142
+ L 1: pos=2042(0.499), neg=2054(0.501), sym=1.4116, top=19.70 (0.6s)
143
+ L 2: pos=2044(0.499), neg=2052(0.501), sym=1.4119, top=21.48 (0.6s)
144
+ L 3: pos=2045(0.499), neg=2051(0.501), sym=1.4117, top=18.96 (0.6s)
145
+ L 4: pos=2051(0.501), neg=2045(0.499), sym=1.4116, top=27.15 (0.6s)
146
+ L 5: pos=2049(0.500), neg=2047(0.500), sym=1.4147, top=24.49 (0.6s)
147
+ L 6: pos=2050(0.500), neg=2046(0.500), sym=1.4083, top=24.80 (0.6s)
148
+ L 7: pos=2052(0.501), neg=2044(0.499), sym=1.4064, top=18.86 (0.6s)
149
+ L 8: pos=2046(0.500), neg=2050(0.500), sym=1.4072, top=28.88 (0.6s)
150
+ L 9: pos=2050(0.500), neg=2046(0.500), sym=1.4115, top=32.92 (0.6s)
151
+ L10: pos=2051(0.501), neg=2045(0.499), sym=1.4136, top=36.77 (0.6s)
152
+ L11: pos=2049(0.500), neg=2047(0.500), sym=1.4128, top=49.21 (0.6s)
153
+ L12: pos=2047(0.500), neg=2049(0.500), sym=1.4138, top=64.47 (0.6s)
154
+ L13: pos=2051(0.501), neg=2045(0.499), sym=1.4137, top=56.35 (0.6s)
155
+ L14: pos=2051(0.501), neg=2045(0.499), sym=1.4130, top=57.55 (0.6s)
156
+ L15: pos=2049(0.500), neg=2047(0.500), sym=1.4137, top=54.22 (0.6s)
157
+ L16: pos=2050(0.500), neg=2046(0.500), sym=1.4128, top=60.04 (0.6s)
158
+ L17: pos=2048(0.500), neg=2048(0.500), sym=1.4146, top=72.07 (0.6s)
159
+ L18: pos=2050(0.500), neg=2046(0.500), sym=1.4145, top=70.79 (0.6s)
160
+ L19: pos=2049(0.500), neg=2047(0.500), sym=1.4135, top=75.23 (0.6s)
161
+ L20: pos=2049(0.500), neg=2047(0.500), sym=1.4132, top=62.64 (0.6s)
162
+ L21: pos=2048(0.500), neg=2048(0.500), sym=1.4133, top=75.57 (0.6s)
163
+ L22: pos=2047(0.500), neg=2049(0.500), sym=1.4147, top=75.73 (0.6s)
164
+ L23: pos=2047(0.500), neg=2049(0.500), sym=1.4132, top=98.13 (0.6s)
165
+
166
+ ======================================================================
167
+ MLP DEAD NEURONS (GeGLU)
168
+ ======================================================================
169
+
170
+ --- ENCODER ---
171
+ L 0: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
172
+ L 1: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
173
+ L 2: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
174
+ L 3: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
175
+ L 4: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
176
+ L 5: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
177
+ L 6: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
178
+ L 7: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
179
+ L 8: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
180
+ L 9: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
181
+ L10: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
182
+ L11: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
183
+ L12: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
184
+ L13: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
185
+ L14: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
186
+ L15: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
187
+ L16: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
188
+ L17: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
189
+ L18: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
190
+ L19: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
191
+ L20: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
192
+ L21: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
193
+ L22: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
194
+ L23: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
195
+ Total: 0/245760 (0.00%)
196
+
197
+ --- DECODER ---
198
+ L 0: d_ff=10240, dead=0(0.0%), weak=192(1.9%)
199
+ L 1: d_ff=10240, dead=2(0.0%), weak=93(0.9%)
200
+ L 2: d_ff=10240, dead=12(0.1%), weak=106(1.0%)
201
+ L 3: d_ff=10240, dead=0(0.0%), weak=27(0.3%)
202
+ L 4: d_ff=10240, dead=0(0.0%), weak=38(0.4%)
203
+ L 5: d_ff=10240, dead=0(0.0%), weak=4(0.0%)
204
+ L 6: d_ff=10240, dead=0(0.0%), weak=1(0.0%)
205
+ L 7: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
206
+ L 8: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
207
+ L 9: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
208
+ L10: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
209
+ L11: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
210
+ L12: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
211
+ L13: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
212
+ L14: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
213
+ L15: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
214
+ L16: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
215
+ L17: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
216
+ L18: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
217
+ L19: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
218
+ L20: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
219
+ L21: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
220
+ L22: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
221
+ L23: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
222
+ Total: 14/245760 (0.01%)
223
+
224
+ ======================================================================
225
+ CROSS-LAYER Q CORRELATION
226
+ ======================================================================
227
+ encoder adj Q cos: mean=0.0001, range=[-0.0009,0.0011]
228
+ decoder adj Q cos: mean=-0.0001, range=[-0.0012,0.0007]
229
+
230
+ ======================================================================
231
+ POSITION BIAS
232
+ ======================================================================
233
+ encoder : [32×64] Local:24 Global:2 Mixed:38 Range:[-47.2,11.2]
234
+ decoder : [32×64] Local:27 Global:37 Mixed:0 Range:[-28.4,17.0]
235
+
236
+ ======================================================================
237
+ SUMMARY — T5-v1.1-XXL (FLUX)
238
+ ======================================================================
239
+ Params: 11,398,524,928
240
+ d_model=4096, d_ff=10240, heads=64
241
+ Layers: 24 enc + 24 dec
242
+ MLP: gated-gelu (GeGLU)
243
+ self_attn_q (<0.1): 100.0%
244
+ self_attn_k (<0.1): 65.5%
245
+ self_attn_v (<0.1): 73.1%
246
+ cross_attn_q (<0.1): 100.0%
247
+
248
+ Ref: T5-Small Q=93.7% | T5-Base Q=99.4% | BERT=99.1% | DINOv2=100%
249
+ VRAM at end: 26.9 GB
250
+ Done.