Create t5xxl_1_1_analysis.md
Browse files- t5xxl_1_1_analysis.md +250 -0
t5xxl_1_1_analysis.md
ADDED
|
@@ -0,0 +1,250 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
|
| 2 |
+
VRAM: 102.0 GB
|
| 3 |
+
|
| 4 |
+
Loading google/t5-v1_1-xxl (fp16 → GPU)...
|
| 5 |
+
Loading weights: 100%
|
| 6 |
+
560/560 [00:25<00:00, 31.48it/s, Materializing param=shared.weight]
|
| 7 |
+
The tied weights mapping and config for this model specifies to tie shared.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
|
| 8 |
+
The tied weights mapping and config for this model specifies to tie shared.weight to encoder.embed_tokens.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
|
| 9 |
+
The tied weights mapping and config for this model specifies to tie shared.weight to decoder.embed_tokens.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
|
| 10 |
+
Loaded in 31s, 11,398,524,928 params
|
| 11 |
+
VRAM used: 26.8 GB
|
| 12 |
+
d_model=4096, d_kv=64, d_ff=10240, heads=64, layers=24+24, ff=gated-gelu
|
| 13 |
+
|
| 14 |
+
======================================================================
|
| 15 |
+
CATALOG
|
| 16 |
+
======================================================================
|
| 17 |
+
cross_attn_k : 24 (E:0 D:24) 402,653,184 {'(4096, 4096)'}
|
| 18 |
+
cross_attn_o : 24 (E:0 D:24) 402,653,184 {'(4096, 4096)'}
|
| 19 |
+
cross_attn_q : 24 (E:0 D:24) 402,653,184 {'(4096, 4096)'}
|
| 20 |
+
cross_attn_v : 24 (E:0 D:24) 402,653,184 {'(4096, 4096)'}
|
| 21 |
+
embedding : 3 (E:1 D:1) 394,788,864 {'(32128, 4096)'}
|
| 22 |
+
mlp_down : 48 (E:24 D:24) 2,013,265,920 {'(4096, 10240)'}
|
| 23 |
+
mlp_gate : 48 (E:24 D:24) 2,013,265,920 {'(10240, 4096)'}
|
| 24 |
+
mlp_up : 48 (E:24 D:24) 2,013,265,920 {'(10240, 4096)'}
|
| 25 |
+
other : 1 (E:0 D:0) 131,596,288 {'(32128, 4096)'}
|
| 26 |
+
position_bias : 2 (E:1 D:1) 4,096 {'(32, 64)'}
|
| 27 |
+
self_attn_k : 48 (E:24 D:24) 805,306,368 {'(4096, 4096)'}
|
| 28 |
+
self_attn_o : 48 (E:24 D:24) 805,306,368 {'(4096, 4096)'}
|
| 29 |
+
self_attn_q : 48 (E:24 D:24) 805,306,368 {'(4096, 4096)'}
|
| 30 |
+
self_attn_v : 48 (E:24 D:24) 805,306,368 {'(4096, 4096)'}
|
| 31 |
+
|
| 32 |
+
Encoder layers: ALL 24
|
| 33 |
+
Decoder layers: ALL 24
|
| 34 |
+
|
| 35 |
+
======================================================================
|
| 36 |
+
SVD EFFECTIVE RANK
|
| 37 |
+
======================================================================
|
| 38 |
+
SVD done: 432 matrices in 395s
|
| 39 |
+
|
| 40 |
+
Type SR PR Act% R90 Cond
|
| 41 |
+
cross_attn_k 50.87 2025.26 0.876 2445 457570.2
|
| 42 |
+
cross_attn_o 6.08 2008.82 0.703 2596 910032.9
|
| 43 |
+
cross_attn_q 98.80 2104.42 0.910 2418 275050.8
|
| 44 |
+
cross_attn_v 230.39 2429.20 0.956 2552 101478.1
|
| 45 |
+
mlp_down 25.27 3078.82 0.983 3207 226.2
|
| 46 |
+
mlp_gate 67.93 3210.24 1.000 3214 60.7
|
| 47 |
+
mlp_up 247.30 3290.06 1.000 3209 35.5
|
| 48 |
+
self_attn_k 90.00 2098.88 0.879 2425 281622.0
|
| 49 |
+
self_attn_o 16.43 2012.25 0.772 2519 443902.2
|
| 50 |
+
self_attn_q 96.81 2087.90 0.891 2424 350901.5
|
| 51 |
+
self_attn_v 204.44 2331.76 0.947 2524 139741.5
|
| 52 |
+
|
| 53 |
+
======================================================================
|
| 54 |
+
SPARSITY
|
| 55 |
+
======================================================================
|
| 56 |
+
Sparsity done in 1s
|
| 57 |
+
|
| 58 |
+
Type <1e-4 <1e-3 <0.01 <0.1
|
| 59 |
+
cross_attn_k 0.0008 0.0084 0.0828 0.6306
|
| 60 |
+
cross_attn_o 0.0007 0.0070 0.0696 0.5626
|
| 61 |
+
cross_attn_q 0.0063 0.0629 0.5249 1.0000
|
| 62 |
+
cross_attn_v 0.0012 0.0122 0.1196 0.7112
|
| 63 |
+
mlp_down 0.0008 0.0081 0.0804 0.6498
|
| 64 |
+
mlp_gate 0.0007 0.0072 0.0715 0.5990
|
| 65 |
+
mlp_up 0.0006 0.0064 0.0633 0.5192
|
| 66 |
+
self_attn_k 0.0009 0.0088 0.0870 0.6553
|
| 67 |
+
self_attn_o 0.0009 0.0092 0.0913 0.6542
|
| 68 |
+
self_attn_q 0.0071 0.0709 0.5737 1.0000
|
| 69 |
+
self_attn_v 0.0014 0.0136 0.1331 0.7307
|
| 70 |
+
|
| 71 |
+
--- ENCODER vs DECODER SPARSITY (<0.1) ---
|
| 72 |
+
encoder self_attn_q : 100.0%
|
| 73 |
+
decoder self_attn_q : 100.0%
|
| 74 |
+
encoder self_attn_k : 71.7%
|
| 75 |
+
decoder self_attn_k : 59.4%
|
| 76 |
+
encoder self_attn_v : 76.0%
|
| 77 |
+
decoder self_attn_v : 70.1%
|
| 78 |
+
decoder cross_attn_q : 100.0%
|
| 79 |
+
decoder cross_attn_k : 63.1%
|
| 80 |
+
decoder cross_attn_v : 71.1%
|
| 81 |
+
|
| 82 |
+
======================================================================
|
| 83 |
+
QK MANIFOLD (eigvalsh on CPU)
|
| 84 |
+
======================================================================
|
| 85 |
+
|
| 86 |
+
--- ENCODER self-attention ---
|
| 87 |
+
L 0: SR=4.42, pos=2090(0.510), neg=2006(0.490), sym=1.1843, top=70.33 (1.7s)
|
| 88 |
+
L 1: SR=3.64, pos=2079(0.508), neg=2017(0.492), sym=1.2417, top=107.53 (1.6s)
|
| 89 |
+
L 2: SR=10.47, pos=2065(0.504), neg=2031(0.496), sym=1.2760, top=62.50 (1.6s)
|
| 90 |
+
L 3: SR=12.32, pos=2071(0.506), neg=2025(0.494), sym=1.3047, top=41.21 (1.7s)
|
| 91 |
+
L 4: SR=13.72, pos=2052(0.501), neg=2044(0.499), sym=1.2819, top=57.93 (1.7s)
|
| 92 |
+
L 5: SR=14.98, pos=2068(0.505), neg=2028(0.495), sym=1.3016, top=53.62 (1.6s)
|
| 93 |
+
L 6: SR=13.24, pos=2061(0.503), neg=2035(0.497), sym=1.2758, top=70.88 (1.7s)
|
| 94 |
+
L 7: SR=16.00, pos=2064(0.504), neg=2032(0.496), sym=1.2766, top=82.54 (1.7s)
|
| 95 |
+
L 8: SR=11.31, pos=2074(0.506), neg=2022(0.494), sym=1.2787, top=85.30 (1.7s)
|
| 96 |
+
L 9: SR=11.69, pos=2071(0.506), neg=2025(0.494), sym=1.2342, top=95.72 (1.7s)
|
| 97 |
+
L10: SR=12.37, pos=2058(0.502), neg=2038(0.498), sym=1.2403, top=135.32 (1.7s)
|
| 98 |
+
L11: SR=8.86, pos=2092(0.511), neg=2004(0.489), sym=1.2171, top=124.68 (1.6s)
|
| 99 |
+
L12: SR=11.30, pos=2078(0.507), neg=2018(0.493), sym=1.2221, top=152.47 (1.6s)
|
| 100 |
+
L13: SR=10.54, pos=2087(0.510), neg=2009(0.490), sym=1.2069, top=131.38 (1.6s)
|
| 101 |
+
L14: SR=7.88, pos=2084(0.509), neg=2012(0.491), sym=1.2023, top=133.98 (1.6s)
|
| 102 |
+
L15: SR=13.96, pos=2095(0.511), neg=2001(0.489), sym=1.2026, top=146.83 (1.7s)
|
| 103 |
+
L16: SR=17.25, pos=2112(0.516), neg=1984(0.484), sym=1.1775, top=141.57 (1.6s)
|
| 104 |
+
L17: SR=19.15, pos=2081(0.508), neg=2015(0.492), sym=1.1713, top=150.69 (1.6s)
|
| 105 |
+
L18: SR=21.13, pos=2082(0.508), neg=2014(0.492), sym=1.1845, top=138.35 (1.7s)
|
| 106 |
+
L19: SR=22.84, pos=2071(0.506), neg=2025(0.494), sym=1.1861, top=115.63 (1.6s)
|
| 107 |
+
L20: SR=25.01, pos=2054(0.501), neg=2042(0.499), sym=1.1386, top=102.76 (1.6s)
|
| 108 |
+
L21: SR=22.57, pos=2084(0.509), neg=2012(0.491), sym=1.1301, top=82.99 (1.7s)
|
| 109 |
+
L22: SR=16.52, pos=2035(0.497), neg=2061(0.503), sym=1.1544, top=72.17 (1.6s)
|
| 110 |
+
L23: SR=15.34, pos=2061(0.503), neg=2035(0.497), sym=1.2299, top=65.78 (1.7s)
|
| 111 |
+
Trend: L0=0.510 → L23=0.503
|
| 112 |
+
|
| 113 |
+
--- DECODER self-attention ---
|
| 114 |
+
L 0: SR=2.74, pos=2052(0.501), neg=2044(0.499), sym=1.3248, top=125.28 (1.7s)
|
| 115 |
+
L 1: SR=3.73, pos=2030(0.496), neg=2066(0.504), sym=1.3126, top=64.14 (1.7s)
|
| 116 |
+
L 2: SR=2.79, pos=1997(0.488), neg=2099(0.512), sym=1.2139, top=94.72 (1.7s)
|
| 117 |
+
L 3: SR=4.70, pos=2016(0.492), neg=2080(0.508), sym=1.2821, top=175.46 (1.7s)
|
| 118 |
+
L 4: SR=3.53, pos=2028(0.495), neg=2068(0.505), sym=1.2302, top=222.86 (1.7s)
|
| 119 |
+
L 5: SR=3.61, pos=2039(0.498), neg=2057(0.502), sym=1.2552, top=111.65 (1.7s)
|
| 120 |
+
L 6: SR=4.88, pos=2061(0.503), neg=2035(0.497), sym=1.2901, top=206.78 (1.8s)
|
| 121 |
+
L 7: SR=7.31, pos=2062(0.503), neg=2034(0.497), sym=1.3132, top=161.56 (1.7s)
|
| 122 |
+
L 8: SR=5.99, pos=2086(0.509), neg=2010(0.491), sym=1.2770, top=161.19 (1.7s)
|
| 123 |
+
L 9: SR=7.92, pos=2075(0.507), neg=2021(0.493), sym=1.3177, top=126.85 (1.7s)
|
| 124 |
+
L10: SR=6.57, pos=2071(0.506), neg=2025(0.494), sym=1.2753, top=241.70 (1.7s)
|
| 125 |
+
L11: SR=9.67, pos=2058(0.502), neg=2038(0.498), sym=1.3237, top=195.50 (1.7s)
|
| 126 |
+
L12: SR=13.29, pos=2102(0.513), neg=1994(0.487), sym=1.3140, top=206.67 (1.7s)
|
| 127 |
+
L13: SR=13.37, pos=2096(0.512), neg=2000(0.488), sym=1.3338, top=158.07 (1.7s)
|
| 128 |
+
L14: SR=15.72, pos=2113(0.516), neg=1983(0.484), sym=1.3374, top=146.70 (1.6s)
|
| 129 |
+
L15: SR=15.90, pos=2122(0.518), neg=1974(0.482), sym=1.3480, top=151.95 (1.6s)
|
| 130 |
+
L16: SR=18.25, pos=2139(0.522), neg=1957(0.478), sym=1.3473, top=126.08 (1.6s)
|
| 131 |
+
L17: SR=19.31, pos=2143(0.523), neg=1953(0.477), sym=1.3495, top=118.79 (1.6s)
|
| 132 |
+
L18: SR=17.63, pos=2171(0.530), neg=1925(0.470), sym=1.3467, top=107.62 (1.6s)
|
| 133 |
+
L19: SR=14.06, pos=2186(0.534), neg=1910(0.466), sym=1.3491, top=109.47 (1.6s)
|
| 134 |
+
L20: SR=13.42, pos=2217(0.541), neg=1879(0.459), sym=1.3249, top=78.52 (1.7s)
|
| 135 |
+
L21: SR=11.14, pos=2276(0.556), neg=1820(0.444), sym=1.3111, top=69.83 (1.6s)
|
| 136 |
+
L22: SR=8.89, pos=2283(0.557), neg=1813(0.443), sym=1.2788, top=63.48 (1.7s)
|
| 137 |
+
L23: SR=8.88, pos=2246(0.548), neg=1850(0.452), sym=1.3011, top=130.08 (1.7s)
|
| 138 |
+
Trend: L0=0.501 → L23=0.548
|
| 139 |
+
|
| 140 |
+
--- DECODER cross-attention ---
|
| 141 |
+
L 0: pos=2046(0.500), neg=2050(0.500), sym=1.4072, top=10.23 (0.6s)
|
| 142 |
+
L 1: pos=2042(0.499), neg=2054(0.501), sym=1.4116, top=19.70 (0.6s)
|
| 143 |
+
L 2: pos=2044(0.499), neg=2052(0.501), sym=1.4119, top=21.48 (0.6s)
|
| 144 |
+
L 3: pos=2045(0.499), neg=2051(0.501), sym=1.4117, top=18.96 (0.6s)
|
| 145 |
+
L 4: pos=2051(0.501), neg=2045(0.499), sym=1.4116, top=27.15 (0.6s)
|
| 146 |
+
L 5: pos=2049(0.500), neg=2047(0.500), sym=1.4147, top=24.49 (0.6s)
|
| 147 |
+
L 6: pos=2050(0.500), neg=2046(0.500), sym=1.4083, top=24.80 (0.6s)
|
| 148 |
+
L 7: pos=2052(0.501), neg=2044(0.499), sym=1.4064, top=18.86 (0.6s)
|
| 149 |
+
L 8: pos=2046(0.500), neg=2050(0.500), sym=1.4072, top=28.88 (0.6s)
|
| 150 |
+
L 9: pos=2050(0.500), neg=2046(0.500), sym=1.4115, top=32.92 (0.6s)
|
| 151 |
+
L10: pos=2051(0.501), neg=2045(0.499), sym=1.4136, top=36.77 (0.6s)
|
| 152 |
+
L11: pos=2049(0.500), neg=2047(0.500), sym=1.4128, top=49.21 (0.6s)
|
| 153 |
+
L12: pos=2047(0.500), neg=2049(0.500), sym=1.4138, top=64.47 (0.6s)
|
| 154 |
+
L13: pos=2051(0.501), neg=2045(0.499), sym=1.4137, top=56.35 (0.6s)
|
| 155 |
+
L14: pos=2051(0.501), neg=2045(0.499), sym=1.4130, top=57.55 (0.6s)
|
| 156 |
+
L15: pos=2049(0.500), neg=2047(0.500), sym=1.4137, top=54.22 (0.6s)
|
| 157 |
+
L16: pos=2050(0.500), neg=2046(0.500), sym=1.4128, top=60.04 (0.6s)
|
| 158 |
+
L17: pos=2048(0.500), neg=2048(0.500), sym=1.4146, top=72.07 (0.6s)
|
| 159 |
+
L18: pos=2050(0.500), neg=2046(0.500), sym=1.4145, top=70.79 (0.6s)
|
| 160 |
+
L19: pos=2049(0.500), neg=2047(0.500), sym=1.4135, top=75.23 (0.6s)
|
| 161 |
+
L20: pos=2049(0.500), neg=2047(0.500), sym=1.4132, top=62.64 (0.6s)
|
| 162 |
+
L21: pos=2048(0.500), neg=2048(0.500), sym=1.4133, top=75.57 (0.6s)
|
| 163 |
+
L22: pos=2047(0.500), neg=2049(0.500), sym=1.4147, top=75.73 (0.6s)
|
| 164 |
+
L23: pos=2047(0.500), neg=2049(0.500), sym=1.4132, top=98.13 (0.6s)
|
| 165 |
+
|
| 166 |
+
======================================================================
|
| 167 |
+
MLP DEAD NEURONS (GeGLU)
|
| 168 |
+
======================================================================
|
| 169 |
+
|
| 170 |
+
--- ENCODER ---
|
| 171 |
+
L 0: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 172 |
+
L 1: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 173 |
+
L 2: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 174 |
+
L 3: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 175 |
+
L 4: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 176 |
+
L 5: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 177 |
+
L 6: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 178 |
+
L 7: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 179 |
+
L 8: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 180 |
+
L 9: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 181 |
+
L10: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 182 |
+
L11: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 183 |
+
L12: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 184 |
+
L13: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 185 |
+
L14: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 186 |
+
L15: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 187 |
+
L16: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 188 |
+
L17: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 189 |
+
L18: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 190 |
+
L19: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 191 |
+
L20: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 192 |
+
L21: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 193 |
+
L22: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 194 |
+
L23: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 195 |
+
Total: 0/245760 (0.00%)
|
| 196 |
+
|
| 197 |
+
--- DECODER ---
|
| 198 |
+
L 0: d_ff=10240, dead=0(0.0%), weak=192(1.9%)
|
| 199 |
+
L 1: d_ff=10240, dead=2(0.0%), weak=93(0.9%)
|
| 200 |
+
L 2: d_ff=10240, dead=12(0.1%), weak=106(1.0%)
|
| 201 |
+
L 3: d_ff=10240, dead=0(0.0%), weak=27(0.3%)
|
| 202 |
+
L 4: d_ff=10240, dead=0(0.0%), weak=38(0.4%)
|
| 203 |
+
L 5: d_ff=10240, dead=0(0.0%), weak=4(0.0%)
|
| 204 |
+
L 6: d_ff=10240, dead=0(0.0%), weak=1(0.0%)
|
| 205 |
+
L 7: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 206 |
+
L 8: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 207 |
+
L 9: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 208 |
+
L10: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 209 |
+
L11: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 210 |
+
L12: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 211 |
+
L13: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 212 |
+
L14: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 213 |
+
L15: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 214 |
+
L16: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 215 |
+
L17: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 216 |
+
L18: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 217 |
+
L19: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 218 |
+
L20: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 219 |
+
L21: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 220 |
+
L22: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 221 |
+
L23: d_ff=10240, dead=0(0.0%), weak=0(0.0%)
|
| 222 |
+
Total: 14/245760 (0.01%)
|
| 223 |
+
|
| 224 |
+
======================================================================
|
| 225 |
+
CROSS-LAYER Q CORRELATION
|
| 226 |
+
======================================================================
|
| 227 |
+
encoder adj Q cos: mean=0.0001, range=[-0.0009,0.0011]
|
| 228 |
+
decoder adj Q cos: mean=-0.0001, range=[-0.0012,0.0007]
|
| 229 |
+
|
| 230 |
+
======================================================================
|
| 231 |
+
POSITION BIAS
|
| 232 |
+
======================================================================
|
| 233 |
+
encoder : [32×64] Local:24 Global:2 Mixed:38 Range:[-47.2,11.2]
|
| 234 |
+
decoder : [32×64] Local:27 Global:37 Mixed:0 Range:[-28.4,17.0]
|
| 235 |
+
|
| 236 |
+
======================================================================
|
| 237 |
+
SUMMARY — T5-v1.1-XXL (FLUX)
|
| 238 |
+
======================================================================
|
| 239 |
+
Params: 11,398,524,928
|
| 240 |
+
d_model=4096, d_ff=10240, heads=64
|
| 241 |
+
Layers: 24 enc + 24 dec
|
| 242 |
+
MLP: gated-gelu (GeGLU)
|
| 243 |
+
self_attn_q (<0.1): 100.0%
|
| 244 |
+
self_attn_k (<0.1): 65.5%
|
| 245 |
+
self_attn_v (<0.1): 73.1%
|
| 246 |
+
cross_attn_q (<0.1): 100.0%
|
| 247 |
+
|
| 248 |
+
Ref: T5-Small Q=93.7% | T5-Base Q=99.4% | BERT=99.1% | DINOv2=100%
|
| 249 |
+
VRAM at end: 26.9 GB
|
| 250 |
+
Done.
|