| Remaining attention layers to evaluate: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] |
| εΌε§ζ£ζ΅Super WeightοΌδ½Ώη¨64δΈͺζ ·ζ¬... |
| ζ£ζ΅Super Weight: 0%| | 0/32 [00:00<?, ?it/s][WARNING|logging.py:328] 2025-11-19 20:52:29,371 >> The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be removed and `position_embeddings` will be mandatory. |
| The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be removed and `position_embeddings` will be mandatory. |
| The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be removed and `position_embeddings` will be mandatory. |
| Layer 0: εη°Super Weight at (291, 491), ζιεΌ: 0.2656, ζΏζ΄»εΌ: 8.5000 |
| ζ£ζ΅Super Weight: 3%|βββ | 1/32 [00:00<00:29, 1.04it/s]Layer 1: εη°Super Weight at (788, 2427), ζιεΌ: -0.6055, ζΏζ΄»εΌ: 396.0000 |
| ζ£ζ΅Super Weight: 44%|βββββββββββββββββββββββββββββββββ | 14/32 [00:11<00:15, 1.18it/s]The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be removed and `position_embeddings` will be mandatory. |
| ζ£ζ΅Super Weight: 47%|βββββββββββββββββββββββββββββββββββ | 15/32 [00:12<00:14, 1.18it/s]Layer 15: εη°Super Weight at (4055, 8232), ζιεΌ: -0.0146, ζΏζ΄»εΌ: 3.0156 |
| ζ£ζ΅Super Weight: 50%|βββββββββββββββββββββββββββββββββββββ | 16/32 [00:13<00:13, 1.18it/s]Layer 16: εη°Super Weight at (4055, 2), ζιεΌ: -0.0045, ζΏζ΄»εΌ: 3.2031 |
| ζ£ζ΅Super Weight: 56%|ββββββββββββββββββββββββββββββββββββββββββ | 18/32 [00:15<00:11, 1.18it/s]Layer 18: εη°Super Weight at (4055, 298), ζιεΌ: -0.0327, ζΏζ΄»εΌ: 3.5000 |
| ζ£ζ΅Super Weight: 66%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 21/32 [00:17<00:09, 1.17it/s]Layer 21: εη°Super Weight at (2352, 1562), ζιεΌ: -0.0265, ζΏζ΄»εΌ: 3.3281 |
| ζ£ζ΅Super Weight: 69%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 22/32 [00:18<00:08, 1.17it/s]Layer 22: εη°Super Weight at (2352, 7684), ζιεΌ: 0.0104, ζΏζ΄»εΌ: 3.3281 |
| ζ£ζ΅Super Weight: 72%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 23/32 [00:19<00:07, 1.17it/s]Layer 23: εη°Super Weight at (2352, 7604), ζιεΌ: -0.0347, ζΏζ΄»εΌ: 3.5625 |
| ζ£ζ΅Super Weight: 75%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 24/32 [00:20<00:06, 1.17it/s]Layer 24: εη°Super Weight at (2352, 8981), ζιεΌ: 0.0130, ζΏζ΄»εΌ: 3.1094 |
| ζ£ζ΅Super Weight: 78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 25/32 [00:21<00:05, 1.17it/s]Layer 25: εη°Super Weight at (2352, 2900), ζιεΌ: -0.0449, ζΏζ΄»εΌ: 4.4688 |
| ζ£ζ΅Super Weight: 81%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 26/32 [00:22<00:05, 1.16it/s]Layer 26: εη°Super Weight at (2352, 13121), ζιεΌ: 0.0170, ζΏζ΄»εΌ: 3.7969 |
| ζ£ζ΅Super Weight: 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 27/32 [00:23<00:04, 1.16it/s]Layer 27: εη°Super Weight at (2352, 3919), ζιεΌ: -0.0427, ζΏζ΄»εΌ: 5.2812 |
| ζ£ζ΅Super Weight: 88%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 28/32 [00:23<00:03, 1.16it/s]Layer 28: εη°Super Weight at (2352, 2221), ζιεΌ: 0.0064, ζΏζ΄»εΌ: 5.3125 |
| ζ£ζ΅Super Weight: 91%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 29/32 [00:24<00:02, 1.16it/s]Layer 29: εη°Super Weight at (2352, 9562), ζιεΌ: -0.0928, ζΏζ΄»εΌ: 9.6875 |
| ζ£ζ΅Super Weight: 94%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 30/32 [00:25<00:01, 1.15it/s]Layer 30: εη°Super Weight at (2352, 6805), ζιεΌ: -0.0019, ζΏζ΄»εΌ: 15.6875 |
| ζ£ζ΅Super Weight: 97%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 31/32 [00:26<00:00, 1.15it/s]Layer 31: εη°Super Weight at (788, 12732), ζιεΌ: -0.4590, ζΏζ΄»εΌ: 390.0000 |
| ζ£ζ΅Super Weight: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:27<00:00, 1.17it/s] |
| ζ»ε
±εη° 16 δΈͺSuper Weight |
| [SuperWeightDrop][Step 1] evaluating 32 candidate layers... |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:34<00:00, 1.07s/it] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 0 delta=586.847656 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.11it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 1 delta=719.562500 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 2 delta=2.093750 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 3 delta=6.125000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 4 delta=6.015625 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:31<00:00, 1.02it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 5 delta=5.046875 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 6 delta=7.828125 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:35<00:00, 1.12s/it] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 7 delta=4.796875 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 8 delta=5.781250 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:35<00:00, 1.12s/it] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 9 delta=4.546875 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.12it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 10 delta=3.109375 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:35<00:00, 1.12s/it] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 11 delta=1.875000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.11it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 12 delta=4.500000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:35<00:00, 1.12s/it] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 13 delta=2.718750 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.11it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 14 delta=5.375000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.11it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 15 delta=2.125000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 16 delta=4.093750 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 17 delta=6.031250 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 18 delta=2.937500 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 19 delta=5.484375 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 20 delta=3.093750 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 21 delta=3.687500 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 22 delta=2.765625 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 23 delta=3.484375 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 24 delta=0.875000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 25 delta=4.328125 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 26 delta=0.953125 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 27 delta=2.218750 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 28 delta=8.000000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.10it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 29 delta=17.375000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 30 delta=18.562500 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:29<00:00, 1.09it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 31 delta=6.000000 (baseline count=16, candidate count=16) |
| [SuperWeightDrop] Dropped layer 24 at step 1 (delta=0.875000). 31 layers remaining. |
| [SuperWeightDrop][Step 2] evaluating 31 candidate layers... |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.11it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 0 delta=593.214844 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.11it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 1 delta=719.156250 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.11it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 2 delta=2.375000 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32/32 [00:28<00:00, 1.11it/s] |
| Measured activations for 16 layers. |
| [SuperWeightDrop] Layer 3 delta=4.312500 (baseline count=16, candidate count=16) |
| Measuring activations for 16 tracked super weights... |
| Measuring super activations: 44%|ββββββββββββββββββββββββββββ | 14/32 [00:12<00:16, 1.09it/s] |