ann-sparseattention / logs /compare_all36_step800.log
datasysdev's picture
Upload logs/compare_all36_step800.log
c313685 verified
Raw
History Blame Contribute Delete
3.75 kB
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading Qwen/Qwen3-4B-Instruct-2507 ...
Loading weights: 0%| | 0/398 [00:00<?, ?it/s] Loading weights: 0%| | 1/398 [00:00<01:30, 4.38it/s] Loading weights: 4%|▍ | 16/398 [00:00<00:06, 59.63it/s] Loading weights: 9%|β–‰ | 36/398 [00:00<00:03, 108.90it/s] Loading weights: 14%|β–ˆβ– | 55/398 [00:00<00:02, 134.02it/s] Loading weights: 18%|β–ˆβ–Š | 71/398 [00:00<00:02, 139.39it/s] Loading weights: 23%|β–ˆβ–ˆβ–Ž | 92/398 [00:00<00:01, 157.97it/s] Loading weights: 28%|β–ˆβ–ˆβ–Š | 113/398 [00:00<00:01, 169.93it/s] Loading weights: 33%|β–ˆβ–ˆβ–ˆβ–Ž | 131/398 [00:00<00:01, 169.09it/s] Loading weights: 37%|β–ˆβ–ˆβ–ˆβ–‹ | 149/398 [00:01<00:01, 170.60it/s] Loading weights: 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 169/398 [00:01<00:01, 176.19it/s] Loading weights: 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 188/398 [00:01<00:01, 179.97it/s] Loading weights: 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 207/398 [00:01<00:01, 174.45it/s] Loading weights: 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 225/398 [00:01<00:01, 166.83it/s] Loading weights: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 246/398 [00:01<00:00, 177.57it/s] Loading weights: 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 266/398 [00:01<00:00, 179.04it/s] Loading weights: 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 286/398 [00:01<00:00, 183.20it/s] Loading weights: 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 305/398 [00:01<00:00, 181.77it/s] Loading weights: 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 324/398 [00:02<00:00, 176.07it/s] Loading weights: 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 345/398 [00:02<00:00, 182.53it/s] Loading weights: 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 366/398 [00:02<00:00, 188.18it/s] Loading weights: 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 385/398 [00:02<00:00, 186.67it/s] Loading weights: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 398/398 [00:02<00:00, 163.92it/s]
Loaded ckpt step 800 for layers [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]
batch 1/2 done
========================================================================
mass@K β€” fraction of teacher attention captured by retrieval set
raw_qk : exact top-K over head-mean-aggregated post-RoPE Q,K
learned: exact top-K over trained search projections (d=128)
========================================================================
K method L00 L01 L02 L03 L04 L05 L06 L07 L08 L09 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 L24 L25 L26 L27 L28 L29 L30 L31 L32 L33 L34 L35 avg
128 raw_qk 0.922 0.918 0.939 0.939 0.944 0.964 0.956 0.982 0.971 0.959 0.974 0.976 0.961 0.971 0.973 0.968 0.956 0.959 0.965 0.961 0.959 0.966 0.963 0.979 0.971 0.986 0.978 0.978 0.979 0.982 0.988 0.984 0.979 0.977 0.976 0.980 0.966
128 learned 0.776 0.853 0.899 0.925 0.936 0.950 0.939 0.983 0.971 0.976 0.971 0.976 0.970 0.972 0.973 0.972 0.962 0.967 0.973 0.968 0.976 0.980 0.970 0.985 0.978 0.989 0.986 0.983 0.985 0.987 0.986 0.984 0.980 0.970 0.960 0.965 0.960
256 raw_qk 0.974 0.983 0.986 0.986 0.986 0.993 0.990 0.996 0.994 0.992 0.995 0.996 0.991 0.995 0.996 0.995 0.992 0.993 0.995 0.993 0.993 0.994 0.993 0.996 0.994 0.997 0.996 0.995 0.995 0.997 0.998 0.997 0.995 0.995 0.995 0.995 0.993
256 learned 0.924 0.961 0.966 0.977 0.982 0.987 0.981 0.996 0.993 0.995 0.992 0.995 0.993 0.993 0.994 0.994 0.992 0.994 0.996 0.993 0.996 0.997 0.994 0.997 0.996 0.998 0.997 0.997 0.997 0.998 0.997 0.997 0.995 0.992 0.990 0.989 0.990
Learned vs raw mass@K=128: 0.960 / 0.966 = 0.99Γ—
Wrote /tmp/checkpoints_all36_d128_block/search_step_800.compare_retrieval.json