m1b commited on
Commit
dafeda3
Β·
verified Β·
1 Parent(s): 8e7ba39

Upload NOVEL_APPROACH.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. NOVEL_APPROACH.md +174 -0
NOVEL_APPROACH.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Novel SOTA Optimization for Parameter Golf
2
+
3
+ ## Summary of Novel Techniques
4
+
5
+ After deep analysis of the current SOTA (1.0810 BPB) and the full literature, I propose **5 novel optimizations** that target complementary axes of improvement. Each is grounded in published results and has NOT been fully exploited by any submission on the leaderboard.
6
+
7
+ ---
8
+
9
+ ## 1. Multi-Token Prediction (MTP) Auxiliary Training Loss
10
+ **Paper**: Better & Faster LLMs via Multi-token Prediction (Meta FAIR, arxiv 2404.19737)
11
+ **Status**: NOT used in ANY leaderboard submission
12
+
13
+ ### Why it helps
14
+ - Trains the model to predict 2 future tokens simultaneously via 2 independent output heads on top of the shared trunk
15
+ - Forces hidden representations to encode longer-range planning information
16
+ - 20-30% improved sample efficiency at no additional inference-time cost
17
+ - Heads are DISCARDED at serialization β†’ zero extra bytes in 16MB artifact
18
+ - With only ~4500 training steps in 10 minutes, every bit of sample efficiency matters enormously
19
+
20
+ ### Implementation
21
+ ```python
22
+ # In GPT.forward():
23
+ # Standard NTP loss on head_1 (predicting token t+1)
24
+ logits_1 = self.forward_logits(input_ids)
25
+ loss_1 = F.cross_entropy(logits_1, target_ids_1)
26
+
27
+ # MTP head_2 (predicting token t+2) - lightweight: shared trunk, separate unembedding projection
28
+ # Use SAME tied embedding weights (no extra params stored)
29
+ hidden = self.get_hidden(input_ids) # intermediate hidden states
30
+ logits_2 = F.linear(hidden, self.tok_emb.weight) # predict t+2
31
+ loss_2 = F.cross_entropy(logits_2, target_ids_2)
32
+
33
+ loss = 0.7 * loss_1 + 0.3 * loss_2 # weighted combination
34
+ ```
35
+
36
+ **Critical**: The second head uses the SAME embedding weights (tied). No extra parameters. At eval, only head_1 is used.
37
+
38
+ ### Expected Gain
39
+ - Meta reports +2-3% on downstream tasks at 7B scale with n=4
40
+ - At small scale with limited steps, the sample efficiency gain translates directly to better BPB
41
+ - Conservative estimate: **-0.003 to -0.008 BPB improvement**
42
+
43
+ ---
44
+
45
+ ## 2. SpiralFormer Multi-Resolution Recurrence
46
+ **Paper**: SpiralFormer (arxiv 2602.11698, 2025)
47
+ **Status**: NO submission uses multi-resolution recurrence (all do flat looping)
48
+
49
+ ### Why it helps
50
+ - Current SOTA loops layers 3-5 at full resolution (17 virtual layers, same compute per loop)
51
+ - SpiralFormer proposes early loops at COARSENED resolution (e.g., 50% of tokens)
52
+ - Later loops at full resolution for fine-grained refinement
53
+ - Saves ~15-25% FLOPs per recurrence β†’ can afford MORE loops in the same wall-clock time
54
+ - SpiralFormer-L at 410M: reduced FLOPs AND improved perplexity vs flat looping
55
+ - Induces hierarchical processing: global patterns first, local refinement later
56
+
57
+ ### Implementation
58
+ - Loop iteration 0: process at 50% resolution (mean-pool adjacent tokens)
59
+ - Loop iteration 1: process at 75% resolution
60
+ - Loop iteration 2: process at full resolution
61
+ - Use causal downscaling/upscaling operators from the paper
62
+ - Key: the shared block weights don't change, just the resolution schedule
63
+
64
+ ### Expected Gain
65
+ - SpiralFormer reports -3 to -10% FLOPs at matched quality, or better quality at matched FLOPs
66
+ - Translates to either more loop iterations OR better per-iteration quality
67
+ - Conservative estimate: **-0.002 to -0.005 BPB improvement**
68
+
69
+ ---
70
+
71
+ ## 3. In-Place TTT During Training (Meta-Learned TTT Initialization)
72
+ **Paper**: In-Place TTT (arxiv 2604.06169, ByteDance, 2025)
73
+ **Status**: Current TTT is eval-time only; no submission uses TTT during training
74
+
75
+ ### Why it helps
76
+ - The model is currently trained WITHOUT knowledge that it will undergo TTT at eval time
77
+ - In-Place TTT trains the model to be GOOD at adapting its W_down fast weights
78
+ - This means the eval-time TTT starts from a much better initialization
79
+ - The W_down matrices learn to be "easy to fine-tune" during meta-training
80
+ - Zero extra stored params: W_down is already part of the model
81
+
82
+ ### Implementation
83
+ - During training: for each batch, split into chunks
84
+ - For each chunk: (1) forward pass with current W_down, (2) compute NTP loss on chunk, (3) update W_down with one GD step using the NTP-aligned objective, (4) continue to next chunk
85
+ - The meta-gradient flows through the TTT update step back to the base model
86
+ - At eval: same score-first TTT but starting from weights that are "pre-adapted" for TTT
87
+
88
+ ### Risk Assessment
89
+ - Adds ~30-50% training time overhead β†’ fewer total steps
90
+ - But: each step is worth more because the model learns to adapt
91
+ - Need careful chunk size tuning (512-1024 tokens optimal per the paper)
92
+ - May conflict with GPTQ quantization (W_down is quantized to int6)
93
+
94
+ ### Expected Gain
95
+ - In-Place TTT shows consistent improvements on long-context tasks
96
+ - For parameter golf: the eval-time TTT already gains ~0.002 BPB, meta-learned TTT could double that
97
+ - Conservative estimate: **-0.002 to -0.004 BPB improvement**
98
+
99
+ ---
100
+
101
+ ## 4. Adaptive Weight Decay Scheduling (RMS-Driven)
102
+ **Paper**: Novel technique informed by Kevin Clark's RMS-compression insight (PR #1218)
103
+ **Status**: Current WD is fixed at 0.095; no submission uses adaptive WD
104
+
105
+ ### Why it helps
106
+ - Kevin Clark discovered RΒ²=0.99 correlation between weight RMS and compression ratio
107
+ - Lower RMS β†’ lower entropy quantized weights β†’ smaller artifact β†’ more budget for model params
108
+ - Currently WD=0.095 is fixed throughout training
109
+ - Novel: INCREASE WD progressively during training (WD warmup)
110
+ - Early: WD=0.02 (let weights explore freely)
111
+ - Mid: WD=0.05 (start constraining)
112
+ - Late: WD=0.12 (aggressively compress for serialization)
113
+ - This gives the model freedom to learn early, then compresses for storage late
114
+
115
+ ### Implementation
116
+ ```python
117
+ def adaptive_wd(frac):
118
+ # Linear ramp from 0.02 to 0.12 over training
119
+ return 0.02 + 0.10 * frac
120
+ ```
121
+
122
+ ### Expected Gain
123
+ - Better rate-distortion tradeoff: model learns more freely, then compresses better
124
+ - Conservative estimate: **-0.001 to -0.003 BPB improvement**
125
+
126
+ ---
127
+
128
+ ## 5. SP16384 Tokenizer
129
+ **Status**: Current best uses SP8192; the progression 1024β†’4096β†’8192 shows clear BPB wins
130
+
131
+ ### Why it helps
132
+ - Larger vocab = more bytes per token = better BPB efficiency
133
+ - SP8192 embedding with int8 GPTQ = 8192 Γ— 512 Γ— 1 byte = 4MB (fits comfortably)
134
+ - SP16384 embedding = 16384 Γ— 512 Γ— 1 byte = 8MB
135
+ - With int6 for non-embedding params, we can afford the larger embedding
136
+ - Each doubling of vocab typically gains ~0.01-0.02 BPB (diminishing returns)
137
+
138
+ ### Risk
139
+ - Embedding becomes a larger fraction of the 16MB budget
140
+ - Need to verify it compresses well with GPTQ
141
+ - May need to reduce model_dim or layers to compensate
142
+
143
+ ### Expected Gain
144
+ - Conservative estimate: **-0.005 to -0.01 BPB improvement** (based on 1024β†’8192 trend)
145
+
146
+ ---
147
+
148
+ ## Combined Strategy
149
+
150
+ ### Phase 1 (Safest, most impactful):
151
+ 1. Multi-Token Prediction (n=2) ← zero risk, proven technique, zero artifact cost
152
+ 2. Adaptive Weight Decay scheduling ← low risk, simple implementation
153
+ 3. SP16384 tokenizer experiment ← moderate risk, requires size budget analysis
154
+
155
+ ### Phase 2 (Higher impact but more complex):
156
+ 4. SpiralFormer multi-resolution recurrence ← needs careful implementation
157
+ 5. In-Place TTT during training ← highest potential but most complex
158
+
159
+ ### Expected Combined Improvement:
160
+ - Techniques 1-3: **-0.009 to -0.021 BPB** (conservative -0.009, optimistic -0.021)
161
+ - Target: **1.0810 - 0.009 = 1.0720 BPB** (conservative)
162
+ - Target: **1.0810 - 0.021 = 1.0600 BPB** (optimistic)
163
+
164
+ ---
165
+
166
+ ## Architecture for Novel Submission
167
+
168
+ Base: Current SOTA architecture (11L Γ— 512d Γ— 8H/4KV)
169
+ Modifications:
170
+ - Add MTP n=2 auxiliary head during training only
171
+ - Replace flat recurrence with SpiralFormer schedule
172
+ - Progressive WD from 0.02β†’0.12
173
+ - Keep SP8192 (SP16384 as fallback experiment)
174
+ - Keep all current techniques: parallel residuals, XSA, skip gates, TTT eval, etc.