Upload NOVEL_APPROACH.md with huggingface_hub
Browse files- NOVEL_APPROACH.md +174 -0
NOVEL_APPROACH.md
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Novel SOTA Optimization for Parameter Golf
|
| 2 |
+
|
| 3 |
+
## Summary of Novel Techniques
|
| 4 |
+
|
| 5 |
+
After deep analysis of the current SOTA (1.0810 BPB) and the full literature, I propose **5 novel optimizations** that target complementary axes of improvement. Each is grounded in published results and has NOT been fully exploited by any submission on the leaderboard.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Multi-Token Prediction (MTP) Auxiliary Training Loss
|
| 10 |
+
**Paper**: Better & Faster LLMs via Multi-token Prediction (Meta FAIR, arxiv 2404.19737)
|
| 11 |
+
**Status**: NOT used in ANY leaderboard submission
|
| 12 |
+
|
| 13 |
+
### Why it helps
|
| 14 |
+
- Trains the model to predict 2 future tokens simultaneously via 2 independent output heads on top of the shared trunk
|
| 15 |
+
- Forces hidden representations to encode longer-range planning information
|
| 16 |
+
- 20-30% improved sample efficiency at no additional inference-time cost
|
| 17 |
+
- Heads are DISCARDED at serialization β zero extra bytes in 16MB artifact
|
| 18 |
+
- With only ~4500 training steps in 10 minutes, every bit of sample efficiency matters enormously
|
| 19 |
+
|
| 20 |
+
### Implementation
|
| 21 |
+
```python
|
| 22 |
+
# In GPT.forward():
|
| 23 |
+
# Standard NTP loss on head_1 (predicting token t+1)
|
| 24 |
+
logits_1 = self.forward_logits(input_ids)
|
| 25 |
+
loss_1 = F.cross_entropy(logits_1, target_ids_1)
|
| 26 |
+
|
| 27 |
+
# MTP head_2 (predicting token t+2) - lightweight: shared trunk, separate unembedding projection
|
| 28 |
+
# Use SAME tied embedding weights (no extra params stored)
|
| 29 |
+
hidden = self.get_hidden(input_ids) # intermediate hidden states
|
| 30 |
+
logits_2 = F.linear(hidden, self.tok_emb.weight) # predict t+2
|
| 31 |
+
loss_2 = F.cross_entropy(logits_2, target_ids_2)
|
| 32 |
+
|
| 33 |
+
loss = 0.7 * loss_1 + 0.3 * loss_2 # weighted combination
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
**Critical**: The second head uses the SAME embedding weights (tied). No extra parameters. At eval, only head_1 is used.
|
| 37 |
+
|
| 38 |
+
### Expected Gain
|
| 39 |
+
- Meta reports +2-3% on downstream tasks at 7B scale with n=4
|
| 40 |
+
- At small scale with limited steps, the sample efficiency gain translates directly to better BPB
|
| 41 |
+
- Conservative estimate: **-0.003 to -0.008 BPB improvement**
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## 2. SpiralFormer Multi-Resolution Recurrence
|
| 46 |
+
**Paper**: SpiralFormer (arxiv 2602.11698, 2025)
|
| 47 |
+
**Status**: NO submission uses multi-resolution recurrence (all do flat looping)
|
| 48 |
+
|
| 49 |
+
### Why it helps
|
| 50 |
+
- Current SOTA loops layers 3-5 at full resolution (17 virtual layers, same compute per loop)
|
| 51 |
+
- SpiralFormer proposes early loops at COARSENED resolution (e.g., 50% of tokens)
|
| 52 |
+
- Later loops at full resolution for fine-grained refinement
|
| 53 |
+
- Saves ~15-25% FLOPs per recurrence β can afford MORE loops in the same wall-clock time
|
| 54 |
+
- SpiralFormer-L at 410M: reduced FLOPs AND improved perplexity vs flat looping
|
| 55 |
+
- Induces hierarchical processing: global patterns first, local refinement later
|
| 56 |
+
|
| 57 |
+
### Implementation
|
| 58 |
+
- Loop iteration 0: process at 50% resolution (mean-pool adjacent tokens)
|
| 59 |
+
- Loop iteration 1: process at 75% resolution
|
| 60 |
+
- Loop iteration 2: process at full resolution
|
| 61 |
+
- Use causal downscaling/upscaling operators from the paper
|
| 62 |
+
- Key: the shared block weights don't change, just the resolution schedule
|
| 63 |
+
|
| 64 |
+
### Expected Gain
|
| 65 |
+
- SpiralFormer reports -3 to -10% FLOPs at matched quality, or better quality at matched FLOPs
|
| 66 |
+
- Translates to either more loop iterations OR better per-iteration quality
|
| 67 |
+
- Conservative estimate: **-0.002 to -0.005 BPB improvement**
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## 3. In-Place TTT During Training (Meta-Learned TTT Initialization)
|
| 72 |
+
**Paper**: In-Place TTT (arxiv 2604.06169, ByteDance, 2025)
|
| 73 |
+
**Status**: Current TTT is eval-time only; no submission uses TTT during training
|
| 74 |
+
|
| 75 |
+
### Why it helps
|
| 76 |
+
- The model is currently trained WITHOUT knowledge that it will undergo TTT at eval time
|
| 77 |
+
- In-Place TTT trains the model to be GOOD at adapting its W_down fast weights
|
| 78 |
+
- This means the eval-time TTT starts from a much better initialization
|
| 79 |
+
- The W_down matrices learn to be "easy to fine-tune" during meta-training
|
| 80 |
+
- Zero extra stored params: W_down is already part of the model
|
| 81 |
+
|
| 82 |
+
### Implementation
|
| 83 |
+
- During training: for each batch, split into chunks
|
| 84 |
+
- For each chunk: (1) forward pass with current W_down, (2) compute NTP loss on chunk, (3) update W_down with one GD step using the NTP-aligned objective, (4) continue to next chunk
|
| 85 |
+
- The meta-gradient flows through the TTT update step back to the base model
|
| 86 |
+
- At eval: same score-first TTT but starting from weights that are "pre-adapted" for TTT
|
| 87 |
+
|
| 88 |
+
### Risk Assessment
|
| 89 |
+
- Adds ~30-50% training time overhead β fewer total steps
|
| 90 |
+
- But: each step is worth more because the model learns to adapt
|
| 91 |
+
- Need careful chunk size tuning (512-1024 tokens optimal per the paper)
|
| 92 |
+
- May conflict with GPTQ quantization (W_down is quantized to int6)
|
| 93 |
+
|
| 94 |
+
### Expected Gain
|
| 95 |
+
- In-Place TTT shows consistent improvements on long-context tasks
|
| 96 |
+
- For parameter golf: the eval-time TTT already gains ~0.002 BPB, meta-learned TTT could double that
|
| 97 |
+
- Conservative estimate: **-0.002 to -0.004 BPB improvement**
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## 4. Adaptive Weight Decay Scheduling (RMS-Driven)
|
| 102 |
+
**Paper**: Novel technique informed by Kevin Clark's RMS-compression insight (PR #1218)
|
| 103 |
+
**Status**: Current WD is fixed at 0.095; no submission uses adaptive WD
|
| 104 |
+
|
| 105 |
+
### Why it helps
|
| 106 |
+
- Kevin Clark discovered RΒ²=0.99 correlation between weight RMS and compression ratio
|
| 107 |
+
- Lower RMS β lower entropy quantized weights β smaller artifact β more budget for model params
|
| 108 |
+
- Currently WD=0.095 is fixed throughout training
|
| 109 |
+
- Novel: INCREASE WD progressively during training (WD warmup)
|
| 110 |
+
- Early: WD=0.02 (let weights explore freely)
|
| 111 |
+
- Mid: WD=0.05 (start constraining)
|
| 112 |
+
- Late: WD=0.12 (aggressively compress for serialization)
|
| 113 |
+
- This gives the model freedom to learn early, then compresses for storage late
|
| 114 |
+
|
| 115 |
+
### Implementation
|
| 116 |
+
```python
|
| 117 |
+
def adaptive_wd(frac):
|
| 118 |
+
# Linear ramp from 0.02 to 0.12 over training
|
| 119 |
+
return 0.02 + 0.10 * frac
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
### Expected Gain
|
| 123 |
+
- Better rate-distortion tradeoff: model learns more freely, then compresses better
|
| 124 |
+
- Conservative estimate: **-0.001 to -0.003 BPB improvement**
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## 5. SP16384 Tokenizer
|
| 129 |
+
**Status**: Current best uses SP8192; the progression 1024β4096β8192 shows clear BPB wins
|
| 130 |
+
|
| 131 |
+
### Why it helps
|
| 132 |
+
- Larger vocab = more bytes per token = better BPB efficiency
|
| 133 |
+
- SP8192 embedding with int8 GPTQ = 8192 Γ 512 Γ 1 byte = 4MB (fits comfortably)
|
| 134 |
+
- SP16384 embedding = 16384 Γ 512 Γ 1 byte = 8MB
|
| 135 |
+
- With int6 for non-embedding params, we can afford the larger embedding
|
| 136 |
+
- Each doubling of vocab typically gains ~0.01-0.02 BPB (diminishing returns)
|
| 137 |
+
|
| 138 |
+
### Risk
|
| 139 |
+
- Embedding becomes a larger fraction of the 16MB budget
|
| 140 |
+
- Need to verify it compresses well with GPTQ
|
| 141 |
+
- May need to reduce model_dim or layers to compensate
|
| 142 |
+
|
| 143 |
+
### Expected Gain
|
| 144 |
+
- Conservative estimate: **-0.005 to -0.01 BPB improvement** (based on 1024β8192 trend)
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## Combined Strategy
|
| 149 |
+
|
| 150 |
+
### Phase 1 (Safest, most impactful):
|
| 151 |
+
1. Multi-Token Prediction (n=2) β zero risk, proven technique, zero artifact cost
|
| 152 |
+
2. Adaptive Weight Decay scheduling β low risk, simple implementation
|
| 153 |
+
3. SP16384 tokenizer experiment β moderate risk, requires size budget analysis
|
| 154 |
+
|
| 155 |
+
### Phase 2 (Higher impact but more complex):
|
| 156 |
+
4. SpiralFormer multi-resolution recurrence β needs careful implementation
|
| 157 |
+
5. In-Place TTT during training β highest potential but most complex
|
| 158 |
+
|
| 159 |
+
### Expected Combined Improvement:
|
| 160 |
+
- Techniques 1-3: **-0.009 to -0.021 BPB** (conservative -0.009, optimistic -0.021)
|
| 161 |
+
- Target: **1.0810 - 0.009 = 1.0720 BPB** (conservative)
|
| 162 |
+
- Target: **1.0810 - 0.021 = 1.0600 BPB** (optimistic)
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
## Architecture for Novel Submission
|
| 167 |
+
|
| 168 |
+
Base: Current SOTA architecture (11L Γ 512d Γ 8H/4KV)
|
| 169 |
+
Modifications:
|
| 170 |
+
- Add MTP n=2 auxiliary head during training only
|
| 171 |
+
- Replace flat recurrence with SpiralFormer schedule
|
| 172 |
+
- Progressive WD from 0.02β0.12
|
| 173 |
+
- Keep SP8192 (SP16384 as fallback experiment)
|
| 174 |
+
- Keep all current techniques: parallel residuals, XSA, skip gates, TTT eval, etc.
|