m1b
/

parameter-golf-novel

Model card Files Files and versions

xet

Community

m1b commited on Apr 22

Commit

dafeda3

verified ·

1 Parent(s): 8e7ba39

Upload NOVEL_APPROACH.md with huggingface_hub

Browse files

Files changed (1) hide show

NOVEL_APPROACH.md +174 -0

NOVEL_APPROACH.md ADDED Viewed

	@@ -0,0 +1,174 @@

+# Novel SOTA Optimization for Parameter Golf
+## Summary of Novel Techniques
+After deep analysis of the current SOTA (1.0810 BPB) and the full literature, I propose **5 novel optimizations** that target complementary axes of improvement. Each is grounded in published results and has NOT been fully exploited by any submission on the leaderboard.
+---
+## 1. Multi-Token Prediction (MTP) Auxiliary Training Loss
+**Paper**: Better & Faster LLMs via Multi-token Prediction (Meta FAIR, arxiv 2404.19737)
+**Status**: NOT used in ANY leaderboard submission
+### Why it helps
+- Trains the model to predict 2 future tokens simultaneously via 2 independent output heads on top of the shared trunk
+- Forces hidden representations to encode longer-range planning information
+- 20-30% improved sample efficiency at no additional inference-time cost
+- Heads are DISCARDED at serialization → zero extra bytes in 16MB artifact
+- With only ~4500 training steps in 10 minutes, every bit of sample efficiency matters enormously
+### Implementation
+```python
+# In GPT.forward():
+# Standard NTP loss on head_1 (predicting token t+1)
+logits_1 = self.forward_logits(input_ids)
+loss_1 = F.cross_entropy(logits_1, target_ids_1)
+# MTP head_2 (predicting token t+2) - lightweight: shared trunk, separate unembedding projection
+# Use SAME tied embedding weights (no extra params stored)
+hidden = self.get_hidden(input_ids)  # intermediate hidden states
+logits_2 = F.linear(hidden, self.tok_emb.weight)  # predict t+2
+loss_2 = F.cross_entropy(logits_2, target_ids_2)
+loss = 0.7 * loss_1 + 0.3 * loss_2  # weighted combination
+```
+**Critical**: The second head uses the SAME embedding weights (tied). No extra parameters. At eval, only head_1 is used.
+### Expected Gain
+- Meta reports +2-3% on downstream tasks at 7B scale with n=4
+- At small scale with limited steps, the sample efficiency gain translates directly to better BPB
+- Conservative estimate: **-0.003 to -0.008 BPB improvement**
+---
+## 2. SpiralFormer Multi-Resolution Recurrence
+**Paper**: SpiralFormer (arxiv 2602.11698, 2025)
+**Status**: NO submission uses multi-resolution recurrence (all do flat looping)
+### Why it helps
+- Current SOTA loops layers 3-5 at full resolution (17 virtual layers, same compute per loop)
+- SpiralFormer proposes early loops at COARSENED resolution (e.g., 50% of tokens)
+- Later loops at full resolution for fine-grained refinement
+- Saves ~15-25% FLOPs per recurrence → can afford MORE loops in the same wall-clock time
+- SpiralFormer-L at 410M: reduced FLOPs AND improved perplexity vs flat looping
+- Induces hierarchical processing: global patterns first, local refinement later
+### Implementation
+- Loop iteration 0: process at 50% resolution (mean-pool adjacent tokens)
+- Loop iteration 1: process at 75% resolution
+- Loop iteration 2: process at full resolution
+- Use causal downscaling/upscaling operators from the paper
+- Key: the shared block weights don't change, just the resolution schedule
+### Expected Gain
+- SpiralFormer reports -3 to -10% FLOPs at matched quality, or better quality at matched FLOPs
+- Translates to either more loop iterations OR better per-iteration quality
+- Conservative estimate: **-0.002 to -0.005 BPB improvement**
+---
+## 3. In-Place TTT During Training (Meta-Learned TTT Initialization)
+**Paper**: In-Place TTT (arxiv 2604.06169, ByteDance, 2025)
+**Status**: Current TTT is eval-time only; no submission uses TTT during training
+### Why it helps
+- The model is currently trained WITHOUT knowledge that it will undergo TTT at eval time
+- In-Place TTT trains the model to be GOOD at adapting its W_down fast weights
+- This means the eval-time TTT starts from a much better initialization
+- The W_down matrices learn to be "easy to fine-tune" during meta-training
+- Zero extra stored params: W_down is already part of the model
+### Implementation
+- During training: for each batch, split into chunks
+- For each chunk: (1) forward pass with current W_down, (2) compute NTP loss on chunk, (3) update W_down with one GD step using the NTP-aligned objective, (4) continue to next chunk
+- The meta-gradient flows through the TTT update step back to the base model
+- At eval: same score-first TTT but starting from weights that are "pre-adapted" for TTT
+### Risk Assessment
+- Adds ~30-50% training time overhead → fewer total steps
+- But: each step is worth more because the model learns to adapt
+- Need careful chunk size tuning (512-1024 tokens optimal per the paper)
+- May conflict with GPTQ quantization (W_down is quantized to int6)
+### Expected Gain
+- In-Place TTT shows consistent improvements on long-context tasks
+- For parameter golf: the eval-time TTT already gains ~0.002 BPB, meta-learned TTT could double that
+- Conservative estimate: **-0.002 to -0.004 BPB improvement**
+---
+## 4. Adaptive Weight Decay Scheduling (RMS-Driven)
+**Paper**: Novel technique informed by Kevin Clark's RMS-compression insight (PR #1218)
+**Status**: Current WD is fixed at 0.095; no submission uses adaptive WD
+### Why it helps
+- Kevin Clark discovered R²=0.99 correlation between weight RMS and compression ratio
+- Lower RMS → lower entropy quantized weights → smaller artifact → more budget for model params
+- Currently WD=0.095 is fixed throughout training
+- Novel: INCREASE WD progressively during training (WD warmup)
+  - Early: WD=0.02 (let weights explore freely)
+  - Mid: WD=0.05 (start constraining)
+  - Late: WD=0.12 (aggressively compress for serialization)
+- This gives the model freedom to learn early, then compresses for storage late
+### Implementation
+```python
+def adaptive_wd(frac):
+    # Linear ramp from 0.02 to 0.12 over training
+    return 0.02 + 0.10 * frac
+```
+### Expected Gain
+- Better rate-distortion tradeoff: model learns more freely, then compresses better
+- Conservative estimate: **-0.001 to -0.003 BPB improvement**
+---
+## 5. SP16384 Tokenizer
+**Status**: Current best uses SP8192; the progression 1024→4096→8192 shows clear BPB wins
+### Why it helps
+- Larger vocab = more bytes per token = better BPB efficiency
+- SP8192 embedding with int8 GPTQ = 8192 × 512 × 1 byte = 4MB (fits comfortably)
+- SP16384 embedding = 16384 × 512 × 1 byte = 8MB
+- With int6 for non-embedding params, we can afford the larger embedding
+- Each doubling of vocab typically gains ~0.01-0.02 BPB (diminishing returns)
+### Risk
+- Embedding becomes a larger fraction of the 16MB budget
+- Need to verify it compresses well with GPTQ
+- May need to reduce model_dim or layers to compensate
+### Expected Gain
+- Conservative estimate: **-0.005 to -0.01 BPB improvement** (based on 1024→8192 trend)
+---
+## Combined Strategy
+### Phase 1 (Safest, most impactful):
+1. Multi-Token Prediction (n=2) ← zero risk, proven technique, zero artifact cost
+2. Adaptive Weight Decay scheduling ← low risk, simple implementation
+3. SP16384 tokenizer experiment ← moderate risk, requires size budget analysis
+### Phase 2 (Higher impact but more complex):
+4. SpiralFormer multi-resolution recurrence ← needs careful implementation
+5. In-Place TTT during training ← highest potential but most complex
+### Expected Combined Improvement:
+- Techniques 1-3: **-0.009 to -0.021 BPB** (conservative -0.009, optimistic -0.021)
+- Target: **1.0810 - 0.009 = 1.0720 BPB** (conservative)
+- Target: **1.0810 - 0.021 = 1.0600 BPB** (optimistic)
+---
+## Architecture for Novel Submission
+Base: Current SOTA architecture (11L × 512d × 8H/4KV)
+Modifications:
+- Add MTP n=2 auxiliary head during training only
+- Replace flat recurrence with SpiralFormer schedule
+- Progressive WD from 0.02→0.12
+- Keep SP8192 (SP16384 as fallback experiment)
+- Keep all current techniques: parallel residuals, XSA, skip gates, TTT eval, etc.