MhaWay
/

Veronica

@@ -29,12 +29,13 @@ model-index:
 ## TL;DR
 | Feature | Description |
 |---------|-------------|
-| Architecture | 24‑layer causal Transformer (RoPE, untied embeddings) |
-| Polymorphic MLP | Soft routing over 3 base branches (extensible) |
-| Routing Control | Temperature schedule + entropy maximization |
 | Precision | BF16 with FP32 LayerNorm for stability |
 | Positional Encoding | Rotary (RoPE, θ=10,000) |
-| Dataset Mix | FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • Additional samples 20% (codelion/DataComp) |
 | Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
 ---
@@ -43,11 +44,6 @@ model-index:
 ```bash
 pip install -e .
-from veronica import VeronicaConfig, VeronicaForCausalLM
-cfg = VeronicaConfig(n_layer=24, num_funcs=3)  # base polymorphic setup
-model = VeronicaForCausalLM(cfg)
-```
 | Source | Share | Link |
 |--------|-------|------|
 | FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
@@ -59,6 +55,10 @@ Notes
 - Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
 Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.
 Generation example:
 ```python
@@ -161,10 +161,10 @@ Total tokens target (example): ~60B (adjustable). The composition balances seman
 | Weight Decay | 0.01 |
 | Label Smoothing | 0.01 |
 | Precision | bf16 + fp32 LayerNorm |
-| Max Seq Len | 512 (curriculum to 2048) |
-| Router τ | 1.6 → 1.1 (freeze first 4k steps) |
-| Aux weight λ | 0.005 → 0.012 |
-| Router forcing | 5% prob for first 3k steps |
 | Rep penalty (α) | 0.05 (smoke quality) |
 Launch:
@@ -179,24 +179,151 @@ python scripts/train_veronica.py \
   --learning_rate 1.2e-4 \
   --warmup_ratio 0.10 \
   --weight_decay 0.01 \
-  --max_seq_len 512 \
-  --router_tau_start 1.6 --router_tau_end 1.1 --router_tau_freeze_steps 4000 \
-  --router_aux_start 0.005 --router_aux_end 0.012 \
-  --router_force_prob 0.05 --router_force_warmup_steps 3000 \
-  --rep_alpha 0.05
 ```
 ---
 ## Router Health Metrics
 Monitor log lines:
 ```
 [router] alpha=[a0, a1, a2, ...] entropy_norm=E
 ```
-Targets:
-- `entropy_norm ≥ 0.75` through first 5k steps
-- No branch < 15% persistent (healthy diversity)
-- `entropy_norm ≥ 0.65` maintained throughout training
 ---
@@ -339,15 +466,143 @@ if num_funcs >= 4:
 ## Router Stability (Important)
-Dynamic soft‑routing is powerful but sensitive. The training methodology is under active refinement to ensure healthy branch growth without premature specialization.
-- Instability: routing entropy can drop early, leading to branch collapse. This is highly sensitive to temperature (τ) scheduling, entropy auxiliary weight (λ), and any warmup forcing.
-- Current safeguards: high initial τ, extended freeze period, entropy‑max regularization, and selective forcing during early steps.
-- Expectations: training curves may show transient oscillations in branch usage while the router and branches co‑adapt.
-- What to monitor: `entropy_norm ≥ 0.75` in the first 3–5k steps; no branch persistently < 15%.
-- Intervention playbook: increase `router_aux_weight`, extend `router_tau_freeze_steps`, temporarily raise `router_tau_start`, or apply targeted forcing to the weakest branch.
-- Fine‑tuning note: if using the standard HF Trainer, consider `router_aux_weight=0` (or use `scripts/train_veronica.py`, which handles entropy‑max correctly).
-- Status: ongoing refinement. Default τ/λ schedules may evolve; core API will remain stable.
 ---

 ## TL;DR
 | Feature | Description |
 |---------|-------------|
+| Architecture | 24‑layer causal Transformer (RoPE, untied embeddings, 551M params) |
+| Polymorphic MLP | Soft routing over 3 base branches (SwiGLU, GLU, DepthwiseConv) |
+| Routing Control | Depth-scaled temperature (√depth) + entropy maximization |
 | Precision | BF16 with FP32 LayerNorm for stability |
 | Positional Encoding | Rotary (RoPE, θ=10,000) |
+| Dataset Mix | FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • FineWeb-Edu 20% |
+| Context Length | **1024 (0-30k)** → 2048 (30k-60k) — *512 causes router collapse on 24L* |
 | Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
 ---
 ```bash
 pip install -e .
 | Source | Share | Link |
 |--------|-------|------|
 | FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
 - Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
 Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.
+from veronica import VeronicaConfig, VeronicaForCausalLM
+cfg = VeronicaConfig(n_layer=24, num_funcs=3)  # base polymorphic setup
+model = VeronicaForCausalLM(cfg)
+```
 Generation example:
 ```python
 | Weight Decay | 0.01 |
 | Label Smoothing | 0.01 |
 | Precision | bf16 + fp32 LayerNorm |
+| Max Seq Len | 1024→2048 (curriculum) |
+| Router τ | 2.2 → 1.4 (freeze first 6k steps, depth-scaled) |
+| Aux weight λ | 0.008 → 0.016 (depth-scaled √2×) |
+| Router forcing | 10% prob for first 5k steps |
 | Rep penalty (α) | 0.05 (smoke quality) |
 Launch:
   --learning_rate 1.2e-4 \
   --warmup_ratio 0.10 \
   --weight_decay 0.01 \
+  --max_seq_len 1024 \
+  --router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
+  --router_aux_start 0.008 --router_aux_end 0.016 \
+  --router_force_prob 0.10 --router_force_warmup_steps 5000 \
+  --rep_alpha 0.05 \
+  --seed 42
 ```
 ---
+## Critical Discovery: Context Length & Router Stability on Deep Models
+### The 512 Token Trap (24L Only)
+**Finding**: With 24 layers, starting training at **512 context length causes router collapse** by step 3k:
+```
+Step 3000: alpha=[0.73, 0.14, 0.12], entropy=0.70 (UNHEALTHY)
+```
+**Root Cause**:
+- With 512 tokens/batch and 24 routing decisions per token → **12,288 routing examples per batch**
+- But distributed across 3 branches and 24 layers → each branch-layer combination receives only **~170 gradient samples**
+- **Insufficient signal** for stable gradient descent on router parameters
+- Weak branches cannot recover from random initialization noise
+- Router collapses toward dominant branch to minimize aux loss conflict
+**Why This Doesn't Happen on 12L**:
+- Same 512 tokens → 6,144 routing examples
+- Each branch-layer: **~170 samples** (same as 24L)
+- But **12 layers = shorter gradient path** → less noise accumulation
+- Router can stabilize before collapse
+### Solution: Start at 1024 for Deep Models
+**Corrected curriculum for 24L**:
+```
+0–20k steps:   1024 tokens  ✅ 24,576 routing examples = stable gradients
+20k–60k steps: 2048 tokens  🎯 48,152 examples = final quality
+```
+**DO NOT use 512 ctx on 24L** — this is an empirical hard constraint, not a performance optimization.
+**For 12L and shallower**: 512→1024→2048 curriculum works fine.
+**Mathematical threshold**: ~15–20 layers appears to be the crossover where 512 becomes unstable. Always use 1024 or higher for ≥24L.
+---
+## Depth Scaling for 24L (Mathematical Rationale)
+With 24 independent routing decisions per token (one per layer), naive parameters from shallower models (12L) cause amplified specialization and branch collapse. We apply **square-root depth scaling** to maintain equivalent "softness" across architectures:
+### Temperature Scaling
+Softmax sharpness compounds across layers. To preserve exploration:
+```
+τ_24L = τ_12L × √(24/12) = τ_12L × √2 ≈ τ_12L × 1.41
+```
+For 12L baseline `τ=1.6`, we use **`τ=2.2`** for 24L (start) and **`τ=1.4`** (end).
+### Aux Weight Scaling
+Entropy gradient must compete with 24 layers pulling toward specialization:
+```
+λ_24L = λ_12L × √2 ≈ λ_12L × 1.41
+```
+For 12L baseline `λ=0.005→0.012`, we use **`λ=0.008→0.016`** for 24L.
+### Forcing Probability
+Each branch needs more examples across deeper network:
+```
+P_force_24L ≈ P_force_12L × (24/12) = 2 × P_force_12L
+```
+For 12L `5%`, we use **`10%`** for 24L during warmup (0–5k steps).
+### Empirical Results (Training Logs)
+- **Step 300**: Entropy 1.00, perfect uniform distribution `[0.33, 0.33, 0.33]`
+- **Step 5k**: Entropy 0.73, healthy distribution `[0.71, 0.11, 0.18]`
+- **Step 7k**: Entropy 0.80–0.93 (exploration phase post tau-freeze)
+- **Step 10k**: Loss ~34, no branch collapse
+- **Step 11k** (post branch-1 recovery): Entropy 0.84–0.93, distribution `[0.57, 0.15, 0.27]` ✅
+- **Step 12k**: Stable soft routing, eval loss 4.07
+---
 ## Router Health Metrics
 Monitor log lines:
 ```
 [router] alpha=[a0, a1, a2, ...] entropy_norm=E
 ```
+### Targets by Training Phase
+| Phase | Steps | Entropy Target | Min Branch Share | Notes |
+|-------|-------|----------------|------------------|-------|
+| Warmup | 0–5k | ≥0.90 | ≥0.25 | Forcing active, near-uniform |
+| Post-freeze | 5k–10k | ≥0.75 | ≥0.12 | Specialization begins |
+| Stable | 10k+ | ≥0.70 | ≥0.15 | Soft routing converged |
+| Final | 40k–60k | ≥0.65 | ≥0.12 | Acceptable specialization |
+### Observed Distribution (24L, Step 12k)
+```
+alpha=[0.571, 0.153, 0.276]  entropy_norm=0.876
+```
+Ideal soft routing: dominant branch ~55–65%, minorities ~15–25% each.
+---
+## Context Length Curriculum
+### Architecture-Dependent Strategy
+**For 24L (≥20L in general)**:
+```
+1024 tokens: Steps 0–20k   (NO 512 phase — causes router collapse)
+2048 tokens: Steps 20k–60k
+```
+**For 12L and shallower**:
+```
+512 tokens:  Steps 0–10k
+1024 tokens: Steps 10k–30k
+2048 tokens: Steps 30k–60k
+```
+---
+### Phase 1 (24L): 1024 Tokens (Steps 0–20k)
+- **Purpose**: Router stability + pattern learning (REQUIRED for 24L from step 0)
+- **VRAM**: ~8–9GB (batch=4, accum=8)
+- **Throughput**: ~8–10 sec/step
+- **Why not 512**: Insufficient routing examples cause branch collapse by 3k steps
+### Phase 2 (24L): 2048 Tokens (Steps 20k–60k)
+- **Purpose**: Final capacity, long-document coherence
+- **VRAM**: ~12–13GB (batch=4, accum=8)
+- **Switching criteria**: Stable routing on 1024 (entropy ≥0.75, branches ≥0.15)
+- **Expected dip**: Temporary entropy −0.02–0.04, recovers within 500 steps
+eg. thats with BF16
+### Switching Template
+```bash
+python scripts/train_veronica.py \
+  --resume_from runs/veronica-24L-1024/checkpoint-12000 \
+  --output_dir runs/veronica-24L-2048 \
+  --max_seq_len 2048 \
+  # ... keep all other router params unchanged
+```
 ---
 ## Router Stability (Important)
+Dynamic soft‑routing is powerful but sensitive. The training methodology has been refined through empirical testing on 24L to ensure healthy branch growth.
+### Known Issues & Solutions
+| Issue | Symptom | Solution |
+|-------|---------|----------|
+| Early collapse | Branch <10% by 3k steps | Increase `tau_start` (2.2→2.4), extend freeze (6k→8k) |
+| Post-freeze oscillation | Entropy spikes 0.75→0.95 | Expected; aux pushes exploration. Monitor 500 steps. |
+| Weak branch stagnation | Branch <12% after 10k | Targeted forcing: `--force_branch_idx X --force_branch_until +1000`, aux=0 during window |
+| Adaptive forcing loops | Repeated forced windows | **Do not use** adaptive forcing; rely on aux+tau only |
+### Failed Experiment: Adaptive Forcing (DO NOT USE)
+**Attempted solution**: Auto-detect weak branches (<threshold) and dynamically apply forcing windows
+```python
+# BROKEN CODE — DO NOT USE
+if min(alpha) < 0.15 and not in_cooldown:
+    weak_idx = argmin(alpha)
+    force_branch_idx = weak_idx
+    force_until = current_step + 1000
+    in_cooldown = True
+```
+**Why it failed**:
+1. **Cascade loops**: Forcing branch A → weakens branch B → triggers forcing B → weakens A → infinite oscillation
+2. **Artificial alpha**: During forced windows, alpha reflects forcing distribution [0,0,1], not learned preferences
+3. **Gradient confusion**: Aux loss receives artificial entropy signals, disrupts learning
+4. **Manual intervention superior**: Targeted forcing with aux=0 isolates signal cleanly
+**Lesson**: Router needs **consistent pressure** (tau + aux), not **reactive intervention**. Manual forcing for recovery only, not automated.
+### Safeguards Implemented (Validated)
+1. **Depth-scaled parameters**: τ and λ scaled by √(depth_ratio) to maintain effective softness
+2. **Extended freeze**: Tau held constant for 6k steps (10% of training) to prevent premature specialization
+3. **Entropy-max loss**: Subtract (not add) aux_loss to maximize branch diversity
+4. **Warmup forcing**: 10% probability during first 5k steps ensures all branches receive gradients
+5. **FP32 LayerNorm**: Prevents BF16 precision drift in routing logits
+6. **NO adaptive forcing**: Rely on tau/aux scheduling + manual intervention when needed
+### Intervention Playbook (Step-by-Step)
+**Scenario: Branch drops <10% before 5k steps**
+1. Stop training, resume from last good checkpoint
+2. Increase `--router_tau_start` by +0.2 (e.g., 2.2→2.4)
+3. Extend `--router_tau_freeze_steps` by +2000
+4. Increase `--router_force_prob` to 0.12–0.15
+**Scenario: Branch stuck <12% after 10k steps**
+1. Run targeted forcing (see Incremental Expansion section)
+2. Force weak branch for 1k steps with `aux=0`, LR=5e-5
+3. Resume normal training with aux restored
+4. Expected recovery: +3–8% share within 500 steps
+**Scenario: Entropy <0.70 and falling after 15k**
+1. Increase `--router_aux_end` by +0.002 (e.g., 0.016→0.018)
+2. Consider raising `--router_tau_end` slightly (1.4→1.5) to slow sharpening
+### Fine‑Tuning Note
+If using standard HF Trainer without custom loss, set `router_aux_weight=0` in config to avoid incorrect gradient direction. Use `scripts/train_veronica.py` for full entropy-max support.
+### Empirical Training Log (24L Complete Journey)
+**First attempt (FAILED — 512 ctx)**:
+- **Step 0–300**: Perfect init (entropy 1.0) with high tau + forcing
+- **Step 3000**: **Router collapse** — alpha=[0.73, 0.14, 0.12], entropy 0.70 ❌
+- **Diagnosis**: 512 ctx insufficient for 24L depth
+- **Action**: Abandoned run, restarted from scratch with 1024 ctx
+**Adaptive forcing experiment (FAILED)**:
+- **Implementation**: Auto-detect weak branches, dynamic forcing windows
+- **Outcome**: Cascade loops, artificial alpha patterns [0,0,1], gradient confusion
+- **Action**: Reverted code, relied on tau/aux only
+**Final successful run (1024 ctx from step 0)**:
+- **Step 0–300**: Perfect uniformity (entropy 1.0), high tau (2.2) + 10% forcing
+- **Step 1000**: Loss 87→52, entropy 0.92, balanced [0.39, 0.32, 0.29]
+- **Step 3000**: Loss 41, entropy 0.73, distribution [0.71, 0.13, 0.16] (healthy)
+- **Step 5000**: Loss 37, forcing disabled, entropy 0.72 maintained
+- **Step 6000**: Tau unfreezes (2.2→1.4 schedule begins)
+- **Step 6000-7000**: Entropy spikes 0.80→0.93 (exploration phase, expected)
+- **Step 10000**: Loss 34, **branch 1 weakened to ~10%** (concern threshold)
+- **Intervention**: Targeted forcing on branch 1 (10k→11k steps)
+  - `--force_branch_idx 1 --force_branch_until 11000`
+  - `--router_aux_start 0.0` (isolate gradient signal)
+  - `--learning_rate 5e-5` (gentle nudge)
+- **Step 11000**: Branch 1 recovered to 15%, entropy 0.84–0.93 ✅
+- **Step 12000**: Stable soft routing [0.57, 0.15, 0.27], entropy 0.876
+  - Eval loss 4.41→4.07 (intervention improved generalization)
+  - Loss trend: 34→33 (continued healthy descent)
+  - **All branches active and contributing**
+**Key learnings**:
+1. ✅ 1024 ctx required from step 0 for 24L
+2. ✅ Depth-scaled tau/aux/forcing parameters validated
+3. ✅ Targeted forcing (aux=0, short window) effective for recovery
+4. ❌ Adaptive forcing causes more problems than it solves
+5. ✅ Entropy 0.84–0.93 with min branch 15% = healthy soft routing
+**Status**: Methodology validated on 24L/551M through 12k steps. Ready for 2048 ctx phase (30k+). Core API stable; default schedules proven effective.
+---
+## Practical Training Tips
+### DO
+- ✅ **Use 1024 ctx from step 0 for 24L models** (512 causes router collapse)
+- ✅ Scale tau/aux with √(depth_ratio) when changing layer count
+- ✅ Use depth-scaled forcing probability (10% for 24L vs 5% for 12L)
+- ✅ Freeze tau for ~10% of total training steps (6k for 60k total)
+- ✅ Monitor entropy every 100 steps; save checkpoints every 500
+- ✅ Apply targeted forcing (aux=0, short window) for weak branches after 10k
+- ✅ Keep aux weight increasing throughout training (e.g., 0.008→0.016)
+- ✅ Trust depth-scaled parameters — they're empirically validated
+### DON'T
+- ❌ **Use 512 ctx on 24L** (causes collapse by 3k steps — empirically proven)
+- ❌ **Implement adaptive forcing** (causes cascade loops and artificial alpha)
+- ❌ Lower tau too aggressively (<1.2 for 24L can cause collapse)
+- ❌ Set aux=0 for normal training (only during targeted forcing windows)
+- ❌ Switch context length without verifying entropy stability (≥0.72 for 1k steps)
+- ❌ Expect perfect uniformity throughout training (soft routing allows specialization)
+- ❌ Panic if entropy spikes post tau-freeze (oscillation is expected; monitor 500 steps)
+- ❌ Use curriculum 512→1024→2048 on deep models (≥20L requires 1024 start)
+### VRAM Optimization
+If hitting OOM on 2048 ctx:
+```bash
+--per_device_train_batch_size 2 \
+--gradient_accumulation_steps 16  # keeps effective batch = 32
+```
+### Quick Health Check (Per 1k Steps)
+```bash
+grep "\[router\]" logs/train.log | tail -10
+```
+Look for:
+- Entropy trend (should be ≥0.70)
+- Min branch value (should be ≥0.12)
+- Loss trend (should decrease or stabilize)
 ---