Readme.md update
Browse files
README.md
CHANGED
|
@@ -29,12 +29,13 @@ model-index:
|
|
| 29 |
## TL;DR
|
| 30 |
| Feature | Description |
|
| 31 |
|---------|-------------|
|
| 32 |
-
| Architecture | 24βlayer causal Transformer (RoPE, untied embeddings) |
|
| 33 |
-
| Polymorphic MLP | Soft routing over 3 base branches (
|
| 34 |
-
| Routing Control |
|
| 35 |
| Precision | BF16 with FP32 LayerNorm for stability |
|
| 36 |
| Positional Encoding | Rotary (RoPE, ΞΈ=10,000) |
|
| 37 |
-
| Dataset Mix | FinePDFsβ1B 50% β’ DCLM Baselineβ1B 30% β’
|
|
|
|
| 38 |
| Expansion | Add new branches (e.g. Translation) via lightweight migration + fineβtune |
|
| 39 |
|
| 40 |
---
|
|
@@ -43,11 +44,6 @@ model-index:
|
|
| 43 |
|
| 44 |
```bash
|
| 45 |
pip install -e .
|
| 46 |
-
from veronica import VeronicaConfig, VeronicaForCausalLM
|
| 47 |
-
cfg = VeronicaConfig(n_layer=24, num_funcs=3) # base polymorphic setup
|
| 48 |
-
model = VeronicaForCausalLM(cfg)
|
| 49 |
-
```
|
| 50 |
-
|
| 51 |
| Source | Share | Link |
|
| 52 |
|--------|-------|------|
|
| 53 |
| FinePDFsβ1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
|
|
@@ -59,6 +55,10 @@ Notes
|
|
| 59 |
- Please refer to each datasetβs license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
|
| 60 |
|
| 61 |
Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelionβs guidance.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
Generation example:
|
| 64 |
```python
|
|
@@ -161,10 +161,10 @@ Total tokens target (example): ~60B (adjustable). The composition balances seman
|
|
| 161 |
| Weight Decay | 0.01 |
|
| 162 |
| Label Smoothing | 0.01 |
|
| 163 |
| Precision | bf16 + fp32 LayerNorm |
|
| 164 |
-
| Max Seq Len |
|
| 165 |
-
| Router Ο |
|
| 166 |
-
| Aux weight Ξ» | 0.
|
| 167 |
-
| Router forcing |
|
| 168 |
| Rep penalty (Ξ±) | 0.05 (smoke quality) |
|
| 169 |
|
| 170 |
Launch:
|
|
@@ -179,24 +179,151 @@ python scripts/train_veronica.py \
|
|
| 179 |
--learning_rate 1.2e-4 \
|
| 180 |
--warmup_ratio 0.10 \
|
| 181 |
--weight_decay 0.01 \
|
| 182 |
-
--max_seq_len
|
| 183 |
-
--router_tau_start
|
| 184 |
-
--router_aux_start 0.
|
| 185 |
-
--router_force_prob 0.
|
| 186 |
-
--rep_alpha 0.05
|
|
|
|
| 187 |
```
|
| 188 |
|
| 189 |
---
|
| 190 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
## Router Health Metrics
|
| 192 |
Monitor log lines:
|
| 193 |
```
|
| 194 |
[router] alpha=[a0, a1, a2, ...] entropy_norm=E
|
| 195 |
```
|
| 196 |
-
Targets
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
---
|
| 202 |
|
|
@@ -339,15 +466,143 @@ if num_funcs >= 4:
|
|
| 339 |
|
| 340 |
## Router Stability (Important)
|
| 341 |
|
| 342 |
-
Dynamic softβrouting is powerful but sensitive. The training methodology
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 343 |
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
-
|
| 350 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 351 |
|
| 352 |
---
|
| 353 |
|
|
|
|
| 29 |
## TL;DR
|
| 30 |
| Feature | Description |
|
| 31 |
|---------|-------------|
|
| 32 |
+
| Architecture | 24βlayer causal Transformer (RoPE, untied embeddings, 551M params) |
|
| 33 |
+
| Polymorphic MLP | Soft routing over 3 base branches (SwiGLU, GLU, DepthwiseConv) |
|
| 34 |
+
| Routing Control | Depth-scaled temperature (βdepth) + entropy maximization |
|
| 35 |
| Precision | BF16 with FP32 LayerNorm for stability |
|
| 36 |
| Positional Encoding | Rotary (RoPE, ΞΈ=10,000) |
|
| 37 |
+
| Dataset Mix | FinePDFsβ1B 50% β’ DCLM Baselineβ1B 30% β’ FineWeb-Edu 20% |
|
| 38 |
+
| Context Length | **1024 (0-30k)** β 2048 (30k-60k) β *512 causes router collapse on 24L* |
|
| 39 |
| Expansion | Add new branches (e.g. Translation) via lightweight migration + fineβtune |
|
| 40 |
|
| 41 |
---
|
|
|
|
| 44 |
|
| 45 |
```bash
|
| 46 |
pip install -e .
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
| Source | Share | Link |
|
| 48 |
|--------|-------|------|
|
| 49 |
| FinePDFsβ1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
|
|
|
|
| 55 |
- Please refer to each datasetβs license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
|
| 56 |
|
| 57 |
Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelionβs guidance.
|
| 58 |
+
from veronica import VeronicaConfig, VeronicaForCausalLM
|
| 59 |
+
cfg = VeronicaConfig(n_layer=24, num_funcs=3) # base polymorphic setup
|
| 60 |
+
model = VeronicaForCausalLM(cfg)
|
| 61 |
+
```
|
| 62 |
|
| 63 |
Generation example:
|
| 64 |
```python
|
|
|
|
| 161 |
| Weight Decay | 0.01 |
|
| 162 |
| Label Smoothing | 0.01 |
|
| 163 |
| Precision | bf16 + fp32 LayerNorm |
|
| 164 |
+
| Max Seq Len | 1024β2048 (curriculum) |
|
| 165 |
+
| Router Ο | 2.2 β 1.4 (freeze first 6k steps, depth-scaled) |
|
| 166 |
+
| Aux weight Ξ» | 0.008 β 0.016 (depth-scaled β2Γ) |
|
| 167 |
+
| Router forcing | 10% prob for first 5k steps |
|
| 168 |
| Rep penalty (Ξ±) | 0.05 (smoke quality) |
|
| 169 |
|
| 170 |
Launch:
|
|
|
|
| 179 |
--learning_rate 1.2e-4 \
|
| 180 |
--warmup_ratio 0.10 \
|
| 181 |
--weight_decay 0.01 \
|
| 182 |
+
--max_seq_len 1024 \
|
| 183 |
+
--router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
|
| 184 |
+
--router_aux_start 0.008 --router_aux_end 0.016 \
|
| 185 |
+
--router_force_prob 0.10 --router_force_warmup_steps 5000 \
|
| 186 |
+
--rep_alpha 0.05 \
|
| 187 |
+
--seed 42
|
| 188 |
```
|
| 189 |
|
| 190 |
---
|
| 191 |
|
| 192 |
+
## Critical Discovery: Context Length & Router Stability on Deep Models
|
| 193 |
+
|
| 194 |
+
### The 512 Token Trap (24L Only)
|
| 195 |
+
|
| 196 |
+
**Finding**: With 24 layers, starting training at **512 context length causes router collapse** by step 3k:
|
| 197 |
+
```
|
| 198 |
+
Step 3000: alpha=[0.73, 0.14, 0.12], entropy=0.70 (UNHEALTHY)
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
**Root Cause**:
|
| 202 |
+
- With 512 tokens/batch and 24 routing decisions per token β **12,288 routing examples per batch**
|
| 203 |
+
- But distributed across 3 branches and 24 layers β each branch-layer combination receives only **~170 gradient samples**
|
| 204 |
+
- **Insufficient signal** for stable gradient descent on router parameters
|
| 205 |
+
- Weak branches cannot recover from random initialization noise
|
| 206 |
+
- Router collapses toward dominant branch to minimize aux loss conflict
|
| 207 |
+
|
| 208 |
+
**Why This Doesn't Happen on 12L**:
|
| 209 |
+
- Same 512 tokens β 6,144 routing examples
|
| 210 |
+
- Each branch-layer: **~170 samples** (same as 24L)
|
| 211 |
+
- But **12 layers = shorter gradient path** β less noise accumulation
|
| 212 |
+
- Router can stabilize before collapse
|
| 213 |
+
|
| 214 |
+
### Solution: Start at 1024 for Deep Models
|
| 215 |
+
|
| 216 |
+
**Corrected curriculum for 24L**:
|
| 217 |
+
```
|
| 218 |
+
0β20k steps: 1024 tokens β
24,576 routing examples = stable gradients
|
| 219 |
+
20kβ60k steps: 2048 tokens π― 48,152 examples = final quality
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
**DO NOT use 512 ctx on 24L** β this is an empirical hard constraint, not a performance optimization.
|
| 223 |
+
|
| 224 |
+
**For 12L and shallower**: 512β1024β2048 curriculum works fine.
|
| 225 |
+
|
| 226 |
+
**Mathematical threshold**: ~15β20 layers appears to be the crossover where 512 becomes unstable. Always use 1024 or higher for β₯24L.
|
| 227 |
+
|
| 228 |
+
---
|
| 229 |
+
|
| 230 |
+
## Depth Scaling for 24L (Mathematical Rationale)
|
| 231 |
+
|
| 232 |
+
With 24 independent routing decisions per token (one per layer), naive parameters from shallower models (12L) cause amplified specialization and branch collapse. We apply **square-root depth scaling** to maintain equivalent "softness" across architectures:
|
| 233 |
+
|
| 234 |
+
### Temperature Scaling
|
| 235 |
+
Softmax sharpness compounds across layers. To preserve exploration:
|
| 236 |
+
```
|
| 237 |
+
Ο_24L = Ο_12L Γ β(24/12) = Ο_12L Γ β2 β Ο_12L Γ 1.41
|
| 238 |
+
```
|
| 239 |
+
For 12L baseline `Ο=1.6`, we use **`Ο=2.2`** for 24L (start) and **`Ο=1.4`** (end).
|
| 240 |
+
|
| 241 |
+
### Aux Weight Scaling
|
| 242 |
+
Entropy gradient must compete with 24 layers pulling toward specialization:
|
| 243 |
+
```
|
| 244 |
+
Ξ»_24L = Ξ»_12L Γ β2 β Ξ»_12L Γ 1.41
|
| 245 |
+
```
|
| 246 |
+
For 12L baseline `Ξ»=0.005β0.012`, we use **`Ξ»=0.008β0.016`** for 24L.
|
| 247 |
+
|
| 248 |
+
### Forcing Probability
|
| 249 |
+
Each branch needs more examples across deeper network:
|
| 250 |
+
```
|
| 251 |
+
P_force_24L β P_force_12L Γ (24/12) = 2 Γ P_force_12L
|
| 252 |
+
```
|
| 253 |
+
For 12L `5%`, we use **`10%`** for 24L during warmup (0β5k steps).
|
| 254 |
+
|
| 255 |
+
### Empirical Results (Training Logs)
|
| 256 |
+
- **Step 300**: Entropy 1.00, perfect uniform distribution `[0.33, 0.33, 0.33]`
|
| 257 |
+
- **Step 5k**: Entropy 0.73, healthy distribution `[0.71, 0.11, 0.18]`
|
| 258 |
+
- **Step 7k**: Entropy 0.80β0.93 (exploration phase post tau-freeze)
|
| 259 |
+
- **Step 10k**: Loss ~34, no branch collapse
|
| 260 |
+
- **Step 11k** (post branch-1 recovery): Entropy 0.84β0.93, distribution `[0.57, 0.15, 0.27]` β
|
| 261 |
+
- **Step 12k**: Stable soft routing, eval loss 4.07
|
| 262 |
+
|
| 263 |
+
---
|
| 264 |
+
|
| 265 |
## Router Health Metrics
|
| 266 |
Monitor log lines:
|
| 267 |
```
|
| 268 |
[router] alpha=[a0, a1, a2, ...] entropy_norm=E
|
| 269 |
```
|
| 270 |
+
### Targets by Training Phase
|
| 271 |
+
| Phase | Steps | Entropy Target | Min Branch Share | Notes |
|
| 272 |
+
|-------|-------|----------------|------------------|-------|
|
| 273 |
+
| Warmup | 0β5k | β₯0.90 | β₯0.25 | Forcing active, near-uniform |
|
| 274 |
+
| Post-freeze | 5kβ10k | β₯0.75 | β₯0.12 | Specialization begins |
|
| 275 |
+
| Stable | 10k+ | β₯0.70 | β₯0.15 | Soft routing converged |
|
| 276 |
+
| Final | 40kβ60k | β₯0.65 | β₯0.12 | Acceptable specialization |
|
| 277 |
+
|
| 278 |
+
### Observed Distribution (24L, Step 12k)
|
| 279 |
+
```
|
| 280 |
+
alpha=[0.571, 0.153, 0.276] entropy_norm=0.876
|
| 281 |
+
```
|
| 282 |
+
Ideal soft routing: dominant branch ~55β65%, minorities ~15β25% each.
|
| 283 |
+
|
| 284 |
+
---
|
| 285 |
+
|
| 286 |
+
## Context Length Curriculum
|
| 287 |
+
|
| 288 |
+
### Architecture-Dependent Strategy
|
| 289 |
+
|
| 290 |
+
**For 24L (β₯20L in general)**:
|
| 291 |
+
```
|
| 292 |
+
1024 tokens: Steps 0β20k (NO 512 phase β causes router collapse)
|
| 293 |
+
2048 tokens: Steps 20kβ60k
|
| 294 |
+
```
|
| 295 |
+
|
| 296 |
+
**For 12L and shallower**:
|
| 297 |
+
```
|
| 298 |
+
512 tokens: Steps 0β10k
|
| 299 |
+
1024 tokens: Steps 10kβ30k
|
| 300 |
+
2048 tokens: Steps 30kβ60k
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
---
|
| 304 |
+
|
| 305 |
+
### Phase 1 (24L): 1024 Tokens (Steps 0β20k)
|
| 306 |
+
- **Purpose**: Router stability + pattern learning (REQUIRED for 24L from step 0)
|
| 307 |
+
- **VRAM**: ~8β9GB (batch=4, accum=8)
|
| 308 |
+
- **Throughput**: ~8β10 sec/step
|
| 309 |
+
- **Why not 512**: Insufficient routing examples cause branch collapse by 3k steps
|
| 310 |
+
|
| 311 |
+
### Phase 2 (24L): 2048 Tokens (Steps 20kβ60k)
|
| 312 |
+
- **Purpose**: Final capacity, long-document coherence
|
| 313 |
+
- **VRAM**: ~12β13GB (batch=4, accum=8)
|
| 314 |
+
- **Switching criteria**: Stable routing on 1024 (entropy β₯0.75, branches β₯0.15)
|
| 315 |
+
- **Expected dip**: Temporary entropy β0.02β0.04, recovers within 500 steps
|
| 316 |
+
|
| 317 |
+
eg. thats with BF16
|
| 318 |
+
|
| 319 |
+
### Switching Template
|
| 320 |
+
```bash
|
| 321 |
+
python scripts/train_veronica.py \
|
| 322 |
+
--resume_from runs/veronica-24L-1024/checkpoint-12000 \
|
| 323 |
+
--output_dir runs/veronica-24L-2048 \
|
| 324 |
+
--max_seq_len 2048 \
|
| 325 |
+
# ... keep all other router params unchanged
|
| 326 |
+
```
|
| 327 |
|
| 328 |
---
|
| 329 |
|
|
|
|
| 466 |
|
| 467 |
## Router Stability (Important)
|
| 468 |
|
| 469 |
+
Dynamic softβrouting is powerful but sensitive. The training methodology has been refined through empirical testing on 24L to ensure healthy branch growth.
|
| 470 |
+
|
| 471 |
+
### Known Issues & Solutions
|
| 472 |
+
| Issue | Symptom | Solution |
|
| 473 |
+
|-------|---------|----------|
|
| 474 |
+
| Early collapse | Branch <10% by 3k steps | Increase `tau_start` (2.2β2.4), extend freeze (6kβ8k) |
|
| 475 |
+
| Post-freeze oscillation | Entropy spikes 0.75β0.95 | Expected; aux pushes exploration. Monitor 500 steps. |
|
| 476 |
+
| Weak branch stagnation | Branch <12% after 10k | Targeted forcing: `--force_branch_idx X --force_branch_until +1000`, aux=0 during window |
|
| 477 |
+
| Adaptive forcing loops | Repeated forced windows | **Do not use** adaptive forcing; rely on aux+tau only |
|
| 478 |
+
|
| 479 |
+
### Failed Experiment: Adaptive Forcing (DO NOT USE)
|
| 480 |
+
|
| 481 |
+
**Attempted solution**: Auto-detect weak branches (<threshold) and dynamically apply forcing windows
|
| 482 |
+
```python
|
| 483 |
+
# BROKEN CODE β DO NOT USE
|
| 484 |
+
if min(alpha) < 0.15 and not in_cooldown:
|
| 485 |
+
weak_idx = argmin(alpha)
|
| 486 |
+
force_branch_idx = weak_idx
|
| 487 |
+
force_until = current_step + 1000
|
| 488 |
+
in_cooldown = True
|
| 489 |
+
```
|
| 490 |
+
|
| 491 |
+
**Why it failed**:
|
| 492 |
+
1. **Cascade loops**: Forcing branch A β weakens branch B β triggers forcing B β weakens A β infinite oscillation
|
| 493 |
+
2. **Artificial alpha**: During forced windows, alpha reflects forcing distribution [0,0,1], not learned preferences
|
| 494 |
+
3. **Gradient confusion**: Aux loss receives artificial entropy signals, disrupts learning
|
| 495 |
+
4. **Manual intervention superior**: Targeted forcing with aux=0 isolates signal cleanly
|
| 496 |
+
|
| 497 |
+
**Lesson**: Router needs **consistent pressure** (tau + aux), not **reactive intervention**. Manual forcing for recovery only, not automated.
|
| 498 |
+
|
| 499 |
+
### Safeguards Implemented (Validated)
|
| 500 |
+
1. **Depth-scaled parameters**: Ο and Ξ» scaled by β(depth_ratio) to maintain effective softness
|
| 501 |
+
2. **Extended freeze**: Tau held constant for 6k steps (10% of training) to prevent premature specialization
|
| 502 |
+
3. **Entropy-max loss**: Subtract (not add) aux_loss to maximize branch diversity
|
| 503 |
+
4. **Warmup forcing**: 10% probability during first 5k steps ensures all branches receive gradients
|
| 504 |
+
5. **FP32 LayerNorm**: Prevents BF16 precision drift in routing logits
|
| 505 |
+
6. **NO adaptive forcing**: Rely on tau/aux scheduling + manual intervention when needed
|
| 506 |
+
|
| 507 |
+
### Intervention Playbook (Step-by-Step)
|
| 508 |
+
**Scenario: Branch drops <10% before 5k steps**
|
| 509 |
+
1. Stop training, resume from last good checkpoint
|
| 510 |
+
2. Increase `--router_tau_start` by +0.2 (e.g., 2.2β2.4)
|
| 511 |
+
3. Extend `--router_tau_freeze_steps` by +2000
|
| 512 |
+
4. Increase `--router_force_prob` to 0.12β0.15
|
| 513 |
+
|
| 514 |
+
**Scenario: Branch stuck <12% after 10k steps**
|
| 515 |
+
1. Run targeted forcing (see Incremental Expansion section)
|
| 516 |
+
2. Force weak branch for 1k steps with `aux=0`, LR=5e-5
|
| 517 |
+
3. Resume normal training with aux restored
|
| 518 |
+
4. Expected recovery: +3β8% share within 500 steps
|
| 519 |
+
|
| 520 |
+
**Scenario: Entropy <0.70 and falling after 15k**
|
| 521 |
+
1. Increase `--router_aux_end` by +0.002 (e.g., 0.016β0.018)
|
| 522 |
+
2. Consider raising `--router_tau_end` slightly (1.4β1.5) to slow sharpening
|
| 523 |
+
|
| 524 |
+
### FineβTuning Note
|
| 525 |
+
If using standard HF Trainer without custom loss, set `router_aux_weight=0` in config to avoid incorrect gradient direction. Use `scripts/train_veronica.py` for full entropy-max support.
|
| 526 |
+
|
| 527 |
+
### Empirical Training Log (24L Complete Journey)
|
| 528 |
+
|
| 529 |
+
**First attempt (FAILED β 512 ctx)**:
|
| 530 |
+
- **Step 0β300**: Perfect init (entropy 1.0) with high tau + forcing
|
| 531 |
+
- **Step 3000**: **Router collapse** β alpha=[0.73, 0.14, 0.12], entropy 0.70 β
|
| 532 |
+
- **Diagnosis**: 512 ctx insufficient for 24L depth
|
| 533 |
+
- **Action**: Abandoned run, restarted from scratch with 1024 ctx
|
| 534 |
+
|
| 535 |
+
**Adaptive forcing experiment (FAILED)**:
|
| 536 |
+
- **Implementation**: Auto-detect weak branches, dynamic forcing windows
|
| 537 |
+
- **Outcome**: Cascade loops, artificial alpha patterns [0,0,1], gradient confusion
|
| 538 |
+
- **Action**: Reverted code, relied on tau/aux only
|
| 539 |
+
|
| 540 |
+
**Final successful run (1024 ctx from step 0)**:
|
| 541 |
+
- **Step 0β300**: Perfect uniformity (entropy 1.0), high tau (2.2) + 10% forcing
|
| 542 |
+
- **Step 1000**: Loss 87β52, entropy 0.92, balanced [0.39, 0.32, 0.29]
|
| 543 |
+
- **Step 3000**: Loss 41, entropy 0.73, distribution [0.71, 0.13, 0.16] (healthy)
|
| 544 |
+
- **Step 5000**: Loss 37, forcing disabled, entropy 0.72 maintained
|
| 545 |
+
- **Step 6000**: Tau unfreezes (2.2β1.4 schedule begins)
|
| 546 |
+
- **Step 6000-7000**: Entropy spikes 0.80β0.93 (exploration phase, expected)
|
| 547 |
+
- **Step 10000**: Loss 34, **branch 1 weakened to ~10%** (concern threshold)
|
| 548 |
+
- **Intervention**: Targeted forcing on branch 1 (10kβ11k steps)
|
| 549 |
+
- `--force_branch_idx 1 --force_branch_until 11000`
|
| 550 |
+
- `--router_aux_start 0.0` (isolate gradient signal)
|
| 551 |
+
- `--learning_rate 5e-5` (gentle nudge)
|
| 552 |
+
- **Step 11000**: Branch 1 recovered to 15%, entropy 0.84β0.93 β
|
| 553 |
+
- **Step 12000**: Stable soft routing [0.57, 0.15, 0.27], entropy 0.876
|
| 554 |
+
- Eval loss 4.41β4.07 (intervention improved generalization)
|
| 555 |
+
- Loss trend: 34β33 (continued healthy descent)
|
| 556 |
+
- **All branches active and contributing**
|
| 557 |
+
|
| 558 |
+
**Key learnings**:
|
| 559 |
+
1. β
1024 ctx required from step 0 for 24L
|
| 560 |
+
2. β
Depth-scaled tau/aux/forcing parameters validated
|
| 561 |
+
3. β
Targeted forcing (aux=0, short window) effective for recovery
|
| 562 |
+
4. β Adaptive forcing causes more problems than it solves
|
| 563 |
+
5. β
Entropy 0.84β0.93 with min branch 15% = healthy soft routing
|
| 564 |
+
|
| 565 |
+
**Status**: Methodology validated on 24L/551M through 12k steps. Ready for 2048 ctx phase (30k+). Core API stable; default schedules proven effective.
|
| 566 |
|
| 567 |
+
---
|
| 568 |
+
|
| 569 |
+
## Practical Training Tips
|
| 570 |
+
|
| 571 |
+
### DO
|
| 572 |
+
- β
**Use 1024 ctx from step 0 for 24L models** (512 causes router collapse)
|
| 573 |
+
- β
Scale tau/aux with β(depth_ratio) when changing layer count
|
| 574 |
+
- β
Use depth-scaled forcing probability (10% for 24L vs 5% for 12L)
|
| 575 |
+
- β
Freeze tau for ~10% of total training steps (6k for 60k total)
|
| 576 |
+
- β
Monitor entropy every 100 steps; save checkpoints every 500
|
| 577 |
+
- β
Apply targeted forcing (aux=0, short window) for weak branches after 10k
|
| 578 |
+
- β
Keep aux weight increasing throughout training (e.g., 0.008β0.016)
|
| 579 |
+
- β
Trust depth-scaled parameters β they're empirically validated
|
| 580 |
+
|
| 581 |
+
### DON'T
|
| 582 |
+
- β **Use 512 ctx on 24L** (causes collapse by 3k steps β empirically proven)
|
| 583 |
+
- β **Implement adaptive forcing** (causes cascade loops and artificial alpha)
|
| 584 |
+
- β Lower tau too aggressively (<1.2 for 24L can cause collapse)
|
| 585 |
+
- β Set aux=0 for normal training (only during targeted forcing windows)
|
| 586 |
+
- β Switch context length without verifying entropy stability (β₯0.72 for 1k steps)
|
| 587 |
+
- β Expect perfect uniformity throughout training (soft routing allows specialization)
|
| 588 |
+
- β Panic if entropy spikes post tau-freeze (oscillation is expected; monitor 500 steps)
|
| 589 |
+
- β Use curriculum 512β1024β2048 on deep models (β₯20L requires 1024 start)
|
| 590 |
+
|
| 591 |
+
### VRAM Optimization
|
| 592 |
+
If hitting OOM on 2048 ctx:
|
| 593 |
+
```bash
|
| 594 |
+
--per_device_train_batch_size 2 \
|
| 595 |
+
--gradient_accumulation_steps 16 # keeps effective batch = 32
|
| 596 |
+
```
|
| 597 |
+
|
| 598 |
+
### Quick Health Check (Per 1k Steps)
|
| 599 |
+
```bash
|
| 600 |
+
grep "\[router\]" logs/train.log | tail -10
|
| 601 |
+
```
|
| 602 |
+
Look for:
|
| 603 |
+
- Entropy trend (should be β₯0.70)
|
| 604 |
+
- Min branch value (should be β₯0.12)
|
| 605 |
+
- Loss trend (should decrease or stabilize)
|
| 606 |
|
| 607 |
---
|
| 608 |
|