MhaWay commited on
Commit
41a5994
Β·
1 Parent(s): 342a5c0

Readme.md update

Browse files
Files changed (1) hide show
  1. README.md +285 -30
README.md CHANGED
@@ -29,12 +29,13 @@ model-index:
29
  ## TL;DR
30
  | Feature | Description |
31
  |---------|-------------|
32
- | Architecture | 24‑layer causal Transformer (RoPE, untied embeddings) |
33
- | Polymorphic MLP | Soft routing over 3 base branches (extensible) |
34
- | Routing Control | Temperature schedule + entropy maximization |
35
  | Precision | BF16 with FP32 LayerNorm for stability |
36
  | Positional Encoding | Rotary (RoPE, ΞΈ=10,000) |
37
- | Dataset Mix | FinePDFs‑1B 50% β€’ DCLM Baseline‑1B 30% β€’ Additional samples 20% (codelion/DataComp) |
 
38
  | Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
39
 
40
  ---
@@ -43,11 +44,6 @@ model-index:
43
 
44
  ```bash
45
  pip install -e .
46
- from veronica import VeronicaConfig, VeronicaForCausalLM
47
- cfg = VeronicaConfig(n_layer=24, num_funcs=3) # base polymorphic setup
48
- model = VeronicaForCausalLM(cfg)
49
- ```
50
-
51
  | Source | Share | Link |
52
  |--------|-------|------|
53
  | FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
@@ -59,6 +55,10 @@ Notes
59
  - Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
60
 
61
  Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.
 
 
 
 
62
 
63
  Generation example:
64
  ```python
@@ -161,10 +161,10 @@ Total tokens target (example): ~60B (adjustable). The composition balances seman
161
  | Weight Decay | 0.01 |
162
  | Label Smoothing | 0.01 |
163
  | Precision | bf16 + fp32 LayerNorm |
164
- | Max Seq Len | 512 (curriculum to 2048) |
165
- | Router Ο„ | 1.6 β†’ 1.1 (freeze first 4k steps) |
166
- | Aux weight Ξ» | 0.005 β†’ 0.012 |
167
- | Router forcing | 5% prob for first 3k steps |
168
  | Rep penalty (Ξ±) | 0.05 (smoke quality) |
169
 
170
  Launch:
@@ -179,24 +179,151 @@ python scripts/train_veronica.py \
179
  --learning_rate 1.2e-4 \
180
  --warmup_ratio 0.10 \
181
  --weight_decay 0.01 \
182
- --max_seq_len 512 \
183
- --router_tau_start 1.6 --router_tau_end 1.1 --router_tau_freeze_steps 4000 \
184
- --router_aux_start 0.005 --router_aux_end 0.012 \
185
- --router_force_prob 0.05 --router_force_warmup_steps 3000 \
186
- --rep_alpha 0.05
 
187
  ```
188
 
189
  ---
190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
  ## Router Health Metrics
192
  Monitor log lines:
193
  ```
194
  [router] alpha=[a0, a1, a2, ...] entropy_norm=E
195
  ```
196
- Targets:
197
- - `entropy_norm β‰₯ 0.75` through first 5k steps
198
- - No branch < 15% persistent (healthy diversity)
199
- - `entropy_norm β‰₯ 0.65` maintained throughout training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
 
201
  ---
202
 
@@ -339,15 +466,143 @@ if num_funcs >= 4:
339
 
340
  ## Router Stability (Important)
341
 
342
- Dynamic soft‑routing is powerful but sensitive. The training methodology is under active refinement to ensure healthy branch growth without premature specialization.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
343
 
344
- - Instability: routing entropy can drop early, leading to branch collapse. This is highly sensitive to temperature (Ο„) scheduling, entropy auxiliary weight (Ξ»), and any warmup forcing.
345
- - Current safeguards: high initial Ο„, extended freeze period, entropy‑max regularization, and selective forcing during early steps.
346
- - Expectations: training curves may show transient oscillations in branch usage while the router and branches co‑adapt.
347
- - What to monitor: `entropy_norm β‰₯ 0.75` in the first 3–5k steps; no branch persistently < 15%.
348
- - Intervention playbook: increase `router_aux_weight`, extend `router_tau_freeze_steps`, temporarily raise `router_tau_start`, or apply targeted forcing to the weakest branch.
349
- - Fine‑tuning note: if using the standard HF Trainer, consider `router_aux_weight=0` (or use `scripts/train_veronica.py`, which handles entropy‑max correctly).
350
- - Status: ongoing refinement. Default Ο„/Ξ» schedules may evolve; core API will remain stable.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
351
 
352
  ---
353
 
 
29
  ## TL;DR
30
  | Feature | Description |
31
  |---------|-------------|
32
+ | Architecture | 24‑layer causal Transformer (RoPE, untied embeddings, 551M params) |
33
+ | Polymorphic MLP | Soft routing over 3 base branches (SwiGLU, GLU, DepthwiseConv) |
34
+ | Routing Control | Depth-scaled temperature (√depth) + entropy maximization |
35
  | Precision | BF16 with FP32 LayerNorm for stability |
36
  | Positional Encoding | Rotary (RoPE, ΞΈ=10,000) |
37
+ | Dataset Mix | FinePDFs‑1B 50% β€’ DCLM Baseline‑1B 30% β€’ FineWeb-Edu 20% |
38
+ | Context Length | **1024 (0-30k)** β†’ 2048 (30k-60k) β€” *512 causes router collapse on 24L* |
39
  | Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
40
 
41
  ---
 
44
 
45
  ```bash
46
  pip install -e .
 
 
 
 
 
47
  | Source | Share | Link |
48
  |--------|-------|------|
49
  | FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
 
55
  - Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
56
 
57
  Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.
58
+ from veronica import VeronicaConfig, VeronicaForCausalLM
59
+ cfg = VeronicaConfig(n_layer=24, num_funcs=3) # base polymorphic setup
60
+ model = VeronicaForCausalLM(cfg)
61
+ ```
62
 
63
  Generation example:
64
  ```python
 
161
  | Weight Decay | 0.01 |
162
  | Label Smoothing | 0.01 |
163
  | Precision | bf16 + fp32 LayerNorm |
164
+ | Max Seq Len | 1024β†’2048 (curriculum) |
165
+ | Router Ο„ | 2.2 β†’ 1.4 (freeze first 6k steps, depth-scaled) |
166
+ | Aux weight Ξ» | 0.008 β†’ 0.016 (depth-scaled √2Γ—) |
167
+ | Router forcing | 10% prob for first 5k steps |
168
  | Rep penalty (Ξ±) | 0.05 (smoke quality) |
169
 
170
  Launch:
 
179
  --learning_rate 1.2e-4 \
180
  --warmup_ratio 0.10 \
181
  --weight_decay 0.01 \
182
+ --max_seq_len 1024 \
183
+ --router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
184
+ --router_aux_start 0.008 --router_aux_end 0.016 \
185
+ --router_force_prob 0.10 --router_force_warmup_steps 5000 \
186
+ --rep_alpha 0.05 \
187
+ --seed 42
188
  ```
189
 
190
  ---
191
 
192
+ ## Critical Discovery: Context Length & Router Stability on Deep Models
193
+
194
+ ### The 512 Token Trap (24L Only)
195
+
196
+ **Finding**: With 24 layers, starting training at **512 context length causes router collapse** by step 3k:
197
+ ```
198
+ Step 3000: alpha=[0.73, 0.14, 0.12], entropy=0.70 (UNHEALTHY)
199
+ ```
200
+
201
+ **Root Cause**:
202
+ - With 512 tokens/batch and 24 routing decisions per token β†’ **12,288 routing examples per batch**
203
+ - But distributed across 3 branches and 24 layers β†’ each branch-layer combination receives only **~170 gradient samples**
204
+ - **Insufficient signal** for stable gradient descent on router parameters
205
+ - Weak branches cannot recover from random initialization noise
206
+ - Router collapses toward dominant branch to minimize aux loss conflict
207
+
208
+ **Why This Doesn't Happen on 12L**:
209
+ - Same 512 tokens β†’ 6,144 routing examples
210
+ - Each branch-layer: **~170 samples** (same as 24L)
211
+ - But **12 layers = shorter gradient path** β†’ less noise accumulation
212
+ - Router can stabilize before collapse
213
+
214
+ ### Solution: Start at 1024 for Deep Models
215
+
216
+ **Corrected curriculum for 24L**:
217
+ ```
218
+ 0–20k steps: 1024 tokens βœ… 24,576 routing examples = stable gradients
219
+ 20k–60k steps: 2048 tokens 🎯 48,152 examples = final quality
220
+ ```
221
+
222
+ **DO NOT use 512 ctx on 24L** β€” this is an empirical hard constraint, not a performance optimization.
223
+
224
+ **For 12L and shallower**: 512β†’1024β†’2048 curriculum works fine.
225
+
226
+ **Mathematical threshold**: ~15–20 layers appears to be the crossover where 512 becomes unstable. Always use 1024 or higher for β‰₯24L.
227
+
228
+ ---
229
+
230
+ ## Depth Scaling for 24L (Mathematical Rationale)
231
+
232
+ With 24 independent routing decisions per token (one per layer), naive parameters from shallower models (12L) cause amplified specialization and branch collapse. We apply **square-root depth scaling** to maintain equivalent "softness" across architectures:
233
+
234
+ ### Temperature Scaling
235
+ Softmax sharpness compounds across layers. To preserve exploration:
236
+ ```
237
+ Ο„_24L = Ο„_12L Γ— √(24/12) = Ο„_12L Γ— √2 β‰ˆ Ο„_12L Γ— 1.41
238
+ ```
239
+ For 12L baseline `Ο„=1.6`, we use **`Ο„=2.2`** for 24L (start) and **`Ο„=1.4`** (end).
240
+
241
+ ### Aux Weight Scaling
242
+ Entropy gradient must compete with 24 layers pulling toward specialization:
243
+ ```
244
+ Ξ»_24L = Ξ»_12L Γ— √2 β‰ˆ Ξ»_12L Γ— 1.41
245
+ ```
246
+ For 12L baseline `Ξ»=0.005β†’0.012`, we use **`Ξ»=0.008β†’0.016`** for 24L.
247
+
248
+ ### Forcing Probability
249
+ Each branch needs more examples across deeper network:
250
+ ```
251
+ P_force_24L β‰ˆ P_force_12L Γ— (24/12) = 2 Γ— P_force_12L
252
+ ```
253
+ For 12L `5%`, we use **`10%`** for 24L during warmup (0–5k steps).
254
+
255
+ ### Empirical Results (Training Logs)
256
+ - **Step 300**: Entropy 1.00, perfect uniform distribution `[0.33, 0.33, 0.33]`
257
+ - **Step 5k**: Entropy 0.73, healthy distribution `[0.71, 0.11, 0.18]`
258
+ - **Step 7k**: Entropy 0.80–0.93 (exploration phase post tau-freeze)
259
+ - **Step 10k**: Loss ~34, no branch collapse
260
+ - **Step 11k** (post branch-1 recovery): Entropy 0.84–0.93, distribution `[0.57, 0.15, 0.27]` βœ…
261
+ - **Step 12k**: Stable soft routing, eval loss 4.07
262
+
263
+ ---
264
+
265
  ## Router Health Metrics
266
  Monitor log lines:
267
  ```
268
  [router] alpha=[a0, a1, a2, ...] entropy_norm=E
269
  ```
270
+ ### Targets by Training Phase
271
+ | Phase | Steps | Entropy Target | Min Branch Share | Notes |
272
+ |-------|-------|----------------|------------------|-------|
273
+ | Warmup | 0–5k | β‰₯0.90 | β‰₯0.25 | Forcing active, near-uniform |
274
+ | Post-freeze | 5k–10k | β‰₯0.75 | β‰₯0.12 | Specialization begins |
275
+ | Stable | 10k+ | β‰₯0.70 | β‰₯0.15 | Soft routing converged |
276
+ | Final | 40k–60k | β‰₯0.65 | β‰₯0.12 | Acceptable specialization |
277
+
278
+ ### Observed Distribution (24L, Step 12k)
279
+ ```
280
+ alpha=[0.571, 0.153, 0.276] entropy_norm=0.876
281
+ ```
282
+ Ideal soft routing: dominant branch ~55–65%, minorities ~15–25% each.
283
+
284
+ ---
285
+
286
+ ## Context Length Curriculum
287
+
288
+ ### Architecture-Dependent Strategy
289
+
290
+ **For 24L (β‰₯20L in general)**:
291
+ ```
292
+ 1024 tokens: Steps 0–20k (NO 512 phase β€” causes router collapse)
293
+ 2048 tokens: Steps 20k–60k
294
+ ```
295
+
296
+ **For 12L and shallower**:
297
+ ```
298
+ 512 tokens: Steps 0–10k
299
+ 1024 tokens: Steps 10k–30k
300
+ 2048 tokens: Steps 30k–60k
301
+ ```
302
+
303
+ ---
304
+
305
+ ### Phase 1 (24L): 1024 Tokens (Steps 0–20k)
306
+ - **Purpose**: Router stability + pattern learning (REQUIRED for 24L from step 0)
307
+ - **VRAM**: ~8–9GB (batch=4, accum=8)
308
+ - **Throughput**: ~8–10 sec/step
309
+ - **Why not 512**: Insufficient routing examples cause branch collapse by 3k steps
310
+
311
+ ### Phase 2 (24L): 2048 Tokens (Steps 20k–60k)
312
+ - **Purpose**: Final capacity, long-document coherence
313
+ - **VRAM**: ~12–13GB (batch=4, accum=8)
314
+ - **Switching criteria**: Stable routing on 1024 (entropy β‰₯0.75, branches β‰₯0.15)
315
+ - **Expected dip**: Temporary entropy βˆ’0.02–0.04, recovers within 500 steps
316
+
317
+ eg. thats with BF16
318
+
319
+ ### Switching Template
320
+ ```bash
321
+ python scripts/train_veronica.py \
322
+ --resume_from runs/veronica-24L-1024/checkpoint-12000 \
323
+ --output_dir runs/veronica-24L-2048 \
324
+ --max_seq_len 2048 \
325
+ # ... keep all other router params unchanged
326
+ ```
327
 
328
  ---
329
 
 
466
 
467
  ## Router Stability (Important)
468
 
469
+ Dynamic soft‑routing is powerful but sensitive. The training methodology has been refined through empirical testing on 24L to ensure healthy branch growth.
470
+
471
+ ### Known Issues & Solutions
472
+ | Issue | Symptom | Solution |
473
+ |-------|---------|----------|
474
+ | Early collapse | Branch <10% by 3k steps | Increase `tau_start` (2.2→2.4), extend freeze (6k→8k) |
475
+ | Post-freeze oscillation | Entropy spikes 0.75β†’0.95 | Expected; aux pushes exploration. Monitor 500 steps. |
476
+ | Weak branch stagnation | Branch <12% after 10k | Targeted forcing: `--force_branch_idx X --force_branch_until +1000`, aux=0 during window |
477
+ | Adaptive forcing loops | Repeated forced windows | **Do not use** adaptive forcing; rely on aux+tau only |
478
+
479
+ ### Failed Experiment: Adaptive Forcing (DO NOT USE)
480
+
481
+ **Attempted solution**: Auto-detect weak branches (<threshold) and dynamically apply forcing windows
482
+ ```python
483
+ # BROKEN CODE β€” DO NOT USE
484
+ if min(alpha) < 0.15 and not in_cooldown:
485
+ weak_idx = argmin(alpha)
486
+ force_branch_idx = weak_idx
487
+ force_until = current_step + 1000
488
+ in_cooldown = True
489
+ ```
490
+
491
+ **Why it failed**:
492
+ 1. **Cascade loops**: Forcing branch A β†’ weakens branch B β†’ triggers forcing B β†’ weakens A β†’ infinite oscillation
493
+ 2. **Artificial alpha**: During forced windows, alpha reflects forcing distribution [0,0,1], not learned preferences
494
+ 3. **Gradient confusion**: Aux loss receives artificial entropy signals, disrupts learning
495
+ 4. **Manual intervention superior**: Targeted forcing with aux=0 isolates signal cleanly
496
+
497
+ **Lesson**: Router needs **consistent pressure** (tau + aux), not **reactive intervention**. Manual forcing for recovery only, not automated.
498
+
499
+ ### Safeguards Implemented (Validated)
500
+ 1. **Depth-scaled parameters**: Ο„ and Ξ» scaled by √(depth_ratio) to maintain effective softness
501
+ 2. **Extended freeze**: Tau held constant for 6k steps (10% of training) to prevent premature specialization
502
+ 3. **Entropy-max loss**: Subtract (not add) aux_loss to maximize branch diversity
503
+ 4. **Warmup forcing**: 10% probability during first 5k steps ensures all branches receive gradients
504
+ 5. **FP32 LayerNorm**: Prevents BF16 precision drift in routing logits
505
+ 6. **NO adaptive forcing**: Rely on tau/aux scheduling + manual intervention when needed
506
+
507
+ ### Intervention Playbook (Step-by-Step)
508
+ **Scenario: Branch drops <10% before 5k steps**
509
+ 1. Stop training, resume from last good checkpoint
510
+ 2. Increase `--router_tau_start` by +0.2 (e.g., 2.2β†’2.4)
511
+ 3. Extend `--router_tau_freeze_steps` by +2000
512
+ 4. Increase `--router_force_prob` to 0.12–0.15
513
+
514
+ **Scenario: Branch stuck <12% after 10k steps**
515
+ 1. Run targeted forcing (see Incremental Expansion section)
516
+ 2. Force weak branch for 1k steps with `aux=0`, LR=5e-5
517
+ 3. Resume normal training with aux restored
518
+ 4. Expected recovery: +3–8% share within 500 steps
519
+
520
+ **Scenario: Entropy <0.70 and falling after 15k**
521
+ 1. Increase `--router_aux_end` by +0.002 (e.g., 0.016β†’0.018)
522
+ 2. Consider raising `--router_tau_end` slightly (1.4β†’1.5) to slow sharpening
523
+
524
+ ### Fine‑Tuning Note
525
+ If using standard HF Trainer without custom loss, set `router_aux_weight=0` in config to avoid incorrect gradient direction. Use `scripts/train_veronica.py` for full entropy-max support.
526
+
527
+ ### Empirical Training Log (24L Complete Journey)
528
+
529
+ **First attempt (FAILED β€” 512 ctx)**:
530
+ - **Step 0–300**: Perfect init (entropy 1.0) with high tau + forcing
531
+ - **Step 3000**: **Router collapse** β€” alpha=[0.73, 0.14, 0.12], entropy 0.70 ❌
532
+ - **Diagnosis**: 512 ctx insufficient for 24L depth
533
+ - **Action**: Abandoned run, restarted from scratch with 1024 ctx
534
+
535
+ **Adaptive forcing experiment (FAILED)**:
536
+ - **Implementation**: Auto-detect weak branches, dynamic forcing windows
537
+ - **Outcome**: Cascade loops, artificial alpha patterns [0,0,1], gradient confusion
538
+ - **Action**: Reverted code, relied on tau/aux only
539
+
540
+ **Final successful run (1024 ctx from step 0)**:
541
+ - **Step 0–300**: Perfect uniformity (entropy 1.0), high tau (2.2) + 10% forcing
542
+ - **Step 1000**: Loss 87β†’52, entropy 0.92, balanced [0.39, 0.32, 0.29]
543
+ - **Step 3000**: Loss 41, entropy 0.73, distribution [0.71, 0.13, 0.16] (healthy)
544
+ - **Step 5000**: Loss 37, forcing disabled, entropy 0.72 maintained
545
+ - **Step 6000**: Tau unfreezes (2.2β†’1.4 schedule begins)
546
+ - **Step 6000-7000**: Entropy spikes 0.80β†’0.93 (exploration phase, expected)
547
+ - **Step 10000**: Loss 34, **branch 1 weakened to ~10%** (concern threshold)
548
+ - **Intervention**: Targeted forcing on branch 1 (10k→11k steps)
549
+ - `--force_branch_idx 1 --force_branch_until 11000`
550
+ - `--router_aux_start 0.0` (isolate gradient signal)
551
+ - `--learning_rate 5e-5` (gentle nudge)
552
+ - **Step 11000**: Branch 1 recovered to 15%, entropy 0.84–0.93 βœ…
553
+ - **Step 12000**: Stable soft routing [0.57, 0.15, 0.27], entropy 0.876
554
+ - Eval loss 4.41β†’4.07 (intervention improved generalization)
555
+ - Loss trend: 34β†’33 (continued healthy descent)
556
+ - **All branches active and contributing**
557
+
558
+ **Key learnings**:
559
+ 1. βœ… 1024 ctx required from step 0 for 24L
560
+ 2. βœ… Depth-scaled tau/aux/forcing parameters validated
561
+ 3. βœ… Targeted forcing (aux=0, short window) effective for recovery
562
+ 4. ❌ Adaptive forcing causes more problems than it solves
563
+ 5. βœ… Entropy 0.84–0.93 with min branch 15% = healthy soft routing
564
+
565
+ **Status**: Methodology validated on 24L/551M through 12k steps. Ready for 2048 ctx phase (30k+). Core API stable; default schedules proven effective.
566
 
567
+ ---
568
+
569
+ ## Practical Training Tips
570
+
571
+ ### DO
572
+ - βœ… **Use 1024 ctx from step 0 for 24L models** (512 causes router collapse)
573
+ - βœ… Scale tau/aux with √(depth_ratio) when changing layer count
574
+ - βœ… Use depth-scaled forcing probability (10% for 24L vs 5% for 12L)
575
+ - βœ… Freeze tau for ~10% of total training steps (6k for 60k total)
576
+ - βœ… Monitor entropy every 100 steps; save checkpoints every 500
577
+ - βœ… Apply targeted forcing (aux=0, short window) for weak branches after 10k
578
+ - βœ… Keep aux weight increasing throughout training (e.g., 0.008β†’0.016)
579
+ - βœ… Trust depth-scaled parameters β€” they're empirically validated
580
+
581
+ ### DON'T
582
+ - ❌ **Use 512 ctx on 24L** (causes collapse by 3k steps β€” empirically proven)
583
+ - ❌ **Implement adaptive forcing** (causes cascade loops and artificial alpha)
584
+ - ❌ Lower tau too aggressively (<1.2 for 24L can cause collapse)
585
+ - ❌ Set aux=0 for normal training (only during targeted forcing windows)
586
+ - ❌ Switch context length without verifying entropy stability (β‰₯0.72 for 1k steps)
587
+ - ❌ Expect perfect uniformity throughout training (soft routing allows specialization)
588
+ - ❌ Panic if entropy spikes post tau-freeze (oscillation is expected; monitor 500 steps)
589
+ - ❌ Use curriculum 512β†’1024β†’2048 on deep models (β‰₯20L requires 1024 start)
590
+
591
+ ### VRAM Optimization
592
+ If hitting OOM on 2048 ctx:
593
+ ```bash
594
+ --per_device_train_batch_size 2 \
595
+ --gradient_accumulation_steps 16 # keeps effective batch = 32
596
+ ```
597
+
598
+ ### Quick Health Check (Per 1k Steps)
599
+ ```bash
600
+ grep "\[router\]" logs/train.log | tail -10
601
+ ```
602
+ Look for:
603
+ - Entropy trend (should be β‰₯0.70)
604
+ - Min branch value (should be β‰₯0.12)
605
+ - Loss trend (should decrease or stabilize)
606
 
607
  ---
608