MhaWay
/

Veronica

@@ -12,164 +12,199 @@ tags:
 - causal-lm
 - rope
 - expandable-architecture
 pipeline_tag: text-generation
 datasets:
 - codelion/finepdfs-1B
 - codelion/dclm-baseline-1B
 - codelion/fineweb-edu-1B
 model-index:
-- name: Veronica-24L (551M)
   results: []
 ---
-# Veronica-Polymorphic
-**Veronica-Polymorphic Soft Mixture-of-Experts(SMoE)** is a decoder‑only transformer featuring a **polymorphic MLP layer**: each token is processed by a soft mixture of specialized branches (SwiGLU, GLU, Depthwise Causal Conv) under an entropy‑regularized router. The design enables adaptive capacity, incremental expansion (adding new branches post‑pretrain), and targeted specialization (e.g. translation modules) without full retraining from scratch.
-## TL;DR
-| Feature | Description |
-|---------|-------------|
-| Architecture | 24‑layer causal Transformer (RoPE, untied embeddings, 551M params) |
-| Polymorphic MLP | Soft routing over 3 base branches (SwiGLU, GLU, DepthwiseConv) |
-| Routing Control | Depth-scaled temperature (√depth) + entropy maximization |
-| Precision | BF16 with FP32 LayerNorm for stability |
-| Positional Encoding | Rotary (RoPE, θ=10,000) |
-| Dataset Mix | FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • FineWeb-Edu 20% |
-| Context Length | **1024 (0-30k)** → 2048 (30k-60k) — *512 causes router collapse on 24L* |
-| Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
 ---
-## Installation
-```bash
-pip install -e .
-from veronica import VeronicaConfig, VeronicaForCausalLM
-cfg = VeronicaConfig(n_layer=24, num_funcs=3)  # base polymorphic setup
-model = VeronicaForCausalLM(cfg)
-```
-| Source | Share | Link |
-|--------|-------|------|
-| FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
-| DCLM Baseline‑1B | 30% | https://huggingface.co/datasets/codelion/dclm-baseline-1B |
-| Additional samples | 20% | https://huggingface.co/collections/codelion/pre-training-dataset-samples |
-Notes
-- The collection link aggregates additional samples (e.g., educational/web sources) used to complete the 50/30/20 composition.
-- Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
-Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.
-Generation example:
-```python
-from transformers import AutoTokenizer
-tok = AutoTokenizer.from_pretrained("gpt2")  # or your saved tokenizer
-prompt = "The theory of relativity states that"
-ids = tok(prompt, return_tensors="pt").to(model.device)
-out = model.generate(**ids, max_new_tokens=64, temperature=0.7, top_p=0.9)
-| Current status | between v0.2 and v0.3 |
-print(tok.decode(out[0], skip_special_tokens=True))
-```
 ---
-## Architecture Overview
-### High Level
-```
-Input Embeddings → [Block × N]
-   Block: Pre-LN → Multi-Head Self-Attention (RoPE) → Pre-LN → Polymorphic MLP (Routing + Branch Fusion) → Residual
-Untied LM Head
-```
-## Dataset Citations
-If you use these datasets or composition, please cite:
-```
 @article{sharma2025billion,
   title   = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
   author  = {Sharma, Asankhaya},
   year    = {2025},
   url     = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
 }
-```
-Related collection and datasets:
-- codelion pre‑training dataset samples: https://huggingface.co/collections/codelion/pre-training-dataset-samples
-- codelion/dclm-baseline-1B: https://huggingface.co/datasets/codelion/dclm-baseline-1B
-- codelion/finepdfs-1B: https://huggingface.co/datasets/codelion/finepdfs-1B
 ---
-### Polymorphic MLP
-Per token & layer:
-```
-router_logits = Router(x)          # Linear → GELU → Linear
-α = softmax(router_logits / τ)
-branches = [SwiGLU(x), GLU(x), DepthwiseConvMLP(x)]
-output = Σ α_i * branch_i(x)
-```
-Routing stabilized by:
-- **Temperature schedule** (τ high early → softer mixing)
-- **Entropy-max aux-loss** (subtract entropy from total loss to maximize it)
-- Optional **forcing** during warmup to guarantee gradient flow to new branches
-### Branch Types
-| Branch | Purpose | Structure |
-|--------|---------|-----------|
-| SwiGLU | Smooth gated MLP | Linear(up 2×) → split → SiLU × gate → Linear(down) |
-| GLU | Alternative gating dynamics | Linear(up 2×) → split → Sigmoid × gate → Linear(down) |
-| DepthwiseConv | Local token patterns | Depthwise causal conv (k=3) → expand → GELU → contract |
-### Positional Encoding
-Rotary embeddings (RoPE) applied to Q/K heads with cached cos/sin; no absolute learned positions.
-### Stability Choices
-| Mechanism | Rationale |
-|-----------|-----------|
-| FP32 LayerNorm | Prevent BF16 precision drift |
-| Entropy-Max Aux | Avoid early router collapse |
-| High initial τ | Encourage exploration across branches |
-| Gradient Checkpointing | Memory efficiency for depth |
----
-## Dataset Mixture (codelion / DataComp inspired)
-Training uses a curated blend guided by open mixture studies:
-| Source | Share | Notes |
-|--------|-------|-------|
-| FinePDFs | 50% | Technical & academic PDFs (higher semantic density) |
-| DCLM Baseline | 30% | General web corpus (DataComp LM baseline) |
-| FineWeb‑Edu | 20% | Educational domain for structured explanatory patterns |
-Total tokens target (example): ~60B (adjustable). The composition balances semantic density vs generality, echoing codelion’s optimal ratio analyses.
----
-## Training Setup
-| Hyperparameter | Value (example) |
-|----------------|-----------------|
-| Layers | 24 |
-| Hidden size | 768 |
-| Heads | 12 |
-| MLP mult | 4.0 |
-| Batch (per device) | 4 |
-| Grad Accumulation | 8 (effective batch 32) |
-| LR | 1.2e-4 cosine decay |
-| Warmup | 10% steps |
-| Weight Decay | 0.01 |
-| Label Smoothing | 0.01 |
-| Precision | bf16 + fp32 LayerNorm |
-| Max Seq Len | 1024→2048 (curriculum) |
-| Router τ | 2.2 → 1.4 (freeze first 6k steps, depth-scaled) |
-| Aux weight λ | 0.008 → 0.016 (depth-scaled √2×) |
-| Router forcing | 10% prob for first 5k steps |
-| Rep penalty (α) | 0.05 (smoke quality) |
-Launch:
-```bash
 python scripts/train_veronica.py \
   --config configs/veronica-pretrain-24L.json \
   --dataset_paths data/mix_optimal_50_30_20 \
@@ -186,481 +221,271 @@ python scripts/train_veronica.py \
   --router_force_prob 0.10 --router_force_warmup_steps 5000 \
   --rep_alpha 0.05 \
   --seed 42
-```
----
-## Critical Discovery: Context Length & Router Stability on Deep Models
-### The 512 Token Trap (24L Only)
-**Finding**: With 24 layers, starting training at **512 context length causes router collapse** by step 3k:
-```
-Step 3000: alpha=[0.73, 0.14, 0.12], entropy=0.70 (UNHEALTHY)
-```
-**Root Cause**:
-- With 512 tokens/batch and 24 routing decisions per token → **12,288 routing examples per batch**
-- But distributed across 3 branches and 24 layers → each branch-layer combination receives only **~170 gradient samples**
-- **Insufficient signal** for stable gradient descent on router parameters
-- Weak branches cannot recover from random initialization noise
-- Router collapses toward dominant branch to minimize aux loss conflict
-**Why This Doesn't Happen on 12L**:
-- Same 512 tokens → 6,144 routing examples
-- Each branch-layer: **~170 samples** (same as 24L)
-- But **12 layers = shorter gradient path** → less noise accumulation
-- Router can stabilize before collapse
-### Solution: Start at 1024 for Deep Models
-**Corrected curriculum for 24L**:
-```
-0–20k steps:   1024 tokens  ✅ 24,576 routing examples = stable gradients
-20k–60k steps: 2048 tokens  🎯 48,152 examples = final quality
-```
-**DO NOT use 512 ctx on 24L** — this is an empirical hard constraint, not a performance optimization.
-**For 12L and shallower**: 512→1024→2048 curriculum works fine.
-**Mathematical threshold**: ~15–20 layers appears to be the crossover where 512 becomes unstable. Always use 1024 or higher for ≥24L.
----
-## Depth Scaling for 24L (Mathematical Rationale)
-With 24 independent routing decisions per token (one per layer), naive parameters from shallower models (12L) cause amplified specialization and branch collapse. We apply **square-root depth scaling** to maintain equivalent "softness" across architectures:
-### Temperature Scaling
-Softmax sharpness compounds across layers. To preserve exploration:
-```
-τ_24L = τ_12L × √(24/12) = τ_12L × √2 ≈ τ_12L × 1.41
-```
-For 12L baseline `τ=1.6`, we use **`τ=2.2`** for 24L (start) and **`τ=1.4`** (end).
-### Aux Weight Scaling
-Entropy gradient must compete with 24 layers pulling toward specialization:
-```
-λ_24L = λ_12L × √2 ≈ λ_12L × 1.41
-```
-For 12L baseline `λ=0.005→0.012`, we use **`λ=0.008→0.016`** for 24L.
-### Forcing Probability
-Each branch needs more examples across deeper network:
-```
-P_force_24L ≈ P_force_12L × (24/12) = 2 × P_force_12L
-```
-For 12L `5%`, we use **`10%`** for 24L during warmup (0–5k steps).
-### Empirical Results (Training Logs)
-- **Step 300**: Entropy 1.00, perfect uniform distribution `[0.33, 0.33, 0.33]`
-- **Step 5k**: Entropy 0.73, healthy distribution `[0.71, 0.11, 0.18]`
-- **Step 7k**: Entropy 0.80–0.93 (exploration phase post tau-freeze)
-- **Step 10k**: Loss ~34, no branch collapse
-- **Step 11k** (post branch-1 recovery): Entropy 0.84–0.93, distribution `[0.57, 0.15, 0.27]` ✅
-- **Step 12k**: Stable soft routing, eval loss 4.07
----
-## Router Health Metrics
-Monitor log lines:
-```
-[router] alpha=[a0, a1, a2, ...] entropy_norm=E
-```
-### Targets by Training Phase
-| Phase | Steps | Entropy Target | Min Branch Share | Notes |
-|-------|-------|----------------|------------------|-------|
-| Warmup | 0–5k | ≥0.90 | ≥0.25 | Forcing active, near-uniform |
-| Post-freeze | 5k–10k | ≥0.75 | ≥0.12 | Specialization begins |
-| Stable | 10k+ | ≥0.70 | ≥0.15 | Soft routing converged |
-| Final | 40k–60k | ≥0.65 | ≥0.12 | Acceptable specialization |
-### Observed Distribution (24L, Step 12k)
-```
-alpha=[0.571, 0.153, 0.276]  entropy_norm=0.876
-```
-Ideal soft routing: dominant branch ~55–65%, minorities ~15–25% each.
----
-## Context Length Curriculum
-### Architecture-Dependent Strategy
-**For 24L (≥20L in general)**:
-```
-1024 tokens: Steps 0–20k   (NO 512 phase — causes router collapse)
-2048 tokens: Steps 20k–60k
-```
-**For 12L and shallower**:
-```
-512 tokens:  Steps 0–10k
-1024 tokens: Steps 10k–30k
-2048 tokens: Steps 30k–60k
-```
 ---
-### Phase 1 (24L): 1024 Tokens (Steps 0–20k)
-- **Purpose**: Router stability + pattern learning (REQUIRED for 24L from step 0)
-- **VRAM**: ~8–9GB (batch=4, accum=8)
-- **Throughput**: ~8–10 sec/step
-- **Why not 512**: Insufficient routing examples cause branch collapse by 3k steps
-### Phase 2 (24L): 2048 Tokens (Steps 20k–60k)
-- **Purpose**: Final capacity, long-document coherence
-- **VRAM**: ~12–13GB (batch=4, accum=8)
-- **Switching criteria**: Stable routing on 1024 (entropy ≥0.75, branches ≥0.15)
-- **Expected dip**: Temporary entropy −0.02–0.04, recovers within 500 steps
-eg. thats with BF16
-### Switching Template
-```bash
-python scripts/train_veronica.py \
-  --resume_from runs/veronica-24L-1024/checkpoint-12000 \
-  --output_dir runs/veronica-24L-2048 \
-  --max_seq_len 2048 \
-  # ... keep all other router params unchanged
-```
----
-## Incremental Expansion (Add New Branch Post‑Pretrain)
-Goal: Increase capacity or add a specialization (e.g. translation) without full restart.
-### Steps
-1. **Load original checkpoint + config**:
-   ```python
-   cfg = VeronicaConfig.from_pretrained(old_dir)
-   old_funcs = cfg.num_funcs
-   cfg.num_funcs = old_funcs + 1  # adding one branch
-   model = VeronicaForCausalLM.from_pretrained(old_dir, config=cfg, ignore_mismatched_sizes=True)
-   ```
-2. **Implement new branch class** (see Translation branch below) and extend `PolymorphicMLP` construction.
-3. **Copy existing router weights** and init new column small:
-   ```python
-   import torch, torch.nn as nn
-   for blk in model.blocks:
-     lin = blk.mlp.router[-1]  # final Linear
-     with torch.no_grad():
-       # existing weights remain; new slice initialized
-       nn.init.normal_(lin.weight[old_funcs:], mean=0.0, std=0.02)
-       if lin.bias is not None:
-         nn.init.zeros_(lin.bias[old_funcs:])
-   ```
-4. **Freeze old branches & attention** for warmup:
-   ```python
-   for name, p in model.named_parameters():
-     if "funcs.%d" % (old_funcs) in name or "router.2" in name:  # new branch + router final layer
-       p.requires_grad = True
-     else:
-       p.requires_grad = False
-   ```
-5. **High τ + light forcing** (0–1k steps): `router_tau_start=1.8`, `router_force_prob≈0.15`.
-6. **Blend phase** (1–3k steps): unfreeze old branches, lower τ → 1.2, increase aux to mid (e.g. 0.006).
-7. **Stabilize**: restore standard schedule (τ→1.0, aux→0.01), disable forcing.
-### Recommended Minimal Fine‑Tune Command
-```bash
-python scripts/train_veronica.py \
-  --config expanded-config.json \  # updated num_funcs
-  --resume_from runs/veronica-pretrain-24L/checkpoint-60000 \
-  --output_dir runs/veronica-expand-translation \
-  --max_steps 8000 \
-  --per_device_train_batch_size 4 \
-  --gradient_accumulation_steps 8 \
-  --learning_rate 8e-5 \
-  --router_tau_start 1.8 --router_tau_end 1.2 --router_tau_freeze_steps 1500 \
-  --router_aux_start 0.001 --router_aux_end 0.008 \
-  --router_force_prob 0.15 --router_force_warmup_steps 1200
-```
----
-## Translation Specialization Branch
-Add a branch focusing on cross‑lingual adaptation without retraining entire backbone.
-### Design Goals
-| Requirement | Implementation Choice |
-|-------------|-----------------------|
-| Lightweight | Low‑rank adapters + language conditioning |
-| Reusable | Shares main hidden size; no separate encoder |
-| Controllable | Can be forced via `force_func` for targeted tuning |
-### Example Branch Implementation
-```python
-class TranslationBranch(nn.Module):
-  def __init__(self, hidden_size: int, mlp_mult: float = 2.0, rank: int = 64, num_langs: int = 16):
-    super().__init__()
-    self.rank = rank
-    self.lang_embed = nn.Embedding(num_langs, hidden_size)
-    inner = int(hidden_size * mlp_mult)
-    self.up = nn.Linear(hidden_size, inner)
-    self.down = nn.Linear(inner, hidden_size)
-    # Low-rank adapters
-    self.A = nn.Linear(hidden_size, rank, bias=False)
-    self.B = nn.Linear(rank, hidden_size, bias=False)
-    self.gate = nn.Linear(hidden_size, 1)
-  def forward(self, x: torch.Tensor, lang_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
-    # x: (B, T, H); lang_ids: (B,) or (B,T) token-level
-    if lang_ids is not None:
-      if lang_ids.dim() == 1:  # broadcast sentence level
-        lang_vec = self.lang_embed(lang_ids).unsqueeze(1)  # (B,1,H)
-      else:
-        lang_vec = self.lang_embed(lang_ids)              # (B,T,H)
-      x = x + lang_vec
-    h = self.up(x)
-    h = torch.gelu(h)
-    h = self.down(h)
-    # Adapter residual
-    a = self.A(x)
-    a = torch.gelu(a)
-    a = self.B(a)
-    g = torch.sigmoid(self.gate(x))  # (B,T,1)
-    return h + g * a
-```
-### Integrate Into `PolymorphicMLP`
-Inside branch construction:
-```python
-if num_funcs >= 4:
-  funcs.append(TranslationBranch(hidden_size, mlp_mult=2.0))
-```
-### Passing Language IDs
-- Add `lang_ids` to model forward signature (optional).
-- Modify TranslationBranch call: `func(x, lang_ids=lang_ids)` for branches expecting it; others ignore.
-- For multilingual fine‑tune, prepend special language tokens or maintain a side tensor of language indices.
-### Fine‑Tuning Strategy
-1. Collect multilingual parallel / monolingual corpora (e.g. FLORES, WikiMatrix, OSCAR subset).
-2. Freeze base transformer + existing branches initially.
-3. Force translation branch (`force_func = translation_index`) for exploratory steps.
-4. Gradually unfreeze attention + other branches for joint adaptation.
-5. Evaluate on BLEU / COMET vs baseline; adjust rank / mlp_mult if underfitting.
----
-## Evaluation & Monitoring
-| Metric | Purpose |
-|--------|---------|
-| CE / PPL | Language modeling convergence |
-| Router Entropy | Diversity of branch usage |
-| Alpha Distribution | Detect collapse or dominance |
-| Translation BLEU (if added) | Cross-lingual quality |
----
-## Limitations
-| Area | Limitation |
-|------|------------|
-| Alignment | Base LM (no RLHF / instruction tuning) |
-| Multilingual | Requires added translation branch + fine‑tune |
-| Safety | No filtering; may reproduce dataset biases |
-| Interpretability | Router decisions not fully explainable |
----
-## Router Stability (Important)
-Dynamic soft‑routing is powerful but sensitive. The training methodology has been refined through empirical testing on 24L to ensure healthy branch growth.
-### Known Issues & Solutions
-| Issue | Symptom | Solution |
-|-------|---------|----------|
-| Early collapse | Branch <10% by 3k steps | Increase `tau_start` (2.2→2.4), extend freeze (6k→8k) |
-| Post-freeze oscillation | Entropy spikes 0.75→0.95 | Expected; aux pushes exploration. Monitor 500 steps. |
-| Weak branch stagnation | Branch <12% after 10k | Targeted forcing: `--force_branch_idx X --force_branch_until +1000`, aux=0 during window |
-| Adaptive forcing loops | Repeated forced windows | **Do not use** adaptive forcing; rely on aux+tau only |
-### Failed Experiment: Adaptive Forcing (DO NOT USE)
-**Attempted solution**: Auto-detect weak branches (<threshold) and dynamically apply forcing windows
-```python
-# BROKEN CODE — DO NOT USE
-if min(alpha) < 0.15 and not in_cooldown:
-    weak_idx = argmin(alpha)
-    force_branch_idx = weak_idx
-    force_until = current_step + 1000
-    in_cooldown = True
-```
-**Why it failed**:
-1. **Cascade loops**: Forcing branch A → weakens branch B → triggers forcing B → weakens A → infinite oscillation
-2. **Artificial alpha**: During forced windows, alpha reflects forcing distribution [0,0,1], not learned preferences
-3. **Gradient confusion**: Aux loss receives artificial entropy signals, disrupts learning
-4. **Manual intervention superior**: Targeted forcing with aux=0 isolates signal cleanly
-**Lesson**: Router needs **consistent pressure** (tau + aux), not **reactive intervention**. Manual forcing for recovery only, not automated.
-### Safeguards Implemented (Validated)
-1. **Depth-scaled parameters**: τ and λ scaled by √(depth_ratio) to maintain effective softness
-2. **Extended freeze**: Tau held constant for 6k steps (10% of training) to prevent premature specialization
-3. **Entropy-max loss**: Subtract (not add) aux_loss to maximize branch diversity
-4. **Warmup forcing**: 10% probability during first 5k steps ensures all branches receive gradients
-5. **FP32 LayerNorm**: Prevents BF16 precision drift in routing logits
-6. **NO adaptive forcing**: Rely on tau/aux scheduling + manual intervention when needed
-### Intervention Playbook (Step-by-Step)
-**Scenario: Branch drops <10% before 5k steps**
-1. Stop training, resume from last good checkpoint
-2. Increase `--router_tau_start` by +0.2 (e.g., 2.2→2.4)
-3. Extend `--router_tau_freeze_steps` by +2000
-4. Increase `--router_force_prob` to 0.12–0.15
-**Scenario: Branch stuck <12% after 10k steps**
-1. Run targeted forcing (see Incremental Expansion section)
-2. Force weak branch for 1k steps with `aux=0`, LR=5e-5
-3. Resume normal training with aux restored
-4. Expected recovery: +3–8% share within 500 steps
-**Scenario: Entropy <0.70 and falling after 15k**
-1. Increase `--router_aux_end` by +0.002 (e.g., 0.016→0.018)
-2. Consider raising `--router_tau_end` slightly (1.4→1.5) to slow sharpening
-### Fine‑Tuning Note
-If using standard HF Trainer without custom loss, set `router_aux_weight=0` in config to avoid incorrect gradient direction. Use `scripts/train_veronica.py` for full entropy-max support.
-### Empirical Training Log (24L Complete Journey)
-**First attempt (FAILED — 512 ctx)**:
-- **Step 0–300**: Perfect init (entropy 1.0) with high tau + forcing
-- **Step 3000**: **Router collapse** — alpha=[0.73, 0.14, 0.12], entropy 0.70 ❌
-- **Diagnosis**: 512 ctx insufficient for 24L depth
-- **Action**: Abandoned run, restarted from scratch with 1024 ctx
-**Adaptive forcing experiment (FAILED)**:
-- **Implementation**: Auto-detect weak branches, dynamic forcing windows
-- **Outcome**: Cascade loops, artificial alpha patterns [0,0,1], gradient confusion
-- **Action**: Reverted code, relied on tau/aux only
-**Final successful run (1024 ctx from step 0)**:
-- **Step 0–300**: Perfect uniformity (entropy 1.0), high tau (2.2) + 10% forcing
-- **Step 1000**: Loss 87→52, entropy 0.92, balanced [0.39, 0.32, 0.29]
-- **Step 3000**: Loss 41, entropy 0.73, distribution [0.71, 0.13, 0.16] (healthy)
-- **Step 5000**: Loss 37, forcing disabled, entropy 0.72 maintained
-- **Step 6000**: Tau unfreezes (2.2→1.4 schedule begins)
-- **Step 6000-7000**: Entropy spikes 0.80→0.93 (exploration phase, expected)
-- **Step 10000**: Loss 34, **branch 1 weakened to ~10%** (concern threshold)
-- **Intervention**: Targeted forcing on branch 1 (10k→11k steps)
-  - `--force_branch_idx 1 --force_branch_until 11000`
-  - `--router_aux_start 0.0` (isolate gradient signal)
-  - `--learning_rate 5e-5` (gentle nudge)
-- **Step 11000**: Branch 1 recovered to 15%, entropy 0.84–0.93 ✅
-- **Step 12000**: Stable soft routing [0.57, 0.15, 0.27], entropy 0.876
-  - Eval loss 4.41→4.07 (intervention improved generalization)
-  - Loss trend: 34→33 (continued healthy descent)
-  - **All branches active and contributing**
-**Key learnings**:
-1. ✅ 1024 ctx required from step 0 for 24L
-2. ✅ Depth-scaled tau/aux/forcing parameters validated
-3. ✅ Targeted forcing (aux=0, short window) effective for recovery
-4. ❌ Adaptive forcing causes more problems than it solves
-5. ✅ Entropy 0.84–0.93 with min branch 15% = healthy soft routing
-**Status**: Methodology validated on 24L/551M through 12k steps. Ready for 2048 ctx phase (30k+). Core API stable; default schedules proven effective.
----
-## Practical Training Tips
-### DO
-- ✅ **Use 1024 ctx from step 0 for 24L models** (512 causes router collapse)
-- ✅ Scale tau/aux with √(depth_ratio) when changing layer count
-- ✅ Use depth-scaled forcing probability (10% for 24L vs 5% for 12L)
-- ✅ Freeze tau for ~10% of total training steps (6k for 60k total)
-- ✅ Monitor entropy every 100 steps; save checkpoints every 500
-- ✅ Apply targeted forcing (aux=0, short window) for weak branches after 10k
-- ✅ Keep aux weight increasing throughout training (e.g., 0.008→0.016)
-- ✅ Trust depth-scaled parameters — they're empirically validated
-### DON'T
-- ❌ **Use 512 ctx on 24L** (causes collapse by 3k steps — empirically proven)
-- ❌ **Implement adaptive forcing** (causes cascade loops and artificial alpha)
-- ❌ Lower tau too aggressively (<1.2 for 24L can cause collapse)
-- ❌ Set aux=0 for normal training (only during targeted forcing windows)
-- ❌ Switch context length without verifying entropy stability (≥0.72 for 1k steps)
-- ❌ Expect perfect uniformity throughout training (soft routing allows specialization)
-- ❌ Panic if entropy spikes post tau-freeze (oscillation is expected; monitor 500 steps)
-- ❌ Use curriculum 512→1024→2048 on deep models (≥20L requires 1024 start)
-### VRAM Optimization
-If hitting OOM on 2048 ctx:
-```bash
---per_device_train_batch_size 2 \
---gradient_accumulation_steps 16  # keeps effective batch = 32
-```
-### Quick Health Check (Per 1k Steps)
-```bash
-grep "\[router\]" logs/train.log | tail -10
-```
-Look for:
-- Entropy trend (should be ≥0.70)
-- Min branch value (should be ≥0.12)
-- Loss trend (should decrease or stabilize)
 ---
-## Roadmap
-| Version | Goal |
-|---------|------|
-| v0.1 | Core polymorphic MLP + tests |
-| v0.2 | Router logging + entropy regularization |
-| v0.3 | Channel attention option |
-| v0.4 | FlashAttention integration |
-| v0.5 | Expansion utilities (branch migration helpers) |
-| v0.6 | Translation branch reference implementation |
 ---
-## Contributing
-PRs welcome for: new branch types, expansion helpers, multilingual adapters, evaluation scripts.
 ---
-## License
-Apache-2.0
 ---
-## Citation
-```bibtex
-@misc{veronica-2025,
-  title={Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
-  author={Emanuele D'Angelo|GG-Ally},
-  year={2025},
-  howpublished={\url{https://huggingface.co/MhaWay/Veronica}}
-}
-```
 ---
-## Acknowledgments
-- Mixture & routing concepts inspired by Switch Transformer, GLaM, MoE literature.
-- Dataset composition ratios guided by codelion’s DataComp LM mixture studies.
-- RoPE adaptation referencing GPT-NeoX implementation details.
 ---
-## FAQ
-**Q: Why entropy-max instead of load-balancing penalty?**
-To avoid premature specialization and keep new branches trainable; scaling uses increasing aux weight schedule.
-**Q: Can I add many branches at once?**
-Recommended incremental (3→4→5) to prevent starvation.
-**Q: How to specialize for translation?**
-Add `TranslationBranch`, warmup with forced routing, then blended fine-tune with multilingual data.
-**Q: Does expansion erase prior knowledge?**
-No; existing branches retain weights. Router + new branch adapt during short fine‑tune.
 ---
-Happy branching! 🌿

 - causal-lm
 - rope
 - expandable-architecture
+- research
 pipeline_tag: text-generation
 datasets:
 - codelion/finepdfs-1B
 - codelion/dclm-baseline-1B
 - codelion/fineweb-edu-1B
 model-index:
+- name: Veronica-Polymorphic 24L (551M)
   results: []
 ---
+# Veronica-Polymorphic 24L (551M)
+Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**:
+each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token.
+The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.
+> ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**.
+> Do **not** treat this as a production-ready model.
 ---
+## 1. TL;DR
+| Aspect              | Value / Description                                            |
+|---------------------|----------------------------------------------------------------|
+| Type                | Decoder-only causal LM                                         |
+| Params              | ~551M                                                          |
+| Layers              | 24                                                             |
+| Hidden size         | 768                                                            |
+| Heads               | 12                                                             |
+| Positional encoding | RoPE (rotary)                                                 |
+| MLP                 | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block          |
+| Routing             | Entropy-regularized soft routing, depth-scaled temperature    |
+| Precision           | bf16 weights, fp32 LayerNorm                                  |
+| Context length      | 1024 → 2048 (curriculum; 512 discouraged on 24L)              |
+| Data mix            | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20%      |
+| Intended use        | Research on routing / branch specialization                   |
+| Not included        | Instruction tuning, RLHF, safety fine-tuning, eval suite      |
+---
+## 2. Intended use & scope
+### Primary intent
+This checkpoint is meant for:
+- Researchers interested in:
+  - **Mixture-of-branches / soft routing** in MLPs
+  - Stability of routers on deeper (24L) architectures
+  - Incremental model growth via **adding branches post-pretrain**
+- Practitioners who want a **small, hackable codebase** to experiment with:
+  - Polymorphic MLPs
+  - Entropy-regularized routing
+  - Context-length curricula
+### Out of scope
+This model is **not** designed or evaluated (yet) for:
+- General-purpose assistant use
+- Safety-critical or high-stakes decisions
+- Deployment to end-users without additional filtering, alignment, and evaluation
+---
+## 3. Model details
+### 3.1 Architecture (high-level)
+```text
+Input tokens
+  ↓
+Token & position embeddings (RoPE on Q/K)
+  ↓
+[ VeronicaBlock × 24 ]
+  VeronicaBlock:
+    x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual
+      → Pre-LN → Polymorphic MLP (router + branches) → Residual
+  ↓
+Untied LM head → logits
+Key design choices:
+Decoder-only Transformer (causal LM)
+Pre-LayerNorm blocks
+RoPE positional encoding (no learned absolute positions)
+Untied input embeddings / LM head
+Gradient checkpointing used in training runs for memory efficiency
+3.2 Polymorphic MLP & routing
+Each block’s MLP is replaced by a polymorphic MLP:
+router_logits = Router(x)      # Linear → GELU → Linear
+alpha = softmax(router_logits / tau)
+branches = [
+  SwiGLU(x),
+  GLU(x),
+  DepthwiseConvMLP(x),
+]
+output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))
+Branches:
+Branch	Role	Sketch
+SwiGLU	Default gated MLP	Linear(up) → split → SiLU×gate → Linear(down)
+GLU	Alternative gating dynamics	Linear(up) → split → Sigmoid×gate → Linear(down)
+DepthwiseConv	Local token patterns / n-grams	Depthwise causal conv (k=3) → MLP
+Routing controls:
+Temperature schedule tau_start → tau_end (higher early = softer mixing)
+Entropy-max aux-loss: encourages non-collapsed branch usage
+Depth-scaled parameters:
+Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models
+The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection.
 ---
+4. Training data
+The pre-train data follows the codelion / DataComp LM mixture guidelines:
+Dataset	Share	Description
+codelion/finepdfs-1B	50%	Technical/academic PDFs (high semantic density)
+codelion/dclm-baseline-1B	30%	General web corpus baseline
+codelion/fineweb-edu-1B	20%	Educational / explanatory web data
+Target token budget for this configuration: ~60B tokens (example setting).
+For licensing and detailed descriptions, please refer to each dataset on Hugging Face.
+If you reuse this mixture, please also cite:
 @article{sharma2025billion,
   title   = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
   author  = {Sharma, Asankhaya},
   year    = {2025},
   url     = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
 }
 ---
+5. Training procedure
+> Note: numbers below describe the reference run configuration used to train this checkpoint.
+You can adapt them for your own experiments.
+5.1 Core hyperparameters
+Hyperparameter	Value / Notes
+Layers	24
+Hidden size	768
+Attention heads	12
+MLP expansion	4×
+Per-device batch size	4
+Grad accumulation	8  (effective batch 32)
+Optimizer / LR schedule	AdamW, lr=1.2e-4, cosine decay
+Warmup	10% of total steps
+Weight decay	0.01
+Label smoothing	0.01
+Precision	bf16 + fp32 LayerNorm
+Max steps	60k (example target)
+Example launch:
 python scripts/train_veronica.py \
   --config configs/veronica-pretrain-24L.json \
   --dataset_paths data/mix_optimal_50_30_20 \
   --router_force_prob 0.10 --router_force_warmup_steps 5000 \
   --rep_alpha 0.05 \
   --seed 42
+5.2 Context-length curriculum & “512-token trap”
+Empirical finding on 24-layer models:
+Starting at 512 tokens caused router collapse around step ~3k:
+One branch dominated (>70%), entropy dropped, other branches starved.
+Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.
+Recommended curriculum for 24L:
+Steps 0–20k   : 1024 tokens
+Steps 20k–60k : 2048 tokens
+For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.
+5.3 Router health during training
+Training logs include entries like:
+[router] alpha=[a0, a1, a2] entropy_norm=E
+Healthy targets (rough guideline):
+Phase	Steps	Entropy (norm)	Min branch share
+Warmup	0–5k	≥ 0.90	≥ 0.25
+Post-freeze	5k–10k	≥ 0.75	≥ 0.12
+Stable	10k+	≥ 0.70	≥ 0.15
+Collapsed routing typically shows up as:
+Entropy < 0.65
+One branch > 80% usage for many thousands of steps
+Other branches stuck < 5–10%
+The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.
 ---
+6. Evaluation
+6.1 Current evaluation status
+At the time of this release:
+No standardized benchmarks (e.g. lm-eval-harness) have been run yet.
+There are no public numbers for:
+MMLU (5-shot / 0-shot)
+ARC-e / ARC-c
+HellaSwag, PIQA, GSM8K, etc.
+Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.
+> 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.
+6.2 Planned evaluation (suggested)
+If you adopt or extend Veronica-Polymorphic, consider running:
+lm-eval-harness on:
+mmlu, arc_challenge, arc_easy, hellaswag, piqa
+Instruction / SFT (if you fine-tune):
+Alpaca-style or OpenAssistant subsets
+Ablations:
+Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width
+With / without entropy-max routing
+Contributions of evaluation scripts and reported metrics are very welcome.
 ---
+7. How to use
+7.1 Loading from code
+If you’re using the Veronica codebase directly:
+from veronica import VeronicaConfig, VeronicaForCausalLM
+cfg = VeronicaConfig(
+    n_layer=24,
+    num_funcs=3,  # SwiGLU, GLU, DepthwiseConv
+)
+model = VeronicaForCausalLM(cfg)
+model.eval()
+You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.
+7.2 Simple generation example
+from transformers import AutoTokenizer
+from veronica import VeronicaForCausalLM, VeronicaConfig
+tokenizer = AutoTokenizer.from_pretrained("gpt2")  # or your own tokenizer
+config = VeronicaConfig.from_pretrained("MhaWay/Veronica")
+model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)
+prompt = "The theory of relativity states that"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=64,
+    temperature=0.7,
+    top_p=0.9,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+> Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.
 ---
+8. Extensibility: adding new branches
+One motivation for polymorphic MLPs is incremental expansion:
+You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:
+Expanding num_funcs
+Initializing the new branch + router output slice
+Running a short fine-tune with:
+Router + new branch trainable
+Optionally freezing the rest of the backbone during warmup
+The repository includes utilities and example code for:
+Adding a new branch type
+Copying router weights and initializing the new column
+Scheduling a short specialization fine-tune
+For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.
 ---
+9. Limitations & risks
+This model:
+May generate inaccurate or nonsensical text
+May reproduce biases present in the underlying datasets
+Is not instruction-tuned:
+Does not follow natural-language instructions reliably
+Can ignore prompts, hallucinate, or switch topics
+Has no safety layer:
+No explicit filtering of harmful/toxic content
+No RLHF / preference optimization
+Do not use Veronica-Polymorphic for:
+Safety-critical systems
+Medical, legal, or financial advice
+Content moderation without extensive additional work
+Any setting where unfiltered, biased generations would cause harm
 ---
+10. Roadmap
+Planned / desired directions:
+Version	Goal
+v0.1	Core polymorphic MLP + tests
+v0.2	Stable router schedules + logging
+v0.3	Configurable attention variants / FlashAttention
+v0.4	Public evaluation scripts (lm-eval-harness)
+v0.5	Reference instruction-tuned variant
+v0.6	Example specialization branches (e.g. translation)
+Community PRs are welcome, especially for:
+Evaluation & ablations vs vanilla MLP baselines
+New branch types and routing strategies
+Practical recipes for SFT / alignment on top of Veronica
 ---
+11. License
+This model and code are released under the Apache-2.0 license.
 ---
+12. Citation
+If you use Veronica-Polymorphic in your work, please cite:
+@misc{veronica-2025,
+  title        = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
+  author       = {Emanuele D'Angelo},
+  year         = {2025},
+  howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
+}
 ---
+13. Acknowledgments
+Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.
+Dataset mixture ratios guided by codelion’s DataComp LM work.
+RoPE implementation adapted from GPT-NeoX-style implementations.