Update README.md
Browse files
README.md
CHANGED
|
@@ -12,164 +12,199 @@ tags:
|
|
| 12 |
- causal-lm
|
| 13 |
- rope
|
| 14 |
- expandable-architecture
|
|
|
|
| 15 |
pipeline_tag: text-generation
|
| 16 |
datasets:
|
| 17 |
- codelion/finepdfs-1B
|
| 18 |
- codelion/dclm-baseline-1B
|
| 19 |
- codelion/fineweb-edu-1B
|
| 20 |
model-index:
|
| 21 |
-
- name: Veronica-24L (551M)
|
| 22 |
results: []
|
| 23 |
---
|
| 24 |
|
| 25 |
-
# Veronica-Polymorphic
|
| 26 |
|
| 27 |
-
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
| Polymorphic MLP | Soft routing over 3 base branches (SwiGLU, GLU, DepthwiseConv) |
|
| 34 |
-
| Routing Control | Depth-scaled temperature (√depth) + entropy maximization |
|
| 35 |
-
| Precision | BF16 with FP32 LayerNorm for stability |
|
| 36 |
-
| Positional Encoding | Rotary (RoPE, θ=10,000) |
|
| 37 |
-
| Dataset Mix | FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • FineWeb-Edu 20% |
|
| 38 |
-
| Context Length | **1024 (0-30k)** → 2048 (30k-60k) — *512 causes router collapse on 24L* |
|
| 39 |
-
| Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
|
| 40 |
|
| 41 |
---
|
| 42 |
|
| 43 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
pip install -e .
|
| 47 |
-
from veronica import VeronicaConfig, VeronicaForCausalLM
|
| 48 |
-
cfg = VeronicaConfig(n_layer=24, num_funcs=3) # base polymorphic setup
|
| 49 |
-
model = VeronicaForCausalLM(cfg)
|
| 50 |
-
```
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|--------|-------|------|
|
| 54 |
-
| FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
|
| 55 |
-
| DCLM Baseline‑1B | 30% | https://huggingface.co/datasets/codelion/dclm-baseline-1B |
|
| 56 |
-
| Additional samples | 20% | https://huggingface.co/collections/codelion/pre-training-dataset-samples |
|
| 57 |
|
| 58 |
-
|
| 59 |
-
- The collection link aggregates additional samples (e.g., educational/web sources) used to complete the 50/30/20 composition.
|
| 60 |
-
- Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
Generation example:
|
| 65 |
-
```python
|
| 66 |
-
from transformers import AutoTokenizer
|
| 67 |
-
tok = AutoTokenizer.from_pretrained("gpt2") # or your saved tokenizer
|
| 68 |
-
prompt = "The theory of relativity states that"
|
| 69 |
-
ids = tok(prompt, return_tensors="pt").to(model.device)
|
| 70 |
-
out = model.generate(**ids, max_new_tokens=64, temperature=0.7, top_p=0.9)
|
| 71 |
-
| Current status | between v0.2 and v0.3 |
|
| 72 |
-
print(tok.decode(out[0], skip_special_tokens=True))
|
| 73 |
-
```
|
| 74 |
|
| 75 |
---
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
```
|
| 89 |
@article{sharma2025billion,
|
| 90 |
title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
|
| 91 |
author = {Sharma, Asankhaya},
|
| 92 |
year = {2025},
|
| 93 |
url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
|
| 94 |
}
|
| 95 |
-
```
|
| 96 |
|
| 97 |
-
Related collection and datasets:
|
| 98 |
-
- codelion pre‑training dataset samples: https://huggingface.co/collections/codelion/pre-training-dataset-samples
|
| 99 |
-
- codelion/dclm-baseline-1B: https://huggingface.co/datasets/codelion/dclm-baseline-1B
|
| 100 |
-
- codelion/finepdfs-1B: https://huggingface.co/datasets/codelion/finepdfs-1B
|
| 101 |
|
| 102 |
---
|
| 103 |
|
| 104 |
-
|
| 105 |
-
Per token & layer:
|
| 106 |
-
```
|
| 107 |
-
router_logits = Router(x) # Linear → GELU → Linear
|
| 108 |
-
α = softmax(router_logits / τ)
|
| 109 |
-
branches = [SwiGLU(x), GLU(x), DepthwiseConvMLP(x)]
|
| 110 |
-
output = Σ α_i * branch_i(x)
|
| 111 |
-
```
|
| 112 |
-
Routing stabilized by:
|
| 113 |
-
- **Temperature schedule** (τ high early → softer mixing)
|
| 114 |
-
- **Entropy-max aux-loss** (subtract entropy from total loss to maximize it)
|
| 115 |
-
- Optional **forcing** during warmup to guarantee gradient flow to new branches
|
| 116 |
-
|
| 117 |
-
### Branch Types
|
| 118 |
-
| Branch | Purpose | Structure |
|
| 119 |
-
|--------|---------|-----------|
|
| 120 |
-
| SwiGLU | Smooth gated MLP | Linear(up 2×) → split → SiLU × gate → Linear(down) |
|
| 121 |
-
| GLU | Alternative gating dynamics | Linear(up 2×) → split → Sigmoid × gate → Linear(down) |
|
| 122 |
-
| DepthwiseConv | Local token patterns | Depthwise causal conv (k=3) → expand → GELU → contract |
|
| 123 |
-
|
| 124 |
-
### Positional Encoding
|
| 125 |
-
Rotary embeddings (RoPE) applied to Q/K heads with cached cos/sin; no absolute learned positions.
|
| 126 |
-
|
| 127 |
-
### Stability Choices
|
| 128 |
-
| Mechanism | Rationale |
|
| 129 |
-
|-----------|-----------|
|
| 130 |
-
| FP32 LayerNorm | Prevent BF16 precision drift |
|
| 131 |
-
| Entropy-Max Aux | Avoid early router collapse |
|
| 132 |
-
| High initial τ | Encourage exploration across branches |
|
| 133 |
-
| Gradient Checkpointing | Memory efficiency for depth |
|
| 134 |
|
| 135 |
-
|
|
|
|
| 136 |
|
| 137 |
-
## Dataset Mixture (codelion / DataComp inspired)
|
| 138 |
-
Training uses a curated blend guided by open mixture studies:
|
| 139 |
|
| 140 |
-
| Source | Share | Notes |
|
| 141 |
-
|--------|-------|-------|
|
| 142 |
-
| FinePDFs | 50% | Technical & academic PDFs (higher semantic density) |
|
| 143 |
-
| DCLM Baseline | 30% | General web corpus (DataComp LM baseline) |
|
| 144 |
-
| FineWeb‑Edu | 20% | Educational domain for structured explanatory patterns |
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
-
## Training Setup
|
| 151 |
-
|
| 152 |
-
| Hyperparameter | Value (example) |
|
| 153 |
-
|----------------|-----------------|
|
| 154 |
-
| Layers | 24 |
|
| 155 |
-
| Hidden size | 768 |
|
| 156 |
-
| Heads | 12 |
|
| 157 |
-
| MLP mult | 4.0 |
|
| 158 |
-
| Batch (per device) | 4 |
|
| 159 |
-
| Grad Accumulation | 8 (effective batch 32) |
|
| 160 |
-
| LR | 1.2e-4 cosine decay |
|
| 161 |
-
| Warmup | 10% steps |
|
| 162 |
-
| Weight Decay | 0.01 |
|
| 163 |
-
| Label Smoothing | 0.01 |
|
| 164 |
-
| Precision | bf16 + fp32 LayerNorm |
|
| 165 |
-
| Max Seq Len | 1024→2048 (curriculum) |
|
| 166 |
-
| Router τ | 2.2 → 1.4 (freeze first 6k steps, depth-scaled) |
|
| 167 |
-
| Aux weight λ | 0.008 → 0.016 (depth-scaled √2×) |
|
| 168 |
-
| Router forcing | 10% prob for first 5k steps |
|
| 169 |
-
| Rep penalty (α) | 0.05 (smoke quality) |
|
| 170 |
-
|
| 171 |
-
Launch:
|
| 172 |
-
```bash
|
| 173 |
python scripts/train_veronica.py \
|
| 174 |
--config configs/veronica-pretrain-24L.json \
|
| 175 |
--dataset_paths data/mix_optimal_50_30_20 \
|
|
@@ -186,481 +221,271 @@ python scripts/train_veronica.py \
|
|
| 186 |
--router_force_prob 0.10 --router_force_warmup_steps 5000 \
|
| 187 |
--rep_alpha 0.05 \
|
| 188 |
--seed 42
|
| 189 |
-
```
|
| 190 |
|
| 191 |
-
|
| 192 |
|
| 193 |
-
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
| 198 |
-
```
|
| 199 |
-
Step 3000: alpha=[0.73, 0.14, 0.12], entropy=0.70 (UNHEALTHY)
|
| 200 |
-
```
|
| 201 |
|
| 202 |
-
**Root Cause**:
|
| 203 |
-
- With 512 tokens/batch and 24 routing decisions per token → **12,288 routing examples per batch**
|
| 204 |
-
- But distributed across 3 branches and 24 layers → each branch-layer combination receives only **~170 gradient samples**
|
| 205 |
-
- **Insufficient signal** for stable gradient descent on router parameters
|
| 206 |
-
- Weak branches cannot recover from random initialization noise
|
| 207 |
-
- Router collapses toward dominant branch to minimize aux loss conflict
|
| 208 |
|
| 209 |
-
|
| 210 |
-
- Same 512 tokens → 6,144 routing examples
|
| 211 |
-
- Each branch-layer: **~170 samples** (same as 24L)
|
| 212 |
-
- But **12 layers = shorter gradient path** → less noise accumulation
|
| 213 |
-
- Router can stabilize before collapse
|
| 214 |
|
| 215 |
-
### Solution: Start at 1024 for Deep Models
|
| 216 |
|
| 217 |
-
|
| 218 |
-
```
|
| 219 |
-
0–20k steps: 1024 tokens ✅ 24,576 routing examples = stable gradients
|
| 220 |
-
20k–60k steps: 2048 tokens 🎯 48,152 examples = final quality
|
| 221 |
-
```
|
| 222 |
|
| 223 |
-
|
|
|
|
| 224 |
|
| 225 |
-
|
| 226 |
|
| 227 |
-
|
| 228 |
|
| 229 |
-
|
| 230 |
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
With 24 independent routing decisions per token (one per layer), naive parameters from shallower models (12L) cause amplified specialization and branch collapse. We apply **square-root depth scaling** to maintain equivalent "softness" across architectures:
|
| 234 |
-
|
| 235 |
-
### Temperature Scaling
|
| 236 |
-
Softmax sharpness compounds across layers. To preserve exploration:
|
| 237 |
-
```
|
| 238 |
-
τ_24L = τ_12L × √(24/12) = τ_12L × √2 ≈ τ_12L × 1.41
|
| 239 |
-
```
|
| 240 |
-
For 12L baseline `τ=1.6`, we use **`τ=2.2`** for 24L (start) and **`τ=1.4`** (end).
|
| 241 |
-
|
| 242 |
-
### Aux Weight Scaling
|
| 243 |
-
Entropy gradient must compete with 24 layers pulling toward specialization:
|
| 244 |
-
```
|
| 245 |
-
λ_24L = λ_12L × √2 ≈ λ_12L × 1.41
|
| 246 |
-
```
|
| 247 |
-
For 12L baseline `λ=0.005→0.012`, we use **`λ=0.008→0.016`** for 24L.
|
| 248 |
-
|
| 249 |
-
### Forcing Probability
|
| 250 |
-
Each branch needs more examples across deeper network:
|
| 251 |
-
```
|
| 252 |
-
P_force_24L ≈ P_force_12L × (24/12) = 2 × P_force_12L
|
| 253 |
-
```
|
| 254 |
-
For 12L `5%`, we use **`10%`** for 24L during warmup (0–5k steps).
|
| 255 |
-
|
| 256 |
-
### Empirical Results (Training Logs)
|
| 257 |
-
- **Step 300**: Entropy 1.00, perfect uniform distribution `[0.33, 0.33, 0.33]`
|
| 258 |
-
- **Step 5k**: Entropy 0.73, healthy distribution `[0.71, 0.11, 0.18]`
|
| 259 |
-
- **Step 7k**: Entropy 0.80–0.93 (exploration phase post tau-freeze)
|
| 260 |
-
- **Step 10k**: Loss ~34, no branch collapse
|
| 261 |
-
- **Step 11k** (post branch-1 recovery): Entropy 0.84–0.93, distribution `[0.57, 0.15, 0.27]` ✅
|
| 262 |
-
- **Step 12k**: Stable soft routing, eval loss 4.07
|
| 263 |
|
| 264 |
-
|
| 265 |
|
| 266 |
-
|
| 267 |
-
Monitor log lines:
|
| 268 |
-
```
|
| 269 |
-
[router] alpha=[a0, a1, a2, ...] entropy_norm=E
|
| 270 |
-
```
|
| 271 |
-
### Targets by Training Phase
|
| 272 |
-
| Phase | Steps | Entropy Target | Min Branch Share | Notes |
|
| 273 |
-
|-------|-------|----------------|------------------|-------|
|
| 274 |
-
| Warmup | 0–5k | ≥0.90 | ≥0.25 | Forcing active, near-uniform |
|
| 275 |
-
| Post-freeze | 5k–10k | ≥0.75 | ≥0.12 | Specialization begins |
|
| 276 |
-
| Stable | 10k+ | ≥0.70 | ≥0.15 | Soft routing converged |
|
| 277 |
-
| Final | 40k–60k | ≥0.65 | ≥0.12 | Acceptable specialization |
|
| 278 |
-
|
| 279 |
-
### Observed Distribution (24L, Step 12k)
|
| 280 |
-
```
|
| 281 |
-
alpha=[0.571, 0.153, 0.276] entropy_norm=0.876
|
| 282 |
-
```
|
| 283 |
-
Ideal soft routing: dominant branch ~55–65%, minorities ~15–25% each.
|
| 284 |
|
| 285 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 286 |
|
| 287 |
-
|
| 288 |
|
| 289 |
-
### Architecture-Dependent Strategy
|
| 290 |
|
| 291 |
-
|
| 292 |
-
```
|
| 293 |
-
1024 tokens: Steps 0–20k (NO 512 phase — causes router collapse)
|
| 294 |
-
2048 tokens: Steps 20k–60k
|
| 295 |
-
```
|
| 296 |
|
| 297 |
-
**For 12L and shallower**:
|
| 298 |
-
```
|
| 299 |
-
512 tokens: Steps 0–10k
|
| 300 |
-
1024 tokens: Steps 10k–30k
|
| 301 |
-
2048 tokens: Steps 30k–60k
|
| 302 |
-
```
|
| 303 |
|
| 304 |
---
|
| 305 |
|
| 306 |
-
|
| 307 |
-
- **Purpose**: Router stability + pattern learning (REQUIRED for 24L from step 0)
|
| 308 |
-
- **VRAM**: ~8–9GB (batch=4, accum=8)
|
| 309 |
-
- **Throughput**: ~8–10 sec/step
|
| 310 |
-
- **Why not 512**: Insufficient routing examples cause branch collapse by 3k steps
|
| 311 |
|
| 312 |
-
|
| 313 |
-
- **Purpose**: Final capacity, long-document coherence
|
| 314 |
-
- **VRAM**: ~12–13GB (batch=4, accum=8)
|
| 315 |
-
- **Switching criteria**: Stable routing on 1024 (entropy ≥0.75, branches ≥0.15)
|
| 316 |
-
- **Expected dip**: Temporary entropy −0.02–0.04, recovers within 500 steps
|
| 317 |
|
| 318 |
-
|
| 319 |
|
| 320 |
-
|
| 321 |
-
```bash
|
| 322 |
-
python scripts/train_veronica.py \
|
| 323 |
-
--resume_from runs/veronica-24L-1024/checkpoint-12000 \
|
| 324 |
-
--output_dir runs/veronica-24L-2048 \
|
| 325 |
-
--max_seq_len 2048 \
|
| 326 |
-
# ... keep all other router params unchanged
|
| 327 |
-
```
|
| 328 |
|
| 329 |
-
|
| 330 |
|
| 331 |
-
|
| 332 |
-
Goal: Increase capacity or add a specialization (e.g. translation) without full restart.
|
| 333 |
-
|
| 334 |
-
### Steps
|
| 335 |
-
1. **Load original checkpoint + config**:
|
| 336 |
-
```python
|
| 337 |
-
cfg = VeronicaConfig.from_pretrained(old_dir)
|
| 338 |
-
old_funcs = cfg.num_funcs
|
| 339 |
-
cfg.num_funcs = old_funcs + 1 # adding one branch
|
| 340 |
-
model = VeronicaForCausalLM.from_pretrained(old_dir, config=cfg, ignore_mismatched_sizes=True)
|
| 341 |
-
```
|
| 342 |
-
2. **Implement new branch class** (see Translation branch below) and extend `PolymorphicMLP` construction.
|
| 343 |
-
3. **Copy existing router weights** and init new column small:
|
| 344 |
-
```python
|
| 345 |
-
import torch, torch.nn as nn
|
| 346 |
-
for blk in model.blocks:
|
| 347 |
-
lin = blk.mlp.router[-1] # final Linear
|
| 348 |
-
with torch.no_grad():
|
| 349 |
-
# existing weights remain; new slice initialized
|
| 350 |
-
nn.init.normal_(lin.weight[old_funcs:], mean=0.0, std=0.02)
|
| 351 |
-
if lin.bias is not None:
|
| 352 |
-
nn.init.zeros_(lin.bias[old_funcs:])
|
| 353 |
-
```
|
| 354 |
-
4. **Freeze old branches & attention** for warmup:
|
| 355 |
-
```python
|
| 356 |
-
for name, p in model.named_parameters():
|
| 357 |
-
if "funcs.%d" % (old_funcs) in name or "router.2" in name: # new branch + router final layer
|
| 358 |
-
p.requires_grad = True
|
| 359 |
-
else:
|
| 360 |
-
p.requires_grad = False
|
| 361 |
-
```
|
| 362 |
-
5. **High τ + light forcing** (0–1k steps): `router_tau_start=1.8`, `router_force_prob≈0.15`.
|
| 363 |
-
6. **Blend phase** (1–3k steps): unfreeze old branches, lower τ → 1.2, increase aux to mid (e.g. 0.006).
|
| 364 |
-
7. **Stabilize**: restore standard schedule (τ→1.0, aux→0.01), disable forcing.
|
| 365 |
-
|
| 366 |
-
### Recommended Minimal Fine‑Tune Command
|
| 367 |
-
```bash
|
| 368 |
-
python scripts/train_veronica.py \
|
| 369 |
-
--config expanded-config.json \ # updated num_funcs
|
| 370 |
-
--resume_from runs/veronica-pretrain-24L/checkpoint-60000 \
|
| 371 |
-
--output_dir runs/veronica-expand-translation \
|
| 372 |
-
--max_steps 8000 \
|
| 373 |
-
--per_device_train_batch_size 4 \
|
| 374 |
-
--gradient_accumulation_steps 8 \
|
| 375 |
-
--learning_rate 8e-5 \
|
| 376 |
-
--router_tau_start 1.8 --router_tau_end 1.2 --router_tau_freeze_steps 1500 \
|
| 377 |
-
--router_aux_start 0.001 --router_aux_end 0.008 \
|
| 378 |
-
--router_force_prob 0.15 --router_force_warmup_steps 1200
|
| 379 |
-
```
|
| 380 |
|
| 381 |
-
|
| 382 |
|
| 383 |
-
|
| 384 |
-
Add a branch focusing on cross‑lingual adaptation without retraining entire backbone.
|
| 385 |
-
|
| 386 |
-
### Design Goals
|
| 387 |
-
| Requirement | Implementation Choice |
|
| 388 |
-
|-------------|-----------------------|
|
| 389 |
-
| Lightweight | Low‑rank adapters + language conditioning |
|
| 390 |
-
| Reusable | Shares main hidden size; no separate encoder |
|
| 391 |
-
| Controllable | Can be forced via `force_func` for targeted tuning |
|
| 392 |
-
|
| 393 |
-
### Example Branch Implementation
|
| 394 |
-
```python
|
| 395 |
-
class TranslationBranch(nn.Module):
|
| 396 |
-
def __init__(self, hidden_size: int, mlp_mult: float = 2.0, rank: int = 64, num_langs: int = 16):
|
| 397 |
-
super().__init__()
|
| 398 |
-
self.rank = rank
|
| 399 |
-
self.lang_embed = nn.Embedding(num_langs, hidden_size)
|
| 400 |
-
inner = int(hidden_size * mlp_mult)
|
| 401 |
-
self.up = nn.Linear(hidden_size, inner)
|
| 402 |
-
self.down = nn.Linear(inner, hidden_size)
|
| 403 |
-
# Low-rank adapters
|
| 404 |
-
self.A = nn.Linear(hidden_size, rank, bias=False)
|
| 405 |
-
self.B = nn.Linear(rank, hidden_size, bias=False)
|
| 406 |
-
self.gate = nn.Linear(hidden_size, 1)
|
| 407 |
-
|
| 408 |
-
def forward(self, x: torch.Tensor, lang_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
|
| 409 |
-
# x: (B, T, H); lang_ids: (B,) or (B,T) token-level
|
| 410 |
-
if lang_ids is not None:
|
| 411 |
-
if lang_ids.dim() == 1: # broadcast sentence level
|
| 412 |
-
lang_vec = self.lang_embed(lang_ids).unsqueeze(1) # (B,1,H)
|
| 413 |
-
else:
|
| 414 |
-
lang_vec = self.lang_embed(lang_ids) # (B,T,H)
|
| 415 |
-
x = x + lang_vec
|
| 416 |
-
h = self.up(x)
|
| 417 |
-
h = torch.gelu(h)
|
| 418 |
-
h = self.down(h)
|
| 419 |
-
# Adapter residual
|
| 420 |
-
a = self.A(x)
|
| 421 |
-
a = torch.gelu(a)
|
| 422 |
-
a = self.B(a)
|
| 423 |
-
g = torch.sigmoid(self.gate(x)) # (B,T,1)
|
| 424 |
-
return h + g * a
|
| 425 |
-
```
|
| 426 |
-
|
| 427 |
-
### Integrate Into `PolymorphicMLP`
|
| 428 |
-
Inside branch construction:
|
| 429 |
-
```python
|
| 430 |
-
if num_funcs >= 4:
|
| 431 |
-
funcs.append(TranslationBranch(hidden_size, mlp_mult=2.0))
|
| 432 |
-
```
|
| 433 |
-
|
| 434 |
-
### Passing Language IDs
|
| 435 |
-
- Add `lang_ids` to model forward signature (optional).
|
| 436 |
-
- Modify TranslationBranch call: `func(x, lang_ids=lang_ids)` for branches expecting it; others ignore.
|
| 437 |
-
- For multilingual fine‑tune, prepend special language tokens or maintain a side tensor of language indices.
|
| 438 |
-
|
| 439 |
-
### Fine‑Tuning Strategy
|
| 440 |
-
1. Collect multilingual parallel / monolingual corpora (e.g. FLORES, WikiMatrix, OSCAR subset).
|
| 441 |
-
2. Freeze base transformer + existing branches initially.
|
| 442 |
-
3. Force translation branch (`force_func = translation_index`) for exploratory steps.
|
| 443 |
-
4. Gradually unfreeze attention + other branches for joint adaptation.
|
| 444 |
-
5. Evaluate on BLEU / COMET vs baseline; adjust rank / mlp_mult if underfitting.
|
| 445 |
|
| 446 |
-
---
|
| 447 |
|
| 448 |
-
## Evaluation & Monitoring
|
| 449 |
-
| Metric | Purpose |
|
| 450 |
-
|--------|---------|
|
| 451 |
-
| CE / PPL | Language modeling convergence |
|
| 452 |
-
| Router Entropy | Diversity of branch usage |
|
| 453 |
-
| Alpha Distribution | Detect collapse or dominance |
|
| 454 |
-
| Translation BLEU (if added) | Cross-lingual quality |
|
| 455 |
|
| 456 |
-
|
| 457 |
|
| 458 |
-
|
| 459 |
-
| Area | Limitation |
|
| 460 |
-
|------|------------|
|
| 461 |
-
| Alignment | Base LM (no RLHF / instruction tuning) |
|
| 462 |
-
| Multilingual | Requires added translation branch + fine‑tune |
|
| 463 |
-
| Safety | No filtering; may reproduce dataset biases |
|
| 464 |
-
| Interpretability | Router decisions not fully explainable |
|
| 465 |
|
| 466 |
-
---
|
| 467 |
|
| 468 |
-
## Router Stability (Important)
|
| 469 |
-
|
| 470 |
-
Dynamic soft‑routing is powerful but sensitive. The training methodology has been refined through empirical testing on 24L to ensure healthy branch growth.
|
| 471 |
-
|
| 472 |
-
### Known Issues & Solutions
|
| 473 |
-
| Issue | Symptom | Solution |
|
| 474 |
-
|-------|---------|----------|
|
| 475 |
-
| Early collapse | Branch <10% by 3k steps | Increase `tau_start` (2.2→2.4), extend freeze (6k→8k) |
|
| 476 |
-
| Post-freeze oscillation | Entropy spikes 0.75→0.95 | Expected; aux pushes exploration. Monitor 500 steps. |
|
| 477 |
-
| Weak branch stagnation | Branch <12% after 10k | Targeted forcing: `--force_branch_idx X --force_branch_until +1000`, aux=0 during window |
|
| 478 |
-
| Adaptive forcing loops | Repeated forced windows | **Do not use** adaptive forcing; rely on aux+tau only |
|
| 479 |
-
|
| 480 |
-
### Failed Experiment: Adaptive Forcing (DO NOT USE)
|
| 481 |
-
|
| 482 |
-
**Attempted solution**: Auto-detect weak branches (<threshold) and dynamically apply forcing windows
|
| 483 |
-
```python
|
| 484 |
-
# BROKEN CODE — DO NOT USE
|
| 485 |
-
if min(alpha) < 0.15 and not in_cooldown:
|
| 486 |
-
weak_idx = argmin(alpha)
|
| 487 |
-
force_branch_idx = weak_idx
|
| 488 |
-
force_until = current_step + 1000
|
| 489 |
-
in_cooldown = True
|
| 490 |
-
```
|
| 491 |
-
|
| 492 |
-
**Why it failed**:
|
| 493 |
-
1. **Cascade loops**: Forcing branch A → weakens branch B → triggers forcing B → weakens A → infinite oscillation
|
| 494 |
-
2. **Artificial alpha**: During forced windows, alpha reflects forcing distribution [0,0,1], not learned preferences
|
| 495 |
-
3. **Gradient confusion**: Aux loss receives artificial entropy signals, disrupts learning
|
| 496 |
-
4. **Manual intervention superior**: Targeted forcing with aux=0 isolates signal cleanly
|
| 497 |
-
|
| 498 |
-
**Lesson**: Router needs **consistent pressure** (tau + aux), not **reactive intervention**. Manual forcing for recovery only, not automated.
|
| 499 |
-
|
| 500 |
-
### Safeguards Implemented (Validated)
|
| 501 |
-
1. **Depth-scaled parameters**: τ and λ scaled by √(depth_ratio) to maintain effective softness
|
| 502 |
-
2. **Extended freeze**: Tau held constant for 6k steps (10% of training) to prevent premature specialization
|
| 503 |
-
3. **Entropy-max loss**: Subtract (not add) aux_loss to maximize branch diversity
|
| 504 |
-
4. **Warmup forcing**: 10% probability during first 5k steps ensures all branches receive gradients
|
| 505 |
-
5. **FP32 LayerNorm**: Prevents BF16 precision drift in routing logits
|
| 506 |
-
6. **NO adaptive forcing**: Rely on tau/aux scheduling + manual intervention when needed
|
| 507 |
-
|
| 508 |
-
### Intervention Playbook (Step-by-Step)
|
| 509 |
-
**Scenario: Branch drops <10% before 5k steps**
|
| 510 |
-
1. Stop training, resume from last good checkpoint
|
| 511 |
-
2. Increase `--router_tau_start` by +0.2 (e.g., 2.2→2.4)
|
| 512 |
-
3. Extend `--router_tau_freeze_steps` by +2000
|
| 513 |
-
4. Increase `--router_force_prob` to 0.12–0.15
|
| 514 |
-
|
| 515 |
-
**Scenario: Branch stuck <12% after 10k steps**
|
| 516 |
-
1. Run targeted forcing (see Incremental Expansion section)
|
| 517 |
-
2. Force weak branch for 1k steps with `aux=0`, LR=5e-5
|
| 518 |
-
3. Resume normal training with aux restored
|
| 519 |
-
4. Expected recovery: +3–8% share within 500 steps
|
| 520 |
-
|
| 521 |
-
**Scenario: Entropy <0.70 and falling after 15k**
|
| 522 |
-
1. Increase `--router_aux_end` by +0.002 (e.g., 0.016→0.018)
|
| 523 |
-
2. Consider raising `--router_tau_end` slightly (1.4→1.5) to slow sharpening
|
| 524 |
-
|
| 525 |
-
### Fine‑Tuning Note
|
| 526 |
-
If using standard HF Trainer without custom loss, set `router_aux_weight=0` in config to avoid incorrect gradient direction. Use `scripts/train_veronica.py` for full entropy-max support.
|
| 527 |
-
|
| 528 |
-
### Empirical Training Log (24L Complete Journey)
|
| 529 |
-
|
| 530 |
-
**First attempt (FAILED — 512 ctx)**:
|
| 531 |
-
- **Step 0–300**: Perfect init (entropy 1.0) with high tau + forcing
|
| 532 |
-
- **Step 3000**: **Router collapse** — alpha=[0.73, 0.14, 0.12], entropy 0.70 ❌
|
| 533 |
-
- **Diagnosis**: 512 ctx insufficient for 24L depth
|
| 534 |
-
- **Action**: Abandoned run, restarted from scratch with 1024 ctx
|
| 535 |
-
|
| 536 |
-
**Adaptive forcing experiment (FAILED)**:
|
| 537 |
-
- **Implementation**: Auto-detect weak branches, dynamic forcing windows
|
| 538 |
-
- **Outcome**: Cascade loops, artificial alpha patterns [0,0,1], gradient confusion
|
| 539 |
-
- **Action**: Reverted code, relied on tau/aux only
|
| 540 |
-
|
| 541 |
-
**Final successful run (1024 ctx from step 0)**:
|
| 542 |
-
- **Step 0–300**: Perfect uniformity (entropy 1.0), high tau (2.2) + 10% forcing
|
| 543 |
-
- **Step 1000**: Loss 87→52, entropy 0.92, balanced [0.39, 0.32, 0.29]
|
| 544 |
-
- **Step 3000**: Loss 41, entropy 0.73, distribution [0.71, 0.13, 0.16] (healthy)
|
| 545 |
-
- **Step 5000**: Loss 37, forcing disabled, entropy 0.72 maintained
|
| 546 |
-
- **Step 6000**: Tau unfreezes (2.2→1.4 schedule begins)
|
| 547 |
-
- **Step 6000-7000**: Entropy spikes 0.80→0.93 (exploration phase, expected)
|
| 548 |
-
- **Step 10000**: Loss 34, **branch 1 weakened to ~10%** (concern threshold)
|
| 549 |
-
- **Intervention**: Targeted forcing on branch 1 (10k→11k steps)
|
| 550 |
-
- `--force_branch_idx 1 --force_branch_until 11000`
|
| 551 |
-
- `--router_aux_start 0.0` (isolate gradient signal)
|
| 552 |
-
- `--learning_rate 5e-5` (gentle nudge)
|
| 553 |
-
- **Step 11000**: Branch 1 recovered to 15%, entropy 0.84–0.93 ✅
|
| 554 |
-
- **Step 12000**: Stable soft routing [0.57, 0.15, 0.27], entropy 0.876
|
| 555 |
-
- Eval loss 4.41→4.07 (intervention improved generalization)
|
| 556 |
-
- Loss trend: 34→33 (continued healthy descent)
|
| 557 |
-
- **All branches active and contributing**
|
| 558 |
-
|
| 559 |
-
**Key learnings**:
|
| 560 |
-
1. ✅ 1024 ctx required from step 0 for 24L
|
| 561 |
-
2. ✅ Depth-scaled tau/aux/forcing parameters validated
|
| 562 |
-
3. ✅ Targeted forcing (aux=0, short window) effective for recovery
|
| 563 |
-
4. ❌ Adaptive forcing causes more problems than it solves
|
| 564 |
-
5. ✅ Entropy 0.84–0.93 with min branch 15% = healthy soft routing
|
| 565 |
-
|
| 566 |
-
**Status**: Methodology validated on 24L/551M through 12k steps. Ready for 2048 ctx phase (30k+). Core API stable; default schedules proven effective.
|
| 567 |
|
| 568 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 569 |
|
| 570 |
-
## Practical Training Tips
|
| 571 |
-
|
| 572 |
-
### DO
|
| 573 |
-
- ✅ **Use 1024 ctx from step 0 for 24L models** (512 causes router collapse)
|
| 574 |
-
- ✅ Scale tau/aux with √(depth_ratio) when changing layer count
|
| 575 |
-
- ✅ Use depth-scaled forcing probability (10% for 24L vs 5% for 12L)
|
| 576 |
-
- ✅ Freeze tau for ~10% of total training steps (6k for 60k total)
|
| 577 |
-
- ✅ Monitor entropy every 100 steps; save checkpoints every 500
|
| 578 |
-
- ✅ Apply targeted forcing (aux=0, short window) for weak branches after 10k
|
| 579 |
-
- ✅ Keep aux weight increasing throughout training (e.g., 0.008→0.016)
|
| 580 |
-
- ✅ Trust depth-scaled parameters — they're empirically validated
|
| 581 |
-
|
| 582 |
-
### DON'T
|
| 583 |
-
- ❌ **Use 512 ctx on 24L** (causes collapse by 3k steps — empirically proven)
|
| 584 |
-
- ❌ **Implement adaptive forcing** (causes cascade loops and artificial alpha)
|
| 585 |
-
- ❌ Lower tau too aggressively (<1.2 for 24L can cause collapse)
|
| 586 |
-
- ❌ Set aux=0 for normal training (only during targeted forcing windows)
|
| 587 |
-
- ❌ Switch context length without verifying entropy stability (≥0.72 for 1k steps)
|
| 588 |
-
- ❌ Expect perfect uniformity throughout training (soft routing allows specialization)
|
| 589 |
-
- ❌ Panic if entropy spikes post tau-freeze (oscillation is expected; monitor 500 steps)
|
| 590 |
-
- ❌ Use curriculum 512→1024→2048 on deep models (≥20L requires 1024 start)
|
| 591 |
-
|
| 592 |
-
### VRAM Optimization
|
| 593 |
-
If hitting OOM on 2048 ctx:
|
| 594 |
-
```bash
|
| 595 |
-
--per_device_train_batch_size 2 \
|
| 596 |
-
--gradient_accumulation_steps 16 # keeps effective batch = 32
|
| 597 |
-
```
|
| 598 |
-
|
| 599 |
-
### Quick Health Check (Per 1k Steps)
|
| 600 |
-
```bash
|
| 601 |
-
grep "\[router\]" logs/train.log | tail -10
|
| 602 |
-
```
|
| 603 |
-
Look for:
|
| 604 |
-
- Entropy trend (should be ≥0.70)
|
| 605 |
-
- Min branch value (should be ≥0.12)
|
| 606 |
-
- Loss trend (should decrease or stabilize)
|
| 607 |
|
| 608 |
---
|
| 609 |
|
| 610 |
-
|
| 611 |
-
|
| 612 |
-
|
| 613 |
-
|
| 614 |
-
|
| 615 |
-
|
| 616 |
-
|
| 617 |
-
|
| 618 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 619 |
|
| 620 |
---
|
| 621 |
|
| 622 |
-
|
| 623 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 624 |
|
| 625 |
---
|
| 626 |
|
| 627 |
-
|
| 628 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 629 |
|
| 630 |
---
|
| 631 |
|
| 632 |
-
|
| 633 |
-
|
| 634 |
-
|
| 635 |
-
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
| 640 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 641 |
|
| 642 |
---
|
| 643 |
|
| 644 |
-
|
| 645 |
-
|
| 646 |
-
|
| 647 |
-
|
| 648 |
|
| 649 |
---
|
| 650 |
|
| 651 |
-
|
| 652 |
-
**Q: Why entropy-max instead of load-balancing penalty?**
|
| 653 |
-
To avoid premature specialization and keep new branches trainable; scaling uses increasing aux weight schedule.
|
| 654 |
|
| 655 |
-
|
| 656 |
-
Recommended incremental (3→4→5) to prevent starvation.
|
| 657 |
|
| 658 |
-
|
| 659 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 660 |
|
| 661 |
-
**Q: Does expansion erase prior knowledge?**
|
| 662 |
-
No; existing branches retain weights. Router + new branch adapt during short fine‑tune.
|
| 663 |
|
| 664 |
---
|
| 665 |
|
| 666 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
- causal-lm
|
| 13 |
- rope
|
| 14 |
- expandable-architecture
|
| 15 |
+
- research
|
| 16 |
pipeline_tag: text-generation
|
| 17 |
datasets:
|
| 18 |
- codelion/finepdfs-1B
|
| 19 |
- codelion/dclm-baseline-1B
|
| 20 |
- codelion/fineweb-edu-1B
|
| 21 |
model-index:
|
| 22 |
+
- name: Veronica-Polymorphic 24L (551M)
|
| 23 |
results: []
|
| 24 |
---
|
| 25 |
|
| 26 |
+
# Veronica-Polymorphic 24L (551M)
|
| 27 |
|
| 28 |
+
Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**:
|
| 29 |
+
each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token.
|
| 30 |
|
| 31 |
+
The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.
|
| 32 |
+
|
| 33 |
+
> ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**.
|
| 34 |
+
> Do **not** treat this as a production-ready model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
---
|
| 37 |
|
| 38 |
+
## 1. TL;DR
|
| 39 |
+
|
| 40 |
+
| Aspect | Value / Description |
|
| 41 |
+
|---------------------|----------------------------------------------------------------|
|
| 42 |
+
| Type | Decoder-only causal LM |
|
| 43 |
+
| Params | ~551M |
|
| 44 |
+
| Layers | 24 |
|
| 45 |
+
| Hidden size | 768 |
|
| 46 |
+
| Heads | 12 |
|
| 47 |
+
| Positional encoding | RoPE (rotary) |
|
| 48 |
+
| MLP | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block |
|
| 49 |
+
| Routing | Entropy-regularized soft routing, depth-scaled temperature |
|
| 50 |
+
| Precision | bf16 weights, fp32 LayerNorm |
|
| 51 |
+
| Context length | 1024 → 2048 (curriculum; 512 discouraged on 24L) |
|
| 52 |
+
| Data mix | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20% |
|
| 53 |
+
| Intended use | Research on routing / branch specialization |
|
| 54 |
+
| Not included | Instruction tuning, RLHF, safety fine-tuning, eval suite |
|
| 55 |
|
| 56 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
## 2. Intended use & scope
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
+
### Primary intent
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
This checkpoint is meant for:
|
| 63 |
+
|
| 64 |
+
- Researchers interested in:
|
| 65 |
+
- **Mixture-of-branches / soft routing** in MLPs
|
| 66 |
+
- Stability of routers on deeper (24L) architectures
|
| 67 |
+
- Incremental model growth via **adding branches post-pretrain**
|
| 68 |
+
- Practitioners who want a **small, hackable codebase** to experiment with:
|
| 69 |
+
- Polymorphic MLPs
|
| 70 |
+
- Entropy-regularized routing
|
| 71 |
+
- Context-length curricula
|
| 72 |
+
|
| 73 |
+
### Out of scope
|
| 74 |
+
|
| 75 |
+
This model is **not** designed or evaluated (yet) for:
|
| 76 |
+
|
| 77 |
+
- General-purpose assistant use
|
| 78 |
+
- Safety-critical or high-stakes decisions
|
| 79 |
+
- Deployment to end-users without additional filtering, alignment, and evaluation
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## 3. Model details
|
| 84 |
+
|
| 85 |
+
### 3.1 Architecture (high-level)
|
| 86 |
+
|
| 87 |
+
```text
|
| 88 |
+
Input tokens
|
| 89 |
+
↓
|
| 90 |
+
Token & position embeddings (RoPE on Q/K)
|
| 91 |
+
↓
|
| 92 |
+
[ VeronicaBlock × 24 ]
|
| 93 |
+
VeronicaBlock:
|
| 94 |
+
x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual
|
| 95 |
+
→ Pre-LN → Polymorphic MLP (router + branches) → Residual
|
| 96 |
+
↓
|
| 97 |
+
Untied LM head → logits
|
| 98 |
+
|
| 99 |
+
Key design choices:
|
| 100 |
+
|
| 101 |
+
Decoder-only Transformer (causal LM)
|
| 102 |
+
|
| 103 |
+
Pre-LayerNorm blocks
|
| 104 |
+
|
| 105 |
+
RoPE positional encoding (no learned absolute positions)
|
| 106 |
+
|
| 107 |
+
Untied input embeddings / LM head
|
| 108 |
+
|
| 109 |
+
Gradient checkpointing used in training runs for memory efficiency
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
3.2 Polymorphic MLP & routing
|
| 113 |
+
|
| 114 |
+
Each block’s MLP is replaced by a polymorphic MLP:
|
| 115 |
+
|
| 116 |
+
router_logits = Router(x) # Linear → GELU → Linear
|
| 117 |
+
alpha = softmax(router_logits / tau)
|
| 118 |
+
|
| 119 |
+
branches = [
|
| 120 |
+
SwiGLU(x),
|
| 121 |
+
GLU(x),
|
| 122 |
+
DepthwiseConvMLP(x),
|
| 123 |
+
]
|
| 124 |
+
|
| 125 |
+
output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))
|
| 126 |
+
|
| 127 |
+
Branches:
|
| 128 |
+
|
| 129 |
+
Branch Role Sketch
|
| 130 |
+
|
| 131 |
+
SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down)
|
| 132 |
+
GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down)
|
| 133 |
+
DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
Routing controls:
|
| 137 |
+
|
| 138 |
+
Temperature schedule tau_start → tau_end (higher early = softer mixing)
|
| 139 |
+
|
| 140 |
+
Entropy-max aux-loss: encourages non-collapsed branch usage
|
| 141 |
+
|
| 142 |
+
Depth-scaled parameters:
|
| 143 |
+
|
| 144 |
+
Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection.
|
| 149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
---
|
| 152 |
|
| 153 |
+
4. Training data
|
| 154 |
|
| 155 |
+
The pre-train data follows the codelion / DataComp LM mixture guidelines:
|
| 156 |
+
|
| 157 |
+
Dataset Share Description
|
| 158 |
+
|
| 159 |
+
codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density)
|
| 160 |
+
codelion/dclm-baseline-1B 30% General web corpus baseline
|
| 161 |
+
codelion/fineweb-edu-1B 20% Educational / explanatory web data
|
| 162 |
+
|
| 163 |
+
|
| 164 |
+
Target token budget for this configuration: ~60B tokens (example setting).
|
| 165 |
+
|
| 166 |
+
For licensing and detailed descriptions, please refer to each dataset on Hugging Face.
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
If you reuse this mixture, please also cite:
|
| 170 |
|
|
|
|
| 171 |
@article{sharma2025billion,
|
| 172 |
title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
|
| 173 |
author = {Sharma, Asankhaya},
|
| 174 |
year = {2025},
|
| 175 |
url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
|
| 176 |
}
|
|
|
|
| 177 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
---
|
| 180 |
|
| 181 |
+
5. Training procedure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
|
| 183 |
+
> Note: numbers below describe the reference run configuration used to train this checkpoint.
|
| 184 |
+
You can adapt them for your own experiments.
|
| 185 |
|
|
|
|
|
|
|
| 186 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
+
5.1 Core hyperparameters
|
| 189 |
|
| 190 |
+
Hyperparameter Value / Notes
|
| 191 |
+
|
| 192 |
+
Layers 24
|
| 193 |
+
Hidden size 768
|
| 194 |
+
Attention heads 12
|
| 195 |
+
MLP expansion 4×
|
| 196 |
+
Per-device batch size 4
|
| 197 |
+
Grad accumulation 8 (effective batch 32)
|
| 198 |
+
Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay
|
| 199 |
+
Warmup 10% of total steps
|
| 200 |
+
Weight decay 0.01
|
| 201 |
+
Label smoothing 0.01
|
| 202 |
+
Precision bf16 + fp32 LayerNorm
|
| 203 |
+
Max steps 60k (example target)
|
| 204 |
+
|
| 205 |
+
|
| 206 |
+
Example launch:
|
| 207 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
python scripts/train_veronica.py \
|
| 209 |
--config configs/veronica-pretrain-24L.json \
|
| 210 |
--dataset_paths data/mix_optimal_50_30_20 \
|
|
|
|
| 221 |
--router_force_prob 0.10 --router_force_warmup_steps 5000 \
|
| 222 |
--rep_alpha 0.05 \
|
| 223 |
--seed 42
|
|
|
|
| 224 |
|
| 225 |
+
5.2 Context-length curriculum & “512-token trap”
|
| 226 |
|
| 227 |
+
Empirical finding on 24-layer models:
|
| 228 |
|
| 229 |
+
Starting at 512 tokens caused router collapse around step ~3k:
|
| 230 |
|
| 231 |
+
One branch dominated (>70%), entropy dropped, other branches starved.
|
|
|
|
|
|
|
|
|
|
| 232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
|
| 234 |
+
Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 235 |
|
|
|
|
| 236 |
|
| 237 |
+
Recommended curriculum for 24L:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
|
| 239 |
+
Steps 0–20k : 1024 tokens
|
| 240 |
+
Steps 20k–60k : 2048 tokens
|
| 241 |
|
| 242 |
+
For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.
|
| 243 |
|
| 244 |
+
5.3 Router health during training
|
| 245 |
|
| 246 |
+
Training logs include entries like:
|
| 247 |
|
| 248 |
+
[router] alpha=[a0, a1, a2] entropy_norm=E
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
|
| 250 |
+
Healthy targets (rough guideline):
|
| 251 |
|
| 252 |
+
Phase Steps Entropy (norm) Min branch share
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
|
| 254 |
+
Warmup 0–5k ≥ 0.90 ≥ 0.25
|
| 255 |
+
Post-freeze 5k–10k ≥ 0.75 ≥ 0.12
|
| 256 |
+
Stable 10k+ ≥ 0.70 ≥ 0.15
|
| 257 |
+
|
| 258 |
+
|
| 259 |
+
Collapsed routing typically shows up as:
|
| 260 |
+
|
| 261 |
+
Entropy < 0.65
|
| 262 |
+
|
| 263 |
+
One branch > 80% usage for many thousands of steps
|
| 264 |
|
| 265 |
+
Other branches stuck < 5–10%
|
| 266 |
|
|
|
|
| 267 |
|
| 268 |
+
The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 269 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 270 |
|
| 271 |
---
|
| 272 |
|
| 273 |
+
6. Evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
|
| 275 |
+
6.1 Current evaluation status
|
|
|
|
|
|
|
|
|
|
|
|
|
| 276 |
|
| 277 |
+
At the time of this release:
|
| 278 |
|
| 279 |
+
No standardized benchmarks (e.g. lm-eval-harness) have been run yet.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
+
There are no public numbers for:
|
| 282 |
|
| 283 |
+
MMLU (5-shot / 0-shot)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
|
| 285 |
+
ARC-e / ARC-c
|
| 286 |
|
| 287 |
+
HellaSwag, PIQA, GSM8K, etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 288 |
|
|
|
|
| 289 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 290 |
|
| 291 |
+
Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.
|
| 292 |
|
| 293 |
+
> 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 294 |
|
|
|
|
| 295 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 296 |
|
| 297 |
+
6.2 Planned evaluation (suggested)
|
| 298 |
+
|
| 299 |
+
If you adopt or extend Veronica-Polymorphic, consider running:
|
| 300 |
+
|
| 301 |
+
lm-eval-harness on:
|
| 302 |
+
|
| 303 |
+
mmlu, arc_challenge, arc_easy, hellaswag, piqa
|
| 304 |
+
|
| 305 |
+
|
| 306 |
+
Instruction / SFT (if you fine-tune):
|
| 307 |
+
|
| 308 |
+
Alpaca-style or OpenAssistant subsets
|
| 309 |
+
|
| 310 |
+
|
| 311 |
+
Ablations:
|
| 312 |
+
|
| 313 |
+
Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width
|
| 314 |
+
|
| 315 |
+
With / without entropy-max routing
|
| 316 |
+
|
| 317 |
+
|
| 318 |
+
|
| 319 |
+
Contributions of evaluation scripts and reported metrics are very welcome.
|
| 320 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 321 |
|
| 322 |
---
|
| 323 |
|
| 324 |
+
7. How to use
|
| 325 |
+
|
| 326 |
+
7.1 Loading from code
|
| 327 |
+
|
| 328 |
+
If you’re using the Veronica codebase directly:
|
| 329 |
+
|
| 330 |
+
from veronica import VeronicaConfig, VeronicaForCausalLM
|
| 331 |
+
|
| 332 |
+
cfg = VeronicaConfig(
|
| 333 |
+
n_layer=24,
|
| 334 |
+
num_funcs=3, # SwiGLU, GLU, DepthwiseConv
|
| 335 |
+
)
|
| 336 |
+
model = VeronicaForCausalLM(cfg)
|
| 337 |
+
model.eval()
|
| 338 |
+
|
| 339 |
+
You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.
|
| 340 |
+
|
| 341 |
+
7.2 Simple generation example
|
| 342 |
+
|
| 343 |
+
from transformers import AutoTokenizer
|
| 344 |
+
from veronica import VeronicaForCausalLM, VeronicaConfig
|
| 345 |
+
|
| 346 |
+
tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer
|
| 347 |
+
config = VeronicaConfig.from_pretrained("MhaWay/Veronica")
|
| 348 |
+
model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)
|
| 349 |
+
|
| 350 |
+
prompt = "The theory of relativity states that"
|
| 351 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 352 |
+
|
| 353 |
+
outputs = model.generate(
|
| 354 |
+
**inputs,
|
| 355 |
+
max_new_tokens=64,
|
| 356 |
+
temperature=0.7,
|
| 357 |
+
top_p=0.9,
|
| 358 |
+
)
|
| 359 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 360 |
+
|
| 361 |
+
> Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.
|
| 362 |
+
|
| 363 |
+
|
| 364 |
+
|
| 365 |
|
| 366 |
---
|
| 367 |
|
| 368 |
+
8. Extensibility: adding new branches
|
| 369 |
+
|
| 370 |
+
One motivation for polymorphic MLPs is incremental expansion:
|
| 371 |
+
|
| 372 |
+
You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:
|
| 373 |
+
|
| 374 |
+
Expanding num_funcs
|
| 375 |
+
|
| 376 |
+
Initializing the new branch + router output slice
|
| 377 |
+
|
| 378 |
+
Running a short fine-tune with:
|
| 379 |
+
|
| 380 |
+
Router + new branch trainable
|
| 381 |
+
|
| 382 |
+
Optionally freezing the rest of the backbone during warmup
|
| 383 |
+
|
| 384 |
+
|
| 385 |
+
|
| 386 |
+
|
| 387 |
+
The repository includes utilities and example code for:
|
| 388 |
+
|
| 389 |
+
Adding a new branch type
|
| 390 |
+
|
| 391 |
+
Copying router weights and initializing the new column
|
| 392 |
+
|
| 393 |
+
Scheduling a short specialization fine-tune
|
| 394 |
+
|
| 395 |
+
|
| 396 |
+
For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.
|
| 397 |
+
|
| 398 |
|
| 399 |
---
|
| 400 |
|
| 401 |
+
9. Limitations & risks
|
| 402 |
+
|
| 403 |
+
This model:
|
| 404 |
+
|
| 405 |
+
May generate inaccurate or nonsensical text
|
| 406 |
+
|
| 407 |
+
May reproduce biases present in the underlying datasets
|
| 408 |
+
|
| 409 |
+
Is not instruction-tuned:
|
| 410 |
+
|
| 411 |
+
Does not follow natural-language instructions reliably
|
| 412 |
+
|
| 413 |
+
Can ignore prompts, hallucinate, or switch topics
|
| 414 |
+
|
| 415 |
+
|
| 416 |
+
Has no safety layer:
|
| 417 |
+
|
| 418 |
+
No explicit filtering of harmful/toxic content
|
| 419 |
+
|
| 420 |
+
No RLHF / preference optimization
|
| 421 |
+
|
| 422 |
+
|
| 423 |
+
|
| 424 |
+
Do not use Veronica-Polymorphic for:
|
| 425 |
+
|
| 426 |
+
Safety-critical systems
|
| 427 |
+
|
| 428 |
+
Medical, legal, or financial advice
|
| 429 |
+
|
| 430 |
+
Content moderation without extensive additional work
|
| 431 |
+
|
| 432 |
+
Any setting where unfiltered, biased generations would cause harm
|
| 433 |
+
|
| 434 |
+
|
| 435 |
|
| 436 |
---
|
| 437 |
|
| 438 |
+
10. Roadmap
|
| 439 |
+
|
| 440 |
+
Planned / desired directions:
|
| 441 |
+
|
| 442 |
+
Version Goal
|
| 443 |
+
|
| 444 |
+
v0.1 Core polymorphic MLP + tests
|
| 445 |
+
v0.2 Stable router schedules + logging
|
| 446 |
+
v0.3 Configurable attention variants / FlashAttention
|
| 447 |
+
v0.4 Public evaluation scripts (lm-eval-harness)
|
| 448 |
+
v0.5 Reference instruction-tuned variant
|
| 449 |
+
v0.6 Example specialization branches (e.g. translation)
|
| 450 |
+
|
| 451 |
+
|
| 452 |
+
Community PRs are welcome, especially for:
|
| 453 |
+
|
| 454 |
+
Evaluation & ablations vs vanilla MLP baselines
|
| 455 |
+
|
| 456 |
+
New branch types and routing strategies
|
| 457 |
+
|
| 458 |
+
Practical recipes for SFT / alignment on top of Veronica
|
| 459 |
+
|
| 460 |
+
|
| 461 |
|
| 462 |
---
|
| 463 |
|
| 464 |
+
11. License
|
| 465 |
+
|
| 466 |
+
This model and code are released under the Apache-2.0 license.
|
| 467 |
+
|
| 468 |
|
| 469 |
---
|
| 470 |
|
| 471 |
+
12. Citation
|
|
|
|
|
|
|
| 472 |
|
| 473 |
+
If you use Veronica-Polymorphic in your work, please cite:
|
|
|
|
| 474 |
|
| 475 |
+
@misc{veronica-2025,
|
| 476 |
+
title = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
|
| 477 |
+
author = {Emanuele D'Angelo},
|
| 478 |
+
year = {2025},
|
| 479 |
+
howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
|
| 480 |
+
}
|
| 481 |
|
|
|
|
|
|
|
| 482 |
|
| 483 |
---
|
| 484 |
|
| 485 |
+
13. Acknowledgments
|
| 486 |
+
|
| 487 |
+
Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.
|
| 488 |
+
|
| 489 |
+
Dataset mixture ratios guided by codelion’s DataComp LM work.
|
| 490 |
+
|
| 491 |
+
RoPE implementation adapted from GPT-NeoX-style implementations.
|