MhaWay commited on
Commit
84b0dde
·
verified ·
1 Parent(s): 77877d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +359 -534
README.md CHANGED
@@ -12,164 +12,199 @@ tags:
12
  - causal-lm
13
  - rope
14
  - expandable-architecture
 
15
  pipeline_tag: text-generation
16
  datasets:
17
  - codelion/finepdfs-1B
18
  - codelion/dclm-baseline-1B
19
  - codelion/fineweb-edu-1B
20
  model-index:
21
- - name: Veronica-24L (551M)
22
  results: []
23
  ---
24
 
25
- # Veronica-Polymorphic
26
 
27
- **Veronica-Polymorphic Soft Mixture-of-Experts(SMoE)** is a decoderonly transformer featuring a **polymorphic MLP layer**: each token is processed by a soft mixture of specialized branches (SwiGLU, GLU, Depthwise Causal Conv) under an entropy‑regularized router. The design enables adaptive capacity, incremental expansion (adding new branches post‑pretrain), and targeted specialization (e.g. translation modules) without full retraining from scratch.
 
28
 
29
- ## TL;DR
30
- | Feature | Description |
31
- |---------|-------------|
32
- | Architecture | 24‑layer causal Transformer (RoPE, untied embeddings, 551M params) |
33
- | Polymorphic MLP | Soft routing over 3 base branches (SwiGLU, GLU, DepthwiseConv) |
34
- | Routing Control | Depth-scaled temperature (√depth) + entropy maximization |
35
- | Precision | BF16 with FP32 LayerNorm for stability |
36
- | Positional Encoding | Rotary (RoPE, θ=10,000) |
37
- | Dataset Mix | FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • FineWeb-Edu 20% |
38
- | Context Length | **1024 (0-30k)** → 2048 (30k-60k) — *512 causes router collapse on 24L* |
39
- | Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
40
 
41
  ---
42
 
43
- ## Installation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- ```bash
46
- pip install -e .
47
- from veronica import VeronicaConfig, VeronicaForCausalLM
48
- cfg = VeronicaConfig(n_layer=24, num_funcs=3) # base polymorphic setup
49
- model = VeronicaForCausalLM(cfg)
50
- ```
51
 
52
- | Source | Share | Link |
53
- |--------|-------|------|
54
- | FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
55
- | DCLM Baseline‑1B | 30% | https://huggingface.co/datasets/codelion/dclm-baseline-1B |
56
- | Additional samples | 20% | https://huggingface.co/collections/codelion/pre-training-dataset-samples |
57
 
58
- Notes
59
- - The collection link aggregates additional samples (e.g., educational/web sources) used to complete the 50/30/20 composition.
60
- - Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
61
 
62
- Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
- Generation example:
65
- ```python
66
- from transformers import AutoTokenizer
67
- tok = AutoTokenizer.from_pretrained("gpt2") # or your saved tokenizer
68
- prompt = "The theory of relativity states that"
69
- ids = tok(prompt, return_tensors="pt").to(model.device)
70
- out = model.generate(**ids, max_new_tokens=64, temperature=0.7, top_p=0.9)
71
- | Current status | between v0.2 and v0.3 |
72
- print(tok.decode(out[0], skip_special_tokens=True))
73
- ```
74
 
75
  ---
76
 
77
- ## Architecture Overview
78
 
79
- ### High Level
80
- ```
81
- Input Embeddings → [Block × N]
82
- Block: Pre-LN → Multi-Head Self-Attention (RoPE) → Pre-LN → Polymorphic MLP (Routing + Branch Fusion) → Residual
83
- Untied LM Head
84
- ```
85
- ## Dataset Citations
86
- If you use these datasets or composition, please cite:
 
 
 
 
 
 
 
87
 
88
- ```
89
  @article{sharma2025billion,
90
  title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
91
  author = {Sharma, Asankhaya},
92
  year = {2025},
93
  url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
94
  }
95
- ```
96
 
97
- Related collection and datasets:
98
- - codelion pre‑training dataset samples: https://huggingface.co/collections/codelion/pre-training-dataset-samples
99
- - codelion/dclm-baseline-1B: https://huggingface.co/datasets/codelion/dclm-baseline-1B
100
- - codelion/finepdfs-1B: https://huggingface.co/datasets/codelion/finepdfs-1B
101
 
102
  ---
103
 
104
- ### Polymorphic MLP
105
- Per token & layer:
106
- ```
107
- router_logits = Router(x) # Linear → GELU → Linear
108
- α = softmax(router_logits / τ)
109
- branches = [SwiGLU(x), GLU(x), DepthwiseConvMLP(x)]
110
- output = Σ α_i * branch_i(x)
111
- ```
112
- Routing stabilized by:
113
- - **Temperature schedule** (τ high early → softer mixing)
114
- - **Entropy-max aux-loss** (subtract entropy from total loss to maximize it)
115
- - Optional **forcing** during warmup to guarantee gradient flow to new branches
116
-
117
- ### Branch Types
118
- | Branch | Purpose | Structure |
119
- |--------|---------|-----------|
120
- | SwiGLU | Smooth gated MLP | Linear(up 2×) → split → SiLU × gate → Linear(down) |
121
- | GLU | Alternative gating dynamics | Linear(up 2×) → split → Sigmoid × gate → Linear(down) |
122
- | DepthwiseConv | Local token patterns | Depthwise causal conv (k=3) → expand → GELU → contract |
123
-
124
- ### Positional Encoding
125
- Rotary embeddings (RoPE) applied to Q/K heads with cached cos/sin; no absolute learned positions.
126
-
127
- ### Stability Choices
128
- | Mechanism | Rationale |
129
- |-----------|-----------|
130
- | FP32 LayerNorm | Prevent BF16 precision drift |
131
- | Entropy-Max Aux | Avoid early router collapse |
132
- | High initial τ | Encourage exploration across branches |
133
- | Gradient Checkpointing | Memory efficiency for depth |
134
 
135
- ---
 
136
 
137
- ## Dataset Mixture (codelion / DataComp inspired)
138
- Training uses a curated blend guided by open mixture studies:
139
 
140
- | Source | Share | Notes |
141
- |--------|-------|-------|
142
- | FinePDFs | 50% | Technical & academic PDFs (higher semantic density) |
143
- | DCLM Baseline | 30% | General web corpus (DataComp LM baseline) |
144
- | FineWeb‑Edu | 20% | Educational domain for structured explanatory patterns |
145
 
146
- Total tokens target (example): ~60B (adjustable). The composition balances semantic density vs generality, echoing codelion’s optimal ratio analyses.
147
 
148
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
- ## Training Setup
151
-
152
- | Hyperparameter | Value (example) |
153
- |----------------|-----------------|
154
- | Layers | 24 |
155
- | Hidden size | 768 |
156
- | Heads | 12 |
157
- | MLP mult | 4.0 |
158
- | Batch (per device) | 4 |
159
- | Grad Accumulation | 8 (effective batch 32) |
160
- | LR | 1.2e-4 cosine decay |
161
- | Warmup | 10% steps |
162
- | Weight Decay | 0.01 |
163
- | Label Smoothing | 0.01 |
164
- | Precision | bf16 + fp32 LayerNorm |
165
- | Max Seq Len | 1024→2048 (curriculum) |
166
- | Router τ | 2.2 → 1.4 (freeze first 6k steps, depth-scaled) |
167
- | Aux weight λ | 0.008 → 0.016 (depth-scaled √2×) |
168
- | Router forcing | 10% prob for first 5k steps |
169
- | Rep penalty (α) | 0.05 (smoke quality) |
170
-
171
- Launch:
172
- ```bash
173
  python scripts/train_veronica.py \
174
  --config configs/veronica-pretrain-24L.json \
175
  --dataset_paths data/mix_optimal_50_30_20 \
@@ -186,481 +221,271 @@ python scripts/train_veronica.py \
186
  --router_force_prob 0.10 --router_force_warmup_steps 5000 \
187
  --rep_alpha 0.05 \
188
  --seed 42
189
- ```
190
 
191
- ---
192
 
193
- ## Critical Discovery: Context Length & Router Stability on Deep Models
194
 
195
- ### The 512 Token Trap (24L Only)
196
 
197
- **Finding**: With 24 layers, starting training at **512 context length causes router collapse** by step 3k:
198
- ```
199
- Step 3000: alpha=[0.73, 0.14, 0.12], entropy=0.70 (UNHEALTHY)
200
- ```
201
 
202
- **Root Cause**:
203
- - With 512 tokens/batch and 24 routing decisions per token → **12,288 routing examples per batch**
204
- - But distributed across 3 branches and 24 layers → each branch-layer combination receives only **~170 gradient samples**
205
- - **Insufficient signal** for stable gradient descent on router parameters
206
- - Weak branches cannot recover from random initialization noise
207
- - Router collapses toward dominant branch to minimize aux loss conflict
208
 
209
- **Why This Doesn't Happen on 12L**:
210
- - Same 512 tokens → 6,144 routing examples
211
- - Each branch-layer: **~170 samples** (same as 24L)
212
- - But **12 layers = shorter gradient path** → less noise accumulation
213
- - Router can stabilize before collapse
214
 
215
- ### Solution: Start at 1024 for Deep Models
216
 
217
- **Corrected curriculum for 24L**:
218
- ```
219
- 0–20k steps: 1024 tokens ✅ 24,576 routing examples = stable gradients
220
- 20k–60k steps: 2048 tokens 🎯 48,152 examples = final quality
221
- ```
222
 
223
- **DO NOT use 512 ctx on 24L** — this is an empirical hard constraint, not a performance optimization.
 
224
 
225
- **For 12L and shallower**: 512→1024→2048 curriculum works fine.
226
 
227
- **Mathematical threshold**: ~15–20 layers appears to be the crossover where 512 becomes unstable. Always use 1024 or higher for ≥24L.
228
 
229
- ---
230
 
231
- ## Depth Scaling for 24L (Mathematical Rationale)
232
-
233
- With 24 independent routing decisions per token (one per layer), naive parameters from shallower models (12L) cause amplified specialization and branch collapse. We apply **square-root depth scaling** to maintain equivalent "softness" across architectures:
234
-
235
- ### Temperature Scaling
236
- Softmax sharpness compounds across layers. To preserve exploration:
237
- ```
238
- τ_24L = τ_12L × √(24/12) = τ_12L × √2 ≈ τ_12L × 1.41
239
- ```
240
- For 12L baseline `τ=1.6`, we use **`τ=2.2`** for 24L (start) and **`τ=1.4`** (end).
241
-
242
- ### Aux Weight Scaling
243
- Entropy gradient must compete with 24 layers pulling toward specialization:
244
- ```
245
- λ_24L = λ_12L × √2 ≈ λ_12L × 1.41
246
- ```
247
- For 12L baseline `λ=0.005→0.012`, we use **`λ=0.008→0.016`** for 24L.
248
-
249
- ### Forcing Probability
250
- Each branch needs more examples across deeper network:
251
- ```
252
- P_force_24L ≈ P_force_12L × (24/12) = 2 × P_force_12L
253
- ```
254
- For 12L `5%`, we use **`10%`** for 24L during warmup (0–5k steps).
255
-
256
- ### Empirical Results (Training Logs)
257
- - **Step 300**: Entropy 1.00, perfect uniform distribution `[0.33, 0.33, 0.33]`
258
- - **Step 5k**: Entropy 0.73, healthy distribution `[0.71, 0.11, 0.18]`
259
- - **Step 7k**: Entropy 0.80–0.93 (exploration phase post tau-freeze)
260
- - **Step 10k**: Loss ~34, no branch collapse
261
- - **Step 11k** (post branch-1 recovery): Entropy 0.84–0.93, distribution `[0.57, 0.15, 0.27]` ✅
262
- - **Step 12k**: Stable soft routing, eval loss 4.07
263
 
264
- ---
265
 
266
- ## Router Health Metrics
267
- Monitor log lines:
268
- ```
269
- [router] alpha=[a0, a1, a2, ...] entropy_norm=E
270
- ```
271
- ### Targets by Training Phase
272
- | Phase | Steps | Entropy Target | Min Branch Share | Notes |
273
- |-------|-------|----------------|------------------|-------|
274
- | Warmup | 0–5k | ≥0.90 | ≥0.25 | Forcing active, near-uniform |
275
- | Post-freeze | 5k–10k | ≥0.75 | ≥0.12 | Specialization begins |
276
- | Stable | 10k+ | ≥0.70 | ≥0.15 | Soft routing converged |
277
- | Final | 40k–60k | ≥0.65 | ≥0.12 | Acceptable specialization |
278
-
279
- ### Observed Distribution (24L, Step 12k)
280
- ```
281
- alpha=[0.571, 0.153, 0.276] entropy_norm=0.876
282
- ```
283
- Ideal soft routing: dominant branch ~55–65%, minorities ~15–25% each.
284
 
285
- ---
 
 
 
 
 
 
 
 
 
286
 
287
- ## Context Length Curriculum
288
 
289
- ### Architecture-Dependent Strategy
290
 
291
- **For 24L (≥20L in general)**:
292
- ```
293
- 1024 tokens: Steps 0–20k (NO 512 phase — causes router collapse)
294
- 2048 tokens: Steps 20k–60k
295
- ```
296
 
297
- **For 12L and shallower**:
298
- ```
299
- 512 tokens: Steps 0–10k
300
- 1024 tokens: Steps 10k–30k
301
- 2048 tokens: Steps 30k–60k
302
- ```
303
 
304
  ---
305
 
306
- ### Phase 1 (24L): 1024 Tokens (Steps 0–20k)
307
- - **Purpose**: Router stability + pattern learning (REQUIRED for 24L from step 0)
308
- - **VRAM**: ~8–9GB (batch=4, accum=8)
309
- - **Throughput**: ~8–10 sec/step
310
- - **Why not 512**: Insufficient routing examples cause branch collapse by 3k steps
311
 
312
- ### Phase 2 (24L): 2048 Tokens (Steps 20k–60k)
313
- - **Purpose**: Final capacity, long-document coherence
314
- - **VRAM**: ~12–13GB (batch=4, accum=8)
315
- - **Switching criteria**: Stable routing on 1024 (entropy ≥0.75, branches ≥0.15)
316
- - **Expected dip**: Temporary entropy −0.02–0.04, recovers within 500 steps
317
 
318
- eg. thats with BF16
319
 
320
- ### Switching Template
321
- ```bash
322
- python scripts/train_veronica.py \
323
- --resume_from runs/veronica-24L-1024/checkpoint-12000 \
324
- --output_dir runs/veronica-24L-2048 \
325
- --max_seq_len 2048 \
326
- # ... keep all other router params unchanged
327
- ```
328
 
329
- ---
330
 
331
- ## Incremental Expansion (Add New Branch Post‑Pretrain)
332
- Goal: Increase capacity or add a specialization (e.g. translation) without full restart.
333
-
334
- ### Steps
335
- 1. **Load original checkpoint + config**:
336
- ```python
337
- cfg = VeronicaConfig.from_pretrained(old_dir)
338
- old_funcs = cfg.num_funcs
339
- cfg.num_funcs = old_funcs + 1 # adding one branch
340
- model = VeronicaForCausalLM.from_pretrained(old_dir, config=cfg, ignore_mismatched_sizes=True)
341
- ```
342
- 2. **Implement new branch class** (see Translation branch below) and extend `PolymorphicMLP` construction.
343
- 3. **Copy existing router weights** and init new column small:
344
- ```python
345
- import torch, torch.nn as nn
346
- for blk in model.blocks:
347
- lin = blk.mlp.router[-1] # final Linear
348
- with torch.no_grad():
349
- # existing weights remain; new slice initialized
350
- nn.init.normal_(lin.weight[old_funcs:], mean=0.0, std=0.02)
351
- if lin.bias is not None:
352
- nn.init.zeros_(lin.bias[old_funcs:])
353
- ```
354
- 4. **Freeze old branches & attention** for warmup:
355
- ```python
356
- for name, p in model.named_parameters():
357
- if "funcs.%d" % (old_funcs) in name or "router.2" in name: # new branch + router final layer
358
- p.requires_grad = True
359
- else:
360
- p.requires_grad = False
361
- ```
362
- 5. **High τ + light forcing** (0–1k steps): `router_tau_start=1.8`, `router_force_prob≈0.15`.
363
- 6. **Blend phase** (1–3k steps): unfreeze old branches, lower τ → 1.2, increase aux to mid (e.g. 0.006).
364
- 7. **Stabilize**: restore standard schedule (τ→1.0, aux→0.01), disable forcing.
365
-
366
- ### Recommended Minimal Fine‑Tune Command
367
- ```bash
368
- python scripts/train_veronica.py \
369
- --config expanded-config.json \ # updated num_funcs
370
- --resume_from runs/veronica-pretrain-24L/checkpoint-60000 \
371
- --output_dir runs/veronica-expand-translation \
372
- --max_steps 8000 \
373
- --per_device_train_batch_size 4 \
374
- --gradient_accumulation_steps 8 \
375
- --learning_rate 8e-5 \
376
- --router_tau_start 1.8 --router_tau_end 1.2 --router_tau_freeze_steps 1500 \
377
- --router_aux_start 0.001 --router_aux_end 0.008 \
378
- --router_force_prob 0.15 --router_force_warmup_steps 1200
379
- ```
380
 
381
- ---
382
 
383
- ## Translation Specialization Branch
384
- Add a branch focusing on cross‑lingual adaptation without retraining entire backbone.
385
-
386
- ### Design Goals
387
- | Requirement | Implementation Choice |
388
- |-------------|-----------------------|
389
- | Lightweight | Low‑rank adapters + language conditioning |
390
- | Reusable | Shares main hidden size; no separate encoder |
391
- | Controllable | Can be forced via `force_func` for targeted tuning |
392
-
393
- ### Example Branch Implementation
394
- ```python
395
- class TranslationBranch(nn.Module):
396
- def __init__(self, hidden_size: int, mlp_mult: float = 2.0, rank: int = 64, num_langs: int = 16):
397
- super().__init__()
398
- self.rank = rank
399
- self.lang_embed = nn.Embedding(num_langs, hidden_size)
400
- inner = int(hidden_size * mlp_mult)
401
- self.up = nn.Linear(hidden_size, inner)
402
- self.down = nn.Linear(inner, hidden_size)
403
- # Low-rank adapters
404
- self.A = nn.Linear(hidden_size, rank, bias=False)
405
- self.B = nn.Linear(rank, hidden_size, bias=False)
406
- self.gate = nn.Linear(hidden_size, 1)
407
-
408
- def forward(self, x: torch.Tensor, lang_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
409
- # x: (B, T, H); lang_ids: (B,) or (B,T) token-level
410
- if lang_ids is not None:
411
- if lang_ids.dim() == 1: # broadcast sentence level
412
- lang_vec = self.lang_embed(lang_ids).unsqueeze(1) # (B,1,H)
413
- else:
414
- lang_vec = self.lang_embed(lang_ids) # (B,T,H)
415
- x = x + lang_vec
416
- h = self.up(x)
417
- h = torch.gelu(h)
418
- h = self.down(h)
419
- # Adapter residual
420
- a = self.A(x)
421
- a = torch.gelu(a)
422
- a = self.B(a)
423
- g = torch.sigmoid(self.gate(x)) # (B,T,1)
424
- return h + g * a
425
- ```
426
-
427
- ### Integrate Into `PolymorphicMLP`
428
- Inside branch construction:
429
- ```python
430
- if num_funcs >= 4:
431
- funcs.append(TranslationBranch(hidden_size, mlp_mult=2.0))
432
- ```
433
-
434
- ### Passing Language IDs
435
- - Add `lang_ids` to model forward signature (optional).
436
- - Modify TranslationBranch call: `func(x, lang_ids=lang_ids)` for branches expecting it; others ignore.
437
- - For multilingual fine‑tune, prepend special language tokens or maintain a side tensor of language indices.
438
-
439
- ### Fine‑Tuning Strategy
440
- 1. Collect multilingual parallel / monolingual corpora (e.g. FLORES, WikiMatrix, OSCAR subset).
441
- 2. Freeze base transformer + existing branches initially.
442
- 3. Force translation branch (`force_func = translation_index`) for exploratory steps.
443
- 4. Gradually unfreeze attention + other branches for joint adaptation.
444
- 5. Evaluate on BLEU / COMET vs baseline; adjust rank / mlp_mult if underfitting.
445
 
446
- ---
447
 
448
- ## Evaluation & Monitoring
449
- | Metric | Purpose |
450
- |--------|---------|
451
- | CE / PPL | Language modeling convergence |
452
- | Router Entropy | Diversity of branch usage |
453
- | Alpha Distribution | Detect collapse or dominance |
454
- | Translation BLEU (if added) | Cross-lingual quality |
455
 
456
- ---
457
 
458
- ## Limitations
459
- | Area | Limitation |
460
- |------|------------|
461
- | Alignment | Base LM (no RLHF / instruction tuning) |
462
- | Multilingual | Requires added translation branch + fine‑tune |
463
- | Safety | No filtering; may reproduce dataset biases |
464
- | Interpretability | Router decisions not fully explainable |
465
 
466
- ---
467
 
468
- ## Router Stability (Important)
469
-
470
- Dynamic soft‑routing is powerful but sensitive. The training methodology has been refined through empirical testing on 24L to ensure healthy branch growth.
471
-
472
- ### Known Issues & Solutions
473
- | Issue | Symptom | Solution |
474
- |-------|---------|----------|
475
- | Early collapse | Branch <10% by 3k steps | Increase `tau_start` (2.2→2.4), extend freeze (6k→8k) |
476
- | Post-freeze oscillation | Entropy spikes 0.75→0.95 | Expected; aux pushes exploration. Monitor 500 steps. |
477
- | Weak branch stagnation | Branch <12% after 10k | Targeted forcing: `--force_branch_idx X --force_branch_until +1000`, aux=0 during window |
478
- | Adaptive forcing loops | Repeated forced windows | **Do not use** adaptive forcing; rely on aux+tau only |
479
-
480
- ### Failed Experiment: Adaptive Forcing (DO NOT USE)
481
-
482
- **Attempted solution**: Auto-detect weak branches (<threshold) and dynamically apply forcing windows
483
- ```python
484
- # BROKEN CODE — DO NOT USE
485
- if min(alpha) < 0.15 and not in_cooldown:
486
- weak_idx = argmin(alpha)
487
- force_branch_idx = weak_idx
488
- force_until = current_step + 1000
489
- in_cooldown = True
490
- ```
491
-
492
- **Why it failed**:
493
- 1. **Cascade loops**: Forcing branch A → weakens branch B → triggers forcing B → weakens A → infinite oscillation
494
- 2. **Artificial alpha**: During forced windows, alpha reflects forcing distribution [0,0,1], not learned preferences
495
- 3. **Gradient confusion**: Aux loss receives artificial entropy signals, disrupts learning
496
- 4. **Manual intervention superior**: Targeted forcing with aux=0 isolates signal cleanly
497
-
498
- **Lesson**: Router needs **consistent pressure** (tau + aux), not **reactive intervention**. Manual forcing for recovery only, not automated.
499
-
500
- ### Safeguards Implemented (Validated)
501
- 1. **Depth-scaled parameters**: τ and λ scaled by √(depth_ratio) to maintain effective softness
502
- 2. **Extended freeze**: Tau held constant for 6k steps (10% of training) to prevent premature specialization
503
- 3. **Entropy-max loss**: Subtract (not add) aux_loss to maximize branch diversity
504
- 4. **Warmup forcing**: 10% probability during first 5k steps ensures all branches receive gradients
505
- 5. **FP32 LayerNorm**: Prevents BF16 precision drift in routing logits
506
- 6. **NO adaptive forcing**: Rely on tau/aux scheduling + manual intervention when needed
507
-
508
- ### Intervention Playbook (Step-by-Step)
509
- **Scenario: Branch drops <10% before 5k steps**
510
- 1. Stop training, resume from last good checkpoint
511
- 2. Increase `--router_tau_start` by +0.2 (e.g., 2.2→2.4)
512
- 3. Extend `--router_tau_freeze_steps` by +2000
513
- 4. Increase `--router_force_prob` to 0.12–0.15
514
-
515
- **Scenario: Branch stuck <12% after 10k steps**
516
- 1. Run targeted forcing (see Incremental Expansion section)
517
- 2. Force weak branch for 1k steps with `aux=0`, LR=5e-5
518
- 3. Resume normal training with aux restored
519
- 4. Expected recovery: +3–8% share within 500 steps
520
-
521
- **Scenario: Entropy <0.70 and falling after 15k**
522
- 1. Increase `--router_aux_end` by +0.002 (e.g., 0.016→0.018)
523
- 2. Consider raising `--router_tau_end` slightly (1.4→1.5) to slow sharpening
524
-
525
- ### Fine‑Tuning Note
526
- If using standard HF Trainer without custom loss, set `router_aux_weight=0` in config to avoid incorrect gradient direction. Use `scripts/train_veronica.py` for full entropy-max support.
527
-
528
- ### Empirical Training Log (24L Complete Journey)
529
-
530
- **First attempt (FAILED — 512 ctx)**:
531
- - **Step 0–300**: Perfect init (entropy 1.0) with high tau + forcing
532
- - **Step 3000**: **Router collapse** — alpha=[0.73, 0.14, 0.12], entropy 0.70 ❌
533
- - **Diagnosis**: 512 ctx insufficient for 24L depth
534
- - **Action**: Abandoned run, restarted from scratch with 1024 ctx
535
-
536
- **Adaptive forcing experiment (FAILED)**:
537
- - **Implementation**: Auto-detect weak branches, dynamic forcing windows
538
- - **Outcome**: Cascade loops, artificial alpha patterns [0,0,1], gradient confusion
539
- - **Action**: Reverted code, relied on tau/aux only
540
-
541
- **Final successful run (1024 ctx from step 0)**:
542
- - **Step 0–300**: Perfect uniformity (entropy 1.0), high tau (2.2) + 10% forcing
543
- - **Step 1000**: Loss 87→52, entropy 0.92, balanced [0.39, 0.32, 0.29]
544
- - **Step 3000**: Loss 41, entropy 0.73, distribution [0.71, 0.13, 0.16] (healthy)
545
- - **Step 5000**: Loss 37, forcing disabled, entropy 0.72 maintained
546
- - **Step 6000**: Tau unfreezes (2.2→1.4 schedule begins)
547
- - **Step 6000-7000**: Entropy spikes 0.80→0.93 (exploration phase, expected)
548
- - **Step 10000**: Loss 34, **branch 1 weakened to ~10%** (concern threshold)
549
- - **Intervention**: Targeted forcing on branch 1 (10k→11k steps)
550
- - `--force_branch_idx 1 --force_branch_until 11000`
551
- - `--router_aux_start 0.0` (isolate gradient signal)
552
- - `--learning_rate 5e-5` (gentle nudge)
553
- - **Step 11000**: Branch 1 recovered to 15%, entropy 0.84–0.93 ✅
554
- - **Step 12000**: Stable soft routing [0.57, 0.15, 0.27], entropy 0.876
555
- - Eval loss 4.41→4.07 (intervention improved generalization)
556
- - Loss trend: 34→33 (continued healthy descent)
557
- - **All branches active and contributing**
558
-
559
- **Key learnings**:
560
- 1. ✅ 1024 ctx required from step 0 for 24L
561
- 2. ✅ Depth-scaled tau/aux/forcing parameters validated
562
- 3. ✅ Targeted forcing (aux=0, short window) effective for recovery
563
- 4. ❌ Adaptive forcing causes more problems than it solves
564
- 5. ✅ Entropy 0.84–0.93 with min branch 15% = healthy soft routing
565
-
566
- **Status**: Methodology validated on 24L/551M through 12k steps. Ready for 2048 ctx phase (30k+). Core API stable; default schedules proven effective.
567
 
568
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
569
 
570
- ## Practical Training Tips
571
-
572
- ### DO
573
- - ✅ **Use 1024 ctx from step 0 for 24L models** (512 causes router collapse)
574
- - ✅ Scale tau/aux with √(depth_ratio) when changing layer count
575
- - ✅ Use depth-scaled forcing probability (10% for 24L vs 5% for 12L)
576
- - ✅ Freeze tau for ~10% of total training steps (6k for 60k total)
577
- - ✅ Monitor entropy every 100 steps; save checkpoints every 500
578
- - ✅ Apply targeted forcing (aux=0, short window) for weak branches after 10k
579
- - ✅ Keep aux weight increasing throughout training (e.g., 0.008→0.016)
580
- - ✅ Trust depth-scaled parameters — they're empirically validated
581
-
582
- ### DON'T
583
- - ❌ **Use 512 ctx on 24L** (causes collapse by 3k steps — empirically proven)
584
- - ❌ **Implement adaptive forcing** (causes cascade loops and artificial alpha)
585
- - ❌ Lower tau too aggressively (<1.2 for 24L can cause collapse)
586
- - ❌ Set aux=0 for normal training (only during targeted forcing windows)
587
- - ❌ Switch context length without verifying entropy stability (≥0.72 for 1k steps)
588
- - ❌ Expect perfect uniformity throughout training (soft routing allows specialization)
589
- - ❌ Panic if entropy spikes post tau-freeze (oscillation is expected; monitor 500 steps)
590
- - ❌ Use curriculum 512→1024→2048 on deep models (≥20L requires 1024 start)
591
-
592
- ### VRAM Optimization
593
- If hitting OOM on 2048 ctx:
594
- ```bash
595
- --per_device_train_batch_size 2 \
596
- --gradient_accumulation_steps 16 # keeps effective batch = 32
597
- ```
598
-
599
- ### Quick Health Check (Per 1k Steps)
600
- ```bash
601
- grep "\[router\]" logs/train.log | tail -10
602
- ```
603
- Look for:
604
- - Entropy trend (should be ≥0.70)
605
- - Min branch value (should be ≥0.12)
606
- - Loss trend (should decrease or stabilize)
607
 
608
  ---
609
 
610
- ## Roadmap
611
- | Version | Goal |
612
- |---------|------|
613
- | v0.1 | Core polymorphic MLP + tests |
614
- | v0.2 | Router logging + entropy regularization |
615
- | v0.3 | Channel attention option |
616
- | v0.4 | FlashAttention integration |
617
- | v0.5 | Expansion utilities (branch migration helpers) |
618
- | v0.6 | Translation branch reference implementation |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
619
 
620
  ---
621
 
622
- ## Contributing
623
- PRs welcome for: new branch types, expansion helpers, multilingual adapters, evaluation scripts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
624
 
625
  ---
626
 
627
- ## License
628
- Apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
629
 
630
  ---
631
 
632
- ## Citation
633
- ```bibtex
634
- @misc{veronica-2025,
635
- title={Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
636
- author={Emanuele D'Angelo|GG-Ally},
637
- year={2025},
638
- howpublished={\url{https://huggingface.co/MhaWay/Veronica}}
639
- }
640
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
641
 
642
  ---
643
 
644
- ## Acknowledgments
645
- - Mixture & routing concepts inspired by Switch Transformer, GLaM, MoE literature.
646
- - Dataset composition ratios guided by codelion’s DataComp LM mixture studies.
647
- - RoPE adaptation referencing GPT-NeoX implementation details.
648
 
649
  ---
650
 
651
- ## FAQ
652
- **Q: Why entropy-max instead of load-balancing penalty?**
653
- To avoid premature specialization and keep new branches trainable; scaling uses increasing aux weight schedule.
654
 
655
- **Q: Can I add many branches at once?**
656
- Recommended incremental (3→4→5) to prevent starvation.
657
 
658
- **Q: How to specialize for translation?**
659
- Add `TranslationBranch`, warmup with forced routing, then blended fine-tune with multilingual data.
 
 
 
 
660
 
661
- **Q: Does expansion erase prior knowledge?**
662
- No; existing branches retain weights. Router + new branch adapt during short fine‑tune.
663
 
664
  ---
665
 
666
- Happy branching! 🌿
 
 
 
 
 
 
 
12
  - causal-lm
13
  - rope
14
  - expandable-architecture
15
+ - research
16
  pipeline_tag: text-generation
17
  datasets:
18
  - codelion/finepdfs-1B
19
  - codelion/dclm-baseline-1B
20
  - codelion/fineweb-edu-1B
21
  model-index:
22
+ - name: Veronica-Polymorphic 24L (551M)
23
  results: []
24
  ---
25
 
26
+ # Veronica-Polymorphic 24L (551M)
27
 
28
+ Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**:
29
+ each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token.
30
 
31
+ The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.
32
+
33
+ > ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**.
34
+ > Do **not** treat this as a production-ready model.
 
 
 
 
 
 
 
35
 
36
  ---
37
 
38
+ ## 1. TL;DR
39
+
40
+ | Aspect | Value / Description |
41
+ |---------------------|----------------------------------------------------------------|
42
+ | Type | Decoder-only causal LM |
43
+ | Params | ~551M |
44
+ | Layers | 24 |
45
+ | Hidden size | 768 |
46
+ | Heads | 12 |
47
+ | Positional encoding | RoPE (rotary) |
48
+ | MLP | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block |
49
+ | Routing | Entropy-regularized soft routing, depth-scaled temperature |
50
+ | Precision | bf16 weights, fp32 LayerNorm |
51
+ | Context length | 1024 → 2048 (curriculum; 512 discouraged on 24L) |
52
+ | Data mix | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20% |
53
+ | Intended use | Research on routing / branch specialization |
54
+ | Not included | Instruction tuning, RLHF, safety fine-tuning, eval suite |
55
 
56
+ ---
 
 
 
 
 
57
 
58
+ ## 2. Intended use & scope
 
 
 
 
59
 
60
+ ### Primary intent
 
 
61
 
62
+ This checkpoint is meant for:
63
+
64
+ - Researchers interested in:
65
+ - **Mixture-of-branches / soft routing** in MLPs
66
+ - Stability of routers on deeper (24L) architectures
67
+ - Incremental model growth via **adding branches post-pretrain**
68
+ - Practitioners who want a **small, hackable codebase** to experiment with:
69
+ - Polymorphic MLPs
70
+ - Entropy-regularized routing
71
+ - Context-length curricula
72
+
73
+ ### Out of scope
74
+
75
+ This model is **not** designed or evaluated (yet) for:
76
+
77
+ - General-purpose assistant use
78
+ - Safety-critical or high-stakes decisions
79
+ - Deployment to end-users without additional filtering, alignment, and evaluation
80
+
81
+ ---
82
+
83
+ ## 3. Model details
84
+
85
+ ### 3.1 Architecture (high-level)
86
+
87
+ ```text
88
+ Input tokens
89
+
90
+ Token & position embeddings (RoPE on Q/K)
91
+
92
+ [ VeronicaBlock × 24 ]
93
+ VeronicaBlock:
94
+ x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual
95
+ → Pre-LN → Polymorphic MLP (router + branches) → Residual
96
+
97
+ Untied LM head → logits
98
+
99
+ Key design choices:
100
+
101
+ Decoder-only Transformer (causal LM)
102
+
103
+ Pre-LayerNorm blocks
104
+
105
+ RoPE positional encoding (no learned absolute positions)
106
+
107
+ Untied input embeddings / LM head
108
+
109
+ Gradient checkpointing used in training runs for memory efficiency
110
+
111
+
112
+ 3.2 Polymorphic MLP & routing
113
+
114
+ Each block’s MLP is replaced by a polymorphic MLP:
115
+
116
+ router_logits = Router(x) # Linear → GELU → Linear
117
+ alpha = softmax(router_logits / tau)
118
+
119
+ branches = [
120
+ SwiGLU(x),
121
+ GLU(x),
122
+ DepthwiseConvMLP(x),
123
+ ]
124
+
125
+ output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))
126
+
127
+ Branches:
128
+
129
+ Branch Role Sketch
130
+
131
+ SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down)
132
+ GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down)
133
+ DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP
134
+
135
+
136
+ Routing controls:
137
+
138
+ Temperature schedule tau_start → tau_end (higher early = softer mixing)
139
+
140
+ Entropy-max aux-loss: encourages non-collapsed branch usage
141
+
142
+ Depth-scaled parameters:
143
+
144
+ Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models
145
+
146
+
147
+
148
+ The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection.
149
 
 
 
 
 
 
 
 
 
 
 
150
 
151
  ---
152
 
153
+ 4. Training data
154
 
155
+ The pre-train data follows the codelion / DataComp LM mixture guidelines:
156
+
157
+ Dataset Share Description
158
+
159
+ codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density)
160
+ codelion/dclm-baseline-1B 30% General web corpus baseline
161
+ codelion/fineweb-edu-1B 20% Educational / explanatory web data
162
+
163
+
164
+ Target token budget for this configuration: ~60B tokens (example setting).
165
+
166
+ For licensing and detailed descriptions, please refer to each dataset on Hugging Face.
167
+
168
+
169
+ If you reuse this mixture, please also cite:
170
 
 
171
  @article{sharma2025billion,
172
  title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
173
  author = {Sharma, Asankhaya},
174
  year = {2025},
175
  url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
176
  }
 
177
 
 
 
 
 
178
 
179
  ---
180
 
181
+ 5. Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
+ > Note: numbers below describe the reference run configuration used to train this checkpoint.
184
+ You can adapt them for your own experiments.
185
 
 
 
186
 
 
 
 
 
 
187
 
188
+ 5.1 Core hyperparameters
189
 
190
+ Hyperparameter Value / Notes
191
+
192
+ Layers 24
193
+ Hidden size 768
194
+ Attention heads 12
195
+ MLP expansion 4×
196
+ Per-device batch size 4
197
+ Grad accumulation 8 (effective batch 32)
198
+ Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay
199
+ Warmup 10% of total steps
200
+ Weight decay 0.01
201
+ Label smoothing 0.01
202
+ Precision bf16 + fp32 LayerNorm
203
+ Max steps 60k (example target)
204
+
205
+
206
+ Example launch:
207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  python scripts/train_veronica.py \
209
  --config configs/veronica-pretrain-24L.json \
210
  --dataset_paths data/mix_optimal_50_30_20 \
 
221
  --router_force_prob 0.10 --router_force_warmup_steps 5000 \
222
  --rep_alpha 0.05 \
223
  --seed 42
 
224
 
225
+ 5.2 Context-length curriculum & “512-token trap”
226
 
227
+ Empirical finding on 24-layer models:
228
 
229
+ Starting at 512 tokens caused router collapse around step ~3k:
230
 
231
+ One branch dominated (>70%), entropy dropped, other branches starved.
 
 
 
232
 
 
 
 
 
 
 
233
 
234
+ Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.
 
 
 
 
235
 
 
236
 
237
+ Recommended curriculum for 24L:
 
 
 
 
238
 
239
+ Steps 0–20k : 1024 tokens
240
+ Steps 20k–60k : 2048 tokens
241
 
242
+ For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.
243
 
244
+ 5.3 Router health during training
245
 
246
+ Training logs include entries like:
247
 
248
+ [router] alpha=[a0, a1, a2] entropy_norm=E
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
 
250
+ Healthy targets (rough guideline):
251
 
252
+ Phase Steps Entropy (norm) Min branch share
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253
 
254
+ Warmup 0–5k ≥ 0.90 ≥ 0.25
255
+ Post-freeze 5k–10k ≥ 0.75 ≥ 0.12
256
+ Stable 10k+ ≥ 0.70 ≥ 0.15
257
+
258
+
259
+ Collapsed routing typically shows up as:
260
+
261
+ Entropy < 0.65
262
+
263
+ One branch > 80% usage for many thousands of steps
264
 
265
+ Other branches stuck < 5–10%
266
 
 
267
 
268
+ The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.
 
 
 
 
269
 
 
 
 
 
 
 
270
 
271
  ---
272
 
273
+ 6. Evaluation
 
 
 
 
274
 
275
+ 6.1 Current evaluation status
 
 
 
 
276
 
277
+ At the time of this release:
278
 
279
+ No standardized benchmarks (e.g. lm-eval-harness) have been run yet.
 
 
 
 
 
 
 
280
 
281
+ There are no public numbers for:
282
 
283
+ MMLU (5-shot / 0-shot)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284
 
285
+ ARC-e / ARC-c
286
 
287
+ HellaSwag, PIQA, GSM8K, etc.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
288
 
 
289
 
 
 
 
 
 
 
 
290
 
291
+ Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.
292
 
293
+ > 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.
 
 
 
 
 
 
294
 
 
295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
296
 
297
+ 6.2 Planned evaluation (suggested)
298
+
299
+ If you adopt or extend Veronica-Polymorphic, consider running:
300
+
301
+ lm-eval-harness on:
302
+
303
+ mmlu, arc_challenge, arc_easy, hellaswag, piqa
304
+
305
+
306
+ Instruction / SFT (if you fine-tune):
307
+
308
+ Alpaca-style or OpenAssistant subsets
309
+
310
+
311
+ Ablations:
312
+
313
+ Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width
314
+
315
+ With / without entropy-max routing
316
+
317
+
318
+
319
+ Contributions of evaluation scripts and reported metrics are very welcome.
320
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
321
 
322
  ---
323
 
324
+ 7. How to use
325
+
326
+ 7.1 Loading from code
327
+
328
+ If you’re using the Veronica codebase directly:
329
+
330
+ from veronica import VeronicaConfig, VeronicaForCausalLM
331
+
332
+ cfg = VeronicaConfig(
333
+ n_layer=24,
334
+ num_funcs=3, # SwiGLU, GLU, DepthwiseConv
335
+ )
336
+ model = VeronicaForCausalLM(cfg)
337
+ model.eval()
338
+
339
+ You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.
340
+
341
+ 7.2 Simple generation example
342
+
343
+ from transformers import AutoTokenizer
344
+ from veronica import VeronicaForCausalLM, VeronicaConfig
345
+
346
+ tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer
347
+ config = VeronicaConfig.from_pretrained("MhaWay/Veronica")
348
+ model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)
349
+
350
+ prompt = "The theory of relativity states that"
351
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
352
+
353
+ outputs = model.generate(
354
+ **inputs,
355
+ max_new_tokens=64,
356
+ temperature=0.7,
357
+ top_p=0.9,
358
+ )
359
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
360
+
361
+ > Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.
362
+
363
+
364
+
365
 
366
  ---
367
 
368
+ 8. Extensibility: adding new branches
369
+
370
+ One motivation for polymorphic MLPs is incremental expansion:
371
+
372
+ You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:
373
+
374
+ Expanding num_funcs
375
+
376
+ Initializing the new branch + router output slice
377
+
378
+ Running a short fine-tune with:
379
+
380
+ Router + new branch trainable
381
+
382
+ Optionally freezing the rest of the backbone during warmup
383
+
384
+
385
+
386
+
387
+ The repository includes utilities and example code for:
388
+
389
+ Adding a new branch type
390
+
391
+ Copying router weights and initializing the new column
392
+
393
+ Scheduling a short specialization fine-tune
394
+
395
+
396
+ For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.
397
+
398
 
399
  ---
400
 
401
+ 9. Limitations & risks
402
+
403
+ This model:
404
+
405
+ May generate inaccurate or nonsensical text
406
+
407
+ May reproduce biases present in the underlying datasets
408
+
409
+ Is not instruction-tuned:
410
+
411
+ Does not follow natural-language instructions reliably
412
+
413
+ Can ignore prompts, hallucinate, or switch topics
414
+
415
+
416
+ Has no safety layer:
417
+
418
+ No explicit filtering of harmful/toxic content
419
+
420
+ No RLHF / preference optimization
421
+
422
+
423
+
424
+ Do not use Veronica-Polymorphic for:
425
+
426
+ Safety-critical systems
427
+
428
+ Medical, legal, or financial advice
429
+
430
+ Content moderation without extensive additional work
431
+
432
+ Any setting where unfiltered, biased generations would cause harm
433
+
434
+
435
 
436
  ---
437
 
438
+ 10. Roadmap
439
+
440
+ Planned / desired directions:
441
+
442
+ Version Goal
443
+
444
+ v0.1 Core polymorphic MLP + tests
445
+ v0.2 Stable router schedules + logging
446
+ v0.3 Configurable attention variants / FlashAttention
447
+ v0.4 Public evaluation scripts (lm-eval-harness)
448
+ v0.5 Reference instruction-tuned variant
449
+ v0.6 Example specialization branches (e.g. translation)
450
+
451
+
452
+ Community PRs are welcome, especially for:
453
+
454
+ Evaluation & ablations vs vanilla MLP baselines
455
+
456
+ New branch types and routing strategies
457
+
458
+ Practical recipes for SFT / alignment on top of Veronica
459
+
460
+
461
 
462
  ---
463
 
464
+ 11. License
465
+
466
+ This model and code are released under the Apache-2.0 license.
467
+
468
 
469
  ---
470
 
471
+ 12. Citation
 
 
472
 
473
+ If you use Veronica-Polymorphic in your work, please cite:
 
474
 
475
+ @misc{veronica-2025,
476
+ title = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
477
+ author = {Emanuele D'Angelo},
478
+ year = {2025},
479
+ howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
480
+ }
481
 
 
 
482
 
483
  ---
484
 
485
+ 13. Acknowledgments
486
+
487
+ Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.
488
+
489
+ Dataset mixture ratios guided by codelion’s DataComp LM work.
490
+
491
+ RoPE implementation adapted from GPT-NeoX-style implementations.