File size: 12,407 Bytes
46c2a06
 
 
 
 
 
 
064436c
 
 
46c2a06
 
 
064436c
84b0dde
46c2a06
064436c
 
 
 
46c2a06
84b0dde
46c2a06
 
 
84b0dde
46c2a06
84b0dde
 
46c2a06
84b0dde
 
 
 
64c6c91
 
46c2a06
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46c2a06
84b0dde
aaf1455
84b0dde
064436c
84b0dde
064436c
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bff0ad9
064436c
 
 
84b0dde
064436c
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
064436c
 
 
 
 
 
 
 
 
 
 
84b0dde
064436c
84b0dde
 
064436c
 
 
84b0dde
064436c
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
064436c
 
383a1ad
064436c
 
 
 
 
 
 
 
41a5994
 
 
 
 
 
064436c
84b0dde
064436c
84b0dde
41a5994
84b0dde
41a5994
84b0dde
41a5994
 
84b0dde
41a5994
 
84b0dde
41a5994
84b0dde
 
41a5994
84b0dde
41a5994
84b0dde
41a5994
84b0dde
41a5994
84b0dde
41a5994
84b0dde
41a5994
84b0dde
41a5994
84b0dde
 
 
 
 
 
 
 
 
 
41a5994
84b0dde
41a5994
 
84b0dde
41a5994
 
 
 
84b0dde
41a5994
84b0dde
41a5994
84b0dde
41a5994
84b0dde
064436c
84b0dde
064436c
84b0dde
064436c
84b0dde
064436c
84b0dde
064436c
 
 
84b0dde
064436c
84b0dde
064436c
 
cc29f95
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41a5994
cc29f95
 
 
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
064436c
 
 
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
064436c
 
 
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
064436c
 
 
84b0dde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
064436c
 
 
84b0dde
 
 
 
064436c
 
 
84b0dde
064436c
84b0dde
064436c
4579484
84b0dde
 
 
 
 
 
4579484
064436c
 
 
84b0dde
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
---
language:
- en
library_name: transformers
license: apache-2.0
tags:
- veronica
- polymorphic-mlp
- mixture-of-branches
- entropy-regularized-routing
- decoder-only
- causal-lm
- rope
- expandable-architecture
- research
pipeline_tag: text-generation
datasets:
- codelion/finepdfs-1B
- codelion/dclm-baseline-1B
- codelion/fineweb-edu-1B
model-index:
- name: Veronica-Polymorphic 24L (551M)
  results: []
---

# Veronica-Polymorphic 24L (551M)

Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**:  
each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token.

The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.

> ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**.  
> Do **not** treat this as a production-ready model.

---

## 1. TL;DR

| Aspect              | Value / Description                                            |
|---------------------|----------------------------------------------------------------|
| Type                | Decoder-only causal LM                                         |
| Params              | ~551M                                                          |
| Layers              | 24                                                             |
| Hidden size         | 768                                                            |
| Heads               | 12                                                             |
| Positional encoding | RoPE (rotary)                                                 |
| MLP                 | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block          |
| Routing             | Entropy-regularized soft routing, depth-scaled temperature    |
| Precision           | bf16 weights, fp32 LayerNorm                                  |
| Context length      | 1024 → 2048 (curriculum; 512 discouraged on 24L)              |
| Data mix            | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20%      |
| Intended use        | Research on routing / branch specialization                   |
| Not included        | Instruction tuning, RLHF, safety fine-tuning, eval suite      |

---

## 2. Intended use & scope

### Primary intent

This checkpoint is meant for:

- Researchers interested in:
  - **Mixture-of-branches / soft routing** in MLPs
  - Stability of routers on deeper (24L) architectures
  - Incremental model growth via **adding branches post-pretrain**
- Practitioners who want a **small, hackable codebase** to experiment with:
  - Polymorphic MLPs
  - Entropy-regularized routing
  - Context-length curricula

### Out of scope

This model is **not** designed or evaluated (yet) for:

- General-purpose assistant use
- Safety-critical or high-stakes decisions
- Deployment to end-users without additional filtering, alignment, and evaluation

---

## 3. Model details

### 3.1 Architecture (high-level)

Input tokens
  ↓
Token & position embeddings (RoPE on Q/K)
  ↓
[ VeronicaBlock × 24 ]
  VeronicaBlock:
    x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual
      → Pre-LN → Polymorphic MLP (router + branches) → Residual

Untied LM head → logits

Key design choices:

Decoder-only Transformer (causal LM)

Pre-LayerNorm blocks

RoPE positional encoding (no learned absolute positions)

Untied input embeddings / LM head

Gradient checkpointing used in training runs for memory efficiency


3.2 Polymorphic MLP & routing

Each block’s MLP is replaced by a polymorphic MLP:

router_logits = Router(x)      # Linear → GELU → Linear
alpha = softmax(router_logits / tau)

branches = [
  SwiGLU(x),
  GLU(x),
  DepthwiseConvMLP(x),
]

output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))

Branches:

Branch	Role	Sketch

SwiGLU	Default gated MLP	Linear(up) → split → SiLU×gate → Linear(down)
GLU	Alternative gating dynamics	Linear(up) → split → Sigmoid×gate → Linear(down)
DepthwiseConv	Local token patterns / n-grams	Depthwise causal conv (k=3) → MLP


Routing controls:

Temperature schedule tau_start → tau_end (higher early = softer mixing)

Entropy-max aux-loss: encourages non-collapsed branch usage

Depth-scaled parameters:

Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models



The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection.


---

4. Training data

The pre-train data follows the codelion / DataComp LM mixture guidelines:

Dataset	Share	Description

codelion/finepdfs-1B	50%	Technical/academic PDFs (high semantic density)
codelion/dclm-baseline-1B	30%	General web corpus baseline
codelion/fineweb-edu-1B	20%	Educational / explanatory web data


Target token budget for this configuration: ~60B tokens (example setting).

For licensing and detailed descriptions, please refer to each dataset on Hugging Face.


If you reuse this mixture, please also cite:

@article{sharma2025billion,
  title   = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
  author  = {Sharma, Asankhaya},
  year    = {2025},
  url     = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}


---

5. Training procedure

> Note: numbers below describe the reference run configuration used to train this checkpoint.
You can adapt them for your own experiments.



5.1 Core hyperparameters

Hyperparameter	Value / Notes

Layers	24
Hidden size	768
Attention heads	12
MLP expansion	4×
Per-device batch size	4
Grad accumulation	8  (effective batch 32)
Optimizer / LR schedule	AdamW, lr=1.2e-4, cosine decay
Warmup	10% of total steps
Weight decay	0.01
Label smoothing	0.01
Precision	bf16 + fp32 LayerNorm
Max steps	60k (example target)


Example launch:

python scripts/train_veronica.py \
  --config configs/veronica-pretrain-24L.json \
  --dataset_paths data/mix_optimal_50_30_20 \
  --output_dir runs/veronica-pretrain-24L \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --max_steps 60000 \
  --learning_rate 1.2e-4 \
  --warmup_ratio 0.10 \
  --weight_decay 0.01 \
  --max_seq_len 1024 \
  --router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
  --router_aux_start 0.008 --router_aux_end 0.016 \
  --router_force_prob 0.10 --router_force_warmup_steps 5000 \
  --rep_alpha 0.05 \
  --seed 42

5.2 Context-length curriculum & “512-token trap”

Empirical finding on 24-layer models:

Starting at 512 tokens caused router collapse around step ~3k:

One branch dominated (>70%), entropy dropped, other branches starved.


Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.


Recommended curriculum for 24L:

Steps 0–20k   : 1024 tokens
Steps 20k–60k : 2048 tokens

For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.

5.3 Router health during training

Training logs include entries like:

[router] alpha=[a0, a1, a2] entropy_norm=E

Healthy targets (rough guideline):

Phase	Steps	Entropy (norm)	Min branch share

Warmup	0–5k	≥ 0.90	≥ 0.25
Post-freeze	5k–10k	≥ 0.75	≥ 0.12
Stable	10k+	≥ 0.70	≥ 0.15


Collapsed routing typically shows up as:

Entropy < 0.65

One branch > 80% usage for many thousands of steps

Other branches stuck < 5–10%


The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.


---

6. Evaluation

6.1 Current evaluation status

At the time of this release:

No standardized benchmarks (e.g. lm-eval-harness) have been run yet.

There are no public numbers for:

MMLU (5-shot / 0-shot)

ARC-e / ARC-c

HellaSwag, PIQA, GSM8K, etc.



Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.

> 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.



6.2 Planned evaluation (suggested)

If you adopt or extend Veronica-Polymorphic, consider running:

lm-eval-harness on:

mmlu, arc_challenge, arc_easy, hellaswag, piqa


Instruction / SFT (if you fine-tune):

Alpaca-style or OpenAssistant subsets


Ablations:

Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width

With / without entropy-max routing



Contributions of evaluation scripts and reported metrics are very welcome.


---

7. How to use

7.1 Loading from code

If you’re using the Veronica codebase directly:

from veronica import VeronicaConfig, VeronicaForCausalLM

cfg = VeronicaConfig(
    n_layer=24,
    num_funcs=3,  # SwiGLU, GLU, DepthwiseConv
)
model = VeronicaForCausalLM(cfg)
model.eval()

You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.

7.2 Simple generation example

from transformers import AutoTokenizer
from veronica import VeronicaForCausalLM, VeronicaConfig

tokenizer = AutoTokenizer.from_pretrained("gpt2")  # or your own tokenizer
config = VeronicaConfig.from_pretrained("MhaWay/Veronica")
model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)

prompt = "The theory of relativity states that"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=64,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

> Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.




---

8. Extensibility: adding new branches

One motivation for polymorphic MLPs is incremental expansion:

You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:

Expanding num_funcs

Initializing the new branch + router output slice

Running a short fine-tune with:

Router + new branch trainable

Optionally freezing the rest of the backbone during warmup




The repository includes utilities and example code for:

Adding a new branch type

Copying router weights and initializing the new column

Scheduling a short specialization fine-tune


For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.


---

9. Limitations & risks

This model:

May generate inaccurate or nonsensical text

May reproduce biases present in the underlying datasets

Is not instruction-tuned:

Does not follow natural-language instructions reliably

Can ignore prompts, hallucinate, or switch topics


Has no safety layer:

No explicit filtering of harmful/toxic content

No RLHF / preference optimization



Do not use Veronica-Polymorphic for:

Safety-critical systems

Medical, legal, or financial advice

Content moderation without extensive additional work

Any setting where unfiltered, biased generations would cause harm



---

10. Roadmap

Planned / desired directions:

Version	Goal

v0.1	Core polymorphic MLP + tests
v0.2	Stable router schedules + logging
v0.3	Configurable attention variants / FlashAttention
v0.4	Public evaluation scripts (lm-eval-harness)
v0.5	Reference instruction-tuned variant
v0.6	Example specialization branches (e.g. translation)


Community PRs are welcome, especially for:

Evaluation & ablations vs vanilla MLP baselines

New branch types and routing strategies

Practical recipes for SFT / alignment on top of Veronica



---

11. License

This model and code are released under the Apache-2.0 license.


---

12. Citation

If you use Veronica-Polymorphic in your work, please cite:

```
@misc{veronica-2025,
  title        = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
  author       = {Emanuele D'Angelo},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
}
```

---

13. Acknowledgments

Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.

Dataset mixture ratios guided by codelion’s DataComp LM work.

RoPE implementation adapted from GPT-NeoX-style implementations.