File size: 12,407 Bytes
46c2a06 064436c 46c2a06 064436c 84b0dde 46c2a06 064436c 46c2a06 84b0dde 46c2a06 84b0dde 46c2a06 84b0dde 46c2a06 84b0dde 64c6c91 46c2a06 84b0dde 46c2a06 84b0dde aaf1455 84b0dde 064436c 84b0dde 064436c 84b0dde bff0ad9 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 383a1ad 064436c 41a5994 064436c 84b0dde 064436c 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 41a5994 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c cc29f95 84b0dde 41a5994 cc29f95 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 84b0dde 064436c 4579484 84b0dde 4579484 064436c 84b0dde |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 |
---
language:
- en
library_name: transformers
license: apache-2.0
tags:
- veronica
- polymorphic-mlp
- mixture-of-branches
- entropy-regularized-routing
- decoder-only
- causal-lm
- rope
- expandable-architecture
- research
pipeline_tag: text-generation
datasets:
- codelion/finepdfs-1B
- codelion/dclm-baseline-1B
- codelion/fineweb-edu-1B
model-index:
- name: Veronica-Polymorphic 24L (551M)
results: []
---
# Veronica-Polymorphic 24L (551M)
Veronica-Polymorphic is a **decoder-only language model (≈551M params)** with a **polymorphic MLP**:
each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a **soft router** that blends them per-token.
The goal is **adaptive capacity** and **incremental expansion** (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.
> ⚠️ **Status:** research preview, **pre-training only**, **no external benchmarks yet**.
> Do **not** treat this as a production-ready model.
---
## 1. TL;DR
| Aspect | Value / Description |
|---------------------|----------------------------------------------------------------|
| Type | Decoder-only causal LM |
| Params | ~551M |
| Layers | 24 |
| Hidden size | 768 |
| Heads | 12 |
| Positional encoding | RoPE (rotary) |
| MLP | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block |
| Routing | Entropy-regularized soft routing, depth-scaled temperature |
| Precision | bf16 weights, fp32 LayerNorm |
| Context length | 1024 → 2048 (curriculum; 512 discouraged on 24L) |
| Data mix | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20% |
| Intended use | Research on routing / branch specialization |
| Not included | Instruction tuning, RLHF, safety fine-tuning, eval suite |
---
## 2. Intended use & scope
### Primary intent
This checkpoint is meant for:
- Researchers interested in:
- **Mixture-of-branches / soft routing** in MLPs
- Stability of routers on deeper (24L) architectures
- Incremental model growth via **adding branches post-pretrain**
- Practitioners who want a **small, hackable codebase** to experiment with:
- Polymorphic MLPs
- Entropy-regularized routing
- Context-length curricula
### Out of scope
This model is **not** designed or evaluated (yet) for:
- General-purpose assistant use
- Safety-critical or high-stakes decisions
- Deployment to end-users without additional filtering, alignment, and evaluation
---
## 3. Model details
### 3.1 Architecture (high-level)
Input tokens
↓
Token & position embeddings (RoPE on Q/K)
↓
[ VeronicaBlock × 24 ]
VeronicaBlock:
x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual
→ Pre-LN → Polymorphic MLP (router + branches) → Residual
↓
Untied LM head → logits
Key design choices:
Decoder-only Transformer (causal LM)
Pre-LayerNorm blocks
RoPE positional encoding (no learned absolute positions)
Untied input embeddings / LM head
Gradient checkpointing used in training runs for memory efficiency
3.2 Polymorphic MLP & routing
Each block’s MLP is replaced by a polymorphic MLP:
router_logits = Router(x) # Linear → GELU → Linear
alpha = softmax(router_logits / tau)
branches = [
SwiGLU(x),
GLU(x),
DepthwiseConvMLP(x),
]
output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))
Branches:
Branch Role Sketch
SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down)
GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down)
DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP
Routing controls:
Temperature schedule tau_start → tau_end (higher early = softer mixing)
Entropy-max aux-loss: encourages non-collapsed branch usage
Depth-scaled parameters:
Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models
The key property is that routing remains soft: typical healthy distributions have a dominant branch (~55–65%) and minority branches (~15–25%) instead of hard one-hot selection.
---
4. Training data
The pre-train data follows the codelion / DataComp LM mixture guidelines:
Dataset Share Description
codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density)
codelion/dclm-baseline-1B 30% General web corpus baseline
codelion/fineweb-edu-1B 20% Educational / explanatory web data
Target token budget for this configuration: ~60B tokens (example setting).
For licensing and detailed descriptions, please refer to each dataset on Hugging Face.
If you reuse this mixture, please also cite:
@article{sharma2025billion,
title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author = {Sharma, Asankhaya},
year = {2025},
url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
---
5. Training procedure
> Note: numbers below describe the reference run configuration used to train this checkpoint.
You can adapt them for your own experiments.
5.1 Core hyperparameters
Hyperparameter Value / Notes
Layers 24
Hidden size 768
Attention heads 12
MLP expansion 4×
Per-device batch size 4
Grad accumulation 8 (effective batch 32)
Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay
Warmup 10% of total steps
Weight decay 0.01
Label smoothing 0.01
Precision bf16 + fp32 LayerNorm
Max steps 60k (example target)
Example launch:
python scripts/train_veronica.py \
--config configs/veronica-pretrain-24L.json \
--dataset_paths data/mix_optimal_50_30_20 \
--output_dir runs/veronica-pretrain-24L \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 8 \
--max_steps 60000 \
--learning_rate 1.2e-4 \
--warmup_ratio 0.10 \
--weight_decay 0.01 \
--max_seq_len 1024 \
--router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000 \
--router_aux_start 0.008 --router_aux_end 0.016 \
--router_force_prob 0.10 --router_force_warmup_steps 5000 \
--rep_alpha 0.05 \
--seed 42
5.2 Context-length curriculum & “512-token trap”
Empirical finding on 24-layer models:
Starting at 512 tokens caused router collapse around step ~3k:
One branch dominated (>70%), entropy dropped, other branches starved.
Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.
Recommended curriculum for 24L:
Steps 0–20k : 1024 tokens
Steps 20k–60k : 2048 tokens
For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.
5.3 Router health during training
Training logs include entries like:
[router] alpha=[a0, a1, a2] entropy_norm=E
Healthy targets (rough guideline):
Phase Steps Entropy (norm) Min branch share
Warmup 0–5k ≥ 0.90 ≥ 0.25
Post-freeze 5k–10k ≥ 0.75 ≥ 0.12
Stable 10k+ ≥ 0.70 ≥ 0.15
Collapsed routing typically shows up as:
Entropy < 0.65
One branch > 80% usage for many thousands of steps
Other branches stuck < 5–10%
The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.
---
6. Evaluation
6.1 Current evaluation status
At the time of this release:
No standardized benchmarks (e.g. lm-eval-harness) have been run yet.
There are no public numbers for:
MMLU (5-shot / 0-shot)
ARC-e / ARC-c
HellaSwag, PIQA, GSM8K, etc.
Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.
> 🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.
6.2 Planned evaluation (suggested)
If you adopt or extend Veronica-Polymorphic, consider running:
lm-eval-harness on:
mmlu, arc_challenge, arc_easy, hellaswag, piqa
Instruction / SFT (if you fine-tune):
Alpaca-style or OpenAssistant subsets
Ablations:
Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width
With / without entropy-max routing
Contributions of evaluation scripts and reported metrics are very welcome.
---
7. How to use
7.1 Loading from code
If you’re using the Veronica codebase directly:
from veronica import VeronicaConfig, VeronicaForCausalLM
cfg = VeronicaConfig(
n_layer=24,
num_funcs=3, # SwiGLU, GLU, DepthwiseConv
)
model = VeronicaForCausalLM(cfg)
model.eval()
You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.
7.2 Simple generation example
from transformers import AutoTokenizer
from veronica import VeronicaForCausalLM, VeronicaConfig
tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer
config = VeronicaConfig.from_pretrained("MhaWay/Veronica")
model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)
prompt = "The theory of relativity states that"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=64,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
> Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.
---
8. Extensibility: adding new branches
One motivation for polymorphic MLPs is incremental expansion:
You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:
Expanding num_funcs
Initializing the new branch + router output slice
Running a short fine-tune with:
Router + new branch trainable
Optionally freezing the rest of the backbone during warmup
The repository includes utilities and example code for:
Adding a new branch type
Copying router weights and initializing the new column
Scheduling a short specialization fine-tune
For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.
---
9. Limitations & risks
This model:
May generate inaccurate or nonsensical text
May reproduce biases present in the underlying datasets
Is not instruction-tuned:
Does not follow natural-language instructions reliably
Can ignore prompts, hallucinate, or switch topics
Has no safety layer:
No explicit filtering of harmful/toxic content
No RLHF / preference optimization
Do not use Veronica-Polymorphic for:
Safety-critical systems
Medical, legal, or financial advice
Content moderation without extensive additional work
Any setting where unfiltered, biased generations would cause harm
---
10. Roadmap
Planned / desired directions:
Version Goal
v0.1 Core polymorphic MLP + tests
v0.2 Stable router schedules + logging
v0.3 Configurable attention variants / FlashAttention
v0.4 Public evaluation scripts (lm-eval-harness)
v0.5 Reference instruction-tuned variant
v0.6 Example specialization branches (e.g. translation)
Community PRs are welcome, especially for:
Evaluation & ablations vs vanilla MLP baselines
New branch types and routing strategies
Practical recipes for SFT / alignment on top of Veronica
---
11. License
This model and code are released under the Apache-2.0 license.
---
12. Citation
If you use Veronica-Polymorphic in your work, please cite:
```
@misc{veronica-2025,
title = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
author = {Emanuele D'Angelo},
year = {2025},
howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
}
```
---
13. Acknowledgments
Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.
Dataset mixture ratios guided by codelion’s DataComp LM work.
RoPE implementation adapted from GPT-NeoX-style implementations.
|