MhaWay commited on
Commit
064436c
·
verified ·
1 Parent(s): 64c6c91

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +373 -26
README.md CHANGED
@@ -1,48 +1,395 @@
1
  ---
2
  language:
3
  - en
4
- - it
5
  library_name: transformers
6
  license: apache-2.0
7
  tags:
8
  - veronica
 
 
 
9
  - decoder-only
10
  - causal-lm
11
- - gqa
12
  - rope
13
- - yarn
14
- - flash-attn2
15
  pipeline_tag: text-generation
 
 
 
 
16
  model-index:
17
- - name: Veronica-Core 450M (prototype)
18
  results: []
19
  ---
20
 
21
- # Veronica — Custom Causal LM (decoder-only)
22
 
23
- **Veronica** is a custom *decoder-only* large language model, designed to maximize **depth efficiency** and token-level reasoning quality under limited resources.
24
- It features **32 layers × 1024 hidden × 16 heads (GQA=4)**, extended context via **RoPE (θ=1e6) + YaRN scaling** up to **32k tokens**, and advanced attention routing with **DuoAttention** and **SEAL scaling**.
25
 
26
- > **Status:** prototype under pretraining.
27
- > This repository currently provides **code, config, and tokenizer** to load Veronica with `trust_remote_code=True`.
28
- > Model weights will be released in a future update.
 
 
 
 
 
 
 
29
 
30
  ---
31
 
32
- ## Quickstart
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ```python
35
- from transformers import AutoTokenizer, AutoModelForCausalLM
36
-
37
- name = "MhaWay/veronica"
38
- tok = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
39
- model = AutoModelForCausalLM.from_pretrained(
40
- name,
41
- trust_remote_code=True,
42
- torch_dtype="auto",
43
- device_map="auto",
44
- )
45
-
46
- prompt = "Explain in simple terms what Veronica is:"
47
- out = model.generate(**tok(prompt, return_tensors="pt").to(model.device))
48
- print(tok.decode(out[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - en
 
4
  library_name: transformers
5
  license: apache-2.0
6
  tags:
7
  - veronica
8
+ - polymorphic-mlp
9
+ - mixture-of-branches
10
+ - entropy-regularized-routing
11
  - decoder-only
12
  - causal-lm
 
13
  - rope
14
+ - expandable-architecture
 
15
  pipeline_tag: text-generation
16
+ datasets:
17
+ - codelion/finepdfs-1B
18
+ - codelion/dclm-baseline-1B
19
+ - codelion/fineweb-edu-1B
20
  model-index:
21
+ - name: Veronica-24L (551M)
22
  results: []
23
  ---
24
 
25
+ # Veronica-Polymorphic
26
 
27
+ **Veronica-Polymorphic** is a decoderonly transformer featuring a **polymorphic MLP layer**: each token is processed by a soft mixture of specialized branches (SwiGLU, GLU, Depthwise Causal Conv) under an entropy‑regularized router. The design enables adaptive capacity, incremental expansion (adding new branches post‑pretrain), and targeted specialization (e.g. translation modules) without full retraining from scratch.
 
28
 
29
+ ## TL;DR
30
+ | Feature | Description |
31
+ |---------|-------------|
32
+ | Architecture | 24‑layer causal Transformer (RoPE, untied embeddings) |
33
+ | Polymorphic MLP | Soft routing over 3 base branches (extensible) |
34
+ | Routing Control | Temperature schedule + entropy maximization |
35
+ | Precision | BF16 with FP32 LayerNorm for stability |
36
+ | Positional Encoding | Rotary (RoPE, θ=10,000) |
37
+ | Dataset Mix | FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • Additional samples 20% (codelion/DataComp) |
38
+ | Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
39
 
40
  ---
41
 
42
+ ## Installation
43
 
44
+ ```bash
45
+ pip install -e .
46
+ | Source | Share | Link |
47
+ |--------|-------|------|
48
+ | FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
49
+ | DCLM Baseline‑1B | 30% | https://huggingface.co/datasets/codelion/dclm-baseline-1B |
50
+ | Additional samples | 20% | https://huggingface.co/collections/codelion/pre-training-dataset-samples |
51
+
52
+ Notes
53
+ - The collection link aggregates additional samples (e.g., educational/web sources) used to complete the 50/30/20 composition.
54
+ - Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
55
+
56
+ Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.
57
+ from veronica import VeronicaConfig, VeronicaForCausalLM
58
+ cfg = VeronicaConfig(n_layer=24, num_funcs=3) # base polymorphic setup
59
+ model = VeronicaForCausalLM(cfg)
60
+ ```
61
+
62
+ Generation example:
63
+ ```python
64
+ from transformers import AutoTokenizer
65
+ tok = AutoTokenizer.from_pretrained("gpt2") # or your saved tokenizer
66
+ prompt = "The theory of relativity states that"
67
+ ids = tok(prompt, return_tensors="pt").to(model.device)
68
+ out = model.generate(**ids, max_new_tokens=64, temperature=0.7, top_p=0.9)
69
+ | Current status | between v0.2 and v0.3 |
70
+ print(tok.decode(out[0], skip_special_tokens=True))
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Architecture Overview
76
+
77
+ ### High Level
78
+ ```
79
+ Input Embeddings → [Block × N]
80
+ Block: Pre-LN → Multi-Head Self-Attention (RoPE) → Pre-LN → Polymorphic MLP (Routing + Branch Fusion) → Residual
81
+ Untied LM Head
82
+ ```
83
+ ## Dataset Citations
84
+ If you use these datasets or composition, please cite:
85
+
86
+ ```
87
+ @article{sharma2025billion,
88
+ title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
89
+ author = {Sharma, Asankhaya},
90
+ year = {2025},
91
+ url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
92
+ }
93
+ ```
94
+
95
+ Related collection and datasets:
96
+ - codelion pre‑training dataset samples: https://huggingface.co/collections/codelion/pre-training-dataset-samples
97
+ - codelion/dclm-baseline-1B: https://huggingface.co/datasets/codelion/dclm-baseline-1B
98
+ - codelion/finepdfs-1B: https://huggingface.co/datasets/codelion/finepdfs-1B
99
+
100
+ ---
101
+
102
+ ### Polymorphic MLP
103
+ Per token & layer:
104
+ ```
105
+ router_logits = Router(x) # Linear → GELU → Linear
106
+ α = softmax(router_logits / τ)
107
+ branches = [SwiGLU(x), GLU(x), DepthwiseConvMLP(x)]
108
+ output = Σ α_i * branch_i(x)
109
+ ```
110
+ Routing stabilized by:
111
+ - **Temperature schedule** (τ high early → softer mixing)
112
+ - **Entropy-max aux-loss** (subtract entropy from total loss to maximize it)
113
+ - Optional **forcing** during warmup to guarantee gradient flow to new branches
114
+
115
+ ### Branch Types
116
+ | Branch | Purpose | Structure |
117
+ |--------|---------|-----------|
118
+ | SwiGLU | Smooth gated MLP | Linear(up 2×) → split → SiLU × gate → Linear(down) |
119
+ | GLU | Alternative gating dynamics | Linear(up 2×) → split → Sigmoid × gate → Linear(down) |
120
+ | DepthwiseConv | Local token patterns | Depthwise causal conv (k=3) → expand → GELU → contract |
121
+
122
+ ### Positional Encoding
123
+ Rotary embeddings (RoPE) applied to Q/K heads with cached cos/sin; no absolute learned positions.
124
+
125
+ ### Stability Choices
126
+ | Mechanism | Rationale |
127
+ |-----------|-----------|
128
+ | FP32 LayerNorm | Prevent BF16 precision drift |
129
+ | Entropy-Max Aux | Avoid early router collapse |
130
+ | High initial τ | Encourage exploration across branches |
131
+ | Gradient Checkpointing | Memory efficiency for depth |
132
+
133
+ ---
134
+
135
+ ## Dataset Mixture (codelion / DataComp inspired)
136
+ Training uses a curated blend guided by open mixture studies:
137
+
138
+ | Source | Share | Notes |
139
+ |--------|-------|-------|
140
+ | FinePDFs | 50% | Technical & academic PDFs (higher semantic density) |
141
+ | DCLM Baseline | 30% | General web corpus (DataComp LM baseline) |
142
+ | FineWeb‑Edu | 20% | Educational domain for structured explanatory patterns |
143
+
144
+ Total tokens target (example): ~60B (adjustable). The composition balances semantic density vs generality, echoing codelion’s optimal ratio analyses.
145
+
146
+ ---
147
+
148
+ ## Training Setup
149
+
150
+ | Hyperparameter | Value (example) |
151
+ |----------------|-----------------|
152
+ | Layers | 24 |
153
+ | Hidden size | 768 |
154
+ | Heads | 12 |
155
+ | MLP mult | 4.0 |
156
+ | Batch (per device) | 4 |
157
+ | Grad Accumulation | 8 (effective batch 32) |
158
+ | LR | 1.2e-4 cosine decay |
159
+ | Warmup | 10% steps |
160
+ | Weight Decay | 0.01 |
161
+ | Label Smoothing | 0.01 |
162
+ | Precision | bf16 + fp32 LayerNorm |
163
+ | Max Seq Len | 512 (curriculum to 2048) |
164
+ | Router τ | 1.6 → 1.1 (freeze first 4k steps) |
165
+ | Aux weight λ | 0.005 → 0.012 |
166
+ | Router forcing | 5% prob for first 3k steps |
167
+ | Rep penalty (α) | 0.05 (smoke quality) |
168
+
169
+ Launch:
170
+ ```bash
171
+ python scripts/train_veronica.py \
172
+ --config configs/veronica-pretrain-12L.json \ # contains 24 layers (legacy name)
173
+ --dataset_paths data/mix_optimal_50_30_20 \
174
+ --output_dir runs/veronica-pretrain-24L \
175
+ --per_device_train_batch_size 4 \
176
+ --gradient_accumulation_steps 8 \
177
+ --max_steps 60000 \
178
+ --learning_rate 1.2e-4 \
179
+ --warmup_ratio 0.10 \
180
+ --weight_decay 0.01 \
181
+ --max_seq_len 512 \
182
+ --router_tau_start 1.6 --router_tau_end 1.1 --router_tau_freeze_steps 4000 \
183
+ --router_aux_start 0.005 --router_aux_end 0.012 \
184
+ --router_force_prob 0.05 --router_force_warmup_steps 3000 \
185
+ --rep_alpha 0.05
186
+ ```
187
+
188
+ ---
189
+
190
+ ## Router Health Metrics
191
+ Monitor log lines:
192
+ ```
193
+ [router] alpha=[a0, a1, a2, ...] entropy_norm=E
194
+ ```
195
+ Targets:
196
+ - `entropy_norm ≥ 0.75` through first 5k steps
197
+ - No branch < 15% persistent (healthy diversity)
198
+ - `entropy_norm ≥ 0.65` maintained throughout training
199
+
200
+ ---
201
+
202
+ ## Incremental Expansion (Add New Branch Post‑Pretrain)
203
+ Goal: Increase capacity or add a specialization (e.g. translation) without full restart.
204
+
205
+ ### Steps
206
+ 1. **Load original checkpoint + config**:
207
+ ```python
208
+ cfg = VeronicaConfig.from_pretrained(old_dir)
209
+ old_funcs = cfg.num_funcs
210
+ cfg.num_funcs = old_funcs + 1 # adding one branch
211
+ model = VeronicaForCausalLM.from_pretrained(old_dir, config=cfg, ignore_mismatched_sizes=True)
212
+ ```
213
+ 2. **Implement new branch class** (see Translation branch below) and extend `PolymorphicMLP` construction.
214
+ 3. **Copy existing router weights** and init new column small:
215
+ ```python
216
+ import torch, torch.nn as nn
217
+ for blk in model.blocks:
218
+ lin = blk.mlp.router[-1] # final Linear
219
+ with torch.no_grad():
220
+ # existing weights remain; new slice initialized
221
+ nn.init.normal_(lin.weight[old_funcs:], mean=0.0, std=0.02)
222
+ if lin.bias is not None:
223
+ nn.init.zeros_(lin.bias[old_funcs:])
224
+ ```
225
+ 4. **Freeze old branches & attention** for warmup:
226
+ ```python
227
+ for name, p in model.named_parameters():
228
+ if "funcs.%d" % (old_funcs) in name or "router.2" in name: # new branch + router final layer
229
+ p.requires_grad = True
230
+ else:
231
+ p.requires_grad = False
232
+ ```
233
+ 5. **High τ + light forcing** (0–1k steps): `router_tau_start=1.8`, `router_force_prob≈0.15`.
234
+ 6. **Blend phase** (1–3k steps): unfreeze old branches, lower τ → 1.2, increase aux to mid (e.g. 0.006).
235
+ 7. **Stabilize**: restore standard schedule (τ→1.0, aux→0.01), disable forcing.
236
+
237
+ ### Recommended Minimal Fine‑Tune Command
238
+ ```bash
239
+ python scripts/train_veronica.py \
240
+ --config expanded-config.json \ # updated num_funcs
241
+ --resume_from runs/veronica-pretrain-24L/checkpoint-60000 \
242
+ --output_dir runs/veronica-expand-translation \
243
+ --max_steps 8000 \
244
+ --per_device_train_batch_size 4 \
245
+ --gradient_accumulation_steps 8 \
246
+ --learning_rate 8e-5 \
247
+ --router_tau_start 1.8 --router_tau_end 1.2 --router_tau_freeze_steps 1500 \
248
+ --router_aux_start 0.001 --router_aux_end 0.008 \
249
+ --router_force_prob 0.15 --router_force_warmup_steps 1200
250
+ ```
251
+
252
+ ---
253
+
254
+ ## Translation Specialization Branch
255
+ Add a branch focusing on cross‑lingual adaptation without retraining entire backbone.
256
+
257
+ ### Design Goals
258
+ | Requirement | Implementation Choice |
259
+ |-------------|-----------------------|
260
+ | Lightweight | Low‑rank adapters + language conditioning |
261
+ | Reusable | Shares main hidden size; no separate encoder |
262
+ | Controllable | Can be forced via `force_func` for targeted tuning |
263
+
264
+ ### Example Branch Implementation
265
+ ```python
266
+ class TranslationBranch(nn.Module):
267
+ def __init__(self, hidden_size: int, mlp_mult: float = 2.0, rank: int = 64, num_langs: int = 16):
268
+ super().__init__()
269
+ self.rank = rank
270
+ self.lang_embed = nn.Embedding(num_langs, hidden_size)
271
+ inner = int(hidden_size * mlp_mult)
272
+ self.up = nn.Linear(hidden_size, inner)
273
+ self.down = nn.Linear(inner, hidden_size)
274
+ # Low-rank adapters
275
+ self.A = nn.Linear(hidden_size, rank, bias=False)
276
+ self.B = nn.Linear(rank, hidden_size, bias=False)
277
+ self.gate = nn.Linear(hidden_size, 1)
278
+
279
+ def forward(self, x: torch.Tensor, lang_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
280
+ # x: (B, T, H); lang_ids: (B,) or (B,T) token-level
281
+ if lang_ids is not None:
282
+ if lang_ids.dim() == 1: # broadcast sentence level
283
+ lang_vec = self.lang_embed(lang_ids).unsqueeze(1) # (B,1,H)
284
+ else:
285
+ lang_vec = self.lang_embed(lang_ids) # (B,T,H)
286
+ x = x + lang_vec
287
+ h = self.up(x)
288
+ h = torch.gelu(h)
289
+ h = self.down(h)
290
+ # Adapter residual
291
+ a = self.A(x)
292
+ a = torch.gelu(a)
293
+ a = self.B(a)
294
+ g = torch.sigmoid(self.gate(x)) # (B,T,1)
295
+ return h + g * a
296
+ ```
297
+
298
+ ### Integrate Into `PolymorphicMLP`
299
+ Inside branch construction:
300
  ```python
301
+ if num_funcs >= 4:
302
+ funcs.append(TranslationBranch(hidden_size, mlp_mult=2.0))
303
+ ```
304
+
305
+ ### Passing Language IDs
306
+ - Add `lang_ids` to model forward signature (optional).
307
+ - Modify TranslationBranch call: `func(x, lang_ids=lang_ids)` for branches expecting it; others ignore.
308
+ - For multilingual fine‑tune, prepend special language tokens or maintain a side tensor of language indices.
309
+
310
+ ### Fine‑Tuning Strategy
311
+ 1. Collect multilingual parallel / monolingual corpora (e.g. FLORES, WikiMatrix, OSCAR subset).
312
+ 2. Freeze base transformer + existing branches initially.
313
+ 3. Force translation branch (`force_func = translation_index`) for exploratory steps.
314
+ 4. Gradually unfreeze attention + other branches for joint adaptation.
315
+ 5. Evaluate on BLEU / COMET vs baseline; adjust rank / mlp_mult if underfitting.
316
+
317
+ ---
318
+
319
+ ## Evaluation & Monitoring
320
+ | Metric | Purpose |
321
+ |--------|---------|
322
+ | CE / PPL | Language modeling convergence |
323
+ | Router Entropy | Diversity of branch usage |
324
+ | Alpha Distribution | Detect collapse or dominance |
325
+ | Translation BLEU (if added) | Cross-lingual quality |
326
+
327
+ ---
328
+
329
+ ## Limitations
330
+ | Area | Limitation |
331
+ |------|------------|
332
+ | Alignment | Base LM (no RLHF / instruction tuning) |
333
+ | Multilingual | Requires added translation branch + fine‑tune |
334
+ | Safety | No filtering; may reproduce dataset biases |
335
+ | Interpretability | Router decisions not fully explainable |
336
+
337
+ ---
338
+
339
+ ## Roadmap
340
+ | Version | Goal |
341
+ |---------|------|
342
+ | v0.1 | Core polymorphic MLP + tests |
343
+ | v0.2 | Router logging + entropy regularization |
344
+ | v0.3 | Channel attention option |
345
+ | v0.4 | FlashAttention integration |
346
+ | v0.5 | Expansion utilities (branch migration helpers) |
347
+ | v0.6 | Translation branch reference implementation |
348
+
349
+ ---
350
+
351
+ ## Contributing
352
+ PRs welcome for: new branch types, expansion helpers, multilingual adapters, evaluation scripts.
353
+
354
+ ---
355
+
356
+ ## License
357
+ Apache-2.0
358
+
359
+ ---
360
+
361
+ ## Citation
362
+ ```bibtex
363
+ @misc{veronica-2025,
364
+ title={Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
365
+ author={Emanuele D'Angelo|GG-Ally},
366
+ year={2025},
367
+ howpublished={\url{https://huggingface.co/MhaWay/Veronica}}
368
+ }
369
+ ```
370
+
371
+ ---
372
+
373
+ ## Acknowledgments
374
+ - Mixture & routing concepts inspired by Switch Transformer, GLaM, MoE literature.
375
+ - Dataset composition ratios guided by codelion’s DataComp LM mixture studies.
376
+ - RoPE adaptation referencing GPT-NeoX implementation details.
377
+
378
+ ---
379
+
380
+ ## FAQ
381
+ **Q: Why entropy-max instead of load-balancing penalty?**
382
+ To avoid premature specialization and keep new branches trainable; scaling uses increasing aux weight schedule.
383
+
384
+ **Q: Can I add many branches at once?**
385
+ Recommended incremental (3→4→5) to prevent starvation.
386
+
387
+ **Q: How to specialize for translation?**
388
+ Add `TranslationBranch`, warmup with forced routing, then blended fine-tune with multilingual data.
389
+
390
+ **Q: Does expansion erase prior knowledge?**
391
+ No; existing branches retain weights. Router + new branch adapt during short fine‑tune.
392
+
393
+ ---
394
+
395
+ Happy branching! 🌿