MhaWay
/

Veronica

@@ -1,48 +1,395 @@
 ---
 language:
 - en
-- it
 library_name: transformers
 license: apache-2.0
 tags:
 - veronica
 - decoder-only
 - causal-lm
-- gqa
 - rope
-- yarn
-- flash-attn2
 pipeline_tag: text-generation
 model-index:
-- name: Veronica-Core 450M (prototype)
   results: []
 ---
-# Veronica — Custom Causal LM (decoder-only)
-**Veronica** is a custom *decoder-only* large language model, designed to maximize **depth efficiency** and token-level reasoning quality under limited resources.
-It features **32 layers × 1024 hidden × 16 heads (GQA=4)**, extended context via **RoPE (θ=1e6) + YaRN scaling** up to **32k tokens**, and advanced attention routing with **DuoAttention** and **SEAL scaling**.
-> **Status:** prototype under pretraining.
-> This repository currently provides **code, config, and tokenizer** to load Veronica with `trust_remote_code=True`.
-> Model weights will be released in a future update.
 ---
-## Quickstart
 ```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-name = "MhaWay/veronica"
-tok = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained(
-    name,
-    trust_remote_code=True,
-    torch_dtype="auto",
-    device_map="auto",
-)
-prompt = "Explain in simple terms what Veronica is:"
-out = model.generate(**tok(prompt, return_tensors="pt").to(model.device))
-print(tok.decode(out[0], skip_special_tokens=True))

 ---
 language:
 - en
 library_name: transformers
 license: apache-2.0
 tags:
 - veronica
+- polymorphic-mlp
+- mixture-of-branches
+- entropy-regularized-routing
 - decoder-only
 - causal-lm
 - rope
+- expandable-architecture
 pipeline_tag: text-generation
+datasets:
+- codelion/finepdfs-1B
+- codelion/dclm-baseline-1B
+- codelion/fineweb-edu-1B
 model-index:
+- name: Veronica-24L (551M)
   results: []
 ---
+# Veronica-Polymorphic
+**Veronica-Polymorphic** is a decoder‑only transformer featuring a **polymorphic MLP layer**: each token is processed by a soft mixture of specialized branches (SwiGLU, GLU, Depthwise Causal Conv) under an entropy‑regularized router. The design enables adaptive capacity, incremental expansion (adding new branches post‑pretrain), and targeted specialization (e.g. translation modules) without full retraining from scratch.
+## TL;DR
+| Feature | Description |
+|---------|-------------|
+| Architecture | 24‑layer causal Transformer (RoPE, untied embeddings) |
+| Polymorphic MLP | Soft routing over 3 base branches (extensible) |
+| Routing Control | Temperature schedule + entropy maximization |
+| Precision | BF16 with FP32 LayerNorm for stability |
+| Positional Encoding | Rotary (RoPE, θ=10,000) |
+| Dataset Mix | FinePDFs‑1B 50% • DCLM Baseline‑1B 30% • Additional samples 20% (codelion/DataComp) |
+| Expansion | Add new branches (e.g. Translation) via lightweight migration + fine‑tune |
 ---
+## Installation
+```bash
+pip install -e .
+| Source | Share | Link |
+|--------|-------|------|
+| FinePDFs‑1B | 50% | https://huggingface.co/datasets/codelion/finepdfs-1B |
+| DCLM Baseline‑1B | 30% | https://huggingface.co/datasets/codelion/dclm-baseline-1B |
+| Additional samples | 20% | https://huggingface.co/collections/codelion/pre-training-dataset-samples |
+Notes
+- The collection link aggregates additional samples (e.g., educational/web sources) used to complete the 50/30/20 composition.
+- Please refer to each dataset’s license/terms; FinePDFs is curated from public PDFs and is referenced, not redistributed here.
+Total tokens target (example): ~60B. The composition balances semantic density (FinePDFs) and generality (DCLM) per codelion’s guidance.
+from veronica import VeronicaConfig, VeronicaForCausalLM
+cfg = VeronicaConfig(n_layer=24, num_funcs=3)  # base polymorphic setup
+model = VeronicaForCausalLM(cfg)
+```
+Generation example:
+```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("gpt2")  # or your saved tokenizer
+prompt = "The theory of relativity states that"
+ids = tok(prompt, return_tensors="pt").to(model.device)
+out = model.generate(**ids, max_new_tokens=64, temperature=0.7, top_p=0.9)
+| Current status | between v0.2 and v0.3 |
+print(tok.decode(out[0], skip_special_tokens=True))
+```
+---
+## Architecture Overview
+### High Level
+```
+Input Embeddings → [Block × N]
+   Block: Pre-LN → Multi-Head Self-Attention (RoPE) → Pre-LN → Polymorphic MLP (Routing + Branch Fusion) → Residual
+Untied LM Head
+```
+## Dataset Citations
+If you use these datasets or composition, please cite:
+```
+@article{sharma2025billion,
+  title   = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
+  author  = {Sharma, Asankhaya},
+  year    = {2025},
+  url     = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
+}
+```
+Related collection and datasets:
+- codelion pre‑training dataset samples: https://huggingface.co/collections/codelion/pre-training-dataset-samples
+- codelion/dclm-baseline-1B: https://huggingface.co/datasets/codelion/dclm-baseline-1B
+- codelion/finepdfs-1B: https://huggingface.co/datasets/codelion/finepdfs-1B
+---
+### Polymorphic MLP
+Per token & layer:
+```
+router_logits = Router(x)          # Linear → GELU → Linear
+α = softmax(router_logits / τ)
+branches = [SwiGLU(x), GLU(x), DepthwiseConvMLP(x)]
+output = Σ α_i * branch_i(x)
+```
+Routing stabilized by:
+- **Temperature schedule** (τ high early → softer mixing)
+- **Entropy-max aux-loss** (subtract entropy from total loss to maximize it)
+- Optional **forcing** during warmup to guarantee gradient flow to new branches
+### Branch Types
+| Branch | Purpose | Structure |
+|--------|---------|-----------|
+| SwiGLU | Smooth gated MLP | Linear(up 2×) → split → SiLU × gate → Linear(down) |
+| GLU | Alternative gating dynamics | Linear(up 2×) → split → Sigmoid × gate → Linear(down) |
+| DepthwiseConv | Local token patterns | Depthwise causal conv (k=3) → expand → GELU → contract |
+### Positional Encoding
+Rotary embeddings (RoPE) applied to Q/K heads with cached cos/sin; no absolute learned positions.
+### Stability Choices
+| Mechanism | Rationale |
+|-----------|-----------|
+| FP32 LayerNorm | Prevent BF16 precision drift |
+| Entropy-Max Aux | Avoid early router collapse |
+| High initial τ | Encourage exploration across branches |
+| Gradient Checkpointing | Memory efficiency for depth |
+---
+## Dataset Mixture (codelion / DataComp inspired)
+Training uses a curated blend guided by open mixture studies:
+| Source | Share | Notes |
+|--------|-------|-------|
+| FinePDFs | 50% | Technical & academic PDFs (higher semantic density) |
+| DCLM Baseline | 30% | General web corpus (DataComp LM baseline) |
+| FineWeb‑Edu | 20% | Educational domain for structured explanatory patterns |
+Total tokens target (example): ~60B (adjustable). The composition balances semantic density vs generality, echoing codelion’s optimal ratio analyses.
+---
+## Training Setup
+| Hyperparameter | Value (example) |
+|----------------|-----------------|
+| Layers | 24 |
+| Hidden size | 768 |
+| Heads | 12 |
+| MLP mult | 4.0 |
+| Batch (per device) | 4 |
+| Grad Accumulation | 8 (effective batch 32) |
+| LR | 1.2e-4 cosine decay |
+| Warmup | 10% steps |
+| Weight Decay | 0.01 |
+| Label Smoothing | 0.01 |
+| Precision | bf16 + fp32 LayerNorm |
+| Max Seq Len | 512 (curriculum to 2048) |
+| Router τ | 1.6 → 1.1 (freeze first 4k steps) |
+| Aux weight λ | 0.005 → 0.012 |
+| Router forcing | 5% prob for first 3k steps |
+| Rep penalty (α) | 0.05 (smoke quality) |
+Launch:
+```bash
+python scripts/train_veronica.py \
+  --config configs/veronica-pretrain-12L.json \  # contains 24 layers (legacy name)
+  --dataset_paths data/mix_optimal_50_30_20 \
+  --output_dir runs/veronica-pretrain-24L \
+  --per_device_train_batch_size 4 \
+  --gradient_accumulation_steps 8 \
+  --max_steps 60000 \
+  --learning_rate 1.2e-4 \
+  --warmup_ratio 0.10 \
+  --weight_decay 0.01 \
+  --max_seq_len 512 \
+  --router_tau_start 1.6 --router_tau_end 1.1 --router_tau_freeze_steps 4000 \
+  --router_aux_start 0.005 --router_aux_end 0.012 \
+  --router_force_prob 0.05 --router_force_warmup_steps 3000 \
+  --rep_alpha 0.05
+```
+---
+## Router Health Metrics
+Monitor log lines:
+```
+[router] alpha=[a0, a1, a2, ...] entropy_norm=E
+```
+Targets:
+- `entropy_norm ≥ 0.75` through first 5k steps
+- No branch < 15% persistent (healthy diversity)
+- `entropy_norm ≥ 0.65` maintained throughout training
+---
+## Incremental Expansion (Add New Branch Post‑Pretrain)
+Goal: Increase capacity or add a specialization (e.g. translation) without full restart.
+### Steps
+1. **Load original checkpoint + config**:
+   ```python
+   cfg = VeronicaConfig.from_pretrained(old_dir)
+   old_funcs = cfg.num_funcs
+   cfg.num_funcs = old_funcs + 1  # adding one branch
+   model = VeronicaForCausalLM.from_pretrained(old_dir, config=cfg, ignore_mismatched_sizes=True)
+   ```
+2. **Implement new branch class** (see Translation branch below) and extend `PolymorphicMLP` construction.
+3. **Copy existing router weights** and init new column small:
+   ```python
+   import torch, torch.nn as nn
+   for blk in model.blocks:
+     lin = blk.mlp.router[-1]  # final Linear
+     with torch.no_grad():
+       # existing weights remain; new slice initialized
+       nn.init.normal_(lin.weight[old_funcs:], mean=0.0, std=0.02)
+       if lin.bias is not None:
+         nn.init.zeros_(lin.bias[old_funcs:])
+   ```
+4. **Freeze old branches & attention** for warmup:
+   ```python
+   for name, p in model.named_parameters():
+     if "funcs.%d" % (old_funcs) in name or "router.2" in name:  # new branch + router final layer
+       p.requires_grad = True
+     else:
+       p.requires_grad = False
+   ```
+5. **High τ + light forcing** (0–1k steps): `router_tau_start=1.8`, `router_force_prob≈0.15`.
+6. **Blend phase** (1–3k steps): unfreeze old branches, lower τ → 1.2, increase aux to mid (e.g. 0.006).
+7. **Stabilize**: restore standard schedule (τ→1.0, aux→0.01), disable forcing.
+### Recommended Minimal Fine‑Tune Command
+```bash
+python scripts/train_veronica.py \
+  --config expanded-config.json \  # updated num_funcs
+  --resume_from runs/veronica-pretrain-24L/checkpoint-60000 \
+  --output_dir runs/veronica-expand-translation \
+  --max_steps 8000 \
+  --per_device_train_batch_size 4 \
+  --gradient_accumulation_steps 8 \
+  --learning_rate 8e-5 \
+  --router_tau_start 1.8 --router_tau_end 1.2 --router_tau_freeze_steps 1500 \
+  --router_aux_start 0.001 --router_aux_end 0.008 \
+  --router_force_prob 0.15 --router_force_warmup_steps 1200
+```
+---
+## Translation Specialization Branch
+Add a branch focusing on cross‑lingual adaptation without retraining entire backbone.
+### Design Goals
+| Requirement | Implementation Choice |
+|-------------|-----------------------|
+| Lightweight | Low‑rank adapters + language conditioning |
+| Reusable | Shares main hidden size; no separate encoder |
+| Controllable | Can be forced via `force_func` for targeted tuning |
+### Example Branch Implementation
+```python
+class TranslationBranch(nn.Module):
+  def __init__(self, hidden_size: int, mlp_mult: float = 2.0, rank: int = 64, num_langs: int = 16):
+    super().__init__()
+    self.rank = rank
+    self.lang_embed = nn.Embedding(num_langs, hidden_size)
+    inner = int(hidden_size * mlp_mult)
+    self.up = nn.Linear(hidden_size, inner)
+    self.down = nn.Linear(inner, hidden_size)
+    # Low-rank adapters
+    self.A = nn.Linear(hidden_size, rank, bias=False)
+    self.B = nn.Linear(rank, hidden_size, bias=False)
+    self.gate = nn.Linear(hidden_size, 1)
+  def forward(self, x: torch.Tensor, lang_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
+    # x: (B, T, H); lang_ids: (B,) or (B,T) token-level
+    if lang_ids is not None:
+      if lang_ids.dim() == 1:  # broadcast sentence level
+        lang_vec = self.lang_embed(lang_ids).unsqueeze(1)  # (B,1,H)
+      else:
+        lang_vec = self.lang_embed(lang_ids)              # (B,T,H)
+      x = x + lang_vec
+    h = self.up(x)
+    h = torch.gelu(h)
+    h = self.down(h)
+    # Adapter residual
+    a = self.A(x)
+    a = torch.gelu(a)
+    a = self.B(a)
+    g = torch.sigmoid(self.gate(x))  # (B,T,1)
+    return h + g * a
+```
+### Integrate Into `PolymorphicMLP`
+Inside branch construction:
 ```python
+if num_funcs >= 4:
+  funcs.append(TranslationBranch(hidden_size, mlp_mult=2.0))
+```
+### Passing Language IDs
+- Add `lang_ids` to model forward signature (optional).
+- Modify TranslationBranch call: `func(x, lang_ids=lang_ids)` for branches expecting it; others ignore.
+- For multilingual fine‑tune, prepend special language tokens or maintain a side tensor of language indices.
+### Fine‑Tuning Strategy
+1. Collect multilingual parallel / monolingual corpora (e.g. FLORES, WikiMatrix, OSCAR subset).
+2. Freeze base transformer + existing branches initially.
+3. Force translation branch (`force_func = translation_index`) for exploratory steps.
+4. Gradually unfreeze attention + other branches for joint adaptation.
+5. Evaluate on BLEU / COMET vs baseline; adjust rank / mlp_mult if underfitting.
+---
+## Evaluation & Monitoring
+| Metric | Purpose |
+|--------|---------|
+| CE / PPL | Language modeling convergence |
+| Router Entropy | Diversity of branch usage |
+| Alpha Distribution | Detect collapse or dominance |
+| Translation BLEU (if added) | Cross-lingual quality |
+---
+## Limitations
+| Area | Limitation |
+|------|------------|
+| Alignment | Base LM (no RLHF / instruction tuning) |
+| Multilingual | Requires added translation branch + fine‑tune |
+| Safety | No filtering; may reproduce dataset biases |
+| Interpretability | Router decisions not fully explainable |
+---
+## Roadmap
+| Version | Goal |
+|---------|------|
+| v0.1 | Core polymorphic MLP + tests |
+| v0.2 | Router logging + entropy regularization |
+| v0.3 | Channel attention option |
+| v0.4 | FlashAttention integration |
+| v0.5 | Expansion utilities (branch migration helpers) |
+| v0.6 | Translation branch reference implementation |
+---
+## Contributing
+PRs welcome for: new branch types, expansion helpers, multilingual adapters, evaluation scripts.
+---
+## License
+Apache-2.0
+---
+## Citation
+```bibtex
+@misc{veronica-2025,
+  title={Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
+  author={Emanuele D'Angelo|GG-Ally},
+  year={2025},
+  howpublished={\url{https://huggingface.co/MhaWay/Veronica}}
+}
+```
+---
+## Acknowledgments
+- Mixture & routing concepts inspired by Switch Transformer, GLaM, MoE literature.
+- Dataset composition ratios guided by codelion’s DataComp LM mixture studies.
+- RoPE adaptation referencing GPT-NeoX implementation details.
+---
+## FAQ
+**Q: Why entropy-max instead of load-balancing penalty?**
+To avoid premature specialization and keep new branches trainable; scaling uses increasing aux weight schedule.
+**Q: Can I add many branches at once?**
+Recommended incremental (3→4→5) to prevent starvation.
+**Q: How to specialize for translation?**
+Add `TranslationBranch`, warmup with forced routing, then blended fine-tune with multilingual data.
+**Q: Does expansion erase prior knowledge?**
+No; existing branches retain weights. Router + new branch adapt during short fine‑tune.
+---
+Happy branching! 🌿