PolyChromaticLM 1.0 Base (0.6B)
A 597M-parameter transformer with biologically-inspired activation routing
Instead of a fixed activation function, each neuron dynamically selects among ReLU, Tanh, SiLU, and GELU โ like biological neurons selecting neurotransmitters based on context.
Overview
PolyChromaticLM is a research language model built from scratch in PyTorch whose core innovation is PolyGLU (Polychromatic Gated Linear Unit) โ a drop-in SwiGLU replacement that implements state-conditional activation routing. Rather than applying a single fixed activation function across all neurons, PolyGLU lets each FFN neuron dynamically choose among K=4 activation functions via a differentiable Gumbel-Softmax routing mechanism.
This is the base pre-trained checkpoint (no instruction tuning / SFT). It was trained on ~10B tokens with a math-heavy data mix on a single A100 80GB GPU.
Author: Daniel Nobrega (independent research)
Key Results
- Routing converges to near-deterministic selections (entropy = 0.03% of maximum) without any explicit sparsity regularization โ an emergent property
- Clear depth-dependent activation specialization: early layers prefer GELU, deep layers strongly prefer Tanh
- Achieves 62โ89% of Qwen3-0.6B-Base benchmark performance using 3,600x fewer training tokens
- The routing mechanism adds only 0.23% parameter overhead (~1.4M params)
Architecture
| Parameters | 597M total (~1.4M routing, 0.23% overhead) |
| Hidden dim | 1,024 |
| FFN dim | 4,096 |
| Layers | 28 |
| Attention | GQA (16 query / 8 KV heads, head dim 64) |
| Context | 4,096 tokens |
| Vocab | 151,669 (Qwen3 tokenizer) |
| Position encoding | RoPE (ฮธ=10,000) |
| Normalization | RMSNorm (pre-norm) + QK-Norm |
| FFN | PolyGLU (K=4: ReLU, Tanh, SiLU, GELU) |
| Weight tying | Embedding โ output head |
How PolyGLU Works
Standard SwiGLU uses a fixed SiLU activation. PolyGLU generalizes this:
PolyGLU(x) = [ฮฃ_k g_k ยท ฯ_k(x ยท W_gate)] โ (x ยท W_up)
where g_k = GumbelSoftmax(ฮฑ_k + ฮฒ_k ยท f(hฬ), ฯ) and ฯ_k โ {ReLU, Tanh, SiLU, GELU}.
Each neuron has:
- Static preference (ฮฑ): a learned bias toward specific activations
- Dynamic gating (ฮฒ ยท f(hฬ)): a lightweight MLP that reads the mean-pooled hidden state and modulates routing based on context
- Temperature (ฯ): annealed from 1.0โ0.1 during training, controlling routing sharpness
The biological analogy: just as neurons select specific neurotransmitters (glutamate, GABA, dopamine, acetylcholine) depending on circuit state, PolyGLU neurons select activation functions depending on input context.
Training
| Tokens | ~10.24B |
| Steps | 19,531 |
| Hardware | 1ร NVIDIA A100 80GB |
| Wall time | |
| Precision | BFloat16 |
| Optimizer | AdamW (ฮฒโ=0.9, ฮฒโ=0.95, ฮต=1e-8) |
| Peak LR | 1e-4 (cosine decay, 2K warmup) |
| Weight decay | 0.1 (weight matrices only) |
| Batch size | ~524K tokens/step |
| Gradient clipping | 1.0 max norm |
| Final loss | 1.31 |
Data Mix
| Domain | Dataset | Share | Tokens |
|---|---|---|---|
| Math | nvidia/Nemotron-CC-Math-v1 (4+ subset) |
70% | ~7.0B |
| STEM | openbmb/Ultra-FineWeb |
25% | ~2.5B |
| Code | lumees/github-code-2025-language-split (Python) |
5% | ~0.5B |
The final 20% of training anneals the mix to 85% math / 10% STEM / 5% code for math-focused refinement.
Training Dynamics
Loss curve detail
| Step | Tokens | Loss |
|---|---|---|
| 0 | 0 | 12.13 |
| 2,000 | 1.05B | 3.50 |
| 10,000 | 5.24B | 2.26 |
| 15,000 | 7.86B | 1.68 |
| 19,531 | 10.24B | 1.31 |
Evaluation
All benchmarks via EleutherAI lm-evaluation-harness v0.4.11, 0-shot unless noted.
Benchmarks
| Benchmark | Metric | PolyChromaticLM | Random | Qwen3-0.6B-Base |
|---|---|---|---|---|
| HellaSwag | acc_norm | 28.51 | 25.00 | 41.10 |
| ARC-Easy | acc_norm | 41.04 | 25.00 | 65.60 |
| ARC-Challenge | acc_norm | 22.27 | 25.00 | 33.90 |
| PIQA | acc_norm | 58.87 | 50.00 | 70.00 |
| WinoGrande | acc | 52.17 | 50.00 | 58.50 |
| BoolQ | acc | 61.13 | 50.00 | 69.70 |
| SciQ | acc_norm | 61.20 | 25.00 | โ |
| OpenBookQA | acc_norm | 29.00 | 25.00 | โ |
| MMLU-STEM | acc (5-shot) | 25.28 | 25.00 | โ |
| LAMBADA | acc | 15.35 | ~0 | โ |
Context: Qwen3-0.6B-Base was trained on ~36T tokens โ approximately 3,600ร our budget. Achieving 62โ89% of its scores at 0.028% of the training compute demonstrates strong token efficiency for the PolyGLU architecture.
Domain Perplexity
| Domain | Training Share | Perplexity | Bits/Token |
|---|---|---|---|
| Math | 70% โ 85% | 3.56 | 1.83 |
| Code | 5% | 7.08 | 2.82 |
| STEM | 25% โ 10% | 31.93 | 5.00 |
Code perplexity (7.08) is significantly lower than STEM (31.93) despite receiving 5ร less data โ evidence that mathematical structure transfers effectively to code patterns.
Emergent Routing Behavior
The most striking finding from training: the routing mechanism converges to near-deterministic activation selections without any explicit sparsity loss or entropy regularization.
At convergence, mean dynamic routing entropy is 0.0004 (just 0.03% of the theoretical maximum), meaning the gate network makes near-one-hot activation choices for virtually every neuron.
Layer-wise Activation Specialization
The model discovers a clear depth-dependent activation gradient:
- Early layers (0โ5): GELU-dominant (~35โ40%) โ smooth, probabilistic activations for initial feature extraction
- Middle layers (6โ14): Mixed โ gradual transition with increasing Tanh representation
- Deep layers (15โ27): Tanh-dominant (~50โ65%) โ bounded compression for deep representational processing
Three layers (9, 16, 17) maintain elevated routing entropy, suggesting they benefit from activation diversity. Layer 17 notably increases its entropy during the second half of training โ counter to the global trend toward determinism.
Usage
This model was trained from scratch in pure PyTorch (no HuggingFace model wrappers). To load and use it:
import torch
from transformers import AutoTokenizer
# Clone the training repo for model code
# git clone https://github.com/danielxmed/PolyGLU.git
from src.model.config import ModelConfig
from src.model.model import load_checkpoint
# Load model
config = ModelConfig(use_flash_attn=False)
model, step, tau = load_checkpoint("path/to/portable_final.pt", config, device="cuda")
model.eval()
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B-Base")
input_ids = tokenizer("The derivative of x squared is", return_tensors="pt")["input_ids"].cuda()
# Generate (greedy, no KV cache)
with torch.no_grad():
for _ in range(50):
logits = model(input_ids)
next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
input_ids = torch.cat([input_ids, next_token], dim=1)
print(tokenizer.decode(input_ids[0]))
Note: This is a base model โ it produces raw continuations, not instruction-following responses. An SFT version fine-tuned on math problem-solving is forthcoming.
Limitations
- Base model only โ no instruction tuning, no chat capability, no RLHF. Outputs are raw text continuations.
- 10B token training budget โ significantly less than comparable-size production models (Qwen3-0.6B: ~36T tokens). General knowledge and factual recall are limited.
- Math-heavy distribution (70% math) โ strong on mathematical language modeling, weaker on general NLU tasks.
- No KV cache โ inference requires the full training codebase; generation is slow without a dedicated inference implementation.
- English only โ trained exclusively on English-language data.
Citation
@misc{nobrega2026polychromaticLM,
title = {PolychromaticLM: State-Conditional Activation Routing via Neurotransmitter-Inspired Gated Linear Units},
author = {Daniel Nobrega},
year = {2026},
url = {https://huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B}
}
Links
| Code | github.com/danielxmed/PolyGLU |
| Model | huggingface.co/tylerxdurden/PolyChromaticLM-1.0-base-0.6B |
| Weights & Biases | polychromatic-lm |
Evaluation results
- Normalized Accuracy on HellaSwagself-reported28.510
- Normalized Accuracy on ARC-Easyself-reported41.040
- Normalized Accuracy on ARC-Challengeself-reported22.270
- Normalized Accuracy on PIQAself-reported58.870
- Accuracy on WinoGrandeself-reported52.170
- Accuracy on BoolQself-reported61.130
- Normalized Accuracy on SciQself-reported61.200
- Accuracy (5-shot) on MMLU-STEMself-reported25.280