BOREAL-250M

Balanced Orthogonal Recurrent Expert Attention Layers

A 250M-parameter dense hybrid language model pretrained from scratch. Built on the Gated DeltaNet architecture — the same hybrid linear-attention design that powers Qwen3.5 and Qwen3.6 — trained with Token Superposition Training (TST) for maximum data efficiency per GPU-hour.

BOREAL-250M is the smallest member of the BOREAL family. It exists to validate the architecture at small scale: prove that DeltaNet layers beat pure Transformers at long context, demonstrate TST acceleration, and establish the scaling laws that justify the larger models.

Architecture

Component	Detail
Type	Dense hybrid — Gated DeltaNet + GQA
Parameters	250M
Hidden size	1,024
Layers	12 (9 DeltaNet + 3 full attention)
Ratio	3:1 linear-to-full attention
Full attention	GQA: 8 query heads, 2 KV heads, head_dim=256
DeltaNet	Gated linear attention: 8 QK heads, 16 V heads, head_dim=128
Conv kernel	4 (local context mixing)
FFN	SwiGLU, intermediate=3,072
Norm	RMSNorm, eps=1e-6
Position	RoPE, theta=10M, partial_rotary_factor=0.25
Output gate	Swish-gated attention outputs
Vocab	151,936 (Qwen3 tokenizer)
Context	32,768 tokens native
MTP	1 multi-token prediction head

Architecture Rationale

Gated DeltaNet over pure attention. 75% of layers use linear attention with data-dependent forgetting gates. Each DeltaNet layer maintains a fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t ⊗ v_t, where beta_t is a learned sigmoid gate controlling information retention. The result: O(n) complexity on 75% of layers, enabling native long-context processing without the quadratic memory blowup of pure attention.

Larger head dims (256). Following Qwen3.5 and DeepSeek-V4, head_dim jumps from the traditional 128 to 256. Fewer heads with more per-head capacity, paired with aggressive GQA (2 KV heads for 16 query heads, 8:1 ratio).

Partial RoPE (0.25). Only 25% of each head's dimensions receive rotary positional encoding. The remaining 75% pass through position-agnostically, creating a natural pathway for the DeltaNet's recurrent state to carry position-free information across the sequence.

Output gating. Every attention and DeltaNet output passes through a learned Swish gate: output = attention(x) * silu(W_gate * x). This prevents attention blowup and provides a gradient highway independent of the attention path.

Training

Parameter	Value
Data tokens	10B–200B (overtrained regime, matched to validation goal)
Corpus	FineWeb-Edu + StarCoder2 code
TST	Token Superposition Training, s=4 bags, r=0.5 fraction
Objective	Phase 1: multi-hot cross-entropy (TST) → Phase 2: standard NTP
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Peak LR	3e-4 (from MuP sweep)
Schedule	Cosine decay to 10% peak
Weight decay	0.1
Batch size	~4M tokens/step
Precision	BF16 weights, FP32 DeltaNet states
Hardware	1× DGX Spark (Grace Hopper H200, 480GB unified memory)

TST (Token Superposition Training)

TST is a drop-in training acceleration method from Nous Research (arXiv:2605.06546, May 2026). During the first 50% of training, token embeddings are grouped into bags of 4 and averaged into single "superposed" embeddings. The model predicts the entire next bag using multi-hot cross-entropy, processing 4x more data tokens per forward pass. The second 50% of training reverts to standard autoregressive next-token prediction, allowing the model to recover fine-grained token-level behavior while retaining the richer representations learned during superposition.

At equal FLOPs, TST provides a 1.5–2.5x reduction in pretraining wall time with no architecture, tokenizer, or data changes.

Scaling Ladder

BOREAL-250M is the architectural proof point. Success here validates:

DeltaNet convergence — linear attention trains stably with FP32 states
TST acceleration — measurable throughput gain vs standard training
KV cache reduction — 4–8x smaller than pure Transformer at 32K context
Context generalization — loss scales gracefully to 8K, 16K, 32K

If these hold, the same architecture scales directly to:

Model	Params	Type	Context	Status
BOREAL-250M	250M	Dense	32K	In training
BOREAL-2B	2B	Dense DeltaNet	64K	Planned
BOREAL-10B-MoE	~10B / ~2B active	DeltaNet + MoE	256K	Planned

The 2B is the community release — a model people can download, use, and benchmark. The 10B MoE is the target: 128 experts, 8 active per token, DeepSeek-V4-style hash routing with no auxiliary loss, shared expert, and 256K native context. It punches at Qwen3.5-9B levels with ~2B active params.

Expectations

BOREAL-250M is not a competitive standalone model. It's an architecture validation tool. Expect:

Coherent text generation — readable, makes sense, occasionally factual
Above-random benchmarks — 35–40% HellaSwag, 55–60% ARC-Easy
Clean scaling curves — log-linear loss vs tokens through 200B+
Long-context advantage — lower perplexity than pure Transformer baselines at 8K+ context lengths

For a model you'd actually use downstream, see BOREAL-2B.

License

Apache 2.0 — based on Qwen3 tokenizer lineage.

Author

Developed by DJLougen.

Trained on a DGX Spark in Toronto. Compute self-funded by a visual neuroscience PhD student who spends too much time thinking about attention mechanisms.

☕ Support on Ko-fi

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for GestaltLabs/BOREAL-250M

Base model

DJLougen/BOREAL-250M

Finetuned

(1)

this model

Paper for GestaltLabs/BOREAL-250M

Efficient Pre-Training with Token Superposition

Paper • 2605.06546 • Published 7 days ago • 27