BOREAL-250M

Balanced Orthogonal Recurrent Expert Attention Layers

A 250M-parameter dense hybrid language model pretrained from scratch. Built on the Gated DeltaNet architecture β€” the same hybrid linear-attention design that powers Qwen3.5 and Qwen3.6 β€” trained with Token Superposition Training (TST) for maximum data efficiency per GPU-hour.

BOREAL-250M is the smallest member of the BOREAL family. It exists to validate the architecture at small scale: prove that DeltaNet layers beat pure Transformers at long context, demonstrate TST acceleration, and establish the scaling laws that justify the larger models.

Architecture

Component Detail
Type Dense hybrid β€” Gated DeltaNet + GQA
Parameters 250M
Hidden size 1,024
Layers 12 (9 DeltaNet + 3 full attention)
Ratio 3:1 linear-to-full attention
Full attention GQA: 8 query heads, 2 KV heads, head_dim=256
DeltaNet Gated linear attention: 8 QK heads, 16 V heads, head_dim=128
Conv kernel 4 (local context mixing)
FFN SwiGLU, intermediate=3,072
Norm RMSNorm, eps=1e-6
Position RoPE, theta=10M, partial_rotary_factor=0.25
Output gate Swish-gated attention outputs
Vocab 151,936 (Qwen3 tokenizer)
Context 32,768 tokens native
MTP 1 multi-token prediction head

Architecture Rationale

Gated DeltaNet over pure attention. 75% of layers use linear attention with data-dependent forgetting gates. Each DeltaNet layer maintains a fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t βŠ— v_t, where beta_t is a learned sigmoid gate controlling information retention. The result: O(n) complexity on 75% of layers, enabling native long-context processing without the quadratic memory blowup of pure attention.

Larger head dims (256). Following Qwen3.5 and DeepSeek-V4, head_dim jumps from the traditional 128 to 256. Fewer heads with more per-head capacity, paired with aggressive GQA (2 KV heads for 16 query heads, 8:1 ratio).

Partial RoPE (0.25). Only 25% of each head's dimensions receive rotary positional encoding. The remaining 75% pass through position-agnostically, creating a natural pathway for the DeltaNet's recurrent state to carry position-free information across the sequence.

Output gating. Every attention and DeltaNet output passes through a learned Swish gate: output = attention(x) * silu(W_gate * x). This prevents attention blowup and provides a gradient highway independent of the attention path.

Training

Parameter Value
Data tokens 10B–200B (overtrained regime, matched to validation goal)
Corpus FineWeb-Edu + StarCoder2 code
TST Token Superposition Training, s=4 bags, r=0.5 fraction
Objective Phase 1: multi-hot cross-entropy (TST) β†’ Phase 2: standard NTP
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95)
Peak LR 3e-4 (from MuP sweep)
Schedule Cosine decay to 10% peak
Weight decay 0.1
Batch size ~4M tokens/step
Precision BF16 weights, FP32 DeltaNet states
Hardware 1Γ— DGX Spark (Grace Hopper H200, 480GB unified memory)

TST (Token Superposition Training)

TST is a drop-in training acceleration method from Nous Research (arXiv:2605.06546, May 2026). During the first 50% of training, token embeddings are grouped into bags of 4 and averaged into single "superposed" embeddings. The model predicts the entire next bag using multi-hot cross-entropy, processing 4x more data tokens per forward pass. The second 50% of training reverts to standard autoregressive next-token prediction, allowing the model to recover fine-grained token-level behavior while retaining the richer representations learned during superposition.

At equal FLOPs, TST provides a 1.5–2.5x reduction in pretraining wall time with no architecture, tokenizer, or data changes.

Scaling Ladder

BOREAL-250M is the architectural proof point. Success here validates:

  1. DeltaNet convergence β€” linear attention trains stably with FP32 states
  2. TST acceleration β€” measurable throughput gain vs standard training
  3. KV cache reduction β€” 4–8x smaller than pure Transformer at 32K context
  4. Context generalization β€” loss scales gracefully to 8K, 16K, 32K

If these hold, the same architecture scales directly to:

Model Params Type Context Status
BOREAL-250M 250M Dense 32K In training
BOREAL-2B 2B Dense DeltaNet 64K Planned
BOREAL-10B-MoE ~10B / ~2B active DeltaNet + MoE 256K Planned

The 2B is the community release β€” a model people can download, use, and benchmark. The 10B MoE is the target: 128 experts, 8 active per token, DeepSeek-V4-style hash routing with no auxiliary loss, shared expert, and 256K native context. It punches at Qwen3.5-9B levels with ~2B active params.

Expectations

BOREAL-250M is not a competitive standalone model. It's an architecture validation tool. Expect:

  • Coherent text generation β€” readable, makes sense, occasionally factual
  • Above-random benchmarks β€” 35–40% HellaSwag, 55–60% ARC-Easy
  • Clean scaling curves β€” log-linear loss vs tokens through 200B+
  • Long-context advantage β€” lower perplexity than pure Transformer baselines at 8K+ context lengths

For a model you'd actually use downstream, see BOREAL-2B.

License

Apache 2.0 β€” based on Qwen3 tokenizer lineage.

Author

Developed by DJLougen.

Trained on a DGX Spark in Toronto. Compute self-funded by a visual neuroscience PhD student who spends too much time thinking about attention mechanisms.

β˜• Support on Ko-fi

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GestaltLabs/BOREAL-250M

Finetuned
(1)
this model

Paper for GestaltLabs/BOREAL-250M