BOREAL-250M
Balanced Orthogonal Recurrent Expert Attention Layers
A 250M-parameter dense hybrid language model pretrained from scratch. Built on the Gated DeltaNet architecture β the same hybrid linear-attention design that powers Qwen3.5 and Qwen3.6 β trained with Token Superposition Training (TST) for maximum data efficiency per GPU-hour.
BOREAL-250M is the smallest member of the BOREAL family. It exists to validate the architecture at small scale: prove that DeltaNet layers beat pure Transformers at long context, demonstrate TST acceleration, and establish the scaling laws that justify the larger models.
Architecture
| Component | Detail |
|---|---|
| Type | Dense hybrid β Gated DeltaNet + GQA |
| Parameters | 250M |
| Hidden size | 1,024 |
| Layers | 12 (9 DeltaNet + 3 full attention) |
| Ratio | 3:1 linear-to-full attention |
| Full attention | GQA: 8 query heads, 2 KV heads, head_dim=256 |
| DeltaNet | Gated linear attention: 8 QK heads, 16 V heads, head_dim=128 |
| Conv kernel | 4 (local context mixing) |
| FFN | SwiGLU, intermediate=3,072 |
| Norm | RMSNorm, eps=1e-6 |
| Position | RoPE, theta=10M, partial_rotary_factor=0.25 |
| Output gate | Swish-gated attention outputs |
| Vocab | 151,936 (Qwen3 tokenizer) |
| Context | 32,768 tokens native |
| MTP | 1 multi-token prediction head |
Architecture Rationale
Gated DeltaNet over pure attention. 75% of layers use linear attention with data-dependent forgetting gates. Each DeltaNet layer maintains a fixed-size recurrent state S_t = beta_t * S_{t-1} + k_t β v_t, where beta_t is a learned sigmoid gate controlling information retention. The result: O(n) complexity on 75% of layers, enabling native long-context processing without the quadratic memory blowup of pure attention.
Larger head dims (256). Following Qwen3.5 and DeepSeek-V4, head_dim jumps from the traditional 128 to 256. Fewer heads with more per-head capacity, paired with aggressive GQA (2 KV heads for 16 query heads, 8:1 ratio).
Partial RoPE (0.25). Only 25% of each head's dimensions receive rotary positional encoding. The remaining 75% pass through position-agnostically, creating a natural pathway for the DeltaNet's recurrent state to carry position-free information across the sequence.
Output gating. Every attention and DeltaNet output passes through a
learned Swish gate: output = attention(x) * silu(W_gate * x). This prevents
attention blowup and provides a gradient highway independent of the attention
path.
Training
| Parameter | Value |
|---|---|
| Data tokens | 10Bβ200B (overtrained regime, matched to validation goal) |
| Corpus | FineWeb-Edu + StarCoder2 code |
| TST | Token Superposition Training, s=4 bags, r=0.5 fraction |
| Objective | Phase 1: multi-hot cross-entropy (TST) β Phase 2: standard NTP |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95) |
| Peak LR | 3e-4 (from MuP sweep) |
| Schedule | Cosine decay to 10% peak |
| Weight decay | 0.1 |
| Batch size | ~4M tokens/step |
| Precision | BF16 weights, FP32 DeltaNet states |
| Hardware | 1Γ DGX Spark (Grace Hopper H200, 480GB unified memory) |
TST (Token Superposition Training)
TST is a drop-in training acceleration method from Nous Research (arXiv:2605.06546, May 2026). During the first 50% of training, token embeddings are grouped into bags of 4 and averaged into single "superposed" embeddings. The model predicts the entire next bag using multi-hot cross-entropy, processing 4x more data tokens per forward pass. The second 50% of training reverts to standard autoregressive next-token prediction, allowing the model to recover fine-grained token-level behavior while retaining the richer representations learned during superposition.
At equal FLOPs, TST provides a 1.5β2.5x reduction in pretraining wall time with no architecture, tokenizer, or data changes.
Scaling Ladder
BOREAL-250M is the architectural proof point. Success here validates:
- DeltaNet convergence β linear attention trains stably with FP32 states
- TST acceleration β measurable throughput gain vs standard training
- KV cache reduction β 4β8x smaller than pure Transformer at 32K context
- Context generalization β loss scales gracefully to 8K, 16K, 32K
If these hold, the same architecture scales directly to:
| Model | Params | Type | Context | Status |
|---|---|---|---|---|
| BOREAL-250M | 250M | Dense | 32K | In training |
| BOREAL-2B | 2B | Dense DeltaNet | 64K | Planned |
| BOREAL-10B-MoE | ~10B / ~2B active | DeltaNet + MoE | 256K | Planned |
The 2B is the community release β a model people can download, use, and benchmark. The 10B MoE is the target: 128 experts, 8 active per token, DeepSeek-V4-style hash routing with no auxiliary loss, shared expert, and 256K native context. It punches at Qwen3.5-9B levels with ~2B active params.
Expectations
BOREAL-250M is not a competitive standalone model. It's an architecture validation tool. Expect:
- Coherent text generation β readable, makes sense, occasionally factual
- Above-random benchmarks β 35β40% HellaSwag, 55β60% ARC-Easy
- Clean scaling curves β log-linear loss vs tokens through 200B+
- Long-context advantage β lower perplexity than pure Transformer baselines at 8K+ context lengths
For a model you'd actually use downstream, see BOREAL-2B.
License
Apache 2.0 β based on Qwen3 tokenizer lineage.
Author
Developed by DJLougen.
Trained on a DGX Spark in Toronto. Compute self-funded by a visual neuroscience PhD student who spends too much time thinking about attention mechanisms.
Model tree for GestaltLabs/BOREAL-250M
Base model
DJLougen/BOREAL-250M