Hybrid SPLM (Scalar-Potential + Attention Language Model)

The Hybrid SPLM combines an attention front-end with a scalar-potential refinement back-end in a two-stage architecture. The attention blocks gather global context across positions (what attention does best), then the SPLM blocks refine each position deterministically through a learned energy field (what conservative dynamics does best). At decode time, the SPLM tail costs O(d2)O(d^2) per token independent of sequence length -- the FLOP-efficiency hypothesis the hybrid is designed to test.

This model achieves 8.50 PPL on TinyStories, within 0.69 PPL of matched pure attention (7.81), and is the best-performing variant in the Semantic Simulation SPLM family.

Table of Contents

Model Details

Model Description

The Hybrid SPLM (Variant A) is a two-stage autoregressive language model:

  1. Attention stage (k=4 blocks): Standard causal multi-head self-attention blocks with residual connections, producing a contextualised representation.
  2. SPLM stage (m=4 steps): Conservative scalar-potential dynamics that refine the attention output through gradient-driven integration steps.

The single causal cumulative-mean context ξ\xi is re-derived from the detached attention output, preserving the causal-honesty invariant. A single shared VθV_\theta drives all integration steps.

  • Developed by: Dimitar P. Gueorguiev (Independent Researcher)
  • Model type: Hybrid attention + conservative autoregressive language model
  • Language: English
  • License: CC-BY-4.0

Model Sources

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   === ATTENTION STAGE (k=4 blocks) ===
   For i = 1..4:
       h = AttnBlock_i(h)             [causal multi-head self-attention + FFN]
       |
   LayerNorm(h)
       |
   === SPLM STAGE (m=4 steps) ===
   xi = causal_cumulative_mean(h.detach())   [leak-safe re-derivation]
       |
   For j = 1..4:
       f = -grad_h V_theta(xi, h)     [conservative force from shared potential]
       v = (v + dt*f/m) / (1 + dt*gamma)
       h = h + dt*v
       LayerNorm(h)
       |
   Logits = h @ E^T                   [tied embeddings]
Parameter Value
Hidden dim (d) 256
Attention blocks (k) 4
SPLM steps (m) 4
Attention heads 4
VθV_\theta hidden / depth 1024 / 3
VθV_\theta input dim 2d=5122d = 512
Mass model logfreq (frozen surprisal lookup)
Damping γ\gamma 0.166 (learned, init 0.15)
Total parameters ~18,960,000

Key Design Properties

  • Best of both worlds: Attention handles global token routing; conservative dynamics handles local refinement.
  • FLOP-efficient decoding: At long context T≫1T \gg 1, the SPLM tail adds O(d2)O(d^2) per token vs O(Td)O(Td) for attention with KV-cache. The embed+logits floor limits short-context savings to ~9%, but at T≥4096T \geq 4096 the hybrid achieves -39% decode FLOPs at PPL parity.
  • Helmholtz decomposition: The two-stage architecture realises a learned Helmholtz decomposition -- the attention front-end breaks the conservative gauge, and the SPLM back-end operates in a gauge-fixed space (Section 17b of the paper).

Why Not a Pure Transformer?

The Hybrid SPLM is a two-stage architecture that uses attention only in the front-end (4 blocks) and replaces the remaining computation with scalar-potential gradient dynamics (4 SPLM steps). Unlike a pure Transformer, the SPLM refinement stage has no KV-cache, no FFN towers, and is driven entirely by a single small scalar-potential MLP, VθV_\theta — 3-layer, 1024-hidden.

Key structural differences from Transformers:

Property Transformer (GPT-2 small) Hybrid SPLM (this model)
Architecture 12 self-attention + FFN blocks 4 attention blocks + 4 SPLM steps
Core computation 50.3M (MLP) + 28.3M (attention) Attention front-end + 3.4M VθV_\theta
Runtime state per token O(T)O(T) — full KV-cache O(T)O(T) — KV-cache in attention stage
Total parameters 124M ~19.0M

Note: Because this model retains attention blocks in the front-end, its runtime memory is still O(T)O(T) in sequence length (due to the KV-cache in the attention stage). However, the SPLM tail contributes only O(d2)O(d^2) per token, and at long context T≥4096T \geq 4096 the hybrid achieves -39% decode FLOPs vs matched pure attention. For fully O(1)O(1) inference, see the Multi-Xi SPLM, PARFLM, and Fock-PARFLM.

Runtime information capacity vs sequence length

Geometric Capabilities

Note: This model includes attention blocks in its front-end, breaking the conservative-by-construction guarantee. The full damped Riemannian geometry — layer-dependent Jacobi metric, damped geodesics with friction, energy dissipation anomaly detection, curvature proxy — is available only in the purely conservative variants: Multi-Xi SPLM, Multi-Xi PARFLM, and Fock-PARFLM v2.1. A Riemannian Geometry Diagnostic Battery (June 2026) confirmed the metric validity and characterised the damping-dominated dynamics across those models. See the companion note for the full damped framework.

However, the SPLM refinement layers in this hybrid architecture still use the same damped Euler integration with (V_\theta) as the purely conservative variants. The geometric structure (metric, geodesics, curvature) is well-defined within those layers — the attention front-end only prevents the global conservative guarantee from holding end-to-end.

How to Get Started

# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch

import torch
import sys
sys.path.insert(0, "hybrid")
sys.path.insert(0, "sarf_mass_variant")
sys.path.insert(0, ".")

from hybrid.model_hybrid import HybridSPLM, HSPLMConfig

config = HSPLMConfig(
    vocab_size=50257,      # GPT-2 BPE
    d=256,
    n_attn=4,
    n_splm=4,
    n_head=4,
    v_hidden=1024,
    v_depth=3,
    max_len=1024,
    block_size=512,
    gamma_init=0.15,
)

model = HybridSPLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)

Available Checkpoint

A trained checkpoint (PPL 8.01, 16k steps) is included in this repository:

File Description
checkpoint/model.pt Full model state dict (73 MB)
training_log.jsonl Per-step training metrics
loss_curve.png Training/validation loss plot
v_theta_hist.png V_theta histogram
landscape_stats.json Landscape statistics (JSON)

To load the checkpoint:

from huggingface_hub import hf_hub_download
import torch

# Download checkpoint
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-hybrid-splm",
    filename="checkpoint/model.pt",
)

# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

Training Details

Training Data

TinyStories -- tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.

Training Procedure

Hyperparameter Value
Optimizer AdamW
Learning rate 5e-4 (cosine decay)
Warmup steps 400
Weight decay 0.01
Gradient clipping 1.0
Batch size 16
Block size 512
Training steps 8,000
Seeds 5 (mean 8.503 +/- 0.035)
Hardware A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/train_hybrid_scaleup.py

Evaluation Results

TinyStories Validation Perplexity

Model PPL Params Gap vs Attention
Matched Attention (baseline) 7.81 19.5M --
Hybrid SPLM+Attn (this model) 8.50 ~19.0M +0.69
Fock-PARFLM v2.1 9.30 17.4M +1.49
Fock Attention 9.42 16.7M +1.61
Multi-Xi PARFLM 12.06 17.6M +4.25
Multi-Xi SPLM 11.51 16.5M +3.70

The Hybrid SPLM achieves the best PPL in the family (8.50), but uses attention and therefore does not satisfy the conservative-by-construction constraint that the pure SPLM / PARFLM / Fock variants maintain. It serves as the empirical bridge between attention and the fully conservative designs.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model Design Inference HuggingFace
Multi-Xi SPLM Pure scalar potential, K-EMA context O(1)O(1) semsimula-splm-multixi
Hybrid SPLM+Attn Attention front-end + SPLM refinement O(T)O(T) this model
Multi-Xi PARFLM Scalar potential + sparse pairwise forces O(1)O(1) semsimula-parflm-multixi
Fock-PARFLM v2.1 PARFLM + Fock register pool (mediated exchange) O(1)O(1) semsimula-fock-parflm
Fock Attention PARFLM + direct token-to-token exchange O(T2)O(T^2) semsimula-fock-attention

Bias, Risks, and Limitations

  • Research checkpoint only. Proof-of-concept for the hybrid conservative/attention architecture, not a production system.
  • TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
  • English only. No multilingual capability.
  • Small scale. ~19M parameters, 256-dim hidden states.
  • No safety training. No RLHF, DPO, or safety filtering has been applied.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

  • Hardware: NVIDIA A100 40GB (Google Colab)
  • Training time: ~4 hours (8,000 steps, 5 seeds)
  • Carbon footprint: Estimated < 5 kg CO2
Downloads last month
332
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dimitarpg13/semsimula-hybrid-splm

Collection including dimitarpg13/semsimula-hybrid-splm

Evaluation results