Hybrid SPLM (Scalar-Potential + Attention Language Model)

The Hybrid SPLM combines an attention front-end with a scalar-potential refinement back-end in a two-stage architecture. The attention blocks gather global context across positions (what attention does best), then the SPLM blocks refine each position deterministically through a learned energy field (what conservative dynamics does best). At decode time, the SPLM tail costs $O (d^{2})$ per token independent of sequence length -- the FLOP-efficiency hypothesis the hybrid is designed to test.

This model achieves 8.50 PPL on TinyStories, within 0.69 PPL of matched pure attention (7.81), and is the best-performing variant in the Semantic Simulation SPLM family.

Model Details
Architecture
Why Not a Pure Transformer?
Geometric Capabilities
How to Get Started
Training Details
Evaluation Results
SPLM Family Overview
Bias, Risks, and Limitations
Citation
Environmental Impact

Model Details

Model Description

The Hybrid SPLM (Variant A) is a two-stage autoregressive language model:

Attention stage (k=4 blocks): Standard causal multi-head self-attention blocks with residual connections, producing a contextualised representation.
SPLM stage (m=4 steps): Conservative scalar-potential dynamics that refine the attention output through gradient-driven integration steps.

The single causal cumulative-mean context $ξ \xi$ is re-derived from the detached attention output, preserving the causal-honesty invariant. A single shared $V_\theta$ drives all integration steps.

Developed by: Dimitar P. Gueorguiev (Independent Researcher)
Model type: Hybrid attention + conservative autoregressive language model
Language: English
License: CC-BY-4.0

Model Sources

Paper: Semantic Simulation: A Prescriptive Lagrangian Framework for Efficient Semantic Inference
Repository: github.com/dimitarpg13/semsimula-paper
Model source code: notebooks/conservative_arch/hybrid/model_hybrid.py

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   === ATTENTION STAGE (k=4 blocks) ===
   For i = 1..4:
       h = AttnBlock_i(h)             [causal multi-head self-attention + FFN]
       |
   LayerNorm(h)
       |
   === SPLM STAGE (m=4 steps) ===
   xi = causal_cumulative_mean(h.detach())   [leak-safe re-derivation]
       |
   For j = 1..4:
       f = -grad_h V_theta(xi, h)     [conservative force from shared potential]
       v = (v + dt*f/m) / (1 + dt*gamma)
       h = h + dt*v
       LayerNorm(h)
       |
   Logits = h @ E^T                   [tied embeddings]

Parameter	Value
Hidden dim (d)	256
Attention blocks (k)	4
SPLM steps (m)	4
Attention heads	4
$V_\theta$ hidden / depth	1024 / 3
$V_\theta$ input dim	$2 d = 512$
Mass model	logfreq (frozen surprisal lookup)
Damping $\gamma$	0.166 (learned, init 0.15)
Total parameters	~18,960,000

Key Design Properties

Best of both worlds: Attention handles global token routing; conservative dynamics handles local refinement.
FLOP-efficient decoding: At long context $T \gg 1$ , the SPLM tail adds $O (d^{2})$ per token vs $O (T d)$ for attention with KV-cache. The embed+logits floor limits short-context savings to ~9%, but at $T \geq 4096$ the hybrid achieves -39% decode FLOPs at PPL parity.
Helmholtz decomposition: The two-stage architecture realises a learned Helmholtz decomposition -- the attention front-end breaks the conservative gauge, and the SPLM back-end operates in a gauge-fixed space (Section 17b of the paper).

Why Not a Pure Transformer?

The Hybrid SPLM is a two-stage architecture that uses attention only in the front-end (4 blocks) and replaces the remaining computation with scalar-potential gradient dynamics (4 SPLM steps). Unlike a pure Transformer, the SPLM refinement stage has no KV-cache, no FFN towers, and is driven entirely by a single small scalar-potential MLP, $V_\theta$ — 3-layer, 1024-hidden.

Key structural differences from Transformers:

Property	Transformer (GPT-2 small)	Hybrid SPLM (this model)
Architecture	12 self-attention + FFN blocks	4 attention blocks + 4 SPLM steps
Core computation	50.3M (MLP) + 28.3M (attention)	Attention front-end + 3.4M $V_\theta$
Runtime state per token	$O (T)$ — full KV-cache	$O (T)$ — KV-cache in attention stage
Total parameters	124M	~19.0M

Note: Because this model retains attention blocks in the front-end, its runtime memory is still $O (T)$ in sequence length (due to the KV-cache in the attention stage). However, the SPLM tail contributes only $O (d^{2})$ per token, and at long context $T \geq 4096$ the hybrid achieves -39% decode FLOPs vs matched pure attention. For fully $O (1)$ inference, see the Multi-Xi SPLM, PARFLM, and Fock-PARFLM.

Geometric Capabilities

Note: This model includes attention blocks in its front-end, breaking the conservative-by-construction guarantee. The full damped Riemannian geometry — layer-dependent Jacobi metric, damped geodesics with friction, energy dissipation anomaly detection, curvature proxy — is available only in the purely conservative variants: Multi-Xi SPLM, Multi-Xi PARFLM, and Fock-PARFLM v2.1. A Riemannian Geometry Diagnostic Battery (June 2026) confirmed the metric validity and characterised the damping-dominated dynamics across those models. See the companion note for the full damped framework.

However, the SPLM refinement layers in this hybrid architecture still use the same damped Euler integration with (V_\theta) as the purely conservative variants. The geometric structure (metric, geodesics, curvature) is well-defined within those layers — the attention front-end only prevents the global conservative guarantee from holding end-to-end.

How to Get Started

# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch

import torch
import sys
sys.path.insert(0, "hybrid")
sys.path.insert(0, "sarf_mass_variant")
sys.path.insert(0, ".")

from hybrid.model_hybrid import HybridSPLM, HSPLMConfig

config = HSPLMConfig(
    vocab_size=50257,      # GPT-2 BPE
    d=256,
    n_attn=4,
    n_splm=4,
    n_head=4,
    v_hidden=1024,
    v_depth=3,
    max_len=1024,
    block_size=512,
    gamma_init=0.15,
)

model = HybridSPLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)

Available Checkpoint

A trained checkpoint (PPL 8.01, 16k steps) is included in this repository:

File	Description
`checkpoint/model.pt`	Full model state dict (73 MB)
`training_log.jsonl`	Per-step training metrics
`loss_curve.png`	Training/validation loss plot
`v_theta_hist.png`	V_theta histogram
`landscape_stats.json`	Landscape statistics (JSON)

To load the checkpoint:

from huggingface_hub import hf_hub_download
import torch

# Download checkpoint
ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-hybrid-splm",
    filename="checkpoint/model.pt",
)

# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

Training Details

Training Data

TinyStories -- tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.

Training Procedure

Hyperparameter	Value
Optimizer	AdamW
Learning rate	5e-4 (cosine decay)
Warmup steps	400
Weight decay	0.01
Gradient clipping	1.0
Batch size	16
Block size	512
Training steps	8,000
Seeds	5 (mean 8.503 +/- 0.035)
Hardware	A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/train_hybrid_scaleup.py

Evaluation Results

TinyStories Validation Perplexity

Model	PPL	Params	Gap vs Attention
Matched Attention (baseline)	7.81	19.5M	--
Hybrid SPLM+Attn (this model)	8.50	~19.0M	+0.69
Fock-PARFLM v2.1	9.30	17.4M	+1.49
Fock Attention	9.42	16.7M	+1.61
Multi-Xi PARFLM	12.06	17.6M	+4.25
Multi-Xi SPLM	11.51	16.5M	+3.70

The Hybrid SPLM achieves the best PPL in the family (8.50), but uses attention and therefore does not satisfy the conservative-by-construction constraint that the pure SPLM / PARFLM / Fock variants maintain. It serves as the empirical bridge between attention and the fully conservative designs.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family:

Model	Design	Inference	HuggingFace
Multi-Xi SPLM	Pure scalar potential, K-EMA context	$O (1)$	semsimula-splm-multixi
Hybrid SPLM+Attn	Attention front-end + SPLM refinement	$O (T)$	this model
Multi-Xi PARFLM	Scalar potential + sparse pairwise forces	$O (1)$	semsimula-parflm-multixi
Fock-PARFLM v2.1	PARFLM + Fock register pool (mediated exchange)	$O (1)$	semsimula-fock-parflm
Fock Attention	PARFLM + direct token-to-token exchange	$O (T^{2})$	semsimula-fock-attention

Bias, Risks, and Limitations

Research checkpoint only. Proof-of-concept for the hybrid conservative/attention architecture, not a production system.
TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
English only. No multilingual capability.
Small scale. ~19M parameters, 256-dim hidden states.
No safety training. No RLHF, DPO, or safety filtering has been applied.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

Hardware: NVIDIA A100 40GB (Google Colab)
Training time: ~4 hours (8,000 steps, 5 seeds)
Carbon footprint: Estimated < 5 kg CO2

Downloads last month: 332

Dataset used to train dimitarpg13/semsimula-hybrid-splm

Collection including dimitarpg13/semsimula-hybrid-splm

Semantic Simulation — SPLM Model Family

Collection

Conservative language models based on Lagrangian mechanics. Paper: https://doi.org/10.5281/zenodo.19712427 • 8 items • Updated 18 days ago

Evaluation results

Validation Perplexity on TinyStories
validation set self-reported

8.500

dimitarpg13
/

semsimula-hybrid-splm