Hybrid SPLM (Scalar-Potential + Attention Language Model)
The Hybrid SPLM combines an attention front-end with a scalar-potential refinement back-end in a two-stage architecture. The attention blocks gather global context across positions (what attention does best), then the SPLM blocks refine each position deterministically through a learned energy field (what conservative dynamics does best). At decode time, the SPLM tail costs per token independent of sequence length -- the FLOP-efficiency hypothesis the hybrid is designed to test.
This model achieves 8.50 PPL on TinyStories, within 0.69 PPL of matched pure attention (7.81), and is the best-performing variant in the Semantic Simulation SPLM family.
Table of Contents
- Model Details
- Architecture
- Why Not a Pure Transformer?
- Geometric Capabilities
- How to Get Started
- Training Details
- Evaluation Results
- SPLM Family Overview
- Bias, Risks, and Limitations
- Citation
- Environmental Impact
Model Details
Model Description
The Hybrid SPLM (Variant A) is a two-stage autoregressive language model:
- Attention stage (k=4 blocks): Standard causal multi-head self-attention blocks with residual connections, producing a contextualised representation.
- SPLM stage (m=4 steps): Conservative scalar-potential dynamics that refine the attention output through gradient-driven integration steps.
The single causal cumulative-mean context is re-derived from the detached attention output, preserving the causal-honesty invariant. A single shared drives all integration steps.
- Developed by: Dimitar P. Gueorguiev (Independent Researcher)
- Model type: Hybrid attention + conservative autoregressive language model
- Language: English
- License: CC-BY-4.0
Model Sources
- Paper: Semantic Simulation: A Prescriptive Lagrangian Framework for Efficient Semantic Inference
- Repository: github.com/dimitarpg13/semsimula-paper
- Model source code:
notebooks/conservative_arch/hybrid/model_hybrid.py
Architecture
Input tokens x_1, ..., x_T
|
Embedding E[x] + positional encoding
|
=== ATTENTION STAGE (k=4 blocks) ===
For i = 1..4:
h = AttnBlock_i(h) [causal multi-head self-attention + FFN]
|
LayerNorm(h)
|
=== SPLM STAGE (m=4 steps) ===
xi = causal_cumulative_mean(h.detach()) [leak-safe re-derivation]
|
For j = 1..4:
f = -grad_h V_theta(xi, h) [conservative force from shared potential]
v = (v + dt*f/m) / (1 + dt*gamma)
h = h + dt*v
LayerNorm(h)
|
Logits = h @ E^T [tied embeddings]
| Parameter | Value |
|---|---|
| Hidden dim (d) | 256 |
| Attention blocks (k) | 4 |
| SPLM steps (m) | 4 |
| Attention heads | 4 |
| hidden / depth | 1024 / 3 |
| input dim | |
| Mass model | logfreq (frozen surprisal lookup) |
| Damping | 0.166 (learned, init 0.15) |
| Total parameters | ~18,960,000 |
Key Design Properties
- Best of both worlds: Attention handles global token routing; conservative dynamics handles local refinement.
- FLOP-efficient decoding: At long context , the SPLM tail adds per token vs for attention with KV-cache. The embed+logits floor limits short-context savings to ~9%, but at the hybrid achieves -39% decode FLOPs at PPL parity.
- Helmholtz decomposition: The two-stage architecture realises a learned Helmholtz decomposition -- the attention front-end breaks the conservative gauge, and the SPLM back-end operates in a gauge-fixed space (Section 17b of the paper).
Why Not a Pure Transformer?
The Hybrid SPLM is a two-stage architecture that uses attention only in the front-end (4 blocks) and replaces the remaining computation with scalar-potential gradient dynamics (4 SPLM steps). Unlike a pure Transformer, the SPLM refinement stage has no KV-cache, no FFN towers, and is driven entirely by a single small scalar-potential MLP, — 3-layer, 1024-hidden.
Key structural differences from Transformers:
| Property | Transformer (GPT-2 small) | Hybrid SPLM (this model) |
|---|---|---|
| Architecture | 12 self-attention + FFN blocks | 4 attention blocks + 4 SPLM steps |
| Core computation | 50.3M (MLP) + 28.3M (attention) | Attention front-end + 3.4M |
| Runtime state per token | — full KV-cache | — KV-cache in attention stage |
| Total parameters | 124M | ~19.0M |
Note: Because this model retains attention blocks in the front-end, its runtime memory is still in sequence length (due to the KV-cache in the attention stage). However, the SPLM tail contributes only per token, and at long context the hybrid achieves -39% decode FLOPs vs matched pure attention. For fully inference, see the Multi-Xi SPLM, PARFLM, and Fock-PARFLM.
Geometric Capabilities
Note: This model includes attention blocks in its front-end, breaking the conservative-by-construction guarantee. The full damped Riemannian geometry — layer-dependent Jacobi metric, damped geodesics with friction, energy dissipation anomaly detection, curvature proxy — is available only in the purely conservative variants: Multi-Xi SPLM, Multi-Xi PARFLM, and Fock-PARFLM v2.1. A Riemannian Geometry Diagnostic Battery (June 2026) confirmed the metric validity and characterised the damping-dominated dynamics across those models. See the companion note for the full damped framework.
However, the SPLM refinement layers in this hybrid architecture still use the same damped Euler integration with (V_\theta) as the purely conservative variants. The geometric structure (metric, geodesics, curvature) is well-defined within those layers — the attention front-end only prevents the global conservative guarantee from holding end-to-end.
How to Get Started
# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch
import torch
import sys
sys.path.insert(0, "hybrid")
sys.path.insert(0, "sarf_mass_variant")
sys.path.insert(0, ".")
from hybrid.model_hybrid import HybridSPLM, HSPLMConfig
config = HSPLMConfig(
vocab_size=50257, # GPT-2 BPE
d=256,
n_attn=4,
n_splm=4,
n_head=4,
v_hidden=1024,
v_depth=3,
max_len=1024,
block_size=512,
gamma_init=0.15,
)
model = HybridSPLM(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)
Available Checkpoint
A trained checkpoint (PPL 8.01, 16k steps) is included in this repository:
| File | Description |
|---|---|
checkpoint/model.pt |
Full model state dict (73 MB) |
training_log.jsonl |
Per-step training metrics |
loss_curve.png |
Training/validation loss plot |
v_theta_hist.png |
V_theta histogram |
landscape_stats.json |
Landscape statistics (JSON) |
To load the checkpoint:
from huggingface_hub import hf_hub_download
import torch
# Download checkpoint
ckpt_path = hf_hub_download(
repo_id="dimitarpg13/semsimula-hybrid-splm",
filename="checkpoint/model.pt",
)
# Load into model (after creating model as above)
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
Training Details
Training Data
TinyStories -- tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.
Training Procedure
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5e-4 (cosine decay) |
| Warmup steps | 400 |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| Block size | 512 |
| Training steps | 8,000 |
| Seeds | 5 (mean 8.503 +/- 0.035) |
| Hardware | A100 40GB (Google Colab) |
Training Script
notebooks/conservative_arch/scaleup/train_hybrid_scaleup.py
Evaluation Results
TinyStories Validation Perplexity
| Model | PPL | Params | Gap vs Attention |
|---|---|---|---|
| Matched Attention (baseline) | 7.81 | 19.5M | -- |
| Hybrid SPLM+Attn (this model) | 8.50 | ~19.0M | +0.69 |
| Fock-PARFLM v2.1 | 9.30 | 17.4M | +1.49 |
| Fock Attention | 9.42 | 16.7M | +1.61 |
| Multi-Xi PARFLM | 12.06 | 17.6M | +4.25 |
| Multi-Xi SPLM | 11.51 | 16.5M | +3.70 |
The Hybrid SPLM achieves the best PPL in the family (8.50), but uses attention and therefore does not satisfy the conservative-by-construction constraint that the pure SPLM / PARFLM / Fock variants maintain. It serves as the empirical bridge between attention and the fully conservative designs.
SPLM Family Overview
This model is part of the Semantic Simulation SPLM family:
| Model | Design | Inference | HuggingFace |
|---|---|---|---|
| Multi-Xi SPLM | Pure scalar potential, K-EMA context | semsimula-splm-multixi | |
| Hybrid SPLM+Attn | Attention front-end + SPLM refinement | this model | |
| Multi-Xi PARFLM | Scalar potential + sparse pairwise forces | semsimula-parflm-multixi | |
| Fock-PARFLM v2.1 | PARFLM + Fock register pool (mediated exchange) | semsimula-fock-parflm | |
| Fock Attention | PARFLM + direct token-to-token exchange | semsimula-fock-attention |
Bias, Risks, and Limitations
- Research checkpoint only. Proof-of-concept for the hybrid conservative/attention architecture, not a production system.
- TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens).
- English only. No multilingual capability.
- Small scale. ~19M parameters, 256-dim hidden states.
- No safety training. No RLHF, DPO, or safety filtering has been applied.
Citation
@misc{Gueorguiev2026SemSim,
author = {Gueorguiev, Dimitar P.},
title = {Semantic Simulation: A Prescriptive Lagrangian Framework
for Efficient Semantic Inference --- A Conservative-by-
Construction Language Model and the Shared-Potential
Separator, with a Correspondence to Joint Embedding
Predictive Architectures},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19712427},
url = {https://doi.org/10.5281/zenodo.19712427},
note = {Version v15 (Jun 7, 2026).
Companion code repository (DOI 10.5281/zenodo.20579561):
\url{https://github.com/dimitarpg13/semsimula-paper}}
}
Environmental Impact
- Hardware: NVIDIA A100 40GB (Google Colab)
- Training time: ~4 hours (8,000 steps, 5 seeds)
- Carbon footprint: Estimated < 5 kg CO2
- Downloads last month
- 332
Dataset used to train dimitarpg13/semsimula-hybrid-splm
Collection including dimitarpg13/semsimula-hybrid-splm
Evaluation results
- Validation Perplexity on TinyStoriesvalidation set self-reported8.500
