The Lie Algebra of Neural Network Training

Published April 22, 2026

Linear neural networks under SGD-with-momentum produce a richer Poisson algebra at gradient-product coupling than any pairwise potential in our 16-potential survey, and split into seven distinct universality classes overall.

Train a 3-layer linear network with SGD and momentum. The training dynamics form a Hamiltonian system — weights are coordinates, velocity buffers are momenta, and the loss landscape defines pairwise interactions between layers. Compute the Lie algebra generated by these interactions via iterated Poisson brackets.

You get [3, 6, 17, 119].

Now replace the network with three gravitating bodies — stars, atoms, black holes — interacting via $1 / r$ . You get [3, 6, 17, 116].

The neural network produces 3 extra generators. It breaks a universality that holds across every pairwise potential $V (r)$ in our 16-potential physical survey — from $1 / r$ through $r^{10}$ , $\log r$ , Yukawa $e^{-\mu r}/r$ , and the Calogero-Moser $1 / r^{2}$ — and across all spatial dimensions, mass ratios, and charge configurations tested. (Other neural couplings instead produce smaller algebras; see the seven-class table below.)

This article introduces bshepp/pairwise-poisson-algebras — the first systematic computation of Poisson bracket Lie algebras for neural network training dynamics. The dataset contains 993 rows across 13 Parquet tables, including a dedicated neural_algebras split with 21 configurations sweeping network depth (L=2..5), width (k=1..3), 12 coupling types, loss function, and activation function — revealing seven distinct neural universality classes at L=3, versus the single physical class at [3, 6, 17, 116].

The Lie Algebra of Neural Network Training

SGD as a Hamiltonian System

Consider a linear network $f (x) = w_{3} w_{2} w_{1} x$ trained on a single data point $(x, t) = (1, 1)$ with MSE loss:

$L = \frac{1}{2}(w_1 w_2 w_3 - 1)^2$

SGD with momentum updates weights $w_{i}$ and velocity buffers $v_{i}$ :

$v_i \leftarrow \mu v_i - \eta \frac{\partial L}{\partial w_i}, \quad w_i \leftarrow w_i + v_i$

In the continuous-time limit, this is Hamilton's equations with $w_{i}$ as positions and $v_{i}$ as conjugate momenta. The phase space is $2 L$ -dimensional for an $L$ -layer network.

The key insight: the loss couples all weights simultaneously, but we can decompose this into pairwise interactions between weight layers — exactly as gravitational N-body dynamics decomposes into pairwise forces. Each pair $(i, j)$ gets a Hamiltonian:

$H_{ij} = \frac{v_i^2}{2} + \frac{v_j^2}{2} + V_{ij}(w_1, \ldots, w_L)$

where $V_{ij}$ captures how layers $i$ and $j$ interact through the loss. The choice of $V_{ij}$ defines the coupling type.

Coupling Types

The dataset sweeps twelve ways to extract pairwise interactions from the loss. Here are the four canonical ones; the full set is summarized in the next section's universality table.

Coupling	Definition	Physical Analog
Gradient-product	$V_{ij} = \frac{1}{2}\frac{\partial L}{\partial w_i}\frac{\partial L}{\partial w_j}$	Force co-alignment
Hessian	$V_{ij} = \frac{1}{2}\frac{\partial^2 L}{\partial w_i \partial w_j}w_i w_j$	Curvature coupling
Symmetric	$V_{ij} = \frac{L}{\binom{L}{2}}$	Democratic loss sharing
Fisher	$V_{ij} \propto g_i g_j (1 - L)$	Information geometry

The remaining eight (gradient_sum, gradient_abs, hessian_plain, hessian_full, directional, natural_gradient, loss_power, gradient_cubic) probe variations in symmetrization, normalization, and degree — and as we will see, every variation that changes the polynomial structure of $V_{ij}$ at high degree produces a new universality class.

The Lie algebra is then generated by all $\lbrace H_{ij}, H_{kl}\rbrace$ under the canonical Poisson bracket, exactly as in the gravitational N-body problem.

Seven Universality Classes in Neural Algebras

The headline discovery: different coupling types produce different algebras. A sweep of 12 pairwise couplings and 3 loss functions reveals at least seven distinct universality classes at L=3.

Class	Dimension Sequence	Couplings/Losses
A	[3, 6, 17, 119]	gradient, fisher, gradient_abs, hessian_plain, gradient+L1
B	[3, 6, 17, 115]	directional (kinetic-gradient)
C	[3, 6, 17, 111]	gradient_sum (diagonal only)
D	[3, 6, 17, 104]	gradient_cubic
E	[3, 6, 17, 87]	natural_gradient
F	[3, 6, 17, 62]	gradient + cross-entropy loss
G	[3, 5, 11, 47]	hessian, symmetric, hessian_full, loss_power

The physical universality class at N=3 is [3, 6, 17, 116] — it sits between classes A (119) and B (115), and above all others in level-3 dimension. Classes A–F all match physics at levels 0–2 but diverge at level 3. Class G diverges already at level 1, producing only 5 independent generators (vs 6 in all other classes).

A key methodological note: the cleaner Fisher formulation $V_{ij} = g_i g_j (1 - L)$ used here places Fisher in Class A; an earlier polynomial-truncated variant landed in its own class. Small choices in how the coupling polynomial is normalized — what gets divided by what, where the loss enters multiplicatively — can move a coupling between classes. The 12-coupling sweep is designed to map this fine-grained landscape rather than treat it as noise.

The structure of the extras

The neural Class A algebra [3, 6, 17, 119] and the physical algebra [3, 6, 17, 116] have identical dimension counts at levels 0, 1, 2 but evaluate to completely different polynomials in phase space. In a shared 6D phase space (3 weights vs 3 positions), the rank of the concatenated 156+156 matrix equals 116 + 119 = 235 — every physical generator and every neural generator contributes an independent direction. The comparison is therefore structural (count vs count) rather than literal subalgebra.

Polynomial-degree analysis reveals the source of the "3 extra":

Physical $(q - q)^{4}$ algebra: level-3 generators at degrees 6, 8, 10 with ranks 38, 95, 116. The top-degree (10) stratum contributes $116 - 95 = 21$ new dimensions from 21 candidates (no syzygies).
Neural gradient algebra: level-3 generators at degrees 18, 26, 34 with ranks 32, 92, 119. The top-degree (34) stratum contributes $119 - 92 = 27$ new dimensions from 30 candidates (3 syzygies).

The 3 extra neural directions at level 3 are specifically at the highest polynomial-degree stratum, where neural generators have 3 fewer linear dependencies than their physical counterparts.

Width Invariance

Does the dimension of weight space matter? We test networks with scalar weights $k = 1$ , 2D weights $k = 2$ , and 3D weights $k = 3$ :

Width $k$	Phase Space	Dims (L0, L1, L2)
1	6D	[3, 6, 17]
2	12D	[3, 6, 17]
3	18D	[3, 6, 17]

The algebra is independent of weight dimension. This is the neural network analog of a remarkable property in physics: the pairwise Poisson algebra for $N$ gravitating bodies is independent of the spatial dimension $d$ . Gravity in 1D, 2D, and 3D all produce [3, 6, 17, 116]. Neural networks with scalar, vector, and matrix weights all produce [3, 6, 17, ...].

Loss Function: Partial Invariance

Does the loss function matter? We test gradient-product coupling with three loss functions:

Loss	Definition	Dims (L0, L1, L2, L3)	Class
MSE	$\frac{1}{2}(f - t)^2$	[3, 6, 17, 119]	A
L1 (smooth proxy)†	quadratic surrogate for $\lvert f - t \rvert$	[3, 6, 17, 119]	A
Cross-entropy	$-\log\sigma(f)$ Taylor $z^{2} / 4 + z^{4} / 48$	[3, 6, 17, 62]	new

† The "L1" row uses a smooth quadratic proxy rather than the non-differentiable $\lvert f-t \rvert$ itself; the polynomial structure of the coupling is what enters the algebra computation.

MSE and L1 give the identical algebra; cross-entropy does not. The nonlinear $z^{4}$ term in the cross-entropy Taylor expansion contributes new monomials that change the level-3 structure dramatically, dropping from 119 to 62 independent generators — an entirely new universality class.

The MSE/L1 case is the neural analog of potential universality: quadratic losses behave like generic polynomial potentials $r^{n}$ with $n \ge 4$ . Cross-entropy is the neural analog of an exceptional potential like $r^{3}$ — it has enough special polynomial structure to produce its own class.

Activation Invariance

Does nonlinearity matter? We test gradient-product coupling with linear, tanh (Taylor truncation), and ReLU (softplus approximation) activations:

Activation	Dims (L0, L1, L2)
Linear	[3, 6, 17]
tanh (Taylor)	[3, 6, 17]
ReLU (softplus)	[3, 6, 17]

The algebra is independent of activation function through level 2. The Taylor approximations used here capture the leading nonlinear corrections; whether deeper levels also match is an open question requiring higher-order expansions or numerical methods.

Depth Scaling: L=3 Is an Accidental Match

As network depth increases, the neural algebra diverges from the physical N-body algebra at progressively earlier levels:

Depth / N	Neural (gradient)	Physical (1/r in 1D)	Level where they first diverge
L=2 / N=2	[1, 1, 1, 1]	[1, 1, 1, 1]	never (trivial)
L=3 / N=3	[3, 6, 17, 119]	[3, 6, 17, 116]	level 3 (+3 extra)
L=4 / N=4	[6, 20, 164, ...]	[6, 14, 62, 1260]	level 1 (+6 extra)
L=5 / N=5	[10, 45, 210, ...]	[10, 25, 145, ...]	level 1 (+20 extra)

The L=3 match at levels 0-2 is essentially accidental. At L=3 the neural and physical algebras both have the same level-0, level-1, and level-2 counts (3, 6, 17), and the divergence appears only at level 3 as the 3 extra generators. But for L >= 4, the algebras diverge at level 1 already — neural L=4 produces 14 new generators at level 1 versus 8 new for physical N=4.

The extras at level 1 follow an exact pattern:

$\text{extras}_{L,1} = \binom{L}{2}(L - 3)$

This gives 0, 6, 20 for $L = 3, 4, 5$ — matching the observed data. The formula vanishes at $L = 3$ , explaining why level 1 matches physics there but nowhere else.

For comparison, the physical N-body problem with $N$ bodies in 1D has closed-form scaling formulas:

$d_0(N) = \binom{N}{2}, \quad d_1(N) = \frac{N(3N-5)}{2}, \quad d_2(N) = \frac{N(4N^2 - 9N + 3)}{2}$

Whether the neural depth scaling follows similar closed-form formulas for $L \ge 4$ is an open question, though the linear formula for level-1 extras is now established.

The Physical Benchmark: 116 vs 119

The physical universality class at $N = 3$ gives [3, 6, 17, 116] for:

Newtonian gravity $1 / r$
Coulomb potential $1 / r$ (identical)
Calogero-Moser $1 / r^{2}$
Dipole-dipole $1 / r^{3}$
Logarithmic vortex $\log r$
Yukawa nuclear force $e^{-\mu r}/r$
GUE eigenvalue dynamics (random matrix theory)
All polynomial $r^{n}$ with $n \ge 4$
576 exponent values across the continuous landscape
20 named physical systems (helium to triple black holes)
All mass ratios from 1 to $10^{10}$
All charge configurations tested

The neural gradient-product coupling gives [3, 6, 17, 119]. The structure of the 3-dimensional "gap" has been characterized by polynomial-degree stratification:

Max degree	Physical $(q_{i} - q_{j})^{4}$ cumulative rank	Neural gradient cumulative rank
Level 3, lowest	38 (deg 6)	32 (deg 18)
Level 3, middle	95 (deg 8)	92 (deg 26)
Level 3, top	116 (deg 10, +21 new)	119 (deg 34, +27 new)

The 3 extra generators appear specifically at the top-degree stratum of level 3. Physics has 21 independent top-degree generators out of 21 candidates (no syzygies). Neural has 27 independent top-degree generators out of 30 candidates (only 3 syzygies). Neural simply has 3 fewer linear relations at the polynomial frontier.

The dimension counts are identical at levels 0, 1, 2; the neural and physical generators occupy entirely different polynomial subspaces (the combined rank is $116 + 119 = 235$ ), so this is a structural-count rather than literal-containment relation.

Dataset Overview

The full dataset contains 993 rows across 13 Parquet tables:

Split	Rows	Description
`neural_algebras`	21	Neural network algebra sweep (depth, width, 12 couplings, loss, activation)
`dimension_sequences`	685	Dimension sequences for N=3..50, 16+ potentials, quantum/classical
`structure_constants`	16	Exact rational structure constant tensors
`physical_systems`	17	Named systems from helium to triple black holes
`charge_sensitivity`	38	Charge-independence tests
`mass_invariance`	33	Mass ratio sweep (10 orders of magnitude)
`convergence_trajectories`	77	SVD rank convergence tracking
`tier_decomposition`	40	$S_{3}$ and $S_{4}$ representation decomposition
`level4_convergence`	19	Level-4 lower bounds
`spectral_statistics`	17	Phase-space atlas rank distributions
`contextuality`	16	Kochen-Specker contextuality tests
`bell_test`	9	CHSH Bell inequality tests
`scaling_formulas`	5	Closed-form dimension scaling formulas

Quick Start

from datasets import load_dataset

# Load neural network algebras
ds = load_dataset("bshepp/pairwise-poisson-algebras", "neural_algebras")
df = ds["train"].to_pandas()

# Compare coupling types
for _, row in df[df["n_layers"] == 3].iterrows():
    print(f"{row['coupling_type']:12s} {row['loss_function']:15s} "
          f"{row['activation']:12s} -> {row['dimension_sequence']}")

# Compare neural vs physical dimension sequences
import json

# Neural algebras
nn = load_dataset("bshepp/pairwise-poisson-algebras", "neural_algebras")["train"].to_pandas()
gradient_l3 = nn[(nn["n_layers"] == 3) & (nn["coupling_type"] == "gradient")
                  & (nn["activation"] == "linear") & (nn["loss_function"] == "mse")]
nn_dims = json.loads(gradient_l3.iloc[0]["dimension_sequence"])

# Physical systems (helium has the universal physical algebra [3, 6, 17, 116])
phys = load_dataset("bshepp/pairwise-poisson-algebras", "physical_systems")["train"].to_pandas()
gravity_dims = json.loads(phys[phys["system_name"] == "helium"].iloc[0]["dimension_sequence"])

print(f"Neural (gradient):  {nn_dims}")
print(f"Physics (helium):   {gravity_dims}")
print(f"Extra generators:   {nn_dims[3] - gravity_dims[3]}")

# Discover universality classes
from collections import defaultdict
classes = defaultdict(list)
for _, row in nn.iterrows():
    dims = row["dimension_sequence"]
    classes[dims].append(f"{row['coupling_type']} / {row['loss_function']} / {row['activation']}")

for dims, configs in sorted(classes.items()):
    print(f"\n{dims}:")
    for c in configs:
        print(f"  {c}")

Open Problems

Exact neural rank for the remaining classes at level 3: The gradient-product [3, 6, 17, 119] result has been verified exactly over $\mathbb{Q}$ (neural/nn_poisson.py). The other six classes currently rely on numerical SVD over 600+ phase-space samples, with a singular-value gap of $\sim 5 \times 10^8$ at the cutoff, which is strong but not symbolic. Confirming each class with exact rational arithmetic would close the loop.
Does activation break universality at level 3? Linear, tanh (Taylor), and ReLU (softplus) all give [3, 6, 17] through level 2. The linear case extends to 119 at level 3; verifying tanh and ReLU at level 3 is computationally difficult due to polynomial blow-up.
The seven classes: What operation on couplings governs the class? Hessian vs gradient vs gradient_sum are all natural choices, yet they produce 47, 119, 111. Is there a classification theorem for neural universality classes analogous to the physical $r^{n}$ classification?
Depth scaling formulas: The extras-at-level-1 formula $\binom{L}{2}(L-3)$ is empirical and exact for $L = 3, 4, 5$ . A closed form for the full dimension sequence as a function of $L$ (and for level 2, 3 extras) would generalize the physical scaling laws.
Nonlinear networks: Full (non-Taylor) nonlinear activations create non-polynomial dynamics. Can the algebra be computed for genuine ReLU or sigmoid networks via asymptotic or numerical methods?
The 3 extra generators at L=3: They appear at the highest polynomial-degree stratum (degree 34), where neural has 3 fewer syzygies than physics at degree 10. What is the algebraic origin of these missing relations?
Converse question: Are any of the seven classes (119, 115, 111, 104, 87, 47) producible by some physical potential? Or do they define genuinely new algebraic structures with no physical realization? The log-gas/GUE algebra matches physical 116 — are there other "hidden" physical systems in these classes?
Connection to loss landscape geometry: The coupling type determines the universality class. Gradient-product measures force co-alignment; Hessian measures curvature; directional couples to velocity. How does this connect to saddle-point structure, mode connectivity, and the edge of stability in real networks?
Wider networks (matrix weights): Width-1, 2, 3 scalar weights all give the same algebra at L=3. Does this extend to full matrix-valued weight layers in realistic architectures, or does the symmetry break at matrix widths where weight-space ordering becomes non-trivial?

Citation

@dataset{sheppard2026poisson,
  title={Pairwise Poisson Algebras of the N-Body Problem},
  author={Sheppard, Brian},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/bshepp/pairwise-poisson-algebras}
}

Datasets mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

The Lie Algebra of Neural Network Training

Table of Contents

SGD as a Hamiltonian System

Coupling Types

Seven Universality Classes in Neural Algebras

The structure of the extras

Width Invariance

Loss Function: Partial Invariance

Activation Invariance

Depth Scaling: L=3 Is an Accidental Match

The Physical Benchmark: 116 vs 119

Dataset Overview

Quick Start

Open Problems

Citation

Links

Datasets mentioned in this article 1

Community

Datasets mentioned in this article 1