The Lie Algebra of Neural Network Training

Community Article Published April 22, 2026

Linear neural networks under SGD-with-momentum produce a richer Poisson algebra at gradient-product coupling than any pairwise potential in our 16-potential survey, and split into seven distinct universality classes overall.

Train a 3-layer linear network with SGD and momentum. The training dynamics form a Hamiltonian system — weights are coordinates, velocity buffers are momenta, and the loss landscape defines pairwise interactions between layers. Compute the Lie algebra generated by these interactions via iterated Poisson brackets.

You get [3, 6, 17, 119].

Now replace the network with three gravitating bodies — stars, atoms, black holes — interacting via 1/r1/r. You get [3, 6, 17, 116].

The neural network produces 3 extra generators. It breaks a universality that holds across every pairwise potential V(r)V(r) in our 16-potential physical survey — from 1/r1/r through r10r^{10}, logr\log r, Yukawa eμr/re^{-\mu r}/r, and the Calogero-Moser 1/r21/r^2 — and across all spatial dimensions, mass ratios, and charge configurations tested. (Other neural couplings instead produce smaller algebras; see the seven-class table below.)

This article introduces bshepp/pairwise-poisson-algebras — the first systematic computation of Poisson bracket Lie algebras for neural network training dynamics. The dataset contains 993 rows across 13 Parquet tables, including a dedicated neural_algebras split with 21 configurations sweeping network depth (L=2..5), width (k=1..3), 12 coupling types, loss function, and activation function — revealing seven distinct neural universality classes at L=3, versus the single physical class at [3, 6, 17, 116].

Table of Contents

SGD as a Hamiltonian System

Consider a linear network f(x)=w3w2w1xf(x) = w_3 w_2 w_1 x trained on a single data point (x,t)=(1,1)(x, t) = (1, 1) with MSE loss:

L=12(w1w2w31)2L = \frac{1}{2}(w_1 w_2 w_3 - 1)^2

SGD with momentum updates weights wiw_i and velocity buffers viv_i:

viμviηLwi,wiwi+viv_i \leftarrow \mu v_i - \eta \frac{\partial L}{\partial w_i}, \quad w_i \leftarrow w_i + v_i

In the continuous-time limit, this is Hamilton's equations with wiw_i as positions and viv_i as conjugate momenta. The phase space is 2L2L-dimensional for an LL-layer network.

The key insight: the loss couples all weights simultaneously, but we can decompose this into pairwise interactions between weight layers — exactly as gravitational N-body dynamics decomposes into pairwise forces. Each pair (i,j)(i, j) gets a Hamiltonian:

Hij=vi22+vj22+Vij(w1,,wL)H_{ij} = \frac{v_i^2}{2} + \frac{v_j^2}{2} + V_{ij}(w_1, \ldots, w_L)

where VijV_{ij} captures how layers ii and jj interact through the loss. The choice of VijV_{ij} defines the coupling type.

Coupling Types

The dataset sweeps twelve ways to extract pairwise interactions from the loss. Here are the four canonical ones; the full set is summarized in the next section's universality table.

Coupling Definition Physical Analog
Gradient-product Vij=12LwiLwjV_{ij} = \frac{1}{2}\frac{\partial L}{\partial w_i}\frac{\partial L}{\partial w_j} Force co-alignment
Hessian Vij=122LwiwjwiwjV_{ij} = \frac{1}{2}\frac{\partial^2 L}{\partial w_i \partial w_j}w_i w_j Curvature coupling
Symmetric Vij=L(L2)V_{ij} = \frac{L}{\binom{L}{2}} Democratic loss sharing
Fisher Vijgigj(1L)V_{ij} \propto g_i g_j (1 - L) Information geometry

The remaining eight (gradient_sum, gradient_abs, hessian_plain, hessian_full, directional, natural_gradient, loss_power, gradient_cubic) probe variations in symmetrization, normalization, and degree — and as we will see, every variation that changes the polynomial structure of VijV_{ij} at high degree produces a new universality class.

The Lie algebra is then generated by all {Hij,Hkl}\lbrace H_{ij}, H_{kl}\rbrace under the canonical Poisson bracket, exactly as in the gravitational N-body problem.

Seven Universality Classes in Neural Algebras

The headline discovery: different coupling types produce different algebras. A sweep of 12 pairwise couplings and 3 loss functions reveals at least seven distinct universality classes at L=3.

Class Dimension Sequence Couplings/Losses
A [3, 6, 17, 119] gradient, fisher, gradient_abs, hessian_plain, gradient+L1
B [3, 6, 17, 115] directional (kinetic-gradient)
C [3, 6, 17, 111] gradient_sum (diagonal only)
D [3, 6, 17, 104] gradient_cubic
E [3, 6, 17, 87] natural_gradient
F [3, 6, 17, 62] gradient + cross-entropy loss
G [3, 5, 11, 47] hessian, symmetric, hessian_full, loss_power

The physical universality class at N=3 is [3, 6, 17, 116] — it sits between classes A (119) and B (115), and above all others in level-3 dimension. Classes A–F all match physics at levels 0–2 but diverge at level 3. Class G diverges already at level 1, producing only 5 independent generators (vs 6 in all other classes).

A key methodological note: the cleaner Fisher formulation Vij=gigj(1L)V_{ij} = g_i g_j (1 - L) used here places Fisher in Class A; an earlier polynomial-truncated variant landed in its own class. Small choices in how the coupling polynomial is normalized — what gets divided by what, where the loss enters multiplicatively — can move a coupling between classes. The 12-coupling sweep is designed to map this fine-grained landscape rather than treat it as noise.

The structure of the extras

The neural Class A algebra [3, 6, 17, 119] and the physical algebra [3, 6, 17, 116] have identical dimension counts at levels 0, 1, 2 but evaluate to completely different polynomials in phase space. In a shared 6D phase space (3 weights vs 3 positions), the rank of the concatenated 156+156 matrix equals 116 + 119 = 235 — every physical generator and every neural generator contributes an independent direction. The comparison is therefore structural (count vs count) rather than literal subalgebra.

Polynomial-degree analysis reveals the source of the "3 extra":

  • Physical (qq)4(q-q)^4 algebra: level-3 generators at degrees 6, 8, 10 with ranks 38, 95, 116. The top-degree (10) stratum contributes 11695=21116-95 = 21 new dimensions from 21 candidates (no syzygies).
  • Neural gradient algebra: level-3 generators at degrees 18, 26, 34 with ranks 32, 92, 119. The top-degree (34) stratum contributes 11992=27119-92 = 27 new dimensions from 30 candidates (3 syzygies).

The 3 extra neural directions at level 3 are specifically at the highest polynomial-degree stratum, where neural generators have 3 fewer linear dependencies than their physical counterparts.

Width Invariance

Does the dimension of weight space matter? We test networks with scalar weights k=1k=1, 2D weights k=2k=2, and 3D weights k=3k=3:

Width kk Phase Space Dims (L0, L1, L2)
1 6D [3, 6, 17]
2 12D [3, 6, 17]
3 18D [3, 6, 17]

The algebra is independent of weight dimension. This is the neural network analog of a remarkable property in physics: the pairwise Poisson algebra for NN gravitating bodies is independent of the spatial dimension dd. Gravity in 1D, 2D, and 3D all produce [3, 6, 17, 116]. Neural networks with scalar, vector, and matrix weights all produce [3, 6, 17, ...].

Loss Function: Partial Invariance

Does the loss function matter? We test gradient-product coupling with three loss functions:

Loss Definition Dims (L0, L1, L2, L3) Class
MSE 12(ft)2\frac{1}{2}(f - t)^2 [3, 6, 17, 119] A
L1 (smooth proxy)† quadratic surrogate for ft\lvert f - t \rvert [3, 6, 17, 119] A
Cross-entropy logσ(f)-\log\sigma(f) Taylor z2/4+z4/48z^2/4 + z^4/48 [3, 6, 17, 62] new

† The "L1" row uses a smooth quadratic proxy rather than the non-differentiable ft\lvert f-t \rvert itself; the polynomial structure of the coupling is what enters the algebra computation.

MSE and L1 give the identical algebra; cross-entropy does not. The nonlinear z4z^4 term in the cross-entropy Taylor expansion contributes new monomials that change the level-3 structure dramatically, dropping from 119 to 62 independent generators — an entirely new universality class.

The MSE/L1 case is the neural analog of potential universality: quadratic losses behave like generic polynomial potentials rnr^n with n4n \ge 4. Cross-entropy is the neural analog of an exceptional potential like r3r^3 — it has enough special polynomial structure to produce its own class.

Activation Invariance

Does nonlinearity matter? We test gradient-product coupling with linear, tanh (Taylor truncation), and ReLU (softplus approximation) activations:

Activation Dims (L0, L1, L2)
Linear [3, 6, 17]
tanh (Taylor) [3, 6, 17]
ReLU (softplus) [3, 6, 17]

The algebra is independent of activation function through level 2. The Taylor approximations used here capture the leading nonlinear corrections; whether deeper levels also match is an open question requiring higher-order expansions or numerical methods.

Depth Scaling: L=3 Is an Accidental Match

As network depth increases, the neural algebra diverges from the physical N-body algebra at progressively earlier levels:

Depth / N Neural (gradient) Physical (1/r in 1D) Level where they first diverge
L=2 / N=2 [1, 1, 1, 1] [1, 1, 1, 1] never (trivial)
L=3 / N=3 [3, 6, 17, 119] [3, 6, 17, 116] level 3 (+3 extra)
L=4 / N=4 [6, 20, 164, ...] [6, 14, 62, 1260] level 1 (+6 extra)
L=5 / N=5 [10, 45, 210, ...] [10, 25, 145, ...] level 1 (+20 extra)

The L=3 match at levels 0-2 is essentially accidental. At L=3 the neural and physical algebras both have the same level-0, level-1, and level-2 counts (3, 6, 17), and the divergence appears only at level 3 as the 3 extra generators. But for L >= 4, the algebras diverge at level 1 already — neural L=4 produces 14 new generators at level 1 versus 8 new for physical N=4.

The extras at level 1 follow an exact pattern:

extrasL,1=(L2)(L3)\text{extras}_{L,1} = \binom{L}{2}(L - 3)

This gives 0, 6, 20 for L=3,4,5L = 3, 4, 5 — matching the observed data. The formula vanishes at L=3L=3, explaining why level 1 matches physics there but nowhere else.

For comparison, the physical N-body problem with NN bodies in 1D has closed-form scaling formulas:

d0(N)=(N2),d1(N)=N(3N5)2,d2(N)=N(4N29N+3)2d_0(N) = \binom{N}{2}, \quad d_1(N) = \frac{N(3N-5)}{2}, \quad d_2(N) = \frac{N(4N^2 - 9N + 3)}{2}

Whether the neural depth scaling follows similar closed-form formulas for L4L \ge 4 is an open question, though the linear formula for level-1 extras is now established.

The Physical Benchmark: 116 vs 119

The physical universality class at N=3N = 3 gives [3, 6, 17, 116] for:

  • Newtonian gravity 1/r1/r
  • Coulomb potential 1/r1/r (identical)
  • Calogero-Moser 1/r21/r^2
  • Dipole-dipole 1/r31/r^3
  • Logarithmic vortex logr\log r
  • Yukawa nuclear force eμr/re^{-\mu r}/r
  • GUE eigenvalue dynamics (random matrix theory)
  • All polynomial rnr^n with n4n \ge 4
  • 576 exponent values across the continuous landscape
  • 20 named physical systems (helium to triple black holes)
  • All mass ratios from 1 to 101010^{10}
  • All charge configurations tested

The neural gradient-product coupling gives [3, 6, 17, 119]. The structure of the 3-dimensional "gap" has been characterized by polynomial-degree stratification:

Max degree Physical (qiqj)4(q_i-q_j)^4 cumulative rank Neural gradient cumulative rank
Level 3, lowest 38 (deg 6) 32 (deg 18)
Level 3, middle 95 (deg 8) 92 (deg 26)
Level 3, top 116 (deg 10, +21 new) 119 (deg 34, +27 new)

The 3 extra generators appear specifically at the top-degree stratum of level 3. Physics has 21 independent top-degree generators out of 21 candidates (no syzygies). Neural has 27 independent top-degree generators out of 30 candidates (only 3 syzygies). Neural simply has 3 fewer linear relations at the polynomial frontier.

The dimension counts are identical at levels 0, 1, 2; the neural and physical generators occupy entirely different polynomial subspaces (the combined rank is 116+119=235116+119=235), so this is a structural-count rather than literal-containment relation.

Dataset Overview

The full dataset contains 993 rows across 13 Parquet tables:

Split Rows Description
neural_algebras 21 Neural network algebra sweep (depth, width, 12 couplings, loss, activation)
dimension_sequences 685 Dimension sequences for N=3..50, 16+ potentials, quantum/classical
structure_constants 16 Exact rational structure constant tensors
physical_systems 17 Named systems from helium to triple black holes
charge_sensitivity 38 Charge-independence tests
mass_invariance 33 Mass ratio sweep (10 orders of magnitude)
convergence_trajectories 77 SVD rank convergence tracking
tier_decomposition 40 S3S_3 and S4S_4 representation decomposition
level4_convergence 19 Level-4 lower bounds
spectral_statistics 17 Phase-space atlas rank distributions
contextuality 16 Kochen-Specker contextuality tests
bell_test 9 CHSH Bell inequality tests
scaling_formulas 5 Closed-form dimension scaling formulas

Quick Start

from datasets import load_dataset

# Load neural network algebras
ds = load_dataset("bshepp/pairwise-poisson-algebras", "neural_algebras")
df = ds["train"].to_pandas()

# Compare coupling types
for _, row in df[df["n_layers"] == 3].iterrows():
    print(f"{row['coupling_type']:12s} {row['loss_function']:15s} "
          f"{row['activation']:12s} -> {row['dimension_sequence']}")
# Compare neural vs physical dimension sequences
import json

# Neural algebras
nn = load_dataset("bshepp/pairwise-poisson-algebras", "neural_algebras")["train"].to_pandas()
gradient_l3 = nn[(nn["n_layers"] == 3) & (nn["coupling_type"] == "gradient")
                  & (nn["activation"] == "linear") & (nn["loss_function"] == "mse")]
nn_dims = json.loads(gradient_l3.iloc[0]["dimension_sequence"])

# Physical systems (helium has the universal physical algebra [3, 6, 17, 116])
phys = load_dataset("bshepp/pairwise-poisson-algebras", "physical_systems")["train"].to_pandas()
gravity_dims = json.loads(phys[phys["system_name"] == "helium"].iloc[0]["dimension_sequence"])

print(f"Neural (gradient):  {nn_dims}")
print(f"Physics (helium):   {gravity_dims}")
print(f"Extra generators:   {nn_dims[3] - gravity_dims[3]}")
# Discover universality classes
from collections import defaultdict
classes = defaultdict(list)
for _, row in nn.iterrows():
    dims = row["dimension_sequence"]
    classes[dims].append(f"{row['coupling_type']} / {row['loss_function']} / {row['activation']}")

for dims, configs in sorted(classes.items()):
    print(f"\n{dims}:")
    for c in configs:
        print(f"  {c}")

Open Problems

  1. Exact neural rank for the remaining classes at level 3: The gradient-product [3, 6, 17, 119] result has been verified exactly over Q\mathbb{Q} (neural/nn_poisson.py). The other six classes currently rely on numerical SVD over 600+ phase-space samples, with a singular-value gap of 5×108\sim 5 \times 10^8 at the cutoff, which is strong but not symbolic. Confirming each class with exact rational arithmetic would close the loop.

  2. Does activation break universality at level 3? Linear, tanh (Taylor), and ReLU (softplus) all give [3, 6, 17] through level 2. The linear case extends to 119 at level 3; verifying tanh and ReLU at level 3 is computationally difficult due to polynomial blow-up.

  3. The seven classes: What operation on couplings governs the class? Hessian vs gradient vs gradient_sum are all natural choices, yet they produce 47, 119, 111. Is there a classification theorem for neural universality classes analogous to the physical rnr^n classification?

  4. Depth scaling formulas: The extras-at-level-1 formula (L2)(L3)\binom{L}{2}(L-3) is empirical and exact for L=3,4,5L=3, 4, 5. A closed form for the full dimension sequence as a function of LL (and for level 2, 3 extras) would generalize the physical scaling laws.

  5. Nonlinear networks: Full (non-Taylor) nonlinear activations create non-polynomial dynamics. Can the algebra be computed for genuine ReLU or sigmoid networks via asymptotic or numerical methods?

  6. The 3 extra generators at L=3: They appear at the highest polynomial-degree stratum (degree 34), where neural has 3 fewer syzygies than physics at degree 10. What is the algebraic origin of these missing relations?

  7. Converse question: Are any of the seven classes (119, 115, 111, 104, 87, 47) producible by some physical potential? Or do they define genuinely new algebraic structures with no physical realization? The log-gas/GUE algebra matches physical 116 — are there other "hidden" physical systems in these classes?

  8. Connection to loss landscape geometry: The coupling type determines the universality class. Gradient-product measures force co-alignment; Hessian measures curvature; directional couples to velocity. How does this connect to saddle-point structure, mode connectivity, and the edge of stability in real networks?

  9. Wider networks (matrix weights): Width-1, 2, 3 scalar weights all give the same algebra at L=3. Does this extend to full matrix-valued weight layers in realistic architectures, or does the symmetry break at matrix widths where weight-space ordering becomes non-trivial?

Citation

@dataset{sheppard2026poisson,
  title={Pairwise Poisson Algebras of the N-Body Problem},
  author={Sheppard, Brian},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/bshepp/pairwise-poisson-algebras}
}

Links

Community

Sign up or log in to comment