File size: 7,424 Bytes
09e2cfe | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | # EML Trainability Study: Can We Turn Theoretical Universality Into Practical Training?
## Overview
This repository contains an empirical study of whether the **EML operator** `eml(x,y) = exp(x) - ln(y)` from [arXiv:2603.21852](https://arxiv.org/abs/2603.21852) can be made practically trainable for **symbolic regression** via gradient descent.
### The Theoretical Discovery
The EML paper proved that **every elementary mathematical function** — addition, multiplication, trigonometry, logarithms, π, e, etc. — can be generated from just one binary operator and the constant 1:
```
eml(x, y) = exp(x) − ln(y)
```
This is analogous to how the NAND gate generates all Boolean logic. The grammar is trivially simple: `S → 1 | eml(S, S)`.
### The Practical Problem
While mathematically universal, this crashes in code. Stacking exponentials 3-4 levels deep in floating-point arithmetic causes numbers to **explode to infinity** or **collapse to zero**. The paper itself reports:
- **Depth 1-2**: 100% recovery from random initialization
- **Depth 3-4**: ~25% recovery
- **Depth 5+**: <1% recovery
- **Depth 6**: 0% in 448 attempts
Yet paradoxically, when initialized **near the correct solution**, recovery is 100% even at depth 5-6. The basins of attraction exist — they're just needle-in-a-haystack from random init.
## Research Questions
1. **Which numerical stability techniques** most improve deep EML tree training?
2. **What is the maximum recoverable tree depth** with enhanced methods?
3. **Can EML-based SR recover real physics equations** (Feynman benchmark)?
## Methods
### Stability Techniques Tested
| Method | Description | Source |
|--------|-------------|--------|
| **Soft routing** | Standard softmax input selection (baseline) | EML paper §4.3 |
| **Gumbel-hard** | Straight-through Gumbel-softmax — hard selection in forward, soft gradients in backward | Jang et al. 2017 |
| **Bounded** | `tanh(output/R) * R` normalization after each node | Inspired by NALU (Trask 2018) |
| **Combined** | Saturating linear: `x / (1 + |x|/R)` + Gumbel-hard routing | Novel combination |
### Key Innovations
1. **Hard routing prevents intermediate explosion**: Soft routing creates weighted mixtures of {1, x, f} that can produce arbitrary intermediate values. Hard selection ensures only one input is chosen per EML node, preventing the "exp of a mixture" problem.
2. **Multi-loss training**: MSE + correlation loss (captures function shape regardless of scale) + entropy regularization (encourages discrete routing decisions).
3. **Temperature annealing**: Start with high temperature (smooth, exploratory) and anneal to near-zero (hard, discrete) over training.
4. **Multi-restart search**: Since basins are narrow, we run 20-30 random initializations per configuration and report best + success rates.
### Architecture: The Master Formula
Following the paper's §4.3, we implement the EML master formula as a full binary tree:
- **Leaf nodes** select from `{1, x₁, ..., xₖ}` (constant and input variables)
- **Internal nodes** select from `{1, x₁, ..., xₖ, f_left, f_right}` (also including child outputs)
- Each selection is parameterized by learnable logits passed through Gumbel-softmax
- Output affine transform `a * eml(left, right) + b` per node
Total parameters: `O(5 × 2ⁿ)` for depth n (as stated in the paper).
## Experimental Design
### Phase 1: Known EML Identities
Test recovery of functions with known EML decompositions:
| Function | EML Depth | EML Expression |
|----------|-----------|----------------|
| `exp(x)` | 1 | `eml(x, 1)` |
| `e` (constant) | 1 | `eml(1, 1)` |
| `ln(x)` | 3 | `eml(1, eml(eml(1,x), 1))` |
| `-x` | 2 | Via composition |
| `1/x` | 3 | Via composition |
| `x + y` | 4 | Via exp/ln identities |
| `x × y` | 4+ | Via exp/ln identities |
| `x²` | 4 | `exp(2·ln(x))` |
| `√x` | 4 | `exp(0.5·ln(x))` |
| `sin(x)` | 5+ | Requires complex intermediates |
### Phase 2: Feynman Physics Equations
A curated set of physics equations from the [SRSD-Feynman benchmark](https://arxiv.org/abs/2206.10540):
- Gaussian distribution: `exp(-θ²/2)/√(2π)`
- Euclidean distance: `√((x₂-x₁)² + (y₂-y₁)²)`
- Inverse square law: `F = q₁q₂/(4πε₀r²)`
- Relativistic mass: `m₀/√(1-v²/c²)`
- Harmonic oscillator: `E = ½kx²`
- And more...
### Phase 3: Depth Scaling Analysis
Systematic measurement of recovery rate vs. depth using EML-native targets.
## Key Literature References
| Topic | Paper | Key Insight |
|-------|-------|-------------|
| EML operator | [2603.21852](https://arxiv.org/abs/2603.21852) | Universal primitive for elementary functions |
| Gumbel-softmax | Jang et al. 2017 | Differentiable discrete selection |
| NALU | [1808.00508](https://arxiv.org/abs/1808.00508) | Stable exp-log arithmetic cells |
| NAU | [2001.05016](https://arxiv.org/abs/2001.05016) | Fixing NALU's gradient issues |
| Gradient clipping | [1211.5063](https://arxiv.org/abs/1211.5063) | Controlling exploding gradients |
| BFloat16 training | [2010.06192](https://arxiv.org/abs/2010.06192) | Kahan summation for precision |
| AutoNumerics-Zero | [2312.08472](https://arxiv.org/abs/2312.08472) | Range reduction for transcendentals |
| Numerical stability | [2501.04697](https://arxiv.org/abs/2501.04697) | Grokking at the edge of stability |
| Tropical geometry | [2505.17190](https://arxiv.org/abs/2505.17190) | Max-plus limit of log-sum-exp |
| AI Feynman | Udrescu & Tegmark 2020 | Physics equations benchmark |
| SRSD | [2206.10540](https://arxiv.org/abs/2206.10540) | Feynman benchmark with proper data |
| PySR | Cranmer 2023 | Evolutionary symbolic regression |
| TPSR | [2303.06833](https://arxiv.org/abs/2303.06833) | Transformer + MCTS for SR |
## Preliminary Results (CPU validation)
From our CPU sandbox testing:
| Function | Depth | Best R² | Method | Notes |
|----------|-------|---------|--------|-------|
| `exp(x)` | 1 | **0.9999** | Gumbel-hard | ✅ Trivially recovered |
| `e` (const) | 1 | **0.9999** | Gumbel-hard | ✅ Correct: `eml(1,1)` |
| `ln(x)` | 3 | -0.08 | All methods | ❌ All 10 restarts fail |
| `x²` | 4 | TBD | - | Awaiting GPU results |
### Key Observation
**The depth-3 barrier is real and severe.** Even with hard routing (Gumbel-softmax), bounded normalization, curriculum learning, and multi-loss training, recovering `ln(x)` from random initialization fails consistently. This aligns with the paper's finding of ~25% success at depth 3-4 and suggests that:
1. The loss landscape at depth 3+ has **exponentially many local minima** relative to the one correct basin
2. Better optimization (second-order methods, population-based search) may help
3. **Informed initialization** (starting near known decompositions) is likely required for practical use
## GPU Experiment Status
🔄 **Running**: Full experiment on T4 GPU with 3 phases and 4 stability methods.
Job: `69e7837acd8c002f31e00d75`
Results will be uploaded to the `results/` folder upon completion.
## How to Reproduce
```python
# Install dependencies
pip install torch numpy huggingface_hub
# Run the full experiment
python code/eml_experiment.py
```
## Citation
If you use this work, please cite the original EML paper:
```
@article{eml2026,
title={All elementary functions from a single operator},
author={...},
journal={arXiv preprint arXiv:2603.21852},
year={2026}
}
```
## License
MIT
|