File size: 7,424 Bytes
09e2cfe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# EML Trainability Study: Can We Turn Theoretical Universality Into Practical Training?

## Overview

This repository contains an empirical study of whether the **EML operator** `eml(x,y) = exp(x) - ln(y)` from [arXiv:2603.21852](https://arxiv.org/abs/2603.21852) can be made practically trainable for **symbolic regression** via gradient descent.

### The Theoretical Discovery
The EML paper proved that **every elementary mathematical function** — addition, multiplication, trigonometry, logarithms, π, e, etc. — can be generated from just one binary operator and the constant 1:

```
eml(x, y) = exp(x) − ln(y)
```

This is analogous to how the NAND gate generates all Boolean logic. The grammar is trivially simple: `S → 1 | eml(S, S)`.

### The Practical Problem  
While mathematically universal, this crashes in code. Stacking exponentials 3-4 levels deep in floating-point arithmetic causes numbers to **explode to infinity** or **collapse to zero**. The paper itself reports:
- **Depth 1-2**: 100% recovery from random initialization
- **Depth 3-4**: ~25% recovery  
- **Depth 5+**: <1% recovery
- **Depth 6**: 0% in 448 attempts

Yet paradoxically, when initialized **near the correct solution**, recovery is 100% even at depth 5-6. The basins of attraction exist — they're just needle-in-a-haystack from random init.

## Research Questions

1. **Which numerical stability techniques** most improve deep EML tree training?
2. **What is the maximum recoverable tree depth** with enhanced methods?
3. **Can EML-based SR recover real physics equations** (Feynman benchmark)?

## Methods

### Stability Techniques Tested

| Method | Description | Source |
|--------|-------------|--------|
| **Soft routing** | Standard softmax input selection (baseline) | EML paper §4.3 |
| **Gumbel-hard** | Straight-through Gumbel-softmax — hard selection in forward, soft gradients in backward | Jang et al. 2017 |
| **Bounded** | `tanh(output/R) * R` normalization after each node | Inspired by NALU (Trask 2018) |
| **Combined** | Saturating linear: `x / (1 + |x|/R)` + Gumbel-hard routing | Novel combination |

### Key Innovations

1. **Hard routing prevents intermediate explosion**: Soft routing creates weighted mixtures of {1, x, f} that can produce arbitrary intermediate values. Hard selection ensures only one input is chosen per EML node, preventing the "exp of a mixture" problem.

2. **Multi-loss training**: MSE + correlation loss (captures function shape regardless of scale) + entropy regularization (encourages discrete routing decisions).

3. **Temperature annealing**: Start with high temperature (smooth, exploratory) and anneal to near-zero (hard, discrete) over training.

4. **Multi-restart search**: Since basins are narrow, we run 20-30 random initializations per configuration and report best + success rates.

### Architecture: The Master Formula

Following the paper's §4.3, we implement the EML master formula as a full binary tree:
- **Leaf nodes** select from `{1, x₁, ..., xₖ}` (constant and input variables)
- **Internal nodes** select from `{1, x₁, ..., xₖ, f_left, f_right}` (also including child outputs)
- Each selection is parameterized by learnable logits passed through Gumbel-softmax
- Output affine transform `a * eml(left, right) + b` per node

Total parameters: `O(5 × 2ⁿ)` for depth n (as stated in the paper).

## Experimental Design

### Phase 1: Known EML Identities
Test recovery of functions with known EML decompositions:

| Function | EML Depth | EML Expression |
|----------|-----------|----------------|
| `exp(x)` | 1 | `eml(x, 1)` |
| `e` (constant) | 1 | `eml(1, 1)` |
| `ln(x)` | 3 | `eml(1, eml(eml(1,x), 1))` |
| `-x` | 2 | Via composition |
| `1/x` | 3 | Via composition |
| `x + y` | 4 | Via exp/ln identities |
| `x × y` | 4+ | Via exp/ln identities |
| `x²` | 4 | `exp(2·ln(x))` |
| `√x` | 4 | `exp(0.5·ln(x))` |
| `sin(x)` | 5+ | Requires complex intermediates |

### Phase 2: Feynman Physics Equations
A curated set of physics equations from the [SRSD-Feynman benchmark](https://arxiv.org/abs/2206.10540):
- Gaussian distribution: `exp(-θ²/2)/√(2π)`
- Euclidean distance: `√((x₂-x₁)² + (y₂-y₁)²)`
- Inverse square law: `F = q₁q₂/(4πε₀r²)`
- Relativistic mass: `m₀/√(1-v²/c²)`
- Harmonic oscillator: `E = ½kx²`
- And more...

### Phase 3: Depth Scaling Analysis
Systematic measurement of recovery rate vs. depth using EML-native targets.

## Key Literature References

| Topic | Paper | Key Insight |
|-------|-------|-------------|
| EML operator | [2603.21852](https://arxiv.org/abs/2603.21852) | Universal primitive for elementary functions |
| Gumbel-softmax | Jang et al. 2017 | Differentiable discrete selection |
| NALU | [1808.00508](https://arxiv.org/abs/1808.00508) | Stable exp-log arithmetic cells |
| NAU | [2001.05016](https://arxiv.org/abs/2001.05016) | Fixing NALU's gradient issues |
| Gradient clipping | [1211.5063](https://arxiv.org/abs/1211.5063) | Controlling exploding gradients |
| BFloat16 training | [2010.06192](https://arxiv.org/abs/2010.06192) | Kahan summation for precision |
| AutoNumerics-Zero | [2312.08472](https://arxiv.org/abs/2312.08472) | Range reduction for transcendentals |
| Numerical stability | [2501.04697](https://arxiv.org/abs/2501.04697) | Grokking at the edge of stability |
| Tropical geometry | [2505.17190](https://arxiv.org/abs/2505.17190) | Max-plus limit of log-sum-exp |
| AI Feynman | Udrescu & Tegmark 2020 | Physics equations benchmark |
| SRSD | [2206.10540](https://arxiv.org/abs/2206.10540) | Feynman benchmark with proper data |
| PySR | Cranmer 2023 | Evolutionary symbolic regression |
| TPSR | [2303.06833](https://arxiv.org/abs/2303.06833) | Transformer + MCTS for SR |

## Preliminary Results (CPU validation)

From our CPU sandbox testing:

| Function | Depth | Best R² | Method | Notes |
|----------|-------|---------|--------|-------|
| `exp(x)` | 1 | **0.9999** | Gumbel-hard | ✅ Trivially recovered |
| `e` (const) | 1 | **0.9999** | Gumbel-hard | ✅ Correct: `eml(1,1)` |
| `ln(x)` | 3 | -0.08 | All methods | ❌ All 10 restarts fail |
| `x²` | 4 | TBD | - | Awaiting GPU results |

### Key Observation
**The depth-3 barrier is real and severe.** Even with hard routing (Gumbel-softmax), bounded normalization, curriculum learning, and multi-loss training, recovering `ln(x)` from random initialization fails consistently. This aligns with the paper's finding of ~25% success at depth 3-4 and suggests that:

1. The loss landscape at depth 3+ has **exponentially many local minima** relative to the one correct basin
2. Better optimization (second-order methods, population-based search) may help
3. **Informed initialization** (starting near known decompositions) is likely required for practical use

## GPU Experiment Status

🔄 **Running**: Full experiment on T4 GPU with 3 phases and 4 stability methods.
Job: `69e7837acd8c002f31e00d75`

Results will be uploaded to the `results/` folder upon completion.

## How to Reproduce

```python
# Install dependencies
pip install torch numpy huggingface_hub

# Run the full experiment
python code/eml_experiment.py
```

## Citation

If you use this work, please cite the original EML paper:
```
@article{eml2026,
  title={All elementary functions from a single operator},
  author={...},
  journal={arXiv preprint arXiv:2603.21852},
  year={2026}
}
```

## License
MIT