File size: 7,424 Bytes

09e2cfe

# EML Trainability Study: Can We Turn Theoretical Universality Into Practical Training?

## Overview

This repository contains an empirical study of whether the **EML operator** `eml(x,y) = exp(x) - ln(y)` from [arXiv:2603.21852](https://arxiv.org/abs/2603.21852) can be made practically trainable for **symbolic regression** via gradient descent.

### The Theoretical Discovery
The EML paper proved that **every elementary mathematical function** — addition, multiplication, trigonometry, logarithms, π, e, etc. — can be generated from just one binary operator and the constant 1:

```
eml(x, y) = exp(x) − ln(y)
```

This is analogous to how the NAND gate generates all Boolean logic. The grammar is trivially simple: `S → 1 | eml(S, S)`.

### The Practical Problem  
While mathematically universal, this crashes in code. Stacking exponentials 3-4 levels deep in floating-point arithmetic causes numbers to **explode to infinity** or **collapse to zero**. The paper itself reports:
- **Depth 1-2**: 100% recovery from random initialization
- **Depth 3-4**: ~25% recovery  
- **Depth 5+**: <1% recovery
- **Depth 6**: 0% in 448 attempts

Yet paradoxically, when initialized **near the correct solution**, recovery is 100% even at depth 5-6. The basins of attraction exist — they're just needle-in-a-haystack from random init.

## Research Questions

1. **Which numerical stability techniques** most improve deep EML tree training?
2. **What is the maximum recoverable tree depth** with enhanced methods?
3. **Can EML-based SR recover real physics equations** (Feynman benchmark)?

## Methods

### Stability Techniques Tested

| Method | Description | Source |
|--------|-------------|--------|
| **Soft routing** | Standard softmax input selection (baseline) | EML paper §4.3 |
| **Gumbel-hard** | Straight-through Gumbel-softmax — hard selection in forward, soft gradients in backward | Jang et al. 2017 |
| **Bounded** | `tanh(output/R) * R` normalization after each node | Inspired by NALU (Trask 2018) |
| **Combined** | Saturating linear: `x / (1 + |x|/R)` + Gumbel-hard routing | Novel combination |

### Key Innovations

1. **Hard routing prevents intermediate explosion**: Soft routing creates weighted mixtures of {1, x, f} that can produce arbitrary intermediate values. Hard selection ensures only one input is chosen per EML node, preventing the "exp of a mixture" problem.

2. **Multi-loss training**: MSE + correlation loss (captures function shape regardless of scale) + entropy regularization (encourages discrete routing decisions).

3. **Temperature annealing**: Start with high temperature (smooth, exploratory) and anneal to near-zero (hard, discrete) over training.

4. **Multi-restart search**: Since basins are narrow, we run 20-30 random initializations per configuration and report best + success rates.

### Architecture: The Master Formula

Following the paper's §4.3, we implement the EML master formula as a full binary tree:
- **Leaf nodes** select from `{1, x₁, ..., xₖ}` (constant and input variables)
- **Internal nodes** select from `{1, x₁, ..., xₖ, f_left, f_right}` (also including child outputs)
- Each selection is parameterized by learnable logits passed through Gumbel-softmax
- Output affine transform `a * eml(left, right) + b` per node

Total parameters: `O(5 × 2ⁿ)` for depth n (as stated in the paper).

## Experimental Design

### Phase 1: Known EML Identities
Test recovery of functions with known EML decompositions:

| Function | EML Depth | EML Expression |
|----------|-----------|----------------|
| `exp(x)` | 1 | `eml(x, 1)` |
| `e` (constant) | 1 | `eml(1, 1)` |
| `ln(x)` | 3 | `eml(1, eml(eml(1,x), 1))` |
| `-x` | 2 | Via composition |
| `1/x` | 3 | Via composition |
| `x + y` | 4 | Via exp/ln identities |
| `x × y` | 4+ | Via exp/ln identities |
| `x²` | 4 | `exp(2·ln(x))` |
| `√x` | 4 | `exp(0.5·ln(x))` |
| `sin(x)` | 5+ | Requires complex intermediates |

### Phase 2: Feynman Physics Equations
A curated set of physics equations from the [SRSD-Feynman benchmark](https://arxiv.org/abs/2206.10540):
- Gaussian distribution: `exp(-θ²/2)/√(2π)`
- Euclidean distance: `√((x₂-x₁)² + (y₂-y₁)²)`
- Inverse square law: `F = q₁q₂/(4πε₀r²)`
- Relativistic mass: `m₀/√(1-v²/c²)`
- Harmonic oscillator: `E = ½kx²`
- And more...

### Phase 3: Depth Scaling Analysis
Systematic measurement of recovery rate vs. depth using EML-native targets.

## Key Literature References

| Topic | Paper | Key Insight |
|-------|-------|-------------|
| EML operator | [2603.21852](https://arxiv.org/abs/2603.21852) | Universal primitive for elementary functions |
| Gumbel-softmax | Jang et al. 2017 | Differentiable discrete selection |
| NALU | [1808.00508](https://arxiv.org/abs/1808.00508) | Stable exp-log arithmetic cells |
| NAU | [2001.05016](https://arxiv.org/abs/2001.05016) | Fixing NALU's gradient issues |
| Gradient clipping | [1211.5063](https://arxiv.org/abs/1211.5063) | Controlling exploding gradients |
| BFloat16 training | [2010.06192](https://arxiv.org/abs/2010.06192) | Kahan summation for precision |
| AutoNumerics-Zero | [2312.08472](https://arxiv.org/abs/2312.08472) | Range reduction for transcendentals |
| Numerical stability | [2501.04697](https://arxiv.org/abs/2501.04697) | Grokking at the edge of stability |
| Tropical geometry | [2505.17190](https://arxiv.org/abs/2505.17190) | Max-plus limit of log-sum-exp |
| AI Feynman | Udrescu & Tegmark 2020 | Physics equations benchmark |
| SRSD | [2206.10540](https://arxiv.org/abs/2206.10540) | Feynman benchmark with proper data |
| PySR | Cranmer 2023 | Evolutionary symbolic regression |
| TPSR | [2303.06833](https://arxiv.org/abs/2303.06833) | Transformer + MCTS for SR |

## Preliminary Results (CPU validation)

From our CPU sandbox testing:

| Function | Depth | Best R² | Method | Notes |
|----------|-------|---------|--------|-------|
| `exp(x)` | 1 | **0.9999** | Gumbel-hard | ✅ Trivially recovered |
| `e` (const) | 1 | **0.9999** | Gumbel-hard | ✅ Correct: `eml(1,1)` |
| `ln(x)` | 3 | -0.08 | All methods | ❌ All 10 restarts fail |
| `x²` | 4 | TBD | - | Awaiting GPU results |

### Key Observation
**The depth-3 barrier is real and severe.** Even with hard routing (Gumbel-softmax), bounded normalization, curriculum learning, and multi-loss training, recovering `ln(x)` from random initialization fails consistently. This aligns with the paper's finding of ~25% success at depth 3-4 and suggests that:

1. The loss landscape at depth 3+ has **exponentially many local minima** relative to the one correct basin
2. Better optimization (second-order methods, population-based search) may help
3. **Informed initialization** (starting near known decompositions) is likely required for practical use

## GPU Experiment Status

🔄 **Running**: Full experiment on T4 GPU with 3 phases and 4 stability methods.
Job: `69e7837acd8c002f31e00d75`

Results will be uploaded to the `results/` folder upon completion.

## How to Reproduce

```python
# Install dependencies
pip install torch numpy huggingface_hub

# Run the full experiment
python code/eml_experiment.py
```

## Citation

If you use this work, please cite the original EML paper:
```
@article{eml2026,
  title={All elementary functions from a single operator},
  author={...},
  journal={arXiv preprint arXiv:2603.21852},
  year={2026}
}
```

## License
MIT