Add comprehensive README with research overview and preliminary results
Browse files
README.md
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# EML Trainability Study: Can We Turn Theoretical Universality Into Practical Training?
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This repository contains an empirical study of whether the **EML operator** `eml(x,y) = exp(x) - ln(y)` from [arXiv:2603.21852](https://arxiv.org/abs/2603.21852) can be made practically trainable for **symbolic regression** via gradient descent.
|
| 6 |
+
|
| 7 |
+
### The Theoretical Discovery
|
| 8 |
+
The EML paper proved that **every elementary mathematical function** — addition, multiplication, trigonometry, logarithms, π, e, etc. — can be generated from just one binary operator and the constant 1:
|
| 9 |
+
|
| 10 |
+
```
|
| 11 |
+
eml(x, y) = exp(x) − ln(y)
|
| 12 |
+
```
|
| 13 |
+
|
| 14 |
+
This is analogous to how the NAND gate generates all Boolean logic. The grammar is trivially simple: `S → 1 | eml(S, S)`.
|
| 15 |
+
|
| 16 |
+
### The Practical Problem
|
| 17 |
+
While mathematically universal, this crashes in code. Stacking exponentials 3-4 levels deep in floating-point arithmetic causes numbers to **explode to infinity** or **collapse to zero**. The paper itself reports:
|
| 18 |
+
- **Depth 1-2**: 100% recovery from random initialization
|
| 19 |
+
- **Depth 3-4**: ~25% recovery
|
| 20 |
+
- **Depth 5+**: <1% recovery
|
| 21 |
+
- **Depth 6**: 0% in 448 attempts
|
| 22 |
+
|
| 23 |
+
Yet paradoxically, when initialized **near the correct solution**, recovery is 100% even at depth 5-6. The basins of attraction exist — they're just needle-in-a-haystack from random init.
|
| 24 |
+
|
| 25 |
+
## Research Questions
|
| 26 |
+
|
| 27 |
+
1. **Which numerical stability techniques** most improve deep EML tree training?
|
| 28 |
+
2. **What is the maximum recoverable tree depth** with enhanced methods?
|
| 29 |
+
3. **Can EML-based SR recover real physics equations** (Feynman benchmark)?
|
| 30 |
+
|
| 31 |
+
## Methods
|
| 32 |
+
|
| 33 |
+
### Stability Techniques Tested
|
| 34 |
+
|
| 35 |
+
| Method | Description | Source |
|
| 36 |
+
|--------|-------------|--------|
|
| 37 |
+
| **Soft routing** | Standard softmax input selection (baseline) | EML paper §4.3 |
|
| 38 |
+
| **Gumbel-hard** | Straight-through Gumbel-softmax — hard selection in forward, soft gradients in backward | Jang et al. 2017 |
|
| 39 |
+
| **Bounded** | `tanh(output/R) * R` normalization after each node | Inspired by NALU (Trask 2018) |
|
| 40 |
+
| **Combined** | Saturating linear: `x / (1 + |x|/R)` + Gumbel-hard routing | Novel combination |
|
| 41 |
+
|
| 42 |
+
### Key Innovations
|
| 43 |
+
|
| 44 |
+
1. **Hard routing prevents intermediate explosion**: Soft routing creates weighted mixtures of {1, x, f} that can produce arbitrary intermediate values. Hard selection ensures only one input is chosen per EML node, preventing the "exp of a mixture" problem.
|
| 45 |
+
|
| 46 |
+
2. **Multi-loss training**: MSE + correlation loss (captures function shape regardless of scale) + entropy regularization (encourages discrete routing decisions).
|
| 47 |
+
|
| 48 |
+
3. **Temperature annealing**: Start with high temperature (smooth, exploratory) and anneal to near-zero (hard, discrete) over training.
|
| 49 |
+
|
| 50 |
+
4. **Multi-restart search**: Since basins are narrow, we run 20-30 random initializations per configuration and report best + success rates.
|
| 51 |
+
|
| 52 |
+
### Architecture: The Master Formula
|
| 53 |
+
|
| 54 |
+
Following the paper's §4.3, we implement the EML master formula as a full binary tree:
|
| 55 |
+
- **Leaf nodes** select from `{1, x₁, ..., xₖ}` (constant and input variables)
|
| 56 |
+
- **Internal nodes** select from `{1, x₁, ..., xₖ, f_left, f_right}` (also including child outputs)
|
| 57 |
+
- Each selection is parameterized by learnable logits passed through Gumbel-softmax
|
| 58 |
+
- Output affine transform `a * eml(left, right) + b` per node
|
| 59 |
+
|
| 60 |
+
Total parameters: `O(5 × 2ⁿ)` for depth n (as stated in the paper).
|
| 61 |
+
|
| 62 |
+
## Experimental Design
|
| 63 |
+
|
| 64 |
+
### Phase 1: Known EML Identities
|
| 65 |
+
Test recovery of functions with known EML decompositions:
|
| 66 |
+
|
| 67 |
+
| Function | EML Depth | EML Expression |
|
| 68 |
+
|----------|-----------|----------------|
|
| 69 |
+
| `exp(x)` | 1 | `eml(x, 1)` |
|
| 70 |
+
| `e` (constant) | 1 | `eml(1, 1)` |
|
| 71 |
+
| `ln(x)` | 3 | `eml(1, eml(eml(1,x), 1))` |
|
| 72 |
+
| `-x` | 2 | Via composition |
|
| 73 |
+
| `1/x` | 3 | Via composition |
|
| 74 |
+
| `x + y` | 4 | Via exp/ln identities |
|
| 75 |
+
| `x × y` | 4+ | Via exp/ln identities |
|
| 76 |
+
| `x²` | 4 | `exp(2·ln(x))` |
|
| 77 |
+
| `√x` | 4 | `exp(0.5·ln(x))` |
|
| 78 |
+
| `sin(x)` | 5+ | Requires complex intermediates |
|
| 79 |
+
|
| 80 |
+
### Phase 2: Feynman Physics Equations
|
| 81 |
+
A curated set of physics equations from the [SRSD-Feynman benchmark](https://arxiv.org/abs/2206.10540):
|
| 82 |
+
- Gaussian distribution: `exp(-θ²/2)/√(2π)`
|
| 83 |
+
- Euclidean distance: `√((x₂-x₁)² + (y₂-y₁)²)`
|
| 84 |
+
- Inverse square law: `F = q₁q₂/(4πε₀r²)`
|
| 85 |
+
- Relativistic mass: `m₀/√(1-v²/c²)`
|
| 86 |
+
- Harmonic oscillator: `E = ½kx²`
|
| 87 |
+
- And more...
|
| 88 |
+
|
| 89 |
+
### Phase 3: Depth Scaling Analysis
|
| 90 |
+
Systematic measurement of recovery rate vs. depth using EML-native targets.
|
| 91 |
+
|
| 92 |
+
## Key Literature References
|
| 93 |
+
|
| 94 |
+
| Topic | Paper | Key Insight |
|
| 95 |
+
|-------|-------|-------------|
|
| 96 |
+
| EML operator | [2603.21852](https://arxiv.org/abs/2603.21852) | Universal primitive for elementary functions |
|
| 97 |
+
| Gumbel-softmax | Jang et al. 2017 | Differentiable discrete selection |
|
| 98 |
+
| NALU | [1808.00508](https://arxiv.org/abs/1808.00508) | Stable exp-log arithmetic cells |
|
| 99 |
+
| NAU | [2001.05016](https://arxiv.org/abs/2001.05016) | Fixing NALU's gradient issues |
|
| 100 |
+
| Gradient clipping | [1211.5063](https://arxiv.org/abs/1211.5063) | Controlling exploding gradients |
|
| 101 |
+
| BFloat16 training | [2010.06192](https://arxiv.org/abs/2010.06192) | Kahan summation for precision |
|
| 102 |
+
| AutoNumerics-Zero | [2312.08472](https://arxiv.org/abs/2312.08472) | Range reduction for transcendentals |
|
| 103 |
+
| Numerical stability | [2501.04697](https://arxiv.org/abs/2501.04697) | Grokking at the edge of stability |
|
| 104 |
+
| Tropical geometry | [2505.17190](https://arxiv.org/abs/2505.17190) | Max-plus limit of log-sum-exp |
|
| 105 |
+
| AI Feynman | Udrescu & Tegmark 2020 | Physics equations benchmark |
|
| 106 |
+
| SRSD | [2206.10540](https://arxiv.org/abs/2206.10540) | Feynman benchmark with proper data |
|
| 107 |
+
| PySR | Cranmer 2023 | Evolutionary symbolic regression |
|
| 108 |
+
| TPSR | [2303.06833](https://arxiv.org/abs/2303.06833) | Transformer + MCTS for SR |
|
| 109 |
+
|
| 110 |
+
## Preliminary Results (CPU validation)
|
| 111 |
+
|
| 112 |
+
From our CPU sandbox testing:
|
| 113 |
+
|
| 114 |
+
| Function | Depth | Best R² | Method | Notes |
|
| 115 |
+
|----------|-------|---------|--------|-------|
|
| 116 |
+
| `exp(x)` | 1 | **0.9999** | Gumbel-hard | ✅ Trivially recovered |
|
| 117 |
+
| `e` (const) | 1 | **0.9999** | Gumbel-hard | ✅ Correct: `eml(1,1)` |
|
| 118 |
+
| `ln(x)` | 3 | -0.08 | All methods | ❌ All 10 restarts fail |
|
| 119 |
+
| `x²` | 4 | TBD | - | Awaiting GPU results |
|
| 120 |
+
|
| 121 |
+
### Key Observation
|
| 122 |
+
**The depth-3 barrier is real and severe.** Even with hard routing (Gumbel-softmax), bounded normalization, curriculum learning, and multi-loss training, recovering `ln(x)` from random initialization fails consistently. This aligns with the paper's finding of ~25% success at depth 3-4 and suggests that:
|
| 123 |
+
|
| 124 |
+
1. The loss landscape at depth 3+ has **exponentially many local minima** relative to the one correct basin
|
| 125 |
+
2. Better optimization (second-order methods, population-based search) may help
|
| 126 |
+
3. **Informed initialization** (starting near known decompositions) is likely required for practical use
|
| 127 |
+
|
| 128 |
+
## GPU Experiment Status
|
| 129 |
+
|
| 130 |
+
🔄 **Running**: Full experiment on T4 GPU with 3 phases and 4 stability methods.
|
| 131 |
+
Job: `69e7837acd8c002f31e00d75`
|
| 132 |
+
|
| 133 |
+
Results will be uploaded to the `results/` folder upon completion.
|
| 134 |
+
|
| 135 |
+
## How to Reproduce
|
| 136 |
+
|
| 137 |
+
```python
|
| 138 |
+
# Install dependencies
|
| 139 |
+
pip install torch numpy huggingface_hub
|
| 140 |
+
|
| 141 |
+
# Run the full experiment
|
| 142 |
+
python code/eml_experiment.py
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
## Citation
|
| 146 |
+
|
| 147 |
+
If you use this work, please cite the original EML paper:
|
| 148 |
+
```
|
| 149 |
+
@article{eml2026,
|
| 150 |
+
title={All elementary functions from a single operator},
|
| 151 |
+
author={...},
|
| 152 |
+
journal={arXiv preprint arXiv:2603.21852},
|
| 153 |
+
year={2026}
|
| 154 |
+
}
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
## License
|
| 158 |
+
MIT
|