Add comprehensive README with research overview and preliminary results

09e2cfe verified about 1 month ago

7.42 kB

	# EML Trainability Study: Can We Turn Theoretical Universality Into Practical Training?

	## Overview

	This repository contains an empirical study of whether the EML operator `eml(x,y) = exp(x) - ln(y)` from [arXiv:2603.21852](https://arxiv.org/abs/2603.21852) can be made practically trainable for symbolic regression via gradient descent.

	### The Theoretical Discovery
	The EML paper proved that every elementary mathematical function — addition, multiplication, trigonometry, logarithms, π, e, etc. — can be generated from just one binary operator and the constant 1:

	```
	eml(x, y) = exp(x) − ln(y)
	```

	This is analogous to how the NAND gate generates all Boolean logic. The grammar is trivially simple: `S → 1 \| eml(S, S)`.

	### The Practical Problem
	While mathematically universal, this crashes in code. Stacking exponentials 3-4 levels deep in floating-point arithmetic causes numbers to explode to infinity or collapse to zero. The paper itself reports:
	- Depth 1-2: 100% recovery from random initialization
	- Depth 3-4: ~25% recovery
	- Depth 5+: <1% recovery
	- Depth 6: 0% in 448 attempts

	Yet paradoxically, when initialized near the correct solution, recovery is 100% even at depth 5-6. The basins of attraction exist — they're just needle-in-a-haystack from random init.

	## Research Questions

	1. Which numerical stability techniques most improve deep EML tree training?
	2. What is the maximum recoverable tree depth with enhanced methods?
	3. Can EML-based SR recover real physics equations (Feynman benchmark)?

	## Methods

	### Stability Techniques Tested

	\| Method \| Description \| Source \|
	\|--------\|-------------\|--------\|
	\| Soft routing \| Standard softmax input selection (baseline) \| EML paper §4.3 \|
	\| Gumbel-hard \| Straight-through Gumbel-softmax — hard selection in forward, soft gradients in backward \| Jang et al. 2017 \|
	\| Bounded \| `tanh(output/R) * R` normalization after each node \| Inspired by NALU (Trask 2018) \|
	\| Combined \| Saturating linear: `x / (1 + \|x\|/R)` + Gumbel-hard routing \| Novel combination \|

	### Key Innovations

	1. Hard routing prevents intermediate explosion: Soft routing creates weighted mixtures of {1, x, f} that can produce arbitrary intermediate values. Hard selection ensures only one input is chosen per EML node, preventing the "exp of a mixture" problem.

	2. Multi-loss training: MSE + correlation loss (captures function shape regardless of scale) + entropy regularization (encourages discrete routing decisions).

	3. Temperature annealing: Start with high temperature (smooth, exploratory) and anneal to near-zero (hard, discrete) over training.

	4. Multi-restart search: Since basins are narrow, we run 20-30 random initializations per configuration and report best + success rates.

	### Architecture: The Master Formula

	Following the paper's §4.3, we implement the EML master formula as a full binary tree:
	- Leaf nodes select from `{1, x₁, ..., xₖ}` (constant and input variables)
	- Internal nodes select from `{1, x₁, ..., xₖ, f_left, f_right}` (also including child outputs)
	- Each selection is parameterized by learnable logits passed through Gumbel-softmax
	- Output affine transform `a * eml(left, right) + b` per node

	Total parameters: `O(5 × 2ⁿ)` for depth n (as stated in the paper).

	## Experimental Design

	### Phase 1: Known EML Identities
	Test recovery of functions with known EML decompositions:

	\| Function \| EML Depth \| EML Expression \|
	\|----------\|-----------\|----------------\|
	\| `exp(x)` \| 1 \| `eml(x, 1)` \|
	\| `e` (constant) \| 1 \| `eml(1, 1)` \|
	\| `ln(x)` \| 3 \| `eml(1, eml(eml(1,x), 1))` \|
	\| `-x` \| 2 \| Via composition \|
	\| `1/x` \| 3 \| Via composition \|
	\| `x + y` \| 4 \| Via exp/ln identities \|
	\| `x × y` \| 4+ \| Via exp/ln identities \|
	\| `x²` \| 4 \| `exp(2·ln(x))` \|
	\| `√x` \| 4 \| `exp(0.5·ln(x))` \|
	\| `sin(x)` \| 5+ \| Requires complex intermediates \|

	### Phase 2: Feynman Physics Equations
	A curated set of physics equations from the [SRSD-Feynman benchmark](https://arxiv.org/abs/2206.10540):
	- Gaussian distribution: `exp(-θ²/2)/√(2π)`
	- Euclidean distance: `√((x₂-x₁)² + (y₂-y₁)²)`
	- Inverse square law: `F = q₁q₂/(4πε₀r²)`
	- Relativistic mass: `m₀/√(1-v²/c²)`
	- Harmonic oscillator: `E = ½kx²`
	- And more...

	### Phase 3: Depth Scaling Analysis
	Systematic measurement of recovery rate vs. depth using EML-native targets.

	## Key Literature References

	\| Topic \| Paper \| Key Insight \|
	\|-------\|-------\|-------------\|
	\| EML operator \| [2603.21852](https://arxiv.org/abs/2603.21852) \| Universal primitive for elementary functions \|
	\| Gumbel-softmax \| Jang et al. 2017 \| Differentiable discrete selection \|
	\| NALU \| [1808.00508](https://arxiv.org/abs/1808.00508) \| Stable exp-log arithmetic cells \|
	\| NAU \| [2001.05016](https://arxiv.org/abs/2001.05016) \| Fixing NALU's gradient issues \|
	\| Gradient clipping \| [1211.5063](https://arxiv.org/abs/1211.5063) \| Controlling exploding gradients \|
	\| BFloat16 training \| [2010.06192](https://arxiv.org/abs/2010.06192) \| Kahan summation for precision \|
	\| AutoNumerics-Zero \| [2312.08472](https://arxiv.org/abs/2312.08472) \| Range reduction for transcendentals \|
	\| Numerical stability \| [2501.04697](https://arxiv.org/abs/2501.04697) \| Grokking at the edge of stability \|
	\| Tropical geometry \| [2505.17190](https://arxiv.org/abs/2505.17190) \| Max-plus limit of log-sum-exp \|
	\| AI Feynman \| Udrescu & Tegmark 2020 \| Physics equations benchmark \|
	\| SRSD \| [2206.10540](https://arxiv.org/abs/2206.10540) \| Feynman benchmark with proper data \|
	\| PySR \| Cranmer 2023 \| Evolutionary symbolic regression \|
	\| TPSR \| [2303.06833](https://arxiv.org/abs/2303.06833) \| Transformer + MCTS for SR \|

	## Preliminary Results (CPU validation)

	From our CPU sandbox testing:

	\| Function \| Depth \| Best R² \| Method \| Notes \|
	\|----------\|-------\|---------\|--------\|-------\|
	\| `exp(x)` \| 1 \| 0.9999 \| Gumbel-hard \| ✅ Trivially recovered \|
	\| `e` (const) \| 1 \| 0.9999 \| Gumbel-hard \| ✅ Correct: `eml(1,1)` \|
	\| `ln(x)` \| 3 \| -0.08 \| All methods \| ❌ All 10 restarts fail \|
	\| `x²` \| 4 \| TBD \| - \| Awaiting GPU results \|

	### Key Observation
	The depth-3 barrier is real and severe. Even with hard routing (Gumbel-softmax), bounded normalization, curriculum learning, and multi-loss training, recovering `ln(x)` from random initialization fails consistently. This aligns with the paper's finding of ~25% success at depth 3-4 and suggests that:

	1. The loss landscape at depth 3+ has exponentially many local minima relative to the one correct basin
	2. Better optimization (second-order methods, population-based search) may help
	3. Informed initialization (starting near known decompositions) is likely required for practical use

	## GPU Experiment Status

	🔄 Running: Full experiment on T4 GPU with 3 phases and 4 stability methods.
	Job: `69e7837acd8c002f31e00d75`

	Results will be uploaded to the `results/` folder upon completion.

	## How to Reproduce

	```python
	# Install dependencies
	pip install torch numpy huggingface_hub

	# Run the full experiment
	python code/eml_experiment.py
	```

	## Citation

	If you use this work, please cite the original EML paper:
	```
	@article{eml2026,
	title={All elementary functions from a single operator},
	author={...},
	journal={arXiv preprint arXiv:2603.21852},
	year={2026}
	}
	```

	## License
	MIT

	# EML Trainability Study: Can We Turn Theoretical Universality Into Practical Training?

	## Overview

	This repository contains an empirical study of whether the EML operator `eml(x,y) = exp(x) - ln(y)` from [arXiv:2603.21852](https://arxiv.org/abs/2603.21852) can be made practically trainable for symbolic regression via gradient descent.

	### The Theoretical Discovery
	The EML paper proved that every elementary mathematical function — addition, multiplication, trigonometry, logarithms, π, e, etc. — can be generated from just one binary operator and the constant 1:

	```
	eml(x, y) = exp(x) − ln(y)
	```

	This is analogous to how the NAND gate generates all Boolean logic. The grammar is trivially simple: `S → 1 \| eml(S, S)`.

	### The Practical Problem
	While mathematically universal, this crashes in code. Stacking exponentials 3-4 levels deep in floating-point arithmetic causes numbers to explode to infinity or collapse to zero. The paper itself reports:
	- Depth 1-2: 100% recovery from random initialization
	- Depth 3-4: ~25% recovery
	- Depth 5+: <1% recovery
	- Depth 6: 0% in 448 attempts

	Yet paradoxically, when initialized near the correct solution, recovery is 100% even at depth 5-6. The basins of attraction exist — they're just needle-in-a-haystack from random init.

	## Research Questions

	1. Which numerical stability techniques most improve deep EML tree training?
	2. What is the maximum recoverable tree depth with enhanced methods?
	3. Can EML-based SR recover real physics equations (Feynman benchmark)?

	## Methods

	### Stability Techniques Tested

	\| Method \| Description \| Source \|
	\|--------\|-------------\|--------\|
	\| Soft routing \| Standard softmax input selection (baseline) \| EML paper §4.3 \|
	\| Gumbel-hard \| Straight-through Gumbel-softmax — hard selection in forward, soft gradients in backward \| Jang et al. 2017 \|
	\| Bounded \| `tanh(output/R) * R` normalization after each node \| Inspired by NALU (Trask 2018) \|
	\| Combined \| Saturating linear: `x / (1 + \|x\|/R)` + Gumbel-hard routing \| Novel combination \|

	### Key Innovations

	1. Hard routing prevents intermediate explosion: Soft routing creates weighted mixtures of {1, x, f} that can produce arbitrary intermediate values. Hard selection ensures only one input is chosen per EML node, preventing the "exp of a mixture" problem.

	2. Multi-loss training: MSE + correlation loss (captures function shape regardless of scale) + entropy regularization (encourages discrete routing decisions).

	3. Temperature annealing: Start with high temperature (smooth, exploratory) and anneal to near-zero (hard, discrete) over training.

	4. Multi-restart search: Since basins are narrow, we run 20-30 random initializations per configuration and report best + success rates.

	### Architecture: The Master Formula

	Following the paper's §4.3, we implement the EML master formula as a full binary tree:
	- Leaf nodes select from `{1, x₁, ..., xₖ}` (constant and input variables)
	- Internal nodes select from `{1, x₁, ..., xₖ, f_left, f_right}` (also including child outputs)
	- Each selection is parameterized by learnable logits passed through Gumbel-softmax
	- Output affine transform `a * eml(left, right) + b` per node

	Total parameters: `O(5 × 2ⁿ)` for depth n (as stated in the paper).

	## Experimental Design

	### Phase 1: Known EML Identities
	Test recovery of functions with known EML decompositions:

	\| Function \| EML Depth \| EML Expression \|
	\|----------\|-----------\|----------------\|
	\| `exp(x)` \| 1 \| `eml(x, 1)` \|
	\| `e` (constant) \| 1 \| `eml(1, 1)` \|
	\| `ln(x)` \| 3 \| `eml(1, eml(eml(1,x), 1))` \|
	\| `-x` \| 2 \| Via composition \|
	\| `1/x` \| 3 \| Via composition \|
	\| `x + y` \| 4 \| Via exp/ln identities \|
	\| `x × y` \| 4+ \| Via exp/ln identities \|
	\| `x²` \| 4 \| `exp(2·ln(x))` \|
	\| `√x` \| 4 \| `exp(0.5·ln(x))` \|
	\| `sin(x)` \| 5+ \| Requires complex intermediates \|

	### Phase 2: Feynman Physics Equations
	A curated set of physics equations from the [SRSD-Feynman benchmark](https://arxiv.org/abs/2206.10540):
	- Gaussian distribution: `exp(-θ²/2)/√(2π)`
	- Euclidean distance: `√((x₂-x₁)² + (y₂-y₁)²)`
	- Inverse square law: `F = q₁q₂/(4πε₀r²)`
	- Relativistic mass: `m₀/√(1-v²/c²)`
	- Harmonic oscillator: `E = ½kx²`
	- And more...

	### Phase 3: Depth Scaling Analysis
	Systematic measurement of recovery rate vs. depth using EML-native targets.

	## Key Literature References

	\| Topic \| Paper \| Key Insight \|
	\|-------\|-------\|-------------\|
	\| EML operator \| [2603.21852](https://arxiv.org/abs/2603.21852) \| Universal primitive for elementary functions \|
	\| Gumbel-softmax \| Jang et al. 2017 \| Differentiable discrete selection \|
	\| NALU \| [1808.00508](https://arxiv.org/abs/1808.00508) \| Stable exp-log arithmetic cells \|
	\| NAU \| [2001.05016](https://arxiv.org/abs/2001.05016) \| Fixing NALU's gradient issues \|
	\| Gradient clipping \| [1211.5063](https://arxiv.org/abs/1211.5063) \| Controlling exploding gradients \|
	\| BFloat16 training \| [2010.06192](https://arxiv.org/abs/2010.06192) \| Kahan summation for precision \|
	\| AutoNumerics-Zero \| [2312.08472](https://arxiv.org/abs/2312.08472) \| Range reduction for transcendentals \|
	\| Numerical stability \| [2501.04697](https://arxiv.org/abs/2501.04697) \| Grokking at the edge of stability \|
	\| Tropical geometry \| [2505.17190](https://arxiv.org/abs/2505.17190) \| Max-plus limit of log-sum-exp \|
	\| AI Feynman \| Udrescu & Tegmark 2020 \| Physics equations benchmark \|
	\| SRSD \| [2206.10540](https://arxiv.org/abs/2206.10540) \| Feynman benchmark with proper data \|
	\| PySR \| Cranmer 2023 \| Evolutionary symbolic regression \|
	\| TPSR \| [2303.06833](https://arxiv.org/abs/2303.06833) \| Transformer + MCTS for SR \|

	## Preliminary Results (CPU validation)

	From our CPU sandbox testing:

	\| Function \| Depth \| Best R² \| Method \| Notes \|
	\|----------\|-------\|---------\|--------\|-------\|
	\| `exp(x)` \| 1 \| 0.9999 \| Gumbel-hard \| ✅ Trivially recovered \|
	\| `e` (const) \| 1 \| 0.9999 \| Gumbel-hard \| ✅ Correct: `eml(1,1)` \|
	\| `ln(x)` \| 3 \| -0.08 \| All methods \| ❌ All 10 restarts fail \|
	\| `x²` \| 4 \| TBD \| - \| Awaiting GPU results \|

	### Key Observation
	The depth-3 barrier is real and severe. Even with hard routing (Gumbel-softmax), bounded normalization, curriculum learning, and multi-loss training, recovering `ln(x)` from random initialization fails consistently. This aligns with the paper's finding of ~25% success at depth 3-4 and suggests that:

	1. The loss landscape at depth 3+ has exponentially many local minima relative to the one correct basin
	2. Better optimization (second-order methods, population-based search) may help
	3. Informed initialization (starting near known decompositions) is likely required for practical use

	## GPU Experiment Status

	🔄 Running: Full experiment on T4 GPU with 3 phases and 4 stability methods.
	Job: `69e7837acd8c002f31e00d75`

	Results will be uploaded to the `results/` folder upon completion.

	## How to Reproduce

	```python
	# Install dependencies
	pip install torch numpy huggingface_hub

	# Run the full experiment
	python code/eml_experiment.py
	```

	## Citation

	If you use this work, please cite the original EML paper:
	```
	@article{eml2026,
	title={All elementary functions from a single operator},
	author={...},
	journal={arXiv preprint arXiv:2603.21852},
	year={2026}
	}
	```

	## License
	MIT