Spaces:
Running
Running
| # Eureqa.jl | |
| **Symbolic regression built on Julia, and interfaced by Python. | |
| Uses regularized evolution and simulated annealing.** | |
| Backstory: we used the original | |
| [eureqa](https://www.creativemachineslab.com/eureqa.html) | |
| in our [paper](https://arxiv.org/abs/2006.11287) to | |
| convert a graph neural network into | |
| an analytic equation describing dark matter overdensity. However, | |
| eureqa is GUI-only, doesn't allow for user-defined | |
| operators, has no distributed capabilities, | |
| and has become proprietary. Thus, the goal | |
| of this package is to have an open-source symbolic regression tool | |
| as efficient as eureqa, while also exposing a configurable | |
| python interface. | |
| The algorithms here implement regularized evolution, as in | |
| [AutoML-Zero](https://arxiv.org/abs/2003.03384), | |
| but with additional algorithmic changes such as simulated | |
| annealing, and classical optimization of constants. | |
| ## Installation | |
| Install [Julia](https://julialang.org/downloads/). Then, at the command line, | |
| install the `Optim` package via: `julia -e 'import Pkg; Pkg.add("Optim")'`. | |
| For python, you need to have Python 3, numpy, and pandas installed. | |
| ## Running: | |
| ### Quickstart | |
| ```python | |
| import numpy as np | |
| from eureqa import eureqa | |
| # Dataset | |
| X = 2*np.random.randn(100, 5) | |
| y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2 | |
| # Learn equations | |
| equations = eureqa(X, y, niterations=5) | |
| ... | |
| print(equations) | |
| ``` | |
| which gives: | |
| ``` | |
| Complexity MSE Equation | |
| 0 5 1.947431 plus(-1.7420927, mult(x0, x0)) | |
| 1 8 0.486858 plus(-1.8710494, plus(cos(x3), mult(x0, x0))) | |
| 2 11 0.000000 plus(plus(mult(x0, x0), cos(x3)), plus(-2.0, cos(x3))) | |
| ``` | |
| ### API | |
| What follows is the API reference for running the numpy interface. | |
| Note that most parameters here | |
| have been tuned with ~1000 trials over several example | |
| equations, so you likely don't need to tune them yourself. | |
| However, you should adjust `threads`, `niterations`, | |
| `binary_operators`, `unary_operators`, and `maxsize` to your requirements. | |
| The program will output a pandas DataFrame containing the equations, | |
| mean square error, and complexity. It will also dump to a csv | |
| at the end of every iteration, | |
| which is `hall_of_fame.csv` by default. It also prints the | |
| equations to stdout. | |
| You can add more operators in `operators.jl`, or use default | |
| Julia ones. Make sure all operators are defined for scalar `Float32`. | |
| Then just specify the operator names in your call, as above. | |
| You can also change the dataset learned on by passing in `X` and `y` as | |
| numpy arrays to `eureqa(...)`. | |
| ```python | |
| eureqa(X=None, y=None, threads=4, niterations=20, | |
| ncyclesperiteration=int(default_ncyclesperiteration), | |
| binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], | |
| alpha=default_alpha, annealing=True, fractionReplaced=default_fractionReplaced, | |
| fractionReplacedHof=default_fractionReplacedHof, npop=int(default_npop), | |
| parsimony=default_parsimony, migration=True, hofMigration=True | |
| shouldOptimizeConstants=True, topn=int(default_topn), | |
| weightAddNode=default_weightAddNode, weightDeleteNode=default_weightDeleteNode, | |
| weightDoNothing=default_weightDoNothing, | |
| weightMutateConstant=default_weightMutateConstant, | |
| weightMutateOperator=default_weightMutateOperator, | |
| weightRandomize=default_weightRandomize, weightSimplify=default_weightSimplify, | |
| timeout=None, equation_file='hall_of_fame.csv', test='simple1', maxsize=20) | |
| ``` | |
| Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i. | |
| **Arguments**: | |
| - `X`: np.ndarray, 2D array. Rows are examples, columns are features. | |
| - `y`: np.ndarray, 1D array. Rows are examples. | |
| - `threads`: int, Number of threads (=number of populations running). | |
| You can have more threads than cores - it actually makes it more | |
| efficient. | |
| - `niterations`: int, Number of iterations of the algorithm to run. The best | |
| equations are printed, and migrate between populations, at the | |
| end of each. | |
| - `ncyclesperiteration`: int, Number of total mutations to run, per 10 | |
| samples of the population, per iteration. | |
| - `binary_operators`: list, List of strings giving the binary operators | |
| in Julia's Base, or in `operator.jl`. | |
| - `unary_operators`: list, Same but for operators taking a single `Float32`. | |
| - `alpha`: float, Initial temperature. | |
| - `annealing`: bool, Whether to use annealing. You should (and it is default). | |
| - `fractionReplaced`: float, How much of population to replace with migrating | |
| equations from other populations. | |
| - `fractionReplacedHof`: float, How much of population to replace with migrating | |
| equations from hall of fame. | |
| - `npop`: int, Number of individuals in each population | |
| - `parsimony`: float, Multiplicative factor for how much to punish complexity. | |
| - `migration`: bool, Whether to migrate. | |
| - `hofMigration`: bool, Whether to have the hall of fame migrate. | |
| - `shouldOptimizeConstants`: bool, Whether to numerically optimize | |
| constants (Nelder-Mead/Newton) at the end of each iteration. | |
| - `topn`: int, How many top individuals migrate from each population. | |
| - `weightAddNode`: float, Relative likelihood for mutation to add a node | |
| - `weightDeleteNode`: float, Relative likelihood for mutation to delete a node | |
| - `weightDoNothing`: float, Relative likelihood for mutation to leave the individual | |
| - `weightMutateConstant`: float, Relative likelihood for mutation to change | |
| the constant slightly in a random direction. | |
| - `weightMutateOperator`: float, Relative likelihood for mutation to swap | |
| an operator. | |
| - `weightRandomize`: float, Relative likelihood for mutation to completely | |
| delete and then randomly generate the equation | |
| - `weightSimplify`: float, Relative likelihood for mutation to simplify | |
| constant parts by evaluation | |
| - `timeout`: float, Time in seconds to timeout search | |
| - `equation_file`: str, Where to save the files (.csv separated by |) | |
| - `test`: str, What test to run, if X,y not passed. | |
| - `maxsize`: int, Max size of an equation. | |
| **Returns**: | |
| pd.DataFrame, Results dataframe, giving complexity, MSE, and equations | |
| (as strings). | |
| # TODO | |
| - [ ] Make scaling of changes to constant a hyperparameter | |
| - [ ] Update hall of fame every iteration? | |
| - [ ] Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features. | |
| - Store feature importances of future, and periodically update it. | |
| - [ ] Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html | |
| - [ ] Sympy printing | |
| - [ ] Consider adding mutation for constant<->variable | |
| - [ ] Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...)) | |
| - [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation? | |
| - [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations | |
| - [ ] Performance: | |
| - [ ] Use an enum for functions instead of storing them? | |
| - Current most expensive operations: | |
| - [ ] Calculating the loss function - there is duplicate calculations happening. | |
| - [x] Declaration of the weights array every iteration | |
| - [x] Add a node at the top of a tree | |
| - [x] Insert a node at the top of a subtree | |
| - [x] Record very best individual in each population, and return at end. | |
| - [x] Write our own tree copy operation; deepcopy() is the slowest operation by far. | |
| - [x] Hyperparameter tune | |
| - [x] Create a benchmark for accuracy | |
| - [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset. | |
| - Could just write out the dataset in julia, or load it. | |
| - [x] Create a Python interface | |
| - [x] Explicit constant optimization on hall-of-fame | |
| - Create method to find and return all constants, from left to right | |
| - Create method to find and set all constants, in same order | |
| - Pull up some optimization algorithm and add it. Keep the package small! | |
| - [x] Create a benchmark for speed | |
| - [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes? | |
| - [x] Record hall of fame | |
| - [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests | |
| - [x] Test performance of reduced precision integers | |
| - No effect | |
| - [x] Create struct to pass through all hyperparameters, instead of treating as constants | |
| - Make sure doesn't affect performance | |