PPO Symbolic Regression Experiment Plan
Objective
Test whether PPO (Proximal Policy Optimization) fine-tuning can help a language model find mathematical expressions that fit given datasets.
Background
Current State
- Base model: GPT-2 + LoRA trained on JSON format data (exp_a_json)
- Validation rate: ~80% of generated expressions are syntactically valid
- Problem: Model generates valid expressions but doesn't know which one fits a specific dataset
The Question
Can we use RL (specifically PPO) to fine-tune the model so it learns to generate expressions that maximize R² score on a given dataset?
Methodology
Simplifications for This Experiment
To isolate whether PPO works at all, we simplify the problem:
No constants (C): Expressions won't contain learnable constants
- Avoids the complexity of constant optimization during reward computation
- Ground truth expressions:
x_1 + x_2,sin(x_1), etc.
Simple operators:
+,-,*,sin,cos- No division (avoids numerical instability)
- No exponentiation (simplifies search space)
Small datasets: 500 samples each, 1-2 variables
Test Datasets
| Dataset | Ground Truth | Difficulty | Variables |
|---|---|---|---|
| add_x1_x2 | x_1 + x_2 | Easy | 2 |
| mul_x1_x2 | x_1 * x_2 | Easy | 2 |
| sub_x1_x2 | x_1 - x_2 | Easy | 2 |
| sin_x1 | sin(x_1) | Medium | 1 |
| cos_x1 | cos(x_1) | Medium | 1 |
| square_x1 | x_1 * x_1 | Medium | 1 |
| sin_x1_plus_x2 | sin(x_1) + x_2 | Hard | 2 |
| x1_mul_sin_x2 | x_1 * sin(x_2) | Hard | 2 |
Experiment Design
Phase 1: Baseline Evaluation
- Generate 200 random expressions per dataset
- Compute R² for each valid expression
- Record best R², mean R², valid rate
Phase 2: PPO Training
- 10 epochs of PPO per dataset
- Batch size: 32
- Compare PPO best R² vs baseline best R²
Phase 3: Deep PPO (single dataset)
- 20 epochs on
mul_x1_x2dataset - Batch size: 64
- Track R² improvement over time
Success Criteria
| Metric | Success Threshold |
|---|---|
| PPO finds exact expression | R² > 0.99 |
| PPO improves over baseline | PPO R² > Baseline R² |
| PPO is viable approach | Improvement on >50% of datasets |
Implementation
Files Created
scripts/
├── ppo_experiment.py # Main PPO trainer (JSON format)
├── run_ppo_experiments.py # Multi-dataset experiment runner
└── data/
└── create_ppo_test_datasets.py # Test dataset generator
userdata_ppo_experiment.sh # AWS EC2 launch script
Key Changes from Original trainer.py
- JSON format prompts (matches exp_a_json training)
- Max retries (avoids infinite loop when model generates invalid expressions)
- No constant optimization (C=1 always, or no C in expression)
- Proper logging and checkpointing
Running the Experiment
# Local test (if GPU available)
python scripts/data/create_ppo_test_datasets.py
python scripts/run_ppo_experiments.py --model_path ./output/exp_a_json --epochs 5
# AWS (recommended)
# 1. Launch g5.xlarge with userdata_ppo_experiment.sh
# 2. Monitor: tail -f /home/ubuntu/ppo_experiment.log
# 3. Download results when complete
Expected Outcomes
Optimistic Scenario
- PPO finds exact expressions for easy datasets (R² > 0.99)
- PPO significantly improves R² for medium datasets
- Proves PPO is viable for symbolic regression
Pessimistic Scenario
- PPO shows no improvement over random sampling
- Model can't learn to maximize R² reward
- Need to investigate: reward shaping, model capacity, exploration
Diagnostic Scenario
- PPO improves valid rate but not R²
- PPO improves R² but can't find exact expression
- Suggests modifications to approach
Next Steps (After Experiment)
Based on results:
If PPO works:
- Add constant optimization
- Try harder expressions
- Scale up to real symbolic regression benchmarks
If PPO partially works:
- Try different reward functions
- Adjust PPO hyperparameters
- Consider curriculum learning
If PPO doesn't work:
- Analyze why model can't learn
- Consider alternative approaches (beam search, MCTS)
- May need different base model architecture
Technical Notes
Why JSON Format?
The exp_a_json model was trained on JSON format data and achieves 80% valid expressions. Using the same format for PPO ensures consistency.
Why No Division?
Division can cause numerical instability (division by zero, very large values) which corrupts R² computation and gradient updates.
Why Max Retries?
The original trainer had a while reward < 0 loop that could hang forever if the model consistently generates invalid expressions. Max retries ensures progress.
Experiment designed: 2026-02-02 Branch: experiment/ppo-symbolic-regression