Experiment: Model Size Impact on Symbolic Regression
Research Project: Seriguela - Language Models for Symbolic Regression
Date: February 2025
Status: ⏳ In Progress
Abstract
This experiment investigates the impact of model size on the ability of GPT-2 language models to generate valid and complex mathematical expressions for symbolic regression. We train three model variants (Base: 124M, Medium: 355M, Large: 774M parameters) on 700K synthetic expressions using LoRA fine-tuning and evaluate them across multiple dimensions: validity, complexity, diversity, and performance on Nguyen benchmarks with reinforcement learning optimization.
Hypothesis: Larger models possess greater capacity to learn compositional patterns, resulting in more complex, valid, and diverse expression generation.
Key Question: Is the increased computational cost of larger models justified by improved expression quality and benchmark performance?
1. Introduction
1.1 Motivation
Prior work (see EXPERIMENT_RESULTS.md) demonstrated that JSON-formatted training (EXP-A) achieves 80% valid expression rates compared to 0.5% with EOS token approach. However, evaluation on Nguyen-5 benchmark revealed a critical limitation:
Problem: The base model (GPT-2 124M) generates structurally simple expressions that fail on complex benchmarks.
Evidence (Nguyen-5 analysis):
- Valid expressions: 39.4%
- All valid expressions: R² = -1.0 (terrible fit)
- Power operations (x²): Only 15.9%
- Nested trigonometric functions: 0%
- Average depth: 1.40 (target requires 2+)
Root Cause: Model learns syntactically valid but structurally trivial expressions. Without proper complexity, all rewards are uniformly bad → no gradient signal → no RL learning.
1.2 Research Questions
- RQ1: Do larger models generate more valid expressions?
- RQ2: Do larger models produce more complex expressions (depth, nesting, power operations)?
- RQ3: Do larger models achieve better R² scores on complex benchmarks?
- RQ4: Do larger models generate more diverse expressions?
- RQ5: What is the optimal model size for symbolic regression considering cost-benefit trade-offs?
1.3 Hypotheses
H1 (Validity): Valid expression rate increases with model size
- Base: 80% → Medium: 82-85% → Large: 85-90%
H2 (Complexity): Expression complexity increases with model size
- Power operations: Base 15.9% → Medium 35-45% → Large 50-65%
- Average depth: Base 1.40 → Medium 1.8-2.0 → Large 2.0-2.5
- Nested trig: Base 0% → Medium 5-10% → Large 10-20%
H3 (Performance): Benchmark performance (R²) improves with model size
- Nguyen-5 best R²: Base -1.0 → Medium >-0.5 → Large >0.0
H4 (Diversity): Expression diversity increases with model size
- Larger models explore broader expression space
H5 (Algorithm Interaction): RL algorithms work better with larger models
- PPO and GRPO benefit more from increased capacity
2. Methodology
2.1 Models
| Model | Parameters | LoRA Trainable | Instance Type | Batch Size | Cost (est.) |
|---|---|---|---|---|---|
| Base | 124M | 294K | g5.xlarge | 8 | $2-3 |
| Medium | 355M | 294K | g5.xlarge | 4 | $3-4 |
| Large | 774M | 294K | g5.2xlarge | 2 | $5-6 |
Key Design Decision: Fix all hyperparameters except batch size to isolate model size effect.
2.2 Training Configuration
Dataset: augustocsc/sintetico_natural (700K subset)
Format: JSON (EXP-A)
{"vars": ["x_1"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "sin(x_1 + C)"}
Hyperparameters (fixed across all models):
- Learning rate: 5e-5
- Epochs: 3 (with early stopping, patience=3)
- Gradient accumulation: 4 steps
- Warmup steps: 500
- Weight decay: 0.01
- FP16: True
- Seed: 42
- LoRA: r=8, alpha=32, target_modules=["c_attn"], dropout=0.05
Training Split: 90% train / 10% validation (automatic)
Infrastructure: AWS g5.xlarge / g5.2xlarge with NVIDIA A10G GPUs
Tracking: Weights & Biases (project: seriguela)
2.3 Evaluation Metrics
2.3.1 Quality Metrics
Validity:
- Valid expression rate (%): Syntactically correct AND semantically evaluable
- Parseable rate (%): Syntactically correct only
Constraint Adherence:
- Uses allowed variables (%): Only uses vars specified in prompt
- Uses allowed operators (%): Only uses ops specified in prompt
- Constraint adherence (%): Both constraints satisfied
Diversity:
- Diversity rate (%): Proportion of unique expressions
- Unique expressions count: Absolute number of different expressions
2.3.2 Complexity Metrics
- Power Operations: Percentage using x², x**n
- Nested Trigonometric Functions: Percentage with sin(cos(x)), etc.
- Expression Depth: Average nesting level
- Operator Distribution: Usage frequencies
2.3.3 Benchmark Performance
Nguyen Suite (1-12): Standard symbolic regression benchmarks
Algorithms:
- Supervised: Direct generation (no optimization)
- REINFORCE: Policy gradient with EMA baseline
- GRPO: Group Relative Policy Optimization
- PPO: Proximal Policy Optimization
Metrics:
- Best R²: Highest R² achieved
- Mean R² (valid expressions): Average fit quality
- Convergence rate: Improvement over epochs
- Valid rate during RL: Maintains validity while optimizing
2.4 Experimental Design
Phase 1: Supervised Training
- Train all 3 models in parallel
- Monitor loss curves, early stopping
- Save checkpoints
Phase 2: Basic Evaluation
- Generate 500 expressions per model
- Compute quality and complexity metrics
- Compare models
Phase 3: Nguyen Suite Evaluation
- 3 models × 12 benchmarks × 4 algorithms = 144 experiments
- 20 epochs, 100 samples per epoch (RL algorithms)
- 200 samples (supervised)
Phase 4: Analysis
- Aggregate results
- Statistical significance testing
- Visualization (heatmaps, bar charts)
- Cost-benefit analysis
3. Results
To be filled after experiments complete
3.1 Training Results
Table 1: Training Metrics
| Model | Final Train Loss | Best Val Loss | Early Stopped | Training Time | Cost |
|---|---|---|---|---|---|
| Base | TBD | TBD | TBD | TBD | TBD |
| Medium | TBD | TBD | TBD | TBD | TBD |
| Large | TBD | TBD | TBD | TBD | TBD |
Expected: Lower loss for larger models.
Actual: TBD
3.2 Quality Metrics
Table 2: Supervised Generation Quality
| Metric | Base | Medium | Large | H1 Confirmed? |
|---|---|---|---|---|
| Valid Expression Rate (%) | TBD | TBD | TBD | ⏳ |
| Parseable Rate (%) | TBD | TBD | TBD | - |
| Constraint Adherence (%) | TBD | TBD | TBD | - |
| Diversity Rate (%) | TBD | TBD | TBD | ⏳ |
| Unique Expressions | TBD | TBD | TBD | - |
3.3 Complexity Metrics
Table 3: Expression Complexity
| Metric | Base | Medium | Large | Improvement (B→L) | H2 Confirmed? |
|---|---|---|---|---|---|
| Power Operations (%) | TBD | TBD | TBD | TBD | ⏳ |
| Nested Trig (%) | TBD | TBD | TBD | TBD | ⏳ |
| Average Depth | TBD | TBD | TBD | TBD | ⏳ |
| Max Depth | TBD | TBD | TBD | TBD | - |
Expected (H2):
- Power ops: Base 15.9% → Large 50-65%
- Depth: Base 1.40 → Large 2.0-2.5
- Nested trig: Base 0% → Large 10-20%
3.4 Nguyen Benchmark Performance
Table 4: Average R² Across All 12 Benchmarks
| Algorithm | Base | Medium | Large | Best Model | H3 Confirmed? |
|---|---|---|---|---|---|
| Supervised | TBD | TBD | TBD | TBD | ⏳ |
| REINFORCE | TBD | TBD | TBD | TBD | ⏳ |
| GRPO | TBD | TBD | TBD | TBD | ⏳ |
| PPO | TBD | TBD | TBD | TBD | ⏳ |
Table 5: Nguyen-5 Specific (Complex Benchmark)
| Algorithm | Base | Medium | Large | Improvement |
|---|---|---|---|---|
| Supervised | TBD | TBD | TBD | TBD |
| REINFORCE | TBD | TBD | TBD | TBD |
| GRPO | TBD | TBD | TBD | TBD |
| PPO | TBD | TBD | TBD | TBD |
Baseline (from previous work): Base supervised on Nguyen-5 = R² -1.0
Expected: Significant improvement with larger models
4. Visualizations
To be generated after evaluation completes
Figure 1: Model Comparison Overview
- 4 subplots: Valid Rate, R², Power Ops, Depth
- Bar charts comparing Base, Medium, Large
Figure 2: Algorithm Performance Heatmaps
- One heatmap per algorithm
- Rows: Nguyen benchmarks (1-12)
- Columns: Model sizes
- Color: R² scores
Figure 3: Complexity Progression
- Line chart showing how complexity metrics scale with model size
Figure 4: Cost-Benefit Analysis
- Scatter plot: Cost (x-axis) vs Performance (y-axis)
- Shows diminishing returns
5. Statistical Analysis
To be completed after results
5.1 Hypothesis Tests
H1 (Validity):
- Test: Chi-square test for valid rate differences
- Significance level: α = 0.05
- Result: TBD
- Conclusion: TBD
H2 (Complexity):
- Test: Mann-Whitney U test for depth differences
- Significance level: α = 0.05
- Result: TBD
- Conclusion: TBD
H3 (Performance):
- Test: Kruskal-Wallis test for R² differences
- Significance level: α = 0.05
- Result: TBD
- Conclusion: TBD
5.2 Effect Sizes
- Cohen's d for continuous metrics (depth, R²)
- Cramér's V for categorical metrics (valid rate)
Results: TBD
6. Discussion
To be written after results
6.1 Key Findings
- Finding 1: TBD
- Finding 2: TBD
- Finding 3: TBD
6.2 Interpretation
RQ1 (Validity): TBD
RQ2 (Complexity): TBD
RQ3 (Performance): TBD
RQ4 (Diversity): TBD
RQ5 (Optimal Size): TBD
6.3 Comparison with Hypotheses
| Hypothesis | Expected | Actual | Confirmed? |
|---|---|---|---|
| H1 (Validity increases) | 80% → 90% | TBD | ⏳ |
| H2 (Complexity increases) | 1.4 → 2.5 depth | TBD | ⏳ |
| H3 (R² improves) | -1.0 → >0.0 | TBD | ⏳ |
| H4 (Diversity increases) | Higher unique rate | TBD | ⏳ |
| H5 (RL benefits) | Better convergence | TBD | ⏳ |
6.4 Unexpected Results
Document any surprising findings
- TBD
- TBD
6.5 Limitations
LoRA fixed parameters: Using the same LoRA rank (r=8) for all model sizes may not be optimal
- Larger models might benefit from higher ranks
- Future: Scale LoRA rank with model size
Single dataset: Only tested on sintetico_natural 700K
- Results may not generalize to other expression distributions
- Future: Test on multiple datasets
Nguyen benchmarks only: Limited to 12 standard benchmarks
- May not represent all real-world symbolic regression tasks
- Future: Test on Feynman equations, real scientific datasets
Batch size variation: Different batch sizes across models (8→4→2)
- Effective batch size same (×4 accumulation), but gradient noise differs
- May affect convergence dynamics
Early stopping: May have prevented full convergence
- Trade-off between cost and potential performance
- Future: Test with longer training
JSON format dependency: Results specific to JSON-structured prompts
- May not generalize to other formats
- Future: Test with multiple prompt formats
6.6 Implications
For Research:
- TBD
For Practitioners:
- TBD
For Model Selection:
- When to use Base: TBD
- When to use Medium: TBD
- When to use Large: TBD
7. Conclusions
To be written after results
7.1 Summary
This experiment investigated the impact of model size (124M → 355M → 774M) on symbolic regression expression generation across three dimensions: validity, complexity, and benchmark performance.
Main Result: TBD
7.2 Recommendations
- Recommended model size: TBD (based on cost-benefit)
- Best algorithm by model: TBD
- Optimal hyperparameters: TBD
7.3 Future Work
LoRA scaling study: Vary LoRA rank with model size
- Test: Base (r=8), Medium (r=16), Large (r=32)
- Hypothesis: Larger models need higher ranks for full capacity
Dataset scaling: Train on larger datasets (1M, 5M expressions)
- Test if larger models benefit more from more data
Architecture variants: Test other model families
- GPT-Neo, GPT-J, LLaMA
- Encoder-decoder models (T5, BART)
Multi-task learning: Train on multiple benchmarks simultaneously
- May improve generalization
Interpretability study: Analyze attention patterns
- Understand what larger models learn differently
Real-world deployment: Test on actual scientific datasets
- Feynman equations
- Materials science expressions
- Biological models
8. Reproducibility
8.1 Code and Data
Repository: https://github.com/augustocsc/seriguela
Branch: experiment/ppo-symbolic-regression
Commit: TBD (run git rev-parse HEAD)
Models: TBD (HuggingFace links)
- Base: https://huggingface.co/USER/gpt2_base_700K_json
- Medium: https://huggingface.co/USER/gpt2_medium_700K_json
- Large: https://huggingface.co/USER/gpt2_large_700K_json
Dataset: augustocsc/sintetico_natural (700K subset)
8.2 Reproduction Steps
# 1. Clone repository
git clone https://github.com/augustocsc/seriguela.git
cd seriguela
git checkout experiment/ppo-symbolic-regression
# 2. Install dependencies
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# 3. Train models (requires AWS)
bash launch_all_models.sh
# 4. Download trained models
# (see TRAINING_LOG for specific instance IPs)
# 5. Evaluate
bash scripts/run_nguyen_suite.sh
# 6. Aggregate results
python scripts/aggregate_nguyen_results.py --input_dir nguyen_suite_results
8.3 Hardware Requirements
Training:
- 3× AWS instances (g5.xlarge + g5.2xlarge)
- Total VRAM: 96GB
- Training time: ~10 hours total (parallel)
- Cost: ~$10-13 USD
Evaluation (Nguyen suite):
- 1× GPU with 24GB+ VRAM
- Time: ~12-16 hours for full suite (144 experiments)
- Can run on CPU (slower: ~48-72 hours)
8.4 Software Versions
See requirements.txt for exact versions.
Key dependencies:
- Python 3.10+
- PyTorch 2.5.1 (CUDA 12.1)
- Transformers 4.51.3
- PEFT 0.15.1
- Wandb ≥0.24.1
9. Acknowledgments
To be filled
- Dataset: Augusto et al. (sintetico_natural)
- Benchmarks: Nguyen et al.
- Infrastructure: AWS
- Tracking: Weights & Biases
10. References
To be filled
- Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models
- Nguyen et al. Symbolic Regression Benchmarks
- Schulman et al. (2017). Proximal Policy Optimization
- DeepSeek-R1 paper (GRPO algorithm)
- Previous work: EXPERIMENT_RESULTS.md
Document Version: 1.0
Last Updated: 2025-02-02
Status: ⏳ In Progress (Results pending)
Contact: [Your contact information]