# Experiment: Model Size Impact on Symbolic Regression

**Research Project**: Seriguela - Language Models for Symbolic Regression

**Date**: February 2025

**Status**: ⏳ In Progress

---

## Abstract

This experiment investigates the impact of model size on the ability of GPT-2 language models to generate valid and complex mathematical expressions for symbolic regression. We train three model variants (Base: 124M, Medium: 355M, Large: 774M parameters) on 700K synthetic expressions using LoRA fine-tuning and evaluate them across multiple dimensions: validity, complexity, diversity, and performance on Nguyen benchmarks with reinforcement learning optimization.

**Hypothesis**: Larger models possess greater capacity to learn compositional patterns, resulting in more complex, valid, and diverse expression generation.

**Key Question**: Is the increased computational cost of larger models justified by improved expression quality and benchmark performance?

---

## 1. Introduction

### 1.1 Motivation

Prior work (see `EXPERIMENT_RESULTS.md`) demonstrated that JSON-formatted training (EXP-A) achieves 80% valid expression rates compared to 0.5% with EOS token approach. However, evaluation on Nguyen-5 benchmark revealed a critical limitation:

**Problem**: The base model (GPT-2 124M) generates structurally simple expressions that fail on complex benchmarks.

**Evidence** (Nguyen-5 analysis):
- Valid expressions: 39.4%
- All valid expressions: R² = -1.0 (terrible fit)
- Power operations (x²): Only 15.9%
- Nested trigonometric functions: 0%
- Average depth: 1.40 (target requires 2+)

**Root Cause**: Model learns syntactically valid but structurally trivial expressions. Without proper complexity, all rewards are uniformly bad → no gradient signal → no RL learning.

### 1.2 Research Questions

1. **RQ1**: Do larger models generate more valid expressions?
2. **RQ2**: Do larger models produce more complex expressions (depth, nesting, power operations)?
3. **RQ3**: Do larger models achieve better R² scores on complex benchmarks?
4. **RQ4**: Do larger models generate more diverse expressions?
5. **RQ5**: What is the optimal model size for symbolic regression considering cost-benefit trade-offs?

### 1.3 Hypotheses

**H1** (Validity): Valid expression rate increases with model size
- Base: 80% → Medium: 82-85% → Large: 85-90%

**H2** (Complexity): Expression complexity increases with model size
- Power operations: Base 15.9% → Medium 35-45% → Large 50-65%
- Average depth: Base 1.40 → Medium 1.8-2.0 → Large 2.0-2.5
- Nested trig: Base 0% → Medium 5-10% → Large 10-20%

**H3** (Performance): Benchmark performance (R²) improves with model size
- Nguyen-5 best R²: Base -1.0 → Medium >-0.5 → Large >0.0

**H4** (Diversity): Expression diversity increases with model size
- Larger models explore broader expression space

**H5** (Algorithm Interaction): RL algorithms work better with larger models
- PPO and GRPO benefit more from increased capacity

---

## 2. Methodology

### 2.1 Models

| Model | Parameters | LoRA Trainable | Instance Type | Batch Size | Cost (est.) |
|-------|-----------|----------------|---------------|-----------|-------------|
| Base | 124M | 294K | g5.xlarge | 8 | $2-3 |
| Medium | 355M | 294K | g5.xlarge | 4 | $3-4 |
| Large | 774M | 294K | g5.2xlarge | 2 | $5-6 |

**Key Design Decision**: Fix all hyperparameters except batch size to isolate model size effect.

### 2.2 Training Configuration

**Dataset**: augustocsc/sintetico_natural (700K subset)

**Format**: JSON (EXP-A)
```json
{"vars": ["x_1"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "sin(x_1 + C)"}
```

**Hyperparameters** (fixed across all models):
- Learning rate: 5e-5
- Epochs: 3 (with early stopping, patience=3)
- Gradient accumulation: 4 steps
- Warmup steps: 500
- Weight decay: 0.01
- FP16: True
- Seed: 42
- LoRA: r=8, alpha=32, target_modules=["c_attn"], dropout=0.05

**Training Split**: 90% train / 10% validation (automatic)

**Infrastructure**: AWS g5.xlarge / g5.2xlarge with NVIDIA A10G GPUs

**Tracking**: Weights & Biases (project: seriguela)

### 2.3 Evaluation Metrics

#### 2.3.1 Quality Metrics

1. **Validity**:
   - Valid expression rate (%): Syntactically correct AND semantically evaluable
   - Parseable rate (%): Syntactically correct only

2. **Constraint Adherence**:
   - Uses allowed variables (%): Only uses vars specified in prompt
   - Uses allowed operators (%): Only uses ops specified in prompt
   - Constraint adherence (%): Both constraints satisfied

3. **Diversity**:
   - Diversity rate (%): Proportion of unique expressions
   - Unique expressions count: Absolute number of different expressions

#### 2.3.2 Complexity Metrics

1. **Power Operations**: Percentage using x², x**n
2. **Nested Trigonometric Functions**: Percentage with sin(cos(x)), etc.
3. **Expression Depth**: Average nesting level
4. **Operator Distribution**: Usage frequencies

#### 2.3.3 Benchmark Performance

**Nguyen Suite** (1-12): Standard symbolic regression benchmarks

**Algorithms**:
1. **Supervised**: Direct generation (no optimization)
2. **REINFORCE**: Policy gradient with EMA baseline
3. **GRPO**: Group Relative Policy Optimization
4. **PPO**: Proximal Policy Optimization

**Metrics**:
- Best R²: Highest R² achieved
- Mean R² (valid expressions): Average fit quality
- Convergence rate: Improvement over epochs
- Valid rate during RL: Maintains validity while optimizing

### 2.4 Experimental Design

**Phase 1**: Supervised Training
- Train all 3 models in parallel
- Monitor loss curves, early stopping
- Save checkpoints

**Phase 2**: Basic Evaluation
- Generate 500 expressions per model
- Compute quality and complexity metrics
- Compare models

**Phase 3**: Nguyen Suite Evaluation
- 3 models × 12 benchmarks × 4 algorithms = **144 experiments**
- 20 epochs, 100 samples per epoch (RL algorithms)
- 200 samples (supervised)

**Phase 4**: Analysis
- Aggregate results
- Statistical significance testing
- Visualization (heatmaps, bar charts)
- Cost-benefit analysis

---

## 3. Results

*To be filled after experiments complete*

### 3.1 Training Results

#### Table 1: Training Metrics

| Model | Final Train Loss | Best Val Loss | Early Stopped | Training Time | Cost |
|-------|------------------|---------------|---------------|---------------|------|
| Base | TBD | TBD | TBD | TBD | TBD |
| Medium | TBD | TBD | TBD | TBD | TBD |
| Large | TBD | TBD | TBD | TBD | TBD |

**Expected**: Lower loss for larger models.

**Actual**: TBD

### 3.2 Quality Metrics

#### Table 2: Supervised Generation Quality

| Metric | Base | Medium | Large | H1 Confirmed? |
|--------|------|--------|-------|---------------|
| Valid Expression Rate (%) | TBD | TBD | TBD | ⏳ |
| Parseable Rate (%) | TBD | TBD | TBD | - |
| Constraint Adherence (%) | TBD | TBD | TBD | - |
| Diversity Rate (%) | TBD | TBD | TBD | ⏳ |
| Unique Expressions | TBD | TBD | TBD | - |

### 3.3 Complexity Metrics

#### Table 3: Expression Complexity

| Metric | Base | Medium | Large | Improvement (B→L) | H2 Confirmed? |
|--------|------|--------|-------|-------------------|---------------|
| Power Operations (%) | TBD | TBD | TBD | TBD | ⏳ |
| Nested Trig (%) | TBD | TBD | TBD | TBD | ⏳ |
| Average Depth | TBD | TBD | TBD | TBD | ⏳ |
| Max Depth | TBD | TBD | TBD | TBD | - |

**Expected** (H2):
- Power ops: Base 15.9% → Large 50-65%
- Depth: Base 1.40 → Large 2.0-2.5
- Nested trig: Base 0% → Large 10-20%

### 3.4 Nguyen Benchmark Performance

#### Table 4: Average R² Across All 12 Benchmarks

| Algorithm | Base | Medium | Large | Best Model | H3 Confirmed? |
|-----------|------|--------|-------|-----------|---------------|
| Supervised | TBD | TBD | TBD | TBD | ⏳ |
| REINFORCE | TBD | TBD | TBD | TBD | ⏳ |
| GRPO | TBD | TBD | TBD | TBD | ⏳ |
| PPO | TBD | TBD | TBD | TBD | ⏳ |

#### Table 5: Nguyen-5 Specific (Complex Benchmark)

| Algorithm | Base | Medium | Large | Improvement |
|-----------|------|--------|-------|-------------|
| Supervised | TBD | TBD | TBD | TBD |
| REINFORCE | TBD | TBD | TBD | TBD |
| GRPO | TBD | TBD | TBD | TBD |
| PPO | TBD | TBD | TBD | TBD |

**Baseline** (from previous work): Base supervised on Nguyen-5 = R² -1.0

**Expected**: Significant improvement with larger models

---

## 4. Visualizations

*To be generated after evaluation completes*

### Figure 1: Model Comparison Overview
- 4 subplots: Valid Rate, R², Power Ops, Depth
- Bar charts comparing Base, Medium, Large

### Figure 2: Algorithm Performance Heatmaps
- One heatmap per algorithm
- Rows: Nguyen benchmarks (1-12)
- Columns: Model sizes
- Color: R² scores

### Figure 3: Complexity Progression
- Line chart showing how complexity metrics scale with model size

### Figure 4: Cost-Benefit Analysis
- Scatter plot: Cost (x-axis) vs Performance (y-axis)
- Shows diminishing returns

---

## 5. Statistical Analysis

*To be completed after results*

### 5.1 Hypothesis Tests

**H1 (Validity)**:
- Test: Chi-square test for valid rate differences
- Significance level: α = 0.05
- Result: TBD
- Conclusion: TBD

**H2 (Complexity)**:
- Test: Mann-Whitney U test for depth differences
- Significance level: α = 0.05
- Result: TBD
- Conclusion: TBD

**H3 (Performance)**:
- Test: Kruskal-Wallis test for R² differences
- Significance level: α = 0.05
- Result: TBD
- Conclusion: TBD

### 5.2 Effect Sizes

- Cohen's d for continuous metrics (depth, R²)
- Cramér's V for categorical metrics (valid rate)

**Results**: TBD

---

## 6. Discussion

*To be written after results*

### 6.1 Key Findings

1. **Finding 1**: TBD
2. **Finding 2**: TBD
3. **Finding 3**: TBD

### 6.2 Interpretation

**RQ1 (Validity)**: TBD

**RQ2 (Complexity)**: TBD

**RQ3 (Performance)**: TBD

**RQ4 (Diversity)**: TBD

**RQ5 (Optimal Size)**: TBD

### 6.3 Comparison with Hypotheses

| Hypothesis | Expected | Actual | Confirmed? |
|-----------|----------|--------|-----------|
| H1 (Validity increases) | 80% → 90% | TBD | ⏳ |
| H2 (Complexity increases) | 1.4 → 2.5 depth | TBD | ⏳ |
| H3 (R² improves) | -1.0 → >0.0 | TBD | ⏳ |
| H4 (Diversity increases) | Higher unique rate | TBD | ⏳ |
| H5 (RL benefits) | Better convergence | TBD | ⏳ |

### 6.4 Unexpected Results

*Document any surprising findings*

1. TBD
2. TBD

### 6.5 Limitations

1. **LoRA fixed parameters**: Using the same LoRA rank (r=8) for all model sizes may not be optimal
   - Larger models might benefit from higher ranks
   - Future: Scale LoRA rank with model size

2. **Single dataset**: Only tested on sintetico_natural 700K
   - Results may not generalize to other expression distributions
   - Future: Test on multiple datasets

3. **Nguyen benchmarks only**: Limited to 12 standard benchmarks
   - May not represent all real-world symbolic regression tasks
   - Future: Test on Feynman equations, real scientific datasets

4. **Batch size variation**: Different batch sizes across models (8→4→2)
   - Effective batch size same (×4 accumulation), but gradient noise differs
   - May affect convergence dynamics

5. **Early stopping**: May have prevented full convergence
   - Trade-off between cost and potential performance
   - Future: Test with longer training

6. **JSON format dependency**: Results specific to JSON-structured prompts
   - May not generalize to other formats
   - Future: Test with multiple prompt formats

### 6.6 Implications

**For Research**:
- TBD

**For Practitioners**:
- TBD

**For Model Selection**:
- When to use Base: TBD
- When to use Medium: TBD
- When to use Large: TBD

---

## 7. Conclusions

*To be written after results*

### 7.1 Summary

This experiment investigated the impact of model size (124M → 355M → 774M) on symbolic regression expression generation across three dimensions: validity, complexity, and benchmark performance.

**Main Result**: TBD

### 7.2 Recommendations

1. **Recommended model size**: TBD (based on cost-benefit)
2. **Best algorithm by model**: TBD
3. **Optimal hyperparameters**: TBD

### 7.3 Future Work

1. **LoRA scaling study**: Vary LoRA rank with model size
   - Test: Base (r=8), Medium (r=16), Large (r=32)
   - Hypothesis: Larger models need higher ranks for full capacity

2. **Dataset scaling**: Train on larger datasets (1M, 5M expressions)
   - Test if larger models benefit more from more data

3. **Architecture variants**: Test other model families
   - GPT-Neo, GPT-J, LLaMA
   - Encoder-decoder models (T5, BART)

4. **Multi-task learning**: Train on multiple benchmarks simultaneously
   - May improve generalization

5. **Interpretability study**: Analyze attention patterns
   - Understand what larger models learn differently

6. **Real-world deployment**: Test on actual scientific datasets
   - Feynman equations
   - Materials science expressions
   - Biological models

---

## 8. Reproducibility

### 8.1 Code and Data

**Repository**: https://github.com/augustocsc/seriguela

**Branch**: experiment/ppo-symbolic-regression

**Commit**: TBD (run `git rev-parse HEAD`)

**Models**: TBD (HuggingFace links)
- Base: https://huggingface.co/USER/gpt2_base_700K_json
- Medium: https://huggingface.co/USER/gpt2_medium_700K_json
- Large: https://huggingface.co/USER/gpt2_large_700K_json

**Dataset**: augustocsc/sintetico_natural (700K subset)

### 8.2 Reproduction Steps

```bash
# 1. Clone repository
git clone https://github.com/augustocsc/seriguela.git
cd seriguela
git checkout experiment/ppo-symbolic-regression

# 2. Install dependencies
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# 3. Train models (requires AWS)
bash launch_all_models.sh

# 4. Download trained models
# (see TRAINING_LOG for specific instance IPs)

# 5. Evaluate
bash scripts/run_nguyen_suite.sh

# 6. Aggregate results
python scripts/aggregate_nguyen_results.py --input_dir nguyen_suite_results
```

### 8.3 Hardware Requirements

**Training**:
- 3× AWS instances (g5.xlarge + g5.2xlarge)
- Total VRAM: 96GB
- Training time: ~10 hours total (parallel)
- Cost: ~$10-13 USD

**Evaluation** (Nguyen suite):
- 1× GPU with 24GB+ VRAM
- Time: ~12-16 hours for full suite (144 experiments)
- Can run on CPU (slower: ~48-72 hours)

### 8.4 Software Versions

See `requirements.txt` for exact versions.

**Key dependencies**:
- Python 3.10+
- PyTorch 2.5.1 (CUDA 12.1)
- Transformers 4.51.3
- PEFT 0.15.1
- Wandb ≥0.24.1

---

## 9. Acknowledgments

*To be filled*

- Dataset: Augusto et al. (sintetico_natural)
- Benchmarks: Nguyen et al.
- Infrastructure: AWS
- Tracking: Weights & Biases

---

## 10. References

*To be filled*

1. Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models
2. Nguyen et al. Symbolic Regression Benchmarks
3. Schulman et al. (2017). Proximal Policy Optimization
4. DeepSeek-R1 paper (GRPO algorithm)
5. Previous work: EXPERIMENT_RESULTS.md

---

**Document Version**: 1.0

**Last Updated**: 2025-02-02

**Status**: ⏳ In Progress (Results pending)

**Contact**: [Your contact information]