gpt2_medium_prefix_682k / docs /reports /EXPERIMENT_MODEL_SCALING.md
augustocsc's picture
GPT-2 Medium trained on prefix dataset (682K)
a1190da verified

Experiment: Model Size Impact on Symbolic Regression

Research Project: Seriguela - Language Models for Symbolic Regression

Date: February 2025

Status: ⏳ In Progress


Abstract

This experiment investigates the impact of model size on the ability of GPT-2 language models to generate valid and complex mathematical expressions for symbolic regression. We train three model variants (Base: 124M, Medium: 355M, Large: 774M parameters) on 700K synthetic expressions using LoRA fine-tuning and evaluate them across multiple dimensions: validity, complexity, diversity, and performance on Nguyen benchmarks with reinforcement learning optimization.

Hypothesis: Larger models possess greater capacity to learn compositional patterns, resulting in more complex, valid, and diverse expression generation.

Key Question: Is the increased computational cost of larger models justified by improved expression quality and benchmark performance?


1. Introduction

1.1 Motivation

Prior work (see EXPERIMENT_RESULTS.md) demonstrated that JSON-formatted training (EXP-A) achieves 80% valid expression rates compared to 0.5% with EOS token approach. However, evaluation on Nguyen-5 benchmark revealed a critical limitation:

Problem: The base model (GPT-2 124M) generates structurally simple expressions that fail on complex benchmarks.

Evidence (Nguyen-5 analysis):

  • Valid expressions: 39.4%
  • All valid expressions: R² = -1.0 (terrible fit)
  • Power operations (x²): Only 15.9%
  • Nested trigonometric functions: 0%
  • Average depth: 1.40 (target requires 2+)

Root Cause: Model learns syntactically valid but structurally trivial expressions. Without proper complexity, all rewards are uniformly bad → no gradient signal → no RL learning.

1.2 Research Questions

  1. RQ1: Do larger models generate more valid expressions?
  2. RQ2: Do larger models produce more complex expressions (depth, nesting, power operations)?
  3. RQ3: Do larger models achieve better R² scores on complex benchmarks?
  4. RQ4: Do larger models generate more diverse expressions?
  5. RQ5: What is the optimal model size for symbolic regression considering cost-benefit trade-offs?

1.3 Hypotheses

H1 (Validity): Valid expression rate increases with model size

  • Base: 80% → Medium: 82-85% → Large: 85-90%

H2 (Complexity): Expression complexity increases with model size

  • Power operations: Base 15.9% → Medium 35-45% → Large 50-65%
  • Average depth: Base 1.40 → Medium 1.8-2.0 → Large 2.0-2.5
  • Nested trig: Base 0% → Medium 5-10% → Large 10-20%

H3 (Performance): Benchmark performance (R²) improves with model size

  • Nguyen-5 best R²: Base -1.0 → Medium >-0.5 → Large >0.0

H4 (Diversity): Expression diversity increases with model size

  • Larger models explore broader expression space

H5 (Algorithm Interaction): RL algorithms work better with larger models

  • PPO and GRPO benefit more from increased capacity

2. Methodology

2.1 Models

Model Parameters LoRA Trainable Instance Type Batch Size Cost (est.)
Base 124M 294K g5.xlarge 8 $2-3
Medium 355M 294K g5.xlarge 4 $3-4
Large 774M 294K g5.2xlarge 2 $5-6

Key Design Decision: Fix all hyperparameters except batch size to isolate model size effect.

2.2 Training Configuration

Dataset: augustocsc/sintetico_natural (700K subset)

Format: JSON (EXP-A)

{"vars": ["x_1"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "sin(x_1 + C)"}

Hyperparameters (fixed across all models):

  • Learning rate: 5e-5
  • Epochs: 3 (with early stopping, patience=3)
  • Gradient accumulation: 4 steps
  • Warmup steps: 500
  • Weight decay: 0.01
  • FP16: True
  • Seed: 42
  • LoRA: r=8, alpha=32, target_modules=["c_attn"], dropout=0.05

Training Split: 90% train / 10% validation (automatic)

Infrastructure: AWS g5.xlarge / g5.2xlarge with NVIDIA A10G GPUs

Tracking: Weights & Biases (project: seriguela)

2.3 Evaluation Metrics

2.3.1 Quality Metrics

  1. Validity:

    • Valid expression rate (%): Syntactically correct AND semantically evaluable
    • Parseable rate (%): Syntactically correct only
  2. Constraint Adherence:

    • Uses allowed variables (%): Only uses vars specified in prompt
    • Uses allowed operators (%): Only uses ops specified in prompt
    • Constraint adherence (%): Both constraints satisfied
  3. Diversity:

    • Diversity rate (%): Proportion of unique expressions
    • Unique expressions count: Absolute number of different expressions

2.3.2 Complexity Metrics

  1. Power Operations: Percentage using x², x**n
  2. Nested Trigonometric Functions: Percentage with sin(cos(x)), etc.
  3. Expression Depth: Average nesting level
  4. Operator Distribution: Usage frequencies

2.3.3 Benchmark Performance

Nguyen Suite (1-12): Standard symbolic regression benchmarks

Algorithms:

  1. Supervised: Direct generation (no optimization)
  2. REINFORCE: Policy gradient with EMA baseline
  3. GRPO: Group Relative Policy Optimization
  4. PPO: Proximal Policy Optimization

Metrics:

  • Best R²: Highest R² achieved
  • Mean R² (valid expressions): Average fit quality
  • Convergence rate: Improvement over epochs
  • Valid rate during RL: Maintains validity while optimizing

2.4 Experimental Design

Phase 1: Supervised Training

  • Train all 3 models in parallel
  • Monitor loss curves, early stopping
  • Save checkpoints

Phase 2: Basic Evaluation

  • Generate 500 expressions per model
  • Compute quality and complexity metrics
  • Compare models

Phase 3: Nguyen Suite Evaluation

  • 3 models × 12 benchmarks × 4 algorithms = 144 experiments
  • 20 epochs, 100 samples per epoch (RL algorithms)
  • 200 samples (supervised)

Phase 4: Analysis

  • Aggregate results
  • Statistical significance testing
  • Visualization (heatmaps, bar charts)
  • Cost-benefit analysis

3. Results

To be filled after experiments complete

3.1 Training Results

Table 1: Training Metrics

Model Final Train Loss Best Val Loss Early Stopped Training Time Cost
Base TBD TBD TBD TBD TBD
Medium TBD TBD TBD TBD TBD
Large TBD TBD TBD TBD TBD

Expected: Lower loss for larger models.

Actual: TBD

3.2 Quality Metrics

Table 2: Supervised Generation Quality

Metric Base Medium Large H1 Confirmed?
Valid Expression Rate (%) TBD TBD TBD
Parseable Rate (%) TBD TBD TBD -
Constraint Adherence (%) TBD TBD TBD -
Diversity Rate (%) TBD TBD TBD
Unique Expressions TBD TBD TBD -

3.3 Complexity Metrics

Table 3: Expression Complexity

Metric Base Medium Large Improvement (B→L) H2 Confirmed?
Power Operations (%) TBD TBD TBD TBD
Nested Trig (%) TBD TBD TBD TBD
Average Depth TBD TBD TBD TBD
Max Depth TBD TBD TBD TBD -

Expected (H2):

  • Power ops: Base 15.9% → Large 50-65%
  • Depth: Base 1.40 → Large 2.0-2.5
  • Nested trig: Base 0% → Large 10-20%

3.4 Nguyen Benchmark Performance

Table 4: Average R² Across All 12 Benchmarks

Algorithm Base Medium Large Best Model H3 Confirmed?
Supervised TBD TBD TBD TBD
REINFORCE TBD TBD TBD TBD
GRPO TBD TBD TBD TBD
PPO TBD TBD TBD TBD

Table 5: Nguyen-5 Specific (Complex Benchmark)

Algorithm Base Medium Large Improvement
Supervised TBD TBD TBD TBD
REINFORCE TBD TBD TBD TBD
GRPO TBD TBD TBD TBD
PPO TBD TBD TBD TBD

Baseline (from previous work): Base supervised on Nguyen-5 = R² -1.0

Expected: Significant improvement with larger models


4. Visualizations

To be generated after evaluation completes

Figure 1: Model Comparison Overview

  • 4 subplots: Valid Rate, R², Power Ops, Depth
  • Bar charts comparing Base, Medium, Large

Figure 2: Algorithm Performance Heatmaps

  • One heatmap per algorithm
  • Rows: Nguyen benchmarks (1-12)
  • Columns: Model sizes
  • Color: R² scores

Figure 3: Complexity Progression

  • Line chart showing how complexity metrics scale with model size

Figure 4: Cost-Benefit Analysis

  • Scatter plot: Cost (x-axis) vs Performance (y-axis)
  • Shows diminishing returns

5. Statistical Analysis

To be completed after results

5.1 Hypothesis Tests

H1 (Validity):

  • Test: Chi-square test for valid rate differences
  • Significance level: α = 0.05
  • Result: TBD
  • Conclusion: TBD

H2 (Complexity):

  • Test: Mann-Whitney U test for depth differences
  • Significance level: α = 0.05
  • Result: TBD
  • Conclusion: TBD

H3 (Performance):

  • Test: Kruskal-Wallis test for R² differences
  • Significance level: α = 0.05
  • Result: TBD
  • Conclusion: TBD

5.2 Effect Sizes

  • Cohen's d for continuous metrics (depth, R²)
  • Cramér's V for categorical metrics (valid rate)

Results: TBD


6. Discussion

To be written after results

6.1 Key Findings

  1. Finding 1: TBD
  2. Finding 2: TBD
  3. Finding 3: TBD

6.2 Interpretation

RQ1 (Validity): TBD

RQ2 (Complexity): TBD

RQ3 (Performance): TBD

RQ4 (Diversity): TBD

RQ5 (Optimal Size): TBD

6.3 Comparison with Hypotheses

Hypothesis Expected Actual Confirmed?
H1 (Validity increases) 80% → 90% TBD
H2 (Complexity increases) 1.4 → 2.5 depth TBD
H3 (R² improves) -1.0 → >0.0 TBD
H4 (Diversity increases) Higher unique rate TBD
H5 (RL benefits) Better convergence TBD

6.4 Unexpected Results

Document any surprising findings

  1. TBD
  2. TBD

6.5 Limitations

  1. LoRA fixed parameters: Using the same LoRA rank (r=8) for all model sizes may not be optimal

    • Larger models might benefit from higher ranks
    • Future: Scale LoRA rank with model size
  2. Single dataset: Only tested on sintetico_natural 700K

    • Results may not generalize to other expression distributions
    • Future: Test on multiple datasets
  3. Nguyen benchmarks only: Limited to 12 standard benchmarks

    • May not represent all real-world symbolic regression tasks
    • Future: Test on Feynman equations, real scientific datasets
  4. Batch size variation: Different batch sizes across models (8→4→2)

    • Effective batch size same (×4 accumulation), but gradient noise differs
    • May affect convergence dynamics
  5. Early stopping: May have prevented full convergence

    • Trade-off between cost and potential performance
    • Future: Test with longer training
  6. JSON format dependency: Results specific to JSON-structured prompts

    • May not generalize to other formats
    • Future: Test with multiple prompt formats

6.6 Implications

For Research:

  • TBD

For Practitioners:

  • TBD

For Model Selection:

  • When to use Base: TBD
  • When to use Medium: TBD
  • When to use Large: TBD

7. Conclusions

To be written after results

7.1 Summary

This experiment investigated the impact of model size (124M → 355M → 774M) on symbolic regression expression generation across three dimensions: validity, complexity, and benchmark performance.

Main Result: TBD

7.2 Recommendations

  1. Recommended model size: TBD (based on cost-benefit)
  2. Best algorithm by model: TBD
  3. Optimal hyperparameters: TBD

7.3 Future Work

  1. LoRA scaling study: Vary LoRA rank with model size

    • Test: Base (r=8), Medium (r=16), Large (r=32)
    • Hypothesis: Larger models need higher ranks for full capacity
  2. Dataset scaling: Train on larger datasets (1M, 5M expressions)

    • Test if larger models benefit more from more data
  3. Architecture variants: Test other model families

    • GPT-Neo, GPT-J, LLaMA
    • Encoder-decoder models (T5, BART)
  4. Multi-task learning: Train on multiple benchmarks simultaneously

    • May improve generalization
  5. Interpretability study: Analyze attention patterns

    • Understand what larger models learn differently
  6. Real-world deployment: Test on actual scientific datasets

    • Feynman equations
    • Materials science expressions
    • Biological models

8. Reproducibility

8.1 Code and Data

Repository: https://github.com/augustocsc/seriguela

Branch: experiment/ppo-symbolic-regression

Commit: TBD (run git rev-parse HEAD)

Models: TBD (HuggingFace links)

Dataset: augustocsc/sintetico_natural (700K subset)

8.2 Reproduction Steps

# 1. Clone repository
git clone https://github.com/augustocsc/seriguela.git
cd seriguela
git checkout experiment/ppo-symbolic-regression

# 2. Install dependencies
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# 3. Train models (requires AWS)
bash launch_all_models.sh

# 4. Download trained models
# (see TRAINING_LOG for specific instance IPs)

# 5. Evaluate
bash scripts/run_nguyen_suite.sh

# 6. Aggregate results
python scripts/aggregate_nguyen_results.py --input_dir nguyen_suite_results

8.3 Hardware Requirements

Training:

  • 3× AWS instances (g5.xlarge + g5.2xlarge)
  • Total VRAM: 96GB
  • Training time: ~10 hours total (parallel)
  • Cost: ~$10-13 USD

Evaluation (Nguyen suite):

  • 1× GPU with 24GB+ VRAM
  • Time: ~12-16 hours for full suite (144 experiments)
  • Can run on CPU (slower: ~48-72 hours)

8.4 Software Versions

See requirements.txt for exact versions.

Key dependencies:

  • Python 3.10+
  • PyTorch 2.5.1 (CUDA 12.1)
  • Transformers 4.51.3
  • PEFT 0.15.1
  • Wandb ≥0.24.1

9. Acknowledgments

To be filled

  • Dataset: Augusto et al. (sintetico_natural)
  • Benchmarks: Nguyen et al.
  • Infrastructure: AWS
  • Tracking: Weights & Biases

10. References

To be filled

  1. Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models
  2. Nguyen et al. Symbolic Regression Benchmarks
  3. Schulman et al. (2017). Proximal Policy Optimization
  4. DeepSeek-R1 paper (GRPO algorithm)
  5. Previous work: EXPERIMENT_RESULTS.md

Document Version: 1.0

Last Updated: 2025-02-02

Status: ⏳ In Progress (Results pending)

Contact: [Your contact information]