gpt2_medium_prefix_682k / docs /reports /EXPERIMENT_MODEL_SCALING.md

GPT-2 Medium trained on prefix dataset (682K)

a1190da verified 25 days ago

preview code

raw

history blame contribute delete

15.1 kB

Experiment: Model Size Impact on Symbolic Regression

Research Project: Seriguela - Language Models for Symbolic Regression

Date: February 2025

Status: ⏳ In Progress

Abstract

This experiment investigates the impact of model size on the ability of GPT-2 language models to generate valid and complex mathematical expressions for symbolic regression. We train three model variants (Base: 124M, Medium: 355M, Large: 774M parameters) on 700K synthetic expressions using LoRA fine-tuning and evaluate them across multiple dimensions: validity, complexity, diversity, and performance on Nguyen benchmarks with reinforcement learning optimization.

Hypothesis: Larger models possess greater capacity to learn compositional patterns, resulting in more complex, valid, and diverse expression generation.

Key Question: Is the increased computational cost of larger models justified by improved expression quality and benchmark performance?

1. Introduction

1.1 Motivation

Prior work (see EXPERIMENT_RESULTS.md) demonstrated that JSON-formatted training (EXP-A) achieves 80% valid expression rates compared to 0.5% with EOS token approach. However, evaluation on Nguyen-5 benchmark revealed a critical limitation:

Problem: The base model (GPT-2 124M) generates structurally simple expressions that fail on complex benchmarks.

Evidence (Nguyen-5 analysis):

Valid expressions: 39.4%
All valid expressions: R² = -1.0 (terrible fit)
Power operations (x²): Only 15.9%
Nested trigonometric functions: 0%
Average depth: 1.40 (target requires 2+)

Root Cause: Model learns syntactically valid but structurally trivial expressions. Without proper complexity, all rewards are uniformly bad → no gradient signal → no RL learning.

1.2 Research Questions

RQ1: Do larger models generate more valid expressions?
RQ2: Do larger models produce more complex expressions (depth, nesting, power operations)?
RQ3: Do larger models achieve better R² scores on complex benchmarks?
RQ4: Do larger models generate more diverse expressions?
RQ5: What is the optimal model size for symbolic regression considering cost-benefit trade-offs?

1.3 Hypotheses

H1 (Validity): Valid expression rate increases with model size

Base: 80% → Medium: 82-85% → Large: 85-90%

H2 (Complexity): Expression complexity increases with model size

Power operations: Base 15.9% → Medium 35-45% → Large 50-65%
Average depth: Base 1.40 → Medium 1.8-2.0 → Large 2.0-2.5
Nested trig: Base 0% → Medium 5-10% → Large 10-20%

H3 (Performance): Benchmark performance (R²) improves with model size

Nguyen-5 best R²: Base -1.0 → Medium >-0.5 → Large >0.0

H4 (Diversity): Expression diversity increases with model size

Larger models explore broader expression space

H5 (Algorithm Interaction): RL algorithms work better with larger models

PPO and GRPO benefit more from increased capacity

2. Methodology

2.1 Models

Model	Parameters	LoRA Trainable	Instance Type	Batch Size	Cost (est.)
Base	124M	294K	g5.xlarge	8	$2-3
Medium	355M	294K	g5.xlarge	4	$3-4
Large	774M	294K	g5.2xlarge	2	$5-6

Key Design Decision: Fix all hyperparameters except batch size to isolate model size effect.

2.2 Training Configuration

Dataset: augustocsc/sintetico_natural (700K subset)

Format: JSON (EXP-A)

{"vars": ["x_1"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "sin(x_1 + C)"}

Hyperparameters (fixed across all models):

Learning rate: 5e-5
Epochs: 3 (with early stopping, patience=3)
Gradient accumulation: 4 steps
Warmup steps: 500
Weight decay: 0.01
FP16: True
Seed: 42
LoRA: r=8, alpha=32, target_modules=["c_attn"], dropout=0.05

Training Split: 90% train / 10% validation (automatic)

Infrastructure: AWS g5.xlarge / g5.2xlarge with NVIDIA A10G GPUs

Tracking: Weights & Biases (project: seriguela)

2.3 Evaluation Metrics

2.3.1 Quality Metrics

Validity:
- Valid expression rate (%): Syntactically correct AND semantically evaluable
- Parseable rate (%): Syntactically correct only
Constraint Adherence:
- Uses allowed variables (%): Only uses vars specified in prompt
- Uses allowed operators (%): Only uses ops specified in prompt
- Constraint adherence (%): Both constraints satisfied
Diversity:
- Diversity rate (%): Proportion of unique expressions
- Unique expressions count: Absolute number of different expressions

2.3.2 Complexity Metrics

Power Operations: Percentage using x², x**n
Nested Trigonometric Functions: Percentage with sin(cos(x)), etc.
Expression Depth: Average nesting level
Operator Distribution: Usage frequencies

2.3.3 Benchmark Performance

Nguyen Suite (1-12): Standard symbolic regression benchmarks

Algorithms:

Supervised: Direct generation (no optimization)
REINFORCE: Policy gradient with EMA baseline
GRPO: Group Relative Policy Optimization
PPO: Proximal Policy Optimization

Metrics:

Best R²: Highest R² achieved
Mean R² (valid expressions): Average fit quality
Convergence rate: Improvement over epochs
Valid rate during RL: Maintains validity while optimizing

2.4 Experimental Design

Phase 1: Supervised Training

Train all 3 models in parallel
Monitor loss curves, early stopping
Save checkpoints

Phase 2: Basic Evaluation

Generate 500 expressions per model
Compute quality and complexity metrics
Compare models

Phase 3: Nguyen Suite Evaluation

3 models × 12 benchmarks × 4 algorithms = 144 experiments
20 epochs, 100 samples per epoch (RL algorithms)
200 samples (supervised)

Phase 4: Analysis

Aggregate results
Statistical significance testing
Visualization (heatmaps, bar charts)
Cost-benefit analysis

3. Results

To be filled after experiments complete

3.1 Training Results

Table 1: Training Metrics

Model	Final Train Loss	Best Val Loss	Early Stopped	Training Time	Cost
Base	TBD	TBD	TBD	TBD	TBD
Medium	TBD	TBD	TBD	TBD	TBD
Large	TBD	TBD	TBD	TBD	TBD

Expected: Lower loss for larger models.

Actual: TBD

3.2 Quality Metrics

Table 2: Supervised Generation Quality

Metric	Base	Medium	Large	H1 Confirmed?
Valid Expression Rate (%)	TBD	TBD	TBD	⏳
Parseable Rate (%)	TBD	TBD	TBD	-
Constraint Adherence (%)	TBD	TBD	TBD	-
Diversity Rate (%)	TBD	TBD	TBD	⏳
Unique Expressions	TBD	TBD	TBD	-

3.3 Complexity Metrics

Table 3: Expression Complexity

Metric	Base	Medium	Large	Improvement (B→L)	H2 Confirmed?
Power Operations (%)	TBD	TBD	TBD	TBD	⏳
Nested Trig (%)	TBD	TBD	TBD	TBD	⏳
Average Depth	TBD	TBD	TBD	TBD	⏳
Max Depth	TBD	TBD	TBD	TBD	-

Expected (H2):

Power ops: Base 15.9% → Large 50-65%
Depth: Base 1.40 → Large 2.0-2.5
Nested trig: Base 0% → Large 10-20%

3.4 Nguyen Benchmark Performance

Table 4: Average R² Across All 12 Benchmarks

Algorithm	Base	Medium	Large	Best Model	H3 Confirmed?
Supervised	TBD	TBD	TBD	TBD	⏳
REINFORCE	TBD	TBD	TBD	TBD	⏳
GRPO	TBD	TBD	TBD	TBD	⏳
PPO	TBD	TBD	TBD	TBD	⏳

Table 5: Nguyen-5 Specific (Complex Benchmark)

Algorithm	Base	Medium	Large	Improvement
Supervised	TBD	TBD	TBD	TBD
REINFORCE	TBD	TBD	TBD	TBD
GRPO	TBD	TBD	TBD	TBD
PPO	TBD	TBD	TBD	TBD

Baseline (from previous work): Base supervised on Nguyen-5 = R² -1.0

Expected: Significant improvement with larger models

4. Visualizations

To be generated after evaluation completes

Figure 1: Model Comparison Overview

4 subplots: Valid Rate, R², Power Ops, Depth
Bar charts comparing Base, Medium, Large

Figure 2: Algorithm Performance Heatmaps

One heatmap per algorithm
Rows: Nguyen benchmarks (1-12)
Columns: Model sizes
Color: R² scores

Figure 3: Complexity Progression

Line chart showing how complexity metrics scale with model size

Figure 4: Cost-Benefit Analysis

Scatter plot: Cost (x-axis) vs Performance (y-axis)
Shows diminishing returns

5. Statistical Analysis

To be completed after results

5.1 Hypothesis Tests

H1 (Validity):

Test: Chi-square test for valid rate differences
Significance level: α = 0.05
Result: TBD
Conclusion: TBD

H2 (Complexity):

Test: Mann-Whitney U test for depth differences
Significance level: α = 0.05
Result: TBD
Conclusion: TBD

H3 (Performance):

Test: Kruskal-Wallis test for R² differences
Significance level: α = 0.05
Result: TBD
Conclusion: TBD

5.2 Effect Sizes

Cohen's d for continuous metrics (depth, R²)
Cramér's V for categorical metrics (valid rate)

Results: TBD

6. Discussion

To be written after results

6.1 Key Findings

Finding 1: TBD
Finding 2: TBD
Finding 3: TBD

6.2 Interpretation

RQ1 (Validity): TBD

RQ2 (Complexity): TBD

RQ3 (Performance): TBD

RQ4 (Diversity): TBD

RQ5 (Optimal Size): TBD

6.3 Comparison with Hypotheses

Hypothesis	Expected	Actual	Confirmed?
H1 (Validity increases)	80% → 90%	TBD	⏳
H2 (Complexity increases)	1.4 → 2.5 depth	TBD	⏳
H3 (R² improves)	-1.0 → >0.0	TBD	⏳
H4 (Diversity increases)	Higher unique rate	TBD	⏳
H5 (RL benefits)	Better convergence	TBD	⏳

6.4 Unexpected Results

Document any surprising findings

6.5 Limitations

LoRA fixed parameters: Using the same LoRA rank (r=8) for all model sizes may not be optimal
- Larger models might benefit from higher ranks
- Future: Scale LoRA rank with model size
Single dataset: Only tested on sintetico_natural 700K
- Results may not generalize to other expression distributions
- Future: Test on multiple datasets
Nguyen benchmarks only: Limited to 12 standard benchmarks
- May not represent all real-world symbolic regression tasks
- Future: Test on Feynman equations, real scientific datasets
Batch size variation: Different batch sizes across models (8→4→2)
- Effective batch size same (×4 accumulation), but gradient noise differs
- May affect convergence dynamics
Early stopping: May have prevented full convergence
- Trade-off between cost and potential performance
- Future: Test with longer training
JSON format dependency: Results specific to JSON-structured prompts
- May not generalize to other formats
- Future: Test with multiple prompt formats

6.6 Implications

For Research:

For Practitioners:

For Model Selection:

When to use Base: TBD
When to use Medium: TBD
When to use Large: TBD

7. Conclusions

To be written after results

7.1 Summary

This experiment investigated the impact of model size (124M → 355M → 774M) on symbolic regression expression generation across three dimensions: validity, complexity, and benchmark performance.

Main Result: TBD

7.2 Recommendations

Recommended model size: TBD (based on cost-benefit)
Best algorithm by model: TBD
Optimal hyperparameters: TBD

7.3 Future Work

LoRA scaling study: Vary LoRA rank with model size
- Test: Base (r=8), Medium (r=16), Large (r=32)
- Hypothesis: Larger models need higher ranks for full capacity
Dataset scaling: Train on larger datasets (1M, 5M expressions)
- Test if larger models benefit more from more data
Architecture variants: Test other model families
- GPT-Neo, GPT-J, LLaMA
- Encoder-decoder models (T5, BART)
Multi-task learning: Train on multiple benchmarks simultaneously
- May improve generalization
Interpretability study: Analyze attention patterns
- Understand what larger models learn differently
Real-world deployment: Test on actual scientific datasets
- Feynman equations
- Materials science expressions
- Biological models

8. Reproducibility

8.1 Code and Data

Repository: https://github.com/augustocsc/seriguela

Branch: experiment/ppo-symbolic-regression

Commit: TBD (run git rev-parse HEAD)

Models: TBD (HuggingFace links)

Dataset: augustocsc/sintetico_natural (700K subset)

8.2 Reproduction Steps

# 1. Clone repository
git clone https://github.com/augustocsc/seriguela.git
cd seriguela
git checkout experiment/ppo-symbolic-regression

# 2. Install dependencies
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# 3. Train models (requires AWS)
bash launch_all_models.sh

# 4. Download trained models
# (see TRAINING_LOG for specific instance IPs)

# 5. Evaluate
bash scripts/run_nguyen_suite.sh

# 6. Aggregate results
python scripts/aggregate_nguyen_results.py --input_dir nguyen_suite_results

8.3 Hardware Requirements

Training:

3× AWS instances (g5.xlarge + g5.2xlarge)
Total VRAM: 96GB
Training time: ~10 hours total (parallel)
Cost: ~$10-13 USD

Evaluation (Nguyen suite):

1× GPU with 24GB+ VRAM
Time: ~12-16 hours for full suite (144 experiments)
Can run on CPU (slower: ~48-72 hours)

8.4 Software Versions

See requirements.txt for exact versions.

Key dependencies:

Python 3.10+
PyTorch 2.5.1 (CUDA 12.1)
Transformers 4.51.3
PEFT 0.15.1
Wandb ≥0.24.1

9. Acknowledgments

To be filled

Dataset: Augusto et al. (sintetico_natural)
Benchmarks: Nguyen et al.
Infrastructure: AWS
Tracking: Weights & Biases

10. References

To be filled

Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models
Nguyen et al. Symbolic Regression Benchmarks
Schulman et al. (2017). Proximal Policy Optimization
DeepSeek-R1 paper (GRPO algorithm)
Previous work: EXPERIMENT_RESULTS.md

Document Version: 1.0

Last Updated: 2025-02-02

Status: ⏳ In Progress (Results pending)

Contact: [Your contact information]