Model Scaling Experiment - Executive Summary
Status: π’ TRAINING IN PROGRESS
Started: 2026-02-02 23:41:37
π Current State
Training Running (AWS)
| Model | Size | Instance | IP | Expected Done |
|---|---|---|---|---|
| Base | 124M | i-0855711efcac25a9c | 18.206.190.126 | ~01:42 |
| Medium | 355M | i-0eea77c3bbf1ea976 | 13.220.236.233 | ~02:43 |
| Large | 774M | i-04dc6f51534d8185d | 52.55.119.255 | ~03:43 |
Git Commit: e3e2787f1444f3690cd5d3c3300e0bb445c77216
Estimated Total Cost: $10-13 USD
π― Experiment Overview
Research Question
Do larger GPT-2 models (355M, 774M) generate more complex mathematical expressions than smaller ones (124M) for symbolic regression?
Method
- Train 3 models with identical hyperparameters (only varying model size)
- Evaluate on quality, complexity, and performance metrics
- Test on Nguyen benchmarks with multiple RL algorithms
- Compare systematically across 144 experiments
Hypotheses
- H1: Larger models β higher valid expression rate
- H2: Larger models β more complex expressions (depth, power ops, nesting)
- H3: Larger models β better RΒ² scores on benchmarks
- H4: Larger models β more diverse expressions
- H5: RL algorithms work better with larger models
π Files Created (13 total)
Training Scripts
β
launch_all_models.sh - Parallel launch orchestrator
β
scripts/aws/launch_base_training.sh - Base (124M) launcher
β
scripts/aws/launch_medium_training.sh - Medium (355M) launcher (fixed)
β
scripts/aws/launch_large_training.sh - Large (774M) launcher (fixed)
Evaluation Scripts
β
scripts/run_nguyen_suite.sh - 144 experiments automation
β
scripts/aggregate_nguyen_results.py - Results analysis & visualization
Documentation
β
TRAINING_LOG_MODEL_SCALING_2025.md - Detailed training log
β
EXPERIMENT_MODEL_SCALING.md - Scientific report (awaiting results)
β
TRAINING_STATUS_2026-02-02.md - Current status & monitoring
β
NEXT_STEPS_AFTER_TRAINING.md - Post-training workflow
β
README_EXPERIMENT.md - This file
Model Cards
β
model_cards/gpt2_base_700K_json_card.md - Base model card
β
model_cards/gpt2_medium_700K_json_card.md - Medium model card
β
model_cards/gpt2_large_700K_json_card.md - Large model card
Updated
β
CLAUDE.md - Added "Model Scaling Study" section
βοΈ Next Steps
Now β ~04:30 (Training Phase)
β³ Wait - Models training automatically π Monitor - Check Wandb occasionally π€ Rest - No action needed
When Training Completes
- π¨ STOP INSTANCES (critical!)
- πΎ Download models (3 models via SCP)
- π Update logs (times, costs, losses)
- β Quick validation (test models work)
Evaluation Phase (12-16h)
- π§ͺ Run Nguyen suite (144 experiments)
- π Aggregate results (visualizations, stats)
- π Fill documentation (tables, figures, analysis)
Decision Point
- β Analyze results - Hypotheses confirmed?
- π€ Decide - Publish or iterate?
Publication (If Ready)
- π€ Upload to HuggingFace (3 models)
- π Git commit (final results)
- π Create presentation (optional)
π Quick Reference
Monitor Training
# Wandb (easiest)
https://wandb.ai/YOUR_USERNAME/seriguela
# SSH to instances
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@18.206.190.126 # Base
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@13.220.236.233 # Medium
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@52.55.119.255 # Large
Stop Instances (When Done)
aws ec2 stop-instances --instance-ids i-0855711efcac25a9c i-0eea77c3bbf1ea976 i-04dc6f51534d8185d
Download Models (When Done)
scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@18.206.190.126:~/seriguela/output/gpt2_base_700K_json ./output/
scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@13.220.236.233:~/seriguela/output/gpt2_medium_700K_json ./output/
scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@52.55.119.255:~/seriguela/output/gpt2_large_700K_json ./output/
π Expected Contributions
If Successful
- Quantify model size impact on symbolic regression quality
- Establish scaling laws for expression generation
- Provide model selection guide for practitioners
- Demonstrate LoRA effectiveness at different scales
- Validate/invalidate RL approaches for this domain
If Unsuccessful (Null Results)
- Document LoRA limitations for symbolic regression
- Identify dataset size requirements for scaling
- Highlight need for alternative architectures
- Guide future research away from unsuccessful approaches
Both outcomes are scientifically valuable!
π Documentation Hierarchy
README_EXPERIMENT.md (this file) β Executive summary
βββ TRAINING_STATUS_2026-02-02.md β Real-time status
βββ NEXT_STEPS_AFTER_TRAINING.md β Post-training workflow
βββ TRAINING_LOG_MODEL_SCALING_2025.md β Detailed training log
βββ EXPERIMENT_MODEL_SCALING.md β Scientific report
Supporting:
βββ CLAUDE.md β Project guide
βββ model_cards/*.md β Model documentation
βββ nguyen_suite_results/ β Evaluation results (future)
β What's Working
- β All 3 instances launched successfully
- β Scripts have deadlock fix applied
- β Credentials configured correctly
- β Monitoring infrastructure in place
- β Documentation comprehensive
- β Evaluation pipeline ready
- β Git commit recorded
β οΈ What to Watch
- β οΈ Early stopping may trigger (patience=3)
- β οΈ Large model may OOM (unlikely with g5.2xlarge)
- β οΈ Instances must be stopped manually (no auto-shutdown)
- β οΈ Evaluation suite takes 12-16 hours (plan accordingly)
π‘ Pro Tips
- Set alarm for ~04:30 to check if Large model completed
- Check Wandb first - easiest way to monitor progress
- Don't terminate instances until models downloaded
- Test models locally before running full 144-experiment suite
- Document unexpected findings - they're often most valuable
Next Check: ~01:40 (2 hours from now)
Current Phase: β³ Waiting for training completion
No immediate action required β
For detailed instructions, see NEXT_STEPS_AFTER_TRAINING.md
For real-time status, see TRAINING_STATUS_2026-02-02.md
For scientific context, see EXPERIMENT_MODEL_SCALING.md