gpt2_medium_prefix_682k / README_EXPERIMENT.md
augustocsc's picture
GPT-2 Medium trained on prefix dataset (682K)
a1190da verified

Model Scaling Experiment - Executive Summary

Status: 🟒 TRAINING IN PROGRESS

Started: 2026-02-02 23:41:37


πŸ“Š Current State

Training Running (AWS)

Model Size Instance IP Expected Done
Base 124M i-0855711efcac25a9c 18.206.190.126 ~01:42
Medium 355M i-0eea77c3bbf1ea976 13.220.236.233 ~02:43
Large 774M i-04dc6f51534d8185d 52.55.119.255 ~03:43

Git Commit: e3e2787f1444f3690cd5d3c3300e0bb445c77216

Estimated Total Cost: $10-13 USD


🎯 Experiment Overview

Research Question

Do larger GPT-2 models (355M, 774M) generate more complex mathematical expressions than smaller ones (124M) for symbolic regression?

Method

  1. Train 3 models with identical hyperparameters (only varying model size)
  2. Evaluate on quality, complexity, and performance metrics
  3. Test on Nguyen benchmarks with multiple RL algorithms
  4. Compare systematically across 144 experiments

Hypotheses

  • H1: Larger models β†’ higher valid expression rate
  • H2: Larger models β†’ more complex expressions (depth, power ops, nesting)
  • H3: Larger models β†’ better RΒ² scores on benchmarks
  • H4: Larger models β†’ more diverse expressions
  • H5: RL algorithms work better with larger models

πŸ“ Files Created (13 total)

Training Scripts

βœ… launch_all_models.sh - Parallel launch orchestrator βœ… scripts/aws/launch_base_training.sh - Base (124M) launcher βœ… scripts/aws/launch_medium_training.sh - Medium (355M) launcher (fixed) βœ… scripts/aws/launch_large_training.sh - Large (774M) launcher (fixed)

Evaluation Scripts

βœ… scripts/run_nguyen_suite.sh - 144 experiments automation βœ… scripts/aggregate_nguyen_results.py - Results analysis & visualization

Documentation

βœ… TRAINING_LOG_MODEL_SCALING_2025.md - Detailed training log βœ… EXPERIMENT_MODEL_SCALING.md - Scientific report (awaiting results) βœ… TRAINING_STATUS_2026-02-02.md - Current status & monitoring βœ… NEXT_STEPS_AFTER_TRAINING.md - Post-training workflow βœ… README_EXPERIMENT.md - This file

Model Cards

βœ… model_cards/gpt2_base_700K_json_card.md - Base model card βœ… model_cards/gpt2_medium_700K_json_card.md - Medium model card βœ… model_cards/gpt2_large_700K_json_card.md - Large model card

Updated

βœ… CLAUDE.md - Added "Model Scaling Study" section


⏭️ Next Steps

Now β†’ ~04:30 (Training Phase)

⏳ Wait - Models training automatically πŸ‘€ Monitor - Check Wandb occasionally πŸ’€ Rest - No action needed

When Training Completes

  1. 🚨 STOP INSTANCES (critical!)
  2. πŸ’Ύ Download models (3 models via SCP)
  3. πŸ“ Update logs (times, costs, losses)
  4. βœ… Quick validation (test models work)

Evaluation Phase (12-16h)

  1. πŸ§ͺ Run Nguyen suite (144 experiments)
  2. πŸ“Š Aggregate results (visualizations, stats)
  3. πŸ“„ Fill documentation (tables, figures, analysis)

Decision Point

  1. βœ… Analyze results - Hypotheses confirmed?
  2. πŸ€” Decide - Publish or iterate?

Publication (If Ready)

  1. πŸ“€ Upload to HuggingFace (3 models)
  2. πŸŽ“ Git commit (final results)
  3. πŸ“Š Create presentation (optional)

πŸ“ž Quick Reference

Monitor Training

# Wandb (easiest)
https://wandb.ai/YOUR_USERNAME/seriguela

# SSH to instances
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@18.206.190.126  # Base
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@13.220.236.233  # Medium
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@52.55.119.255   # Large

Stop Instances (When Done)

aws ec2 stop-instances --instance-ids i-0855711efcac25a9c i-0eea77c3bbf1ea976 i-04dc6f51534d8185d

Download Models (When Done)

scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@18.206.190.126:~/seriguela/output/gpt2_base_700K_json ./output/
scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@13.220.236.233:~/seriguela/output/gpt2_medium_700K_json ./output/
scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@52.55.119.255:~/seriguela/output/gpt2_large_700K_json ./output/

πŸŽ“ Expected Contributions

If Successful

  1. Quantify model size impact on symbolic regression quality
  2. Establish scaling laws for expression generation
  3. Provide model selection guide for practitioners
  4. Demonstrate LoRA effectiveness at different scales
  5. Validate/invalidate RL approaches for this domain

If Unsuccessful (Null Results)

  1. Document LoRA limitations for symbolic regression
  2. Identify dataset size requirements for scaling
  3. Highlight need for alternative architectures
  4. Guide future research away from unsuccessful approaches

Both outcomes are scientifically valuable!


πŸ“š Documentation Hierarchy

README_EXPERIMENT.md (this file)          ← Executive summary
β”œβ”€β”€ TRAINING_STATUS_2026-02-02.md        ← Real-time status
β”œβ”€β”€ NEXT_STEPS_AFTER_TRAINING.md         ← Post-training workflow
β”œβ”€β”€ TRAINING_LOG_MODEL_SCALING_2025.md   ← Detailed training log
└── EXPERIMENT_MODEL_SCALING.md          ← Scientific report

Supporting:
β”œβ”€β”€ CLAUDE.md                            ← Project guide
β”œβ”€β”€ model_cards/*.md                     ← Model documentation
└── nguyen_suite_results/                ← Evaluation results (future)

βœ… What's Working

  • βœ… All 3 instances launched successfully
  • βœ… Scripts have deadlock fix applied
  • βœ… Credentials configured correctly
  • βœ… Monitoring infrastructure in place
  • βœ… Documentation comprehensive
  • βœ… Evaluation pipeline ready
  • βœ… Git commit recorded

⚠️ What to Watch

  • ⚠️ Early stopping may trigger (patience=3)
  • ⚠️ Large model may OOM (unlikely with g5.2xlarge)
  • ⚠️ Instances must be stopped manually (no auto-shutdown)
  • ⚠️ Evaluation suite takes 12-16 hours (plan accordingly)

πŸ’‘ Pro Tips

  1. Set alarm for ~04:30 to check if Large model completed
  2. Check Wandb first - easiest way to monitor progress
  3. Don't terminate instances until models downloaded
  4. Test models locally before running full 144-experiment suite
  5. Document unexpected findings - they're often most valuable

Next Check: ~01:40 (2 hours from now)

Current Phase: ⏳ Waiting for training completion

No immediate action required βœ…


For detailed instructions, see NEXT_STEPS_AFTER_TRAINING.md

For real-time status, see TRAINING_STATUS_2026-02-02.md

For scientific context, see EXPERIMENT_MODEL_SCALING.md