gpt2_medium_prefix_682k / README_EXPERIMENT.md

GPT-2 Medium trained on prefix dataset (682K)

a1190da verified 2 months ago

preview code

raw

history blame contribute delete

6.68 kB

Model Scaling Experiment - Executive Summary

Status: 🟢 TRAINING IN PROGRESS

Started: 2026-02-02 23:41:37

📊 Current State

Training Running (AWS)

Model	Size	Instance	IP	Expected Done
Base	124M	i-0855711efcac25a9c	18.206.190.126	~01:42
Medium	355M	i-0eea77c3bbf1ea976	13.220.236.233	~02:43
Large	774M	i-04dc6f51534d8185d	52.55.119.255	~03:43

Git Commit: e3e2787f1444f3690cd5d3c3300e0bb445c77216

Estimated Total Cost: $10-13 USD

🎯 Experiment Overview

Research Question

Do larger GPT-2 models (355M, 774M) generate more complex mathematical expressions than smaller ones (124M) for symbolic regression?

Method

Train 3 models with identical hyperparameters (only varying model size)
Evaluate on quality, complexity, and performance metrics
Test on Nguyen benchmarks with multiple RL algorithms
Compare systematically across 144 experiments

Hypotheses

H1: Larger models → higher valid expression rate
H2: Larger models → more complex expressions (depth, power ops, nesting)
H3: Larger models → better R² scores on benchmarks
H4: Larger models → more diverse expressions
H5: RL algorithms work better with larger models

📁 Files Created (13 total)

Training Scripts

✅ launch_all_models.sh - Parallel launch orchestrator ✅ scripts/aws/launch_base_training.sh - Base (124M) launcher ✅ scripts/aws/launch_medium_training.sh - Medium (355M) launcher (fixed) ✅ scripts/aws/launch_large_training.sh - Large (774M) launcher (fixed)

Evaluation Scripts

✅ scripts/run_nguyen_suite.sh - 144 experiments automation ✅ scripts/aggregate_nguyen_results.py - Results analysis & visualization

Documentation

✅ TRAINING_LOG_MODEL_SCALING_2025.md - Detailed training log ✅ EXPERIMENT_MODEL_SCALING.md - Scientific report (awaiting results) ✅ TRAINING_STATUS_2026-02-02.md - Current status & monitoring ✅ NEXT_STEPS_AFTER_TRAINING.md - Post-training workflow ✅ README_EXPERIMENT.md - This file

Model Cards

✅ model_cards/gpt2_base_700K_json_card.md - Base model card ✅ model_cards/gpt2_medium_700K_json_card.md - Medium model card ✅ model_cards/gpt2_large_700K_json_card.md - Large model card

Updated

✅ CLAUDE.md - Added "Model Scaling Study" section

⏭️ Next Steps

Now → ~04:30 (Training Phase)

⏳ Wait - Models training automatically 👀 Monitor - Check Wandb occasionally 💤 Rest - No action needed

When Training Completes

🚨 STOP INSTANCES (critical!)
💾 Download models (3 models via SCP)
📝 Update logs (times, costs, losses)
✅ Quick validation (test models work)

Evaluation Phase (12-16h)

🧪 Run Nguyen suite (144 experiments)
📊 Aggregate results (visualizations, stats)
📄 Fill documentation (tables, figures, analysis)

Decision Point

✅ Analyze results - Hypotheses confirmed?
🤔 Decide - Publish or iterate?

Publication (If Ready)

📤 Upload to HuggingFace (3 models)
🎓 Git commit (final results)
📊 Create presentation (optional)

📞 Quick Reference

Monitor Training

# Wandb (easiest)
https://wandb.ai/YOUR_USERNAME/seriguela

# SSH to instances
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@18.206.190.126  # Base
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@13.220.236.233  # Medium
ssh -i C:\Users\madeinweb\chave-gpu.pem ubuntu@52.55.119.255   # Large

Stop Instances (When Done)

aws ec2 stop-instances --instance-ids i-0855711efcac25a9c i-0eea77c3bbf1ea976 i-04dc6f51534d8185d

Download Models (When Done)

scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@18.206.190.126:~/seriguela/output/gpt2_base_700K_json ./output/
scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@13.220.236.233:~/seriguela/output/gpt2_medium_700K_json ./output/
scp -i C:\Users\madeinweb\chave-gpu.pem -r ubuntu@52.55.119.255:~/seriguela/output/gpt2_large_700K_json ./output/

🎓 Expected Contributions

If Successful

Quantify model size impact on symbolic regression quality
Establish scaling laws for expression generation
Provide model selection guide for practitioners
Demonstrate LoRA effectiveness at different scales
Validate/invalidate RL approaches for this domain

If Unsuccessful (Null Results)

Document LoRA limitations for symbolic regression
Identify dataset size requirements for scaling
Highlight need for alternative architectures
Guide future research away from unsuccessful approaches

Both outcomes are scientifically valuable!

📚 Documentation Hierarchy

README_EXPERIMENT.md (this file)          ← Executive summary
├── TRAINING_STATUS_2026-02-02.md        ← Real-time status
├── NEXT_STEPS_AFTER_TRAINING.md         ← Post-training workflow
├── TRAINING_LOG_MODEL_SCALING_2025.md   ← Detailed training log
└── EXPERIMENT_MODEL_SCALING.md          ← Scientific report

Supporting:
├── CLAUDE.md                            ← Project guide
├── model_cards/*.md                     ← Model documentation
└── nguyen_suite_results/                ← Evaluation results (future)

✅ What's Working

✅ All 3 instances launched successfully
✅ Scripts have deadlock fix applied
✅ Credentials configured correctly
✅ Monitoring infrastructure in place
✅ Documentation comprehensive
✅ Evaluation pipeline ready
✅ Git commit recorded

⚠️ What to Watch

⚠️ Early stopping may trigger (patience=3)
⚠️ Large model may OOM (unlikely with g5.2xlarge)
⚠️ Instances must be stopped manually (no auto-shutdown)
⚠️ Evaluation suite takes 12-16 hours (plan accordingly)

💡 Pro Tips

Set alarm for ~04:30 to check if Large model completed
Check Wandb first - easiest way to monitor progress
Don't terminate instances until models downloaded
Test models locally before running full 144-experiment suite
Document unexpected findings - they're often most valuable

Next Check: ~01:40 (2 hours from now)

Current Phase: ⏳ Waiting for training completion

No immediate action required ✅

For detailed instructions, see NEXT_STEPS_AFTER_TRAINING.md

For real-time status, see TRAINING_STATUS_2026-02-02.md

For scientific context, see EXPERIMENT_MODEL_SCALING.md