CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Seriguela is an academic research project focused on training language models for symbolic regression through fine-tuning and reinforcement learning. The project trains GPT-2 models to generate valid mathematical expressions using LoRA (parameter-efficient fine-tuning) and explores RL algorithms (PPO, GRPO, REINFORCE) for optimizing expression quality.
Research Context: This is a graduate-level research project exploring the application of large language models to symbolic regression problems, with focus on:
- Parameter-efficient fine-tuning techniques (LoRA)
- Reinforcement learning for expression optimization
- Model scaling effects on compositional complexity
- Benchmark evaluation (Nguyen benchmarks)
Current Status: The JSON structured format (EXP-A) is the recommended approach, achieving 80% valid expressions vs 0.5% with EOS token approach. See EXPERIMENT_RESULTS.md for details.
Core Commands
Environment Setup
# Create and activate virtual environment
python -m venv .seriguela
source .seriguela/bin/activate # Linux/macOS
.seriguela\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Install PyTorch with CUDA 12.1 (required for GPU)
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# Login to experiment tracking (optional)
wandb login
Data Preparation
# Prepare experiment data (JSON + EOS formats)
python scripts/data/prepare_experiment_data.py \
--dataset_repo_id augustocsc/sintetico_natural \
--data_dir 700K \
--data_column i_prompt_n \
--output_base_dir ./data/experiments
# Legacy: Add end-of-expression markers (simple format)
python scripts/data/prepare_training_data_fixed.py \
--dataset_repo_id augustocsc/sintetico_natural \
--data_dir 700K \
--data_column i_prompt_n \
--output_dir ./data/processed/700K_fixed
Training
# RECOMMENDED: Train with JSON format (EXP-A) - 80% valid expressions
python scripts/train_experiment.py \
--experiment_name exp_a_json \
--train_file ./data/experiments/exp_a_json/train.csv \
--output_dir ./output/exp_a_json \
--num_train_epochs 3
# Alternative: Train with EOS format (EXP-B) - not recommended, 0.5% valid
python scripts/train_experiment.py \
--experiment_name exp_b_eos \
--train_file ./data/experiments/exp_b_eos/train.csv \
--output_dir ./output/exp_b_eos \
--use_native_eos
# Legacy: Basic training with original format
python scripts/train.py \
--model_name_or_path gpt2 \
--dataset_repo_id augustocsc/sintetico_natural \
--data_dir 700K \
--data_column i_prompt_n \
--output_dir ./output/Se124M_700K_infix \
--wandb_project seriguela \
--num_train_epochs 3
# Training with config file
python scripts/train.py --config configs/training_medium.json
Generation & Evaluation
# Evaluate experiment results
python scripts/evaluate_experiments.py \
--model_path ./output/exp_a_json \
--num_samples 200
# Generate expressions
python scripts/generate.py \
--model_path ./output/exp_a_json \
--num_generations 50 \
--validate
# Evaluate model
python scripts/evaluate.py \
--model_path ./output/exp_a_json \
--num_samples 100
# Compare two models
python scripts/compare_models.py \
--model1 ./output/model_v1 \
--model2 ./output/model_v2
AWS Infrastructure
# Launch instance with training
./scripts/aws/launch_instance_fixed.sh \
--instance-type g5.xlarge \
--hf-token <token> \
--wandb-key <key>
# Monitor training remotely
./scripts/aws/monitor_training_auto.sh
# Check AWS instances
aws ec2 describe-instances --query "Reservations[*].Instances[*].[InstanceId,InstanceType,State.Name,PublicIpAddress,Tags[?Key=='Name'].Value|[0]]" --output table
# Stop all running instances
aws ec2 stop-instances --instance-ids <id1> <id2>
Architecture
Key Components
Data Pipeline (
scripts/data/)prepare_experiment_data.py: Prepares data in JSON (EXP-A) and EOS (EXP-B) formatsprepare_training_data_fixed.py: Legacy script that adds<|endofex|>markersdata_processing.py: Cleaning, augmentation, format conversion- Supports both HuggingFace Hub and local CSV files
Training (
scripts/train_experiment.py,scripts/train.py)- Uses GPT-2 (124M parameters) with LoRA adapters (only 294K trainable)
train_experiment.py: Supports JSON format (recommended) and EOS format- Integrates with Weights & Biases for experiment tracking
- Validates dataset format before training
Generation (
scripts/generate.py)ExpressionStoppingCriteria: Custom stopping logic for clean expression boundaries- Multi-strategy extraction: marker-based → boundary-based → fallback
- Validates expressions using SymPy
Expression Parsing (
classes/expression.py)parse_prefix(): Stack-based prefix notation parserparse_infix(): Infix notation parservalidate(): Syntax and semantic validationOPERATOR_ARITY: Maps operators to argument counts
Data Flow
HuggingFace Hub / Local CSV
↓ prepare_experiment_data.py (converts to JSON or EOS format)
Train/Val/Test CSV files (in data/experiments/)
↓ train_experiment.py (tokenize + LoRA)
output/{experiment_name}/ (checkpoints + final model)
↓ evaluate_experiments.py & generate.py (inference + validation)
LoRA Configuration
- r=8: LoRA rank
- alpha=32: LoRA alpha scaling
- target_modules:
c_attn(attention layers only) - dropout=0.05: LoRA dropout rate
- Results in only 294K trainable parameters vs 124M total
Important Patterns
Data Format (JSON - Recommended)
The JSON format (EXP-A) is the recommended approach, providing 80% valid expressions:
Training data format:
{"vars": ["x_1", "x_2"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "sin(x_1 + C*x_2)"}
Inference prompt:
{"vars": ["x_1", "x_2"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "
The model completes the expression and closes with "}.
Why JSON Works Better Than EOS Token
- Clear boundaries: JSON has explicit
{start and}end markers - Structured containment: Expression is within
"expr": "..."field - Lower loss: Easier for model to learn (0.343 vs 0.415)
- No repetition: Unlike EOS approach, model doesn't fall into repetitive patterns
Expression Validation
from classes.expression import Expression
# Create and validate
expr = Expression.parse_prefix("+", "x", "y") # Prefix notation
expr = Expression.parse_infix("x + y") # Infix notation
is_valid = expr.validate()
# Add new operators
# Edit OPERATOR_ARITY and OPERATOR_FUNCS in classes/expression.py
Configuration Files
Training parameters can be set via:
- CLI arguments (highest priority)
- JSON config files in
configs/(e.g.,training_medium.json) - Default values in code
Example config structure:
{
"model_name_or_path": "gpt2",
"output_dir": "./output/experiment",
"num_train_epochs": 3,
"per_device_train_batch_size": 8,
"learning_rate": 5e-5,
"lora_r": 8,
"lora_alpha": 32
}
Experiment Results Summary
| Metric | EXP-A (JSON) | EXP-B (EOS) |
|---|---|---|
| Valid Expressions | 80% | 0.5% |
| Parseable | 81% | 4.5% |
| Correct Symbols | 76.5% | 11% |
| Train Loss | 0.343 | 0.415 |
Full results in EXPERIMENT_RESULTS.md.
Known Issues & Fixes
Solved: Expression Generation Stopping
The original problem (models not stopping at expression boundaries) has been solved by using the JSON structured format. The EOS token approach (<|endoftext|>) proved ineffective (0.5% valid).
Historical Bug Fixes
Missing parse_prefix() Method: Added stack-based implementation in
classes/expression.py:712.Wandb Version: Old version (0.19.9) doesn't support new API key format. Update to
wandb>=0.24.1inrequirements.txt.
GPU Environment
- Local Windows: May have PyTorch DLL issues. Recommend testing on AWS.
- AWS: Use g5.xlarge instances with NVIDIA A10G GPU (CUDA 12.1).
- Verify GPU:
nvidia-smiandpython -c "import torch; print(torch.cuda.is_available())"
AWS Security Group
Current security group (sg-0deaa73e23482e3f6) restricts SSH access to specific IPs:
- 143.106.58.120/32
- 179.160.37.193/32
To add your IP:
MY_IP=$(curl -s https://checkip.amazonaws.com)
aws ec2 authorize-security-group-ingress \
--group-id sg-0deaa73e23482e3f6 \
--protocol tcp --port 22 --cidr $MY_IP/32
File Locations
Critical Files
scripts/train_experiment.py- Recommended training script for experimentsscripts/data/prepare_experiment_data.py- Prepares JSON/EOS format datascripts/evaluate_experiments.py- Evaluates experiment resultsclasses/expression.py:712+- Expression parsing and validation corescripts/train.py- Legacy training entry pointscripts/generate.py- Generation with stopping criteria
Output Directories
output/{experiment_name}/- Model checkpoints and final modelsdata/experiments/exp_a_json/- JSON format training datadata/experiments/exp_b_eos/- EOS format training data (not recommended)data/processed/- Legacy prepared training datawandb/- Weights & Biases experiment metadata (gitignored)aws/keys/- SSH keys for EC2 instances (gitignored)
Configuration
configs/training_*.json- Training hyperparameters for different scalesconfigs/wandb_config.py- Wandb naming standards and utilitiesaws/config.json- AWS instance configurationsCREDENTIALS_SETUP.md- Guide for API tokens and SSH keys setupWANDB_NAMING.md- Wandb experiment naming conventions.gitignore- Excludes*.pem,*.key,aws/.env,output/*,data/*
Dependencies
Core ML Stack
transformers==4.51.3- Model loading, tokenizer, Trainertorch==2.5.1+cu121- Deep learning with CUDA 12.1peft==0.15.1- LoRA parameter-efficient fine-tuningdatasets==3.5.0- HuggingFace dataset loadingaccelerate==1.6.0- Multi-GPU training
Experiment Tracking
wandb>=0.24.1- Experiment tracking (note: version must be >=0.24.1)tensorboard==2.16.2- Alternative visualizationtrl==0.16.1- Advanced training techniques
Validation & Analysis
sympy==1.13.1- Symbolic math for expression validationpandas==2.2.1- Data manipulationscikit-learn==1.6.1- Metrics and evaluation
Extensibility Points
Adding New Operators
Edit classes/expression.py:
OPERATOR_ARITY = {
# Existing operators...
'new_op': 2, # binary operator
}
OPERATOR_FUNCS = {
'new_op': np.new_function,
}
Custom Metrics
Extend scripts/evaluate.py with new validation functions:
def custom_metric(expression):
# Your validation logic
return score
Different Model Base
Pass --model_name_or_path to train.py:
python scripts/train.py --model_name_or_path gpt2-medium # 355M params
python scripts/train.py --model_name_or_path gpt2-large # 774M params
New Data Format
Add converter in scripts/data/data_processing.py for custom notation or expression formats.
Best Practices
- Always validate data: Ensure
<|endofex|>markers exist before training - Use config files: Store hyperparameters in
configs/for reproducibility - Track experiments: Enable W&B logging with standardized naming (see
WANDB_NAMING.md) - Use standard Wandb names: Import from
configs/wandb_config.pyfor consistent naming - Test locally first: Run small experiments before AWS deployment
- Monitor GPU usage: Check
nvidia-smiand training logs for memory issues - Stop AWS instances: Always stop instances when not in use to avoid charges
- Version control: Commit config files but never commit model weights or keys
Quick Debugging
# Check if GPU is available
python -c "import torch; print(torch.cuda.is_available())"
# Verify JSON format data
python -c "import pandas as pd; df = pd.read_csv('./data/experiments/exp_a_json/train.csv'); print(df['text'].iloc[0])"
# Test expression parsing
python -c "from classes.expression import Expression; expr = Expression.parse_infix('x + y'); print(expr.validate())"
# Check model output structure
python scripts/generate.py --model_path ./output/exp_a_json --num_generations 1 --validate
# Monitor training on AWS
ssh -i ~/chave-gpu.pem ubuntu@<ip>
tail -f ~/training_exp_a.log
Wandb Naming Standards
Complete guide: See WANDB_NAMING.md for detailed naming conventions.
Standard Format
All Wandb runs follow the pattern: seriguela-{type}-{model}-{dataset}-{timestamp}
Quick Usage
from configs.wandb_config import generate_run_name, get_wandb_project_name
# Generate standardized run name
run_name = generate_run_name("ppo", "medium", "nguyen5")
# → seriguela-ppo-medium-nguyen5-20260203-143022
# Initialize wandb
wandb.init(
project=get_wandb_project_name(), # → "seriguela"
name=run_name
)
Common Examples
# Supervised training
generate_run_name("supervised", "base", "700K")
# → seriguela-supervised-base-700k-20260203-143022
# PPO reinforcement learning
generate_run_name("ppo", "medium", "nguyen5")
# → seriguela-ppo-medium-nguyen5-20260203-143022
# GRPO with extra info
generate_run_name("grpo", "large", "nguyen7", "lr5e5")
# → seriguela-grpo-large-nguyen7-lr5e5-20260203-143022
Project name: Always use "seriguela" for production experiments.
Reinforcement Learning for Expression Generation (Feb 2025)
Problem Discovery: Complexity Gap
After implementing supervised fine-tuning with JSON format (80% valid expressions), RL experiments revealed a critical limitation:
The base model (GPT-2 124M) generates structurally simple expressions that fail on complex benchmarks.
Evidence from Nguyen-5 Benchmark
Target: sin(x_1**2)*cos(x_1) - 1
Analyzing 160 expressions from REINFORCE training:
- Valid expressions: 39.4%
- All valid expressions had R² = -1.0 (terrible fit)
- Only 15.9% used power operations (x²)
- 0% had nested trigonometric functions (sin(x²))
- Average depth: 1.40 (target requires 2+)
- No examples of function multiplication (sin()*cos())
Root Cause: Model learns to generate syntactically valid but structurally trivial expressions. Without proper complexity, all rewards are uniformly bad → no gradient signal → no learning.
RL Algorithms Implemented
Three algorithms implemented for symbolic regression fine-tuning:
1. REINFORCE (scripts/reinforce_symbolic.py)
- Policy gradient with EMA baseline
- Baseline: exponential moving average across epochs
- Advantage: reward - baseline
- Simple but effective for easy problems (Nguyen-1: R²=0.95)
2. GRPO (scripts/grpo_symbolic.py)
- Group Relative Policy Optimization (DeepSeek-R1 approach)
- Within-group advantage normalization
- More stable than REINFORCE
- Better for multi-modal reward landscapes
3. PPO (scripts/ppo_symbolic.py)
- Proximal Policy Optimization with clipped objective
- Multiple optimization epochs per batch
- KL divergence monitoring with early stopping
- Problem: Too conservative when all samples have poor fit
- Failed completely on Nguyen-5 (R²=-1.0)
Debugging Tools
scripts/debug_reinforce.py
Captures ALL expressions (valid and invalid) during training for analysis:
- Tracks: expression, R², validity, error_type, error_message
- Saves to
debug_expressions.json - Essential for understanding model behavior
Example usage:
python scripts/debug_reinforce.py \
--model_path augustocsc/Se124M_700K_infix_v3_json \
--dataset data/benchmarks/nguyen/nguyen_5.csv \
--epochs 10
scripts/analyze_complexity.py
Analyzes expression complexity patterns:
- Power operation usage (x²)
- Nested trigonometric functions
- Expression depth (nesting level)
- Operator distribution
Results show base model generates shallow expressions (depth 1.4) with no nesting.
Solution: Scale Up Model Size
Hypothesis: Larger models have more capacity to learn complex compositional patterns.
Implemented training for three model sizes:
- GPT-2 Base: 124M parameters (baseline)
- GPT-2 Medium: 355M parameters (3x larger)
- GPT-2 Large: 774M parameters (6x larger)
Critical Discovery: Data Format Issue (Feb 2025)
PROBLEM FOUND: The HuggingFace dataset (augustocsc/sintetico_natural) column i_prompt_n is NOT in JSON format!
Actual format:
vars: x_1, x_2, x_3, x_4, x_5
oper: *, abs, asin, cos, log, sin, tan
cons: C
expr: log(cos(x_4))
Required JSON format (80% valid):
{"vars": ["x_1", "x_2"], "ops": ["*", "+", "sin"], "cons": "C", "expr": "log(cos(x_4))"}
This was causing models to learn wrong patterns!
Solution: scripts/train_with_json.py
New training script with critical fixes:
- Automatic format conversion to JSON:
def convert_to_json_format(example):
"""Convert text format to JSON."""
# Parses "vars: x_1, x_2" → {"vars": ["x_1", "x_2"]}
# Returns proper JSON string for training
Early stopping:
- Patience: 3 epochs
- Monitors validation loss
- Stops if no improvement → saves time and cost
- Load best model at end
Train/validation split:
- 90% train / 10% validation
- Evaluation every 500 steps
- Prevents overfitting
Wandb integration for monitoring
Training Larger Models on AWS
Launch Scripts
Medium (355M) - g5.xlarge:
bash scripts/aws/launch_medium_training.sh \
--wandb-key YOUR_KEY \
--hf-token YOUR_TOKEN
- GPU: NVIDIA A10G (24GB VRAM)
- Time: ~2-3 hours
- Cost: ~$2-3 USD
Large (774M) - g5.2xlarge:
bash scripts/aws/launch_large_training.sh \
--wandb-key YOUR_KEY \
--hf-token YOUR_TOKEN
- GPU: NVIDIA A10G (48GB VRAM)
- Time: ~4-5 hours
- Cost: ~$8-10 USD
Monitoring
# Check training progress
ssh -i ~/.ssh/KEY.pem ubuntu@IP
tail -f /home/ubuntu/training_medium.log # or training_large.log
# Check completion
ssh ubuntu@IP 'test -f ~/.training_complete && echo "DONE" || echo "Running"'
Download Trained Models
# Medium
scp -i ~/.ssh/KEY.pem -r ubuntu@IP:~/seriguela/output/gpt2_medium_700K_json ./
# Large
scp -i ~/.ssh/KEY.pem -r ubuntu@IP:~/seriguela/output/gpt2_large_700K_json ./
Comparing Model Sizes
scripts/compare_trained_models.py: Compares models on expression complexity
python scripts/compare_trained_models.py \
--model_base augustocsc/Se124M_700K_infix_v3_json \
--model_medium ./gpt2_medium_700K_json \
--model_large ./gpt2_large_700K_json \
--dataset data/benchmarks/nguyen/nguyen_5.csv \
--epochs 10
Metrics compared:
- Valid expression %: Syntactic correctness
- Power operation %: Use of x² (essential for Nguyen-5)
- Nested trig %: sin(x²), cos(x²), etc.
- Average depth: Compositional complexity
- Best R²: Fit quality
Expected improvements with larger models:
- Medium (355M): +20-30% power usage, depth 1.8-2.0
- Large (774M): +40-50% power usage, depth 2.0-2.5, possible nested trig
Key Insights
- JSON format is critical: 80% valid vs much lower with other formats
- Model size matters for complexity: Base (124M) cannot generate nested compositions
- RL needs variance in rewards: PPO fails when all samples are equally bad
- Early stopping essential: Saves compute on AWS
- Validation split necessary: Prevents overfitting on large datasets
Files Added (Feb 2025)
Training:
scripts/train_with_json.py- Correct training with JSON + early stoppingscripts/train_medium.py- Legacy (DO NOT USE - wrong format)
RL Algorithms:
scripts/reinforce_symbolic.py- REINFORCE implementationscripts/grpo_symbolic.py- GRPO implementationscripts/ppo_symbolic.py- PPO implementationscripts/debug_reinforce.py- Debug version capturing all expressions
Analysis:
scripts/analyze_complexity.py- Expression complexity analysisscripts/compare_trained_models.py- Multi-model comparisonscripts/show_expressions.py- Display valid/invalid expressions
AWS Deployment:
scripts/aws/launch_medium_training.sh- Launch medium trainingscripts/aws/launch_large_training.sh- Launch large trainingscripts/aws/monitor_medium_training.sh- Monitor progress
Documentation:
TRAIN_MEDIUM_AWS.md- Quick guide for AWS trainingWANDB_NAMING.md- Wandb experiment naming standardsCREDENTIALS_SETUP.md- API keys and SSH setup guide
Credentials Location
Complete setup guide: See CREDENTIALS_SETUP.md for detailed configuration.
API Tokens: C:\Users\madeinweb\.tokens.txt (gitignored)
huggingface = hf_...
wandb = wandb_v1_...
SSH Key (AWS): C:\Users\madeinweb\chave-gpu.pem
- Used for all AWS EC2 instance access
- Usage:
ssh -i ~/chave-gpu.pem ubuntu@<IP>
Scripts automatically read tokens from ~/.tokens.txt when available.
Next Steps
- Wait for training completion (~5 hours for both)
- Compare all three models on Nguyen benchmarks
- Test larger models with RL: REINFORCE/GRPO on Nguyen-5
- Expected result: Medium/Large should generate complex expressions with proper nesting
- If successful: Deploy to production, create HuggingFace model cards
Model Scaling Study (Feb 2025)
Overview
Comprehensive experiment training GPT-2 Base (124M), Medium (355M), and Large (774M) on 700K JSON dataset to investigate the impact of model size on symbolic regression capability.
Status: ⏳ In Progress
Complete documentation: See EXPERIMENT_MODEL_SCALING.md for full research report.
Research Question
Do larger models generate more complex, valid, and diverse mathematical expressions for symbolic regression?
Hypotheses
- H1 (Validity): Valid expression rate increases with model size (80% → 90%)
- H2 (Complexity): Expression depth increases (1.4 → 2.5), power operations increase (16% → 50%+)
- H3 (Performance): R² scores improve on complex benchmarks (Nguyen-5: -1.0 → >0.0)
- H4 (Diversity): Larger models generate more unique expressions
- H5 (RL Interaction): RL algorithms benefit more from larger models
Models Trained
| Model | Parameters | Trainable (LoRA) | Instance | Batch Size | Training Time | Cost |
|---|---|---|---|---|---|---|
| Base | 124M | 294K | g5.xlarge | 8 | ~2-3h | ~$2-3 |
| Medium | 355M | 294K | g5.xlarge | 4 | ~3-4h | ~$3-4 |
| Large | 774M | 294K | g5.2xlarge | 2 | ~4-5h | ~$5-6 |
All hyperparameters identical except batch size (to isolate model size effect).
Training Scripts
Parallel training (all 3 models simultaneously):
# Launch all models in parallel
bash launch_all_models.sh
Individual training (if needed):
# Base (124M)
bash scripts/aws/launch_base_training.sh --wandb-key KEY --hf-token TOKEN
# Medium (355M)
bash scripts/aws/launch_medium_training.sh --wandb-key KEY --hf-token TOKEN
# Large (774M)
bash scripts/aws/launch_large_training.sh --wandb-key KEY --hf-token TOKEN
Critical fix applied: Removed cloud-init status --wait deadlock from all launch scripts.
Evaluation Pipeline
Complete Nguyen suite evaluation (144 experiments):
# Run full suite: 3 models × 12 benchmarks × 4 algorithms
bash scripts/run_nguyen_suite.sh
# Aggregate results and generate report
python scripts/aggregate_nguyen_results.py --input_dir nguyen_suite_results
Algorithms evaluated:
- Supervised: Direct generation (no RL)
- REINFORCE: Policy gradient with EMA baseline
- GRPO: Group Relative Policy Optimization
- PPO: Proximal Policy Optimization
Metrics Tracked
Quality Metrics:
- Valid expression rate (%)
- Constraint adherence (uses only allowed vars/ops)
- Diversity rate (unique expressions)
Complexity Metrics:
- Power operations usage (x², x**n)
- Nested trigonometric functions (sin(cos(x)))
- Average expression depth
- Operator distribution
Performance Metrics:
- Best R² achieved per benchmark
- Mean R² (valid expressions only)
- Convergence rate during RL
Key Results
To be filled after evaluation completes
| Metric | Base (124M) | Medium (355M) | Large (774M) |
|---|---|---|---|
| Valid Rate (%) | TBD | TBD | TBD |
| Power Ops (%) | TBD | TBD | TBD |
| Avg Depth | TBD | TBD | TBD |
| Nested Trig (%) | TBD | TBD | TBD |
| Best R² (Nguyen-5) | TBD | TBD | TBD |
Baseline (from previous work):
- Base model: 39.4% valid on Nguyen-5, R²=-1.0
- Power operations: 15.9%
- Nested trig: 0%
- Average depth: 1.40
Files Created
Training scripts:
scripts/aws/launch_base_training.sh- Launch Base trainingscripts/aws/launch_medium_training.sh- Launch Medium training (fixed)scripts/aws/launch_large_training.sh- Launch Large training (fixed)launch_all_models.sh- Parallel launch all 3 models
Evaluation scripts:
scripts/run_nguyen_suite.sh- Run complete Nguyen 1-12 suitescripts/aggregate_nguyen_results.py- Aggregate and visualize results
Documentation:
EXPERIMENT_MODEL_SCALING.md- Full research reportTRAINING_LOG_MODEL_SCALING_2025.md- Training execution logmodel_cards/gpt2_base_700K_json_card.md- Base model cardmodel_cards/gpt2_medium_700K_json_card.md- Medium model cardmodel_cards/gpt2_large_700K_json_card.md- Large model card
Model Locations
Local paths (after training):
output/
├── gpt2_base_700K_json/ # Base (124M)
├── gpt2_medium_700K_json/ # Medium (355M)
└── gpt2_large_700K_json/ # Large (774M)
HuggingFace (after publication):
- Base: TBD (to be uploaded)
- Medium: TBD (to be uploaded)
- Large: TBD (to be uploaded)
Reproduction
Quick start:
# 1. Train all models (requires AWS, ~10h total parallel)
bash launch_all_models.sh
# 2. Stop instances immediately after completion
aws ec2 stop-instances --instance-ids $(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=seriguela-*-training" \
"Name=instance-state-name,Values=running" \
--query "Reservations[*].Instances[*].InstanceId" --output text)
# 3. Download models
scp -i ~/.ssh/KEY.pem -r ubuntu@BASE_IP:~/seriguela/output/gpt2_base_700K_json ./output/
scp -i ~/.ssh/KEY.pem -r ubuntu@MEDIUM_IP:~/seriguela/output/gpt2_medium_700K_json ./output/
scp -i ~/.ssh/KEY.pem -r ubuntu@LARGE_IP:~/seriguela/output/gpt2_large_700K_json ./output/
# 4. Evaluate (can run locally or on single AWS instance)
bash scripts/run_nguyen_suite.sh
# 5. Analyze results
python scripts/aggregate_nguyen_results.py --input_dir nguyen_suite_results
Total cost: ~$10-13 USD for training + optional $8-12 for full suite evaluation (if run on AWS)
Expected Findings
If hypotheses confirmed:
- Larger models produce significantly more complex expressions
- Depth and power operation usage scale with model size
- R² scores improve on complex benchmarks
- Optimal model choice depends on task complexity and budget
If hypotheses rejected:
- LoRA may be the limiting factor (fixed rank=8 for all sizes)
- Dataset (700K) may not be large enough to show scaling benefits
- Alternative architectures may be needed
Implications
For model selection:
- Use Base (124M) if: Fast inference, simple benchmarks (Nguyen 1-2), limited budget
- Use Medium (355M) if: Balanced performance/cost, moderate complexity
- Use Large (774M) if: Maximum quality needed, complex benchmarks (Nguyen 5+), budget available
For future research:
- Test LoRA rank scaling with model size (r=8/16/32)
- Train on larger datasets (1M, 5M expressions)
- Test other architectures (GPT-Neo, LLaMA)
References
- Training log:
TRAINING_LOG_MODEL_SCALING_2025.md - Research report:
EXPERIMENT_MODEL_SCALING.md - Model cards:
model_cards/gpt2_*_700K_json_card.md - Previous RL work: See "Reinforcement Learning for Expression Generation" section above