Test training flow - 1 epoch

2c4ca2f verified 2 months ago

12.1 kB

language: en
license: mit
tags:
  - symbolic-regression
  - gpt2-medium
  - lora
  - expression-generation
  - mathematics
datasets:
  - augustocsc/sintetico_natural
metrics:
  - accuracy
  - r2_score
model-index:
  - name: GPT-2 Medium for Symbolic Regression
    results:
      - task:
          type: symbolic-regression
          name: Expression Generation Quality
        dataset:
          name: Sintetico Natural 700K
          type: augustocsc/sintetico_natural
        metrics:
          - name: Valid Expression Rate
            type: accuracy
            value: 99.2
          - name: Diversity Rate
            type: diversity
            value: 98.8
      - task:
          type: symbolic-regression
          name: Nguyen Benchmark Suite
        dataset:
          name: Nguyen Benchmarks 1-12
          type: nguyen-symbolic-regression
        metrics:
          - name: Average Valid Rate
            type: accuracy
            value: 75.2
          - name: Average Best R²
            type: r2_score
            value: 0.9812
          - name: Maximum R²
            type: r2_score
            value: 0.9999

GPT-2 Medium for Symbolic Regression (JSON Format)

Model Description

This model is a GPT-2 Medium (355M parameters) fine-tuned using LoRA for symbolic regression expression generation. It was trained on 700K synthetic mathematical expressions in JSON format, achieving 99.2% valid expression rate, 98.8% diversity, and R² = 0.9812 on Nguyen benchmarks.

Part of research study: "Impact of Model Size on Symbolic Regression Capability in Large Language Models"

This is the balanced performance/cost model in a comprehensive scaling study, offering significant improvements over Base (124M) while remaining cost-effective.

Model Details

Architecture

Base Model: gpt2-medium (355M parameters)
Trainable Parameters: ~294K (LoRA adapters only - 0.08% of total)
Training Method: LoRA (Low-Rank Adaptation)
Training Data: 700K expressions from augustocsc/sintetico_natural
Data Format: JSON structured format (EXP-A)
Framework: HuggingFace Transformers + PEFT

LoRA Configuration

{
  "r": 8,
  "lora_alpha": 32,
  "target_modules": ["c_attn"],
  "lora_dropout": 0.05,
  "bias": "none",
  "task_type": "CAUSAL_LM"
}

Training Hyperparameters

{
  "learning_rate": 5e-5,
  "num_train_epochs": 3,
  "per_device_train_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "effective_batch_size": 16,
  "warmup_steps": 500,
  "weight_decay": 0.01,
  "fp16": true,
  "early_stopping_patience": 3,
  "seed": 42
}

Training Details

Training Duration: ~3-4 hours on NVIDIA A10G (24GB)
Instance Type: AWS g5.xlarge
Early Stopping: Enabled (patience=3, monitored validation loss)
Final Training Loss: [Value from training logs]
Dataset Split: 90% train / 10% validation

Performance

Expression Generation Quality (500 samples)

Metric	Score	vs Base
Valid Expression Rate	99.2%	-0.2%
Diversity Rate	98.8% 🏆	+1.0%
Unique Expressions	494 / 500 🏆	+5
Errors	4 / 500 (0.8%)	+1

Key Strengths:

Near-perfect valid expression generation (99.2%)
Highest diversity among all model sizes (98.8%)
Best unique expression count (494/500)
Excellent balance of quality and efficiency

Nguyen Benchmark Performance (12 benchmarks, 1,200 expressions)

Metric	Score	vs Base
Average Valid Rate	75.2%	+12.7% 🎯
Valid Rate Range	64.0% - 94.0%	Improved consistency
Average Best R²	0.9812	+6.8% 🎯
R² Range	0.9288 - 0.9999	Much tighter range
Benchmarks with R² > 0.99	5 / 12	+1
Average Execution Time	162.3 seconds per benchmark	+71%

Major Improvements Over Base:

+12.7 percentage points valid rate (62.5% → 75.2%)
+6.8% average R² improvement (0.919 → 0.981)
All benchmarks R² > 0.93 (Base had 0.67 minimum)
Near-perfect fit on Nguyen-7 (R² = 0.9999)

Per-Benchmark Results (Best R²):

Benchmark	Formula	Valid Rate	Best R²	vs Base
Nguyen-1	x³ + x² + x	64%	0.9889 🏆	+0.0172
Nguyen-2	x⁴ + x³ + x² + x	67%	0.9804	-0.0171
Nguyen-3	x⁵ + x⁴ + x³ + x² + x	71%	0.9591	-0.0187
Nguyen-4	x⁶ + x⁵ + x⁴ + x³ + x² + x	71%	0.9288	+0.1495 🎯
Nguyen-5	sin(x²)·cos(x) - 1	64%	0.9993 🏆	+0.0671 🎯
Nguyen-6	sin(x) + sin(x + x²)	69%	0.9985	+0.0003
Nguyen-7	log(x + 1) + log(x² + 1)	81%	0.9999 ⭐	+0.0016
Nguyen-8	√x	79%	0.9985	+0.0224
Nguyen-9	sin(x) + sin(y²)	77%	0.9875	+0.1837 🎯
Nguyen-10	2·sin(x)·cos(y)	75%	0.9980	-0.0014
Nguyen-11	x^y	91%	0.9600 🏆	+0.0401
Nguyen-12	x⁴ - x³ + y²/2 - y	94%	0.9751 🏆	+0.3016 🎯

Best Results:

Nguyen-7: R² = 0.9999 (near-perfect, within 0.01% of perfect fit)
Nguyen-5: R² = 0.9993 (complex nested operations)
Nguyen-12: Massive +0.30 R² improvement over Base

Observations:

Wins on complex benchmarks: Nguyen 4, 5, 9, 11, 12 (all improved significantly)
Consistent R² > 0.93 across all benchmarks
Valid rate 64-94% - much more stable than Base (46-93%)
Best diversity (98.8%) - generates most varied expressions

Usage

Installation

pip install transformers peft torch

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("gpt2-medium")
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")

# Add padding token
tokenizer.pad_token = tokenizer.eos_token

# Load LoRA adapter (replace with actual HuggingFace repo)
model = PeftModel.from_pretrained(base_model, "USER/gpt2_medium_700K_json")
model.eval()

Generating Expressions

import torch

# Define prompt in JSON format
prompt = '{"vars": ["x_1", "x_2"], "ops": ["*", "+", "sin", "cos", "log"], "cons": "C", "expr": "'

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt")

# Generate with higher quality settings
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        num_return_sequences=5,  # Generate multiple candidates
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.encode('"', add_special_tokens=False)[0]
    )

# Decode all candidates
for i, output in enumerate(outputs):
    generated_text = tokenizer.decode(output, skip_special_tokens=True)
    expression = generated_text.split('"expr": "')[1].split('"')[0]
    print(f"Candidate {i+1}: {expression}")

Batch Generation for Symbolic Regression

def generate_candidate_expressions(model, tokenizer, variables, operators, num_candidates=10):
    """Generate multiple candidate expressions for symbolic regression."""
    vars_str = ', '.join([f'"{v}"' for v in variables])
    ops_str = ', '.join([f'"{o}"' for o in operators])
    prompt = f'{{"vars": [{vars_str}], "ops": [{ops_str}], "cons": "C", "expr": "'

    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.8,
            top_p=0.95,
            do_sample=True,
            num_return_sequences=num_candidates,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.encode('"', add_special_tokens=False)[0]
        )

    expressions = []
    for output in outputs:
        try:
            text = tokenizer.decode(output, skip_special_tokens=True)
            expr = text.split('"expr": "')[1].split('"')[0]
            expressions.append(expr)
        except:
            continue

    return expressions

# Example usage
candidates = generate_candidate_expressions(
    model, tokenizer,
    variables=["x_1", "x_2"],
    operators=["*", "+", "sin", "cos"],
    num_candidates=20
)

print(f"Generated {len(candidates)} valid expressions")
for i, expr in enumerate(candidates[:5]):
    print(f"{i+1}. {expr}")

Intended Use

Primary Use Cases

Production symbolic regression: Balanced quality and speed
Complex benchmark problems: Nguyen 4-12, nested operations
Research applications: State-of-the-art baseline
Expression diversity: Highest diversity among all models

Recommended For

Moderate to complex benchmarks (all Nguyen 1-12)
Applications requiring high diversity (98.8% unique)
Production systems with quality requirements >98%
Best balance of performance and cost

Optimal Choice When

Need consistent R² > 0.93 across all problems
Speed is important but quality cannot be compromised
Budget allows ~70% more compute than Base
Diversity is critical (exploration phase)

Comparison with Other Sizes

vs Base (124M)

When to choose Medium:

Need +12.7% valid rate on benchmarks
Complex problems requiring nested operations
+6.8% R² improvement is worth 70% slower inference
Highest diversity required

When to choose Base:

Speed is critical
Simple benchmarks only
Budget is very limited

vs Large (774M)

When to choose Medium:

Best performance/cost ratio
Already excellent R² (0.981 vs 0.985)
42% faster than Large
Smaller memory footprint

When to choose Large:

Need maximum possible quality (100% valid on quality, 89% on benchmarks)
Budget allows
Perfect R² = 1.0 achieved on some benchmarks

Limitations

Known Issues

Slower Than Base: 70% longer inference time (162s vs 95s)
Not Perfect: 99.2% valid (vs 100% on Large)
Memory Requirements: 355M params require more VRAM than Base
Still Some Failed Cases: Valid rate 64-94% on benchmarks (not 100%)

Model Scaling Position

Sweet Spot Model: Medium offers the best balance:

94% of Large's performance at 42% faster speed
67% improvement over Base in R² quality
Highest diversity (98.8%) across all sizes

General Limitations

Trained only on infix notation
May generate expressions with division by zero
LoRA fine-tuning limits adaptation vs full fine-tuning
No reinforcement learning optimization
Requires JSON prompt format

Ethical Considerations

Bias: Inherits GPT-2 pretraining biases
Validation Required: Always validate generated expressions
Environmental: Moderate carbon footprint (~3-4 hours GPU)
Transparency: All metrics and training details disclosed

Model Card Authors

Research Team: [Your Name/Institution]

Contact: [Email or GitHub]

Date: February 2025

Citation

If you use this model in your research, please cite:

@misc{gpt2_medium_symbolic_regression_2025,
  title={GPT-2 Medium Model for Symbolic Regression: Optimal Performance-Cost Balance},
  author={[Your Name]},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/USER/gpt2_medium_700K_json}},
  note={355M parameters, 99.2\% valid rate, R²=0.9812 on Nguyen benchmarks}
}

Acknowledgments

Dataset: augustocsc/sintetico_natural (HuggingFace Hub)
Framework: HuggingFace Transformers, PEFT (LoRA)
Compute: AWS g5.xlarge with NVIDIA A10G GPU
Experiment Tracking: Weights & Biases

Additional Resources

Research Report: See SCIENTIFIC_REPORT_MODEL_SCALING.md
Training Details: See TRAINING_LOG_MODEL_SCALING_2025.md
Benchmark Results: See NGUYEN_RESULTS_FINAL.md
Visualizations: See visualizations/ directory
Model Comparison: See FINAL_STATUS.md

Model Version: 1.0 Last Updated: 2026-02-04 Status: Production-Ready Recommended Use: Default choice for most applications