gpt2_base_prefix_682k / MODEL_CARD_BASE.md

GPT-2 Base trained on prefix dataset (682K)

c082aa2 verified 2 months ago

9.93 kB

language: en
license: mit
tags:
  - symbolic-regression
  - gpt2
  - lora
  - expression-generation
  - mathematics
datasets:
  - augustocsc/sintetico_natural
metrics:
  - accuracy
  - r2_score
model-index:
  - name: GPT-2 Base for Symbolic Regression
    results:
      - task:
          type: symbolic-regression
          name: Expression Generation Quality
        dataset:
          name: Sintetico Natural 700K
          type: augustocsc/sintetico_natural
        metrics:
          - name: Valid Expression Rate
            type: accuracy
            value: 99.4
          - name: Diversity Rate
            type: diversity
            value: 97.8
      - task:
          type: symbolic-regression
          name: Nguyen Benchmark Suite
        dataset:
          name: Nguyen Benchmarks 1-12
          type: nguyen-symbolic-regression
        metrics:
          - name: Average Valid Rate
            type: accuracy
            value: 62.5
          - name: Average Best R²
            type: r2_score
            value: 0.919
          - name: Maximum R²
            type: r2_score
            value: 0.9994

GPT-2 Base for Symbolic Regression (JSON Format)

Model Description

This model is a GPT-2 Base (124M parameters) fine-tuned using LoRA for symbolic regression expression generation. It was trained on 700K synthetic mathematical expressions in JSON format, achieving 99.4% valid expression rate and solid performance on standard symbolic regression benchmarks.

Part of research study: "Impact of Model Size on Symbolic Regression Capability in Large Language Models"

This is the baseline model in a comprehensive model scaling study comparing Base (124M), Medium (355M), and Large (774M) parameter models.

Model Details

Architecture

Base Model: gpt2 (124M parameters)
Trainable Parameters: ~294K (LoRA adapters only - 0.24% of total)
Training Method: LoRA (Low-Rank Adaptation)
Training Data: 700K expressions from augustocsc/sintetico_natural
Data Format: JSON structured format (EXP-A)
Framework: HuggingFace Transformers + PEFT

LoRA Configuration

{
  "r": 8,
  "lora_alpha": 32,
  "target_modules": ["c_attn"],
  "lora_dropout": 0.05,
  "bias": "none",
  "task_type": "CAUSAL_LM"
}

Training Hyperparameters

{
  "learning_rate": 5e-5,
  "num_train_epochs": 3,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "effective_batch_size": 32,
  "warmup_steps": 500,
  "weight_decay": 0.01,
  "fp16": true,
  "early_stopping_patience": 3,
  "seed": 42
}

Training Details

Training Duration: ~2-3 hours on NVIDIA A10G (24GB)
Instance Type: AWS g5.xlarge
Early Stopping: Enabled (patience=3, monitored validation loss)
Final Training Loss: [Value from training logs]
Dataset Split: 90% train / 10% validation

Performance

Expression Generation Quality (500 samples)

Metric	Score
Valid Expression Rate	99.4%
Diversity Rate	97.8%
Unique Expressions	489 / 500
Errors	3 / 500 (0.6%)

Key Strengths:

Near-perfect valid expression generation (99.4%)
High diversity (97.8% unique expressions)
Very few errors (only 3 in 500 samples)
Fast inference (smallest model)
Most economical option

Nguyen Benchmark Performance (12 benchmarks, 1,200 expressions)

Metric	Score
Average Valid Rate	62.5%
Valid Rate Range	46.0% - 93.0%
Average Best R²	0.9190
R² Range	0.6735 - 0.9994
Benchmarks with R² > 0.99	4 / 12
Average Execution Time	95.1 seconds per benchmark

Per-Benchmark Results (Best R²):

Benchmark	Formula	Valid Rate	Best R²
Nguyen-1	x³ + x² + x	49%	0.9717
Nguyen-2	x⁴ + x³ + x² + x	52%	0.9975
Nguyen-3	x⁵ + x⁴ + x³ + x² + x	46%	0.9778
Nguyen-4	x⁶ + x⁵ + x⁴ + x³ + x² + x	46%	0.7793
Nguyen-5	sin(x²)·cos(x) - 1	56%	0.9322
Nguyen-6	sin(x) + sin(x + x²)	53%	0.9982
Nguyen-7	log(x + 1) + log(x² + 1)	84%	0.9983
Nguyen-8	√x	82%	0.9761
Nguyen-9	sin(x) + sin(y²)	56%	0.8038
Nguyen-10	2·sin(x)·cos(y)	50%	0.9994 ⭐
Nguyen-11	x^y	93%	0.9199
Nguyen-12	x⁴ - x³ + y²/2 - y	83%	0.6735

Best Result: Nguyen-10 with R² = 0.9994 (near-perfect fit)

Observations:

Excels on simpler benchmarks (Nguyen 7, 10)
Struggles with complex benchmarks requiring nested operations (Nguyen 4, 12)
Valid rate varies significantly (46-93%) depending on benchmark complexity
Fast execution (~95s per benchmark) - fastest among all model sizes

Usage

Installation

pip install transformers peft torch

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Add padding token
tokenizer.pad_token = tokenizer.eos_token

# Load LoRA adapter (replace with actual HuggingFace repo)
model = PeftModel.from_pretrained(base_model, "USER/gpt2_base_700K_json")
model.eval()

Generating Expressions

import torch

# Define prompt in JSON format
prompt = '{"vars": ["x_1"], "ops": ["*", "+", "sin", "cos"], "cons": "C", "expr": "'

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt")

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.encode('"', add_special_tokens=False)[0]  # Stop at closing quote
    )

# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract expression (after "expr": ")
expression = generated_text.split('"expr": "')[1].split('"')[0]
print(f"Generated expression: {expression}")

Validation with SymPy

import sympy as sp

def validate_expression(expr_str, variables):
    """Validate expression using SymPy."""
    try:
        # Define symbols
        symbols = {var: sp.Symbol(var) for var in variables}

        # Parse expression
        expr = sp.sympify(expr_str, locals=symbols)

        # Check if valid
        return True, expr
    except Exception as e:
        return False, str(e)

# Example
is_valid, result = validate_expression("sin(x_1**2) + cos(x_1)", ["x_1"])
print(f"Valid: {is_valid}, Expression: {result}")

Intended Use

Primary Use Cases

Symbolic regression research: Baseline model for symbolic regression experiments
Fast prototyping: Quick expression generation with reasonable quality
Educational purposes: Teaching symbolic regression with LLMs
Cost-sensitive applications: Best performance/cost ratio

Recommended For

Simple to moderate complexity benchmarks (Nguyen 1-3, 7-8, 10-11)
Applications where speed is critical
Scenarios with limited computational resources
Baseline comparisons in research

Not Recommended For

Complex nested expressions (use Medium or Large models)
Production systems requiring >95% valid rate on complex benchmarks
Tasks requiring maximum possible accuracy (consider Large model)

Limitations

Known Issues

Complex Expression Generation: Struggles with deeply nested expressions (avg depth 1.40)
Power Operations: Limited use of power operations (15.9% on Nguyen-5)
Variable Valid Rates: Valid rate drops to 46% on complex benchmarks (Nguyen-3, 4)
Benchmark Performance: Average R² of 0.919 - good but not state-of-the-art
No Nested Trigonometry: 0% nested trig functions (e.g., sin(cos(x)))

Model Scaling Insights

Compared to larger models in the study:

Medium (355M): +12.7% valid rate, +6.8% R² improvement
Large (774M): +26.5% valid rate, +7.2% R² improvement

Conclusion: For maximum quality, consider the Medium or Large models. For speed and cost-effectiveness, this Base model is excellent.

General Limitations

Trained only on infix notation
May generate expressions with division by zero or undefined operations
LoRA fine-tuning provides less adaptation than full fine-tuning
No reinforcement learning optimization (supervised learning only)
Performance depends heavily on prompt format (JSON required)

Ethical Considerations

Bias: Model inherits biases from GPT-2 pretraining (minimize impact via symbolic math focus)
Misuse: Could generate incorrect formulas - always validate outputs
Environmental: Small carbon footprint (~2-3 hours GPU training)
Transparency: All training details, hyperparameters, and evaluation metrics disclosed

Model Card Authors

Research Team: [Your Name/Institution]

Contact: [Email or GitHub]

Date: February 2025

Citation

If you use this model in your research, please cite:

@misc{gpt2_base_symbolic_regression_2025,
  title={GPT-2 Base Model for Symbolic Regression: A Model Scaling Study},
  author={[Your Name]},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/USER/gpt2_base_700K_json}},
  note={Trained on 700K expressions with LoRA fine-tuning}
}

Acknowledgments

Dataset: augustocsc/sintetico_natural (HuggingFace Hub)
Framework: HuggingFace Transformers, PEFT (LoRA)
Compute: AWS g5.xlarge with NVIDIA A10G GPU
Experiment Tracking: Weights & Biases

Additional Resources

Research Report: See SCIENTIFIC_REPORT_MODEL_SCALING.md
Training Details: See TRAINING_LOG_MODEL_SCALING_2025.md
Benchmark Results: See NGUYEN_RESULTS_FINAL.md
Visualizations: See visualizations/ directory
Comparison with Medium/Large: See FINAL_STATUS.md

Model Version: 1.0 Last Updated: 2026-02-04 Status: Production-Ready