cpugym-v5-gcc-optimizer

A reinforcement learning agent trained with PPO (Stable-Baselines3) to select GCC optimization flags for C programs. This is a research checkpoint from the V5 convergence-first training run — the model learns sequential flag composition via a curriculum but has not yet surpassed -O2 on PolyBench/C. It serves as a baseline for the V6 architecture.

Model Description

CPUGym V5 uses a convergence-first design with:

  • Reduced action space: 17 actions (12 individual flags + 4 base optimization levels + STOP)
  • Potential-based reward shaping (Ng et al., 1999) for stable intermediate rewards
  • Curriculum learning: O0 → O1 → O2 → best-known progressive difficulty
  • Behavioral cloning cold-start from 8 expert optimization strategies

Architecture

Component Details
Algorithm PPO (Proximal Policy Optimization)
Policy MlpPolicy (64×64 hidden layers)
Observation Box(24): [program_features(8) + flag_state(12) + base_onehot(4)]
Action Discrete(17): STOP(0) | toggle_flag(1-12) | set_base(13-16)
Max steps/episode 5
Framework Stable-Baselines3

Optimization Flags (12)

Category Flags
Vectorization & SIMD -march=native, -ftree-vectorize
Math -ffast-math
Loop optimizations -funroll-loops, -fpeel-loops, -ftree-loop-distribution
Inlining -finline-functions
IPO -flto (auto-adds -fwhole-program)
Scheduling & codegen -fschedule-insns2, -fomit-frame-pointer
Memory -fstrict-aliasing
Loop vectorization -ftree-loop-vectorize

Base Optimization Levels (4)

-O1, -O2, -O3, -Ofast

Training Details

Hyperparameters

Parameter Value
Learning rate 1e-3
Discount (γ) 0.95
GAE (λ) 0.9
Clip range 0.1
Entropy coefficient 0.1
Batch size 128
N-steps 128
N-epochs 10
Parallel environments 16
Total timesteps 500000

Curriculum Schedule

Phase Timesteps Baseline Target
1. Beat -O0 0–20k -O0 Trivial warm-up
2. Beat -O1 20k–80k -O1 Learn specific flags
3. Beat -O2 80k–500k -O2 Core optimization target
4. Beat best-known 500k+ Best found Research frontier

Cold Start

Pre-trained with behavioral cloning from 8 expert strategies:

  • Vectorization-focused (-O3 -march=native -ftree-vectorize)
  • Math-aggressive (-Ofast -ffast-math)
  • Loop-focused (-O3 -funroll-loops -ftree-loop-distribution)
  • Full pipeline (-O3 -march=native -flto -funroll-loops)
  • And 4 more domain-specific combinations

Reward Design

  • Terminal reward: log(t_baseline / t_agent) — positive when agent beats baseline
  • Intermediate reward: Potential-based shaping (Φ = flag coverage ratio)
  • Conflict penalty: -0.1 for selecting flags already implied by the base level

Usage

from stable_baselines3 import PPO
import numpy as np

# Load model
model = PPO.load("path/to/model.zip")

# Create observation (24-dim)
# [program_features(8) + flag_state(12) + base_onehot(4)]
obs = np.zeros(24, dtype=np.float32)
# ... set program features from extract_program_features()

# Get action
action, _ = model.predict(obs, deterministic=True)
# action 0 = STOP, 1-12 = toggle flag, 13-16 = set base level

Intended Use

This model is designed for compiler optimization research. It demonstrates that RL agents can learn to select GCC optimization flags via curriculum learning and sequential flag composition.

Not intended for: Production compiler toolchains without thorough validation.

Evaluation Results (Azure linux/amd64, GCC 10)

Phase 1: Naive PolyBench Evaluation (30 programs × 7 baselines × 7 runs)

Metric Value
Beat -O2 0/30 (0%)
Beat best baseline 0/30 (0%)
Avg speedup vs -O2 -265.6% (3.3× slower)
Geomean time ratio vs O2 3.32×

The agent at 448k steps (curriculum phase 3: beat_O2) selects flags that produce slower code than -O2. It tends to choose -O1 + individual flags or bare -O2 without useful additions.

Phase 2: O2-vs-O3 Classification Test (13 synthetic benchmarks)

11/13 passed (85%) — validates that the test infrastructure correctly differentiates O2-favorable vs O3-favorable programs on the target hardware.

Program O2 time O3 time Speedup Category
dense_matmul 0.416s 0.224s 1.86× O3-favorable
simd_vectorize 0.321s 0.182s 1.76× O3-favorable
stencil_2d 0.182s 0.143s 1.27× O3-favorable
loop_unroll_target 0.098s 0.083s 1.19× O3-favorable
branch_heavy 0.808s 0.810s 1.00× O2-favorable
linked_list_walk 4.871s 4.891s 1.00× O2-favorable
icache_pressure 0.065s 0.065s 1.00× O2-favorable

Phase 3: Agent O2/O3 Flag Selection (7 known-outcome programs)

4/7 correct (57%) — the agent always defaults to O2 as base level (correct for O2-favorable programs, wrong for O3-favorable ones like dense_matmul, stencil_2d, vector_reduction). This is expected: the model was still in the beat_O2 curriculum phase and hadn't learned when to escalate to O3.

Program Expected Agent chose Result
branch_heavy O2 O2 CORRECT
icache_pressure O2 O2 CORRECT
linked_list_walk O2 O2 CORRECT
sort_and_search O2 O2 CORRECT
dense_matmul O3 O2 WRONG
stencil_2d O3 O2 WRONG
vector_reduction O3 O2 WRONG

Interpretation

This checkpoint is a curriculum-in-progress model: it learned "O2 is safe" but hasn't discovered when O3/Ofast provides measurable benefit. The V6 architecture addresses this with synthetic data augmentation, LLM-generated training programs with known-optimal flags, and extended training (1M+ steps).

Training Infrastructure

  • Azure Container Apps (D16 workload profile, 16 vCPU, linux/amd64)
  • Training cost: ~$65
  • Training time: ~13 hours (500k timesteps)

Citation

@software{cpugym_v5,
  title={CPUGym V5: Convergence-First GCC Optimization via Reinforcement Learning},
  year={2026},
  url={https://github.com/pznachab_amadeus/CPUGym}
}

License

MIT

Downloads last month
28
Video Preview
loading

Evaluation results

  • Mean Episode Reward (best eval) on PolyBench/C
    self-reported
    -1.770
  • Programs Beating -O2 (%) on PolyBench/C
    self-reported
    0.000
  • O2/O3 Classification Accuracy (%) on PolyBench/C
    self-reported
    57.000
  • O2-vs-O3 Ground Truth Validation (%) on PolyBench/C
    self-reported
    85.000