CPUGym GCC V5 β€” Reinforcement Learning for Compiler Optimization

Model Description

A PPO (Proximal Policy Optimization) agent trained to select GCC compilation flags for C programs. Given program features (LOC, loop counts, function counts, etc.), the agent iteratively builds a flag combination to minimize execution time.

This model was trained as part of the CPUGym project, which applies reinforcement learning to automate compiler optimization β€” a traditionally manual and expertise-intensive process.

Key Innovation

Unlike static flag selection (e.g., always using -O2), this model learns to:

  • Adapt flags per-program: Different programs receive different optimization strategies
  • Compose flags incrementally: The agent toggles flags one at a time, learning which combinations are synergistic
  • Use curriculum learning: Training progresses from beating -O0 to -O1 to -O2 to the best known configuration

Architecture

Component Details
Algorithm PPO (Proximal Policy Optimization)
Policy MlpPolicy (ActorCriticPolicy) β€” 2-layer MLP [64, 64]
Framework Stable Baselines 3 v2.x
Observation Box(24) = [prog_features(8) | flag_state(12) | base_onehot(4)]
Action Discrete(17) = STOP(0) | toggle_flag(1..12) | set_base(13..16)
Max Episode Length 5 steps

Observation Space (24 dimensions)

Dims Name Description
0-7 Program Features size_kb, loc, function_count, avg_function_loc, for_loops, while_loops, do_loops, include_count
8-19 Flag State Binary vector of 12 individual GCC flags
20-23 Base Level One-hot encoding of base optimization level (-O1, -O2, -O3, -Ofast)

Action Space (17 discrete actions)

Action Description
0 STOP β€” finalize current flag combination
1-12 Toggle individual flag (march-native, ftree-vectorize, ffast-math, funroll-loops, fpeel-loops, ftree-loop-distribution, finline-functions, flto, fschedule-insns2, fomit-frame-pointer, fstrict-aliasing, ftree-loop-vectorize)
13-16 Set base optimization level (-O1, -O2, -O3, -Ofast)

Training Details

Hyperparameters

PPO(
    learning_rate=1e-3,
    gamma=0.95,
    clip_range=0.1,
    ent_coef=0.1,
    batch_size=128,
    n_steps=128,
    n_epochs=10,
    gae_lambda=0.95,
    vf_coef=0.5,
    max_grad_norm=0.5,
)

Training Configuration

  • Total timesteps: 500,000 per run Γ— 5 completed runs
  • Environments: 12 parallel SubprocVecEnv
  • Measurement runs: 7 per compilation (min aggregation, SPEC CPU methodology)
  • Benchmark suite: PolyBench/C (30 programs across linear algebra, stencils, datamining, medley)
  • Hardware: Azure Container Apps, D16 CPU profile (16 vCPU, 64 GB RAM)

Curriculum Learning

The agent progresses through 4 phases of increasing difficulty:

Phase Timestep Range Baseline Goal
beat_O0 0 β€” 20k -O0 Learn that optimization helps
beat_O1 20k β€” 80k -O1 Surpass conservative optimization
beat_O2 80k β€” 500k -O2 Beat the standard default
beat_best_known 500k+ Best found so far Continue improving

Convergence Metrics (5 completed runs)

Run Final Reward Best Reward Explained Variance Loss
1 -3.277 -1.537 0.945 0.195
2 -3.377 -1.292 0.956 0.124
3 -1.585 -1.585 0.975 -0.189
4 -2.841 -1.532 0.965 -0.093
5 -2.797 -1.682 0.988 -0.226

Cross-run trend: IMPROVING (avg final reward -3.327 β†’ -2.408)

Usage

Quick Start

import numpy as np
from stable_baselines3 import PPO

# Load model
model = PPO.load("best_model")

# Program features: [size_kb, loc, func_count, avg_func_loc, for_loops, while_loops, do_loops, includes]
prog_features = np.array([5.0, 100.0, 4.0, 25.0, 12.0, 0.0, 0.0, 3.0], dtype=np.float32)

# Initial state: no flags selected, base = -O1
flag_state = np.zeros(12, dtype=np.float32)
base_onehot = np.array([1, 0, 0, 0], dtype=np.float32)  # -O1

obs = np.concatenate([prog_features, flag_state, base_onehot])

# Run agent
for step in range(5):
    action, _ = model.predict(obs, deterministic=True)
    action = int(action)
    
    if action == 0:  # STOP
        break
    elif 1 <= action <= 12:  # Toggle flag
        flag_state[action - 1] = 1 - flag_state[action - 1]
    elif 13 <= action <= 16:  # Set base
        base_onehot = np.zeros(4, dtype=np.float32)
        base_onehot[action - 13] = 1
    
    obs = np.concatenate([prog_features, flag_state, base_onehot])

# Decode final flags
BASE_LEVELS = ["-O1", "-O2", "-O3", "-Ofast"]
FLAGS = ["march-native", "ftree-vectorize", "ffast-math", "funroll-loops",
         "fpeel-loops", "ftree-loop-distribution", "finline-functions", "flto",
         "fschedule-insns2", "fomit-frame-pointer", "fstrict-aliasing", "ftree-loop-vectorize"]

base = BASE_LEVELS[np.argmax(base_onehot)]
active = [f"-{FLAGS[i]}" for i in range(12) if flag_state[i] > 0.5]
print(f"Predicted: {base} {' '.join(active)}")

Full Evaluation Pipeline

See evaluate_v5_naive.py in this repository for the complete evaluation framework that runs the model on PolyBench/C benchmarks and LLM-generated code.

Evaluation Results

Naive Evaluation (34 programs: 30 PolyBench + 4 LLM-generated)

Base Level Distribution: -O1: 91.2%, -O3: 8.8%

Top Flag Activations:

  • finline-functions: 35.3%
  • march-native: 20.6%
  • ffast-math: 5.9%
  • fpeel-loops: 5.9%
  • flto: 5.9%

Key Finding: The model is program-adaptive β€” it produces 10 different flag combinations across 34 programs, with an action entropy of 2.94/4.09 (72%).

Interpretation

The model's preference for -O1 + individual flags over -O2/-O3 is noteworthy. This suggests that for the PolyBench benchmarks:

  1. -O1 -finline-functions (35% of programs): The model learned that aggressive inlining on top of conservative optimization is often the best strategy β€” avoiding the overhead of O2/O3's full optimization pipeline.

  2. -O1 -march=native (21% of programs): For SIMD-amenable code, native architecture flags on minimal optimization outperform generic O2.

  3. -O3 without extras (9%): Only for large, complex programs (gemver, lu, gaussian_blur) does the full O3 pipeline justify its compile-time cost.

Limitations

  • Trained only on PolyBench/C benchmarks (30 programs) β€” may not generalize to all C programs
  • Does not consider compilation time in the reward signal
  • Flag space is limited to 12 hand-selected GCC flags
  • Evaluation is based on execution time only (no memory or code size optimization)
  • The model uses min-time aggregation (SPEC CPU methodology) which may not reflect average-case performance

Citation

@software{cpugym_gcc_v5,
  title={CPUGym: Reinforcement Learning for GCC Compiler Optimization},
  author={CompilOpt Team},
  year={2026},
  url={https://huggingface.co/compilopt/cpugym-gcc-v5}
}

License

MIT License

Downloads last month
7
Video Preview
loading

Evaluation results