CPUGym GCC V5 — Reinforcement Learning for Compiler Optimization

Model Description

A PPO (Proximal Policy Optimization) agent trained to select GCC compilation flags for C programs. Given program features (LOC, loop counts, function counts, etc.), the agent iteratively builds a flag combination to minimize execution time.

This model was trained as part of the CPUGym project, which applies reinforcement learning to automate compiler optimization — a traditionally manual and expertise-intensive process.

Key Innovation

Unlike static flag selection (e.g., always using -O2), this model learns to:

Adapt flags per-program: Different programs receive different optimization strategies
Compose flags incrementally: The agent toggles flags one at a time, learning which combinations are synergistic
Use curriculum learning: Training progresses from beating -O0 to -O1 to -O2 to the best known configuration

Architecture

Component	Details
Algorithm	PPO (Proximal Policy Optimization)
Policy	MlpPolicy (ActorCriticPolicy) — 2-layer MLP [64, 64]
Framework	Stable Baselines 3 v2.x
Observation	Box(24) = [prog_features(8) \| flag_state(12) \| base_onehot(4)]
Action	Discrete(17) = STOP(0) \| toggle_flag(1..12) \| set_base(13..16)
Max Episode Length	5 steps

Observation Space (24 dimensions)

Dims	Name	Description
0-7	Program Features	size_kb, loc, function_count, avg_function_loc, for_loops, while_loops, do_loops, include_count
8-19	Flag State	Binary vector of 12 individual GCC flags
20-23	Base Level	One-hot encoding of base optimization level (-O1, -O2, -O3, -Ofast)

Action Space (17 discrete actions)

Action	Description
0	STOP — finalize current flag combination
1-12	Toggle individual flag (march-native, ftree-vectorize, ffast-math, funroll-loops, fpeel-loops, ftree-loop-distribution, finline-functions, flto, fschedule-insns2, fomit-frame-pointer, fstrict-aliasing, ftree-loop-vectorize)
13-16	Set base optimization level (-O1, -O2, -O3, -Ofast)

Training Details

Hyperparameters

PPO(
    learning_rate=1e-3,
    gamma=0.95,
    clip_range=0.1,
    ent_coef=0.1,
    batch_size=128,
    n_steps=128,
    n_epochs=10,
    gae_lambda=0.95,
    vf_coef=0.5,
    max_grad_norm=0.5,
)

Training Configuration

Total timesteps: 500,000 per run × 5 completed runs
Environments: 12 parallel SubprocVecEnv
Measurement runs: 7 per compilation (min aggregation, SPEC CPU methodology)
Benchmark suite: PolyBench/C (30 programs across linear algebra, stencils, datamining, medley)
Hardware: Azure Container Apps, D16 CPU profile (16 vCPU, 64 GB RAM)

Curriculum Learning

The agent progresses through 4 phases of increasing difficulty:

Phase	Timestep Range	Baseline	Goal
beat_O0	0 — 20k	`-O0`	Learn that optimization helps
beat_O1	20k — 80k	`-O1`	Surpass conservative optimization
beat_O2	80k — 500k	`-O2`	Beat the standard default
beat_best_known	500k+	Best found so far	Continue improving

Convergence Metrics (5 completed runs)

Run	Final Reward	Best Reward	Explained Variance	Loss
1	-3.277	-1.537	0.945	0.195
2	-3.377	-1.292	0.956	0.124
3	-1.585	-1.585	0.975	-0.189
4	-2.841	-1.532	0.965	-0.093
5	-2.797	-1.682	0.988	-0.226

Cross-run trend: IMPROVING (avg final reward -3.327 → -2.408)

Usage

Quick Start

import numpy as np
from stable_baselines3 import PPO

# Load model
model = PPO.load("best_model")

# Program features: [size_kb, loc, func_count, avg_func_loc, for_loops, while_loops, do_loops, includes]
prog_features = np.array([5.0, 100.0, 4.0, 25.0, 12.0, 0.0, 0.0, 3.0], dtype=np.float32)

# Initial state: no flags selected, base = -O1
flag_state = np.zeros(12, dtype=np.float32)
base_onehot = np.array([1, 0, 0, 0], dtype=np.float32)  # -O1

obs = np.concatenate([prog_features, flag_state, base_onehot])

# Run agent
for step in range(5):
    action, _ = model.predict(obs, deterministic=True)
    action = int(action)
    
    if action == 0:  # STOP
        break
    elif 1 <= action <= 12:  # Toggle flag
        flag_state[action - 1] = 1 - flag_state[action - 1]
    elif 13 <= action <= 16:  # Set base
        base_onehot = np.zeros(4, dtype=np.float32)
        base_onehot[action - 13] = 1
    
    obs = np.concatenate([prog_features, flag_state, base_onehot])

# Decode final flags
BASE_LEVELS = ["-O1", "-O2", "-O3", "-Ofast"]
FLAGS = ["march-native", "ftree-vectorize", "ffast-math", "funroll-loops",
         "fpeel-loops", "ftree-loop-distribution", "finline-functions", "flto",
         "fschedule-insns2", "fomit-frame-pointer", "fstrict-aliasing", "ftree-loop-vectorize"]

base = BASE_LEVELS[np.argmax(base_onehot)]
active = [f"-{FLAGS[i]}" for i in range(12) if flag_state[i] > 0.5]
print(f"Predicted: {base} {' '.join(active)}")

Full Evaluation Pipeline

See evaluate_v5_naive.py in this repository for the complete evaluation framework that runs the model on PolyBench/C benchmarks and LLM-generated code.

Evaluation Results

Naive Evaluation (34 programs: 30 PolyBench + 4 LLM-generated)

Base Level Distribution: -O1: 91.2%, -O3: 8.8%

Top Flag Activations:

finline-functions: 35.3%
march-native: 20.6%
ffast-math: 5.9%
fpeel-loops: 5.9%
flto: 5.9%

Key Finding: The model is program-adaptive — it produces 10 different flag combinations across 34 programs, with an action entropy of 2.94/4.09 (72%).

Interpretation

The model's preference for -O1 + individual flags over -O2/-O3 is noteworthy. This suggests that for the PolyBench benchmarks:

-O1 -finline-functions (35% of programs): The model learned that aggressive inlining on top of conservative optimization is often the best strategy — avoiding the overhead of O2/O3's full optimization pipeline.
-O1 -march=native (21% of programs): For SIMD-amenable code, native architecture flags on minimal optimization outperform generic O2.
-O3 without extras (9%): Only for large, complex programs (gemver, lu, gaussian_blur) does the full O3 pipeline justify its compile-time cost.

Limitations

Trained only on PolyBench/C benchmarks (30 programs) — may not generalize to all C programs
Does not consider compilation time in the reward signal
Flag space is limited to 12 hand-selected GCC flags
Evaluation is based on execution time only (no memory or code size optimization)
The model uses min-time aggregation (SPEC CPU methodology) which may not reflect average-case performance

Citation

@software{cpugym_gcc_v5,
  title={CPUGym: Reinforcement Learning for GCC Compiler Optimization},
  author={CompilOpt Team},
  year={2026},
  url={https://huggingface.co/compilopt/cpugym-gcc-v5}
}

License

MIT License

Downloads last month: 1

Video Preview

Reinforcement Learning

Evaluation results

mean_reward
self-reported

-1.585
explained_variance
self-reported

0.988