Instructions to use callensxavier/cpugym-gcc-v5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- stable-baselines3
How to use callensxavier/cpugym-gcc-v5 with stable-baselines3:
from huggingface_sb3 import load_from_hub checkpoint = load_from_hub( repo_id="callensxavier/cpugym-gcc-v5", filename="{MODEL FILENAME}.zip", ) - Notebooks
- Google Colab
- Kaggle
CPUGym GCC V5 β Reinforcement Learning for Compiler Optimization
Model Description
A PPO (Proximal Policy Optimization) agent trained to select GCC compilation flags for C programs. Given program features (LOC, loop counts, function counts, etc.), the agent iteratively builds a flag combination to minimize execution time.
This model was trained as part of the CPUGym project, which applies reinforcement learning to automate compiler optimization β a traditionally manual and expertise-intensive process.
Key Innovation
Unlike static flag selection (e.g., always using -O2), this model learns to:
- Adapt flags per-program: Different programs receive different optimization strategies
- Compose flags incrementally: The agent toggles flags one at a time, learning which combinations are synergistic
- Use curriculum learning: Training progresses from beating
-O0to-O1to-O2to the best known configuration
Architecture
| Component | Details |
|---|---|
| Algorithm | PPO (Proximal Policy Optimization) |
| Policy | MlpPolicy (ActorCriticPolicy) β 2-layer MLP [64, 64] |
| Framework | Stable Baselines 3 v2.x |
| Observation | Box(24) = [prog_features(8) | flag_state(12) | base_onehot(4)] |
| Action | Discrete(17) = STOP(0) | toggle_flag(1..12) | set_base(13..16) |
| Max Episode Length | 5 steps |
Observation Space (24 dimensions)
| Dims | Name | Description |
|---|---|---|
| 0-7 | Program Features | size_kb, loc, function_count, avg_function_loc, for_loops, while_loops, do_loops, include_count |
| 8-19 | Flag State | Binary vector of 12 individual GCC flags |
| 20-23 | Base Level | One-hot encoding of base optimization level (-O1, -O2, -O3, -Ofast) |
Action Space (17 discrete actions)
| Action | Description |
|---|---|
| 0 | STOP β finalize current flag combination |
| 1-12 | Toggle individual flag (march-native, ftree-vectorize, ffast-math, funroll-loops, fpeel-loops, ftree-loop-distribution, finline-functions, flto, fschedule-insns2, fomit-frame-pointer, fstrict-aliasing, ftree-loop-vectorize) |
| 13-16 | Set base optimization level (-O1, -O2, -O3, -Ofast) |
Training Details
Hyperparameters
PPO(
learning_rate=1e-3,
gamma=0.95,
clip_range=0.1,
ent_coef=0.1,
batch_size=128,
n_steps=128,
n_epochs=10,
gae_lambda=0.95,
vf_coef=0.5,
max_grad_norm=0.5,
)
Training Configuration
- Total timesteps: 500,000 per run Γ 5 completed runs
- Environments: 12 parallel SubprocVecEnv
- Measurement runs: 7 per compilation (min aggregation, SPEC CPU methodology)
- Benchmark suite: PolyBench/C (30 programs across linear algebra, stencils, datamining, medley)
- Hardware: Azure Container Apps, D16 CPU profile (16 vCPU, 64 GB RAM)
Curriculum Learning
The agent progresses through 4 phases of increasing difficulty:
| Phase | Timestep Range | Baseline | Goal |
|---|---|---|---|
| beat_O0 | 0 β 20k | -O0 |
Learn that optimization helps |
| beat_O1 | 20k β 80k | -O1 |
Surpass conservative optimization |
| beat_O2 | 80k β 500k | -O2 |
Beat the standard default |
| beat_best_known | 500k+ | Best found so far | Continue improving |
Convergence Metrics (5 completed runs)
| Run | Final Reward | Best Reward | Explained Variance | Loss |
|---|---|---|---|---|
| 1 | -3.277 | -1.537 | 0.945 | 0.195 |
| 2 | -3.377 | -1.292 | 0.956 | 0.124 |
| 3 | -1.585 | -1.585 | 0.975 | -0.189 |
| 4 | -2.841 | -1.532 | 0.965 | -0.093 |
| 5 | -2.797 | -1.682 | 0.988 | -0.226 |
Cross-run trend: IMPROVING (avg final reward -3.327 β -2.408)
Usage
Quick Start
import numpy as np
from stable_baselines3 import PPO
# Load model
model = PPO.load("best_model")
# Program features: [size_kb, loc, func_count, avg_func_loc, for_loops, while_loops, do_loops, includes]
prog_features = np.array([5.0, 100.0, 4.0, 25.0, 12.0, 0.0, 0.0, 3.0], dtype=np.float32)
# Initial state: no flags selected, base = -O1
flag_state = np.zeros(12, dtype=np.float32)
base_onehot = np.array([1, 0, 0, 0], dtype=np.float32) # -O1
obs = np.concatenate([prog_features, flag_state, base_onehot])
# Run agent
for step in range(5):
action, _ = model.predict(obs, deterministic=True)
action = int(action)
if action == 0: # STOP
break
elif 1 <= action <= 12: # Toggle flag
flag_state[action - 1] = 1 - flag_state[action - 1]
elif 13 <= action <= 16: # Set base
base_onehot = np.zeros(4, dtype=np.float32)
base_onehot[action - 13] = 1
obs = np.concatenate([prog_features, flag_state, base_onehot])
# Decode final flags
BASE_LEVELS = ["-O1", "-O2", "-O3", "-Ofast"]
FLAGS = ["march-native", "ftree-vectorize", "ffast-math", "funroll-loops",
"fpeel-loops", "ftree-loop-distribution", "finline-functions", "flto",
"fschedule-insns2", "fomit-frame-pointer", "fstrict-aliasing", "ftree-loop-vectorize"]
base = BASE_LEVELS[np.argmax(base_onehot)]
active = [f"-{FLAGS[i]}" for i in range(12) if flag_state[i] > 0.5]
print(f"Predicted: {base} {' '.join(active)}")
Full Evaluation Pipeline
See evaluate_v5_naive.py in this repository for the complete evaluation
framework that runs the model on PolyBench/C benchmarks and LLM-generated code.
Evaluation Results
Naive Evaluation (34 programs: 30 PolyBench + 4 LLM-generated)
Base Level Distribution: -O1: 91.2%, -O3: 8.8%
Top Flag Activations:
- finline-functions: 35.3%
- march-native: 20.6%
- ffast-math: 5.9%
- fpeel-loops: 5.9%
- flto: 5.9%
Key Finding: The model is program-adaptive β it produces 10 different flag combinations across 34 programs, with an action entropy of 2.94/4.09 (72%).
Interpretation
The model's preference for -O1 + individual flags over -O2/-O3 is noteworthy.
This suggests that for the PolyBench benchmarks:
-O1 -finline-functions(35% of programs): The model learned that aggressive inlining on top of conservative optimization is often the best strategy β avoiding the overhead of O2/O3's full optimization pipeline.-O1 -march=native(21% of programs): For SIMD-amenable code, native architecture flags on minimal optimization outperform generic O2.-O3without extras (9%): Only for large, complex programs (gemver, lu, gaussian_blur) does the full O3 pipeline justify its compile-time cost.
Limitations
- Trained only on PolyBench/C benchmarks (30 programs) β may not generalize to all C programs
- Does not consider compilation time in the reward signal
- Flag space is limited to 12 hand-selected GCC flags
- Evaluation is based on execution time only (no memory or code size optimization)
- The model uses min-time aggregation (SPEC CPU methodology) which may not reflect average-case performance
Citation
@software{cpugym_gcc_v5,
title={CPUGym: Reinforcement Learning for GCC Compiler Optimization},
author={CompilOpt Team},
year={2026},
url={https://huggingface.co/compilopt/cpugym-gcc-v5}
}
License
MIT License
- Downloads last month
- 7
Evaluation results
- mean_rewardself-reported-1.585
- explained_varianceself-reported0.988