cpugym-v5-gcc-optimizer

A reinforcement learning agent trained with PPO (Stable-Baselines3) to select GCC optimization flags for C programs. This is a research checkpoint from the V5 convergence-first training run — the model learns sequential flag composition via a curriculum but has not yet surpassed -O2 on PolyBench/C. It serves as a baseline for the V6 architecture.

Model Description

CPUGym V5 uses a convergence-first design with:

Reduced action space: 17 actions (12 individual flags + 4 base optimization levels + STOP)
Potential-based reward shaping (Ng et al., 1999) for stable intermediate rewards
Curriculum learning: O0 → O1 → O2 → best-known progressive difficulty
Behavioral cloning cold-start from 8 expert optimization strategies

Architecture

Component	Details
Algorithm	PPO (Proximal Policy Optimization)
Policy	MlpPolicy (64×64 hidden layers)
Observation	Box(24): [program_features(8) + flag_state(12) + base_onehot(4)]
Action	Discrete(17): STOP(0) \| toggle_flag(1-12) \| set_base(13-16)
Max steps/episode	5
Framework	Stable-Baselines3

Optimization Flags (12)

Category	Flags
Vectorization & SIMD	`-march=native`, `-ftree-vectorize`
Math	`-ffast-math`
Loop optimizations	`-funroll-loops`, `-fpeel-loops`, `-ftree-loop-distribution`
Inlining	`-finline-functions`
IPO	`-flto` (auto-adds `-fwhole-program`)
Scheduling & codegen	`-fschedule-insns2`, `-fomit-frame-pointer`
Memory	`-fstrict-aliasing`
Loop vectorization	`-ftree-loop-vectorize`

Base Optimization Levels (4)

-O1, -O2, -O3, -Ofast

Training Details

Hyperparameters

Parameter	Value
Learning rate	1e-3
Discount (γ)	0.95
GAE (λ)	0.9
Clip range	0.1
Entropy coefficient	0.1
Batch size	128
N-steps	128
N-epochs	10
Parallel environments	16
Total timesteps	500000

Curriculum Schedule

Phase	Timesteps	Baseline	Target
1. Beat -O0	0–20k	`-O0`	Trivial warm-up
2. Beat -O1	20k–80k	`-O1`	Learn specific flags
3. Beat -O2	80k–500k	`-O2`	Core optimization target
4. Beat best-known	500k+	Best found	Research frontier

Cold Start

Pre-trained with behavioral cloning from 8 expert strategies:

Vectorization-focused (-O3 -march=native -ftree-vectorize)
Math-aggressive (-Ofast -ffast-math)
Loop-focused (-O3 -funroll-loops -ftree-loop-distribution)
Full pipeline (-O3 -march=native -flto -funroll-loops)
And 4 more domain-specific combinations

Reward Design

Terminal reward: log(t_baseline / t_agent) — positive when agent beats baseline
Intermediate reward: Potential-based shaping (Φ = flag coverage ratio)
Conflict penalty: -0.1 for selecting flags already implied by the base level

Usage

from stable_baselines3 import PPO
import numpy as np

# Load model
model = PPO.load("path/to/model.zip")

# Create observation (24-dim)
# [program_features(8) + flag_state(12) + base_onehot(4)]
obs = np.zeros(24, dtype=np.float32)
# ... set program features from extract_program_features()

# Get action
action, _ = model.predict(obs, deterministic=True)
# action 0 = STOP, 1-12 = toggle flag, 13-16 = set base level

Intended Use

This model is designed for compiler optimization research. It demonstrates that RL agents can learn to select GCC optimization flags via curriculum learning and sequential flag composition.

Not intended for: Production compiler toolchains without thorough validation.

Evaluation Results (Azure linux/amd64, GCC 10)

Phase 1: Naive PolyBench Evaluation (30 programs × 7 baselines × 7 runs)

Metric	Value
Beat -O2	0/30 (0%)
Beat best baseline	0/30 (0%)
Avg speedup vs -O2	-265.6% (3.3× slower)
Geomean time ratio vs O2	3.32×

The agent at 448k steps (curriculum phase 3: beat_O2) selects flags that produce slower code than -O2. It tends to choose -O1 + individual flags or bare -O2 without useful additions.

Phase 2: O2-vs-O3 Classification Test (13 synthetic benchmarks)

11/13 passed (85%) — validates that the test infrastructure correctly differentiates O2-favorable vs O3-favorable programs on the target hardware.

Program	O2 time	O3 time	Speedup	Category
dense_matmul	0.416s	0.224s	1.86×	O3-favorable
simd_vectorize	0.321s	0.182s	1.76×	O3-favorable
stencil_2d	0.182s	0.143s	1.27×	O3-favorable
loop_unroll_target	0.098s	0.083s	1.19×	O3-favorable
branch_heavy	0.808s	0.810s	1.00×	O2-favorable
linked_list_walk	4.871s	4.891s	1.00×	O2-favorable
icache_pressure	0.065s	0.065s	1.00×	O2-favorable

Phase 3: Agent O2/O3 Flag Selection (7 known-outcome programs)

4/7 correct (57%) — the agent always defaults to O2 as base level (correct for O2-favorable programs, wrong for O3-favorable ones like dense_matmul, stencil_2d, vector_reduction). This is expected: the model was still in the beat_O2 curriculum phase and hadn't learned when to escalate to O3.

Program	Expected	Agent chose	Result
branch_heavy	O2	O2	CORRECT
icache_pressure	O2	O2	CORRECT
linked_list_walk	O2	O2	CORRECT
sort_and_search	O2	O2	CORRECT
dense_matmul	O3	O2	WRONG
stencil_2d	O3	O2	WRONG
vector_reduction	O3	O2	WRONG

Interpretation

This checkpoint is a curriculum-in-progress model: it learned "O2 is safe" but hasn't discovered when O3/Ofast provides measurable benefit. The V6 architecture addresses this with synthetic data augmentation, LLM-generated training programs with known-optimal flags, and extended training (1M+ steps).

Training Infrastructure

Azure Container Apps (D16 workload profile, 16 vCPU, linux/amd64)
Training cost: ~$65
Training time: ~13 hours (500k timesteps)

Citation

@software{cpugym_v5,
  title={CPUGym V5: Convergence-First GCC Optimization via Reinforcement Learning},
  year={2026},
  url={https://github.com/pznachab_amadeus/CPUGym}
}

License

MIT

Downloads last month: 3

Video Preview

Reinforcement Learning

Evaluation results

Mean Episode Reward (best eval) on PolyBench/C
self-reported

-1.770
Programs Beating -O2 (%) on PolyBench/C
self-reported

0.000
O2/O3 Classification Accuracy (%) on PolyBench/C
self-reported

57.000
O2-vs-O3 Ground Truth Validation (%) on PolyBench/C
self-reported

85.000