Instructions to use callensxavier/cpugym-v5-gcc-optimizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- stable-baselines3
How to use callensxavier/cpugym-v5-gcc-optimizer with stable-baselines3:
from huggingface_sb3 import load_from_hub checkpoint = load_from_hub( repo_id="callensxavier/cpugym-v5-gcc-optimizer", filename="{MODEL FILENAME}.zip", ) - Notebooks
- Google Colab
- Kaggle
cpugym-v5-gcc-optimizer
A reinforcement learning agent trained with PPO (Stable-Baselines3) to select GCC optimization flags for C programs. This is a research checkpoint from the V5 convergence-first training run — the model learns sequential flag composition via a curriculum but has not yet surpassed -O2 on PolyBench/C. It serves as a baseline for the V6 architecture.
Model Description
CPUGym V5 uses a convergence-first design with:
- Reduced action space: 17 actions (12 individual flags + 4 base optimization levels + STOP)
- Potential-based reward shaping (Ng et al., 1999) for stable intermediate rewards
- Curriculum learning: O0 → O1 → O2 → best-known progressive difficulty
- Behavioral cloning cold-start from 8 expert optimization strategies
Architecture
| Component | Details |
|---|---|
| Algorithm | PPO (Proximal Policy Optimization) |
| Policy | MlpPolicy (64×64 hidden layers) |
| Observation | Box(24): [program_features(8) + flag_state(12) + base_onehot(4)] |
| Action | Discrete(17): STOP(0) | toggle_flag(1-12) | set_base(13-16) |
| Max steps/episode | 5 |
| Framework | Stable-Baselines3 |
Optimization Flags (12)
| Category | Flags |
|---|---|
| Vectorization & SIMD | -march=native, -ftree-vectorize |
| Math | -ffast-math |
| Loop optimizations | -funroll-loops, -fpeel-loops, -ftree-loop-distribution |
| Inlining | -finline-functions |
| IPO | -flto (auto-adds -fwhole-program) |
| Scheduling & codegen | -fschedule-insns2, -fomit-frame-pointer |
| Memory | -fstrict-aliasing |
| Loop vectorization | -ftree-loop-vectorize |
Base Optimization Levels (4)
-O1, -O2, -O3, -Ofast
Training Details
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 1e-3 |
| Discount (γ) | 0.95 |
| GAE (λ) | 0.9 |
| Clip range | 0.1 |
| Entropy coefficient | 0.1 |
| Batch size | 128 |
| N-steps | 128 |
| N-epochs | 10 |
| Parallel environments | 16 |
| Total timesteps | 500000 |
Curriculum Schedule
| Phase | Timesteps | Baseline | Target |
|---|---|---|---|
| 1. Beat -O0 | 0–20k | -O0 |
Trivial warm-up |
| 2. Beat -O1 | 20k–80k | -O1 |
Learn specific flags |
| 3. Beat -O2 | 80k–500k | -O2 |
Core optimization target |
| 4. Beat best-known | 500k+ | Best found | Research frontier |
Cold Start
Pre-trained with behavioral cloning from 8 expert strategies:
- Vectorization-focused (
-O3 -march=native -ftree-vectorize) - Math-aggressive (
-Ofast -ffast-math) - Loop-focused (
-O3 -funroll-loops -ftree-loop-distribution) - Full pipeline (
-O3 -march=native -flto -funroll-loops) - And 4 more domain-specific combinations
Reward Design
- Terminal reward:
log(t_baseline / t_agent)— positive when agent beats baseline - Intermediate reward: Potential-based shaping (Φ = flag coverage ratio)
- Conflict penalty: -0.1 for selecting flags already implied by the base level
Usage
from stable_baselines3 import PPO
import numpy as np
# Load model
model = PPO.load("path/to/model.zip")
# Create observation (24-dim)
# [program_features(8) + flag_state(12) + base_onehot(4)]
obs = np.zeros(24, dtype=np.float32)
# ... set program features from extract_program_features()
# Get action
action, _ = model.predict(obs, deterministic=True)
# action 0 = STOP, 1-12 = toggle flag, 13-16 = set base level
Intended Use
This model is designed for compiler optimization research. It demonstrates that RL agents can learn to select GCC optimization flags via curriculum learning and sequential flag composition.
Not intended for: Production compiler toolchains without thorough validation.
Evaluation Results (Azure linux/amd64, GCC 10)
Phase 1: Naive PolyBench Evaluation (30 programs × 7 baselines × 7 runs)
| Metric | Value |
|---|---|
| Beat -O2 | 0/30 (0%) |
| Beat best baseline | 0/30 (0%) |
| Avg speedup vs -O2 | -265.6% (3.3× slower) |
| Geomean time ratio vs O2 | 3.32× |
The agent at 448k steps (curriculum phase 3: beat_O2) selects flags that produce
slower code than -O2. It tends to choose -O1 + individual flags or bare -O2
without useful additions.
Phase 2: O2-vs-O3 Classification Test (13 synthetic benchmarks)
11/13 passed (85%) — validates that the test infrastructure correctly differentiates O2-favorable vs O3-favorable programs on the target hardware.
| Program | O2 time | O3 time | Speedup | Category |
|---|---|---|---|---|
| dense_matmul | 0.416s | 0.224s | 1.86× | O3-favorable |
| simd_vectorize | 0.321s | 0.182s | 1.76× | O3-favorable |
| stencil_2d | 0.182s | 0.143s | 1.27× | O3-favorable |
| loop_unroll_target | 0.098s | 0.083s | 1.19× | O3-favorable |
| branch_heavy | 0.808s | 0.810s | 1.00× | O2-favorable |
| linked_list_walk | 4.871s | 4.891s | 1.00× | O2-favorable |
| icache_pressure | 0.065s | 0.065s | 1.00× | O2-favorable |
Phase 3: Agent O2/O3 Flag Selection (7 known-outcome programs)
4/7 correct (57%) — the agent always defaults to O2 as base level (correct
for O2-favorable programs, wrong for O3-favorable ones like dense_matmul,
stencil_2d, vector_reduction). This is expected: the model was still in the
beat_O2 curriculum phase and hadn't learned when to escalate to O3.
| Program | Expected | Agent chose | Result |
|---|---|---|---|
| branch_heavy | O2 | O2 | CORRECT |
| icache_pressure | O2 | O2 | CORRECT |
| linked_list_walk | O2 | O2 | CORRECT |
| sort_and_search | O2 | O2 | CORRECT |
| dense_matmul | O3 | O2 | WRONG |
| stencil_2d | O3 | O2 | WRONG |
| vector_reduction | O3 | O2 | WRONG |
Interpretation
This checkpoint is a curriculum-in-progress model: it learned "O2 is safe" but hasn't discovered when O3/Ofast provides measurable benefit. The V6 architecture addresses this with synthetic data augmentation, LLM-generated training programs with known-optimal flags, and extended training (1M+ steps).
Training Infrastructure
- Azure Container Apps (D16 workload profile, 16 vCPU, linux/amd64)
- Training cost: ~$65
- Training time: ~13 hours (500k timesteps)
Citation
@software{cpugym_v5,
title={CPUGym V5: Convergence-First GCC Optimization via Reinforcement Learning},
year={2026},
url={https://github.com/pznachab_amadeus/CPUGym}
}
License
MIT
- Downloads last month
- 28
Evaluation results
- Mean Episode Reward (best eval) on PolyBench/Cself-reported-1.770
- Programs Beating -O2 (%) on PolyBench/Cself-reported0.000
- O2/O3 Classification Accuracy (%) on PolyBench/Cself-reported57.000
- O2-vs-O3 Ground Truth Validation (%) on PolyBench/Cself-reported85.000