# Polyglot-Optima Beginner + Technical Explanation This document explains the project from zero, then gradually adds technical depth. --- ## 1) One-line idea `Polyglot-Optima` is a training environment where an AI learns to convert Python functions into fast C++ **without breaking correctness**. --- ## 2) Why this project exists Most code models can produce "fast-looking" code, but in real systems that is not enough. Common failure modes: - code compiles but gives wrong outputs, - code is fast only on one machine but fails elsewhere, - reward is easy to game (model hacks scoring instead of solving task), - model does not improve over multiple refinement rounds. This project is built to fix those problems using: - strict compile checks, - fuzz-based correctness verification, - cross-hardware portability checks, - anti-gaming trap tasks, - curriculum learning (easy -> hard), - structured continuous reward. --- ## 3) Mental model (simple) Think of this project as a game with rules: - **Input:** a Python function + a hardware profile. - **Player (AI):** can call tools to analyze and optimize. - **Goal:** submit C++ that is fast *and* correct. - **Score (reward):** combines speed, correctness, reasoning quality, and portability. The AI plays this game many times and learns better strategies. --- ## 4) Core architecture Main folders: - `models.py` Defines typed data objects for actions, observations, and state. - `server/environment.py` The main OpenEnv environment implementation (`reset`, `step`, `state`, `close`). - `server/tools/` Actual capability tools (compiler, verifier, profiling, portability, submit). - `server/rewards/` Reward rubrics and reward composition logic. - `server/scenarios/` Task generators, hardware profiles, trap library, and adaptive curriculum. - `tests/` Unit + integration tests validating behavior and quality. --- ## 5) Episode lifecycle (what happens in one training sample) Each episode has 3 rounds. ### Round flow 1. Environment samples: - Python code task - hardware profile - hidden bottleneck labels (for diagnosis scoring) 2. Model calls tools (analyze, compile, verify, etc.). 3. Model eventually calls `submit_optimization`. 4. Environment computes round reward. 5. Repeat for rounds 2 and 3. 6. Final episode reward is computed from round rewards. ### Important implementation details - `max_calls_per_round` is enforced. - If call budget is exhausted, environment forces submit for that round. - Adaptive curriculum can update global difficulty after batch outcomes. --- ## 6) The 9 tools (what the model can do) The AI does not directly "guess" everything. It uses tools: 1. `get_hardware_profile` 2. `profile_python_hotspots` 3. `analyze_complexity` 4. `check_memory_access` 5. `compile_and_benchmark` 6. `verify_equivalence` 7. `check_portability` 8. `get_bottleneck_report` 9. `submit_optimization` (round-closing action) The most important tools for trustworthiness are: - `compile_and_benchmark` (real compile/runtime behavior), - `verify_equivalence` (catches wrong-but-fast code), - `check_portability` (checks behavior across profiles). --- ## 7) Reward system explained simply Reward is **continuous**, not just pass/fail. That means: - weak solutions get small score, - better solutions get higher score, - fully good solutions get top score. This is important for RL because the model needs gradient/signal to improve. ### Reward components - **SpeedupRubric:** how much faster C++ is vs Python baseline - **CorrectnessRubric:** fuzz pass-rate quality - **CompilationRubric:** compile quality/status - **DiagnosisRubric:** quality/coherence of bottleneck reasoning - **PortabilityRubric:** cross-profile robustness - **SelfCorrectionRubric:** improvement from earlier rounds ### Composition Reward is composed using rubric operators (`Sequential`, `Gate`, `WeightedSum`), so it is easier to reason about and tune than one large monolithic score function. --- ## 8) Anti-gaming design This project assumes the model will try shortcuts. So it includes defenses: - Trap functions (overflow, NaN/Inf, aliasing, semantic edge cases) - Adversarial fuzzing - Correctness + adversarial pass-rate signals - Portability checks across hardware profiles - Reasoning/diagnosis quality signal Net effect: "fast but wrong" should score poorly. --- ## 9) Curriculum learning (easy -> hard) Difficulty axes include: - function complexity tier, - hardware difficulty class, - verifier strictness, - portability requirement. Curriculum controller monitors success in batches and adjusts: - high success -> increase difficulty, - low success -> reduce difficulty, - middle zone -> hold. This stabilizes learning and prevents early collapse. --- ## 10) Adaptive traps (what was improved) Adaptive traps now do two things: - prioritize categories where the model recently failed, - create semantic-preserving trap variants (not only naive renaming). Why this matters: - reduces memorization, - improves robustness, - increases novelty/innovation signal for judges. --- ## 11) What "good performance" means here Not just one high speedup number. A good policy should show: - increasing reward trend, - high correctness/adversarial pass-rate, - high compile success, - better portability over time, - stable behavior on held-out/edge-case tasks. --- ## 12) How to run and verify locally From `polyglot_optima/`: ```bash python -m ruff check . python -m pytest -q ``` Smoke test (LLM-in-the-loop): ```bash python tests/smoke_llm_hf.py ``` Cursor/OpenAI-compatible mode: ```bash set LLM_PROVIDER=cursor set CURSOR_API_KEY=... set CURSOR_MODEL=gpt-4.1-nano python tests/smoke_llm_hf.py ``` --- ## 13) Training workflow for beginners Use `training/openenv_hackathon_training.ipynb`: 1. Configure model + episodes + logging. 2. Run baseline eval first (fixed seeds). 3. Run RL training (TRL scaffold cell). 4. Run post-training eval with same seed protocol. 5. Export plots to `docs/plots`. 6. Add results to `README.md`. Track at least: - reward, - correctness pass rate, - compile success rate, - portability metrics. --- ## 14) How this maps to hackathon judging The project can score well if you clearly show: - **Innovation:** adaptive curriculum + anti-gaming traps + structured reward - **Storytelling:** clear problem -> method -> before/after outcome - **Improvement evidence:** baseline vs trained plots - **Pipeline quality:** reproducible notebook/script + OpenEnv-compliant deployment --- ## 15) Most important files to read next Recommended reading order: 1. `README.md` 2. `models.py` 3. `server/environment.py` 4. `server/tools/submit.py` 5. `server/tools/cpp_compiler.py` 6. `server/tools/verifier.py` 7. `server/rewards/__init__.py` 8. `server/scenarios/dataset_loader.py` 9. `tests/test_skeleton.py` --- ## 16) Beginner takeaway If you remember one thing: This is not just "code generation." It is a full RL environment that teaches an AI to do **correct, robust, hardware-aware optimization** under realistic constraints.