Spaces:
Build error
Build error
| # Polyglot-Optima Beginner + Technical Explanation | |
| This document explains the project from zero, then gradually adds technical depth. | |
| --- | |
| ## 1) One-line idea | |
| `Polyglot-Optima` is a training environment where an AI learns to convert Python functions into fast C++ **without breaking correctness**. | |
| --- | |
| ## 2) Why this project exists | |
| Most code models can produce "fast-looking" code, but in real systems that is not enough. | |
| Common failure modes: | |
| - code compiles but gives wrong outputs, | |
| - code is fast only on one machine but fails elsewhere, | |
| - reward is easy to game (model hacks scoring instead of solving task), | |
| - model does not improve over multiple refinement rounds. | |
| This project is built to fix those problems using: | |
| - strict compile checks, | |
| - fuzz-based correctness verification, | |
| - cross-hardware portability checks, | |
| - anti-gaming trap tasks, | |
| - curriculum learning (easy -> hard), | |
| - structured continuous reward. | |
| --- | |
| ## 3) Mental model (simple) | |
| Think of this project as a game with rules: | |
| - **Input:** a Python function + a hardware profile. | |
| - **Player (AI):** can call tools to analyze and optimize. | |
| - **Goal:** submit C++ that is fast *and* correct. | |
| - **Score (reward):** combines speed, correctness, reasoning quality, and portability. | |
| The AI plays this game many times and learns better strategies. | |
| --- | |
| ## 4) Core architecture | |
| Main folders: | |
| - `models.py` | |
| Defines typed data objects for actions, observations, and state. | |
| - `server/environment.py` | |
| The main OpenEnv environment implementation (`reset`, `step`, `state`, `close`). | |
| - `server/tools/` | |
| Actual capability tools (compiler, verifier, profiling, portability, submit). | |
| - `server/rewards/` | |
| Reward rubrics and reward composition logic. | |
| - `server/scenarios/` | |
| Task generators, hardware profiles, trap library, and adaptive curriculum. | |
| - `tests/` | |
| Unit + integration tests validating behavior and quality. | |
| --- | |
| ## 5) Episode lifecycle (what happens in one training sample) | |
| Each episode has 3 rounds. | |
| ### Round flow | |
| 1. Environment samples: | |
| - Python code task | |
| - hardware profile | |
| - hidden bottleneck labels (for diagnosis scoring) | |
| 2. Model calls tools (analyze, compile, verify, etc.). | |
| 3. Model eventually calls `submit_optimization`. | |
| 4. Environment computes round reward. | |
| 5. Repeat for rounds 2 and 3. | |
| 6. Final episode reward is computed from round rewards. | |
| ### Important implementation details | |
| - `max_calls_per_round` is enforced. | |
| - If call budget is exhausted, environment forces submit for that round. | |
| - Adaptive curriculum can update global difficulty after batch outcomes. | |
| --- | |
| ## 6) The 9 tools (what the model can do) | |
| The AI does not directly "guess" everything. It uses tools: | |
| 1. `get_hardware_profile` | |
| 2. `profile_python_hotspots` | |
| 3. `analyze_complexity` | |
| 4. `check_memory_access` | |
| 5. `compile_and_benchmark` | |
| 6. `verify_equivalence` | |
| 7. `check_portability` | |
| 8. `get_bottleneck_report` | |
| 9. `submit_optimization` (round-closing action) | |
| The most important tools for trustworthiness are: | |
| - `compile_and_benchmark` (real compile/runtime behavior), | |
| - `verify_equivalence` (catches wrong-but-fast code), | |
| - `check_portability` (checks behavior across profiles). | |
| --- | |
| ## 7) Reward system explained simply | |
| Reward is **continuous**, not just pass/fail. | |
| That means: | |
| - weak solutions get small score, | |
| - better solutions get higher score, | |
| - fully good solutions get top score. | |
| This is important for RL because the model needs gradient/signal to improve. | |
| ### Reward components | |
| - **SpeedupRubric:** how much faster C++ is vs Python baseline | |
| - **CorrectnessRubric:** fuzz pass-rate quality | |
| - **CompilationRubric:** compile quality/status | |
| - **DiagnosisRubric:** quality/coherence of bottleneck reasoning | |
| - **PortabilityRubric:** cross-profile robustness | |
| - **SelfCorrectionRubric:** improvement from earlier rounds | |
| ### Composition | |
| Reward is composed using rubric operators (`Sequential`, `Gate`, `WeightedSum`), so it is easier to reason about and tune than one large monolithic score function. | |
| --- | |
| ## 8) Anti-gaming design | |
| This project assumes the model will try shortcuts. So it includes defenses: | |
| - Trap functions (overflow, NaN/Inf, aliasing, semantic edge cases) | |
| - Adversarial fuzzing | |
| - Correctness + adversarial pass-rate signals | |
| - Portability checks across hardware profiles | |
| - Reasoning/diagnosis quality signal | |
| Net effect: "fast but wrong" should score poorly. | |
| --- | |
| ## 9) Curriculum learning (easy -> hard) | |
| Difficulty axes include: | |
| - function complexity tier, | |
| - hardware difficulty class, | |
| - verifier strictness, | |
| - portability requirement. | |
| Curriculum controller monitors success in batches and adjusts: | |
| - high success -> increase difficulty, | |
| - low success -> reduce difficulty, | |
| - middle zone -> hold. | |
| This stabilizes learning and prevents early collapse. | |
| --- | |
| ## 10) Adaptive traps (what was improved) | |
| Adaptive traps now do two things: | |
| - prioritize categories where the model recently failed, | |
| - create semantic-preserving trap variants (not only naive renaming). | |
| Why this matters: | |
| - reduces memorization, | |
| - improves robustness, | |
| - increases novelty/innovation signal for judges. | |
| --- | |
| ## 11) What "good performance" means here | |
| Not just one high speedup number. | |
| A good policy should show: | |
| - increasing reward trend, | |
| - high correctness/adversarial pass-rate, | |
| - high compile success, | |
| - better portability over time, | |
| - stable behavior on held-out/edge-case tasks. | |
| --- | |
| ## 12) How to run and verify locally | |
| From `polyglot_optima/`: | |
| ```bash | |
| python -m ruff check . | |
| python -m pytest -q | |
| ``` | |
| Smoke test (LLM-in-the-loop): | |
| ```bash | |
| python tests/smoke_llm_hf.py | |
| ``` | |
| Cursor/OpenAI-compatible mode: | |
| ```bash | |
| set LLM_PROVIDER=cursor | |
| set CURSOR_API_KEY=... | |
| set CURSOR_MODEL=gpt-4.1-nano | |
| python tests/smoke_llm_hf.py | |
| ``` | |
| --- | |
| ## 13) Training workflow for beginners | |
| Use `training/openenv_hackathon_training.ipynb`: | |
| 1. Configure model + episodes + logging. | |
| 2. Run baseline eval first (fixed seeds). | |
| 3. Run RL training (TRL scaffold cell). | |
| 4. Run post-training eval with same seed protocol. | |
| 5. Export plots to `docs/plots`. | |
| 6. Add results to `README.md`. | |
| Track at least: | |
| - reward, | |
| - correctness pass rate, | |
| - compile success rate, | |
| - portability metrics. | |
| --- | |
| ## 14) How this maps to hackathon judging | |
| The project can score well if you clearly show: | |
| - **Innovation:** adaptive curriculum + anti-gaming traps + structured reward | |
| - **Storytelling:** clear problem -> method -> before/after outcome | |
| - **Improvement evidence:** baseline vs trained plots | |
| - **Pipeline quality:** reproducible notebook/script + OpenEnv-compliant deployment | |
| --- | |
| ## 15) Most important files to read next | |
| Recommended reading order: | |
| 1. `README.md` | |
| 2. `models.py` | |
| 3. `server/environment.py` | |
| 4. `server/tools/submit.py` | |
| 5. `server/tools/cpp_compiler.py` | |
| 6. `server/tools/verifier.py` | |
| 7. `server/rewards/__init__.py` | |
| 8. `server/scenarios/dataset_loader.py` | |
| 9. `tests/test_skeleton.py` | |
| --- | |
| ## 16) Beginner takeaway | |
| If you remember one thing: | |
| This is not just "code generation." | |
| It is a full RL environment that teaches an AI to do **correct, robust, hardware-aware optimization** under realistic constraints. | |