polyglot-optima-openenv / docs /BEGINNER_PROJECT_EXPLANATION.md
Swastikr's picture
Upload folder using huggingface_hub
4bf4bf6 verified
# Polyglot-Optima Beginner + Technical Explanation
This document explains the project from zero, then gradually adds technical depth.
---
## 1) One-line idea
`Polyglot-Optima` is a training environment where an AI learns to convert Python functions into fast C++ **without breaking correctness**.
---
## 2) Why this project exists
Most code models can produce "fast-looking" code, but in real systems that is not enough.
Common failure modes:
- code compiles but gives wrong outputs,
- code is fast only on one machine but fails elsewhere,
- reward is easy to game (model hacks scoring instead of solving task),
- model does not improve over multiple refinement rounds.
This project is built to fix those problems using:
- strict compile checks,
- fuzz-based correctness verification,
- cross-hardware portability checks,
- anti-gaming trap tasks,
- curriculum learning (easy -> hard),
- structured continuous reward.
---
## 3) Mental model (simple)
Think of this project as a game with rules:
- **Input:** a Python function + a hardware profile.
- **Player (AI):** can call tools to analyze and optimize.
- **Goal:** submit C++ that is fast *and* correct.
- **Score (reward):** combines speed, correctness, reasoning quality, and portability.
The AI plays this game many times and learns better strategies.
---
## 4) Core architecture
Main folders:
- `models.py`
Defines typed data objects for actions, observations, and state.
- `server/environment.py`
The main OpenEnv environment implementation (`reset`, `step`, `state`, `close`).
- `server/tools/`
Actual capability tools (compiler, verifier, profiling, portability, submit).
- `server/rewards/`
Reward rubrics and reward composition logic.
- `server/scenarios/`
Task generators, hardware profiles, trap library, and adaptive curriculum.
- `tests/`
Unit + integration tests validating behavior and quality.
---
## 5) Episode lifecycle (what happens in one training sample)
Each episode has 3 rounds.
### Round flow
1. Environment samples:
- Python code task
- hardware profile
- hidden bottleneck labels (for diagnosis scoring)
2. Model calls tools (analyze, compile, verify, etc.).
3. Model eventually calls `submit_optimization`.
4. Environment computes round reward.
5. Repeat for rounds 2 and 3.
6. Final episode reward is computed from round rewards.
### Important implementation details
- `max_calls_per_round` is enforced.
- If call budget is exhausted, environment forces submit for that round.
- Adaptive curriculum can update global difficulty after batch outcomes.
---
## 6) The 9 tools (what the model can do)
The AI does not directly "guess" everything. It uses tools:
1. `get_hardware_profile`
2. `profile_python_hotspots`
3. `analyze_complexity`
4. `check_memory_access`
5. `compile_and_benchmark`
6. `verify_equivalence`
7. `check_portability`
8. `get_bottleneck_report`
9. `submit_optimization` (round-closing action)
The most important tools for trustworthiness are:
- `compile_and_benchmark` (real compile/runtime behavior),
- `verify_equivalence` (catches wrong-but-fast code),
- `check_portability` (checks behavior across profiles).
---
## 7) Reward system explained simply
Reward is **continuous**, not just pass/fail.
That means:
- weak solutions get small score,
- better solutions get higher score,
- fully good solutions get top score.
This is important for RL because the model needs gradient/signal to improve.
### Reward components
- **SpeedupRubric:** how much faster C++ is vs Python baseline
- **CorrectnessRubric:** fuzz pass-rate quality
- **CompilationRubric:** compile quality/status
- **DiagnosisRubric:** quality/coherence of bottleneck reasoning
- **PortabilityRubric:** cross-profile robustness
- **SelfCorrectionRubric:** improvement from earlier rounds
### Composition
Reward is composed using rubric operators (`Sequential`, `Gate`, `WeightedSum`), so it is easier to reason about and tune than one large monolithic score function.
---
## 8) Anti-gaming design
This project assumes the model will try shortcuts. So it includes defenses:
- Trap functions (overflow, NaN/Inf, aliasing, semantic edge cases)
- Adversarial fuzzing
- Correctness + adversarial pass-rate signals
- Portability checks across hardware profiles
- Reasoning/diagnosis quality signal
Net effect: "fast but wrong" should score poorly.
---
## 9) Curriculum learning (easy -> hard)
Difficulty axes include:
- function complexity tier,
- hardware difficulty class,
- verifier strictness,
- portability requirement.
Curriculum controller monitors success in batches and adjusts:
- high success -> increase difficulty,
- low success -> reduce difficulty,
- middle zone -> hold.
This stabilizes learning and prevents early collapse.
---
## 10) Adaptive traps (what was improved)
Adaptive traps now do two things:
- prioritize categories where the model recently failed,
- create semantic-preserving trap variants (not only naive renaming).
Why this matters:
- reduces memorization,
- improves robustness,
- increases novelty/innovation signal for judges.
---
## 11) What "good performance" means here
Not just one high speedup number.
A good policy should show:
- increasing reward trend,
- high correctness/adversarial pass-rate,
- high compile success,
- better portability over time,
- stable behavior on held-out/edge-case tasks.
---
## 12) How to run and verify locally
From `polyglot_optima/`:
```bash
python -m ruff check .
python -m pytest -q
```
Smoke test (LLM-in-the-loop):
```bash
python tests/smoke_llm_hf.py
```
Cursor/OpenAI-compatible mode:
```bash
set LLM_PROVIDER=cursor
set CURSOR_API_KEY=...
set CURSOR_MODEL=gpt-4.1-nano
python tests/smoke_llm_hf.py
```
---
## 13) Training workflow for beginners
Use `training/openenv_hackathon_training.ipynb`:
1. Configure model + episodes + logging.
2. Run baseline eval first (fixed seeds).
3. Run RL training (TRL scaffold cell).
4. Run post-training eval with same seed protocol.
5. Export plots to `docs/plots`.
6. Add results to `README.md`.
Track at least:
- reward,
- correctness pass rate,
- compile success rate,
- portability metrics.
---
## 14) How this maps to hackathon judging
The project can score well if you clearly show:
- **Innovation:** adaptive curriculum + anti-gaming traps + structured reward
- **Storytelling:** clear problem -> method -> before/after outcome
- **Improvement evidence:** baseline vs trained plots
- **Pipeline quality:** reproducible notebook/script + OpenEnv-compliant deployment
---
## 15) Most important files to read next
Recommended reading order:
1. `README.md`
2. `models.py`
3. `server/environment.py`
4. `server/tools/submit.py`
5. `server/tools/cpp_compiler.py`
6. `server/tools/verifier.py`
7. `server/rewards/__init__.py`
8. `server/scenarios/dataset_loader.py`
9. `tests/test_skeleton.py`
---
## 16) Beginner takeaway
If you remember one thing:
This is not just "code generation."
It is a full RL environment that teaches an AI to do **correct, robust, hardware-aware optimization** under realistic constraints.