# Polyglot-Optima Beginner + Technical Explanation

This document explains the project from zero, then gradually adds technical depth.

---

## 1) One-line idea

`Polyglot-Optima` is a training environment where an AI learns to convert Python functions into fast C++ **without breaking correctness**.

---

## 2) Why this project exists

Most code models can produce "fast-looking" code, but in real systems that is not enough.

Common failure modes:
- code compiles but gives wrong outputs,
- code is fast only on one machine but fails elsewhere,
- reward is easy to game (model hacks scoring instead of solving task),
- model does not improve over multiple refinement rounds.

This project is built to fix those problems using:
- strict compile checks,
- fuzz-based correctness verification,
- cross-hardware portability checks,
- anti-gaming trap tasks,
- curriculum learning (easy -> hard),
- structured continuous reward.

---

## 3) Mental model (simple)

Think of this project as a game with rules:

- **Input:** a Python function + a hardware profile.
- **Player (AI):** can call tools to analyze and optimize.
- **Goal:** submit C++ that is fast *and* correct.
- **Score (reward):** combines speed, correctness, reasoning quality, and portability.

The AI plays this game many times and learns better strategies.

---

## 4) Core architecture

Main folders:

- `models.py`  
  Defines typed data objects for actions, observations, and state.

- `server/environment.py`  
  The main OpenEnv environment implementation (`reset`, `step`, `state`, `close`).

- `server/tools/`  
  Actual capability tools (compiler, verifier, profiling, portability, submit).

- `server/rewards/`  
  Reward rubrics and reward composition logic.

- `server/scenarios/`  
  Task generators, hardware profiles, trap library, and adaptive curriculum.

- `tests/`  
  Unit + integration tests validating behavior and quality.

---

## 5) Episode lifecycle (what happens in one training sample)

Each episode has 3 rounds.

### Round flow
1. Environment samples:
   - Python code task
   - hardware profile
   - hidden bottleneck labels (for diagnosis scoring)
2. Model calls tools (analyze, compile, verify, etc.).
3. Model eventually calls `submit_optimization`.
4. Environment computes round reward.
5. Repeat for rounds 2 and 3.
6. Final episode reward is computed from round rewards.

### Important implementation details
- `max_calls_per_round` is enforced.
- If call budget is exhausted, environment forces submit for that round.
- Adaptive curriculum can update global difficulty after batch outcomes.

---

## 6) The 9 tools (what the model can do)

The AI does not directly "guess" everything. It uses tools:

1. `get_hardware_profile`
2. `profile_python_hotspots`
3. `analyze_complexity`
4. `check_memory_access`
5. `compile_and_benchmark`
6. `verify_equivalence`
7. `check_portability`
8. `get_bottleneck_report`
9. `submit_optimization` (round-closing action)

The most important tools for trustworthiness are:
- `compile_and_benchmark` (real compile/runtime behavior),
- `verify_equivalence` (catches wrong-but-fast code),
- `check_portability` (checks behavior across profiles).

---

## 7) Reward system explained simply

Reward is **continuous**, not just pass/fail.

That means:
- weak solutions get small score,
- better solutions get higher score,
- fully good solutions get top score.

This is important for RL because the model needs gradient/signal to improve.

### Reward components
- **SpeedupRubric:** how much faster C++ is vs Python baseline
- **CorrectnessRubric:** fuzz pass-rate quality
- **CompilationRubric:** compile quality/status
- **DiagnosisRubric:** quality/coherence of bottleneck reasoning
- **PortabilityRubric:** cross-profile robustness
- **SelfCorrectionRubric:** improvement from earlier rounds

### Composition
Reward is composed using rubric operators (`Sequential`, `Gate`, `WeightedSum`), so it is easier to reason about and tune than one large monolithic score function.

---

## 8) Anti-gaming design

This project assumes the model will try shortcuts. So it includes defenses:

- Trap functions (overflow, NaN/Inf, aliasing, semantic edge cases)
- Adversarial fuzzing
- Correctness + adversarial pass-rate signals
- Portability checks across hardware profiles
- Reasoning/diagnosis quality signal

Net effect: "fast but wrong" should score poorly.

---

## 9) Curriculum learning (easy -> hard)

Difficulty axes include:
- function complexity tier,
- hardware difficulty class,
- verifier strictness,
- portability requirement.

Curriculum controller monitors success in batches and adjusts:
- high success -> increase difficulty,
- low success -> reduce difficulty,
- middle zone -> hold.

This stabilizes learning and prevents early collapse.

---

## 10) Adaptive traps (what was improved)

Adaptive traps now do two things:
- prioritize categories where the model recently failed,
- create semantic-preserving trap variants (not only naive renaming).

Why this matters:
- reduces memorization,
- improves robustness,
- increases novelty/innovation signal for judges.

---

## 11) What "good performance" means here

Not just one high speedup number.

A good policy should show:
- increasing reward trend,
- high correctness/adversarial pass-rate,
- high compile success,
- better portability over time,
- stable behavior on held-out/edge-case tasks.

---

## 12) How to run and verify locally

From `polyglot_optima/`:

```bash
python -m ruff check .
python -m pytest -q
```

Smoke test (LLM-in-the-loop):

```bash
python tests/smoke_llm_hf.py
```

Cursor/OpenAI-compatible mode:

```bash
set LLM_PROVIDER=cursor
set CURSOR_API_KEY=...
set CURSOR_MODEL=gpt-4.1-nano
python tests/smoke_llm_hf.py
```

---

## 13) Training workflow for beginners

Use `training/openenv_hackathon_training.ipynb`:

1. Configure model + episodes + logging.
2. Run baseline eval first (fixed seeds).
3. Run RL training (TRL scaffold cell).
4. Run post-training eval with same seed protocol.
5. Export plots to `docs/plots`.
6. Add results to `README.md`.

Track at least:
- reward,
- correctness pass rate,
- compile success rate,
- portability metrics.

---

## 14) How this maps to hackathon judging

The project can score well if you clearly show:

- **Innovation:** adaptive curriculum + anti-gaming traps + structured reward
- **Storytelling:** clear problem -> method -> before/after outcome
- **Improvement evidence:** baseline vs trained plots
- **Pipeline quality:** reproducible notebook/script + OpenEnv-compliant deployment

---

## 15) Most important files to read next

Recommended reading order:

1. `README.md`
2. `models.py`
3. `server/environment.py`
4. `server/tools/submit.py`
5. `server/tools/cpp_compiler.py`
6. `server/tools/verifier.py`
7. `server/rewards/__init__.py`
8. `server/scenarios/dataset_loader.py`
9. `tests/test_skeleton.py`

---

## 16) Beginner takeaway

If you remember one thing:

This is not just "code generation."  
It is a full RL environment that teaches an AI to do **correct, robust, hardware-aware optimization** under realistic constraints.