Spaces:

Swastikr
/

polyglot-optima-openenv

Build error

App Files Files Community

polyglot-optima-openenv / docs /BEGINNER_PROJECT_EXPLANATION.md

Swastikr

Upload folder using huggingface_hub

4bf4bf6 verified about 1 month ago

preview code

raw

history blame contribute delete

7.11 kB

	# Polyglot-Optima Beginner + Technical Explanation

	This document explains the project from zero, then gradually adds technical depth.

	---

	## 1) One-line idea

	`Polyglot-Optima` is a training environment where an AI learns to convert Python functions into fast C++ without breaking correctness.

	---

	## 2) Why this project exists

	Most code models can produce "fast-looking" code, but in real systems that is not enough.

	Common failure modes:
	- code compiles but gives wrong outputs,
	- code is fast only on one machine but fails elsewhere,
	- reward is easy to game (model hacks scoring instead of solving task),
	- model does not improve over multiple refinement rounds.

	This project is built to fix those problems using:
	- strict compile checks,
	- fuzz-based correctness verification,
	- cross-hardware portability checks,
	- anti-gaming trap tasks,
	- curriculum learning (easy -> hard),
	- structured continuous reward.

	---

	## 3) Mental model (simple)

	Think of this project as a game with rules:

	- Input: a Python function + a hardware profile.
	- Player (AI): can call tools to analyze and optimize.
	- Goal: submit C++ that is fast and correct.
	- Score (reward): combines speed, correctness, reasoning quality, and portability.

	The AI plays this game many times and learns better strategies.

	---

	## 4) Core architecture

	Main folders:

	- `models.py`
	Defines typed data objects for actions, observations, and state.

	- `server/environment.py`
	The main OpenEnv environment implementation (`reset`, `step`, `state`, `close`).

	- `server/tools/`
	Actual capability tools (compiler, verifier, profiling, portability, submit).

	- `server/rewards/`
	Reward rubrics and reward composition logic.

	- `server/scenarios/`
	Task generators, hardware profiles, trap library, and adaptive curriculum.

	- `tests/`
	Unit + integration tests validating behavior and quality.

	---

	## 5) Episode lifecycle (what happens in one training sample)

	Each episode has 3 rounds.

	### Round flow
	1. Environment samples:
	- Python code task
	- hardware profile
	- hidden bottleneck labels (for diagnosis scoring)
	2. Model calls tools (analyze, compile, verify, etc.).
	3. Model eventually calls `submit_optimization`.
	4. Environment computes round reward.
	5. Repeat for rounds 2 and 3.
	6. Final episode reward is computed from round rewards.

	### Important implementation details
	- `max_calls_per_round` is enforced.
	- If call budget is exhausted, environment forces submit for that round.
	- Adaptive curriculum can update global difficulty after batch outcomes.

	---

	## 6) The 9 tools (what the model can do)

	The AI does not directly "guess" everything. It uses tools:

	1. `get_hardware_profile`
	2. `profile_python_hotspots`
	3. `analyze_complexity`
	4. `check_memory_access`
	5. `compile_and_benchmark`
	6. `verify_equivalence`
	7. `check_portability`
	8. `get_bottleneck_report`
	9. `submit_optimization` (round-closing action)

	The most important tools for trustworthiness are:
	- `compile_and_benchmark` (real compile/runtime behavior),
	- `verify_equivalence` (catches wrong-but-fast code),
	- `check_portability` (checks behavior across profiles).

	---

	## 7) Reward system explained simply

	Reward is continuous, not just pass/fail.

	That means:
	- weak solutions get small score,
	- better solutions get higher score,
	- fully good solutions get top score.

	This is important for RL because the model needs gradient/signal to improve.

	### Reward components
	- SpeedupRubric: how much faster C++ is vs Python baseline
	- CorrectnessRubric: fuzz pass-rate quality
	- CompilationRubric: compile quality/status
	- DiagnosisRubric: quality/coherence of bottleneck reasoning
	- PortabilityRubric: cross-profile robustness
	- SelfCorrectionRubric: improvement from earlier rounds

	### Composition
	Reward is composed using rubric operators (`Sequential`, `Gate`, `WeightedSum`), so it is easier to reason about and tune than one large monolithic score function.

	---

	## 8) Anti-gaming design

	This project assumes the model will try shortcuts. So it includes defenses:

	- Trap functions (overflow, NaN/Inf, aliasing, semantic edge cases)
	- Adversarial fuzzing
	- Correctness + adversarial pass-rate signals
	- Portability checks across hardware profiles
	- Reasoning/diagnosis quality signal

	Net effect: "fast but wrong" should score poorly.

	---

	## 9) Curriculum learning (easy -> hard)

	Difficulty axes include:
	- function complexity tier,
	- hardware difficulty class,
	- verifier strictness,
	- portability requirement.

	Curriculum controller monitors success in batches and adjusts:
	- high success -> increase difficulty,
	- low success -> reduce difficulty,
	- middle zone -> hold.

	This stabilizes learning and prevents early collapse.

	---

	## 10) Adaptive traps (what was improved)

	Adaptive traps now do two things:
	- prioritize categories where the model recently failed,
	- create semantic-preserving trap variants (not only naive renaming).

	Why this matters:
	- reduces memorization,
	- improves robustness,
	- increases novelty/innovation signal for judges.

	---

	## 11) What "good performance" means here

	Not just one high speedup number.

	A good policy should show:
	- increasing reward trend,
	- high correctness/adversarial pass-rate,
	- high compile success,
	- better portability over time,
	- stable behavior on held-out/edge-case tasks.

	---

	## 12) How to run and verify locally

	From `polyglot_optima/`:

	```bash
	python -m ruff check .
	python -m pytest -q
	```

	Smoke test (LLM-in-the-loop):

	```bash
	python tests/smoke_llm_hf.py
	```

	Cursor/OpenAI-compatible mode:

	```bash
	set LLM_PROVIDER=cursor
	set CURSOR_API_KEY=...
	set CURSOR_MODEL=gpt-4.1-nano
	python tests/smoke_llm_hf.py
	```

	---

	## 13) Training workflow for beginners

	Use `training/openenv_hackathon_training.ipynb`:

	1. Configure model + episodes + logging.
	2. Run baseline eval first (fixed seeds).
	3. Run RL training (TRL scaffold cell).
	4. Run post-training eval with same seed protocol.
	5. Export plots to `docs/plots`.
	6. Add results to `README.md`.

	Track at least:
	- reward,
	- correctness pass rate,
	- compile success rate,
	- portability metrics.

	---

	## 14) How this maps to hackathon judging

	The project can score well if you clearly show:

	- Innovation: adaptive curriculum + anti-gaming traps + structured reward
	- Storytelling: clear problem -> method -> before/after outcome
	- Improvement evidence: baseline vs trained plots
	- Pipeline quality: reproducible notebook/script + OpenEnv-compliant deployment

	---

	## 15) Most important files to read next

	Recommended reading order:

	1. `README.md`
	2. `models.py`
	3. `server/environment.py`
	4. `server/tools/submit.py`
	5. `server/tools/cpp_compiler.py`
	6. `server/tools/verifier.py`
	7. `server/rewards/__init__.py`
	8. `server/scenarios/dataset_loader.py`
	9. `tests/test_skeleton.py`

	---

	## 16) Beginner takeaway

	If you remember one thing:

	This is not just "code generation."
	It is a full RL environment that teaches an AI to do correct, robust, hardware-aware optimization under realistic constraints.