Spaces:

Swastikr
/

polyglot-optima-openenv

Build error

Input: a Python function + a hardware profile.
Player (AI): can call tools to analyze and optimize.
Goal: submit C++ that is fast and correct.
Score (reward): combines speed, correctness, reasoning quality, and portability.

The AI plays this game many times and learns better strategies.

4) Core architecture

Main folders:

models.py
Defines typed data objects for actions, observations, and state.
server/environment.py
The main OpenEnv environment implementation (reset, step, state, close).
server/tools/
Actual capability tools (compiler, verifier, profiling, portability, submit).
server/rewards/
Reward rubrics and reward composition logic.
server/scenarios/
Task generators, hardware profiles, trap library, and adaptive curriculum.
tests/
Unit + integration tests validating behavior and quality.

5) Episode lifecycle (what happens in one training sample)

Each episode has 3 rounds.

Round flow

Environment samples:
- Python code task
- hardware profile
- hidden bottleneck labels (for diagnosis scoring)
Model calls tools (analyze, compile, verify, etc.).
Model eventually calls submit_optimization.
Environment computes round reward.
Repeat for rounds 2 and 3.
Final episode reward is computed from round rewards.

Important implementation details

max_calls_per_round is enforced.
If call budget is exhausted, environment forces submit for that round.
Adaptive curriculum can update global difficulty after batch outcomes.

6) The 9 tools (what the model can do)

The AI does not directly "guess" everything. It uses tools:

get_hardware_profile
profile_python_hotspots
analyze_complexity
check_memory_access
compile_and_benchmark
verify_equivalence
check_portability
get_bottleneck_report
submit_optimization (round-closing action)

The most important tools for trustworthiness are:

compile_and_benchmark (real compile/runtime behavior),
verify_equivalence (catches wrong-but-fast code),
check_portability (checks behavior across profiles).

7) Reward system explained simply

Reward is continuous, not just pass/fail.

That means:

weak solutions get small score,
better solutions get higher score,
fully good solutions get top score.

This is important for RL because the model needs gradient/signal to improve.

Reward components

SpeedupRubric: how much faster C++ is vs Python baseline
CorrectnessRubric: fuzz pass-rate quality
CompilationRubric: compile quality/status
DiagnosisRubric: quality/coherence of bottleneck reasoning
PortabilityRubric: cross-profile robustness
SelfCorrectionRubric: improvement from earlier rounds

Composition

Reward is composed using rubric operators (Sequential, Gate, WeightedSum), so it is easier to reason about and tune than one large monolithic score function.

8) Anti-gaming design

This project assumes the model will try shortcuts. So it includes defenses:

Trap functions (overflow, NaN/Inf, aliasing, semantic edge cases)
Adversarial fuzzing
Correctness + adversarial pass-rate signals
Portability checks across hardware profiles
Reasoning/diagnosis quality signal

Net effect: "fast but wrong" should score poorly.

9) Curriculum learning (easy -> hard)

Difficulty axes include:

function complexity tier,
hardware difficulty class,
verifier strictness,
portability requirement.

Curriculum controller monitors success in batches and adjusts:

high success -> increase difficulty,
low success -> reduce difficulty,
middle zone -> hold.

This stabilizes learning and prevents early collapse.

10) Adaptive traps (what was improved)

Adaptive traps now do two things:

prioritize categories where the model recently failed,
create semantic-preserving trap variants (not only naive renaming).

Why this matters:

reduces memorization,
improves robustness,
increases novelty/innovation signal for judges.

11) What "good performance" means here

Not just one high speedup number.

A good policy should show:

increasing reward trend,
high correctness/adversarial pass-rate,
high compile success,
better portability over time,
stable behavior on held-out/edge-case tasks.

12) How to run and verify locally

From polyglot_optima/:

python -m ruff check .
python -m pytest -q

Smoke test (LLM-in-the-loop):

python tests/smoke_llm_hf.py

Cursor/OpenAI-compatible mode:

set LLM_PROVIDER=cursor
set CURSOR_API_KEY=...
set CURSOR_MODEL=gpt-4.1-nano
python tests/smoke_llm_hf.py

13) Training workflow for beginners

Use training/openenv_hackathon_training.ipynb:

Configure model + episodes + logging.
Run baseline eval first (fixed seeds).
Run RL training (TRL scaffold cell).
Run post-training eval with same seed protocol.
Export plots to docs/plots.
Add results to README.md.

Track at least:

reward,
correctness pass rate,
compile success rate,
portability metrics.

14) How this maps to hackathon judging

The project can score well if you clearly show:

Innovation: adaptive curriculum + anti-gaming traps + structured reward
Storytelling: clear problem -> method -> before/after outcome
Improvement evidence: baseline vs trained plots
Pipeline quality: reproducible notebook/script + OpenEnv-compliant deployment

15) Most important files to read next

16) Beginner takeaway

If you remember one thing:

This is not just "code generation."
It is a full RL environment that teaches an AI to do correct, robust, hardware-aware optimization under realistic constraints.