# Rewards Multi-component reward system for the explore -> generate -> repair episode. ## Episode Flow ``` reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done ``` Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts. Every action reward and `*_total` component is clamped to the `0-1` range. ## Exploration Rewards (`exploration.py`) Per-step reward for each `explore` action. Gated by information need -- once the agent has enough info, further exploration yields diminishing returns. | Component | Weight | Range | Description | |---|---|---|---| | `query_quality` | 0.20 | 0-1 | Query relevance plus tool fit | | `evidence_quality` | 0.25 | 0-1 | Retrieved chunk quality plus useful source diversity | | `information_gain` | 0.40 | 0-1 | Newly covered concepts plus result novelty | | `efficiency` | 0.15 | 0-1 | Action novelty scaled by remaining information need | | `step_cost` | -0.05 | flat | Per-step penalty -- exploration must justify itself | **Gating mechanism**: `info_need = 1 - sufficiency`. Raw reward is scaled by `0.3 + 0.7 * info_need`, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough. ## Generation Rewards (`generation.py`) Reward on `generate` and `repair` actions. Uses **multiplicative gates** instead of additive weights for code validity. ### Gates (multiplicative) | Condition | Effect | |---|---| | Code doesn't parse (AST fails) | total = 0 | | Static check fails | total = quality * 0.12-0.18 | | Code doesn't execute | total = quality * 0.30 | | Code executes successfully | total = quality * 1.0 | ### Quality components | Component | Weight | Range | Description | |---|---|---|---| | `validity` | 0.15 | 0-1 | Parse/static-check/execution validity | | `task_alignment` | 0.30 | 0-1 | Keyword coverage plus preferred format match | | `structure` | 0.30 | 0-1 | Structural quality (cells/scenes, UI, viz, `marimo check`) | | `research_usage` | 0.25 | 0-1 | Code references terms from exploration research | For manim, `structure` includes scene structure plus narration quality. ### Marimo structure scoring Additive scoring for good patterns: - `import marimo` / `marimo.App()` / `@app.cell` count - UI elements (`mo.ui.*`, `mo.md(`, etc.) - Visualization libraries (`matplotlib`, `plotly`, etc.) - Tier-appropriate cell count Then `marimo check` CLI validates against 5 breaking rules (MB001-MB005). Per-violation penalties: | Rule | Penalty | What it catches | |---|---|---| | MB001 | -0.30 | Unparsable cells | | MB002 | -0.35 | Duplicate variable definitions across cells | | MB003 | -0.40 | Cycle dependencies between cells | | MB004 | -0.20 | Invalid setup cell dependencies | | MB005 | -0.25 | Syntax errors within cells | Clean code (no violations) gets +0.1 bonus. **Skip penalty**: Generating without any exploration incurs -0.1 penalty. ### Repair scoring If generation fails lint/build validation, the observation enters `repair` and exposes structured errors. Up to three repair attempts are allowed: | Condition | Effect | |---|---| | First generation succeeds | Full eligible generation reward; episode ends | | Repair succeeds | Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code | | Repair fails | Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends | | Code repeated unchanged | Additional penalty | Repair reward components are: | Component | Range | Description | |---|---|---| | `repair_success` | 0/1 | Whether the repaired artifact executes successfully | | `fixed_prior_errors` | 0/1 | Whether previous error codes are gone | | `changed_code` | 0/1 | Whether the repair changed the submitted code | ## Search Sources (`sources.py`) All search calls are **async** (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using **BM25** to surface the most relevant parts. | Source | Library | Use Case | Retrieval | |---|---|---|---| | Wikipedia | `wikipediaapi.AsyncWikipedia` | Fundamentals | Search 3-5 pages -> section tree -> global BM25 ranking | | HuggingFace Papers | httpx -> `huggingface.co/api/papers/search` | ML/AI research | Search 3-5 papers -> markdown chunks -> global BM25 ranking | | arXiv | httpx -> arXiv Atom API | Math, algorithms, ML, statistics papers | Search 3-5 papers -> abstracts -> global BM25 ranking | | Semantic Scholar | httpx -> Graph API | Scholarly metadata and abstracts | Search 3-5 papers -> abstracts -> global BM25 ranking | | Docs | httpx + trafilatura | API/library/code details | Fetch allowlisted docs -> chunk -> global BM25 ranking | | HF Hub | `huggingface_hub.HfApi` | Model cards, datasets, Spaces | Search Hub entities -> metadata/card snippets | Agents choose tools explicitly. Each tool fetches multiple candidates, chunks them, and returns the top 3-5 chunks globally. Optional small local embeddings can rerank the BM25 shortlist when `EMBEDDINGS_ENABLED=1`. ## Sandbox (`sandbox.py`) Validation follows a fast-to-slow pipeline. Each stage gates the next. | Check | Tool | Timeout | Purpose | |---|---|---|---| | `ast_parses` | Python `ast.parse` | ~0ms | Catch syntax errors | | `check_marimo` | `marimo check --format json --select MB` | 8s | Catch structural violations | | `run_marimo` | `marimo export html` | 7s | Full execution test | | `run_manim` | `manim render -ql` | 30s | Full render test | `check_marimo` runs in ~100-200ms and catches MB001-MB005. If it fails, `run_marimo` is skipped (saves ~15s per broken submission).