Spaces:
Sleeping
Sleeping
| # Rewards | |
| Multi-component reward system for the explore -> generate -> repair episode. | |
| ## Episode Flow | |
| ``` | |
| reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done | |
| ``` | |
| Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts. | |
| Every action reward and `*_total` component is clamped to the `0-1` range. | |
| ## Exploration Rewards (`exploration.py`) | |
| Per-step reward for each `explore` action. Gated by information need -- once the agent has enough info, further exploration yields diminishing returns. | |
| | Component | Weight | Range | Description | | |
| |---|---|---|---| | |
| | `query_quality` | 0.20 | 0-1 | Query relevance plus tool fit | | |
| | `evidence_quality` | 0.25 | 0-1 | Retrieved chunk quality plus useful source diversity | | |
| | `information_gain` | 0.40 | 0-1 | Newly covered concepts plus result novelty | | |
| | `efficiency` | 0.15 | 0-1 | Action novelty scaled by remaining information need | | |
| | `step_cost` | -0.05 | flat | Per-step penalty -- exploration must justify itself | | |
| **Gating mechanism**: `info_need = 1 - sufficiency`. Raw reward is scaled by `0.3 + 0.7 * info_need`, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough. | |
| ## Generation Rewards (`generation.py`) | |
| Reward on `generate` and `repair` actions. Uses **multiplicative gates** instead of additive weights for code validity. | |
| ### Gates (multiplicative) | |
| | Condition | Effect | | |
| |---|---| | |
| | Code doesn't parse (AST fails) | total = 0 | | |
| | Static check fails | total = quality * 0.12-0.18 | | |
| | Code doesn't execute | total = quality * 0.30 | | |
| | Code executes successfully | total = quality * 1.0 | | |
| ### Quality components | |
| | Component | Weight | Range | Description | | |
| |---|---|---|---| | |
| | `validity` | 0.15 | 0-1 | Parse/static-check/execution validity | | |
| | `task_alignment` | 0.30 | 0-1 | Keyword coverage plus preferred format match | | |
| | `structure` | 0.30 | 0-1 | Structural quality (cells/scenes, UI, viz, `marimo check`) | | |
| | `research_usage` | 0.25 | 0-1 | Code references terms from exploration research | | |
| For manim, `structure` includes scene structure plus narration quality. | |
| ### Marimo structure scoring | |
| Additive scoring for good patterns: | |
| - `import marimo` / `marimo.App()` / `@app.cell` count | |
| - UI elements (`mo.ui.*`, `mo.md(`, etc.) | |
| - Visualization libraries (`matplotlib`, `plotly`, etc.) | |
| - Tier-appropriate cell count | |
| Then `marimo check` CLI validates against 5 breaking rules (MB001-MB005). Per-violation penalties: | |
| | Rule | Penalty | What it catches | | |
| |---|---|---| | |
| | MB001 | -0.30 | Unparsable cells | | |
| | MB002 | -0.35 | Duplicate variable definitions across cells | | |
| | MB003 | -0.40 | Cycle dependencies between cells | | |
| | MB004 | -0.20 | Invalid setup cell dependencies | | |
| | MB005 | -0.25 | Syntax errors within cells | | |
| Clean code (no violations) gets +0.1 bonus. | |
| **Skip penalty**: Generating without any exploration incurs -0.1 penalty. | |
| ### Repair scoring | |
| If generation fails lint/build validation, the observation enters `repair` and exposes structured errors. Up to three repair attempts are allowed: | |
| | Condition | Effect | | |
| |---|---| | |
| | First generation succeeds | Full eligible generation reward; episode ends | | |
| | Repair succeeds | Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code | | |
| | Repair fails | Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends | | |
| | Code repeated unchanged | Additional penalty | | |
| Repair reward components are: | |
| | Component | Range | Description | | |
| |---|---|---| | |
| | `repair_success` | 0/1 | Whether the repaired artifact executes successfully | | |
| | `fixed_prior_errors` | 0/1 | Whether previous error codes are gone | | |
| | `changed_code` | 0/1 | Whether the repair changed the submitted code | | |
| ## Search Sources (`sources.py`) | |
| All search calls are **async** (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using **BM25** to surface the most relevant parts. | |
| | Source | Library | Use Case | Retrieval | | |
| |---|---|---|---| | |
| | Wikipedia | `wikipediaapi.AsyncWikipedia` | Fundamentals | Search 3-5 pages -> section tree -> global BM25 ranking | | |
| | HuggingFace Papers | httpx -> `huggingface.co/api/papers/search` | ML/AI research | Search 3-5 papers -> markdown chunks -> global BM25 ranking | | |
| | arXiv | httpx -> arXiv Atom API | Math, algorithms, ML, statistics papers | Search 3-5 papers -> abstracts -> global BM25 ranking | | |
| | Semantic Scholar | httpx -> Graph API | Scholarly metadata and abstracts | Search 3-5 papers -> abstracts -> global BM25 ranking | | |
| | Docs | httpx + trafilatura | API/library/code details | Fetch allowlisted docs -> chunk -> global BM25 ranking | | |
| | HF Hub | `huggingface_hub.HfApi` | Model cards, datasets, Spaces | Search Hub entities -> metadata/card snippets | | |
| Agents choose tools explicitly. Each tool fetches multiple candidates, chunks them, and returns the top 3-5 chunks globally. Optional small local embeddings can rerank the BM25 shortlist when `EMBEDDINGS_ENABLED=1`. | |
| ## Sandbox (`sandbox.py`) | |
| Validation follows a fast-to-slow pipeline. Each stage gates the next. | |
| | Check | Tool | Timeout | Purpose | | |
| |---|---|---|---| | |
| | `ast_parses` | Python `ast.parse` | ~0ms | Catch syntax errors | | |
| | `check_marimo` | `marimo check --format json --select MB` | 8s | Catch structural violations | | |
| | `run_marimo` | `marimo export html` | 7s | Full execution test | | |
| | `run_manim` | `manim render -ql` | 30s | Full render test | | |
| `check_marimo` runs in ~100-200ms and catches MB001-MB005. If it fails, `run_marimo` is skipped (saves ~15s per broken submission). | |