Spaces:
Sleeping
Rewards
Multi-component reward system for the explore -> generate -> repair episode.
Episode Flow
reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done
Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts.
Every action reward and *_total component is clamped to the 0-1 range.
Exploration Rewards (exploration.py)
Per-step reward for each explore action. Gated by information need -- once the agent has enough info, further exploration yields diminishing returns.
| Component | Weight | Range | Description |
|---|---|---|---|
query_quality |
0.20 | 0-1 | Query relevance plus tool fit |
evidence_quality |
0.25 | 0-1 | Retrieved chunk quality plus useful source diversity |
information_gain |
0.40 | 0-1 | Newly covered concepts plus result novelty |
efficiency |
0.15 | 0-1 | Action novelty scaled by remaining information need |
step_cost |
-0.05 | flat | Per-step penalty -- exploration must justify itself |
Gating mechanism: info_need = 1 - sufficiency. Raw reward is scaled by 0.3 + 0.7 * info_need, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough.
Generation Rewards (generation.py)
Reward on generate and repair actions. Uses multiplicative gates instead of additive weights for code validity.
Gates (multiplicative)
| Condition | Effect |
|---|---|
| Code doesn't parse (AST fails) | total = 0 |
| Static check fails | total = quality * 0.12-0.18 |
| Code doesn't execute | total = quality * 0.30 |
| Code executes successfully | total = quality * 1.0 |
Quality components
| Component | Weight | Range | Description |
|---|---|---|---|
validity |
0.15 | 0-1 | Parse/static-check/execution validity |
task_alignment |
0.30 | 0-1 | Keyword coverage plus preferred format match |
structure |
0.30 | 0-1 | Structural quality (cells/scenes, UI, viz, marimo check) |
research_usage |
0.25 | 0-1 | Code references terms from exploration research |
For manim, structure includes scene structure plus narration quality.
Marimo structure scoring
Additive scoring for good patterns:
import marimo/marimo.App()/@app.cellcount- UI elements (
mo.ui.*,mo.md(, etc.) - Visualization libraries (
matplotlib,plotly, etc.) - Tier-appropriate cell count
Then marimo check CLI validates against 5 breaking rules (MB001-MB005). Per-violation penalties:
| Rule | Penalty | What it catches |
|---|---|---|
| MB001 | -0.30 | Unparsable cells |
| MB002 | -0.35 | Duplicate variable definitions across cells |
| MB003 | -0.40 | Cycle dependencies between cells |
| MB004 | -0.20 | Invalid setup cell dependencies |
| MB005 | -0.25 | Syntax errors within cells |
Clean code (no violations) gets +0.1 bonus.
Skip penalty: Generating without any exploration incurs -0.1 penalty.
Repair scoring
If generation fails lint/build validation, the observation enters repair and exposes structured errors. Up to three repair attempts are allowed:
| Condition | Effect |
|---|---|
| First generation succeeds | Full eligible generation reward; episode ends |
| Repair succeeds | Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code |
| Repair fails | Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends |
| Code repeated unchanged | Additional penalty |
Repair reward components are:
| Component | Range | Description |
|---|---|---|
repair_success |
0/1 | Whether the repaired artifact executes successfully |
fixed_prior_errors |
0/1 | Whether previous error codes are gone |
changed_code |
0/1 | Whether the repair changed the submitted code |
Search Sources (sources.py)
All search calls are async (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using BM25 to surface the most relevant parts.
| Source | Library | Use Case | Retrieval |
|---|---|---|---|
| Wikipedia | wikipediaapi.AsyncWikipedia |
Fundamentals | Search 3-5 pages -> section tree -> global BM25 ranking |
| HuggingFace Papers | httpx -> huggingface.co/api/papers/search |
ML/AI research | Search 3-5 papers -> markdown chunks -> global BM25 ranking |
| arXiv | httpx -> arXiv Atom API | Math, algorithms, ML, statistics papers | Search 3-5 papers -> abstracts -> global BM25 ranking |
| Semantic Scholar | httpx -> Graph API | Scholarly metadata and abstracts | Search 3-5 papers -> abstracts -> global BM25 ranking |
| Docs | httpx + trafilatura | API/library/code details | Fetch allowlisted docs -> chunk -> global BM25 ranking |
| HF Hub | huggingface_hub.HfApi |
Model cards, datasets, Spaces | Search Hub entities -> metadata/card snippets |
Agents choose tools explicitly. Each tool fetches multiple candidates, chunks them, and returns the top 3-5 chunks globally. Optional small local embeddings can rerank the BM25 shortlist when EMBEDDINGS_ENABLED=1.
Sandbox (sandbox.py)
Validation follows a fast-to-slow pipeline. Each stage gates the next.
| Check | Tool | Timeout | Purpose |
|---|---|---|---|
ast_parses |
Python ast.parse |
~0ms | Catch syntax errors |
check_marimo |
marimo check --format json --select MB |
8s | Catch structural violations |
run_marimo |
marimo export html |
7s | Full execution test |
run_manim |
manim render -ql |
30s | Full render test |
check_marimo runs in ~100-200ms and catches MB001-MB005. If it fails, run_marimo is skipped (saves ~15s per broken submission).