explainer-env / rewards /README.md
kgdrathan's picture
Upload folder using huggingface_hub
5869d56 verified

Rewards

Multi-component reward system for the explore -> generate -> repair episode.

Episode Flow

reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done

Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts. Every action reward and *_total component is clamped to the 0-1 range.

Exploration Rewards (exploration.py)

Per-step reward for each explore action. Gated by information need -- once the agent has enough info, further exploration yields diminishing returns.

Component Weight Range Description
query_quality 0.20 0-1 Query relevance plus tool fit
evidence_quality 0.25 0-1 Retrieved chunk quality plus useful source diversity
information_gain 0.40 0-1 Newly covered concepts plus result novelty
efficiency 0.15 0-1 Action novelty scaled by remaining information need
step_cost -0.05 flat Per-step penalty -- exploration must justify itself

Gating mechanism: info_need = 1 - sufficiency. Raw reward is scaled by 0.3 + 0.7 * info_need, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough.

Generation Rewards (generation.py)

Reward on generate and repair actions. Uses multiplicative gates instead of additive weights for code validity.

Gates (multiplicative)

Condition Effect
Code doesn't parse (AST fails) total = 0
Static check fails total = quality * 0.12-0.18
Code doesn't execute total = quality * 0.30
Code executes successfully total = quality * 1.0

Quality components

Component Weight Range Description
validity 0.15 0-1 Parse/static-check/execution validity
task_alignment 0.30 0-1 Keyword coverage plus preferred format match
structure 0.30 0-1 Structural quality (cells/scenes, UI, viz, marimo check)
research_usage 0.25 0-1 Code references terms from exploration research

For manim, structure includes scene structure plus narration quality.

Marimo structure scoring

Additive scoring for good patterns:

  • import marimo / marimo.App() / @app.cell count
  • UI elements (mo.ui.*, mo.md(, etc.)
  • Visualization libraries (matplotlib, plotly, etc.)
  • Tier-appropriate cell count

Then marimo check CLI validates against 5 breaking rules (MB001-MB005). Per-violation penalties:

Rule Penalty What it catches
MB001 -0.30 Unparsable cells
MB002 -0.35 Duplicate variable definitions across cells
MB003 -0.40 Cycle dependencies between cells
MB004 -0.20 Invalid setup cell dependencies
MB005 -0.25 Syntax errors within cells

Clean code (no violations) gets +0.1 bonus.

Skip penalty: Generating without any exploration incurs -0.1 penalty.

Repair scoring

If generation fails lint/build validation, the observation enters repair and exposes structured errors. Up to three repair attempts are allowed:

Condition Effect
First generation succeeds Full eligible generation reward; episode ends
Repair succeeds Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code
Repair fails Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends
Code repeated unchanged Additional penalty

Repair reward components are:

Component Range Description
repair_success 0/1 Whether the repaired artifact executes successfully
fixed_prior_errors 0/1 Whether previous error codes are gone
changed_code 0/1 Whether the repair changed the submitted code

Search Sources (sources.py)

All search calls are async (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using BM25 to surface the most relevant parts.

Source Library Use Case Retrieval
Wikipedia wikipediaapi.AsyncWikipedia Fundamentals Search 3-5 pages -> section tree -> global BM25 ranking
HuggingFace Papers httpx -> huggingface.co/api/papers/search ML/AI research Search 3-5 papers -> markdown chunks -> global BM25 ranking
arXiv httpx -> arXiv Atom API Math, algorithms, ML, statistics papers Search 3-5 papers -> abstracts -> global BM25 ranking
Semantic Scholar httpx -> Graph API Scholarly metadata and abstracts Search 3-5 papers -> abstracts -> global BM25 ranking
Docs httpx + trafilatura API/library/code details Fetch allowlisted docs -> chunk -> global BM25 ranking
HF Hub huggingface_hub.HfApi Model cards, datasets, Spaces Search Hub entities -> metadata/card snippets

Agents choose tools explicitly. Each tool fetches multiple candidates, chunks them, and returns the top 3-5 chunks globally. Optional small local embeddings can rerank the BM25 shortlist when EMBEDDINGS_ENABLED=1.

Sandbox (sandbox.py)

Validation follows a fast-to-slow pipeline. Each stage gates the next.

Check Tool Timeout Purpose
ast_parses Python ast.parse ~0ms Catch syntax errors
check_marimo marimo check --format json --select MB 8s Catch structural violations
run_marimo marimo export html 7s Full execution test
run_manim manim render -ql 30s Full render test

check_marimo runs in ~100-200ms and catches MB001-MB005. If it fails, run_marimo is skipped (saves ~15s per broken submission).