Spaces:

kgdrathan
/

explainer-env

Sleeping

App Files Files Community

explainer-env / rewards /README.md

kgdrathan

Upload folder using huggingface_hub

5869d56 verified about 1 month ago

preview code

raw

history blame contribute delete

5.69 kB

Rewards

Multi-component reward system for the explore -> generate -> repair episode.

Episode Flow

reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done

Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts. Every action reward and *_total component is clamped to the 0-1 range.

Exploration Rewards (`exploration.py`)

Per-step reward for each explore action. Gated by information need -- once the agent has enough info, further exploration yields diminishing returns.

Component	Weight	Range	Description
`query_quality`	0.20	0-1	Query relevance plus tool fit
`evidence_quality`	0.25	0-1	Retrieved chunk quality plus useful source diversity
`information_gain`	0.40	0-1	Newly covered concepts plus result novelty
`efficiency`	0.15	0-1	Action novelty scaled by remaining information need
`step_cost`	-0.05	flat	Per-step penalty -- exploration must justify itself

Gating mechanism: info_need = 1 - sufficiency. Raw reward is scaled by 0.3 + 0.7 * info_need, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough.

Generation Rewards (`generation.py`)

Reward on generate and repair actions. Uses multiplicative gates instead of additive weights for code validity.

Gates (multiplicative)

Condition	Effect
Code doesn't parse (AST fails)	total = 0
Static check fails	total = quality * 0.12-0.18
Code doesn't execute	total = quality * 0.30
Code executes successfully	total = quality * 1.0

Quality components

Component	Weight	Range	Description
`validity`	0.15	0-1	Parse/static-check/execution validity
`task_alignment`	0.30	0-1	Keyword coverage plus preferred format match
`structure`	0.30	0-1	Structural quality (cells/scenes, UI, viz, `marimo check`)
`research_usage`	0.25	0-1	Code references terms from exploration research

For manim, structure includes scene structure plus narration quality.

Marimo structure scoring

Additive scoring for good patterns:

import marimo / marimo.App() / @app.cell count
UI elements (mo.ui.*, mo.md(, etc.)
Visualization libraries (matplotlib, plotly, etc.)
Tier-appropriate cell count

Then marimo check CLI validates against 5 breaking rules (MB001-MB005). Per-violation penalties:

Rule	Penalty	What it catches
MB001	-0.30	Unparsable cells
MB002	-0.35	Duplicate variable definitions across cells
MB003	-0.40	Cycle dependencies between cells
MB004	-0.20	Invalid setup cell dependencies
MB005	-0.25	Syntax errors within cells

Clean code (no violations) gets +0.1 bonus.

Skip penalty: Generating without any exploration incurs -0.1 penalty.

Repair scoring

If generation fails lint/build validation, the observation enters repair and exposes structured errors. Up to three repair attempts are allowed:

Condition	Effect
First generation succeeds	Full eligible generation reward; episode ends
Repair succeeds	Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code
Repair fails	Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends
Code repeated unchanged	Additional penalty

Repair reward components are:

Component	Range	Description
`repair_success`	0/1	Whether the repaired artifact executes successfully
`fixed_prior_errors`	0/1	Whether previous error codes are gone
`changed_code`	0/1	Whether the repair changed the submitted code

Search Sources (`sources.py`)

All search calls are async (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using BM25 to surface the most relevant parts.

Source	Library	Use Case	Retrieval
Wikipedia	`wikipediaapi.AsyncWikipedia`	Fundamentals	Search 3-5 pages -> section tree -> global BM25 ranking
HuggingFace Papers	httpx -> `huggingface.co/api/papers/search`	ML/AI research	Search 3-5 papers -> markdown chunks -> global BM25 ranking
arXiv	httpx -> arXiv Atom API	Math, algorithms, ML, statistics papers	Search 3-5 papers -> abstracts -> global BM25 ranking
Semantic Scholar	httpx -> Graph API	Scholarly metadata and abstracts	Search 3-5 papers -> abstracts -> global BM25 ranking
Docs	httpx + trafilatura	API/library/code details	Fetch allowlisted docs -> chunk -> global BM25 ranking
HF Hub	`huggingface_hub.HfApi`	Model cards, datasets, Spaces	Search Hub entities -> metadata/card snippets

Agents choose tools explicitly. Each tool fetches multiple candidates, chunks them, and returns the top 3-5 chunks globally. Optional small local embeddings can rerank the BM25 shortlist when EMBEDDINGS_ENABLED=1.

Sandbox (`sandbox.py`)

Validation follows a fast-to-slow pipeline. Each stage gates the next.

Check	Tool	Timeout	Purpose
`ast_parses`	Python `ast.parse`	~0ms	Catch syntax errors
`check_marimo`	`marimo check --format json --select MB`	8s	Catch structural violations
`run_marimo`	`marimo export html`	7s	Full execution test
`run_manim`	`manim render -ql`	30s	Full render test

check_marimo runs in ~100-200ms and catches MB001-MB005. If it fails, run_marimo is skipped (saves ~15s per broken submission).