Spaces:

kgdrathan
/

explainer-env

Sleeping

App Files Files Community

explainer-env / rewards /README.md

kgdrathan

Upload folder using huggingface_hub

5869d56 verified about 1 month ago

preview code

raw

history blame contribute delete

5.69 kB

	# Rewards

	Multi-component reward system for the explore -> generate -> repair episode.

	## Episode Flow

	```
	reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done
	```

	Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts.
	Every action reward and `*_total` component is clamped to the `0-1` range.

	## Exploration Rewards (`exploration.py`)

	Per-step reward for each `explore` action. Gated by information need -- once the agent has enough info, further exploration yields diminishing returns.

	\| Component \| Weight \| Range \| Description \|
	\|---\|---\|---\|---\|
	\| `query_quality` \| 0.20 \| 0-1 \| Query relevance plus tool fit \|
	\| `evidence_quality` \| 0.25 \| 0-1 \| Retrieved chunk quality plus useful source diversity \|
	\| `information_gain` \| 0.40 \| 0-1 \| Newly covered concepts plus result novelty \|
	\| `efficiency` \| 0.15 \| 0-1 \| Action novelty scaled by remaining information need \|
	\| `step_cost` \| -0.05 \| flat \| Per-step penalty -- exploration must justify itself \|

	Gating mechanism: `info_need = 1 - sufficiency`. Raw reward is scaled by `0.3 + 0.7 * info_need`, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough.

	## Generation Rewards (`generation.py`)

	Reward on `generate` and `repair` actions. Uses multiplicative gates instead of additive weights for code validity.

	### Gates (multiplicative)

	\| Condition \| Effect \|
	\|---\|---\|
	\| Code doesn't parse (AST fails) \| total = 0 \|
	\| Static check fails \| total = quality * 0.12-0.18 \|
	\| Code doesn't execute \| total = quality * 0.30 \|
	\| Code executes successfully \| total = quality * 1.0 \|

	### Quality components

	\| Component \| Weight \| Range \| Description \|
	\|---\|---\|---\|---\|
	\| `validity` \| 0.15 \| 0-1 \| Parse/static-check/execution validity \|
	\| `task_alignment` \| 0.30 \| 0-1 \| Keyword coverage plus preferred format match \|
	\| `structure` \| 0.30 \| 0-1 \| Structural quality (cells/scenes, UI, viz, `marimo check`) \|
	\| `research_usage` \| 0.25 \| 0-1 \| Code references terms from exploration research \|

	For manim, `structure` includes scene structure plus narration quality.

	### Marimo structure scoring

	Additive scoring for good patterns:
	- `import marimo` / `marimo.App()` / `@app.cell` count
	- UI elements (`mo.ui.*`, `mo.md(`, etc.)
	- Visualization libraries (`matplotlib`, `plotly`, etc.)
	- Tier-appropriate cell count

	Then `marimo check` CLI validates against 5 breaking rules (MB001-MB005). Per-violation penalties:

	\| Rule \| Penalty \| What it catches \|
	\|---\|---\|---\|
	\| MB001 \| -0.30 \| Unparsable cells \|
	\| MB002 \| -0.35 \| Duplicate variable definitions across cells \|
	\| MB003 \| -0.40 \| Cycle dependencies between cells \|
	\| MB004 \| -0.20 \| Invalid setup cell dependencies \|
	\| MB005 \| -0.25 \| Syntax errors within cells \|

	Clean code (no violations) gets +0.1 bonus.

	Skip penalty: Generating without any exploration incurs -0.1 penalty.

	### Repair scoring

	If generation fails lint/build validation, the observation enters `repair` and exposes structured errors. Up to three repair attempts are allowed:

	\| Condition \| Effect \|
	\|---\|---\|
	\| First generation succeeds \| Full eligible generation reward; episode ends \|
	\| Repair succeeds \| Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code \|
	\| Repair fails \| Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends \|
	\| Code repeated unchanged \| Additional penalty \|

	Repair reward components are:

	\| Component \| Range \| Description \|
	\|---\|---\|---\|
	\| `repair_success` \| 0/1 \| Whether the repaired artifact executes successfully \|
	\| `fixed_prior_errors` \| 0/1 \| Whether previous error codes are gone \|
	\| `changed_code` \| 0/1 \| Whether the repair changed the submitted code \|

	## Search Sources (`sources.py`)

	All search calls are async (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using BM25 to surface the most relevant parts.

	\| Source \| Library \| Use Case \| Retrieval \|
	\|---\|---\|---\|---\|
	\| Wikipedia \| `wikipediaapi.AsyncWikipedia` \| Fundamentals \| Search 3-5 pages -> section tree -> global BM25 ranking \|
	\| HuggingFace Papers \| httpx -> `huggingface.co/api/papers/search` \| ML/AI research \| Search 3-5 papers -> markdown chunks -> global BM25 ranking \|
	\| arXiv \| httpx -> arXiv Atom API \| Math, algorithms, ML, statistics papers \| Search 3-5 papers -> abstracts -> global BM25 ranking \|
	\| Semantic Scholar \| httpx -> Graph API \| Scholarly metadata and abstracts \| Search 3-5 papers -> abstracts -> global BM25 ranking \|
	\| Docs \| httpx + trafilatura \| API/library/code details \| Fetch allowlisted docs -> chunk -> global BM25 ranking \|
	\| HF Hub \| `huggingface_hub.HfApi` \| Model cards, datasets, Spaces \| Search Hub entities -> metadata/card snippets \|

	Agents choose tools explicitly. Each tool fetches multiple candidates, chunks them, and returns the top 3-5 chunks globally. Optional small local embeddings can rerank the BM25 shortlist when `EMBEDDINGS_ENABLED=1`.

	## Sandbox (`sandbox.py`)

	Validation follows a fast-to-slow pipeline. Each stage gates the next.

	\| Check \| Tool \| Timeout \| Purpose \|
	\|---\|---\|---\|---\|
	\| `ast_parses` \| Python `ast.parse` \| ~0ms \| Catch syntax errors \|
	\| `check_marimo` \| `marimo check --format json --select MB` \| 8s \| Catch structural violations \|
	\| `run_marimo` \| `marimo export html` \| 7s \| Full execution test \|
	\| `run_manim` \| `manim render -ql` \| 30s \| Full render test \|

	`check_marimo` runs in ~100-200ms and catches MB001-MB005. If it fails, `run_marimo` is skipped (saves ~15s per broken submission).

	# Rewards

	Multi-component reward system for the explore -> generate -> repair episode.

	## Episode Flow

	```
	reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done
	```

	Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts.
	Every action reward and `*_total` component is clamped to the `0-1` range.

	## Exploration Rewards (`exploration.py`)

	Per-step reward for each `explore` action. Gated by information need -- once the agent has enough info, further exploration yields diminishing returns.

	\| Component \| Weight \| Range \| Description \|
	\|---\|---\|---\|---\|
	\| `query_quality` \| 0.20 \| 0-1 \| Query relevance plus tool fit \|
	\| `evidence_quality` \| 0.25 \| 0-1 \| Retrieved chunk quality plus useful source diversity \|
	\| `information_gain` \| 0.40 \| 0-1 \| Newly covered concepts plus result novelty \|
	\| `efficiency` \| 0.15 \| 0-1 \| Action novelty scaled by remaining information need \|
	\| `step_cost` \| -0.05 \| flat \| Per-step penalty -- exploration must justify itself \|

	Gating mechanism: `info_need = 1 - sufficiency`. Raw reward is scaled by `0.3 + 0.7 * info_need`, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough.

	## Generation Rewards (`generation.py`)

	Reward on `generate` and `repair` actions. Uses multiplicative gates instead of additive weights for code validity.

	### Gates (multiplicative)

	\| Condition \| Effect \|
	\|---\|---\|
	\| Code doesn't parse (AST fails) \| total = 0 \|
	\| Static check fails \| total = quality * 0.12-0.18 \|
	\| Code doesn't execute \| total = quality * 0.30 \|
	\| Code executes successfully \| total = quality * 1.0 \|

	### Quality components

	\| Component \| Weight \| Range \| Description \|
	\|---\|---\|---\|---\|
	\| `validity` \| 0.15 \| 0-1 \| Parse/static-check/execution validity \|
	\| `task_alignment` \| 0.30 \| 0-1 \| Keyword coverage plus preferred format match \|
	\| `structure` \| 0.30 \| 0-1 \| Structural quality (cells/scenes, UI, viz, `marimo check`) \|
	\| `research_usage` \| 0.25 \| 0-1 \| Code references terms from exploration research \|

	For manim, `structure` includes scene structure plus narration quality.

	### Marimo structure scoring

	Additive scoring for good patterns:
	- `import marimo` / `marimo.App()` / `@app.cell` count
	- UI elements (`mo.ui.*`, `mo.md(`, etc.)
	- Visualization libraries (`matplotlib`, `plotly`, etc.)
	- Tier-appropriate cell count

	Then `marimo check` CLI validates against 5 breaking rules (MB001-MB005). Per-violation penalties:

	\| Rule \| Penalty \| What it catches \|
	\|---\|---\|---\|
	\| MB001 \| -0.30 \| Unparsable cells \|
	\| MB002 \| -0.35 \| Duplicate variable definitions across cells \|
	\| MB003 \| -0.40 \| Cycle dependencies between cells \|
	\| MB004 \| -0.20 \| Invalid setup cell dependencies \|
	\| MB005 \| -0.25 \| Syntax errors within cells \|

	Clean code (no violations) gets +0.1 bonus.

	Skip penalty: Generating without any exploration incurs -0.1 penalty.

	### Repair scoring

	If generation fails lint/build validation, the observation enters `repair` and exposes structured errors. Up to three repair attempts are allowed:

	\| Condition \| Effect \|
	\|---\|---\|
	\| First generation succeeds \| Full eligible generation reward; episode ends \|
	\| Repair succeeds \| Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code \|
	\| Repair fails \| Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends \|
	\| Code repeated unchanged \| Additional penalty \|

	Repair reward components are:

	\| Component \| Range \| Description \|
	\|---\|---\|---\|
	\| `repair_success` \| 0/1 \| Whether the repaired artifact executes successfully \|
	\| `fixed_prior_errors` \| 0/1 \| Whether previous error codes are gone \|
	\| `changed_code` \| 0/1 \| Whether the repair changed the submitted code \|

	## Search Sources (`sources.py`)

	All search calls are async (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using BM25 to surface the most relevant parts.

	\| Source \| Library \| Use Case \| Retrieval \|
	\|---\|---\|---\|---\|
	\| Wikipedia \| `wikipediaapi.AsyncWikipedia` \| Fundamentals \| Search 3-5 pages -> section tree -> global BM25 ranking \|
	\| HuggingFace Papers \| httpx -> `huggingface.co/api/papers/search` \| ML/AI research \| Search 3-5 papers -> markdown chunks -> global BM25 ranking \|
	\| arXiv \| httpx -> arXiv Atom API \| Math, algorithms, ML, statistics papers \| Search 3-5 papers -> abstracts -> global BM25 ranking \|
	\| Semantic Scholar \| httpx -> Graph API \| Scholarly metadata and abstracts \| Search 3-5 papers -> abstracts -> global BM25 ranking \|
	\| Docs \| httpx + trafilatura \| API/library/code details \| Fetch allowlisted docs -> chunk -> global BM25 ranking \|
	\| HF Hub \| `huggingface_hub.HfApi` \| Model cards, datasets, Spaces \| Search Hub entities -> metadata/card snippets \|

	Agents choose tools explicitly. Each tool fetches multiple candidates, chunks them, and returns the top 3-5 chunks globally. Optional small local embeddings can rerank the BM25 shortlist when `EMBEDDINGS_ENABLED=1`.

	## Sandbox (`sandbox.py`)

	Validation follows a fast-to-slow pipeline. Each stage gates the next.

	\| Check \| Tool \| Timeout \| Purpose \|
	\|---\|---\|---\|---\|
	\| `ast_parses` \| Python `ast.parse` \| ~0ms \| Catch syntax errors \|
	\| `check_marimo` \| `marimo check --format json --select MB` \| 8s \| Catch structural violations \|
	\| `run_marimo` \| `marimo export html` \| 7s \| Full execution test \|
	\| `run_manim` \| `manim render -ql` \| 30s \| Full render test \|

	`check_marimo` runs in ~100-200ms and catches MB001-MB005. If it fails, `run_marimo` is skipped (saves ~15s per broken submission).