explainer-env / rewards /README.md
kgdrathan's picture
Upload folder using huggingface_hub
5869d56 verified
# Rewards
Multi-component reward system for the explore -> generate -> repair episode.
## Episode Flow
```
reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done
```
Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts.
Every action reward and `*_total` component is clamped to the `0-1` range.
## Exploration Rewards (`exploration.py`)
Per-step reward for each `explore` action. Gated by information need -- once the agent has enough info, further exploration yields diminishing returns.
| Component | Weight | Range | Description |
|---|---|---|---|
| `query_quality` | 0.20 | 0-1 | Query relevance plus tool fit |
| `evidence_quality` | 0.25 | 0-1 | Retrieved chunk quality plus useful source diversity |
| `information_gain` | 0.40 | 0-1 | Newly covered concepts plus result novelty |
| `efficiency` | 0.15 | 0-1 | Action novelty scaled by remaining information need |
| `step_cost` | -0.05 | flat | Per-step penalty -- exploration must justify itself |
**Gating mechanism**: `info_need = 1 - sufficiency`. Raw reward is scaled by `0.3 + 0.7 * info_need`, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough.
## Generation Rewards (`generation.py`)
Reward on `generate` and `repair` actions. Uses **multiplicative gates** instead of additive weights for code validity.
### Gates (multiplicative)
| Condition | Effect |
|---|---|
| Code doesn't parse (AST fails) | total = 0 |
| Static check fails | total = quality * 0.12-0.18 |
| Code doesn't execute | total = quality * 0.30 |
| Code executes successfully | total = quality * 1.0 |
### Quality components
| Component | Weight | Range | Description |
|---|---|---|---|
| `validity` | 0.15 | 0-1 | Parse/static-check/execution validity |
| `task_alignment` | 0.30 | 0-1 | Keyword coverage plus preferred format match |
| `structure` | 0.30 | 0-1 | Structural quality (cells/scenes, UI, viz, `marimo check`) |
| `research_usage` | 0.25 | 0-1 | Code references terms from exploration research |
For manim, `structure` includes scene structure plus narration quality.
### Marimo structure scoring
Additive scoring for good patterns:
- `import marimo` / `marimo.App()` / `@app.cell` count
- UI elements (`mo.ui.*`, `mo.md(`, etc.)
- Visualization libraries (`matplotlib`, `plotly`, etc.)
- Tier-appropriate cell count
Then `marimo check` CLI validates against 5 breaking rules (MB001-MB005). Per-violation penalties:
| Rule | Penalty | What it catches |
|---|---|---|
| MB001 | -0.30 | Unparsable cells |
| MB002 | -0.35 | Duplicate variable definitions across cells |
| MB003 | -0.40 | Cycle dependencies between cells |
| MB004 | -0.20 | Invalid setup cell dependencies |
| MB005 | -0.25 | Syntax errors within cells |
Clean code (no violations) gets +0.1 bonus.
**Skip penalty**: Generating without any exploration incurs -0.1 penalty.
### Repair scoring
If generation fails lint/build validation, the observation enters `repair` and exposes structured errors. Up to three repair attempts are allowed:
| Condition | Effect |
|---|---|
| First generation succeeds | Full eligible generation reward; episode ends |
| Repair succeeds | Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code |
| Repair fails | Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends |
| Code repeated unchanged | Additional penalty |
Repair reward components are:
| Component | Range | Description |
|---|---|---|
| `repair_success` | 0/1 | Whether the repaired artifact executes successfully |
| `fixed_prior_errors` | 0/1 | Whether previous error codes are gone |
| `changed_code` | 0/1 | Whether the repair changed the submitted code |
## Search Sources (`sources.py`)
All search calls are **async** (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using **BM25** to surface the most relevant parts.
| Source | Library | Use Case | Retrieval |
|---|---|---|---|
| Wikipedia | `wikipediaapi.AsyncWikipedia` | Fundamentals | Search 3-5 pages -> section tree -> global BM25 ranking |
| HuggingFace Papers | httpx -> `huggingface.co/api/papers/search` | ML/AI research | Search 3-5 papers -> markdown chunks -> global BM25 ranking |
| arXiv | httpx -> arXiv Atom API | Math, algorithms, ML, statistics papers | Search 3-5 papers -> abstracts -> global BM25 ranking |
| Semantic Scholar | httpx -> Graph API | Scholarly metadata and abstracts | Search 3-5 papers -> abstracts -> global BM25 ranking |
| Docs | httpx + trafilatura | API/library/code details | Fetch allowlisted docs -> chunk -> global BM25 ranking |
| HF Hub | `huggingface_hub.HfApi` | Model cards, datasets, Spaces | Search Hub entities -> metadata/card snippets |
Agents choose tools explicitly. Each tool fetches multiple candidates, chunks them, and returns the top 3-5 chunks globally. Optional small local embeddings can rerank the BM25 shortlist when `EMBEDDINGS_ENABLED=1`.
## Sandbox (`sandbox.py`)
Validation follows a fast-to-slow pipeline. Each stage gates the next.
| Check | Tool | Timeout | Purpose |
|---|---|---|---|
| `ast_parses` | Python `ast.parse` | ~0ms | Catch syntax errors |
| `check_marimo` | `marimo check --format json --select MB` | 8s | Catch structural violations |
| `run_marimo` | `marimo export html` | 7s | Full execution test |
| `run_manim` | `manim render -ql` | 30s | Full render test |
`check_marimo` runs in ~100-200ms and catches MB001-MB005. If it fails, `run_marimo` is skipped (saves ~15s per broken submission).