Spaces:

kgdrathan
/

explainer-env

Sleeping

App Files Files Community

kgdrathan commited on about 1 month ago

Commit

5869d56

verified ·

1 Parent(s): f3394fa

Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

constants.py +7 -2
rewards/README.md +24 -17
rewards/exploration.py +3 -3
rewards/generation.py +3 -3
server/explainer_env_environment.py +4 -3
tests/test_rewards.py +4 -3

constants.py CHANGED Viewed

@@ -12,12 +12,17 @@ AVAILABLE_TOOLS = (
     "search_hf_hub",
 )
-MAX_EXPLORE_REWARD = 0.8
 MAX_GENERATE_REWARD = 1.0
-MAX_REPAIR_REWARD = 0.7
 SUCCESS_SCORE_THRESHOLD = 0.3
 def normalized_episode_score(total_reward: float) -> float:
     """Normalize an episode's accumulated reward to the required [0, 1] range.

     "search_hf_hub",
 )
+MAX_EXPLORE_REWARD = 1.0
 MAX_GENERATE_REWARD = 1.0
+MAX_REPAIR_REWARD = 1.0
 SUCCESS_SCORE_THRESHOLD = 0.3
+def clamp_action_reward(value: float) -> float:
+    """Clamp any single action reward to the required [0, 1] range."""
+    return min(max(value, 0.0), 1.0)
 def normalized_episode_score(total_reward: float) -> float:
     """Normalize an episode's accumulated reward to the required [0, 1] range.

rewards/README.md CHANGED Viewed

@@ -5,10 +5,11 @@ Multi-component reward system for the explore -> generate -> repair episode.
 ## Episode Flow
 ```
-reset() --> [explore x 0..3] --> generate x 1 --> [repair x 0..1] --> done
 ```
 Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts.
 ## Exploration Rewards (`exploration.py`)
@@ -16,12 +17,10 @@ Per-step reward for each `explore` action. Gated by information need -- once the
 | Component | Weight | Range | Description |
 |---|---|---|---|
-| `tool_choice` | 0.15 | 0-1 | Tool fits task difficulty and query intent |
-| `query_relevance` | 0.20 | 0-1 | Topic, keyword, and intent overlap |
-| `source_quality` | 0.20 | 0-1 | Retrieved chunks have useful metadata, content, and relevance |
-| `coverage_delta` | 0.20 | 0-1 | Newly covered missing concepts/keywords |
-| `result_novelty` | 0.15 | 0-1 | New normalized terms vs. previous context |
-| `diversity` | 0.10 | 0-1 | Useful new source/tool diversity |
 | `step_cost` | -0.05 | flat | Per-step penalty -- exploration must justify itself |
 **Gating mechanism**: `info_need = 1 - sufficiency`. Raw reward is scaled by `0.3 + 0.7 * info_need`, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough.
@@ -35,20 +34,20 @@ Reward on `generate` and `repair` actions. Uses **multiplicative gates** instead
 | Condition | Effect |
 |---|---|
 | Code doesn't parse (AST fails) | total = 0 |
-| Code doesn't execute | total = quality * 0.4 |
 | Code executes successfully | total = quality * 1.0 |
 ### Quality components
 | Component | Weight | Range | Description |
 |---|---|---|---|
-| `coverage` | 0.20 | 0-1 | Fraction of task keywords in generated code |
-| `format_match` | 0.10 | 0.3/1.0 | Chosen format matches task's preferred format |
-| `structure` | 0.20* | 0-1 | Structural quality (cells/scenes, UI, viz, `marimo check`) |
-| `narration` | 0.15* | 0-1 | Narration quality (manim only; words, scene markers) |
-| `context_usage` | 0.35 | 0-1 | Code references terms from exploration research |
-*For marimo format, narration weight (0.15) is redistributed to structure (-> 0.35 total).
 ### Marimo structure scoring
@@ -74,15 +73,23 @@ Clean code (no violations) gets +0.1 bonus.
 ### Repair scoring
-If generation fails lint/build validation, the observation enters `repair` and exposes structured errors. One repair attempt is allowed:
 | Condition | Effect |
 |---|---|
 | First generation succeeds | Full eligible generation reward; episode ends |
-| Repair succeeds | Base generation reward * 0.8, plus a small bonus for fixing prior error codes |
-| Repair fails | Base generation reward * 0.3; episode ends |
 | Code repeated unchanged | Additional penalty |
 ## Search Sources (`sources.py`)
 All search calls are **async** (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using **BM25** to surface the most relevant parts.

 ## Episode Flow
 ```
+reset() --> [explore x 0..6] --> generate x 1 --> [repair x 0..3] --> done
 ```
 Each step returns a per-step reward. The agent learns what tool to use, what to retrieve, when to stop exploring, and how to repair broken artifacts.
+Every action reward and `*_total` component is clamped to the `0-1` range.
 ## Exploration Rewards (`exploration.py`)
 | Component | Weight | Range | Description |
 |---|---|---|---|
+| `query_quality` | 0.20 | 0-1 | Query relevance plus tool fit |
+| `evidence_quality` | 0.25 | 0-1 | Retrieved chunk quality plus useful source diversity |
+| `information_gain` | 0.40 | 0-1 | Newly covered concepts plus result novelty |
+| `efficiency` | 0.15 | 0-1 | Action novelty scaled by remaining information need |
 | `step_cost` | -0.05 | flat | Per-step penalty -- exploration must justify itself |
 **Gating mechanism**: `info_need = 1 - sufficiency`. Raw reward is scaled by `0.3 + 0.7 * info_need`, so high sufficiency -> low reward for more exploration. This teaches the agent to stop when it has enough.
 | Condition | Effect |
 |---|---|
 | Code doesn't parse (AST fails) | total = 0 |
+| Static check fails | total = quality * 0.12-0.18 |
+| Code doesn't execute | total = quality * 0.30 |
 | Code executes successfully | total = quality * 1.0 |
 ### Quality components
 | Component | Weight | Range | Description |
 |---|---|---|---|
+| `validity` | 0.15 | 0-1 | Parse/static-check/execution validity |
+| `task_alignment` | 0.30 | 0-1 | Keyword coverage plus preferred format match |
+| `structure` | 0.30 | 0-1 | Structural quality (cells/scenes, UI, viz, `marimo check`) |
+| `research_usage` | 0.25 | 0-1 | Code references terms from exploration research |
+For manim, `structure` includes scene structure plus narration quality.
 ### Marimo structure scoring
 ### Repair scoring
+If generation fails lint/build validation, the observation enters `repair` and exposes structured errors. Up to three repair attempts are allowed:
 | Condition | Effect |
 |---|---|
 | First generation succeeds | Full eligible generation reward; episode ends |
+| Repair succeeds | Base generation reward * 0.6, plus small bonuses for fixing prior error codes and changing code |
+| Repair fails | Base generation reward * 0.25, plus a small bonus if prior error codes are fixed; episode ends |
 | Code repeated unchanged | Additional penalty |
+Repair reward components are:
+| Component | Range | Description |
+|---|---|---|
+| `repair_success` | 0/1 | Whether the repaired artifact executes successfully |
+| `fixed_prior_errors` | 0/1 | Whether previous error codes are gone |
+| `changed_code` | 0/1 | Whether the repair changed the submitted code |
 ## Search Sources (`sources.py`)
 All search calls are **async** (httpx + wikipediaapi.AsyncWikipedia). Content is retrieved at section/chunk level and ranked using **BM25** to surface the most relevant parts.

rewards/exploration.py CHANGED Viewed

@@ -3,11 +3,11 @@
 from __future__ import annotations
 try:
-    from ..constants import MAX_EXPLORE_REWARD
     from ..research.retrieval import tokenize
     from ..research.types import ResearchResult
 except ImportError:  # pragma: no cover - supports direct test execution
-    from constants import MAX_EXPLORE_REWARD
     from research.retrieval import tokenize
     from research.types import ResearchResult
@@ -238,7 +238,7 @@ def compute_explore_reward(
     )
     gate = _exploration_gate(sufficiency_after) if result_ok else 0.0
     total = raw * gate + 0.08 * info_need - STEP_COST
-    total = max(0.0, min(MAX_EXPLORE_REWARD, total))
     components = {
         "query_quality": round(query_quality, 3),

 from __future__ import annotations
 try:
+    from ..constants import MAX_EXPLORE_REWARD, clamp_action_reward
     from ..research.retrieval import tokenize
     from ..research.types import ResearchResult
 except ImportError:  # pragma: no cover - supports direct test execution
+    from constants import MAX_EXPLORE_REWARD, clamp_action_reward
     from research.retrieval import tokenize
     from research.types import ResearchResult
     )
     gate = _exploration_gate(sufficiency_after) if result_ok else 0.0
     total = raw * gate + 0.08 * info_need - STEP_COST
+    total = min(MAX_EXPLORE_REWARD, clamp_action_reward(total))
     components = {
         "query_quality": round(query_quality, 3),

rewards/generation.py CHANGED Viewed

@@ -22,9 +22,9 @@ from typing import TYPE_CHECKING
 from .sandbox import ast_parses, check_marimo, extract_scene_class
 try:
-    from ..constants import MAX_REPAIR_REWARD
 except ImportError:  # pragma: no cover - supports direct test execution
-    from constants import MAX_REPAIR_REWARD
 if TYPE_CHECKING:
     from ..task_bank import Task
@@ -369,7 +369,7 @@ def adjust_repair_reward(
     if not changed:
         reward -= 0.15
-    reward = max(0.0, min(MAX_REPAIR_REWARD, reward))
     return reward, {
         "repair_success": 1.0 if repair_success else 0.0,
         "fixed_prior_errors": 1.0 if fixed_prior else 0.0,

 from .sandbox import ast_parses, check_marimo, extract_scene_class
 try:
+    from ..constants import MAX_REPAIR_REWARD, clamp_action_reward
 except ImportError:  # pragma: no cover - supports direct test execution
+    from constants import MAX_REPAIR_REWARD, clamp_action_reward
 if TYPE_CHECKING:
     from ..task_bank import Task
     if not changed:
         reward -= 0.15
+    reward = min(MAX_REPAIR_REWARD, clamp_action_reward(reward))
     return reward, {
         "repair_success": 1.0 if repair_success else 0.0,
         "fixed_prior_errors": 1.0 if fixed_prior else 0.0,

server/explainer_env_environment.py CHANGED Viewed

@@ -21,7 +21,7 @@ from openenv.core.env_server.interfaces import Environment
 from openenv.core.env_server.types import State
 try:
-    from ..constants import MAX_EXPLORE_STEPS, MAX_REPAIR_STEPS
     from ..models import ExplainerAction, ExplainerObservation
     from ..research import AVAILABLE_TOOLS, run_research_tool
     from ..rewards.exploration import compute_explore_reward
@@ -29,7 +29,7 @@ try:
     from ..rewards.sandbox import validate_code
     from ..task_bank import ALL_TASKS, EASY_TASKS, HARD_TASKS, MEDIUM_TASKS, Task
 except ImportError:
-    from constants import MAX_EXPLORE_STEPS, MAX_REPAIR_STEPS
     from models import ExplainerAction, ExplainerObservation
     from research import AVAILABLE_TOOLS, run_research_tool
     from rewards.exploration import compute_explore_reward
@@ -385,7 +385,8 @@ class ExplainerEnvironment(Environment):
             static_check_passed=sandbox.check_passed,
             error_codes=sandbox.error_codes,
         )
-        reward = max(0.0, reward + skip_penalty)
         self._last_code = code
         self._last_format = fmt

 from openenv.core.env_server.types import State
 try:
+    from ..constants import MAX_EXPLORE_STEPS, MAX_REPAIR_STEPS, clamp_action_reward
     from ..models import ExplainerAction, ExplainerObservation
     from ..research import AVAILABLE_TOOLS, run_research_tool
     from ..rewards.exploration import compute_explore_reward
     from ..rewards.sandbox import validate_code
     from ..task_bank import ALL_TASKS, EASY_TASKS, HARD_TASKS, MEDIUM_TASKS, Task
 except ImportError:
+    from constants import MAX_EXPLORE_STEPS, MAX_REPAIR_STEPS, clamp_action_reward
     from models import ExplainerAction, ExplainerObservation
     from research import AVAILABLE_TOOLS, run_research_tool
     from rewards.exploration import compute_explore_reward
             static_check_passed=sandbox.check_passed,
             error_codes=sandbox.error_codes,
         )
+        reward = clamp_action_reward(reward + skip_penalty)
+        components["generate_total"] = round(reward, 4)
         self._last_code = code
         self._last_format = fmt

tests/test_rewards.py CHANGED Viewed

@@ -412,7 +412,7 @@ def test_reward_spread():
     assert len(unique) >= 3
-def test_repair_reward_success_is_capped_and_changed():
     reward, comp = adjust_repair_reward(
         1.0,
         repair_success=True,
@@ -421,7 +421,8 @@ def test_repair_reward_success_is_capped_and_changed():
         previous_code="x =",
         repaired_code="x = 1",
     )
-    assert reward == MAX_REPAIR_REWARD
     assert comp["repair_success"] == 1.0
     assert comp["fixed_prior_errors"] == 1.0
     assert comp["changed_code"] == 1.0
@@ -494,7 +495,7 @@ if __name__ == "__main__":
         test_marimo_static_failure_is_not_code_valid,
         test_generate_reward_wrong_format,
         test_reward_spread,
-        test_repair_reward_success_is_capped_and_changed,
         test_repair_reward_penalizes_repeated_code,
         test_repair_reward_failed_fix_stays_discounted,
         test_normalized_episode_score_bounds,

     assert len(unique) >= 3
+def test_repair_reward_success_is_discounted_and_changed():
     reward, comp = adjust_repair_reward(
         1.0,
         repair_success=True,
         previous_code="x =",
         repaired_code="x = 1",
     )
+    assert reward == 0.72
+    assert 0.0 <= reward <= MAX_REPAIR_REWARD
     assert comp["repair_success"] == 1.0
     assert comp["fixed_prior_errors"] == 1.0
     assert comp["changed_code"] == 1.0
         test_marimo_static_failure_is_not_code_valid,
         test_generate_reward_wrong_format,
         test_reward_spread,
+        test_repair_reward_success_is_discounted_and_changed,
         test_repair_reward_penalizes_repeated_code,
         test_repair_reward_failed_fix_stays_discounted,
         test_normalized_episode_score_bounds,