Detail grading rewards and penalties in README
Browse files
README.md
CHANGED
|
@@ -13,13 +13,24 @@ base_path: /dashboard/
|
|
| 13 |
---
|
| 14 |
|
| 15 |
### Advanced Reward Shaping (RLVR Integration)
|
| 16 |
-
Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic grading engine designed to harden LLM strategic reasoning:
|
| 17 |
|
| 18 |
-
|
| 19 |
-
* **
|
| 20 |
-
* **
|
| 21 |
-
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
---
|
| 25 |
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
### Advanced Reward Shaping (RLVR Integration)
|
| 16 |
+
Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
|
| 17 |
|
| 18 |
+
#### 1. Task Easy (Clarification Policies)
|
| 19 |
+
* **The Penalty:** If the agent's proposed rule contains words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure and heavily penalizes the score (drops it to baseline `0.12`).
|
| 20 |
+
* **The Reward:** To get a high score (>`0.90`), the agent must provide a definition that is entirely free of these vague terms, include valid active `affected_policy_ids`, and provide a substantive, verbose `justification` string.
|
| 21 |
+
|
| 22 |
+
#### 2. Task Medium (New Rule Generation Policies)
|
| 23 |
+
* **The Penalty:** If the `scope` array is empty, or if the `new_rule` text is too short (indicating an incomplete thought), the score drops significantly.
|
| 24 |
+
* **The Reward:** The grader enforces that the agent accurately targets the correct missing `rule_domain`. Maximum points are awarded only when the agent provides legitimate integration points showing how the new policy connects to existing framework policies.
|
| 25 |
+
|
| 26 |
+
#### 3. Task Hard (Evolve Policy Framework Policies)
|
| 27 |
+
* **The Penalty:** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.99`), the grader recognizes this as an **Unrealistic Tradeoff** and explicitly fails the agent. Real-world policies always have friction.
|
| 28 |
+
* **The Reward:** The grader requires at least **two** distinct `policy_modifications` (e.g., one to tighten a rule, one for an exception). It also verifies the mathematical variance in the outcome projections, forcing the agent to demonstrate complex, balanced reasoning.
|
| 29 |
+
|
| 30 |
+
#### Global Bonuses & Penalties
|
| 31 |
+
* **Chain of Thought (CoT) Bonus (+0.10 to +0.20):** Across all three tasks, the grader evaluates the `think` field. If the agent includes comprehensive analytical keywords (like `"tradeoff"`, `"precision"`, `"recall"`, `"threshold"`), the grader awards a flat strategic bonus, mathematically incentivizing deep meta-reasoning.
|
| 32 |
+
* **Step-Delta Shaping:** Provides an `improvement_bonus` for iterative actions that significantly outperform the episode's previous best score.
|
| 33 |
+
* **Anti-Repetition Penalty (-0.30):** Encounters a severe penalty for exact repeated actions across steps, forcing the agent toward continuous exploration and evolution.
|
| 34 |
|
| 35 |
---
|
| 36 |
|