Update docs and reward progression plot
Browse files- README.md +5 -4
- openenv.yaml +1 -0
README.md
CHANGED
|
@@ -16,16 +16,17 @@ base_path: /dashboard/
|
|
| 16 |
Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
|
| 17 |
|
| 18 |
#### 1. Task Easy (Clarification Policies)
|
| 19 |
-
* **The Penalty:** If the agent's proposed rule contains words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure
|
| 20 |
-
* **The Reward:** To get a high score (>`0.
|
| 21 |
|
| 22 |
#### 2. Task Medium (New Rule Generation Policies)
|
| 23 |
* **The Penalty:** If the `scope` array is empty, or if the `new_rule` text is too short (indicating an incomplete thought), the score drops significantly.
|
| 24 |
* **The Reward:** The grader enforces that the agent accurately targets the correct missing `rule_domain`. Maximum points are awarded only when the agent provides legitimate integration points showing how the new policy connects to existing framework policies.
|
| 25 |
|
| 26 |
#### 3. Task Hard (Evolve Policy Framework Policies)
|
| 27 |
-
* **The Penalty:** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.
|
| 28 |
-
* **The
|
|
|
|
| 29 |
|
| 30 |
#### Global Bonuses & Penalties
|
| 31 |
* **Chain of Thought (CoT) Bonus (+0.10 to +0.20):** Across all three tasks, the grader evaluates the `think` field. If the agent includes comprehensive analytical keywords (like `"tradeoff"`, `"precision"`, `"recall"`, `"threshold"`), the grader awards a flat strategic bonus, mathematically incentivizing deep meta-reasoning.
|
|
|
|
| 16 |
Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
|
| 17 |
|
| 18 |
#### 1. Task Easy (Clarification Policies)
|
| 19 |
+
* **The Penalty:** If the agent's proposed rule contains vague words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure. Furthermore, if the definition contains **ZERO measurable keywords** (e.g. `"threshold"`, `"verify"`, `"%"`), a strict hard penalty is triggered, capping the base score below `0.30`—making it impossible to succeed without numbers or strict conditionals.
|
| 20 |
+
* **The Reward:** To get a high score (>`0.85`), the agent must provide a definition that is entirely free of these vague terms, include valid active `affected_policy_ids`, include robust **measurable keywords**, and provide a substantive `justification` string.
|
| 21 |
|
| 22 |
#### 2. Task Medium (New Rule Generation Policies)
|
| 23 |
* **The Penalty:** If the `scope` array is empty, or if the `new_rule` text is too short (indicating an incomplete thought), the score drops significantly.
|
| 24 |
* **The Reward:** The grader enforces that the agent accurately targets the correct missing `rule_domain`. Maximum points are awarded only when the agent provides legitimate integration points showing how the new policy connects to existing framework policies.
|
| 25 |
|
| 26 |
#### 3. Task Hard (Evolve Policy Framework Policies)
|
| 27 |
+
* **The Penalty 1 (Hallucinations):** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.95` without any downside variance), the grader recognizes this as an **Unrealistic Tradeoff** and explicitly fails the agent natively (maximum score capped strictly below `0.30`).
|
| 28 |
+
* **The Penalty 2 (Cross-Domain Mismatch):** Proposing an HR or AI policy for an e-commerce fraud scenario violates domain relevance. By using targeted Regex logic, a `-0.30` penalty is immediately stripped from the score if the text does not contain marketplace-relevant context.
|
| 29 |
+
* **The Reward:** The grader verifies mathematical outcome variance. Agents must write realistic tradeoffs and utilize standardized impact metric keys (aliases are robustly supported, e.g., you can use `"fraud_rate"`, `"fraud"`, or `"fraud_detection"`; or `"queue_overload"` for `"revenue_velocity"`).
|
| 30 |
|
| 31 |
#### Global Bonuses & Penalties
|
| 32 |
* **Chain of Thought (CoT) Bonus (+0.10 to +0.20):** Across all three tasks, the grader evaluates the `think` field. If the agent includes comprehensive analytical keywords (like `"tradeoff"`, `"precision"`, `"recall"`, `"threshold"`), the grader awards a flat strategic bonus, mathematically incentivizing deep meta-reasoning.
|
openenv.yaml
CHANGED
|
@@ -38,6 +38,7 @@ environment:
|
|
| 38 |
schema: "ProposeNewRuleAction"
|
| 39 |
- action_type: "evolve_policy"
|
| 40 |
schema: "EvolveProcessAction"
|
|
|
|
| 41 |
|
| 42 |
reward_range: [0.0, 1.0]
|
| 43 |
|
|
|
|
| 38 |
schema: "ProposeNewRuleAction"
|
| 39 |
- action_type: "evolve_policy"
|
| 40 |
schema: "EvolveProcessAction"
|
| 41 |
+
description: "Hard task metric keys: fraud_rate (aliases: fraud_detection, fraud), revenue_velocity (aliases: queue_overload, revenue), seller_trust (aliases: seller_confidence, trust)."
|
| 42 |
|
| 43 |
reward_range: [0.0, 1.0]
|
| 44 |
|