Somuai12 commited on
Commit
28e7c64
·
1 Parent(s): 147cdc4

Update docs and reward progression plot

Browse files
Files changed (2) hide show
  1. README.md +5 -4
  2. openenv.yaml +1 -0
README.md CHANGED
@@ -16,16 +16,17 @@ base_path: /dashboard/
16
  Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
17
 
18
  #### 1. Task Easy (Clarification Policies)
19
- * **The Penalty:** If the agent's proposed rule contains words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure and heavily penalizes the score (drops it to baseline `0.12`).
20
- * **The Reward:** To get a high score (>`0.90`), the agent must provide a definition that is entirely free of these vague terms, include valid active `affected_policy_ids`, and provide a substantive, verbose `justification` string.
21
 
22
  #### 2. Task Medium (New Rule Generation Policies)
23
  * **The Penalty:** If the `scope` array is empty, or if the `new_rule` text is too short (indicating an incomplete thought), the score drops significantly.
24
  * **The Reward:** The grader enforces that the agent accurately targets the correct missing `rule_domain`. Maximum points are awarded only when the agent provides legitimate integration points showing how the new policy connects to existing framework policies.
25
 
26
  #### 3. Task Hard (Evolve Policy Framework Policies)
27
- * **The Penalty:** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.99`), the grader recognizes this as an **Unrealistic Tradeoff** and explicitly fails the agent. Real-world policies always have friction.
28
- * **The Reward:** The grader requires at least **two** distinct `policy_modifications` (e.g., one to tighten a rule, one for an exception). It also verifies the mathematical variance in the outcome projections, forcing the agent to demonstrate complex, balanced reasoning.
 
29
 
30
  #### Global Bonuses & Penalties
31
  * **Chain of Thought (CoT) Bonus (+0.10 to +0.20):** Across all three tasks, the grader evaluates the `think` field. If the agent includes comprehensive analytical keywords (like `"tradeoff"`, `"precision"`, `"recall"`, `"threshold"`), the grader awards a flat strategic bonus, mathematically incentivizing deep meta-reasoning.
 
16
  Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
17
 
18
  #### 1. Task Easy (Clarification Policies)
19
+ * **The Penalty:** If the agent's proposed rule contains vague words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure. Furthermore, if the definition contains **ZERO measurable keywords** (e.g. `"threshold"`, `"verify"`, `"%"`), a strict hard penalty is triggered, capping the base score below `0.30`—making it impossible to succeed without numbers or strict conditionals.
20
+ * **The Reward:** To get a high score (>`0.85`), the agent must provide a definition that is entirely free of these vague terms, include valid active `affected_policy_ids`, include robust **measurable keywords**, and provide a substantive `justification` string.
21
 
22
  #### 2. Task Medium (New Rule Generation Policies)
23
  * **The Penalty:** If the `scope` array is empty, or if the `new_rule` text is too short (indicating an incomplete thought), the score drops significantly.
24
  * **The Reward:** The grader enforces that the agent accurately targets the correct missing `rule_domain`. Maximum points are awarded only when the agent provides legitimate integration points showing how the new policy connects to existing framework policies.
25
 
26
  #### 3. Task Hard (Evolve Policy Framework Policies)
27
+ * **The Penalty 1 (Hallucinations):** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.95` without any downside variance), the grader recognizes this as an **Unrealistic Tradeoff** and explicitly fails the agent natively (maximum score capped strictly below `0.30`).
28
+ * **The Penalty 2 (Cross-Domain Mismatch):** Proposing an HR or AI policy for an e-commerce fraud scenario violates domain relevance. By using targeted Regex logic, a `-0.30` penalty is immediately stripped from the score if the text does not contain marketplace-relevant context.
29
+ * **The Reward:** The grader verifies mathematical outcome variance. Agents must write realistic tradeoffs and utilize standardized impact metric keys (aliases are robustly supported, e.g., you can use `"fraud_rate"`, `"fraud"`, or `"fraud_detection"`; or `"queue_overload"` for `"revenue_velocity"`).
30
 
31
  #### Global Bonuses & Penalties
32
  * **Chain of Thought (CoT) Bonus (+0.10 to +0.20):** Across all three tasks, the grader evaluates the `think` field. If the agent includes comprehensive analytical keywords (like `"tradeoff"`, `"precision"`, `"recall"`, `"threshold"`), the grader awards a flat strategic bonus, mathematically incentivizing deep meta-reasoning.
openenv.yaml CHANGED
@@ -38,6 +38,7 @@ environment:
38
  schema: "ProposeNewRuleAction"
39
  - action_type: "evolve_policy"
40
  schema: "EvolveProcessAction"
 
41
 
42
  reward_range: [0.0, 1.0]
43
 
 
38
  schema: "ProposeNewRuleAction"
39
  - action_type: "evolve_policy"
40
  schema: "EvolveProcessAction"
41
+ description: "Hard task metric keys: fraud_rate (aliases: fraud_detection, fraud), revenue_velocity (aliases: queue_overload, revenue), seller_trust (aliases: seller_confidence, trust)."
42
 
43
  reward_range: [0.0, 1.0]
44