Somuai12 commited on
Commit
74e5e1d
·
1 Parent(s): 82f6517

Detail grading rewards and penalties in README

Browse files
Files changed (1) hide show
  1. README.md +17 -6
README.md CHANGED
@@ -13,13 +13,24 @@ base_path: /dashboard/
13
  ---
14
 
15
  ### Advanced Reward Shaping (RLVR Integration)
16
- Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic grading engine designed to harden LLM strategic reasoning:
17
 
18
- * **Tiered CoT Bonus**: Rewards analytical reasoning (up to +0.20) based on keyword density and length.
19
- * **Clarity Coherence**: Penalizes "vague" or "subjective" (e.g., *maybe, perhaps*) policy definitions in Easy tasks.
20
- * **Tradeoff Realism**: Detects and caps "hallucinated" outcomes in Hard tasks (e.g., claiming to simultaneously maximize fraud-prevention and revenue).
21
- * **Step-Delta Shaping**: Provides an `improvement_bonus` for iterative actions that significantly outperform the episode's best score.
22
- * **Anti-Repetition Penalty**: Encounters a -0.30 penalty for exact repeated actions, forcing the agent toward continuous evolution.
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ---
25
 
 
13
  ---
14
 
15
  ### Advanced Reward Shaping (RLVR Integration)
16
+ Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
17
 
18
+ #### 1. Task Easy (Clarification Policies)
19
+ * **The Penalty:** If the agent's proposed rule contains words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure and heavily penalizes the score (drops it to baseline `0.12`).
20
+ * **The Reward:** To get a high score (>`0.90`), the agent must provide a definition that is entirely free of these vague terms, include valid active `affected_policy_ids`, and provide a substantive, verbose `justification` string.
21
+
22
+ #### 2. Task Medium (New Rule Generation Policies)
23
+ * **The Penalty:** If the `scope` array is empty, or if the `new_rule` text is too short (indicating an incomplete thought), the score drops significantly.
24
+ * **The Reward:** The grader enforces that the agent accurately targets the correct missing `rule_domain`. Maximum points are awarded only when the agent provides legitimate integration points showing how the new policy connects to existing framework policies.
25
+
26
+ #### 3. Task Hard (Evolve Policy Framework Policies)
27
+ * **The Penalty:** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.99`), the grader recognizes this as an **Unrealistic Tradeoff** and explicitly fails the agent. Real-world policies always have friction.
28
+ * **The Reward:** The grader requires at least **two** distinct `policy_modifications` (e.g., one to tighten a rule, one for an exception). It also verifies the mathematical variance in the outcome projections, forcing the agent to demonstrate complex, balanced reasoning.
29
+
30
+ #### Global Bonuses & Penalties
31
+ * **Chain of Thought (CoT) Bonus (+0.10 to +0.20):** Across all three tasks, the grader evaluates the `think` field. If the agent includes comprehensive analytical keywords (like `"tradeoff"`, `"precision"`, `"recall"`, `"threshold"`), the grader awards a flat strategic bonus, mathematically incentivizing deep meta-reasoning.
32
+ * **Step-Delta Shaping:** Provides an `improvement_bonus` for iterative actions that significantly outperform the episode's previous best score.
33
+ * **Anti-Repetition Penalty (-0.30):** Encounters a severe penalty for exact repeated actions across steps, forcing the agent toward continuous exploration and evolution.
34
 
35
  ---
36