Spaces:

luciferai-devil
/

devil-policyevolverenv

Sleeping

App Files Files Community

Somuai12 commited on Apr 6

Commit

74e5e1d

1 Parent(s): 82f6517

Detail grading rewards and penalties in README

Browse files

Files changed (1) hide show

README.md +17 -6

README.md CHANGED Viewed

@@ -13,13 +13,24 @@ base_path: /dashboard/
 ---
 ### Advanced Reward Shaping (RLVR Integration)
-Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic grading engine designed to harden LLM strategic reasoning:
-*   **Tiered CoT Bonus**: Rewards analytical reasoning (up to +0.20) based on keyword density and length.
-*   **Clarity Coherence**: Penalizes "vague" or "subjective" (e.g., *maybe, perhaps*) policy definitions in Easy tasks.
-*   **Tradeoff Realism**: Detects and caps "hallucinated" outcomes in Hard tasks (e.g., claiming to simultaneously maximize fraud-prevention and revenue).
-*   **Step-Delta Shaping**: Provides an `improvement_bonus` for iterative actions that significantly outperform the episode's best score.
-*   **Anti-Repetition Penalty**: Encounters a -0.30 penalty for exact repeated actions, forcing the agent toward continuous evolution.
 ---

 ---
 ### Advanced Reward Shaping (RLVR Integration)
+Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
+#### 1. Task Easy (Clarification Policies)
+*   **The Penalty:** If the agent's proposed rule contains words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure and heavily penalizes the score (drops it to baseline `0.12`).
+*   **The Reward:** To get a high score (>`0.90`), the agent must provide a definition that is entirely free of these vague terms, include valid active `affected_policy_ids`, and provide a substantive, verbose `justification` string.
+#### 2. Task Medium (New Rule Generation Policies)
+*   **The Penalty:** If the `scope` array is empty, or if the `new_rule` text is too short (indicating an incomplete thought), the score drops significantly.
+*   **The Reward:** The grader enforces that the agent accurately targets the correct missing `rule_domain`. Maximum points are awarded only when the agent provides legitimate integration points showing how the new policy connects to existing framework policies.
+#### 3. Task Hard (Evolve Policy Framework Policies)
+*   **The Penalty:** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.99`), the grader recognizes this as an **Unrealistic Tradeoff** and explicitly fails the agent. Real-world policies always have friction.
+*   **The Reward:** The grader requires at least **two** distinct `policy_modifications` (e.g., one to tighten a rule, one for an exception). It also verifies the mathematical variance in the outcome projections, forcing the agent to demonstrate complex, balanced reasoning.
+#### Global Bonuses & Penalties
+*   **Chain of Thought (CoT) Bonus (+0.10 to +0.20):** Across all three tasks, the grader evaluates the `think` field. If the agent includes comprehensive analytical keywords (like `"tradeoff"`, `"precision"`, `"recall"`, `"threshold"`), the grader awards a flat strategic bonus, mathematically incentivizing deep meta-reasoning.
+*   **Step-Delta Shaping:** Provides an `improvement_bonus` for iterative actions that significantly outperform the episode's previous best score.
+*   **Anti-Repetition Penalty (-0.30):** Encounters a severe penalty for exact repeated actions across steps, forcing the agent toward continuous exploration and evolution.
 ---