Somuai12 commited on
Commit
f2195b2
Β·
1 Parent(s): 95a7dc0

Restructure README to required format: overview, spaces, tasks, setup, baseline

Browse files
Files changed (1) hide show
  1. README.md +206 -126
README.md CHANGED
@@ -6,189 +6,269 @@ sdk: docker
6
  app_port: 8000
7
  base_path: /dashboard/
8
  ---
9
- # PolicyEvolverEnv β€” Multi-Modal Strategic Governance Sandbox
10
 
11
- **PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment designed for the **Meta Γ— PyTorch Γ— Scaler Hackathon**. It serves as a production-grade benchmark for demonstrating in-context policy improvement using RLVR signals β€” no weight updates required, making the environment compute-efficient and immediately deployable.
12
 
13
- > **Why this matters:** Meta's Oversight Board reviewed 300K+ content appeals in 2024 due to vague community standards. Amazon's marketplace loses an estimated $700M/year to false-positive seller suspensions. PolicyEvolverEnv directly addresses this gap by training agents to replace subjective governance terms with measurable, enforceable thresholds β€” turning policy ambiguity into a solvable optimization problem.
14
 
15
- ---
16
-
17
- ### Advanced Reward Shaping (RLVR Integration)
18
- Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
19
 
20
- #### 1. Task Easy (Clarification Policies)
21
- * **The Penalty:** If the agent's proposed rule contains vague words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure. Furthermore, if the definition contains **ZERO measurable keywords** (e.g. `"threshold"`, `"verify"`, `"%"`), a strict hard penalty is triggered, capping the base score below `0.30`β€”making it impossible to succeed without numbers or strict conditionals.
22
- * **The Reward:** To get a high score (>`0.85`), the agent must provide a definition that is entirely free of these vague terms, include valid active `affected_policy_ids`, include robust **measurable keywords**, and provide a substantive `justification` string.
23
 
24
- #### 2. Task Medium (New Rule Generation Policies)
25
- * **The Penalty:** If the `scope` array is empty, or if the `new_rule` text is too short (indicating an incomplete thought), the score drops significantly.
26
- * **The Reward:** The grader enforces that the agent accurately targets the correct missing `rule_domain`. Maximum points are awarded only when the agent provides legitimate integration points showing how the new policy connects to existing framework policies.
 
27
 
28
- #### 3. Task Hard (Evolve Policy Framework Policies)
29
- * **The Penalty 1 (Hallucinations):** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.95` without any downside variance), the grader recognizes this as an **Unrealistic Tradeoff** and explicitly fails the agent natively (maximum score capped strictly below `0.30`).
30
- * **The Penalty 2 (Cross-Domain Mismatch):** Proposing an HR or AI policy for an e-commerce fraud scenario violates domain relevance. By using targeted Regex logic, a `-0.30` penalty is immediately stripped from the score if the text does not contain marketplace-relevant context.
31
- * **The Reward:** The grader verifies mathematical outcome variance. Agents must write realistic tradeoffs and utilize standardized impact metric keys (aliases are robustly supported, e.g., you can use `"fraud_rate"`, `"fraud"`, or `"fraud_detection"`; or `"queue_overload"` for `"revenue_velocity"`).
32
 
33
- #### Global Bonuses & Penalties
34
- * **Chain of Thought (CoT) Bonus (+0.10 to +0.20):** Across all three tasks, the grader evaluates the `think` field. If the agent includes comprehensive analytical keywords (like `"tradeoff"`, `"precision"`, `"recall"`, `"threshold"`), the grader awards a flat strategic bonus, mathematically incentivizing deep meta-reasoning.
35
- * **Step-Delta Shaping:** Provides an `improvement_bonus` for iterative actions that significantly outperform the episode's previous best score.
36
- * **Anti-Repetition Penalty (-0.30):** Encounters a severe penalty for exact repeated actions across steps, forcing the agent toward continuous exploration and evolution.
37
 
38
- ---
39
 
40
- ## Environment Description & Motivation
41
- PolicyEvolverEnv is a real-world governance sandbox where an AI agent improves its in-context policy to **design and evolve governance policies** through meta-reasoning over real-world operational data. In modern platforms (social media, enterprise HR, e-commerce), static policies quickly become outdated or vaguely applied, leading to inconsistent enforcement, false-positive moderation, and unrecognized fraud.
42
 
43
- This environment simulates this challenge by presenting the agent with a corpus of operational data alongside an existing policy framework. The agent's goal is to analyze the outcomes, identify systemic flaws or ambiguities, and act directly on the policies to optimize governance outcomes. This directly tackles live production problems faced by platforms like Meta.
44
 
45
- ## The Strategic Concept
46
 
47
- ### 1. The Core Idea: What is PolicyEvolverEnv?
48
- Most AI environments are games (like Chess or Atari). **PolicyEvolverEnv** is differentβ€”it is a **Strategic Governance Sandbox**.
49
 
50
- The environment represents the **Reinforcement Learning from Verifiable Rewards (RLVR)** stage of inference-time adaptation. It gives an agent a score (Reward) based on how well it identifies a flaw in a policy and "evolves" it to be more precise.
51
 
52
- * **The Problem**: Human moderators or automated systems make mistakes because the "Rules of the Game" are broken.
53
- * **The Solution**: An AI agent that doesn't just follow rules, but **designs** them.
54
 
55
- ### 2. The Gradio "Judge Console": How it Works
56
- The dashboard we built (`server/app.py`) is the human-readable window into this environment. It’s designed as a **Command & Control** center for a "Policy Judge."
 
 
 
 
 
 
57
 
58
- #### The Left Panel: Scenario Metrics
59
- * **Environment Best Score**: This tracks the highest score achieved in this session. It represents the "Gold Standard" the agent is aiming for.
60
- * **Remaining Execution Steps**: Each "Episode" has a limit (5 steps). The agent must improve the policy within this budget. This forces **Strategic Efficiency**.
61
- * **Latest Strategic Reward**: Every time you click "Execute," the Grader (`server/grader.py`) analyzes your proposal. If it’s vague, you get a low reward (0.1–0.3). If it’s specific and measurable, you get a high reward (0.8–0.9).
62
 
63
- #### The Right Panel: Observations
64
- * **Data Corpus (Tabular View)**: These are the "Facts on the Ground." These are real-world incidents (e.g., a post flagged for 'harassment' vs one that wasn't).
65
- * **Active Framework**: This shows the current "Code of Law."
66
- * **The Workflow**: Your goal is to find an incident in the Corpus that doesn't fit correctly into the Framework, then use the bottom console to fix it.
 
 
 
 
 
67
 
68
- ### 3. The Power Buttons: Action Space
69
- At the bottom, you have the **Action Console**. This is where the "Evolution" happens:
70
 
71
- * **Initialize Scenario**: This "boots" a specific challenge.
72
- * **Easy**: Fixing vague words.
73
- * **Medium**: Finding a completely missing category.
74
- * **Hard**: Balancing complex trade-offs (like reducing fraud without hurting good sellers).
75
- * **Load Expert Suggestion**: This populates the form with a "Perfect" answer. It shows the Judge exactly what a high-performing agent looks like.
76
- * **Execute Strategic Step**: This is the most important button. It takes everything you typed, packages it into a Pydantic Model (`models.py`), and sends it to the environment. It triggers the **Refinement Loop**: The agent sees its score, reads the feedback, and tries again in the next step to get a higher reward.
 
 
77
 
78
- ### 4. The Final Result: Strategic Convergence
79
- The goal of the whole idea is **Strategic Convergence**. When the "Current Project Score" hits **0.85 or higher**, it means the Agent has successfully evolved the policy framework to a point where it is:
80
 
81
- * **Objective**: No more biased "gut-feel" moderation.
82
- * **Measurable**: Success is defined by numbers (Precision/Recall).
83
- * **Future-Proof**: The agent has filled gaps (like AI-generated content) that didn't exist when the original rules were written.
84
 
85
- ## Observation Space
86
- The `Observation` received by the agent at every step describes the current operational context:
87
- - `task_id` (str): Identifier for the active scenario.
88
- - `episode_id` (str): Unique session tracker.
89
- - `step_count` (int): Active step number (Max 5 per episode).
90
- - `data_corpus` (List[Dict]): Represents operational examples like social media posts, HR incidents, or seller accounts along with the action taken or outcome.
91
- - `current_policies` (List[Dict]): The list of current active policies the system follows.
92
- - `system_metrics` & `policy_outcomes`: Operational statistics reflecting precision/recall or false-positive rates.
93
- - `identified_issues`: Current known flaws in the governance pipeline.
94
 
95
- ## Action Space
96
- The Action space utilizes a highly structured Discriminated Union model to represent multi-faceted policy adjustments:
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
- **1. ProposeClarificationAction (`propose_clarification`)**
99
- - Targets an `ambiguous_term` in an existing policy.
100
- - Requires a specific, measurable `suggested_definition` and `justification`.
101
- **2. ProposeNewRuleAction (`propose_new_rule`)**
102
- - Addresses an unhandled domain (`rule_domain`).
103
- - Requires `new_rule` text, application `scope`, and `integration_points` connecting to older policies.
104
- **3. EvolveProcessAction (`evolve_policy`)**
105
- - The hardest action; holistically modifies existing rules.
106
- - Requires a list of `policy_modifications`, realistic `expected_outcomes` deltas, and multi-metric `rollback_conditions`.
107
 
108
- *(Each action also supports an optional `think` property allowing Chain-of-Thought meta-reasoning for a reward score bonus).*
109
 
110
- ## Tasks
111
- The environment provides three procedural tasks designed to ramp up in cognitive reasoning difficulty:
 
 
 
112
 
113
- ### 1. Task Easy (Social Media Community Guidelines)
114
- * **The Scenario:** Refining a social media platform's initial content moderation rules.
115
- * **The Problem:** The existing rule simply stated that "offensive or inappropriate content" was prohibited. This was far too subjective, leading to inconsistent moderation.
116
- * **The Policy Applied (Action taken by Agent):** The agent was required to use the `propose_clarification` action. It took the vague term (like "offensive") and redefined it using strict, measurable thresholds (e.g., "specific threats of physical violence" or "explicit slurs targeting protected identity characteristics"). By removing subjectivity, the policy became actionable and deterministic.
117
 
118
- ### 2. Task Medium (Corporate HR Data Privacy)
119
- * **The Scenario:** Updating a company's internal confidentiality framework.
120
- * **The Problem:** The existing HR policy covered generic data protection but had a massive gap regarding the use of modern Generative AI tools (like employees pasting proprietary code into ChatGPT).
121
- * **The Policy Applied (Action taken by Agent):** The agent was required to use the `propose_new_rule` action. It drafted an entirely new policy targeting the specific gap: "Employees must explicitly disclose and gain approval for any use of Generative AI tools when handling proprietary code or client proposals." This successfully bridged the gap between basic confidentiality and modern AI risks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
- ### 3. Task Hard (E-Commerce Trust & Safety Framework)
124
- * **The Scenario:** Managing an e-commerce platform facing a complex fraud problem, where current rules were causing too many "false positives" (locking out legitimate, high-volume sellers).
125
- * **The Problem:** The platform needed to catch rapid-velocity fraud without ruining the experience for trusted legacy merchants.
126
- * **The Policy Applied (Action taken by Agent):** The agent used the `evolve_policy` action for a holistic system update. It had to apply at least two complex modifications to balance Precision and Recall:
127
- * **Tightening Rule:** Added a strict identity-verification trigger for new sellers showing extreme sales velocity (e.g., >20 sales/day in first 30 days).
128
- * **Exemption Rule:** Rolled back the manual review thresholds for trusted legacy sellers to reduce false positives and preserve revenue.
129
 
130
- *In short: Easy focused on removing vagueness, Medium focused on patching a missing risk gap (GenAI), and Hard focused on balancing complex system trade-offs (Fraud vs. Revenue).*
131
 
132
- ## Setup & Usage
133
 
134
- ### 1. Local Installation
135
  ```bash
136
- git clone <repository_url>
137
- cd policy_evolver_env
138
  python3 -m venv .venv
139
  source .venv/bin/activate
140
  pip install -r server/requirements.txt
141
  ```
142
 
143
- ### 2. Run the Environment API
144
- Start the FastAPI environment server locally:
145
  ```bash
146
  uvicorn server.app:app --port 8000
147
  ```
148
- This boots all core endpoint paths (`/reset`, `/step`, `/state`, `/tasks`, `/grader`, `/health`).
149
 
150
- ### 3. Run the Inference Baseline (Hackathon Entry)
151
- The primary entry point for evaluation is **`inference.py`** in the root directory. This script strictly follows the Meta Hackathon `[START]`, `[STEP]`, `[END]` logging format.
 
152
 
153
- Export your environment variables:
154
  ```bash
155
- export API_BASE_URL="https://api.groq.com/openai/v1"
156
- export MODEL_NAME="llama-3.3-70b-versatile"
157
- export HF_TOKEN="your_token_here"
158
  ```
159
 
160
- Execute the baseline evaluation:
 
 
 
161
  ```bash
 
 
 
 
162
  python3 inference.py
163
  ```
164
- *(Optionally, you can run a specific task: `python3 inference.py task_easy`)*.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
  ---
167
 
168
- *(Note: The legacy baseline at `baseline/run_baseline.py` is still available for detailed JSON analytical reports but does not follow the hackathon logging format).*
169
 
170
- ## Baseline Performance β€” In-Context Policy Improvement
171
 
172
- The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and failure diagnosis.
173
 
174
- | Task | Step 1 | Step 2 | Step 3 | Step 4 | Step 5 | Converged |
175
- |------|--------|--------|--------|--------|--------|-----------|
176
- | task_easy | 0.94 | N/A | N/A | N/A | N/A | Yes |
177
- | task_medium | 1.00 | N/A | N/A | N/A | N/A | Yes |
178
- | task_hard | 0.90 | N/A | N/A | N/A | N/A | Yes |
179
 
180
- **Model:** llama-3.1-8b-instant (via Groq)
181
- **Reproducible:** temperature=0.0, seed=42 (**Bit-for-bit identical results verified**)
182
- **No fine-tuning required.** The environment provides the learning signal; the model adapts its in-context policy each step.
183
 
184
- ## Strategic Reward Evolution & RLVR
185
- PolicyEvolverEnv serves as the **Strategic Sandbox** for the **Reinforcement Learning from Verifiable Rewards (RLVR)** stage of the modern LLM inference pipeline. Unlike static evaluation, this environment enables agents to refine their strategies iteratively based on high-quality, verifiable feedback.
 
 
 
186
 
187
- ![Reward Progression](https://raw.githubusercontent.com/Luciferai04/PolicyEvolverEnv/master/reward_progression.png)
188
 
189
- ### How It Works: The Iterative Refinement Process
190
- 1. **Refinement Hub**: The baseline agent tracks its previous rewards and actions through the observation's metadata (`info`).
191
- 2. **Strategic pivoting**: If a policy proposal receives low rewards (due to lack of specificity or missing justifications), the agent identifies the failure points and pivots its strategy in subsequent steps.
192
- 3. **Measurable Improvement**: As shown in the progression chart, iterative refinement leads to **Strategic Convergence**, where the policy quality reaches institutional standards (Score β‰₯ 0.85).
193
 
194
- For a detailed technical dive into how our project maps to RLHF/RLVR training architectures, see **[STRATEGIC_LEARNING.md](STRATEGIC_LEARNING.md)**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  app_port: 8000
7
  base_path: /dashboard/
8
  ---
 
9
 
10
+ # PolicyEvolverEnv
11
 
12
+ ## 1. Environment Overview and Motivation
13
 
14
+ **PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment where an AI agent learns to **design, refine, and evolve governance policies** through meta-reasoning over real-world operational data.
 
 
 
15
 
16
+ ### The Problem
 
 
17
 
18
+ In modern platforms β€” social media, enterprise HR, and e-commerce β€” static policies quickly become outdated or vaguely worded, leading to:
19
+ - **Inconsistent enforcement**: Moderators interpret "offensive content" differently, creating 300K+ appeals per year (Meta Oversight Board, 2024).
20
+ - **False-positive actions**: E-commerce platforms lose an estimated $700M/year from incorrectly suspending legitimate high-volume sellers.
21
+ - **Unaddressed gaps**: Emerging risks like Generative AI misuse have no governing rules in legacy frameworks.
22
 
23
+ ### The Solution
 
 
 
24
 
25
+ PolicyEvolverEnv simulates these challenges by presenting the agent with:
26
+ 1. A **corpus of operational incidents** (flagged posts, HR violations, seller transactions).
27
+ 2. An **existing policy framework** with known flaws (vague terms, missing rules, conflicting thresholds).
 
28
 
29
+ The agent must analyze the data, identify systemic flaws, and submit **structured policy modifications** β€” not just answers, but actionable governance. The grader evaluates whether the proposed fix is specific, measurable, domain-relevant, and free of hallucination.
30
 
31
+ ### Why This Matters for RLVR
 
32
 
33
+ This environment operates at the **Reinforcement Learning from Verifiable Rewards (RLVR)** layer of inference-time adaptation. No weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback β€” demonstrating genuine in-context policy learning.
34
 
35
+ ---
36
 
37
+ ## 2. Action Space
 
38
 
39
+ The action space uses a **Discriminated Union** (Pydantic `RootModel` with `Discriminator("action_type")`) supporting three structured action types:
40
 
41
+ ### `propose_clarification` β€” Easy Task Action
 
42
 
43
+ | Field | Type | Description |
44
+ |:------|:-----|:------------|
45
+ | `action_type` | `Literal["propose_clarification"]` | Discriminator tag |
46
+ | `ambiguous_term` | `str` | The exact vague term found in existing policies |
47
+ | `suggested_definition` | `str` | A specific, measurable replacement definition |
48
+ | `affected_policy_ids` | `List[str]` | Which policy IDs this clarification affects |
49
+ | `justification` | `str` | Why this term is ambiguous and why the fix works |
50
+ | `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |
51
 
52
+ ### `propose_new_rule` β€” Medium Task Action
 
 
 
53
 
54
+ | Field | Type | Description |
55
+ |:------|:-----|:------------|
56
+ | `action_type` | `Literal["propose_new_rule"]` | Discriminator tag |
57
+ | `rule_domain` | `str` | Domain the new rule covers (e.g., `"AI_use"`) |
58
+ | `new_rule` | `str` | The complete new rule text |
59
+ | `scope` | `List[str]` | Scenario types this rule applies to |
60
+ | `integration_points` | `List[str]` | How it connects to existing policy IDs |
61
+ | `justification` | `str` | Why a gap exists and how this rule fills it |
62
+ | `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |
63
 
64
+ ### `evolve_policy` β€” Hard Task Action
 
65
 
66
+ | Field | Type | Description |
67
+ |:------|:-----|:------------|
68
+ | `action_type` | `Literal["evolve_policy"]` | Discriminator tag |
69
+ | `policy_modifications` | `List[PolicyModification]` | Specific changes: `policy_id`, `change_type`, `new_text`, `reason` |
70
+ | `expected_outcomes` | `Dict[str, float]` | Metric name β†’ expected value (must show realistic tradeoffs) |
71
+ | `rollback_conditions` | `List[str]` | When to revert changes |
72
+ | `justification` | `str` | Comprehensive reasoning for the evolution |
73
+ | `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |
74
 
75
+ ---
 
76
 
77
+ ## 3. Observation Space
 
 
78
 
79
+ The `Observation` returned by `reset()` and `step()` contains:
 
 
 
 
 
 
 
 
80
 
81
+ | Field | Type | Description |
82
+ |:------|:-----|:------------|
83
+ | `task_id` | `str` | Active scenario identifier (`task_easy`, `task_medium`, `task_hard`) |
84
+ | `episode_id` | `str` | Unique episode session tracker |
85
+ | `step_count` | `int` | Current step number (max 5 per episode) |
86
+ | `corpus_size` | `int` | Total incidents in the full data corpus |
87
+ | `corpus_shown` | `int` | Number of incidents displayed (reactive to agent's domain) |
88
+ | `data_corpus` | `List[CorpusIncident]` | Operational incidents with `id`, `content`, `system_action`, and `type` |
89
+ | `current_policies` | `List[Dict]` | The existing policy framework (`id` + `text`) |
90
+ | `policy_outcomes` | `Optional[List[Dict]]` | Historical outcome data (hard task only) |
91
+ | `system_metrics` | `Dict[str, float]` | Operational statistics (precision, recall, false-positive rates) |
92
+ | `identified_issues` | `List[Dict]` | Known flaws in the governance pipeline |
93
+ | `reward` | `float` | Score from the grader for the last action, in (0, 1) |
94
+ | `done` | `bool` | Whether the episode has ended |
95
+ | `info` | `Dict` | Contains `best_score`, `rewards_history`, `steps_remaining`, and `staff_feedback` |
96
 
97
+ ### Staff Feedback (in `info`)
 
 
 
 
 
 
 
 
98
 
99
+ After each step, the observation includes structured staff feedback to guide the agent's next action:
100
 
101
+ | Field | Example Values | Purpose |
102
+ |:------|:---------------|:--------|
103
+ | `strategic_rating` | `"Junior Associate"`, `"Staff Specialist"`, `"Senior Architect"` | Performance tier based on reward |
104
+ | `focus` | `"Signal detected"` or `"Burying the lede or distracted by noise"` | Whether the agent prioritized correctly |
105
+ | `recommendation` | `"Maintain high signal-to-noise ratio and lead with the fix."` | Actionable guidance for next step |
106
 
107
+ ---
 
 
 
108
 
109
+ ## 4. Task Descriptions
110
+
111
+ The environment provides three tasks with escalating cognitive difficulty:
112
+
113
+ ### Task Easy β€” Ambiguity Clarification (Difficulty: `easy`)
114
+ - **Scenario**: A social media platform's community guidelines use vague terms like "offensive" and "appropriate."
115
+ - **Objective**: Identify an ambiguous term and replace it with a specific, measurable definition.
116
+ - **Expected Action**: `propose_clarification`
117
+ - **Expected Min Score**: 0.70
118
+ - **Key Grading Criteria**:
119
+ - Definition must contain measurable keywords (`"threshold"`, `"verify"`, `"%"`, `"within"`)
120
+ - Vague words (`"generally"`, `"sometimes"`, `"maybe"`) trigger a hard penalty (score capped < 0.30)
121
+ - Valid `affected_policy_ids` boost score
122
+
123
+ ### Task Medium β€” Gap Detection & New Rule (Difficulty: `medium`)
124
+ - **Scenario**: A corporate HR framework with policies covering data protection but no coverage for Generative AI tool usage.
125
+ - **Objective**: Detect the missing policy domain and draft a new rule to fill the gap.
126
+ - **Expected Action**: `propose_new_rule`
127
+ - **Expected Min Score**: 0.55
128
+ - **Key Grading Criteria**:
129
+ - Must target the correct `rule_domain` (e.g., `"AI_use"`)
130
+ - Empty `scope` array severely penalized
131
+ - `integration_points` linking to existing policy IDs boost score
132
+ - Rule text must be substantive (short rules penalized)
133
+
134
+ ### Task Hard β€” Holistic Policy Evolution (Difficulty: `hard`)
135
+ - **Scenario**: An e-commerce Trust & Safety framework where blanket seller suspension policies catch legitimate seasonal merchants alongside fraudsters.
136
+ - **Objective**: Evolve multiple policies simultaneously to balance fraud detection, revenue velocity, and seller trust.
137
+ - **Expected Action**: `evolve_policy`
138
+ - **Expected Min Score**: 0.40
139
+ - **Key Grading Criteria**:
140
+ - **Hallucination Guard**: All metrics at 0.95+ triggers "Unrealistic Tradeoff" penalty (score capped < 0.15)
141
+ - **Cross-Domain Guard**: HR/AI proposals for an e-commerce task incur -0.30 penalty
142
+ - **Realistic Tradeoffs**: `expected_outcomes` must show mathematical variance (improving fraud detection should decrease revenue velocity)
143
+ - **Domain Relevance**: Modifications must reference marketplace concepts (seller, fraud, listing, merchant)
144
+ - Metric key aliases supported: `fraud_rate`/`fraud`/`fraud_detection`, `revenue_velocity`/`queue_overload`/`revenue`
145
+
146
+ ### Global Grading Mechanics (All Tasks)
147
+
148
+ | Mechanic | Effect |
149
+ |:---------|:-------|
150
+ | **Chain-of-Thought Bonus** | `think` field with keywords like `"tradeoff"`, `"precision"`, `"recall"` β†’ +0.10 to +0.20 |
151
+ | **Step-Delta Bonus** | Significant improvement over previous best β†’ +0.02 to +0.05 |
152
+ | **Anti-Repetition Penalty** | Exact repeated action β†’ -0.30 |
153
+ | **Prompt Injection Guard** | `"ignore previous"`, `"system_prompt"`, `"override"` β†’ score zeroed |
154
+ | **Semantic Density Guard** | Word-stuffing with >200 words and low content density β†’ score zeroed |
155
+ | **Red Herring Penalty** | Referencing injected noise topics (office logistics, mascot) β†’ up to -0.75 |
156
+ | **Segmented Prioritization** | Core fix in first 25% of response β†’ bonus; buried at bottom β†’ penalty |
157
 
158
+ ---
 
 
 
 
 
159
 
160
+ ## 5. Setup and Usage
161
 
162
+ ### Local Installation
163
 
 
164
  ```bash
165
+ git clone https://github.com/Luciferai04/PolicyEvolverEnv.git
166
+ cd PolicyEvolverEnv
167
  python3 -m venv .venv
168
  source .venv/bin/activate
169
  pip install -r server/requirements.txt
170
  ```
171
 
172
+ ### Run the Environment Server
173
+
174
  ```bash
175
  uvicorn server.app:app --port 8000
176
  ```
 
177
 
178
+ This starts all endpoints: `/reset` (POST), `/step` (POST), `/state` (GET), `/tasks` (GET), `/grader` (POST), `/health` (GET), `/baseline` (GET).
179
+
180
+ ### Run with Docker
181
 
 
182
  ```bash
183
+ docker build -t policy-evolver .
184
+ docker run -p 8000:8000 policy-evolver
 
185
  ```
186
 
187
+ ### Run the Inference Agent
188
+
189
+ The primary evaluation entry point is `inference.py`, which follows the hackathon `[START]`, `[STEP]`, `[END]` logging format.
190
+
191
  ```bash
192
+ export API_BASE_URL="https://api.groq.com/openai/v1"
193
+ export MODEL_NAME="llama-3.1-8b-instant"
194
+ export HF_TOKEN="your_groq_api_key"
195
+
196
  python3 inference.py
197
  ```
198
+
199
+ To run a specific task: `python3 inference.py task_easy`
200
+
201
+ ### Required Environment Variables
202
+
203
+ | Variable | Description | Example |
204
+ |:---------|:------------|:--------|
205
+ | `HF_TOKEN` | API key for LLM inference (Groq) | `gsk_...` |
206
+ | `API_BASE_URL` | OpenAI-compatible endpoint | `https://api.groq.com/openai/v1` |
207
+ | `MODEL_NAME` | Model identifier | `llama-3.1-8b-instant` |
208
+
209
+ ### Run Tests
210
+
211
+ ```bash
212
+ PYTHONPATH=. python tests/test_smoke_exploits.py # 27 smoke & exploit checks
213
+ PYTHONPATH=. python tests/test_icl.py # ICL verification (3 tasks)
214
+ PYTHONPATH=. python tests/test_multi_episode.py # Multi-episode progression
215
+ PYTHONPATH=. python server/grader.py # 8-phase grader test suite
216
+ ```
217
 
218
  ---
219
 
220
+ ## 6. Baseline Performance Scores
221
 
222
+ The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback.
223
 
224
+ ### Single-Step Convergence (Best Case)
225
 
226
+ | Task | Score | Converged | Expected Min |
227
+ |:-----|:------|:----------|:-------------|
228
+ | `task_easy` | 0.94 | βœ“ Step 1 | 0.70 |
229
+ | `task_medium` | 0.999 | βœ“ Step 1 | 0.55 |
230
+ | `task_hard` | 0.90 | βœ“ Step 1 | 0.40 |
231
 
232
+ ### Multi-Step ICL Progression (Naive β†’ Optimized)
 
 
233
 
234
+ | Task | Naive (Step 0) | Optimized (Step 1) | Improvement |
235
+ |:-----|:---------------|:-------------------|:------------|
236
+ | `task_easy` | 0.400 | 0.999 | +0.600 |
237
+ | `task_medium` | 0.001 | 0.999 | +0.998 |
238
+ | `task_hard` | 0.088 | 0.999 | +0.912 |
239
 
240
+ **Average ICL Improvement: +0.837**
241
 
242
+ ### Configuration
 
 
 
243
 
244
+ | Setting | Value |
245
+ |:--------|:------|
246
+ | **Model** | `llama-3.1-8b-instant` (via Groq) |
247
+ | **Temperature** | `0.0` |
248
+ | **Seed** | `42` |
249
+ | **Determinism** | 5 identical runs β†’ identical scores βœ“ |
250
+ | **Fine-tuning** | None required |
251
+
252
+ ---
253
+
254
+ ## Project Structure
255
+
256
+ ```
257
+ policy_evolver_env/
258
+ β”œβ”€β”€ inference.py # Hackathon entry point ([START]/[STEP]/[END] format)
259
+ β”œβ”€β”€ client.py # EnvClient for HTTP interaction
260
+ β”œβ”€β”€ models.py # Pydantic models (Action, Observation, State)
261
+ β”œβ”€β”€ openenv.yaml # OpenEnv specification
262
+ β”œβ”€β”€ Dockerfile # Docker deployment with HEALTHCHECK
263
+ β”œβ”€β”€ server/
264
+ β”‚ β”œβ”€β”€ app.py # FastAPI + Gradio dashboard
265
+ β”‚ β”œβ”€β”€ environment.py # Environment logic (reset, step, state)
266
+ β”‚ β”œβ”€β”€ grader.py # Deterministic grading engine (8-phase test suite)
267
+ β”‚ β”œβ”€β”€ requirements.txt # Dependencies
268
+ β”‚ └── tasks/ # Task definitions (easy, medium, hard)
269
+ β”œβ”€β”€ tests/
270
+ β”‚ β”œβ”€β”€ test_smoke_exploits.py # 27 smoke & exploit checks
271
+ β”‚ β”œβ”€β”€ test_icl.py # ICL loop verification
272
+ β”‚ └── test_multi_episode.py # Multi-episode progression
273
+ └── STRATEGIC_LEARNING.md # RLVR architecture documentation
274
+ ```