File size: 12,647 Bytes
8cd3fa7 dd5366d 6aa8acb 8cd3fa7 f2195b2 8cd3fa7 f2195b2 70f8688 f2195b2 ef5751d f2195b2 74e5e1d f2195b2 74e5e1d f2195b2 74e5e1d f2195b2 ef5751d f2195b2 ef5751d f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 1ad2a1f f2195b2 1ad2a1f f2195b2 1ad2a1f f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 8cd3fa7 dd5366d 8cd3fa7 f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 6aa8acb 8cd3fa7 f2195b2 6aa8acb f2195b2 8cd3fa7 f2195b2 6aa8acb f2195b2 6aa8acb f2195b2 511f04a f2195b2 511f04a f2195b2 6aa8acb f2195b2 8cd3fa7 f2195b2 8cd3fa7 f2195b2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 | ---
title: PolicyEvolverEnv
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /dashboard/
---
# PolicyEvolverEnv
## 1. Environment Overview and Motivation
**PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment where an AI agent learns to **design, refine, and evolve governance policies** through meta-reasoning over real-world operational data.
### The Problem
In modern platforms β social media, enterprise HR, and e-commerce β static policies quickly become outdated or vaguely worded, leading to:
- **Inconsistent enforcement**: Moderators interpret "offensive content" differently, creating 300K+ appeals per year (Meta Oversight Board, 2024).
- **False-positive actions**: E-commerce platforms lose an estimated $700M/year from incorrectly suspending legitimate high-volume sellers.
- **Unaddressed gaps**: Emerging risks like Generative AI misuse have no governing rules in legacy frameworks.
### The Solution
PolicyEvolverEnv simulates these challenges by presenting the agent with:
1. A **corpus of operational incidents** (flagged posts, HR violations, seller transactions).
2. An **existing policy framework** with known flaws (vague terms, missing rules, conflicting thresholds).
The agent must analyze the data, identify systemic flaws, and submit **structured policy modifications** β not just answers, but actionable governance. The grader evaluates whether the proposed fix is specific, measurable, domain-relevant, and free of hallucination.
### Why This Matters for RLVR
This environment operates at the **Reinforcement Learning from Verifiable Rewards (RLVR)** layer of inference-time adaptation. No weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback β demonstrating genuine in-context policy learning.
---
## 2. Action Space
The action space uses a **Discriminated Union** (Pydantic `RootModel` with `Discriminator("action_type")`) supporting three structured action types:
### `propose_clarification` β Easy Task Action
| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["propose_clarification"]` | Discriminator tag |
| `ambiguous_term` | `str` | The exact vague term found in existing policies |
| `suggested_definition` | `str` | A specific, measurable replacement definition |
| `affected_policy_ids` | `List[str]` | Which policy IDs this clarification affects |
| `justification` | `str` | Why this term is ambiguous and why the fix works |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) |
### `propose_new_rule` β Medium Task Action
| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["propose_new_rule"]` | Discriminator tag |
| `rule_domain` | `str` | Domain the new rule covers (e.g., `"AI_use"`) |
| `new_rule` | `str` | The complete new rule text |
| `scope` | `List[str]` | Scenario types this rule applies to |
| `integration_points` | `List[str]` | How it connects to existing policy IDs |
| `justification` | `str` | Why a gap exists and how this rule fills it |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) |
### `evolve_policy` β Hard Task Action
| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["evolve_policy"]` | Discriminator tag |
| `policy_modifications` | `List[PolicyModification]` | Specific changes: `policy_id`, `change_type`, `new_text`, `reason` |
| `expected_outcomes` | `Dict[str, float]` | Metric name β expected value (must show realistic tradeoffs) |
| `rollback_conditions` | `List[str]` | When to revert changes |
| `justification` | `str` | Comprehensive reasoning for the evolution |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) |
---
## 3. Observation Space
The `Observation` returned by `reset()` and `step()` contains:
| Field | Type | Description |
|:------|:-----|:------------|
| `task_id` | `str` | Active scenario identifier (`task_easy`, `task_medium`, `task_hard`) |
| `episode_id` | `str` | Unique episode session tracker |
| `step_count` | `int` | Current step number (max 5 per episode) |
| `corpus_size` | `int` | Total incidents in the full data corpus |
| `corpus_shown` | `int` | Number of incidents displayed (reactive to agent's domain) |
| `data_corpus` | `List[CorpusIncident]` | Operational incidents with `id`, `content`, `system_action`, and `type` |
| `current_policies` | `List[Dict]` | The existing policy framework (`id` + `text`) |
| `policy_outcomes` | `Optional[List[Dict]]` | Historical outcome data (hard task only) |
| `system_metrics` | `Dict[str, float]` | Operational statistics (precision, recall, false-positive rates) |
| `identified_issues` | `List[Dict]` | Known flaws in the governance pipeline |
| `reward` | `float` | Score from the grader for the last action, in (0, 1) |
| `done` | `bool` | Whether the episode has ended |
| `info` | `Dict` | Contains `best_score`, `rewards_history`, `steps_remaining`, and `staff_feedback` |
### Staff Feedback (in `info`)
After each step, the observation includes structured staff feedback to guide the agent's next action:
| Field | Example Values | Purpose |
|:------|:---------------|:--------|
| `strategic_rating` | `"Junior Associate"`, `"Staff Specialist"`, `"Senior Architect"` | Performance tier based on reward |
| `focus` | `"Signal detected"` or `"Burying the lede or distracted by noise"` | Whether the agent prioritized correctly |
| `recommendation` | `"Maintain high signal-to-noise ratio and lead with the fix."` | Actionable guidance for next step |
---
## 4. Task Descriptions
The environment provides three tasks with escalating cognitive difficulty:
### Task Easy β Ambiguity Clarification (Difficulty: `easy`)
- **Scenario**: A social media platform's community guidelines use vague terms like "offensive" and "appropriate."
- **Objective**: Identify an ambiguous term and replace it with a specific, measurable definition.
- **Expected Action**: `propose_clarification`
- **Expected Min Score**: 0.70
- **Key Grading Criteria**:
- Definition must contain measurable keywords (`"threshold"`, `"verify"`, `"%"`, `"within"`)
- Vague words (`"generally"`, `"sometimes"`, `"maybe"`) trigger a hard penalty (score capped < 0.30)
- Valid `affected_policy_ids` boost score
### Task Medium β Gap Detection & New Rule (Difficulty: `medium`)
- **Scenario**: A corporate HR framework with policies covering data protection but no coverage for Generative AI tool usage.
- **Objective**: Detect the missing policy domain and draft a new rule to fill the gap.
- **Expected Action**: `propose_new_rule`
- **Expected Min Score**: 0.55
- **Key Grading Criteria**:
- Must target the correct `rule_domain` (e.g., `"AI_use"`)
- Empty `scope` array severely penalized
- `integration_points` linking to existing policy IDs boost score
- Rule text must be substantive (short rules penalized)
### Task Hard β Holistic Policy Evolution (Difficulty: `hard`)
- **Scenario**: An e-commerce Trust & Safety framework where blanket seller suspension policies catch legitimate seasonal merchants alongside fraudsters.
- **Objective**: Evolve multiple policies simultaneously to balance fraud detection, revenue velocity, and seller trust.
- **Expected Action**: `evolve_policy`
- **Expected Min Score**: 0.40
- **Key Grading Criteria**:
- **Hallucination Guard**: All metrics at 0.95+ triggers "Unrealistic Tradeoff" penalty (score capped < 0.15)
- **Cross-Domain Guard**: HR/AI proposals for an e-commerce task incur -0.30 penalty
- **Realistic Tradeoffs**: `expected_outcomes` must show mathematical variance (improving fraud detection should decrease revenue velocity)
- **Domain Relevance**: Modifications must reference marketplace concepts (seller, fraud, listing, merchant)
- Metric key aliases supported: `fraud_rate`/`fraud`/`fraud_detection`, `revenue_velocity`/`queue_overload`/`revenue`
### Global Grading Mechanics (All Tasks)
| Mechanic | Effect |
|:---------|:-------|
| **Chain-of-Thought Bonus** | `think` field with keywords like `"tradeoff"`, `"precision"`, `"recall"` β +0.10 to +0.20 |
| **Step-Delta Bonus** | Significant improvement over previous best β +0.02 to +0.05 |
| **Anti-Repetition Penalty** | Exact repeated action β -0.30 |
| **Prompt Injection Guard** | `"ignore previous"`, `"system_prompt"`, `"override"` β score zeroed |
| **Semantic Density Guard** | Word-stuffing with >200 words and low content density β score zeroed |
| **Red Herring Penalty** | Referencing injected noise topics (office logistics, mascot) β up to -0.75 |
| **Segmented Prioritization** | Core fix in first 25% of response β bonus; buried at bottom β penalty |
---
## 5. Setup and Usage
### Local Installation
```bash
git clone https://github.com/Luciferai04/PolicyEvolverEnv.git
cd PolicyEvolverEnv
python3 -m venv .venv
source .venv/bin/activate
pip install -r server/requirements.txt
```
### Run the Environment Server
```bash
uvicorn server.app:app --port 8000
```
This starts all endpoints: `/reset` (POST), `/step` (POST), `/state` (GET), `/tasks` (GET), `/grader` (POST), `/health` (GET), `/baseline` (GET).
### Run with Docker
```bash
docker build -t policy-evolver .
docker run -p 8000:8000 policy-evolver
```
### Run the Inference Agent
The primary evaluation entry point is `inference.py`, which follows the hackathon `[START]`, `[STEP]`, `[END]` logging format.
```bash
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export HF_TOKEN="your_groq_api_key"
python3 inference.py
```
To run a specific task: `python3 inference.py task_easy`
### Required Environment Variables
| Variable | Description | Example |
|:---------|:------------|:--------|
| `HF_TOKEN` | API key for LLM inference (Groq) | `gsk_...` |
| `API_BASE_URL` | OpenAI-compatible endpoint | `https://api.groq.com/openai/v1` |
| `MODEL_NAME` | Model identifier | `llama-3.1-8b-instant` |
### Run Tests
```bash
PYTHONPATH=. python tests/test_smoke_exploits.py # 27 smoke & exploit checks
PYTHONPATH=. python tests/test_icl.py # ICL verification (3 tasks)
PYTHONPATH=. python tests/test_multi_episode.py # Multi-episode progression
PYTHONPATH=. python server/grader.py # 8-phase grader test suite
```
---
## 6. Baseline Performance Scores
The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback.
### Single-Step Convergence (Best Case)
| Task | Score | Converged | Expected Min |
|:-----|:------|:----------|:-------------|
| `task_easy` | 0.94 | β Step 1 | 0.70 |
| `task_medium` | 0.999 | β Step 1 | 0.55 |
| `task_hard` | 0.90 | β Step 1 | 0.40 |
### Multi-Step ICL Progression (Naive β Optimized)
| Task | Naive (Step 0) | Optimized (Step 1) | Improvement |
|:-----|:---------------|:-------------------|:------------|
| `task_easy` | 0.400 | 0.999 | +0.600 |
| `task_medium` | 0.001 | 0.999 | +0.998 |
| `task_hard` | 0.088 | 0.999 | +0.912 |
**Average ICL Improvement: +0.837**
### Configuration
| Setting | Value |
|:--------|:------|
| **Model** | `llama-3.1-8b-instant` (via Groq) |
| **Temperature** | `0.0` |
| **Seed** | `42` |
| **Determinism** | 5 identical runs β identical scores β |
| **Fine-tuning** | None required |
---
## Project Structure
```
policy_evolver_env/
βββ inference.py # Hackathon entry point ([START]/[STEP]/[END] format)
βββ client.py # EnvClient for HTTP interaction
βββ models.py # Pydantic models (Action, Observation, State)
βββ openenv.yaml # OpenEnv specification
βββ Dockerfile # Docker deployment with HEALTHCHECK
βββ server/
β βββ app.py # FastAPI + Gradio dashboard
β βββ environment.py # Environment logic (reset, step, state)
β βββ grader.py # Deterministic grading engine (8-phase test suite)
β βββ requirements.txt # Dependencies
β βββ tasks/ # Task definitions (easy, medium, hard)
βββ tests/
β βββ test_smoke_exploits.py # 27 smoke & exploit checks
β βββ test_icl.py # ICL loop verification
β βββ test_multi_episode.py # Multi-episode progression
βββ STRATEGIC_LEARNING.md # RLVR architecture documentation
```
|