| --- |
| title: PolicyEvolverEnv |
| colorFrom: blue |
| colorTo: indigo |
| sdk: docker |
| app_port: 8000 |
| base_path: /dashboard/ |
| --- |
| |
| # PolicyEvolverEnv |
|
|
| ## 1. Environment Overview and Motivation |
|
|
| **PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment where an AI agent learns to **design, refine, and evolve governance policies** through meta-reasoning over real-world operational data. |
|
|
| ### The Problem |
|
|
| In modern platforms β social media, enterprise HR, and e-commerce β static policies quickly become outdated or vaguely worded, leading to: |
| - **Inconsistent enforcement**: Moderators interpret "offensive content" differently, creating 300K+ appeals per year (Meta Oversight Board, 2024). |
| - **False-positive actions**: E-commerce platforms lose an estimated $700M/year from incorrectly suspending legitimate high-volume sellers. |
| - **Unaddressed gaps**: Emerging risks like Generative AI misuse have no governing rules in legacy frameworks. |
|
|
| ### The Solution |
|
|
| PolicyEvolverEnv simulates these challenges by presenting the agent with: |
| 1. A **corpus of operational incidents** (flagged posts, HR violations, seller transactions). |
| 2. An **existing policy framework** with known flaws (vague terms, missing rules, conflicting thresholds). |
|
|
| The agent must analyze the data, identify systemic flaws, and submit **structured policy modifications** β not just answers, but actionable governance. The grader evaluates whether the proposed fix is specific, measurable, domain-relevant, and free of hallucination. |
|
|
| ### Why This Matters for RLVR |
|
|
| This environment operates at the **Reinforcement Learning from Verifiable Rewards (RLVR)** layer of inference-time adaptation. No weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback β demonstrating genuine in-context policy learning. |
|
|
| --- |
|
|
| ## 2. Action Space |
|
|
| The action space uses a **Discriminated Union** (Pydantic `RootModel` with `Discriminator("action_type")`) supporting three structured action types: |
|
|
| ### `propose_clarification` β Easy Task Action |
| |
| | Field | Type | Description | |
| |:------|:-----|:------------| |
| | `action_type` | `Literal["propose_clarification"]` | Discriminator tag | |
| | `ambiguous_term` | `str` | The exact vague term found in existing policies | |
| | `suggested_definition` | `str` | A specific, measurable replacement definition | |
| | `affected_policy_ids` | `List[str]` | Which policy IDs this clarification affects | |
| | `justification` | `str` | Why this term is ambiguous and why the fix works | |
| | `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) | |
|
|
| ### `propose_new_rule` β Medium Task Action |
|
|
| | Field | Type | Description | |
| |:------|:-----|:------------| |
| | `action_type` | `Literal["propose_new_rule"]` | Discriminator tag | |
| | `rule_domain` | `str` | Domain the new rule covers (e.g., `"AI_use"`) | |
| | `new_rule` | `str` | The complete new rule text | |
| | `scope` | `List[str]` | Scenario types this rule applies to | |
| | `integration_points` | `List[str]` | How it connects to existing policy IDs | |
| | `justification` | `str` | Why a gap exists and how this rule fills it | |
| | `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) | |
|
|
| ### `evolve_policy` β Hard Task Action |
| |
| | Field | Type | Description | |
| |:------|:-----|:------------| |
| | `action_type` | `Literal["evolve_policy"]` | Discriminator tag | |
| | `policy_modifications` | `List[PolicyModification]` | Specific changes: `policy_id`, `change_type`, `new_text`, `reason` | |
| | `expected_outcomes` | `Dict[str, float]` | Metric name β expected value (must show realistic tradeoffs) | |
| | `rollback_conditions` | `List[str]` | When to revert changes | |
| | `justification` | `str` | Comprehensive reasoning for the evolution | |
| | `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) | |
|
|
| --- |
|
|
| ## 3. Observation Space |
|
|
| The `Observation` returned by `reset()` and `step()` contains: |
|
|
| | Field | Type | Description | |
| |:------|:-----|:------------| |
| | `task_id` | `str` | Active scenario identifier (`task_easy`, `task_medium`, `task_hard`) | |
| | `episode_id` | `str` | Unique episode session tracker | |
| | `step_count` | `int` | Current step number (max 5 per episode) | |
| | `corpus_size` | `int` | Total incidents in the full data corpus | |
| | `corpus_shown` | `int` | Number of incidents displayed (reactive to agent's domain) | |
| | `data_corpus` | `List[CorpusIncident]` | Operational incidents with `id`, `content`, `system_action`, and `type` | |
| | `current_policies` | `List[Dict]` | The existing policy framework (`id` + `text`) | |
| | `policy_outcomes` | `Optional[List[Dict]]` | Historical outcome data (hard task only) | |
| | `system_metrics` | `Dict[str, float]` | Operational statistics (precision, recall, false-positive rates) | |
| | `identified_issues` | `List[Dict]` | Known flaws in the governance pipeline | |
| | `reward` | `float` | Score from the grader for the last action, in (0, 1) | |
| | `done` | `bool` | Whether the episode has ended | |
| | `info` | `Dict` | Contains `best_score`, `rewards_history`, `steps_remaining`, and `staff_feedback` | |
|
|
| ### Staff Feedback (in `info`) |
|
|
| After each step, the observation includes structured staff feedback to guide the agent's next action: |
|
|
| | Field | Example Values | Purpose | |
| |:------|:---------------|:--------| |
| | `strategic_rating` | `"Junior Associate"`, `"Staff Specialist"`, `"Senior Architect"` | Performance tier based on reward | |
| | `focus` | `"Signal detected"` or `"Burying the lede or distracted by noise"` | Whether the agent prioritized correctly | |
| | `recommendation` | `"Maintain high signal-to-noise ratio and lead with the fix."` | Actionable guidance for next step | |
|
|
| --- |
|
|
| ## 4. Task Descriptions |
|
|
| The environment provides three tasks with escalating cognitive difficulty: |
|
|
| ### Task Easy β Ambiguity Clarification (Difficulty: `easy`) |
| - **Scenario**: A social media platform's community guidelines use vague terms like "offensive" and "appropriate." |
| - **Objective**: Identify an ambiguous term and replace it with a specific, measurable definition. |
| - **Expected Action**: `propose_clarification` |
| - **Expected Min Score**: 0.70 |
| - **Key Grading Criteria**: |
| - Definition must contain measurable keywords (`"threshold"`, `"verify"`, `"%"`, `"within"`) |
| - Vague words (`"generally"`, `"sometimes"`, `"maybe"`) trigger a hard penalty (score capped < 0.30) |
| - Valid `affected_policy_ids` boost score |
|
|
| ### Task Medium β Gap Detection & New Rule (Difficulty: `medium`) |
| - **Scenario**: A corporate HR framework with policies covering data protection but no coverage for Generative AI tool usage. |
| - **Objective**: Detect the missing policy domain and draft a new rule to fill the gap. |
| - **Expected Action**: `propose_new_rule` |
| - **Expected Min Score**: 0.55 |
| - **Key Grading Criteria**: |
| - Must target the correct `rule_domain` (e.g., `"AI_use"`) |
| - Empty `scope` array severely penalized |
| - `integration_points` linking to existing policy IDs boost score |
| - Rule text must be substantive (short rules penalized) |
|
|
| ### Task Hard β Holistic Policy Evolution (Difficulty: `hard`) |
| - **Scenario**: An e-commerce Trust & Safety framework where blanket seller suspension policies catch legitimate seasonal merchants alongside fraudsters. |
| - **Objective**: Evolve multiple policies simultaneously to balance fraud detection, revenue velocity, and seller trust. |
| - **Expected Action**: `evolve_policy` |
| - **Expected Min Score**: 0.40 |
| - **Key Grading Criteria**: |
| - **Hallucination Guard**: All metrics at 0.95+ triggers "Unrealistic Tradeoff" penalty (score capped < 0.15) |
| - **Cross-Domain Guard**: HR/AI proposals for an e-commerce task incur -0.30 penalty |
| - **Realistic Tradeoffs**: `expected_outcomes` must show mathematical variance (improving fraud detection should decrease revenue velocity) |
| - **Domain Relevance**: Modifications must reference marketplace concepts (seller, fraud, listing, merchant) |
| - Metric key aliases supported: `fraud_rate`/`fraud`/`fraud_detection`, `revenue_velocity`/`queue_overload`/`revenue` |
|
|
| ### Global Grading Mechanics (All Tasks) |
|
|
| | Mechanic | Effect | |
| |:---------|:-------| |
| | **Chain-of-Thought Bonus** | `think` field with keywords like `"tradeoff"`, `"precision"`, `"recall"` β +0.10 to +0.20 | |
| | **Step-Delta Bonus** | Significant improvement over previous best β +0.02 to +0.05 | |
| | **Anti-Repetition Penalty** | Exact repeated action β -0.30 | |
| | **Prompt Injection Guard** | `"ignore previous"`, `"system_prompt"`, `"override"` β score zeroed | |
| | **Semantic Density Guard** | Word-stuffing with >200 words and low content density β score zeroed | |
| | **Red Herring Penalty** | Referencing injected noise topics (office logistics, mascot) β up to -0.75 | |
| | **Segmented Prioritization** | Core fix in first 25% of response β bonus; buried at bottom β penalty | |
|
|
| --- |
|
|
| ## 5. Setup and Usage |
|
|
| ### Local Installation |
|
|
| ```bash |
| git clone https://github.com/Luciferai04/PolicyEvolverEnv.git |
| cd PolicyEvolverEnv |
| python3 -m venv .venv |
| source .venv/bin/activate |
| pip install -r server/requirements.txt |
| ``` |
|
|
| ### Run the Environment Server |
|
|
| ```bash |
| uvicorn server.app:app --port 8000 |
| ``` |
|
|
| This starts all endpoints: `/reset` (POST), `/step` (POST), `/state` (GET), `/tasks` (GET), `/grader` (POST), `/health` (GET), `/baseline` (GET). |
|
|
| ### Run with Docker |
|
|
| ```bash |
| docker build -t policy-evolver . |
| docker run -p 8000:8000 policy-evolver |
| ``` |
|
|
| ### Run the Inference Agent |
|
|
| The primary evaluation entry point is `inference.py`, which follows the hackathon `[START]`, `[STEP]`, `[END]` logging format. |
|
|
| ```bash |
| export API_BASE_URL="https://api.groq.com/openai/v1" |
| export MODEL_NAME="llama-3.1-8b-instant" |
| export HF_TOKEN="your_groq_api_key" |
| |
| python3 inference.py |
| ``` |
|
|
| To run a specific task: `python3 inference.py task_easy` |
|
|
| ### Required Environment Variables |
|
|
| | Variable | Description | Example | |
| |:---------|:------------|:--------| |
| | `HF_TOKEN` | API key for LLM inference (Groq) | `gsk_...` | |
| | `API_BASE_URL` | OpenAI-compatible endpoint | `https://api.groq.com/openai/v1` | |
| | `MODEL_NAME` | Model identifier | `llama-3.1-8b-instant` | |
|
|
| ### Run Tests |
|
|
| ```bash |
| PYTHONPATH=. python tests/test_smoke_exploits.py # 27 smoke & exploit checks |
| PYTHONPATH=. python tests/test_icl.py # ICL verification (3 tasks) |
| PYTHONPATH=. python tests/test_multi_episode.py # Multi-episode progression |
| PYTHONPATH=. python server/grader.py # 8-phase grader test suite |
| ``` |
|
|
| --- |
|
|
| ## 6. Baseline Performance Scores |
|
|
| The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback. |
|
|
| ### Single-Step Convergence (Best Case) |
|
|
| | Task | Score | Converged | Expected Min | |
| |:-----|:------|:----------|:-------------| |
| | `task_easy` | 0.94 | β Step 1 | 0.70 | |
| | `task_medium` | 0.999 | β Step 1 | 0.55 | |
| | `task_hard` | 0.90 | β Step 1 | 0.40 | |
|
|
| ### Multi-Step ICL Progression (Naive β Optimized) |
|
|
| | Task | Naive (Step 0) | Optimized (Step 1) | Improvement | |
| |:-----|:---------------|:-------------------|:------------| |
| | `task_easy` | 0.400 | 0.999 | +0.600 | |
| | `task_medium` | 0.001 | 0.999 | +0.998 | |
| | `task_hard` | 0.088 | 0.999 | +0.912 | |
|
|
| **Average ICL Improvement: +0.837** |
|
|
| ### Configuration |
|
|
| | Setting | Value | |
| |:--------|:------| |
| | **Model** | `llama-3.1-8b-instant` (via Groq) | |
| | **Temperature** | `0.0` | |
| | **Seed** | `42` | |
| | **Determinism** | 5 identical runs β identical scores β | |
| | **Fine-tuning** | None required | |
|
|
| --- |
|
|
| ## Project Structure |
|
|
| ``` |
| policy_evolver_env/ |
| βββ inference.py # Hackathon entry point ([START]/[STEP]/[END] format) |
| βββ client.py # EnvClient for HTTP interaction |
| βββ models.py # Pydantic models (Action, Observation, State) |
| βββ openenv.yaml # OpenEnv specification |
| βββ Dockerfile # Docker deployment with HEALTHCHECK |
| βββ server/ |
| β βββ app.py # FastAPI + Gradio dashboard |
| β βββ environment.py # Environment logic (reset, step, state) |
| β βββ grader.py # Deterministic grading engine (8-phase test suite) |
| β βββ requirements.txt # Dependencies |
| β βββ tasks/ # Task definitions (easy, medium, hard) |
| βββ tests/ |
| β βββ test_smoke_exploits.py # 27 smoke & exploit checks |
| β βββ test_icl.py # ICL loop verification |
| β βββ test_multi_episode.py # Multi-episode progression |
| βββ STRATEGIC_LEARNING.md # RLVR architecture documentation |
| ``` |
|
|