Somuai12's picture
Restructure README to required format: overview, spaces, tasks, setup, baseline
f2195b2
---
title: PolicyEvolverEnv
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /dashboard/
---
# PolicyEvolverEnv
## 1. Environment Overview and Motivation
**PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment where an AI agent learns to **design, refine, and evolve governance policies** through meta-reasoning over real-world operational data.
### The Problem
In modern platforms β€” social media, enterprise HR, and e-commerce β€” static policies quickly become outdated or vaguely worded, leading to:
- **Inconsistent enforcement**: Moderators interpret "offensive content" differently, creating 300K+ appeals per year (Meta Oversight Board, 2024).
- **False-positive actions**: E-commerce platforms lose an estimated $700M/year from incorrectly suspending legitimate high-volume sellers.
- **Unaddressed gaps**: Emerging risks like Generative AI misuse have no governing rules in legacy frameworks.
### The Solution
PolicyEvolverEnv simulates these challenges by presenting the agent with:
1. A **corpus of operational incidents** (flagged posts, HR violations, seller transactions).
2. An **existing policy framework** with known flaws (vague terms, missing rules, conflicting thresholds).
The agent must analyze the data, identify systemic flaws, and submit **structured policy modifications** β€” not just answers, but actionable governance. The grader evaluates whether the proposed fix is specific, measurable, domain-relevant, and free of hallucination.
### Why This Matters for RLVR
This environment operates at the **Reinforcement Learning from Verifiable Rewards (RLVR)** layer of inference-time adaptation. No weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback β€” demonstrating genuine in-context policy learning.
---
## 2. Action Space
The action space uses a **Discriminated Union** (Pydantic `RootModel` with `Discriminator("action_type")`) supporting three structured action types:
### `propose_clarification` β€” Easy Task Action
| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["propose_clarification"]` | Discriminator tag |
| `ambiguous_term` | `str` | The exact vague term found in existing policies |
| `suggested_definition` | `str` | A specific, measurable replacement definition |
| `affected_policy_ids` | `List[str]` | Which policy IDs this clarification affects |
| `justification` | `str` | Why this term is ambiguous and why the fix works |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |
### `propose_new_rule` β€” Medium Task Action
| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["propose_new_rule"]` | Discriminator tag |
| `rule_domain` | `str` | Domain the new rule covers (e.g., `"AI_use"`) |
| `new_rule` | `str` | The complete new rule text |
| `scope` | `List[str]` | Scenario types this rule applies to |
| `integration_points` | `List[str]` | How it connects to existing policy IDs |
| `justification` | `str` | Why a gap exists and how this rule fills it |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |
### `evolve_policy` β€” Hard Task Action
| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["evolve_policy"]` | Discriminator tag |
| `policy_modifications` | `List[PolicyModification]` | Specific changes: `policy_id`, `change_type`, `new_text`, `reason` |
| `expected_outcomes` | `Dict[str, float]` | Metric name β†’ expected value (must show realistic tradeoffs) |
| `rollback_conditions` | `List[str]` | When to revert changes |
| `justification` | `str` | Comprehensive reasoning for the evolution |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |
---
## 3. Observation Space
The `Observation` returned by `reset()` and `step()` contains:
| Field | Type | Description |
|:------|:-----|:------------|
| `task_id` | `str` | Active scenario identifier (`task_easy`, `task_medium`, `task_hard`) |
| `episode_id` | `str` | Unique episode session tracker |
| `step_count` | `int` | Current step number (max 5 per episode) |
| `corpus_size` | `int` | Total incidents in the full data corpus |
| `corpus_shown` | `int` | Number of incidents displayed (reactive to agent's domain) |
| `data_corpus` | `List[CorpusIncident]` | Operational incidents with `id`, `content`, `system_action`, and `type` |
| `current_policies` | `List[Dict]` | The existing policy framework (`id` + `text`) |
| `policy_outcomes` | `Optional[List[Dict]]` | Historical outcome data (hard task only) |
| `system_metrics` | `Dict[str, float]` | Operational statistics (precision, recall, false-positive rates) |
| `identified_issues` | `List[Dict]` | Known flaws in the governance pipeline |
| `reward` | `float` | Score from the grader for the last action, in (0, 1) |
| `done` | `bool` | Whether the episode has ended |
| `info` | `Dict` | Contains `best_score`, `rewards_history`, `steps_remaining`, and `staff_feedback` |
### Staff Feedback (in `info`)
After each step, the observation includes structured staff feedback to guide the agent's next action:
| Field | Example Values | Purpose |
|:------|:---------------|:--------|
| `strategic_rating` | `"Junior Associate"`, `"Staff Specialist"`, `"Senior Architect"` | Performance tier based on reward |
| `focus` | `"Signal detected"` or `"Burying the lede or distracted by noise"` | Whether the agent prioritized correctly |
| `recommendation` | `"Maintain high signal-to-noise ratio and lead with the fix."` | Actionable guidance for next step |
---
## 4. Task Descriptions
The environment provides three tasks with escalating cognitive difficulty:
### Task Easy β€” Ambiguity Clarification (Difficulty: `easy`)
- **Scenario**: A social media platform's community guidelines use vague terms like "offensive" and "appropriate."
- **Objective**: Identify an ambiguous term and replace it with a specific, measurable definition.
- **Expected Action**: `propose_clarification`
- **Expected Min Score**: 0.70
- **Key Grading Criteria**:
- Definition must contain measurable keywords (`"threshold"`, `"verify"`, `"%"`, `"within"`)
- Vague words (`"generally"`, `"sometimes"`, `"maybe"`) trigger a hard penalty (score capped < 0.30)
- Valid `affected_policy_ids` boost score
### Task Medium β€” Gap Detection & New Rule (Difficulty: `medium`)
- **Scenario**: A corporate HR framework with policies covering data protection but no coverage for Generative AI tool usage.
- **Objective**: Detect the missing policy domain and draft a new rule to fill the gap.
- **Expected Action**: `propose_new_rule`
- **Expected Min Score**: 0.55
- **Key Grading Criteria**:
- Must target the correct `rule_domain` (e.g., `"AI_use"`)
- Empty `scope` array severely penalized
- `integration_points` linking to existing policy IDs boost score
- Rule text must be substantive (short rules penalized)
### Task Hard β€” Holistic Policy Evolution (Difficulty: `hard`)
- **Scenario**: An e-commerce Trust & Safety framework where blanket seller suspension policies catch legitimate seasonal merchants alongside fraudsters.
- **Objective**: Evolve multiple policies simultaneously to balance fraud detection, revenue velocity, and seller trust.
- **Expected Action**: `evolve_policy`
- **Expected Min Score**: 0.40
- **Key Grading Criteria**:
- **Hallucination Guard**: All metrics at 0.95+ triggers "Unrealistic Tradeoff" penalty (score capped < 0.15)
- **Cross-Domain Guard**: HR/AI proposals for an e-commerce task incur -0.30 penalty
- **Realistic Tradeoffs**: `expected_outcomes` must show mathematical variance (improving fraud detection should decrease revenue velocity)
- **Domain Relevance**: Modifications must reference marketplace concepts (seller, fraud, listing, merchant)
- Metric key aliases supported: `fraud_rate`/`fraud`/`fraud_detection`, `revenue_velocity`/`queue_overload`/`revenue`
### Global Grading Mechanics (All Tasks)
| Mechanic | Effect |
|:---------|:-------|
| **Chain-of-Thought Bonus** | `think` field with keywords like `"tradeoff"`, `"precision"`, `"recall"` β†’ +0.10 to +0.20 |
| **Step-Delta Bonus** | Significant improvement over previous best β†’ +0.02 to +0.05 |
| **Anti-Repetition Penalty** | Exact repeated action β†’ -0.30 |
| **Prompt Injection Guard** | `"ignore previous"`, `"system_prompt"`, `"override"` β†’ score zeroed |
| **Semantic Density Guard** | Word-stuffing with >200 words and low content density β†’ score zeroed |
| **Red Herring Penalty** | Referencing injected noise topics (office logistics, mascot) β†’ up to -0.75 |
| **Segmented Prioritization** | Core fix in first 25% of response β†’ bonus; buried at bottom β†’ penalty |
---
## 5. Setup and Usage
### Local Installation
```bash
git clone https://github.com/Luciferai04/PolicyEvolverEnv.git
cd PolicyEvolverEnv
python3 -m venv .venv
source .venv/bin/activate
pip install -r server/requirements.txt
```
### Run the Environment Server
```bash
uvicorn server.app:app --port 8000
```
This starts all endpoints: `/reset` (POST), `/step` (POST), `/state` (GET), `/tasks` (GET), `/grader` (POST), `/health` (GET), `/baseline` (GET).
### Run with Docker
```bash
docker build -t policy-evolver .
docker run -p 8000:8000 policy-evolver
```
### Run the Inference Agent
The primary evaluation entry point is `inference.py`, which follows the hackathon `[START]`, `[STEP]`, `[END]` logging format.
```bash
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export HF_TOKEN="your_groq_api_key"
python3 inference.py
```
To run a specific task: `python3 inference.py task_easy`
### Required Environment Variables
| Variable | Description | Example |
|:---------|:------------|:--------|
| `HF_TOKEN` | API key for LLM inference (Groq) | `gsk_...` |
| `API_BASE_URL` | OpenAI-compatible endpoint | `https://api.groq.com/openai/v1` |
| `MODEL_NAME` | Model identifier | `llama-3.1-8b-instant` |
### Run Tests
```bash
PYTHONPATH=. python tests/test_smoke_exploits.py # 27 smoke & exploit checks
PYTHONPATH=. python tests/test_icl.py # ICL verification (3 tasks)
PYTHONPATH=. python tests/test_multi_episode.py # Multi-episode progression
PYTHONPATH=. python server/grader.py # 8-phase grader test suite
```
---
## 6. Baseline Performance Scores
The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback.
### Single-Step Convergence (Best Case)
| Task | Score | Converged | Expected Min |
|:-----|:------|:----------|:-------------|
| `task_easy` | 0.94 | βœ“ Step 1 | 0.70 |
| `task_medium` | 0.999 | βœ“ Step 1 | 0.55 |
| `task_hard` | 0.90 | βœ“ Step 1 | 0.40 |
### Multi-Step ICL Progression (Naive β†’ Optimized)
| Task | Naive (Step 0) | Optimized (Step 1) | Improvement |
|:-----|:---------------|:-------------------|:------------|
| `task_easy` | 0.400 | 0.999 | +0.600 |
| `task_medium` | 0.001 | 0.999 | +0.998 |
| `task_hard` | 0.088 | 0.999 | +0.912 |
**Average ICL Improvement: +0.837**
### Configuration
| Setting | Value |
|:--------|:------|
| **Model** | `llama-3.1-8b-instant` (via Groq) |
| **Temperature** | `0.0` |
| **Seed** | `42` |
| **Determinism** | 5 identical runs β†’ identical scores βœ“ |
| **Fine-tuning** | None required |
---
## Project Structure
```
policy_evolver_env/
β”œβ”€β”€ inference.py # Hackathon entry point ([START]/[STEP]/[END] format)
β”œβ”€β”€ client.py # EnvClient for HTTP interaction
β”œβ”€β”€ models.py # Pydantic models (Action, Observation, State)
β”œβ”€β”€ openenv.yaml # OpenEnv specification
β”œβ”€β”€ Dockerfile # Docker deployment with HEALTHCHECK
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ app.py # FastAPI + Gradio dashboard
β”‚ β”œβ”€β”€ environment.py # Environment logic (reset, step, state)
β”‚ β”œβ”€β”€ grader.py # Deterministic grading engine (8-phase test suite)
β”‚ β”œβ”€β”€ requirements.txt # Dependencies
β”‚ └── tasks/ # Task definitions (easy, medium, hard)
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ test_smoke_exploits.py # 27 smoke & exploit checks
β”‚ β”œβ”€β”€ test_icl.py # ICL loop verification
β”‚ └── test_multi_episode.py # Multi-episode progression
└── STRATEGIC_LEARNING.md # RLVR architecture documentation
```