---
title: PolicyEvolverEnv
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /dashboard/
---

# PolicyEvolverEnv

## 1. Environment Overview and Motivation

**PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment where an AI agent learns to **design, refine, and evolve governance policies** through meta-reasoning over real-world operational data.

### The Problem

In modern platforms — social media, enterprise HR, and e-commerce — static policies quickly become outdated or vaguely worded, leading to:
- **Inconsistent enforcement**: Moderators interpret "offensive content" differently, creating 300K+ appeals per year (Meta Oversight Board, 2024).
- **False-positive actions**: E-commerce platforms lose an estimated $700M/year from incorrectly suspending legitimate high-volume sellers.
- **Unaddressed gaps**: Emerging risks like Generative AI misuse have no governing rules in legacy frameworks.

### The Solution

PolicyEvolverEnv simulates these challenges by presenting the agent with:
1. A **corpus of operational incidents** (flagged posts, HR violations, seller transactions).
2. An **existing policy framework** with known flaws (vague terms, missing rules, conflicting thresholds).

The agent must analyze the data, identify systemic flaws, and submit **structured policy modifications** — not just answers, but actionable governance. The grader evaluates whether the proposed fix is specific, measurable, domain-relevant, and free of hallucination.

### Why This Matters for RLVR

This environment operates at the **Reinforcement Learning from Verifiable Rewards (RLVR)** layer of inference-time adaptation. No weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback — demonstrating genuine in-context policy learning.

---

## 2. Action Space

The action space uses a **Discriminated Union** (Pydantic `RootModel` with `Discriminator("action_type")`) supporting three structured action types:

### `propose_clarification` — Easy Task Action

| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["propose_clarification"]` | Discriminator tag |
| `ambiguous_term` | `str` | The exact vague term found in existing policies |
| `suggested_definition` | `str` | A specific, measurable replacement definition |
| `affected_policy_ids` | `List[str]` | Which policy IDs this clarification affects |
| `justification` | `str` | Why this term is ambiguous and why the fix works |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |

### `propose_new_rule` — Medium Task Action

| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["propose_new_rule"]` | Discriminator tag |
| `rule_domain` | `str` | Domain the new rule covers (e.g., `"AI_use"`) |
| `new_rule` | `str` | The complete new rule text |
| `scope` | `List[str]` | Scenario types this rule applies to |
| `integration_points` | `List[str]` | How it connects to existing policy IDs |
| `justification` | `str` | Why a gap exists and how this rule fills it |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |

### `evolve_policy` — Hard Task Action

| Field | Type | Description |
|:------|:-----|:------------|
| `action_type` | `Literal["evolve_policy"]` | Discriminator tag |
| `policy_modifications` | `List[PolicyModification]` | Specific changes: `policy_id`, `change_type`, `new_text`, `reason` |
| `expected_outcomes` | `Dict[str, float]` | Metric name → expected value (must show realistic tradeoffs) |
| `rollback_conditions` | `List[str]` | When to revert changes |
| `justification` | `str` | Comprehensive reasoning for the evolution |
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10–0.20 bonus) |

---

## 3. Observation Space

The `Observation` returned by `reset()` and `step()` contains:

| Field | Type | Description |
|:------|:-----|:------------|
| `task_id` | `str` | Active scenario identifier (`task_easy`, `task_medium`, `task_hard`) |
| `episode_id` | `str` | Unique episode session tracker |
| `step_count` | `int` | Current step number (max 5 per episode) |
| `corpus_size` | `int` | Total incidents in the full data corpus |
| `corpus_shown` | `int` | Number of incidents displayed (reactive to agent's domain) |
| `data_corpus` | `List[CorpusIncident]` | Operational incidents with `id`, `content`, `system_action`, and `type` |
| `current_policies` | `List[Dict]` | The existing policy framework (`id` + `text`) |
| `policy_outcomes` | `Optional[List[Dict]]` | Historical outcome data (hard task only) |
| `system_metrics` | `Dict[str, float]` | Operational statistics (precision, recall, false-positive rates) |
| `identified_issues` | `List[Dict]` | Known flaws in the governance pipeline |
| `reward` | `float` | Score from the grader for the last action, in (0, 1) |
| `done` | `bool` | Whether the episode has ended |
| `info` | `Dict` | Contains `best_score`, `rewards_history`, `steps_remaining`, and `staff_feedback` |

### Staff Feedback (in `info`)

After each step, the observation includes structured staff feedback to guide the agent's next action:

| Field | Example Values | Purpose |
|:------|:---------------|:--------|
| `strategic_rating` | `"Junior Associate"`, `"Staff Specialist"`, `"Senior Architect"` | Performance tier based on reward |
| `focus` | `"Signal detected"` or `"Burying the lede or distracted by noise"` | Whether the agent prioritized correctly |
| `recommendation` | `"Maintain high signal-to-noise ratio and lead with the fix."` | Actionable guidance for next step |

---

## 4. Task Descriptions

The environment provides three tasks with escalating cognitive difficulty:

### Task Easy — Ambiguity Clarification (Difficulty: `easy`)
- **Scenario**: A social media platform's community guidelines use vague terms like "offensive" and "appropriate."
- **Objective**: Identify an ambiguous term and replace it with a specific, measurable definition.
- **Expected Action**: `propose_clarification`
- **Expected Min Score**: 0.70
- **Key Grading Criteria**:
  - Definition must contain measurable keywords (`"threshold"`, `"verify"`, `"%"`, `"within"`)
  - Vague words (`"generally"`, `"sometimes"`, `"maybe"`) trigger a hard penalty (score capped < 0.30)
  - Valid `affected_policy_ids` boost score

### Task Medium — Gap Detection & New Rule (Difficulty: `medium`)
- **Scenario**: A corporate HR framework with policies covering data protection but no coverage for Generative AI tool usage.
- **Objective**: Detect the missing policy domain and draft a new rule to fill the gap.
- **Expected Action**: `propose_new_rule`
- **Expected Min Score**: 0.55
- **Key Grading Criteria**:
  - Must target the correct `rule_domain` (e.g., `"AI_use"`)
  - Empty `scope` array severely penalized
  - `integration_points` linking to existing policy IDs boost score
  - Rule text must be substantive (short rules penalized)

### Task Hard — Holistic Policy Evolution (Difficulty: `hard`)
- **Scenario**: An e-commerce Trust & Safety framework where blanket seller suspension policies catch legitimate seasonal merchants alongside fraudsters.
- **Objective**: Evolve multiple policies simultaneously to balance fraud detection, revenue velocity, and seller trust.
- **Expected Action**: `evolve_policy`
- **Expected Min Score**: 0.40
- **Key Grading Criteria**:
  - **Hallucination Guard**: All metrics at 0.95+ triggers "Unrealistic Tradeoff" penalty (score capped < 0.15)
  - **Cross-Domain Guard**: HR/AI proposals for an e-commerce task incur -0.30 penalty
  - **Realistic Tradeoffs**: `expected_outcomes` must show mathematical variance (improving fraud detection should decrease revenue velocity)
  - **Domain Relevance**: Modifications must reference marketplace concepts (seller, fraud, listing, merchant)
  - Metric key aliases supported: `fraud_rate`/`fraud`/`fraud_detection`, `revenue_velocity`/`queue_overload`/`revenue`

### Global Grading Mechanics (All Tasks)

| Mechanic | Effect |
|:---------|:-------|
| **Chain-of-Thought Bonus** | `think` field with keywords like `"tradeoff"`, `"precision"`, `"recall"` → +0.10 to +0.20 |
| **Step-Delta Bonus** | Significant improvement over previous best → +0.02 to +0.05 |
| **Anti-Repetition Penalty** | Exact repeated action → -0.30 |
| **Prompt Injection Guard** | `"ignore previous"`, `"system_prompt"`, `"override"` → score zeroed |
| **Semantic Density Guard** | Word-stuffing with >200 words and low content density → score zeroed |
| **Red Herring Penalty** | Referencing injected noise topics (office logistics, mascot) → up to -0.75 |
| **Segmented Prioritization** | Core fix in first 25% of response → bonus; buried at bottom → penalty |

---

## 5. Setup and Usage

### Local Installation

```bash
git clone https://github.com/Luciferai04/PolicyEvolverEnv.git
cd PolicyEvolverEnv
python3 -m venv .venv
source .venv/bin/activate
pip install -r server/requirements.txt
```

### Run the Environment Server

```bash
uvicorn server.app:app --port 8000
```

This starts all endpoints: `/reset` (POST), `/step` (POST), `/state` (GET), `/tasks` (GET), `/grader` (POST), `/health` (GET), `/baseline` (GET).

### Run with Docker

```bash
docker build -t policy-evolver .
docker run -p 8000:8000 policy-evolver
```

### Run the Inference Agent

The primary evaluation entry point is `inference.py`, which follows the hackathon `[START]`, `[STEP]`, `[END]` logging format.

```bash
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export HF_TOKEN="your_groq_api_key"

python3 inference.py
```

To run a specific task: `python3 inference.py task_easy`

### Required Environment Variables

| Variable | Description | Example |
|:---------|:------------|:--------|
| `HF_TOKEN` | API key for LLM inference (Groq) | `gsk_...` |
| `API_BASE_URL` | OpenAI-compatible endpoint | `https://api.groq.com/openai/v1` |
| `MODEL_NAME` | Model identifier | `llama-3.1-8b-instant` |

### Run Tests

```bash
PYTHONPATH=. python tests/test_smoke_exploits.py   # 27 smoke & exploit checks
PYTHONPATH=. python tests/test_icl.py              # ICL verification (3 tasks)
PYTHONPATH=. python tests/test_multi_episode.py    # Multi-episode progression
PYTHONPATH=. python server/grader.py               # 8-phase grader test suite
```

---

## 6. Baseline Performance Scores

The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback.

### Single-Step Convergence (Best Case)

| Task | Score | Converged | Expected Min |
|:-----|:------|:----------|:-------------|
| `task_easy` | 0.94 | ✓ Step 1 | 0.70 |
| `task_medium` | 0.999 | ✓ Step 1 | 0.55 |
| `task_hard` | 0.90 | ✓ Step 1 | 0.40 |

### Multi-Step ICL Progression (Naive → Optimized)

| Task | Naive (Step 0) | Optimized (Step 1) | Improvement |
|:-----|:---------------|:-------------------|:------------|
| `task_easy` | 0.400 | 0.999 | +0.600 |
| `task_medium` | 0.001 | 0.999 | +0.998 |
| `task_hard` | 0.088 | 0.999 | +0.912 |

**Average ICL Improvement: +0.837**

### Configuration

| Setting | Value |
|:--------|:------|
| **Model** | `llama-3.1-8b-instant` (via Groq) |
| **Temperature** | `0.0` |
| **Seed** | `42` |
| **Determinism** | 5 identical runs → identical scores ✓ |
| **Fine-tuning** | None required |

---

## Project Structure

```
policy_evolver_env/
├── inference.py            # Hackathon entry point ([START]/[STEP]/[END] format)
├── client.py               # EnvClient for HTTP interaction
├── models.py               # Pydantic models (Action, Observation, State)
├── openenv.yaml            # OpenEnv specification
├── Dockerfile              # Docker deployment with HEALTHCHECK
├── server/
│   ├── app.py              # FastAPI + Gradio dashboard
│   ├── environment.py      # Environment logic (reset, step, state)
│   ├── grader.py           # Deterministic grading engine (8-phase test suite)
│   ├── requirements.txt    # Dependencies
│   └── tasks/              # Task definitions (easy, medium, hard)
├── tests/
│   ├── test_smoke_exploits.py  # 27 smoke & exploit checks
│   ├── test_icl.py             # ICL loop verification
│   └── test_multi_episode.py   # Multi-episode progression
└── STRATEGIC_LEARNING.md       # RLVR architecture documentation
```