Somuai12's picture
Restructure README to required format: overview, spaces, tasks, setup, baseline
f2195b2
metadata
title: PolicyEvolverEnv
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /dashboard/

PolicyEvolverEnv

1. Environment Overview and Motivation

PolicyEvolverEnv is an OpenEnv-compliant reinforcement learning environment where an AI agent learns to design, refine, and evolve governance policies through meta-reasoning over real-world operational data.

The Problem

In modern platforms β€” social media, enterprise HR, and e-commerce β€” static policies quickly become outdated or vaguely worded, leading to:

  • Inconsistent enforcement: Moderators interpret "offensive content" differently, creating 300K+ appeals per year (Meta Oversight Board, 2024).
  • False-positive actions: E-commerce platforms lose an estimated $700M/year from incorrectly suspending legitimate high-volume sellers.
  • Unaddressed gaps: Emerging risks like Generative AI misuse have no governing rules in legacy frameworks.

The Solution

PolicyEvolverEnv simulates these challenges by presenting the agent with:

  1. A corpus of operational incidents (flagged posts, HR violations, seller transactions).
  2. An existing policy framework with known flaws (vague terms, missing rules, conflicting thresholds).

The agent must analyze the data, identify systemic flaws, and submit structured policy modifications β€” not just answers, but actionable governance. The grader evaluates whether the proposed fix is specific, measurable, domain-relevant, and free of hallucination.

Why This Matters for RLVR

This environment operates at the Reinforcement Learning from Verifiable Rewards (RLVR) layer of inference-time adaptation. No weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback β€” demonstrating genuine in-context policy learning.


2. Action Space

The action space uses a Discriminated Union (Pydantic RootModel with Discriminator("action_type")) supporting three structured action types:

propose_clarification β€” Easy Task Action

Field Type Description
action_type Literal["propose_clarification"] Discriminator tag
ambiguous_term str The exact vague term found in existing policies
suggested_definition str A specific, measurable replacement definition
affected_policy_ids List[str] Which policy IDs this clarification affects
justification str Why this term is ambiguous and why the fix works
think Optional[str] Chain-of-thought reasoning (earns +0.10–0.20 bonus)

propose_new_rule β€” Medium Task Action

Field Type Description
action_type Literal["propose_new_rule"] Discriminator tag
rule_domain str Domain the new rule covers (e.g., "AI_use")
new_rule str The complete new rule text
scope List[str] Scenario types this rule applies to
integration_points List[str] How it connects to existing policy IDs
justification str Why a gap exists and how this rule fills it
think Optional[str] Chain-of-thought reasoning (earns +0.10–0.20 bonus)

evolve_policy β€” Hard Task Action

Field Type Description
action_type Literal["evolve_policy"] Discriminator tag
policy_modifications List[PolicyModification] Specific changes: policy_id, change_type, new_text, reason
expected_outcomes Dict[str, float] Metric name β†’ expected value (must show realistic tradeoffs)
rollback_conditions List[str] When to revert changes
justification str Comprehensive reasoning for the evolution
think Optional[str] Chain-of-thought reasoning (earns +0.10–0.20 bonus)

3. Observation Space

The Observation returned by reset() and step() contains:

Field Type Description
task_id str Active scenario identifier (task_easy, task_medium, task_hard)
episode_id str Unique episode session tracker
step_count int Current step number (max 5 per episode)
corpus_size int Total incidents in the full data corpus
corpus_shown int Number of incidents displayed (reactive to agent's domain)
data_corpus List[CorpusIncident] Operational incidents with id, content, system_action, and type
current_policies List[Dict] The existing policy framework (id + text)
policy_outcomes Optional[List[Dict]] Historical outcome data (hard task only)
system_metrics Dict[str, float] Operational statistics (precision, recall, false-positive rates)
identified_issues List[Dict] Known flaws in the governance pipeline
reward float Score from the grader for the last action, in (0, 1)
done bool Whether the episode has ended
info Dict Contains best_score, rewards_history, steps_remaining, and staff_feedback

Staff Feedback (in info)

After each step, the observation includes structured staff feedback to guide the agent's next action:

Field Example Values Purpose
strategic_rating "Junior Associate", "Staff Specialist", "Senior Architect" Performance tier based on reward
focus "Signal detected" or "Burying the lede or distracted by noise" Whether the agent prioritized correctly
recommendation "Maintain high signal-to-noise ratio and lead with the fix." Actionable guidance for next step

4. Task Descriptions

The environment provides three tasks with escalating cognitive difficulty:

Task Easy β€” Ambiguity Clarification (Difficulty: easy)

  • Scenario: A social media platform's community guidelines use vague terms like "offensive" and "appropriate."
  • Objective: Identify an ambiguous term and replace it with a specific, measurable definition.
  • Expected Action: propose_clarification
  • Expected Min Score: 0.70
  • Key Grading Criteria:
    • Definition must contain measurable keywords ("threshold", "verify", "%", "within")
    • Vague words ("generally", "sometimes", "maybe") trigger a hard penalty (score capped < 0.30)
    • Valid affected_policy_ids boost score

Task Medium β€” Gap Detection & New Rule (Difficulty: medium)

  • Scenario: A corporate HR framework with policies covering data protection but no coverage for Generative AI tool usage.
  • Objective: Detect the missing policy domain and draft a new rule to fill the gap.
  • Expected Action: propose_new_rule
  • Expected Min Score: 0.55
  • Key Grading Criteria:
    • Must target the correct rule_domain (e.g., "AI_use")
    • Empty scope array severely penalized
    • integration_points linking to existing policy IDs boost score
    • Rule text must be substantive (short rules penalized)

Task Hard β€” Holistic Policy Evolution (Difficulty: hard)

  • Scenario: An e-commerce Trust & Safety framework where blanket seller suspension policies catch legitimate seasonal merchants alongside fraudsters.
  • Objective: Evolve multiple policies simultaneously to balance fraud detection, revenue velocity, and seller trust.
  • Expected Action: evolve_policy
  • Expected Min Score: 0.40
  • Key Grading Criteria:
    • Hallucination Guard: All metrics at 0.95+ triggers "Unrealistic Tradeoff" penalty (score capped < 0.15)
    • Cross-Domain Guard: HR/AI proposals for an e-commerce task incur -0.30 penalty
    • Realistic Tradeoffs: expected_outcomes must show mathematical variance (improving fraud detection should decrease revenue velocity)
    • Domain Relevance: Modifications must reference marketplace concepts (seller, fraud, listing, merchant)
    • Metric key aliases supported: fraud_rate/fraud/fraud_detection, revenue_velocity/queue_overload/revenue

Global Grading Mechanics (All Tasks)

Mechanic Effect
Chain-of-Thought Bonus think field with keywords like "tradeoff", "precision", "recall" β†’ +0.10 to +0.20
Step-Delta Bonus Significant improvement over previous best β†’ +0.02 to +0.05
Anti-Repetition Penalty Exact repeated action β†’ -0.30
Prompt Injection Guard "ignore previous", "system_prompt", "override" β†’ score zeroed
Semantic Density Guard Word-stuffing with >200 words and low content density β†’ score zeroed
Red Herring Penalty Referencing injected noise topics (office logistics, mascot) β†’ up to -0.75
Segmented Prioritization Core fix in first 25% of response β†’ bonus; buried at bottom β†’ penalty

5. Setup and Usage

Local Installation

git clone https://github.com/Luciferai04/PolicyEvolverEnv.git
cd PolicyEvolverEnv
python3 -m venv .venv
source .venv/bin/activate
pip install -r server/requirements.txt

Run the Environment Server

uvicorn server.app:app --port 8000

This starts all endpoints: /reset (POST), /step (POST), /state (GET), /tasks (GET), /grader (POST), /health (GET), /baseline (GET).

Run with Docker

docker build -t policy-evolver .
docker run -p 8000:8000 policy-evolver

Run the Inference Agent

The primary evaluation entry point is inference.py, which follows the hackathon [START], [STEP], [END] logging format.

export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
export HF_TOKEN="your_groq_api_key"

python3 inference.py

To run a specific task: python3 inference.py task_easy

Required Environment Variables

Variable Description Example
HF_TOKEN API key for LLM inference (Groq) gsk_...
API_BASE_URL OpenAI-compatible endpoint https://api.groq.com/openai/v1
MODEL_NAME Model identifier llama-3.1-8b-instant

Run Tests

PYTHONPATH=. python tests/test_smoke_exploits.py   # 27 smoke & exploit checks
PYTHONPATH=. python tests/test_icl.py              # ICL verification (3 tasks)
PYTHONPATH=. python tests/test_multi_episode.py    # Multi-episode progression
PYTHONPATH=. python server/grader.py               # 8-phase grader test suite

6. Baseline Performance Scores

The agent uses In-Context Reinforcement Learning (ICL-RL): no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback.

Single-Step Convergence (Best Case)

Task Score Converged Expected Min
task_easy 0.94 βœ“ Step 1 0.70
task_medium 0.999 βœ“ Step 1 0.55
task_hard 0.90 βœ“ Step 1 0.40

Multi-Step ICL Progression (Naive β†’ Optimized)

Task Naive (Step 0) Optimized (Step 1) Improvement
task_easy 0.400 0.999 +0.600
task_medium 0.001 0.999 +0.998
task_hard 0.088 0.999 +0.912

Average ICL Improvement: +0.837

Configuration

Setting Value
Model llama-3.1-8b-instant (via Groq)
Temperature 0.0
Seed 42
Determinism 5 identical runs β†’ identical scores βœ“
Fine-tuning None required

Project Structure

policy_evolver_env/
β”œβ”€β”€ inference.py            # Hackathon entry point ([START]/[STEP]/[END] format)
β”œβ”€β”€ client.py               # EnvClient for HTTP interaction
β”œβ”€β”€ models.py               # Pydantic models (Action, Observation, State)
β”œβ”€β”€ openenv.yaml            # OpenEnv specification
β”œβ”€β”€ Dockerfile              # Docker deployment with HEALTHCHECK
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py              # FastAPI + Gradio dashboard
β”‚   β”œβ”€β”€ environment.py      # Environment logic (reset, step, state)
β”‚   β”œβ”€β”€ grader.py           # Deterministic grading engine (8-phase test suite)
β”‚   β”œβ”€β”€ requirements.txt    # Dependencies
β”‚   └── tasks/              # Task definitions (easy, medium, hard)
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_smoke_exploits.py  # 27 smoke & exploit checks
β”‚   β”œβ”€β”€ test_icl.py             # ICL loop verification
β”‚   └── test_multi_episode.py   # Multi-episode progression
└── STRATEGIC_LEARNING.md       # RLVR architecture documentation