Restructure README to required format: overview, spaces, tasks, setup, baseline
Browse files
README.md
CHANGED
|
@@ -6,189 +6,269 @@ sdk: docker
|
|
| 6 |
app_port: 8000
|
| 7 |
base_path: /dashboard/
|
| 8 |
---
|
| 9 |
-
# PolicyEvolverEnv β Multi-Modal Strategic Governance Sandbox
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
---
|
| 16 |
-
|
| 17 |
-
### Advanced Reward Shaping (RLVR Integration)
|
| 18 |
-
Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic Python grading engine (`server/grader.py`) designed to harden LLM strategic reasoning through explicit rewards and penalties:
|
| 19 |
|
| 20 |
-
###
|
| 21 |
-
* **The Penalty:** If the agent's proposed rule contains vague words like `"generally"`, `"sometimes"`, `"often"`, or `"maybe"`, the grader explicitly detects this **Clarity Coherence** failure. Furthermore, if the definition contains **ZERO measurable keywords** (e.g. `"threshold"`, `"verify"`, `"%"`), a strict hard penalty is triggered, capping the base score below `0.30`βmaking it impossible to succeed without numbers or strict conditionals.
|
| 22 |
-
* **The Reward:** To get a high score (>`0.85`), the agent must provide a definition that is entirely free of these vague terms, include valid active `affected_policy_ids`, include robust **measurable keywords**, and provide a substantive `justification` string.
|
| 23 |
|
| 24 |
-
|
| 25 |
-
*
|
| 26 |
-
*
|
|
|
|
| 27 |
|
| 28 |
-
###
|
| 29 |
-
* **The Penalty 1 (Hallucinations):** If the `expected_outcomes` predict that *all* metrics will perfectly improve (e.g., both revenue and fraud-blocking hit `0.95` without any downside variance), the grader recognizes this as an **Unrealistic Tradeoff** and explicitly fails the agent natively (maximum score capped strictly below `0.30`).
|
| 30 |
-
* **The Penalty 2 (Cross-Domain Mismatch):** Proposing an HR or AI policy for an e-commerce fraud scenario violates domain relevance. By using targeted Regex logic, a `-0.30` penalty is immediately stripped from the score if the text does not contain marketplace-relevant context.
|
| 31 |
-
* **The Reward:** The grader verifies mathematical outcome variance. Agents must write realistic tradeoffs and utilize standardized impact metric keys (aliases are robustly supported, e.g., you can use `"fraud_rate"`, `"fraud"`, or `"fraud_detection"`; or `"queue_overload"` for `"revenue_velocity"`).
|
| 32 |
|
| 33 |
-
|
| 34 |
-
*
|
| 35 |
-
*
|
| 36 |
-
* **Anti-Repetition Penalty (-0.30):** Encounters a severe penalty for exact repeated actions across steps, forcing the agent toward continuous exploration and evolution.
|
| 37 |
|
| 38 |
-
-
|
| 39 |
|
| 40 |
-
##
|
| 41 |
-
PolicyEvolverEnv is a real-world governance sandbox where an AI agent improves its in-context policy to **design and evolve governance policies** through meta-reasoning over real-world operational data. In modern platforms (social media, enterprise HR, e-commerce), static policies quickly become outdated or vaguely applied, leading to inconsistent enforcement, false-positive moderation, and unrecognized fraud.
|
| 42 |
|
| 43 |
-
This environment
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
##
|
| 48 |
-
Most AI environments are games (like Chess or Atari). **PolicyEvolverEnv** is differentβit is a **Strategic Governance Sandbox**.
|
| 49 |
|
| 50 |
-
The
|
| 51 |
|
| 52 |
-
|
| 53 |
-
* **The Solution**: An AI agent that doesn't just follow rules, but **designs** them.
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
-
###
|
| 59 |
-
* **Environment Best Score**: This tracks the highest score achieved in this session. It represents the "Gold Standard" the agent is aiming for.
|
| 60 |
-
* **Remaining Execution Steps**: Each "Episode" has a limit (5 steps). The agent must improve the policy within this budget. This forces **Strategic Efficiency**.
|
| 61 |
-
* **Latest Strategic Reward**: Every time you click "Execute," the Grader (`server/grader.py`) analyzes your proposal. If itβs vague, you get a low reward (0.1β0.3). If itβs specific and measurable, you get a high reward (0.8β0.9).
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
###
|
| 69 |
-
At the bottom, you have the **Action Console**. This is where the "Evolution" happens:
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
-
The goal of the whole idea is **Strategic Convergence**. When the "Current Project Score" hits **0.85 or higher**, it means the Agent has successfully evolved the policy framework to a point where it is:
|
| 80 |
|
| 81 |
-
|
| 82 |
-
* **Measurable**: Success is defined by numbers (Precision/Recall).
|
| 83 |
-
* **Future-Proof**: The agent has filled gaps (like AI-generated content) that didn't exist when the original rules were written.
|
| 84 |
|
| 85 |
-
|
| 86 |
-
The `Observation` received by the agent at every step describes the current operational context:
|
| 87 |
-
- `task_id` (str): Identifier for the active scenario.
|
| 88 |
-
- `episode_id` (str): Unique session tracker.
|
| 89 |
-
- `step_count` (int): Active step number (Max 5 per episode).
|
| 90 |
-
- `data_corpus` (List[Dict]): Represents operational examples like social media posts, HR incidents, or seller accounts along with the action taken or outcome.
|
| 91 |
-
- `current_policies` (List[Dict]): The list of current active policies the system follows.
|
| 92 |
-
- `system_metrics` & `policy_outcomes`: Operational statistics reflecting precision/recall or false-positive rates.
|
| 93 |
-
- `identified_issues`: Current known flaws in the governance pipeline.
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
-
|
| 99 |
-
- Targets an `ambiguous_term` in an existing policy.
|
| 100 |
-
- Requires a specific, measurable `suggested_definition` and `justification`.
|
| 101 |
-
**2. ProposeNewRuleAction (`propose_new_rule`)**
|
| 102 |
-
- Addresses an unhandled domain (`rule_domain`).
|
| 103 |
-
- Requires `new_rule` text, application `scope`, and `integration_points` connecting to older policies.
|
| 104 |
-
**3. EvolveProcessAction (`evolve_policy`)**
|
| 105 |
-
- The hardest action; holistically modifies existing rules.
|
| 106 |
-
- Requires a list of `policy_modifications`, realistic `expected_outcomes` deltas, and multi-metric `rollback_conditions`.
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
-
|
| 114 |
-
* **The Scenario:** Refining a social media platform's initial content moderation rules.
|
| 115 |
-
* **The Problem:** The existing rule simply stated that "offensive or inappropriate content" was prohibited. This was far too subjective, leading to inconsistent moderation.
|
| 116 |
-
* **The Policy Applied (Action taken by Agent):** The agent was required to use the `propose_clarification` action. It took the vague term (like "offensive") and redefined it using strict, measurable thresholds (e.g., "specific threats of physical violence" or "explicit slurs targeting protected identity characteristics"). By removing subjectivity, the policy became actionable and deterministic.
|
| 117 |
|
| 118 |
-
##
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
| 124 |
-
* **The Scenario:** Managing an e-commerce platform facing a complex fraud problem, where current rules were causing too many "false positives" (locking out legitimate, high-volume sellers).
|
| 125 |
-
* **The Problem:** The platform needed to catch rapid-velocity fraud without ruining the experience for trusted legacy merchants.
|
| 126 |
-
* **The Policy Applied (Action taken by Agent):** The agent used the `evolve_policy` action for a holistic system update. It had to apply at least two complex modifications to balance Precision and Recall:
|
| 127 |
-
* **Tightening Rule:** Added a strict identity-verification trigger for new sellers showing extreme sales velocity (e.g., >20 sales/day in first 30 days).
|
| 128 |
-
* **Exemption Rule:** Rolled back the manual review thresholds for trusted legacy sellers to reduce false positives and preserve revenue.
|
| 129 |
|
| 130 |
-
|
| 131 |
|
| 132 |
-
##
|
| 133 |
|
| 134 |
-
### 1. Local Installation
|
| 135 |
```bash
|
| 136 |
-
git clone
|
| 137 |
-
cd
|
| 138 |
python3 -m venv .venv
|
| 139 |
source .venv/bin/activate
|
| 140 |
pip install -r server/requirements.txt
|
| 141 |
```
|
| 142 |
|
| 143 |
-
###
|
| 144 |
-
|
| 145 |
```bash
|
| 146 |
uvicorn server.app:app --port 8000
|
| 147 |
```
|
| 148 |
-
This boots all core endpoint paths (`/reset`, `/step`, `/state`, `/tasks`, `/grader`, `/health`).
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|
|
|
|
| 152 |
|
| 153 |
-
Export your environment variables:
|
| 154 |
```bash
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
export HF_TOKEN="your_token_here"
|
| 158 |
```
|
| 159 |
|
| 160 |
-
|
|
|
|
|
|
|
|
|
|
| 161 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
python3 inference.py
|
| 163 |
```
|
| 164 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
---
|
| 167 |
|
| 168 |
-
|
| 169 |
|
| 170 |
-
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
-
| Task |
|
| 175 |
-
|-----
|
| 176 |
-
| task_easy
|
| 177 |
-
| task_medium |
|
| 178 |
-
| task_hard
|
| 179 |
|
| 180 |
-
|
| 181 |
-
**Reproducible:** temperature=0.0, seed=42 (**Bit-for-bit identical results verified**)
|
| 182 |
-
**No fine-tuning required.** The environment provides the learning signal; the model adapts its in-context policy each step.
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
|
|
|
|
|
|
|
|
|
| 186 |
|
| 187 |
-
|
| 188 |
|
| 189 |
-
###
|
| 190 |
-
1. **Refinement Hub**: The baseline agent tracks its previous rewards and actions through the observation's metadata (`info`).
|
| 191 |
-
2. **Strategic pivoting**: If a policy proposal receives low rewards (due to lack of specificity or missing justifications), the agent identifies the failure points and pivots its strategy in subsequent steps.
|
| 192 |
-
3. **Measurable Improvement**: As shown in the progression chart, iterative refinement leads to **Strategic Convergence**, where the policy quality reaches institutional standards (Score β₯ 0.85).
|
| 193 |
|
| 194 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
app_port: 8000
|
| 7 |
base_path: /dashboard/
|
| 8 |
---
|
|
|
|
| 9 |
|
| 10 |
+
# PolicyEvolverEnv
|
| 11 |
|
| 12 |
+
## 1. Environment Overview and Motivation
|
| 13 |
|
| 14 |
+
**PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment where an AI agent learns to **design, refine, and evolve governance policies** through meta-reasoning over real-world operational data.
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
+
### The Problem
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
In modern platforms β social media, enterprise HR, and e-commerce β static policies quickly become outdated or vaguely worded, leading to:
|
| 19 |
+
- **Inconsistent enforcement**: Moderators interpret "offensive content" differently, creating 300K+ appeals per year (Meta Oversight Board, 2024).
|
| 20 |
+
- **False-positive actions**: E-commerce platforms lose an estimated $700M/year from incorrectly suspending legitimate high-volume sellers.
|
| 21 |
+
- **Unaddressed gaps**: Emerging risks like Generative AI misuse have no governing rules in legacy frameworks.
|
| 22 |
|
| 23 |
+
### The Solution
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
PolicyEvolverEnv simulates these challenges by presenting the agent with:
|
| 26 |
+
1. A **corpus of operational incidents** (flagged posts, HR violations, seller transactions).
|
| 27 |
+
2. An **existing policy framework** with known flaws (vague terms, missing rules, conflicting thresholds).
|
|
|
|
| 28 |
|
| 29 |
+
The agent must analyze the data, identify systemic flaws, and submit **structured policy modifications** β not just answers, but actionable governance. The grader evaluates whether the proposed fix is specific, measurable, domain-relevant, and free of hallucination.
|
| 30 |
|
| 31 |
+
### Why This Matters for RLVR
|
|
|
|
| 32 |
|
| 33 |
+
This environment operates at the **Reinforcement Learning from Verifiable Rewards (RLVR)** layer of inference-time adaptation. No weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback β demonstrating genuine in-context policy learning.
|
| 34 |
|
| 35 |
+
---
|
| 36 |
|
| 37 |
+
## 2. Action Space
|
|
|
|
| 38 |
|
| 39 |
+
The action space uses a **Discriminated Union** (Pydantic `RootModel` with `Discriminator("action_type")`) supporting three structured action types:
|
| 40 |
|
| 41 |
+
### `propose_clarification` β Easy Task Action
|
|
|
|
| 42 |
|
| 43 |
+
| Field | Type | Description |
|
| 44 |
+
|:------|:-----|:------------|
|
| 45 |
+
| `action_type` | `Literal["propose_clarification"]` | Discriminator tag |
|
| 46 |
+
| `ambiguous_term` | `str` | The exact vague term found in existing policies |
|
| 47 |
+
| `suggested_definition` | `str` | A specific, measurable replacement definition |
|
| 48 |
+
| `affected_policy_ids` | `List[str]` | Which policy IDs this clarification affects |
|
| 49 |
+
| `justification` | `str` | Why this term is ambiguous and why the fix works |
|
| 50 |
+
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) |
|
| 51 |
|
| 52 |
+
### `propose_new_rule` β Medium Task Action
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
| Field | Type | Description |
|
| 55 |
+
|:------|:-----|:------------|
|
| 56 |
+
| `action_type` | `Literal["propose_new_rule"]` | Discriminator tag |
|
| 57 |
+
| `rule_domain` | `str` | Domain the new rule covers (e.g., `"AI_use"`) |
|
| 58 |
+
| `new_rule` | `str` | The complete new rule text |
|
| 59 |
+
| `scope` | `List[str]` | Scenario types this rule applies to |
|
| 60 |
+
| `integration_points` | `List[str]` | How it connects to existing policy IDs |
|
| 61 |
+
| `justification` | `str` | Why a gap exists and how this rule fills it |
|
| 62 |
+
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) |
|
| 63 |
|
| 64 |
+
### `evolve_policy` β Hard Task Action
|
|
|
|
| 65 |
|
| 66 |
+
| Field | Type | Description |
|
| 67 |
+
|:------|:-----|:------------|
|
| 68 |
+
| `action_type` | `Literal["evolve_policy"]` | Discriminator tag |
|
| 69 |
+
| `policy_modifications` | `List[PolicyModification]` | Specific changes: `policy_id`, `change_type`, `new_text`, `reason` |
|
| 70 |
+
| `expected_outcomes` | `Dict[str, float]` | Metric name β expected value (must show realistic tradeoffs) |
|
| 71 |
+
| `rollback_conditions` | `List[str]` | When to revert changes |
|
| 72 |
+
| `justification` | `str` | Comprehensive reasoning for the evolution |
|
| 73 |
+
| `think` | `Optional[str]` | Chain-of-thought reasoning (earns +0.10β0.20 bonus) |
|
| 74 |
|
| 75 |
+
---
|
|
|
|
| 76 |
|
| 77 |
+
## 3. Observation Space
|
|
|
|
|
|
|
| 78 |
|
| 79 |
+
The `Observation` returned by `reset()` and `step()` contains:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
| Field | Type | Description |
|
| 82 |
+
|:------|:-----|:------------|
|
| 83 |
+
| `task_id` | `str` | Active scenario identifier (`task_easy`, `task_medium`, `task_hard`) |
|
| 84 |
+
| `episode_id` | `str` | Unique episode session tracker |
|
| 85 |
+
| `step_count` | `int` | Current step number (max 5 per episode) |
|
| 86 |
+
| `corpus_size` | `int` | Total incidents in the full data corpus |
|
| 87 |
+
| `corpus_shown` | `int` | Number of incidents displayed (reactive to agent's domain) |
|
| 88 |
+
| `data_corpus` | `List[CorpusIncident]` | Operational incidents with `id`, `content`, `system_action`, and `type` |
|
| 89 |
+
| `current_policies` | `List[Dict]` | The existing policy framework (`id` + `text`) |
|
| 90 |
+
| `policy_outcomes` | `Optional[List[Dict]]` | Historical outcome data (hard task only) |
|
| 91 |
+
| `system_metrics` | `Dict[str, float]` | Operational statistics (precision, recall, false-positive rates) |
|
| 92 |
+
| `identified_issues` | `List[Dict]` | Known flaws in the governance pipeline |
|
| 93 |
+
| `reward` | `float` | Score from the grader for the last action, in (0, 1) |
|
| 94 |
+
| `done` | `bool` | Whether the episode has ended |
|
| 95 |
+
| `info` | `Dict` | Contains `best_score`, `rewards_history`, `steps_remaining`, and `staff_feedback` |
|
| 96 |
|
| 97 |
+
### Staff Feedback (in `info`)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
After each step, the observation includes structured staff feedback to guide the agent's next action:
|
| 100 |
|
| 101 |
+
| Field | Example Values | Purpose |
|
| 102 |
+
|:------|:---------------|:--------|
|
| 103 |
+
| `strategic_rating` | `"Junior Associate"`, `"Staff Specialist"`, `"Senior Architect"` | Performance tier based on reward |
|
| 104 |
+
| `focus` | `"Signal detected"` or `"Burying the lede or distracted by noise"` | Whether the agent prioritized correctly |
|
| 105 |
+
| `recommendation` | `"Maintain high signal-to-noise ratio and lead with the fix."` | Actionable guidance for next step |
|
| 106 |
|
| 107 |
+
---
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
+
## 4. Task Descriptions
|
| 110 |
+
|
| 111 |
+
The environment provides three tasks with escalating cognitive difficulty:
|
| 112 |
+
|
| 113 |
+
### Task Easy β Ambiguity Clarification (Difficulty: `easy`)
|
| 114 |
+
- **Scenario**: A social media platform's community guidelines use vague terms like "offensive" and "appropriate."
|
| 115 |
+
- **Objective**: Identify an ambiguous term and replace it with a specific, measurable definition.
|
| 116 |
+
- **Expected Action**: `propose_clarification`
|
| 117 |
+
- **Expected Min Score**: 0.70
|
| 118 |
+
- **Key Grading Criteria**:
|
| 119 |
+
- Definition must contain measurable keywords (`"threshold"`, `"verify"`, `"%"`, `"within"`)
|
| 120 |
+
- Vague words (`"generally"`, `"sometimes"`, `"maybe"`) trigger a hard penalty (score capped < 0.30)
|
| 121 |
+
- Valid `affected_policy_ids` boost score
|
| 122 |
+
|
| 123 |
+
### Task Medium β Gap Detection & New Rule (Difficulty: `medium`)
|
| 124 |
+
- **Scenario**: A corporate HR framework with policies covering data protection but no coverage for Generative AI tool usage.
|
| 125 |
+
- **Objective**: Detect the missing policy domain and draft a new rule to fill the gap.
|
| 126 |
+
- **Expected Action**: `propose_new_rule`
|
| 127 |
+
- **Expected Min Score**: 0.55
|
| 128 |
+
- **Key Grading Criteria**:
|
| 129 |
+
- Must target the correct `rule_domain` (e.g., `"AI_use"`)
|
| 130 |
+
- Empty `scope` array severely penalized
|
| 131 |
+
- `integration_points` linking to existing policy IDs boost score
|
| 132 |
+
- Rule text must be substantive (short rules penalized)
|
| 133 |
+
|
| 134 |
+
### Task Hard β Holistic Policy Evolution (Difficulty: `hard`)
|
| 135 |
+
- **Scenario**: An e-commerce Trust & Safety framework where blanket seller suspension policies catch legitimate seasonal merchants alongside fraudsters.
|
| 136 |
+
- **Objective**: Evolve multiple policies simultaneously to balance fraud detection, revenue velocity, and seller trust.
|
| 137 |
+
- **Expected Action**: `evolve_policy`
|
| 138 |
+
- **Expected Min Score**: 0.40
|
| 139 |
+
- **Key Grading Criteria**:
|
| 140 |
+
- **Hallucination Guard**: All metrics at 0.95+ triggers "Unrealistic Tradeoff" penalty (score capped < 0.15)
|
| 141 |
+
- **Cross-Domain Guard**: HR/AI proposals for an e-commerce task incur -0.30 penalty
|
| 142 |
+
- **Realistic Tradeoffs**: `expected_outcomes` must show mathematical variance (improving fraud detection should decrease revenue velocity)
|
| 143 |
+
- **Domain Relevance**: Modifications must reference marketplace concepts (seller, fraud, listing, merchant)
|
| 144 |
+
- Metric key aliases supported: `fraud_rate`/`fraud`/`fraud_detection`, `revenue_velocity`/`queue_overload`/`revenue`
|
| 145 |
+
|
| 146 |
+
### Global Grading Mechanics (All Tasks)
|
| 147 |
+
|
| 148 |
+
| Mechanic | Effect |
|
| 149 |
+
|:---------|:-------|
|
| 150 |
+
| **Chain-of-Thought Bonus** | `think` field with keywords like `"tradeoff"`, `"precision"`, `"recall"` β +0.10 to +0.20 |
|
| 151 |
+
| **Step-Delta Bonus** | Significant improvement over previous best β +0.02 to +0.05 |
|
| 152 |
+
| **Anti-Repetition Penalty** | Exact repeated action β -0.30 |
|
| 153 |
+
| **Prompt Injection Guard** | `"ignore previous"`, `"system_prompt"`, `"override"` β score zeroed |
|
| 154 |
+
| **Semantic Density Guard** | Word-stuffing with >200 words and low content density β score zeroed |
|
| 155 |
+
| **Red Herring Penalty** | Referencing injected noise topics (office logistics, mascot) β up to -0.75 |
|
| 156 |
+
| **Segmented Prioritization** | Core fix in first 25% of response β bonus; buried at bottom β penalty |
|
| 157 |
|
| 158 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
+
## 5. Setup and Usage
|
| 161 |
|
| 162 |
+
### Local Installation
|
| 163 |
|
|
|
|
| 164 |
```bash
|
| 165 |
+
git clone https://github.com/Luciferai04/PolicyEvolverEnv.git
|
| 166 |
+
cd PolicyEvolverEnv
|
| 167 |
python3 -m venv .venv
|
| 168 |
source .venv/bin/activate
|
| 169 |
pip install -r server/requirements.txt
|
| 170 |
```
|
| 171 |
|
| 172 |
+
### Run the Environment Server
|
| 173 |
+
|
| 174 |
```bash
|
| 175 |
uvicorn server.app:app --port 8000
|
| 176 |
```
|
|
|
|
| 177 |
|
| 178 |
+
This starts all endpoints: `/reset` (POST), `/step` (POST), `/state` (GET), `/tasks` (GET), `/grader` (POST), `/health` (GET), `/baseline` (GET).
|
| 179 |
+
|
| 180 |
+
### Run with Docker
|
| 181 |
|
|
|
|
| 182 |
```bash
|
| 183 |
+
docker build -t policy-evolver .
|
| 184 |
+
docker run -p 8000:8000 policy-evolver
|
|
|
|
| 185 |
```
|
| 186 |
|
| 187 |
+
### Run the Inference Agent
|
| 188 |
+
|
| 189 |
+
The primary evaluation entry point is `inference.py`, which follows the hackathon `[START]`, `[STEP]`, `[END]` logging format.
|
| 190 |
+
|
| 191 |
```bash
|
| 192 |
+
export API_BASE_URL="https://api.groq.com/openai/v1"
|
| 193 |
+
export MODEL_NAME="llama-3.1-8b-instant"
|
| 194 |
+
export HF_TOKEN="your_groq_api_key"
|
| 195 |
+
|
| 196 |
python3 inference.py
|
| 197 |
```
|
| 198 |
+
|
| 199 |
+
To run a specific task: `python3 inference.py task_easy`
|
| 200 |
+
|
| 201 |
+
### Required Environment Variables
|
| 202 |
+
|
| 203 |
+
| Variable | Description | Example |
|
| 204 |
+
|:---------|:------------|:--------|
|
| 205 |
+
| `HF_TOKEN` | API key for LLM inference (Groq) | `gsk_...` |
|
| 206 |
+
| `API_BASE_URL` | OpenAI-compatible endpoint | `https://api.groq.com/openai/v1` |
|
| 207 |
+
| `MODEL_NAME` | Model identifier | `llama-3.1-8b-instant` |
|
| 208 |
+
|
| 209 |
+
### Run Tests
|
| 210 |
+
|
| 211 |
+
```bash
|
| 212 |
+
PYTHONPATH=. python tests/test_smoke_exploits.py # 27 smoke & exploit checks
|
| 213 |
+
PYTHONPATH=. python tests/test_icl.py # ICL verification (3 tasks)
|
| 214 |
+
PYTHONPATH=. python tests/test_multi_episode.py # Multi-episode progression
|
| 215 |
+
PYTHONPATH=. python server/grader.py # 8-phase grader test suite
|
| 216 |
+
```
|
| 217 |
|
| 218 |
---
|
| 219 |
|
| 220 |
+
## 6. Baseline Performance Scores
|
| 221 |
|
| 222 |
+
The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and staff feedback.
|
| 223 |
|
| 224 |
+
### Single-Step Convergence (Best Case)
|
| 225 |
|
| 226 |
+
| Task | Score | Converged | Expected Min |
|
| 227 |
+
|:-----|:------|:----------|:-------------|
|
| 228 |
+
| `task_easy` | 0.94 | β Step 1 | 0.70 |
|
| 229 |
+
| `task_medium` | 0.999 | β Step 1 | 0.55 |
|
| 230 |
+
| `task_hard` | 0.90 | β Step 1 | 0.40 |
|
| 231 |
|
| 232 |
+
### Multi-Step ICL Progression (Naive β Optimized)
|
|
|
|
|
|
|
| 233 |
|
| 234 |
+
| Task | Naive (Step 0) | Optimized (Step 1) | Improvement |
|
| 235 |
+
|:-----|:---------------|:-------------------|:------------|
|
| 236 |
+
| `task_easy` | 0.400 | 0.999 | +0.600 |
|
| 237 |
+
| `task_medium` | 0.001 | 0.999 | +0.998 |
|
| 238 |
+
| `task_hard` | 0.088 | 0.999 | +0.912 |
|
| 239 |
|
| 240 |
+
**Average ICL Improvement: +0.837**
|
| 241 |
|
| 242 |
+
### Configuration
|
|
|
|
|
|
|
|
|
|
| 243 |
|
| 244 |
+
| Setting | Value |
|
| 245 |
+
|:--------|:------|
|
| 246 |
+
| **Model** | `llama-3.1-8b-instant` (via Groq) |
|
| 247 |
+
| **Temperature** | `0.0` |
|
| 248 |
+
| **Seed** | `42` |
|
| 249 |
+
| **Determinism** | 5 identical runs β identical scores β |
|
| 250 |
+
| **Fine-tuning** | None required |
|
| 251 |
+
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
## Project Structure
|
| 255 |
+
|
| 256 |
+
```
|
| 257 |
+
policy_evolver_env/
|
| 258 |
+
βββ inference.py # Hackathon entry point ([START]/[STEP]/[END] format)
|
| 259 |
+
βββ client.py # EnvClient for HTTP interaction
|
| 260 |
+
βββ models.py # Pydantic models (Action, Observation, State)
|
| 261 |
+
βββ openenv.yaml # OpenEnv specification
|
| 262 |
+
βββ Dockerfile # Docker deployment with HEALTHCHECK
|
| 263 |
+
βββ server/
|
| 264 |
+
β βββ app.py # FastAPI + Gradio dashboard
|
| 265 |
+
β βββ environment.py # Environment logic (reset, step, state)
|
| 266 |
+
β βββ grader.py # Deterministic grading engine (8-phase test suite)
|
| 267 |
+
β βββ requirements.txt # Dependencies
|
| 268 |
+
β βββ tasks/ # Task definitions (easy, medium, hard)
|
| 269 |
+
βββ tests/
|
| 270 |
+
β βββ test_smoke_exploits.py # 27 smoke & exploit checks
|
| 271 |
+
β βββ test_icl.py # ICL loop verification
|
| 272 |
+
β βββ test_multi_episode.py # Multi-episode progression
|
| 273 |
+
βββ STRATEGIC_LEARNING.md # RLVR architecture documentation
|
| 274 |
+
```
|