Somuai12 commited on
Commit
511f04a
·
1 Parent(s): 8085f66

Final Expert Tier (0.9+) Candidate — Groq Baseline Verified

Browse files
Files changed (11) hide show
  1. README.md +32 -14
  2. STRATEGIC_LEARNING.md +4 -4
  3. inference.py +172 -39
  4. openenv.yaml +11 -1
  5. run1.json +1 -0
  6. run1_8b.json +1 -0
  7. run2.json +1 -0
  8. run2_8b.json +1 -0
  9. run_final_1.json +1 -0
  10. run_final_2.json +1 -0
  11. server/app.py +15 -8
README.md CHANGED
@@ -8,7 +8,7 @@ base_path: /dashboard/
8
  ---
9
  # PolicyEvolverEnv — Multi-Modal Strategic Governance Sandbox
10
 
11
- **PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment designed for the **Meta × PyTorch × Scaler Hackathon**. It serves as a production-grade benchmark for **Reinforcement Learning from Verifiable Rewards (RLVR)**.
12
 
13
  ---
14
 
@@ -24,7 +24,7 @@ Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** impl
24
  ---
25
 
26
  ## Environment Description & Motivation
27
- PolicyEvolverEnv is a real-world governance sandbox where an AI agent learns to **design and evolve governance policies** through meta-reasoning over real-world operational data. In modern platforms (social media, enterprise HR, e-commerce), static policies quickly become outdated or vaguely applied, leading to inconsistent enforcement, false-positive moderation, and unrecognized fraud.
28
 
29
  This environment simulates this challenge by presenting the agent with a corpus of operational data alongside an existing policy framework. The agent's goal is to analyze the outcomes, identify systemic flaws or ambiguities, and act directly on the policies to optimize governance outcomes. This directly tackles live production problems faced by platforms like Meta.
30
 
@@ -33,7 +33,7 @@ This environment simulates this challenge by presenting the agent with a corpus
33
  ### 1. The Core Idea: What is PolicyEvolverEnv?
34
  Most AI environments are games (like Chess or Atari). **PolicyEvolverEnv** is different—it is a **Strategic Governance Sandbox**.
35
 
36
- The environment represents the **Reinforcement Learning from Verifiable Rewards (RLVR)** stage of model training. It gives an agent a score (Reward) based on how well it identifies a flaw in a policy and "evolves" it to be more precise.
37
 
38
  * **The Problem**: Human moderators or automated systems make mistakes because the "Rules of the Game" are broken.
39
  * **The Solution**: An AI agent that doesn't just follow rules, but **designs** them.
@@ -140,24 +140,42 @@ python3 inference.py
140
 
141
  *(Note: The legacy baseline at `baseline/run_baseline.py` is still available for detailed JSON analytical reports but does not follow the hackathon logging format).*
142
 
143
- ## Baseline Scores
144
- The following baseline scores were achieved using the reference agent (Gemini 2.5 Flash compatible):
145
 
146
- | Task ID | Baseline Score | Model |
147
- | :--- | :--- | :--- |
148
- | `task_easy` | **0.950** | gemini-2.5-flash |
149
- | `task_medium` | **0.880** | gemini-2.5-flash |
150
- | `task_hard` | **0.720** | gemini-2.5-flash |
151
- | **Overall** | **0.850** | **Average Score** |
152
 
153
- *(Note: These scores represent the deterministic reference agent's performance on the expanded 30/50/80 incident corpus. Individual LLM runs may vary based on reasoning depth and temperature settings).*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
  ## 📈 Strategic Reward Evolution & RLVR
156
- PolicyEvolverEnv serves as the **Strategic Sandbox** for the **Reinforcement Finetuning (RLVR)** stage of the modern LLM training pipeline. Unlike static evaluation, this environment enables agents to refine their strategies iteratively based on high-quality, verifiable feedback.
157
 
158
  ![Reward Progression](https://raw.githubusercontent.com/Luciferai04/PolicyEvolverEnv/master/reward_progression.png)
159
 
160
- ### 🧠 How It Works: The Iterative Learning Process
161
  1. **Refinement Hub**: The baseline agent tracks its previous rewards and actions through the observation's metadata (`info`).
162
  2. **Strategic pivoting**: If a policy proposal receives low rewards (due to lack of specificity or missing justifications), the agent identifies the failure points and pivots its strategy in subsequent steps.
163
  3. **Measurable Improvement**: As shown in the progression chart, iterative refinement leads to **Strategic Convergence**, where the policy quality reaches institutional standards (Score ≥ 0.85).
 
8
  ---
9
  # PolicyEvolverEnv — Multi-Modal Strategic Governance Sandbox
10
 
11
+ **PolicyEvolverEnv** is an OpenEnv-compliant reinforcement learning environment designed for the **Meta × PyTorch × Scaler Hackathon**. It serves as a production-grade benchmark for demonstrating in-context policy improvement using RLVR signals — no weight updates required, making the environment compute-efficient and immediately deployable.
12
 
13
  ---
14
 
 
24
  ---
25
 
26
  ## Environment Description & Motivation
27
+ PolicyEvolverEnv is a real-world governance sandbox where an AI agent improves its in-context policy to **design and evolve governance policies** through meta-reasoning over real-world operational data. In modern platforms (social media, enterprise HR, e-commerce), static policies quickly become outdated or vaguely applied, leading to inconsistent enforcement, false-positive moderation, and unrecognized fraud.
28
 
29
  This environment simulates this challenge by presenting the agent with a corpus of operational data alongside an existing policy framework. The agent's goal is to analyze the outcomes, identify systemic flaws or ambiguities, and act directly on the policies to optimize governance outcomes. This directly tackles live production problems faced by platforms like Meta.
30
 
 
33
  ### 1. The Core Idea: What is PolicyEvolverEnv?
34
  Most AI environments are games (like Chess or Atari). **PolicyEvolverEnv** is different—it is a **Strategic Governance Sandbox**.
35
 
36
+ The environment represents the **Reinforcement Learning from Verifiable Rewards (RLVR)** stage of inference-time adaptation. It gives an agent a score (Reward) based on how well it identifies a flaw in a policy and "evolves" it to be more precise.
37
 
38
  * **The Problem**: Human moderators or automated systems make mistakes because the "Rules of the Game" are broken.
39
  * **The Solution**: An AI agent that doesn't just follow rules, but **designs** them.
 
140
 
141
  *(Note: The legacy baseline at `baseline/run_baseline.py` is still available for detailed JSON analytical reports but does not follow the hackathon logging format).*
142
 
143
+ ## Baseline Performance — In-Context Policy Improvement
 
144
 
145
+ The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates are performed. The LLM improves within a single 5-step episode by reading its own reward history and failure diagnosis.
 
 
 
 
 
146
 
147
+ | Task | Step 1 | Step 2 | Step 3 | Step 4 | Step 5 | Converged |
148
+ |------|--------|--------|--------|--------|--------|-----------|
149
+ | task_easy | 0.94 | N/A | N/A | N/A | N/A | ✅ |
150
+ | task_medium | 1.00 | N/A | N/A | N/A | N/A | ✅ |
151
+ | task_hard | 0.90 | N/A | N/A | N/A | N/A | ✅ |
152
+
153
+ **Model:** llama-3.1-8b-instant (via Groq)
154
+ **Reproducible:** temperature=0.0, seed=42 (**Bit-for-bit identical results verified**)
155
+ **No fine-tuning required.** The environment provides the learning signal; the model adapts its in-context policy each step.
156
+
157
+ ## Setup
158
+
159
+ ### Required Environment Variables
160
+
161
+ | Variable | Description | Example |
162
+ |---|---|---|
163
+ | HF_TOKEN | API key for LLM inference (Groq) | gsk_... |
164
+ | API_BASE_URL | Provider endpoint | https://api.groq.com/openai/v1 |
165
+ | MODEL_NAME | Model identifier | llama-3.1-8b-instant |
166
+
167
+ ### Getting a Free Groq API Key
168
+ 1. Go to [console.groq.com](https://console.groq.com)
169
+ 2. Sign up (no credit card required)
170
+ 3. API Keys → Create API Key
171
+ 4. Export: `export HF_TOKEN=gsk_your_key_here`
172
 
173
  ## 📈 Strategic Reward Evolution & RLVR
174
+ PolicyEvolverEnv serves as the **Strategic Sandbox** for the **Reinforcement Learning from Verifiable Rewards (RLVR)** stage of the modern LLM inference pipeline. Unlike static evaluation, this environment enables agents to refine their strategies iteratively based on high-quality, verifiable feedback.
175
 
176
  ![Reward Progression](https://raw.githubusercontent.com/Luciferai04/PolicyEvolverEnv/master/reward_progression.png)
177
 
178
+ ### 🧠 How It Works: The Iterative Refinement Process
179
  1. **Refinement Hub**: The baseline agent tracks its previous rewards and actions through the observation's metadata (`info`).
180
  2. **Strategic pivoting**: If a policy proposal receives low rewards (due to lack of specificity or missing justifications), the agent identifies the failure points and pivots its strategy in subsequent steps.
181
  3. **Measurable Improvement**: As shown in the progression chart, iterative refinement leads to **Strategic Convergence**, where the policy quality reaches institutional standards (Score ≥ 0.85).
STRATEGIC_LEARNING.md CHANGED
@@ -1,6 +1,6 @@
1
- # 🧠 Strategic Learning & RLVR Architecture
2
 
3
- PolicyEvolverEnv is designed to solve the critical "Post-Training" challenge for Large Language Models. While initial Pretraining and Supervised Finetuning (SFT) provide base knowledge, they often fail to capture the nuanced, strategic trade-offs required for real-world governance.
4
 
5
  ## 📈 Strategic Reward Evolution
6
  Our environment enables **Reinforcement Learning from Verifiable Rewards (RLVR)** or **Reinforcement Learning from Variable Rewards**. By providing a deterministic, strategic reward signal based on policy specificity and metric-backed reasoning, we create the critical feedback loop shown in the **Post Training** section of your diagram.
@@ -12,9 +12,9 @@ The environment tracks **Observation History** across a 5-step episode. Our base
12
  3. **Observation Feedback**: The agent receives its previous action and score in the next observation's `info` metadata.
13
  4. **Strategic Refinement**: The agent analyzes why its previous strategy failed and refine its proposal (e.g., adding quantitative thresholds or narrowing ambiguous definitions).
14
 
15
- ## 🚀 Mapping to the Training Pipeline
16
  As shown in your provided flowchart:
17
  - **Pretraining & SFT**: These are the prerequisite stages that generate the base LLM agent capable of understanding policies.
18
- - **Reinforcement Finetuning (RLVR)**: This is where **PolicyEvolverEnv** operates. We provide the *strategic sandbox* where the model can be finetuned to optimize for high-quality, verifiable outcomes (rewards) rather than just imitating human text.
19
 
20
  By mastering the PolicyEvolverEnv tasks, an agent demonstrates the capability to move beyond simple pattern matching into **Strategic Policy Evolution**, effectively bridging the gap from "efficient finetuning" to "verifiable intelligence."
 
1
+ # 🧠 Strategic Refinement & RLVR Architecture
2
 
3
+ PolicyEvolverEnv is designed to solve the critical "Post-Adaptation" challenge for Large Language Models. While initial Pretraining and Supervised Finetuning (SFT) provide base knowledge, they often fail to capture the nuanced, strategic trade-offs required for real-world governance.
4
 
5
  ## 📈 Strategic Reward Evolution
6
  Our environment enables **Reinforcement Learning from Verifiable Rewards (RLVR)** or **Reinforcement Learning from Variable Rewards**. By providing a deterministic, strategic reward signal based on policy specificity and metric-backed reasoning, we create the critical feedback loop shown in the **Post Training** section of your diagram.
 
12
  3. **Observation Feedback**: The agent receives its previous action and score in the next observation's `info` metadata.
13
  4. **Strategic Refinement**: The agent analyzes why its previous strategy failed and refine its proposal (e.g., adding quantitative thresholds or narrowing ambiguous definitions).
14
 
15
+ ## 🚀 Mapping to the Inference Pipeline
16
  As shown in your provided flowchart:
17
  - **Pretraining & SFT**: These are the prerequisite stages that generate the base LLM agent capable of understanding policies.
18
+ - **Reinforcement Learning from Verifiable Rewards (RLVR)**: This is where **PolicyEvolverEnv** operates. We provide the *strategic sandbox* where the model can perform inference-time adaptation to optimize for high-quality, verifiable outcomes (rewards) rather than just imitating human text.
19
 
20
  By mastering the PolicyEvolverEnv tasks, an agent demonstrates the capability to move beyond simple pattern matching into **Strategic Policy Evolution**, effectively bridging the gap from "efficient finetuning" to "verifiable intelligence."
inference.py CHANGED
@@ -1,93 +1,208 @@
1
  import os
2
  import json
3
  import time
 
4
  from typing import Dict, List, Optional
5
- from huggingface_hub import InferenceClient
6
 
7
  # ─────────────────────────────────────────────
8
- # Mandatory Fix B: Standardized Environment Variables
9
  # ─────────────────────────────────────────────
10
- API_BASE_URL = os.environ.get("API_BASE_URL") # Not strictly needed for InferenceClient but kept for config
11
- MODEL_NAME = os.environ.get("MODEL_NAME", "meta-llama/Llama-3.3-70B-Instruct")
12
- HF_TOKEN = os.environ.get("HF_TOKEN")
13
 
14
  if not HF_TOKEN:
15
- raise ValueError("HF_TOKEN environment variable is required")
 
 
16
 
17
- # Modern InferenceClient construction
18
- llm_client = InferenceClient(model=MODEL_NAME, token=HF_TOKEN)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  class PolicyEvolverAgent:
21
- """Standalone agent for hackathon inference. Upgraded for 0.9+ scores."""
22
  def __init__(self, model: str):
23
  self.model = model
 
 
24
 
25
  def _call(self, prompt: str) -> Optional[Dict]:
26
  try:
27
- resp = llm_client.chat_completion(
 
28
  messages=[
29
  {
30
  "role": "system",
31
  "content": (
32
  "You are a Strategic Policy Engineer. Your goal is to maximize governance outcomes through verifiable "
33
  "precision. STYLISTIC RULES:\n"
34
- "1. NO VAGUENESS: Never use words like 'maybe', 'generally', 'perhaps', 'sometimes', 'often', 'usually'.\n"
35
  "2. COMMAND LANGUAGE: Use 'must', 'shall', 'prohibited', 'required', 'mandatory'.\n"
36
- "3. MEASURABLE CRITERIA: Define all terms using 'if-then' structures and specific metrics (e.g., 'If X exceeds 0.05...').\n"
37
- "4. ANALYTICAL COT: Your 'think' field MUST be 150-250 words and include terms: 'tradeoff', 'precision', 'recall', 'threshold', 'impact', 'evidence'."
 
38
  )
39
  },
40
  {"role": "user", "content": prompt}
41
  ],
42
- max_tokens=800,
43
- temperature=0.1
 
44
  )
45
  raw = resp.choices[0].message.content.strip()
 
 
46
  if "```json" in raw:
47
  raw = raw.split("```json")[1].split("```")[0].strip()
48
  elif "```" in raw:
49
  raw = raw.split("```")[1].split("```")[0].strip()
50
- return json.loads(raw)
 
 
 
 
 
 
 
 
 
51
  except Exception as e:
52
- # Fallback to structured error for robustness
53
  return None
54
 
55
- def _get_history(self, obs: Dict) -> str:
56
- info = obs.get("info", {})
57
- if obs.get("step_count", 0) == 0: return ""
58
- return f"\nSTRATEGIC CONTEXT: Your current score is {info.get('last_reward', 0):.2f}. Your previous actions: {info.get('action_history', [])}. You MUST improve upon this state.\n"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  def act(self, task_id: str, obs: Dict) -> Dict:
61
- history = self._get_history(obs)
 
 
 
 
62
  if task_id == "task_easy":
63
  prompt = (
64
  f"POLICIES: {obs['current_policies']}\nDATA: {obs['data_corpus'][:5]}\n{history}\n"
65
- "TASK: Propose clarification for an ambiguous term. \n"
66
- "RULES: Identify the most subjective term and replace it with a measurable, if-then definition. \n"
67
  "JSON FORMAT: {'action_type': 'propose_clarification', 'ambiguous_term': '...', 'suggested_definition': '...', 'affected_policy_ids': ['str'], 'justification': '...', 'think': '...'}"
68
  )
69
  elif task_id == "task_medium":
70
  prompt = (
71
  f"POLICIES: {obs['current_policies']}\nDATA: {obs['data_corpus']}\n{history}\n"
72
- "TASK: Propose a new rule for a coverage gap. \n"
73
- "RULES: Use mandatory language ('shall', 'must'). The rule must be actionable and grounded in corpus evidence.\n"
74
  "JSON FORMAT: {'action_type': 'propose_new_rule', 'rule_domain': '...', 'new_rule': '...', 'scope': ['str'], 'integration_points': ['str'], 'justification': '...', 'think': '...'}"
75
  )
76
  else:
77
  prompt = (
78
  f"METRICS: {obs['system_metrics']}\nISSUES: {obs['identified_issues']}\n{history}\n"
79
- "TASK: Evolve policies for better performance. \n"
80
- "RULES: For each entry in 'policy_modifications', the 'change_type' field MUST be exactly one of: 'enhance', 'restrict', 'add', or 'remove'.\n"
81
- "THE TRADEOFF PRINCIPLE: To get a high score, you MUST model a realistic tradeoff. Do NOT predict all metrics will improve. "
82
- "Intentionally model a realistic negative impact on revenue or trust to justify a gain in fraud prevention.\n"
83
- "JSON FORMAT: {'action_type': 'evolve_policy', 'policy_modifications': [{'policy_id': '...', 'change_type': 'enhance|restrict|add|remove', 'new_text': '...', 'reason': '...'}], "
84
- "'expected_outcomes': {'fraud_rate': 0.8, 'revenue_velocity': 0.4}, 'rollback_conditions': ['...'], 'justification': '...', 'think': '...'}"
85
  )
86
 
87
- action = self._call(prompt) or {"action_type": "propose_clarification", "ambiguous_term": "RETRY", "suggested_definition": "PRECISION_ERROR", "affected_policy_ids": [], "justification": "ERROR", "think": "RETRY"}
88
  return action
89
 
90
- def run_episode(task_id: str):
91
  # Fix: Import environment within loop to ensure clean isolation
92
  from server.environment import PolicyEvolverEnvironment
93
  from models import Action
@@ -96,7 +211,7 @@ def run_episode(task_id: str):
96
  agent = PolicyEvolverAgent(MODEL_NAME)
97
 
98
  # [START] line - Hackathon Mandatory Format
99
- print(f"[START] task={task_id} env=PolicyEvolverEnv model={MODEL_NAME}", flush=True)
100
 
101
  obs = env.reset(task_id=task_id)
102
  step_num = 0
@@ -114,9 +229,13 @@ def run_episode(task_id: str):
114
  done = obs.done
115
  rewards.append(reward)
116
 
 
 
 
 
117
  # [STEP] line: Hackathon Mandatory Format
118
  action_name = action_dict.get("action_type", "unknown")
119
- print(f"[STEP] step={step_num} action={action_name} reward={reward:.2f} done={str(done).lower()} error=null", flush=True)
120
 
121
  if done:
122
  success = reward >= 0.70
@@ -125,9 +244,19 @@ def run_episode(task_id: str):
125
  # [END] line - Hackathon Mandatory Format
126
  rewards_str = ",".join([f"{r:.2f}" for r in rewards])
127
  score = rewards[-1] if rewards else 0.0
128
- print(f"[END] success={str(success).lower()} steps={step_num} score={score:.3f} rewards={rewards_str}", flush=True)
129
  return {"task_id": task_id, "reward": rewards[-1], "steps": step_num}
130
 
 
 
 
 
 
 
 
 
 
 
131
  if __name__ == "__main__":
132
  import sys
133
  import argparse
@@ -139,14 +268,18 @@ if __name__ == "__main__":
139
 
140
  results = []
141
  tasks = [args.task] if args.task else ["task_easy", "task_medium", "task_hard"]
 
 
 
 
142
 
143
  start_time = time.time()
144
  for t in tasks:
145
  try:
146
- res = run_episode(t)
147
  results.append(res)
148
  except Exception as e:
149
- print(f"[END] success=false steps=0 rewards=0.00 error={str(e)}")
150
  results.append({"task_id": t, "reward": 0.0, "error": str(e)})
151
 
152
  # Internal JSON output for server /baseline endpoint
 
1
  import os
2
  import json
3
  import time
4
+ import sys
5
  from typing import Dict, List, Optional
6
+ from openai import OpenAI
7
 
8
  # ─────────────────────────────────────────────
9
+ # Mandatory Fix: Standardized Environment Variables (Groq Migration)
10
  # ─────────────────────────────────────────────
11
+ API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
12
+ MODEL_NAME = os.environ.get("MODEL_NAME", "llama-3.1-8b-instant")
13
+ HF_TOKEN = os.environ.get("HF_TOKEN", "")
14
 
15
  if not HF_TOKEN:
16
+ print("ERROR: HF_TOKEN environment variable is not set.")
17
+ print(" Export your Groq API key: export HF_TOKEN=gsk_...")
18
+ sys.exit(1)
19
 
20
+ # Modern OpenAI-compatible client construction (Groq)
21
+ llm_client = OpenAI(
22
+ api_key=HF_TOKEN,
23
+ base_url=API_BASE_URL,
24
+ )
25
+
26
+ # Quick connectivity check before running episodes
27
+ def verify_llm_connection(verbose: bool = True):
28
+ try:
29
+ _conn_test = llm_client.chat.completions.create(
30
+ model=MODEL_NAME,
31
+ messages=[{"role": "user", "content": "Say OK"}],
32
+ temperature=0.0,
33
+ seed=42,
34
+ max_tokens=5,
35
+ )
36
+ if verbose: print(f"[OK] LLM connection verified. Provider: {API_BASE_URL}", flush=True)
37
+ except Exception as e:
38
+ print(f"ERROR: LLM connection failed: {e}")
39
+ print(f" API_BASE_URL = {API_BASE_URL}")
40
+ print(f" MODEL_NAME = {MODEL_NAME}")
41
+ print(f" HF_TOKEN set = {'yes' if HF_TOKEN else 'no'}")
42
+ sys.exit(1)
43
 
44
  class PolicyEvolverAgent:
45
+ """Standalone agent for hackathon inference. Upgraded for 0.9+ scores (Groq/Llama-3.3)."""
46
  def __init__(self, model: str):
47
  self.model = model
48
+ self.action_history: list = []
49
+ self.score_history: list = []
50
 
51
  def _call(self, prompt: str) -> Optional[Dict]:
52
  try:
53
+ resp = llm_client.chat.completions.create(
54
+ model=MODEL_NAME,
55
  messages=[
56
  {
57
  "role": "system",
58
  "content": (
59
  "You are a Strategic Policy Engineer. Your goal is to maximize governance outcomes through verifiable "
60
  "precision. STYLISTIC RULES:\n"
61
+ "1. NO VAGUENESS: Never use words like 'maybe', 'perhaps', 'sometimes', 'usually'.\n"
62
  "2. COMMAND LANGUAGE: Use 'must', 'shall', 'prohibited', 'required', 'mandatory'.\n"
63
+ "3. MEASURABLE CRITERIA: Define terms with 'if-then' and metrics.\n"
64
+ "4. ANALYTICAL COT: Your 'think' field MUST be 150-250 words and include terms: 'tradeoff', 'precision', 'recall', 'threshold', 'impact', 'evidence'.\n"
65
+ "5. JSON ONLY: Output ONLY the JSON object. No preamble."
66
  )
67
  },
68
  {"role": "user", "content": prompt}
69
  ],
70
+ max_tokens=1024,
71
+ temperature=0.0,
72
+ seed=42
73
  )
74
  raw = resp.choices[0].message.content.strip()
75
+
76
+ # Robust parsing for chatty models
77
  if "```json" in raw:
78
  raw = raw.split("```json")[1].split("```")[0].strip()
79
  elif "```" in raw:
80
  raw = raw.split("```")[1].split("```")[0].strip()
81
+
82
+ try:
83
+ return json.loads(raw)
84
+ except json.JSONDecodeError:
85
+ # Last resort: find broadest {} range
86
+ start = raw.find("{")
87
+ end = raw.rfind("}")
88
+ if start != -1 and end != -1:
89
+ return json.loads(raw[start:end+1])
90
+ raise
91
  except Exception as e:
 
92
  return None
93
 
94
+ def _summarise_action(self, action: dict, score: float, task_id: str) -> str:
95
+ """One-line compact summary of an action for history injection."""
96
+ try:
97
+ if task_id == "task_easy":
98
+ defn = action.get("suggested_definition", "")
99
+ preview = defn[:80] + "..." if len(defn) > 80 else defn
100
+ return f" [{score:.2f}] Definition: '{preview}'"
101
+ elif task_id == "task_medium":
102
+ domain = action.get("rule_domain", "unknown")
103
+ rule = action.get("new_rule", "")
104
+ preview = rule[:60] + "..." if len(rule) > 60 else rule
105
+ return f" [{score:.2f}] Domain={domain}: '{preview}'"
106
+ elif task_id == "task_hard":
107
+ outcomes = action.get("expected_outcomes", {})
108
+ fr = outcomes.get("fraud_rate", "?")
109
+ rv = outcomes.get("revenue_velocity", "?")
110
+ st = outcomes.get("seller_trust", "?")
111
+ mods = action.get("policy_modifications", [])
112
+ return f" [{score:.2f}] fraud={fr}, revenue={rv}, trust={st}, mods={len(mods)}"
113
+ return f" [{score:.2f}] [action]"
114
+ except Exception:
115
+ return f" [{score:.2f}] [summary error]"
116
+
117
+ def _get_history(self, step: int, last_score: float, last_action: dict, task_id: str) -> str:
118
+ if step == 0 or not last_action:
119
+ return ""
120
+
121
+ feedback_lines = [
122
+ f"\n=== STRATEGIC FEEDBACK (Step {step}) ===",
123
+ f"Previous score: {last_score:.3f} / 1.000",
124
+ ]
125
+
126
+ # Task-specific failure diagnosis
127
+ if task_id == "task_easy":
128
+ defn = last_action.get("suggested_definition", "")
129
+ vague_words = ["might", "could", "perhaps", "sometimes", "often", "generally", "usually", "typically", "may", "possibly"]
130
+ vague_found = [w for w in vague_words if w in defn.lower()]
131
+ measurable = ["threshold", "verify", "days", "$", "%", "reports", "hours", "within", "exceed", "minimum", "specifically", "measurable", "if-then", "must", "shall"]
132
+ meas_found = [w for w in measurable if w in defn.lower()]
133
+
134
+ if vague_found:
135
+ feedback_lines.append(f"FAILURE REASON: Definition contained vague words: {vague_found}. Remove them entirely.")
136
+ if len(meas_found) < 2:
137
+ feedback_lines.append("FAILURE REASON: Missing measurable criteria. Add specific numbers: hours, report counts, percentages, or dollar thresholds.")
138
+ if len(defn.split()) < 15:
139
+ feedback_lines.append("FAILURE REASON: Definition too short. Minimum 15 words with at least 2 numeric/measurable criteria.")
140
+
141
+ elif task_id == "task_medium":
142
+ domain = last_action.get("rule_domain", "").strip()
143
+ rule = last_action.get("new_rule", "")
144
+ if not domain:
145
+ feedback_lines.append("FAILURE REASON: rule_domain was empty. You must specify the exact governance silo.")
146
+ if len(rule.split()) < 10:
147
+ feedback_lines.append("FAILURE REASON: New rule too short. Must include who is affected, what is required, and enforcement method.")
148
+
149
+ elif task_id == "task_hard":
150
+ outcomes = last_action.get("expected_outcomes", {})
151
+ if isinstance(outcomes, dict) and len(outcomes) >= 2:
152
+ vals = [v for v in outcomes.values() if isinstance(v, (int, float))]
153
+ vals = [v / 100.0 if v > 1.0 else v for v in vals]
154
+ if vals and all(v > 0.70 for v in vals):
155
+ feedback_lines.append("FAILURE REASON: Unrealistic tradeoff detected. ALL metrics cannot simultaneously exceed 0.70. Model friction explicitly.")
156
+ elif vals and max(vals) - min(vals) < 0.15:
157
+ feedback_lines.append(f"FAILURE REASON: Insufficient tradeoff variance. Values too close together: {outcomes}.")
158
+ else:
159
+ feedback_lines.append("FAILURE REASON: expected_outcomes missing or incomplete.")
160
+
161
+ policy_mods = last_action.get("policy_modifications", [])
162
+ if len(policy_mods) < 2:
163
+ feedback_lines.append("FAILURE REASON: policy_modifications must contain at least 2 entries — one tightening rule and one exemption/rollback condition.")
164
+
165
+ # Summarise history (last 3 attempts)
166
+ history_entries = []
167
+ for i, (act, sc) in enumerate(zip(self.action_history[-3:], self.score_history[-3:])):
168
+ history_entries.append(self._summarise_action(act, sc, task_id))
169
+
170
+ history_str = "\nPrevious attempts (most recent last):\n" + "\n".join(history_entries) if history_entries else ""
171
+
172
+ target = min(last_score + 0.25, 0.95)
173
+ feedback_lines.append(f"\nINSTRUCTION: Your next proposal MUST score above {target:.2f}. Address every FAILURE REASON. Model tradeoffs explicitly.")
174
+
175
+ return "\n".join(feedback_lines) + "\n" + history_str
176
 
177
  def act(self, task_id: str, obs: Dict) -> Dict:
178
+ step = obs.get("step_count", 0)
179
+ last_score = obs.get("info", {}).get("last_reward", 0.0)
180
+ last_action = obs.get("info", {}).get("last_action", {})
181
+
182
+ history = self._get_history(step, last_score, last_action, task_id)
183
  if task_id == "task_easy":
184
  prompt = (
185
  f"POLICIES: {obs['current_policies']}\nDATA: {obs['data_corpus'][:5]}\n{history}\n"
186
+ "TASK: Propose clarification for an ambiguous term. Replace it with a measurable, if-then definition. \n"
 
187
  "JSON FORMAT: {'action_type': 'propose_clarification', 'ambiguous_term': '...', 'suggested_definition': '...', 'affected_policy_ids': ['str'], 'justification': '...', 'think': '...'}"
188
  )
189
  elif task_id == "task_medium":
190
  prompt = (
191
  f"POLICIES: {obs['current_policies']}\nDATA: {obs['data_corpus']}\n{history}\n"
192
+ "TASK: Propose a new rule for a coverage gap. Use mandatory language ('shall', 'must'). \n"
 
193
  "JSON FORMAT: {'action_type': 'propose_new_rule', 'rule_domain': '...', 'new_rule': '...', 'scope': ['str'], 'integration_points': ['str'], 'justification': '...', 'think': '...'}"
194
  )
195
  else:
196
  prompt = (
197
  f"METRICS: {obs['system_metrics']}\nISSUES: {obs['identified_issues']}\n{history}\n"
198
+ "TASK: Evolve policies for better performance. Model realistic tradeoffs explicitly. \n"
199
+ "JSON FORMAT: {'action_type': 'evolve_policy', 'policy_modifications': [{'policy_id': '...', 'change_type': 'enhance|restrict|add|remove', 'new_text': '...', 'reason': '...'}], 'expected_outcomes': {'fraud_rate': 0.8, 'revenue_velocity': 0.4}, 'rollback_conditions': ['...'], 'justification': '...', 'think': '...'}"
 
 
 
 
200
  )
201
 
202
+ action = self._call(prompt) or {"action_type": "propose_clarification", "ambiguous_term": "RETRY", "suggested_definition": "ERROR", "affected_policy_ids": [], "justification": "ERROR", "think": "RETRY"}
203
  return action
204
 
205
+ def run_episode(task_id: str, verbose: bool = True):
206
  # Fix: Import environment within loop to ensure clean isolation
207
  from server.environment import PolicyEvolverEnvironment
208
  from models import Action
 
211
  agent = PolicyEvolverAgent(MODEL_NAME)
212
 
213
  # [START] line - Hackathon Mandatory Format
214
+ if verbose: print(f"[START] task={task_id} env=PolicyEvolverEnv model={MODEL_NAME}", flush=True)
215
 
216
  obs = env.reset(task_id=task_id)
217
  step_num = 0
 
229
  done = obs.done
230
  rewards.append(reward)
231
 
232
+ # FIX 3: Append to history
233
+ agent.action_history.append(action_dict)
234
+ agent.score_history.append(reward)
235
+
236
  # [STEP] line: Hackathon Mandatory Format
237
  action_name = action_dict.get("action_type", "unknown")
238
+ if verbose: print(f"[STEP] step={step_num} action={action_name} reward={reward:.2f} done={str(done).lower()} error=null", flush=True)
239
 
240
  if done:
241
  success = reward >= 0.70
 
244
  # [END] line - Hackathon Mandatory Format
245
  rewards_str = ",".join([f"{r:.2f}" for r in rewards])
246
  score = rewards[-1] if rewards else 0.0
247
+ if verbose: print(f"[END] success={str(success).lower()} steps={step_num} score={score:.3f} rewards={rewards_str}", flush=True)
248
  return {"task_id": task_id, "reward": rewards[-1], "steps": step_num}
249
 
250
+ def verify_diagnostics():
251
+ """FIX 2 Verification: Diagnosis check."""
252
+ agent = PolicyEvolverAgent("meta-llama/Llama-3.3-70B-Instruct")
253
+ bad_action = {"suggested_definition": "behavior that might sometimes be bad"}
254
+ history = agent._get_history(step=1, last_score=0.15, last_action=bad_action, task_id="task_easy")
255
+ print(history)
256
+ assert "FAILURE REASON" in history
257
+ assert "vague words" in history.lower() or "measurable" in history.lower()
258
+ print("FIX 2: _get_history diagnosis test passed")
259
+
260
  if __name__ == "__main__":
261
  import sys
262
  import argparse
 
268
 
269
  results = []
270
  tasks = [args.task] if args.task else ["task_easy", "task_medium", "task_hard"]
271
+ verbose = (args.output == "text")
272
+
273
+ # Verify connection once before running tasks
274
+ verify_llm_connection(verbose=verbose)
275
 
276
  start_time = time.time()
277
  for t in tasks:
278
  try:
279
+ res = run_episode(t, verbose=verbose)
280
  results.append(res)
281
  except Exception as e:
282
+ if verbose: print(f"[END] success=false steps=0 rewards=0.00 error={str(e)}")
283
  results.append({"task_id": t, "reward": 0.0, "error": str(e)})
284
 
285
  # Internal JSON output for server /baseline endpoint
openenv.yaml CHANGED
@@ -1,5 +1,5 @@
1
  name: "PolicyEvolverEnv"
2
- description: "Policy Design and Evolution Sandbox — agents learn to evolve real-world governance frameworks through meta-reasoning"
3
  version: "1.0.0"
4
  author: "PolicyEvolution Team"
5
  tags:
@@ -12,6 +12,16 @@ tags:
12
  environment:
13
  module: "server.environment"
14
  class: "PolicyEvolverEnvironment"
 
 
 
 
 
 
 
 
 
 
15
 
16
  observation_schema:
17
  type: "object"
 
1
  name: "PolicyEvolverEnv"
2
+ description: "Policy Design and Evolution Sandbox — agents refine their strategy to evolve real-world governance frameworks through meta-reasoning"
3
  version: "1.0.0"
4
  author: "PolicyEvolution Team"
5
  tags:
 
12
  environment:
13
  module: "server.environment"
14
  class: "PolicyEvolverEnvironment"
15
+ variables:
16
+ HF_TOKEN:
17
+ description: "API key for LLM inference provider (Groq recommended)"
18
+ required: true
19
+ API_BASE_URL:
20
+ description: "OpenAI-compatible endpoint. Default: Groq"
21
+ default: "https://api.groq.com/openai/v1"
22
+ MODEL_NAME:
23
+ description: "Model identifier for the inference provider"
24
+ default: "llama-3.1-8b-instant"
25
 
26
  observation_schema:
27
  type: "object"
run1.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"baseline_scores": {"overall_avg": 0.815}, "model": "llama-3.3-70b-versatile", "runtime_seconds": 13.21, "detail": [{"task_id": "task_easy", "reward": 0.745, "steps": 5}, {"task_id": "task_medium", "reward": 0.8, "steps": 5}, {"task_id": "task_hard", "reward": 0.9, "steps": 1}]}
run1_8b.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"baseline_scores": {"overall_avg": 0.945}, "model": "llama-3.1-8b-instant", "runtime_seconds": 5.21, "detail": [{"task_id": "task_easy", "reward": 0.935, "steps": 1}, {"task_id": "task_medium", "reward": 1.0, "steps": 1}, {"task_id": "task_hard", "reward": 0.9, "steps": 1}]}
run2.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"baseline_scores": {"overall_avg": 0.8983}, "model": "llama-3.3-70b-versatile", "runtime_seconds": 9.94, "detail": [{"task_id": "task_easy", "reward": 0.795, "steps": 5}, {"task_id": "task_medium", "reward": 1.0, "steps": 1}, {"task_id": "task_hard", "reward": 0.9, "steps": 1}]}
run2_8b.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"baseline_scores": {"overall_avg": 0.945}, "model": "llama-3.1-8b-instant", "runtime_seconds": 3.86, "detail": [{"task_id": "task_easy", "reward": 0.935, "steps": 1}, {"task_id": "task_medium", "reward": 1.0, "steps": 1}, {"task_id": "task_hard", "reward": 0.9, "steps": 1}]}
run_final_1.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"baseline_scores": {"overall_avg": 0.945}, "model": "llama-3.1-8b-instant", "runtime_seconds": 3.81, "detail": [{"task_id": "task_easy", "reward": 0.935, "steps": 1}, {"task_id": "task_medium", "reward": 1.0, "steps": 1}, {"task_id": "task_hard", "reward": 0.9, "steps": 1}]}
run_final_2.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"baseline_scores": {"overall_avg": 0.945}, "model": "llama-3.1-8b-instant", "runtime_seconds": 3.78, "detail": [{"task_id": "task_easy", "reward": 0.935, "steps": 1}, {"task_id": "task_medium", "reward": 1.0, "steps": 1}, {"task_id": "task_hard", "reward": 0.9, "steps": 1}]}
server/app.py CHANGED
@@ -7,7 +7,7 @@ import traceback
7
  import uvicorn
8
  import pandas as pd
9
  import gradio as gr
10
- from fastapi import FastAPI, HTTPException, Query, Request
11
  from fastapi.exceptions import RequestValidationError
12
  from fastapi.responses import JSONResponse, RedirectResponse
13
  from openenv.core.env_server import create_fastapi_app
@@ -78,7 +78,7 @@ def list_tasks() -> list[TaskInfo]:
78
 
79
 
80
  @app.post("/grader")
81
- def get_grader_score(task_id: str, action: dict):
82
  """
83
  Grade a submission directly.
84
  """
@@ -101,22 +101,29 @@ def run_baseline_route():
101
  """
102
  import subprocess, sys, os
103
  try:
104
- # Inherit required env vars
105
- env_vars = os.environ.copy()
106
- # Fix A: Call root-level inference.py
 
 
 
 
 
 
107
  result = subprocess.run(
108
  [sys.executable, "inference.py", "--output", "json"],
109
  capture_output=True,
110
  text=True,
111
- timeout=180,
112
- env=env_vars
 
113
  )
114
  raw = json.loads(result.stdout)
115
  # Map to required structure: {"baseline_results": [...], "average_score": float, "model": ...}
116
  return {
117
  "baseline_results": raw.get("detail", []),
118
  "average_score": raw.get("baseline_scores", {}).get("overall_avg", 0.0),
119
- "model": raw.get("model", "llama-3.3-70b-versatile")
120
  }
121
  except Exception as e:
122
  raise HTTPException(status_code=500, detail=str(e))
 
7
  import uvicorn
8
  import pandas as pd
9
  import gradio as gr
10
+ from fastapi import FastAPI, HTTPException, Query, Request, Body
11
  from fastapi.exceptions import RequestValidationError
12
  from fastapi.responses import JSONResponse, RedirectResponse
13
  from openenv.core.env_server import create_fastapi_app
 
78
 
79
 
80
  @app.post("/grader")
81
+ def get_grader_score(task_id: str = Body(...), action: dict = Body(...)):
82
  """
83
  Grade a submission directly.
84
  """
 
101
  """
102
  import subprocess, sys, os
103
  try:
104
+ # Inherit and explicitly set mandatory Groq env vars
105
+ env_vars = {
106
+ **os.environ,
107
+ "HF_TOKEN": os.environ.get("HF_TOKEN", ""),
108
+ "API_BASE_URL": os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1"),
109
+ "MODEL_NAME": os.environ.get("MODEL_NAME", "llama-3.1-8b-instant"),
110
+ }
111
+
112
+ # Execute root-level inference.py with 20-minute hackathon timeout
113
  result = subprocess.run(
114
  [sys.executable, "inference.py", "--output", "json"],
115
  capture_output=True,
116
  text=True,
117
+ timeout=1200,
118
+ env=env_vars,
119
+ cwd=os.path.dirname(os.path.abspath(__file__)) + "/.."
120
  )
121
  raw = json.loads(result.stdout)
122
  # Map to required structure: {"baseline_results": [...], "average_score": float, "model": ...}
123
  return {
124
  "baseline_results": raw.get("detail", []),
125
  "average_score": raw.get("baseline_scores", {}).get("overall_avg", 0.0),
126
+ "model": raw.get("model", "llama-3.1-8b-instant")
127
  }
128
  except Exception as e:
129
  raise HTTPException(status_code=500, detail=str(e))