Spaces:

parth-1
/

MetaGuard

Sleeping

App Files Files Community

Kartik Goyal commited on Apr 25

Commit

47fa380

1 Parent(s): daa0358

improved logic

Browse files

Files changed (9) hide show

.gitignore +35 -0
README.md +175 -74
apps/start.sh +24 -3
dockerfile +9 -11
grpo_train.py +116 -38
inference.py +28 -7
pyproject.toml +19 -2
requirements.txt +2 -1
server/app.py +5 -9

.gitignore ADDED Viewed

	@@ -0,0 +1,35 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+*.egg
+build/
+dist/
+.eggs/
+# Virtual environments
+.venv/
+venv/
+env/
+.env
+# Editor / OS
+.vscode/
+.idea/
+.DS_Store
+Thumbs.db
+# Project-specific
+outputs/
+AD_sandbox.zip
+*.log
+debug-*.log
+checkpoint-*/
+# Notebooks
+.ipynb_checkpoints/
+# Cached models / datasets
+.cache/
+hf_cache/

README.md CHANGED Viewed

@@ -1,135 +1,236 @@
-# 🚀 MetaGuard: Procedural RL for Automated Ad Moderation
-> **Transforming "Black Box" AI into auditable, multi-step regulatory workflows.**
 ![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)
 ![Python 3.9+](https://img.shields.io/badge/Python-3.9%2B-blue.svg)
-![RL-Framework: GRPO](https://img.shields.io/badge/Framework-GRPO-success.svg)
 ---
-## ⚠️ The Problem: "Single-Shot" Failures
-Traditional AI moderation models treat policy enforcement as a simple classification task (Approve/Reject). This approach fails in enterprise environments because it lacks:
-* ❌ **Traceability:** No explanation for *why* a decision was made.
-* ❌ **Contextual Awareness:** Decisions are made without checking advertiser history or regional regulations.
-* ❌ **Risk Management:** Approving high-risk content blindly without a verified audit trail.
-## ✅ The MetaGuard Solution
-MetaGuard redefines moderation as a **step-by-step investigative process** powered by Reinforcement Learning. The agent is trained not just to provide the right answer, but to follow the **correct investigative procedure** required by global compliance standards.
 ---
-## 🏗️ System Architecture
-MetaGuard operates as a microservice ecosystem to simulate real-world API latency, data silos, and procedural constraints.
-### 🔄 Interaction Flow
 ```mermaid
 graph LR
-    subgraph "Intelligent Agent"
-        A[RL Policy Agent]
     end
-    subgraph "MetaGuard Core"
         B(Environment Hub :8000)
     end
-    subgraph "External Policy APIs"
         C[[Regulatory API :8001]]
         D[[CRM API :8002]]
         E[[Audit API :8003]]
     end
-    A -- "1. Action Selection" --> B
-    B -- "2. API Request" --> C
-    B -- "2. API Request" --> D
-    C -- "3. Policy Signal" --> B
-    D -- "3. Trust Score" --> B
-    B -- "4. State Update + Reward" --> A
-    A -- "5. Final Decision" --> B
-    B -- "6. Immutable Log" --> E
 ```
-### 🗂️ Microservice Responsibility Map
-| Service | Endpoint | Responsibility |
-| :--- | :--- | :--- |
-| **Core Env** | `:8000` | State orchestration & Reward calculation |
-| **Regulatory API**| `:8001` | Dynamic policy lookup & legal constraints |
-| **CRM API** | `:8002` | Advertiser historical risk & trust scoring |
-| **Audit API** | `:8003` | Immutable logging for decision accountability |
 ---
-## 🧠 Methodology: GRPO & Procedural RL
-We utilize **Group Relative Policy Optimization (GRPO)** to train the agent. Unlike standard LLMs, our agent learns an optimal **Action Sequence**:
-1. 📥 **Ingest:** Fetch policy constraints via `query_regulations`.
-2. 🔍 **Inspect:** Scan creative assets via `analyze_image`.
-3. 🛡️ **Validate:** Cross-reference advertiser reliability via `check_advertiser_history`.
-4. 📝 **Certify:** Generate an immutable record via `submit_audit`.
-5. ⚖️ **Decide:** Execute final `approve` or `reject` action.
 ---
-## 🎬 Evaluation Trace
-We compare a baseline "Naive" agent against the MetaGuard trained agent to demonstrate procedural intelligence via our `demo.py` execution.
-### 📉 Scenario 1: Naive Agent
-* **Behavior:** Attempts to approve content without performing due diligence.
-* **Outcome:** Procedural penalties triggered; audit trail missing.
-* **Final Compliance Rating:** `0/10` 🚨
-### 📈 Scenario 2: MetaGuard Agent
-* **Behavior:** Systematically investigates all signals before acting.
-* **Trace:** `REGULATIONS` ➔ `IMAGE_SCAN` ➔ `CRM_CHECK` ➔ `AUDIT_LOG` ➔ `REJECT`.
-* **Final Compliance Rating:** `9/10` 🌟
 ---
-## 📊 Performance Metrics
-| Metric | Pre-Training (Naive) | Post-Training (MetaGuard) |
-| :--- | :--- | :--- |
-| **Success Rate** | 43% | **77%** |
-| **Procedural Compliance** | 12% | **94%** |
-| **Avg. Reward Score** | -2.1 | **+1.35** |
 ---
-## 🚀 Getting Started
-### 1. Environment Setup
 ```bash
-git clone [https://github.com/Parth380/meta-ad-policy-sandbox.git](https://github.com/Parth380/meta-ad-policy-sandbox.git)
 cd meta-ad-policy-sandbox
-pip install -r requirements.txt
 ```
-### 2. Launch Microservices
-Open three separate terminal windows and start the mock API infrastructure:
 ```bash
-python apps/regulatory_api.py  # Port 8001
-python apps/crm_api.py         # Port 8002
-python apps/audit_api.py       # Port 8003
 ```
-### 3. Run the Evaluation Demo
 ```bash
 python demo.py
 ```
 ---
-## 🏆 Hackathon Submission Details
-* **Theme:** 3.1 Multi-Step Reasoning & Policy Compliance
-* **Bonus Track:** AI Scaler Lab
-* **Team Members:** Parth Singhal, Mehakveer Kaur, Kartik Goyal
 ---
-### 📜 License
-This project is licensed under the MIT License.

+---
+title: MetaGuard Ad Policy Sandbox
+emoji: 🛡
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_port: 8000
+pinned: false
+license: mit
+---
+# MetaGuard: A Multi-App RL Environment for Enterprise Ad Policy Compliance
+> An OpenEnv-compatible reinforcement learning environment that forces an LLM agent
+> to do **real investigative work** across multiple enterprise APIs — not pattern-match.
 ![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)
 ![Python 3.9+](https://img.shields.io/badge/Python-3.9%2B-blue.svg)
+![Framework: OpenEnv + GRPO](https://img.shields.io/badge/Framework-OpenEnv%20%2B%20GRPO-success.svg)
+---
+## TL;DR for Judges
+MetaGuard is a **partially observable, multi-application RL environment** modelled
+after a real enterprise ad-moderation workflow. The agent (LLM) must orchestrate
+calls across 4 microservices (Regulatory, CRM, Audit, Core), update its internal
+beliefs based on each tool result, and produce a defensible decision in the
+correct procedural order — or get penalised.
+| Theme 3.1 requirement | How MetaGuard satisfies it |
+| --- | --- |
+| Real interaction with tools / APIs / dynamic systems | 4 independent FastAPI microservices on ports 8000-8003 |
+| "Real hard work, not shortcuts" | Procedural penalties + ambiguity tasks force investigation |
+| Maintain consistent internal state | Env tracks `actions_taken`, `signals`, `api_failed`, `trace` |
+| Update beliefs based on outcomes | `signals` dict (`risk_score`, `policy_confidence`, `image_flag`, `landing_flag`) is populated only as the agent acts |
+| Orchestrate multi-step workflows | `REQUIRED_BEFORE_TERMINAL` enforces `query_regulations` → `submit_audit` → decide |
+| Partially observable world | Agent sees only what its actions reveal; no global view |
+| **Scaler AI Labs bonus** — Multi-App RL for Enterprise Workflows | 4-app architecture mirrors a real compliance stack with business-rule nuance |
 ---
+## The Problem
+Single-shot LLM moderation is brittle in enterprise settings:
+- **No traceability** — no record of *why* a decision was made.
+- **No context** — no advertiser history, no jurisdiction-specific rules.
+- **No risk gating** — high-risk content can be approved without an audit trail.
+Real compliance teams follow a **procedure**: check policy → inspect creative →
+verify the advertiser → log the audit → only then decide. MetaGuard makes the
+agent learn that procedure end-to-end.
 ---
+## Architecture
+A 4-service ecosystem that mirrors a real enterprise compliance stack.
 ```mermaid
 graph LR
+    subgraph "Agent"
+        A[LLM Policy Agent]
     end
+    subgraph "MetaGuard Core (OpenEnv)"
         B(Environment Hub :8000)
     end
+    subgraph "Enterprise APIs"
         C[[Regulatory API :8001]]
         D[[CRM API :8002]]
         E[[Audit API :8003]]
     end
+    A -- "action_type, reasoning" --> B
+    B -- "GET /regulations/{cat}" --> C
+    B -- "GET /advertiser/{id}" --> D
+    B -- "POST /log" --> E
+    C -- "policy + violations" --> B
+    D -- "risk_score + history" --> B
+    E -- "audit_id" --> B
+    B -- "obs + reward + signals" --> A
 ```
+| Service | Port | Responsibility | Real-world analog |
+| :--- | :--- | :--- | :--- |
+| Core Env | `:8000` | State orchestration, reward shaping | Compliance workflow engine |
+| Regulatory API | `:8001` | Category-specific policy lookup with random outages | Legal / policy database |
+| CRM API | `:8002` | Advertiser trust score and prior-violation history | Salesforce / advertiser CRM |
+| Audit API | `:8003` | Immutable audit-log writes | SOX-compliant audit ledger |
+Each external API has a **10% random failure rate** to simulate real network
+unreliability — the agent must learn to retry.
 ---
+## Action Space
+8 actions span the full investigative procedure:
+| Action | Calls service | Purpose |
+| --- | --- | --- |
+| `query_regulations` | Regulatory API | Look up category-specific policy |
+| `analyze_image` | (internal VLM stub) | Inspect creative for visual violations |
+| `check_advertiser_history` | CRM API | Pull advertiser trust score |
+| `request_landing_page` | (internal) | Check landing-page domain age + risk keywords |
+| `request_id_verification` | (internal) | Targeting / age-gate check |
+| `submit_audit` | Audit API | Write immutable audit record |
+| `approve` | terminal | Final approval decision |
+| `reject` | terminal | Final rejection decision |
 ---
+## Business-Rule Nuances (the "hard work" criteria)
+The env penalises shortcuts and rewards real reasoning. Specifically:
+1. **Phase ordering.** `query_regulations` MUST come first. Any other action
+   first returns `-0.2` reward and is **not registered** as taken.
+2. **Audit gate.** `submit_audit` is required before any `approve` / `reject`.
+   Skipping it costs `-0.2` from the terminal reward.
+3. **API-failure recovery.** External services fail 10% of the time. Recovering
+   (retrying after a failure) earns `+0.3`; ignoring earns `-0.3`.
+4. **Risk-aware approvals.** Approving high-risk content (`risk_score > 0.7`
+   AND `policy_confidence > 0.6`) costs `-0.5`.
+5. **Ambiguity enforcement.** When `policy_confidence < 0.6`, the agent MUST
+   gather more signals (CRM or landing-page) or take a `-0.4` penalty.
+6. **Step penalty.** Every action costs `-0.05` to discourage padding.
+7. **Terminal correctness.** `+1.0` for the right decision, `-1.0` for wrong.
+8. **Step cap.** Hard cap at 8 steps; exceeding it costs `-0.5`.
+These rules together form a partially observable POMDP where greedy or
+single-shot strategies provably under-perform a procedural agent.
 ---
+## Task Suite
+10 task families exposed via `task_id`:
+| ID | Family | What it tests |
+| --- | --- | --- |
+| `task_1_healthcare` | Unverified medical claims, prescription bypass | Domain knowledge + policy lookup |
+| `task_2_financial` | Predatory lending, guaranteed-returns scams | High-stakes risk gating |
+| `task_3_multimodal` | Violation hidden in image, clean text | Forces `analyze_image` |
+| `task_4_targeting` | Adult financial product targeting minors | Forces `request_id_verification` |
+| `task_6_conflict` | Clean text + risky advertiser | Conflict resolution |
+| `task_7_ambiguous` | Low policy confidence | Forces extra signal gathering |
+| `task_8_adversarial` | Fine-print loophole | Adversarial robustness |
+| `task_9_dependency_trap` | Mismatch between text and image | Multi-source verification |
+| `task_10_failure` | Deterministic API failure on step 1 | Recovery behavior |
 ---
+## Quick Start
+### 1. Install
 ```bash
+git clone https://github.com/Parth380/meta-ad-policy-sandbox.git
 cd meta-ad-policy-sandbox
+pip install -e .
 ```
+### 2. Launch the 4-service stack
+Four terminals (or use `apps/start_all.bat` on Windows):
 ```bash
+python apps/regulatory_api.py                                # :8001
+python apps/crm_api.py                                       # :8002
+python apps/audit_api.py                                     # :8003
+python -m uvicorn server.app:app --host 0.0.0.0 --port 8000  # :8000
 ```
+### 3. Run the inference benchmark
+Uses an LLM through the HF Router and emits the official `[START]/[STEP]/[END]`
+grading log lines.
+```bash
+export HF_TOKEN=hf_xxxxxxxx                # your Hugging Face token
+export MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct
+python inference.py
+```
+### 4. Run the local naive-vs-procedural demo
 ```bash
 python demo.py
 ```
+### 5. (Optional) Train an agent with GRPO
+Requires a CUDA GPU. Trains a LoRA on top of `unsloth/Llama-3.1-8B-Instruct`
+using the env itself as the reward function.
+```bash
+python grpo_train.py
+```
+---
+## Repository Layout
+```
+meta-ad-policy-sandbox/
+├── apps/
+│   ├── regulatory_api.py    # FastAPI :8001 — policy DB
+│   ├── crm_api.py           # FastAPI :8002 — advertiser CRM
+│   ├── audit_api.py         # FastAPI :8003 — audit log
+│   └── start_all.bat        # Windows: launch all 4 at once
+├── server/
+│   └── app.py               # OpenEnv FastAPI server :8000
+├── src/
+│   ├── environment.py       # AdPolicyEnvironment — core RL logic
+│   ├── models.py            # Pydantic schemas (AdAction, AdObservation, AdState)
+│   └── generator.py         # AdGenerator — task-aware ad sampling
+├── inference.py             # LLM-via-HF-Router benchmark with grading logs
+├── demo.py                  # Local naive-vs-procedural demo
+├── grpo_train.py            # GRPO + LoRA training script
+├── test_env.py              # Smoke test of env logic
+├── openenv.yaml             # OpenEnv manifest
+├── dockerfile               # Container build for HF Spaces deployment
+└── validate.sh              # Validator for HF Space + openenv submission
+```
 ---
+## Hackathon Submission
+- **Theme:** 3.1 Professional Tasks — Multi-Step Reasoning & Policy Compliance
+- **Bonus Track:** Scaler AI Labs — Multi-App RL Environment for Enterprise Workflows
+- **Team:** Parth Singhal, Mehakveer Kaur, Kartik Goyal
 ---
+## License
+MIT.

apps/start.sh CHANGED Viewed

@@ -1,9 +1,30 @@
 #!/bin/bash
-# Start the background microservices
 python apps/regulatory_api.py &
 python apps/crm_api.py &
 python apps/audit_api.py &
-# Start the main environment server in the foreground
-uvicorn server.app:app --host 0.0.0.0 --port 8000

 #!/bin/bash
+set -e
 python apps/regulatory_api.py &
+REG_PID=$!
 python apps/crm_api.py &
+CRM_PID=$!
 python apps/audit_api.py &
+AUD_PID=$!
+wait_for_service() {
+  local url=$1
+  local name=$2
+  for i in $(seq 1 30); do
+    if curl -sf "$url" > /dev/null 2>&1; then
+      echo "[start.sh] $name ready"
+      return 0
+    fi
+    sleep 1
+  done
+  echo "[start.sh] WARNING: $name did not become ready within 30s"
+  return 1
+}
+wait_for_service "http://localhost:8001/health" "regulatory_api"
+wait_for_service "http://localhost:8002/health" "crm_api"
+wait_for_service "http://localhost:8003/health" "audit_api"
+echo "[start.sh] All microservices up. Launching environment server on :8000"
+exec uvicorn server.app:app --host 0.0.0.0 --port 8000

dockerfile CHANGED Viewed

@@ -1,21 +1,19 @@
-# 1. Use a lightweight Python image
 FROM python:3.11-slim
-# 2. Set the working directory inside the container
 WORKDIR /app
-# 3. Copy all your project files into the container
-COPY . .
-# 4. Install dependencies
-RUN pip install --no-cache-dir .
-RUN pip install -r requirements.txt
-# 5. Make the startup script executable (Bypasses Windows permission errors)
 RUN chmod +x apps/start.sh
-# 6. Expose the port the main server uses
 EXPOSE 8000
-# 7. Start all services using the bash script
-CMD ["./apps/start.sh"]

 FROM python:3.11-slim
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends curl \
+ && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
+COPY pyproject.toml requirements.txt ./
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+RUN pip install --no-cache-dir --no-deps .
 RUN chmod +x apps/start.sh
 EXPOSE 8000
+CMD ["./apps/start.sh"]

grpo_train.py CHANGED Viewed

@@ -98,95 +98,167 @@ def extract_json(text):
 # =========================
 BASE_SCENARIOS = [
-    # 🔹 Fresh state
     {
         "task_id": "task_1_healthcare",
-        "text": "Ad: miracle supplement cures disease. Initial review.",
-        "setup_actions": []
     },
-    # 🔹 Mid state
     {
         "task_id": "task_1_healthcare",
-        "text": "Ad: pharma product. Policy already checked. Next step?",
         "setup_actions": [
-            {"action_type": "query_regulations", "reasoning": "step1"}
-        ]
     },
-    # 🔹 Late state
     {
         "task_id": "task_2_financial",
-        "text": "Ad: investment scheme. Policy + history checked. Final decision?",
         "setup_actions": [
-            {"action_type": "query_regulations", "reasoning": "step1"},
-            {"action_type": "check_advertiser_history", "reasoning": "step2"}
-        ]
-    }
 ]
-def build_dataset():
-    rows = []
-    for s in BASE_SCENARIOS:
-        prompt = f"""
-You are an Ad Policy Agent.
-Respond ONLY JSON:
-{{"action_type": "...", "reasoning": "..."}}
-{s['text']}
-Next action?
-"""
         rows.append({
             "prompt": prompt,
             "task_id": s["task_id"],
-            "setup_actions": s["setup_actions"]
         })
-    return Dataset.from_list(rows * 20)  # small repeat
 # =========================
 # REWARD FUNCTION (FIXED)
 # =========================
 def reward_environment(prompts, completions, task_id=None, setup_actions=None, **kwargs):
     client = EnvClient(ENV_URL)
     rewards = []
     for completion, t_id, setup in zip(completions, task_id, setup_actions):
         parsed = extract_json(completion)
         if not parsed:
             rewards.append(-1.0)
             continue
         action_type = parsed.get("action_type")
         if action_type not in ALLOWED_ACTIONS:
             rewards.append(-1.0)
             continue
         action = {
             "action_type": action_type,
-            "reasoning": parsed.get("reasoning", "")
         }
         try:
             client.reset(t_id)
-            # 🔥 FAST-FORWARD STATE
             for s in setup:
                 safe_step(client, s)
             result = safe_step(client, action)
-            reward = float(result.get("reward", -0.2))
-            rewards.append(reward)
-        except:
             rewards.append(-0.3)
     return rewards
@@ -204,9 +276,15 @@ model, tokenizer = FastLanguageModel.from_pretrained(
 model = FastLanguageModel.get_peft_model(
     model,
     r=16,
-    target_modules=["q_proj", "v_proj"],
-    lora_alpha=16,
     lora_dropout=0,
 )
 # =========================

 # =========================
 BASE_SCENARIOS = [
+    # Phase 1 — Fresh state, expected: query_regulations
     {
         "task_id": "task_1_healthcare",
+        "text": "Healthcare ad: 'miracle supplement cures disease'. No actions taken yet.",
+        "actions_already_taken": [],
+        "setup_actions": [],
+    },
+    {
+        "task_id": "task_2_financial",
+        "text": "Financial ad: 'guaranteed 500% returns, zero risk'. No actions taken yet.",
+        "actions_already_taken": [],
+        "setup_actions": [],
+    },
+    {
+        "task_id": "task_3_multimodal",
+        "text": "Multimodal ad: image may contain hidden violation. No actions taken yet.",
+        "actions_already_taken": [],
+        "setup_actions": [],
     },
+    # Phase 2 — Policy checked, expected: analyze_image OR check_advertiser_history
     {
         "task_id": "task_1_healthcare",
+        "text": "Healthcare ad: pharma product. Policy already queried.",
+        "actions_already_taken": ["query_regulations"],
         "setup_actions": [
+            {"action_type": "query_regulations", "reasoning": "policy lookup"},
+        ],
+    },
+    {
+        "task_id": "task_3_multimodal",
+        "text": "Multimodal ad: image not yet inspected. Policy already queried.",
+        "actions_already_taken": ["query_regulations"],
+        "setup_actions": [
+            {"action_type": "query_regulations", "reasoning": "policy lookup"},
+        ],
     },
+    # Phase 3 — Policy + history checked, expected: submit_audit
     {
         "task_id": "task_2_financial",
+        "text": "Financial ad: investment scheme. Policy and advertiser history both checked.",
+        "actions_already_taken": ["query_regulations", "check_advertiser_history"],
         "setup_actions": [
+            {"action_type": "query_regulations", "reasoning": "policy lookup"},
+            {"action_type": "check_advertiser_history", "reasoning": "trust score"},
+        ],
+    },
+    # Phase 4 — Audit complete, expected: reject (high-risk) or approve (clean)
+    {
+        "task_id": "task_2_financial",
+        "text": "Financial ad: investment scheme. Policy, history, and audit all complete. Make final decision.",
+        "actions_already_taken": ["query_regulations", "check_advertiser_history", "submit_audit"],
+        "setup_actions": [
+            {"action_type": "query_regulations", "reasoning": "policy lookup"},
+            {"action_type": "check_advertiser_history", "reasoning": "trust score"},
+            {"action_type": "submit_audit", "reasoning": "audit log"},
+        ],
+    },
 ]
+PROMPT_TEMPLATE = """You are an enterprise Ad Policy Compliance Agent.
+You MUST choose exactly ONE action_type from this list (any other value is invalid):
+- query_regulations
+- analyze_image
+- check_advertiser_history
+- submit_audit
+- approve
+- reject
+REQUIRED PHASE ORDER:
+1. query_regulations  -> always first
+2. analyze_image / check_advertiser_history  -> gather signals
+3. submit_audit  -> always before final decision
+4. approve OR reject  -> only after audit
+HARD RULES:
+- NEVER repeat an action listed in `actions_already_taken`.
+- Respond with ONLY a valid JSON object. No markdown, no prose.
+Required format:
+{{"action_type": "<one_of_the_actions_above>", "reasoning": "<short reason>"}}
+Scenario: {text}
+actions_already_taken: {actions_already_taken}
+Your next action?"""
+def build_dataset():
+    rows = []
+    for s in BASE_SCENARIOS:
+        prompt = PROMPT_TEMPLATE.format(
+            text=s["text"],
+            actions_already_taken=json.dumps(s["actions_already_taken"]),
+        )
         rows.append({
             "prompt": prompt,
             "task_id": s["task_id"],
+            "setup_actions": s["setup_actions"],
         })
+    return Dataset.from_list(rows * 10)  # 7 scenarios x 10 = 70 examples
 # =========================
 # REWARD FUNCTION (FIXED)
 # =========================
 def reward_environment(prompts, completions, task_id=None, setup_actions=None, **kwargs):
+    """Shaped reward for GRPO.
+    Pure env reward is too sparse (mostly -0.05) to give clear gradients.
+    We add explicit shaping:
+      - invalid JSON / invalid action_type -> -1.0  (strong negative signal)
+      - valid action env REJECTS (wrong phase / API failure) -> -0.5
+      - valid action env ACCEPTS (advances state) -> +0.5 + env_reward
+      - terminal correct decision -> env_reward already contains +1.0 bonus
+    """
     client = EnvClient(ENV_URL)
     rewards = []
     for completion, t_id, setup in zip(completions, task_id, setup_actions):
         parsed = extract_json(completion)
         if not parsed:
             rewards.append(-1.0)
             continue
         action_type = parsed.get("action_type")
         if action_type not in ALLOWED_ACTIONS:
             rewards.append(-1.0)
             continue
         action = {
             "action_type": action_type,
+            "reasoning": parsed.get("reasoning", "format-compliant"),
         }
         try:
             client.reset(t_id)
             for s in setup:
                 safe_step(client, s)
             result = safe_step(client, action)
+            env_reward = float(result.get("reward", -0.2))
+            status_msg = (result.get("status_message") or "").lower()
+            rejected = (
+                "api failure" in status_msg
+                or "invalid action" in status_msg
+                or "must call" in status_msg
+            )
+            if rejected:
+                shaped = -0.5
+            else:
+                shaped = 0.5 + env_reward
+            rewards.append(shaped)
+        except Exception:
             rewards.append(-0.3)
     return rewards
 model = FastLanguageModel.get_peft_model(
     model,
     r=16,
+    target_modules=[
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+    ],
+    lora_alpha=32,
     lora_dropout=0,
+    bias="none",
+    use_gradient_checkpointing="unsloth",
+    random_state=3407,
 )
 # =========================

inference.py CHANGED Viewed

@@ -8,7 +8,7 @@ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "dummy_local_token")
 MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Meta-Llama-3-8B-Instruct")
-ENV_URL = "http://localhost:8000"
 MAX_STEPS = 10
 # 2. MANDATORY: Use OpenAI Client pointed at the HF Router
@@ -59,6 +59,12 @@ def get_llm_action(observation_data):
     - approve
     - reject
     Response format:
     {"action_type": "<action>", "reasoning": "<brief reason>"}
     """
@@ -99,7 +105,8 @@ def main() -> None:
         rewards = []
         steps_taken = 0
         success = False
         try:
             # 1. Reset the environment
             res = requests.post(f"{ENV_URL}/reset", json={"task_id": task_id})
@@ -122,16 +129,17 @@ def main() -> None:
                     "task_id": task_id,
                     "last_feedback": step_data.get("status_message", "No feedback yet."),
                     "step_count": steps_taken,
-                    "ad_details": observation
                 }
                 # Get action from LLM
                 action_payload = get_llm_action(llm_observation)
                 action_str = action_payload["action_type"]
                 if "Error code: 402" in action_payload.get("reasoning", ""):
-                 done = True
-                 log_step(step=steps_taken, action=action_str, reward=0.0, done=True, error="API credits depleted")
-                 break
                 # Execute action in environment
                 step_res = requests.post(f"{ENV_URL}/step", json={"action": action_payload})
                 step_data = step_res.json()
@@ -140,8 +148,21 @@ def main() -> None:
                 observation = step_data.get("observation", {})
                 done = step_data.get("done", False)
                 reward = step_data.get("reward", 0.0)
                 rewards.append(reward)
                 log_step(step=steps_taken, action=action_str, reward=reward, done=done, error=None)
             # 4. Final Scoring (Single Log)

 API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "dummy_local_token")
 MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Meta-Llama-3-8B-Instruct")
+ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
 MAX_STEPS = 10
 # 2. MANDATORY: Use OpenAI Client pointed at the HF Router
     - approve
     - reject
+    HARD RULES:
+    - NEVER repeat an action listed in `actions_already_taken`.
+    - You MUST progress through the phase order. Do NOT call submit_audit or approve/reject
+      before the prerequisite phases are complete.
+    - Choose your action_type ONLY from the AVAILABLE ACTIONS list above. Any other value is invalid.
     Response format:
     {"action_type": "<action>", "reasoning": "<brief reason>"}
     """
         rewards = []
         steps_taken = 0
         success = False
+        actions_taken_list: list = []
         try:
             # 1. Reset the environment
             res = requests.post(f"{ENV_URL}/reset", json={"task_id": task_id})
                     "task_id": task_id,
                     "last_feedback": step_data.get("status_message", "No feedback yet."),
                     "step_count": steps_taken,
+                    "actions_already_taken": actions_taken_list,
+                    "ad_details": observation
                 }
                 # Get action from LLM
                 action_payload = get_llm_action(llm_observation)
                 action_str = action_payload["action_type"]
                 if "Error code: 402" in action_payload.get("reasoning", ""):
+                    done = True
+                    log_step(step=steps_taken, action=action_str, reward=0.0, done=True, error="API credits depleted")
+                    break
                 # Execute action in environment
                 step_res = requests.post(f"{ENV_URL}/step", json={"action": action_payload})
                 step_data = step_res.json()
                 observation = step_data.get("observation", {})
                 done = step_data.get("done", False)
                 reward = step_data.get("reward", 0.0)
                 rewards.append(reward)
+                # Track only actions that actually advanced state. Skip API-failure
+                # / invalid-action / wrong-order cases so the agent is free to retry.
+                status_msg = (step_data.get("status_message") or "").lower()
+                action_failed = (
+                    "api failure" in status_msg
+                    or "retryable" in status_msg
+                    or "invalid action" in status_msg
+                    or "must call" in status_msg
+                )
+                if not action_failed and action_str not in actions_taken_list:
+                    actions_taken_list.append(action_str)
                 log_step(step=steps_taken, action=action_str, reward=reward, done=done, error=None)
             # 4. Final Scoring (Single Log)

pyproject.toml CHANGED Viewed

@@ -6,14 +6,31 @@ build-backend = "setuptools.build_meta"
 name = "meta-ad-policy-sandbox"
 version = "0.2.3"
 description = "Meta Ad-Policy RL Sandbox"
 dependencies = [
     "fastapi",
     "uvicorn",
     "pydantic",
     "requests",
     "openai",
-    "openenv-core>=0.2.0"
 ]
 [project.scripts]
-server = "server.app:main"

 name = "meta-ad-policy-sandbox"
 version = "0.2.3"
 description = "Meta Ad-Policy RL Sandbox"
+requires-python = ">=3.9"
 dependencies = [
     "fastapi",
     "uvicorn",
     "pydantic",
     "requests",
     "openai",
+    "openenv-core>=0.2.0",
+]
+[project.optional-dependencies]
+train = [
+    "torch",
+    "datasets",
+    "trl",
+    "unsloth",
+    "accelerate",
+    "bitsandbytes",
+    "peft",
 ]
 [project.scripts]
+server = "server.app:main"
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["server*", "src*"]
+exclude = ["apps*", "tests*"]

requirements.txt CHANGED Viewed

@@ -2,4 +2,5 @@ openenv-core>=0.2.1
 fastapi
 uvicorn
 pydantic
-requests

 fastapi
 uvicorn
 pydantic
+requests
+openai

server/app.py CHANGED Viewed

@@ -3,21 +3,17 @@ from openenv.core.env_server import create_fastapi_app
 from src.environment import AdPolicyEnvironment
 from src.models import AdAction, AdObservation
-# 1. Create the App
-# NOTICE: We pass the CLASS NAME (AdPolicyEnvironment), not 'env' or 'AdPolicyEnvironment()'
 app = create_fastapi_app(
-    AdPolicyEnvironment,
-    AdAction,
-    AdObservation
 )
-if __name__ == "__main__":
-    print("🚀 Starting Meta Ad-Policy Sandbox on http://localhost:8000")
-    uvicorn.run(app, host="0.0.0.0", port=8000)
 def main():
     uvicorn.run("server.app:app", host="0.0.0.0", port=8000)
 if __name__ == "__main__":
     main()

 from src.environment import AdPolicyEnvironment
 from src.models import AdAction, AdObservation
 app = create_fastapi_app(
+    AdPolicyEnvironment,
+    AdAction,
+    AdObservation,
 )
 def main():
+    print("Starting Meta Ad-Policy Sandbox on http://localhost:8000")
     uvicorn.run("server.app:app", host="0.0.0.0", port=8000)
 if __name__ == "__main__":
     main()