minimal changes
Browse files
README.md
CHANGED
|
@@ -1,5 +1,4 @@
|
|
| 1 |
---
|
| 2 |
-
---
|
| 3 |
title: MetaGuard Ad Policy Sandbox
|
| 4 |
emoji: 🛡
|
| 5 |
colorFrom: blue
|
|
@@ -9,7 +8,6 @@ app_port: 8000
|
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
| 11 |
---
|
| 12 |
-
---
|
| 13 |
# MetaGuard: A Multi-App RL Environment for Enterprise Ad Policy Compliance
|
| 14 |
|
| 15 |
> An OpenEnv-compatible reinforcement learning environment that forces an LLM agent
|
|
@@ -19,6 +17,19 @@ license: mit
|
|
| 19 |

|
| 20 |

|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
---
|
| 23 |
|
| 24 |
## TL;DR for Judges
|
|
@@ -53,6 +64,10 @@ Real compliance teams follow a **procedure**: check policy → inspect creative
|
|
| 53 |
verify the advertiser → log the audit → only then decide. MetaGuard makes the
|
| 54 |
agent learn that procedure end-to-end.
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
---
|
| 57 |
|
| 58 |
## Architecture
|
|
@@ -155,6 +170,39 @@ single-shot strategies provably under-perform a procedural agent.
|
|
| 155 |
| `task_9_dependency_trap` | Mismatch between text and image | Multi-source verification |
|
| 156 |
| `task_10_failure` | Deterministic API failure on step 1 | Recovery behavior |
|
| 157 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
---
|
| 159 |
|
| 160 |
## Quick Start
|
|
@@ -189,9 +237,9 @@ python inference.py
|
|
| 189 |
python demo.py
|
| 190 |
```
|
| 191 |
|
| 192 |
-
### 5.
|
| 193 |
-
Requires a CUDA GPU.
|
| 194 |
-
|
| 195 |
```bash
|
| 196 |
python grpo_train.py
|
| 197 |
```
|
|
@@ -229,6 +277,7 @@ meta-ad-policy-sandbox/
|
|
| 229 |
- **Theme:** 3.1 Professional Tasks — Multi-Step Reasoning & Policy Compliance
|
| 230 |
- **Bonus Track:** Scaler AI Labs — Multi-App RL Environment for Enterprise Workflows
|
| 231 |
- **Team:** Parth Singhal, Mehakveer Kaur, Kartik Goyal
|
|
|
|
| 232 |
|
| 233 |
---
|
| 234 |
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
title: MetaGuard Ad Policy Sandbox
|
| 3 |
emoji: 🛡
|
| 4 |
colorFrom: blue
|
|
|
|
| 8 |
pinned: false
|
| 9 |
license: mit
|
| 10 |
---
|
|
|
|
| 11 |
# MetaGuard: A Multi-App RL Environment for Enterprise Ad Policy Compliance
|
| 12 |
|
| 13 |
> An OpenEnv-compatible reinforcement learning environment that forces an LLM agent
|
|
|
|
| 17 |

|
| 18 |

|
| 19 |
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## Quick links
|
| 23 |
+
|
| 24 |
+
| Asset | URL |
|
| 25 |
+
| --- | --- |
|
| 26 |
+
| **Hugging Face Space** (environment — MetaGuard) | [huggingface.co/spaces/parth-1/MetaGuard](https://huggingface.co/spaces/parth-1/MetaGuard) |
|
| 27 |
+
| **Hugging Face Space** (GRPO training — MetaGuard-Train) | [huggingface.co/spaces/parth-1/MetaGuard-Train](https://huggingface.co/spaces/parth-1/MetaGuard-Train) |
|
| 28 |
+
| **Fine-tuned model** (GRPO checkpoint on Hub) | [parth-1/metaguard-policy-agent-v1](https://huggingface.co/parth-1/metaguard-policy-agent-v1) |
|
| 29 |
+
| **Training notebook** (Colab — opens `grpo_train.ipynb` from this repo) | [Open in Colab](https://colab.research.google.com/github/Parth380/meta-ad-policy-sandbox/blob/main/grpo_train.ipynb) |
|
| 30 |
+
| **Blog post** (narrative write-up — publish on HF, then swap URL) | [Draft in repo (`docs/metaguard-blog.md`)](https://github.com/Parth380/meta-ad-policy-sandbox/blob/main/docs/metaguard-blog.md) |
|
| 31 |
+
|
| 32 |
+
|
| 33 |
---
|
| 34 |
|
| 35 |
## TL;DR for Judges
|
|
|
|
| 64 |
verify the advertiser → log the audit → only then decide. MetaGuard makes the
|
| 65 |
agent learn that procedure end-to-end.
|
| 66 |
|
| 67 |
+
### Why it matters
|
| 68 |
+
|
| 69 |
+
**Trust & safety and ads integrity teams**, **enterprise compliance**, and **anyone shipping LLM agents against real APIs** care because mistakes are costly: bad approvals create legal and user-harm risk; over-blocking hurts revenue. A benchmark that enforces **procedure, traceability, and recovery from flaky tools** is a better proxy for production than single-shot classification.
|
| 70 |
+
|
| 71 |
---
|
| 72 |
|
| 73 |
## Architecture
|
|
|
|
| 170 |
| `task_9_dependency_trap` | Mismatch between text and image | Multi-source verification |
|
| 171 |
| `task_10_failure` | Deterministic API failure on step 1 | Recovery behavior |
|
| 172 |
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## Training & results
|
| 176 |
+
|
| 177 |
+
Training uses **[OpenEnv](https://github.com/openenv-ai/openenv)** as the environment (this repo’s FastAPI hub + reward logic), **[Unsloth](https://github.com/unslothai/unsloth)** for fast LoRA fine-tuning, and **[Hugging Face TRL](https://github.com/huggingface/trl)** (`GRPOTrainer` / `GRPOConfig`) for GRPO. Entry point: `grpo_train.py`. The trained weights are on the Hub as **[parth-1/metaguard-policy-agent-v1](https://huggingface.co/parth-1/metaguard-policy-agent-v1)** (fine-tuned from `unsloth/Llama-3.1-8B-Instruct`).
|
| 178 |
+
|
| 179 |
+
### What changed after training?
|
| 180 |
+
|
| 181 |
+
After you complete a real run, summarize **before → after** in a few bullets and paste evidence below.
|
| 182 |
+
|
| 183 |
+
- **Baseline (optional):**
|
| 184 |
+
Prior to the RLHF fine-tuning process, the base meta-llama/Meta-Llama-3.1-8B-Instruct model was largely incompatible with the strict constraints of the compliance environment. It recorded a Mean Initial Reward of -0.30, driven primarily by consistent API rejections and qualitative failure modes. These failures included frequent JSON formatting hallucinations—such as unclosed brackets—and a fundamental inability to follow the required phase order, often attempting to submit audits before querying regulations. The model’s inability to adhere to the required schema also resulted in frequent timeout errors during execution.
|
| 185 |
+
- **After GRPO:**
|
| 186 |
+
Reward Stabilization: The model successfully moved from an initial negative mean reward (~ -0.5) to a consistent positive mean reward (~ 0.45).
|
| 187 |
+
Comparison: Initial runs exhibited significant policy instability around step 80, evidenced by a 4\times increase in GRPO loss and a corresponding crash in mean return. Subsequent tuning flattened these spikes, resulting in a more monotonic improvement in reward and fewer qualitative "relapses" into low-reward behavior.
|
| 188 |
+
Violation Rate: Success rate stabilized after step 40, with the model consistently hitting the reward ceiling despite injected variability.
|
| 189 |
+
- **Run metadata (optional):**
|
| 190 |
+
The training was conducted on an NVIDIA L4 (24GB VRAM) via Hugging Face Spaces using the unsloth/Llama-3.1-8B-Instruct base model, quantized to 4-bit and trained in bfloat16. The fine-tuning utilized a LoRA configuration with a Rank(r) of 32 and an alpha of 64, targeting all linear layers. Key hyperparameters in the GRPOConfig included a learning rate of 2*10^{-5} over 3 epochs, an effective batch size of 8 (via a per-device batch size of 1 and 8 gradient accumulation steps), and a max completion length of 128.
|
| 191 |
+
|
| 192 |
+

|
| 193 |
+
|
| 194 |
+

|
| 195 |
+
### Loss and reward
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+

|
| 199 |
+
|
| 200 |
+
|
| 201 |
+

|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
|
| 206 |
---
|
| 207 |
|
| 208 |
## Quick Start
|
|
|
|
| 237 |
python demo.py
|
| 238 |
```
|
| 239 |
|
| 240 |
+
### 5. Train an agent (Unsloth + TRL GRPO)
|
| 241 |
+
Requires a CUDA GPU. `grpo_train.py` fine-tunes a LoRA on `unsloth/Llama-3.1-8B-Instruct` with **Unsloth** and **TRL’s `GRPOTrainer`**, using this environment as the reward source. See [Training & results](#training--results) for plots and narrative after you train.
|
| 242 |
+
|
| 243 |
```bash
|
| 244 |
python grpo_train.py
|
| 245 |
```
|
|
|
|
| 277 |
- **Theme:** 3.1 Professional Tasks — Multi-Step Reasoning & Policy Compliance
|
| 278 |
- **Bonus Track:** Scaler AI Labs — Multi-App RL Environment for Enterprise Workflows
|
| 279 |
- **Team:** Parth Singhal, Mehakveer Kaur, Kartik Goyal
|
| 280 |
+
- **Checklist:** OpenEnv-based env, **Unsloth + TRL** training (`grpo_train.py`), **loss/reward plots** in [Training & results](#training--results), **HF Space + collateral** in [Quick links].
|
| 281 |
|
| 282 |
---
|
| 283 |
|