parth-1 commited on
Commit
6f613ca
·
verified ·
1 Parent(s): ce665a2

minimal changes

Browse files
Files changed (1) hide show
  1. README.md +54 -5
README.md CHANGED
@@ -1,5 +1,4 @@
1
  ---
2
- ---
3
  title: MetaGuard Ad Policy Sandbox
4
  emoji: 🛡
5
  colorFrom: blue
@@ -9,7 +8,6 @@ app_port: 8000
9
  pinned: false
10
  license: mit
11
  ---
12
- ---
13
  # MetaGuard: A Multi-App RL Environment for Enterprise Ad Policy Compliance
14
 
15
  > An OpenEnv-compatible reinforcement learning environment that forces an LLM agent
@@ -19,6 +17,19 @@ license: mit
19
  ![Python 3.9+](https://img.shields.io/badge/Python-3.9%2B-blue.svg)
20
  ![Framework: OpenEnv + GRPO](https://img.shields.io/badge/Framework-OpenEnv%20%2B%20GRPO-success.svg)
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ---
23
 
24
  ## TL;DR for Judges
@@ -53,6 +64,10 @@ Real compliance teams follow a **procedure**: check policy → inspect creative
53
  verify the advertiser → log the audit → only then decide. MetaGuard makes the
54
  agent learn that procedure end-to-end.
55
 
 
 
 
 
56
  ---
57
 
58
  ## Architecture
@@ -155,6 +170,39 @@ single-shot strategies provably under-perform a procedural agent.
155
  | `task_9_dependency_trap` | Mismatch between text and image | Multi-source verification |
156
  | `task_10_failure` | Deterministic API failure on step 1 | Recovery behavior |
157
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
  ---
159
 
160
  ## Quick Start
@@ -189,9 +237,9 @@ python inference.py
189
  python demo.py
190
  ```
191
 
192
- ### 5. (Optional) Train an agent with GRPO
193
- Requires a CUDA GPU. Trains a LoRA on top of `unsloth/Llama-3.1-8B-Instruct`
194
- using the env itself as the reward function.
195
  ```bash
196
  python grpo_train.py
197
  ```
@@ -229,6 +277,7 @@ meta-ad-policy-sandbox/
229
  - **Theme:** 3.1 Professional Tasks — Multi-Step Reasoning & Policy Compliance
230
  - **Bonus Track:** Scaler AI Labs — Multi-App RL Environment for Enterprise Workflows
231
  - **Team:** Parth Singhal, Mehakveer Kaur, Kartik Goyal
 
232
 
233
  ---
234
 
 
1
  ---
 
2
  title: MetaGuard Ad Policy Sandbox
3
  emoji: 🛡
4
  colorFrom: blue
 
8
  pinned: false
9
  license: mit
10
  ---
 
11
  # MetaGuard: A Multi-App RL Environment for Enterprise Ad Policy Compliance
12
 
13
  > An OpenEnv-compatible reinforcement learning environment that forces an LLM agent
 
17
  ![Python 3.9+](https://img.shields.io/badge/Python-3.9%2B-blue.svg)
18
  ![Framework: OpenEnv + GRPO](https://img.shields.io/badge/Framework-OpenEnv%20%2B%20GRPO-success.svg)
19
 
20
+ ---
21
+
22
+ ## Quick links
23
+
24
+ | Asset | URL |
25
+ | --- | --- |
26
+ | **Hugging Face Space** (environment — MetaGuard) | [huggingface.co/spaces/parth-1/MetaGuard](https://huggingface.co/spaces/parth-1/MetaGuard) |
27
+ | **Hugging Face Space** (GRPO training — MetaGuard-Train) | [huggingface.co/spaces/parth-1/MetaGuard-Train](https://huggingface.co/spaces/parth-1/MetaGuard-Train) |
28
+ | **Fine-tuned model** (GRPO checkpoint on Hub) | [parth-1/metaguard-policy-agent-v1](https://huggingface.co/parth-1/metaguard-policy-agent-v1) |
29
+ | **Training notebook** (Colab — opens `grpo_train.ipynb` from this repo) | [Open in Colab](https://colab.research.google.com/github/Parth380/meta-ad-policy-sandbox/blob/main/grpo_train.ipynb) |
30
+ | **Blog post** (narrative write-up — publish on HF, then swap URL) | [Draft in repo (`docs/metaguard-blog.md`)](https://github.com/Parth380/meta-ad-policy-sandbox/blob/main/docs/metaguard-blog.md) |
31
+
32
+
33
  ---
34
 
35
  ## TL;DR for Judges
 
64
  verify the advertiser → log the audit → only then decide. MetaGuard makes the
65
  agent learn that procedure end-to-end.
66
 
67
+ ### Why it matters
68
+
69
+ **Trust & safety and ads integrity teams**, **enterprise compliance**, and **anyone shipping LLM agents against real APIs** care because mistakes are costly: bad approvals create legal and user-harm risk; over-blocking hurts revenue. A benchmark that enforces **procedure, traceability, and recovery from flaky tools** is a better proxy for production than single-shot classification.
70
+
71
  ---
72
 
73
  ## Architecture
 
170
  | `task_9_dependency_trap` | Mismatch between text and image | Multi-source verification |
171
  | `task_10_failure` | Deterministic API failure on step 1 | Recovery behavior |
172
 
173
+ ---
174
+
175
+ ## Training & results
176
+
177
+ Training uses **[OpenEnv](https://github.com/openenv-ai/openenv)** as the environment (this repo’s FastAPI hub + reward logic), **[Unsloth](https://github.com/unslothai/unsloth)** for fast LoRA fine-tuning, and **[Hugging Face TRL](https://github.com/huggingface/trl)** (`GRPOTrainer` / `GRPOConfig`) for GRPO. Entry point: `grpo_train.py`. The trained weights are on the Hub as **[parth-1/metaguard-policy-agent-v1](https://huggingface.co/parth-1/metaguard-policy-agent-v1)** (fine-tuned from `unsloth/Llama-3.1-8B-Instruct`).
178
+
179
+ ### What changed after training?
180
+
181
+ After you complete a real run, summarize **before → after** in a few bullets and paste evidence below.
182
+
183
+ - **Baseline (optional):**
184
+ Prior to the RLHF fine-tuning process, the base meta-llama/Meta-Llama-3.1-8B-Instruct model was largely incompatible with the strict constraints of the compliance environment. It recorded a Mean Initial Reward of -0.30, driven primarily by consistent API rejections and qualitative failure modes. These failures included frequent JSON formatting hallucinations—such as unclosed brackets—and a fundamental inability to follow the required phase order, often attempting to submit audits before querying regulations. The model’s inability to adhere to the required schema also resulted in frequent timeout errors during execution.
185
+ - **After GRPO:**
186
+ Reward Stabilization: The model successfully moved from an initial negative mean reward (~ -0.5) to a consistent positive mean reward (~ 0.45).
187
+ Comparison: Initial runs exhibited significant policy instability around step 80, evidenced by a 4\times increase in GRPO loss and a corresponding crash in mean return. Subsequent tuning flattened these spikes, resulting in a more monotonic improvement in reward and fewer qualitative "relapses" into low-reward behavior.
188
+ Violation Rate: Success rate stabilized after step 40, with the model consistently hitting the reward ceiling despite injected variability.
189
+ - **Run metadata (optional):**
190
+ The training was conducted on an NVIDIA L4 (24GB VRAM) via Hugging Face Spaces using the unsloth/Llama-3.1-8B-Instruct base model, quantized to 4-bit and trained in bfloat16. The fine-tuning utilized a LoRA configuration with a Rank(r) of 32 and an alpha of 64, targeting all linear layers. Key hyperparameters in the GRPOConfig included a learning rate of 2*10^{-5} over 3 epochs, an effective batch size of 8 (via a per-device batch size of 1 and 8 gradient accumulation steps), and a max completion length of 128.
191
+
192
+ ![training_first](https://cdn-uploads.huggingface.co/production/uploads/6985e6933894062444aae20e/SCJveHcoSKgW9Rk56D3BU.png)
193
+
194
+ ![Training_third](https://cdn-uploads.huggingface.co/production/uploads/6985e6933894062444aae20e/fLG0zQ4iK37fHaX9YnRVO.png)
195
+ ### Loss and reward
196
+
197
+
198
+ ![training_loss](https://cdn-uploads.huggingface.co/production/uploads/6985e6933894062444aae20e/sCeY-w9COj_95NVNXo7K7.png)
199
+
200
+
201
+ ![reward_curve](https://cdn-uploads.huggingface.co/production/uploads/6985e6933894062444aae20e/3X0Y91km7iXlM_fWdnHdK.png)
202
+
203
+
204
+
205
+
206
  ---
207
 
208
  ## Quick Start
 
237
  python demo.py
238
  ```
239
 
240
+ ### 5. Train an agent (Unsloth + TRL GRPO)
241
+ Requires a CUDA GPU. `grpo_train.py` fine-tunes a LoRA on `unsloth/Llama-3.1-8B-Instruct` with **Unsloth** and **TRL’s `GRPOTrainer`**, using this environment as the reward source. See [Training & results](#training--results) for plots and narrative after you train.
242
+
243
  ```bash
244
  python grpo_train.py
245
  ```
 
277
  - **Theme:** 3.1 Professional Tasks — Multi-Step Reasoning & Policy Compliance
278
  - **Bonus Track:** Scaler AI Labs — Multi-App RL Environment for Enterprise Workflows
279
  - **Team:** Parth Singhal, Mehakveer Kaur, Kartik Goyal
280
+ - **Checklist:** OpenEnv-based env, **Unsloth + TRL** training (`grpo_train.py`), **loss/reward plots** in [Training & results](#training--results), **HF Space + collateral** in [Quick links].
281
 
282
  ---
283