3v324v23 commited on
Commit
45f5063
·
1 Parent(s): 685b77a

Update readme

Browse files
Files changed (1) hide show
  1. README.md +50 -199
README.md CHANGED
@@ -1,245 +1,96 @@
1
- ````markdown
2
  # MetaGuard: Enterprise Ad-Policy RL Sandbox
3
 
4
- [![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](#)
5
- [![Python](https://img.shields.io/badge/Python-3.11%2B-green)](#)
6
- [![FastAPI](https://img.shields.io/badge/FastAPI-microservices-009688)](#)
7
- [![RL](https://img.shields.io/badge/Training-GRPO-orange)](#)
8
 
9
- MetaGuard is an OpenEnv-compatible reinforcement learning environment built for enterprise policy decision-making. It simulates a realistic ad-review workflow where an agent must gather context, inspect policy constraints, validate advertiser history, log its decision trail, and take a final moderation action.
10
-
11
- The goal is not simple classification. The goal is procedural compliance under uncertainty.
12
 
13
  ---
14
 
15
- ## Why this project exists
16
-
17
- Most moderation demos stop at “approve” or “reject.” Real systems do not work that way.
18
-
19
- A production moderation workflow usually needs:
20
- - policy lookup before judgment
21
- - account and advertiser risk context
22
- - audit logging for traceability
23
- - support for multimodal and adversarial inputs
24
- - stepwise compliance with a strict operating procedure
25
-
26
- MetaGuard models that workflow as a reinforcement learning environment, so an agent is rewarded not just for the final answer, but for following the correct enterprise process.
27
-
28
- ---
29
-
30
- ## Core idea
31
-
32
- The environment forces the agent to behave like a policy operator inside a controlled moderation stack:
33
-
34
- 1. retrieve policy constraints
35
- 2. inspect the content
36
- 3. check advertiser history
37
- 4. write an audit log
38
- 5. take a terminal decision
39
 
40
- Skipping steps, violating the sequence, or ignoring context results in penalties.
41
-
42
- ---
43
-
44
- ## System architecture
45
 
46
  ```mermaid
47
  flowchart LR
48
- A[Agent / Policy Model] -->|reset / step| B[Environment Hub]
49
- B --> C[Regulatory Service]
50
- B --> D[Advertiser CRM Service]
51
- B --> E[Audit Service]
52
- B --> F[Scenario Generator]
53
  B -->|observation + reward| A
54
- ````
55
-
56
- ### Services
57
-
58
- **Environment Hub**
59
- Coordinates the episode lifecycle, enforces step order, applies rewards, and exposes the OpenEnv-style interface.
60
-
61
- **Regulatory Service**
62
- Returns policy constraints, sensitive categories, and risk rules for a given task.
63
-
64
- **Advertiser CRM Service**
65
- Stores advertiser history, trust level, and prior violations.
66
-
67
- **Audit Service**
68
- Persists the moderation trace and final decision record.
69
-
70
- **Scenario Generator**
71
- Creates varied tasks and adversarial edge cases so the policy does not overfit to a narrow pattern.
72
-
73
- ---
74
-
75
- ## Action space
76
-
77
- The environment uses a structured action space designed around real moderation work.
78
-
79
- ### Required workflow actions
80
-
81
- * `query_regulations` — fetch policy constraints
82
- * `analyze_image` — inspect visual content when the task includes media
83
- * `check_advertiser_history` — retrieve account risk context
84
- * `submit_audit` — store the decision trail before final action
85
-
86
- ### Terminal actions
87
-
88
- * `approve`
89
- * `reject`
90
-
91
- The environment penalizes invalid ordering, skipped steps, premature terminal actions, and unsupported decisions.
92
-
93
- ---
94
-
95
- ## Reward design
96
-
97
- Rewards reflect enterprise correctness, not just outcome guessing:
98
-
99
- * positive reward for correct terminal decision
100
- * positive reward for following required procedural steps
101
- * bonus for complete audit logging
102
- * penalty for skipping mandatory steps
103
- * penalty for invalid actions
104
- * penalty for inconsistent decisions
105
-
106
- ---
107
-
108
- ## Training with GRPO
109
-
110
- MetaGuard supports policy optimization using **GRPO (Group Relative Policy Optimization)**.
111
-
112
- ### Why GRPO
113
-
114
- * no separate critic model required
115
- * works well with relative reward comparisons
116
- * suited for structured decision tasks
117
- * integrates cleanly with environment-driven feedback
118
-
119
- ### Why Unsloth
120
-
121
- * reduced VRAM usage
122
- * faster fine-tuning cycles
123
- * practical for 7B–8B models on limited hardware
124
-
125
- ### Training loop
126
 
127
- 1. sample tasks
128
- 2. run policy in environment
129
- 3. compute reward from compliance + outcome
130
- 4. update policy with GRPO
131
- 5. repeat across task families
132
 
133
  ---
134
 
135
- ## Task families
136
 
137
- * **Healthcare claims** unapproved medical claims, pharma violations
138
- * **Financial claims** — predatory offers, misleading returns
139
- * **Multimodal traps** — violations hidden in images
140
- * **Targeting violations** — illegal demographic targeting
141
 
142
- These scenarios test both policy understanding and procedural discipline.
 
 
143
 
144
  ---
145
 
146
- ## What makes this different
147
 
148
- MetaGuard is not a classifier.
149
 
150
- It simulates a real moderation workflow with:
151
-
152
- * tool usage
153
- * stateful decision making
154
- * policy retrieval
155
- * advertiser context
156
- * auditability
157
- * adversarial task generation
158
- * RL-based optimization
159
 
160
  ---
161
 
162
- ## Local setup
163
 
164
- ### Install
 
165
 
166
  ```bash
 
167
  pip install -e .
168
  pip install -r requirements.txt
169
- ```
170
 
171
- ### Run services
172
-
173
- ```bash
174
  python apps/regulatory_api.py
175
  python apps/crm_api.py
176
  python apps/audit_api.py
 
 
177
  uvicorn server.app:app --host 0.0.0.0 --port 8000
178
  ```
179
 
180
- ### Train / Inference
181
-
182
  ```bash
183
- python grpo_train.py
184
  python inference.py
185
  ```
186
 
187
- ### Validate
188
-
189
- ```bash
190
- ./validate.sh <YOUR_HF_SPACE_URL> .
191
- ```
192
-
193
  ---
194
 
195
- ## Repository structure
196
-
197
- ```text
198
- .
199
- ├── apps/ # microservices
200
- ├── server/ # environment hub
201
- ├── src/ # environment + logic
202
- ├── grpo_train.py # training
203
- ├── inference.py # evaluation
204
- ├── validate.sh # validation script
205
- └── README.md
206
- ```
207
-
208
- ---
209
-
210
- ## Implementation notes
211
-
212
- * strict step sequence enforced
213
- * terminal actions gated by compliance steps
214
- * audit logs must be structured
215
- * reproducibility from clean setup is required
216
- * Docker build must be standard and functional
217
-
218
- ---
219
-
220
- ## Suggested demo flow
221
-
222
- 1. show a complex policy case
223
- 2. agent calls services in correct order
224
- 3. audit log is generated
225
- 4. final decision is made
226
- 5. reward explains correctness
227
 
228
  ---
229
 
230
- ## Future improvements
231
-
232
- * stronger multimodal reasoning
233
- * richer policy graphs
234
- * improved adversarial generation
235
- * better evaluation metrics
236
- * expanded agent compatibility
237
-
238
- ---
239
-
240
- ## License
241
-
242
- Add your license here.
243
-
244
- ```
245
- ```
 
1
+ ```markdown
2
  # MetaGuard: Enterprise Ad-Policy RL Sandbox
3
 
4
+ ![OpenEnv](https://img.shields.io/badge/OpenEnv-0.2.3-blue)
5
+ ![Python](https://img.shields.io/badge/Python-3.11%2B-green)
6
+ ![FastAPI](https://img.shields.io/badge/FastAPI-Multi--Service-009688)
7
+ ![Algorithm](https://img.shields.io/badge/RL-GRPO-orange)
8
 
9
+ MetaGuard is a high-fidelity Reinforcement Learning (RL) environment designed for ad-policy moderation. It simulates a production-grade enterprise ecosystem where AI agents must navigate multi-step compliance workflows, coordinate across distributed microservices, and overcome adversarial multimodal "traps."
 
 
10
 
11
  ---
12
 
13
+ ## 🏗️ System Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
+ MetaGuard utilizes a distributed microservice architecture to mimic a production moderation stack.
 
 
 
 
16
 
17
  ```mermaid
18
  flowchart LR
19
+ A[Agent / LLM Policy] -->|/reset, /step| B[OpenEnv Environment Server]
20
+ B -->|query_regulations| C[Regulatory API :8001]
21
+ B -->|check_history| D[CRM API :8002]
22
+ B -->|submit_audit| E[Audit API :8003]
 
23
  B -->|observation + reward| A
24
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ ### Integrated Services
27
+ * **Environment Hub (`:8000`)**: Orchestrates the episode lifecycle and enforces procedural phase gates.
28
+ * **Regulatory API (`:8001`)**: Provides category-specific policy constraints and risk levels.
29
+ * **Advertiser CRM (`:8002`)**: Manages advertiser trust scores and historical violation records.
30
+ * **Audit API (`:8003`)**: Persists the "Chain of Thought" and decision logs for full traceability.
31
 
32
  ---
33
 
34
+ ## 🧠 Methodology: GRPO + Unsloth
35
 
36
+ To advance beyond simple instruction following, the system implements **Group Relative Policy Optimization (GRPO)** for fine-tuning.
 
 
 
37
 
38
+ * **Efficiency:** Optimized via **Unsloth** to enable 8B model training on consumer-grade GPUs with significantly reduced VRAM footprint.
39
+ * **Critic-less RL:** GRPO calculates rewards based on group relative performance, eliminating the need for a separate Reward Model/Critic.
40
+ * **Dynamic Training:** The training loop interacts with the **live environment** directly, allowing the model to learn from real-time API feedback.
41
 
42
  ---
43
 
44
+ ## 🚦 Procedural Action Space
45
 
46
+ The environment enforces a strict Standard Operating Procedure (SOP). Failure to follow this sequence results in negative rewards and blocked terminal actions.
47
 
48
+ 1. **`query_regulations`**: Fetch policy constraints (Mandatory initial step).
49
+ 2. **`analyze_image`**: Inspect visual assets for policy "dog whistles" (Required for multimodal tasks).
50
+ 3. **`check_advertiser_history`**: Consult the CRM for risk context and recidivism.
51
+ 4. **`submit_audit`**: Log reasoning to the Audit API (Required before final decision).
52
+ 5. **`approve` / `reject`**: Terminal actions.
 
 
 
 
53
 
54
  ---
55
 
56
+ ## 🚀 Deployment Guide
57
 
58
+ ### Local Microservice Setup
59
+ To initialize the full enterprise stack locally:
60
 
61
  ```bash
62
+ # 1. Install local project in editable mode
63
  pip install -e .
64
  pip install -r requirements.txt
 
65
 
66
+ # 2. Launch background microservices
 
 
67
  python apps/regulatory_api.py
68
  python apps/crm_api.py
69
  python apps/audit_api.py
70
+
71
+ # 3. Start the Environment Hub
72
  uvicorn server.app:app --host 0.0.0.0 --port 8000
73
  ```
74
 
75
+ ### Running Inference
76
+ Evaluate agent compliance across adversarial task families:
77
  ```bash
78
+ export HF_TOKEN="your_token"
79
  python inference.py
80
  ```
81
 
 
 
 
 
 
 
82
  ---
83
 
84
+ ## 📊 Adversarial Task Families
85
+ The system evaluates agents on four distinct challenge categories:
86
+ * **`task_1_healthcare`**: Detection of unapproved medical claims and pharmaceutical violations.
87
+ * **`task_2_financial`**: Identification of predatory services and high-pressure financial tactics.
88
+ * **`task_3_multimodal`**: Policy violations hidden within imagery that bypass standard text filters.
89
+ * **`task_4_targeting`**: Illegal demographic targeting and age-restricted policy violations.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ---
92
 
93
+ ## 🛠️ Technical Design Decisions
94
+ * **Synthetic Scenario Generation:** Utilizes a dynamic `AdGenerator` to produce unique training scenarios, ensuring generalization across diverse policy edge cases.
95
+ * **Inference Rerouting:** The stack supports instant toggling to high-speed providers to manage API rate limits during large-scale evaluation.
96
+ ```