3v324v23 commited on
Commit
685b77a
Β·
1 Parent(s): 574b833

Fix task_id kwarg in reward function

Browse files
Files changed (1) hide show
  1. README.md +245 -28
README.md CHANGED
@@ -1,28 +1,245 @@
1
- πŸ›‘οΈ Meta Ad-Policy RL Sandbox
2
- A custom, bleeding-edge Reinforcement Learning environment built for the Meta Ad-Policy Hackathon. This sandbox evaluates the ability of Vision-Language Models (VLMs) and LLMs to act as autonomous ad moderators, navigating complex policy violations, multimodal traps, and illegal targeting.
3
-
4
- πŸš€ Core Features
5
- OpenEnv 0.2.3 Compliant: Fully implements the latest Meta OpenEnv specifications, including Pydantic StepResult state serialization and /step & /reset API endpoints.
6
- Reward Shaping: Implements a strict -0.05 step penalty to force the AI agent to optimize tool usage and prevent infinite analysis loops.
7
- Multimodal Traps: Tests the limits of VLMs by presenting ads where the text is benign, but the visual elements contain severe policy violations.
8
- Containerized Infrastructure: Fully Dockerized and highly lightweight, easily running under the 2 vCPU / 8GB RAM hackathon constraints.
9
- πŸ“‹ Evaluation Tasks
10
- The environment natively supports 4 distinct adversarial tasks, loadable via the task_id parameter:
11
-
12
- task_1_healthcare: Evaluates ads for unapproved medical claims, pharmaceuticals, and subtle dog whistles.
13
- task_2_financial: Evaluates ads for predatory financial services, scams, and high-pressure tactics.
14
- task_3_multimodal: Detects policy violations hidden entirely within visual elements that bypass standard NLP text filters.
15
- task_4_targeting: Identifies illegal demographic targeting (e.g., adult financial services targeting minors).
16
- πŸ› οΈ Available Agent Tools
17
- The environment exposes the following action space to the evaluating LLM:
18
-
19
- analyze_image: Request VLM context for visual elements.
20
- request_landing_page: Extract simulated URL endpoints.
21
- request_id_verification: Check advertiser trust scores.
22
- approve / reject: Terminal actions.
23
- 🚦 Quick Start (Local)
24
- 1. Build the Docker Image docker build -t meta-ad-sandbox .
25
-
26
- 2. Run the Environment Container docker run -p 8000:8000 meta-ad-sandbox
27
-
28
- 3. Run the Automated Inference Agent Make sure your Hugging Face credentials are set, then run the evaluation script to test the agent against all 4 tasks: export HF_TOKEN="your_hugging_face_token" python inference.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ````markdown
2
+ # MetaGuard: Enterprise Ad-Policy RL Sandbox
3
+
4
+ [![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](#)
5
+ [![Python](https://img.shields.io/badge/Python-3.11%2B-green)](#)
6
+ [![FastAPI](https://img.shields.io/badge/FastAPI-microservices-009688)](#)
7
+ [![RL](https://img.shields.io/badge/Training-GRPO-orange)](#)
8
+
9
+ MetaGuard is an OpenEnv-compatible reinforcement learning environment built for enterprise policy decision-making. It simulates a realistic ad-review workflow where an agent must gather context, inspect policy constraints, validate advertiser history, log its decision trail, and take a final moderation action.
10
+
11
+ The goal is not simple classification. The goal is procedural compliance under uncertainty.
12
+
13
+ ---
14
+
15
+ ## Why this project exists
16
+
17
+ Most moderation demos stop at β€œapprove” or β€œreject.” Real systems do not work that way.
18
+
19
+ A production moderation workflow usually needs:
20
+ - policy lookup before judgment
21
+ - account and advertiser risk context
22
+ - audit logging for traceability
23
+ - support for multimodal and adversarial inputs
24
+ - stepwise compliance with a strict operating procedure
25
+
26
+ MetaGuard models that workflow as a reinforcement learning environment, so an agent is rewarded not just for the final answer, but for following the correct enterprise process.
27
+
28
+ ---
29
+
30
+ ## Core idea
31
+
32
+ The environment forces the agent to behave like a policy operator inside a controlled moderation stack:
33
+
34
+ 1. retrieve policy constraints
35
+ 2. inspect the content
36
+ 3. check advertiser history
37
+ 4. write an audit log
38
+ 5. take a terminal decision
39
+
40
+ Skipping steps, violating the sequence, or ignoring context results in penalties.
41
+
42
+ ---
43
+
44
+ ## System architecture
45
+
46
+ ```mermaid
47
+ flowchart LR
48
+ A[Agent / Policy Model] -->|reset / step| B[Environment Hub]
49
+ B --> C[Regulatory Service]
50
+ B --> D[Advertiser CRM Service]
51
+ B --> E[Audit Service]
52
+ B --> F[Scenario Generator]
53
+ B -->|observation + reward| A
54
+ ````
55
+
56
+ ### Services
57
+
58
+ **Environment Hub**
59
+ Coordinates the episode lifecycle, enforces step order, applies rewards, and exposes the OpenEnv-style interface.
60
+
61
+ **Regulatory Service**
62
+ Returns policy constraints, sensitive categories, and risk rules for a given task.
63
+
64
+ **Advertiser CRM Service**
65
+ Stores advertiser history, trust level, and prior violations.
66
+
67
+ **Audit Service**
68
+ Persists the moderation trace and final decision record.
69
+
70
+ **Scenario Generator**
71
+ Creates varied tasks and adversarial edge cases so the policy does not overfit to a narrow pattern.
72
+
73
+ ---
74
+
75
+ ## Action space
76
+
77
+ The environment uses a structured action space designed around real moderation work.
78
+
79
+ ### Required workflow actions
80
+
81
+ * `query_regulations` β€” fetch policy constraints
82
+ * `analyze_image` β€” inspect visual content when the task includes media
83
+ * `check_advertiser_history` β€” retrieve account risk context
84
+ * `submit_audit` β€” store the decision trail before final action
85
+
86
+ ### Terminal actions
87
+
88
+ * `approve`
89
+ * `reject`
90
+
91
+ The environment penalizes invalid ordering, skipped steps, premature terminal actions, and unsupported decisions.
92
+
93
+ ---
94
+
95
+ ## Reward design
96
+
97
+ Rewards reflect enterprise correctness, not just outcome guessing:
98
+
99
+ * positive reward for correct terminal decision
100
+ * positive reward for following required procedural steps
101
+ * bonus for complete audit logging
102
+ * penalty for skipping mandatory steps
103
+ * penalty for invalid actions
104
+ * penalty for inconsistent decisions
105
+
106
+ ---
107
+
108
+ ## Training with GRPO
109
+
110
+ MetaGuard supports policy optimization using **GRPO (Group Relative Policy Optimization)**.
111
+
112
+ ### Why GRPO
113
+
114
+ * no separate critic model required
115
+ * works well with relative reward comparisons
116
+ * suited for structured decision tasks
117
+ * integrates cleanly with environment-driven feedback
118
+
119
+ ### Why Unsloth
120
+
121
+ * reduced VRAM usage
122
+ * faster fine-tuning cycles
123
+ * practical for 7B–8B models on limited hardware
124
+
125
+ ### Training loop
126
+
127
+ 1. sample tasks
128
+ 2. run policy in environment
129
+ 3. compute reward from compliance + outcome
130
+ 4. update policy with GRPO
131
+ 5. repeat across task families
132
+
133
+ ---
134
+
135
+ ## Task families
136
+
137
+ * **Healthcare claims** β€” unapproved medical claims, pharma violations
138
+ * **Financial claims** β€” predatory offers, misleading returns
139
+ * **Multimodal traps** β€” violations hidden in images
140
+ * **Targeting violations** β€” illegal demographic targeting
141
+
142
+ These scenarios test both policy understanding and procedural discipline.
143
+
144
+ ---
145
+
146
+ ## What makes this different
147
+
148
+ MetaGuard is not a classifier.
149
+
150
+ It simulates a real moderation workflow with:
151
+
152
+ * tool usage
153
+ * stateful decision making
154
+ * policy retrieval
155
+ * advertiser context
156
+ * auditability
157
+ * adversarial task generation
158
+ * RL-based optimization
159
+
160
+ ---
161
+
162
+ ## Local setup
163
+
164
+ ### Install
165
+
166
+ ```bash
167
+ pip install -e .
168
+ pip install -r requirements.txt
169
+ ```
170
+
171
+ ### Run services
172
+
173
+ ```bash
174
+ python apps/regulatory_api.py
175
+ python apps/crm_api.py
176
+ python apps/audit_api.py
177
+ uvicorn server.app:app --host 0.0.0.0 --port 8000
178
+ ```
179
+
180
+ ### Train / Inference
181
+
182
+ ```bash
183
+ python grpo_train.py
184
+ python inference.py
185
+ ```
186
+
187
+ ### Validate
188
+
189
+ ```bash
190
+ ./validate.sh <YOUR_HF_SPACE_URL> .
191
+ ```
192
+
193
+ ---
194
+
195
+ ## Repository structure
196
+
197
+ ```text
198
+ .
199
+ β”œβ”€β”€ apps/ # microservices
200
+ β”œβ”€β”€ server/ # environment hub
201
+ β”œβ”€β”€ src/ # environment + logic
202
+ β”œβ”€β”€ grpo_train.py # training
203
+ β”œβ”€β”€ inference.py # evaluation
204
+ β”œβ”€β”€ validate.sh # validation script
205
+ └── README.md
206
+ ```
207
+
208
+ ---
209
+
210
+ ## Implementation notes
211
+
212
+ * strict step sequence enforced
213
+ * terminal actions gated by compliance steps
214
+ * audit logs must be structured
215
+ * reproducibility from clean setup is required
216
+ * Docker build must be standard and functional
217
+
218
+ ---
219
+
220
+ ## Suggested demo flow
221
+
222
+ 1. show a complex policy case
223
+ 2. agent calls services in correct order
224
+ 3. audit log is generated
225
+ 4. final decision is made
226
+ 5. reward explains correctness
227
+
228
+ ---
229
+
230
+ ## Future improvements
231
+
232
+ * stronger multimodal reasoning
233
+ * richer policy graphs
234
+ * improved adversarial generation
235
+ * better evaluation metrics
236
+ * expanded agent compatibility
237
+
238
+ ---
239
+
240
+ ## License
241
+
242
+ Add your license here.
243
+
244
+ ```
245
+ ```