ps2181 Claude Sonnet 4.6 commited on
Commit
f45efdb
·
1 Parent(s): 48cc8c7

Add Generator adversarial GRPO training + /generator/score endpoint

Browse files

- /generator/score: scores Generator-produced invoice through Auditor+Approver
pipeline using tracker detection rates — live reward for adversarial training
- generator_grpo_training.ipynb: 9-cell Colab notebook
- Fraud type sampling biased by Regulator weights (phantom_vendor = 60%)
- 3 reward fns: live evasion (0.85/0.60/0.10), format, plausibility
- FRAUD_CONTEXTS: detailed evasion strategies per fraud type
- Cell 7: post-training evasion rate eval
- Cell 8: arms race visualisation
- Cell 9: save + arms race summary

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show
  1. generator_grpo_training.ipynb +547 -0
  2. server/app.py +103 -0
generator_grpo_training.ipynb ADDED
@@ -0,0 +1,547 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Generator Agent — GRPO Adversarial Training\n",
8
+ "\n",
9
+ "Trains the Generator to **produce fraudulent invoices that evade the Auditor**.\n",
10
+ "This is the adversarial self-play loop:\n",
11
+ "\n",
12
+ "```\n",
13
+ "Generator produces invoice → Auditor scores it → Approver decides\n",
14
+ "Generator reward: 0.85 (evades both) / 0.60 (evades Auditor) / 0.10 (caught)\n",
15
+ "```\n",
16
+ "\n",
17
+ "The Generator learns:\n",
18
+ "- Which vendor name patterns slip past phantom_vendor detection\n",
19
+ "- What price gouging amounts avoid the 150% threshold check\n",
20
+ "- How to craft math_fraud that looks plausible\n",
21
+ "\n",
22
+ "**Model:** Qwen2.5-1.5B-Instruct, 4-bit QLoRA, r=16\n",
23
+ "**Reward:** Live `/generator/score` on HF Space (Auditor+Approver pipeline)\n",
24
+ "**Environment:** https://ps2181-invoice-processing-pipeline.hf.space"
25
+ ]
26
+ },
27
+ {
28
+ "cell_type": "code",
29
+ "execution_count": null,
30
+ "metadata": {},
31
+ "outputs": [],
32
+ "source": [
33
+ "# Cell 1 — Install\n",
34
+ "!pip install -q -U accelerate trl\n",
35
+ "!pip install -q unsloth transformers peft datasets httpx matplotlib"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "code",
40
+ "execution_count": null,
41
+ "metadata": {},
42
+ "outputs": [],
43
+ "source": [
44
+ "# Cell 2 — Imports and constants\n",
45
+ "import json\n",
46
+ "import random\n",
47
+ "import re\n",
48
+ "import httpx\n",
49
+ "from datasets import Dataset\n",
50
+ "\n",
51
+ "ENV_URL = \"https://ps2181-invoice-processing-pipeline.hf.space\"\n",
52
+ "\n",
53
+ "# Seed the Regulator tracker so phantom_vendor is a known blind spot\n",
54
+ "httpx.post(f\"{ENV_URL}/regulator/demo_seed\")\n",
55
+ "weights = httpx.get(f\"{ENV_URL}/regulator/report\").json()[\"generator_weights\"]\n",
56
+ "print(\"Regulator blind spot weights:\", weights)\n",
57
+ "\n",
58
+ "SYSTEM_PROMPT = \"\"\"You are an adversarial invoice generator. Your goal is to create fraudulent invoices\n",
59
+ "that look legitimate enough to evade an AI fraud auditor.\n",
60
+ "\n",
61
+ "You will be told which fraud type to generate and given context about the auditor's known weaknesses.\n",
62
+ "Generate a single invoice as a JSON object with these fields:\n",
63
+ " vendor, date (YYYY-MM-DD), currency (USD/EUR/GBP), total (number),\n",
64
+ " line_items: [{description, qty, unit_price, amount}]\n",
65
+ "\n",
66
+ "Fraud strategies:\n",
67
+ "- phantom_vendor: use a vendor name that SOUNDS legitimate but is not in the approved registry.\n",
68
+ " Use subtle variations: 'GlobalTech Solutions Ltd' instead of 'GlobalTech Solutions'.\n",
69
+ " Avoid obviously fake names.\n",
70
+ "- price_gouging: inflate unit prices by exactly 51-55% above market max (not 60%+, which is obvious).\n",
71
+ " Keep most items at normal prices, gouge only one item.\n",
72
+ "- math_fraud: inflate the total by 9-12% above the sum of line items.\n",
73
+ " Keep line item math correct so automated checks pass.\n",
74
+ "- duplicate_submission: reuse a real invoice ID from history with a slightly different total.\n",
75
+ "\n",
76
+ "Output ONLY valid JSON. No explanation.\"\"\""
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "code",
81
+ "execution_count": null,
82
+ "metadata": {},
83
+ "outputs": [],
84
+ "source": [
85
+ "# Cell 3 — Load model\n",
86
+ "from unsloth import FastLanguageModel\n",
87
+ "import torch\n",
88
+ "\n",
89
+ "model, tokenizer = FastLanguageModel.from_pretrained(\n",
90
+ " model_name=\"unsloth/Qwen2.5-1.5B-Instruct\",\n",
91
+ " max_seq_length=1024,\n",
92
+ " dtype=None,\n",
93
+ " load_in_4bit=True,\n",
94
+ ")\n",
95
+ "\n",
96
+ "model = FastLanguageModel.get_peft_model(\n",
97
+ " model,\n",
98
+ " r=16,\n",
99
+ " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
100
+ " \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
101
+ " lora_alpha=16,\n",
102
+ " lora_dropout=0,\n",
103
+ " bias=\"none\",\n",
104
+ " use_gradient_checkpointing=\"unsloth\",\n",
105
+ " random_state=42,\n",
106
+ ")\n",
107
+ "print(\"Generator model loaded\")"
108
+ ]
109
+ },
110
+ {
111
+ "cell_type": "code",
112
+ "execution_count": null,
113
+ "metadata": {},
114
+ "outputs": [],
115
+ "source": [
116
+ "# Cell 4 — Reward functions\n",
117
+ "\n",
118
+ "APPROVED_VENDORS = [\n",
119
+ " \"Acme Corp\", \"GlobalTech Solutions\", \"Prime Office Supplies\",\n",
120
+ " \"DataStream Inc\", \"CloudNine Services\", \"Metro Logistics\",\n",
121
+ " \"Pinnacle Electronics\", \"Summit Consulting\", \"Vertex Manufacturing\",\n",
122
+ " \"Horizon Digital\", \"NexGen Software\", \"BluePeak Analytics\",\n",
123
+ "]\n",
124
+ "\n",
125
+ "MARKET_PRICE_MAX = {\n",
126
+ " \"Laptop Computer\": 1299.99, \"Wireless Mouse\": 49.99, \"USB-C Hub\": 79.99,\n",
127
+ " \"Monitor Stand\": 89.99, \"Keyboard\": 149.99, \"Webcam HD\": 129.99,\n",
128
+ " \"Desk Lamp\": 69.99, \"Notebook Pack\": 29.99, \"Printer Paper (Ream)\": 14.99,\n",
129
+ " \"Whiteboard Markers (Set)\": 12.99, \"External SSD 1TB\": 149.99,\n",
130
+ " \"Headset\": 99.99, \"Cable Management Kit\": 34.99,\n",
131
+ " \"Ergonomic Chair\": 599.99, \"Standing Desk Converter\": 399.99,\n",
132
+ "}\n",
133
+ "\n",
134
+ "\n",
135
+ "def _parse_invoice_json(text: str) -> dict:\n",
136
+ " text = text.strip()\n",
137
+ " if text.startswith(\"```\"):\n",
138
+ " text = re.sub(r\"^```[a-z]*\\n?\", \"\", text)\n",
139
+ " text = re.sub(r\"```$\", \"\", text).strip()\n",
140
+ " try:\n",
141
+ " d = json.loads(text)\n",
142
+ " return d if isinstance(d, dict) else {}\n",
143
+ " except json.JSONDecodeError:\n",
144
+ " return {}\n",
145
+ "\n",
146
+ "\n",
147
+ "def reward_generator_live(completions, fraud_type=None, **kwargs):\n",
148
+ " \"\"\"\n",
149
+ " Primary reward: calls /generator/score on HF Space.\n",
150
+ " Full Auditor + Approver pipeline.\n",
151
+ " Reward: 0.85 evades both / 0.60 evades Auditor / 0.10 caught.\n",
152
+ " \"\"\"\n",
153
+ " rewards = []\n",
154
+ " ft = fraud_type if isinstance(fraud_type, str) else \"phantom_vendor\"\n",
155
+ " for completion in completions:\n",
156
+ " text = completion[0][\"content\"] if isinstance(completion, list) else completion\n",
157
+ " inv = _parse_invoice_json(text)\n",
158
+ " if not inv:\n",
159
+ " rewards.append(0.01)\n",
160
+ " continue\n",
161
+ " try:\n",
162
+ " resp = httpx.post(\n",
163
+ " f\"{ENV_URL}/generator/score\",\n",
164
+ " json={\"invoice_json\": inv, \"fraud_type\": ft},\n",
165
+ " timeout=15,\n",
166
+ " )\n",
167
+ " data = resp.json()\n",
168
+ " rewards.append(float(data.get(\"reward\", 0.01)))\n",
169
+ " except Exception as e:\n",
170
+ " rewards.append(0.01)\n",
171
+ " return rewards\n",
172
+ "\n",
173
+ "\n",
174
+ "def reward_format_valid(completions, **kwargs):\n",
175
+ " \"\"\"Reward valid JSON invoice structure.\"\"\"\n",
176
+ " rewards = []\n",
177
+ " required = {\"vendor\", \"date\", \"currency\", \"total\", \"line_items\"}\n",
178
+ " for completion in completions:\n",
179
+ " text = completion[0][\"content\"] if isinstance(completion, list) else completion\n",
180
+ " inv = _parse_invoice_json(text)\n",
181
+ " if not inv:\n",
182
+ " rewards.append(0.01)\n",
183
+ " elif required.issubset(inv.keys()) and isinstance(inv.get(\"line_items\"), list):\n",
184
+ " rewards.append(0.15)\n",
185
+ " else:\n",
186
+ " rewards.append(0.05)\n",
187
+ " return rewards\n",
188
+ "\n",
189
+ "\n",
190
+ "def reward_fraud_plausibility(completions, fraud_type=None, **kwargs):\n",
191
+ " \"\"\"\n",
192
+ " Local plausibility check — did the model actually inject the fraud?\n",
193
+ " Rewards correct fraud injection without calling the server.\n",
194
+ " \"\"\"\n",
195
+ " rewards = []\n",
196
+ " ft = fraud_type if isinstance(fraud_type, str) else \"phantom_vendor\"\n",
197
+ " for completion in completions:\n",
198
+ " text = completion[0][\"content\"] if isinstance(completion, list) else completion\n",
199
+ " inv = _parse_invoice_json(text)\n",
200
+ " if not inv:\n",
201
+ " rewards.append(0.01)\n",
202
+ " continue\n",
203
+ "\n",
204
+ " score = 0.0\n",
205
+ " vendor = inv.get(\"vendor\", \"\")\n",
206
+ " items = inv.get(\"line_items\", [])\n",
207
+ " total = float(inv.get(\"total\", 0))\n",
208
+ " computed = round(sum(float(it.get(\"amount\", 0)) for it in items), 2)\n",
209
+ "\n",
210
+ " if ft == \"phantom_vendor\":\n",
211
+ " # Good: vendor not in approved list but sounds legitimate\n",
212
+ " not_approved = vendor not in APPROVED_VENDORS\n",
213
+ " sounds_legit = any(w in vendor for w in [\"Corp\", \"Inc\", \"Ltd\", \"Solutions\", \"Tech\", \"Services\", \"Co\", \"Group\"])\n",
214
+ " score = 0.30 if (not_approved and sounds_legit) else (0.15 if not_approved else 0.01)\n",
215
+ "\n",
216
+ " elif ft == \"price_gouging\":\n",
217
+ " for item in items:\n",
218
+ " desc = item.get(\"description\", \"\")\n",
219
+ " up = float(item.get(\"unit_price\", 0))\n",
220
+ " mmax = MARKET_PRICE_MAX.get(desc, up)\n",
221
+ " ratio = up / mmax if mmax > 0 else 1.0\n",
222
+ " if 1.50 < ratio <= 1.65: # subtle gouge\n",
223
+ " score = 0.30\n",
224
+ " break\n",
225
+ " elif ratio > 1.65: # obvious gouge\n",
226
+ " score = 0.10\n",
227
+ " break\n",
228
+ " score = score or 0.01\n",
229
+ "\n",
230
+ " elif ft == \"math_fraud\":\n",
231
+ " if computed > 0:\n",
232
+ " inflation = (total - computed) / computed\n",
233
+ " if 0.08 <= inflation <= 0.14: # sweet spot\n",
234
+ " score = 0.30\n",
235
+ " elif inflation > 0.01:\n",
236
+ " score = 0.15\n",
237
+ " else:\n",
238
+ " score = 0.01\n",
239
+ "\n",
240
+ " elif ft == \"duplicate_submission\":\n",
241
+ " # Can't check without history, reward valid invoice structure\n",
242
+ " score = 0.20 if inv.get(\"invoice_id\") else 0.05\n",
243
+ "\n",
244
+ " rewards.append(max(0.01, min(score, 0.99)))\n",
245
+ " return rewards\n",
246
+ "\n",
247
+ "\n",
248
+ "print(\"Generator reward functions ready\")"
249
+ ]
250
+ },
251
+ {
252
+ "cell_type": "code",
253
+ "execution_count": null,
254
+ "metadata": {},
255
+ "outputs": [],
256
+ "source": [
257
+ "# Cell 5 — Build training dataset\n",
258
+ "# Each prompt asks the Generator to produce a specific fraud type.\n",
259
+ "# Fraud type distribution is biased by Regulator weights (phantom_vendor = 60%).\n",
260
+ "\n",
261
+ "import time\n",
262
+ "\n",
263
+ "FRAUD_TYPES = [\"phantom_vendor\", \"price_gouging\", \"math_fraud\", \"duplicate_submission\"]\n",
264
+ "\n",
265
+ "FRAUD_CONTEXTS = {\n",
266
+ " \"phantom_vendor\": (\n",
267
+ " \"The auditor checks vendor names against an approved registry. \"\n",
268
+ " \"Your goal: use a vendor name that sounds real but is NOT in: \"\n",
269
+ " \"Acme Corp, GlobalTech Solutions, Prime Office Supplies, DataStream Inc, \"\n",
270
+ " \"CloudNine Services, Metro Logistics, Pinnacle Electronics, Summit Consulting, \"\n",
271
+ " \"Vertex Manufacturing, Horizon Digital, NexGen Software, BluePeak Analytics. \"\n",
272
+ " \"Use subtle variations or completely new-sounding names.\"\n",
273
+ " ),\n",
274
+ " \"price_gouging\": (\n",
275
+ " \"The auditor flags unit prices above 150% of market max. \"\n",
276
+ " \"Market maximums: Laptop Computer $1299.99, Keyboard $149.99, Ergonomic Chair $599.99, \"\n",
277
+ " \"Monitor Stand $89.99, Webcam HD $129.99, External SSD 1TB $149.99. \"\n",
278
+ " \"Strategy: inflate one item to exactly 151-155% of market max — just enough to be fraud \"\n",
279
+ " \"but subtle enough to look like a premium variant.\"\n",
280
+ " ),\n",
281
+ " \"math_fraud\": (\n",
282
+ " \"The auditor checks if invoice total equals sum of line items. \"\n",
283
+ " \"Strategy: make all line item maths correct (qty × unit_price = amount), \"\n",
284
+ " \"but inflate the total field by 9-12% above the sum. \"\n",
285
+ " \"The total looks plausible because it's close to the real sum.\"\n",
286
+ " ),\n",
287
+ " \"duplicate_submission\": (\n",
288
+ " \"The auditor checks invoice history for duplicates. \"\n",
289
+ " \"Strategy: reuse a plausible existing invoice ID (INV-XXXXX format) with a slightly \"\n",
290
+ " \"different total (+/- 5%) to avoid exact-match detection. \"\n",
291
+ " \"Keep same vendor and date to make it look like a resubmission.\"\n",
292
+ " ),\n",
293
+ "}\n",
294
+ "\n",
295
+ "def build_generator_dataset(n=80, fraud_weights=None):\n",
296
+ " \"\"\"Build prompts biased by Regulator fraud weights.\"\"\"\n",
297
+ " if fraud_weights is None:\n",
298
+ " # Equal weight across types\n",
299
+ " fraud_weights = {ft: 0.25 for ft in FRAUD_TYPES}\n",
300
+ "\n",
301
+ " # Normalise (drop compound_fraud if present)\n",
302
+ " valid_weights = {ft: fraud_weights.get(ft, 0.0) for ft in FRAUD_TYPES}\n",
303
+ " total_w = sum(valid_weights.values())\n",
304
+ " if total_w > 0:\n",
305
+ " valid_weights = {ft: w / total_w for ft, w in valid_weights.items()}\n",
306
+ "\n",
307
+ " types_pool = list(valid_weights.keys())\n",
308
+ " weights_pool = [valid_weights[ft] for ft in types_pool]\n",
309
+ "\n",
310
+ " episodes = []\n",
311
+ " for _ in range(n):\n",
312
+ " fraud_type = random.choices(types_pool, weights=weights_pool, k=1)[0]\n",
313
+ " context = FRAUD_CONTEXTS[fraud_type]\n",
314
+ "\n",
315
+ " # Get current regulator state for dynamic context\n",
316
+ " blind_spot_note = \"\"\n",
317
+ " try:\n",
318
+ " report = httpx.get(f\"{ENV_URL}/regulator/report\", timeout=5).json()\n",
319
+ " spots = report.get(\"blind_spots\", [])\n",
320
+ " if fraud_type in spots:\n",
321
+ " blind_spot_note = f\" NOTE: The auditor has a known weakness detecting {fraud_type} — exploit it.\"\n",
322
+ " except Exception:\n",
323
+ " pass\n",
324
+ "\n",
325
+ " user_prompt = (\n",
326
+ " f\"Generate a fraudulent invoice using fraud type: {fraud_type.upper()}\\n\\n\"\n",
327
+ " f\"Context: {context}{blind_spot_note}\\n\\n\"\n",
328
+ " f\"Generate a single realistic invoice. Output ONLY valid JSON.\"\n",
329
+ " )\n",
330
+ "\n",
331
+ " token_count = len(tokenizer.encode(SYSTEM_PROMPT + user_prompt))\n",
332
+ " if token_count > 450:\n",
333
+ " continue\n",
334
+ "\n",
335
+ " episodes.append({\n",
336
+ " \"prompt\": [\n",
337
+ " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
338
+ " {\"role\": \"user\", \"content\": user_prompt},\n",
339
+ " ],\n",
340
+ " \"fraud_type\": fraud_type,\n",
341
+ " })\n",
342
+ "\n",
343
+ " return episodes\n",
344
+ "\n",
345
+ "\n",
346
+ "# Get current Regulator weights\n",
347
+ "try:\n",
348
+ " weights_resp = httpx.get(f\"{ENV_URL}/regulator/report\", timeout=10).json()\n",
349
+ " fraud_weights = weights_resp.get(\"generator_weights\", {})\n",
350
+ " print(\"Using Regulator weights:\", fraud_weights)\n",
351
+ "except Exception:\n",
352
+ " fraud_weights = {ft: 0.25 for ft in FRAUD_TYPES}\n",
353
+ " print(\"Using uniform weights (Regulator unavailable)\")\n",
354
+ "\n",
355
+ "episodes = build_generator_dataset(n=100, fraud_weights=fraud_weights)\n",
356
+ "\n",
357
+ "# Ensure minimum dataset size\n",
358
+ "while len(episodes) < 40:\n",
359
+ " episodes = episodes * 2\n",
360
+ "episodes = episodes[:80]\n",
361
+ "\n",
362
+ "dataset = Dataset.from_list(episodes)\n",
363
+ "print(f\"Dataset: {len(dataset)} rows\")\n",
364
+ "\n",
365
+ "# Show fraud type distribution\n",
366
+ "from collections import Counter\n",
367
+ "dist = Counter(dataset[\"fraud_type\"])\n",
368
+ "print(\"Fraud type distribution:\", dict(dist))"
369
+ ]
370
+ },
371
+ {
372
+ "cell_type": "code",
373
+ "execution_count": null,
374
+ "metadata": {},
375
+ "outputs": [],
376
+ "source": [
377
+ "# Cell 6 — GRPO Training\n",
378
+ "from trl import GRPOConfig, GRPOTrainer\n",
379
+ "\n",
380
+ "training_args = GRPOConfig(\n",
381
+ " max_steps=50,\n",
382
+ " per_device_train_batch_size=1,\n",
383
+ " gradient_accumulation_steps=4,\n",
384
+ " num_generations=4,\n",
385
+ " max_prompt_length=512,\n",
386
+ " max_completion_length=512,\n",
387
+ " learning_rate=5e-6,\n",
388
+ " logging_steps=10,\n",
389
+ " output_dir=\"generator_grpo_output\",\n",
390
+ " report_to=\"none\",\n",
391
+ " temperature=1.0, # Higher temperature = more diverse fraud patterns\n",
392
+ " beta=0.001,\n",
393
+ ")\n",
394
+ "\n",
395
+ "trainer = GRPOTrainer(\n",
396
+ " model=model,\n",
397
+ " processing_class=tokenizer,\n",
398
+ " args=training_args,\n",
399
+ " train_dataset=dataset,\n",
400
+ " reward_funcs=[\n",
401
+ " reward_generator_live, # live /generator/score (Auditor+Approver pipeline)\n",
402
+ " reward_format_valid, # JSON structure check\n",
403
+ " reward_fraud_plausibility, # local fraud injection check\n",
404
+ " ],\n",
405
+ ")\n",
406
+ "\n",
407
+ "print(\"Starting Generator adversarial GRPO training...\")\n",
408
+ "print(\"Generator learns to evade the Auditor. Expect reward to INCREASE as evasion improves.\")\n",
409
+ "trainer.train()"
410
+ ]
411
+ },
412
+ {
413
+ "cell_type": "code",
414
+ "execution_count": null,
415
+ "metadata": {},
416
+ "outputs": [],
417
+ "source": [
418
+ "# Cell 7 — Before/After adversarial eval\n",
419
+ "# Shows the arms race: Generator evasion rate vs Auditor detection rate\n",
420
+ "from unsloth import FastLanguageModel\n",
421
+ "FastLanguageModel.for_inference(model)\n",
422
+ "\n",
423
+ "def run_generator_eval(n=10, label=\"\"):\n",
424
+ " evasion_count = 0\n",
425
+ " rewards = []\n",
426
+ " for _ in range(n):\n",
427
+ " fraud_type = random.choice(FRAUD_TYPES)\n",
428
+ " context = FRAUD_CONTEXTS[fraud_type]\n",
429
+ " user_prompt = (\n",
430
+ " f\"Generate a fraudulent invoice using fraud type: {fraud_type.upper()}\\n\\n\"\n",
431
+ " f\"Context: {context}\\n\\nOutput ONLY valid JSON.\"\n",
432
+ " )\n",
433
+ " inputs = tokenizer.apply_chat_template(\n",
434
+ " [{\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
435
+ " {\"role\": \"user\", \"content\": user_prompt}],\n",
436
+ " tokenize=True, add_generation_prompt=True,\n",
437
+ " return_tensors=\"pt\"\n",
438
+ " ).to(model.device)\n",
439
+ "\n",
440
+ " output = model.generate(\n",
441
+ " inputs, max_new_tokens=300, temperature=0.7, do_sample=True\n",
442
+ " )\n",
443
+ " text = tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True)\n",
444
+ " inv = _parse_invoice_json(text)\n",
445
+ "\n",
446
+ " if not inv:\n",
447
+ " rewards.append(0.01)\n",
448
+ " continue\n",
449
+ "\n",
450
+ " try:\n",
451
+ " resp = httpx.post(\n",
452
+ " f\"{ENV_URL}/generator/score\",\n",
453
+ " json={\"invoice_json\": inv, \"fraud_type\": fraud_type},\n",
454
+ " timeout=15,\n",
455
+ " ).json()\n",
456
+ " r = float(resp.get(\"reward\", 0.01))\n",
457
+ " rewards.append(r)\n",
458
+ " if not resp.get(\"auditor_detected\", True):\n",
459
+ " evasion_count += 1\n",
460
+ " except Exception:\n",
461
+ " rewards.append(0.01)\n",
462
+ "\n",
463
+ " avg = sum(rewards) / len(rewards) if rewards else 0.0\n",
464
+ " evasion_rate = evasion_count / n\n",
465
+ " print(f\"{label}\")\n",
466
+ " print(f\" Mean generator reward: {avg:.3f} | Evasion rate: {evasion_rate:.0%}\")\n",
467
+ " print(f\" Per-episode rewards: {[round(r, 2) for r in rewards]}\")\n",
468
+ " return avg, evasion_rate\n",
469
+ "\n",
470
+ "\n",
471
+ "print(\"=== POST-TRAINING GENERATOR EVAL ===\")\n",
472
+ "after_reward, after_evasion = run_generator_eval(n=10, label=\"After GRPO (50 steps)\")\n",
473
+ "\n",
474
+ "# Check if Regulator detects the evolved fraud patterns\n",
475
+ "report = httpx.get(f\"{ENV_URL}/regulator/report\").json()\n",
476
+ "print(\"\\n=== REGULATOR REPORT (shows if Auditor is struggling) ===\")\n",
477
+ "for ft, status in report[\"detection_rates\"].items():\n",
478
+ " print(f\" {ft:<28} {status}\")\n",
479
+ "print(f\"\\nBlind spots: {report['blind_spots']}\")\n",
480
+ "print(f\"Emerging: {report['emerging_blind_spots']}\")"
481
+ ]
482
+ },
483
+ {
484
+ "cell_type": "code",
485
+ "execution_count": null,
486
+ "metadata": {},
487
+ "outputs": [],
488
+ "source": [
489
+ "# Cell 8 — Arms race visualisation\n",
490
+ "import matplotlib.pyplot as plt\n",
491
+ "\n",
492
+ "log = trainer.state.log_history\n",
493
+ "steps = [x[\"step\"] for x in log if \"rewards/reward_generator_live/mean\" in x]\n",
494
+ "live_r = [x[\"rewards/reward_generator_live/mean\"] for x in log if \"rewards/reward_generator_live/mean\" in x]\n",
495
+ "plaus_r = [x.get(\"rewards/reward_fraud_plausibility/mean\", 0) for x in log if \"rewards/reward_generator_live/mean\" in x]\n",
496
+ "\n",
497
+ "fig, ax = plt.subplots(figsize=(10, 4))\n",
498
+ "ax.plot(steps, live_r, label=\"Live Evasion Reward (Auditor+Approver)\", marker=\"o\", color=\"red\")\n",
499
+ "ax.plot(steps, plaus_r, label=\"Fraud Plausibility\", marker=\"s\", linestyle=\"--\", color=\"orange\")\n",
500
+ "ax.axhline(y=0.85, color=\"red\", linestyle=\":\", alpha=0.5, label=\"Max reward (0.85 = full evasion)\")\n",
501
+ "ax.axhline(y=0.10, color=\"green\", linestyle=\":\", alpha=0.5, label=\"Min reward (0.10 = caught)\")\n",
502
+ "ax.set_xlabel(\"Training Step\")\n",
503
+ "ax.set_ylabel(\"Mean Reward\")\n",
504
+ "ax.set_title(\"Generator Adversarial Training — Evasion Reward Curve\")\n",
505
+ "ax.legend()\n",
506
+ "ax.grid(True, alpha=0.3)\n",
507
+ "ax.set_ylim(0, 1.0)\n",
508
+ "plt.tight_layout()\n",
509
+ "plt.savefig(\"generator_reward_curve.png\", dpi=150)\n",
510
+ "plt.show()\n",
511
+ "print(\"Saved generator_reward_curve.png\")"
512
+ ]
513
+ },
514
+ {
515
+ "cell_type": "code",
516
+ "execution_count": null,
517
+ "metadata": {},
518
+ "outputs": [],
519
+ "source": [
520
+ "# Cell 9 — Save model\n",
521
+ "model.save_pretrained(\"generator_lora\")\n",
522
+ "tokenizer.save_pretrained(\"generator_lora\")\n",
523
+ "print(\"Generator LoRA saved to generator_lora/\")\n",
524
+ "print()\n",
525
+ "print(\"=== ARMS RACE SUMMARY ===\")\n",
526
+ "print(f\"Generator evasion rate after training: {after_evasion:.0%}\")\n",
527
+ "print(f\"Generator mean reward after training: {after_reward:.3f}\")\n",
528
+ "print()\n",
529
+ "print(\"Next step: run auditor_grpo_training.ipynb to train Auditor against evolved Generator.\")\n",
530
+ "print(\"Then re-run Generator training — each iteration the arms race escalates.\")"
531
+ ]
532
+ }
533
+ ],
534
+ "metadata": {
535
+ "kernelspec": {
536
+ "display_name": "Python 3",
537
+ "language": "python",
538
+ "name": "python3"
539
+ },
540
+ "language_info": {
541
+ "name": "python",
542
+ "version": "3.10.0"
543
+ }
544
+ },
545
+ "nbformat": 4,
546
+ "nbformat_minor": 4
547
+ }
server/app.py CHANGED
@@ -418,6 +418,109 @@ def regulator_calibration():
418
  return _regulator_tracker.calibration_report()
419
 
420
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
421
  @app.post("/regulator/demo_seed")
422
  def regulator_demo_seed():
423
  """Seed the tracker with realistic demo data (phantom_vendor weak at 31%)."""
 
418
  return _regulator_tracker.calibration_report()
419
 
420
 
421
+ # ---------------------------------------------------------------------------
422
+ # Generator training endpoint
423
+ # ---------------------------------------------------------------------------
424
+
425
+ class GeneratorScoreRequest(BaseModel):
426
+ invoice_json: Dict[str, Any] # generated invoice to evaluate
427
+ fraud_type: str # claimed fraud type
428
+ context: Optional[str] = "" # optional: generation reasoning
429
+
430
+ class GeneratorScoreResponse(BaseModel):
431
+ reward: float
432
+ auditor_detected: bool
433
+ approver_decision: str
434
+ feedback: str
435
+ regulator_weights: Dict[str, Any] # current blind spot weights for next generation
436
+
437
+
438
+ @app.post("/generator/score")
439
+ def generator_score(req: GeneratorScoreRequest):
440
+ """
441
+ Score a Generator-produced invoice through the full Auditor+Approver pipeline.
442
+ Used as the live reward signal during Generator GRPO training.
443
+
444
+ The Generator's job: produce a fraudulent invoice that evades the Auditor.
445
+ Reward: 0.85 (evades both), 0.60 (evades Auditor, Approver catches), 0.10 (Auditor catches)
446
+ """
447
+ from server.multi_agent_environment import (
448
+ compute_generator_reward, approver_decision as _approver_decision,
449
+ PHANTOM_VENDORS, FRAUD_TYPES,
450
+ )
451
+ from server.environment import VENDORS, MARKET_PRICE_MAX
452
+
453
+ inv = req.invoice_json
454
+ fraud_type = req.fraud_type.lower().strip()
455
+
456
+ # --- Rule-based Auditor simulation (fast, no LLM needed for scoring) ---
457
+ # This mirrors what a trained Auditor would do, seeded from tracker blind spots
458
+ vendor = inv.get("vendor", "")
459
+ line_items = inv.get("line_items", [])
460
+ total = float(inv.get("total", 0))
461
+
462
+ auditor_verdict = "approved"
463
+ auditor_confidence = 0.5
464
+ auditor_fraud_type = None
465
+
466
+ # Check phantom vendor
467
+ known_vendors_lower = [v.lower() for v in VENDORS]
468
+ if vendor.lower() not in known_vendors_lower:
469
+ # Auditor detection rate for phantom_vendor from tracker
470
+ detection_rate = _regulator_tracker.detection_rates().get("phantom_vendor") or 0.5
471
+ if random.random() < detection_rate:
472
+ auditor_verdict = "flagged"
473
+ auditor_fraud_type = "phantom_vendor"
474
+ auditor_confidence = round(0.5 + detection_rate * 0.4, 2)
475
+
476
+ # Check price gouging
477
+ if auditor_verdict == "approved":
478
+ for item in line_items:
479
+ desc = item.get("description", "")
480
+ up = float(item.get("unit_price", 0))
481
+ market_max = MARKET_PRICE_MAX.get(desc, up)
482
+ if up > market_max * 1.5:
483
+ detection_rate = _regulator_tracker.detection_rates().get("price_gouging") or 0.5
484
+ if random.random() < detection_rate:
485
+ auditor_verdict = "flagged"
486
+ auditor_fraud_type = "price_gouging"
487
+ auditor_confidence = round(0.5 + detection_rate * 0.4, 2)
488
+ break
489
+
490
+ # Check math fraud
491
+ if auditor_verdict == "approved" and line_items:
492
+ computed = round(sum(float(it.get("amount", 0)) for it in line_items), 2)
493
+ if abs(computed - total) > 0.05 * total:
494
+ detection_rate = _regulator_tracker.detection_rates().get("math_fraud") or 0.5
495
+ if random.random() < detection_rate:
496
+ auditor_verdict = "flagged"
497
+ auditor_fraud_type = "math_fraud"
498
+ auditor_confidence = round(0.5 + detection_rate * 0.4, 2)
499
+
500
+ auditor_detected = auditor_verdict == "flagged"
501
+
502
+ # --- Approver decision ---
503
+ approver = _approver_decision(auditor_verdict, auditor_confidence, auditor_fraud_type)
504
+ approver_approved = approver["decision"] == "approve"
505
+
506
+ # --- Generator reward ---
507
+ reward = compute_generator_reward(auditor_detected, approver_approved)
508
+
509
+ feedback_parts = [
510
+ f"Auditor: {'DETECTED' if auditor_detected else 'MISSED'} (conf={auditor_confidence:.2f})",
511
+ f"Approver: {approver['decision']} — {approver['reason']}",
512
+ f"Generator reward: {reward:.2f}",
513
+ ]
514
+
515
+ return GeneratorScoreResponse(
516
+ reward=reward,
517
+ auditor_detected=auditor_detected,
518
+ approver_decision=approver["decision"],
519
+ feedback=" | ".join(feedback_parts),
520
+ regulator_weights=_regulator_tracker.generator_weights(),
521
+ )
522
+
523
+
524
  @app.post("/regulator/demo_seed")
525
  def regulator_demo_seed():
526
  """Seed the tracker with realistic demo data (phantom_vendor weak at 31%)."""