Spaces:

vchirrav
/

sans-workshop-lab13

Sleeping

App Files Files Community

sans-workshop-lab13 / README.md

vchirrav

Update README.md

c32aac6 verified 5 days ago

preview code

raw

history blame contribute delete

3.73 kB

	---
	title: SANS Workshop Lab 13
	emoji: 🧠
	colorFrom: purple
	colorTo: red
	sdk: streamlit
	sdk_version: 1.42.0
	app_file: app.py
	pinned: false
	---

	# SEC545 Lab 13 — Model Inversion via Agent API

	ML Security — Training Data Extraction & Membership Inference

	Hands-on lab demonstrating how attackers extract sensitive data that was
	carelessly included in a fine-tuning corpus — using nothing but the model's
	public completion API. Covers two distinct statistical attacks (membership
	inference and training data extraction), canary-based forensic detection,
	and five layered defenses including differential privacy.

	## What Students Will Do

	\| Step \| Topic \|
	\|------\|-------\|
	\| 0 \| Examine the training dataset — HR records, customer PII, medical data, and canary records that should never have been included \|
	\| 1 \| Safe baseline: model answers support questions without leaking training data \|
	\| 2 \| Attack A — Membership inference: use log-probability differentials to statistically confirm which records were in the training corpus \|
	\| 3 \| Attack B — Training data extraction: prefix completion oracle recovers SSNs, credit card numbers, salaries, and diagnoses verbatim \|
	\| 4 \| Attack C — Canary record extraction: synthetic sentinel records prove memorisation conclusively \|
	\| 5 \| Apply five defenses: PII scrubbing before training, differential privacy (DP-SGD), output filtering, rate limiting, canary monitoring \|
	\| 6 \| Run all three attack types against the fully hardened model \|

	## Secrets Required

	\| Secret Name \| Where to Get It \|
	\|-------------\|----------------\|
	\| `OPENAI_API_KEY` \| https://platform.openai.com/api-keys \|

	Only one secret needed.

	## Architecture

	- Training dataset of 10 records shown in full: 2 knowledge base (safe), 3 HR, 2 customer PII, 1 medical, 2 canaries
	- Membership inference simulated with realistic perplexity differentials and Gaussian noise
	- Extraction attacks use real `gpt-4o-mini` with a system prompt simulating a fine-tuned model
	- Differential privacy demo uses a configurable ε slider to show the noise/utility tradeoff
	- Output filter applies regex PII patterns (SSN, CC, salary, PHI) before delivery
	- Rate limiter tracks per-session query count; hits at 15 queries with block + flag
	- Canary monitoring checks all completions for synthetic sentinel IDs

	## Five Defenses Demonstrated

	1. PII scrubbing before training — automated removal of sensitive records from corpus before fine-tuning; 8 of 10 records dropped
	2. Differential privacy (DP-SGD) — ε slider shows membership inference gap collapsing as noise increases; Opacus code sample
	3. Output filtering — regex + pattern matching redacts SSN/CC/salary/PHI in completions; side-by-side raw vs filtered
	4. Rate limiting — 15-query session limit with prefix-variation pattern detection
	5. Canary monitoring — synthetic sentinel IDs watched across all outputs; any hit triggers P0 incident response

	## Key Concepts

	Membership inference — even if the model refuses to complete a sensitive
	prefix, the perplexity differential between training members and non-members
	is a statistical signal that leaks membership. Confirming that a specific
	person's medical record was in training data is itself a HIPAA breach.

	The canary principle — a unique synthetic record with no existence outside
	the training corpus. If it appears in any completion, there is no alternative
	explanation. Forensic attribution is exact.

	## Based On

	- Extracting Training Data from ChatGPT (Nasr et al., 2023)
	- Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017)
	- OWASP GenAI Security Project — Top 10 for Agentic Applications 2026