Spaces:

vchirrav
/

sans-workshop-lab13

Sleeping

App Files Files Community

sans-workshop-lab13 / README.md

vchirrav

Update README.md

c32aac6 verified 4 days ago

preview code

raw

history blame contribute delete

3.73 kB

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

metadata

title: SANS Workshop Lab 13
emoji: 🧠
colorFrom: purple
colorTo: red
sdk: streamlit
sdk_version: 1.42.0
app_file: app.py
pinned: false

SEC545 Lab 13 — Model Inversion via Agent API

ML Security — Training Data Extraction & Membership Inference

Hands-on lab demonstrating how attackers extract sensitive data that was carelessly included in a fine-tuning corpus — using nothing but the model's public completion API. Covers two distinct statistical attacks (membership inference and training data extraction), canary-based forensic detection, and five layered defenses including differential privacy.

What Students Will Do

Step	Topic
0	Examine the training dataset — HR records, customer PII, medical data, and canary records that should never have been included
1	Safe baseline: model answers support questions without leaking training data
2	Attack A — Membership inference: use log-probability differentials to statistically confirm which records were in the training corpus
3	Attack B — Training data extraction: prefix completion oracle recovers SSNs, credit card numbers, salaries, and diagnoses verbatim
4	Attack C — Canary record extraction: synthetic sentinel records prove memorisation conclusively
5	Apply five defenses: PII scrubbing before training, differential privacy (DP-SGD), output filtering, rate limiting, canary monitoring
6	Run all three attack types against the fully hardened model

Secrets Required

Secret Name	Where to Get It
`OPENAI_API_KEY`	https://platform.openai.com/api-keys

Only one secret needed.

Architecture

Training dataset of 10 records shown in full: 2 knowledge base (safe), 3 HR, 2 customer PII, 1 medical, 2 canaries
Membership inference simulated with realistic perplexity differentials and Gaussian noise
Extraction attacks use real gpt-4o-mini with a system prompt simulating a fine-tuned model
Differential privacy demo uses a configurable ε slider to show the noise/utility tradeoff
Output filter applies regex PII patterns (SSN, CC, salary, PHI) before delivery
Rate limiter tracks per-session query count; hits at 15 queries with block + flag
Canary monitoring checks all completions for synthetic sentinel IDs

Five Defenses Demonstrated

PII scrubbing before training — automated removal of sensitive records from corpus before fine-tuning; 8 of 10 records dropped
Differential privacy (DP-SGD) — ε slider shows membership inference gap collapsing as noise increases; Opacus code sample
Output filtering — regex + pattern matching redacts SSN/CC/salary/PHI in completions; side-by-side raw vs filtered
Rate limiting — 15-query session limit with prefix-variation pattern detection
Canary monitoring — synthetic sentinel IDs watched across all outputs; any hit triggers P0 incident response

Key Concepts

Membership inference — even if the model refuses to complete a sensitive prefix, the perplexity differential between training members and non-members is a statistical signal that leaks membership. Confirming that a specific person's medical record was in training data is itself a HIPAA breach.

The canary principle — a unique synthetic record with no existence outside the training corpus. If it appears in any completion, there is no alternative explanation. Forensic attribution is exact.

Based On

Extracting Training Data from ChatGPT (Nasr et al., 2023)
Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017)
OWASP GenAI Security Project — Top 10 for Agentic Applications 2026