sans-workshop-lab13 / README.md
vchirrav's picture
Update README.md
c32aac6 verified
---
title: SANS Workshop Lab 13
emoji: 🧠
colorFrom: purple
colorTo: red
sdk: streamlit
sdk_version: 1.42.0
app_file: app.py
pinned: false
---
# SEC545 Lab 13 β€” Model Inversion via Agent API
**ML Security β€” Training Data Extraction & Membership Inference**
Hands-on lab demonstrating how attackers extract sensitive data that was
carelessly included in a fine-tuning corpus β€” using nothing but the model's
public completion API. Covers two distinct statistical attacks (membership
inference and training data extraction), canary-based forensic detection,
and five layered defenses including differential privacy.
## What Students Will Do
| Step | Topic |
|------|-------|
| 0 | Examine the training dataset β€” HR records, customer PII, medical data, and canary records that should never have been included |
| 1 | Safe baseline: model answers support questions without leaking training data |
| 2 | **Attack A** β€” Membership inference: use log-probability differentials to statistically confirm which records were in the training corpus |
| 3 | **Attack B** β€” Training data extraction: prefix completion oracle recovers SSNs, credit card numbers, salaries, and diagnoses verbatim |
| 4 | **Attack C** β€” Canary record extraction: synthetic sentinel records prove memorisation conclusively |
| 5 | Apply five defenses: PII scrubbing before training, differential privacy (DP-SGD), output filtering, rate limiting, canary monitoring |
| 6 | Run all three attack types against the fully hardened model |
## Secrets Required
| Secret Name | Where to Get It |
|-------------|----------------|
| `OPENAI_API_KEY` | https://platform.openai.com/api-keys |
Only one secret needed.
## Architecture
- Training dataset of 10 records shown in full: 2 knowledge base (safe), 3 HR, 2 customer PII, 1 medical, 2 canaries
- Membership inference simulated with realistic perplexity differentials and Gaussian noise
- Extraction attacks use real `gpt-4o-mini` with a system prompt simulating a fine-tuned model
- Differential privacy demo uses a configurable Ξ΅ slider to show the noise/utility tradeoff
- Output filter applies regex PII patterns (SSN, CC, salary, PHI) before delivery
- Rate limiter tracks per-session query count; hits at 15 queries with block + flag
- Canary monitoring checks all completions for synthetic sentinel IDs
## Five Defenses Demonstrated
1. **PII scrubbing before training** β€” automated removal of sensitive records from corpus before fine-tuning; 8 of 10 records dropped
2. **Differential privacy (DP-SGD)** β€” Ξ΅ slider shows membership inference gap collapsing as noise increases; Opacus code sample
3. **Output filtering** β€” regex + pattern matching redacts SSN/CC/salary/PHI in completions; side-by-side raw vs filtered
4. **Rate limiting** β€” 15-query session limit with prefix-variation pattern detection
5. **Canary monitoring** β€” synthetic sentinel IDs watched across all outputs; any hit triggers P0 incident response
## Key Concepts
**Membership inference** β€” even if the model refuses to complete a sensitive
prefix, the perplexity differential between training members and non-members
is a statistical signal that leaks membership. Confirming that a specific
person's medical record was in training data is itself a HIPAA breach.
**The canary principle** β€” a unique synthetic record with no existence outside
the training corpus. If it appears in any completion, there is no alternative
explanation. Forensic attribution is exact.
## Based On
- Extracting Training Data from ChatGPT (Nasr et al., 2023)
- Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017)
- OWASP GenAI Security Project β€” Top 10 for Agentic Applications 2026