--- title: SANS Workshop Lab 13 emoji: 🧠 colorFrom: purple colorTo: red sdk: streamlit sdk_version: 1.42.0 app_file: app.py pinned: false --- # SEC545 Lab 13 — Model Inversion via Agent API **ML Security — Training Data Extraction & Membership Inference** Hands-on lab demonstrating how attackers extract sensitive data that was carelessly included in a fine-tuning corpus — using nothing but the model's public completion API. Covers two distinct statistical attacks (membership inference and training data extraction), canary-based forensic detection, and five layered defenses including differential privacy. ## What Students Will Do | Step | Topic | |------|-------| | 0 | Examine the training dataset — HR records, customer PII, medical data, and canary records that should never have been included | | 1 | Safe baseline: model answers support questions without leaking training data | | 2 | **Attack A** — Membership inference: use log-probability differentials to statistically confirm which records were in the training corpus | | 3 | **Attack B** — Training data extraction: prefix completion oracle recovers SSNs, credit card numbers, salaries, and diagnoses verbatim | | 4 | **Attack C** — Canary record extraction: synthetic sentinel records prove memorisation conclusively | | 5 | Apply five defenses: PII scrubbing before training, differential privacy (DP-SGD), output filtering, rate limiting, canary monitoring | | 6 | Run all three attack types against the fully hardened model | ## Secrets Required | Secret Name | Where to Get It | |-------------|----------------| | `OPENAI_API_KEY` | https://platform.openai.com/api-keys | Only one secret needed. ## Architecture - Training dataset of 10 records shown in full: 2 knowledge base (safe), 3 HR, 2 customer PII, 1 medical, 2 canaries - Membership inference simulated with realistic perplexity differentials and Gaussian noise - Extraction attacks use real `gpt-4o-mini` with a system prompt simulating a fine-tuned model - Differential privacy demo uses a configurable ε slider to show the noise/utility tradeoff - Output filter applies regex PII patterns (SSN, CC, salary, PHI) before delivery - Rate limiter tracks per-session query count; hits at 15 queries with block + flag - Canary monitoring checks all completions for synthetic sentinel IDs ## Five Defenses Demonstrated 1. **PII scrubbing before training** — automated removal of sensitive records from corpus before fine-tuning; 8 of 10 records dropped 2. **Differential privacy (DP-SGD)** — ε slider shows membership inference gap collapsing as noise increases; Opacus code sample 3. **Output filtering** — regex + pattern matching redacts SSN/CC/salary/PHI in completions; side-by-side raw vs filtered 4. **Rate limiting** — 15-query session limit with prefix-variation pattern detection 5. **Canary monitoring** — synthetic sentinel IDs watched across all outputs; any hit triggers P0 incident response ## Key Concepts **Membership inference** — even if the model refuses to complete a sensitive prefix, the perplexity differential between training members and non-members is a statistical signal that leaks membership. Confirming that a specific person's medical record was in training data is itself a HIPAA breach. **The canary principle** — a unique synthetic record with no existence outside the training corpus. If it appears in any completion, there is no alternative explanation. Forensic attribution is exact. ## Based On - Extracting Training Data from ChatGPT (Nasr et al., 2023) - Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) - OWASP GenAI Security Project — Top 10 for Agentic Applications 2026