---
title: SANS Workshop Lab 13
emoji: 🧠
colorFrom: purple
colorTo: red
sdk: streamlit
sdk_version: 1.42.0
app_file: app.py
pinned: false
---

# SEC545 Lab 13 — Model Inversion via Agent API

**ML Security — Training Data Extraction & Membership Inference**

Hands-on lab demonstrating how attackers extract sensitive data that was
carelessly included in a fine-tuning corpus — using nothing but the model's
public completion API. Covers two distinct statistical attacks (membership
inference and training data extraction), canary-based forensic detection,
and five layered defenses including differential privacy.

## What Students Will Do

| Step | Topic |
|------|-------|
| 0 | Examine the training dataset — HR records, customer PII, medical data, and canary records that should never have been included |
| 1 | Safe baseline: model answers support questions without leaking training data |
| 2 | **Attack A** — Membership inference: use log-probability differentials to statistically confirm which records were in the training corpus |
| 3 | **Attack B** — Training data extraction: prefix completion oracle recovers SSNs, credit card numbers, salaries, and diagnoses verbatim |
| 4 | **Attack C** — Canary record extraction: synthetic sentinel records prove memorisation conclusively |
| 5 | Apply five defenses: PII scrubbing before training, differential privacy (DP-SGD), output filtering, rate limiting, canary monitoring |
| 6 | Run all three attack types against the fully hardened model |

## Secrets Required

| Secret Name | Where to Get It |
|-------------|----------------|
| `OPENAI_API_KEY` | https://platform.openai.com/api-keys |

Only one secret needed.

## Architecture

- Training dataset of 10 records shown in full: 2 knowledge base (safe), 3 HR, 2 customer PII, 1 medical, 2 canaries
- Membership inference simulated with realistic perplexity differentials and Gaussian noise
- Extraction attacks use real `gpt-4o-mini` with a system prompt simulating a fine-tuned model
- Differential privacy demo uses a configurable ε slider to show the noise/utility tradeoff
- Output filter applies regex PII patterns (SSN, CC, salary, PHI) before delivery
- Rate limiter tracks per-session query count; hits at 15 queries with block + flag
- Canary monitoring checks all completions for synthetic sentinel IDs

## Five Defenses Demonstrated

1. **PII scrubbing before training** — automated removal of sensitive records from corpus before fine-tuning; 8 of 10 records dropped
2. **Differential privacy (DP-SGD)** — ε slider shows membership inference gap collapsing as noise increases; Opacus code sample
3. **Output filtering** — regex + pattern matching redacts SSN/CC/salary/PHI in completions; side-by-side raw vs filtered
4. **Rate limiting** — 15-query session limit with prefix-variation pattern detection
5. **Canary monitoring** — synthetic sentinel IDs watched across all outputs; any hit triggers P0 incident response

## Key Concepts

**Membership inference** — even if the model refuses to complete a sensitive
prefix, the perplexity differential between training members and non-members
is a statistical signal that leaks membership. Confirming that a specific
person's medical record was in training data is itself a HIPAA breach.

**The canary principle** — a unique synthetic record with no existence outside
the training corpus. If it appears in any completion, there is no alternative
explanation. Forensic attribution is exact.

## Based On

- Extracting Training Data from ChatGPT (Nasr et al., 2023)
- Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017)
- OWASP GenAI Security Project — Top 10 for Agentic Applications 2026