sans-workshop-lab13 / README.md
vchirrav's picture
Update README.md
c32aac6 verified

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade
metadata
title: SANS Workshop Lab 13
emoji: 🧠
colorFrom: purple
colorTo: red
sdk: streamlit
sdk_version: 1.42.0
app_file: app.py
pinned: false

SEC545 Lab 13 β€” Model Inversion via Agent API

ML Security β€” Training Data Extraction & Membership Inference

Hands-on lab demonstrating how attackers extract sensitive data that was carelessly included in a fine-tuning corpus β€” using nothing but the model's public completion API. Covers two distinct statistical attacks (membership inference and training data extraction), canary-based forensic detection, and five layered defenses including differential privacy.

What Students Will Do

Step Topic
0 Examine the training dataset β€” HR records, customer PII, medical data, and canary records that should never have been included
1 Safe baseline: model answers support questions without leaking training data
2 Attack A β€” Membership inference: use log-probability differentials to statistically confirm which records were in the training corpus
3 Attack B β€” Training data extraction: prefix completion oracle recovers SSNs, credit card numbers, salaries, and diagnoses verbatim
4 Attack C β€” Canary record extraction: synthetic sentinel records prove memorisation conclusively
5 Apply five defenses: PII scrubbing before training, differential privacy (DP-SGD), output filtering, rate limiting, canary monitoring
6 Run all three attack types against the fully hardened model

Secrets Required

Secret Name Where to Get It
OPENAI_API_KEY https://platform.openai.com/api-keys

Only one secret needed.

Architecture

  • Training dataset of 10 records shown in full: 2 knowledge base (safe), 3 HR, 2 customer PII, 1 medical, 2 canaries
  • Membership inference simulated with realistic perplexity differentials and Gaussian noise
  • Extraction attacks use real gpt-4o-mini with a system prompt simulating a fine-tuned model
  • Differential privacy demo uses a configurable Ξ΅ slider to show the noise/utility tradeoff
  • Output filter applies regex PII patterns (SSN, CC, salary, PHI) before delivery
  • Rate limiter tracks per-session query count; hits at 15 queries with block + flag
  • Canary monitoring checks all completions for synthetic sentinel IDs

Five Defenses Demonstrated

  1. PII scrubbing before training β€” automated removal of sensitive records from corpus before fine-tuning; 8 of 10 records dropped
  2. Differential privacy (DP-SGD) β€” Ξ΅ slider shows membership inference gap collapsing as noise increases; Opacus code sample
  3. Output filtering β€” regex + pattern matching redacts SSN/CC/salary/PHI in completions; side-by-side raw vs filtered
  4. Rate limiting β€” 15-query session limit with prefix-variation pattern detection
  5. Canary monitoring β€” synthetic sentinel IDs watched across all outputs; any hit triggers P0 incident response

Key Concepts

Membership inference β€” even if the model refuses to complete a sensitive prefix, the perplexity differential between training members and non-members is a statistical signal that leaks membership. Confirming that a specific person's medical record was in training data is itself a HIPAA breach.

The canary principle β€” a unique synthetic record with no existence outside the training corpus. If it appears in any completion, there is no alternative explanation. Forensic attribution is exact.

Based On

  • Extracting Training Data from ChatGPT (Nasr et al., 2023)
  • Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017)
  • OWASP GenAI Security Project β€” Top 10 for Agentic Applications 2026