TranscriptWriting / QUICK_START_SECURITY.md
jmisak's picture
Upload 57 files
52d0298 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Quick Start - Security Features

⚑ 30-Second Setup for PII Protection

Step 1: Enable Redaction in UI

β˜‘ Enable PII Redaction
β—‹ Redaction Level: moderate

Step 2: Configure Environment

# Edit .env file
DEBUG_MODE=False
SANITIZE_LOGS=True

Step 3: Use Safe Data

  • βœ… Synthetic data (create_sample_transcripts.py)
  • βœ… De-identified data (all 18 HIPAA identifiers removed)
  • ❌ Real PHI on HuggingFace Spaces

That's it! πŸŽ‰


🚨 Critical Decision Tree

Do you have real patient/healthcare data?
β”œβ”€β”€ YES β†’ Contains ANY of these?
β”‚   β”œβ”€β”€ Names, dates, SSN, MRN, emails, phones, addresses?
β”‚   β”‚   β”œβ”€β”€ YES β†’ ⚠️ STOP! Cannot use HF Spaces!
β”‚   β”‚   β”‚   └── Options:
β”‚   β”‚   β”‚       1. Remove ALL 18 HIPAA identifiers (de-identify)
β”‚   β”‚   β”‚       2. Deploy on AWS/Azure/GCP with BAA
β”‚   β”‚   β”‚       3. Use synthetic data instead
β”‚   β”‚   └── NO β†’ Proceed with redaction enabled
β”‚   └── NO β†’ Safe to use HF Spaces
└── NO β†’ βœ… Safe to proceed

πŸ“‹ Quick Redaction Levels Guide

Level What's Redacted Use When
Minimal SSN, MRN, Account # Testing, low-risk data
Moderate + Emails, Phones, Dates Recommended - balanced protection
Strict + Names, Addresses Maximum protection, compliance testing

πŸ” The 18 HIPAA Identifiers (Must Remove ALL for De-identification)

  1. Names
  2. Locations < State
  3. Dates (except year)
  4. Phone numbers
  5. Fax numbers
  6. Email addresses
  7. SSN
  8. MRN
  9. Health plan #
  10. Account #
  11. License #
  12. Vehicle IDs
  13. Device serial #
  14. URLs
  15. IP addresses
  16. Biometrics
  17. Photos
  18. Other unique IDs

Redaction module helps with these, but verify manually!


βš™οΈ Environment Variables Cheat Sheet

# Security (ALWAYS set these in production)
DEBUG_MODE=False              # No debug output
SANITIZE_LOGS=True           # Redact PII from logs

# Logging
LOG_TO_FILE=True             # Create audit trail

# LLM Backend (for HIPAA: use local)
USE_LMSTUDIO=True            # βœ… Keeps data local
USE_HF_API=False             # ❌ Sends to HF servers

# LM Studio
LMSTUDIO_URL=http://localhost:1234/v1/chat/completions

🎯 Common Scenarios

Scenario 1: Testing with Fake Data

1. python create_sample_transcripts.py --count 5 --synthetic
2. Upload to TranscriptorAI
3. Optional: Enable redaction for testing
4. βœ… Safe - no real data

Scenario 2: De-identified Research Data

1. Remove all 18 HIPAA identifiers manually
2. Enable redaction (moderate or strict)
3. Upload to TranscriptorAI
4. Review outputs - verify no PII leaked
5. βœ… Safe if properly de-identified

Scenario 3: Real Patient Data (HIPAA)

1. ⚠️ DO NOT use HuggingFace Spaces
2. Deploy on AWS HealthLake / Azure Health / GCP
3. Sign BAA with cloud provider
4. Configure encryption, MFA, audit logs
5. Enable PII redaction (strict mode)
6. βœ… Safe with proper infrastructure

πŸ†˜ Troubleshooting

Problem: "Redaction not working"

  • βœ… Check HAS_REDACTION is True in logs
  • βœ… Verify redaction.py exists
  • βœ… Check "Enable PII Redaction" is checked

Problem: "Too much debug output"

  • βœ… Set DEBUG_MODE=False in .env
  • βœ… Restart application

Problem: "PII showing in logs"

  • βœ… Set SANITIZE_LOGS=True in .env
  • βœ… Check logger.py is imported

Problem: "Need to use real PHI"

  • βœ… Read SECURITY_AND_COMPLIANCE.md
  • βœ… Deploy on compliant infrastructure
  • βœ… Never use HF Spaces for real PHI

πŸ“ž Quick Links

  • Full Security Guide: SECURITY_AND_COMPLIANCE.md
  • What Changed: IMPROVEMENTS_SUMMARY.md
  • General Docs: README.md
  • HIPAA Guidance: https://www.hhs.gov/hipaa

βœ… Pre-Flight Checklist

Before uploading sensitive data:

  • Read SECURITY_AND_COMPLIANCE.md
  • Data is de-identified OR synthetic
  • PII redaction enabled in UI
  • DEBUG_MODE=False
  • SANITIZE_LOGS=True
  • Using local LLM (not HF API)
  • Tested with fake data first
  • Will manually review outputs

If using real PHI:

  • Deployed on HIPAA infrastructure (NOT HF Spaces)
  • BAA signed with cloud provider
  • Compliance review completed

Remember: When in doubt, use synthetic data!