Spaces:

empirenexus
/

TranscriptWriting

Sleeping

App Files Files Community

TranscriptWriting / QUICK_START_SECURITY.md

jmisak

Upload 57 files

52d0298 verified 2 months ago

preview code

raw

history blame contribute delete

4.59 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Quick Start - Security Features

⚡ 30-Second Setup for PII Protection

Step 1: Enable Redaction in UI

☑ Enable PII Redaction
○ Redaction Level: moderate

Step 2: Configure Environment

# Edit .env file
DEBUG_MODE=False
SANITIZE_LOGS=True

Step 3: Use Safe Data

✅ Synthetic data (create_sample_transcripts.py)
✅ De-identified data (all 18 HIPAA identifiers removed)
❌ Real PHI on HuggingFace Spaces

That's it! 🎉

🚨 Critical Decision Tree

Do you have real patient/healthcare data?
├── YES → Contains ANY of these?
│   ├── Names, dates, SSN, MRN, emails, phones, addresses?
│   │   ├── YES → ⚠️ STOP! Cannot use HF Spaces!
│   │   │   └── Options:
│   │   │       1. Remove ALL 18 HIPAA identifiers (de-identify)
│   │   │       2. Deploy on AWS/Azure/GCP with BAA
│   │   │       3. Use synthetic data instead
│   │   └── NO → Proceed with redaction enabled
│   └── NO → Safe to use HF Spaces
└── NO → ✅ Safe to proceed

📋 Quick Redaction Levels Guide

Level	What's Redacted	Use When
Minimal	SSN, MRN, Account #	Testing, low-risk data
Moderate	+ Emails, Phones, Dates	Recommended - balanced protection
Strict	+ Names, Addresses	Maximum protection, compliance testing

🔐 The 18 HIPAA Identifiers (Must Remove ALL for De-identification)

Names
Locations < State
Dates (except year)
Phone numbers
Fax numbers
Email addresses
SSN
MRN
Health plan #
Account #
License #
Vehicle IDs
Device serial #
URLs
IP addresses
Biometrics
Photos
Other unique IDs

Redaction module helps with these, but verify manually!

⚙️ Environment Variables Cheat Sheet

# Security (ALWAYS set these in production)
DEBUG_MODE=False              # No debug output
SANITIZE_LOGS=True           # Redact PII from logs

# Logging
LOG_TO_FILE=True             # Create audit trail

# LLM Backend (for HIPAA: use local)
USE_LMSTUDIO=True            # ✅ Keeps data local
USE_HF_API=False             # ❌ Sends to HF servers

# LM Studio
LMSTUDIO_URL=http://localhost:1234/v1/chat/completions

🎯 Common Scenarios

Scenario 1: Testing with Fake Data

1. python create_sample_transcripts.py --count 5 --synthetic
2. Upload to TranscriptorAI
3. Optional: Enable redaction for testing
4. ✅ Safe - no real data

Scenario 2: De-identified Research Data

1. Remove all 18 HIPAA identifiers manually
2. Enable redaction (moderate or strict)
3. Upload to TranscriptorAI
4. Review outputs - verify no PII leaked
5. ✅ Safe if properly de-identified

Scenario 3: Real Patient Data (HIPAA)

1. ⚠️ DO NOT use HuggingFace Spaces
2. Deploy on AWS HealthLake / Azure Health / GCP
3. Sign BAA with cloud provider
4. Configure encryption, MFA, audit logs
5. Enable PII redaction (strict mode)
6. ✅ Safe with proper infrastructure

🆘 Troubleshooting

Problem: "Redaction not working"

✅ Check HAS_REDACTION is True in logs
✅ Verify redaction.py exists
✅ Check "Enable PII Redaction" is checked

Problem: "Too much debug output"

✅ Set DEBUG_MODE=False in .env
✅ Restart application

Problem: "PII showing in logs"

✅ Set SANITIZE_LOGS=True in .env
✅ Check logger.py is imported

Problem: "Need to use real PHI"

✅ Read SECURITY_AND_COMPLIANCE.md
✅ Deploy on compliant infrastructure
✅ Never use HF Spaces for real PHI

📞 Quick Links

Full Security Guide: SECURITY_AND_COMPLIANCE.md
What Changed: IMPROVEMENTS_SUMMARY.md
General Docs: README.md
HIPAA Guidance: https://www.hhs.gov/hipaa

✅ Pre-Flight Checklist

Before uploading sensitive data:

Read SECURITY_AND_COMPLIANCE.md
Data is de-identified OR synthetic
PII redaction enabled in UI
DEBUG_MODE=False
SANITIZE_LOGS=True
Using local LLM (not HF API)
Tested with fake data first
Will manually review outputs

If using real PHI:

Deployed on HIPAA infrastructure (NOT HF Spaces)
BAA signed with cloud provider
Compliance review completed

Remember: When in doubt, use synthetic data!