Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.2.0
Quick Start - Security Features
β‘ 30-Second Setup for PII Protection
Step 1: Enable Redaction in UI
β Enable PII Redaction
β Redaction Level: moderate
Step 2: Configure Environment
# Edit .env file
DEBUG_MODE=False
SANITIZE_LOGS=True
Step 3: Use Safe Data
- β Synthetic data (create_sample_transcripts.py)
- β De-identified data (all 18 HIPAA identifiers removed)
- β Real PHI on HuggingFace Spaces
That's it! π
π¨ Critical Decision Tree
Do you have real patient/healthcare data?
βββ YES β Contains ANY of these?
β βββ Names, dates, SSN, MRN, emails, phones, addresses?
β β βββ YES β β οΈ STOP! Cannot use HF Spaces!
β β β βββ Options:
β β β 1. Remove ALL 18 HIPAA identifiers (de-identify)
β β β 2. Deploy on AWS/Azure/GCP with BAA
β β β 3. Use synthetic data instead
β β βββ NO β Proceed with redaction enabled
β βββ NO β Safe to use HF Spaces
βββ NO β β
Safe to proceed
π Quick Redaction Levels Guide
| Level | What's Redacted | Use When |
|---|---|---|
| Minimal | SSN, MRN, Account # | Testing, low-risk data |
| Moderate | + Emails, Phones, Dates | Recommended - balanced protection |
| Strict | + Names, Addresses | Maximum protection, compliance testing |
π The 18 HIPAA Identifiers (Must Remove ALL for De-identification)
- Names
- Locations < State
- Dates (except year)
- Phone numbers
- Fax numbers
- Email addresses
- SSN
- MRN
- Health plan #
- Account #
- License #
- Vehicle IDs
- Device serial #
- URLs
- IP addresses
- Biometrics
- Photos
- Other unique IDs
Redaction module helps with these, but verify manually!
βοΈ Environment Variables Cheat Sheet
# Security (ALWAYS set these in production)
DEBUG_MODE=False # No debug output
SANITIZE_LOGS=True # Redact PII from logs
# Logging
LOG_TO_FILE=True # Create audit trail
# LLM Backend (for HIPAA: use local)
USE_LMSTUDIO=True # β
Keeps data local
USE_HF_API=False # β Sends to HF servers
# LM Studio
LMSTUDIO_URL=http://localhost:1234/v1/chat/completions
π― Common Scenarios
Scenario 1: Testing with Fake Data
1. python create_sample_transcripts.py --count 5 --synthetic
2. Upload to TranscriptorAI
3. Optional: Enable redaction for testing
4. β
Safe - no real data
Scenario 2: De-identified Research Data
1. Remove all 18 HIPAA identifiers manually
2. Enable redaction (moderate or strict)
3. Upload to TranscriptorAI
4. Review outputs - verify no PII leaked
5. β
Safe if properly de-identified
Scenario 3: Real Patient Data (HIPAA)
1. β οΈ DO NOT use HuggingFace Spaces
2. Deploy on AWS HealthLake / Azure Health / GCP
3. Sign BAA with cloud provider
4. Configure encryption, MFA, audit logs
5. Enable PII redaction (strict mode)
6. β
Safe with proper infrastructure
π Troubleshooting
Problem: "Redaction not working"
- β Check HAS_REDACTION is True in logs
- β Verify redaction.py exists
- β Check "Enable PII Redaction" is checked
Problem: "Too much debug output"
- β Set DEBUG_MODE=False in .env
- β Restart application
Problem: "PII showing in logs"
- β Set SANITIZE_LOGS=True in .env
- β Check logger.py is imported
Problem: "Need to use real PHI"
- β Read SECURITY_AND_COMPLIANCE.md
- β Deploy on compliant infrastructure
- β Never use HF Spaces for real PHI
π Quick Links
- Full Security Guide:
SECURITY_AND_COMPLIANCE.md - What Changed:
IMPROVEMENTS_SUMMARY.md - General Docs:
README.md - HIPAA Guidance: https://www.hhs.gov/hipaa
β Pre-Flight Checklist
Before uploading sensitive data:
- Read SECURITY_AND_COMPLIANCE.md
- Data is de-identified OR synthetic
- PII redaction enabled in UI
- DEBUG_MODE=False
- SANITIZE_LOGS=True
- Using local LLM (not HF API)
- Tested with fake data first
- Will manually review outputs
If using real PHI:
- Deployed on HIPAA infrastructure (NOT HF Spaces)
- BAA signed with cloud provider
- Compliance review completed
Remember: When in doubt, use synthetic data!