# Quick Start - Security Features ## ⚡ 30-Second Setup for PII Protection ### Step 1: Enable Redaction in UI ``` ☑ Enable PII Redaction ○ Redaction Level: moderate ``` ### Step 2: Configure Environment ```bash # Edit .env file DEBUG_MODE=False SANITIZE_LOGS=True ``` ### Step 3: Use Safe Data - ✅ Synthetic data (create_sample_transcripts.py) - ✅ De-identified data (all 18 HIPAA identifiers removed) - ❌ Real PHI on HuggingFace Spaces That's it! 🎉 --- ## 🚨 Critical Decision Tree ``` Do you have real patient/healthcare data? ├── YES → Contains ANY of these? │ ├── Names, dates, SSN, MRN, emails, phones, addresses? │ │ ├── YES → ⚠️ STOP! Cannot use HF Spaces! │ │ │ └── Options: │ │ │ 1. Remove ALL 18 HIPAA identifiers (de-identify) │ │ │ 2. Deploy on AWS/Azure/GCP with BAA │ │ │ 3. Use synthetic data instead │ │ └── NO → Proceed with redaction enabled │ └── NO → Safe to use HF Spaces └── NO → ✅ Safe to proceed ``` --- ## 📋 Quick Redaction Levels Guide | Level | What's Redacted | Use When | |-------|----------------|----------| | **Minimal** | SSN, MRN, Account # | Testing, low-risk data | | **Moderate** | + Emails, Phones, Dates | **Recommended** - balanced protection | | **Strict** | + Names, Addresses | Maximum protection, compliance testing | --- ## 🔐 The 18 HIPAA Identifiers (Must Remove ALL for De-identification) 1. Names 2. Locations < State 3. Dates (except year) 4. Phone numbers 5. Fax numbers 6. Email addresses 7. SSN 8. MRN 9. Health plan # 10. Account # 11. License # 12. Vehicle IDs 13. Device serial # 14. URLs 15. IP addresses 16. Biometrics 17. Photos 18. Other unique IDs **Redaction module helps with these, but verify manually!** --- ## ⚙️ Environment Variables Cheat Sheet ```bash # Security (ALWAYS set these in production) DEBUG_MODE=False # No debug output SANITIZE_LOGS=True # Redact PII from logs # Logging LOG_TO_FILE=True # Create audit trail # LLM Backend (for HIPAA: use local) USE_LMSTUDIO=True # ✅ Keeps data local USE_HF_API=False # ❌ Sends to HF servers # LM Studio LMSTUDIO_URL=http://localhost:1234/v1/chat/completions ``` --- ## 🎯 Common Scenarios ### Scenario 1: Testing with Fake Data ```bash 1. python create_sample_transcripts.py --count 5 --synthetic 2. Upload to TranscriptorAI 3. Optional: Enable redaction for testing 4. ✅ Safe - no real data ``` ### Scenario 2: De-identified Research Data ```bash 1. Remove all 18 HIPAA identifiers manually 2. Enable redaction (moderate or strict) 3. Upload to TranscriptorAI 4. Review outputs - verify no PII leaked 5. ✅ Safe if properly de-identified ``` ### Scenario 3: Real Patient Data (HIPAA) ```bash 1. ⚠️ DO NOT use HuggingFace Spaces 2. Deploy on AWS HealthLake / Azure Health / GCP 3. Sign BAA with cloud provider 4. Configure encryption, MFA, audit logs 5. Enable PII redaction (strict mode) 6. ✅ Safe with proper infrastructure ``` --- ## 🆘 Troubleshooting **Problem:** "Redaction not working" - ✅ Check HAS_REDACTION is True in logs - ✅ Verify redaction.py exists - ✅ Check "Enable PII Redaction" is checked **Problem:** "Too much debug output" - ✅ Set DEBUG_MODE=False in .env - ✅ Restart application **Problem:** "PII showing in logs" - ✅ Set SANITIZE_LOGS=True in .env - ✅ Check logger.py is imported **Problem:** "Need to use real PHI" - ✅ Read SECURITY_AND_COMPLIANCE.md - ✅ Deploy on compliant infrastructure - ✅ Never use HF Spaces for real PHI --- ## 📞 Quick Links - **Full Security Guide:** `SECURITY_AND_COMPLIANCE.md` - **What Changed:** `IMPROVEMENTS_SUMMARY.md` - **General Docs:** `README.md` - **HIPAA Guidance:** https://www.hhs.gov/hipaa --- ## ✅ Pre-Flight Checklist Before uploading sensitive data: - [ ] Read SECURITY_AND_COMPLIANCE.md - [ ] Data is de-identified OR synthetic - [ ] PII redaction enabled in UI - [ ] DEBUG_MODE=False - [ ] SANITIZE_LOGS=True - [ ] Using local LLM (not HF API) - [ ] Tested with fake data first - [ ] Will manually review outputs **If using real PHI:** - [ ] Deployed on HIPAA infrastructure (NOT HF Spaces) - [ ] BAA signed with cloud provider - [ ] Compliance review completed --- **Remember: When in doubt, use synthetic data!**