# Security and Compliance Guide for TranscriptorAI **Last Updated:** 2025-10-29 This document provides critical security information for using TranscriptorAI with sensitive healthcare data. --- ## ⚠️ CRITICAL SECURITY NOTICE ### HuggingFace Spaces and HIPAA Compliance **TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).** #### Why HuggingFace Spaces Cannot Support HIPAA Data: 1. **No Business Associate Agreement (BAA)** - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA 2. **Shared Infrastructure** - Spaces run on multi-tenant infrastructure not certified for PHI 3. **No HIPAA Certification** - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare) 4. **Platform Access** - HF staff may have technical access to private Spaces for maintenance/debugging 5. **Log Retention** - Logs are kept for 30 days and may inadvertently contain PHI fragments 6. **No Audit Controls** - Insufficient access logging and audit trails for HIPAA compliance 7. **Security History** - 2024 security incident exposed potential vulnerabilities in Spaces secrets ### What Data Can Be Used on HF Spaces? ✅ **SAFE TO USE:** - Fully de-identified data (all 18 HIPAA identifiers removed) - Synthetic/test data (completely fabricated) - Anonymized market research data - General business-confidential data (non-healthcare) ❌ **NEVER USE:** - Real patient data with any identifiers - Healthcare provider information with identifying details - Data subject to HIPAA, GDPR Article 9, or similar regulations - Any data containing the 18 HIPAA identifiers (see below) --- ## HIPAA Safe Harbor De-Identification If you must use real healthcare data, **you MUST remove all 18 HIPAA identifiers** before uploading to HF Spaces: 1. **Names** - Patient, relatives, employers 2. **Geographic subdivisions** - Smaller than state (addresses, cities, ZIP codes) 3. **Dates** - Birth dates, admission dates, discharge dates, death dates (year is OK) 4. **Telephone numbers** 5. **Fax numbers** 6. **Email addresses** 7. **Social Security numbers** 8. **Medical record numbers** 9. **Health plan beneficiary numbers** 10. **Account numbers** 11. **Certificate/license numbers** 12. **Vehicle identifiers** - License plates, VINs 13. **Device identifiers and serial numbers** 14. **Web URLs** 15. **IP addresses** 16. **Biometric identifiers** - Fingerprints, voice prints 17. **Full-face photos** 18. **Other unique identifying numbers/codes** ### Using the Built-in Redaction Feature TranscriptorAI now includes a PII redaction module: 1. **Enable PII Redaction** checkbox in the UI 2. **Choose Redaction Level:** - **Minimal**: Only redacts obvious identifiers (SSN, MRN, account numbers) - **Moderate**: Redacts common PII (emails, phones, dates, SSN, MRN) - **RECOMMENDED** - **Strict**: Redacts all PII including names and addresses ⚠️ **Important:** The redaction module is a tool to ASSIST with de-identification, but: - It is not 100% guaranteed to catch all PII - You are still responsible for verifying data is properly de-identified - Manual review is recommended for regulated data - Consider using professional de-identification services for high-risk data --- ## HIPAA-Compliant Deployment Options For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure: ### Option 1: AWS (Recommended for Healthcare) - **AWS HealthLake** - Purpose-built for HIPAA/FHIR data - **EC2 + S3 with BAA** - Self-managed on AWS infrastructure - **Requires:** Signed AWS BAA, encryption at rest/in-transit, audit logging - **Cost:** ~$50-500/month depending on usage ### Option 2: Microsoft Azure - **Azure Health Data Services** - HIPAA-compliant platform - **Azure VM + Blob Storage** - Self-hosted with BAA - **Requires:** Signed Azure BAA, compliance certifications enabled - **Cost:** Similar to AWS ### Option 3: Google Cloud Platform - **Healthcare API** - HIPAA-compliant - **Compute Engine + Cloud Storage with BAA** - **Requires:** Signed GCP BAA - **Cost:** Similar to AWS/Azure ### Option 4: On-Premises - Deploy on your own HIPAA-certified servers - Full control over data and access - **Requires:** Your own HIPAA compliance program, security controls, auditing - **Cost:** Infrastructure + IT staff ### Deployment Checklist for HIPAA Compliance - [ ] Signed Business Associate Agreement with cloud provider - [ ] Encryption at rest (AES-256) - [ ] Encryption in transit (TLS 1.2+) - [ ] Multi-factor authentication (MFA) enabled - [ ] Role-based access control (RBAC) - [ ] Audit logging enabled and retained (6 years) - [ ] Regular security assessments - [ ] Incident response plan documented - [ ] Breach notification procedures in place - [ ] Regular backups with encryption - [ ] Staff HIPAA training completed - [ ] Data retention and destruction policies --- ## Security Features in TranscriptorAI ### Built-in Security Controls 1. **PII Redaction Module** (`redaction.py`) - Detects and masks 10+ types of PII - Configurable redaction levels - Redaction reporting for audit trails 2. **Secure Logging** (`logger.py`) - Automatic PII sanitization in logs - Token masking (shows only first/last 4 chars) - Configurable log levels - Prevents sensitive data leakage 3. **Type Safety** - Standardized LLM response handling - Prevents data corruption/leakage through type errors - Defensive type checking 4. **Environment Variable Protection** - API keys stored in environment variables (not code) - Never logged in full - Masked in debug output ### Configuring Security Settings ```bash # .env file (NEVER commit this to version control!) # Enable PII sanitization in logs (RECOMMENDED) SANITIZE_LOGS=True # Disable debug mode in production (no sensitive data in logs) DEBUG_MODE=False # Enable file logging for audit trails LOG_TO_FILE=True # For HIPAA: Use local models (data stays on your server) USE_HF_API=False USE_LMSTUDIO=True LMSTUDIO_URL=http://localhost:1234/v1/chat/completions # Or use HF API only after signing BAA (Enterprise plan) # USE_HF_API=True # HUGGINGFACE_TOKEN= ``` --- ## Data Flow and Storage ### Where Data Goes 1. **Upload**: Files uploaded through Gradio UI → Server memory (temporary) 2. **Processing**: Text extraction → LLM analysis → Report generation 3. **Output**: CSV/PDF reports generated → Downloads 4. **Cleanup**: Temporary files deleted after session ### Data Retention | Location | What's Stored | Retention | |----------|---------------|-----------| | **HF Spaces (if used)** | Logs, temporary files | 30 days (platform logs) | | **Local Deployment** | Only what you configure | You control | | **LLM API (HF/OpenAI)** | Prompts/responses | Varies by provider | | **Local LM Studio** | Nothing (all local) | You control | ### Minimizing Data Exposure **Best Practices:** 1. **Use local LLM (LM Studio)** - Keeps all data on your servers 2. **Enable PII redaction** - Remove identifiers before processing 3. **Don't use HF Inference API** - Data sent to HuggingFace servers 4. **Clear session data** - Restart app between sessions with sensitive data 5. **Use incognito/private browsing** - Prevents browser caching --- ## LLM Backend Security Considerations ### HuggingFace Inference API ❌ **NOT recommended for PHI:** - Data sent to HuggingFace servers for processing - Logs kept for 30 days - No BAA available for API usage (as of 2025-01) - May be used for model improvement (check ToS) ### LM Studio (Local) ✅ **Recommended for PHI:** - All processing happens on your server - No data sent externally - Full control over model and data - Can run on HIPAA-compliant infrastructure ### OpenAI/Anthropic APIs ⚠️ **Use with caution:** - OpenAI offers BAAs for Enterprise customers - Anthropic offers BAAs for Enterprise - Zero data retention policies available - Requires Enterprise plan + signed BAA --- ## Compliance Certifications Required For healthcare use, your deployment should have: - **SOC 2 Type II** - Security and availability controls - **HITRUST CSF** - Healthcare industry framework - **ISO 27001** - Information security management - **HIPAA Compliance** - Via BAA with cloud provider For European data (GDPR): - **GDPR Article 9** - Special category data (health) - **Data Processing Agreement (DPA)** with providers - **Privacy Impact Assessment (PIA)** completed --- ## Incident Response If you suspect a data breach: 1. **Immediately stop processing** - Shut down the application 2. **Preserve logs** - Don't delete anything 3. **Notify your security team** - Escalate within 1 hour 4. **Notify cloud provider** (if applicable) 5. **Document the incident** - Who, what, when, where, how 6. **Notify affected individuals** - Within 60 days per HIPAA 7. **File breach report** - HHS if >500 individuals affected --- ## Testing with Sensitive Data ### Safe Testing Workflow 1. **Start with synthetic data** - Generate realistic but fake transcripts 2. **Test with de-identified data** - Remove all 18 HIPAA identifiers 3. **Enable PII redaction** - Use "strict" mode 4. **Review outputs manually** - Check for leaked PII 5. **Deploy to compliant infrastructure** - Only then use real data ### Creating Synthetic Test Data Use the included script: ```bash python create_sample_transcripts.py --count 10 --type patient --synthetic ``` This generates realistic but completely fabricated patient/HCP interviews. --- ## Security Checklist for Production Deployment ### Pre-Deployment - [ ] De-identify all test data - [ ] Enable PII redaction in UI - [ ] Set `DEBUG_MODE=False` - [ ] Set `SANITIZE_LOGS=True` - [ ] Remove any hardcoded API keys - [ ] Use environment variables for secrets - [ ] Configure LM Studio (not HF API) - [ ] Test on synthetic data only ### Deployment - [ ] Deploy on HIPAA-compliant infrastructure - [ ] Sign BAA with cloud provider - [ ] Enable encryption at rest - [ ] Enable encryption in transit (HTTPS/TLS 1.2+) - [ ] Configure MFA for all users - [ ] Set up RBAC (role-based access control) - [ ] Enable audit logging - [ ] Configure log retention (6+ years) - [ ] Set up automated backups - [ ] Document data flow diagram ### Post-Deployment - [ ] Conduct security assessment - [ ] Penetration testing completed - [ ] Staff training on HIPAA completed - [ ] Incident response plan in place - [ ] Breach notification procedures documented - [ ] Regular vulnerability scanning (monthly) - [ ] Access reviews (quarterly) - [ ] Compliance audit (annual) --- ## Frequently Asked Questions **Q: Can I use private HF Spaces for HIPAA data?** A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure. **Q: Is the PII redaction module HIPAA-compliant?** A: The redaction module is a *tool* to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs. **Q: Can I get a BAA from HuggingFace?** A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs. **Q: What if I only have de-identified data?** A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly. **Q: Can I use this for research?** A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office. **Q: What about GDPR compliance?** A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers). --- ## Additional Resources - **HIPAA Guidance:** https://www.hhs.gov/hipaa - **HIPAA Safe Harbor Method:** https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification - **HuggingFace Security:** https://huggingface.co/docs/hub/security - **AWS HIPAA Compliance:** https://aws.amazon.com/compliance/hipaa-compliance/ - **HITRUST Alliance:** https://hitrustalliance.net/ --- ## Support and Questions For security questions or to report vulnerabilities: - **Security Issues:** Create a private issue in GitHub (do not disclose publicly) - **Compliance Questions:** Consult with your organization's compliance officer - **General Support:** See README.md --- **Remember:** When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place. **This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.**