Spaces:
Sleeping
Sleeping
| # Security and Compliance Guide for TranscriptorAI | |
| **Last Updated:** 2025-10-29 | |
| This document provides critical security information for using TranscriptorAI with sensitive healthcare data. | |
| --- | |
| ## ⚠️ CRITICAL SECURITY NOTICE | |
| ### HuggingFace Spaces and HIPAA Compliance | |
| **TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).** | |
| #### Why HuggingFace Spaces Cannot Support HIPAA Data: | |
| 1. **No Business Associate Agreement (BAA)** - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA | |
| 2. **Shared Infrastructure** - Spaces run on multi-tenant infrastructure not certified for PHI | |
| 3. **No HIPAA Certification** - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare) | |
| 4. **Platform Access** - HF staff may have technical access to private Spaces for maintenance/debugging | |
| 5. **Log Retention** - Logs are kept for 30 days and may inadvertently contain PHI fragments | |
| 6. **No Audit Controls** - Insufficient access logging and audit trails for HIPAA compliance | |
| 7. **Security History** - 2024 security incident exposed potential vulnerabilities in Spaces secrets | |
| ### What Data Can Be Used on HF Spaces? | |
| ✅ **SAFE TO USE:** | |
| - Fully de-identified data (all 18 HIPAA identifiers removed) | |
| - Synthetic/test data (completely fabricated) | |
| - Anonymized market research data | |
| - General business-confidential data (non-healthcare) | |
| ❌ **NEVER USE:** | |
| - Real patient data with any identifiers | |
| - Healthcare provider information with identifying details | |
| - Data subject to HIPAA, GDPR Article 9, or similar regulations | |
| - Any data containing the 18 HIPAA identifiers (see below) | |
| --- | |
| ## HIPAA Safe Harbor De-Identification | |
| If you must use real healthcare data, **you MUST remove all 18 HIPAA identifiers** before uploading to HF Spaces: | |
| 1. **Names** - Patient, relatives, employers | |
| 2. **Geographic subdivisions** - Smaller than state (addresses, cities, ZIP codes) | |
| 3. **Dates** - Birth dates, admission dates, discharge dates, death dates (year is OK) | |
| 4. **Telephone numbers** | |
| 5. **Fax numbers** | |
| 6. **Email addresses** | |
| 7. **Social Security numbers** | |
| 8. **Medical record numbers** | |
| 9. **Health plan beneficiary numbers** | |
| 10. **Account numbers** | |
| 11. **Certificate/license numbers** | |
| 12. **Vehicle identifiers** - License plates, VINs | |
| 13. **Device identifiers and serial numbers** | |
| 14. **Web URLs** | |
| 15. **IP addresses** | |
| 16. **Biometric identifiers** - Fingerprints, voice prints | |
| 17. **Full-face photos** | |
| 18. **Other unique identifying numbers/codes** | |
| ### Using the Built-in Redaction Feature | |
| TranscriptorAI now includes a PII redaction module: | |
| 1. **Enable PII Redaction** checkbox in the UI | |
| 2. **Choose Redaction Level:** | |
| - **Minimal**: Only redacts obvious identifiers (SSN, MRN, account numbers) | |
| - **Moderate**: Redacts common PII (emails, phones, dates, SSN, MRN) - **RECOMMENDED** | |
| - **Strict**: Redacts all PII including names and addresses | |
| ⚠️ **Important:** The redaction module is a tool to ASSIST with de-identification, but: | |
| - It is not 100% guaranteed to catch all PII | |
| - You are still responsible for verifying data is properly de-identified | |
| - Manual review is recommended for regulated data | |
| - Consider using professional de-identification services for high-risk data | |
| --- | |
| ## HIPAA-Compliant Deployment Options | |
| For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure: | |
| ### Option 1: AWS (Recommended for Healthcare) | |
| - **AWS HealthLake** - Purpose-built for HIPAA/FHIR data | |
| - **EC2 + S3 with BAA** - Self-managed on AWS infrastructure | |
| - **Requires:** Signed AWS BAA, encryption at rest/in-transit, audit logging | |
| - **Cost:** ~$50-500/month depending on usage | |
| ### Option 2: Microsoft Azure | |
| - **Azure Health Data Services** - HIPAA-compliant platform | |
| - **Azure VM + Blob Storage** - Self-hosted with BAA | |
| - **Requires:** Signed Azure BAA, compliance certifications enabled | |
| - **Cost:** Similar to AWS | |
| ### Option 3: Google Cloud Platform | |
| - **Healthcare API** - HIPAA-compliant | |
| - **Compute Engine + Cloud Storage with BAA** | |
| - **Requires:** Signed GCP BAA | |
| - **Cost:** Similar to AWS/Azure | |
| ### Option 4: On-Premises | |
| - Deploy on your own HIPAA-certified servers | |
| - Full control over data and access | |
| - **Requires:** Your own HIPAA compliance program, security controls, auditing | |
| - **Cost:** Infrastructure + IT staff | |
| ### Deployment Checklist for HIPAA Compliance | |
| - [ ] Signed Business Associate Agreement with cloud provider | |
| - [ ] Encryption at rest (AES-256) | |
| - [ ] Encryption in transit (TLS 1.2+) | |
| - [ ] Multi-factor authentication (MFA) enabled | |
| - [ ] Role-based access control (RBAC) | |
| - [ ] Audit logging enabled and retained (6 years) | |
| - [ ] Regular security assessments | |
| - [ ] Incident response plan documented | |
| - [ ] Breach notification procedures in place | |
| - [ ] Regular backups with encryption | |
| - [ ] Staff HIPAA training completed | |
| - [ ] Data retention and destruction policies | |
| --- | |
| ## Security Features in TranscriptorAI | |
| ### Built-in Security Controls | |
| 1. **PII Redaction Module** (`redaction.py`) | |
| - Detects and masks 10+ types of PII | |
| - Configurable redaction levels | |
| - Redaction reporting for audit trails | |
| 2. **Secure Logging** (`logger.py`) | |
| - Automatic PII sanitization in logs | |
| - Token masking (shows only first/last 4 chars) | |
| - Configurable log levels | |
| - Prevents sensitive data leakage | |
| 3. **Type Safety** | |
| - Standardized LLM response handling | |
| - Prevents data corruption/leakage through type errors | |
| - Defensive type checking | |
| 4. **Environment Variable Protection** | |
| - API keys stored in environment variables (not code) | |
| - Never logged in full | |
| - Masked in debug output | |
| ### Configuring Security Settings | |
| ```bash | |
| # .env file (NEVER commit this to version control!) | |
| # Enable PII sanitization in logs (RECOMMENDED) | |
| SANITIZE_LOGS=True | |
| # Disable debug mode in production (no sensitive data in logs) | |
| DEBUG_MODE=False | |
| # Enable file logging for audit trails | |
| LOG_TO_FILE=True | |
| # For HIPAA: Use local models (data stays on your server) | |
| USE_HF_API=False | |
| USE_LMSTUDIO=True | |
| LMSTUDIO_URL=http://localhost:1234/v1/chat/completions | |
| # Or use HF API only after signing BAA (Enterprise plan) | |
| # USE_HF_API=True | |
| # HUGGINGFACE_TOKEN=<your_token> | |
| ``` | |
| --- | |
| ## Data Flow and Storage | |
| ### Where Data Goes | |
| 1. **Upload**: Files uploaded through Gradio UI → Server memory (temporary) | |
| 2. **Processing**: Text extraction → LLM analysis → Report generation | |
| 3. **Output**: CSV/PDF reports generated → Downloads | |
| 4. **Cleanup**: Temporary files deleted after session | |
| ### Data Retention | |
| | Location | What's Stored | Retention | | |
| |----------|---------------|-----------| | |
| | **HF Spaces (if used)** | Logs, temporary files | 30 days (platform logs) | | |
| | **Local Deployment** | Only what you configure | You control | | |
| | **LLM API (HF/OpenAI)** | Prompts/responses | Varies by provider | | |
| | **Local LM Studio** | Nothing (all local) | You control | | |
| ### Minimizing Data Exposure | |
| **Best Practices:** | |
| 1. **Use local LLM (LM Studio)** - Keeps all data on your servers | |
| 2. **Enable PII redaction** - Remove identifiers before processing | |
| 3. **Don't use HF Inference API** - Data sent to HuggingFace servers | |
| 4. **Clear session data** - Restart app between sessions with sensitive data | |
| 5. **Use incognito/private browsing** - Prevents browser caching | |
| --- | |
| ## LLM Backend Security Considerations | |
| ### HuggingFace Inference API | |
| ❌ **NOT recommended for PHI:** | |
| - Data sent to HuggingFace servers for processing | |
| - Logs kept for 30 days | |
| - No BAA available for API usage (as of 2025-01) | |
| - May be used for model improvement (check ToS) | |
| ### LM Studio (Local) | |
| ✅ **Recommended for PHI:** | |
| - All processing happens on your server | |
| - No data sent externally | |
| - Full control over model and data | |
| - Can run on HIPAA-compliant infrastructure | |
| ### OpenAI/Anthropic APIs | |
| ⚠️ **Use with caution:** | |
| - OpenAI offers BAAs for Enterprise customers | |
| - Anthropic offers BAAs for Enterprise | |
| - Zero data retention policies available | |
| - Requires Enterprise plan + signed BAA | |
| --- | |
| ## Compliance Certifications Required | |
| For healthcare use, your deployment should have: | |
| - **SOC 2 Type II** - Security and availability controls | |
| - **HITRUST CSF** - Healthcare industry framework | |
| - **ISO 27001** - Information security management | |
| - **HIPAA Compliance** - Via BAA with cloud provider | |
| For European data (GDPR): | |
| - **GDPR Article 9** - Special category data (health) | |
| - **Data Processing Agreement (DPA)** with providers | |
| - **Privacy Impact Assessment (PIA)** completed | |
| --- | |
| ## Incident Response | |
| If you suspect a data breach: | |
| 1. **Immediately stop processing** - Shut down the application | |
| 2. **Preserve logs** - Don't delete anything | |
| 3. **Notify your security team** - Escalate within 1 hour | |
| 4. **Notify cloud provider** (if applicable) | |
| 5. **Document the incident** - Who, what, when, where, how | |
| 6. **Notify affected individuals** - Within 60 days per HIPAA | |
| 7. **File breach report** - HHS if >500 individuals affected | |
| --- | |
| ## Testing with Sensitive Data | |
| ### Safe Testing Workflow | |
| 1. **Start with synthetic data** - Generate realistic but fake transcripts | |
| 2. **Test with de-identified data** - Remove all 18 HIPAA identifiers | |
| 3. **Enable PII redaction** - Use "strict" mode | |
| 4. **Review outputs manually** - Check for leaked PII | |
| 5. **Deploy to compliant infrastructure** - Only then use real data | |
| ### Creating Synthetic Test Data | |
| Use the included script: | |
| ```bash | |
| python create_sample_transcripts.py --count 10 --type patient --synthetic | |
| ``` | |
| This generates realistic but completely fabricated patient/HCP interviews. | |
| --- | |
| ## Security Checklist for Production Deployment | |
| ### Pre-Deployment | |
| - [ ] De-identify all test data | |
| - [ ] Enable PII redaction in UI | |
| - [ ] Set `DEBUG_MODE=False` | |
| - [ ] Set `SANITIZE_LOGS=True` | |
| - [ ] Remove any hardcoded API keys | |
| - [ ] Use environment variables for secrets | |
| - [ ] Configure LM Studio (not HF API) | |
| - [ ] Test on synthetic data only | |
| ### Deployment | |
| - [ ] Deploy on HIPAA-compliant infrastructure | |
| - [ ] Sign BAA with cloud provider | |
| - [ ] Enable encryption at rest | |
| - [ ] Enable encryption in transit (HTTPS/TLS 1.2+) | |
| - [ ] Configure MFA for all users | |
| - [ ] Set up RBAC (role-based access control) | |
| - [ ] Enable audit logging | |
| - [ ] Configure log retention (6+ years) | |
| - [ ] Set up automated backups | |
| - [ ] Document data flow diagram | |
| ### Post-Deployment | |
| - [ ] Conduct security assessment | |
| - [ ] Penetration testing completed | |
| - [ ] Staff training on HIPAA completed | |
| - [ ] Incident response plan in place | |
| - [ ] Breach notification procedures documented | |
| - [ ] Regular vulnerability scanning (monthly) | |
| - [ ] Access reviews (quarterly) | |
| - [ ] Compliance audit (annual) | |
| --- | |
| ## Frequently Asked Questions | |
| **Q: Can I use private HF Spaces for HIPAA data?** | |
| A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure. | |
| **Q: Is the PII redaction module HIPAA-compliant?** | |
| A: The redaction module is a *tool* to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs. | |
| **Q: Can I get a BAA from HuggingFace?** | |
| A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs. | |
| **Q: What if I only have de-identified data?** | |
| A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly. | |
| **Q: Can I use this for research?** | |
| A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office. | |
| **Q: What about GDPR compliance?** | |
| A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers). | |
| --- | |
| ## Additional Resources | |
| - **HIPAA Guidance:** https://www.hhs.gov/hipaa | |
| - **HIPAA Safe Harbor Method:** https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification | |
| - **HuggingFace Security:** https://huggingface.co/docs/hub/security | |
| - **AWS HIPAA Compliance:** https://aws.amazon.com/compliance/hipaa-compliance/ | |
| - **HITRUST Alliance:** https://hitrustalliance.net/ | |
| --- | |
| ## Support and Questions | |
| For security questions or to report vulnerabilities: | |
| - **Security Issues:** Create a private issue in GitHub (do not disclose publicly) | |
| - **Compliance Questions:** Consult with your organization's compliance officer | |
| - **General Support:** See README.md | |
| --- | |
| **Remember:** When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place. | |
| **This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.** | |