TranscriptWriting / SECURITY_AND_COMPLIANCE.md
jmisak's picture
Upload 57 files
52d0298 verified
# Security and Compliance Guide for TranscriptorAI
**Last Updated:** 2025-10-29
This document provides critical security information for using TranscriptorAI with sensitive healthcare data.
---
## ⚠️ CRITICAL SECURITY NOTICE
### HuggingFace Spaces and HIPAA Compliance
**TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).**
#### Why HuggingFace Spaces Cannot Support HIPAA Data:
1. **No Business Associate Agreement (BAA)** - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA
2. **Shared Infrastructure** - Spaces run on multi-tenant infrastructure not certified for PHI
3. **No HIPAA Certification** - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare)
4. **Platform Access** - HF staff may have technical access to private Spaces for maintenance/debugging
5. **Log Retention** - Logs are kept for 30 days and may inadvertently contain PHI fragments
6. **No Audit Controls** - Insufficient access logging and audit trails for HIPAA compliance
7. **Security History** - 2024 security incident exposed potential vulnerabilities in Spaces secrets
### What Data Can Be Used on HF Spaces?
**SAFE TO USE:**
- Fully de-identified data (all 18 HIPAA identifiers removed)
- Synthetic/test data (completely fabricated)
- Anonymized market research data
- General business-confidential data (non-healthcare)
**NEVER USE:**
- Real patient data with any identifiers
- Healthcare provider information with identifying details
- Data subject to HIPAA, GDPR Article 9, or similar regulations
- Any data containing the 18 HIPAA identifiers (see below)
---
## HIPAA Safe Harbor De-Identification
If you must use real healthcare data, **you MUST remove all 18 HIPAA identifiers** before uploading to HF Spaces:
1. **Names** - Patient, relatives, employers
2. **Geographic subdivisions** - Smaller than state (addresses, cities, ZIP codes)
3. **Dates** - Birth dates, admission dates, discharge dates, death dates (year is OK)
4. **Telephone numbers**
5. **Fax numbers**
6. **Email addresses**
7. **Social Security numbers**
8. **Medical record numbers**
9. **Health plan beneficiary numbers**
10. **Account numbers**
11. **Certificate/license numbers**
12. **Vehicle identifiers** - License plates, VINs
13. **Device identifiers and serial numbers**
14. **Web URLs**
15. **IP addresses**
16. **Biometric identifiers** - Fingerprints, voice prints
17. **Full-face photos**
18. **Other unique identifying numbers/codes**
### Using the Built-in Redaction Feature
TranscriptorAI now includes a PII redaction module:
1. **Enable PII Redaction** checkbox in the UI
2. **Choose Redaction Level:**
- **Minimal**: Only redacts obvious identifiers (SSN, MRN, account numbers)
- **Moderate**: Redacts common PII (emails, phones, dates, SSN, MRN) - **RECOMMENDED**
- **Strict**: Redacts all PII including names and addresses
⚠️ **Important:** The redaction module is a tool to ASSIST with de-identification, but:
- It is not 100% guaranteed to catch all PII
- You are still responsible for verifying data is properly de-identified
- Manual review is recommended for regulated data
- Consider using professional de-identification services for high-risk data
---
## HIPAA-Compliant Deployment Options
For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure:
### Option 1: AWS (Recommended for Healthcare)
- **AWS HealthLake** - Purpose-built for HIPAA/FHIR data
- **EC2 + S3 with BAA** - Self-managed on AWS infrastructure
- **Requires:** Signed AWS BAA, encryption at rest/in-transit, audit logging
- **Cost:** ~$50-500/month depending on usage
### Option 2: Microsoft Azure
- **Azure Health Data Services** - HIPAA-compliant platform
- **Azure VM + Blob Storage** - Self-hosted with BAA
- **Requires:** Signed Azure BAA, compliance certifications enabled
- **Cost:** Similar to AWS
### Option 3: Google Cloud Platform
- **Healthcare API** - HIPAA-compliant
- **Compute Engine + Cloud Storage with BAA**
- **Requires:** Signed GCP BAA
- **Cost:** Similar to AWS/Azure
### Option 4: On-Premises
- Deploy on your own HIPAA-certified servers
- Full control over data and access
- **Requires:** Your own HIPAA compliance program, security controls, auditing
- **Cost:** Infrastructure + IT staff
### Deployment Checklist for HIPAA Compliance
- [ ] Signed Business Associate Agreement with cloud provider
- [ ] Encryption at rest (AES-256)
- [ ] Encryption in transit (TLS 1.2+)
- [ ] Multi-factor authentication (MFA) enabled
- [ ] Role-based access control (RBAC)
- [ ] Audit logging enabled and retained (6 years)
- [ ] Regular security assessments
- [ ] Incident response plan documented
- [ ] Breach notification procedures in place
- [ ] Regular backups with encryption
- [ ] Staff HIPAA training completed
- [ ] Data retention and destruction policies
---
## Security Features in TranscriptorAI
### Built-in Security Controls
1. **PII Redaction Module** (`redaction.py`)
- Detects and masks 10+ types of PII
- Configurable redaction levels
- Redaction reporting for audit trails
2. **Secure Logging** (`logger.py`)
- Automatic PII sanitization in logs
- Token masking (shows only first/last 4 chars)
- Configurable log levels
- Prevents sensitive data leakage
3. **Type Safety**
- Standardized LLM response handling
- Prevents data corruption/leakage through type errors
- Defensive type checking
4. **Environment Variable Protection**
- API keys stored in environment variables (not code)
- Never logged in full
- Masked in debug output
### Configuring Security Settings
```bash
# .env file (NEVER commit this to version control!)
# Enable PII sanitization in logs (RECOMMENDED)
SANITIZE_LOGS=True
# Disable debug mode in production (no sensitive data in logs)
DEBUG_MODE=False
# Enable file logging for audit trails
LOG_TO_FILE=True
# For HIPAA: Use local models (data stays on your server)
USE_HF_API=False
USE_LMSTUDIO=True
LMSTUDIO_URL=http://localhost:1234/v1/chat/completions
# Or use HF API only after signing BAA (Enterprise plan)
# USE_HF_API=True
# HUGGINGFACE_TOKEN=<your_token>
```
---
## Data Flow and Storage
### Where Data Goes
1. **Upload**: Files uploaded through Gradio UI → Server memory (temporary)
2. **Processing**: Text extraction → LLM analysis → Report generation
3. **Output**: CSV/PDF reports generated → Downloads
4. **Cleanup**: Temporary files deleted after session
### Data Retention
| Location | What's Stored | Retention |
|----------|---------------|-----------|
| **HF Spaces (if used)** | Logs, temporary files | 30 days (platform logs) |
| **Local Deployment** | Only what you configure | You control |
| **LLM API (HF/OpenAI)** | Prompts/responses | Varies by provider |
| **Local LM Studio** | Nothing (all local) | You control |
### Minimizing Data Exposure
**Best Practices:**
1. **Use local LLM (LM Studio)** - Keeps all data on your servers
2. **Enable PII redaction** - Remove identifiers before processing
3. **Don't use HF Inference API** - Data sent to HuggingFace servers
4. **Clear session data** - Restart app between sessions with sensitive data
5. **Use incognito/private browsing** - Prevents browser caching
---
## LLM Backend Security Considerations
### HuggingFace Inference API
**NOT recommended for PHI:**
- Data sent to HuggingFace servers for processing
- Logs kept for 30 days
- No BAA available for API usage (as of 2025-01)
- May be used for model improvement (check ToS)
### LM Studio (Local)
**Recommended for PHI:**
- All processing happens on your server
- No data sent externally
- Full control over model and data
- Can run on HIPAA-compliant infrastructure
### OpenAI/Anthropic APIs
⚠️ **Use with caution:**
- OpenAI offers BAAs for Enterprise customers
- Anthropic offers BAAs for Enterprise
- Zero data retention policies available
- Requires Enterprise plan + signed BAA
---
## Compliance Certifications Required
For healthcare use, your deployment should have:
- **SOC 2 Type II** - Security and availability controls
- **HITRUST CSF** - Healthcare industry framework
- **ISO 27001** - Information security management
- **HIPAA Compliance** - Via BAA with cloud provider
For European data (GDPR):
- **GDPR Article 9** - Special category data (health)
- **Data Processing Agreement (DPA)** with providers
- **Privacy Impact Assessment (PIA)** completed
---
## Incident Response
If you suspect a data breach:
1. **Immediately stop processing** - Shut down the application
2. **Preserve logs** - Don't delete anything
3. **Notify your security team** - Escalate within 1 hour
4. **Notify cloud provider** (if applicable)
5. **Document the incident** - Who, what, when, where, how
6. **Notify affected individuals** - Within 60 days per HIPAA
7. **File breach report** - HHS if >500 individuals affected
---
## Testing with Sensitive Data
### Safe Testing Workflow
1. **Start with synthetic data** - Generate realistic but fake transcripts
2. **Test with de-identified data** - Remove all 18 HIPAA identifiers
3. **Enable PII redaction** - Use "strict" mode
4. **Review outputs manually** - Check for leaked PII
5. **Deploy to compliant infrastructure** - Only then use real data
### Creating Synthetic Test Data
Use the included script:
```bash
python create_sample_transcripts.py --count 10 --type patient --synthetic
```
This generates realistic but completely fabricated patient/HCP interviews.
---
## Security Checklist for Production Deployment
### Pre-Deployment
- [ ] De-identify all test data
- [ ] Enable PII redaction in UI
- [ ] Set `DEBUG_MODE=False`
- [ ] Set `SANITIZE_LOGS=True`
- [ ] Remove any hardcoded API keys
- [ ] Use environment variables for secrets
- [ ] Configure LM Studio (not HF API)
- [ ] Test on synthetic data only
### Deployment
- [ ] Deploy on HIPAA-compliant infrastructure
- [ ] Sign BAA with cloud provider
- [ ] Enable encryption at rest
- [ ] Enable encryption in transit (HTTPS/TLS 1.2+)
- [ ] Configure MFA for all users
- [ ] Set up RBAC (role-based access control)
- [ ] Enable audit logging
- [ ] Configure log retention (6+ years)
- [ ] Set up automated backups
- [ ] Document data flow diagram
### Post-Deployment
- [ ] Conduct security assessment
- [ ] Penetration testing completed
- [ ] Staff training on HIPAA completed
- [ ] Incident response plan in place
- [ ] Breach notification procedures documented
- [ ] Regular vulnerability scanning (monthly)
- [ ] Access reviews (quarterly)
- [ ] Compliance audit (annual)
---
## Frequently Asked Questions
**Q: Can I use private HF Spaces for HIPAA data?**
A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure.
**Q: Is the PII redaction module HIPAA-compliant?**
A: The redaction module is a *tool* to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs.
**Q: Can I get a BAA from HuggingFace?**
A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs.
**Q: What if I only have de-identified data?**
A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly.
**Q: Can I use this for research?**
A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office.
**Q: What about GDPR compliance?**
A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers).
---
## Additional Resources
- **HIPAA Guidance:** https://www.hhs.gov/hipaa
- **HIPAA Safe Harbor Method:** https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification
- **HuggingFace Security:** https://huggingface.co/docs/hub/security
- **AWS HIPAA Compliance:** https://aws.amazon.com/compliance/hipaa-compliance/
- **HITRUST Alliance:** https://hitrustalliance.net/
---
## Support and Questions
For security questions or to report vulnerabilities:
- **Security Issues:** Create a private issue in GitHub (do not disclose publicly)
- **Compliance Questions:** Consult with your organization's compliance officer
- **General Support:** See README.md
---
**Remember:** When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place.
**This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.**