Spaces:

empirenexus
/

TranscriptWriting

Sleeping

App Files Files Community

TranscriptWriting / SECURITY_AND_COMPLIANCE.md

jmisak

Upload 57 files

52d0298 verified 2 months ago

preview code

raw

history blame contribute delete

13.2 kB

	# Security and Compliance Guide for TranscriptorAI

	Last Updated: 2025-10-29

	This document provides critical security information for using TranscriptorAI with sensitive healthcare data.

	---

	## ⚠️ CRITICAL SECURITY NOTICE

	### HuggingFace Spaces and HIPAA Compliance

	TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).

	#### Why HuggingFace Spaces Cannot Support HIPAA Data:

	1. No Business Associate Agreement (BAA) - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA
	2. Shared Infrastructure - Spaces run on multi-tenant infrastructure not certified for PHI
	3. No HIPAA Certification - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare)
	4. Platform Access - HF staff may have technical access to private Spaces for maintenance/debugging
	5. Log Retention - Logs are kept for 30 days and may inadvertently contain PHI fragments
	6. No Audit Controls - Insufficient access logging and audit trails for HIPAA compliance
	7. Security History - 2024 security incident exposed potential vulnerabilities in Spaces secrets

	### What Data Can Be Used on HF Spaces?

	✅ SAFE TO USE:
	- Fully de-identified data (all 18 HIPAA identifiers removed)
	- Synthetic/test data (completely fabricated)
	- Anonymized market research data
	- General business-confidential data (non-healthcare)

	❌ NEVER USE:
	- Real patient data with any identifiers
	- Healthcare provider information with identifying details
	- Data subject to HIPAA, GDPR Article 9, or similar regulations
	- Any data containing the 18 HIPAA identifiers (see below)

	---

	## HIPAA Safe Harbor De-Identification

	If you must use real healthcare data, you MUST remove all 18 HIPAA identifiers before uploading to HF Spaces:

	1. Names - Patient, relatives, employers
	2. Geographic subdivisions - Smaller than state (addresses, cities, ZIP codes)
	3. Dates - Birth dates, admission dates, discharge dates, death dates (year is OK)
	4. Telephone numbers
	5. Fax numbers
	6. Email addresses
	7. Social Security numbers
	8. Medical record numbers
	9. Health plan beneficiary numbers
	10. Account numbers
	11. Certificate/license numbers
	12. Vehicle identifiers - License plates, VINs
	13. Device identifiers and serial numbers
	14. Web URLs
	15. IP addresses
	16. Biometric identifiers - Fingerprints, voice prints
	17. Full-face photos
	18. Other unique identifying numbers/codes

	### Using the Built-in Redaction Feature

	TranscriptorAI now includes a PII redaction module:

	1. Enable PII Redaction checkbox in the UI
	2. Choose Redaction Level:
	- Minimal: Only redacts obvious identifiers (SSN, MRN, account numbers)
	- Moderate: Redacts common PII (emails, phones, dates, SSN, MRN) - RECOMMENDED
	- Strict: Redacts all PII including names and addresses

	⚠️ Important: The redaction module is a tool to ASSIST with de-identification, but:
	- It is not 100% guaranteed to catch all PII
	- You are still responsible for verifying data is properly de-identified
	- Manual review is recommended for regulated data
	- Consider using professional de-identification services for high-risk data

	---

	## HIPAA-Compliant Deployment Options

	For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure:

	### Option 1: AWS (Recommended for Healthcare)
	- AWS HealthLake - Purpose-built for HIPAA/FHIR data
	- EC2 + S3 with BAA - Self-managed on AWS infrastructure
	- Requires: Signed AWS BAA, encryption at rest/in-transit, audit logging
	- Cost: ~$50-500/month depending on usage

	### Option 2: Microsoft Azure
	- Azure Health Data Services - HIPAA-compliant platform
	- Azure VM + Blob Storage - Self-hosted with BAA
	- Requires: Signed Azure BAA, compliance certifications enabled
	- Cost: Similar to AWS

	### Option 3: Google Cloud Platform
	- Healthcare API - HIPAA-compliant
	- Compute Engine + Cloud Storage with BAA
	- Requires: Signed GCP BAA
	- Cost: Similar to AWS/Azure

	### Option 4: On-Premises
	- Deploy on your own HIPAA-certified servers
	- Full control over data and access
	- Requires: Your own HIPAA compliance program, security controls, auditing
	- Cost: Infrastructure + IT staff

	### Deployment Checklist for HIPAA Compliance

	- [ ] Signed Business Associate Agreement with cloud provider
	- [ ] Encryption at rest (AES-256)
	- [ ] Encryption in transit (TLS 1.2+)
	- [ ] Multi-factor authentication (MFA) enabled
	- [ ] Role-based access control (RBAC)
	- [ ] Audit logging enabled and retained (6 years)
	- [ ] Regular security assessments
	- [ ] Incident response plan documented
	- [ ] Breach notification procedures in place
	- [ ] Regular backups with encryption
	- [ ] Staff HIPAA training completed
	- [ ] Data retention and destruction policies

	---

	## Security Features in TranscriptorAI

	### Built-in Security Controls

	1. PII Redaction Module (`redaction.py`)
	- Detects and masks 10+ types of PII
	- Configurable redaction levels
	- Redaction reporting for audit trails

	2. Secure Logging (`logger.py`)
	- Automatic PII sanitization in logs
	- Token masking (shows only first/last 4 chars)
	- Configurable log levels
	- Prevents sensitive data leakage

	3. Type Safety
	- Standardized LLM response handling
	- Prevents data corruption/leakage through type errors
	- Defensive type checking

	4. Environment Variable Protection
	- API keys stored in environment variables (not code)
	- Never logged in full
	- Masked in debug output

	### Configuring Security Settings

	```bash
	# .env file (NEVER commit this to version control!)

	# Enable PII sanitization in logs (RECOMMENDED)
	SANITIZE_LOGS=True

	# Disable debug mode in production (no sensitive data in logs)
	DEBUG_MODE=False

	# Enable file logging for audit trails
	LOG_TO_FILE=True

	# For HIPAA: Use local models (data stays on your server)
	USE_HF_API=False
	USE_LMSTUDIO=True
	LMSTUDIO_URL=http://localhost:1234/v1/chat/completions

	# Or use HF API only after signing BAA (Enterprise plan)
	# USE_HF_API=True
	# HUGGINGFACE_TOKEN=<your_token>
	```

	---

	## Data Flow and Storage

	### Where Data Goes

	1. Upload: Files uploaded through Gradio UI → Server memory (temporary)
	2. Processing: Text extraction → LLM analysis → Report generation
	3. Output: CSV/PDF reports generated → Downloads
	4. Cleanup: Temporary files deleted after session

	### Data Retention

	\| Location \| What's Stored \| Retention \|
	\|----------\|---------------\|-----------\|
	\| HF Spaces (if used) \| Logs, temporary files \| 30 days (platform logs) \|
	\| Local Deployment \| Only what you configure \| You control \|
	\| LLM API (HF/OpenAI) \| Prompts/responses \| Varies by provider \|
	\| Local LM Studio \| Nothing (all local) \| You control \|

	### Minimizing Data Exposure

	Best Practices:

	1. Use local LLM (LM Studio) - Keeps all data on your servers
	2. Enable PII redaction - Remove identifiers before processing
	3. Don't use HF Inference API - Data sent to HuggingFace servers
	4. Clear session data - Restart app between sessions with sensitive data
	5. Use incognito/private browsing - Prevents browser caching

	---

	## LLM Backend Security Considerations

	### HuggingFace Inference API

	❌ NOT recommended for PHI:
	- Data sent to HuggingFace servers for processing
	- Logs kept for 30 days
	- No BAA available for API usage (as of 2025-01)
	- May be used for model improvement (check ToS)

	### LM Studio (Local)

	✅ Recommended for PHI:
	- All processing happens on your server
	- No data sent externally
	- Full control over model and data
	- Can run on HIPAA-compliant infrastructure

	### OpenAI/Anthropic APIs

	⚠️ Use with caution:
	- OpenAI offers BAAs for Enterprise customers
	- Anthropic offers BAAs for Enterprise
	- Zero data retention policies available
	- Requires Enterprise plan + signed BAA

	---

	## Compliance Certifications Required

	For healthcare use, your deployment should have:

	- SOC 2 Type II - Security and availability controls
	- HITRUST CSF - Healthcare industry framework
	- ISO 27001 - Information security management
	- HIPAA Compliance - Via BAA with cloud provider

	For European data (GDPR):
	- GDPR Article 9 - Special category data (health)
	- Data Processing Agreement (DPA) with providers
	- Privacy Impact Assessment (PIA) completed

	---

	## Incident Response

	If you suspect a data breach:

	1. Immediately stop processing - Shut down the application
	2. Preserve logs - Don't delete anything
	3. Notify your security team - Escalate within 1 hour
	4. Notify cloud provider (if applicable)
	5. Document the incident - Who, what, when, where, how
	6. Notify affected individuals - Within 60 days per HIPAA
	7. File breach report - HHS if >500 individuals affected

	---

	## Testing with Sensitive Data

	### Safe Testing Workflow

	1. Start with synthetic data - Generate realistic but fake transcripts
	2. Test with de-identified data - Remove all 18 HIPAA identifiers
	3. Enable PII redaction - Use "strict" mode
	4. Review outputs manually - Check for leaked PII
	5. Deploy to compliant infrastructure - Only then use real data

	### Creating Synthetic Test Data

	Use the included script:

	```bash
	python create_sample_transcripts.py --count 10 --type patient --synthetic
	```

	This generates realistic but completely fabricated patient/HCP interviews.

	---

	## Security Checklist for Production Deployment

	### Pre-Deployment

	- [ ] De-identify all test data
	- [ ] Enable PII redaction in UI
	- [ ] Set `DEBUG_MODE=False`
	- [ ] Set `SANITIZE_LOGS=True`
	- [ ] Remove any hardcoded API keys
	- [ ] Use environment variables for secrets
	- [ ] Configure LM Studio (not HF API)
	- [ ] Test on synthetic data only

	### Deployment

	- [ ] Deploy on HIPAA-compliant infrastructure
	- [ ] Sign BAA with cloud provider
	- [ ] Enable encryption at rest
	- [ ] Enable encryption in transit (HTTPS/TLS 1.2+)
	- [ ] Configure MFA for all users
	- [ ] Set up RBAC (role-based access control)
	- [ ] Enable audit logging
	- [ ] Configure log retention (6+ years)
	- [ ] Set up automated backups
	- [ ] Document data flow diagram

	### Post-Deployment

	- [ ] Conduct security assessment
	- [ ] Penetration testing completed
	- [ ] Staff training on HIPAA completed
	- [ ] Incident response plan in place
	- [ ] Breach notification procedures documented
	- [ ] Regular vulnerability scanning (monthly)
	- [ ] Access reviews (quarterly)
	- [ ] Compliance audit (annual)

	---

	## Frequently Asked Questions

	Q: Can I use private HF Spaces for HIPAA data?
	A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure.

	Q: Is the PII redaction module HIPAA-compliant?
	A: The redaction module is a tool to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs.

	Q: Can I get a BAA from HuggingFace?
	A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs.

	Q: What if I only have de-identified data?
	A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly.

	Q: Can I use this for research?
	A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office.

	Q: What about GDPR compliance?
	A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers).

	---

	## Additional Resources

	- HIPAA Guidance: https://www.hhs.gov/hipaa
	- HIPAA Safe Harbor Method: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification
	- HuggingFace Security: https://huggingface.co/docs/hub/security
	- AWS HIPAA Compliance: https://aws.amazon.com/compliance/hipaa-compliance/
	- HITRUST Alliance: https://hitrustalliance.net/

	---

	## Support and Questions

	For security questions or to report vulnerabilities:

	- Security Issues: Create a private issue in GitHub (do not disclose publicly)
	- Compliance Questions: Consult with your organization's compliance officer
	- General Support: See README.md

	---

	Remember: When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place.

	This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.