Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.2.0
Security and Compliance Guide for TranscriptorAI
Last Updated: 2025-10-29
This document provides critical security information for using TranscriptorAI with sensitive healthcare data.
β οΈ CRITICAL SECURITY NOTICE
HuggingFace Spaces and HIPAA Compliance
TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).
Why HuggingFace Spaces Cannot Support HIPAA Data:
- No Business Associate Agreement (BAA) - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA
- Shared Infrastructure - Spaces run on multi-tenant infrastructure not certified for PHI
- No HIPAA Certification - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare)
- Platform Access - HF staff may have technical access to private Spaces for maintenance/debugging
- Log Retention - Logs are kept for 30 days and may inadvertently contain PHI fragments
- No Audit Controls - Insufficient access logging and audit trails for HIPAA compliance
- Security History - 2024 security incident exposed potential vulnerabilities in Spaces secrets
What Data Can Be Used on HF Spaces?
β SAFE TO USE:
- Fully de-identified data (all 18 HIPAA identifiers removed)
- Synthetic/test data (completely fabricated)
- Anonymized market research data
- General business-confidential data (non-healthcare)
β NEVER USE:
- Real patient data with any identifiers
- Healthcare provider information with identifying details
- Data subject to HIPAA, GDPR Article 9, or similar regulations
- Any data containing the 18 HIPAA identifiers (see below)
HIPAA Safe Harbor De-Identification
If you must use real healthcare data, you MUST remove all 18 HIPAA identifiers before uploading to HF Spaces:
- Names - Patient, relatives, employers
- Geographic subdivisions - Smaller than state (addresses, cities, ZIP codes)
- Dates - Birth dates, admission dates, discharge dates, death dates (year is OK)
- Telephone numbers
- Fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers - License plates, VINs
- Device identifiers and serial numbers
- Web URLs
- IP addresses
- Biometric identifiers - Fingerprints, voice prints
- Full-face photos
- Other unique identifying numbers/codes
Using the Built-in Redaction Feature
TranscriptorAI now includes a PII redaction module:
- Enable PII Redaction checkbox in the UI
- Choose Redaction Level:
- Minimal: Only redacts obvious identifiers (SSN, MRN, account numbers)
- Moderate: Redacts common PII (emails, phones, dates, SSN, MRN) - RECOMMENDED
- Strict: Redacts all PII including names and addresses
β οΈ Important: The redaction module is a tool to ASSIST with de-identification, but:
- It is not 100% guaranteed to catch all PII
- You are still responsible for verifying data is properly de-identified
- Manual review is recommended for regulated data
- Consider using professional de-identification services for high-risk data
HIPAA-Compliant Deployment Options
For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure:
Option 1: AWS (Recommended for Healthcare)
- AWS HealthLake - Purpose-built for HIPAA/FHIR data
- EC2 + S3 with BAA - Self-managed on AWS infrastructure
- Requires: Signed AWS BAA, encryption at rest/in-transit, audit logging
- Cost: ~$50-500/month depending on usage
Option 2: Microsoft Azure
- Azure Health Data Services - HIPAA-compliant platform
- Azure VM + Blob Storage - Self-hosted with BAA
- Requires: Signed Azure BAA, compliance certifications enabled
- Cost: Similar to AWS
Option 3: Google Cloud Platform
- Healthcare API - HIPAA-compliant
- Compute Engine + Cloud Storage with BAA
- Requires: Signed GCP BAA
- Cost: Similar to AWS/Azure
Option 4: On-Premises
- Deploy on your own HIPAA-certified servers
- Full control over data and access
- Requires: Your own HIPAA compliance program, security controls, auditing
- Cost: Infrastructure + IT staff
Deployment Checklist for HIPAA Compliance
- Signed Business Associate Agreement with cloud provider
- Encryption at rest (AES-256)
- Encryption in transit (TLS 1.2+)
- Multi-factor authentication (MFA) enabled
- Role-based access control (RBAC)
- Audit logging enabled and retained (6 years)
- Regular security assessments
- Incident response plan documented
- Breach notification procedures in place
- Regular backups with encryption
- Staff HIPAA training completed
- Data retention and destruction policies
Security Features in TranscriptorAI
Built-in Security Controls
PII Redaction Module (
redaction.py)- Detects and masks 10+ types of PII
- Configurable redaction levels
- Redaction reporting for audit trails
Secure Logging (
logger.py)- Automatic PII sanitization in logs
- Token masking (shows only first/last 4 chars)
- Configurable log levels
- Prevents sensitive data leakage
Type Safety
- Standardized LLM response handling
- Prevents data corruption/leakage through type errors
- Defensive type checking
Environment Variable Protection
- API keys stored in environment variables (not code)
- Never logged in full
- Masked in debug output
Configuring Security Settings
# .env file (NEVER commit this to version control!)
# Enable PII sanitization in logs (RECOMMENDED)
SANITIZE_LOGS=True
# Disable debug mode in production (no sensitive data in logs)
DEBUG_MODE=False
# Enable file logging for audit trails
LOG_TO_FILE=True
# For HIPAA: Use local models (data stays on your server)
USE_HF_API=False
USE_LMSTUDIO=True
LMSTUDIO_URL=http://localhost:1234/v1/chat/completions
# Or use HF API only after signing BAA (Enterprise plan)
# USE_HF_API=True
# HUGGINGFACE_TOKEN=<your_token>
Data Flow and Storage
Where Data Goes
- Upload: Files uploaded through Gradio UI β Server memory (temporary)
- Processing: Text extraction β LLM analysis β Report generation
- Output: CSV/PDF reports generated β Downloads
- Cleanup: Temporary files deleted after session
Data Retention
| Location | What's Stored | Retention |
|---|---|---|
| HF Spaces (if used) | Logs, temporary files | 30 days (platform logs) |
| Local Deployment | Only what you configure | You control |
| LLM API (HF/OpenAI) | Prompts/responses | Varies by provider |
| Local LM Studio | Nothing (all local) | You control |
Minimizing Data Exposure
Best Practices:
- Use local LLM (LM Studio) - Keeps all data on your servers
- Enable PII redaction - Remove identifiers before processing
- Don't use HF Inference API - Data sent to HuggingFace servers
- Clear session data - Restart app between sessions with sensitive data
- Use incognito/private browsing - Prevents browser caching
LLM Backend Security Considerations
HuggingFace Inference API
β NOT recommended for PHI:
- Data sent to HuggingFace servers for processing
- Logs kept for 30 days
- No BAA available for API usage (as of 2025-01)
- May be used for model improvement (check ToS)
LM Studio (Local)
β Recommended for PHI:
- All processing happens on your server
- No data sent externally
- Full control over model and data
- Can run on HIPAA-compliant infrastructure
OpenAI/Anthropic APIs
β οΈ Use with caution:
- OpenAI offers BAAs for Enterprise customers
- Anthropic offers BAAs for Enterprise
- Zero data retention policies available
- Requires Enterprise plan + signed BAA
Compliance Certifications Required
For healthcare use, your deployment should have:
- SOC 2 Type II - Security and availability controls
- HITRUST CSF - Healthcare industry framework
- ISO 27001 - Information security management
- HIPAA Compliance - Via BAA with cloud provider
For European data (GDPR):
- GDPR Article 9 - Special category data (health)
- Data Processing Agreement (DPA) with providers
- Privacy Impact Assessment (PIA) completed
Incident Response
If you suspect a data breach:
- Immediately stop processing - Shut down the application
- Preserve logs - Don't delete anything
- Notify your security team - Escalate within 1 hour
- Notify cloud provider (if applicable)
- Document the incident - Who, what, when, where, how
- Notify affected individuals - Within 60 days per HIPAA
- File breach report - HHS if >500 individuals affected
Testing with Sensitive Data
Safe Testing Workflow
- Start with synthetic data - Generate realistic but fake transcripts
- Test with de-identified data - Remove all 18 HIPAA identifiers
- Enable PII redaction - Use "strict" mode
- Review outputs manually - Check for leaked PII
- Deploy to compliant infrastructure - Only then use real data
Creating Synthetic Test Data
Use the included script:
python create_sample_transcripts.py --count 10 --type patient --synthetic
This generates realistic but completely fabricated patient/HCP interviews.
Security Checklist for Production Deployment
Pre-Deployment
- De-identify all test data
- Enable PII redaction in UI
- Set
DEBUG_MODE=False - Set
SANITIZE_LOGS=True - Remove any hardcoded API keys
- Use environment variables for secrets
- Configure LM Studio (not HF API)
- Test on synthetic data only
Deployment
- Deploy on HIPAA-compliant infrastructure
- Sign BAA with cloud provider
- Enable encryption at rest
- Enable encryption in transit (HTTPS/TLS 1.2+)
- Configure MFA for all users
- Set up RBAC (role-based access control)
- Enable audit logging
- Configure log retention (6+ years)
- Set up automated backups
- Document data flow diagram
Post-Deployment
- Conduct security assessment
- Penetration testing completed
- Staff training on HIPAA completed
- Incident response plan in place
- Breach notification procedures documented
- Regular vulnerability scanning (monthly)
- Access reviews (quarterly)
- Compliance audit (annual)
Frequently Asked Questions
Q: Can I use private HF Spaces for HIPAA data? A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure.
Q: Is the PII redaction module HIPAA-compliant? A: The redaction module is a tool to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs.
Q: Can I get a BAA from HuggingFace? A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs.
Q: What if I only have de-identified data? A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly.
Q: Can I use this for research? A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office.
Q: What about GDPR compliance? A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers).
Additional Resources
- HIPAA Guidance: https://www.hhs.gov/hipaa
- HIPAA Safe Harbor Method: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification
- HuggingFace Security: https://huggingface.co/docs/hub/security
- AWS HIPAA Compliance: https://aws.amazon.com/compliance/hipaa-compliance/
- HITRUST Alliance: https://hitrustalliance.net/
Support and Questions
For security questions or to report vulnerabilities:
- Security Issues: Create a private issue in GitHub (do not disclose publicly)
- Compliance Questions: Consult with your organization's compliance officer
- General Support: See README.md
Remember: When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place.
This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.