TranscriptWriting / SECURITY_AND_COMPLIANCE.md
jmisak's picture
Upload 57 files
52d0298 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Security and Compliance Guide for TranscriptorAI

Last Updated: 2025-10-29

This document provides critical security information for using TranscriptorAI with sensitive healthcare data.


⚠️ CRITICAL SECURITY NOTICE

HuggingFace Spaces and HIPAA Compliance

TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).

Why HuggingFace Spaces Cannot Support HIPAA Data:

  1. No Business Associate Agreement (BAA) - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA
  2. Shared Infrastructure - Spaces run on multi-tenant infrastructure not certified for PHI
  3. No HIPAA Certification - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare)
  4. Platform Access - HF staff may have technical access to private Spaces for maintenance/debugging
  5. Log Retention - Logs are kept for 30 days and may inadvertently contain PHI fragments
  6. No Audit Controls - Insufficient access logging and audit trails for HIPAA compliance
  7. Security History - 2024 security incident exposed potential vulnerabilities in Spaces secrets

What Data Can Be Used on HF Spaces?

βœ… SAFE TO USE:

  • Fully de-identified data (all 18 HIPAA identifiers removed)
  • Synthetic/test data (completely fabricated)
  • Anonymized market research data
  • General business-confidential data (non-healthcare)

❌ NEVER USE:

  • Real patient data with any identifiers
  • Healthcare provider information with identifying details
  • Data subject to HIPAA, GDPR Article 9, or similar regulations
  • Any data containing the 18 HIPAA identifiers (see below)

HIPAA Safe Harbor De-Identification

If you must use real healthcare data, you MUST remove all 18 HIPAA identifiers before uploading to HF Spaces:

  1. Names - Patient, relatives, employers
  2. Geographic subdivisions - Smaller than state (addresses, cities, ZIP codes)
  3. Dates - Birth dates, admission dates, discharge dates, death dates (year is OK)
  4. Telephone numbers
  5. Fax numbers
  6. Email addresses
  7. Social Security numbers
  8. Medical record numbers
  9. Health plan beneficiary numbers
  10. Account numbers
  11. Certificate/license numbers
  12. Vehicle identifiers - License plates, VINs
  13. Device identifiers and serial numbers
  14. Web URLs
  15. IP addresses
  16. Biometric identifiers - Fingerprints, voice prints
  17. Full-face photos
  18. Other unique identifying numbers/codes

Using the Built-in Redaction Feature

TranscriptorAI now includes a PII redaction module:

  1. Enable PII Redaction checkbox in the UI
  2. Choose Redaction Level:
    • Minimal: Only redacts obvious identifiers (SSN, MRN, account numbers)
    • Moderate: Redacts common PII (emails, phones, dates, SSN, MRN) - RECOMMENDED
    • Strict: Redacts all PII including names and addresses

⚠️ Important: The redaction module is a tool to ASSIST with de-identification, but:

  • It is not 100% guaranteed to catch all PII
  • You are still responsible for verifying data is properly de-identified
  • Manual review is recommended for regulated data
  • Consider using professional de-identification services for high-risk data

HIPAA-Compliant Deployment Options

For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure:

Option 1: AWS (Recommended for Healthcare)

  • AWS HealthLake - Purpose-built for HIPAA/FHIR data
  • EC2 + S3 with BAA - Self-managed on AWS infrastructure
  • Requires: Signed AWS BAA, encryption at rest/in-transit, audit logging
  • Cost: ~$50-500/month depending on usage

Option 2: Microsoft Azure

  • Azure Health Data Services - HIPAA-compliant platform
  • Azure VM + Blob Storage - Self-hosted with BAA
  • Requires: Signed Azure BAA, compliance certifications enabled
  • Cost: Similar to AWS

Option 3: Google Cloud Platform

  • Healthcare API - HIPAA-compliant
  • Compute Engine + Cloud Storage with BAA
  • Requires: Signed GCP BAA
  • Cost: Similar to AWS/Azure

Option 4: On-Premises

  • Deploy on your own HIPAA-certified servers
  • Full control over data and access
  • Requires: Your own HIPAA compliance program, security controls, auditing
  • Cost: Infrastructure + IT staff

Deployment Checklist for HIPAA Compliance

  • Signed Business Associate Agreement with cloud provider
  • Encryption at rest (AES-256)
  • Encryption in transit (TLS 1.2+)
  • Multi-factor authentication (MFA) enabled
  • Role-based access control (RBAC)
  • Audit logging enabled and retained (6 years)
  • Regular security assessments
  • Incident response plan documented
  • Breach notification procedures in place
  • Regular backups with encryption
  • Staff HIPAA training completed
  • Data retention and destruction policies

Security Features in TranscriptorAI

Built-in Security Controls

  1. PII Redaction Module (redaction.py)

    • Detects and masks 10+ types of PII
    • Configurable redaction levels
    • Redaction reporting for audit trails
  2. Secure Logging (logger.py)

    • Automatic PII sanitization in logs
    • Token masking (shows only first/last 4 chars)
    • Configurable log levels
    • Prevents sensitive data leakage
  3. Type Safety

    • Standardized LLM response handling
    • Prevents data corruption/leakage through type errors
    • Defensive type checking
  4. Environment Variable Protection

    • API keys stored in environment variables (not code)
    • Never logged in full
    • Masked in debug output

Configuring Security Settings

# .env file (NEVER commit this to version control!)

# Enable PII sanitization in logs (RECOMMENDED)
SANITIZE_LOGS=True

# Disable debug mode in production (no sensitive data in logs)
DEBUG_MODE=False

# Enable file logging for audit trails
LOG_TO_FILE=True

# For HIPAA: Use local models (data stays on your server)
USE_HF_API=False
USE_LMSTUDIO=True
LMSTUDIO_URL=http://localhost:1234/v1/chat/completions

# Or use HF API only after signing BAA (Enterprise plan)
# USE_HF_API=True
# HUGGINGFACE_TOKEN=<your_token>

Data Flow and Storage

Where Data Goes

  1. Upload: Files uploaded through Gradio UI β†’ Server memory (temporary)
  2. Processing: Text extraction β†’ LLM analysis β†’ Report generation
  3. Output: CSV/PDF reports generated β†’ Downloads
  4. Cleanup: Temporary files deleted after session

Data Retention

Location What's Stored Retention
HF Spaces (if used) Logs, temporary files 30 days (platform logs)
Local Deployment Only what you configure You control
LLM API (HF/OpenAI) Prompts/responses Varies by provider
Local LM Studio Nothing (all local) You control

Minimizing Data Exposure

Best Practices:

  1. Use local LLM (LM Studio) - Keeps all data on your servers
  2. Enable PII redaction - Remove identifiers before processing
  3. Don't use HF Inference API - Data sent to HuggingFace servers
  4. Clear session data - Restart app between sessions with sensitive data
  5. Use incognito/private browsing - Prevents browser caching

LLM Backend Security Considerations

HuggingFace Inference API

❌ NOT recommended for PHI:

  • Data sent to HuggingFace servers for processing
  • Logs kept for 30 days
  • No BAA available for API usage (as of 2025-01)
  • May be used for model improvement (check ToS)

LM Studio (Local)

βœ… Recommended for PHI:

  • All processing happens on your server
  • No data sent externally
  • Full control over model and data
  • Can run on HIPAA-compliant infrastructure

OpenAI/Anthropic APIs

⚠️ Use with caution:

  • OpenAI offers BAAs for Enterprise customers
  • Anthropic offers BAAs for Enterprise
  • Zero data retention policies available
  • Requires Enterprise plan + signed BAA

Compliance Certifications Required

For healthcare use, your deployment should have:

  • SOC 2 Type II - Security and availability controls
  • HITRUST CSF - Healthcare industry framework
  • ISO 27001 - Information security management
  • HIPAA Compliance - Via BAA with cloud provider

For European data (GDPR):

  • GDPR Article 9 - Special category data (health)
  • Data Processing Agreement (DPA) with providers
  • Privacy Impact Assessment (PIA) completed

Incident Response

If you suspect a data breach:

  1. Immediately stop processing - Shut down the application
  2. Preserve logs - Don't delete anything
  3. Notify your security team - Escalate within 1 hour
  4. Notify cloud provider (if applicable)
  5. Document the incident - Who, what, when, where, how
  6. Notify affected individuals - Within 60 days per HIPAA
  7. File breach report - HHS if >500 individuals affected

Testing with Sensitive Data

Safe Testing Workflow

  1. Start with synthetic data - Generate realistic but fake transcripts
  2. Test with de-identified data - Remove all 18 HIPAA identifiers
  3. Enable PII redaction - Use "strict" mode
  4. Review outputs manually - Check for leaked PII
  5. Deploy to compliant infrastructure - Only then use real data

Creating Synthetic Test Data

Use the included script:

python create_sample_transcripts.py --count 10 --type patient --synthetic

This generates realistic but completely fabricated patient/HCP interviews.


Security Checklist for Production Deployment

Pre-Deployment

  • De-identify all test data
  • Enable PII redaction in UI
  • Set DEBUG_MODE=False
  • Set SANITIZE_LOGS=True
  • Remove any hardcoded API keys
  • Use environment variables for secrets
  • Configure LM Studio (not HF API)
  • Test on synthetic data only

Deployment

  • Deploy on HIPAA-compliant infrastructure
  • Sign BAA with cloud provider
  • Enable encryption at rest
  • Enable encryption in transit (HTTPS/TLS 1.2+)
  • Configure MFA for all users
  • Set up RBAC (role-based access control)
  • Enable audit logging
  • Configure log retention (6+ years)
  • Set up automated backups
  • Document data flow diagram

Post-Deployment

  • Conduct security assessment
  • Penetration testing completed
  • Staff training on HIPAA completed
  • Incident response plan in place
  • Breach notification procedures documented
  • Regular vulnerability scanning (monthly)
  • Access reviews (quarterly)
  • Compliance audit (annual)

Frequently Asked Questions

Q: Can I use private HF Spaces for HIPAA data? A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure.

Q: Is the PII redaction module HIPAA-compliant? A: The redaction module is a tool to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs.

Q: Can I get a BAA from HuggingFace? A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs.

Q: What if I only have de-identified data? A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly.

Q: Can I use this for research? A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office.

Q: What about GDPR compliance? A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers).


Additional Resources


Support and Questions

For security questions or to report vulnerabilities:

  • Security Issues: Create a private issue in GitHub (do not disclose publicly)
  • Compliance Questions: Consult with your organization's compliance officer
  • General Support: See README.md

Remember: When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place.

This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.