Spaces:

empirenexus
/

TranscriptWriting

Sleeping

No Business Associate Agreement (BAA) - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA
Shared Infrastructure - Spaces run on multi-tenant infrastructure not certified for PHI
No HIPAA Certification - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare)
Platform Access - HF staff may have technical access to private Spaces for maintenance/debugging
Log Retention - Logs are kept for 30 days and may inadvertently contain PHI fragments
No Audit Controls - Insufficient access logging and audit trails for HIPAA compliance
Security History - 2024 security incident exposed potential vulnerabilities in Spaces secrets

What Data Can Be Used on HF Spaces?

✅ SAFE TO USE:

Fully de-identified data (all 18 HIPAA identifiers removed)
Synthetic/test data (completely fabricated)
Anonymized market research data
General business-confidential data (non-healthcare)

❌ NEVER USE:

Real patient data with any identifiers
Healthcare provider information with identifying details
Data subject to HIPAA, GDPR Article 9, or similar regulations
Any data containing the 18 HIPAA identifiers (see below)

HIPAA Safe Harbor De-Identification

If you must use real healthcare data, you MUST remove all 18 HIPAA identifiers before uploading to HF Spaces:

Names - Patient, relatives, employers
Geographic subdivisions - Smaller than state (addresses, cities, ZIP codes)
Dates - Birth dates, admission dates, discharge dates, death dates (year is OK)
Telephone numbers
Fax numbers
Email addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers - License plates, VINs
Device identifiers and serial numbers
Web URLs
IP addresses
Biometric identifiers - Fingerprints, voice prints
Full-face photos
Other unique identifying numbers/codes

Using the Built-in Redaction Feature

TranscriptorAI now includes a PII redaction module:

Enable PII Redaction checkbox in the UI
Choose Redaction Level:
- Minimal: Only redacts obvious identifiers (SSN, MRN, account numbers)
- Moderate: Redacts common PII (emails, phones, dates, SSN, MRN) - RECOMMENDED
- Strict: Redacts all PII including names and addresses

⚠️ Important: The redaction module is a tool to ASSIST with de-identification, but:

It is not 100% guaranteed to catch all PII
You are still responsible for verifying data is properly de-identified
Manual review is recommended for regulated data
Consider using professional de-identification services for high-risk data

HIPAA-Compliant Deployment Options

For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure:

Option 1: AWS (Recommended for Healthcare)

AWS HealthLake - Purpose-built for HIPAA/FHIR data
EC2 + S3 with BAA - Self-managed on AWS infrastructure
Requires: Signed AWS BAA, encryption at rest/in-transit, audit logging
Cost: ~$50-500/month depending on usage

Option 2: Microsoft Azure

Azure Health Data Services - HIPAA-compliant platform
Azure VM + Blob Storage - Self-hosted with BAA
Requires: Signed Azure BAA, compliance certifications enabled
Cost: Similar to AWS

Option 3: Google Cloud Platform

Healthcare API - HIPAA-compliant
Compute Engine + Cloud Storage with BAA
Requires: Signed GCP BAA
Cost: Similar to AWS/Azure

Option 4: On-Premises

Deploy on your own HIPAA-certified servers
Full control over data and access
Requires: Your own HIPAA compliance program, security controls, auditing
Cost: Infrastructure + IT staff

Deployment Checklist for HIPAA Compliance

Signed Business Associate Agreement with cloud provider
Encryption at rest (AES-256)
Encryption in transit (TLS 1.2+)
Multi-factor authentication (MFA) enabled
Role-based access control (RBAC)
Audit logging enabled and retained (6 years)
Regular security assessments
Incident response plan documented
Breach notification procedures in place
Regular backups with encryption
Staff HIPAA training completed
Data retention and destruction policies

Security Features in TranscriptorAI

Built-in Security Controls

PII Redaction Module (redaction.py)
- Detects and masks 10+ types of PII
- Configurable redaction levels
- Redaction reporting for audit trails
Secure Logging (logger.py)
- Automatic PII sanitization in logs
- Token masking (shows only first/last 4 chars)
- Configurable log levels
- Prevents sensitive data leakage
Type Safety
- Standardized LLM response handling
- Prevents data corruption/leakage through type errors
- Defensive type checking
Environment Variable Protection
- API keys stored in environment variables (not code)
- Never logged in full
- Masked in debug output

Configuring Security Settings

# .env file (NEVER commit this to version control!)

# Enable PII sanitization in logs (RECOMMENDED)
SANITIZE_LOGS=True

# Disable debug mode in production (no sensitive data in logs)
DEBUG_MODE=False

# Enable file logging for audit trails
LOG_TO_FILE=True

# For HIPAA: Use local models (data stays on your server)
USE_HF_API=False
USE_LMSTUDIO=True
LMSTUDIO_URL=http://localhost:1234/v1/chat/completions

# Or use HF API only after signing BAA (Enterprise plan)
# USE_HF_API=True
# HUGGINGFACE_TOKEN=<your_token>

Data Flow and Storage

Where Data Goes

Upload: Files uploaded through Gradio UI → Server memory (temporary)
Processing: Text extraction → LLM analysis → Report generation
Output: CSV/PDF reports generated → Downloads
Cleanup: Temporary files deleted after session

Data Retention

Location	What's Stored	Retention
HF Spaces (if used)	Logs, temporary files	30 days (platform logs)
Local Deployment	Only what you configure	You control
LLM API (HF/OpenAI)	Prompts/responses	Varies by provider
Local LM Studio	Nothing (all local)	You control

Minimizing Data Exposure

Best Practices:

Use local LLM (LM Studio) - Keeps all data on your servers
Enable PII redaction - Remove identifiers before processing
Don't use HF Inference API - Data sent to HuggingFace servers
Clear session data - Restart app between sessions with sensitive data
Use incognito/private browsing - Prevents browser caching

LLM Backend Security Considerations

HuggingFace Inference API

❌ NOT recommended for PHI:

Data sent to HuggingFace servers for processing
Logs kept for 30 days
No BAA available for API usage (as of 2025-01)
May be used for model improvement (check ToS)

LM Studio (Local)

✅ Recommended for PHI:

All processing happens on your server
No data sent externally
Full control over model and data
Can run on HIPAA-compliant infrastructure

OpenAI/Anthropic APIs

⚠️ Use with caution:

OpenAI offers BAAs for Enterprise customers
Anthropic offers BAAs for Enterprise
Zero data retention policies available
Requires Enterprise plan + signed BAA

Compliance Certifications Required

For healthcare use, your deployment should have:

SOC 2 Type II - Security and availability controls
HITRUST CSF - Healthcare industry framework
ISO 27001 - Information security management
HIPAA Compliance - Via BAA with cloud provider

For European data (GDPR):

GDPR Article 9 - Special category data (health)
Data Processing Agreement (DPA) with providers
Privacy Impact Assessment (PIA) completed

Incident Response

If you suspect a data breach:

Immediately stop processing - Shut down the application
Preserve logs - Don't delete anything
Notify your security team - Escalate within 1 hour
Notify cloud provider (if applicable)
Document the incident - Who, what, when, where, how
Notify affected individuals - Within 60 days per HIPAA
File breach report - HHS if >500 individuals affected

Testing with Sensitive Data

Safe Testing Workflow

Start with synthetic data - Generate realistic but fake transcripts
Test with de-identified data - Remove all 18 HIPAA identifiers
Enable PII redaction - Use "strict" mode
Review outputs manually - Check for leaked PII
Deploy to compliant infrastructure - Only then use real data

Creating Synthetic Test Data

Use the included script:

python create_sample_transcripts.py --count 10 --type patient --synthetic

This generates realistic but completely fabricated patient/HCP interviews.

Security Checklist for Production Deployment

Pre-Deployment

De-identify all test data
Enable PII redaction in UI
Set DEBUG_MODE=False
Set SANITIZE_LOGS=True
Remove any hardcoded API keys
Use environment variables for secrets
Configure LM Studio (not HF API)
Test on synthetic data only

Deployment

Deploy on HIPAA-compliant infrastructure
Sign BAA with cloud provider
Enable encryption at rest
Enable encryption in transit (HTTPS/TLS 1.2+)
Configure MFA for all users
Set up RBAC (role-based access control)
Enable audit logging
Configure log retention (6+ years)
Set up automated backups
Document data flow diagram

Post-Deployment

Conduct security assessment
Penetration testing completed
Staff training on HIPAA completed
Incident response plan in place
Breach notification procedures documented
Regular vulnerability scanning (monthly)
Access reviews (quarterly)
Compliance audit (annual)

Frequently Asked Questions

Q: Can I use private HF Spaces for HIPAA data? A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure.

Q: Is the PII redaction module HIPAA-compliant? A: The redaction module is a tool to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs.

Q: Can I get a BAA from HuggingFace? A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs.

Q: What if I only have de-identified data? A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly.

Q: Can I use this for research? A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office.

Q: What about GDPR compliance? A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers).

Additional Resources

HIPAA Guidance: https://www.hhs.gov/hipaa
HIPAA Safe Harbor Method: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification
HuggingFace Security: https://huggingface.co/docs/hub/security
AWS HIPAA Compliance: https://aws.amazon.com/compliance/hipaa-compliance/
HITRUST Alliance: https://hitrustalliance.net/

Support and Questions

For security questions or to report vulnerabilities:

Security Issues: Create a private issue in GitHub (do not disclose publicly)
Compliance Questions: Consult with your organization's compliance officer
General Support: See README.md

Remember: When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place.

This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.