Spaces:

empirenexus
/

TranscriptWriting

Sleeping

File size: 13,211 Bytes

52d0298

# Security and Compliance Guide for TranscriptorAI

**Last Updated:** 2025-10-29

This document provides critical security information for using TranscriptorAI with sensitive healthcare data.

---

## ⚠️ CRITICAL SECURITY NOTICE

### HuggingFace Spaces and HIPAA Compliance

**TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).**

#### Why HuggingFace Spaces Cannot Support HIPAA Data:

1. **No Business Associate Agreement (BAA)** - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA
2. **Shared Infrastructure** - Spaces run on multi-tenant infrastructure not certified for PHI
3. **No HIPAA Certification** - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare)
4. **Platform Access** - HF staff may have technical access to private Spaces for maintenance/debugging
5. **Log Retention** - Logs are kept for 30 days and may inadvertently contain PHI fragments
6. **No Audit Controls** - Insufficient access logging and audit trails for HIPAA compliance
7. **Security History** - 2024 security incident exposed potential vulnerabilities in Spaces secrets

### What Data Can Be Used on HF Spaces?

✅ **SAFE TO USE:**
- Fully de-identified data (all 18 HIPAA identifiers removed)
- Synthetic/test data (completely fabricated)
- Anonymized market research data
- General business-confidential data (non-healthcare)

❌ **NEVER USE:**
- Real patient data with any identifiers
- Healthcare provider information with identifying details
- Data subject to HIPAA, GDPR Article 9, or similar regulations
- Any data containing the 18 HIPAA identifiers (see below)

---

## HIPAA Safe Harbor De-Identification

If you must use real healthcare data, **you MUST remove all 18 HIPAA identifiers** before uploading to HF Spaces:

1. **Names** - Patient, relatives, employers
2. **Geographic subdivisions** - Smaller than state (addresses, cities, ZIP codes)
3. **Dates** - Birth dates, admission dates, discharge dates, death dates (year is OK)
4. **Telephone numbers**
5. **Fax numbers**
6. **Email addresses**
7. **Social Security numbers**
8. **Medical record numbers**
9. **Health plan beneficiary numbers**
10. **Account numbers**
11. **Certificate/license numbers**
12. **Vehicle identifiers** - License plates, VINs
13. **Device identifiers and serial numbers**
14. **Web URLs**
15. **IP addresses**
16. **Biometric identifiers** - Fingerprints, voice prints
17. **Full-face photos**
18. **Other unique identifying numbers/codes**

### Using the Built-in Redaction Feature

TranscriptorAI now includes a PII redaction module:

1. **Enable PII Redaction** checkbox in the UI
2. **Choose Redaction Level:**
   - **Minimal**: Only redacts obvious identifiers (SSN, MRN, account numbers)
   - **Moderate**: Redacts common PII (emails, phones, dates, SSN, MRN) - **RECOMMENDED**
   - **Strict**: Redacts all PII including names and addresses

⚠️ **Important:** The redaction module is a tool to ASSIST with de-identification, but:
- It is not 100% guaranteed to catch all PII
- You are still responsible for verifying data is properly de-identified
- Manual review is recommended for regulated data
- Consider using professional de-identification services for high-risk data

---

## HIPAA-Compliant Deployment Options

For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure:

### Option 1: AWS (Recommended for Healthcare)
- **AWS HealthLake** - Purpose-built for HIPAA/FHIR data
- **EC2 + S3 with BAA** - Self-managed on AWS infrastructure
- **Requires:** Signed AWS BAA, encryption at rest/in-transit, audit logging
- **Cost:** ~$50-500/month depending on usage

### Option 2: Microsoft Azure
- **Azure Health Data Services** - HIPAA-compliant platform
- **Azure VM + Blob Storage** - Self-hosted with BAA
- **Requires:** Signed Azure BAA, compliance certifications enabled
- **Cost:** Similar to AWS

### Option 3: Google Cloud Platform
- **Healthcare API** - HIPAA-compliant
- **Compute Engine + Cloud Storage with BAA**
- **Requires:** Signed GCP BAA
- **Cost:** Similar to AWS/Azure

### Option 4: On-Premises
- Deploy on your own HIPAA-certified servers
- Full control over data and access
- **Requires:** Your own HIPAA compliance program, security controls, auditing
- **Cost:** Infrastructure + IT staff

### Deployment Checklist for HIPAA Compliance

- [ ] Signed Business Associate Agreement with cloud provider
- [ ] Encryption at rest (AES-256)
- [ ] Encryption in transit (TLS 1.2+)
- [ ] Multi-factor authentication (MFA) enabled
- [ ] Role-based access control (RBAC)
- [ ] Audit logging enabled and retained (6 years)
- [ ] Regular security assessments
- [ ] Incident response plan documented
- [ ] Breach notification procedures in place
- [ ] Regular backups with encryption
- [ ] Staff HIPAA training completed
- [ ] Data retention and destruction policies

---

## Security Features in TranscriptorAI

### Built-in Security Controls

1. **PII Redaction Module** (`redaction.py`)
   - Detects and masks 10+ types of PII
   - Configurable redaction levels
   - Redaction reporting for audit trails

2. **Secure Logging** (`logger.py`)
   - Automatic PII sanitization in logs
   - Token masking (shows only first/last 4 chars)
   - Configurable log levels
   - Prevents sensitive data leakage

3. **Type Safety**
   - Standardized LLM response handling
   - Prevents data corruption/leakage through type errors
   - Defensive type checking

4. **Environment Variable Protection**
   - API keys stored in environment variables (not code)
   - Never logged in full
   - Masked in debug output

### Configuring Security Settings

```bash

# .env file (NEVER commit this to version control!)



# Enable PII sanitization in logs (RECOMMENDED)

SANITIZE_LOGS=True



# Disable debug mode in production (no sensitive data in logs)

DEBUG_MODE=False



# Enable file logging for audit trails

LOG_TO_FILE=True



# For HIPAA: Use local models (data stays on your server)

USE_HF_API=False

USE_LMSTUDIO=True

LMSTUDIO_URL=http://localhost:1234/v1/chat/completions



# Or use HF API only after signing BAA (Enterprise plan)

# USE_HF_API=True

# HUGGINGFACE_TOKEN=<your_token>

```

---

## Data Flow and Storage

### Where Data Goes

1. **Upload**: Files uploaded through Gradio UI → Server memory (temporary)
2. **Processing**: Text extraction → LLM analysis → Report generation
3. **Output**: CSV/PDF reports generated → Downloads
4. **Cleanup**: Temporary files deleted after session

### Data Retention

| Location | What's Stored | Retention |
|----------|---------------|-----------|
| **HF Spaces (if used)** | Logs, temporary files | 30 days (platform logs) |
| **Local Deployment** | Only what you configure | You control |
| **LLM API (HF/OpenAI)** | Prompts/responses | Varies by provider |
| **Local LM Studio** | Nothing (all local) | You control |

### Minimizing Data Exposure

**Best Practices:**

1. **Use local LLM (LM Studio)** - Keeps all data on your servers
2. **Enable PII redaction** - Remove identifiers before processing
3. **Don't use HF Inference API** - Data sent to HuggingFace servers
4. **Clear session data** - Restart app between sessions with sensitive data
5. **Use incognito/private browsing** - Prevents browser caching

---

## LLM Backend Security Considerations

### HuggingFace Inference API

❌ **NOT recommended for PHI:**
- Data sent to HuggingFace servers for processing
- Logs kept for 30 days
- No BAA available for API usage (as of 2025-01)
- May be used for model improvement (check ToS)

### LM Studio (Local)

✅ **Recommended for PHI:**
- All processing happens on your server
- No data sent externally
- Full control over model and data
- Can run on HIPAA-compliant infrastructure

### OpenAI/Anthropic APIs

⚠️ **Use with caution:**
- OpenAI offers BAAs for Enterprise customers
- Anthropic offers BAAs for Enterprise
- Zero data retention policies available
- Requires Enterprise plan + signed BAA

---

## Compliance Certifications Required

For healthcare use, your deployment should have:

- **SOC 2 Type II** - Security and availability controls
- **HITRUST CSF** - Healthcare industry framework
- **ISO 27001** - Information security management
- **HIPAA Compliance** - Via BAA with cloud provider

For European data (GDPR):
- **GDPR Article 9** - Special category data (health)
- **Data Processing Agreement (DPA)** with providers
- **Privacy Impact Assessment (PIA)** completed

---

## Incident Response

If you suspect a data breach:

1. **Immediately stop processing** - Shut down the application
2. **Preserve logs** - Don't delete anything
3. **Notify your security team** - Escalate within 1 hour
4. **Notify cloud provider** (if applicable)
5. **Document the incident** - Who, what, when, where, how
6. **Notify affected individuals** - Within 60 days per HIPAA
7. **File breach report** - HHS if >500 individuals affected

---

## Testing with Sensitive Data

### Safe Testing Workflow

1. **Start with synthetic data** - Generate realistic but fake transcripts
2. **Test with de-identified data** - Remove all 18 HIPAA identifiers
3. **Enable PII redaction** - Use "strict" mode
4. **Review outputs manually** - Check for leaked PII
5. **Deploy to compliant infrastructure** - Only then use real data

### Creating Synthetic Test Data

Use the included script:

```bash

python create_sample_transcripts.py --count 10 --type patient --synthetic

```

This generates realistic but completely fabricated patient/HCP interviews.

---

## Security Checklist for Production Deployment

### Pre-Deployment

- [ ] De-identify all test data
- [ ] Enable PII redaction in UI
- [ ] Set `DEBUG_MODE=False`
- [ ] Set `SANITIZE_LOGS=True`
- [ ] Remove any hardcoded API keys
- [ ] Use environment variables for secrets
- [ ] Configure LM Studio (not HF API)
- [ ] Test on synthetic data only

### Deployment

- [ ] Deploy on HIPAA-compliant infrastructure
- [ ] Sign BAA with cloud provider
- [ ] Enable encryption at rest
- [ ] Enable encryption in transit (HTTPS/TLS 1.2+)
- [ ] Configure MFA for all users
- [ ] Set up RBAC (role-based access control)
- [ ] Enable audit logging
- [ ] Configure log retention (6+ years)
- [ ] Set up automated backups
- [ ] Document data flow diagram

### Post-Deployment

- [ ] Conduct security assessment
- [ ] Penetration testing completed
- [ ] Staff training on HIPAA completed
- [ ] Incident response plan in place
- [ ] Breach notification procedures documented
- [ ] Regular vulnerability scanning (monthly)
- [ ] Access reviews (quarterly)
- [ ] Compliance audit (annual)

---

## Frequently Asked Questions

**Q: Can I use private HF Spaces for HIPAA data?**
A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure.

**Q: Is the PII redaction module HIPAA-compliant?**
A: The redaction module is a *tool* to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs.

**Q: Can I get a BAA from HuggingFace?**
A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs.

**Q: What if I only have de-identified data?**
A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly.

**Q: Can I use this for research?**
A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office.

**Q: What about GDPR compliance?**
A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers).

---

## Additional Resources

- **HIPAA Guidance:** https://www.hhs.gov/hipaa
- **HIPAA Safe Harbor Method:** https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification
- **HuggingFace Security:** https://huggingface.co/docs/hub/security
- **AWS HIPAA Compliance:** https://aws.amazon.com/compliance/hipaa-compliance/
- **HITRUST Alliance:** https://hitrustalliance.net/

---

## Support and Questions

For security questions or to report vulnerabilities:

- **Security Issues:** Create a private issue in GitHub (do not disclose publicly)
- **Compliance Questions:** Consult with your organization's compliance officer
- **General Support:** See README.md

---

**Remember:** When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place.

**This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.**