File size: 13,211 Bytes
52d0298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
# Security and Compliance Guide for TranscriptorAI

**Last Updated:** 2025-10-29

This document provides critical security information for using TranscriptorAI with sensitive healthcare data.

---

## ⚠️ CRITICAL SECURITY NOTICE

### HuggingFace Spaces and HIPAA Compliance

**TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).**

#### Why HuggingFace Spaces Cannot Support HIPAA Data:

1. **No Business Associate Agreement (BAA)** - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA
2. **Shared Infrastructure** - Spaces run on multi-tenant infrastructure not certified for PHI
3. **No HIPAA Certification** - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare)
4. **Platform Access** - HF staff may have technical access to private Spaces for maintenance/debugging
5. **Log Retention** - Logs are kept for 30 days and may inadvertently contain PHI fragments
6. **No Audit Controls** - Insufficient access logging and audit trails for HIPAA compliance
7. **Security History** - 2024 security incident exposed potential vulnerabilities in Spaces secrets

### What Data Can Be Used on HF Spaces?

βœ… **SAFE TO USE:**
- Fully de-identified data (all 18 HIPAA identifiers removed)
- Synthetic/test data (completely fabricated)
- Anonymized market research data
- General business-confidential data (non-healthcare)

❌ **NEVER USE:**
- Real patient data with any identifiers
- Healthcare provider information with identifying details
- Data subject to HIPAA, GDPR Article 9, or similar regulations
- Any data containing the 18 HIPAA identifiers (see below)

---

## HIPAA Safe Harbor De-Identification

If you must use real healthcare data, **you MUST remove all 18 HIPAA identifiers** before uploading to HF Spaces:

1. **Names** - Patient, relatives, employers
2. **Geographic subdivisions** - Smaller than state (addresses, cities, ZIP codes)
3. **Dates** - Birth dates, admission dates, discharge dates, death dates (year is OK)
4. **Telephone numbers**
5. **Fax numbers**
6. **Email addresses**
7. **Social Security numbers**
8. **Medical record numbers**
9. **Health plan beneficiary numbers**
10. **Account numbers**
11. **Certificate/license numbers**
12. **Vehicle identifiers** - License plates, VINs
13. **Device identifiers and serial numbers**
14. **Web URLs**
15. **IP addresses**
16. **Biometric identifiers** - Fingerprints, voice prints
17. **Full-face photos**
18. **Other unique identifying numbers/codes**

### Using the Built-in Redaction Feature

TranscriptorAI now includes a PII redaction module:

1. **Enable PII Redaction** checkbox in the UI
2. **Choose Redaction Level:**
   - **Minimal**: Only redacts obvious identifiers (SSN, MRN, account numbers)
   - **Moderate**: Redacts common PII (emails, phones, dates, SSN, MRN) - **RECOMMENDED**
   - **Strict**: Redacts all PII including names and addresses

⚠️ **Important:** The redaction module is a tool to ASSIST with de-identification, but:
- It is not 100% guaranteed to catch all PII
- You are still responsible for verifying data is properly de-identified
- Manual review is recommended for regulated data
- Consider using professional de-identification services for high-risk data

---

## HIPAA-Compliant Deployment Options

For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure:

### Option 1: AWS (Recommended for Healthcare)
- **AWS HealthLake** - Purpose-built for HIPAA/FHIR data
- **EC2 + S3 with BAA** - Self-managed on AWS infrastructure
- **Requires:** Signed AWS BAA, encryption at rest/in-transit, audit logging
- **Cost:** ~$50-500/month depending on usage

### Option 2: Microsoft Azure
- **Azure Health Data Services** - HIPAA-compliant platform
- **Azure VM + Blob Storage** - Self-hosted with BAA
- **Requires:** Signed Azure BAA, compliance certifications enabled
- **Cost:** Similar to AWS

### Option 3: Google Cloud Platform
- **Healthcare API** - HIPAA-compliant
- **Compute Engine + Cloud Storage with BAA**
- **Requires:** Signed GCP BAA
- **Cost:** Similar to AWS/Azure

### Option 4: On-Premises
- Deploy on your own HIPAA-certified servers
- Full control over data and access
- **Requires:** Your own HIPAA compliance program, security controls, auditing
- **Cost:** Infrastructure + IT staff

### Deployment Checklist for HIPAA Compliance

- [ ] Signed Business Associate Agreement with cloud provider
- [ ] Encryption at rest (AES-256)
- [ ] Encryption in transit (TLS 1.2+)
- [ ] Multi-factor authentication (MFA) enabled
- [ ] Role-based access control (RBAC)
- [ ] Audit logging enabled and retained (6 years)
- [ ] Regular security assessments
- [ ] Incident response plan documented
- [ ] Breach notification procedures in place
- [ ] Regular backups with encryption
- [ ] Staff HIPAA training completed
- [ ] Data retention and destruction policies

---

## Security Features in TranscriptorAI

### Built-in Security Controls

1. **PII Redaction Module** (`redaction.py`)
   - Detects and masks 10+ types of PII
   - Configurable redaction levels
   - Redaction reporting for audit trails

2. **Secure Logging** (`logger.py`)
   - Automatic PII sanitization in logs
   - Token masking (shows only first/last 4 chars)
   - Configurable log levels
   - Prevents sensitive data leakage

3. **Type Safety**
   - Standardized LLM response handling
   - Prevents data corruption/leakage through type errors
   - Defensive type checking

4. **Environment Variable Protection**
   - API keys stored in environment variables (not code)
   - Never logged in full
   - Masked in debug output

### Configuring Security Settings

```bash

# .env file (NEVER commit this to version control!)



# Enable PII sanitization in logs (RECOMMENDED)

SANITIZE_LOGS=True



# Disable debug mode in production (no sensitive data in logs)

DEBUG_MODE=False



# Enable file logging for audit trails

LOG_TO_FILE=True



# For HIPAA: Use local models (data stays on your server)

USE_HF_API=False

USE_LMSTUDIO=True

LMSTUDIO_URL=http://localhost:1234/v1/chat/completions



# Or use HF API only after signing BAA (Enterprise plan)

# USE_HF_API=True

# HUGGINGFACE_TOKEN=<your_token>

```

---

## Data Flow and Storage

### Where Data Goes

1. **Upload**: Files uploaded through Gradio UI β†’ Server memory (temporary)
2. **Processing**: Text extraction β†’ LLM analysis β†’ Report generation
3. **Output**: CSV/PDF reports generated β†’ Downloads
4. **Cleanup**: Temporary files deleted after session

### Data Retention

| Location | What's Stored | Retention |
|----------|---------------|-----------|
| **HF Spaces (if used)** | Logs, temporary files | 30 days (platform logs) |
| **Local Deployment** | Only what you configure | You control |
| **LLM API (HF/OpenAI)** | Prompts/responses | Varies by provider |
| **Local LM Studio** | Nothing (all local) | You control |

### Minimizing Data Exposure

**Best Practices:**

1. **Use local LLM (LM Studio)** - Keeps all data on your servers
2. **Enable PII redaction** - Remove identifiers before processing
3. **Don't use HF Inference API** - Data sent to HuggingFace servers
4. **Clear session data** - Restart app between sessions with sensitive data
5. **Use incognito/private browsing** - Prevents browser caching

---

## LLM Backend Security Considerations

### HuggingFace Inference API

❌ **NOT recommended for PHI:**
- Data sent to HuggingFace servers for processing
- Logs kept for 30 days
- No BAA available for API usage (as of 2025-01)
- May be used for model improvement (check ToS)

### LM Studio (Local)

βœ… **Recommended for PHI:**
- All processing happens on your server
- No data sent externally
- Full control over model and data
- Can run on HIPAA-compliant infrastructure

### OpenAI/Anthropic APIs

⚠️ **Use with caution:**
- OpenAI offers BAAs for Enterprise customers
- Anthropic offers BAAs for Enterprise
- Zero data retention policies available
- Requires Enterprise plan + signed BAA

---

## Compliance Certifications Required

For healthcare use, your deployment should have:

- **SOC 2 Type II** - Security and availability controls
- **HITRUST CSF** - Healthcare industry framework
- **ISO 27001** - Information security management
- **HIPAA Compliance** - Via BAA with cloud provider

For European data (GDPR):
- **GDPR Article 9** - Special category data (health)
- **Data Processing Agreement (DPA)** with providers
- **Privacy Impact Assessment (PIA)** completed

---

## Incident Response

If you suspect a data breach:

1. **Immediately stop processing** - Shut down the application
2. **Preserve logs** - Don't delete anything
3. **Notify your security team** - Escalate within 1 hour
4. **Notify cloud provider** (if applicable)
5. **Document the incident** - Who, what, when, where, how
6. **Notify affected individuals** - Within 60 days per HIPAA
7. **File breach report** - HHS if >500 individuals affected

---

## Testing with Sensitive Data

### Safe Testing Workflow

1. **Start with synthetic data** - Generate realistic but fake transcripts
2. **Test with de-identified data** - Remove all 18 HIPAA identifiers
3. **Enable PII redaction** - Use "strict" mode
4. **Review outputs manually** - Check for leaked PII
5. **Deploy to compliant infrastructure** - Only then use real data

### Creating Synthetic Test Data

Use the included script:

```bash

python create_sample_transcripts.py --count 10 --type patient --synthetic

```

This generates realistic but completely fabricated patient/HCP interviews.

---

## Security Checklist for Production Deployment

### Pre-Deployment

- [ ] De-identify all test data
- [ ] Enable PII redaction in UI
- [ ] Set `DEBUG_MODE=False`
- [ ] Set `SANITIZE_LOGS=True`
- [ ] Remove any hardcoded API keys
- [ ] Use environment variables for secrets
- [ ] Configure LM Studio (not HF API)
- [ ] Test on synthetic data only

### Deployment

- [ ] Deploy on HIPAA-compliant infrastructure
- [ ] Sign BAA with cloud provider
- [ ] Enable encryption at rest
- [ ] Enable encryption in transit (HTTPS/TLS 1.2+)
- [ ] Configure MFA for all users
- [ ] Set up RBAC (role-based access control)
- [ ] Enable audit logging
- [ ] Configure log retention (6+ years)
- [ ] Set up automated backups
- [ ] Document data flow diagram

### Post-Deployment

- [ ] Conduct security assessment
- [ ] Penetration testing completed
- [ ] Staff training on HIPAA completed
- [ ] Incident response plan in place
- [ ] Breach notification procedures documented
- [ ] Regular vulnerability scanning (monthly)
- [ ] Access reviews (quarterly)
- [ ] Compliance audit (annual)

---

## Frequently Asked Questions

**Q: Can I use private HF Spaces for HIPAA data?**
A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure.

**Q: Is the PII redaction module HIPAA-compliant?**
A: The redaction module is a *tool* to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs.

**Q: Can I get a BAA from HuggingFace?**
A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs.

**Q: What if I only have de-identified data?**
A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly.

**Q: Can I use this for research?**
A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office.

**Q: What about GDPR compliance?**
A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers).

---

## Additional Resources

- **HIPAA Guidance:** https://www.hhs.gov/hipaa
- **HIPAA Safe Harbor Method:** https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification
- **HuggingFace Security:** https://huggingface.co/docs/hub/security
- **AWS HIPAA Compliance:** https://aws.amazon.com/compliance/hipaa-compliance/
- **HITRUST Alliance:** https://hitrustalliance.net/

---

## Support and Questions

For security questions or to report vulnerabilities:

- **Security Issues:** Create a private issue in GitHub (do not disclose publicly)
- **Compliance Questions:** Consult with your organization's compliance officer
- **General Support:** See README.md

---

**Remember:** When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place.

**This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.**