Spaces:
Sleeping
Sleeping
File size: 13,211 Bytes
52d0298 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 |
# Security and Compliance Guide for TranscriptorAI
**Last Updated:** 2025-10-29
This document provides critical security information for using TranscriptorAI with sensitive healthcare data.
---
## β οΈ CRITICAL SECURITY NOTICE
### HuggingFace Spaces and HIPAA Compliance
**TranscriptorAI deployed on HuggingFace Spaces is NOT HIPAA-compliant and should NOT be used with real Protected Health Information (PHI).**
#### Why HuggingFace Spaces Cannot Support HIPAA Data:
1. **No Business Associate Agreement (BAA)** - HuggingFace does not offer BAAs for Spaces, which is legally required under HIPAA
2. **Shared Infrastructure** - Spaces run on multi-tenant infrastructure not certified for PHI
3. **No HIPAA Certification** - HF Spaces lacks required certifications (HITRUST, SOC 2 Type II for healthcare)
4. **Platform Access** - HF staff may have technical access to private Spaces for maintenance/debugging
5. **Log Retention** - Logs are kept for 30 days and may inadvertently contain PHI fragments
6. **No Audit Controls** - Insufficient access logging and audit trails for HIPAA compliance
7. **Security History** - 2024 security incident exposed potential vulnerabilities in Spaces secrets
### What Data Can Be Used on HF Spaces?
β
**SAFE TO USE:**
- Fully de-identified data (all 18 HIPAA identifiers removed)
- Synthetic/test data (completely fabricated)
- Anonymized market research data
- General business-confidential data (non-healthcare)
β **NEVER USE:**
- Real patient data with any identifiers
- Healthcare provider information with identifying details
- Data subject to HIPAA, GDPR Article 9, or similar regulations
- Any data containing the 18 HIPAA identifiers (see below)
---
## HIPAA Safe Harbor De-Identification
If you must use real healthcare data, **you MUST remove all 18 HIPAA identifiers** before uploading to HF Spaces:
1. **Names** - Patient, relatives, employers
2. **Geographic subdivisions** - Smaller than state (addresses, cities, ZIP codes)
3. **Dates** - Birth dates, admission dates, discharge dates, death dates (year is OK)
4. **Telephone numbers**
5. **Fax numbers**
6. **Email addresses**
7. **Social Security numbers**
8. **Medical record numbers**
9. **Health plan beneficiary numbers**
10. **Account numbers**
11. **Certificate/license numbers**
12. **Vehicle identifiers** - License plates, VINs
13. **Device identifiers and serial numbers**
14. **Web URLs**
15. **IP addresses**
16. **Biometric identifiers** - Fingerprints, voice prints
17. **Full-face photos**
18. **Other unique identifying numbers/codes**
### Using the Built-in Redaction Feature
TranscriptorAI now includes a PII redaction module:
1. **Enable PII Redaction** checkbox in the UI
2. **Choose Redaction Level:**
- **Minimal**: Only redacts obvious identifiers (SSN, MRN, account numbers)
- **Moderate**: Redacts common PII (emails, phones, dates, SSN, MRN) - **RECOMMENDED**
- **Strict**: Redacts all PII including names and addresses
β οΈ **Important:** The redaction module is a tool to ASSIST with de-identification, but:
- It is not 100% guaranteed to catch all PII
- You are still responsible for verifying data is properly de-identified
- Manual review is recommended for regulated data
- Consider using professional de-identification services for high-risk data
---
## HIPAA-Compliant Deployment Options
For production use with real PHI, deploy TranscriptorAI on HIPAA-compliant infrastructure:
### Option 1: AWS (Recommended for Healthcare)
- **AWS HealthLake** - Purpose-built for HIPAA/FHIR data
- **EC2 + S3 with BAA** - Self-managed on AWS infrastructure
- **Requires:** Signed AWS BAA, encryption at rest/in-transit, audit logging
- **Cost:** ~$50-500/month depending on usage
### Option 2: Microsoft Azure
- **Azure Health Data Services** - HIPAA-compliant platform
- **Azure VM + Blob Storage** - Self-hosted with BAA
- **Requires:** Signed Azure BAA, compliance certifications enabled
- **Cost:** Similar to AWS
### Option 3: Google Cloud Platform
- **Healthcare API** - HIPAA-compliant
- **Compute Engine + Cloud Storage with BAA**
- **Requires:** Signed GCP BAA
- **Cost:** Similar to AWS/Azure
### Option 4: On-Premises
- Deploy on your own HIPAA-certified servers
- Full control over data and access
- **Requires:** Your own HIPAA compliance program, security controls, auditing
- **Cost:** Infrastructure + IT staff
### Deployment Checklist for HIPAA Compliance
- [ ] Signed Business Associate Agreement with cloud provider
- [ ] Encryption at rest (AES-256)
- [ ] Encryption in transit (TLS 1.2+)
- [ ] Multi-factor authentication (MFA) enabled
- [ ] Role-based access control (RBAC)
- [ ] Audit logging enabled and retained (6 years)
- [ ] Regular security assessments
- [ ] Incident response plan documented
- [ ] Breach notification procedures in place
- [ ] Regular backups with encryption
- [ ] Staff HIPAA training completed
- [ ] Data retention and destruction policies
---
## Security Features in TranscriptorAI
### Built-in Security Controls
1. **PII Redaction Module** (`redaction.py`)
- Detects and masks 10+ types of PII
- Configurable redaction levels
- Redaction reporting for audit trails
2. **Secure Logging** (`logger.py`)
- Automatic PII sanitization in logs
- Token masking (shows only first/last 4 chars)
- Configurable log levels
- Prevents sensitive data leakage
3. **Type Safety**
- Standardized LLM response handling
- Prevents data corruption/leakage through type errors
- Defensive type checking
4. **Environment Variable Protection**
- API keys stored in environment variables (not code)
- Never logged in full
- Masked in debug output
### Configuring Security Settings
```bash
# .env file (NEVER commit this to version control!)
# Enable PII sanitization in logs (RECOMMENDED)
SANITIZE_LOGS=True
# Disable debug mode in production (no sensitive data in logs)
DEBUG_MODE=False
# Enable file logging for audit trails
LOG_TO_FILE=True
# For HIPAA: Use local models (data stays on your server)
USE_HF_API=False
USE_LMSTUDIO=True
LMSTUDIO_URL=http://localhost:1234/v1/chat/completions
# Or use HF API only after signing BAA (Enterprise plan)
# USE_HF_API=True
# HUGGINGFACE_TOKEN=<your_token>
```
---
## Data Flow and Storage
### Where Data Goes
1. **Upload**: Files uploaded through Gradio UI β Server memory (temporary)
2. **Processing**: Text extraction β LLM analysis β Report generation
3. **Output**: CSV/PDF reports generated β Downloads
4. **Cleanup**: Temporary files deleted after session
### Data Retention
| Location | What's Stored | Retention |
|----------|---------------|-----------|
| **HF Spaces (if used)** | Logs, temporary files | 30 days (platform logs) |
| **Local Deployment** | Only what you configure | You control |
| **LLM API (HF/OpenAI)** | Prompts/responses | Varies by provider |
| **Local LM Studio** | Nothing (all local) | You control |
### Minimizing Data Exposure
**Best Practices:**
1. **Use local LLM (LM Studio)** - Keeps all data on your servers
2. **Enable PII redaction** - Remove identifiers before processing
3. **Don't use HF Inference API** - Data sent to HuggingFace servers
4. **Clear session data** - Restart app between sessions with sensitive data
5. **Use incognito/private browsing** - Prevents browser caching
---
## LLM Backend Security Considerations
### HuggingFace Inference API
β **NOT recommended for PHI:**
- Data sent to HuggingFace servers for processing
- Logs kept for 30 days
- No BAA available for API usage (as of 2025-01)
- May be used for model improvement (check ToS)
### LM Studio (Local)
β
**Recommended for PHI:**
- All processing happens on your server
- No data sent externally
- Full control over model and data
- Can run on HIPAA-compliant infrastructure
### OpenAI/Anthropic APIs
β οΈ **Use with caution:**
- OpenAI offers BAAs for Enterprise customers
- Anthropic offers BAAs for Enterprise
- Zero data retention policies available
- Requires Enterprise plan + signed BAA
---
## Compliance Certifications Required
For healthcare use, your deployment should have:
- **SOC 2 Type II** - Security and availability controls
- **HITRUST CSF** - Healthcare industry framework
- **ISO 27001** - Information security management
- **HIPAA Compliance** - Via BAA with cloud provider
For European data (GDPR):
- **GDPR Article 9** - Special category data (health)
- **Data Processing Agreement (DPA)** with providers
- **Privacy Impact Assessment (PIA)** completed
---
## Incident Response
If you suspect a data breach:
1. **Immediately stop processing** - Shut down the application
2. **Preserve logs** - Don't delete anything
3. **Notify your security team** - Escalate within 1 hour
4. **Notify cloud provider** (if applicable)
5. **Document the incident** - Who, what, when, where, how
6. **Notify affected individuals** - Within 60 days per HIPAA
7. **File breach report** - HHS if >500 individuals affected
---
## Testing with Sensitive Data
### Safe Testing Workflow
1. **Start with synthetic data** - Generate realistic but fake transcripts
2. **Test with de-identified data** - Remove all 18 HIPAA identifiers
3. **Enable PII redaction** - Use "strict" mode
4. **Review outputs manually** - Check for leaked PII
5. **Deploy to compliant infrastructure** - Only then use real data
### Creating Synthetic Test Data
Use the included script:
```bash
python create_sample_transcripts.py --count 10 --type patient --synthetic
```
This generates realistic but completely fabricated patient/HCP interviews.
---
## Security Checklist for Production Deployment
### Pre-Deployment
- [ ] De-identify all test data
- [ ] Enable PII redaction in UI
- [ ] Set `DEBUG_MODE=False`
- [ ] Set `SANITIZE_LOGS=True`
- [ ] Remove any hardcoded API keys
- [ ] Use environment variables for secrets
- [ ] Configure LM Studio (not HF API)
- [ ] Test on synthetic data only
### Deployment
- [ ] Deploy on HIPAA-compliant infrastructure
- [ ] Sign BAA with cloud provider
- [ ] Enable encryption at rest
- [ ] Enable encryption in transit (HTTPS/TLS 1.2+)
- [ ] Configure MFA for all users
- [ ] Set up RBAC (role-based access control)
- [ ] Enable audit logging
- [ ] Configure log retention (6+ years)
- [ ] Set up automated backups
- [ ] Document data flow diagram
### Post-Deployment
- [ ] Conduct security assessment
- [ ] Penetration testing completed
- [ ] Staff training on HIPAA completed
- [ ] Incident response plan in place
- [ ] Breach notification procedures documented
- [ ] Regular vulnerability scanning (monthly)
- [ ] Access reviews (quarterly)
- [ ] Compliance audit (annual)
---
## Frequently Asked Questions
**Q: Can I use private HF Spaces for HIPAA data?**
A: No. Even private Spaces are not HIPAA-compliant. You need a signed BAA and certified infrastructure.
**Q: Is the PII redaction module HIPAA-compliant?**
A: The redaction module is a *tool* to assist with de-identification, but it alone doesn't make your deployment HIPAA-compliant. You still need proper infrastructure, BAAs, and compliance programs.
**Q: Can I get a BAA from HuggingFace?**
A: As of January 2025, HuggingFace does not offer BAAs for Spaces. Enterprise customers should contact HF directly for API-level BAAs.
**Q: What if I only have de-identified data?**
A: De-identified data (all 18 HIPAA identifiers removed) is not PHI and doesn't require HIPAA compliance. However, ensure de-identification is done correctly.
**Q: Can I use this for research?**
A: Yes, if data is properly de-identified or you have IRB approval and appropriate consent. Check with your institution's compliance office.
**Q: What about GDPR compliance?**
A: GDPR Article 9 covers health data. Use similar protections: de-identification, data processing agreements, and compliant infrastructure (preferably EU-based servers).
---
## Additional Resources
- **HIPAA Guidance:** https://www.hhs.gov/hipaa
- **HIPAA Safe Harbor Method:** https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification
- **HuggingFace Security:** https://huggingface.co/docs/hub/security
- **AWS HIPAA Compliance:** https://aws.amazon.com/compliance/hipaa-compliance/
- **HITRUST Alliance:** https://hitrustalliance.net/
---
## Support and Questions
For security questions or to report vulnerabilities:
- **Security Issues:** Create a private issue in GitHub (do not disclose publicly)
- **Compliance Questions:** Consult with your organization's compliance officer
- **General Support:** See README.md
---
**Remember:** When in doubt, DON'T USE REAL PHI. Use synthetic or de-identified data until you have proper HIPAA-compliant infrastructure in place.
**This software is provided AS-IS with no warranties. You are responsible for ensuring compliance with applicable regulations.**
|