|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- privacy |
|
|
- phi-detection |
|
|
- medical-documents |
|
|
- vision-language-models |
|
|
- negative-result |
|
|
- deepseek-ocr |
|
|
- hipaa |
|
|
pipeline_tag: image-to-text |
|
|
--- |
|
|
|
|
|
# Vision Token Masking for PHI Protection: A Negative Result |
|
|
|
|
|
**Research Code & Evaluation Framework** |
|
|
|
|
|
๐จ **Key Finding**: Vision-level token masking achieves only **42.9% PHI reduction** - insufficient for HIPAA compliance |
|
|
|
|
|
## Overview |
|
|
|
|
|
This repository contains the systematic evaluation code from our paper **"Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR"**. |
|
|
|
|
|
**Author**: Richard J. Young |
|
|
**Affiliation**: DeepNeuro.AI | University of Nevada, Las Vegas |
|
|
**Status**: Paper Under Review |
|
|
|
|
|
## The Negative Result |
|
|
|
|
|
We evaluated **seven masking strategies** (V3-V9) across different architectural layers of DeepSeek-OCR using 100 synthetic medical billing statements from a corpus of 38,517 annotated documents. |
|
|
|
|
|
### What We Found |
|
|
|
|
|
| PHI Type | Reduction Rate | |
|
|
|----------|----------------| |
|
|
| **Patient Names** | โ
100% | |
|
|
| **Dates of Birth** | โ
100% | |
|
|
| **Physical Addresses** | โ
100% | |
|
|
| **SSN** | โ 0% | |
|
|
| **Medical Record Numbers** | โ 0% | |
|
|
| **Email Addresses** | โ 0% | |
|
|
| **Account Numbers** | โ 0% | |
|
|
| **Overall** | โ ๏ธ 42.9% | |
|
|
|
|
|
### Why It Matters |
|
|
|
|
|
- **All strategies converged** to 42.9% regardless of architectural layer (V3-V9) |
|
|
- **Spatial expansion didn't help** - mask radius r=1,2,3 showed no improvement |
|
|
- **Root cause identified** - language model contextual inference reconstructs masked short identifiers from document context |
|
|
- **Hybrid approach needed** - simulation shows 88.6% reduction when combining vision masking + NLP post-processing |
|
|
|
|
|
## What's Included |
|
|
|
|
|
โ
**Synthetic Data Pipeline**: Generates 38,517+ annotated medical PDFs using Synthea |
|
|
โ
**PHI Annotation Tools**: Ground-truth labeling for all 18 HIPAA categories |
|
|
โ
**Seven Masking Strategies**: V3-V9 implementations targeting different DeepSeek-OCR layers |
|
|
โ
**Evaluation Framework**: Code for measuring PHI reduction by category |
|
|
โ
**Configuration Files**: DeepSeek-OCR integration settings |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/ricyoung/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR.git |
|
|
cd Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR |
|
|
|
|
|
python -m venv venv |
|
|
source venv/bin/activate # Windows: venv\Scripts\activate |
|
|
|
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Generate Synthetic Data |
|
|
|
|
|
```bash |
|
|
# Setup Synthea |
|
|
bash scripts/setup_synthea.sh |
|
|
|
|
|
# Generate patient data |
|
|
bash scripts/generate_synthea_data.sh |
|
|
|
|
|
# Create annotated PDFs |
|
|
python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000 |
|
|
``` |
|
|
|
|
|
### Explore the Code |
|
|
|
|
|
```bash |
|
|
# View masking strategies |
|
|
cat src/training/lora_phi_detector.py |
|
|
|
|
|
# Check PHI annotation pipeline |
|
|
cat src/preprocessing/phi_annotator.py |
|
|
``` |
|
|
|
|
|
## Research Contributions |
|
|
|
|
|
1. **First systematic evaluation** of vision-level token masking for PHI protection in VLMs |
|
|
2. **Establishes boundaries** - identifies which PHI types work with vision masking vs requiring language-level redaction |
|
|
3. **Negative result** - proves vision-only approaches are insufficient for HIPAA compliance |
|
|
4. **Redirects future work** - toward hybrid architectures and decoder-level fine-tuning |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
Input PDF โ Vision Encoder โ PHI Detection โ Vision Token Masking โ DeepSeek Decoder โ Text Output |
|
|
(SAM + CLIP) (Ground Truth) (V3-V9 Strategies) (3B-MoE) |
|
|
โ |
|
|
42.9% Reduction |
|
|
(Insufficient!) |
|
|
``` |
|
|
|
|
|
### Seven Masking Strategies |
|
|
|
|
|
- **V3-V5**: SAM encoder blocks at different depths |
|
|
- **V6**: Compression layer (4096โ1024 tokens) |
|
|
- **V7**: Dual vision encoders (SAM + CLIP) |
|
|
- **V8**: Post-compression stage |
|
|
- **V9**: Projector fusion layer |
|
|
|
|
|
**Result**: All strategies converged to identical 42.9% reduction |
|
|
|
|
|
## Implications |
|
|
|
|
|
โ ๏ธ **Vision-only masking is insufficient for HIPAA compliance** (requires 99%+ PHI reduction) |
|
|
|
|
|
โ
**Hybrid architectures are necessary** - combine vision masking with NLP post-processing |
|
|
|
|
|
๐ฎ **Future directions** - decoder-level fine-tuning, defense-in-depth approaches |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
This code is useful for: |
|
|
|
|
|
- Researchers exploring privacy-preserving VLMs |
|
|
- Healthcare AI teams evaluating PHI protection strategies |
|
|
- Benchmarking alternative redaction approaches |
|
|
- Understanding VLM architectural limitations for sensitive data |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{young2025visionmasking, |
|
|
title={Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation}, |
|
|
author={Young, Richard J.}, |
|
|
institution={DeepNeuro.AI; University of Nevada, Las Vegas}, |
|
|
journal={Under Review}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/richardyoung/vision-token-masking-phi} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See [LICENSE](LICENSE) for details |
|
|
|
|
|
## Contact |
|
|
|
|
|
**Richard J. Young** |
|
|
๐ [deepneuro.ai/richard](https://deepneuro.ai/richard) |
|
|
๐ค [@richardyoung](https://huggingface.co/richardyoung) |
|
|
๐ป [GitHub](https://github.com/ricyoung/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR) |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
โ ๏ธ Research project only - **NOT for production use with real PHI**. Always consult legal and compliance teams before deploying PHI-related systems. |
|
|
|
|
|
--- |
|
|
|
|
|
**Note**: This negative result establishes important boundaries for vision-level privacy interventions in VLMs and redirects the field toward more effective hybrid approaches. |
|
|
|