Ric
Add HuggingFace model card
011b80f
---
license: mit
tags:
- privacy
- phi-detection
- medical-documents
- vision-language-models
- negative-result
- deepseek-ocr
- hipaa
pipeline_tag: image-to-text
---
# Vision Token Masking for PHI Protection: A Negative Result
**Research Code & Evaluation Framework**
๐Ÿšจ **Key Finding**: Vision-level token masking achieves only **42.9% PHI reduction** - insufficient for HIPAA compliance
## Overview
This repository contains the systematic evaluation code from our paper **"Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR"**.
**Author**: Richard J. Young
**Affiliation**: DeepNeuro.AI | University of Nevada, Las Vegas
**Status**: Paper Under Review
## The Negative Result
We evaluated **seven masking strategies** (V3-V9) across different architectural layers of DeepSeek-OCR using 100 synthetic medical billing statements from a corpus of 38,517 annotated documents.
### What We Found
| PHI Type | Reduction Rate |
|----------|----------------|
| **Patient Names** | โœ… 100% |
| **Dates of Birth** | โœ… 100% |
| **Physical Addresses** | โœ… 100% |
| **SSN** | โŒ 0% |
| **Medical Record Numbers** | โŒ 0% |
| **Email Addresses** | โŒ 0% |
| **Account Numbers** | โŒ 0% |
| **Overall** | โš ๏ธ 42.9% |
### Why It Matters
- **All strategies converged** to 42.9% regardless of architectural layer (V3-V9)
- **Spatial expansion didn't help** - mask radius r=1,2,3 showed no improvement
- **Root cause identified** - language model contextual inference reconstructs masked short identifiers from document context
- **Hybrid approach needed** - simulation shows 88.6% reduction when combining vision masking + NLP post-processing
## What's Included
โœ… **Synthetic Data Pipeline**: Generates 38,517+ annotated medical PDFs using Synthea
โœ… **PHI Annotation Tools**: Ground-truth labeling for all 18 HIPAA categories
โœ… **Seven Masking Strategies**: V3-V9 implementations targeting different DeepSeek-OCR layers
โœ… **Evaluation Framework**: Code for measuring PHI reduction by category
โœ… **Configuration Files**: DeepSeek-OCR integration settings
## Quick Start
### Installation
```bash
git clone https://github.com/ricyoung/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR.git
cd Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
```
### Generate Synthetic Data
```bash
# Setup Synthea
bash scripts/setup_synthea.sh
# Generate patient data
bash scripts/generate_synthea_data.sh
# Create annotated PDFs
python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000
```
### Explore the Code
```bash
# View masking strategies
cat src/training/lora_phi_detector.py
# Check PHI annotation pipeline
cat src/preprocessing/phi_annotator.py
```
## Research Contributions
1. **First systematic evaluation** of vision-level token masking for PHI protection in VLMs
2. **Establishes boundaries** - identifies which PHI types work with vision masking vs requiring language-level redaction
3. **Negative result** - proves vision-only approaches are insufficient for HIPAA compliance
4. **Redirects future work** - toward hybrid architectures and decoder-level fine-tuning
## Architecture
```
Input PDF โ†’ Vision Encoder โ†’ PHI Detection โ†’ Vision Token Masking โ†’ DeepSeek Decoder โ†’ Text Output
(SAM + CLIP) (Ground Truth) (V3-V9 Strategies) (3B-MoE)
โ†“
42.9% Reduction
(Insufficient!)
```
### Seven Masking Strategies
- **V3-V5**: SAM encoder blocks at different depths
- **V6**: Compression layer (4096โ†’1024 tokens)
- **V7**: Dual vision encoders (SAM + CLIP)
- **V8**: Post-compression stage
- **V9**: Projector fusion layer
**Result**: All strategies converged to identical 42.9% reduction
## Implications
โš ๏ธ **Vision-only masking is insufficient for HIPAA compliance** (requires 99%+ PHI reduction)
โœ… **Hybrid architectures are necessary** - combine vision masking with NLP post-processing
๐Ÿ”ฎ **Future directions** - decoder-level fine-tuning, defense-in-depth approaches
## Use Cases
This code is useful for:
- Researchers exploring privacy-preserving VLMs
- Healthcare AI teams evaluating PHI protection strategies
- Benchmarking alternative redaction approaches
- Understanding VLM architectural limitations for sensitive data
## Citation
```bibtex
@article{young2025visionmasking,
title={Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation},
author={Young, Richard J.},
institution={DeepNeuro.AI; University of Nevada, Las Vegas},
journal={Under Review},
year={2025},
url={https://huggingface.co/richardyoung/vision-token-masking-phi}
}
```
## License
MIT License - See [LICENSE](LICENSE) for details
## Contact
**Richard J. Young**
๐ŸŒ [deepneuro.ai/richard](https://deepneuro.ai/richard)
๐Ÿค— [@richardyoung](https://huggingface.co/richardyoung)
๐Ÿ’ป [GitHub](https://github.com/ricyoung/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR)
## Disclaimer
โš ๏ธ Research project only - **NOT for production use with real PHI**. Always consult legal and compliance teams before deploying PHI-related systems.
---
**Note**: This negative result establishes important boundaries for vision-level privacy interventions in VLMs and redirects the field toward more effective hybrid approaches.