Spaces:
Running
Running
Twin
commited on
Commit
·
59cab28
1
Parent(s):
0c2f645
final space commit
Browse files- .gitignore +1 -1
- DEPLOYMENT.md +0 -170
- README.md +7 -33
- inference/bert_classif.py +38 -8
- inference/ocr_service.py +11 -11
- requirements.txt +0 -24
- test_ocr.py +0 -133
- test_reconstruction.py +0 -77
- uv.lock +0 -0
.gitignore
CHANGED
|
@@ -45,7 +45,7 @@ venv.bak/
|
|
| 45 |
ehthumbs.db
|
| 46 |
Thumbs.db
|
| 47 |
|
| 48 |
-
# Large model files (use HuggingFace Hub
|
| 49 |
models/bert_pii/
|
| 50 |
*.safetensors
|
| 51 |
*.bin
|
|
|
|
| 45 |
ehthumbs.db
|
| 46 |
Thumbs.db
|
| 47 |
|
| 48 |
+
# Large model files (use HuggingFace Hub)
|
| 49 |
models/bert_pii/
|
| 50 |
*.safetensors
|
| 51 |
*.bin
|
DEPLOYMENT.md
DELETED
|
@@ -1,170 +0,0 @@
|
|
| 1 |
-
# 🚀 PII Masking Space - Guide de Déploiement
|
| 2 |
-
|
| 3 |
-
## 📋 Structure Créée
|
| 4 |
-
|
| 5 |
-
```
|
| 6 |
-
space/
|
| 7 |
-
├── README.md # ✅ Configuration HF Space
|
| 8 |
-
├── Dockerfile # ✅ Container CPU-optimisé
|
| 9 |
-
├── requirements.txt # ✅ Dependencies Python
|
| 10 |
-
├── app.py # ✅ FastAPI application
|
| 11 |
-
├── static/
|
| 12 |
-
│ └── index.html # ✅ Interface web moderne
|
| 13 |
-
└── inference/
|
| 14 |
-
├── __init__.py # ✅ Module init
|
| 15 |
-
└── mistral_prompting.py # ✅ Service Mistral
|
| 16 |
-
```
|
| 17 |
-
|
| 18 |
-
## 🎯 Fonctionnalités Implémentées
|
| 19 |
-
|
| 20 |
-
### ✅ **Service Mistral Prompting**
|
| 21 |
-
- API Mistral intégrée
|
| 22 |
-
- Détection PII via JSON structuré
|
| 23 |
-
- Gestion d'erreurs robuste
|
| 24 |
-
- Rate limiting intégré
|
| 25 |
-
- Support async complet
|
| 26 |
-
|
| 27 |
-
### ✅ **Application FastAPI**
|
| 28 |
-
- Endpoints `/predict` et `/health`
|
| 29 |
-
- Interface HTML intégrée (fallback)
|
| 30 |
-
- Gestion des erreurs HTTP
|
| 31 |
-
- CORS configuré
|
| 32 |
-
- Logging détaillé
|
| 33 |
-
|
| 34 |
-
### ✅ **Interface Utilisateur**
|
| 35 |
-
- Design moderne et responsive
|
| 36 |
-
- Loading states avec spinner
|
| 37 |
-
- Affichage des métriques
|
| 38 |
-
- Gestion d'erreurs UX
|
| 39 |
-
- Mobile-friendly
|
| 40 |
-
|
| 41 |
-
### ✅ **Configuration Docker**
|
| 42 |
-
- Base Python 3.9-slim (CPU)
|
| 43 |
-
- User HF Spaces compliant
|
| 44 |
-
- Health check intégré
|
| 45 |
-
- Optimisé pour production
|
| 46 |
-
|
| 47 |
-
## 🔧 Configuration Requise
|
| 48 |
-
|
| 49 |
-
### **Variables d'Environnement**
|
| 50 |
-
Dans les settings du Space HF, ajouter :
|
| 51 |
-
|
| 52 |
-
```
|
| 53 |
-
MISTRAL_API_KEY=your_mistral_api_key_here
|
| 54 |
-
```
|
| 55 |
-
**Type:** Secret (pas Variable)
|
| 56 |
-
|
| 57 |
-
### **Configuration Space**
|
| 58 |
-
Le fichier `README.md` contient :
|
| 59 |
-
```yaml
|
| 60 |
-
sdk: docker
|
| 61 |
-
app_port: 7860
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
## 🚀 Déploiement HuggingFace
|
| 65 |
-
|
| 66 |
-
### **Étape 1: Créer le Space**
|
| 67 |
-
1. Aller sur https://huggingface.co/spaces
|
| 68 |
-
2. Cliquer "Create new Space"
|
| 69 |
-
3. Choisir "Docker" comme SDK
|
| 70 |
-
4. Nommer votre Space
|
| 71 |
-
|
| 72 |
-
### **Étape 2: Upload des Fichiers**
|
| 73 |
-
```bash
|
| 74 |
-
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
|
| 75 |
-
cd YOUR_SPACE_NAME
|
| 76 |
-
|
| 77 |
-
# Copier les fichiers du space
|
| 78 |
-
cp -r /path/to/pii-masking-200k/space/* .
|
| 79 |
-
cp -r /path/to/pii-masking-200k/src .
|
| 80 |
-
|
| 81 |
-
# Commit et push
|
| 82 |
-
git add .
|
| 83 |
-
git commit -m "Initial PII masking demo"
|
| 84 |
-
git push
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
### **Étape 3: Configuration des Secrets**
|
| 88 |
-
1. Aller dans Settings du Space
|
| 89 |
-
2. Ajouter `MISTRAL_API_KEY` comme Secret
|
| 90 |
-
3. Valeur = votre clé API Mistral
|
| 91 |
-
|
| 92 |
-
### **Étape 4: Déploiement**
|
| 93 |
-
Le Space se build automatiquement après le push.
|
| 94 |
-
|
| 95 |
-
## 🧪 Tests Locaux
|
| 96 |
-
|
| 97 |
-
### **Test des Imports**
|
| 98 |
-
```bash
|
| 99 |
-
cd space
|
| 100 |
-
uv run -c "from inference.mistral_prompting import MistralPromptingService; print('✅ OK')"
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
### **Test de l'Application**
|
| 104 |
-
```bash
|
| 105 |
-
cd space
|
| 106 |
-
# Avec une vraie clé API dans .env
|
| 107 |
-
echo "MISTRAL_API_KEY=your_key" > .env
|
| 108 |
-
uv run app.py
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
Puis ouvrir http://localhost:7860
|
| 112 |
-
|
| 113 |
-
## 📊 Performance Attendue
|
| 114 |
-
|
| 115 |
-
### **Hardware Gratuit HF**
|
| 116 |
-
- **CPU:** 2 vCPU, 16GB RAM
|
| 117 |
-
- **Latence Mistral:** ~1-3s par requête
|
| 118 |
-
- **Concurrent users:** ~5-10
|
| 119 |
-
|
| 120 |
-
### **Optimisations Possibles**
|
| 121 |
-
- Cache Redis pour requêtes fréquentes
|
| 122 |
-
- Batch processing pour multiple texts
|
| 123 |
-
- GPU upgrade pour BERT local
|
| 124 |
-
|
| 125 |
-
## 🔍 Monitoring
|
| 126 |
-
|
| 127 |
-
### **Health Check**
|
| 128 |
-
```bash
|
| 129 |
-
curl https://YOUR_SPACE.hf.space/health
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
### **Logs**
|
| 133 |
-
Visibles dans l'interface HF Spaces
|
| 134 |
-
|
| 135 |
-
### **Métriques**
|
| 136 |
-
- Processing time affiché dans l'UI
|
| 137 |
-
- Nombre d'entités détectées
|
| 138 |
-
- Méthode utilisée
|
| 139 |
-
|
| 140 |
-
## 🛠️ Développement Futur
|
| 141 |
-
|
| 142 |
-
### **Ajouts Prévus**
|
| 143 |
-
1. **Service BERT** (`bert_inference.py`)
|
| 144 |
-
2. **Service Mistral Fine-tuned** (`mistral_finetuned.py`)
|
| 145 |
-
3. **Comparaison des méthodes**
|
| 146 |
-
4. **Export des résultats**
|
| 147 |
-
|
| 148 |
-
### **Améliorations UX**
|
| 149 |
-
1. **Exemples pré-remplis**
|
| 150 |
-
2. **Historique des requêtes**
|
| 151 |
-
3. **Statistiques d'usage**
|
| 152 |
-
4. **Mode batch upload**
|
| 153 |
-
|
| 154 |
-
## ⚠️ Limitations Actuelles
|
| 155 |
-
|
| 156 |
-
- **Seul Mistral** implémenté pour l'instant
|
| 157 |
-
- **Pas de cache** (chaque requête = API call)
|
| 158 |
-
- **Rate limiting** basique
|
| 159 |
-
- **Pas de persistence** des résultats
|
| 160 |
-
|
| 161 |
-
## 🎉 Résultat Final
|
| 162 |
-
|
| 163 |
-
Une démo PII masking fonctionnelle avec :
|
| 164 |
-
- ✅ Interface moderne et intuitive
|
| 165 |
-
- ✅ API Mistral intégrée
|
| 166 |
-
- ✅ Déployable sur HF Spaces gratuit
|
| 167 |
-
- ✅ Code production-ready
|
| 168 |
-
- ✅ Extensible pour autres méthodes
|
| 169 |
-
|
| 170 |
-
**Ready to deploy! 🚀**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -13,43 +13,17 @@ license: mit
|
|
| 13 |
|
| 14 |
A comprehensive demo for Personal Identifiable Information (PII) masking using multiple approaches:
|
| 15 |
|
| 16 |
-
-
|
| 17 |
-
-
|
| 18 |
-
-
|
| 19 |
|
| 20 |
## Features
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
|
| 27 |
-
## Usage
|
| 28 |
-
|
| 29 |
-
1. Paste your text containing PII information
|
| 30 |
-
2. Select your preferred masking method
|
| 31 |
-
3. Click "Process Text" to get the masked version
|
| 32 |
-
4. Compare results across different methods
|
| 33 |
-
|
| 34 |
-
## Methods Comparison
|
| 35 |
-
|
| 36 |
-
| Method | Speed | Accuracy | Cost |
|
| 37 |
-
|--------|-------|----------|------|
|
| 38 |
-
| BERT | ⚡ Fast | ⭐⭐⭐⭐ | Free |
|
| 39 |
-
| Mistral Prompt | 🐌 Slow | ⭐⭐⭐⭐⭐ | API |
|
| 40 |
-
| Mistral Fine-tuned | 🐌 Slow | ⭐⭐⭐⭐⭐ | API |
|
| 41 |
-
|
| 42 |
-
## Example
|
| 43 |
-
|
| 44 |
-
**Input:**
|
| 45 |
-
```
|
| 46 |
-
Hi, my name is John Smith and my email is john.smith@company.com. Call me at 555-1234.
|
| 47 |
-
```
|
| 48 |
-
|
| 49 |
-
**Output:**
|
| 50 |
-
```
|
| 51 |
-
Hi, my name is [FIRSTNAME_1] [LASTNAME_1] and my email is [EMAIL_1]. Call me at [PHONENUMBER_1].
|
| 52 |
-
```
|
| 53 |
|
| 54 |
## Technology Stack
|
| 55 |
|
|
|
|
| 13 |
|
| 14 |
A comprehensive demo for Personal Identifiable Information (PII) masking using multiple approaches:
|
| 15 |
|
| 16 |
+
- **BERT Token Classification** - Fast local inference
|
| 17 |
+
- **Mistral Prompting** - High accuracy via API
|
| 18 |
+
- **Mistral Fine-tuned** - Best performance via API
|
| 19 |
|
| 20 |
## Features
|
| 21 |
|
| 22 |
+
- **Multiple Methods** - Compare different PII masking approaches
|
| 23 |
+
- **Real-time Processing** - Instant text masking
|
| 24 |
+
- **Performance Metrics** - Processing time and entity counts
|
| 25 |
+
- **User-friendly Interface** - Simple copy-paste workflow
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
## Technology Stack
|
| 29 |
|
inference/bert_classif.py
CHANGED
|
@@ -271,14 +271,44 @@ class BERTInferenceService:
|
|
| 271 |
return True # Adjacent or overlapping
|
| 272 |
|
| 273 |
between_text = text[between_start:between_end]
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 282 |
|
| 283 |
return False
|
| 284 |
|
|
|
|
| 271 |
return True # Adjacent or overlapping
|
| 272 |
|
| 273 |
between_text = text[between_start:between_end]
|
| 274 |
+
entity_type = span1['entity_type']
|
| 275 |
+
|
| 276 |
+
# More aggressive merging for specific entity types
|
| 277 |
+
if entity_type in ['PHONENUMBER', 'SSN', 'ACCOUNTNAME']:
|
| 278 |
+
# For phone numbers, SSNs, and account numbers, merge if separated by:
|
| 279 |
+
# - No gap (adjacent tokens)
|
| 280 |
+
# - Common phone/ID separators: spaces, dashes, parentheses, dots
|
| 281 |
+
if len(between_text) <= 5 and all(c in ' \t\n()-.' for c in between_text):
|
| 282 |
+
return True
|
| 283 |
+
|
| 284 |
+
elif entity_type in ['FIRSTNAME', 'LASTNAME', 'MIDDLENAME']:
|
| 285 |
+
# For names, be more conservative - only merge with single space or initials
|
| 286 |
+
if len(between_text) <= 2 and between_text.strip() in ['', '.']:
|
| 287 |
+
return True
|
| 288 |
+
|
| 289 |
+
elif entity_type in ['STREET', 'SECONDARYADDRESS', 'CITY']:
|
| 290 |
+
# For addresses, merge with spaces and common separators
|
| 291 |
+
if len(between_text) <= 3 and all(c in ' \t\n,-' for c in between_text):
|
| 292 |
+
return True
|
| 293 |
+
|
| 294 |
+
elif entity_type in ['DATE', 'TIME']:
|
| 295 |
+
# For dates and times, merge with spaces, commas, and common separators
|
| 296 |
+
if len(between_text) <= 4 and all(c in ' \t\n,/-:' for c in between_text):
|
| 297 |
+
return True
|
| 298 |
+
|
| 299 |
+
elif entity_type in ['EMAIL', 'URL']:
|
| 300 |
+
# For emails and URLs, only merge if directly adjacent (no gaps allowed)
|
| 301 |
+
if len(between_text) == 0:
|
| 302 |
+
return True
|
| 303 |
+
|
| 304 |
+
else:
|
| 305 |
+
# Default behavior: merge if only whitespace and simple punctuation
|
| 306 |
+
if len(between_text) <= 3 and between_text.strip() in ['', ',', '.', '-', '/', ':', ';']:
|
| 307 |
+
return True
|
| 308 |
+
|
| 309 |
+
# Also merge if it's just whitespace
|
| 310 |
+
if between_text.isspace() and len(between_text) <= 2:
|
| 311 |
+
return True
|
| 312 |
|
| 313 |
return False
|
| 314 |
|
inference/ocr_service.py
CHANGED
|
@@ -38,7 +38,7 @@ class OCRService:
|
|
| 38 |
self.client = Mistral(api_key=self.api_key)
|
| 39 |
self.is_initialized = True
|
| 40 |
|
| 41 |
-
logger.info("
|
| 42 |
|
| 43 |
async def extract_text_from_pdf(self, pdf_content: bytes) -> str:
|
| 44 |
"""
|
|
@@ -54,7 +54,7 @@ class OCRService:
|
|
| 54 |
# Encode PDF content to base64
|
| 55 |
base64_pdf = base64.b64encode(pdf_content).decode('utf-8')
|
| 56 |
|
| 57 |
-
logger.info(f"
|
| 58 |
|
| 59 |
# Process the PDF with OCR
|
| 60 |
ocr_response = self.client.ocr.process(
|
|
@@ -66,34 +66,34 @@ class OCRService:
|
|
| 66 |
include_image_base64=False # Don't include images to save bandwidth
|
| 67 |
)
|
| 68 |
|
| 69 |
-
logger.info("
|
| 70 |
|
| 71 |
# Extract text from all pages
|
| 72 |
extracted_text = ""
|
| 73 |
|
| 74 |
if hasattr(ocr_response, 'pages') and ocr_response.pages:
|
| 75 |
-
logger.info(f"
|
| 76 |
|
| 77 |
for i, page in enumerate(ocr_response.pages):
|
| 78 |
if hasattr(page, 'markdown') and page.markdown:
|
| 79 |
page_text = page.markdown
|
| 80 |
extracted_text += page_text + "\n\n"
|
| 81 |
-
logger.debug(f"
|
| 82 |
|
| 83 |
-
logger.info(f"
|
| 84 |
|
| 85 |
if not extracted_text.strip():
|
| 86 |
-
logger.warning("
|
| 87 |
return "No text could be extracted from this PDF."
|
| 88 |
|
| 89 |
return extracted_text.strip()
|
| 90 |
|
| 91 |
else:
|
| 92 |
-
logger.warning("
|
| 93 |
return "No text could be extracted from this PDF."
|
| 94 |
|
| 95 |
except Exception as e:
|
| 96 |
-
logger.error(f"
|
| 97 |
raise RuntimeError(f"Failed to extract text from PDF: {str(e)}")
|
| 98 |
|
| 99 |
def get_service_info(self) -> Dict[str, Any]:
|
|
@@ -119,8 +119,8 @@ async def create_ocr_service(api_key: Optional[str] = None) -> OCRService:
|
|
| 119 |
"""
|
| 120 |
try:
|
| 121 |
service = OCRService(api_key)
|
| 122 |
-
logger.info("
|
| 123 |
return service
|
| 124 |
except Exception as e:
|
| 125 |
-
logger.error(f"
|
| 126 |
raise
|
|
|
|
| 38 |
self.client = Mistral(api_key=self.api_key)
|
| 39 |
self.is_initialized = True
|
| 40 |
|
| 41 |
+
logger.info("OCR service initialized with Mistral API")
|
| 42 |
|
| 43 |
async def extract_text_from_pdf(self, pdf_content: bytes) -> str:
|
| 44 |
"""
|
|
|
|
| 54 |
# Encode PDF content to base64
|
| 55 |
base64_pdf = base64.b64encode(pdf_content).decode('utf-8')
|
| 56 |
|
| 57 |
+
logger.info(f"Processing PDF ({len(pdf_content)} bytes) with Mistral OCR...")
|
| 58 |
|
| 59 |
# Process the PDF with OCR
|
| 60 |
ocr_response = self.client.ocr.process(
|
|
|
|
| 66 |
include_image_base64=False # Don't include images to save bandwidth
|
| 67 |
)
|
| 68 |
|
| 69 |
+
logger.info("OCR processing completed")
|
| 70 |
|
| 71 |
# Extract text from all pages
|
| 72 |
extracted_text = ""
|
| 73 |
|
| 74 |
if hasattr(ocr_response, 'pages') and ocr_response.pages:
|
| 75 |
+
logger.info(f"Found {len(ocr_response.pages)} pages")
|
| 76 |
|
| 77 |
for i, page in enumerate(ocr_response.pages):
|
| 78 |
if hasattr(page, 'markdown') and page.markdown:
|
| 79 |
page_text = page.markdown
|
| 80 |
extracted_text += page_text + "\n\n"
|
| 81 |
+
logger.debug(f"Page {i+1}: {len(page_text)} characters")
|
| 82 |
|
| 83 |
+
logger.info(f"Total extracted text: {len(extracted_text)} characters")
|
| 84 |
|
| 85 |
if not extracted_text.strip():
|
| 86 |
+
logger.warning("No text extracted from PDF")
|
| 87 |
return "No text could be extracted from this PDF."
|
| 88 |
|
| 89 |
return extracted_text.strip()
|
| 90 |
|
| 91 |
else:
|
| 92 |
+
logger.warning("No pages found in OCR response")
|
| 93 |
return "No text could be extracted from this PDF."
|
| 94 |
|
| 95 |
except Exception as e:
|
| 96 |
+
logger.error(f"OCR processing failed: {e}")
|
| 97 |
raise RuntimeError(f"Failed to extract text from PDF: {str(e)}")
|
| 98 |
|
| 99 |
def get_service_info(self) -> Dict[str, Any]:
|
|
|
|
| 119 |
"""
|
| 120 |
try:
|
| 121 |
service = OCRService(api_key)
|
| 122 |
+
logger.info("OCR service created successfully")
|
| 123 |
return service
|
| 124 |
except Exception as e:
|
| 125 |
+
logger.error(f"Failed to create OCR service: {e}")
|
| 126 |
raise
|
requirements.txt
DELETED
|
@@ -1,24 +0,0 @@
|
|
| 1 |
-
# FastAPI and web server
|
| 2 |
-
fastapi==0.104.1
|
| 3 |
-
uvicorn[standard]==0.24.0
|
| 4 |
-
pydantic==2.5.0
|
| 5 |
-
python-multipart==0.0.6
|
| 6 |
-
|
| 7 |
-
# HTTP Client for API calls
|
| 8 |
-
httpx>=0.27.0
|
| 9 |
-
aiohttp>=3.9.0
|
| 10 |
-
|
| 11 |
-
# Mistral AI client
|
| 12 |
-
mistralai>=1.0.0
|
| 13 |
-
|
| 14 |
-
# Utilities
|
| 15 |
-
python-dotenv==1.0.0
|
| 16 |
-
loguru==0.7.2
|
| 17 |
-
|
| 18 |
-
# Basic data processing
|
| 19 |
-
numpy>=1.24.0
|
| 20 |
-
|
| 21 |
-
# BERT and ML dependencies
|
| 22 |
-
torch>=2.0.0,<2.3.0
|
| 23 |
-
transformers>=4.35.0
|
| 24 |
-
tokenizers>=0.15.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_ocr.py
DELETED
|
@@ -1,133 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Test script for Mistral OCR functionality.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import os
|
| 7 |
-
import sys
|
| 8 |
-
import asyncio
|
| 9 |
-
import logging
|
| 10 |
-
from pathlib import Path
|
| 11 |
-
from dotenv import load_dotenv
|
| 12 |
-
|
| 13 |
-
# Load environment variables from .env file
|
| 14 |
-
load_dotenv()
|
| 15 |
-
|
| 16 |
-
# Add the inference directory to Python path
|
| 17 |
-
sys.path.insert(0, str(Path(__file__).parent / "inference"))
|
| 18 |
-
|
| 19 |
-
from mistralai import Mistral
|
| 20 |
-
|
| 21 |
-
# Setup logging
|
| 22 |
-
logging.basicConfig(level=logging.INFO)
|
| 23 |
-
logger = logging.getLogger(__name__)
|
| 24 |
-
|
| 25 |
-
async def test_mistral_ocr():
|
| 26 |
-
"""Test Mistral OCR with a sample PDF."""
|
| 27 |
-
|
| 28 |
-
# Check if API key is available
|
| 29 |
-
api_key = os.environ.get("MISTRAL_API_KEY")
|
| 30 |
-
if not api_key:
|
| 31 |
-
logger.error("❌ MISTRAL_API_KEY not found in environment variables")
|
| 32 |
-
return
|
| 33 |
-
|
| 34 |
-
try:
|
| 35 |
-
# Initialize Mistral client
|
| 36 |
-
client = Mistral(api_key=api_key)
|
| 37 |
-
logger.info("✅ Mistral client initialized")
|
| 38 |
-
|
| 39 |
-
# Test with a sample PDF from arXiv (Mistral paper)
|
| 40 |
-
test_pdf_url = "https://arxiv.org/pdf/2201.04234"
|
| 41 |
-
|
| 42 |
-
logger.info(f"🔍 Testing OCR with PDF: {test_pdf_url}")
|
| 43 |
-
|
| 44 |
-
# Process the PDF with OCR
|
| 45 |
-
ocr_response = client.ocr.process(
|
| 46 |
-
model="mistral-ocr-latest",
|
| 47 |
-
document={
|
| 48 |
-
"type": "document_url",
|
| 49 |
-
"document_url": test_pdf_url
|
| 50 |
-
},
|
| 51 |
-
include_image_base64=False # Don't include images for testing
|
| 52 |
-
)
|
| 53 |
-
|
| 54 |
-
logger.info("✅ OCR processing completed")
|
| 55 |
-
|
| 56 |
-
# Extract the text content
|
| 57 |
-
logger.info(f"📊 OCR Response structure:")
|
| 58 |
-
logger.info(f" - Type: {type(ocr_response)}")
|
| 59 |
-
logger.info(f" - Has pages: {hasattr(ocr_response, 'pages')}")
|
| 60 |
-
logger.info(f" - Has content: {hasattr(ocr_response, 'content')}")
|
| 61 |
-
|
| 62 |
-
extracted_text = ""
|
| 63 |
-
|
| 64 |
-
if hasattr(ocr_response, 'pages') and ocr_response.pages:
|
| 65 |
-
logger.info(f"📄 Found {len(ocr_response.pages)} pages")
|
| 66 |
-
|
| 67 |
-
# Extract text from all pages
|
| 68 |
-
for i, page in enumerate(ocr_response.pages):
|
| 69 |
-
logger.info(f"📃 Page {i+1} structure: {dir(page)}")
|
| 70 |
-
|
| 71 |
-
if hasattr(page, 'markdown') and page.markdown:
|
| 72 |
-
page_text = page.markdown
|
| 73 |
-
extracted_text += page_text + "\n\n"
|
| 74 |
-
logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
|
| 75 |
-
logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
|
| 76 |
-
elif hasattr(page, 'content'):
|
| 77 |
-
page_text = page.content
|
| 78 |
-
extracted_text += page_text + "\n\n"
|
| 79 |
-
logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
|
| 80 |
-
logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
|
| 81 |
-
elif hasattr(page, 'text'):
|
| 82 |
-
page_text = page.text
|
| 83 |
-
extracted_text += page_text + "\n\n"
|
| 84 |
-
logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
|
| 85 |
-
logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
|
| 86 |
-
else:
|
| 87 |
-
logger.info(f"⚠️ Page {i+1} attributes: {[attr for attr in dir(page) if not attr.startswith('_')]}")
|
| 88 |
-
|
| 89 |
-
if extracted_text:
|
| 90 |
-
logger.info(f"📄 Total extracted text length: {len(extracted_text)} characters")
|
| 91 |
-
|
| 92 |
-
# Test if we can use this text for PII detection
|
| 93 |
-
if len(extracted_text) > 50:
|
| 94 |
-
logger.info("✅ OCR extraction successful - text is suitable for PII detection")
|
| 95 |
-
|
| 96 |
-
# Try to import and test PII detection
|
| 97 |
-
try:
|
| 98 |
-
from mistral_prompting import create_mistral_service
|
| 99 |
-
|
| 100 |
-
logger.info("🔍 Testing PII detection on OCR text...")
|
| 101 |
-
service = await create_mistral_service()
|
| 102 |
-
|
| 103 |
-
# Use a small sample of the text for testing
|
| 104 |
-
sample_text = extracted_text[:500] # First 500 characters
|
| 105 |
-
prediction = await service.predict(sample_text)
|
| 106 |
-
|
| 107 |
-
logger.info(f"📊 PII detection results:")
|
| 108 |
-
logger.info(f" - Entities found: {len(prediction.entities)}")
|
| 109 |
-
logger.info(f" - Spans detected: {len(prediction.spans)}")
|
| 110 |
-
logger.info(f" - Masked text preview: {prediction.masked_text[:100]}...")
|
| 111 |
-
|
| 112 |
-
except Exception as e:
|
| 113 |
-
logger.warning(f"⚠️ PII detection test failed: {e}")
|
| 114 |
-
logger.info("💡 OCR works, but PII detection needs API key setup")
|
| 115 |
-
else:
|
| 116 |
-
logger.warning("⚠️ Extracted text too short")
|
| 117 |
-
else:
|
| 118 |
-
logger.warning("⚠️ No text extracted from pages")
|
| 119 |
-
|
| 120 |
-
elif hasattr(ocr_response, 'content'):
|
| 121 |
-
extracted_text = ocr_response.content
|
| 122 |
-
logger.info(f"📄 Extracted text length: {len(extracted_text)} characters")
|
| 123 |
-
logger.info(f"📝 First 200 characters: {extracted_text[:200]}...")
|
| 124 |
-
else:
|
| 125 |
-
logger.error("❌ No content found in OCR response")
|
| 126 |
-
logger.info(f"Available attributes: {[attr for attr in dir(ocr_response) if not attr.startswith('_')]}")
|
| 127 |
-
|
| 128 |
-
except Exception as e:
|
| 129 |
-
logger.error(f"❌ OCR test failed: {e}")
|
| 130 |
-
logger.info("💡 Make sure MISTRAL_API_KEY is set correctly")
|
| 131 |
-
|
| 132 |
-
if __name__ == "__main__":
|
| 133 |
-
asyncio.run(test_mistral_ocr())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_reconstruction.py
DELETED
|
@@ -1,77 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Test script to verify that PII reconstruction works with correct numbering.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import sys
|
| 7 |
-
from pathlib import Path
|
| 8 |
-
|
| 9 |
-
# Add the src directory to Python path
|
| 10 |
-
sys.path.insert(0, str(Path(__file__).parent / "src"))
|
| 11 |
-
|
| 12 |
-
from pii_masking.text_processing import reconstruct_masked_text
|
| 13 |
-
|
| 14 |
-
def test_reconstruction():
|
| 15 |
-
"""Test the reconstruction function with multiple entities of the same type."""
|
| 16 |
-
|
| 17 |
-
# Test case with multiple entities of the same type
|
| 18 |
-
text = "John T. Smith lives at 4872 Willow Creek Drive Apartment 14B Springfield, IL 62701. Phone: 555-1234. Email: john@email.com. On January 15, 2024 Ms. Alice A. Johnson from Records Office at 9135 Westfield Parkway, Suite 320 Hartford, CT 06101. Dear Ms. Johnson, I am writing regarding my claim from December 2024. As of February 1, 2024, my address is 4872 Willow Creek Drive, Apartment 14B, Springfield, IL 62701."
|
| 19 |
-
|
| 20 |
-
# Simulate PII entities found in the text
|
| 21 |
-
pii_dict = {
|
| 22 |
-
"FIRSTNAME": ["John", "Alice"],
|
| 23 |
-
"LASTNAME": ["Smith", "Johnson"],
|
| 24 |
-
"CITY": ["Springfield", "Hartford"],
|
| 25 |
-
"STATE": ["IL", "CT"],
|
| 26 |
-
"ZIPCODE": ["62701", "06101"],
|
| 27 |
-
"PHONENUMBER": ["555-1234"],
|
| 28 |
-
"EMAIL": ["john@email.com"],
|
| 29 |
-
"DATE": ["January 15, 2024", "December 2024", "February 1, 2024"]
|
| 30 |
-
}
|
| 31 |
-
|
| 32 |
-
print("🧪 Testing PII reconstruction with correct numbering")
|
| 33 |
-
print("=" * 60)
|
| 34 |
-
print(f"Original text: {text[:100]}...")
|
| 35 |
-
print()
|
| 36 |
-
|
| 37 |
-
masked_text = reconstruct_masked_text(text, pii_dict)
|
| 38 |
-
|
| 39 |
-
print("🎭 Masked text:")
|
| 40 |
-
print(masked_text)
|
| 41 |
-
print()
|
| 42 |
-
|
| 43 |
-
# Check if numbering is correct (first occurrence should be _1, second _2, etc.)
|
| 44 |
-
print("🔍 Checking numbering order:")
|
| 45 |
-
|
| 46 |
-
# Check FIRSTNAMEs
|
| 47 |
-
john_pos = masked_text.find("[FIRSTNAME_1]")
|
| 48 |
-
alice_pos = masked_text.find("[FIRSTNAME_2]")
|
| 49 |
-
print(f" FIRSTNAME_1 position: {john_pos}")
|
| 50 |
-
print(f" FIRSTNAME_2 position: {alice_pos}")
|
| 51 |
-
print(f" ✅ Correct order: {john_pos < alice_pos}")
|
| 52 |
-
|
| 53 |
-
# Check LASTNAMEs
|
| 54 |
-
smith_pos = masked_text.find("[LASTNAME_1]")
|
| 55 |
-
johnson_pos = masked_text.find("[LASTNAME_2]")
|
| 56 |
-
print(f" LASTNAME_1 position: {smith_pos}")
|
| 57 |
-
print(f" LASTNAME_2 position: {johnson_pos}")
|
| 58 |
-
print(f" ✅ Correct order: {smith_pos < johnson_pos}")
|
| 59 |
-
|
| 60 |
-
# Check CITYs
|
| 61 |
-
springfield_pos = masked_text.find("[CITY_1]")
|
| 62 |
-
hartford_pos = masked_text.find("[CITY_2]")
|
| 63 |
-
print(f" CITY_1 position: {springfield_pos}")
|
| 64 |
-
print(f" CITY_2 position: {hartford_pos}")
|
| 65 |
-
print(f" ✅ Correct order: {springfield_pos < hartford_pos}")
|
| 66 |
-
|
| 67 |
-
# Check DATEs
|
| 68 |
-
date1_pos = masked_text.find("[DATE_1]")
|
| 69 |
-
date2_pos = masked_text.find("[DATE_2]")
|
| 70 |
-
date3_pos = masked_text.find("[DATE_3]")
|
| 71 |
-
print(f" DATE_1 position: {date1_pos}")
|
| 72 |
-
print(f" DATE_2 position: {date2_pos}")
|
| 73 |
-
print(f" DATE_3 position: {date3_pos}")
|
| 74 |
-
print(f" ✅ Correct order: {date1_pos < date2_pos < date3_pos}")
|
| 75 |
-
|
| 76 |
-
if __name__ == "__main__":
|
| 77 |
-
test_reconstruction()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
uv.lock
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|