Twin commited on
Commit
59cab28
·
1 Parent(s): 0c2f645

final space commit

Browse files
.gitignore CHANGED
@@ -45,7 +45,7 @@ venv.bak/
45
  ehthumbs.db
46
  Thumbs.db
47
 
48
- # Large model files (use HuggingFace Hub instead)
49
  models/bert_pii/
50
  *.safetensors
51
  *.bin
 
45
  ehthumbs.db
46
  Thumbs.db
47
 
48
+ # Large model files (use HuggingFace Hub)
49
  models/bert_pii/
50
  *.safetensors
51
  *.bin
DEPLOYMENT.md DELETED
@@ -1,170 +0,0 @@
1
- # 🚀 PII Masking Space - Guide de Déploiement
2
-
3
- ## 📋 Structure Créée
4
-
5
- ```
6
- space/
7
- ├── README.md # ✅ Configuration HF Space
8
- ├── Dockerfile # ✅ Container CPU-optimisé
9
- ├── requirements.txt # ✅ Dependencies Python
10
- ├── app.py # ✅ FastAPI application
11
- ├── static/
12
- │ └── index.html # ✅ Interface web moderne
13
- └── inference/
14
- ├── __init__.py # ✅ Module init
15
- └── mistral_prompting.py # ✅ Service Mistral
16
- ```
17
-
18
- ## 🎯 Fonctionnalités Implémentées
19
-
20
- ### ✅ **Service Mistral Prompting**
21
- - API Mistral intégrée
22
- - Détection PII via JSON structuré
23
- - Gestion d'erreurs robuste
24
- - Rate limiting intégré
25
- - Support async complet
26
-
27
- ### ✅ **Application FastAPI**
28
- - Endpoints `/predict` et `/health`
29
- - Interface HTML intégrée (fallback)
30
- - Gestion des erreurs HTTP
31
- - CORS configuré
32
- - Logging détaillé
33
-
34
- ### ✅ **Interface Utilisateur**
35
- - Design moderne et responsive
36
- - Loading states avec spinner
37
- - Affichage des métriques
38
- - Gestion d'erreurs UX
39
- - Mobile-friendly
40
-
41
- ### ✅ **Configuration Docker**
42
- - Base Python 3.9-slim (CPU)
43
- - User HF Spaces compliant
44
- - Health check intégré
45
- - Optimisé pour production
46
-
47
- ## 🔧 Configuration Requise
48
-
49
- ### **Variables d'Environnement**
50
- Dans les settings du Space HF, ajouter :
51
-
52
- ```
53
- MISTRAL_API_KEY=your_mistral_api_key_here
54
- ```
55
- **Type:** Secret (pas Variable)
56
-
57
- ### **Configuration Space**
58
- Le fichier `README.md` contient :
59
- ```yaml
60
- sdk: docker
61
- app_port: 7860
62
- ```
63
-
64
- ## 🚀 Déploiement HuggingFace
65
-
66
- ### **Étape 1: Créer le Space**
67
- 1. Aller sur https://huggingface.co/spaces
68
- 2. Cliquer "Create new Space"
69
- 3. Choisir "Docker" comme SDK
70
- 4. Nommer votre Space
71
-
72
- ### **Étape 2: Upload des Fichiers**
73
- ```bash
74
- git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
75
- cd YOUR_SPACE_NAME
76
-
77
- # Copier les fichiers du space
78
- cp -r /path/to/pii-masking-200k/space/* .
79
- cp -r /path/to/pii-masking-200k/src .
80
-
81
- # Commit et push
82
- git add .
83
- git commit -m "Initial PII masking demo"
84
- git push
85
- ```
86
-
87
- ### **Étape 3: Configuration des Secrets**
88
- 1. Aller dans Settings du Space
89
- 2. Ajouter `MISTRAL_API_KEY` comme Secret
90
- 3. Valeur = votre clé API Mistral
91
-
92
- ### **Étape 4: Déploiement**
93
- Le Space se build automatiquement après le push.
94
-
95
- ## 🧪 Tests Locaux
96
-
97
- ### **Test des Imports**
98
- ```bash
99
- cd space
100
- uv run -c "from inference.mistral_prompting import MistralPromptingService; print('✅ OK')"
101
- ```
102
-
103
- ### **Test de l'Application**
104
- ```bash
105
- cd space
106
- # Avec une vraie clé API dans .env
107
- echo "MISTRAL_API_KEY=your_key" > .env
108
- uv run app.py
109
- ```
110
-
111
- Puis ouvrir http://localhost:7860
112
-
113
- ## 📊 Performance Attendue
114
-
115
- ### **Hardware Gratuit HF**
116
- - **CPU:** 2 vCPU, 16GB RAM
117
- - **Latence Mistral:** ~1-3s par requête
118
- - **Concurrent users:** ~5-10
119
-
120
- ### **Optimisations Possibles**
121
- - Cache Redis pour requêtes fréquentes
122
- - Batch processing pour multiple texts
123
- - GPU upgrade pour BERT local
124
-
125
- ## 🔍 Monitoring
126
-
127
- ### **Health Check**
128
- ```bash
129
- curl https://YOUR_SPACE.hf.space/health
130
- ```
131
-
132
- ### **Logs**
133
- Visibles dans l'interface HF Spaces
134
-
135
- ### **Métriques**
136
- - Processing time affiché dans l'UI
137
- - Nombre d'entités détectées
138
- - Méthode utilisée
139
-
140
- ## 🛠️ Développement Futur
141
-
142
- ### **Ajouts Prévus**
143
- 1. **Service BERT** (`bert_inference.py`)
144
- 2. **Service Mistral Fine-tuned** (`mistral_finetuned.py`)
145
- 3. **Comparaison des méthodes**
146
- 4. **Export des résultats**
147
-
148
- ### **Améliorations UX**
149
- 1. **Exemples pré-remplis**
150
- 2. **Historique des requêtes**
151
- 3. **Statistiques d'usage**
152
- 4. **Mode batch upload**
153
-
154
- ## ⚠️ Limitations Actuelles
155
-
156
- - **Seul Mistral** implémenté pour l'instant
157
- - **Pas de cache** (chaque requête = API call)
158
- - **Rate limiting** basique
159
- - **Pas de persistence** des résultats
160
-
161
- ## 🎉 Résultat Final
162
-
163
- Une démo PII masking fonctionnelle avec :
164
- - ✅ Interface moderne et intuitive
165
- - ✅ API Mistral intégrée
166
- - ✅ Déployable sur HF Spaces gratuit
167
- - ✅ Code production-ready
168
- - ✅ Extensible pour autres méthodes
169
-
170
- **Ready to deploy! 🚀**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -13,43 +13,17 @@ license: mit
13
 
14
  A comprehensive demo for Personal Identifiable Information (PII) masking using multiple approaches:
15
 
16
- - 🤖 **BERT Token Classification** - Fast local inference
17
- - 🧠 **Mistral Prompting** - High accuracy via API
18
- - 🎯 **Mistral Fine-tuned** - Best performance via API
19
 
20
  ## Features
21
 
22
- **Multiple Methods** - Compare different PII masking approaches
23
- **Real-time Processing** - Instant text masking
24
- **Performance Metrics** - Processing time and entity counts
25
- **User-friendly Interface** - Simple copy-paste workflow
26
 
27
- ## Usage
28
-
29
- 1. Paste your text containing PII information
30
- 2. Select your preferred masking method
31
- 3. Click "Process Text" to get the masked version
32
- 4. Compare results across different methods
33
-
34
- ## Methods Comparison
35
-
36
- | Method | Speed | Accuracy | Cost |
37
- |--------|-------|----------|------|
38
- | BERT | ⚡ Fast | ⭐⭐⭐⭐ | Free |
39
- | Mistral Prompt | 🐌 Slow | ⭐⭐⭐⭐⭐ | API |
40
- | Mistral Fine-tuned | 🐌 Slow | ⭐⭐⭐⭐⭐ | API |
41
-
42
- ## Example
43
-
44
- **Input:**
45
- ```
46
- Hi, my name is John Smith and my email is john.smith@company.com. Call me at 555-1234.
47
- ```
48
-
49
- **Output:**
50
- ```
51
- Hi, my name is [FIRSTNAME_1] [LASTNAME_1] and my email is [EMAIL_1]. Call me at [PHONENUMBER_1].
52
- ```
53
 
54
  ## Technology Stack
55
 
 
13
 
14
  A comprehensive demo for Personal Identifiable Information (PII) masking using multiple approaches:
15
 
16
+ - **BERT Token Classification** - Fast local inference
17
+ - **Mistral Prompting** - High accuracy via API
18
+ - **Mistral Fine-tuned** - Best performance via API
19
 
20
  ## Features
21
 
22
+ - **Multiple Methods** - Compare different PII masking approaches
23
+ - **Real-time Processing** - Instant text masking
24
+ - **Performance Metrics** - Processing time and entity counts
25
+ - **User-friendly Interface** - Simple copy-paste workflow
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Technology Stack
29
 
inference/bert_classif.py CHANGED
@@ -271,14 +271,44 @@ class BERTInferenceService:
271
  return True # Adjacent or overlapping
272
 
273
  between_text = text[between_start:between_end]
274
-
275
- # Merge if only whitespace and simple punctuation, and not too long
276
- if len(between_text) <= 3 and between_text.strip() in ['', ',', '.', '-', '/', ':', ';']:
277
- return True
278
-
279
- # Also merge if it's just whitespace
280
- if between_text.isspace() and len(between_text) <= 2:
281
- return True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
282
 
283
  return False
284
 
 
271
  return True # Adjacent or overlapping
272
 
273
  between_text = text[between_start:between_end]
274
+ entity_type = span1['entity_type']
275
+
276
+ # More aggressive merging for specific entity types
277
+ if entity_type in ['PHONENUMBER', 'SSN', 'ACCOUNTNAME']:
278
+ # For phone numbers, SSNs, and account numbers, merge if separated by:
279
+ # - No gap (adjacent tokens)
280
+ # - Common phone/ID separators: spaces, dashes, parentheses, dots
281
+ if len(between_text) <= 5 and all(c in ' \t\n()-.' for c in between_text):
282
+ return True
283
+
284
+ elif entity_type in ['FIRSTNAME', 'LASTNAME', 'MIDDLENAME']:
285
+ # For names, be more conservative - only merge with single space or initials
286
+ if len(between_text) <= 2 and between_text.strip() in ['', '.']:
287
+ return True
288
+
289
+ elif entity_type in ['STREET', 'SECONDARYADDRESS', 'CITY']:
290
+ # For addresses, merge with spaces and common separators
291
+ if len(between_text) <= 3 and all(c in ' \t\n,-' for c in between_text):
292
+ return True
293
+
294
+ elif entity_type in ['DATE', 'TIME']:
295
+ # For dates and times, merge with spaces, commas, and common separators
296
+ if len(between_text) <= 4 and all(c in ' \t\n,/-:' for c in between_text):
297
+ return True
298
+
299
+ elif entity_type in ['EMAIL', 'URL']:
300
+ # For emails and URLs, only merge if directly adjacent (no gaps allowed)
301
+ if len(between_text) == 0:
302
+ return True
303
+
304
+ else:
305
+ # Default behavior: merge if only whitespace and simple punctuation
306
+ if len(between_text) <= 3 and between_text.strip() in ['', ',', '.', '-', '/', ':', ';']:
307
+ return True
308
+
309
+ # Also merge if it's just whitespace
310
+ if between_text.isspace() and len(between_text) <= 2:
311
+ return True
312
 
313
  return False
314
 
inference/ocr_service.py CHANGED
@@ -38,7 +38,7 @@ class OCRService:
38
  self.client = Mistral(api_key=self.api_key)
39
  self.is_initialized = True
40
 
41
- logger.info("🔧 OCR service initialized with Mistral API")
42
 
43
  async def extract_text_from_pdf(self, pdf_content: bytes) -> str:
44
  """
@@ -54,7 +54,7 @@ class OCRService:
54
  # Encode PDF content to base64
55
  base64_pdf = base64.b64encode(pdf_content).decode('utf-8')
56
 
57
- logger.info(f"📄 Processing PDF ({len(pdf_content)} bytes) with Mistral OCR...")
58
 
59
  # Process the PDF with OCR
60
  ocr_response = self.client.ocr.process(
@@ -66,34 +66,34 @@ class OCRService:
66
  include_image_base64=False # Don't include images to save bandwidth
67
  )
68
 
69
- logger.info("OCR processing completed")
70
 
71
  # Extract text from all pages
72
  extracted_text = ""
73
 
74
  if hasattr(ocr_response, 'pages') and ocr_response.pages:
75
- logger.info(f"📄 Found {len(ocr_response.pages)} pages")
76
 
77
  for i, page in enumerate(ocr_response.pages):
78
  if hasattr(page, 'markdown') and page.markdown:
79
  page_text = page.markdown
80
  extracted_text += page_text + "\n\n"
81
- logger.debug(f"📝 Page {i+1}: {len(page_text)} characters")
82
 
83
- logger.info(f"📄 Total extracted text: {len(extracted_text)} characters")
84
 
85
  if not extracted_text.strip():
86
- logger.warning("⚠️ No text extracted from PDF")
87
  return "No text could be extracted from this PDF."
88
 
89
  return extracted_text.strip()
90
 
91
  else:
92
- logger.warning("⚠️ No pages found in OCR response")
93
  return "No text could be extracted from this PDF."
94
 
95
  except Exception as e:
96
- logger.error(f"OCR processing failed: {e}")
97
  raise RuntimeError(f"Failed to extract text from PDF: {str(e)}")
98
 
99
  def get_service_info(self) -> Dict[str, Any]:
@@ -119,8 +119,8 @@ async def create_ocr_service(api_key: Optional[str] = None) -> OCRService:
119
  """
120
  try:
121
  service = OCRService(api_key)
122
- logger.info("OCR service created successfully")
123
  return service
124
  except Exception as e:
125
- logger.error(f"Failed to create OCR service: {e}")
126
  raise
 
38
  self.client = Mistral(api_key=self.api_key)
39
  self.is_initialized = True
40
 
41
+ logger.info("OCR service initialized with Mistral API")
42
 
43
  async def extract_text_from_pdf(self, pdf_content: bytes) -> str:
44
  """
 
54
  # Encode PDF content to base64
55
  base64_pdf = base64.b64encode(pdf_content).decode('utf-8')
56
 
57
+ logger.info(f"Processing PDF ({len(pdf_content)} bytes) with Mistral OCR...")
58
 
59
  # Process the PDF with OCR
60
  ocr_response = self.client.ocr.process(
 
66
  include_image_base64=False # Don't include images to save bandwidth
67
  )
68
 
69
+ logger.info("OCR processing completed")
70
 
71
  # Extract text from all pages
72
  extracted_text = ""
73
 
74
  if hasattr(ocr_response, 'pages') and ocr_response.pages:
75
+ logger.info(f"Found {len(ocr_response.pages)} pages")
76
 
77
  for i, page in enumerate(ocr_response.pages):
78
  if hasattr(page, 'markdown') and page.markdown:
79
  page_text = page.markdown
80
  extracted_text += page_text + "\n\n"
81
+ logger.debug(f"Page {i+1}: {len(page_text)} characters")
82
 
83
+ logger.info(f"Total extracted text: {len(extracted_text)} characters")
84
 
85
  if not extracted_text.strip():
86
+ logger.warning("No text extracted from PDF")
87
  return "No text could be extracted from this PDF."
88
 
89
  return extracted_text.strip()
90
 
91
  else:
92
+ logger.warning("No pages found in OCR response")
93
  return "No text could be extracted from this PDF."
94
 
95
  except Exception as e:
96
+ logger.error(f"OCR processing failed: {e}")
97
  raise RuntimeError(f"Failed to extract text from PDF: {str(e)}")
98
 
99
  def get_service_info(self) -> Dict[str, Any]:
 
119
  """
120
  try:
121
  service = OCRService(api_key)
122
+ logger.info("OCR service created successfully")
123
  return service
124
  except Exception as e:
125
+ logger.error(f"Failed to create OCR service: {e}")
126
  raise
requirements.txt DELETED
@@ -1,24 +0,0 @@
1
- # FastAPI and web server
2
- fastapi==0.104.1
3
- uvicorn[standard]==0.24.0
4
- pydantic==2.5.0
5
- python-multipart==0.0.6
6
-
7
- # HTTP Client for API calls
8
- httpx>=0.27.0
9
- aiohttp>=3.9.0
10
-
11
- # Mistral AI client
12
- mistralai>=1.0.0
13
-
14
- # Utilities
15
- python-dotenv==1.0.0
16
- loguru==0.7.2
17
-
18
- # Basic data processing
19
- numpy>=1.24.0
20
-
21
- # BERT and ML dependencies
22
- torch>=2.0.0,<2.3.0
23
- transformers>=4.35.0
24
- tokenizers>=0.15.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test_ocr.py DELETED
@@ -1,133 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Test script for Mistral OCR functionality.
4
- """
5
-
6
- import os
7
- import sys
8
- import asyncio
9
- import logging
10
- from pathlib import Path
11
- from dotenv import load_dotenv
12
-
13
- # Load environment variables from .env file
14
- load_dotenv()
15
-
16
- # Add the inference directory to Python path
17
- sys.path.insert(0, str(Path(__file__).parent / "inference"))
18
-
19
- from mistralai import Mistral
20
-
21
- # Setup logging
22
- logging.basicConfig(level=logging.INFO)
23
- logger = logging.getLogger(__name__)
24
-
25
- async def test_mistral_ocr():
26
- """Test Mistral OCR with a sample PDF."""
27
-
28
- # Check if API key is available
29
- api_key = os.environ.get("MISTRAL_API_KEY")
30
- if not api_key:
31
- logger.error("❌ MISTRAL_API_KEY not found in environment variables")
32
- return
33
-
34
- try:
35
- # Initialize Mistral client
36
- client = Mistral(api_key=api_key)
37
- logger.info("✅ Mistral client initialized")
38
-
39
- # Test with a sample PDF from arXiv (Mistral paper)
40
- test_pdf_url = "https://arxiv.org/pdf/2201.04234"
41
-
42
- logger.info(f"🔍 Testing OCR with PDF: {test_pdf_url}")
43
-
44
- # Process the PDF with OCR
45
- ocr_response = client.ocr.process(
46
- model="mistral-ocr-latest",
47
- document={
48
- "type": "document_url",
49
- "document_url": test_pdf_url
50
- },
51
- include_image_base64=False # Don't include images for testing
52
- )
53
-
54
- logger.info("✅ OCR processing completed")
55
-
56
- # Extract the text content
57
- logger.info(f"📊 OCR Response structure:")
58
- logger.info(f" - Type: {type(ocr_response)}")
59
- logger.info(f" - Has pages: {hasattr(ocr_response, 'pages')}")
60
- logger.info(f" - Has content: {hasattr(ocr_response, 'content')}")
61
-
62
- extracted_text = ""
63
-
64
- if hasattr(ocr_response, 'pages') and ocr_response.pages:
65
- logger.info(f"📄 Found {len(ocr_response.pages)} pages")
66
-
67
- # Extract text from all pages
68
- for i, page in enumerate(ocr_response.pages):
69
- logger.info(f"📃 Page {i+1} structure: {dir(page)}")
70
-
71
- if hasattr(page, 'markdown') and page.markdown:
72
- page_text = page.markdown
73
- extracted_text += page_text + "\n\n"
74
- logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
75
- logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
76
- elif hasattr(page, 'content'):
77
- page_text = page.content
78
- extracted_text += page_text + "\n\n"
79
- logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
80
- logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
81
- elif hasattr(page, 'text'):
82
- page_text = page.text
83
- extracted_text += page_text + "\n\n"
84
- logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
85
- logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
86
- else:
87
- logger.info(f"⚠️ Page {i+1} attributes: {[attr for attr in dir(page) if not attr.startswith('_')]}")
88
-
89
- if extracted_text:
90
- logger.info(f"📄 Total extracted text length: {len(extracted_text)} characters")
91
-
92
- # Test if we can use this text for PII detection
93
- if len(extracted_text) > 50:
94
- logger.info("✅ OCR extraction successful - text is suitable for PII detection")
95
-
96
- # Try to import and test PII detection
97
- try:
98
- from mistral_prompting import create_mistral_service
99
-
100
- logger.info("🔍 Testing PII detection on OCR text...")
101
- service = await create_mistral_service()
102
-
103
- # Use a small sample of the text for testing
104
- sample_text = extracted_text[:500] # First 500 characters
105
- prediction = await service.predict(sample_text)
106
-
107
- logger.info(f"📊 PII detection results:")
108
- logger.info(f" - Entities found: {len(prediction.entities)}")
109
- logger.info(f" - Spans detected: {len(prediction.spans)}")
110
- logger.info(f" - Masked text preview: {prediction.masked_text[:100]}...")
111
-
112
- except Exception as e:
113
- logger.warning(f"⚠️ PII detection test failed: {e}")
114
- logger.info("💡 OCR works, but PII detection needs API key setup")
115
- else:
116
- logger.warning("⚠️ Extracted text too short")
117
- else:
118
- logger.warning("⚠️ No text extracted from pages")
119
-
120
- elif hasattr(ocr_response, 'content'):
121
- extracted_text = ocr_response.content
122
- logger.info(f"📄 Extracted text length: {len(extracted_text)} characters")
123
- logger.info(f"📝 First 200 characters: {extracted_text[:200]}...")
124
- else:
125
- logger.error("❌ No content found in OCR response")
126
- logger.info(f"Available attributes: {[attr for attr in dir(ocr_response) if not attr.startswith('_')]}")
127
-
128
- except Exception as e:
129
- logger.error(f"❌ OCR test failed: {e}")
130
- logger.info("💡 Make sure MISTRAL_API_KEY is set correctly")
131
-
132
- if __name__ == "__main__":
133
- asyncio.run(test_mistral_ocr())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test_reconstruction.py DELETED
@@ -1,77 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Test script to verify that PII reconstruction works with correct numbering.
4
- """
5
-
6
- import sys
7
- from pathlib import Path
8
-
9
- # Add the src directory to Python path
10
- sys.path.insert(0, str(Path(__file__).parent / "src"))
11
-
12
- from pii_masking.text_processing import reconstruct_masked_text
13
-
14
- def test_reconstruction():
15
- """Test the reconstruction function with multiple entities of the same type."""
16
-
17
- # Test case with multiple entities of the same type
18
- text = "John T. Smith lives at 4872 Willow Creek Drive Apartment 14B Springfield, IL 62701. Phone: 555-1234. Email: john@email.com. On January 15, 2024 Ms. Alice A. Johnson from Records Office at 9135 Westfield Parkway, Suite 320 Hartford, CT 06101. Dear Ms. Johnson, I am writing regarding my claim from December 2024. As of February 1, 2024, my address is 4872 Willow Creek Drive, Apartment 14B, Springfield, IL 62701."
19
-
20
- # Simulate PII entities found in the text
21
- pii_dict = {
22
- "FIRSTNAME": ["John", "Alice"],
23
- "LASTNAME": ["Smith", "Johnson"],
24
- "CITY": ["Springfield", "Hartford"],
25
- "STATE": ["IL", "CT"],
26
- "ZIPCODE": ["62701", "06101"],
27
- "PHONENUMBER": ["555-1234"],
28
- "EMAIL": ["john@email.com"],
29
- "DATE": ["January 15, 2024", "December 2024", "February 1, 2024"]
30
- }
31
-
32
- print("🧪 Testing PII reconstruction with correct numbering")
33
- print("=" * 60)
34
- print(f"Original text: {text[:100]}...")
35
- print()
36
-
37
- masked_text = reconstruct_masked_text(text, pii_dict)
38
-
39
- print("🎭 Masked text:")
40
- print(masked_text)
41
- print()
42
-
43
- # Check if numbering is correct (first occurrence should be _1, second _2, etc.)
44
- print("🔍 Checking numbering order:")
45
-
46
- # Check FIRSTNAMEs
47
- john_pos = masked_text.find("[FIRSTNAME_1]")
48
- alice_pos = masked_text.find("[FIRSTNAME_2]")
49
- print(f" FIRSTNAME_1 position: {john_pos}")
50
- print(f" FIRSTNAME_2 position: {alice_pos}")
51
- print(f" ✅ Correct order: {john_pos < alice_pos}")
52
-
53
- # Check LASTNAMEs
54
- smith_pos = masked_text.find("[LASTNAME_1]")
55
- johnson_pos = masked_text.find("[LASTNAME_2]")
56
- print(f" LASTNAME_1 position: {smith_pos}")
57
- print(f" LASTNAME_2 position: {johnson_pos}")
58
- print(f" ✅ Correct order: {smith_pos < johnson_pos}")
59
-
60
- # Check CITYs
61
- springfield_pos = masked_text.find("[CITY_1]")
62
- hartford_pos = masked_text.find("[CITY_2]")
63
- print(f" CITY_1 position: {springfield_pos}")
64
- print(f" CITY_2 position: {hartford_pos}")
65
- print(f" ✅ Correct order: {springfield_pos < hartford_pos}")
66
-
67
- # Check DATEs
68
- date1_pos = masked_text.find("[DATE_1]")
69
- date2_pos = masked_text.find("[DATE_2]")
70
- date3_pos = masked_text.find("[DATE_3]")
71
- print(f" DATE_1 position: {date1_pos}")
72
- print(f" DATE_2 position: {date2_pos}")
73
- print(f" DATE_3 position: {date3_pos}")
74
- print(f" ✅ Correct order: {date1_pos < date2_pos < date3_pos}")
75
-
76
- if __name__ == "__main__":
77
- test_reconstruction()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
uv.lock DELETED
The diff for this file is too large to render. See raw diff