Spaces:

SoelMgd
/

pii_masking

Sleeping

App Files Files Community

Twin commited on Sep 3, 2025

Commit

59cab28

1 Parent(s): 0c2f645

final space commit

Browse files

Files changed (9) hide show

.gitignore +1 -1
DEPLOYMENT.md +0 -170
README.md +7 -33
inference/bert_classif.py +38 -8
inference/ocr_service.py +11 -11
requirements.txt +0 -24
test_ocr.py +0 -133
test_reconstruction.py +0 -77
uv.lock +0 -0

.gitignore CHANGED Viewed

@@ -45,7 +45,7 @@ venv.bak/
 ehthumbs.db
 Thumbs.db
-# Large model files (use HuggingFace Hub instead)
 models/bert_pii/
 *.safetensors
 *.bin

 ehthumbs.db
 Thumbs.db
+# Large model files (use HuggingFace Hub)
 models/bert_pii/
 *.safetensors
 *.bin

DEPLOYMENT.md DELETED Viewed

@@ -1,170 +0,0 @@
-# 🚀 PII Masking Space - Guide de Déploiement
-## 📋 Structure Créée
-```
-space/
-├── README.md                    # ✅ Configuration HF Space
-├── Dockerfile                   # ✅ Container CPU-optimisé
-├── requirements.txt             # ✅ Dependencies Python
-├── app.py                       # ✅ FastAPI application
-├── static/
-│   └── index.html              # ✅ Interface web moderne
-└── inference/
-    ├── __init__.py             # ✅ Module init
-    └── mistral_prompting.py    # ✅ Service Mistral
-```
-## 🎯 Fonctionnalités Implémentées
-### ✅ **Service Mistral Prompting**
-- API Mistral intégrée
-- Détection PII via JSON structuré
-- Gestion d'erreurs robuste
-- Rate limiting intégré
-- Support async complet
-### ✅ **Application FastAPI**
-- Endpoints `/predict` et `/health`
-- Interface HTML intégrée (fallback)
-- Gestion des erreurs HTTP
-- CORS configuré
-- Logging détaillé
-### ✅ **Interface Utilisateur**
-- Design moderne et responsive
-- Loading states avec spinner
-- Affichage des métriques
-- Gestion d'erreurs UX
-- Mobile-friendly
-### ✅ **Configuration Docker**
-- Base Python 3.9-slim (CPU)
-- User HF Spaces compliant
-- Health check intégré
-- Optimisé pour production
-## 🔧 Configuration Requise
-### **Variables d'Environnement**
-Dans les settings du Space HF, ajouter :
-```
-MISTRAL_API_KEY=your_mistral_api_key_here
-```
-**Type:** Secret (pas Variable)
-### **Configuration Space**
-Le fichier `README.md` contient :
-```yaml
-sdk: docker
-app_port: 7860
-```
-## 🚀 Déploiement HuggingFace
-### **Étape 1: Créer le Space**
-1. Aller sur https://huggingface.co/spaces
-2. Cliquer "Create new Space"
-3. Choisir "Docker" comme SDK
-4. Nommer votre Space
-### **Étape 2: Upload des Fichiers**
-```bash
-git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
-cd YOUR_SPACE_NAME
-# Copier les fichiers du space
-cp -r /path/to/pii-masking-200k/space/* .
-cp -r /path/to/pii-masking-200k/src .
-# Commit et push
-git add .
-git commit -m "Initial PII masking demo"
-git push
-```
-### **Étape 3: Configuration des Secrets**
-1. Aller dans Settings du Space
-2. Ajouter `MISTRAL_API_KEY` comme Secret
-3. Valeur = votre clé API Mistral
-### **Étape 4: Déploiement**
-Le Space se build automatiquement après le push.
-## 🧪 Tests Locaux
-### **Test des Imports**
-```bash
-cd space
-uv run -c "from inference.mistral_prompting import MistralPromptingService; print('✅ OK')"
-```
-### **Test de l'Application**
-```bash
-cd space
-# Avec une vraie clé API dans .env
-echo "MISTRAL_API_KEY=your_key" > .env
-uv run app.py
-```
-Puis ouvrir http://localhost:7860
-## 📊 Performance Attendue
-### **Hardware Gratuit HF**
-- **CPU:** 2 vCPU, 16GB RAM
-- **Latence Mistral:** ~1-3s par requête
-- **Concurrent users:** ~5-10
-### **Optimisations Possibles**
-- Cache Redis pour requêtes fréquentes
-- Batch processing pour multiple texts
-- GPU upgrade pour BERT local
-## 🔍 Monitoring
-### **Health Check**
-```bash
-curl https://YOUR_SPACE.hf.space/health
-```
-### **Logs**
-Visibles dans l'interface HF Spaces
-### **Métriques**
-- Processing time affiché dans l'UI
-- Nombre d'entités détectées
-- Méthode utilisée
-## 🛠️ Développement Futur
-### **Ajouts Prévus**
-1. **Service BERT** (`bert_inference.py`)
-2. **Service Mistral Fine-tuned** (`mistral_finetuned.py`)
-3. **Comparaison des méthodes**
-4. **Export des résultats**
-### **Améliorations UX**
-1. **Exemples pré-remplis**
-2. **Historique des requêtes**
-3. **Statistiques d'usage**
-4. **Mode batch upload**
-## ⚠️ Limitations Actuelles
-- **Seul Mistral** implémenté pour l'instant
-- **Pas de cache** (chaque requête = API call)
-- **Rate limiting** basique
-- **Pas de persistence** des résultats
-## 🎉 Résultat Final
-Une démo PII masking fonctionnelle avec :
-- ✅ Interface moderne et intuitive
-- ✅ API Mistral intégrée
-- ✅ Déployable sur HF Spaces gratuit
-- ✅ Code production-ready
-- ✅ Extensible pour autres méthodes
-**Ready to deploy! 🚀**

README.md CHANGED Viewed

@@ -13,43 +13,17 @@ license: mit
 A comprehensive demo for Personal Identifiable Information (PII) masking using multiple approaches:
-- 🤖 **BERT Token Classification** - Fast local inference
-- 🧠 **Mistral Prompting** - High accuracy via API
-- 🎯 **Mistral Fine-tuned** - Best performance via API
 ## Features
-✅ **Multiple Methods** - Compare different PII masking approaches
-✅ **Real-time Processing** - Instant text masking
-✅ **Performance Metrics** - Processing time and entity counts
-✅ **User-friendly Interface** - Simple copy-paste workflow
-## Usage
-1. Paste your text containing PII information
-2. Select your preferred masking method
-3. Click "Process Text" to get the masked version
-4. Compare results across different methods
-## Methods Comparison
-| Method | Speed | Accuracy | Cost |
-|--------|-------|----------|------|
-| BERT | ⚡ Fast | ⭐⭐⭐⭐ | Free |
-| Mistral Prompt | 🐌 Slow | ⭐⭐⭐⭐⭐ | API |
-| Mistral Fine-tuned | 🐌 Slow | ⭐⭐⭐⭐⭐ | API |
-## Example
-**Input:**
-```
-Hi, my name is John Smith and my email is john.smith@company.com. Call me at 555-1234.
-```
-**Output:**
-```
-Hi, my name is [FIRSTNAME_1] [LASTNAME_1] and my email is [EMAIL_1]. Call me at [PHONENUMBER_1].
-```
 ## Technology Stack

 A comprehensive demo for Personal Identifiable Information (PII) masking using multiple approaches:
+- **BERT Token Classification** - Fast local inference
+- **Mistral Prompting** - High accuracy via API
+- **Mistral Fine-tuned** - Best performance via API
 ## Features
+- **Multiple Methods** - Compare different PII masking approaches
+- **Real-time Processing** - Instant text masking
+- **Performance Metrics** - Processing time and entity counts
+- **User-friendly Interface** - Simple copy-paste workflow
 ## Technology Stack

inference/bert_classif.py CHANGED Viewed

@@ -271,14 +271,44 @@ class BERTInferenceService:
             return True  # Adjacent or overlapping
         between_text = text[between_start:between_end]
-        # Merge if only whitespace and simple punctuation, and not too long
-        if len(between_text) <= 3 and between_text.strip() in ['', ',', '.', '-', '/', ':', ';']:
-            return True
-        # Also merge if it's just whitespace
-        if between_text.isspace() and len(between_text) <= 2:
-            return True
         return False

             return True  # Adjacent or overlapping
         between_text = text[between_start:between_end]
+        entity_type = span1['entity_type']
+        # More aggressive merging for specific entity types
+        if entity_type in ['PHONENUMBER', 'SSN', 'ACCOUNTNAME']:
+            # For phone numbers, SSNs, and account numbers, merge if separated by:
+            # - No gap (adjacent tokens)
+            # - Common phone/ID separators: spaces, dashes, parentheses, dots
+            if len(between_text) <= 5 and all(c in ' \t\n()-.' for c in between_text):
+                return True
+        elif entity_type in ['FIRSTNAME', 'LASTNAME', 'MIDDLENAME']:
+            # For names, be more conservative - only merge with single space or initials
+            if len(between_text) <= 2 and between_text.strip() in ['', '.']:
+                return True
+        elif entity_type in ['STREET', 'SECONDARYADDRESS', 'CITY']:
+            # For addresses, merge with spaces and common separators
+            if len(between_text) <= 3 and all(c in ' \t\n,-' for c in between_text):
+                return True
+        elif entity_type in ['DATE', 'TIME']:
+            # For dates and times, merge with spaces, commas, and common separators
+            if len(between_text) <= 4 and all(c in ' \t\n,/-:' for c in between_text):
+                return True
+        elif entity_type in ['EMAIL', 'URL']:
+            # For emails and URLs, only merge if directly adjacent (no gaps allowed)
+            if len(between_text) == 0:
+                return True
+        else:
+            # Default behavior: merge if only whitespace and simple punctuation
+            if len(between_text) <= 3 and between_text.strip() in ['', ',', '.', '-', '/', ':', ';']:
+                return True
+            # Also merge if it's just whitespace
+            if between_text.isspace() and len(between_text) <= 2:
+                return True
         return False

inference/ocr_service.py CHANGED Viewed

@@ -38,7 +38,7 @@ class OCRService:
         self.client = Mistral(api_key=self.api_key)
         self.is_initialized = True
-        logger.info("🔧 OCR service initialized with Mistral API")
     async def extract_text_from_pdf(self, pdf_content: bytes) -> str:
         """
@@ -54,7 +54,7 @@ class OCRService:
             # Encode PDF content to base64
             base64_pdf = base64.b64encode(pdf_content).decode('utf-8')
-            logger.info(f"📄 Processing PDF ({len(pdf_content)} bytes) with Mistral OCR...")
             # Process the PDF with OCR
             ocr_response = self.client.ocr.process(
@@ -66,34 +66,34 @@ class OCRService:
                 include_image_base64=False  # Don't include images to save bandwidth
             )
-            logger.info("✅ OCR processing completed")
             # Extract text from all pages
             extracted_text = ""
             if hasattr(ocr_response, 'pages') and ocr_response.pages:
-                logger.info(f"📄 Found {len(ocr_response.pages)} pages")
                 for i, page in enumerate(ocr_response.pages):
                     if hasattr(page, 'markdown') and page.markdown:
                         page_text = page.markdown
                         extracted_text += page_text + "\n\n"
-                        logger.debug(f"📝 Page {i+1}: {len(page_text)} characters")
-                logger.info(f"📄 Total extracted text: {len(extracted_text)} characters")
                 if not extracted_text.strip():
-                    logger.warning("⚠️  No text extracted from PDF")
                     return "No text could be extracted from this PDF."
                 return extracted_text.strip()
             else:
-                logger.warning("⚠️  No pages found in OCR response")
                 return "No text could be extracted from this PDF."
         except Exception as e:
-            logger.error(f"❌ OCR processing failed: {e}")
             raise RuntimeError(f"Failed to extract text from PDF: {str(e)}")
     def get_service_info(self) -> Dict[str, Any]:
@@ -119,8 +119,8 @@ async def create_ocr_service(api_key: Optional[str] = None) -> OCRService:
     """
     try:
         service = OCRService(api_key)
-        logger.info("✅ OCR service created successfully")
         return service
     except Exception as e:
-        logger.error(f"❌ Failed to create OCR service: {e}")
         raise

         self.client = Mistral(api_key=self.api_key)
         self.is_initialized = True
+        logger.info("OCR service initialized with Mistral API")
     async def extract_text_from_pdf(self, pdf_content: bytes) -> str:
         """
             # Encode PDF content to base64
             base64_pdf = base64.b64encode(pdf_content).decode('utf-8')
+            logger.info(f"Processing PDF ({len(pdf_content)} bytes) with Mistral OCR...")
             # Process the PDF with OCR
             ocr_response = self.client.ocr.process(
                 include_image_base64=False  # Don't include images to save bandwidth
             )
+            logger.info("OCR processing completed")
             # Extract text from all pages
             extracted_text = ""
             if hasattr(ocr_response, 'pages') and ocr_response.pages:
+                logger.info(f"Found {len(ocr_response.pages)} pages")
                 for i, page in enumerate(ocr_response.pages):
                     if hasattr(page, 'markdown') and page.markdown:
                         page_text = page.markdown
                         extracted_text += page_text + "\n\n"
+                        logger.debug(f"Page {i+1}: {len(page_text)} characters")
+                logger.info(f"Total extracted text: {len(extracted_text)} characters")
                 if not extracted_text.strip():
+                    logger.warning("No text extracted from PDF")
                     return "No text could be extracted from this PDF."
                 return extracted_text.strip()
             else:
+                logger.warning("No pages found in OCR response")
                 return "No text could be extracted from this PDF."
         except Exception as e:
+            logger.error(f"OCR processing failed: {e}")
             raise RuntimeError(f"Failed to extract text from PDF: {str(e)}")
     def get_service_info(self) -> Dict[str, Any]:
     """
     try:
         service = OCRService(api_key)
+        logger.info("OCR service created successfully")
         return service
     except Exception as e:
+        logger.error(f"Failed to create OCR service: {e}")
         raise

requirements.txt DELETED Viewed

@@ -1,24 +0,0 @@
-# FastAPI and web server
-fastapi==0.104.1
-uvicorn[standard]==0.24.0
-pydantic==2.5.0
-python-multipart==0.0.6
-# HTTP Client for API calls
-httpx>=0.27.0
-aiohttp>=3.9.0
-# Mistral AI client
-mistralai>=1.0.0
-# Utilities
-python-dotenv==1.0.0
-loguru==0.7.2
-# Basic data processing
-numpy>=1.24.0
-# BERT and ML dependencies
-torch>=2.0.0,<2.3.0
-transformers>=4.35.0
-tokenizers>=0.15.0

test_ocr.py DELETED Viewed

@@ -1,133 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script for Mistral OCR functionality.
-"""
-import os
-import sys
-import asyncio
-import logging
-from pathlib import Path
-from dotenv import load_dotenv
-# Load environment variables from .env file
-load_dotenv()
-# Add the inference directory to Python path
-sys.path.insert(0, str(Path(__file__).parent / "inference"))
-from mistralai import Mistral
-# Setup logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-async def test_mistral_ocr():
-    """Test Mistral OCR with a sample PDF."""
-    # Check if API key is available
-    api_key = os.environ.get("MISTRAL_API_KEY")
-    if not api_key:
-        logger.error("❌ MISTRAL_API_KEY not found in environment variables")
-        return
-    try:
-        # Initialize Mistral client
-        client = Mistral(api_key=api_key)
-        logger.info("✅ Mistral client initialized")
-        # Test with a sample PDF from arXiv (Mistral paper)
-        test_pdf_url = "https://arxiv.org/pdf/2201.04234"
-        logger.info(f"🔍 Testing OCR with PDF: {test_pdf_url}")
-        # Process the PDF with OCR
-        ocr_response = client.ocr.process(
-            model="mistral-ocr-latest",
-            document={
-                "type": "document_url",
-                "document_url": test_pdf_url
-            },
-            include_image_base64=False  # Don't include images for testing
-        )
-        logger.info("✅ OCR processing completed")
-        # Extract the text content
-        logger.info(f"📊 OCR Response structure:")
-        logger.info(f"   - Type: {type(ocr_response)}")
-        logger.info(f"   - Has pages: {hasattr(ocr_response, 'pages')}")
-        logger.info(f"   - Has content: {hasattr(ocr_response, 'content')}")
-        extracted_text = ""
-        if hasattr(ocr_response, 'pages') and ocr_response.pages:
-            logger.info(f"📄 Found {len(ocr_response.pages)} pages")
-            # Extract text from all pages
-            for i, page in enumerate(ocr_response.pages):
-                logger.info(f"📃 Page {i+1} structure: {dir(page)}")
-                if hasattr(page, 'markdown') and page.markdown:
-                    page_text = page.markdown
-                    extracted_text += page_text + "\n\n"
-                    logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
-                    logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
-                elif hasattr(page, 'content'):
-                    page_text = page.content
-                    extracted_text += page_text + "\n\n"
-                    logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
-                    logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
-                elif hasattr(page, 'text'):
-                    page_text = page.text
-                    extracted_text += page_text + "\n\n"
-                    logger.info(f"📝 Page {i+1} text length: {len(page_text)} characters")
-                    logger.info(f"📝 Page {i+1} preview: {page_text[:200]}...")
-                else:
-                    logger.info(f"⚠️  Page {i+1} attributes: {[attr for attr in dir(page) if not attr.startswith('_')]}")
-            if extracted_text:
-                logger.info(f"📄 Total extracted text length: {len(extracted_text)} characters")
-                # Test if we can use this text for PII detection
-                if len(extracted_text) > 50:
-                    logger.info("✅ OCR extraction successful - text is suitable for PII detection")
-                    # Try to import and test PII detection
-                    try:
-                        from mistral_prompting import create_mistral_service
-                        logger.info("🔍 Testing PII detection on OCR text...")
-                        service = await create_mistral_service()
-                        # Use a small sample of the text for testing
-                        sample_text = extracted_text[:500]  # First 500 characters
-                        prediction = await service.predict(sample_text)
-                        logger.info(f"📊 PII detection results:")
-                        logger.info(f"   - Entities found: {len(prediction.entities)}")
-                        logger.info(f"   - Spans detected: {len(prediction.spans)}")
-                        logger.info(f"   - Masked text preview: {prediction.masked_text[:100]}...")
-                    except Exception as e:
-                        logger.warning(f"⚠️  PII detection test failed: {e}")
-                        logger.info("💡 OCR works, but PII detection needs API key setup")
-                else:
-                    logger.warning("⚠️  Extracted text too short")
-            else:
-                logger.warning("⚠️  No text extracted from pages")
-        elif hasattr(ocr_response, 'content'):
-            extracted_text = ocr_response.content
-            logger.info(f"📄 Extracted text length: {len(extracted_text)} characters")
-            logger.info(f"📝 First 200 characters: {extracted_text[:200]}...")
-        else:
-            logger.error("❌ No content found in OCR response")
-            logger.info(f"Available attributes: {[attr for attr in dir(ocr_response) if not attr.startswith('_')]}")
-    except Exception as e:
-        logger.error(f"❌ OCR test failed: {e}")
-        logger.info("💡 Make sure MISTRAL_API_KEY is set correctly")
-if __name__ == "__main__":
-    asyncio.run(test_mistral_ocr())

test_reconstruction.py DELETED Viewed

@@ -1,77 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify that PII reconstruction works with correct numbering.
-"""
-import sys
-from pathlib import Path
-# Add the src directory to Python path
-sys.path.insert(0, str(Path(__file__).parent / "src"))
-from pii_masking.text_processing import reconstruct_masked_text
-def test_reconstruction():
-    """Test the reconstruction function with multiple entities of the same type."""
-    # Test case with multiple entities of the same type
-    text = "John T. Smith lives at 4872 Willow Creek Drive Apartment 14B Springfield, IL 62701. Phone: 555-1234. Email: john@email.com. On January 15, 2024 Ms. Alice A. Johnson from Records Office at 9135 Westfield Parkway, Suite 320 Hartford, CT 06101. Dear Ms. Johnson, I am writing regarding my claim from December 2024. As of February 1, 2024, my address is 4872 Willow Creek Drive, Apartment 14B, Springfield, IL 62701."
-    # Simulate PII entities found in the text
-    pii_dict = {
-        "FIRSTNAME": ["John", "Alice"],
-        "LASTNAME": ["Smith", "Johnson"],
-        "CITY": ["Springfield", "Hartford"],
-        "STATE": ["IL", "CT"],
-        "ZIPCODE": ["62701", "06101"],
-        "PHONENUMBER": ["555-1234"],
-        "EMAIL": ["john@email.com"],
-        "DATE": ["January 15, 2024", "December 2024", "February 1, 2024"]
-    }
-    print("🧪 Testing PII reconstruction with correct numbering")
-    print("=" * 60)
-    print(f"Original text: {text[:100]}...")
-    print()
-    masked_text = reconstruct_masked_text(text, pii_dict)
-    print("🎭 Masked text:")
-    print(masked_text)
-    print()
-    # Check if numbering is correct (first occurrence should be _1, second _2, etc.)
-    print("🔍 Checking numbering order:")
-    # Check FIRSTNAMEs
-    john_pos = masked_text.find("[FIRSTNAME_1]")
-    alice_pos = masked_text.find("[FIRSTNAME_2]")
-    print(f"  FIRSTNAME_1 position: {john_pos}")
-    print(f"  FIRSTNAME_2 position: {alice_pos}")
-    print(f"  ✅ Correct order: {john_pos < alice_pos}")
-    # Check LASTNAMEs
-    smith_pos = masked_text.find("[LASTNAME_1]")
-    johnson_pos = masked_text.find("[LASTNAME_2]")
-    print(f"  LASTNAME_1 position: {smith_pos}")
-    print(f"  LASTNAME_2 position: {johnson_pos}")
-    print(f"  ✅ Correct order: {smith_pos < johnson_pos}")
-    # Check CITYs
-    springfield_pos = masked_text.find("[CITY_1]")
-    hartford_pos = masked_text.find("[CITY_2]")
-    print(f"  CITY_1 position: {springfield_pos}")
-    print(f"  CITY_2 position: {hartford_pos}")
-    print(f"  ✅ Correct order: {springfield_pos < hartford_pos}")
-    # Check DATEs
-    date1_pos = masked_text.find("[DATE_1]")
-    date2_pos = masked_text.find("[DATE_2]")
-    date3_pos = masked_text.find("[DATE_3]")
-    print(f"  DATE_1 position: {date1_pos}")
-    print(f"  DATE_2 position: {date2_pos}")
-    print(f"  DATE_3 position: {date3_pos}")
-    print(f"  ✅ Correct order: {date1_pos < date2_pos < date3_pos}")
-if __name__ == "__main__":
-    test_reconstruction()

uv.lock DELETED Viewed

The diff for this file is too large to render. See raw diff