Spaces:

Sparkonix
/

email-classification-model

Sleeping

App Files Files Community

Sparkonix commited on May 15, 2025

Commit

b7c31ab

0 Parent(s):

add remote to repo

Browse files

Files changed (9) hide show

.gitignore +47 -0
Dockerfile +23 -0
README.md +182 -0
docker-compose.yml +12 -0
main.py +71 -0
models.py +81 -0
requirements.txt +40 -0
upload_model.py +93 -0
utils.py +331 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,47 @@

+# Ignore the classification model folder with large files
+classification_model/
+# Python artifacts
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+venv/
+ENV/
+env/
+# IDE files
+.idea/
+.vscode/
+*.swp
+*.swo
+# Jupyter Notebook
+.ipynb_checkpoints
+# OS specific files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db

Dockerfile ADDED Viewed

	@@ -0,0 +1,23 @@

+FROM python:3.10-slim
+WORKDIR /app
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the rest of the application
+COPY . .
+# Set environment variables
+ENV PORT=7860
+ENV MODEL_PATH="Sparkonix/email-classifier-model"
+# Replace YOUR_ACTUAL_USERNAME with your Hugging Face username after uploading the model
+# Expose the port
+EXPOSE 7860
+# Command to run the application
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,182 @@

+# Email Classification for Support Team
+## Project Overview
+This project implements an email classification system that categorizes support emails into predefined categories while ensuring that personal information (PII) is masked before processing. The system uses a combination of Named Entity Recognition (NER) techniques for PII masking and a pre-trained XLM-RoBERTa model for email classification.
+## Key Features
+1. **Email Classification**: Classifies support emails into four categories:
+   - Incident
+   - Request
+   - Change
+   - Problem
+2. **Personal Information Masking**: Detects and masks the following types of PII:
+   - Full Name ("full_name")
+   - Email Address ("email")
+   - Phone number ("phone_number")
+   - Date of birth ("dob")
+   - Aadhar card number ("aadhar_num")
+   - Credit/Debit Card Number ("credit_debit_no")
+   - CVV number ("cvv_no")
+   - Card expiry number ("expiry_no")
+3. **API Interface**: Exposes the solution as a RESTful API endpoint.
+## Project Structure
+```
+.
+├── classification_model/    # Local model files (not used in deployment)
+├── docker-compose.yml       # Docker Compose configuration
+├── Dockerfile               # Docker configuration
+├── main.py                  # Main FastAPI application
+├── models.py                # Email classifier model implementation
+├── README.md                # Project documentation
+├── requirements.txt         # Python dependencies
+└── utils.py                 # PII masker implementation
+```
+## Installation
+### Prerequisites
+- Python 3.8+
+- [Docker](https://www.docker.com/) (optional)
+- Hugging Face account for model hosting
+### Setup
+1. Clone the repository:
+   ```
+   git clone <repository-url>
+   cd email_classifier_project
+   ```
+2. Install dependencies:
+   ```
+   pip install -r requirements.txt
+   ```
+3. Run the application:
+   ```
+   python main.py
+   ```
+### Using Docker
+1. Build and run with Docker Compose:
+   ```
+   docker-compose up
+   ```
+## Uploading the Model to Hugging Face Hub
+Before deploying the application to Hugging Face Spaces, you need to upload the model to the Hugging Face Model Hub:
+1. Install the Hugging Face CLI if you haven't already:
+   ```
+   pip install huggingface_hub
+   ```
+2. Log in to Hugging Face:
+   ```
+   huggingface-cli login
+   ```
+3. Create a new model repository on Hugging Face:
+   ```
+   huggingface-cli repo create email-classifier-model
+   ```
+4. Upload the model using Python:
+   ```python
+   from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer
+   # Load the local model
+   model = XLMRobertaForSequenceClassification.from_pretrained("classification_model")
+   tokenizer = XLMRobertaTokenizer.from_pretrained("classification_model")
+   # Push to Hugging Face Hub
+   model.push_to_hub("YourUsername/email-classifier-model")
+   tokenizer.push_to_hub("YourUsername/email-classifier-model")
+   ```
+5. Update the `MODEL_PATH` environment variable in the Dockerfile with your Hugging Face model path:
+   ```
+   ENV MODEL_PATH="YourUsername/email-classifier-model"
+   ```
+## API Usage
+The API exposes a single endpoint for email classification:
+- **Endpoint**: `/classify`
+- **Method**: POST
+- **Input Format**:
+  ```json
+  {
+    "input_email_body": "string containing the email"
+  }
+  ```
+- **Output Format**:
+  ```json
+  {
+    "input_email_body": "string containing the email",
+    "list_of_masked_entities": [
+      {
+        "position": [start_index, end_index],
+        "classification": "entity_type",
+        "entity": "original_entity_value"
+      }
+    ],
+    "masked_email": "string containing the masked email",
+    "category_of_the_email": "string containing the class"
+  }
+  ```
+## Example
+```python
+import requests
+url = "https://username-space-name.hf.space/classify"
+data = {
+    "input_email_body": "Hello, my name is John Doe, and I'm having issues with my account."
+}
+response = requests.post(url, json=data)
+print(response.json())
+```
+## Deployment to Hugging Face Spaces
+1. Create a new Space on Hugging Face:
+   - Go to https://huggingface.co/spaces
+   - Click "Create new Space"
+   - Choose a name for your Space
+   - Select "Docker" as the Space SDK
+2. Connect your GitHub repository to the Space:
+   - In the Space settings, go to "Repository"
+   - Enter your GitHub repository URL
+   - Authenticate with GitHub if prompted
+3. Ensure your Hugging Face Space has access to the model:
+   - Go to your model on Hugging Face Hub
+   - Go to "Settings" > "Collaborators"
+   - Add your Space as a collaborator with "Read" access
+4. Your API will be available at:
+   ```
+   https://username-space-name.hf.space/classify
+   ```
+## Technologies Used
+- **FastAPI**: Web framework for building the API
+- **SpaCy**: NLP library for PII detection and masking
+- **Transformers**: Hugging Face library for the email classification model
+- **PyTorch**: Deep learning framework
+- **Docker**: Containerization for deployment

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,12 @@

+version: '3'
+services:
+  api:
+    build: .
+    ports:
+      - "8000:7860"
+    volumes:
+      - .:/app
+    environment:
+      - PORT=7860
+    restart: unless-stopped

main.py ADDED Viewed

	@@ -0,0 +1,71 @@

+import os
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from typing import Dict, Any, List, Tuple, Optional
+import uvicorn
+from utils import PIIMasker
+from models import EmailClassifier
+# Initialize the FastAPI application
+app = FastAPI(title="Email Classification API",
+              description="API for classifying support emails and masking PII",
+              version="1.0.0")
+# Initialize the PII masker and email classifier
+pii_masker = PIIMasker()
+email_classifier = EmailClassifier()
+class EmailInput(BaseModel):
+    """Input model for the email classification endpoint"""
+    input_email_body: str
+class EntityInfo(BaseModel):
+    """Model for entity information"""
+    position: Tuple[int, int]
+    classification: str
+    entity: str
+class EmailOutput(BaseModel):
+    """Output model for the email classification endpoint"""
+    input_email_body: str
+    list_of_masked_entities: List[EntityInfo]
+    masked_email: str
+    category_of_the_email: str
+@app.post("/classify", response_model=EmailOutput)
+async def classify_email(email_input: EmailInput) -> Dict[str, Any]:
+    """
+    Classify an email into a support category while masking PII
+    Args:
+        email_input: The input email data
+    Returns:
+        The classified email data with masked PII
+    """
+    try:
+        # Process the email to mask PII
+        processed_data = pii_masker.process_email(email_input.input_email_body)
+        # Classify the masked email
+        classified_data = email_classifier.process_email(processed_data)
+        return classified_data
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error processing email: {str(e)}")
+@app.get("/health")
+async def health_check():
+    """
+    Health check endpoint
+    Returns:
+        Status message indicating the API is running
+    """
+    return {"status": "healthy", "message": "Email classification API is running"}
+# For local development and testing
+if __name__ == "__main__":
+    port = int(os.environ.get("PORT", 8000))
+    uvicorn.run("main:app", host="0.0.0.0", port=port, reload=True)

models.py ADDED Viewed

	@@ -0,0 +1,81 @@

+import os
+import torch
+from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer
+from typing import Dict, Any
+class EmailClassifier:
+    """
+    Email classification model to categorize emails into different support categories
+    """
+    CATEGORIES = ["Incident", "Request", "Change", "Problem"]
+    def __init__(self, model_path: str = None):
+        """
+        Initialize the email classifier with a pre-trained model
+        Args:
+            model_path: Path or Hugging Face Hub model ID
+        """
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        # Use environment variable for model path or fall back to Hugging Face Hub model
+        # This allows for flexibility in deployment
+        model_path = model_path or os.environ.get("MODEL_PATH", "Sparkonix11/email-classifier-model")
+        # Load the tokenizer and model from Hugging Face Hub or local path
+        self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
+        self.model = XLMRobertaForSequenceClassification.from_pretrained(model_path)
+        self.model.to(self.device)
+        self.model.eval()
+    def classify(self, masked_email: str) -> str:
+        """
+        Classify a masked email into one of the predefined categories
+        Args:
+            masked_email: The email content with PII masked
+        Returns:
+            The predicted category as a string
+        """
+        # Tokenize the masked email
+        inputs = self.tokenizer(
+            masked_email,
+            return_tensors="pt",
+            padding="max_length",
+            truncation=True,
+            max_length=512
+        )
+        inputs = {key: val.to(self.device) for key, val in inputs.items()}
+        # Perform inference
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            logits = outputs.logits
+            predicted_class_idx = torch.argmax(logits, dim=1).item()
+        # Map the predicted class index to the category
+        return self.CATEGORIES[predicted_class_idx]
+    def process_email(self, masked_email_data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Process an email by classifying it into a category
+        Args:
+            masked_email_data: Dictionary containing the masked email and other data
+        Returns:
+            The input dictionary with the classification added
+        """
+        # Extract masked email content
+        masked_email = masked_email_data["masked_email"]
+        # Classify the masked email
+        category = self.classify(masked_email)
+        # Add the classification to the data
+        masked_email_data["category_of_the_email"] = category
+        return masked_email_data

requirements.txt ADDED Viewed

	@@ -0,0 +1,40 @@

+# FastAPI for the API
+fastapi>=0.95.1
+uvicorn>=0.22.0
+# Pydantic (FastAPI dependency, ensure compatibility with your FastAPI version)
+# Usually, if FastAPI is Pydantic v2 compatible, this is fine:
+pydantic>=2.0.0
+# Transformers for the classification model
+# Your model's config.json specifies "transformers_version": "4.51.3"
+transformers>=4.30.0
+# PyTorch (CPU version by default for Docker on Hugging Face Spaces)
+# Let pip choose a compatible version with transformers 4.51.3.
+# PyTorch 2.x is generally expected.
+torch>=2.0.0
+# SpaCy for PII Masking
+# Choose a version you've tested or a recent stable one.
+# If you resolved build issues for 3.8.x, you can use that.
+# Otherwise, 3.7.x is also robust.
+spacy>=3.5.0
+# The SpaCy model (e.g., xx_ent_wiki_sm) should be downloaded in the Dockerfile.
+# For model weights loading (if you use .safetensors, which is likely with transformers 4.51.3)
+safetensors
+# Regex library (only if you are using the third-party 'regex' package;
+# Python's built-in 're' module doesn't need to be listed)
+# regex
+# If your PII masking code directly uses numpy (less likely if encapsulated in SpaCy)
+# numpy
+# Additional dependencies
+python-multipart>=0.0.6
+huggingface-hub>=0.15.1
+spacy-transformers>=1.2.5
+xx-ent-wiki-sm @ https://github.com/explosion/spacy-models/releases/download/xx_ent_wiki_sm-3.5.0/xx_ent_wiki_sm-3.5.0-py3-none-any.whl
+en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl

upload_model.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""
+Script to upload the email classification model to Hugging Face Hub
+"""
+import os
+import sys
+import argparse
+import subprocess
+import pkg_resources
+def check_and_install_dependencies():
+    """Check for required libraries and install if missing"""
+    required_packages = ['torch', 'transformers', 'sentencepiece']
+    installed_packages = {pkg.key for pkg in pkg_resources.working_set}
+    missing_packages = [pkg for pkg in required_packages if pkg not in installed_packages]
+    if missing_packages:
+        print(f"Installing missing dependencies: {', '.join(missing_packages)}")
+        subprocess.check_call([sys.executable, "-m", "pip", "install"] + missing_packages)
+        print("Dependencies installed. You may need to restart the script.")
+        return False
+    return True
+def get_huggingface_username(token=None):
+    """Get the username for the authenticated user"""
+    try:
+        from huggingface_hub import HfApi
+        api = HfApi(token=token)
+        user_info = api.whoami()
+        return user_info.get('name')
+    except Exception as e:
+        print(f"Error getting Hugging Face username: {e}")
+        return None
+def main():
+    """Upload model to Hugging Face Hub"""
+    # Check dependencies first
+    if not check_and_install_dependencies():
+        return
+    # Import dependencies after installation check
+    from transformers import XLMRobertaForSequenceClassification, XLMRobertaTokenizer
+    from huggingface_hub import login
+    parser = argparse.ArgumentParser(description="Upload email classification model to Hugging Face Hub")
+    parser.add_argument("--model_path", type=str, default="classification_model",
+                        help="Local path to the model files")
+    parser.add_argument("--hub_model_id", type=str,
+                        help="Hugging Face Hub model ID (e.g., 'username/email-classifier-model')")
+    parser.add_argument("--model_name", type=str, default="email-classifier-model",
+                        help="Name for the model repository (default: email-classifier-model)")
+    parser.add_argument("--token", type=str,
+                        help="Hugging Face API token (optional, can use environment variable or huggingface-cli login)")
+    args = parser.parse_args()
+    # Login if token is provided
+    if args.token:
+        login(token=args.token)
+    # If hub_model_id is not provided, try to get username and construct it
+    if not args.hub_model_id:
+        username = get_huggingface_username(args.token)
+        if not username:
+            print("Could not determine Hugging Face username. Please provide --hub_model_id explicitly.")
+            return
+        args.hub_model_id = f"{username}/{args.model_name}"
+    print(f"Loading model from {args.model_path}...")
+    # Load the local model and tokenizer
+    model = XLMRobertaForSequenceClassification.from_pretrained(args.model_path)
+    tokenizer = XLMRobertaTokenizer.from_pretrained(args.model_path)
+    print(f"Uploading model to {args.hub_model_id}...")
+    try:
+        # Push to Hugging Face Hub
+        model.push_to_hub(args.hub_model_id)
+        tokenizer.push_to_hub(args.hub_model_id)
+        print("Model successfully uploaded to Hugging Face Hub!")
+        print(f"You can now use the model with the ID: {args.hub_model_id}")
+        print(f"Update the MODEL_PATH in Dockerfile to: {args.hub_model_id}")
+    except Exception as e:
+        print(f"Error uploading model: {e}")
+        print("\nPossible solutions:")
+        print("1. Make sure you're logged in with 'huggingface-cli login'")
+        print("2. Check that you have permission to create repos in the specified namespace")
+        print("3. Try using your own username: --hub_model_id yourusername/email-classifier-model")
+if __name__ == "__main__":
+    main()

utils.py ADDED Viewed

	@@ -0,0 +1,331 @@

+import re
+import spacy
+from typing import List, Dict, Tuple, Any
+class Entity:
+    def __init__(self, start: int, end: int, entity_type: str, value: str):
+        self.start = start
+        self.end = end
+        self.entity_type = entity_type
+        self.value = value
+    def to_dict(self):
+        return {
+            "position": [self.start, self.end],
+            "classification": self.entity_type,
+            "entity": self.value
+        }
+    def __repr__(self): # Added for easier debugging
+        return f"Entity(type='{self.entity_type}', value='{self.value}', start={self.start}, end={self.end})"
+class PIIMasker:
+    def __init__(self, spacy_model_name: str = "xx_ent_wiki_sm"): # Allow model choice
+        # Load SpaCy model
+        try:
+            self.nlp = spacy.load(spacy_model_name)
+        except OSError:
+            print(f"SpaCy model '{spacy_model_name}' not found. Downloading...")
+            try:
+                spacy.cli.download(spacy_model_name)
+                self.nlp = spacy.load(spacy_model_name)
+            except Exception as e:
+                print(f"Failed to download or load {spacy_model_name}. Error: {e}")
+                print("Attempting to load 'en_core_web_sm' as a fallback for English.")
+                try:
+                    self.nlp = spacy.load("en_core_web_sm")
+                except OSError:
+                    print("Downloading 'en_core_web_sm'...")
+                    spacy.cli.download("en_core_web_sm")
+                    self.nlp = spacy.load("en_core_web_sm")
+        # Initialize regex patterns
+        self._initialize_patterns()
+    def _initialize_patterns(self):
+        # Define regex patterns for different entity types
+        self.patterns = {
+            "email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
+            "phone_number": r'\b(\+\d{1,2}\s?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
+            # Card number regex: common formats, allows optional spaces/hyphens
+            "credit_debit_no": r'\b(?:(?:\d{4}[\s-]?){3}\d{4}|\d{13,19})\b',
+            # CVV: 3 or 4 digits, ensuring it's a standalone number (word boundary)
+            "cvv_no": r'\b\d{3,4}\b',
+            # Expiry: MM/YY or MM/YYYY, common separators
+            "expiry_no": r'\b(0[1-9]|1[0-2])[/\s-]([0-9]{2}|20[0-9]{2})\b',
+            "aadhar_num": r'\b\d{4}\s?\d{4}\s?\d{4}\b',
+            # DOB: DD/MM/YYYY or DD-MM-YYYY etc.
+            "dob": r'\b(0[1-9]|[12][0-9]|3[01])[/\s-](0[1-9]|1[0-2])[/\s-](?:19|20)\d\d\b'
+        }
+    def detect_regex_entities(self, text: str) -> List[Entity]:
+        """Detect entities using regex patterns"""
+        entities = []
+        for entity_type, pattern in self.patterns.items():
+            for match in re.finditer(pattern, text):
+                start, end = match.span()
+                value = match.group()
+                # Specific verifications
+                if entity_type == "credit_debit_no":
+                    if not self.verify_credit_card(text, match):
+                        continue
+                elif entity_type == "cvv_no":
+                    if not self.verify_cvv(text, match):
+                        continue
+                elif entity_type == "dob": # Using the generic context verifier for DOB
+                    if not self._verify_with_context(text, start, end, ["birth", "dob", "born"]):
+                        continue
+                # Avoid detecting parts of already matched longer entities (e.g. year within a DOB)
+                # This is a simple check; more robust overlap handling is done later
+                is_substring_of_existing = False
+                for existing_entity in entities:
+                    if existing_entity.start <= start and existing_entity.end >= end and existing_entity.value != value :
+                        is_substring_of_existing = True
+                        break
+                if is_substring_of_existing:
+                    continue
+                entities.append(Entity(start, end, entity_type, value))
+        return entities
+    def _verify_with_context(self, text: str, start: int, end: int, keywords: List[str], window: int = 50) -> bool:
+        """Verify an entity match using surrounding context"""
+        context_before = text[max(0, start - window):start].lower()
+        context_after = text[end:min(len(text), end + window)].lower()
+        for keyword in keywords:
+            if keyword in context_before or keyword in context_after:
+                return True
+        return False
+    def verify_credit_card(self, text: str, match: re.Match) -> bool:
+        """Verify if a match is actually a credit card number using contextual clues"""
+        context_window = 50
+        start, end = match.span()
+        context_before = text[max(0, start - context_window):start].lower()
+        context_after = text[end:min(len(text), end + context_window)].lower()
+        card_keywords = ["card", "credit", "debit", "visa", "mastercard", "payment", "amex", "account no", "card no"]
+        for keyword in card_keywords:
+            if keyword in context_before or keyword in context_after:
+                return True
+        # Basic Luhn algorithm check (optional, can be computationally more intensive)
+        # For simplicity, we'll rely on context here. If needed, Luhn can be added.
+        return False
+    def verify_cvv(self, text: str, match: re.Match) -> bool:
+        """Verify if a 3-4 digit number is actually a CVV using contextual clues"""
+        context_window = 30
+        start, end = match.span()
+        value = match.group()
+        # If it's part of a longer number sequence (like a phone number or ID), it's likely not a CVV
+        # Check character immediately before and after
+        char_before = text[start-1:start] if start > 0 else ""
+        char_after = text[end:end+1] if end < len(text) else ""
+        if char_before.isdigit() or char_after.isdigit():
+            return False # It's part of a larger number
+        context_before = text[max(0, start - context_window):start].lower()
+        context_after = text[end:min(len(text), end + context_window)].lower()
+        cvv_keywords = ["cvv", "cvc", "csc", "security code", "card verification", "verification no"]
+        date_keywords = ["date", "year", "/", "-", "born", "age", "since", "established", "version", "model", "grade"] # More exhaustive
+        is_cvv_context = any(keyword in context_before or keyword in context_after for keyword in cvv_keywords)
+        # If it looks like a year in common contexts, it's probably not a CVV
+        # e.g. "since 2023", "class of 99", "born 1990"
+        if value.isdigit() and (1900 <= int(value) <= 2100 if len(value) == 4 else False):
+            year_context_keywords = ["year", "born", "fiscal", "established", "since", "class of", "ended", "began", "joined"]
+            if any(kw in context_before for kw in year_context_keywords):
+                return False # Likely a year
+            # If it's MM/YY or MM/YYYY context, it's expiry, not CVV
+            if re.search(r'\b(0[1-9]|1[0-2])[/\s-]$', context_before.strip()): # Ends with MM/
+                 return False # Part of an expiry date
+        is_date_context = any(keyword in context_before or keyword in context_after for keyword in date_keywords)
+        # Check if the number itself looks like a year in typical CVV lengths
+        looks_like_year = False
+        if len(value) == 2 and value.isdigit(): # e.g. "23" for year in expiry
+            if any(k in context_before for k in ["expiry", "exp", "valid thru", "good thru"]) or \
+               re.search(r'\b(0[1-9]|1[0-2])[/\s-]$', context_before.strip()):
+                looks_like_year = True # It's the YY part of an expiry
+        elif len(value) == 4 and value.isdigit() and (1900 <= int(value) <= 2100):
+             if any(k in (context_before + context_after) for k in ["year", "born", "fiscal"]):
+                 looks_like_year = True
+        return is_cvv_context and not (is_date_context and looks_like_year)
+    def detect_name_entities(self, text: str) -> List[Entity]:
+        """Detect name entities using SpaCy NER"""
+        entities = []
+        doc = self.nlp(text)
+        for ent in doc.ents:
+            # Use PER for person, common in many models like xx_ent_wiki_sm
+            # Also checking for PERSON as some models might use it.
+            if ent.label_ in ["PER", "PERSON"]:
+                entities.append(Entity(ent.start_char, ent.end_char, "full_name", ent.text))
+        return entities
+    def detect_all_entities(self, text: str) -> List[Entity]:
+        """Detect all types of entities in the text"""
+        # Get regex-based entities first
+        entities = self.detect_regex_entities(text)
+        # Add SpaCy-based name entities
+        # We add them second and let overlap resolution handle conflicts
+        # This is because NER for names can be more reliable than a generic regex
+        name_entities = self.detect_name_entities(text)
+        entities.extend(name_entities)
+        # Sort entities by their starting position
+        entities.sort(key=lambda x: x.start)
+        # Resolve overlaps: prioritize NER entities (like names) or longer regex matches
+        entities = self._resolve_overlaps(entities)
+        return entities
+    def _resolve_overlaps(self, entities: List[Entity]) -> List[Entity]:
+        """Resolve overlapping entities.
+        Prioritize:
+        1. NER entities (e.g., "full_name") if they overlap with regex.
+        2. Longer entities over shorter ones.
+        3. If same length and type, no change (first one encountered).
+        """
+        if not entities:
+            return []
+        # A simple greedy approach: iterate and remove/adjust overlaps
+        # This can be made more sophisticated
+        resolved_entities: List[Entity] = []
+        for current_entity in sorted(entities, key=lambda e: (e.start, -(e.end - e.start))): # Process by start, then by longest
+            is_overlapped_or_contained = False
+            temp_resolved = []
+            for i, res_entity in enumerate(resolved_entities):
+                # Check for overlap:
+                # Current: |----|
+                # Res:         |----|   or |----| or   |--|  or  |------|
+                overlap = max(0, min(current_entity.end, res_entity.end) - max(current_entity.start, res_entity.start))
+                if overlap > 0:
+                    is_overlapped_or_contained = True
+                    # Preference:
+                    # 1. NER names often trump regex if they are the ones causing overlap
+                    # 2. Longer entity wins
+                    current_len = current_entity.end - current_entity.start
+                    res_len = res_entity.end - res_entity.start
+                    # If current is a name and overlaps, and previous is not a name, prefer current if it's not fully contained
+                    if current_entity.entity_type == "full_name" and res_entity.entity_type != "full_name":
+                        if not (res_entity.start <= current_entity.start and res_entity.end >= current_entity.end): # current not fully contained by res
+                             # remove res_entity, current will be added later
+                            continue # go to next res_entity, this one is marked for removal
+                    elif res_entity.entity_type == "full_name" and current_entity.entity_type != "full_name":
+                         # res_entity is a name, current is not. Prefer res_entity if it's not fully contained
+                        if not (current_entity.start <= res_entity.start and current_entity.end >= res_entity.end):
+                            # current entity is subsumed or less important, so don't add current
+                            # and keep res_entity
+                            temp_resolved.append(res_entity)
+                            is_overlapped_or_contained = True # Mark current as handled
+                            break # Current is dominated
+                    # General case: longer entity wins
+                    if current_len > res_len:
+                        # current is longer, res_entity is removed from consideration for this current_entity
+                        pass # res_entity will not be added to temp_resolved if it's fully replaced
+                    elif res_len > current_len:
+                        # res is longer, current is dominated
+                        temp_resolved.append(res_entity)
+                        is_overlapped_or_contained = True # Mark current as handled
+                        break # Current is dominated
+                    else: # Same length, keep existing one (res_entity)
+                        temp_resolved.append(res_entity)
+                        is_overlapped_or_contained = True # Mark current as handled
+                        break
+                else: # No overlap
+                    temp_resolved.append(res_entity)
+            if not is_overlapped_or_contained:
+                temp_resolved.append(current_entity)
+            resolved_entities = sorted(temp_resolved, key=lambda e: (e.start, -(e.end - e.start)))
+        # Final pass to remove fully contained entities if a larger one exists
+        final_entities = []
+        if not resolved_entities:
+            return []
+        for i, entity in enumerate(resolved_entities):
+            is_contained = False
+            for j, other_entity in enumerate(resolved_entities):
+                if i == j:
+                    continue
+                # If 'entity' is strictly contained within 'other_entity'
+                if other_entity.start <= entity.start and other_entity.end >= entity.end and \
+                   (other_entity.end - other_entity.start > entity.end - entity.start):
+                    is_contained = True
+                    break
+            if not is_contained:
+                final_entities.append(entity)
+        return final_entities
+    def mask_text(self, text: str) -> Tuple[str, List[Dict[str, Any]]]:
+        """
+        Mask PII entities in the text and return masked text and entity information
+        """
+        entities = self.detect_all_entities(text)
+        entity_info = [entity.to_dict() for entity in entities]
+        masked_text = list(text) # Use list of chars for easier replacement
+        # Sort entities by start position to ensure correct masking,
+        # longest first at same start to prevent partial masking by shorter entities
+        entities.sort(key=lambda x: (x.start, -(x.end - x.start)))
+        offset = 0
+        new_text_parts = []
+        current_pos = 0
+        for entity in entities:
+            # Add text before the entity
+            if entity.start > current_pos:
+                new_text_parts.append(text[current_pos:entity.start])
+            # Add the mask
+            mask = f"[{entity.entity_type.upper()}]" # Changed to upper for clarity
+            new_text_parts.append(mask)
+            current_pos = entity.end
+        # Add any remaining text after the last entity
+        if current_pos < len(text):
+            new_text_parts.append(text[current_pos:])
+        return "".join(new_text_parts), entity_info
+    def process_email(self, email_text: str) -> Dict[str, Any]:
+        """
+        Process an email by detecting and masking PII entities
+        """
+        masked_email, entity_info = self.mask_text(email_text)
+        return {
+            "input_email_body": email_text,
+            "list_of_masked_entities": entity_info,
+            "masked_email": masked_email,
+            "category_of_the_email": ""
+        }