Spaces:

gargaman07
/

emailclassification

Sleeping

App Files Files Community

Aman Garg commited on May 19, 2025

Commit

6db4426

verified ·

1 Parent(s): 640b7e6

Email Classification API

Browse files

Files changed (10) hide show

Dockerfile +36 -0
README.md +127 -6
label_encoder.pkl +3 -0
main.py +93 -0
mlp_model.pth +3 -0
models.py +77 -0
pca.pkl +3 -0
requirements.txt +10 -0
test.py +61 -0
utils.py +141 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,36 @@

+# Use an official Python runtime as a parent image
+FROM python:3.10-slim
+# Prevent Python from writing .pyc files to disc and enable unbuffered output
+ENV PYTHONDONTWRITEBYTECODE 1
+ENV PYTHONUNBUFFERED 1
+# Set the working directory
+WORKDIR /app
+# Install git (required by some HF models) and basic system tools
+RUN apt-get update && apt-get install -y git && apt-get clean
+# Copy requirements and install dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Pre-download NER model
+RUN python -c "from transformers import AutoTokenizer, AutoModelForTokenClassification; \
+               model = AutoModelForTokenClassification.from_pretrained('Davlan/bert-base-multilingual-cased-ner-hrl'); \
+               tokenizer = AutoTokenizer.from_pretrained('Davlan/bert-base-multilingual-cased-ner-hrl'); \
+               model.save_pretrained('./model'); tokenizer.save_pretrained('./model')"
+# Pre-download SentenceTransformer model
+RUN python -c "from sentence_transformers import SentenceTransformer; \
+               model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2'); \
+               model.save('./sbert_model')"
+# Copy app code into container
+COPY . .
+# Expose port
+EXPOSE 8000
+# Run the application
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,10 +1,131 @@
 ---
-title: Emailclassification
-emoji: 😻
-colorFrom: red
-colorTo: yellow
-sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Email Classification and PII Masking API
+emoji: 📧
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: "latest"
+app_file: main.py
 pinned: false
+models:
+  - Davlan/bert-base-multilingual-cased-ner-hrl
+  - sentence-transformers/paraphrase-multilingual-mpnet-base-v2
 ---
+# Email Classification and PII Masking API
+This FastAPI application provides an API for classifying emails and masking Personally Identifiable Information (PII) in text.
+## Features
+- PII Detection and Masking
+  - Full names
+  - Email addresses
+  - Phone numbers
+  - Dates of birth
+  - Aadhar numbers
+  - Credit/Debit card numbers
+  - CVV numbers
+  - Card expiry dates
+- Email Classification using MLP model
+- Multilingual support using BERT-based models
+## Setup
+1. Create a virtual environment:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+```
+2.  Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3.  Download required model files:
+  - `label_encoder.pkl`
+  - `pca.pkl`
+  - `mlp_model.pth`
+    Place these files in the same directory as `main.py`.
+## Usage
+1.  Start the FastAPI server:
+```bash
+uvicorn main:app --reload --host 0.0.0.0 --port 80
+```
+**Note for Hugging Face Spaces:** We explicitly bind to `0.0.0.0` and port `80`, which is typically required by Spaces.
+2.  The API will be available at the Space's URL (e.g., `https://your-username-your-space-name.hf.space`).
+3.  API Endpoints:
+  - **POST `/classify`**: Classify and mask PII in email text
+      - **Input:** JSON with `input_email_body` field
+```json
+{
+  "input_email_body": "Hello, my name is John Doe and my email is john.doe@example.com. Please help with my billing issue."
+}
+```
+      - **Output:** JSON with masked text, detected entities, and classification
+```json
+{
+  "input_email_body": "Hello, my name is John Doe and my email is john.doe@example.com. Please help with my billing issue.",
+  "list_of_masked_entities": [
+    {
+      "position": [
+        16,
+        24
+      ],
+      "classification": "full_name",
+      "entity": "John Doe"
+    },
+    {
+      "position": [
+        39,
+        60
+      ],
+      "classification": "email",
+      "entity": "john.doe@example.com"
+    }
+  ],
+  "masked_email": "Hello, my name is [full_name] and my email is [email]. I'm having trouble with my account.",
+  "category_of_the_email": "Billing Issues"
+}
+```
+## API Documentation
+Once the server is running on Hugging Face Spaces, the Swagger UI and ReDoc documentation endpoints might not be directly accessible via the standard `/docs` and `/redoc` paths in a "Static" Space setup. You would typically interact with the `/classify` endpoint directly via POST requests.
+## Project Structure
+```
+.
+├── README.md
+├── requirements.txt
+├── main.py         # FastAPI application entry point
+├── models.py       # ML model definitions and training logic
+├── utils.py        # Utility functions for text processing
+├── label_encoder.pkl # Label encoder for classification
+├── pca.pkl         # PCA model for dimensionality reduction
+└── mlp_model.pth   # Trained MLP model weights
+```
+## Deployment on Hugging Face Spaces
+To deploy this application on Hugging Face Spaces:
+1.  **Create a new Space** on [https://huggingface.co/spaces](https://huggingface.co/spaces).
+2.  Choose a **Space name**, select a **license**, and for **Space Hardware**, the "Free" tier should be sufficient for this type of API.
+3.  Crucially, under **SDK**, select **"Static"**.
+4.  In your Space's settings, link your **GitHub repository** containing these files.
+5.  Hugging Face Spaces will automatically detect the `requirements.txt` and install the dependencies.
+6.  It will then look for an `app_file` specified in the frontmatter (`main.py` in this case) to run. For a "Static" Space running a FastAPI application, it will execute `uvicorn main:app --host 0.0.0.0 --port 80`.
+Ensure all your model files (`label_encoder.pkl`, `pca.pkl`, `mlp_model.pth`) are present in your repository at the root level or in the same directory as `main.py`.

label_encoder.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e8103306a7eee71cc11a4a770b52d57c8e033925b07f81f7c3b85c26e2c4d6a
+size 283

main.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import json
+import re
+from typing import Any, Dict, List
+from fastapi import FastAPI, HTTPException
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+from models import ModelManager
+from utils import mask_pii
+app = FastAPI()
+# Initialize model manager
+model_manager = ModelManager()
+try:
+    model_manager.load_models()
+except Exception as e:
+    raise RuntimeError(f"Error loading models: {e}")
+# Helper class for marking lists that need compact JSON representation
+class CompactListWrapper:
+    def __init__(self, data_list):
+        self.data = data_list
+# Custom JSON Encoder (used by CustomFormattedJSONResponse)
+class CustomJsonEncoder(json.JSONEncoder):
+    def default(self, o):
+        if isinstance(o, CompactListWrapper):
+            return f"__COMPACT_LIST_PLACEHOLDER__{json.dumps(o.data, separators=(',',':'))}__END_PLACEHOLDER__"
+        return super().default(o)
+# Custom JSONResponse class for specific formatting
+class CustomFormattedJSONResponse(JSONResponse):
+    def render(self, content: Any) -> bytes:
+        # content is the dictionary passed to the response instance
+        json_string_with_placeholders = json.dumps(
+            content,
+            indent=2,
+            cls=CustomJsonEncoder # Our encoder that inserts placeholders
+        )
+        # Replace the quoted placeholders with their unquoted compact list content
+        final_json_string = re.sub(
+            r'"__COMPACT_LIST_PLACEHOLDER__(.*?)__END_PLACEHOLDER__"',
+            r'\1',
+            json_string_with_placeholders
+        )
+        return final_json_string.encode("utf-8")
+class EmailInput(BaseModel):
+    input_email_body: str
+@app.post("/classify")
+async def classify_email(email_input: EmailInput):
+    try:
+        # Mask PII in the email
+        masked_email_str, masked_entities_list_of_dicts = mask_pii(
+            email_input.input_email_body,
+            model_manager.ner_pipeline
+        )
+        # Classify the masked email
+        predicted_category_str = model_manager.predict(masked_email_str)
+        # Prepare data, wrapping 'position' lists in CompactListWrapper
+        processed_masked_entities = []
+        for entity_dict in masked_entities_list_of_dicts:
+            # Create a new dict to avoid modifying original from mask_pii if it's reused
+            processed_entity = entity_dict.copy()
+            if "position" in processed_entity and isinstance(processed_entity["position"], list):
+                processed_entity["position"] = CompactListWrapper(processed_entity["position"])
+            processed_masked_entities.append(processed_entity)
+        response_data = {
+            "input_email_body": email_input.input_email_body,
+            "list_of_masked_entities": processed_masked_entities,
+            "masked_email": masked_email_str,
+            "category_of_the_email": predicted_category_str
+        }
+        # Use the custom response class
+        return CustomFormattedJSONResponse(content=response_data)
+    except Exception as e:
+        # It's good practice to log the actual exception for debugging on the server
+        # import traceback
+        # print(f"Error in classify_email: {str(e)}\n{traceback.format_exc()}")
+        raise HTTPException(status_code=500, detail=str(e))
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

mlp_model.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:294544a99fec249f58b8258a5868d941dfb9e47d7ff44700965f0b1ce107b22d
+size 923932

models.py ADDED Viewed

	@@ -0,0 +1,77 @@

+import pickle
+import torch
+import torch.nn as nn
+from sentence_transformers import SentenceTransformer
+from transformers import (AutoModelForTokenClassification, AutoTokenizer,
+                          TokenClassificationPipeline)
+class MLPClassifier(nn.Module):
+    def __init__(self, input_dim, num_classes):
+        super(MLPClassifier, self).__init__()
+        self.model = nn.Sequential(
+            nn.Linear(input_dim, 256),
+            nn.ReLU(),
+            nn.Dropout(0.3),
+            nn.Linear(256, 128),
+            nn.ReLU(),
+            nn.Dropout(0.3),
+            nn.Linear(128, num_classes)
+        )
+    def forward(self, x):
+        return self.model(x)
+class ModelManager:
+    def __init__(self):
+        self.ner_model = None
+        self.ner_tokenizer = None
+        self.ner_pipeline = None
+        self.classification_model = None
+        self.label_encoder = None
+        self.pca_model = None
+        self.mlp_model = None
+    def load_models(self):
+        # Load NER model
+        ner_model_name = "Davlan/bert-base-multilingual-cased-ner-hrl"
+        self.ner_tokenizer = AutoTokenizer.from_pretrained("./model")
+        self.ner_model = AutoModelForTokenClassification.from_pretrained("./model")
+        self.ner_pipeline = TokenClassificationPipeline(
+            model=self.ner_model.to('cpu'),
+            tokenizer=self.ner_tokenizer,
+            device=-1,
+            aggregation_strategy="simple"
+        )
+        # Load classification models
+        self.classification_model = SentenceTransformer('./sbert_model')
+        with open("label_encoder.pkl", "rb") as f:
+            self.label_encoder = pickle.load(f)
+        with open("pca.pkl", "rb") as f:
+            self.pca_model = pickle.load(f)
+        model_state_dict = torch.load("mlp_model.pth", map_location=torch.device('cpu'))
+        num_classes = len(self.label_encoder.classes_)
+        input_dim = self.pca_model.n_components_
+        self.mlp_model = MLPClassifier(input_dim, num_classes)
+        self.mlp_model.load_state_dict(model_state_dict)
+        self.mlp_model.eval()
+    def predict(self, text):
+        # Get embeddings and reduce dimensions
+        email_embedding = self.classification_model.encode([text])
+        email_reduced = self.pca_model.transform(email_embedding)
+        email_tensor = torch.tensor(email_reduced, dtype=torch.float32)
+        # Make prediction
+        with torch.no_grad():
+            output = self.mlp_model(email_tensor)
+            predicted_class_index = torch.argmax(output, dim=1).item()
+            predicted_category = self.label_encoder.inverse_transform([predicted_class_index])[0]
+        return predicted_category

pca.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8dd0cc734e8c0ef88a277f3f948c687afae9184bef195adbe608d1bb553a2ed3
+size 2372321

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+fastapi
+uvicorn
+transformers==4.51.3
+torch==2.5.0
+pandas==2.2.2
+scikit-learn==1.6.1
+sentence-transformers==4.1.0
+pydantic==2.11.4
+tqdm==4.67.1
+numpy==2.0.2

test.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import json
+import requests
+# Replace with the actual URL where your FastAPI app is running locally
+LOCAL_API_URL = "http://127.0.0.1:8000/classify"
+def test_classify_endpoint(email_body):
+    """
+    Sends a POST request to the /classify endpoint of the local FastAPI app.
+    Args:
+        email_body (str): The email content to be classified.
+    Returns:
+        dict: The JSON response from the API, or None if an error occurred.
+    """
+    headers = {"Content-Type": "application/json"}
+    payload = {"input_email_body": email_body}
+    try:
+        response = requests.post(LOCAL_API_URL, headers=headers, json=payload)
+        response.raise_for_status()  # Raise an exception for bad status codes
+        return response.json()
+    except requests.exceptions.RequestException as e:
+        print(f"Error connecting to the API: {e}")
+        return None
+    except requests.exceptions.HTTPError as e:
+        print(f"HTTP Error: {e}")
+        return None
+    except json.JSONDecodeError as e:
+        print(f"Error decoding JSON response: {e}")
+        return None
+if __name__ == "__main__":
+    # Example email bodies to test
+    test_emails = [
+        "Hello, my name is Alice Smith and my email is alice.smith@example.com. I'm having trouble with my account.",
+        "Urgent: My credit card number is 1234-5678-9012-3456 and the expiry is 03/27. I was overcharged.",
+        "Subject: Network down - Office B1 floor. Please investigate.",
+        "Request for new software installation on my laptop.",
+        "Regarding invoice INV-2023-10-01. The total seems incorrect.",
+        "My date of birth is 01/15/1990 and my phone number is 987-654-3210.",
+        "Is there a problem with the server?",
+        "I need to change my registered address.",
+        "Subject: Unplanned system outage affecting database access.",
+        "Can I get access to the premium features?"
+    ]
+    print("Testing the /classify endpoint on localhost:")
+    for i, email in enumerate(test_emails):
+        print(f"\n--- Test Email {i+1} ---")
+        print(f"Input Email Body: {email}")
+        response_data = test_classify_endpoint(email)
+        if response_data:
+            print("API Response:")
+            print(json.dumps(response_data, indent=4))
+        else:
+            print("Failed to get a valid response from the API.")
+    print("\nTesting complete.")

utils.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import re
+from typing import Dict, List, Tuple
+def mask_full_name(text: str, ner_pipeline) -> Tuple[str, List[Dict]]:
+    """
+    Mask full names in text using NER model.
+    Args:
+        text (str): Input text
+        ner_pipeline: NER pipeline for name detection
+    Returns:
+        Tuple[str, List[Dict]]: Masked text and list of masked entities
+    """
+    entities = ner_pipeline(text)
+    masked_entities = []
+    for ent in sorted(entities, key=lambda x: x['start'], reverse=True):
+        if ent['entity_group'] in ['PER', 'Person', 'full_name']:
+            start, end = ent['start'], ent['end']
+            original_entity = text[start:end]
+            masked_entities.append({
+                "position": [start, end],
+                "classification": "full_name",
+                "entity": original_entity
+            })
+            text = text[:start] + '[full_name]' + text[end:]
+    return text, masked_entities
+def mask_with_regex(text: str) -> Tuple[str, List[Dict]]:
+    """
+    Mask PII using regex patterns.
+    Args:
+        text (str): Input text
+    Returns:
+        Tuple[str, List[Dict]]: Masked text and list of masked entities
+    """
+    masked_entities = []
+    # Email address
+    emails = list(re.finditer(r'\b[\w.-]+?@\w+?\.\w+?\b', text))
+    for match in reversed(emails):
+        start, end = match.span()
+        original_entity = text[start:end]
+        masked_entities.append({
+            "position": [start, end],
+            "classification": "email",
+            "entity": original_entity
+        })
+        text = text[:start] + '[email]' + text[end:]
+    # Phone number
+    phones = list(re.finditer(r'\b(?:(?:\+|0)91[\s.-]?)?\d{10}(?!\d)\b', text))
+    for match in reversed(phones):
+        start, end = match.span()
+        original_entity = text[start:end]
+        masked_entities.append({
+            "position": [start, end],
+            "classification": "phone_number",
+            "entity": original_entity
+        })
+        text = text[:start] + '[phone_number]' + text[end:]
+    # Date of Birth
+    dobs = list(re.finditer(r'\b\d{2}[-/]\d{2}[-/]\d{4}\b|\b\d{4}[-/]\d{2}[-/]\d{2}\b', text))
+    for match in reversed(dobs):
+        start, end = match.span()
+        original_entity = text[start:end]
+        masked_entities.append({
+            "position": [start, end],
+            "classification": "dob",
+            "entity": original_entity
+        })
+        text = text[:start] + '[dob]' + text[end:]
+    # Credit/Debit card number
+    cards = list(re.finditer(r'\b(?:\d[ -]*?){13,19}\b', text))
+    for match in reversed(cards):
+        start, end = match.span()
+        original_entity = text[start:end]
+        masked_entities.append({
+            "position": [start, end],
+            "classification": "credit_debit_no",
+            "entity": original_entity
+        })
+        text = text[:start] + '[credit_debit_no]' + text[end:]
+    # Aadhar number
+    aadhars = list(re.finditer(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', text))
+    for match in reversed(aadhars):
+        start, end = match.span()
+        original_entity = text[start:end]
+        masked_entities.append({
+            "position": [start, end],
+            "classification": "aadhar_num",
+            "entity": original_entity
+        })
+        text = text[:start] + '[aadhar_num]' + text[end:]
+    # CVV number
+    cvvs = list(re.finditer(r'\b\d{3}\b', text))
+    for match in reversed(cvvs):
+        start, end = match.span()
+        original_entity = text[start:end]
+        masked_entities.append({
+            "position": [start, end],
+            "classification": "cvv_no",
+            "entity": original_entity
+        })
+        text = text[:start] + '[cvv_no]' + text[end:]
+    # Card expiry date
+    expiries = list(re.finditer(r'\b(0[1-9]|1[0-2])\/?([0-9]{2}|[0-9]{4})\b', text))
+    for match in reversed(expiries):
+        start, end = match.span()
+        original_entity = text[start:end]
+        masked_entities.append({
+            "position": [start, end],
+            "classification": "expiry_no",
+            "entity": original_entity
+        })
+        text = text[:start] + '[expiry_no]' + text[end:]
+    return text, masked_entities
+def mask_pii(text: str, ner_pipeline) -> Tuple[str, List[Dict]]:
+    """
+    Mask all PII in text using both NER and regex patterns.
+    Args:
+        text (str): Input text
+        ner_pipeline: NER pipeline for name detection
+    Returns:
+        Tuple[str, List[Dict]]: Masked text and list of all masked entities
+    """
+    text, ner_entities = mask_full_name(text, ner_pipeline)
+    text, regex_entities = mask_with_regex(text)
+    return text, ner_entities + regex_entities