NLP Applications (S1-25_AIMLCZG519)

Assignment 2 โ€“ Problem Statement โ€“ 25

Submitted by: Group 108

  • ADAPALA MANI KUMAR
  • BHAT MITALI MAHENDRA
  • CHELLAPPAN C
  • ELLURU SAI GAGAN
  • MD FAREED FAROOQUI

Finetuned Model & Project URL : https://drive.google.com/drive/folders/19ZP1RG_9Ms_kzsiLYBMrtrUbETej5dLq?usp=sharing

PII Detection and Masking System

This project implements a PII (Personally Identifiable Information) detection and masking system using DistilBERT fine-tuned on the ai4privacy/pii-masking-200k dataset. The system exposes a Flask API for uploading text files and returning masked outputs.

๐Ÿš€ Features

  • Transformer-based Named Entity Recognition (NER)
  • Detects multiple PII categories (Email, Phone, SSN, IP, etc.)
  • Batch processing (2 lines at a time)
  • REST API using Flask
  • Optimized with FP16 training
  • Entity-level F1 Score: 92.16%

๐Ÿ“‚ Project Structure

.
.
โ”œโ”€โ”€ app.py
โ”œโ”€โ”€ design_document.docx
โ”œโ”€โ”€ distilbert-ner
โ”‚   โ””โ”€โ”€ checkpoint-10880
โ”‚       โ”œโ”€โ”€ config.json
โ”‚       โ”œโ”€โ”€ model.safetensors
โ”‚       โ”œโ”€โ”€ tokenizer_config.json
โ”‚       โ”œโ”€โ”€ tokenizer.json
โ”‚       โ”œโ”€โ”€ trainer_state.json
โ”‚       โ””โ”€โ”€ training_args.bin
โ”œโ”€โ”€ NER_Masking.pdf
โ”œโ”€โ”€ NER_Masking.ipynb
โ”œโ”€โ”€ readme.md
โ”œโ”€โ”€ sample.txt
โ””โ”€โ”€ templates
    โ””โ”€โ”€ index.html

โš™๏ธ Installation

  1. Download the repository
https://drive.google.com/drive/folders/19ZP1RG_9Ms_kzsiLYBMrtrUbETej5dLq?usp=sharing
  1. Create and activate environment:
conda create -n pii python=3.10
conda activate pii
  1. Install dependencies:
pip install gdown torch matplotlib transformers datasets seqeval scikit-learn seaborn numpy ipywidgets 

Train the Model

To Train the Model run every cell in the file NER_Masking.ipynb

โ–ถ๏ธ Run the Application

To run the Flask API:

python app.py

The server will start locally (default: http://127.0.0.1:5000).

๐Ÿ“ค API Usage

Nice. You now have two clean portals into your PII engine, like two doors to the same vault, one for raw text, one for files. Here is a concise explanation you can add to your README under an API Endpoints section.

๐ŸŒ API Endpoints

1๏ธโƒฃ /predict

Method: POST Description: Performs PII detection on raw text input.

Request Body (JSON):

{
  "text": "Your input text here"
}

Response:

{
  "masked": "Masked text output",
  "highlighted": "<html with highlighted entities>"
}
  • Calls pii_inference(text)
  • Returns both masked text and dynamically highlighted HTML output

2๏ธโƒฃ /upload

Method: POST Description: Uploads a .txt file and processes it in batches of 5 lines.

Form-Data Key:

file

Processing Logic:

  • Reads file
  • Splits into lines
  • Processes 5 lines at a time
  • Runs pii_inference on each batch
  • Merges results into final masked output
  • Generates highlighted HTML

Response:

{
  "masked": "Final masked text",
  "highlighted": "<html with highlighted entities>"
}

๐Ÿ” Design Insight

  • /predict โ†’ Low latency, single inference call
  • /upload โ†’ Memory-efficient batch processing
  • Batch size (5 lines) prevents long-sequence instability in transformer inference

๐Ÿง  Model Details

  • Base Model: DistilBERT
  • Dataset: ai4privacy/pii-masking-200k
  • Training Framework: Hugging Face Trainer
  • Batch Size: 32
  • Epochs: 10
  • Mixed Precision (FP16): Enabled

๐Ÿ“Š Performance

  • Precision: 90.86%
  • Recall: 93.50%
  • F1 Score: 92.16%

Strong performance on structured PII types such as Email, URL, SSN, and Username.

๐Ÿ”ฎ Future Improvements

  • Add CRF layer for structured decoding
  • Improve low-performing entity categories
  • Model quantization for faster inference
  • Hybrid LLM-based validation layer
Downloads last month
17
Safetensors
Model size
65.3M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ManiKumarAdapala/distilbert-pii-ner

Finetuned
(327)
this model

Dataset used to train ManiKumarAdapala/distilbert-pii-ner