NLP Applications (S1-25_AIMLCZG519)
Assignment 2 โ Problem Statement โ 25
Submitted by: Group 108
- ADAPALA MANI KUMAR
- BHAT MITALI MAHENDRA
- CHELLAPPAN C
- ELLURU SAI GAGAN
- MD FAREED FAROOQUI
Finetuned Model & Project URL : https://drive.google.com/drive/folders/19ZP1RG_9Ms_kzsiLYBMrtrUbETej5dLq?usp=sharing
PII Detection and Masking System
This project implements a PII (Personally Identifiable Information) detection and masking system using DistilBERT fine-tuned on the ai4privacy/pii-masking-200k dataset. The system exposes a Flask API for uploading text files and returning masked outputs.
๐ Features
- Transformer-based Named Entity Recognition (NER)
- Detects multiple PII categories (Email, Phone, SSN, IP, etc.)
- Batch processing (2 lines at a time)
- REST API using Flask
- Optimized with FP16 training
- Entity-level F1 Score: 92.16%
๐ Project Structure
.
.
โโโ app.py
โโโ design_document.docx
โโโ distilbert-ner
โ โโโ checkpoint-10880
โ โโโ config.json
โ โโโ model.safetensors
โ โโโ tokenizer_config.json
โ โโโ tokenizer.json
โ โโโ trainer_state.json
โ โโโ training_args.bin
โโโ NER_Masking.pdf
โโโ NER_Masking.ipynb
โโโ readme.md
โโโ sample.txt
โโโ templates
โโโ index.html
โ๏ธ Installation
- Download the repository
https://drive.google.com/drive/folders/19ZP1RG_9Ms_kzsiLYBMrtrUbETej5dLq?usp=sharing
- Create and activate environment:
conda create -n pii python=3.10
conda activate pii
- Install dependencies:
pip install gdown torch matplotlib transformers datasets seqeval scikit-learn seaborn numpy ipywidgets
Train the Model
To Train the Model run every cell in the file NER_Masking.ipynb
โถ๏ธ Run the Application
To run the Flask API:
python app.py
The server will start locally (default: http://127.0.0.1:5000).
๐ค API Usage
Nice. You now have two clean portals into your PII engine, like two doors to the same vault, one for raw text, one for files. Here is a concise explanation you can add to your README under an API Endpoints section.
๐ API Endpoints
1๏ธโฃ /predict
Method: POST
Description: Performs PII detection on raw text input.
Request Body (JSON):
{
"text": "Your input text here"
}
Response:
{
"masked": "Masked text output",
"highlighted": "<html with highlighted entities>"
}
- Calls
pii_inference(text) - Returns both masked text and dynamically highlighted HTML output
2๏ธโฃ /upload
Method: POST
Description: Uploads a .txt file and processes it in batches of 5 lines.
Form-Data Key:
file
Processing Logic:
- Reads file
- Splits into lines
- Processes 5 lines at a time
- Runs
pii_inferenceon each batch - Merges results into final masked output
- Generates highlighted HTML
Response:
{
"masked": "Final masked text",
"highlighted": "<html with highlighted entities>"
}
๐ Design Insight
/predictโ Low latency, single inference call/uploadโ Memory-efficient batch processing- Batch size (5 lines) prevents long-sequence instability in transformer inference
๐ง Model Details
- Base Model: DistilBERT
- Dataset: ai4privacy/pii-masking-200k
- Training Framework: Hugging Face Trainer
- Batch Size: 32
- Epochs: 10
- Mixed Precision (FP16): Enabled
๐ Performance
- Precision: 90.86%
- Recall: 93.50%
- F1 Score: 92.16%
Strong performance on structured PII types such as Email, URL, SSN, and Username.
๐ฎ Future Improvements
- Add CRF layer for structured decoding
- Improve low-performing entity categories
- Model quantization for faster inference
- Hybrid LLM-based validation layer
- Downloads last month
- 17
Model tree for ManiKumarAdapala/distilbert-pii-ner
Base model
distilbert/distilbert-base-cased