|
|
--- |
|
|
license: openrail |
|
|
datasets: |
|
|
- ai4privacy/open-pii-masking-500k-ai4privacy |
|
|
language: |
|
|
- it |
|
|
- en |
|
|
- de |
|
|
- fr |
|
|
- es |
|
|
- nl |
|
|
- hi |
|
|
- te |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- microsoft/deberta-v3-base |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- PII |
|
|
- Ner |
|
|
- Privacy |
|
|
- NLP |
|
|
--- |
|
|
# NerGuard-0.3B: High-Performance NER for PII Detection |
|
|
|
|
|
**Model:** `exdsgift/NerGuard-0.3B` |
|
|
**Base Architecture:** `DeBERTa-v3-base` (435M parameters) |
|
|
**Context:** Master's Thesis, University of Verona (Department of Computer Science) |
|
|
**License:** Academic/Research Use |
|
|
|
|
|
## Abstract |
|
|
|
|
|
NerGuard-0.3B is a state-of-the-art Named Entity Recognition (NER) model specialized in the detection of Personally Identifiable Information (PII). Fine-tuned on `ai4privacy/open-pii-masking-500k-ai4privacy` dataset using a `DeBERTa-v3-base` backbone, the model classifies 21 distinct entity types. Evaluation demonstrates robust performance with a weighted `F1`-score of **0.9929** on validation sets and **0.9529** on out-of-domain benchmarks (`nvidia/Nemotron-PII`), significantly outperforming traditional frameworks like Spacy and Microsoft Presidio in both accuracy and recall. |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
* **Architecture:** `DeBERTa-v3-base` (Decoding-enhanced BERT with disentangled attention). |
|
|
* **Tokenization:** `DeBERTa-v3 Fast Tokenizer` (Max sequence: 512 tokens). |
|
|
* **Tagging Scheme:** `IOB2` (Inside-Outside-Beginning). |
|
|
* **Inference Latency:** `~25.21 ms` (Average per request on CUDA). |
|
|
* **Training Strategy:** Full fine-tuning (3 epochs, AdamW, `2e^-5` LR) on AI4Privacy-v2. |
|
|
|
|
|
## Supported Entity Types (21 Classes) |
|
|
|
|
|
The model detects the following PII categories: |
|
|
|
|
|
* **Identity:** `GIVENNAME`, `SURNAME`, `TITLE`, `AGE`, `SEX`, `GENDER` |
|
|
* **Government/ID:** `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM` (SSN), `TAXNUM` |
|
|
* **Financial:** `CREDITCARDNUMBER` |
|
|
* **Contact:** `EMAIL`, `TELEPHONENUM` |
|
|
* **Location:** `STREET`, `BUILDINGNUM`, `CITY`, `ZIPCODE` |
|
|
* **Temporal:** `DATE`, `TIME` |
|
|
|
|
|
## Performance Evaluation |
|
|
|
|
|
### Global Metrics |
|
|
Evaluation performed across In-Domain (Validation) and Out-of-Domain `nvidia/Nemotron-PII` datasets. |
|
|
|
|
|
| Metric | Validation Set (In-Domain) | NVIDIA Nemotron (Out-of-Domain) | |
|
|
| :--- | :--- | :--- | |
|
|
| **Accuracy** | **99.29%** | **93.42%** | |
|
|
| **Weighted Precision** | 0.9930 | 0.9755 | |
|
|
| **Weighted Recall** | 0.9929 | 0.9342 | |
|
|
| **Weighted `F1`** | **0.9929** | **0.9529** | |
|
|
| **Macro `F1`** | 0.9499 | 0.3491* | |
|
|
|
|
|
*\*Note: Lower Macro `F1` on the NVIDIA dataset reflects class imbalance and the absence of specific rare entity types (e.g., Building Numbers) in the test set.* |
|
|
|
|
|
### Benchmark Comparison |
|
|
NerGuard-0.3B establishes a new baseline compared to existing PII solutions. |
|
|
|
|
|
| Model Framework | `F1`-Score | Latency (ms) | Relative `F1` vs Baseline | |
|
|
| :--- | :--- | :--- | :--- | |
|
|
| **`NerGuard-0.3B`** | **0.9037** | **25.21** | **Baseline** | |
|
|
| `Gliner` | 0.4463 | 24.68 | -50.6% | |
|
|
| `Microsoft Presidio` | 0.3158 | 13.53 | -65.1% | |
|
|
| `Spacy (en_core_web_trf)` | 0.1423 | 9.35 | -84.2% | |
|
|
|
|
|
### Granular Analysis Summary |
|
|
* **High Performance (`F1` > `0.95`):** Structured entities (`Email`, `Phone`, `Date`, `Time`) and Name components. |
|
|
* **Moderate Performance (`0.85` < `F1` < `0.95`):** Government IDs (`Passport`, `SSN`) and Addresses. |
|
|
* **Challenges:** Context-heavy entities (Street addresses without numbers) and rare classes (Gender, Tax IDs) exhibit lower recall in out-of-domain settings. |
|
|
|
|
|
## Quick Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
from pprint import pprint |
|
|
|
|
|
# Load Model & Tokenizer |
|
|
model_name = "exdsgift/NerGuard-0.3B" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
|
|
# Initialize Pipeline |
|
|
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
|
|
# Inference |
|
|
multilingual_cases = [ |
|
|
"Please send the report to Mr. John Smith at j.smith@company.com immediately.", |
|
|
"J'habite au 15 Rue de la Paix, Paris. Mon nom est Pierre Martin.", |
|
|
"Mein Name ist Thomas Müller und ich lebe in der Berliner Straße 5, München.", |
|
|
"La doctora Ana María González López trabaja en el Hospital Central de Madrid.", |
|
|
"Il codice fiscale di Mario Rossi è RSSMRA80A01H501U.", |
|
|
"Ik ben Sven van der Berg en mijn e-mailadres is sven.berg@example.nl." |
|
|
] |
|
|
|
|
|
|
|
|
for text in multilingual_cases: |
|
|
results = nlp(text) |
|
|
print(f"\n--- Sample: {text} ---") |
|
|
pprint(results) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
- **Domain Specificity**: Optimized for general prose; may require fine-tuning for specialized medical or legal jargon. |
|
|
- **Context Sensitivity**: High recall on numeric identifiers (e.g., `SSN`) may result in false positives if context is ambiguous. |
|
|
|
|
|
## Citations |
|
|
```bibtex |
|
|
@mastersthesis{nerguard2025, |
|
|
title={NerGuard-0.3B: High-Performance Named Entity Recognition for PII Detection}, |
|
|
author={[Author Name]}, |
|
|
year={2025}, |
|
|
school={University of Verona, Department of Computer Science}, |
|
|
type={Master's Thesis}, |
|
|
url={[https://huggingface.co/exdsgift/NerGuard-0.3B](https://github.com/exdsgift/NerGuard)} |
|
|
} |
|
|
``` |