NerGuard-0.3B: High-Performance NER for PII Detection
Model: exdsgift/NerGuard-0.3B
Base Architecture: DeBERTa-v3-base (435M parameters)
Context: Master's Thesis, University of Verona (Department of Computer Science)
License: Academic/Research Use
Abstract
NerGuard-0.3B is a state-of-the-art Named Entity Recognition (NER) model specialized in the detection of Personally Identifiable Information (PII). Fine-tuned on ai4privacy/open-pii-masking-500k-ai4privacy dataset using a DeBERTa-v3-base backbone, the model classifies 21 distinct entity types. Evaluation demonstrates robust performance with a weighted F1-score of 0.9929 on validation sets and 0.9529 on out-of-domain benchmarks (nvidia/Nemotron-PII), significantly outperforming traditional frameworks like Spacy and Microsoft Presidio in both accuracy and recall.
Technical Specifications
- Architecture:
DeBERTa-v3-base(Decoding-enhanced BERT with disentangled attention). - Tokenization:
DeBERTa-v3 Fast Tokenizer(Max sequence: 512 tokens). - Tagging Scheme:
IOB2(Inside-Outside-Beginning). - Inference Latency:
~25.21 ms(Average per request on CUDA). - Training Strategy: Full fine-tuning (3 epochs, AdamW,
2e^-5LR) on AI4Privacy-v2.
Supported Entity Types (21 Classes)
The model detects the following PII categories:
- Identity:
GIVENNAME,SURNAME,TITLE,AGE,SEX,GENDER - Government/ID:
IDCARDNUM,PASSPORTNUM,DRIVERLICENSENUM,SOCIALNUM(SSN),TAXNUM - Financial:
CREDITCARDNUMBER - Contact:
EMAIL,TELEPHONENUM - Location:
STREET,BUILDINGNUM,CITY,ZIPCODE - Temporal:
DATE,TIME
Performance Evaluation
Global Metrics
Evaluation performed across In-Domain (Validation) and Out-of-Domain nvidia/Nemotron-PII datasets.
| Metric | Validation Set (In-Domain) | NVIDIA Nemotron (Out-of-Domain) |
|---|---|---|
| Accuracy | 99.29% | 93.42% |
| Weighted Precision | 0.9930 | 0.9755 |
| Weighted Recall | 0.9929 | 0.9342 |
Weighted F1 |
0.9929 | 0.9529 |
Macro F1 |
0.9499 | 0.3491* |
*Note: Lower Macro F1 on the NVIDIA dataset reflects class imbalance and the absence of specific rare entity types (e.g., Building Numbers) in the test set.
Benchmark Comparison
NerGuard-0.3B establishes a new baseline compared to existing PII solutions.
| Model Framework | F1-Score |
Latency (ms) | Relative F1 vs Baseline |
|---|---|---|---|
NerGuard-0.3B |
0.9037 | 25.21 | Baseline |
Gliner |
0.4463 | 24.68 | -50.6% |
Microsoft Presidio |
0.3158 | 13.53 | -65.1% |
Spacy (en_core_web_trf) |
0.1423 | 9.35 | -84.2% |
Granular Analysis Summary
- High Performance (
F1>0.95): Structured entities (Email,Phone,Date,Time) and Name components. - Moderate Performance (
0.85<F1<0.95): Government IDs (Passport,SSN) and Addresses. - Challenges: Context-heavy entities (Street addresses without numbers) and rare classes (Gender, Tax IDs) exhibit lower recall in out-of-domain settings.
Quick Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from pprint import pprint
# Load Model & Tokenizer
model_name = "exdsgift/NerGuard-0.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Initialize Pipeline
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Inference
multilingual_cases = [
"Please send the report to Mr. John Smith at j.smith@company.com immediately.",
"J'habite au 15 Rue de la Paix, Paris. Mon nom est Pierre Martin.",
"Mein Name ist Thomas Müller und ich lebe in der Berliner Straße 5, München.",
"La doctora Ana María González López trabaja en el Hospital Central de Madrid.",
"Il codice fiscale di Mario Rossi è RSSMRA80A01H501U.",
"Ik ben Sven van der Berg en mijn e-mailadres is sven.berg@example.nl."
]
for text in multilingual_cases:
results = nlp(text)
print(f"\n--- Sample: {text} ---")
pprint(results)
Limitations
- Domain Specificity: Optimized for general prose; may require fine-tuning for specialized medical or legal jargon.
- Context Sensitivity: High recall on numeric identifiers (e.g.,
SSN) may result in false positives if context is ambiguous.
Citations
@mastersthesis{nerguard2025,
title={NerGuard-0.3B: High-Performance Named Entity Recognition for PII Detection},
author={[Author Name]},
year={2025},
school={University of Verona, Department of Computer Science},
type={Master's Thesis},
url={[https://huggingface.co/exdsgift/NerGuard-0.3B](https://github.com/exdsgift/NerGuard)}
}
- Downloads last month
- 45