NerGuard-0.3B: High-Performance NER for PII Detection

Model: exdsgift/NerGuard-0.3B
Base Architecture: DeBERTa-v3-base (435M parameters)
Context: Master's Thesis, University of Verona (Department of Computer Science)
License: Academic/Research Use

Abstract

NerGuard-0.3B is a state-of-the-art Named Entity Recognition (NER) model specialized in the detection of Personally Identifiable Information (PII). Fine-tuned on ai4privacy/open-pii-masking-500k-ai4privacy dataset using a DeBERTa-v3-base backbone, the model classifies 21 distinct entity types. Evaluation demonstrates robust performance with a weighted F1-score of 0.9929 on validation sets and 0.9529 on out-of-domain benchmarks (nvidia/Nemotron-PII), significantly outperforming traditional frameworks like Spacy and Microsoft Presidio in both accuracy and recall.

Technical Specifications

Architecture: DeBERTa-v3-base (Decoding-enhanced BERT with disentangled attention).
Tokenization: DeBERTa-v3 Fast Tokenizer (Max sequence: 512 tokens).
Tagging Scheme: IOB2 (Inside-Outside-Beginning).
Inference Latency: ~25.21 ms (Average per request on CUDA).
Training Strategy: Full fine-tuning (3 epochs, AdamW, 2e^-5 LR) on AI4Privacy-v2.

Supported Entity Types (21 Classes)

The model detects the following PII categories:

Identity: GIVENNAME, SURNAME, TITLE, AGE, SEX, GENDER
Government/ID: IDCARDNUM, PASSPORTNUM, DRIVERLICENSENUM, SOCIALNUM (SSN), TAXNUM
Financial: CREDITCARDNUMBER
Contact: EMAIL, TELEPHONENUM
Location: STREET, BUILDINGNUM, CITY, ZIPCODE
Temporal: DATE, TIME

Performance Evaluation

Global Metrics

Evaluation performed across In-Domain (Validation) and Out-of-Domain nvidia/Nemotron-PII datasets.

Metric	Validation Set (In-Domain)	NVIDIA Nemotron (Out-of-Domain)
Accuracy	99.29%	93.42%
Weighted Precision	0.9930	0.9755
Weighted Recall	0.9929	0.9342
Weighted `F1`	0.9929	0.9529
Macro `F1`	0.9499	0.3491*

*Note: Lower Macro F1 on the NVIDIA dataset reflects class imbalance and the absence of specific rare entity types (e.g., Building Numbers) in the test set.

Benchmark Comparison

NerGuard-0.3B establishes a new baseline compared to existing PII solutions.

Model Framework	`F1`-Score	Latency (ms)	Relative `F1` vs Baseline
`NerGuard-0.3B`	0.9037	25.21	Baseline
`Gliner`	0.4463	24.68	-50.6%
`Microsoft Presidio`	0.3158	13.53	-65.1%
`Spacy (en_core_web_trf)`	0.1423	9.35	-84.2%

Granular Analysis Summary

High Performance (F1 > 0.95): Structured entities (Email, Phone, Date, Time) and Name components.
Moderate Performance (0.85 < F1 < 0.95): Government IDs (Passport, SSN) and Addresses.
Challenges: Context-heavy entities (Street addresses without numbers) and rare classes (Gender, Tax IDs) exhibit lower recall in out-of-domain settings.

Quick Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from pprint import pprint

# Load Model & Tokenizer
model_name = "exdsgift/NerGuard-0.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Initialize Pipeline
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Inference
multilingual_cases = [
    "Please send the report to Mr. John Smith at j.smith@company.com immediately.",
    "J'habite au 15 Rue de la Paix, Paris. Mon nom est Pierre Martin.",
    "Mein Name ist Thomas Müller und ich lebe in der Berliner Straße 5, München.",
    "La doctora Ana María González López trabaja en el Hospital Central de Madrid.",
    "Il codice fiscale di Mario Rossi è RSSMRA80A01H501U.",
    "Ik ben Sven van der Berg en mijn e-mailadres is sven.berg@example.nl."
]


for text in multilingual_cases:
    results = nlp(text)
    print(f"\n--- Sample: {text} ---")
    pprint(results)

Limitations

Domain Specificity: Optimized for general prose; may require fine-tuning for specialized medical or legal jargon.
Context Sensitivity: High recall on numeric identifiers (e.g., SSN) may result in false positives if context is ambiguous.

Citations

@mastersthesis{nerguard2025,
  title={NerGuard-0.3B: High-Performance Named Entity Recognition for PII Detection},
  author={[Author Name]},
  year={2025},
  school={University of Verona, Department of Computer Science},
  type={Master's Thesis},
  url={[https://huggingface.co/exdsgift/NerGuard-0.3B](https://github.com/exdsgift/NerGuard)}
}

Downloads last month: 45

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for exdsgift/NerGuard-0.3B

Base model

microsoft/deberta-v3-base

Finetuned

(514)

this model

Quantizations

1 model

exdsgift
/

NerGuard-0.3B