NerGuard-0.3B: High-Performance NER for PII Detection

Model: exdsgift/NerGuard-0.3B
Base Architecture: DeBERTa-v3-base (435M parameters)
Context: Master's Thesis, University of Verona (Department of Computer Science)
License: Academic/Research Use

Abstract

NerGuard-0.3B is a state-of-the-art Named Entity Recognition (NER) model specialized in the detection of Personally Identifiable Information (PII). Fine-tuned on ai4privacy/open-pii-masking-500k-ai4privacy dataset using a DeBERTa-v3-base backbone, the model classifies 21 distinct entity types. Evaluation demonstrates robust performance with a weighted F1-score of 0.9929 on validation sets and 0.9529 on out-of-domain benchmarks (nvidia/Nemotron-PII), significantly outperforming traditional frameworks like Spacy and Microsoft Presidio in both accuracy and recall.

Technical Specifications

  • Architecture: DeBERTa-v3-base (Decoding-enhanced BERT with disentangled attention).
  • Tokenization: DeBERTa-v3 Fast Tokenizer (Max sequence: 512 tokens).
  • Tagging Scheme: IOB2 (Inside-Outside-Beginning).
  • Inference Latency: ~25.21 ms (Average per request on CUDA).
  • Training Strategy: Full fine-tuning (3 epochs, AdamW, 2e^-5 LR) on AI4Privacy-v2.

Supported Entity Types (21 Classes)

The model detects the following PII categories:

  • Identity: GIVENNAME, SURNAME, TITLE, AGE, SEX, GENDER
  • Government/ID: IDCARDNUM, PASSPORTNUM, DRIVERLICENSENUM, SOCIALNUM (SSN), TAXNUM
  • Financial: CREDITCARDNUMBER
  • Contact: EMAIL, TELEPHONENUM
  • Location: STREET, BUILDINGNUM, CITY, ZIPCODE
  • Temporal: DATE, TIME

Performance Evaluation

Global Metrics

Evaluation performed across In-Domain (Validation) and Out-of-Domain nvidia/Nemotron-PII datasets.

Metric Validation Set (In-Domain) NVIDIA Nemotron (Out-of-Domain)
Accuracy 99.29% 93.42%
Weighted Precision 0.9930 0.9755
Weighted Recall 0.9929 0.9342
Weighted F1 0.9929 0.9529
Macro F1 0.9499 0.3491*

*Note: Lower Macro F1 on the NVIDIA dataset reflects class imbalance and the absence of specific rare entity types (e.g., Building Numbers) in the test set.

Benchmark Comparison

NerGuard-0.3B establishes a new baseline compared to existing PII solutions.

Model Framework F1-Score Latency (ms) Relative F1 vs Baseline
NerGuard-0.3B 0.9037 25.21 Baseline
Gliner 0.4463 24.68 -50.6%
Microsoft Presidio 0.3158 13.53 -65.1%
Spacy (en_core_web_trf) 0.1423 9.35 -84.2%

Granular Analysis Summary

  • High Performance (F1 > 0.95): Structured entities (Email, Phone, Date, Time) and Name components.
  • Moderate Performance (0.85 < F1 < 0.95): Government IDs (Passport, SSN) and Addresses.
  • Challenges: Context-heavy entities (Street addresses without numbers) and rare classes (Gender, Tax IDs) exhibit lower recall in out-of-domain settings.

Quick Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from pprint import pprint

# Load Model & Tokenizer
model_name = "exdsgift/NerGuard-0.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Initialize Pipeline
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Inference
multilingual_cases = [
    "Please send the report to Mr. John Smith at j.smith@company.com immediately.",
    "J'habite au 15 Rue de la Paix, Paris. Mon nom est Pierre Martin.",
    "Mein Name ist Thomas Müller und ich lebe in der Berliner Straße 5, München.",
    "La doctora Ana María González López trabaja en el Hospital Central de Madrid.",
    "Il codice fiscale di Mario Rossi è RSSMRA80A01H501U.",
    "Ik ben Sven van der Berg en mijn e-mailadres is sven.berg@example.nl."
]


for text in multilingual_cases:
    results = nlp(text)
    print(f"\n--- Sample: {text} ---")
    pprint(results)

Limitations

  • Domain Specificity: Optimized for general prose; may require fine-tuning for specialized medical or legal jargon.
  • Context Sensitivity: High recall on numeric identifiers (e.g., SSN) may result in false positives if context is ambiguous.

Citations

@mastersthesis{nerguard2025,
  title={NerGuard-0.3B: High-Performance Named Entity Recognition for PII Detection},
  author={[Author Name]},
  year={2025},
  school={University of Verona, Department of Computer Science},
  type={Master's Thesis},
  url={[https://huggingface.co/exdsgift/NerGuard-0.3B](https://github.com/exdsgift/NerGuard)}
}
Downloads last month
45
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for exdsgift/NerGuard-0.3B

Finetuned
(514)
this model
Quantizations
1 model

Dataset used to train exdsgift/NerGuard-0.3B