NerGuard-0.3B / README.md
exdsgift's picture
Update README.md
af04b2c verified
---
license: openrail
datasets:
- ai4privacy/open-pii-masking-500k-ai4privacy
language:
- it
- en
- de
- fr
- es
- nl
- hi
- te
metrics:
- accuracy
base_model:
- microsoft/deberta-v3-base
pipeline_tag: token-classification
tags:
- PII
- Ner
- Privacy
- NLP
---
# NerGuard-0.3B: High-Performance NER for PII Detection
**Model:** `exdsgift/NerGuard-0.3B`
**Base Architecture:** `DeBERTa-v3-base` (435M parameters)
**Context:** Master's Thesis, University of Verona (Department of Computer Science)
**License:** Academic/Research Use
## Abstract
NerGuard-0.3B is a state-of-the-art Named Entity Recognition (NER) model specialized in the detection of Personally Identifiable Information (PII). Fine-tuned on `ai4privacy/open-pii-masking-500k-ai4privacy` dataset using a `DeBERTa-v3-base` backbone, the model classifies 21 distinct entity types. Evaluation demonstrates robust performance with a weighted `F1`-score of **0.9929** on validation sets and **0.9529** on out-of-domain benchmarks (`nvidia/Nemotron-PII`), significantly outperforming traditional frameworks like Spacy and Microsoft Presidio in both accuracy and recall.
## Technical Specifications
* **Architecture:** `DeBERTa-v3-base` (Decoding-enhanced BERT with disentangled attention).
* **Tokenization:** `DeBERTa-v3 Fast Tokenizer` (Max sequence: 512 tokens).
* **Tagging Scheme:** `IOB2` (Inside-Outside-Beginning).
* **Inference Latency:** `~25.21 ms` (Average per request on CUDA).
* **Training Strategy:** Full fine-tuning (3 epochs, AdamW, `2e^-5` LR) on AI4Privacy-v2.
## Supported Entity Types (21 Classes)
The model detects the following PII categories:
* **Identity:** `GIVENNAME`, `SURNAME`, `TITLE`, `AGE`, `SEX`, `GENDER`
* **Government/ID:** `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM` (SSN), `TAXNUM`
* **Financial:** `CREDITCARDNUMBER`
* **Contact:** `EMAIL`, `TELEPHONENUM`
* **Location:** `STREET`, `BUILDINGNUM`, `CITY`, `ZIPCODE`
* **Temporal:** `DATE`, `TIME`
## Performance Evaluation
### Global Metrics
Evaluation performed across In-Domain (Validation) and Out-of-Domain `nvidia/Nemotron-PII` datasets.
| Metric | Validation Set (In-Domain) | NVIDIA Nemotron (Out-of-Domain) |
| :--- | :--- | :--- |
| **Accuracy** | **99.29%** | **93.42%** |
| **Weighted Precision** | 0.9930 | 0.9755 |
| **Weighted Recall** | 0.9929 | 0.9342 |
| **Weighted `F1`** | **0.9929** | **0.9529** |
| **Macro `F1`** | 0.9499 | 0.3491* |
*\*Note: Lower Macro `F1` on the NVIDIA dataset reflects class imbalance and the absence of specific rare entity types (e.g., Building Numbers) in the test set.*
### Benchmark Comparison
NerGuard-0.3B establishes a new baseline compared to existing PII solutions.
| Model Framework | `F1`-Score | Latency (ms) | Relative `F1` vs Baseline |
| :--- | :--- | :--- | :--- |
| **`NerGuard-0.3B`** | **0.9037** | **25.21** | **Baseline** |
| `Gliner` | 0.4463 | 24.68 | -50.6% |
| `Microsoft Presidio` | 0.3158 | 13.53 | -65.1% |
| `Spacy (en_core_web_trf)` | 0.1423 | 9.35 | -84.2% |
### Granular Analysis Summary
* **High Performance (`F1` > `0.95`):** Structured entities (`Email`, `Phone`, `Date`, `Time`) and Name components.
* **Moderate Performance (`0.85` < `F1` < `0.95`):** Government IDs (`Passport`, `SSN`) and Addresses.
* **Challenges:** Context-heavy entities (Street addresses without numbers) and rare classes (Gender, Tax IDs) exhibit lower recall in out-of-domain settings.
## Quick Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from pprint import pprint
# Load Model & Tokenizer
model_name = "exdsgift/NerGuard-0.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Initialize Pipeline
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Inference
multilingual_cases = [
"Please send the report to Mr. John Smith at j.smith@company.com immediately.",
"J'habite au 15 Rue de la Paix, Paris. Mon nom est Pierre Martin.",
"Mein Name ist Thomas Müller und ich lebe in der Berliner Straße 5, München.",
"La doctora Ana María González López trabaja en el Hospital Central de Madrid.",
"Il codice fiscale di Mario Rossi è RSSMRA80A01H501U.",
"Ik ben Sven van der Berg en mijn e-mailadres is sven.berg@example.nl."
]
for text in multilingual_cases:
results = nlp(text)
print(f"\n--- Sample: {text} ---")
pprint(results)
```
## Limitations
- **Domain Specificity**: Optimized for general prose; may require fine-tuning for specialized medical or legal jargon.
- **Context Sensitivity**: High recall on numeric identifiers (e.g., `SSN`) may result in false positives if context is ambiguous.
## Citations
```bibtex
@mastersthesis{nerguard2025,
title={NerGuard-0.3B: High-Performance Named Entity Recognition for PII Detection},
author={[Author Name]},
year={2025},
school={University of Verona, Department of Computer Science},
type={Master's Thesis},
url={[https://huggingface.co/exdsgift/NerGuard-0.3B](https://github.com/exdsgift/NerGuard)}
}
```