File size: 6,473 Bytes
97e2a74 411f8fd da5ae57 97e2a74 54cf5e6 97e2a74 54cf5e6 97e2a74 54cf5e6 97e2a74 54cf5e6 97e2a74 54cf5e6 97e2a74 da5ae57 54cf5e6 da5ae57 54cf5e6 da5ae57 54cf5e6 da5ae57 83deffc da5ae57 83deffc da5ae57 54cf5e6 da5ae57 54cf5e6 83deffc 54cf5e6 51029db 54cf5e6 da5ae57 54cf5e6 51029db 54cf5e6 51029db 54cf5e6 51029db 54cf5e6 51029db 54cf5e6 51029db da5ae57 54cf5e6 da5ae57 51029db da5ae57 51029db 54cf5e6 51029db da5ae57 51029db da5ae57 51029db 54cf5e6 51029db 54cf5e6 da5ae57 54cf5e6 da5ae57 51029db 54cf5e6 51029db da5ae57 54cf5e6 da5ae57 54cf5e6 da5ae57 54cf5e6 51029db da5ae57 51029db 54cf5e6 51029db 54cf5e6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ---
license: openrail
library_name: transformers
datasets:
- ai4privacy/open-pii-masking-500k-ai4privacy
language:
- en
- de
- fr
- it
- es
- pt
- nl
- pl
metrics:
- f1
- precision
- recall
base_model:
- microsoft/mdeberta-v3-base
pipeline_tag: token-classification
tags:
- ner
- pii
- token-classification
- privacy
- gdpr
- mdeberta
- multilingual
model-index:
- name: NerGuard-0.3B
results:
- task:
type: token-classification
name: PII Detection
dataset:
name: AI4Privacy Open PII Masking 500K (validation)
type: ai4privacy/open-pii-masking-500k-ai4privacy
metrics:
- type: f1
value: 0.9963
name: F1 (macro)
- type: f1
value: 0.9933
name: F1 (weighted)
- type: accuracy
value: 0.9926
name: Accuracy
- task:
type: token-classification
name: PII Detection
dataset:
name: NVIDIA Nemotron-PII (1000 samples, Tier 2 eval — 16 aligned entity types)
type: nvidia/Nemotron-PII-200k
metrics:
- type: f1
value: 0.4175
name: F1 (macro)
- type: f1
value: 0.6105
name: F1 (micro)
- type: f1
value: 0.6076
name: Entity F1 (span-level)
- type: precision
value: 0.5616
name: Precision
- type: recall
value: 0.6619
name: Recall
---
[](https://huggingface.co/exdsgift/NerGuard-0.3B)
[](https://github.com/exdsgift/NerGuard)
[](https://huggingface.co/exdsgift/NerGuard-0.3B)
[](https://huggingface.co/spaces/CompVis/stable-diffusion-license)
[](https://huggingface.co/exdsgift/NerGuard-0.3B)
**NerGuard-0.3B** is a multilingual transformer model for Personally Identifiable Information (PII) detection, built on [mDeBERTa-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base). It performs token-level classification across **20 PII entity types** using BIO tagging, covering names, addresses, government IDs, financial data, and contact information across **8 European languages**.
Trained on 500K+ samples from [AI4Privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy), it achieves **F1-macro 99.63%** on in-distribution validation. On the out-of-distribution NVIDIA Nemotron-PII benchmark (1,000 samples, 7-system comparison), the base model ranks **4th out of 7 systems** on F1-macro and **3rd on Entity-F1** — without any LLM augmentation. For the full hybrid system with entropy-based LLM routing (which ranks **1st on both F1-macro and F1-micro**), see the [NerGuard GitHub repository](https://github.com/exdsgift/NerGuard).
> **Note on labels**: The model outputs its native AI4Privacy label space (e.g., `GIVENNAME`, `SURNAME`, `SOCIALNUM`). The NerGuard pipeline includes a semantic alignment layer that maps these to benchmark-specific label spaces (e.g., NVIDIA Nemotron-PII uses `first_name`, `ssn`).
## Supported Entity Types
| Category | Entity Types |
| --- | --- |
| **Person** | `GIVENNAME`, `SURNAME`, `TITLE` |
| **Location** | `CITY`, `STREET`, `BUILDINGNUM`, `ZIPCODE` |
| **Government ID** | `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM`, `TAXNUM` |
| **Financial** | `CREDITCARDNUMBER` |
| **Contact** | `EMAIL`, `TELEPHONENUM` |
| **Temporal** | `DATE`, `TIME` |
| **Demographic** | `AGE`, `SEX`, `GENDER` |
## Evaluation Results
### In-Distribution: AI4Privacy (validation split)
| Metric | Value |
| --- | --- |
| F1 (macro) | **99.63%** |
| F1 (weighted) | 99.33% |
| Accuracy | 99.26% |
### Out-of-Distribution: NVIDIA Nemotron-PII (1,000 samples)
Tier 2 evaluation: semantic alignment over 16 comparable entity types. Seven systems compared.
| System | F1-macro | F1-micro | Entity-F1 | Latency (ms) |
| --- | --- | --- | --- | --- |
| **NerGuard Hybrid V2** (base + LLM) | **0.5069** | **0.7015** | 0.6634 | 41 |
| Presidio | 0.4933 | 0.5493 | **0.6680** | 86 |
| NerGuard Hybrid V1 | 0.4943 | 0.6862 | 0.6475 | 31 |
| Piiranha | 0.4731 | 0.6501 | 0.6195 | 31 |
| **NerGuard Base (this model)** | 0.4175 | 0.6105 | 0.6076 | **33** |
| spaCy (en_core_web_trf) | 0.3607 | 0.4175 | 0.5527 | 144 |
| dslim/bert-base-NER | 0.3331 | 0.4821 | 0.6225 | 38 |
The base model (no LLM) achieves 33 ms median latency. The entropy-gated hybrid adds +8.94 pt F1-macro by routing only uncertain spans (~3% of tokens) to an LLM for disambiguation.
## Usage
```python
from transformers import pipeline
ner = pipeline(
"token-classification",
model="exdsgift/NerGuard-0.3B",
aggregation_strategy="simple"
)
results = ner("My name is John Smith and my email is john@acme.com")
for entity in results:
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2%})")
# John -> GIVENNAME (99.82%)
# Smith -> SURNAME (99.71%)
# john@acme.com -> EMAIL (99.54%)
```
For the full hybrid pipeline with LLM routing and regex validation:
```python
from src.inference.tester import PIITester
tester = PIITester(model_path="exdsgift/NerGuard-0.3B")
entities = tester.get_entities("John Smith, SSN: 078-05-1120, email: john@acme.com")
```
## Training Details
| Parameter | Value |
| --- | --- |
| Base model | `microsoft/mdeberta-v3-base` |
| Dataset | AI4Privacy Open PII Masking 500K |
| Training samples | ~450K |
| Max sequence length | 512 (stride 382) |
| Learning rate | 2e-5 |
| Batch size | 32 |
| Epochs | 3 |
| Hardware | 2× NVIDIA A100 |
## Citation
```bibtex
@mastersthesis{durante2026nerguard,
title = {Engineering a Scalable Multilingual PII Detection System
with mDeBERTa-v3 and LLM-Based Validation},
author = {Durante, Gabriele},
year = {2026},
school = {University of Verona},
department = {Department of Computer Science}
}
```
|