File size: 6,473 Bytes

97e2a74
411f8fd
da5ae57
97e2a74
 
 
 
 
 
54cf5e6
97e2a74
54cf5e6
97e2a74
54cf5e6
97e2a74
54cf5e6
 
 
97e2a74
54cf5e6
97e2a74
 
da5ae57
 
 
 
54cf5e6
da5ae57
54cf5e6
da5ae57
 
 
 
 
 
 
54cf5e6
da5ae57
 
 
83deffc
da5ae57
 
83deffc
da5ae57
 
 
 
 
 
 
 
54cf5e6
 
da5ae57
 
54cf5e6
 
 
 
 
 
 
 
 
 
83deffc
54cf5e6
 
 
51029db
54cf5e6
da5ae57
 
 
54cf5e6
 
51029db
54cf5e6
51029db
54cf5e6
51029db
54cf5e6
51029db
54cf5e6
51029db
da5ae57
54cf5e6
da5ae57
 
 
 
 
 
 
51029db
da5ae57
51029db
54cf5e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51029db
da5ae57
51029db
da5ae57
 
51029db
54cf5e6
 
 
 
 
51029db
54cf5e6
da5ae57
54cf5e6
 
 
 
da5ae57
51029db
54cf5e6
 
 
 
 
 
 
 
 
 
51029db
da5ae57
54cf5e6
da5ae57
 
54cf5e6
da5ae57
 
 
 
54cf5e6
51029db
da5ae57
51029db
 
54cf5e6
 
 
 
 
 
 
51029db
54cf5e6

---
license: openrail
library_name: transformers
datasets:
- ai4privacy/open-pii-masking-500k-ai4privacy
language:
- en
- de
- fr
- it
- es
- pt
- nl
- pl
metrics:
- f1
- precision
- recall
base_model:
- microsoft/mdeberta-v3-base
pipeline_tag: token-classification
tags:
- ner
- pii
- token-classification
- privacy
- gdpr
- mdeberta
- multilingual
model-index:
  - name: NerGuard-0.3B
    results:
      - task:
          type: token-classification
          name: PII Detection
        dataset:
          name: AI4Privacy Open PII Masking 500K (validation)
          type: ai4privacy/open-pii-masking-500k-ai4privacy
        metrics:
          - type: f1
            value: 0.9963
            name: F1 (macro)
          - type: f1
            value: 0.9933
            name: F1 (weighted)
          - type: accuracy
            value: 0.9926
            name: Accuracy
      - task:
          type: token-classification
          name: PII Detection
        dataset:
          name: NVIDIA Nemotron-PII (1000 samples, Tier 2 eval — 16 aligned entity types)
          type: nvidia/Nemotron-PII-200k
        metrics:
          - type: f1
            value: 0.4175
            name: F1 (macro)
          - type: f1
            value: 0.6105
            name: F1 (micro)
          - type: f1
            value: 0.6076
            name: Entity F1 (span-level)
          - type: precision
            value: 0.5616
            name: Precision
          - type: recall
            value: 0.6619
            name: Recall
---

[![Downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fexdsgift%2FNerGuard-0.3B&query=%24.downloads&label=%F0%9F%A4%97%20Downloads&color=blue)](https://huggingface.co/exdsgift/NerGuard-0.3B)
[![GitHub](https://img.shields.io/github/stars/exdsgift/NerGuard?style=social)](https://github.com/exdsgift/NerGuard)
[![Likes](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fexdsgift%2FNerGuard-0.3B&query=%24.likes&label=%E2%9D%A4%20Likes&color=red)](https://huggingface.co/exdsgift/NerGuard-0.3B)
[![License: OpenRAIL](https://img.shields.io/badge/License-OpenRAIL-green.svg)](https://huggingface.co/spaces/CompVis/stable-diffusion-license)
[![Model Size](https://img.shields.io/badge/Parameters-279M-orange)](https://huggingface.co/exdsgift/NerGuard-0.3B)

**NerGuard-0.3B** is a multilingual transformer model for Personally Identifiable Information (PII) detection, built on [mDeBERTa-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base). It performs token-level classification across **20 PII entity types** using BIO tagging, covering names, addresses, government IDs, financial data, and contact information across **8 European languages**.

Trained on 500K+ samples from [AI4Privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy), it achieves **F1-macro 99.63%** on in-distribution validation. On the out-of-distribution NVIDIA Nemotron-PII benchmark (1,000 samples, 7-system comparison), the base model ranks **4th out of 7 systems** on F1-macro and **3rd on Entity-F1** — without any LLM augmentation. For the full hybrid system with entropy-based LLM routing (which ranks **1st on both F1-macro and F1-micro**), see the [NerGuard GitHub repository](https://github.com/exdsgift/NerGuard).

> **Note on labels**: The model outputs its native AI4Privacy label space (e.g., `GIVENNAME`, `SURNAME`, `SOCIALNUM`). The NerGuard pipeline includes a semantic alignment layer that maps these to benchmark-specific label spaces (e.g., NVIDIA Nemotron-PII uses `first_name`, `ssn`).

## Supported Entity Types

| Category | Entity Types |
| --- | --- |
| **Person** | `GIVENNAME`, `SURNAME`, `TITLE` |
| **Location** | `CITY`, `STREET`, `BUILDINGNUM`, `ZIPCODE` |
| **Government ID** | `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM`, `TAXNUM` |
| **Financial** | `CREDITCARDNUMBER` |
| **Contact** | `EMAIL`, `TELEPHONENUM` |
| **Temporal** | `DATE`, `TIME` |
| **Demographic** | `AGE`, `SEX`, `GENDER` |

## Evaluation Results

### In-Distribution: AI4Privacy (validation split)

| Metric | Value |
| --- | --- |
| F1 (macro) | **99.63%** |
| F1 (weighted) | 99.33% |
| Accuracy | 99.26% |

### Out-of-Distribution: NVIDIA Nemotron-PII (1,000 samples)

Tier 2 evaluation: semantic alignment over 16 comparable entity types. Seven systems compared.

| System | F1-macro | F1-micro | Entity-F1 | Latency (ms) |
| --- | --- | --- | --- | --- |
| **NerGuard Hybrid V2** (base + LLM) | **0.5069** | **0.7015** | 0.6634 | 41 |
| Presidio | 0.4933 | 0.5493 | **0.6680** | 86 |
| NerGuard Hybrid V1 | 0.4943 | 0.6862 | 0.6475 | 31 |
| Piiranha | 0.4731 | 0.6501 | 0.6195 | 31 |
| **NerGuard Base (this model)** | 0.4175 | 0.6105 | 0.6076 | **33** |
| spaCy (en_core_web_trf) | 0.3607 | 0.4175 | 0.5527 | 144 |
| dslim/bert-base-NER | 0.3331 | 0.4821 | 0.6225 | 38 |

The base model (no LLM) achieves 33 ms median latency. The entropy-gated hybrid adds +8.94 pt F1-macro by routing only uncertain spans (~3% of tokens) to an LLM for disambiguation.

## Usage

```python
from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="exdsgift/NerGuard-0.3B",
    aggregation_strategy="simple"
)

results = ner("My name is John Smith and my email is john@acme.com")
for entity in results:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2%})")
# John  -> GIVENNAME (99.82%)
# Smith -> SURNAME   (99.71%)
# john@acme.com -> EMAIL (99.54%)
```

For the full hybrid pipeline with LLM routing and regex validation:

```python
from src.inference.tester import PIITester

tester = PIITester(model_path="exdsgift/NerGuard-0.3B")
entities = tester.get_entities("John Smith, SSN: 078-05-1120, email: john@acme.com")
```

## Training Details

| Parameter | Value |
| --- | --- |
| Base model | `microsoft/mdeberta-v3-base` |
| Dataset | AI4Privacy Open PII Masking 500K |
| Training samples | ~450K |
| Max sequence length | 512 (stride 382) |
| Learning rate | 2e-5 |
| Batch size | 32 |
| Epochs | 3 |
| Hardware | 2× NVIDIA A100 |

## Citation

```bibtex
@mastersthesis{durante2026nerguard,
  title     = {Engineering a Scalable Multilingual PII Detection System
               with mDeBERTa-v3 and LLM-Based Validation},
  author    = {Durante, Gabriele},
  year      = {2026},
  school    = {University of Verona},
  department = {Department of Computer Science}
}
```