exdsgift commited on
Commit
da5ae57
·
verified ·
1 Parent(s): af04b2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -91
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
  license: openrail
 
3
  datasets:
4
  - ai4privacy/open-pii-masking-500k-ai4privacy
5
  language:
@@ -17,114 +18,107 @@ base_model:
17
  - microsoft/deberta-v3-base
18
  pipeline_tag: token-classification
19
  tags:
20
- - PII
21
- - Ner
22
- - Privacy
23
- - NLP
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  ---
25
- # NerGuard-0.3B: High-Performance NER for PII Detection
 
 
 
 
26
 
27
- **Model:** `exdsgift/NerGuard-0.3B`
28
- **Base Architecture:** `DeBERTa-v3-base` (435M parameters)
29
- **Context:** Master's Thesis, University of Verona (Department of Computer Science)
30
- **License:** Academic/Research Use
31
 
32
- ## Abstract
33
 
34
- NerGuard-0.3B is a state-of-the-art Named Entity Recognition (NER) model specialized in the detection of Personally Identifiable Information (PII). Fine-tuned on `ai4privacy/open-pii-masking-500k-ai4privacy` dataset using a `DeBERTa-v3-base` backbone, the model classifies 21 distinct entity types. Evaluation demonstrates robust performance with a weighted `F1`-score of **0.9929** on validation sets and **0.9529** on out-of-domain benchmarks (`nvidia/Nemotron-PII`), significantly outperforming traditional frameworks like Spacy and Microsoft Presidio in both accuracy and recall.
35
 
36
- ## Technical Specifications
37
 
38
- * **Architecture:** `DeBERTa-v3-base` (Decoding-enhanced BERT with disentangled attention).
39
- * **Tokenization:** `DeBERTa-v3 Fast Tokenizer` (Max sequence: 512 tokens).
40
- * **Tagging Scheme:** `IOB2` (Inside-Outside-Beginning).
41
- * **Inference Latency:** `~25.21 ms` (Average per request on CUDA).
42
- * **Training Strategy:** Full fine-tuning (3 epochs, AdamW, `2e^-5` LR) on AI4Privacy-v2.
43
 
44
- ## Supported Entity Types (21 Classes)
 
 
 
 
 
 
 
 
45
 
46
- The model detects the following PII categories:
47
 
48
- * **Identity:** `GIVENNAME`, `SURNAME`, `TITLE`, `AGE`, `SEX`, `GENDER`
49
- * **Government/ID:** `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM` (SSN), `TAXNUM`
50
- * **Financial:** `CREDITCARDNUMBER`
51
- * **Contact:** `EMAIL`, `TELEPHONENUM`
52
- * **Location:** `STREET`, `BUILDINGNUM`, `CITY`, `ZIPCODE`
53
- * **Temporal:** `DATE`, `TIME`
54
 
55
- ## Performance Evaluation
56
 
57
- ### Global Metrics
58
- Evaluation performed across In-Domain (Validation) and Out-of-Domain `nvidia/Nemotron-PII` datasets.
59
-
60
- | Metric | Validation Set (In-Domain) | NVIDIA Nemotron (Out-of-Domain) |
61
- | :--- | :--- | :--- |
62
- | **Accuracy** | **99.29%** | **93.42%** |
63
- | **Weighted Precision** | 0.9930 | 0.9755 |
64
- | **Weighted Recall** | 0.9929 | 0.9342 |
65
- | **Weighted `F1`** | **0.9929** | **0.9529** |
66
- | **Macro `F1`** | 0.9499 | 0.3491* |
67
-
68
- *\*Note: Lower Macro `F1` on the NVIDIA dataset reflects class imbalance and the absence of specific rare entity types (e.g., Building Numbers) in the test set.*
69
-
70
- ### Benchmark Comparison
71
- NerGuard-0.3B establishes a new baseline compared to existing PII solutions.
72
 
73
- | Model Framework | `F1`-Score | Latency (ms) | Relative `F1` vs Baseline |
74
- | :--- | :--- | :--- | :--- |
75
- | **`NerGuard-0.3B`** | **0.9037** | **25.21** | **Baseline** |
76
- | `Gliner` | 0.4463 | 24.68 | -50.6% |
77
- | `Microsoft Presidio` | 0.3158 | 13.53 | -65.1% |
78
- | `Spacy (en_core_web_trf)` | 0.1423 | 9.35 | -84.2% |
79
 
80
- ### Granular Analysis Summary
81
- * **High Performance (`F1` > `0.95`):** Structured entities (`Email`, `Phone`, `Date`, `Time`) and Name components.
82
- * **Moderate Performance (`0.85` < `F1` < `0.95`):** Government IDs (`Passport`, `SSN`) and Addresses.
83
- * **Challenges:** Context-heavy entities (Street addresses without numbers) and rare classes (Gender, Tax IDs) exhibit lower recall in out-of-domain settings.
84
 
85
- ## Quick Usage
86
 
87
- ```python
88
- from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
89
- from pprint import pprint
90
-
91
- # Load Model & Tokenizer
92
- model_name = "exdsgift/NerGuard-0.3B"
93
- tokenizer = AutoTokenizer.from_pretrained(model_name)
94
- model = AutoModelForTokenClassification.from_pretrained(model_name)
95
-
96
- # Initialize Pipeline
97
- nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
98
-
99
- # Inference
100
- multilingual_cases = [
101
- "Please send the report to Mr. John Smith at j.smith@company.com immediately.",
102
- "J'habite au 15 Rue de la Paix, Paris. Mon nom est Pierre Martin.",
103
- "Mein Name ist Thomas Müller und ich lebe in der Berliner Straße 5, München.",
104
- "La doctora Ana María González López trabaja en el Hospital Central de Madrid.",
105
- "Il codice fiscale di Mario Rossi è RSSMRA80A01H501U.",
106
- "Ik ben Sven van der Berg en mijn e-mailadres is sven.berg@example.nl."
107
- ]
108
-
109
-
110
- for text in multilingual_cases:
111
- results = nlp(text)
112
- print(f"\n--- Sample: {text} ---")
113
- pprint(results)
114
- ```
115
 
116
- ## Limitations
117
- - **Domain Specificity**: Optimized for general prose; may require fine-tuning for specialized medical or legal jargon.
118
- - **Context Sensitivity**: High recall on numeric identifiers (e.g., `SSN`) may result in false positives if context is ambiguous.
119
 
120
- ## Citations
121
  ```bibtex
122
- @mastersthesis{nerguard2025,
123
- title={NerGuard-0.3B: High-Performance Named Entity Recognition for PII Detection},
124
- author={[Author Name]},
125
- year={2025},
126
- school={University of Verona, Department of Computer Science},
127
- type={Master's Thesis},
128
- url={[https://huggingface.co/exdsgift/NerGuard-0.3B](https://github.com/exdsgift/NerGuard)}
129
  }
130
  ```
 
1
  ---
2
  license: openrail
3
+ library_name: transformers
4
  datasets:
5
  - ai4privacy/open-pii-masking-500k-ai4privacy
6
  language:
 
18
  - microsoft/deberta-v3-base
19
  pipeline_tag: token-classification
20
  tags:
21
+ - ner
22
+ - pii
23
+ - token-classification
24
+ - privacy
25
+ - mdeberta
26
+ model-index:
27
+ - name: NerGuard-0.3B
28
+ results:
29
+ - task:
30
+ type: token-classification
31
+ name: PII Detection
32
+ dataset:
33
+ name: AI4Privacy (validation)
34
+ type: ai4privacy/open-pii-masking-500k-ai4privacy
35
+ metrics:
36
+ - type: f1
37
+ value: 0.9597
38
+ name: F1 (macro)
39
+ - type: f1
40
+ value: 0.9926
41
+ name: F1 (weighted)
42
+ - type: accuracy
43
+ value: 0.9926
44
+ name: Accuracy
45
+ - task:
46
+ type: token-classification
47
+ name: PII Detection
48
+ dataset:
49
+ name: NVIDIA Nemotron-PII
50
+ type: nvidia/Nemotron-PII
51
+ metrics:
52
+ - type: f1
53
+ value: 0.9543
54
+ name: F1 (weighted)
55
+ - type: accuracy
56
+ value: 0.9350
57
+ name: Accuracy
58
  ---
59
+ [![Downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fexdsgift%2FNerGuard-0.3B&query=%24.downloads&label=%F0%9F%A4%97%20Downloads&color=blue)](https://huggingface.co/exdsgift/NerGuard-0.3B)
60
+ [![GitHub](https://img.shields.io/github/stars/exdsgift/NerGuard?style=social)](https://github.com/exdsgift/NerGuard)
61
+ [![Likes](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fexdsgift%2FNerGuard-0.3B&query=%24.likes&label=%E2%9D%A4%20Likes&color=red)](https://huggingface.co/exdsgift/NerGuard-0.3B)
62
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
63
+ [![Model Size](https://img.shields.io/badge/Parameters-278M-orange)](https://huggingface.co/exdsgift/NerGuard-0.3B)
64
 
65
+ # NerGuard-0.3B
 
 
 
66
 
67
+ **NerGuard-0.3B** is a multilingual transformer model for Personally Identifiable Information (PII) detection, built on [mDeBERTa-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base). It performs token-level classification across **21 PII entity types** using BIO tagging, covering names, addresses, government IDs, financial data, and contact information.
68
 
69
+ Trained on 500K+ samples from [AI4Privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy), the model achieves **F1 95.97%** on validation and **2x higher F1** than the best open-source alternative (GLiNER, Presidio, SpaCy) on a 3,000-sample benchmark. It supports cross-lingual transfer to 8 European languages without additional fine-tuning.
70
 
71
+ This is the standalone NER model. For the full hybrid system with entropy-based LLM routing, see the [NerGuard GitHub repository](https://github.com/exdsgift/NerGuard).
72
 
73
+ ## Supported Entities
 
 
 
 
74
 
75
+ | Category | Entity Types |
76
+ |---|---|
77
+ | **Person** | `GIVENNAME`, `SURNAME`, `TITLE` |
78
+ | **Location** | `CITY`, `STREET`, `BUILDINGNUM`, `ZIPCODE` |
79
+ | **Government ID** | `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM`, `TAXNUM` |
80
+ | **Financial** | `CREDITCARDNUMBER` |
81
+ | **Contact** | `EMAIL`, `TELEPHONENUM` |
82
+ | **Temporal** | `DATE`, `TIME` |
83
+ | **Demographic** | `AGE`, `SEX`, `GENDER` |
84
 
85
+ ## Evaluation Results
86
 
87
+ | Dataset | Accuracy | F1 (macro) | F1 (weighted) |
88
+ |---|---|---|---|
89
+ | AI4Privacy (validation) | 99.26% | 95.97% | 99.26% |
90
+ | NVIDIA Nemotron-PII | 93.50% | — | 95.43% |
 
 
91
 
92
+ ## Usage
93
 
94
+ ```python
95
+ from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
+ ner = pipeline("token-classification", model="exdsgift/NerGuard-0.3B", aggregation_strategy="simple")
98
+ results = ner("My name is John Smith and my email is john@gmail.com")
 
 
 
 
99
 
100
+ for entity in results:
101
+ print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
102
+ ```
 
103
 
104
+ ## Training
105
 
106
+ | Parameter | Value |
107
+ |---|---|
108
+ | Base model | `microsoft/mdeberta-v3-base` |
109
+ | Dataset | AI4Privacy Open PII Masking 500K |
110
+ | Max sequence length | 512 (stride 382) |
111
+ | Learning rate | 2e-5 |
112
+ | Batch size | 32 |
113
+ | Epochs | 3 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
 
115
+ ## Citation
 
 
116
 
 
117
  ```bibtex
118
+ @mastersthesis{nerguard2026,
119
+ title={NerGuard: Hybrid PII Detection with Entropy-Based LLM Routing},
120
+ author={Exdsgift},
121
+ school={University of Verona},
122
+ year={2026}
 
 
123
  }
124
  ```