exdsgift
/

NerGuard-0.3B

@@ -1,5 +1,6 @@
 ---
 license: openrail
 datasets:
 - ai4privacy/open-pii-masking-500k-ai4privacy
 language:
@@ -17,114 +18,107 @@ base_model:
 - microsoft/deberta-v3-base
 pipeline_tag: token-classification
 tags:
-- PII
-- Ner
-- Privacy
-- NLP
 ---
-# NerGuard-0.3B: High-Performance NER for PII Detection
-**Model:** `exdsgift/NerGuard-0.3B`
-**Base Architecture:** `DeBERTa-v3-base` (435M parameters)
-**Context:** Master's Thesis, University of Verona (Department of Computer Science)
-**License:** Academic/Research Use
-## Abstract
-NerGuard-0.3B is a state-of-the-art Named Entity Recognition (NER) model specialized in the detection of Personally Identifiable Information (PII). Fine-tuned on `ai4privacy/open-pii-masking-500k-ai4privacy` dataset using a `DeBERTa-v3-base` backbone, the model classifies 21 distinct entity types. Evaluation demonstrates robust performance with a weighted `F1`-score of **0.9929** on validation sets and **0.9529** on out-of-domain benchmarks (`nvidia/Nemotron-PII`), significantly outperforming traditional frameworks like Spacy and Microsoft Presidio in both accuracy and recall.
-## Technical Specifications
-* **Architecture:** `DeBERTa-v3-base` (Decoding-enhanced BERT with disentangled attention).
-* **Tokenization:** `DeBERTa-v3 Fast Tokenizer` (Max sequence: 512 tokens).
-* **Tagging Scheme:** `IOB2` (Inside-Outside-Beginning).
-* **Inference Latency:** `~25.21 ms` (Average per request on CUDA).
-* **Training Strategy:** Full fine-tuning (3 epochs, AdamW, `2e^-5` LR) on AI4Privacy-v2.
-## Supported Entity Types (21 Classes)
-The model detects the following PII categories:
-* **Identity:** `GIVENNAME`, `SURNAME`, `TITLE`, `AGE`, `SEX`, `GENDER`
-* **Government/ID:** `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM` (SSN), `TAXNUM`
-* **Financial:** `CREDITCARDNUMBER`
-* **Contact:** `EMAIL`, `TELEPHONENUM`
-* **Location:** `STREET`, `BUILDINGNUM`, `CITY`, `ZIPCODE`
-* **Temporal:** `DATE`, `TIME`
-## Performance Evaluation
-### Global Metrics
-Evaluation performed across In-Domain (Validation) and Out-of-Domain `nvidia/Nemotron-PII` datasets.
-| Metric | Validation Set (In-Domain) | NVIDIA Nemotron (Out-of-Domain) |
-| :--- | :--- | :--- |
-| **Accuracy** | **99.29%** | **93.42%** |
-| **Weighted Precision** | 0.9930 | 0.9755 |
-| **Weighted Recall** | 0.9929 | 0.9342 |
-| **Weighted `F1`** | **0.9929** | **0.9529** |
-| **Macro `F1`** | 0.9499 | 0.3491* |
-*\*Note: Lower Macro `F1` on the NVIDIA dataset reflects class imbalance and the absence of specific rare entity types (e.g., Building Numbers) in the test set.*
-### Benchmark Comparison
-NerGuard-0.3B establishes a new baseline compared to existing PII solutions.
-| Model Framework | `F1`-Score | Latency (ms) | Relative `F1` vs Baseline |
-| :--- | :--- | :--- | :--- |
-| **`NerGuard-0.3B`**  | **0.9037** | **25.21** | **Baseline** |
-| `Gliner` | 0.4463 | 24.68 | -50.6% |
-| `Microsoft Presidio` | 0.3158 | 13.53 | -65.1% |
-| `Spacy (en_core_web_trf)` | 0.1423 | 9.35 | -84.2% |
-### Granular Analysis Summary
-* **High Performance (`F1` > `0.95`):** Structured entities (`Email`, `Phone`, `Date`, `Time`) and Name components.
-* **Moderate Performance (`0.85` < `F1` < `0.95`):** Government IDs (`Passport`, `SSN`) and Addresses.
-* **Challenges:** Context-heavy entities (Street addresses without numbers) and rare classes (Gender, Tax IDs) exhibit lower recall in out-of-domain settings.
-## Quick Usage
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
-from pprint import pprint
-# Load Model & Tokenizer
-model_name = "exdsgift/NerGuard-0.3B"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForTokenClassification.from_pretrained(model_name)
-# Initialize Pipeline
-nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
-# Inference
-multilingual_cases = [
-    "Please send the report to Mr. John Smith at j.smith@company.com immediately.",
-    "J'habite au 15 Rue de la Paix, Paris. Mon nom est Pierre Martin.",
-    "Mein Name ist Thomas Müller und ich lebe in der Berliner Straße 5, München.",
-    "La doctora Ana María González López trabaja en el Hospital Central de Madrid.",
-    "Il codice fiscale di Mario Rossi è RSSMRA80A01H501U.",
-    "Ik ben Sven van der Berg en mijn e-mailadres is sven.berg@example.nl."
-]
-for text in multilingual_cases:
-    results = nlp(text)
-    print(f"\n--- Sample: {text} ---")
-    pprint(results)
-```
-## Limitations
-- **Domain Specificity**: Optimized for general prose; may require fine-tuning for specialized medical or legal jargon.
-- **Context Sensitivity**: High recall on numeric identifiers (e.g., `SSN`) may result in false positives if context is ambiguous.
-## Citations
 ```bibtex
-@mastersthesis{nerguard2025,
-  title={NerGuard-0.3B: High-Performance Named Entity Recognition for PII Detection},
-  author={[Author Name]},
-  year={2025},
-  school={University of Verona, Department of Computer Science},
-  type={Master's Thesis},
-  url={[https://huggingface.co/exdsgift/NerGuard-0.3B](https://github.com/exdsgift/NerGuard)}
 }
 ```

 ---
 license: openrail
+library_name: transformers
 datasets:
 - ai4privacy/open-pii-masking-500k-ai4privacy
 language:
 - microsoft/deberta-v3-base
 pipeline_tag: token-classification
 tags:
+- ner
+- pii
+- token-classification
+- privacy
+- mdeberta
+model-index:
+  - name: NerGuard-0.3B
+    results:
+      - task:
+          type: token-classification
+          name: PII Detection
+        dataset:
+          name: AI4Privacy (validation)
+          type: ai4privacy/open-pii-masking-500k-ai4privacy
+        metrics:
+          - type: f1
+            value: 0.9597
+            name: F1 (macro)
+          - type: f1
+            value: 0.9926
+            name: F1 (weighted)
+          - type: accuracy
+            value: 0.9926
+            name: Accuracy
+      - task:
+          type: token-classification
+          name: PII Detection
+        dataset:
+          name: NVIDIA Nemotron-PII
+          type: nvidia/Nemotron-PII
+        metrics:
+          - type: f1
+            value: 0.9543
+            name: F1 (weighted)
+          - type: accuracy
+            value: 0.9350
+            name: Accuracy
 ---
+[![Downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fexdsgift%2FNerGuard-0.3B&query=%24.downloads&label=%F0%9F%A4%97%20Downloads&color=blue)](https://huggingface.co/exdsgift/NerGuard-0.3B)
+[![GitHub](https://img.shields.io/github/stars/exdsgift/NerGuard?style=social)](https://github.com/exdsgift/NerGuard)
+[![Likes](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fexdsgift%2FNerGuard-0.3B&query=%24.likes&label=%E2%9D%A4%20Likes&color=red)](https://huggingface.co/exdsgift/NerGuard-0.3B)
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
+[![Model Size](https://img.shields.io/badge/Parameters-278M-orange)](https://huggingface.co/exdsgift/NerGuard-0.3B)
+# NerGuard-0.3B
+**NerGuard-0.3B** is a multilingual transformer model for Personally Identifiable Information (PII) detection, built on [mDeBERTa-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base). It performs token-level classification across **21 PII entity types** using BIO tagging, covering names, addresses, government IDs, financial data, and contact information.
+Trained on 500K+ samples from [AI4Privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy), the model achieves **F1 95.97%** on validation and **2x higher F1** than the best open-source alternative (GLiNER, Presidio, SpaCy) on a 3,000-sample benchmark. It supports cross-lingual transfer to 8 European languages without additional fine-tuning.
+This is the standalone NER model. For the full hybrid system with entropy-based LLM routing, see the [NerGuard GitHub repository](https://github.com/exdsgift/NerGuard).
+## Supported Entities
+| Category | Entity Types |
+|---|---|
+| **Person** | `GIVENNAME`, `SURNAME`, `TITLE` |
+| **Location** | `CITY`, `STREET`, `BUILDINGNUM`, `ZIPCODE` |
+| **Government ID** | `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM`, `TAXNUM` |
+| **Financial** | `CREDITCARDNUMBER` |
+| **Contact** | `EMAIL`, `TELEPHONENUM` |
+| **Temporal** | `DATE`, `TIME` |
+| **Demographic** | `AGE`, `SEX`, `GENDER` |
+## Evaluation Results
+| Dataset | Accuracy | F1 (macro) | F1 (weighted) |
+|---|---|---|---|
+| AI4Privacy (validation) | 99.26% | 95.97% | 99.26% |
+| NVIDIA Nemotron-PII | 93.50% | — | 95.43% |
+## Usage
+```python
+from transformers import pipeline
+ner = pipeline("token-classification", model="exdsgift/NerGuard-0.3B", aggregation_strategy="simple")
+results = ner("My name is John Smith and my email is john@gmail.com")
+for entity in results:
+    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
+```
+## Training
+| Parameter | Value |
+|---|---|
+| Base model | `microsoft/mdeberta-v3-base` |
+| Dataset | AI4Privacy Open PII Masking 500K |
+| Max sequence length | 512 (stride 382) |
+| Learning rate | 2e-5 |
+| Batch size | 32 |
+| Epochs | 3 |
+## Citation
 ```bibtex
+@mastersthesis{nerguard2026,
+  title={NerGuard: Hybrid PII Detection with Entropy-Based LLM Routing},
+  author={Exdsgift},
+  school={University of Verona},
+  year={2026}
 }
 ```