NerGuard-0.3B / README.md

Update README.md

af04b2c verified 26 days ago

5.14 kB

	---
	license: openrail
	datasets:
	- ai4privacy/open-pii-masking-500k-ai4privacy
	language:
	- it
	- en
	- de
	- fr
	- es
	- nl
	- hi
	- te
	metrics:
	- accuracy
	base_model:
	- microsoft/deberta-v3-base
	pipeline_tag: token-classification
	tags:
	- PII
	- Ner
	- Privacy
	- NLP
	---
	# NerGuard-0.3B: High-Performance NER for PII Detection

	Model: `exdsgift/NerGuard-0.3B`
	Base Architecture: `DeBERTa-v3-base` (435M parameters)
	Context: Master's Thesis, University of Verona (Department of Computer Science)
	License: Academic/Research Use

	## Abstract

	NerGuard-0.3B is a state-of-the-art Named Entity Recognition (NER) model specialized in the detection of Personally Identifiable Information (PII). Fine-tuned on `ai4privacy/open-pii-masking-500k-ai4privacy` dataset using a `DeBERTa-v3-base` backbone, the model classifies 21 distinct entity types. Evaluation demonstrates robust performance with a weighted `F1`-score of 0.9929 on validation sets and 0.9529 on out-of-domain benchmarks (`nvidia/Nemotron-PII`), significantly outperforming traditional frameworks like Spacy and Microsoft Presidio in both accuracy and recall.

	## Technical Specifications

	* Architecture: `DeBERTa-v3-base` (Decoding-enhanced BERT with disentangled attention).
	* Tokenization: `DeBERTa-v3 Fast Tokenizer` (Max sequence: 512 tokens).
	* Tagging Scheme: `IOB2` (Inside-Outside-Beginning).
	* Inference Latency: `~25.21 ms` (Average per request on CUDA).
	* Training Strategy: Full fine-tuning (3 epochs, AdamW, `2e^-5` LR) on AI4Privacy-v2.

	## Supported Entity Types (21 Classes)

	The model detects the following PII categories:

	* Identity: `GIVENNAME`, `SURNAME`, `TITLE`, `AGE`, `SEX`, `GENDER`
	* Government/ID: `IDCARDNUM`, `PASSPORTNUM`, `DRIVERLICENSENUM`, `SOCIALNUM` (SSN), `TAXNUM`
	* Financial: `CREDITCARDNUMBER`
	* Contact: `EMAIL`, `TELEPHONENUM`
	* Location: `STREET`, `BUILDINGNUM`, `CITY`, `ZIPCODE`
	* Temporal: `DATE`, `TIME`

	## Performance Evaluation

	### Global Metrics
	Evaluation performed across In-Domain (Validation) and Out-of-Domain `nvidia/Nemotron-PII` datasets.

	\| Metric \| Validation Set (In-Domain) \| NVIDIA Nemotron (Out-of-Domain) \|
	\| :--- \| :--- \| :--- \|
	\| Accuracy \| 99.29% \| 93.42% \|
	\| Weighted Precision \| 0.9930 \| 0.9755 \|
	\| Weighted Recall \| 0.9929 \| 0.9342 \|
	\| Weighted `F1` \| 0.9929 \| 0.9529 \|
	\| Macro `F1` \| 0.9499 \| 0.3491* \|

	\Note: Lower Macro `F1` on the NVIDIA dataset reflects class imbalance and the absence of specific rare entity types (e.g., Building Numbers) in the test set.*

	### Benchmark Comparison
	NerGuard-0.3B establishes a new baseline compared to existing PII solutions.

	\| Model Framework \| `F1`-Score \| Latency (ms) \| Relative `F1` vs Baseline \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| `NerGuard-0.3B` \| 0.9037 \| 25.21 \| Baseline \|
	\| `Gliner` \| 0.4463 \| 24.68 \| -50.6% \|
	\| `Microsoft Presidio` \| 0.3158 \| 13.53 \| -65.1% \|
	\| `Spacy (en_core_web_trf)` \| 0.1423 \| 9.35 \| -84.2% \|

	### Granular Analysis Summary
	* High Performance (`F1` > `0.95`): Structured entities (`Email`, `Phone`, `Date`, `Time`) and Name components.
	* Moderate Performance (`0.85` < `F1` < `0.95`): Government IDs (`Passport`, `SSN`) and Addresses.
	* Challenges: Context-heavy entities (Street addresses without numbers) and rare classes (Gender, Tax IDs) exhibit lower recall in out-of-domain settings.

	## Quick Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
	from pprint import pprint

	# Load Model & Tokenizer
	model_name = "exdsgift/NerGuard-0.3B"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Initialize Pipeline
	nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

	# Inference
	multilingual_cases = [
	"Please send the report to Mr. John Smith at j.smith@company.com immediately.",
	"J'habite au 15 Rue de la Paix, Paris. Mon nom est Pierre Martin.",
	"Mein Name ist Thomas Müller und ich lebe in der Berliner Straße 5, München.",
	"La doctora Ana María González López trabaja en el Hospital Central de Madrid.",
	"Il codice fiscale di Mario Rossi è RSSMRA80A01H501U.",
	"Ik ben Sven van der Berg en mijn e-mailadres is sven.berg@example.nl."
	]


	for text in multilingual_cases:
	results = nlp(text)
	print(f"\n--- Sample: {text} ---")
	pprint(results)
	```

	## Limitations
	- Domain Specificity: Optimized for general prose; may require fine-tuning for specialized medical or legal jargon.
	- Context Sensitivity: High recall on numeric identifiers (e.g., `SSN`) may result in false positives if context is ambiguous.

	## Citations
	```bibtex
	@mastersthesis{nerguard2025,
	title={NerGuard-0.3B: High-Performance Named Entity Recognition for PII Detection},
	author={[Author Name]},
	year={2025},
	school={University of Verona, Department of Computer Science},
	type={Master's Thesis},
	url={[https://huggingface.co/exdsgift/NerGuard-0.3B](https://github.com/exdsgift/NerGuard)}
	}
	```