nvidia/Nemotron-PII
Viewer • Updated • 200k • 4.12k • 98
How to use seongyeon1/pii-deberta-v3-base-multi with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("token-classification", model="seongyeon1/pii-deberta-v3-base-multi") # Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("seongyeon1/pii-deberta-v3-base-multi")
model = AutoModelForTokenClassification.from_pretrained("seongyeon1/pii-deberta-v3-base-multi")A token classification (NER) model for Personally Identifiable Information (PII) detection, fine-tuned on a combination of ai4privacy/pii-masking-200k and nvidia/Nemotron-PII datasets.
| Entity | Description | F1 |
|---|---|---|
NAME_STUDENT |
Person names | 0.979 |
EMAIL |
Email addresses | 0.992 |
USERNAME |
Usernames | 0.980 |
ID_NUM |
ID numbers (SSN, credit card, passport, etc.) | 0.980 |
PHONE_NUM |
Phone numbers | 0.992 |
URL_PERSONAL |
Personal URLs | 0.992 |
STREET_ADDRESS |
Street addresses | 0.958 |
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "seongyeon1/pii-deberta-v3-base-multi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "My name is John Smith and my email is john@example.com"
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)[0]
for idx, (pred, (start, end)) in enumerate(zip(predictions, offset_mapping[0])):
label = model.config.id2label[pred.item()]
if label != "O" and start != 0 and end != 0:
print(f"{text[start:end]} -> {label}")
| Setting | Value |
|---|---|
| Epochs | 3 |
| Learning Rate | 2e-5 |
| Batch Size | 8 |
| Max Length | 256 |
| Warmup Steps | 200 |
| Optimizer | AdamW |
| Training Data | ~9,000 samples (ai4privacy 5K + Nemotron 5K) |
| Training Time | ~29 minutes (Apple MPS) |
54+ entity types from source datasets are merged into 7 standard Kaggle PII types:
FIRSTNAME, LASTNAME, GIVENNAME, SURNAME → NAME_STUDENTSSN, CREDITCARDNUMBER, IBAN, PASSPORT, ... → ID_NUMPHONENUMBER, TEL, TELEPHONENUM → PHONE_NUMCITY, STATE, ZIPCODE, STREET → STREET_ADDRESS| Metric | Score |
|---|---|
| F1 (entity-level) | 0.9766 |
| Precision | 0.9703 |
| Recall | 0.9830 |
| F5 (recall-weighted) | 0.9825 |
| Eval Loss | 0.0014 |
| Epoch | Train Loss | Eval F1 | Eval F5 |
|---|---|---|---|
| 1 | 0.0039 | 0.9434 | 0.9560 |
| 2 | 0.0023 | 0.9738 | 0.9777 |
| 3 | 0.0012 | 0.9766 | 0.9825 |
Built with PolyBed-Pipeline PII Channel.