GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction
Paper • 2605.09973 • Published
GLiNER2-PII is a fine-tune of the GLiNER2 model (205M parameters) for detecting and masking personally identifiable information across 42 entity types and 7 languages.
Trained entirely on a constraint-driven synthetic corpus of 4,910 annotated texts, it achieves the highest span-level F1 (0.477) on the SPY benchmark among four compared systems — including OpenAI Privacy Filter, NVIDIA GLiNER-PII, and urchade/gliner_multi_pii-v1.
📄 Technical Report
🔗 GitHub
pip install gliner2
from gliner2 import GLiNER2
model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
text = "Email john.smith@acme.com or call +1 415 555 0199."
labels = ["email", "phone_number", "person"]
result = model.extract_entities(
text,
labels,
threshold=0.5,
include_confidence=True,
include_spans=True,
)
print(result)
You can pass any subset of the 42 supported labels — the model conditions on the labels you provide at inference time.
| Group | Labels |
|---|---|
| Person / names | person, full_name, first_name, middle_name, last_name, date_of_birth |
| Contact / address | email, phone_number, address, street_address, city, state_or_region, postal_code, country |
| Government / tax IDs | government_id, national_id_number, passport_number, drivers_license_number, license_number, tax_id, tax_number |
| Banking / payment | bank_account, account_number, routing_number, iban, payment_card, card_number, card_expiry, card_cvv |
| Digital identity | username, ip_address, account_id, sensitive_account_id |
| Secrets / credentials | password, secret, api_key, access_token, recovery_code |
| Sensitive dates | sensitive_date, document_date, expiration_date, transaction_date |
Evaluated on the SPY benchmark (Savkin et al., 2025) with exact-match span-level metrics:
| Model | Legal P | Legal R | Legal F1 | Medical P | Medical R | Medical F1 | Avg F1 |
|---|---|---|---|---|---|---|---|
| fastino/gliner2-pii-v1 | .346 | .750 | .473 | .369 | .686 | .480 | .477 |
| nvidia/gliner-PII | .343 | .452 | .390 | .368 | .465 | .411 | .400 |
| urchade/gliner_multi_pii-v1 | .467 | .317 | .377 | .518 | .351 | .419 | .398 |
| openai/privacy-filter | .242 | .656 | .354 | .287 | .692 | .406 | .380 |
| Use case | Why GLiNER2-PII |
|---|---|
| PII redaction / masking | High recall minimises missed sensitive spans |
| Data governance & GDPR/CCPA compliance | 42 fine-grained types enable policy-specific routing |
| Training-data hygiene | Exact character spans for precise masking before model training |
| Multi-language pipelines | Trained on EN, FR, ES, DE, IT, PT, NL formats |
def redact(text, labels, threshold=0.5):
model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
result = model.extract_entities(
text, labels, threshold=threshold,
include_spans=True,
)
entities = result.get("entities", {})
spans = []
for label, values in entities.items():
for value in values:
start = text.find(value)
if start != -1:
spans.append((start, start + len(value), label))
spans.sort(key=lambda s: s[0], reverse=True)
redacted = text
for start, end, label in spans:
redacted = redacted[:start] + f"[{label.upper()}]" + redacted[end:]
return redacted
text = "Please contact Maria Jensen at maria.jensen@example.dk or +45 20 12 34 56."
labels = ["person", "email", "phone_number"]
print(redact(text, labels))
# "Please contact [PERSON] at [EMAIL] or [PHONE_NUMBER]."
| Detail | Value |
|---|---|
| Base model | GLiNER2 (205M parameters) |
| Training data | 4,910 synthetic annotated texts |
| PII mentions | 129,951 total (mean 26.5 per example) |
| Generator | GPT-5.4 (temperature 0.01) |
| Data framework | Constraint-driven generation (same framework as Pioneer Agent) |
| Languages | English, French, Spanish, German, Italian, Portuguese, Dutch |
| Label types | 42 PII entity types across 7 semantic groups |
name entities, sometimes confusing common nouns, organisation names, and product names with personal names.For production use, consider:
person / full_name)@misc{fastino2026gliner2pii,
title = {GLiNER2-PII: Multilingual PII Extraction via Synthetic Fine-Tuning},
author = {{Fastino AI Team}},
year = {2026},
url = {https://huggingface.co/fastino/gliner2-pii-v1}
}
@inproceedings{zaratiana-etal-2025-gliner2,
title = {GLiNER2: Schema-Driven Multi-Task Learning for Structured Information Extraction},
author = {Zaratiana, Urchade and Pasternak, Gil and Boyd, Oliver and Hurn-Maloney, George and Lewis, Ash},
booktitle = {Proceedings of EMNLP 2025: System Demonstrations},
year = {2025}
}
@inproceedings{zaratiana-etal-2024-gliner,
title = {GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
author = {Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
booktitle = {Proceedings of NAACL 2024},
year = {2024}
}
@misc{atreja2026pioneeragent,
title = {Pioneer Agent: Continual Improvement of Small Language Models in Production},
author = {Atreja, Dhruv and White, Julia and Nayak, Nikhil and Zhang, Kelton and Princis, Henrijs and Hurn-Maloney, George and Lewis, Ash and Zaratiana, Urchade},
year = {2026},
url = {https://arxiv.org/abs/2604.09791}
}
Apache 2.0