PII NER Azerbaijani v3
A high-accuracy Named Entity Recognition model for detecting Personally Identifiable Information (PII) in Azerbaijani text. Built on LocalDoc/mmBERT-small-en-az (ModernBERT architecture), this model is 4x smaller and faster than XLM-RoBERTa while achieving higher accuracy.
Key Features
- F1 = 0.9974 — all 15 entity types above 0.99
- 69M parameters — 4x smaller than XLM-RoBERTa (278M)
- 3-4x faster inference — ModernBERT architecture with Flash Attention 2
- Transliteration-robust — works with both
ŞərifovaandSherifova - Hard negative trained — distinguishes "bakı küləyi" (weather) from "bakıda yaşayır" (address)
- Lowercase input — model is trained on lowercased text for case-insensitive detection
Model Details
| Metric | Value |
|---|---|
| Base Model | LocalDoc/mmBERT-small-en-az |
| Architecture | ModernBERT (22 layers, hidden=384) |
| Parameters | 69M |
| Model Size (fp32) | 0.26 GB |
| Max Sequence Length | 8,192 tokens |
| Training Data | LocalDoc/pii_ner_azerbaijani_extended (530K rows) |
| Training Epochs | 5 (best at epoch 5) |
| License | MIT |
Performance
Overall Metrics
| Metric | This Model (69M) | XLM-RoBERTa v2 (278M) |
|---|---|---|
| F1 | 0.9974 | 0.9746 |
| Precision | 0.9967 | 0.9760 |
| Recall | 0.9982 | 0.9732 |
| False Positives (hard neg) | 1 | 4 |
Per-Entity F1 Scores
| Entity | F1 | Entity | F1 |
|---|---|---|---|
| GIVENNAME | 0.9974 | PASSPORTNUM | 0.9996 |
| SURNAME | 0.9980 | TAXNUM | 0.9994 |
| 0.9978 | TELEPHONENUM | 0.9993 | |
| DATE | 0.9936 | TIME | 0.9993 |
| AGE | 0.9965 | CREDITCARDNUMBER | 0.9948 |
| CITY | 0.9967 | STREET | 0.9926 |
| IDCARDNUM | 0.9985 | BUILDINGNUM | 0.9976 |
| ZIPCODE | 0.9978 |
Training Progress
| Epoch | Loss | F1 | Precision | Recall |
|---|---|---|---|---|
| 1 | 0.0159 | 0.9839 | 0.9794 | 0.9889 |
| 2 | 0.0099 | 0.9877 | 0.9848 | 0.9908 |
| 3 | 0.0053 | 0.9949 | 0.9931 | 0.9967 |
| 4 | 0.0038 | 0.9972 | 0.9964 | 0.9980 |
| 5 | 0.0041 | 0.9974 | 0.9967 | 0.9982 |
Recognized Entities
GIVENNAME — First name (e.g., "Əli", "Aysel")
SURNAME — Last name (e.g., "Həsənov", "Məmmədova")
EMAIL — Email address
TELEPHONENUM — Phone number
DATE — Date in various formats
TIME — Time
AGE — Age
IDCARDNUM — ID card / FIN number
PASSPORTNUM — Passport number
TAXNUM — Tax identification number
CREDITCARDNUMBER — Credit card number
CITY — City name (as address, not adjective)
STREET — Street name
BUILDINGNUM — Building number
ZIPCODE — ZIP/postal code
Usage
Quick Start
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
class AzerbaijaniPiiNer:
def __init__(self, model_name="LocalDoc/pii-ner-azerbaijani-v3"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForTokenClassification.from_pretrained(model_name)
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device).eval()
self.id2label = self.model.config.id2label
def predict(self, text: str) -> list[dict]:
"""
Detect PII entities in text.
Input is lowercased for the model, but original casing is preserved in output.
"""
original_text = text
text_lower = text.lower()
inputs = self.tokenizer(
text_lower,
return_tensors="pt",
return_offsets_mapping=True,
return_special_tokens_mask=True,
truncation=True,
max_length=512,
)
offsets = inputs.pop("offset_mapping")[0]
special_mask = inputs.pop("special_tokens_mask")[0]
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
logits = self.model(**inputs).logits
predictions = torch.argmax(logits, dim=-1)[0].cpu()
# Extract entities
entities = []
current = None
for pred_id, offset, is_special in zip(predictions, offsets, special_mask):
if is_special:
if current:
entities.append(current)
current = None
continue
label = self.id2label[pred_id.item()]
cs, ce = offset[0].item(), offset[1].item()
if label.startswith("B-"):
if current:
entities.append(current)
current = {"label": label[2:], "start": cs, "end": ce}
elif label.startswith("I-") and current and label[2:] == current["label"]:
current["end"] = ce
else:
if current:
entities.append(current)
current = None
if current:
entities.append(current)
# Map back to ORIGINAL text (preserve original casing)
for ent in entities:
raw = original_text[ent["start"]:ent["end"]]
ent["value"] = raw.strip()
if raw != raw.strip():
offset = len(raw) - len(raw.lstrip())
ent["start"] += offset
ent["end"] = ent["start"] + len(ent["value"])
return entities
def anonymize(self, text: str, replacement: str = "***") -> str:
"""Replace all PII entities with a placeholder."""
entities = self.predict(text)
entities.sort(key=lambda x: x["start"], reverse=True)
result = text
for ent in entities:
result = result[:ent["start"]] + replacement + result[ent["end"]:]
return result
def highlight(self, text: str) -> str:
"""Return text with entities marked: [LABEL: value]."""
entities = self.predict(text)
entities.sort(key=lambda x: x["start"], reverse=True)
result = text
for ent in entities:
result = (
result[:ent["start"]]
+ f"[{ent['label']}: {ent['value']}]"
+ result[ent["end"]:]
)
return result
# --- Example ---
if __name__ == "__main__":
ner = AzerbaijaniPiiNer()
examples = [
# Original Azerbaijani
"Hörmətli Əhməd Süleymanlı, 05.03.1987 tarixli müraciətiniz qəbul edildi. Əlaqə: 055-234-67-89.",
# Transliterated (informal)
"Hormetli Ehmed Suleymanlı, 05.03.1987 tarixli muracietiniz qebul edildi. Elaqe: 055-234-67-89.",
# Mixed context with hard negatives
"Bakı küləyi güclüdür, amma Əli Bakıda Nizami küçəsi 42-də yaşayır.",
# Complex document
"Müştəri: Gülarə Məmmədli, 67 yaş. Pasport: AZE 1234567. Email: gulare@mail.az. Tel: 012-456-78-90.",
# English-Azerbaijani mix
"Dear customer Əli Həsənli, your order shipped to Bakı, 28 May küçəsi 12. Contact: ali@company.com.",
]
for text in examples:
print(f"\nInput: {text}")
print(f"Highlight: {ner.highlight(text)}")
print(f"Anonymize: {ner.anonymize(text)}")
for ent in ner.predict(text):
print(f" {ent['label']:20s} → \"{ent['value']}\" ({ent['start']}:{ent['end']})")
Expected Output
Input: Hörmətli Əhməd Süleymanlı, 05.03.1987 tarixli müraciətiniz qəbul edildi. Əlaqə: 055-234-67-89.
Highlight: Hörmətli [GIVENNAME: Əhməd] [SURNAME: Süleymanlı], [DATE: 05.03.1987] tarixli müraciətiniz qəbul edildi. Əlaqə: [TELEPHONENUM: 055-234-67-89].
Anonymize: Hörmətli *** ***, *** tarixli müraciətiniz qəbul edildi. Əlaqə: ***.
GIVENNAME → "Əhməd" (9:14)
SURNAME → "Süleymanlı" (15:25)
DATE → "05.03.1987" (27:37)
TELEPHONENUM → "055-234-67-89" (82:95)
Pipeline Usage
from transformers import pipeline
ner_pipeline = pipeline(
"token-classification",
model="LocalDoc/pii-ner-azerbaijani-v3",
aggregation_strategy="simple",
)
# Important: lowercase the input
text = "Əhməd Həsənov Bakıda yaşayır, telefonu 055-123-45-67."
results = ner_pipeline(text.lower())
for entity in results:
print(f"{entity['entity_group']:20s} → \"{entity['word']}\" (score: {entity['score']:.4f})")
Training Details
Dataset
Trained on LocalDoc/pii_ner_azerbaijani_extended (530K rows):
- Template-based data (~481K) — original + 3 transliteration strategies
- LLM-generated PII (~25K) — natural sentences in diverse contexts
- LLM-generated hard negatives (~15K) — trap words that look like PII
- LLM-generated mixed (~10K) — real PII + traps in the same sentence
Why Hard Negatives Matter
Without hard negatives, the model marks every city name as PII:
- ❌ "bakı küləyi güclüdür" → CITY: bakı (wrong — it's about weather)
- ❌ "nərgiz çiçəkləri açılır" → GIVENNAME: nərgiz (wrong — it's a flower)
With hard negatives, the model learns context:
- ✅ "bakı küləyi güclüdür" → no PII (weather context)
- ✅ "əhməd bakıda yaşayır" → GIVENNAME: əhməd, CITY: bakı (address context)
Configuration
- Optimizer: AdamW
- Learning Rate: 3e-5 with cosine schedule
- Warmup: 10%
- Batch Size: 64
- Weight Decay: 0.01
- Max Length: 512
- Early Stopping: patience=3 on F1
- Preprocessing: all text lowercased before tokenization
Limitations
- Lowercase input required — always call
.lower()before inference - Synthetic training data — may not cover all real-world PII patterns
- Phone numbers with dashes — in chat-style text, numbers like
055-987-65-43may split. This is a known tokenizer limitation. - Azerbaijani and English only — other languages will produce poor results
- District names — names like "Nəsimi" may be misidentified as personal names
Comparison with Previous Versions
| v3 (this) | v2 (XLM-RoBERTa) | v1 (XLM-RoBERTa) | |
|---|---|---|---|
| Base | mmBERT-small | XLM-RoBERTa | XLM-RoBERTa |
| Parameters | 69M | 278M | 278M |
| F1 | 0.9974 | 0.9746 | 0.9629 |
| Hard neg FP | 1 | 4 | not tested |
| Transliteration | yes | no | no |
| Speed | 3-4x faster | 1x | 1x |
Citation
@misc{pii-ner-azerbaijani-v3,
title={PII NER Azerbaijani v3},
author={LocalDoc},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/LocalDoc/pii-ner-azerbaijani-v3}
}
CC BY 4.0 License — What It Allows
The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows:
✅ You Can:
- Use the model for any purpose, including commercial use.
- Share it — copy and redistribute in any medium or format.
- Adapt it — remix, transform, and build upon it for any purpose, even commercially.
📝 You Must:
- Give appropriate credit — Attribute the original creator.
- Not imply endorsement — Do not suggest the original author endorses your use.
❌ You Cannot:
- Apply legal terms or technological measures that restrict others from doing anything the license permits.
For more information, refer to the CC BY 4.0 license.
Contact
For questions or issues, contact LocalDoc at [v.resad.89@gmail.com].
- Downloads last month
- -
Model tree for LocalDoc/pii-ner-azerbaijani-v3
Dataset used to train LocalDoc/pii-ner-azerbaijani-v3
Evaluation results
- F1 on PII NER Azerbaijani Extendedself-reported0.997
- Precision on PII NER Azerbaijani Extendedself-reported0.997
- Recall on PII NER Azerbaijani Extendedself-reported0.998