PII NER Azerbaijani v3

A high-accuracy Named Entity Recognition model for detecting Personally Identifiable Information (PII) in Azerbaijani text. Built on LocalDoc/mmBERT-small-en-az (ModernBERT architecture), this model is 4x smaller and faster than XLM-RoBERTa while achieving higher accuracy.

Key Features

F1 = 0.9974 — all 15 entity types above 0.99
69M parameters — 4x smaller than XLM-RoBERTa (278M)
3-4x faster inference — ModernBERT architecture with Flash Attention 2
Transliteration-robust — works with both Şərifova and Sherifova
Hard negative trained — distinguishes "bakı küləyi" (weather) from "bakıda yaşayır" (address)
Lowercase input — model is trained on lowercased text for case-insensitive detection

Model Details

Metric	Value
Base Model	LocalDoc/mmBERT-small-en-az
Architecture	ModernBERT (22 layers, hidden=384)
Parameters	69M
Model Size (fp32)	0.26 GB
Max Sequence Length	8,192 tokens
Training Data	LocalDoc/pii_ner_azerbaijani_extended (530K rows)
Training Epochs	5 (best at epoch 5)
License	MIT

Performance

Overall Metrics

Metric	This Model (69M)	XLM-RoBERTa v2 (278M)
F1	0.9974	0.9746
Precision	0.9967	0.9760
Recall	0.9982	0.9732
False Positives (hard neg)	1	4

Per-Entity F1 Scores

Entity	F1	Entity	F1
GIVENNAME	0.9974	PASSPORTNUM	0.9996
SURNAME	0.9980	TAXNUM	0.9994
EMAIL	0.9978	TELEPHONENUM	0.9993
DATE	0.9936	TIME	0.9993
AGE	0.9965	CREDITCARDNUMBER	0.9948
CITY	0.9967	STREET	0.9926
IDCARDNUM	0.9985	BUILDINGNUM	0.9976
ZIPCODE	0.9978

Training Progress

Epoch	Loss	F1	Precision	Recall
1	0.0159	0.9839	0.9794	0.9889
2	0.0099	0.9877	0.9848	0.9908
3	0.0053	0.9949	0.9931	0.9967
4	0.0038	0.9972	0.9964	0.9980
5	0.0041	0.9974	0.9967	0.9982

Recognized Entities

GIVENNAME          — First name (e.g., "Əli", "Aysel")
SURNAME            — Last name (e.g., "Həsənov", "Məmmədova")
EMAIL              — Email address
TELEPHONENUM       — Phone number
DATE               — Date in various formats
TIME               — Time
AGE                — Age
IDCARDNUM          — ID card / FIN number
PASSPORTNUM        — Passport number
TAXNUM             — Tax identification number
CREDITCARDNUMBER   — Credit card number
CITY               — City name (as address, not adjective)
STREET             — Street name
BUILDINGNUM        — Building number
ZIPCODE            — ZIP/postal code

Usage

Quick Start

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer


class AzerbaijaniPiiNer:
    def __init__(self, model_name="LocalDoc/pii-ner-azerbaijani-v3"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device).eval()

        self.id2label = self.model.config.id2label

    def predict(self, text: str) -> list[dict]:
        """
        Detect PII entities in text.
        Input is lowercased for the model, but original casing is preserved in output.
        """
        original_text = text
        text_lower = text.lower()

        inputs = self.tokenizer(
            text_lower,
            return_tensors="pt",
            return_offsets_mapping=True,
            return_special_tokens_mask=True,
            truncation=True,
            max_length=512,
        )

        offsets = inputs.pop("offset_mapping")[0]
        special_mask = inputs.pop("special_tokens_mask")[0]
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        with torch.no_grad():
            logits = self.model(**inputs).logits

        predictions = torch.argmax(logits, dim=-1)[0].cpu()

        # Extract entities
        entities = []
        current = None

        for pred_id, offset, is_special in zip(predictions, offsets, special_mask):
            if is_special:
                if current:
                    entities.append(current)
                    current = None
                continue

            label = self.id2label[pred_id.item()]
            cs, ce = offset[0].item(), offset[1].item()

            if label.startswith("B-"):
                if current:
                    entities.append(current)
                current = {"label": label[2:], "start": cs, "end": ce}
            elif label.startswith("I-") and current and label[2:] == current["label"]:
                current["end"] = ce
            else:
                if current:
                    entities.append(current)
                    current = None

        if current:
            entities.append(current)

        # Map back to ORIGINAL text (preserve original casing)
        for ent in entities:
            raw = original_text[ent["start"]:ent["end"]]
            ent["value"] = raw.strip()
            if raw != raw.strip():
                offset = len(raw) - len(raw.lstrip())
                ent["start"] += offset
                ent["end"] = ent["start"] + len(ent["value"])

        return entities

    def anonymize(self, text: str, replacement: str = "***") -> str:
        """Replace all PII entities with a placeholder."""
        entities = self.predict(text)
        entities.sort(key=lambda x: x["start"], reverse=True)

        result = text
        for ent in entities:
            result = result[:ent["start"]] + replacement + result[ent["end"]:]

        return result

    def highlight(self, text: str) -> str:
        """Return text with entities marked: [LABEL: value]."""
        entities = self.predict(text)
        entities.sort(key=lambda x: x["start"], reverse=True)

        result = text
        for ent in entities:
            result = (
                result[:ent["start"]]
                + f"[{ent['label']}: {ent['value']}]"
                + result[ent["end"]:]
            )

        return result


# --- Example ---
if __name__ == "__main__":
    ner = AzerbaijaniPiiNer()

    examples = [
        # Original Azerbaijani
        "Hörmətli Əhməd Süleymanlı, 05.03.1987 tarixli müraciətiniz qəbul edildi. Əlaqə: 055-234-67-89.",

        # Transliterated (informal)
        "Hormetli Ehmed Suleymanlı, 05.03.1987 tarixli muracietiniz qebul edildi. Elaqe: 055-234-67-89.",

        # Mixed context with hard negatives
        "Bakı küləyi güclüdür, amma Əli Bakıda Nizami küçəsi 42-də yaşayır.",

        # Complex document
        "Müştəri: Gülarə Məmmədli, 67 yaş. Pasport: AZE 1234567. Email: gulare@mail.az. Tel: 012-456-78-90.",

        # English-Azerbaijani mix
        "Dear customer Əli Həsənli, your order shipped to Bakı, 28 May küçəsi 12. Contact: ali@company.com.",
    ]

    for text in examples:
        print(f"\nInput:     {text}")
        print(f"Highlight: {ner.highlight(text)}")
        print(f"Anonymize: {ner.anonymize(text)}")
        for ent in ner.predict(text):
            print(f"  {ent['label']:20s} → \"{ent['value']}\" ({ent['start']}:{ent['end']})")

Expected Output

Input:     Hörmətli Əhməd Süleymanlı, 05.03.1987 tarixli müraciətiniz qəbul edildi. Əlaqə: 055-234-67-89.
Highlight: Hörmətli [GIVENNAME: Əhməd] [SURNAME: Süleymanlı], [DATE: 05.03.1987] tarixli müraciətiniz qəbul edildi. Əlaqə: [TELEPHONENUM: 055-234-67-89].
Anonymize: Hörmətli *** ***, *** tarixli müraciətiniz qəbul edildi. Əlaqə: ***.
  GIVENNAME            → "Əhməd" (9:14)
  SURNAME              → "Süleymanlı" (15:25)
  DATE                 → "05.03.1987" (27:37)
  TELEPHONENUM         → "055-234-67-89" (82:95)

Pipeline Usage

from transformers import pipeline

ner_pipeline = pipeline(
    "token-classification",
    model="LocalDoc/pii-ner-azerbaijani-v3",
    aggregation_strategy="simple",
)

# Important: lowercase the input
text = "Əhməd Həsənov Bakıda yaşayır, telefonu 055-123-45-67."
results = ner_pipeline(text.lower())

for entity in results:
    print(f"{entity['entity_group']:20s} → \"{entity['word']}\" (score: {entity['score']:.4f})")

Training Details

Dataset

Trained on LocalDoc/pii_ner_azerbaijani_extended (530K rows):

Template-based data (~481K) — original + 3 transliteration strategies
LLM-generated PII (~25K) — natural sentences in diverse contexts
LLM-generated hard negatives (~15K) — trap words that look like PII
LLM-generated mixed (~10K) — real PII + traps in the same sentence

Why Hard Negatives Matter

Without hard negatives, the model marks every city name as PII:

❌ "bakı küləyi güclüdür" → CITY: bakı (wrong — it's about weather)
❌ "nərgiz çiçəkləri açılır" → GIVENNAME: nərgiz (wrong — it's a flower)

With hard negatives, the model learns context:

✅ "bakı küləyi güclüdür" → no PII (weather context)
✅ "əhməd bakıda yaşayır" → GIVENNAME: əhməd, CITY: bakı (address context)

Configuration

Optimizer: AdamW
Learning Rate: 3e-5 with cosine schedule
Warmup: 10%
Batch Size: 64
Weight Decay: 0.01
Max Length: 512
Early Stopping: patience=3 on F1
Preprocessing: all text lowercased before tokenization

Limitations

Lowercase input required — always call .lower() before inference
Synthetic training data — may not cover all real-world PII patterns
Phone numbers with dashes — in chat-style text, numbers like 055-987-65-43 may split. This is a known tokenizer limitation.
Azerbaijani and English only — other languages will produce poor results
District names — names like "Nəsimi" may be misidentified as personal names

Comparison with Previous Versions

	v3 (this)	v2 (XLM-RoBERTa)	v1 (XLM-RoBERTa)
Base	mmBERT-small	XLM-RoBERTa	XLM-RoBERTa
Parameters	69M	278M	278M
F1	0.9974	0.9746	0.9629
Hard neg FP	1	4	not tested
Transliteration	yes	no	no
Speed	3-4x faster	1x	1x

Citation

@misc{pii-ner-azerbaijani-v3,
  title={PII NER Azerbaijani v3},
  author={LocalDoc},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/LocalDoc/pii-ner-azerbaijani-v3}
}

CC BY 4.0 License — What It Allows

The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows:

✅ You Can:

Use the model for any purpose, including commercial use.
Share it — copy and redistribute in any medium or format.
Adapt it — remix, transform, and build upon it for any purpose, even commercially.

📝 You Must:

Give appropriate credit — Attribute the original creator.
Not imply endorsement — Do not suggest the original author endorses your use.

❌ You Cannot:

Apply legal terms or technological measures that restrict others from doing anything the license permits.

For more information, refer to the CC BY 4.0 license.

Contact

For questions or issues, contact LocalDoc at [v.resad.89@gmail.com].

Downloads last month: 66

Safetensors

Model size

69.9M params

Tensor type

F32

Model tree for LocalDoc/pii-ner-azerbaijani-v3

Base model

jhu-clsp/mmBERT-small

Finetuned

LocalDoc/mmBERT-small-en-az

Finetuned

(1)

this model

Dataset used to train LocalDoc/pii-ner-azerbaijani-v3

Evaluation results

F1 on PII NER Azerbaijani Extended
self-reported

0.997
Precision on PII NER Azerbaijani Extended
self-reported

0.997
Recall on PII NER Azerbaijani Extended
self-reported

0.998