cee-pii — multilingual CEE-first PII & sensitive-entity span detector

An open, small (~300M), multilingual span-level PII / sensitive-entity detector, weighted toward Central & Eastern European languages and financial identifiers that English-centric models miss — with first-class US/UK identifier coverage.

Given a piece of text, cee-pii returns character-offset spans, each tagged with a type from a fixed 34-type taxonomy across three tiers (checksum-validated national IDs, structured identifiers, and contextual entities). It is a GLiNER fine-tune, so it extracts entities against natural-language type prompts rather than a baked-in label head — which is exactly what lets the same weights serve two different consumers.

The product insight that shapes everything: checksum identifiers are easy mode; contextual entities are where a privacy guard actually earns its keep. A regex can find a CNP or an IBAN. What it cannot do is find Str. Aviatorilor nr. 12, ap. 4 next to Ștefan Popescu next to an employer mention, without diacritics, in a noisy OCR fragment, in six languages. That contextual span extraction is the hard part, it is where the error mass lives, and it is the reason this model exists rather than a rulebook.

Two consumers of the same weights:

Enterprise masking layer — redact PII before text leaves the perimeter or reaches an LLM. Fixed taxonomy, exact character spans for in-place masking.
Consumer privacy guard — warn a user before they paste PII into a chatbot. Flexible entity prompts, and a first-class low-false-positive posture (a guard that cries wolf gets disabled within a week).

Fine-tuned from the Apache-2.0 urchade/gliner_multi-v2.1 (mDeBERTa-v3 backbone). Runs on CPU; deployment target is consumer laptops, not a data center.

flowxai/cee-pii (this card) — GLiNER v1 fine-tune, ~300M params.
flowxai/cee-pii-bench (companion dataset) — the held-out synthetic benchmark this card is measured on. See flowxai/cee-pii-bench.

This model does NOT guarantee GDPR / regulatory compliance. It is a detection aid, not a legal control, and its recall is not 100%. See Intended use & limitations before deploying.

How do I use it?

cee-pii is a GLiNER model, so it runs through the gliner package. You pass the text and the list of entity types you want (as natural-language prompts); it returns spans with start / end character offsets, a label, and a score.

pip install gliner

Real Romanian example — a support-ticket line carrying a CNP, an IBAN, and a person name, typed without diacritics (the common real-world case):

from gliner import GLiNER

model = GLiNER.from_pretrained("flowxai/cee-pii")

text = (
    "Buna ziua, sunt Stefan Popescu, CNP 1960714125089, "
    "va rog transferati suma pe contul RO49AAAA1B31007593840000."
)

# GLiNER takes natural-language type prompts (the same phrasings the model was
# trained against). Use the short type id or a descriptive phrasing — both work.
labels = [
    "person name",
    "Romanian personal numeric code (CNP)",
    "IBAN",
]

for ent in model.predict_entities(text, labels, threshold=0.5):
    print(f"{ent['label']:40s} [{ent['start']:>3}:{ent['end']:>3}]  {ent['text']}")

person name                              [ 16: 30]  Stefan Popescu
Romanian personal numeric code (CNP)     [ 36: 49]  1960714125089
IBAN                                     [ 66: 90]  RO49AAAA1B31007593840000

The (start, end) offsets are exact character ranges, so masking is a straight slice replacement:

ents = model.predict_entities(text, labels, threshold=0.5)
masked = text
for ent in sorted(ents, key=lambda e: e["start"], reverse=True):
    masked = masked[: ent["start"]] + f"[{ent['label']}]" + masked[ent["end"] :]
# -> "Buna ziua, sunt [person name], CNP [Romanian personal numeric code (CNP)], ..."

Notes for callers.

Type prompts are open-vocabulary. During training each type was shown 2–3 natural-language phrasings (e.g. cnp → "Romanian personal numeric code (CNP)"), so the model responds to descriptive prompts. At evaluation we use one fixed canonical phrasing per type and map it back to the short id — the full canonical list is the taxonomy below.
Threshold trades recall for precision. 0.5 is the evaluated operating point (0.911 exact micro-precision, 0.758 recall on the bench). Lower it for a higher-recall guard; raise it for a stricter masking layer.
max_types was 25 at training time; pass entity lists in batches if you need all 34.

How it works

text  -->  cee-pii (GLiNER, mDeBERTa-v3 encoder + span head)  -->  [ (type, start, end, score), ... ]
              ^
              |  entity-type prompts (natural-language phrasings, open-vocabulary)

Under the hood: whitespace word-splitter → mDeBERTa-v3 encoder → span scorer against each type prompt → threshold → char-offset spans. GLiNER scores candidate word-token spans (max_width 12) against each supplied type prompt; spans above threshold are emitted with their character offsets recovered from the splitter. There is no fixed classification head, which is why the taxonomy can grow without retraining the output layer.

Entity taxonomy (34 types)

The canonical type list is the single source of truth shared by the generators, validators, corpus, bench, and eval adapter — it cannot drift. Grouped by the three design tiers.

Tier 1 — checksum-validated identifiers (easy mode, near-zero ambiguity)

Type	Country	Notes
`cnp`	RO	13-digit personal numeric code; control digit; encodes DOB + county
`ci_ro`	RO	ID card series (2 letters, valid county) + 6 digits
`pesel`	PL	11 digits, positional checksum, encodes DOB
`nip`	PL	10-digit tax ID, weighted checksum
`taj`	HU	9-digit social-security number, checksum on first 8
`szemelyi`	HU	11-digit personal ID (személyi szám)
`pinfl`	UZ	14-digit personal ID (official lex.uz checksum)
`iban`	RO/PL/HU/GB + generic	ISO 13616 mod-97, correct country lengths
`card`	intl	Luhn + realistic scheme shapes
`nhs`	UK	10 digits, mod-11 checksum
`aba`	US	9-digit bank routing number, 3-7-1 weighted checksum

Tier 2 — structured (format / range rules, some without a public checksum)

Type	Country	Notes
`ssn`	US	9 digits; invalid-range rejection (area 000/666/900+, group 00, serial 0000)
`itin`	US	9xx- with valid IRS group ranges
`ein`	US	valid prefix list
`nino`	UK	National Insurance number (prefix + suffix rules)
`utr`	UK	10-digit Unique Taxpayer Reference (HMRC check digit)
`company_number_uk`	UK	8-char Companies House number (incl. SC/NI)
`uk_sort_code`	UK	`xx-xx-xx`
`uk_account_number`	UK	8 digits
`uz_account`	UZ	20-digit domestic account number (structural)
`phone`	multi	E.164 + local formats per country
`email`	—	email address
`plate`	RO/PL/HU/UK	vehicle registration
`postal`	multi	postal / ZIP (incl. UK postcode grammar, US ZIP+4)
`dob`	multi	date of birth, many formats

Tier 3 — contextual entities (where GLiNER earns its keep)

Type	Notes
`person_name`	full name (with AND without diacritics)
`first_name`	given name only
`surname`	family name only
`address`	street address (incl. RO `Str./nr./bl./sc./et./ap.` shape)
`policy_ref`	insurance policy number / reference
`contract_ref`	contract number / reference
`account_ref`	internal account number / reference (not a bank IBAN)
`employer`	employer / company mention
`health_condition`	coarse flag that a health condition is mentioned — not classified

Evaluation

All numbers below are on the synthetic CEE-PII-Bench v0.2 — read the honesty caveat first, then the tables.

Honesty caveat — home-field advantage (read before citing these numbers). CEE-PII-Bench v0.2 is held-out and contamination-verified (bench families ∩ train = ∅ and bench values ∩ train = ∅, both tested), but it is drawn from the same synthetic generator distribution as the training corpus — the same output format, phrasing, and entity style the fine-tune learned. The numbers here are a valid relative signal (fine-tune vs zero-shot vs frontier on identical inputs) but are likely optimistic in absolute terms versus real documents. This is the same caveat as the sibling scam-guard project. The honest real-world number is the Phase-6 human-labeled real-structure eval set (official form specimens, published template contracts, sample bank statements), which is reported separately and is still to be built.

Scoring is model-agnostic (eval/harness.py): each entity is a (type, start, end) triple, matched greedily one-to-one per type. Exact requires the span boundaries and type to match; relaxed requires the type to match and the character ranges to overlap. The frontier column runs the same task through the same scorer (LLM offsets recovered by verbatim-substring search, since raw LLM char offsets are unreliable).

Headline — fine-tuned vs zero-shot vs Claude frontier

Full 2,000-doc v0.2 standard bench (reports/eval_gliner.md). The Claude column is on a 100-doc seeded slice (the cached frontier reference, reports/eval_frontier.md); the fine-tuned model's score on that same 100-doc slice is shown in the Claude row's note for a fair comparison.

Metric	zero-shot `gliner_multi-v2.1`	fine-tuned `cee-pii` v1	lift vs zero-shot
micro-F1 (exact)	0.177	0.827	+0.650 (4.7×)
micro-F1 (relaxed)	0.246	0.833	+0.587 (3.4×)
macro-F1 (exact)	0.114	0.598	+0.484 (5.2×)
micro-precision (exact)	0.661	0.911	+0.250
micro-recall (exact)	0.102	0.758	+0.656
FP-rate (274 no-entity docs)	0.000	0.000	matched

The headline is the false-positive rate: 0.000 on the 274 no-entity documents. A privacy guard that fires on clean text gets disabled within a week; this one does not fire on documents that carry no PII, while still reaching 0.758 recall at 0.911 precision.

Acceptance gate (Phase 4): MET — the fine-tune beats zero-shot GLiNER by a wide margin (+0.65 exact micro-F1, a 4.7× relative gain). Zero-shot's failure is almost entirely recall (0.10): the base model fires on only a handful of universal types (postal, dob) and ignores the CEE-specific taxonomy. Fine-tuning is what teaches the taxonomy.

Gap to the Claude frontier (reference, not a competitor). On the same 100-doc slice as the cached Claude Opus 4.8 column, fine-tuned cee-pii scores 0.833 / 0.841 (exact / relaxed micro-F1) vs Claude's 0.936 / 0.963 — a 0.10 exact-F1 gap (reports/eval_gliner_slice100.md). Expected and honest: a ~300M open-weights, CPU-deployable, Apache-2.0 model reaching **89% of a frontier API's exact-F1** on this bench, fully offline and at a fraction of the cost and latency. Not an apples-to-apples comparison — the value proposition is on-device masking, not beating a frontier LLM.

Per-language (exact micro-F1, full 2,000-doc bench)

Language	exact micro-F1
en_uk	0.95
pl	0.94
en_us	0.89
uz	0.74
ro	0.66
hu	0.63

RO and HU trail — and honestly so. Their weakest types (ci_ro, taj, szemelyi) concentrate in those languages, so the per-type errors below drag the per-language number down. Closing the RO/HU gap is the explicit target of the planned 3rd-epoch follow-up.

Per-type honesty (the 2-epoch budget shows)

Strong (checksum / high-signal types): cnp 1.00, phone/email/postal ~0.99, person_name 0.98, dob 0.98, card 0.97, employer/plate 0.96, utr 0.94.
Real weaknesses (types WITH gold that the model gets wrong):
- ci_ro (RO ID card) F1 0.00 — the model finds the span but mislabels it as nino (UK NI number); both are "2 letters + digits", a learnable confusion.
- uz_account 0.00 — a genuine recall miss on a rare, under-represented long-digit type.
- taj 0.17 — confused with generic numeric references (policy_ref).
- first_name 0.57 / surname 0.62 — boundary/role confusion with person_name.
Not real failures (bench-coverage artifact): ein, itin, nino, pesel, ssn, szemelyi, uk_account_number show F1 0.00 but have zero gold occurrences in v0.2 standard — the 0.00 is a scoring convention (P=R=0 when no gold exists), not a model failure. v0.2 standard exercises 23 of the 34 taxonomy types; the remaining types need bench coverage before their F1 is meaningful.

The full per-type / per-language tables (fine-tuned + zero-shot + heuristic floor) live in reports/eval_gliner.md; the 100-doc frontier slice is in reports/eval_gliner_slice100.md and reports/eval_frontier.md.

Pending eval work (not blocking this card)

Hard-subset (960-doc noised) eval, XLM-R BIO baseline + Presidio through the same scorer, CPU latency + peak memory on M3-class hardware, bootstrap confidence intervals over documents, the Phase-6 human-labeled real-structure eval set, and OpenAI/Gemini frontier columns once keys are configured.

Intended use & limitations

Intended use. A detection aid for (1) redacting PII before text leaves a perimeter or reaches an LLM, and (2) warning a user before they paste PII into a chatbot. Languages: Romanian, Polish, Hungarian, Uzbek, UK English, US English. Deployment target is consumer CPU.

Out of scope & limitations.

NOT a compliance guarantee. This model does not guarantee GDPR or regulatory compliance. It is a detection aid, not a legal control. Do not represent its output as a compliance certification.
Recall is not 100%. Some PII will be missed (bench recall is 0.758 exact at threshold 0.5). Do not rely on it as a sole barrier; pair it with other controls.
Checksum entities are easy mode; contextual entities are where errors live. Checksum-validated identifiers (CNP, PESEL, IBAN, …) are highly detectable; contextual entities (person names, addresses, employer/health mentions) carry most of the error mass. Budget your review accordingly.
Health conditions are coarse-flagged, not classified. The model flags that a health condition is mentioned; it does not categorize which one.
RO and HU trail the other languages (exact micro-F1 0.66 / 0.63) because their confusable CEE types (ci_ro↔nino, taj↔policy_ref, szemelyi) concentrate there.
2-epoch reduced budget. This is a 2-epoch run, below the brief's conservative 3-epoch start — chosen to land a clean, memory-safe end-to-end result first. A 3rd-epoch / longer schedule is the documented Phase-4 follow-up, targeting exactly the confusable CEE types.
Synthetic training + synthetic bench. Trained purely on synthetic data and evaluated on a same-distribution synthetic bench (home-field advantage, see Evaluation). Real-world register drift is the Phase-6 real-structure eval, reported separately and still to be built.

Training data

All training data is synthetic. Zero client data, zero scraped personal data, zero real PII anywhere in the repo (including tests).

Synthetic pipeline (6 languages: RO / PL / HU / UZ / EN-UK / EN-US). 437 template families (380 slotted + 57 entity-free for false-positive pressure) across 7 registers (chat, email, support ticket, contract clause, bank statement line, form field, OCR-ish fragment), expanded by LLM paraphrase (claude-haiku-4-5, slot placeholders preserved and validated, all outputs cached for reproducibility).
Corpus v0.2: 20,000 documents (generator cee-pii-phase3-v0.2, seed 20260702), balanced per language (each language 16.3–17.2%), ~14% zero-entity docs, hard-negative injection in ~50% of docs, and noise applied at assembly (diacritic stripping — critical for RO — OCR confusions, random casing, whitespace/punctuation damage) with character-level span tracking through every transformation.
Privacy guarantee. Person names are formed by independent sampling of census/statistical first-name and surname frequency lists; full-name pairs are never copied from any source. Checksum identifiers are internally valid but correspond to no real person.
Split discipline — family- AND value-disjoint, zero straddlers. Families are partitioned into train/val/test first (stratified by (language, register), seed 20260703, 80/10/10), then each split is generated only from its own family pool → straddler_count = 0 by construction. Value-disjointness enforced by resample. Split sizes: train 16,000 / val 2,000 / test 2,000.

Fine-tuning

Full fine-tune of urchade/gliner_multi-v2.1 (mDeBERTa-v3 backbone, ~300M params) on corpus v0.2 train (config training/config/gliner_v1_2ep_memsafe.yaml, run training/runs/gliner_v1/).

Objective / label form. Each span is trained against a natural-language type phrasing (e.g. cnp → "Romanian personal numeric code (CNP)"), 2–3 phrasings per type sampled per document to preserve the open-vocabulary property.
Hyperparameters. 2 epochs; encoder LR 1e-5, head LR 5e-5; weight decay 0.01; warmup ratio 0.1; max-grad-norm 1.0; max_width 12, max_types 25; seed 20260704. Effective batch 2 × 16 = 32 (small physical batch is a deliberate MPS memory-safety choice, not a quality preference).
Data seen. 13,654 train / 441 val examples (from 16,000 / 500 docs; ~15% no-entity docs are dropped from TRAINING but remain in the bench for FP measurement). Char→token span misalignments dropped and counted (train 602 / 58,899 ≈ 1.0%).
Outcome. ~1.5 h (5,320 s) on an M3 Max via MPS (no CUDA anywhere), 2.0 epochs, eval_loss 0.448. Promoted checkpoint: training/runs/gliner_v1/final (end of epoch 2). PYTORCH_ENABLE_MPS_FALLBACK=1 set so any unsupported op degrades to CPU rather than crashing a multi-hour run.

Base-model license. urchade/gliner_multi-v2.1 is Apache-2.0 (verified on its HF model card, 2026-07-05), compatible with this Apache-2.0 release.

Model artifacts

The model repo carries the promoted GLiNER checkpoint from training/runs/gliner_v1/final:

pytorch_model.bin — the fine-tuned weights (~1.15 GB).
gliner_config.json — GLiNER config (mDeBERTa-v3 encoder, max_width 12, span mode markerV0, etc.).
tokenizer.json + tokenizer_config.json — the mDeBERTa-v3 tokenizer.

GLiNER.from_pretrained("flowxai/cee-pii") loads these directly.

Optional / pending (Phase-7 items, not blocking this card): an ONNX export and an int8-quantized variant for lower-footprint CPU inference are listed in the brief but are not yet built; a guard.py stdin→spans consumer demo and measured CPU latency for a 512-token message are the remaining Phase-7 deliverables. These are follow-ups; the model above ships and runs today via the gliner package.

Reproduction

# environment (uv-managed, Python 3.12)
uv sync

# corpus v0.2 (split-aware generation, family- + value-disjoint, seed 20260702)
uv run python scripts/build_corpus_v02.py

# freeze CEE-PII-Bench v0.2 from the test split (refuses silent overwrite)
uv run python scripts/build_bench.py

# training: GLiNER fine-tune (Phase 4, ~1.5h on M3 Max MPS)
uv run python training/train_gliner.py --config training/config/gliner_v1_2ep_memsafe.yaml

# eval: fine-tuned vs zero-shot GLiNER on the full 2,000-doc bench
uv run python scripts/run_eval_gliner.py --full   # -> reports/eval_gliner.md

# frontier reference on a 100-doc slice (Claude; openai/gemini skip cleanly w/o keys)
env -u ANTHROPIC_BASE_URL -u ANTHROPIC_AUTH_TOKEN \
  uv run python scripts/run_eval_frontier.py --n 100 --seed 20260703 \
    --specs claude:claude-opus-4-8

License

Apache-2.0 (weights, code, and data pipeline). Base model urchade/gliner_multi-v2.1 is Apache-2.0.

Downloads last month: -

Model tree for flowxai/cee-pii

Base model

urchade/gliner_multi-v2.1

Finetuned

(13)

this model

Dataset used to train flowxai/cee-pii

Evaluation results

span micro-F1 (exact) on CEE-PII-Bench v0.2
self-reported

0.827
span micro-F1 (relaxed) on CEE-PII-Bench v0.2
self-reported

0.833
span macro-F1 (exact) on CEE-PII-Bench v0.2
self-reported

0.598
span micro-precision (exact) on CEE-PII-Bench v0.2
self-reported

0.911
span micro-recall (exact) on CEE-PII-Bench v0.2
self-reported

0.758
FP-rate on no-entity documents (274 docs) on CEE-PII-Bench v0.2
self-reported

0.000

flowxai
/

cee-pii