Instructions to use flowxai/cee-pii with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use flowxai/cee-pii with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("flowxai/cee-pii") - Notebooks
- Google Colab
- Kaggle
cee-pii — multilingual CEE-first PII & sensitive-entity span detector
An open, small (~300M), multilingual span-level PII / sensitive-entity detector, weighted toward Central & Eastern European languages and financial identifiers that English-centric models miss — with first-class US/UK identifier coverage.
Given a piece of text, cee-pii returns character-offset spans, each tagged with a
type from a fixed 34-type taxonomy across three tiers (checksum-validated national
IDs, structured identifiers, and contextual entities). It is a GLiNER fine-tune, so it
extracts entities against natural-language type prompts rather than a baked-in label
head — which is exactly what lets the same weights serve two different consumers.
The product insight that shapes everything: checksum identifiers are easy mode;
contextual entities are where a privacy guard actually earns its keep. A regex can find
a CNP or an IBAN. What it cannot do is find Str. Aviatorilor nr. 12, ap. 4 next to
Ștefan Popescu next to an employer mention, without diacritics, in a noisy OCR fragment,
in six languages. That contextual span extraction is the hard part, it is where the error
mass lives, and it is the reason this model exists rather than a rulebook.
Two consumers of the same weights:
- Enterprise masking layer — redact PII before text leaves the perimeter or reaches an LLM. Fixed taxonomy, exact character spans for in-place masking.
- Consumer privacy guard — warn a user before they paste PII into a chatbot. Flexible entity prompts, and a first-class low-false-positive posture (a guard that cries wolf gets disabled within a week).
Fine-tuned from the Apache-2.0 urchade/gliner_multi-v2.1 (mDeBERTa-v3 backbone). Runs on
CPU; deployment target is consumer laptops, not a data center.
flowxai/cee-pii(this card) — GLiNER v1 fine-tune, ~300M params.flowxai/cee-pii-bench(companion dataset) — the held-out synthetic benchmark this card is measured on. Seeflowxai/cee-pii-bench.
This model does NOT guarantee GDPR / regulatory compliance. It is a detection aid, not a legal control, and its recall is not 100%. See Intended use & limitations before deploying.
How do I use it?
cee-pii is a GLiNER model, so it runs through the gliner
package. You pass the text and the list of entity types you want (as natural-language
prompts); it returns spans with start / end character offsets, a label, and a score.
pip install gliner
Real Romanian example — a support-ticket line carrying a CNP, an IBAN, and a person name, typed without diacritics (the common real-world case):
from gliner import GLiNER
model = GLiNER.from_pretrained("flowxai/cee-pii")
text = (
"Buna ziua, sunt Stefan Popescu, CNP 1960714125089, "
"va rog transferati suma pe contul RO49AAAA1B31007593840000."
)
# GLiNER takes natural-language type prompts (the same phrasings the model was
# trained against). Use the short type id or a descriptive phrasing — both work.
labels = [
"person name",
"Romanian personal numeric code (CNP)",
"IBAN",
]
for ent in model.predict_entities(text, labels, threshold=0.5):
print(f"{ent['label']:40s} [{ent['start']:>3}:{ent['end']:>3}] {ent['text']}")
person name [ 16: 30] Stefan Popescu
Romanian personal numeric code (CNP) [ 36: 49] 1960714125089
IBAN [ 66: 90] RO49AAAA1B31007593840000
The (start, end) offsets are exact character ranges, so masking is a straight slice
replacement:
ents = model.predict_entities(text, labels, threshold=0.5)
masked = text
for ent in sorted(ents, key=lambda e: e["start"], reverse=True):
masked = masked[: ent["start"]] + f"[{ent['label']}]" + masked[ent["end"] :]
# -> "Buna ziua, sunt [person name], CNP [Romanian personal numeric code (CNP)], ..."
Notes for callers.
- Type prompts are open-vocabulary. During training each type was shown 2–3
natural-language phrasings (e.g.
cnp→ "Romanian personal numeric code (CNP)"), so the model responds to descriptive prompts. At evaluation we use one fixed canonical phrasing per type and map it back to the short id — the full canonical list is the taxonomy below. - Threshold trades recall for precision.
0.5is the evaluated operating point (0.911 exact micro-precision, 0.758 recall on the bench). Lower it for a higher-recall guard; raise it for a stricter masking layer. max_typeswas 25 at training time; pass entity lists in batches if you need all 34.
How it works
text --> cee-pii (GLiNER, mDeBERTa-v3 encoder + span head) --> [ (type, start, end, score), ... ]
^
| entity-type prompts (natural-language phrasings, open-vocabulary)
Under the hood: whitespace word-splitter → mDeBERTa-v3 encoder → span scorer against each type prompt → threshold → char-offset spans. GLiNER scores candidate word-token spans
(max_width 12) against each supplied type prompt; spans above threshold are emitted
with their character offsets recovered from the splitter. There is no fixed classification
head, which is why the taxonomy can grow without retraining the output layer.
Entity taxonomy (34 types)
The canonical type list is the single source of truth shared by the generators, validators, corpus, bench, and eval adapter — it cannot drift. Grouped by the three design tiers.
Tier 1 — checksum-validated identifiers (easy mode, near-zero ambiguity)
| Type | Country | Notes |
|---|---|---|
cnp |
RO | 13-digit personal numeric code; control digit; encodes DOB + county |
ci_ro |
RO | ID card series (2 letters, valid county) + 6 digits |
pesel |
PL | 11 digits, positional checksum, encodes DOB |
nip |
PL | 10-digit tax ID, weighted checksum |
taj |
HU | 9-digit social-security number, checksum on first 8 |
szemelyi |
HU | 11-digit personal ID (személyi szám) |
pinfl |
UZ | 14-digit personal ID (official lex.uz checksum) |
iban |
RO/PL/HU/GB + generic | ISO 13616 mod-97, correct country lengths |
card |
intl | Luhn + realistic scheme shapes |
nhs |
UK | 10 digits, mod-11 checksum |
aba |
US | 9-digit bank routing number, 3-7-1 weighted checksum |
Tier 2 — structured (format / range rules, some without a public checksum)
| Type | Country | Notes |
|---|---|---|
ssn |
US | 9 digits; invalid-range rejection (area 000/666/900+, group 00, serial 0000) |
itin |
US | 9xx- with valid IRS group ranges |
ein |
US | valid prefix list |
nino |
UK | National Insurance number (prefix + suffix rules) |
utr |
UK | 10-digit Unique Taxpayer Reference (HMRC check digit) |
company_number_uk |
UK | 8-char Companies House number (incl. SC/NI) |
uk_sort_code |
UK | xx-xx-xx |
uk_account_number |
UK | 8 digits |
uz_account |
UZ | 20-digit domestic account number (structural) |
phone |
multi | E.164 + local formats per country |
email |
— | email address |
plate |
RO/PL/HU/UK | vehicle registration |
postal |
multi | postal / ZIP (incl. UK postcode grammar, US ZIP+4) |
dob |
multi | date of birth, many formats |
Tier 3 — contextual entities (where GLiNER earns its keep)
| Type | Notes |
|---|---|
person_name |
full name (with AND without diacritics) |
first_name |
given name only |
surname |
family name only |
address |
street address (incl. RO Str./nr./bl./sc./et./ap. shape) |
policy_ref |
insurance policy number / reference |
contract_ref |
contract number / reference |
account_ref |
internal account number / reference (not a bank IBAN) |
employer |
employer / company mention |
health_condition |
coarse flag that a health condition is mentioned — not classified |
Evaluation
All numbers below are on the synthetic CEE-PII-Bench v0.2 — read the honesty caveat first, then the tables.
Honesty caveat — home-field advantage (read before citing these numbers). CEE-PII-Bench v0.2 is held-out and contamination-verified (bench families ∩ train = ∅ and bench values ∩ train = ∅, both tested), but it is drawn from the same synthetic generator distribution as the training corpus — the same output format, phrasing, and entity style the fine-tune learned. The numbers here are a valid relative signal (fine-tune vs zero-shot vs frontier on identical inputs) but are likely optimistic in absolute terms versus real documents. This is the same caveat as the sibling scam-guard project. The honest real-world number is the Phase-6 human-labeled real-structure eval set (official form specimens, published template contracts, sample bank statements), which is reported separately and is still to be built.
Scoring is model-agnostic (eval/harness.py): each entity is a (type, start, end)
triple, matched greedily one-to-one per type. Exact requires the span boundaries and
type to match; relaxed requires the type to match and the character ranges to overlap.
The frontier column runs the same task through the same scorer (LLM offsets recovered
by verbatim-substring search, since raw LLM char offsets are unreliable).
Headline — fine-tuned vs zero-shot vs Claude frontier
Full 2,000-doc v0.2 standard bench (reports/eval_gliner.md). The Claude column is on
a 100-doc seeded slice (the cached frontier reference, reports/eval_frontier.md); the
fine-tuned model's score on that same 100-doc slice is shown in the Claude row's note
for a fair comparison.
| Metric | zero-shot gliner_multi-v2.1 |
fine-tuned cee-pii v1 |
lift vs zero-shot |
|---|---|---|---|
| micro-F1 (exact) | 0.177 | 0.827 | +0.650 (4.7×) |
| micro-F1 (relaxed) | 0.246 | 0.833 | +0.587 (3.4×) |
| macro-F1 (exact) | 0.114 | 0.598 | +0.484 (5.2×) |
| micro-precision (exact) | 0.661 | 0.911 | +0.250 |
| micro-recall (exact) | 0.102 | 0.758 | +0.656 |
| FP-rate (274 no-entity docs) | 0.000 | 0.000 | matched |
The headline is the false-positive rate: 0.000 on the 274 no-entity documents. A privacy guard that fires on clean text gets disabled within a week; this one does not fire on documents that carry no PII, while still reaching 0.758 recall at 0.911 precision.
Acceptance gate (Phase 4): MET — the fine-tune beats zero-shot GLiNER by a wide margin (+0.65 exact micro-F1, a 4.7× relative gain). Zero-shot's failure is almost entirely recall (0.10): the base model fires on only a handful of universal types (postal, dob) and ignores the CEE-specific taxonomy. Fine-tuning is what teaches the taxonomy.
Gap to the Claude frontier (reference, not a competitor). On the same 100-doc slice as
the cached Claude Opus 4.8 column, fine-tuned cee-pii scores 0.833 / 0.841
(exact / relaxed micro-F1) vs Claude's 0.936 / 0.963 — a 0.10 exact-F1 gap
(89% of a frontier API's exact-F1** on this
bench, fully offline and at a fraction of the cost and latency. Not an apples-to-apples
comparison — the value proposition is on-device masking, not beating a frontier LLM.reports/eval_gliner_slice100.md). Expected and honest: a ~300M open-weights,
CPU-deployable, Apache-2.0 model reaching **
Per-language (exact micro-F1, full 2,000-doc bench)
| Language | exact micro-F1 |
|---|---|
| en_uk | 0.95 |
| pl | 0.94 |
| en_us | 0.89 |
| uz | 0.74 |
| ro | 0.66 |
| hu | 0.63 |
RO and HU trail — and honestly so. Their weakest types (ci_ro, taj, szemelyi)
concentrate in those languages, so the per-type errors below drag the per-language number
down. Closing the RO/HU gap is the explicit target of the planned 3rd-epoch follow-up.
Per-type honesty (the 2-epoch budget shows)
- Strong (checksum / high-signal types):
cnp1.00,phone/email/postal~0.99,person_name0.98,dob0.98,card0.97,employer/plate0.96,utr0.94. - Real weaknesses (types WITH gold that the model gets wrong):
ci_ro(RO ID card) F1 0.00 — the model finds the span but mislabels it asnino(UK NI number); both are "2 letters + digits", a learnable confusion.uz_account0.00 — a genuine recall miss on a rare, under-represented long-digit type.taj0.17 — confused with generic numeric references (policy_ref).first_name0.57 /surname0.62 — boundary/role confusion withperson_name.
- Not real failures (bench-coverage artifact):
ein,itin,nino,pesel,ssn,szemelyi,uk_account_numbershow F1 0.00 but have zero gold occurrences in v0.2 standard — the 0.00 is a scoring convention (P=R=0 when no gold exists), not a model failure. v0.2 standard exercises 23 of the 34 taxonomy types; the remaining types need bench coverage before their F1 is meaningful.
The full per-type / per-language tables (fine-tuned + zero-shot + heuristic floor) live in
reports/eval_gliner.md; the 100-doc frontier slice is in
reports/eval_gliner_slice100.md and
reports/eval_frontier.md.
Pending eval work (not blocking this card)
Hard-subset (960-doc noised) eval, XLM-R BIO baseline + Presidio through the same scorer, CPU latency + peak memory on M3-class hardware, bootstrap confidence intervals over documents, the Phase-6 human-labeled real-structure eval set, and OpenAI/Gemini frontier columns once keys are configured.
Intended use & limitations
Intended use. A detection aid for (1) redacting PII before text leaves a perimeter or reaches an LLM, and (2) warning a user before they paste PII into a chatbot. Languages: Romanian, Polish, Hungarian, Uzbek, UK English, US English. Deployment target is consumer CPU.
Out of scope & limitations.
- NOT a compliance guarantee. This model does not guarantee GDPR or regulatory compliance. It is a detection aid, not a legal control. Do not represent its output as a compliance certification.
- Recall is not 100%. Some PII will be missed (bench recall is 0.758 exact at threshold 0.5). Do not rely on it as a sole barrier; pair it with other controls.
- Checksum entities are easy mode; contextual entities are where errors live. Checksum-validated identifiers (CNP, PESEL, IBAN, …) are highly detectable; contextual entities (person names, addresses, employer/health mentions) carry most of the error mass. Budget your review accordingly.
- Health conditions are coarse-flagged, not classified. The model flags that a health condition is mentioned; it does not categorize which one.
- RO and HU trail the other languages (exact micro-F1 0.66 / 0.63) because their
confusable CEE types (
ci_ro↔nino,taj↔policy_ref,szemelyi) concentrate there. - 2-epoch reduced budget. This is a 2-epoch run, below the brief's conservative 3-epoch start — chosen to land a clean, memory-safe end-to-end result first. A 3rd-epoch / longer schedule is the documented Phase-4 follow-up, targeting exactly the confusable CEE types.
- Synthetic training + synthetic bench. Trained purely on synthetic data and evaluated on a same-distribution synthetic bench (home-field advantage, see Evaluation). Real-world register drift is the Phase-6 real-structure eval, reported separately and still to be built.
Training data
All training data is synthetic. Zero client data, zero scraped personal data, zero real PII anywhere in the repo (including tests).
- Synthetic pipeline (6 languages: RO / PL / HU / UZ / EN-UK / EN-US). 437 template
families (380 slotted + 57 entity-free for false-positive pressure) across 7 registers
(chat, email, support ticket, contract clause, bank statement line, form field, OCR-ish
fragment), expanded by LLM paraphrase (
claude-haiku-4-5, slot placeholders preserved and validated, all outputs cached for reproducibility). - Corpus v0.2: 20,000 documents (generator
cee-pii-phase3-v0.2, seed20260702), balanced per language (each language 16.3–17.2%), ~14% zero-entity docs, hard-negative injection in ~50% of docs, and noise applied at assembly (diacritic stripping — critical for RO — OCR confusions, random casing, whitespace/punctuation damage) with character-level span tracking through every transformation. - Privacy guarantee. Person names are formed by independent sampling of census/statistical first-name and surname frequency lists; full-name pairs are never copied from any source. Checksum identifiers are internally valid but correspond to no real person.
- Split discipline — family- AND value-disjoint, zero straddlers. Families are
partitioned into train/val/test first (stratified by
(language, register), seed20260703, 80/10/10), then each split is generated only from its own family pool →straddler_count = 0by construction. Value-disjointness enforced by resample. Split sizes: train 16,000 / val 2,000 / test 2,000.
Fine-tuning
Full fine-tune of urchade/gliner_multi-v2.1 (mDeBERTa-v3 backbone, ~300M params) on
corpus v0.2 train (config training/config/gliner_v1_2ep_memsafe.yaml, run
training/runs/gliner_v1/).
- Objective / label form. Each span is trained against a natural-language type
phrasing (e.g.
cnp→ "Romanian personal numeric code (CNP)"), 2–3 phrasings per type sampled per document to preserve the open-vocabulary property. - Hyperparameters. 2 epochs; encoder LR 1e-5, head LR 5e-5; weight decay 0.01; warmup
ratio 0.1; max-grad-norm 1.0;
max_width12,max_types25; seed 20260704. Effective batch2 × 16 = 32(small physical batch is a deliberate MPS memory-safety choice, not a quality preference). - Data seen. 13,654 train / 441 val examples (from 16,000 / 500 docs; ~15% no-entity docs are dropped from TRAINING but remain in the bench for FP measurement). Char→token span misalignments dropped and counted (train 602 / 58,899 ≈ 1.0%).
- Outcome. ~1.5 h (5,320 s) on an M3 Max via MPS (no CUDA anywhere), 2.0 epochs,
eval_loss 0.448. Promoted checkpoint:
training/runs/gliner_v1/final(end of epoch 2).PYTORCH_ENABLE_MPS_FALLBACK=1set so any unsupported op degrades to CPU rather than crashing a multi-hour run.
Base-model license.
urchade/gliner_multi-v2.1is Apache-2.0 (verified on its HF model card, 2026-07-05), compatible with this Apache-2.0 release.
Model artifacts
The model repo carries the promoted GLiNER checkpoint from training/runs/gliner_v1/final:
pytorch_model.bin— the fine-tuned weights (~1.15 GB).gliner_config.json— GLiNER config (mDeBERTa-v3 encoder,max_width12, span modemarkerV0, etc.).tokenizer.json+tokenizer_config.json— the mDeBERTa-v3 tokenizer.
GLiNER.from_pretrained("flowxai/cee-pii") loads these directly.
Optional / pending (Phase-7 items, not blocking this card): an ONNX export and an
int8-quantized variant for lower-footprint CPU inference are listed in the brief but are
not yet built; a guard.py stdin→spans consumer demo and measured CPU latency for a
512-token message are the remaining Phase-7 deliverables. These are follow-ups; the model
above ships and runs today via the gliner package.
Reproduction
# environment (uv-managed, Python 3.12)
uv sync
# corpus v0.2 (split-aware generation, family- + value-disjoint, seed 20260702)
uv run python scripts/build_corpus_v02.py
# freeze CEE-PII-Bench v0.2 from the test split (refuses silent overwrite)
uv run python scripts/build_bench.py
# training: GLiNER fine-tune (Phase 4, ~1.5h on M3 Max MPS)
uv run python training/train_gliner.py --config training/config/gliner_v1_2ep_memsafe.yaml
# eval: fine-tuned vs zero-shot GLiNER on the full 2,000-doc bench
uv run python scripts/run_eval_gliner.py --full # -> reports/eval_gliner.md
# frontier reference on a 100-doc slice (Claude; openai/gemini skip cleanly w/o keys)
env -u ANTHROPIC_BASE_URL -u ANTHROPIC_AUTH_TOKEN \
uv run python scripts/run_eval_frontier.py --n 100 --seed 20260703 \
--specs claude:claude-opus-4-8
Links
- Benchmark / dataset:
flowxai/cee-pii-bench.
License
Apache-2.0 (weights, code, and data pipeline). Base model urchade/gliner_multi-v2.1 is
Apache-2.0.
- Downloads last month
- -
Model tree for flowxai/cee-pii
Base model
urchade/gliner_multi-v2.1Dataset used to train flowxai/cee-pii
Evaluation results
- span micro-F1 (exact) on CEE-PII-Bench v0.2self-reported0.827
- span micro-F1 (relaxed) on CEE-PII-Bench v0.2self-reported0.833
- span macro-F1 (exact) on CEE-PII-Bench v0.2self-reported0.598
- span micro-precision (exact) on CEE-PII-Bench v0.2self-reported0.911
- span micro-recall (exact) on CEE-PII-Bench v0.2self-reported0.758
- FP-rate on no-entity documents (274 docs) on CEE-PII-Bench v0.2self-reported0.000