pleno_anonymize_en

Lightweight English PII NER trained on the English split of ai4privacy/pii-masking-300k. Built as the English counterpart of 0xhikae/pleno_anonymize_ja; recipe is intentionally a mirror of the JP supervised v2 pipeline but with distilbert-base-uncased as the backbone (~66M params) so the artefact stays small for CPU inference.

Acceptance tier

Smoke (≥0.50 F1) on the EN validation split is the explicit target. Numbers reported alongside the JP card are 1000-iter document-level bootstrap CIs; this card refreshes after the run completes.

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tok = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_en")
mdl = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_en")
ner = pipeline("token-classification", model=mdl, tokenizer=tok, aggregation_strategy="simple")
ner("Contact: Alice Johnson <alice@example.com>, phone 555-123-4567.")

Training

Base: distilbert-base-uncased
Dataset: ai4privacy/pii-masking-300k, English slice (~30k train / ~8k val)
Recipe: 2 epochs, batch 16, lr 5e-5, fp16, seed 42 (mirror of JP v2)
Hardware: single RTX 4090 on RunPod

Reproduce:

make -C packages/training dump-supervised-en
make -C packages/training train-supervised-en
make -C packages/training eval-300k-en

See docs/benchmark-pleno-anonymize-ja.md for the JP methodology this mirrors.

License

MIT (matches the upstream ai4privacy/pii-masking-300k license).

Downloads last month: 48

Safetensors

Model size

66.4M params

Tensor type

F32

Model tree for 0xhikae/pleno_anonymize_en

Base model

distilbert/distilbert-base-uncased

Finetuned

(11641)

this model

0xhikae
/

pleno_anonymize_en