ai4privacy/pii-masking-300k
Viewer • Updated • 225k • 9.08k • 101
How to use 0xhikae/pleno_anonymize_en with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("token-classification", model="0xhikae/pleno_anonymize_en") # Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_en")
model = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_en")Lightweight English PII NER trained on the English split of
ai4privacy/pii-masking-300k.
Built as the English counterpart of
0xhikae/pleno_anonymize_ja;
recipe is intentionally a mirror of the JP supervised v2 pipeline but
with distilbert-base-uncased as the backbone (~66M params) so the
artefact stays small for CPU inference.
Smoke (≥0.50 F1) on the EN validation split is the explicit target. Numbers reported alongside the JP card are 1000-iter document-level bootstrap CIs; this card refreshes after the run completes.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tok = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_en")
mdl = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_en")
ner = pipeline("token-classification", model=mdl, tokenizer=tok, aggregation_strategy="simple")
ner("Contact: Alice Johnson <alice@example.com>, phone 555-123-4567.")
distilbert-base-uncasedai4privacy/pii-masking-300k, English slice (~30k train / ~8k val)Reproduce:
make -C packages/training dump-supervised-en
make -C packages/training train-supervised-en
make -C packages/training eval-300k-en
See docs/benchmark-pleno-anonymize-ja.md for the JP methodology this mirrors.
MIT (matches the upstream ai4privacy/pii-masking-300k license).
Base model
distilbert/distilbert-base-uncased