pleno_anonymize_en

Lightweight English PII NER trained on the English split of ai4privacy/pii-masking-300k. Built as the English counterpart of 0xhikae/pleno_anonymize_ja; recipe is intentionally a mirror of the JP supervised v2 pipeline but with distilbert-base-uncased as the backbone (~66M params) so the artefact stays small for CPU inference.

Acceptance tier

Smoke (≥0.50 F1) on the EN validation split is the explicit target. Numbers reported alongside the JP card are 1000-iter document-level bootstrap CIs; this card refreshes after the run completes.

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tok = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_en")
mdl = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_en")
ner = pipeline("token-classification", model=mdl, tokenizer=tok, aggregation_strategy="simple")
ner("Contact: Alice Johnson <alice@example.com>, phone 555-123-4567.")

Training

  • Base: distilbert-base-uncased
  • Dataset: ai4privacy/pii-masking-300k, English slice (~30k train / ~8k val)
  • Recipe: 2 epochs, batch 16, lr 5e-5, fp16, seed 42 (mirror of JP v2)
  • Hardware: single RTX 4090 on RunPod

Reproduce:

make -C packages/training dump-supervised-en
make -C packages/training train-supervised-en
make -C packages/training eval-300k-en

See docs/benchmark-pleno-anonymize-ja.md for the JP methodology this mirrors.

License

MIT (matches the upstream ai4privacy/pii-masking-300k license).

Downloads last month
48
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xhikae/pleno_anonymize_en

Finetuned
(11641)
this model

Dataset used to train 0xhikae/pleno_anonymize_en