pleno_anonymize_ja

Japanese PII NER fine-tuned on 0xhikae/pii-masking-300k-ja train split.

Full methodological accounting in docs/benchmark-pleno-anonymize-ja.md: template-overlap analysis, label-aware span-merge derivation, real-text eval, 3-seed variance, open gaps. This card surfaces headlines + caveats.

3-seed numbers (mean ± std)

Char-IoU ≥ 0.5, label-agnostic. 1000-iter bootstrap CIs on seed-42 run.

Eval set Mean F1 Std F1 95% CI Smoke ≥0.50 Parity ≥0.82
In-dist (300k-ja val, 300 docs)* 0.955 0.002 [0.935, 0.973]
Real text (stockmark JP Wikipedia, PII subset 147 docs) 0.467 0.010 [0.393, 0.520]

* Trained on the train split of this same dataset. Treat as upper bound.

Production expectations should be calibrated near 0.47, not 0.96.

Honest comparison

Engine In-dist F1 Real-text F1 (PII subset)
builtin v0.13.0 0.342
spaCy ja_core_news_lg 0.274 0.571
openai-privacy-filter v0.13.0 0.702
pleno_anonymize_ja 0.955 0.467

pleno_anonymize_ja wins in-distribution. spaCy beats it by 0.10 F1 on real Wikipedia text. The domain mismatch matters: this model is trained on form-/record-/chat-style PII text; Wikipedia is narrative prose with corporate / facility / product entities the model was never trained to recognise.

What this model is good for

  • Detecting PII in PII-typical text: forms, records, chat, call-centre transcripts, email bodies — content shaped like the ai4privacy training distribution.
  • Replacing or augmenting the builtin (Presidio + regex) backend in pleno-anonymize for Japanese PII redaction.
  • Production deployments that stack pattern recognisers (Presidio) on top for HEALTH_INSURANCE, MY_NUMBER, URL etc., which lie outside the 28-label ai4privacy schema.

What this model is NOT good for

  • General-purpose Japanese NER on narrative text (Wikipedia, news articles, novels). Use spaCy ja_core_news_lg or GiNZA.
  • Detecting corporate, product, facility, or event names — none of these exist in the label vocabulary.
  • Workloads requiring HEALTH_INSURANCE, MY_NUMBER, RESIDENCE_CARD, URL etc. (out-of-schema; stack Presidio pattern recognisers).

Training

  • Base: FacebookAI/xlm-roberta-base (270M params)
  • Data: 25,082 Japanese rows from 0xhikae/pii-masking-300k-ja train split
  • 2 epochs, batch 16, lr 5e-5, fp16
  • 3 training seeds reported: 42, 7, 1337 — variance below 0.015 F1 across all eval sets
  • ~5 min per seed on RTX A6000 (RunPod), ~$0.05 compute per seed
  • Released checkpoint is seed 42

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_ja")
tokenizer = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_ja")
ner = pipeline(
    "token-classification", model=model, tokenizer=tokenizer,
    aggregation_strategy="simple",
)
print(ner("山田太郎さんの電話番号は090-1234-5678です。"))

Label schema (28 ai4privacy labels)

BOD, BUILDING, CARDISSUER, CITY, COUNTRY, DATE, DRIVERLICENSE, EMAIL, GEOCOORD, GIVENNAME1, GIVENNAME2, IDCARD, IP, LASTNAME1, LASTNAME2, LASTNAME3, PASS, PASSPORT, POSTCODE, SECADDRESS, SEX, SOCIALNUMBER, STATE, STREET, TEL, TIME, TITLE, USERNAME.

Honest open gaps

  • No PII-context real-text eval. Wikipedia is real but off-domain. ≥50 hand-annotated JP chat / form / email samples would resolve. Highest-priority follow-up.
  • Library versions not pinned in pyproject.toml.
  • Only one classic baseline (spaCy). GiNZA / multilingual-BERT-NER not run.
  • AI4Privacy upstream split-protocol not fully documented.
  • Single base-model size (xlm-roberta-base). No -large ablation.

Reproducibility

Citation

@misc{pleno_anonymize_ja,
  title  = {pleno_anonymize_ja: Japanese PII NER on ai4privacy 300k-ja},
  author = {Egashira, Hikaru},
  year   = {2026},
  url    = {https://huggingface.co/0xhikae/pleno_anonymize_ja}
}
Downloads last month
45
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xhikae/pleno_anonymize_ja

Finetuned
(3983)
this model