pleno_anonymize_ja

Japanese PII NER fine-tuned on 0xhikae/pii-masking-300k-ja train split.

Full methodological accounting in docs/benchmark-pleno-anonymize-ja.md: template-overlap analysis, label-aware span-merge derivation, real-text eval, 3-seed variance, open gaps. This card surfaces headlines + caveats.

3-seed numbers (mean ± std)

Char-IoU ≥ 0.5, label-agnostic. 1000-iter bootstrap CIs on seed-42 run.

Eval set	Mean F1	Std	F1 95% CI	Smoke ≥0.50	Parity ≥0.82
In-dist (300k-ja val, 300 docs)*	0.955	0.002	[0.935, 0.973]	✅	✅
Real text (stockmark JP Wikipedia, PII subset 147 docs)	0.467	0.010	[0.393, 0.520]	❌	❌

* Trained on the train split of this same dataset. Treat as upper bound.

Production expectations should be calibrated near 0.47, not 0.96.

Honest comparison

Engine	In-dist F1	Real-text F1 (PII subset)
`builtin` v0.13.0	0.342	—
spaCy `ja_core_news_lg`	0.274	0.571
`openai-privacy-filter` v0.13.0	0.702	—
`pleno_anonymize_ja`	0.955	0.467

pleno_anonymize_ja wins in-distribution. spaCy beats it by 0.10 F1 on real Wikipedia text. The domain mismatch matters: this model is trained on form-/record-/chat-style PII text; Wikipedia is narrative prose with corporate / facility / product entities the model was never trained to recognise.

What this model is good for

Detecting PII in PII-typical text: forms, records, chat, call-centre transcripts, email bodies — content shaped like the ai4privacy training distribution.
Replacing or augmenting the builtin (Presidio + regex) backend in pleno-anonymize for Japanese PII redaction.
Production deployments that stack pattern recognisers (Presidio) on top for HEALTH_INSURANCE, MY_NUMBER, URL etc., which lie outside the 28-label ai4privacy schema.

What this model is NOT good for

General-purpose Japanese NER on narrative text (Wikipedia, news articles, novels). Use spaCy ja_core_news_lg or GiNZA.
Detecting corporate, product, facility, or event names — none of these exist in the label vocabulary.
Workloads requiring HEALTH_INSURANCE, MY_NUMBER, RESIDENCE_CARD, URL etc. (out-of-schema; stack Presidio pattern recognisers).

Training

Base: FacebookAI/xlm-roberta-base (270M params)
Data: 25,082 Japanese rows from 0xhikae/pii-masking-300k-ja train split
2 epochs, batch 16, lr 5e-5, fp16
3 training seeds reported: 42, 7, 1337 — variance below 0.015 F1 across all eval sets
~5 min per seed on RTX A6000 (RunPod), ~$0.05 compute per seed
Released checkpoint is seed 42

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_ja")
tokenizer = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_ja")
ner = pipeline(
    "token-classification", model=model, tokenizer=tokenizer,
    aggregation_strategy="simple",
)
print(ner("山田太郎さんの電話番号は090-1234-5678です。"))

Label schema (28 ai4privacy labels)

BOD, BUILDING, CARDISSUER, CITY, COUNTRY, DATE, DRIVERLICENSE, EMAIL, GEOCOORD, GIVENNAME1, GIVENNAME2, IDCARD, IP, LASTNAME1, LASTNAME2, LASTNAME3, PASS, PASSPORT, POSTCODE, SECADDRESS, SEX, SOCIALNUMBER, STATE, STREET, TEL, TIME, TITLE, USERNAME.

Honest open gaps

No PII-context real-text eval. Wikipedia is real but off-domain. ≥50 hand-annotated JP chat / form / email samples would resolve. Highest-priority follow-up.
Library versions not pinned in pyproject.toml.
Only one classic baseline (spaCy). GiNZA / multilingual-BERT-NER not run.
AI4Privacy upstream split-protocol not fully documented.
Single base-model size (xlm-roberta-base). No -large ablation.

Reproducibility

Training: scripts/train_supervised_300k_ja.py
In-dist eval: scripts/eval_on_300k.py
Real-text eval: scripts/eval_stockmark_jp_real.py
Bootstrap CIs: scripts/compute_ci_bootstrap.py
Seed pinning across python/numpy/torch/HF.

Citation

@misc{pleno_anonymize_ja,
  title  = {pleno_anonymize_ja: Japanese PII NER on ai4privacy 300k-ja},
  author = {Egashira, Hikaru},
  year   = {2026},
  url    = {https://huggingface.co/0xhikae/pleno_anonymize_ja}
}

Downloads last month: 45

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for 0xhikae/pleno_anonymize_ja

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3983)

this model