Instructions to use 0xhikae/pleno_anonymize_ja with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xhikae/pleno_anonymize_ja with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="0xhikae/pleno_anonymize_ja")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_ja") model = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_ja") - Notebooks
- Google Colab
- Kaggle
pleno_anonymize_ja
Japanese PII NER fine-tuned on 0xhikae/pii-masking-300k-ja train split.
Full methodological accounting in
docs/benchmark-pleno-anonymize-ja.md:
template-overlap analysis, label-aware span-merge derivation, real-text
eval, 3-seed variance, open gaps. This card surfaces headlines + caveats.
3-seed numbers (mean ± std)
Char-IoU ≥ 0.5, label-agnostic. 1000-iter bootstrap CIs on seed-42 run.
| Eval set | Mean F1 | Std | F1 95% CI | Smoke ≥0.50 | Parity ≥0.82 |
|---|---|---|---|---|---|
| In-dist (300k-ja val, 300 docs)* | 0.955 | 0.002 | [0.935, 0.973] | ✅ | ✅ |
| Real text (stockmark JP Wikipedia, PII subset 147 docs) | 0.467 | 0.010 | [0.393, 0.520] | ❌ | ❌ |
* Trained on the train split of this same dataset. Treat as upper bound.
Production expectations should be calibrated near 0.47, not 0.96.
Honest comparison
| Engine | In-dist F1 | Real-text F1 (PII subset) |
|---|---|---|
builtin v0.13.0 |
0.342 | — |
spaCy ja_core_news_lg |
0.274 | 0.571 |
openai-privacy-filter v0.13.0 |
0.702 | — |
pleno_anonymize_ja |
0.955 | 0.467 |
pleno_anonymize_ja wins in-distribution. spaCy beats it by 0.10 F1 on real Wikipedia text. The domain mismatch matters: this model is trained on form-/record-/chat-style PII text; Wikipedia is narrative prose with corporate / facility / product entities the model was never trained to recognise.
What this model is good for
- Detecting PII in PII-typical text: forms, records, chat, call-centre transcripts, email bodies — content shaped like the ai4privacy training distribution.
- Replacing or augmenting the
builtin(Presidio + regex) backend inpleno-anonymizefor Japanese PII redaction. - Production deployments that stack pattern recognisers (Presidio)
on top for
HEALTH_INSURANCE,MY_NUMBER,URLetc., which lie outside the 28-label ai4privacy schema.
What this model is NOT good for
- General-purpose Japanese NER on narrative text (Wikipedia, news
articles, novels). Use spaCy
ja_core_news_lgor GiNZA. - Detecting corporate, product, facility, or event names — none of these exist in the label vocabulary.
- Workloads requiring
HEALTH_INSURANCE,MY_NUMBER,RESIDENCE_CARD,URLetc. (out-of-schema; stack Presidio pattern recognisers).
Training
- Base:
FacebookAI/xlm-roberta-base(270M params) - Data: 25,082 Japanese rows from
0xhikae/pii-masking-300k-jatrain split - 2 epochs, batch 16, lr 5e-5, fp16
- 3 training seeds reported: 42, 7, 1337 — variance below 0.015 F1 across all eval sets
- ~5 min per seed on RTX A6000 (RunPod), ~$0.05 compute per seed
- Released checkpoint is seed 42
Usage
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_ja")
tokenizer = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_ja")
ner = pipeline(
"token-classification", model=model, tokenizer=tokenizer,
aggregation_strategy="simple",
)
print(ner("山田太郎さんの電話番号は090-1234-5678です。"))
Label schema (28 ai4privacy labels)
BOD, BUILDING, CARDISSUER, CITY, COUNTRY, DATE, DRIVERLICENSE,
EMAIL, GEOCOORD, GIVENNAME1, GIVENNAME2, IDCARD, IP,
LASTNAME1, LASTNAME2, LASTNAME3, PASS, PASSPORT, POSTCODE,
SECADDRESS, SEX, SOCIALNUMBER, STATE, STREET, TEL, TIME,
TITLE, USERNAME.
Honest open gaps
- No PII-context real-text eval. Wikipedia is real but off-domain. ≥50 hand-annotated JP chat / form / email samples would resolve. Highest-priority follow-up.
- Library versions not pinned in
pyproject.toml. - Only one classic baseline (spaCy). GiNZA / multilingual-BERT-NER not run.
- AI4Privacy upstream split-protocol not fully documented.
- Single base-model size (xlm-roberta-base). No -large ablation.
Reproducibility
- Training:
scripts/train_supervised_300k_ja.py - In-dist eval:
scripts/eval_on_300k.py - Real-text eval:
scripts/eval_stockmark_jp_real.py - Bootstrap CIs:
scripts/compute_ci_bootstrap.py - Seed pinning across python/numpy/torch/HF.
Citation
@misc{pleno_anonymize_ja,
title = {pleno_anonymize_ja: Japanese PII NER on ai4privacy 300k-ja},
author = {Egashira, Hikaru},
year = {2026},
url = {https://huggingface.co/0xhikae/pleno_anonymize_ja}
}
- Downloads last month
- 45
Model tree for 0xhikae/pleno_anonymize_ja
Base model
FacebookAI/xlm-roberta-base