| --- |
| license: mit |
| language: |
| - ko |
| - en |
| tags: |
| - pii-detection |
| - token-classification |
| - korean |
| - xlm-roberta |
| - multilingual-e5 |
| - bioes |
| base_model: intfloat/multilingual-e5-base |
| pipeline_tag: token-classification |
| --- |
| |
| # Korean PII β multilingual-e5-base |
|
|
| Span-level **Korean PII detection**, fine-tuned from |
| [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) |
| (a multilingual XLM-RoBERTa bidirectional encoder). It detects 9 PII categories as |
| character-offset spans and is trained for **multi-domain** Korean coverage |
| (conversational, news, and a range of document domains). |
|
|
|
|
| **[Open PII Notebook](https://huggingface.co/FrameByFrame/korean-pii-e5-base/blob/main/pii_demo.ipynb)** β load the model and redact Korean PII interactively. |
|
|
| ## Capabilities |
|
|
| | Category | Description | Example | |
| |---|---|---| |
| | `private_person` | Personal name (Korean / Western / handles) | κΉλ―Όμ, John Smith | |
| | `private_address` | Physical / postal address | μμΈνΉλ³μ κ°λ¨κ΅¬ ν
ν€λλ‘ 123 | |
| | `private_phone` | Phone number | 010-1234-5678 | |
| | `private_email` | Email address | minsu@example.com | |
| | `private_date` | Birthday / personally-identifying date | 1985λ
3μ 12μΌ | |
| | `private_url` | Personal URL | github.com/minsu | |
| | `account_number` | Bank, card, RRN, passport, etc. | 110-234-567890 | |
| | `personal_handle` | Username / handle | rainbow879612 | |
| | `ip_address` | IP address | 192.168.1.5 | |
|
|
| ## Benchmark Results |
|
|
| Evaluated across three domains, exact character-span F1, with deterministic span |
| normalization (see `extract_pii` below). |
|
|
| | eval set | what it measures | Overall F1 | |
| |---|---|---:| |
| | **KDPII test** (2,252) | conversational Korean (in-domain) | **0.943** | |
| | **Held-out document domains** (insurance, government) | unseen domains | **0.995** | |
| | **KLUE-NER `person`** | real Korean **news** text | **0.866** (recall 0.92) | |
|
|
| ### KDPII per-class (conversational, in-domain) |
| | label | F1 | | label | F1 | |
| |---|---:|---|---|---:| |
| | `private_email` | 1.000 | | `private_person` | 0.909 | |
| | `private_url` | 1.000 | | `private_address` | 0.922 | |
| | `ip_address` | 1.000 | | `account_number` | 0.979 | |
| | `private_date` | 0.980 | | `personal_handle` | 0.863 | |
| | `private_phone` | 0.993 | | | | |
|
|
|
|
| ## Quick Start |
|
|
| ### Install |
|
|
| ```bash |
| pip install "transformers>=4.40" torch safetensors |
| ``` |
|
|
| ### Load |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| |
| MODEL_ID = "FrameByFrame/korean-pii-e5-base" |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
| model = AutoModelForTokenClassification.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16) |
| model.eval() |
| if torch.cuda.is_available(): |
| model.cuda() |
| ``` |
|
|
| ### Inference |
|
|
| The model emits per-token BIOES labels. The helper decodes them into character-offset |
| spans and applies light, deterministic **span normalization** (strips trailing Korean |
| particles / whitespace from a span, e.g. `λ―Όμμ¨` β `λ―Όμ`, `μ‘νꡬμ` β `μ‘νꡬ`). The |
| benchmark numbers above include this normalization. |
|
|
| ```python |
| import re |
| |
| _TRAILING_JOSA = ["μ΄μμ","μ΄λΌκ³ ","μ
λλ€","μ΄μΌ","μ΄λ","νν
","μκ²","μΌλ‘","μ΄κ°","μ΄λ", |
| "μμ","μ΄κ³ ","μμ","μ¨","λ","μ΄","κ°","μ","λ","μ","λ₯Ό","μΌ","μ","μ","μ","λ","κ»","κ³ "] |
| _DATE_END = re.compile(r".*(?:μΌ|[0-9])", re.S) |
| |
| def _normalize(text, label, s, e): |
| while s < e and text[s] in " .,\t\n": s += 1 |
| while e > s and text[e-1] in " .,\t\n": e -= 1 |
| if label == "private_date": |
| m = _DATE_END.match(text[s:e]) |
| if m and m.end() > 0: e = s + m.end() |
| elif label in ("private_person", "personal_handle", "private_address"): |
| for _ in range(2): |
| seg = text[s:e] |
| for j in _TRAILING_JOSA: |
| if seg.endswith(j) and (e - s) - len(j) >= 2: |
| e -= len(j); break |
| else: |
| break |
| return s, e |
| |
| def extract_pii(text: str, max_length: int = 256): |
| enc = tokenizer(text, truncation=True, max_length=max_length, |
| return_offsets_mapping=True, return_tensors="pt") |
| offsets = enc.pop("offset_mapping")[0].tolist() |
| with torch.no_grad(): |
| logits = model(**{k: v.to(model.device) for k, v in enc.items()}).logits |
| pred = logits.argmax(-1)[0].tolist() |
| id2label = model.config.id2label |
| |
| spans, active = [], None # active = [label, start, end] |
| for i, lid in enumerate(pred): |
| label = id2label[int(lid)] |
| cs, ce = offsets[i] |
| if cs == ce: # special token |
| if active: spans.append(active); active = None |
| continue |
| if label == "O": |
| if active: spans.append(active); active = None |
| continue |
| prefix, cat = label.split("-", 1) |
| if prefix in ("B", "S") or not active or active[0] != cat: |
| if active: spans.append(active) |
| active = [cat, cs, ce] |
| else: |
| active[2] = ce |
| if active: spans.append(active) |
| |
| out = [] |
| for cat, s, e in spans: |
| s, e = _normalize(text, cat, s, e) |
| if text[s:e].strip(): |
| out.append({"label": cat, "start": s, "end": e, "text": text[s:e]}) |
| return out |
| ``` |
|
|
| ### Redaction |
|
|
| ```python |
| def redact(text: str) -> str: |
| spans = sorted(extract_pii(text), key=lambda s: s["start"], reverse=True) |
| for s in spans: |
| text = text[:s["start"]] + f"[{s['label'].upper()}]" + text[s["end"]:] |
| return text |
| |
| >>> redact("κΉλ―Όμλμ λ²νΈλ 010-1234-5678μ
λλ€.") |
| "[PRIVATE_PERSON]λμ λ²νΈλ [PRIVATE_PHONE]μ
λλ€." |
| ``` |
|
|
| ## Output Schema |
|
|
| | field | description | |
| |---|---| |
| | `label` | one of the 9 categories above | |
| | `start` | character offset start (inclusive) | |
| | `end` | character offset end (exclusive) | |
| | `text` | the matched substring | |
|
|
| ## Training Details |
|
|
| | | | |
| |---|---| |
| | **Base model** | [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) (XLM-RoBERTa, ~278M) | |
| | **Task** | token classification, BIOES (9 PII classes β 37 labels) | |
| | **Method** | full fine-tune (token head randomly initialized; encoder fully trained) | |
| | **Datasets** | **multi-domain Korean mix** β KDPII (conversational, CC BY 4.0) + KLUE-NER person spans (news) + LLM-generated multi-domain documents (medical, legal, finance, e-commerce, HR, real-estate, social, gaming, IT, telecom, education, travel, delivery, email) with placeholder-filled PII + distribution-matched synthetic PII. All PII is synthetic/generated, never real. | |
| | **Split** | KDPII test held out (seed 42); 2 document domains (insurance, government) fully held out for unseen-domain eval; KLUE-val held out | |
| | **Optimizer** | AdamW, lr 3e-5, linear schedule, warmup 0.05 | |
| | **Batch / seq** | 32 per device, max_length 256 | |
| | **Epochs** | 3, best checkpoint by `eval_span_f1` | |
| | **Precision** | bf16 | |
| | **Hardware** | 1Γ NVIDIA RTX A5000 | |
| |
| ## Known Limitations |
| |
| - **`personal_handle` (~0.86 in-domain)** is the weakest class β handles are open-vocabulary |
| (arbitrary usernames) and overlap with names; near its practical ceiling. |
| - **Held-out document-domain F1 (0.995) is optimistic** β those domains are unseen, but share |
| the *generator/entity distribution* of the synthetic training data. It shows domain-content |
| transfer, not guaranteed real-world-text robustness. Treat real-world performance as bounded |
| by the KDPII (0.94, real conversational) and KLUE-news (0.87, real news) numbers. |
| - **Evaluate on your own domain before high-stakes use.** Coverage is broad but not exhaustive; |
| Korean PII annotation conventions vary by source. |
| - **Structured PII** (phone/email/url/ip/account/RRN) is best paired with a regex/checksum |
| validator in production for guaranteed precision. |
| - The `extract_pii` helper applies span normalization; if you decode logits yourself, apply |
| equivalent trimming to reproduce the reported numbers. |
|
|
| ## License |
|
|
| MIT β inherited from the base [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) (MIT). Training data includes KDPII (CC BY 4.0). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{framebyframe-korean-pii-e5-base-2026, |
| title = {Korean PII (multilingual-e5-base): token classification for Korean PII}, |
| author = {Mariappan, Vijayachandran}, |
| year = {2026}, |
| url = {https://huggingface.co/FrameByFrame/korean-pii-e5-base} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| For inquiries, please contact vijay@artelligence.ai |
|
|