Update README.md

a308c54 verified 20 days ago

8.49 kB

	---
	license: mit
	language:
	- ko
	- en
	tags:
	- pii-detection
	- token-classification
	- korean
	- xlm-roberta
	- multilingual-e5
	- bioes
	base_model: intfloat/multilingual-e5-base
	pipeline_tag: token-classification
	---

	# Korean PII — multilingual-e5-base

	Span-level Korean PII detection, fine-tuned from
	[`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base)
	(a multilingual XLM-RoBERTa bidirectional encoder). It detects 9 PII categories as
	character-offset spans and is trained for multi-domain Korean coverage
	(conversational, news, and a range of document domains).


	[Open PII Notebook](https://huggingface.co/FrameByFrame/korean-pii-e5-base/blob/main/pii_demo.ipynb) — load the model and redact Korean PII interactively.

	## Capabilities

	\| Category \| Description \| Example \|
	\|---\|---\|---\|
	\| `private_person` \| Personal name (Korean / Western / handles) \| 김민수, John Smith \|
	\| `private_address` \| Physical / postal address \| 서울특별시 강남구 테헤란로 123 \|
	\| `private_phone` \| Phone number \| 010-1234-5678 \|
	\| `private_email` \| Email address \| minsu@example.com \|
	\| `private_date` \| Birthday / personally-identifying date \| 1985년 3월 12일 \|
	\| `private_url` \| Personal URL \| github.com/minsu \|
	\| `account_number` \| Bank, card, RRN, passport, etc. \| 110-234-567890 \|
	\| `personal_handle` \| Username / handle \| rainbow879612 \|
	\| `ip_address` \| IP address \| 192.168.1.5 \|

	## Benchmark Results

	Evaluated across three domains, exact character-span F1, with deterministic span
	normalization (see `extract_pii` below).

	\| eval set \| what it measures \| Overall F1 \|
	\|---\|---\|---:\|
	\| KDPII test (2,252) \| conversational Korean (in-domain) \| 0.943 \|
	\| Held-out document domains (insurance, government) \| unseen domains \| 0.995 \|
	\| KLUE-NER `person` \| real Korean news text \| 0.866 (recall 0.92) \|

	### KDPII per-class (conversational, in-domain)
	\| label \| F1 \| \| label \| F1 \|
	\|---\|---:\|---\|---\|---:\|
	\| `private_email` \| 1.000 \| \| `private_person` \| 0.909 \|
	\| `private_url` \| 1.000 \| \| `private_address` \| 0.922 \|
	\| `ip_address` \| 1.000 \| \| `account_number` \| 0.979 \|
	\| `private_date` \| 0.980 \| \| `personal_handle` \| 0.863 \|
	\| `private_phone` \| 0.993 \| \| \| \|


	## Quick Start

	### Install

	```bash
	pip install "transformers>=4.40" torch safetensors
	```

	### Load

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	MODEL_ID = "FrameByFrame/korean-pii-e5-base"
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	model = AutoModelForTokenClassification.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
	model.eval()
	if torch.cuda.is_available():
	model.cuda()
	```

	### Inference

	The model emits per-token BIOES labels. The helper decodes them into character-offset
	spans and applies light, deterministic span normalization (strips trailing Korean
	particles / whitespace from a span, e.g. `민수씨` → `민수`, `송파구에` → `송파구`). The
	benchmark numbers above include this normalization.

	```python
	import re

	_TRAILING_JOSA = ["이에요","이라고","입니다","이야","이랑","한테","에게","으로","이가","이는",
	"에서","이고","예요","씨","님","이","가","은","는","을","를","야","아","에","의","랑","께","고"]
	_DATE_END = re.compile(r".*(?:일\|[0-9])", re.S)

	def _normalize(text, label, s, e):
	while s < e and text[s] in " .,\t\n": s += 1
	while e > s and text[e-1] in " .,\t\n": e -= 1
	if label == "private_date":
	m = _DATE_END.match(text[s:e])
	if m and m.end() > 0: e = s + m.end()
	elif label in ("private_person", "personal_handle", "private_address"):
	for _ in range(2):
	seg = text[s:e]
	for j in _TRAILING_JOSA:
	if seg.endswith(j) and (e - s) - len(j) >= 2:
	e -= len(j); break
	else:
	break
	return s, e

	def extract_pii(text: str, max_length: int = 256):
	enc = tokenizer(text, truncation=True, max_length=max_length,
	return_offsets_mapping=True, return_tensors="pt")
	offsets = enc.pop("offset_mapping")[0].tolist()
	with torch.no_grad():
	logits = model(**{k: v.to(model.device) for k, v in enc.items()}).logits
	pred = logits.argmax(-1)[0].tolist()
	id2label = model.config.id2label

	spans, active = [], None # active = [label, start, end]
	for i, lid in enumerate(pred):
	label = id2label[int(lid)]
	cs, ce = offsets[i]
	if cs == ce: # special token
	if active: spans.append(active); active = None
	continue
	if label == "O":
	if active: spans.append(active); active = None
	continue
	prefix, cat = label.split("-", 1)
	if prefix in ("B", "S") or not active or active[0] != cat:
	if active: spans.append(active)
	active = [cat, cs, ce]
	else:
	active[2] = ce
	if active: spans.append(active)

	out = []
	for cat, s, e in spans:
	s, e = _normalize(text, cat, s, e)
	if text[s:e].strip():
	out.append({"label": cat, "start": s, "end": e, "text": text[s:e]})
	return out
	```

	### Redaction

	```python
	def redact(text: str) -> str:
	spans = sorted(extract_pii(text), key=lambda s: s["start"], reverse=True)
	for s in spans:
	text = text[:s["start"]] + f"[{s['label'].upper()}]" + text[s["end"]:]
	return text

	>>> redact("김민수님의 번호는 010-1234-5678입니다.")
	"[PRIVATE_PERSON]님의 번호는 [PRIVATE_PHONE]입니다."
	```

	## Output Schema

	\| field \| description \|
	\|---\|---\|
	\| `label` \| one of the 9 categories above \|
	\| `start` \| character offset start (inclusive) \|
	\| `end` \| character offset end (exclusive) \|
	\| `text` \| the matched substring \|

	## Training Details

	\| \| \|
	\|---\|---\|
	\| Base model \| [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) (XLM-RoBERTa, ~278M) \|
	\| Task \| token classification, BIOES (9 PII classes → 37 labels) \|
	\| Method \| full fine-tune (token head randomly initialized; encoder fully trained) \|
	\| Datasets \| multi-domain Korean mix — KDPII (conversational, CC BY 4.0) + KLUE-NER person spans (news) + LLM-generated multi-domain documents (medical, legal, finance, e-commerce, HR, real-estate, social, gaming, IT, telecom, education, travel, delivery, email) with placeholder-filled PII + distribution-matched synthetic PII. All PII is synthetic/generated, never real. \|
	\| Split \| KDPII test held out (seed 42); 2 document domains (insurance, government) fully held out for unseen-domain eval; KLUE-val held out \|
	\| Optimizer \| AdamW, lr 3e-5, linear schedule, warmup 0.05 \|
	\| Batch / seq \| 32 per device, max_length 256 \|
	\| Epochs \| 3, best checkpoint by `eval_span_f1` \|
	\| Precision \| bf16 \|
	\| Hardware \| 1× NVIDIA RTX A5000 \|

	## Known Limitations

	- `personal_handle` (~0.86 in-domain) is the weakest class — handles are open-vocabulary
	(arbitrary usernames) and overlap with names; near its practical ceiling.
	- Held-out document-domain F1 (0.995) is optimistic — those domains are unseen, but share
	the generator/entity distribution of the synthetic training data. It shows domain-content
	transfer, not guaranteed real-world-text robustness. Treat real-world performance as bounded
	by the KDPII (0.94, real conversational) and KLUE-news (0.87, real news) numbers.
	- Evaluate on your own domain before high-stakes use. Coverage is broad but not exhaustive;
	Korean PII annotation conventions vary by source.
	- Structured PII (phone/email/url/ip/account/RRN) is best paired with a regex/checksum
	validator in production for guaranteed precision.
	- The `extract_pii` helper applies span normalization; if you decode logits yourself, apply
	equivalent trimming to reproduce the reported numbers.

	## License

	MIT — inherited from the base [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) (MIT). Training data includes KDPII (CC BY 4.0).

	## Citation

	```bibtex
	@misc{framebyframe-korean-pii-e5-base-2026,
	title = {Korean PII (multilingual-e5-base): token classification for Korean PII},
	author = {Mariappan, Vijayachandran},
	year = {2026},
	url = {https://huggingface.co/FrameByFrame/korean-pii-e5-base}
	}
	```

	## Contact

	For inquiries, please contact vijay@artelligence.ai