File size: 8,493 Bytes
2bf9c60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
license: mit
language:
  - ko
  - en
tags:
  - pii-detection
  - token-classification
  - korean
  - xlm-roberta
  - multilingual-e5
  - bioes
base_model: intfloat/multilingual-e5-base
pipeline_tag: token-classification
---

# Korean PII β€” multilingual-e5-base

Span-level **Korean PII detection**, fine-tuned from
[`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base)
(a multilingual XLM-RoBERTa bidirectional encoder). It detects 9 PII categories as
character-offset spans and is trained for **multi-domain** Korean coverage
(conversational, news, and a range of document domains).


**[Open PII Notebook](https://huggingface.co/FrameByFrame/korean-pii-e5-base/blob/main/pii_demo.ipynb)** β€” load the model and redact Korean PII interactively.

## Capabilities

| Category | Description | Example |
|---|---|---|
| `private_person` | Personal name (Korean / Western / handles) | κΉ€λ―Όμˆ˜, John Smith |
| `private_address` | Physical / postal address | μ„œμšΈνŠΉλ³„μ‹œ 강남ꡬ ν…Œν—€λž€λ‘œ 123 |
| `private_phone` | Phone number | 010-1234-5678 |
| `private_email` | Email address | minsu@example.com |
| `private_date` | Birthday / personally-identifying date | 1985λ…„ 3μ›” 12일 |
| `private_url` | Personal URL | github.com/minsu |
| `account_number` | Bank, card, RRN, passport, etc. | 110-234-567890 |
| `personal_handle` | Username / handle | rainbow879612 |
| `ip_address` | IP address | 192.168.1.5 |

## Benchmark Results

Evaluated across three domains, exact character-span F1, with deterministic span
normalization (see `extract_pii` below).

| eval set | what it measures | Overall F1 |
|---|---|---:|
| **KDPII test** (2,252) | conversational Korean (in-domain) | **0.943** |
| **Held-out document domains** (insurance, government) | unseen domains | **0.995** |
| **KLUE-NER `person`** | real Korean **news** text | **0.866** (recall 0.92) |

### KDPII per-class (conversational, in-domain)
| label | F1 | | label | F1 |
|---|---:|---|---|---:|
| `private_email` | 1.000 | | `private_person` | 0.909 |
| `private_url` | 1.000 | | `private_address` | 0.922 |
| `ip_address` | 1.000 | | `account_number` | 0.979 |
| `private_date` | 0.980 | | `personal_handle` | 0.863 |
| `private_phone` | 0.993 | | | |


## Quick Start

### Install

```bash
pip install "transformers>=4.40" torch safetensors
```

### Load

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

MODEL_ID = "FrameByFrame/korean-pii-e5-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForTokenClassification.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
model.eval()
if torch.cuda.is_available():
    model.cuda()
```

### Inference

The model emits per-token BIOES labels. The helper decodes them into character-offset
spans and applies light, deterministic **span normalization** (strips trailing Korean
particles / whitespace from a span, e.g. `λ―Όμˆ˜μ”¨` β†’ `민수`, `μ†‘νŒŒκ΅¬μ—` β†’ `μ†‘νŒŒκ΅¬`). The
benchmark numbers above include this normalization.

```python
import re

_TRAILING_JOSA = ["μ΄μ—μš”","이라고","μž…λ‹ˆλ‹€","이야","μ΄λž‘","ν•œν…Œ","μ—κ²Œ","으둜","이가","μ΄λŠ”",
                  "μ—μ„œ","이고","μ˜ˆμš”","씨","λ‹˜","이","κ°€","은","λŠ”","을","λ₯Ό","μ•Ό","μ•„","에","의","λž‘","께","κ³ "]
_DATE_END = re.compile(r".*(?:일|[0-9])", re.S)

def _normalize(text, label, s, e):
    while s < e and text[s] in " .,\t\n": s += 1
    while e > s and text[e-1] in " .,\t\n": e -= 1
    if label == "private_date":
        m = _DATE_END.match(text[s:e])
        if m and m.end() > 0: e = s + m.end()
    elif label in ("private_person", "personal_handle", "private_address"):
        for _ in range(2):
            seg = text[s:e]
            for j in _TRAILING_JOSA:
                if seg.endswith(j) and (e - s) - len(j) >= 2:
                    e -= len(j); break
            else:
                break
    return s, e

def extract_pii(text: str, max_length: int = 256):
    enc = tokenizer(text, truncation=True, max_length=max_length,
                    return_offsets_mapping=True, return_tensors="pt")
    offsets = enc.pop("offset_mapping")[0].tolist()
    with torch.no_grad():
        logits = model(**{k: v.to(model.device) for k, v in enc.items()}).logits
    pred = logits.argmax(-1)[0].tolist()
    id2label = model.config.id2label

    spans, active = [], None  # active = [label, start, end]
    for i, lid in enumerate(pred):
        label = id2label[int(lid)]
        cs, ce = offsets[i]
        if cs == ce:  # special token
            if active: spans.append(active); active = None
            continue
        if label == "O":
            if active: spans.append(active); active = None
            continue
        prefix, cat = label.split("-", 1)
        if prefix in ("B", "S") or not active or active[0] != cat:
            if active: spans.append(active)
            active = [cat, cs, ce]
        else:
            active[2] = ce
    if active: spans.append(active)

    out = []
    for cat, s, e in spans:
        s, e = _normalize(text, cat, s, e)
        if text[s:e].strip():
            out.append({"label": cat, "start": s, "end": e, "text": text[s:e]})
    return out
```

### Redaction

```python
def redact(text: str) -> str:
    spans = sorted(extract_pii(text), key=lambda s: s["start"], reverse=True)
    for s in spans:
        text = text[:s["start"]] + f"[{s['label'].upper()}]" + text[s["end"]:]
    return text

>>> redact("κΉ€λ―Όμˆ˜λ‹˜μ˜ λ²ˆν˜ΈλŠ” 010-1234-5678μž…λ‹ˆλ‹€.")
"[PRIVATE_PERSON]λ‹˜μ˜ λ²ˆν˜ΈλŠ” [PRIVATE_PHONE]μž…λ‹ˆλ‹€."
```

## Output Schema

| field | description |
|---|---|
| `label` | one of the 9 categories above |
| `start` | character offset start (inclusive) |
| `end` | character offset end (exclusive) |
| `text` | the matched substring |

## Training Details

| | |
|---|---|
| **Base model** | [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) (XLM-RoBERTa, ~278M) |
| **Task** | token classification, BIOES (9 PII classes β†’ 37 labels) |
| **Method** | full fine-tune (token head randomly initialized; encoder fully trained) |
| **Datasets** | **multi-domain Korean mix** β€” KDPII (conversational, CC BY 4.0) + KLUE-NER person spans (news) + LLM-generated multi-domain documents (medical, legal, finance, e-commerce, HR, real-estate, social, gaming, IT, telecom, education, travel, delivery, email) with placeholder-filled PII + distribution-matched synthetic PII. All PII is synthetic/generated, never real. |
| **Split** | KDPII test held out (seed 42); 2 document domains (insurance, government) fully held out for unseen-domain eval; KLUE-val held out |
| **Optimizer** | AdamW, lr 3e-5, linear schedule, warmup 0.05 |
| **Batch / seq** | 32 per device, max_length 256 |
| **Epochs** | 3, best checkpoint by `eval_span_f1` |
| **Precision** | bf16 |
| **Hardware** | 1Γ— NVIDIA RTX A5000 |

## Known Limitations

- **`personal_handle` (~0.86 in-domain)** is the weakest class β€” handles are open-vocabulary
  (arbitrary usernames) and overlap with names; near its practical ceiling.
- **Held-out document-domain F1 (0.995) is optimistic** β€” those domains are unseen, but share
  the *generator/entity distribution* of the synthetic training data. It shows domain-content
  transfer, not guaranteed real-world-text robustness. Treat real-world performance as bounded
  by the KDPII (0.94, real conversational) and KLUE-news (0.87, real news) numbers.
- **Evaluate on your own domain before high-stakes use.** Coverage is broad but not exhaustive;
  Korean PII annotation conventions vary by source.
- **Structured PII** (phone/email/url/ip/account/RRN) is best paired with a regex/checksum
  validator in production for guaranteed precision.
- The `extract_pii` helper applies span normalization; if you decode logits yourself, apply
  equivalent trimming to reproduce the reported numbers.

## License

MIT β€” inherited from the base [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) (MIT). Training data includes KDPII (CC BY 4.0).

## Citation

```bibtex
@misc{framebyframe-korean-pii-e5-base-2026,
  title  = {Korean PII (multilingual-e5-base): token classification for Korean PII},
  author = {Mariappan, Vijayachandran},
  year   = {2026},
  url    = {https://huggingface.co/FrameByFrame/korean-pii-e5-base}
}
```

## Contact

For inquiries, please contact vijay@artelligence.ai