Instructions to use vmaca123/korean-pii-ner-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use vmaca123/korean-pii-ner-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="vmaca123/korean-pii-ner-v3")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("vmaca123/korean-pii-ner-v3") model = AutoModelForTokenClassification.from_pretrained("vmaca123/korean-pii-ner-v3") - Notebooks
- Google Colab
- Kaggle
Korean PII NER v3 (klue/roberta-large fine-tuned)
ํ๊ตญ์ด PII (Personally Identifiable Information) ๊ฐ๋๋ ์ผ์ฉ NER ๋ชจ๋ธ. NAME / ADDRESS / ORG 3 ์ํฐํฐ 7-label BIO ๋ถ๋ฅ.
Quick Start
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("vmaca123/korean-pii-ner-v3")
model = AutoModelForTokenClassification.from_pretrained("vmaca123/korean-pii-ner-v3")
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
print(ner("์ ๋ ํ๊ธธ๋์ด๊ณ ์์ธ์ ๊ฐ๋จ๊ตฌ ํ
ํค๋๋ก 152์ ๊ฑฐ์ฃผํฉ๋๋ค. ์ผ์ฑ์ ์ ์์์
๋๋ค."))
# [{'entity_group': 'NAME', 'start': 6, 'end': 9, 'word': 'ํ๊ธธ๋', ...},
# {'entity_group': 'ADDRESS', 'start': 13, 'end': 29, 'word': '์์ธ์ ๊ฐ๋จ๊ตฌ ํ
ํค๋๋ก 152', ...},
# {'entity_group': 'ORG', 'start': 35, 'end': 39, 'word': '์ผ์ฑ์ ์', ...}]
Labels (BIO, 7 classes)
| ID | Label | Maps to v0.2 EntityType |
|---|---|---|
| 0 | O | (non-entity) |
| 1 | B-NAME | PERSON_NAME |
| 2 | I-NAME | PERSON_NAME |
| 3 | B-ADDRESS | ADDRESS_FULL |
| 4 | I-ADDRESS | ADDRESS_FULL |
| 5 | B-ORG | ORGANIZATION |
| 6 | I-ORG | ORGANIZATION |
PHONE / EMAIL / RRN / CREDIT_CARD ๋ฑ ์ ํ PII๋ ๋ณธ ๋ชจ๋ธ scope ๋ฐ (regex/dict ์ฑ ์).
Training data
| Source | Count | License |
|---|---|---|
| KLUE-NER train | 21,008 | CC-BY-SA |
| Faker-ko baseline (real admin divisions) | 10,000 | self-generated |
| Faker conjunctive composite | 2,000 | self-generated |
| Hard negatives (ํ๋/์ฌ๋/๋ํ๋ฒํธ/์์๋ฒํธ) | 1,000 | self-generated |
| Total train pool | 34,008 |
๋ฐ์ดํฐ split: 8:1:1 (train 27,206 / val 3,401 / test 3,401). KLUE-NER validation 5,000์ ํ์ต ๋ฏธํฌํจ ์ธ๋ถ ํ๊ฐ์ฉ.
Training details
- Base:
klue/roberta-large(335M params) - Phase 1: encoder freeze, classifier head 1 epoch, LR 5e-4
- Phase 2: full unfreeze, 5 epochs, LR 2e-5, warmup ratio 0.1, weight decay 0.01
- Batch: 16, max_length 128, fp16
- Hardware: RTX 3090 24GB (Vast.ai)
- Wall clock: ~30 minutes
Evaluation results
| Eval set | macro-F1 | micro-F1 | size |
|---|---|---|---|
| Internal val | 0.872 | 0.880 | 3,401 |
| Internal test | 0.878 | 0.887 | 3,401 |
| KLUE-NER val (์ธ๋ถ) | 0.766 | 0.792 | 5,000 |
Iteration comparison (6 training runs)
| Run | Base | Data | Internal test | KLUE val |
|---|---|---|---|---|
| 1 | bert-base, 2ep | 31k | 0.776 | 0.630 |
| 2 | bert-base, 5ep | 31k | 0.798 | 0.669 |
| 3 | roberta-base, 5ep | 31k | 0.830 | 0.697 |
| 4 (v1) | roberta-large, 5ep | 31k | 0.865 | 0.764 |
| 5 (v2) | roberta-large + Naver/WikiAnn | 139k | 0.708 | 0.664 โ |
| 6 (v3) | roberta-large + augment | 34k | 0.878 | 0.766 โ |
v2 ์คํจ lesson: Naver NER 90k + WikiAnn 20k ํตํฉ์ด KLUE val -10%p. ์ด์ ๋ผ๋ฒจ์ char-level ๋ณํ ๋ ธ์ด์ฆ + multi-source distribution shift๊ฐ ์์ธ. v3๋ v1 setup ๊ทธ๋๋ก + composite/hard-negative augment๋ง ์ถ๊ฐํ๋ conservative ์ ๊ทผ.
Sample outputs
| Input | NER output |
|---|---|
| "์๋ ํ์ธ์, ์ ๋ ํ๊ธธ๋์ด๊ณ ์์ธ์ ๊ฐ๋จ๊ตฌ ํ ํค๋๋ก 152์ ๊ฑฐ์ฃผํฉ๋๋ค. ์ผ์ฑ์ ์ ์์์ ๋๋ค." | NAME=ํ๊ธธ๋, ADDR=์์ธ์ ๊ฐ๋จ๊ตฌ ํ ํค๋๋ก 152, ORG=์ผ์ฑ์ ์ |
| "์ ๋ ๋ฐ์ ํฌ์ด๊ณ ๋ถ์ฐ๊ด์ญ์ ํด์ด๋๊ตฌ์ ๊ฑฐ์ฃผํ๋ฉฐ LG์ ์ ์์์ ๋๋ค." | NAME + ADDR + ORG |
| "์ค๋ ํ๋์ด ๋ง๋ค์." | (no spans) โ hard negative |
| "์ฌ๋์ ์ค์ํ ๊ฐ์น์ ๋๋ค." | (no spans) โ hard negative |
| "์์ ์ ํ๋ฒํธ๋ 010-0000-0000์ ๋๋ค." | (no spans) โ NER scope ๋ฐ |
Limitations & known issues
- Conjunctive ํจํด:
{name}์ด๊ณ {address}๊ฐ์ ์ผ์ด์ค v3์์ ํ์ต ๋ฐ์ดํฐ๋ก ๋ณด๊ฐํ์ผ๋ ์ผ๋ถ ๋ณํ (์ง๋ช +ํธ์นญ ์ธ์ ๋ฑ) ์์ boundary ์ค๋ฅ ๊ฐ๋ฅ - ADDRESS_UNIT ๋ฏธ์ง์: "101๋ 1203ํธ" ๊ฐ์ unit์ dict ํ์ฒ๋ฆฌ์์ ๋ถ๊ธฐ (NER์ ADDRESS_FULL๋ง emit)
- SCHOOL/HOSPITAL ๋ฏธ์ธ๋ถ: ORG ๋จ์ผ ๋ผ๋ฒจ๋ก emit, dict๋ก SCHOOL/HOSPITAL reclassification
- ์ธ๋ถ ๋๋ฉ์ธ transfer ์ฝํจ: KLUE val 0.766 vs internal 0.878. ๋๋ฉ์ธ ํนํ fine-tuning ๊ถ์ฅ (์๋ฃ/๊ธ์ต/๋ฒ๋ฅ ๋ฑ)
- PHONE/EMAIL/RRN ๋ฏธ์ง์: ์ ํ PII๋ regex ์ฑ ์
Intended use
- ํ๊ตญ์ด ํ ์คํธ์์ PII ํ๋ณด (PERSON_NAME / ADDRESS_FULL / ORGANIZATION) ํ์ง
- v0.2 ๊ฐ๋๋ ์ผ ํ์ดํ๋ผ์ธ์ NER detector ๋ชจ๋ (Korean PII Guardrail v0.2)
- Production ํ๊ฒฝ: regex / dictionary / context scorer / boundary corrector์ ํจ๊ป ์ฌ์ฉ ๊ถ์ฅ (๋จ์ผ NER๋ง์ผ๋ก๋ 99% target ๋ฏธ๋ฌ)
Out of scope
- ์ ํ PII (์ ํ๋ฒํธ, ์ด๋ฉ์ผ, ์ฃผ๋ฏผ๋ฑ๋ก๋ฒํธ ๋ฑ) โ regex/validator ์ฌ์ฉ
- ๋ฉํฐ ํด / RAG ํ๊ฒฝ์์์ ์ปจํ ์คํธ ์ถ์ โ v0.2 single-turn ๋ฒ์ ๋ฐ
- ์๋ฃ ์ฐจํธ, ๋ฒ๋ฅ ๋ฌธ์ ๋ฑ ๋๋ฉ์ธ ํนํ ํ ์คํธ (์ฑ๋ฅ ์ ํ ์์)
License
CC-BY-SA-4.0 (KLUE base ๋ชจ๋ธ ๋ผ์ด์ ์ค ์์).
Citation
@misc{kimminwoo2026koreanpiinerv3,
title={Korean PII NER v3: klue/roberta-large fine-tuned for PII guardrails},
author={Kim, Minwoo},
year={2026},
url={https://huggingface.co/vmaca123/korean-pii-ner-v3}
}
Companion project
์ด ๋ชจ๋ธ์ Korean PII Guardrail v0.2 ํ๋ก์ ํธ์ NER detector ๋ชจ๋์ ๋๋ค.
- Wrapper code:
PII/ner/ner_wrapper.py(v0.2BaseNERDetectorProtocol ์ค์) - Design doc:
korean_pii_guardrail_v0_2/docs/14_NER_DESIGN_v1.md - Training results:
PII/ner/TRAINING_RESULTS_v3.md
- Downloads last month
- 254
Model tree for vmaca123/korean-pii-ner-v3
Base model
klue/roberta-largeDataset used to train vmaca123/korean-pii-ner-v3
Evaluation results
- macro F1 on PII NER internal test (3,401 sentences)self-reported0.878
- micro F1 on PII NER internal test (3,401 sentences)self-reported0.887
- macro F1 on KLUE-NER validation (5,000 sentences)self-reported0.766
- micro F1 on KLUE-NER validation (5,000 sentences)self-reported0.792