sungjun12's picture
Upload folder using huggingface_hub
2cd2f04 verified
metadata
license: mit
task_categories:
  - token-classification
  - named-entity-recognition
tags:
  - korean
  - pii
  - privacy
  - masking
  - bert
language:
  - ko
pipeline_tag: token-classification

Korean PII Masking BERT

ํ•œ๊ตญ์–ด ๊ฐœ์ธ์ •๋ณด(PII, Personally Identifiable Information) ๋งˆ์Šคํ‚น์„ ์œ„ํ•œ BERT ๊ธฐ๋ฐ˜ ํ† ํฐ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ ์„ค๋ช…

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์—์„œ ๊ฐœ์ธ์ •๋ณด๋ฅผ ์ž๋™์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  ๋งˆ์Šคํ‚นํ•˜๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. BERT ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 14๊ฐ€์ง€ ์œ ํ˜•์˜ ํ•œ๊ตญ์–ด PII๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ์„ธ๋ถ€ ์ •๋ณด

  • ์•„ํ‚คํ…์ฒ˜: BertForTokenClassification
  • ๊ธฐ๋ณธ ๋ชจ๋ธ: BERT (Korean)
  • Hidden Size: 1024
  • Num Hidden Layers: 24
  • Num Attention Heads: 16
  • Max Position Embeddings: 300
  • Vocab Size: 30,000

์ง€์›ํ•˜๋Š” PII ์œ ํ˜•

๋ชจ๋ธ์€ ๋‹ค์Œ 14๊ฐ€์ง€ PII ์œ ํ˜•์„ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค:

  1. ๊ฐ€๋งน์ ๋ช… (Business Name)
  2. ๊ฒฐ์ œ๊ธˆ์•ก (Payment Amount)
  3. ๊ณ„์ขŒ๋ฒˆํ˜ธ (Account Number)
  4. ๋กœ๊ทธ์ธID (Login ID)
  5. ์ƒ์„ธ์ฃผ์†Œ (Detailed Address)
  6. ์‹ ์šฉ์ ์ˆ˜ (Credit Score)
  7. ์—ฌ๊ถŒ๋ฒˆํ˜ธ (Passport Number)
  8. ์šฐํŽธ๋ฒˆํ˜ธ (Postal Code)
  9. ์šด์ „๋ฉดํ—ˆ๋ฒˆํ˜ธ (Driver's License Number)
  10. ์ด๋ฆ„ (Name)
  11. ์ „์ž๋ฉ”์ผ (Email)
  12. ์ „ํ™”๋ฒˆํ˜ธ (Phone Number)
  13. ์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ (Resident Registration Number)
  14. ์นด๋“œ๋ฒˆํ˜ธ (Card Number)
  15. ํœด๋Œ€์ „ํ™”๋ฒˆํ˜ธ (Mobile Phone Number)

๊ฐ PII๋Š” BIO ํƒœ๊น… ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (B-, I-, O).

์‚ฌ์šฉ๋ฒ•

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

from transformers import BertForTokenClassification, BertTokenizer
import torch

# ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model = BertForTokenClassification.from_pretrained("your-username/korean-pii-masking-bert")
tokenizer = BertTokenizer.from_pretrained("your-username/korean-pii-masking-bert")

# ํ…์ŠคํŠธ ํ† ํฌ๋‚˜์ด์ง•
text = "์•ˆ๋…•ํ•˜์„ธ์š”, ์ œ ์ด๋ฆ„์€ ๊น€๋ฏผ์ˆ˜์ด๊ณ  ์ „ํ™”๋ฒˆํ˜ธ๋Š” 010-1234-5678์ž…๋‹ˆ๋‹ค."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# ์˜ˆ์ธก
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_labels = torch.argmax(predictions, dim=-1)[0]

ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•œ ์‚ฌ์šฉ

์›๋ณธ ์ €์žฅ์†Œ์˜ inference_pipeline.py๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ๊ฐ„ํŽธํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from inference_pipeline import PIIInferencePipeline

# ํŒŒ์ดํ”„๋ผ์ธ ์ดˆ๊ธฐํ™”
pipeline = PIIInferencePipeline()

# ํ…์ŠคํŠธ ์˜ˆ์ธก
text = "์•ˆ๋…•ํ•˜์„ธ์š”, ์ œ ์ด๋ฆ„์€ ๊น€๋ฏผ์ˆ˜์ด๊ณ  ์ „ํ™”๋ฒˆํ˜ธ๋Š” 010-1234-5678์ž…๋‹ˆ๋‹ค."
result = pipeline.predict(text)

print(f"์›๋ณธ ํ…์ŠคํŠธ: {result.original_text}")
print(f"๋งˆ์Šคํ‚น ํ…์ŠคํŠธ: {result.masked_text}")
print(f"๋ฐœ๊ฒฌ๋œ PII: {len(result.entities)}๊ฐœ")

์˜ˆ์‹œ

์ž…๋ ฅ: "8์›” 10์ผ 14:32์— ๋ฐฑ๋‹ค๋ฐฉ ์ฝ”์—‘์Šค์ ์—์„œ 9,910์› ์Šน์ธ ๋‚ด์—ญ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค."

์ถœ๋ ฅ:
- ๋ฐœ๊ฒฌ๋œ PII:
  - ๋ฐฑ๋‹ค๋ฐฉ ์ฝ”์—‘์Šค์  -> [๊ฐ€๋งน์ ๋ช…]
  - 9,910์› -> [๊ฒฐ์ œ๊ธˆ์•ก]

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์œผ๋ฉฐ, ์ตœ๋Œ€ ๊ธธ์ด๋Š” 300 ํ† ํฐ์ž…๋‹ˆ๋‹ค.

์ œํ•œ ์‚ฌํ•ญ

  • ์ตœ๋Œ€ ์ž…๋ ฅ ๊ธธ์ด: 300 ํ† ํฐ
  • ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์— ์ตœ์ ํ™”๋จ
  • ํ…์ŠคํŠธ์—์„œ์˜ PII ์ธ์‹์— ํŠนํ™” (์ด๋ฏธ์ง€๋‚˜ ์Œ์„ฑ ๋ฏธ์ง€์›)

์ฐธ๊ณ  ๋ฌธํ—Œ

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ๊ฐœ์ธ์ •๋ณด ๋งˆ์Šคํ‚น์„ ์œ„ํ•ด ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋ผ์ด์„ผ์Šค

MIT License

์ €์ž

Korean PII Masking Project