sungjun12's picture
Upload folder using huggingface_hub
2cd2f04 verified
---
license: mit
task_categories:
- token-classification
- named-entity-recognition
tags:
- korean
- pii
- privacy
- masking
- bert
language:
- ko
pipeline_tag: token-classification
---
# Korean PII Masking BERT
ํ•œ๊ตญ์–ด ๊ฐœ์ธ์ •๋ณด(PII, Personally Identifiable Information) ๋งˆ์Šคํ‚น์„ ์œ„ํ•œ BERT ๊ธฐ๋ฐ˜ ํ† ํฐ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
## ๋ชจ๋ธ ์„ค๋ช…
์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์—์„œ ๊ฐœ์ธ์ •๋ณด๋ฅผ ์ž๋™์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  ๋งˆ์Šคํ‚นํ•˜๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. BERT ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 14๊ฐ€์ง€ ์œ ํ˜•์˜ ํ•œ๊ตญ์–ด PII๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.
## ๋ชจ๋ธ ์„ธ๋ถ€ ์ •๋ณด
- **์•„ํ‚คํ…์ฒ˜**: BertForTokenClassification
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: BERT (Korean)
- **Hidden Size**: 1024
- **Num Hidden Layers**: 24
- **Num Attention Heads**: 16
- **Max Position Embeddings**: 300
- **Vocab Size**: 30,000
## ์ง€์›ํ•˜๋Š” PII ์œ ํ˜•
๋ชจ๋ธ์€ ๋‹ค์Œ 14๊ฐ€์ง€ PII ์œ ํ˜•์„ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค:
1. **๊ฐ€๋งน์ ๋ช…** (Business Name)
2. **๊ฒฐ์ œ๊ธˆ์•ก** (Payment Amount)
3. **๊ณ„์ขŒ๋ฒˆํ˜ธ** (Account Number)
4. **๋กœ๊ทธ์ธID** (Login ID)
5. **์ƒ์„ธ์ฃผ์†Œ** (Detailed Address)
6. **์‹ ์šฉ์ ์ˆ˜** (Credit Score)
7. **์—ฌ๊ถŒ๋ฒˆํ˜ธ** (Passport Number)
8. **์šฐํŽธ๋ฒˆํ˜ธ** (Postal Code)
9. **์šด์ „๋ฉดํ—ˆ๋ฒˆํ˜ธ** (Driver's License Number)
10. **์ด๋ฆ„** (Name)
11. **์ „์ž๋ฉ”์ผ** (Email)
12. **์ „ํ™”๋ฒˆํ˜ธ** (Phone Number)
13. **์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ** (Resident Registration Number)
14. **์นด๋“œ๋ฒˆํ˜ธ** (Card Number)
15. **ํœด๋Œ€์ „ํ™”๋ฒˆํ˜ธ** (Mobile Phone Number)
๊ฐ PII๋Š” BIO ํƒœ๊น… ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (B-, I-, O).
## ์‚ฌ์šฉ๋ฒ•
### ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•
```python
from transformers import BertForTokenClassification, BertTokenizer
import torch
# ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model = BertForTokenClassification.from_pretrained("your-username/korean-pii-masking-bert")
tokenizer = BertTokenizer.from_pretrained("your-username/korean-pii-masking-bert")
# ํ…์ŠคํŠธ ํ† ํฌ๋‚˜์ด์ง•
text = "์•ˆ๋…•ํ•˜์„ธ์š”, ์ œ ์ด๋ฆ„์€ ๊น€๋ฏผ์ˆ˜์ด๊ณ  ์ „ํ™”๋ฒˆํ˜ธ๋Š” 010-1234-5678์ž…๋‹ˆ๋‹ค."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# ์˜ˆ์ธก
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_labels = torch.argmax(predictions, dim=-1)[0]
```
### ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•œ ์‚ฌ์šฉ
์›๋ณธ ์ €์žฅ์†Œ์˜ `inference_pipeline.py`๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ๊ฐ„ํŽธํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```python
from inference_pipeline import PIIInferencePipeline
# ํŒŒ์ดํ”„๋ผ์ธ ์ดˆ๊ธฐํ™”
pipeline = PIIInferencePipeline()
# ํ…์ŠคํŠธ ์˜ˆ์ธก
text = "์•ˆ๋…•ํ•˜์„ธ์š”, ์ œ ์ด๋ฆ„์€ ๊น€๋ฏผ์ˆ˜์ด๊ณ  ์ „ํ™”๋ฒˆํ˜ธ๋Š” 010-1234-5678์ž…๋‹ˆ๋‹ค."
result = pipeline.predict(text)
print(f"์›๋ณธ ํ…์ŠคํŠธ: {result.original_text}")
print(f"๋งˆ์Šคํ‚น ํ…์ŠคํŠธ: {result.masked_text}")
print(f"๋ฐœ๊ฒฌ๋œ PII: {len(result.entities)}๊ฐœ")
```
## ์˜ˆ์‹œ
```
์ž…๋ ฅ: "8์›” 10์ผ 14:32์— ๋ฐฑ๋‹ค๋ฐฉ ์ฝ”์—‘์Šค์ ์—์„œ 9,910์› ์Šน์ธ ๋‚ด์—ญ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค."
์ถœ๋ ฅ:
- ๋ฐœ๊ฒฌ๋œ PII:
- ๋ฐฑ๋‹ค๋ฐฉ ์ฝ”์—‘์Šค์  -> [๊ฐ€๋งน์ ๋ช…]
- 9,910์› -> [๊ฒฐ์ œ๊ธˆ์•ก]
```
## ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์œผ๋ฉฐ, ์ตœ๋Œ€ ๊ธธ์ด๋Š” 300 ํ† ํฐ์ž…๋‹ˆ๋‹ค.
## ์ œํ•œ ์‚ฌํ•ญ
- ์ตœ๋Œ€ ์ž…๋ ฅ ๊ธธ์ด: 300 ํ† ํฐ
- ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์— ์ตœ์ ํ™”๋จ
- ํ…์ŠคํŠธ์—์„œ์˜ PII ์ธ์‹์— ํŠนํ™” (์ด๋ฏธ์ง€๋‚˜ ์Œ์„ฑ ๋ฏธ์ง€์›)
## ์ฐธ๊ณ  ๋ฌธํ—Œ
์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ๊ฐœ์ธ์ •๋ณด ๋งˆ์Šคํ‚น์„ ์œ„ํ•ด ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
## ๋ผ์ด์„ผ์Šค
MIT License
## ์ €์ž
Korean PII Masking Project