File size: 3,537 Bytes
2cd2f04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: mit
task_categories:
- token-classification
- named-entity-recognition
tags:
- korean
- pii
- privacy
- masking
- bert
language:
- ko
pipeline_tag: token-classification
---

# Korean PII Masking BERT

ํ•œ๊ตญ์–ด ๊ฐœ์ธ์ •๋ณด(PII, Personally Identifiable Information) ๋งˆ์Šคํ‚น์„ ์œ„ํ•œ BERT ๊ธฐ๋ฐ˜ ํ† ํฐ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

## ๋ชจ๋ธ ์„ค๋ช…

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์—์„œ ๊ฐœ์ธ์ •๋ณด๋ฅผ ์ž๋™์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  ๋งˆ์Šคํ‚นํ•˜๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. BERT ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 14๊ฐ€์ง€ ์œ ํ˜•์˜ ํ•œ๊ตญ์–ด PII๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

## ๋ชจ๋ธ ์„ธ๋ถ€ ์ •๋ณด

- **์•„ํ‚คํ…์ฒ˜**: BertForTokenClassification
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: BERT (Korean)
- **Hidden Size**: 1024
- **Num Hidden Layers**: 24
- **Num Attention Heads**: 16
- **Max Position Embeddings**: 300
- **Vocab Size**: 30,000

## ์ง€์›ํ•˜๋Š” PII ์œ ํ˜•

๋ชจ๋ธ์€ ๋‹ค์Œ 14๊ฐ€์ง€ PII ์œ ํ˜•์„ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค:

1. **๊ฐ€๋งน์ ๋ช…** (Business Name)
2. **๊ฒฐ์ œ๊ธˆ์•ก** (Payment Amount)
3. **๊ณ„์ขŒ๋ฒˆํ˜ธ** (Account Number)
4. **๋กœ๊ทธ์ธID** (Login ID)
5. **์ƒ์„ธ์ฃผ์†Œ** (Detailed Address)
6. **์‹ ์šฉ์ ์ˆ˜** (Credit Score)
7. **์—ฌ๊ถŒ๋ฒˆํ˜ธ** (Passport Number)
8. **์šฐํŽธ๋ฒˆํ˜ธ** (Postal Code)
9. **์šด์ „๋ฉดํ—ˆ๋ฒˆํ˜ธ** (Driver's License Number)
10. **์ด๋ฆ„** (Name)
11. **์ „์ž๋ฉ”์ผ** (Email)
12. **์ „ํ™”๋ฒˆํ˜ธ** (Phone Number)
13. **์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ** (Resident Registration Number)
14. **์นด๋“œ๋ฒˆํ˜ธ** (Card Number)
15. **ํœด๋Œ€์ „ํ™”๋ฒˆํ˜ธ** (Mobile Phone Number)

๊ฐ PII๋Š” BIO ํƒœ๊น… ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (B-, I-, O).

## ์‚ฌ์šฉ๋ฒ•

### ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

```python
from transformers import BertForTokenClassification, BertTokenizer
import torch

# ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model = BertForTokenClassification.from_pretrained("your-username/korean-pii-masking-bert")
tokenizer = BertTokenizer.from_pretrained("your-username/korean-pii-masking-bert")

# ํ…์ŠคํŠธ ํ† ํฌ๋‚˜์ด์ง•
text = "์•ˆ๋…•ํ•˜์„ธ์š”, ์ œ ์ด๋ฆ„์€ ๊น€๋ฏผ์ˆ˜์ด๊ณ  ์ „ํ™”๋ฒˆํ˜ธ๋Š” 010-1234-5678์ž…๋‹ˆ๋‹ค."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# ์˜ˆ์ธก
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_labels = torch.argmax(predictions, dim=-1)[0]
```

### ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•œ ์‚ฌ์šฉ

์›๋ณธ ์ €์žฅ์†Œ์˜ `inference_pipeline.py`๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ๊ฐ„ํŽธํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

```python
from inference_pipeline import PIIInferencePipeline

# ํŒŒ์ดํ”„๋ผ์ธ ์ดˆ๊ธฐํ™”
pipeline = PIIInferencePipeline()

# ํ…์ŠคํŠธ ์˜ˆ์ธก
text = "์•ˆ๋…•ํ•˜์„ธ์š”, ์ œ ์ด๋ฆ„์€ ๊น€๋ฏผ์ˆ˜์ด๊ณ  ์ „ํ™”๋ฒˆํ˜ธ๋Š” 010-1234-5678์ž…๋‹ˆ๋‹ค."
result = pipeline.predict(text)

print(f"์›๋ณธ ํ…์ŠคํŠธ: {result.original_text}")
print(f"๋งˆ์Šคํ‚น ํ…์ŠคํŠธ: {result.masked_text}")
print(f"๋ฐœ๊ฒฌ๋œ PII: {len(result.entities)}๊ฐœ")
```

## ์˜ˆ์‹œ

```
์ž…๋ ฅ: "8์›” 10์ผ 14:32์— ๋ฐฑ๋‹ค๋ฐฉ ์ฝ”์—‘์Šค์ ์—์„œ 9,910์› ์Šน์ธ ๋‚ด์—ญ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค."

์ถœ๋ ฅ:
- ๋ฐœ๊ฒฌ๋œ PII:
  - ๋ฐฑ๋‹ค๋ฐฉ ์ฝ”์—‘์Šค์  -> [๊ฐ€๋งน์ ๋ช…]
  - 9,910์› -> [๊ฒฐ์ œ๊ธˆ์•ก]
```

## ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์œผ๋ฉฐ, ์ตœ๋Œ€ ๊ธธ์ด๋Š” 300 ํ† ํฐ์ž…๋‹ˆ๋‹ค.

## ์ œํ•œ ์‚ฌํ•ญ

- ์ตœ๋Œ€ ์ž…๋ ฅ ๊ธธ์ด: 300 ํ† ํฐ
- ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์— ์ตœ์ ํ™”๋จ
- ํ…์ŠคํŠธ์—์„œ์˜ PII ์ธ์‹์— ํŠนํ™” (์ด๋ฏธ์ง€๋‚˜ ์Œ์„ฑ ๋ฏธ์ง€์›)

## ์ฐธ๊ณ  ๋ฌธํ—Œ

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ๊ฐœ์ธ์ •๋ณด ๋งˆ์Šคํ‚น์„ ์œ„ํ•ด ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

## ๋ผ์ด์„ผ์Šค

MIT License

## ์ €์ž

Korean PII Masking Project