|
|
--- |
|
|
license: mit |
|
|
task_categories: |
|
|
- token-classification |
|
|
- named-entity-recognition |
|
|
tags: |
|
|
- korean |
|
|
- pii |
|
|
- privacy |
|
|
- masking |
|
|
- bert |
|
|
language: |
|
|
- ko |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# Korean PII Masking BERT |
|
|
|
|
|
ํ๊ตญ์ด ๊ฐ์ธ์ ๋ณด(PII, Personally Identifiable Information) ๋ง์คํน์ ์ํ BERT ๊ธฐ๋ฐ ํ ํฐ ๋ถ๋ฅ ๋ชจ๋ธ์
๋๋ค. |
|
|
|
|
|
## ๋ชจ๋ธ ์ค๋ช
|
|
|
|
|
|
์ด ๋ชจ๋ธ์ ํ๊ตญ์ด ํ
์คํธ์์ ๊ฐ์ธ์ ๋ณด๋ฅผ ์๋์ผ๋ก ๊ฐ์งํ๊ณ ๋ง์คํนํ๋ ์ฉ๋๋ก ์ฌ์ฉ๋ฉ๋๋ค. BERT ๊ธฐ๋ฐ ์ํคํ
์ฒ๋ฅผ ์ฌ์ฉํ์ฌ 14๊ฐ์ง ์ ํ์ ํ๊ตญ์ด PII๋ฅผ ์๋ณํฉ๋๋ค. |
|
|
|
|
|
## ๋ชจ๋ธ ์ธ๋ถ ์ ๋ณด |
|
|
|
|
|
- **์ํคํ
์ฒ**: BertForTokenClassification |
|
|
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: BERT (Korean) |
|
|
- **Hidden Size**: 1024 |
|
|
- **Num Hidden Layers**: 24 |
|
|
- **Num Attention Heads**: 16 |
|
|
- **Max Position Embeddings**: 300 |
|
|
- **Vocab Size**: 30,000 |
|
|
|
|
|
## ์ง์ํ๋ PII ์ ํ |
|
|
|
|
|
๋ชจ๋ธ์ ๋ค์ 14๊ฐ์ง PII ์ ํ์ ์ธ์ํฉ๋๋ค: |
|
|
|
|
|
1. **๊ฐ๋งน์ ๋ช
** (Business Name) |
|
|
2. **๊ฒฐ์ ๊ธ์ก** (Payment Amount) |
|
|
3. **๊ณ์ข๋ฒํธ** (Account Number) |
|
|
4. **๋ก๊ทธ์ธID** (Login ID) |
|
|
5. **์์ธ์ฃผ์** (Detailed Address) |
|
|
6. **์ ์ฉ์ ์** (Credit Score) |
|
|
7. **์ฌ๊ถ๋ฒํธ** (Passport Number) |
|
|
8. **์ฐํธ๋ฒํธ** (Postal Code) |
|
|
9. **์ด์ ๋ฉดํ๋ฒํธ** (Driver's License Number) |
|
|
10. **์ด๋ฆ** (Name) |
|
|
11. **์ ์๋ฉ์ผ** (Email) |
|
|
12. **์ ํ๋ฒํธ** (Phone Number) |
|
|
13. **์ฃผ๋ฏผ๋ฑ๋ก๋ฒํธ** (Resident Registration Number) |
|
|
14. **์นด๋๋ฒํธ** (Card Number) |
|
|
15. **ํด๋์ ํ๋ฒํธ** (Mobile Phone Number) |
|
|
|
|
|
๊ฐ PII๋ BIO ํ๊น
๋ฐฉ์์ ์ฌ์ฉํฉ๋๋ค (B-, I-, O). |
|
|
|
|
|
## ์ฌ์ฉ๋ฒ |
|
|
|
|
|
### ๊ธฐ๋ณธ ์ฌ์ฉ๋ฒ |
|
|
|
|
|
```python |
|
|
from transformers import BertForTokenClassification, BertTokenizer |
|
|
import torch |
|
|
|
|
|
# ๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ก๋ |
|
|
model = BertForTokenClassification.from_pretrained("your-username/korean-pii-masking-bert") |
|
|
tokenizer = BertTokenizer.from_pretrained("your-username/korean-pii-masking-bert") |
|
|
|
|
|
# ํ
์คํธ ํ ํฌ๋์ด์ง |
|
|
text = "์๋
ํ์ธ์, ์ ์ด๋ฆ์ ๊น๋ฏผ์์ด๊ณ ์ ํ๋ฒํธ๋ 010-1234-5678์
๋๋ค." |
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
|
|
|
|
|
# ์์ธก |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_labels = torch.argmax(predictions, dim=-1)[0] |
|
|
``` |
|
|
|
|
|
### ํ์ดํ๋ผ์ธ์ ํตํ ์ฌ์ฉ |
|
|
|
|
|
์๋ณธ ์ ์ฅ์์ `inference_pipeline.py`๋ฅผ ์ฌ์ฉํ๋ฉด ๋ ๊ฐํธํ๊ฒ ์ฌ์ฉํ ์ ์์ต๋๋ค: |
|
|
|
|
|
```python |
|
|
from inference_pipeline import PIIInferencePipeline |
|
|
|
|
|
# ํ์ดํ๋ผ์ธ ์ด๊ธฐํ |
|
|
pipeline = PIIInferencePipeline() |
|
|
|
|
|
# ํ
์คํธ ์์ธก |
|
|
text = "์๋
ํ์ธ์, ์ ์ด๋ฆ์ ๊น๋ฏผ์์ด๊ณ ์ ํ๋ฒํธ๋ 010-1234-5678์
๋๋ค." |
|
|
result = pipeline.predict(text) |
|
|
|
|
|
print(f"์๋ณธ ํ
์คํธ: {result.original_text}") |
|
|
print(f"๋ง์คํน ํ
์คํธ: {result.masked_text}") |
|
|
print(f"๋ฐ๊ฒฌ๋ PII: {len(result.entities)}๊ฐ") |
|
|
``` |
|
|
|
|
|
## ์์ |
|
|
|
|
|
``` |
|
|
์
๋ ฅ: "8์ 10์ผ 14:32์ ๋ฐฑ๋ค๋ฐฉ ์ฝ์์ค์ ์์ 9,910์ ์น์ธ ๋ด์ญ ํ์ธ๋ฉ๋๋ค." |
|
|
|
|
|
์ถ๋ ฅ: |
|
|
- ๋ฐ๊ฒฌ๋ PII: |
|
|
- ๋ฐฑ๋ค๋ฐฉ ์ฝ์์ค์ -> [๊ฐ๋งน์ ๋ช
] |
|
|
- 9,910์ -> [๊ฒฐ์ ๊ธ์ก] |
|
|
``` |
|
|
|
|
|
## ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ |
|
|
|
|
|
๋ชจ๋ธ์ ํ๊ตญ์ด ํ
์คํธ๋ฅผ ์
๋ ฅ์ผ๋ก ๋ฐ์ผ๋ฉฐ, ์ต๋ ๊ธธ์ด๋ 300 ํ ํฐ์
๋๋ค. |
|
|
|
|
|
## ์ ํ ์ฌํญ |
|
|
|
|
|
- ์ต๋ ์
๋ ฅ ๊ธธ์ด: 300 ํ ํฐ |
|
|
- ํ๊ตญ์ด ํ
์คํธ์ ์ต์ ํ๋จ |
|
|
- ํ
์คํธ์์์ PII ์ธ์์ ํนํ (์ด๋ฏธ์ง๋ ์์ฑ ๋ฏธ์ง์) |
|
|
|
|
|
## ์ฐธ๊ณ ๋ฌธํ |
|
|
|
|
|
์ด ๋ชจ๋ธ์ ํ๊ตญ์ด ๊ฐ์ธ์ ๋ณด ๋ง์คํน์ ์ํด ํ์ต๋์์ต๋๋ค. |
|
|
|
|
|
## ๋ผ์ด์ผ์ค |
|
|
|
|
|
MIT License |
|
|
|
|
|
## ์ ์ |
|
|
|
|
|
Korean PII Masking Project |
|
|
|
|
|
|
|
|
|