Update README.md
Browse files
README.md
CHANGED
|
@@ -4,28 +4,55 @@
|
|
| 4 |
**Role**: Project Lead & Primary Researcher
|
| 5 |
|
| 6 |
## Overview
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
-
|
| 17 |
-
-
|
| 18 |
-
-
|
| 19 |
-
-
|
| 20 |
-
-
|
| 21 |
-
-
|
| 22 |
-
-
|
| 23 |
-
-
|
| 24 |
-
-
|
| 25 |
-
-
|
| 26 |
-
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
This list focuses on the most frequently occurring and sensitive personal data types in Korean text/documents.
|
|
|
|
| 4 |
**Role**: Project Lead & Primary Researcher
|
| 5 |
|
| 6 |
## Overview
|
| 7 |
+
|
| 8 |
+
BERT-based token classification model specialized for **Korean Personally Identifiable Information (PII) masking**.
|
| 9 |
+
It detects and masks 14 common types of Korean PII entities.
|
| 10 |
+
|
| 11 |
+
For best accuracy in production, **strongly recommended**: use a **hybrid approach**
|
| 12 |
+
(pre-processing โ model inference โ post-processing (Regex and rule based) rather than the model alone.
|
| 13 |
+
|
| 14 |
+
## Base Model & Architecture
|
| 15 |
+
|
| 16 |
+
- **Base pretrained model**: KcBERT-Large
|
| 17 |
+
- **Model type**: `BertForTokenClassification`
|
| 18 |
+
- **Architecture highlights**:
|
| 19 |
+
- Hidden size : 1024
|
| 20 |
+
- Layers : 24
|
| 21 |
+
- Attention heads : 16
|
| 22 |
+
- Intermediate size : 4096
|
| 23 |
+
- Max position embeddings : 300
|
| 24 |
+
- Vocab size : 30,000
|
| 25 |
+
- Activation : GELU
|
| 26 |
+
- Dropout : 0.1 (hidden & attention)
|
| 27 |
+
|
| 28 |
+
## Supported PII Types (BIO tagging)
|
| 29 |
+
|
| 30 |
+
1. ๊ฐ๋งน์ ๋ช
(Business Name)
|
| 31 |
+
2. ๊ฒฐ์ ๊ธ์ก (Payment Amount)
|
| 32 |
+
3. ๊ณ์ข๋ฒํธ (Account Number)
|
| 33 |
+
4. ๋ก๊ทธ์ธID (Login ID)
|
| 34 |
+
5. ์์ธ์ฃผ์ (Detailed Address)
|
| 35 |
+
6. ์ ์ฉ์ ์ (Credit Score)
|
| 36 |
+
7. ์ฌ๊ถ๋ฒํธ (Passport Number)
|
| 37 |
+
8. ์ฐํธ๋ฒํธ (Postal Code)
|
| 38 |
+
9. ์ด์ ๋ฉดํ๋ฒํธ (Driver's License Number)
|
| 39 |
+
10. ์ด๋ฆ (Name)
|
| 40 |
+
11. ์ ์๋ฉ์ผ (Email)
|
| 41 |
+
12. ์ ํ๋ฒํธ (Phone Number)
|
| 42 |
+
13. ์ฃผ๋ฏผ๋ฑ๋ก๋ฒํธ (Resident Registration Number)
|
| 43 |
+
14. ์นด๋๋ฒํธ (Card Number)
|
| 44 |
+
15. ํด๋์ ํ๋ฒํธ (Mobile Phone Number)
|
| 45 |
+
|
| 46 |
+
## Example
|
| 47 |
+
|
| 48 |
+
```
|
| 49 |
+
์
๋ ฅ: "์์ฒ ์ฉ ๊ณ ๊ฐ๋, 8์ 10์ผ 14:32์ ๋ฐฑ๋ค๋ฐฉ ์ฝ์์ค์ ์์ 9,910์ ๊ฒฐ์ ๋ด์ญ ํ์ธ๋ฉ๋๋ค."
|
| 50 |
+
|
| 51 |
+
์ถ๋ ฅ:
|
| 52 |
+
- ๋ฐ๊ฒฌ๋ PII:
|
| 53 |
+
- ์์ฒ ์ฉ -> [์ด๋ฆ]
|
| 54 |
+
- ๋ฐฑ๋ค๋ฐฉ ์ฝ์์ค์ -> [๊ฐ๋งน์ ๋ช
]
|
| 55 |
+
- 9,910์ -> [๊ฒฐ์ ๊ธ์ก]
|
| 56 |
+
```
|
| 57 |
|
| 58 |
This list focuses on the most frequently occurring and sensitive personal data types in Korean text/documents.
|