sungjun12 commited on
Commit
852100b
ยท
verified ยท
1 Parent(s): b85158c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -23
README.md CHANGED
@@ -4,28 +4,55 @@
4
  **Role**: Project Lead & Primary Researcher
5
 
6
  ## Overview
7
- This is a specialized **Personally Identifiable Information (PII)** masking model optimized specifically for the **Korean language**.
8
- To achieve the highest possible accuracy, the model has been fine-tuned to work in combination with preprocessing scripts and **regular expressions (regex)**.
9
-
10
- **Strong Recommendation**: For production environments, we highly recommend adopting a **hybrid approach** โ€” combining this model with robust rule-based systems (regex + post-processing) โ€” rather than relying on the model alone.
11
-
12
- ### Supported PII Types (Korean Context)
13
- The model is designed to detect and mask the following common Korean PII categories:
14
-
15
- - Resident Registration Number (์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ)
16
- - Credit/Debit Card Number (์นด๋“œ๋ฒˆํ˜ธ)
17
- - Telephone Number (์ „ํ™”๋ฒˆํ˜ธ)
18
- - Mobile Phone Number (ํœด๋Œ€์ „ํ™”๋ฒˆํ˜ธ)
19
- - Postal Code (์šฐํŽธ๋ฒˆํ˜ธ)
20
- - Detailed Address (์ƒ์„ธ์ฃผ์†Œ)
21
- - Bank Account Number (๊ณ„์ขŒ๋ฒˆํ˜ธ)
22
- - Passport Number (์—ฌ๊ถŒ๋ฒˆํ˜ธ)
23
- - Driver's License Number (์šด์ „๋ฉดํ—ˆ๋ฒˆํ˜ธ)
24
- - Email Address (์ „์ž๋ฉ”์ผ)
25
- - Customer Name (๊ณ ๊ฐ๋ช…)
26
- - Login ID / Username (๋กœ๊ทธ์ธID)
27
- - Payment Amount (๊ฒฐ์ œ๊ธˆ์•ก)
28
- - Merchant Name (๊ฐ€๋งน์ ๋ช…)
29
- - Credit Score (์‹ ์šฉ์ ์ˆ˜)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  This list focuses on the most frequently occurring and sensitive personal data types in Korean text/documents.
 
4
  **Role**: Project Lead & Primary Researcher
5
 
6
  ## Overview
7
+
8
+ BERT-based token classification model specialized for **Korean Personally Identifiable Information (PII) masking**.
9
+ It detects and masks 14 common types of Korean PII entities.
10
+
11
+ For best accuracy in production, **strongly recommended**: use a **hybrid approach**
12
+ (pre-processing โ†’ model inference โ†’ post-processing (Regex and rule based) rather than the model alone.
13
+
14
+ ## Base Model & Architecture
15
+
16
+ - **Base pretrained model**: KcBERT-Large
17
+ - **Model type**: `BertForTokenClassification`
18
+ - **Architecture highlights**:
19
+ - Hidden size : 1024
20
+ - Layers : 24
21
+ - Attention heads : 16
22
+ - Intermediate size : 4096
23
+ - Max position embeddings : 300
24
+ - Vocab size : 30,000
25
+ - Activation : GELU
26
+ - Dropout : 0.1 (hidden & attention)
27
+
28
+ ## Supported PII Types (BIO tagging)
29
+
30
+ 1. ๊ฐ€๋งน์ ๋ช… (Business Name)
31
+ 2. ๊ฒฐ์ œ๊ธˆ์•ก (Payment Amount)
32
+ 3. ๊ณ„์ขŒ๋ฒˆํ˜ธ (Account Number)
33
+ 4. ๋กœ๊ทธ์ธID (Login ID)
34
+ 5. ์ƒ์„ธ์ฃผ์†Œ (Detailed Address)
35
+ 6. ์‹ ์šฉ์ ์ˆ˜ (Credit Score)
36
+ 7. ์—ฌ๊ถŒ๋ฒˆํ˜ธ (Passport Number)
37
+ 8. ์šฐํŽธ๋ฒˆํ˜ธ (Postal Code)
38
+ 9. ์šด์ „๋ฉดํ—ˆ๋ฒˆํ˜ธ (Driver's License Number)
39
+ 10. ์ด๋ฆ„ (Name)
40
+ 11. ์ „์ž๋ฉ”์ผ (Email)
41
+ 12. ์ „ํ™”๋ฒˆํ˜ธ (Phone Number)
42
+ 13. ์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ (Resident Registration Number)
43
+ 14. ์นด๋“œ๋ฒˆํ˜ธ (Card Number)
44
+ 15. ํœด๋Œ€์ „ํ™”๋ฒˆํ˜ธ (Mobile Phone Number)
45
+
46
+ ## Example
47
+
48
+ ```
49
+ ์ž…๋ ฅ: "์–‘์ฒ ์šฉ ๊ณ ๊ฐ๋‹˜, 8์›” 10์ผ 14:32์— ๋ฐฑ๋‹ค๋ฐฉ ์ฝ”์—‘์Šค์ ์—์„œ 9,910์› ๊ฒฐ์ œ ๋‚ด์—ญ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค."
50
+
51
+ ์ถœ๋ ฅ:
52
+ - ๋ฐœ๊ฒฌ๋œ PII:
53
+ - ์–‘์ฒ ์šฉ -> [์ด๋ฆ„]
54
+ - ๋ฐฑ๋‹ค๋ฐฉ ์ฝ”์—‘์Šค์  -> [๊ฐ€๋งน์ ๋ช…]
55
+ - 9,910์› -> [๊ฒฐ์ œ๊ธˆ์•ก]
56
+ ```
57
 
58
  This list focuses on the most frequently occurring and sensitive personal data types in Korean text/documents.