Matela7 commited on
Commit
084d0d9
·
verified ·
1 Parent(s): 6fc47da

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pl
4
+ license: mit
5
+ tags:
6
+ - token-classification
7
+ - named-entity-recognition
8
+ - polish
9
+ - anonymization
10
+ - privacy
11
+ - gdpr
12
+ datasets:
13
+ - custom
14
+ metrics:
15
+ - f1
16
+ - precision
17
+ - recall
18
+ base_model: allegro/herbert-base-cased
19
+ model-index:
20
+ - name: AnonBERT-ENR
21
+ results:
22
+ - task:
23
+ type: token-classification
24
+ name: Named Entity Recognition
25
+ metrics:
26
+ - type: f1
27
+ value: 0.84
28
+ name: F1 Score
29
+
30
+ widget:
31
+ - text: "Nazywam się Jan Kowalski i mieszkam w Warszawie przy ulicy Marszałkowskiej 1. Mój email to jan.kowalski@example.com, a numer telefonu +48 123 456 789."
32
+ example_title: "Personal Data Example"
33
+ - text: "PESEL: 12345678901, dowód osobisty: ABC123456. Data urodzenia: 1990-05-20."
34
+ example_title: "ID Numbers Example"
35
+ ---
36
+
37
+ # AnonBERT-ENR: Polish Personal Data Anonymization Model
38
+
39
+ **Fine-tuned HerBERT model for Named Entity Recognition and anonymization of sensitive personal information in Polish text.**
40
+
41
+ > Achieves **84% F1 score** on test set for identifying and anonymizing 25+ types of personal data entities.
42
+
43
+ ## Model Description
44
+
45
+ AnonBERT-ENR is a specialized NER model fine-tuned on [allegro/herbert-base-cased](https://huggingface.co/allegro/herbert-base-cased) for detecting and anonymizing personal data in Polish documents. The model is designed to help organizations comply with GDPR and other privacy regulations by automatically identifying sensitive information.
46
+
47
+ ### Key Features
48
+
49
+ - ✅ **High Accuracy**: 84% F1 score on diverse test data
50
+ - 🇵🇱 **Polish Language**: Optimized for Polish text and naming conventions
51
+ - 🔒 **Privacy-Focused**: Detects 25+ types of sensitive personal information
52
+ - 🚀 **Production Ready**: Includes complete anonymization pipeline
53
+ - 📊 **Comprehensive Coverage**: Names, IDs, contact info, health data, political views, and more
54
+
55
+ ## Supported Entity Types
56
+
57
+ The model can identify the following types of personal information:
58
+
59
+ | Entity Type | Description | Example |
60
+ |------------|-------------|---------|
61
+ | `NAME` | First name | Jan |
62
+ | `SURNAME` | Last name | Kowalski |
63
+ | `EMAIL` | Email address | jan@example.pl |
64
+ | `PHONE` | Phone number | +48 123 456 789 |
65
+ | `PESEL` | National ID number | 12345678901 |
66
+ | `DOCUMENT` | ID document number | ABC123456 |
67
+ | `ADDRESS` | Street address | ul. Marszałkowska 1 |
68
+ | `CITY` | City name | Warszawa |
69
+ | `BANK_ACCOUNT` | Bank account number | PL61109010140000071219812874 |
70
+ | `CREDIT_CARD` | Credit card number | 1234-5678-9012-3456 |
71
+ | `DATE_BIRTH` | Date of birth | 1990-05-20 |
72
+ | `DATE` | General date | 2024-01-15 |
73
+ | `AGE` | Age | 25 lat |
74
+ | `SEX` | Gender/sex | Mężczyzna, Kobieta |
75
+ | `COMPANY` | Company name | Allegro |
76
+ | `SCHOOL` | School name | Uniwersytet Warszawski |
77
+ | `JOB` | Job title | Dyrektor |
78
+ | `USERNAME` | Username/login | jan.kowalski |
79
+ | `HEALTH` | Health information | cukrzyca |
80
+ | `RELIGION` | Religious affiliation | katolik |
81
+ | `POLITICAL` | Political views | liberalny |
82
+ | `ETHNICITY` | Ethnicity | polska |
83
+ | `ORIENTATION` | Sexual orientation | heteroseksualny |
84
+ | `RELATIVE` | Family relation | matka |
85
+ | `SECRET` | Secret/confidential info | hasło: abc123 |