MutazYoune commited on
Commit
cdb7180
·
verified ·
1 Parent(s): 1d8c4eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -41
README.md CHANGED
@@ -10,77 +10,190 @@ tags:
10
  - token-classification
11
  - pii
12
  - privacy
 
13
  datasets:
14
- - custom
15
  widget:
16
  - text: أحمد محمد يعمل في شركة جوجل في الرياض ورقم هاتفه 0501234567
17
  example_title: Arabic PII Detection
18
  - text: تواصل مع فاطمة الزهراني على البريد الإلكتروني fatima@email.com
19
  example_title: Email Detection
 
 
20
  pipeline_tag: token-classification
21
  ---
22
 
23
- # MutazYoune/Arabic-NER-PII
24
 
25
  ## Model Overview
26
 
27
- This state-of-the-art Arabic Named Entity Recognition (NER) model is fine-tuned on top of the powerful `MutazYoune/ARAB_BERT` architecture. Designed specifically for detecting and redacting Personally Identifiable Information (PII) in Arabic text, it excels at recognizing sensitive data embedded within sentences.
28
 
29
- This model was carefully trained to serve Arabic NLP applications requiring privacy and security, making it suitable for tasks such as data anonymization, document redaction, and compliance with data protection laws.
30
 
31
- ## What It Detects
32
 
33
- Our model can identify a wide spectrum of PII categories in Arabic text, including but not limited to:
34
 
35
- - Personal Names (first, middle, family)
36
- - Phone Numbers
37
- - Email Addresses
38
- - Physical Addresses
39
- - National ID Numbers
40
- - Bank Account Details
41
- - Dates of Birth
42
 
43
- ## Model Specifications
44
 
45
- - Architecture: BERT-based Token Classification
46
- - Base Model: `MutazYoune/ARAB_BERT`
47
- - Language: Arabic (Modern Standard and Dialects)
48
- - Task: Named Entity Recognition & PII Redaction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## Training Details
51
 
52
- | Parameter | Value |
53
- |---------------|-------------|
54
- | Epochs | 12 |
55
- | Batch Size | 16 |
56
- | Learning Rate | 3e-5 |
57
 
58
- ## Supported Entity Tags
 
 
 
 
 
 
59
 
60
- | Entity | Description |
61
- |-------------|-----------------------------------|
62
- | CONTACT | Emails, phone numbers, addresses |
63
- | IDENTIFIER | National IDs, bank accounts |
64
- | NETWORK | IP addresses, online identifiers |
65
- | NUMERIC_ID | Numeric IDs like passport numbers |
66
- | PII | Generic personally identifiable info|
67
 
68
- ## How to Use
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ```python
71
  from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
72
 
73
- # Load tokenizer and model
74
  tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
75
  model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
76
 
77
- # Create a NER pipeline with aggregation for cleaner output
78
- ner_pipeline = pipeline("ner",
79
- model=model,
80
- tokenizer=tokenizer,
81
- aggregation_strategy="simple")
 
 
82
 
83
- # Test example
84
- text = "أحمد محمد يعمل في شركة جوجل في الرياض"
85
  entities = ner_pipeline(text)
86
- print(entities)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - token-classification
11
  - pii
12
  - privacy
13
+ - maqsam-competition
14
  datasets:
15
+ - Maqsam/ArabicPIIRedaction
16
  widget:
17
  - text: أحمد محمد يعمل في شركة جوجل في الرياض ورقم هاتفه 0501234567
18
  example_title: Arabic PII Detection
19
  - text: تواصل مع فاطمة الزهراني على البريد الإلكتروني fatima@email.com
20
  example_title: Email Detection
21
+ - text: عنوان المنزل هو شارع الملك فهد، الرياض
22
+ example_title: Address Detection
23
  pipeline_tag: token-classification
24
  ---
25
 
26
+ # Arabic NER PII - Personally Identifiable Information Detection
27
 
28
  ## Model Overview
29
 
30
+ This Arabic Named Entity Recognition model addresses the critical challenge of detecting Personally Identifiable Information in Arabic text. Built on MutazYoune/ARAB_BERT, the model tackles unique Arabic NLP challenges including morphological complexity and absence of capitalization patterns that typically assist in entity recognition.
31
 
32
+ Developed for the Maqsam Arabic PII Redaction Challenge, this model demonstrates competitive performance in identifying sensitive information across various Arabic text patterns and dialectal variations.
33
 
34
+ ## Entity Categories
35
 
36
+ The model identifies five main categories of PII in Arabic text:
37
 
38
+ - **CONTACT**: Email addresses, phone numbers, and contact information
39
+ - **NETWORK**: IP addresses and network identifiers
40
+ - **IDENTIFIER**: National IDs, bank accounts, and structured identifiers
41
+ - **NUMERIC_ID**: Numeric identifiers like passport numbers, account numbers
42
+ - **PII**: Generic personally identifiable information (names, personal details)
 
 
43
 
44
+ ## Performance Metrics
45
 
46
+ Based on the Maqsam competition evaluation (token-level classification):
47
+
48
+ | Metric | Score |
49
+ |--------|-------|
50
+ | **Best Overall Score** | 0.5341 |
51
+ | **Exact F1** | 0.0239 |
52
+ | **Exact Precision** | 0.0290 |
53
+ | **Exact Recall** | 0.0200 |
54
+ | **Partial F1** | 0.5341 |
55
+ | **Partial Precision** | 0.6470 |
56
+ | **Partial Recall** | 0.4550 |
57
+ | **IoU50 F1** | 0.2439 |
58
+ | **IoU50 Precision** | 0.2950 |
59
+ | **IoU50 Recall** | 0.2080 |
60
+
61
+ *Competition Ranking: 16th place (Prophtech-AI team)*
62
+
63
+ ## Architecture
64
+
65
+ - **Base Model**: MutazYoune/ARAB_BERT
66
+ - **Architecture**: BERT-based Token Classification
67
+ - **Language**: Arabic (Modern Standard Arabic and regional dialects)
68
+ - **Task**: Named Entity Recognition for PII Detection
69
+ - **Labels**: BIO tagging scheme with 11 labels (O, B-/I- for each entity type)
70
 
71
  ## Training Details
72
 
73
+ ### Dataset
74
+ - **Primary Dataset**: [Maqsam Arabic PII Redaction Competition Dataset](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction) (10,000 records)
75
+ - **Augmented Data**: Additional 10,000 LLM-generated records for data augmentation
76
+ - **Total Training Data**: 20,000 annotated Arabic sentences
77
+ - **Annotation Scheme**: BIO tagging with regex-based pattern recognition for structured entities
78
 
79
+ ### Training Configuration
80
+ | Parameter | Value |
81
+ |-----------|-------|
82
+ | Epochs | 12 |
83
+ | Batch Size | 16 |
84
+ | Learning Rate | 3e-5 |
85
+ | Base Model | MutazYoune/ARAB_BERT |
86
 
87
+ ### Pattern Recognition Strategy
88
+ The model combines neural learning with regex-based pattern matching for improved accuracy:
 
 
 
 
 
89
 
90
+ ```python
91
+ PATTERNS = {
92
+ "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
93
+ "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
94
+ "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+|[a-zA-Z]+\d+[a-zA-Z]+\d+|\d+[a-zA-Z]+\d+[a-zA-Z]+',
95
+ "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
96
+ }
97
+ ```
98
+
99
+ ## Usage
100
+
101
+ ### Quick Start
102
 
103
  ```python
104
  from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
105
 
106
+ # Load the model and tokenizer
107
  tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
108
  model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
109
 
110
+ # Create NER pipeline
111
+ ner_pipeline = pipeline(
112
+ "ner",
113
+ model=model,
114
+ tokenizer=tokenizer,
115
+ aggregation_strategy="simple"
116
+ )
117
 
118
+ # Example usage
119
+ text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
120
  entities = ner_pipeline(text)
121
+
122
+ for entity in entities:
123
+ print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Confidence: {entity['score']:.4f}")
124
+ ```
125
+
126
+ ### Advanced Usage with Custom Processing
127
+
128
+ ```python
129
+ import torch
130
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
131
+
132
+ tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
133
+ model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
134
+
135
+ def predict_pii(text):
136
+ # Tokenize input
137
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
138
+
139
+ # Get predictions
140
+ with torch.no_grad():
141
+ outputs = model(**inputs)
142
+ predictions = torch.argmax(outputs.logits, dim=-1)
143
+
144
+ # Decode predictions
145
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
146
+ labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
147
+
148
+ return list(zip(tokens, labels))
149
+
150
+ # Example
151
+ text = "للتواصل مع سارة على الرقم 0501234567"
152
+ results = predict_pii(text)
153
+ print(results)
154
+ ```
155
+
156
+ ## Competition Context
157
+
158
+ This model was developed for the **Maqsam Arabic PII Redaction Challenge**, which aimed to address the critical need for Arabic PII detection systems. The competition focused on:
159
+
160
+ - **Token-level evaluation** with precision, recall, and F1 metrics
161
+ - **Real-world applicability** for data protection compliance
162
+ - **Speed optimization** for practical deployment
163
+ - **Handling Arabic-specific challenges** like morphological complexity and lack of capitalization
164
+
165
+ The final competition score combined multiple metrics:
166
+ ```
167
+ Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
168
+ ```
169
+
170
+ ## Limitations
171
+
172
+ 1. **Performance Variability**: The exact match scores indicate room for improvement in precise boundary detection
173
+ 2. **Dialectal Coverage**: Primarily trained on Modern Standard Arabic with limited dialectal variations
174
+ 3. **Context Dependency**: May struggle with context-dependent PII that doesn't follow clear patterns
175
+ 4. **False Positives**: Higher precision suggests some over-detection of non-PII entities
176
+
177
+ ## Citation
178
+
179
+ If you use this model in your research or applications, please cite:
180
+
181
+ ```bibtex
182
+ @misc{arabic-ner-pii-2024,
183
+ author = {MutazYoune},
184
+ title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
185
+ year = {2024},
186
+ publisher = {Hugging Face},
187
+ url = {https://huggingface.co/MutazYoune/Arabic-NER-PII}
188
+ }
189
+ ```
190
+
191
+ ## Related Resources
192
+
193
+ - **Base Model**: [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
194
+ - **Competition**: [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
195
+ - **Dataset**: Maqsam/ArabicPIIRedaction
196
+
197
+ ## License
198
+
199
+ This model is released under the Apache 2.0 License, making it suitable for both research and commercial applications with appropriate attribution.