MutazYoune commited on
Commit
ab05fca
·
verified ·
1 Parent(s): cdb7180

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -108
README.md CHANGED
@@ -23,177 +23,218 @@ widget:
23
  pipeline_tag: token-classification
24
  ---
25
 
26
- # Arabic NER PII - Personally Identifiable Information Detection
27
 
28
- ## Model Overview
29
 
30
- This Arabic Named Entity Recognition model addresses the critical challenge of detecting Personally Identifiable Information in Arabic text. Built on MutazYoune/ARAB_BERT, the model tackles unique Arabic NLP challenges including morphological complexity and absence of capitalization patterns that typically assist in entity recognition.
31
 
32
- Developed for the Maqsam Arabic PII Redaction Challenge, this model demonstrates competitive performance in identifying sensitive information across various Arabic text patterns and dialectal variations.
 
 
 
33
 
34
- ## Entity Categories
35
 
36
- The model identifies five main categories of PII in Arabic text:
37
 
38
- - **CONTACT**: Email addresses, phone numbers, and contact information
39
- - **NETWORK**: IP addresses and network identifiers
40
- - **IDENTIFIER**: National IDs, bank accounts, and structured identifiers
41
- - **NUMERIC_ID**: Numeric identifiers like passport numbers, account numbers
42
- - **PII**: Generic personally identifiable information (names, personal details)
43
 
44
- ## Performance Metrics
45
 
46
- Based on the Maqsam competition evaluation (token-level classification):
47
 
48
- | Metric | Score |
49
- |--------|-------|
50
- | **Best Overall Score** | 0.5341 |
51
- | **Exact F1** | 0.0239 |
52
- | **Exact Precision** | 0.0290 |
53
- | **Exact Recall** | 0.0200 |
54
- | **Partial F1** | 0.5341 |
55
- | **Partial Precision** | 0.6470 |
56
- | **Partial Recall** | 0.4550 |
57
- | **IoU50 F1** | 0.2439 |
58
- | **IoU50 Precision** | 0.2950 |
59
- | **IoU50 Recall** | 0.2080 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- *Competition Ranking: 16th place (Prophtech-AI team)*
62
 
63
- ## Architecture
64
 
65
- - **Base Model**: MutazYoune/ARAB_BERT
66
- - **Architecture**: BERT-based Token Classification
67
- - **Language**: Arabic (Modern Standard Arabic and regional dialects)
68
- - **Task**: Named Entity Recognition for PII Detection
69
- - **Labels**: BIO tagging scheme with 11 labels (O, B-/I- for each entity type)
 
 
70
 
71
  ## Training Details
72
 
73
- ### Dataset
74
- - **Primary Dataset**: [Maqsam Arabic PII Redaction Competition Dataset](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction) (10,000 records)
75
- - **Augmented Data**: Additional 10,000 LLM-generated records for data augmentation
76
- - **Total Training Data**: 20,000 annotated Arabic sentences
77
- - **Annotation Scheme**: BIO tagging with regex-based pattern recognition for structured entities
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
- ### Training Configuration
80
- | Parameter | Value |
81
- |-----------|-------|
82
- | Epochs | 12 |
83
- | Batch Size | 16 |
84
- | Learning Rate | 3e-5 |
85
- | Base Model | MutazYoune/ARAB_BERT |
86
 
87
- ### Pattern Recognition Strategy
88
- The model combines neural learning with regex-based pattern matching for improved accuracy:
89
 
90
  ```python
91
  PATTERNS = {
92
  "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
93
  "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
94
- "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+|[a-zA-Z]+\d+[a-zA-Z]+\d+|\d+[a-zA-Z]+\d+[a-zA-Z]+',
95
  "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
96
  }
97
  ```
98
 
99
- ## Usage
100
 
101
- ### Quick Start
102
 
103
- ```python
104
- from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
105
-
106
- # Load the model and tokenizer
107
- tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
108
- model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
109
-
110
- # Create NER pipeline
111
- ner_pipeline = pipeline(
112
- "ner",
113
- model=model,
114
- tokenizer=tokenizer,
115
- aggregation_strategy="simple"
116
- )
117
-
118
- # Example usage
119
- text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
120
- entities = ner_pipeline(text)
121
-
122
- for entity in entities:
123
- print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Confidence: {entity['score']:.4f}")
124
- ```
125
-
126
- ### Advanced Usage with Custom Processing
127
 
128
  ```python
129
  import torch
130
  from transformers import AutoTokenizer, AutoModelForTokenClassification
131
 
132
- tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
133
- model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
134
-
135
- def predict_pii(text):
136
- # Tokenize input
137
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
138
 
139
- # Get predictions
140
  with torch.no_grad():
141
  outputs = model(**inputs)
142
  predictions = torch.argmax(outputs.logits, dim=-1)
143
 
144
- # Decode predictions
145
  tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
146
  labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
147
 
148
- return list(zip(tokens, labels))
 
 
 
 
 
 
 
 
 
149
 
150
- # Example
151
- text = "للتواصل مع سارة على الرقم 0501234567"
152
- results = predict_pii(text)
153
- print(results)
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  ```
155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
  ## Competition Context
157
 
158
- This model was developed for the **Maqsam Arabic PII Redaction Challenge**, which aimed to address the critical need for Arabic PII detection systems. The competition focused on:
159
 
160
- - **Token-level evaluation** with precision, recall, and F1 metrics
161
- - **Real-world applicability** for data protection compliance
162
- - **Speed optimization** for practical deployment
163
- - **Handling Arabic-specific challenges** like morphological complexity and lack of capitalization
164
 
165
- The final competition score combined multiple metrics:
166
  ```
167
  Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
168
  ```
169
 
170
- ## Limitations
171
-
172
- 1. **Performance Variability**: The exact match scores indicate room for improvement in precise boundary detection
173
- 2. **Dialectal Coverage**: Primarily trained on Modern Standard Arabic with limited dialectal variations
174
- 3. **Context Dependency**: May struggle with context-dependent PII that doesn't follow clear patterns
175
- 4. **False Positives**: Higher precision suggests some over-detection of non-PII entities
176
-
177
  ## Citation
178
 
179
- If you use this model in your research or applications, please cite:
180
-
181
  ```bibtex
182
  @misc{arabic-ner-pii-2024,
183
  author = {MutazYoune},
184
  title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
185
  year = {2024},
186
  publisher = {Hugging Face},
187
- url = {https://huggingface.co/MutazYoune/Arabic-NER-PII}
 
188
  }
189
  ```
190
 
191
- ## Related Resources
 
 
 
 
 
 
192
 
193
- - **Base Model**: [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
194
- - **Competition**: [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
195
- - **Dataset**: Maqsam/ArabicPIIRedaction
196
 
197
- ## License
198
 
199
- This model is released under the Apache 2.0 License, making it suitable for both research and commercial applications with appropriate attribution.
 
23
  pipeline_tag: token-classification
24
  ---
25
 
26
+ <div align="center">
27
 
28
+ # Arabic NER PII
29
 
30
+ **Personally Identifiable Information Detection for Arabic Text**
31
 
32
+ [![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/MutazYoune/Arabic-NER-PII)
33
+ [![Competition](https://img.shields.io/badge/Maqsam-Challenge-green)](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
34
+ [![License](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://opensource.org/licenses/Apache-2.0)
35
+ [![Arabic](https://img.shields.io/badge/Language-Arabic-red)]()
36
 
37
+ </div>
38
 
39
+ ## Overview
40
 
41
+ BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in Arabic text. Addresses unique challenges in Arabic NLP including morphological complexity and absence of capitalization patterns.
 
 
 
 
42
 
43
+ **Base Model:** `MutazYoune/ARAB_BERT` | **Task:** Token Classification | **Language:** Arabic
44
 
45
+ ## Quick Start
46
 
47
+ ```bash
48
+ pip install transformers torch
49
+ ```
50
+
51
+ ```python
52
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
53
+
54
+ # Load model
55
+ tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
56
+ model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
57
+
58
+ # Create pipeline
59
+ ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
60
+
61
+ # Detect PII
62
+ text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
63
+ entities = ner_pipeline(text)
64
+ print(entities)
65
+ ```
66
+
67
+ ## Supported Entities
68
+
69
+ | Entity | Description | Examples |
70
+ |--------|-------------|----------|
71
+ | `CONTACT` | Email addresses, phone numbers | `ahmed@email.com`, `0501234567` |
72
+ | `NETWORK` | IP addresses, network identifiers | `192.168.1.1`, `10-20-30-40` |
73
+ | `IDENTIFIER` | National IDs, structured identifiers | `ID_123456`, `user.name` |
74
+ | `NUMERIC_ID` | Numeric identifiers | `123456789`, `12-34-56` |
75
+ | `PII` | Generic personal information | Names, personal details |
76
 
77
+ ## Performance
78
 
79
+ > **Maqsam Arabic PII Redaction Challenge - Rank #16**
80
 
81
+ | Metric | Exact | Partial | IoU50 |
82
+ |--------|-------|---------|-------|
83
+ | **Precision** | 0.029 | 0.647 | 0.295 |
84
+ | **Recall** | 0.020 | 0.455 | 0.208 |
85
+ | **F1** | 0.024 | 0.534 | 0.244 |
86
+
87
+ **Overall Score:** 0.5341
88
 
89
  ## Training Details
90
 
91
+ <details>
92
+ <summary><strong>Dataset</strong></summary>
93
+
94
+ - **Source:** Maqsam Arabic PII Redaction Competition Dataset
95
+ - **Size:** 20,000 sentences (10k original + 10k LLM-augmented)
96
+ - **Annotation:** BIO tagging scheme with regex pattern matching
97
+ - **Labels:** 11 total (O + B-/I- for each entity type)
98
+
99
+ </details>
100
+
101
+ <details>
102
+ <summary><strong>Training Configuration</strong></summary>
103
+
104
+ ```yaml
105
+ base_model: MutazYoune/ARAB_BERT
106
+ epochs: 12
107
+ batch_size: 16
108
+ learning_rate: 3e-5
109
+ max_length: 512
110
+ optimization: AdamW
111
+ ```
112
 
113
+ </details>
 
 
 
 
 
 
114
 
115
+ <details>
116
+ <summary><strong>Pattern Recognition</strong></summary>
117
 
118
  ```python
119
  PATTERNS = {
120
  "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
121
  "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
122
+ "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+',
123
  "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
124
  }
125
  ```
126
 
127
+ </details>
128
 
129
+ ## Advanced Usage
130
 
131
+ <details>
132
+ <summary><strong>Custom Processing Pipeline</strong></summary>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
  ```python
135
  import torch
136
  from transformers import AutoTokenizer, AutoModelForTokenClassification
137
 
138
+ def process_arabic_text(text, model, tokenizer):
139
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 
 
 
 
140
 
 
141
  with torch.no_grad():
142
  outputs = model(**inputs)
143
  predictions = torch.argmax(outputs.logits, dim=-1)
144
 
 
145
  tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
146
  labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
147
 
148
+ # Filter out special tokens
149
+ results = []
150
+ for token, label in zip(tokens, labels):
151
+ if token not in ['[CLS]', '[SEP]', '[PAD]']:
152
+ results.append((token, label))
153
+
154
+ return results
155
+ ```
156
+
157
+ </details>
158
 
159
+ <details>
160
+ <summary><strong>Batch Processing</strong></summary>
161
+
162
+ ```python
163
+ def batch_process_texts(texts, model, tokenizer, batch_size=8):
164
+ results = []
165
+ for i in range(0, len(texts), batch_size):
166
+ batch = texts[i:i+batch_size]
167
+ batch_results = []
168
+
169
+ for text in batch:
170
+ entities = ner_pipeline(text)
171
+ batch_results.append(entities)
172
+
173
+ results.extend(batch_results)
174
+
175
+ return results
176
  ```
177
 
178
+ </details>
179
+
180
+ ## Model Architecture
181
+
182
+ ```
183
+ Input: Arabic Text
184
+
185
+ Tokenization (Arabic BERT Tokenizer)
186
+
187
+ ARAB_BERT Encoder (12 layers)
188
+
189
+ Classification Head (11 classes)
190
+
191
+ BIO Tag Predictions
192
+ ```
193
+
194
+ ## Limitations & Considerations
195
+
196
+ - **Exact Boundary Detection:** Lower exact match scores indicate challenges with precise entity boundaries
197
+ - **Dialectal Coverage:** Primarily trained on Modern Standard Arabic
198
+ - **Context Sensitivity:** May struggle with context-dependent PII identification
199
+ - **Performance Trade-offs:** Higher partial scores vs. exact match performance
200
+
201
  ## Competition Context
202
 
203
+ Developed for the **Maqsam Arabic PII Redaction Challenge** addressing critical gaps in Arabic PII detection systems. The competition emphasized:
204
 
205
+ - Token-level evaluation methodology
206
+ - Real-world deployment considerations
207
+ - Speed optimization for practical applications
208
+ - Arabic-specific linguistic challenges
209
 
210
+ **Evaluation Formula:**
211
  ```
212
  Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
213
  ```
214
 
 
 
 
 
 
 
 
215
  ## Citation
216
 
 
 
217
  ```bibtex
218
  @misc{arabic-ner-pii-2024,
219
  author = {MutazYoune},
220
  title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
221
  year = {2024},
222
  publisher = {Hugging Face},
223
+ journal = {Hugging Face Model Hub},
224
+ howpublished = {\url{https://huggingface.co/MutazYoune/Arabic-NER-PII}}
225
  }
226
  ```
227
 
228
+ ## Resources
229
+
230
+ - **Base Model:** [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
231
+ - **Competition:** [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
232
+ - **Dataset:** Maqsam/ArabicPIIRedaction
233
+
234
+ ---
235
 
236
+ <div align="center">
 
 
237
 
238
+ **[🤗 Model Hub](https://huggingface.co/MutazYoune/Arabic-NER-PII)** • **[📊 Competition](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)** • **[📖 Documentation](https://docs.anthropic.com)**
239
 
240
+ </div>