File size: 7,053 Bytes
1d8c4eb
f1a1c8e
 
1d8c4eb
f1a1c8e
 
 
 
 
 
4567190
1d8c4eb
cdb7180
f1a1c8e
cdb7180
f1a1c8e
1d8c4eb
 
 
 
cdb7180
 
1d8c4eb
 
235b97e
ab05fca
235b97e
ab05fca
4567190
ab05fca
 
 
 
4567190
ab05fca
99a5e0b
 
 
 
4567190
ab05fca
235b97e
ab05fca
235b97e
ab05fca
235b97e
ab05fca
cdb7180
ab05fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdb7180
ab05fca
cdb7180
ab05fca
cdb7180
ab05fca
 
 
 
 
 
 
235b97e
f0526b5
235b97e
ab05fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
235b97e
ab05fca
235b97e
ab05fca
 
f0526b5
cdb7180
 
 
 
ab05fca
cdb7180
 
 
 
ab05fca
cdb7180
ab05fca
235b97e
ab05fca
 
cdb7180
 
 
 
 
ab05fca
 
cdb7180
 
 
 
 
 
 
 
ab05fca
 
 
 
 
 
 
 
 
 
cdb7180
ab05fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdb7180
 
ab05fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdb7180
 
ab05fca
cdb7180
ab05fca
 
 
 
cdb7180
ab05fca
cdb7180
 
 
 
 
 
 
 
 
 
 
 
ab05fca
 
cdb7180
 
 
ab05fca
 
 
 
 
 
 
cdb7180
ab05fca
cdb7180
6008087
cdb7180
ab05fca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
language: ar
license: apache-2.0
base_model: MutazYoune/ARAB_BERT
tags:
- arabic
- ner
- named-entity-recognition
- bert
- token-classification
- pii
- privacy
- maqsam-competition
datasets:
- Maqsam/ArabicPIIRedaction
widget:
- text: أحمد محمد يعمل في شركة جوجل في الرياض ورقم هاتفه 0501234567
  example_title: Arabic PII Detection
- text: تواصل مع فاطمة الزهراني على البريد الإلكتروني fatima@email.com
  example_title: Email Detection
- text: عنوان المنزل هو شارع الملك فهد، الرياض
  example_title: Address Detection
pipeline_tag: token-classification
---

# Arabic NER PII

**Personally Identifiable Information Detection for Arabic Text**

[![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/MutazYoune/Arabic-NER-PII)
[![Competition](https://img.shields.io/badge/Maqsam-Challenge-green)](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
[![License](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://opensource.org/licenses/Apache-2.0)
[![Arabic](https://img.shields.io/badge/Language-Arabic-red)]()

</div>
<p align="center">
  <img src="pii_model_image.png" alt="PII Model" width="400"/>
</p>
<div align="center">

## Overview

BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in Arabic text. Addresses unique challenges in Arabic NLP including morphological complexity and absence of capitalization patterns.

**Base Model:** `MutazYoune/ARAB_BERT` | **Task:** Token Classification | **Language:** Arabic

## Quick Start

```bash
pip install transformers torch
```

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model
tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")

# Create pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Detect PII
text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
entities = ner_pipeline(text)
print(entities)
```

## Supported Entities

| Entity | Description | Examples |
|--------|-------------|----------|
| `CONTACT` | Email addresses, phone numbers | `ahmed@email.com`, `0501234567` |
| `NETWORK` | IP addresses, network identifiers | `192.168.1.1`, `10-20-30-40` |
| `IDENTIFIER` | National IDs, structured identifiers | `ID_123456`, `user.name` |
| `NUMERIC_ID` | Numeric identifiers | `123456789`, `12-34-56` |
| `PII` | Generic personal information | Names, personal details |

## Performance

> **Maqsam Arabic PII Redaction Challenge - Rank #16**

| Metric | Exact | Partial | IoU50 |
|--------|-------|---------|-------|
| **Precision** | 0.029 | 0.647 | 0.295 |
| **Recall** | 0.020 | 0.455 | 0.208 |
| **F1** | 0.024 | 0.534 | 0.244 |

**Overall Score:** 0.5341

## Training Details

<details>
<summary><strong>Dataset</strong></summary>

- **Source:** Maqsam Arabic PII Redaction Competition Dataset
- **Size:** 20,000 sentences (10k original + 10k LLM-augmented)
- **Annotation:** BIO tagging scheme with regex pattern matching
- **Labels:** 11 total (O + B-/I- for each entity type)

</details>

<details>
<summary><strong>Training Configuration</strong></summary>

```yaml
base_model: MutazYoune/ARAB_BERT
epochs: 12
batch_size: 16
learning_rate: 3e-5
max_length: 512
optimization: AdamW
```

</details>

<details>
<summary><strong>Pattern Recognition</strong></summary>

```python
PATTERNS = {
    "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
    "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
    "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+',
    "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
}
```

</details>

## Advanced Usage

<details>
<summary><strong>Custom Processing Pipeline</strong></summary>

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

def process_arabic_text(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
    
    # Filter out special tokens
    results = []
    for token, label in zip(tokens, labels):
        if token not in ['[CLS]', '[SEP]', '[PAD]']:
            results.append((token, label))
    
    return results
```

</details>

<details>
<summary><strong>Batch Processing</strong></summary>

```python
def batch_process_texts(texts, model, tokenizer, batch_size=8):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_results = []
        
        for text in batch:
            entities = ner_pipeline(text)
            batch_results.append(entities)
        
        results.extend(batch_results)
    
    return results
```

</details>

## Model Architecture

```
Input: Arabic Text

Tokenization (Arabic BERT Tokenizer)

ARAB_BERT Encoder (12 layers)

Classification Head (11 classes)

BIO Tag Predictions
```

## Limitations & Considerations

- **Exact Boundary Detection:** Lower exact match scores indicate challenges with precise entity boundaries
- **Dialectal Coverage:** Primarily trained on Modern Standard Arabic
- **Context Sensitivity:** May struggle with context-dependent PII identification
- **Performance Trade-offs:** Higher partial scores vs. exact match performance

## Competition Context

Developed for the **Maqsam Arabic PII Redaction Challenge** addressing critical gaps in Arabic PII detection systems. The competition emphasized:

- Token-level evaluation methodology
- Real-world deployment considerations  
- Speed optimization for practical applications
- Arabic-specific linguistic challenges

**Evaluation Formula:**
```
Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
```

## Citation

```bibtex
@misc{arabic-ner-pii-2024,
  author = {MutazYoune},
  title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/MutazYoune/Arabic-NER-PII}}
}
```

## Resources

- **Base Model:** [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
- **Competition:** [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
- **Dataset:** Maqsam/ArabicPIIRedaction

---

<div align="center">

**[🤗 Model Hub](https://huggingface.co/MutazYoune/Arabic-NER-PII)****[📊 Competition](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)** 

</div>