---
language: ar
license: apache-2.0
base_model: MutazYoune/ARAB_BERT
tags:
- arabic
- ner
- named-entity-recognition
- bert
- token-classification
- pii
- privacy
- maqsam-competition
datasets:
- Maqsam/ArabicPIIRedaction
widget:
- text: أحمد محمد يعمل في شركة جوجل في الرياض ورقم هاتفه 0501234567
example_title: Arabic PII Detection
- text: تواصل مع فاطمة الزهراني على البريد الإلكتروني fatima@email.com
example_title: Email Detection
- text: عنوان المنزل هو شارع الملك فهد، الرياض
example_title: Address Detection
pipeline_tag: token-classification
---
# Arabic NER PII
**Personally Identifiable Information Detection for Arabic Text**
[](https://huggingface.co/MutazYoune/Arabic-NER-PII)
[](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
[](https://opensource.org/licenses/Apache-2.0)
[]()
## Overview
BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in Arabic text. Addresses unique challenges in Arabic NLP including morphological complexity and absence of capitalization patterns.
**Base Model:** `MutazYoune/ARAB_BERT` | **Task:** Token Classification | **Language:** Arabic
## Quick Start
```bash
pip install transformers torch
```
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load model
tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
# Create pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Detect PII
text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
entities = ner_pipeline(text)
print(entities)
```
## Supported Entities
| Entity | Description | Examples |
|--------|-------------|----------|
| `CONTACT` | Email addresses, phone numbers | `ahmed@email.com`, `0501234567` |
| `NETWORK` | IP addresses, network identifiers | `192.168.1.1`, `10-20-30-40` |
| `IDENTIFIER` | National IDs, structured identifiers | `ID_123456`, `user.name` |
| `NUMERIC_ID` | Numeric identifiers | `123456789`, `12-34-56` |
| `PII` | Generic personal information | Names, personal details |
## Performance
> **Maqsam Arabic PII Redaction Challenge - Rank #16**
| Metric | Exact | Partial | IoU50 |
|--------|-------|---------|-------|
| **Precision** | 0.029 | 0.647 | 0.295 |
| **Recall** | 0.020 | 0.455 | 0.208 |
| **F1** | 0.024 | 0.534 | 0.244 |
**Overall Score:** 0.5341
## Training Details
Dataset
- **Source:** Maqsam Arabic PII Redaction Competition Dataset
- **Size:** 20,000 sentences (10k original + 10k LLM-augmented)
- **Annotation:** BIO tagging scheme with regex pattern matching
- **Labels:** 11 total (O + B-/I- for each entity type)
Training Configuration
```yaml
base_model: MutazYoune/ARAB_BERT
epochs: 12
batch_size: 16
learning_rate: 3e-5
max_length: 512
optimization: AdamW
```
Pattern Recognition
```python
PATTERNS = {
"CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
"NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
"IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+',
"NUMERIC_ID": r'\d+\-\d+|\d{6,}'
}
```
## Advanced Usage
Custom Processing Pipeline
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
def process_arabic_text(text, model, tokenizer):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
# Filter out special tokens
results = []
for token, label in zip(tokens, labels):
if token not in ['[CLS]', '[SEP]', '[PAD]']:
results.append((token, label))
return results
```
Batch Processing
```python
def batch_process_texts(texts, model, tokenizer, batch_size=8):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
batch_results = []
for text in batch:
entities = ner_pipeline(text)
batch_results.append(entities)
results.extend(batch_results)
return results
```
## Model Architecture
```
Input: Arabic Text
↓
Tokenization (Arabic BERT Tokenizer)
↓
ARAB_BERT Encoder (12 layers)
↓
Classification Head (11 classes)
↓
BIO Tag Predictions
```
## Limitations & Considerations
- **Exact Boundary Detection:** Lower exact match scores indicate challenges with precise entity boundaries
- **Dialectal Coverage:** Primarily trained on Modern Standard Arabic
- **Context Sensitivity:** May struggle with context-dependent PII identification
- **Performance Trade-offs:** Higher partial scores vs. exact match performance
## Competition Context
Developed for the **Maqsam Arabic PII Redaction Challenge** addressing critical gaps in Arabic PII detection systems. The competition emphasized:
- Token-level evaluation methodology
- Real-world deployment considerations
- Speed optimization for practical applications
- Arabic-specific linguistic challenges
**Evaluation Formula:**
```
Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
```
## Citation
```bibtex
@misc{arabic-ner-pii-2024,
author = {MutazYoune},
title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/MutazYoune/Arabic-NER-PII}}
}
```
## Resources
- **Base Model:** [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
- **Competition:** [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
- **Dataset:** Maqsam/ArabicPIIRedaction
---
**[🤗 Model Hub](https://huggingface.co/MutazYoune/Arabic-NER-PII)** • **[📊 Competition](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)**