File size: 7,053 Bytes

1d8c4eb
f1a1c8e
 
1d8c4eb
f1a1c8e
 
 
 
 
 
4567190
1d8c4eb
cdb7180
f1a1c8e
cdb7180
f1a1c8e
1d8c4eb
 
 
 
cdb7180
 
1d8c4eb
 
235b97e
ab05fca
235b97e
ab05fca
4567190
ab05fca
 
 
 
4567190
ab05fca
99a5e0b
 
 
 
4567190
ab05fca
235b97e
ab05fca
235b97e
ab05fca
235b97e
ab05fca
cdb7180
ab05fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdb7180
ab05fca
cdb7180
ab05fca
cdb7180
ab05fca
 
 
 
 
 
 
235b97e
f0526b5
235b97e
ab05fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
235b97e
ab05fca
235b97e
ab05fca
 
f0526b5
cdb7180
 
 
 
ab05fca
cdb7180
 
 
 
ab05fca
cdb7180
ab05fca
235b97e
ab05fca
 
cdb7180
 
 
 
 
ab05fca
 
cdb7180
 
 
 
 
 
 
 
ab05fca
 
 
 
 
 
 
 
 
 
cdb7180
ab05fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdb7180
 
ab05fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdb7180
 
ab05fca
cdb7180
ab05fca
 
 
 
cdb7180
ab05fca
cdb7180
 
 
 
 
 
 
 
 
 
 
 
ab05fca
 
cdb7180
 
 
ab05fca
 
 
 
 
 
 
cdb7180
ab05fca
cdb7180
6008087
cdb7180
ab05fca

---
language: ar
license: apache-2.0
base_model: MutazYoune/ARAB_BERT
tags:
- arabic
- ner
- named-entity-recognition
- bert
- token-classification
- pii
- privacy
- maqsam-competition
datasets:
- Maqsam/ArabicPIIRedaction
widget:
- text: أحمد محمد يعمل في شركة جوجل في الرياض ورقم هاتفه 0501234567
  example_title: Arabic PII Detection
- text: تواصل مع فاطمة الزهراني على البريد الإلكتروني fatima@email.com
  example_title: Email Detection
- text: عنوان المنزل هو شارع الملك فهد، الرياض
  example_title: Address Detection
pipeline_tag: token-classification
---

# Arabic NER PII

**Personally Identifiable Information Detection for Arabic Text**

[![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/MutazYoune/Arabic-NER-PII)
[![Competition](https://img.shields.io/badge/Maqsam-Challenge-green)](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
[![License](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://opensource.org/licenses/Apache-2.0)
[![Arabic](https://img.shields.io/badge/Language-Arabic-red)]()

</div>
<p align="center">
  <img src="pii_model_image.png" alt="PII Model" width="400"/>
</p>
<div align="center">

## Overview

BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in Arabic text. Addresses unique challenges in Arabic NLP including morphological complexity and absence of capitalization patterns.

**Base Model:** `MutazYoune/ARAB_BERT` | **Task:** Token Classification | **Language:** Arabic

## Quick Start

```bash
pip install transformers torch
```

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model
tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")

# Create pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Detect PII
text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
entities = ner_pipeline(text)
print(entities)
```

## Supported Entities

| Entity | Description | Examples |
|--------|-------------|----------|
| `CONTACT` | Email addresses, phone numbers | `ahmed@email.com`, `0501234567` |
| `NETWORK` | IP addresses, network identifiers | `192.168.1.1`, `10-20-30-40` |
| `IDENTIFIER` | National IDs, structured identifiers | `ID_123456`, `user.name` |
| `NUMERIC_ID` | Numeric identifiers | `123456789`, `12-34-56` |
| `PII` | Generic personal information | Names, personal details |

## Performance

> **Maqsam Arabic PII Redaction Challenge - Rank #16**

| Metric | Exact | Partial | IoU50 |
|--------|-------|---------|-------|
| **Precision** | 0.029 | 0.647 | 0.295 |
| **Recall** | 0.020 | 0.455 | 0.208 |
| **F1** | 0.024 | 0.534 | 0.244 |

**Overall Score:** 0.5341

## Training Details

<details>
<summary><strong>Dataset</strong></summary>

- **Source:** Maqsam Arabic PII Redaction Competition Dataset
- **Size:** 20,000 sentences (10k original + 10k LLM-augmented)
- **Annotation:** BIO tagging scheme with regex pattern matching
- **Labels:** 11 total (O + B-/I- for each entity type)

</details>

<details>
<summary><strong>Training Configuration</strong></summary>

```yaml
base_model: MutazYoune/ARAB_BERT
epochs: 12
batch_size: 16
learning_rate: 3e-5
max_length: 512
optimization: AdamW
```

</details>

<details>
<summary><strong>Pattern Recognition</strong></summary>

```python
PATTERNS = {
    "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
    "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
    "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+',
    "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
}
```

</details>

## Advanced Usage

<details>
<summary><strong>Custom Processing Pipeline</strong></summary>

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

def process_arabic_text(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
    
    # Filter out special tokens
    results = []
    for token, label in zip(tokens, labels):
        if token not in ['[CLS]', '[SEP]', '[PAD]']:
            results.append((token, label))
    
    return results
```

</details>

<details>
<summary><strong>Batch Processing</strong></summary>

```python
def batch_process_texts(texts, model, tokenizer, batch_size=8):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_results = []
        
        for text in batch:
            entities = ner_pipeline(text)
            batch_results.append(entities)
        
        results.extend(batch_results)
    
    return results
```

</details>

## Model Architecture

```
Input: Arabic Text
    ↓
Tokenization (Arabic BERT Tokenizer)
    ↓
ARAB_BERT Encoder (12 layers)
    ↓
Classification Head (11 classes)
    ↓
BIO Tag Predictions
```

## Limitations & Considerations

- **Exact Boundary Detection:** Lower exact match scores indicate challenges with precise entity boundaries
- **Dialectal Coverage:** Primarily trained on Modern Standard Arabic
- **Context Sensitivity:** May struggle with context-dependent PII identification
- **Performance Trade-offs:** Higher partial scores vs. exact match performance

## Competition Context

Developed for the **Maqsam Arabic PII Redaction Challenge** addressing critical gaps in Arabic PII detection systems. The competition emphasized:

- Token-level evaluation methodology
- Real-world deployment considerations  
- Speed optimization for practical applications
- Arabic-specific linguistic challenges

**Evaluation Formula:**
```
Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
```

## Citation

```bibtex
@misc{arabic-ner-pii-2024,
  author = {MutazYoune},
  title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/MutazYoune/Arabic-NER-PII}}
}
```

## Resources

- **Base Model:** [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
- **Competition:** [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
- **Dataset:** Maqsam/ArabicPIIRedaction

---

<div align="center">

**[🤗 Model Hub](https://huggingface.co/MutazYoune/Arabic-NER-PII)** • **[📊 Competition](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)** 

</div>