MutazYoune
/

Arabic-NER-PII

@@ -10,77 +10,190 @@ tags:
 - token-classification
 - pii
 - privacy
 datasets:
-- custom
 widget:
 - text: أحمد محمد يعمل في شركة جوجل في الرياض ورقم هاتفه 0501234567
   example_title: Arabic PII Detection
 - text: تواصل مع فاطمة الزهراني على البريد الإلكتروني fatima@email.com
   example_title: Email Detection
 pipeline_tag: token-classification
 ---
-# MutazYoune/Arabic-NER-PII
 ## Model Overview
-This state-of-the-art Arabic Named Entity Recognition (NER) model is fine-tuned on top of the powerful `MutazYoune/ARAB_BERT` architecture. Designed specifically for detecting and redacting Personally Identifiable Information (PII) in Arabic text, it excels at recognizing sensitive data embedded within sentences.
-This model was carefully trained to serve Arabic NLP applications requiring privacy and security, making it suitable for tasks such as data anonymization, document redaction, and compliance with data protection laws.
-## What It Detects
-Our model can identify a wide spectrum of PII categories in Arabic text, including but not limited to:
-- Personal Names (first, middle, family)
-- Phone Numbers
-- Email Addresses
-- Physical Addresses
-- National ID Numbers
-- Bank Account Details
-- Dates of Birth
-## Model Specifications
-- Architecture: BERT-based Token Classification
-- Base Model: `MutazYoune/ARAB_BERT`
-- Language: Arabic (Modern Standard and Dialects)
-- Task: Named Entity Recognition & PII Redaction
 ## Training Details
-| Parameter     | Value       |
-|---------------|-------------|
-| Epochs        | 12          |
-| Batch Size    | 16          |
-| Learning Rate | 3e-5        |
-## Supported Entity Tags
-| Entity      | Description                         |
-|-------------|-----------------------------------|
-| CONTACT     | Emails, phone numbers, addresses  |
-| IDENTIFIER  | National IDs, bank accounts        |
-| NETWORK     | IP addresses, online identifiers  |
-| NUMERIC_ID  | Numeric IDs like passport numbers |
-| PII         | Generic personally identifiable info|
-## How to Use
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
-# Load tokenizer and model
 tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
 model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
-# Create a NER pipeline with aggregation for cleaner output
-ner_pipeline = pipeline("ner",
-                       model=model,
-                       tokenizer=tokenizer,
-                       aggregation_strategy="simple")
-# Test example
-text = "أحمد محمد يعمل في شركة جوجل في الرياض"
 entities = ner_pipeline(text)
-print(entities)

 - token-classification
 - pii
 - privacy
+- maqsam-competition
 datasets:
+- Maqsam/ArabicPIIRedaction
 widget:
 - text: أحمد محمد يعمل في شركة جوجل في الرياض ورقم هاتفه 0501234567
   example_title: Arabic PII Detection
 - text: تواصل مع فاطمة الزهراني على البريد الإلكتروني fatima@email.com
   example_title: Email Detection
+- text: عنوان المنزل هو شارع الملك فهد، الرياض
+  example_title: Address Detection
 pipeline_tag: token-classification
 ---
+# Arabic NER PII - Personally Identifiable Information Detection
 ## Model Overview
+This Arabic Named Entity Recognition model addresses the critical challenge of detecting Personally Identifiable Information in Arabic text. Built on MutazYoune/ARAB_BERT, the model tackles unique Arabic NLP challenges including morphological complexity and absence of capitalization patterns that typically assist in entity recognition.
+Developed for the Maqsam Arabic PII Redaction Challenge, this model demonstrates competitive performance in identifying sensitive information across various Arabic text patterns and dialectal variations.
+## Entity Categories
+The model identifies five main categories of PII in Arabic text:
+- **CONTACT**: Email addresses, phone numbers, and contact information
+- **NETWORK**: IP addresses and network identifiers
+- **IDENTIFIER**: National IDs, bank accounts, and structured identifiers
+- **NUMERIC_ID**: Numeric identifiers like passport numbers, account numbers
+- **PII**: Generic personally identifiable information (names, personal details)
+## Performance Metrics
+Based on the Maqsam competition evaluation (token-level classification):
+| Metric | Score |
+|--------|-------|
+| **Best Overall Score** | 0.5341 |
+| **Exact F1** | 0.0239 |
+| **Exact Precision** | 0.0290 |
+| **Exact Recall** | 0.0200 |
+| **Partial F1** | 0.5341 |
+| **Partial Precision** | 0.6470 |
+| **Partial Recall** | 0.4550 |
+| **IoU50 F1** | 0.2439 |
+| **IoU50 Precision** | 0.2950 |
+| **IoU50 Recall** | 0.2080 |
+*Competition Ranking: 16th place (Prophtech-AI team)*
+## Architecture
+- **Base Model**: MutazYoune/ARAB_BERT
+- **Architecture**: BERT-based Token Classification
+- **Language**: Arabic (Modern Standard Arabic and regional dialects)
+- **Task**: Named Entity Recognition for PII Detection
+- **Labels**: BIO tagging scheme with 11 labels (O, B-/I- for each entity type)
 ## Training Details
+### Dataset
+- **Primary Dataset**: [Maqsam Arabic PII Redaction Competition Dataset](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction) (10,000 records)
+- **Augmented Data**: Additional 10,000 LLM-generated records for data augmentation
+- **Total Training Data**: 20,000 annotated Arabic sentences
+- **Annotation Scheme**: BIO tagging with regex-based pattern recognition for structured entities
+### Training Configuration
+| Parameter | Value |
+|-----------|-------|
+| Epochs | 12 |
+| Batch Size | 16 |
+| Learning Rate | 3e-5 |
+| Base Model | MutazYoune/ARAB_BERT |
+### Pattern Recognition Strategy
+The model combines neural learning with regex-based pattern matching for improved accuracy:
+```python
+PATTERNS = {
+    "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
+    "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
+    "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+|[a-zA-Z]+\d+[a-zA-Z]+\d+|\d+[a-zA-Z]+\d+[a-zA-Z]+',
+    "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
+}
+```
+## Usage
+### Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+# Load the model and tokenizer
 tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
 model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
+# Create NER pipeline
+ner_pipeline = pipeline(
+    "ner",
+    model=model,
+    tokenizer=tokenizer,
+    aggregation_strategy="simple"
+)
+# Example usage
+text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
 entities = ner_pipeline(text)
+for entity in entities:
+    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Confidence: {entity['score']:.4f}")
+```
+### Advanced Usage with Custom Processing
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
+model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
+def predict_pii(text):
+    # Tokenize input
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+    # Get predictions
+    with torch.no_grad():
+        outputs = model(**inputs)
+        predictions = torch.argmax(outputs.logits, dim=-1)
+    # Decode predictions
+    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+    labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
+    return list(zip(tokens, labels))
+# Example
+text = "للتواصل مع سارة على الرقم 0501234567"
+results = predict_pii(text)
+print(results)
+```
+## Competition Context
+This model was developed for the **Maqsam Arabic PII Redaction Challenge**, which aimed to address the critical need for Arabic PII detection systems. The competition focused on:
+- **Token-level evaluation** with precision, recall, and F1 metrics
+- **Real-world applicability** for data protection compliance
+- **Speed optimization** for practical deployment
+- **Handling Arabic-specific challenges** like morphological complexity and lack of capitalization
+The final competition score combined multiple metrics:
+```
+Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
+```
+## Limitations
+1. **Performance Variability**: The exact match scores indicate room for improvement in precise boundary detection
+2. **Dialectal Coverage**: Primarily trained on Modern Standard Arabic with limited dialectal variations
+3. **Context Dependency**: May struggle with context-dependent PII that doesn't follow clear patterns
+4. **False Positives**: Higher precision suggests some over-detection of non-PII entities
+## Citation
+If you use this model in your research or applications, please cite:
+```bibtex
+@misc{arabic-ner-pii-2024,
+  author = {MutazYoune},
+  title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/MutazYoune/Arabic-NER-PII}
+}
+```
+## Related Resources
+- **Base Model**: [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
+- **Competition**: [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
+- **Dataset**: Maqsam/ArabicPIIRedaction
+## License
+This model is released under the Apache 2.0 License, making it suitable for both research and commercial applications with appropriate attribution.