MutazYoune
/

Arabic-NER-PII

@@ -23,177 +23,218 @@ widget:
 pipeline_tag: token-classification
 ---
-# Arabic NER PII - Personally Identifiable Information Detection
-## Model Overview
-This Arabic Named Entity Recognition model addresses the critical challenge of detecting Personally Identifiable Information in Arabic text. Built on MutazYoune/ARAB_BERT, the model tackles unique Arabic NLP challenges including morphological complexity and absence of capitalization patterns that typically assist in entity recognition.
-Developed for the Maqsam Arabic PII Redaction Challenge, this model demonstrates competitive performance in identifying sensitive information across various Arabic text patterns and dialectal variations.
-## Entity Categories
-The model identifies five main categories of PII in Arabic text:
-- **CONTACT**: Email addresses, phone numbers, and contact information
-- **NETWORK**: IP addresses and network identifiers
-- **IDENTIFIER**: National IDs, bank accounts, and structured identifiers
-- **NUMERIC_ID**: Numeric identifiers like passport numbers, account numbers
-- **PII**: Generic personally identifiable information (names, personal details)
-## Performance Metrics
-Based on the Maqsam competition evaluation (token-level classification):
-| Metric | Score |
-|--------|-------|
-| **Best Overall Score** | 0.5341 |
-| **Exact F1** | 0.0239 |
-| **Exact Precision** | 0.0290 |
-| **Exact Recall** | 0.0200 |
-| **Partial F1** | 0.5341 |
-| **Partial Precision** | 0.6470 |
-| **Partial Recall** | 0.4550 |
-| **IoU50 F1** | 0.2439 |
-| **IoU50 Precision** | 0.2950 |
-| **IoU50 Recall** | 0.2080 |
-*Competition Ranking: 16th place (Prophtech-AI team)*
-## Architecture
-- **Base Model**: MutazYoune/ARAB_BERT
-- **Architecture**: BERT-based Token Classification
-- **Language**: Arabic (Modern Standard Arabic and regional dialects)
-- **Task**: Named Entity Recognition for PII Detection
-- **Labels**: BIO tagging scheme with 11 labels (O, B-/I- for each entity type)
 ## Training Details
-### Dataset
-- **Primary Dataset**: [Maqsam Arabic PII Redaction Competition Dataset](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction) (10,000 records)
-- **Augmented Data**: Additional 10,000 LLM-generated records for data augmentation
-- **Total Training Data**: 20,000 annotated Arabic sentences
-- **Annotation Scheme**: BIO tagging with regex-based pattern recognition for structured entities
-### Training Configuration
-| Parameter | Value |
-|-----------|-------|
-| Epochs | 12 |
-| Batch Size | 16 |
-| Learning Rate | 3e-5 |
-| Base Model | MutazYoune/ARAB_BERT |
-### Pattern Recognition Strategy
-The model combines neural learning with regex-based pattern matching for improved accuracy:
 ```python
 PATTERNS = {
     "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
     "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
-    "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+|[a-zA-Z]+\d+[a-zA-Z]+\d+|\d+[a-zA-Z]+\d+[a-zA-Z]+',
     "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
 }
 ```
-## Usage
-### Quick Start
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
-# Load the model and tokenizer
-tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
-model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
-# Create NER pipeline
-ner_pipeline = pipeline(
-    "ner",
-    model=model,
-    tokenizer=tokenizer,
-    aggregation_strategy="simple"
-)
-# Example usage
-text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
-entities = ner_pipeline(text)
-for entity in entities:
-    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Confidence: {entity['score']:.4f}")
-```
-### Advanced Usage with Custom Processing
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForTokenClassification
-tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
-model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
-def predict_pii(text):
-    # Tokenize input
-    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
-    # Get predictions
     with torch.no_grad():
         outputs = model(**inputs)
         predictions = torch.argmax(outputs.logits, dim=-1)
-    # Decode predictions
     tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
     labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
-    return list(zip(tokens, labels))
-# Example
-text = "للتواصل مع سارة على الرقم 0501234567"
-results = predict_pii(text)
-print(results)
 ```
 ## Competition Context
-This model was developed for the **Maqsam Arabic PII Redaction Challenge**, which aimed to address the critical need for Arabic PII detection systems. The competition focused on:
-- **Token-level evaluation** with precision, recall, and F1 metrics
-- **Real-world applicability** for data protection compliance
-- **Speed optimization** for practical deployment
-- **Handling Arabic-specific challenges** like morphological complexity and lack of capitalization
-The final competition score combined multiple metrics:
 ```
 Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
 ```
-## Limitations
-1. **Performance Variability**: The exact match scores indicate room for improvement in precise boundary detection
-2. **Dialectal Coverage**: Primarily trained on Modern Standard Arabic with limited dialectal variations
-3. **Context Dependency**: May struggle with context-dependent PII that doesn't follow clear patterns
-4. **False Positives**: Higher precision suggests some over-detection of non-PII entities
 ## Citation
-If you use this model in your research or applications, please cite:
 ```bibtex
 @misc{arabic-ner-pii-2024,
   author = {MutazYoune},
   title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
   year = {2024},
   publisher = {Hugging Face},
-  url = {https://huggingface.co/MutazYoune/Arabic-NER-PII}
 }
 ```
-## Related Resources
-- **Base Model**: [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
-- **Competition**: [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
-- **Dataset**: Maqsam/ArabicPIIRedaction
-## License
-This model is released under the Apache 2.0 License, making it suitable for both research and commercial applications with appropriate attribution.

 pipeline_tag: token-classification
 ---
+<div align="center">
+# Arabic NER PII
+**Personally Identifiable Information Detection for Arabic Text**
+[![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/MutazYoune/Arabic-NER-PII)
+[![Competition](https://img.shields.io/badge/Maqsam-Challenge-green)](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
+[![License](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Arabic](https://img.shields.io/badge/Language-Arabic-red)]()
+</div>
+## Overview
+BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in Arabic text. Addresses unique challenges in Arabic NLP including morphological complexity and absence of capitalization patterns.
+**Base Model:** `MutazYoune/ARAB_BERT` | **Task:** Token Classification | **Language:** Arabic
+## Quick Start
+```bash
+pip install transformers torch
+```
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+# Load model
+tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
+model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
+# Create pipeline
+ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
+# Detect PII
+text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
+entities = ner_pipeline(text)
+print(entities)
+```
+## Supported Entities
+| Entity | Description | Examples |
+|--------|-------------|----------|
+| `CONTACT` | Email addresses, phone numbers | `ahmed@email.com`, `0501234567` |
+| `NETWORK` | IP addresses, network identifiers | `192.168.1.1`, `10-20-30-40` |
+| `IDENTIFIER` | National IDs, structured identifiers | `ID_123456`, `user.name` |
+| `NUMERIC_ID` | Numeric identifiers | `123456789`, `12-34-56` |
+| `PII` | Generic personal information | Names, personal details |
+## Performance
+> **Maqsam Arabic PII Redaction Challenge - Rank #16**
+| Metric | Exact | Partial | IoU50 |
+|--------|-------|---------|-------|
+| **Precision** | 0.029 | 0.647 | 0.295 |
+| **Recall** | 0.020 | 0.455 | 0.208 |
+| **F1** | 0.024 | 0.534 | 0.244 |
+**Overall Score:** 0.5341
 ## Training Details
+<details>
+<summary><strong>Dataset</strong></summary>
+- **Source:** Maqsam Arabic PII Redaction Competition Dataset
+- **Size:** 20,000 sentences (10k original + 10k LLM-augmented)
+- **Annotation:** BIO tagging scheme with regex pattern matching
+- **Labels:** 11 total (O + B-/I- for each entity type)
+</details>
+<details>
+<summary><strong>Training Configuration</strong></summary>
+```yaml
+base_model: MutazYoune/ARAB_BERT
+epochs: 12
+batch_size: 16
+learning_rate: 3e-5
+max_length: 512
+optimization: AdamW
+```
+</details>
+<details>
+<summary><strong>Pattern Recognition</strong></summary>
 ```python
 PATTERNS = {
     "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
     "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
+    "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+',
     "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
 }
 ```
+</details>
+## Advanced Usage
+<details>
+<summary><strong>Custom Processing Pipeline</strong></summary>
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForTokenClassification
+def process_arabic_text(text, model, tokenizer):
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
     with torch.no_grad():
         outputs = model(**inputs)
         predictions = torch.argmax(outputs.logits, dim=-1)
     tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
     labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
+    # Filter out special tokens
+    results = []
+    for token, label in zip(tokens, labels):
+        if token not in ['[CLS]', '[SEP]', '[PAD]']:
+            results.append((token, label))
+    return results
+```
+</details>
+<details>
+<summary><strong>Batch Processing</strong></summary>
+```python
+def batch_process_texts(texts, model, tokenizer, batch_size=8):
+    results = []
+    for i in range(0, len(texts), batch_size):
+        batch = texts[i:i+batch_size]
+        batch_results = []
+        for text in batch:
+            entities = ner_pipeline(text)
+            batch_results.append(entities)
+        results.extend(batch_results)
+    return results
 ```
+</details>
+## Model Architecture
+```
+Input: Arabic Text
+    ↓
+Tokenization (Arabic BERT Tokenizer)
+    ↓
+ARAB_BERT Encoder (12 layers)
+    ↓
+Classification Head (11 classes)
+    ↓
+BIO Tag Predictions
+```
+## Limitations & Considerations
+- **Exact Boundary Detection:** Lower exact match scores indicate challenges with precise entity boundaries
+- **Dialectal Coverage:** Primarily trained on Modern Standard Arabic
+- **Context Sensitivity:** May struggle with context-dependent PII identification
+- **Performance Trade-offs:** Higher partial scores vs. exact match performance
 ## Competition Context
+Developed for the **Maqsam Arabic PII Redaction Challenge** addressing critical gaps in Arabic PII detection systems. The competition emphasized:
+- Token-level evaluation methodology
+- Real-world deployment considerations
+- Speed optimization for practical applications
+- Arabic-specific linguistic challenges
+**Evaluation Formula:**
 ```
 Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
 ```
 ## Citation
 ```bibtex
 @misc{arabic-ner-pii-2024,
   author = {MutazYoune},
   title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
   year = {2024},
   publisher = {Hugging Face},
+  journal = {Hugging Face Model Hub},
+  howpublished = {\url{https://huggingface.co/MutazYoune/Arabic-NER-PII}}
 }
 ```
+## Resources
+- **Base Model:** [MutazYoune/ARAB_BERT](https://huggingface.co/MutazYoune/ARAB_BERT)
+- **Competition:** [Maqsam Arabic PII Redaction Challenge](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)
+- **Dataset:** Maqsam/ArabicPIIRedaction
+---
+<div align="center">
+**[🤗 Model Hub](https://huggingface.co/MutazYoune/Arabic-NER-PII)** • **[📊 Competition](https://huggingface.co/spaces/Maqsam/ArabicPIIRedaction)** • **[📖 Documentation](https://docs.anthropic.com)**
+</div>