Luis Kalckstein
commited on
New benchmarking results from improved dataset and contribution notebook
Browse files- README.md +0 -113
- pii_leaderboard.py +82 -1
- results/pii_detection_results.csv +4 -4
README.md
DELETED
|
@@ -1,113 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: LLM PII Detection Leaderboard
|
| 3 |
-
emoji: 🥇
|
| 4 |
-
colorFrom: green
|
| 5 |
-
colorTo: indigo
|
| 6 |
-
sdk: gradio
|
| 7 |
-
app_file: app.py
|
| 8 |
-
pinned: true
|
| 9 |
-
license: apache-2.0
|
| 10 |
-
short_description: Duplicate this leaderboard to initialize your own!
|
| 11 |
-
sdk_version: 5.19.0
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
# 🔒 LLM PII Detection Leaderboard
|
| 15 |
-
|
| 16 |
-
A comprehensive benchmark for evaluating language models' performance in detecting and handling personally identifiable information (PII) across various document types and scenarios.
|
| 17 |
-
|
| 18 |
-
## ✨ Features
|
| 19 |
-
|
| 20 |
-
- **Beautiful Modern UI**: Elegant dark theme with gradient styling and smooth animations
|
| 21 |
-
- **Comprehensive Metrics**: Precision, Recall, F1 Score, Over-detection Rate, Processing Time, and Cost
|
| 22 |
-
- **Domain-Specific Analysis**: Specialized evaluation across Healthcare, Financial, Government, Legal, and Personal documents
|
| 23 |
-
- **Performance Cards**: Professional model performance cards perfect for presentations and reports
|
| 24 |
-
- **Interactive Filtering**: Filter by model type, document type, and sort by any metric
|
| 25 |
-
- **Real-time Updates**: Dynamic table updates and score visualizations
|
| 26 |
-
|
| 27 |
-
## 🚀 Quick Start
|
| 28 |
-
|
| 29 |
-
### Installation
|
| 30 |
-
|
| 31 |
-
```bash
|
| 32 |
-
git clone https://github.com/your-username/LLM-PII-Detection-Leaderboard.git
|
| 33 |
-
cd LLM-PII-Detection-Leaderboard
|
| 34 |
-
pip install -r requirements.txt
|
| 35 |
-
```
|
| 36 |
-
|
| 37 |
-
### Run the Application
|
| 38 |
-
|
| 39 |
-
```bash
|
| 40 |
-
python app.py
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
The leaderboard will be available at `http://localhost:7860`
|
| 44 |
-
|
| 45 |
-
## 📊 Key Metrics
|
| 46 |
-
|
| 47 |
-
- **Overall Accuracy**: Percentage of correctly identified and classified PII entities
|
| 48 |
-
- **Precision**: Of all flagged items, how many were actually PII (avoiding false positives)
|
| 49 |
-
- **Recall**: Of all PII present, how many were successfully detected (avoiding false negatives)
|
| 50 |
-
- **F1 Score**: Harmonic mean balancing precision and recall
|
| 51 |
-
- **Over-detection Rate**: Percentage of non-PII incorrectly flagged (lower is better)
|
| 52 |
-
|
| 53 |
-
## 🏗️ Project Structure
|
| 54 |
-
|
| 55 |
-
```
|
| 56 |
-
LLM-PII-Detection-Leaderboard/
|
| 57 |
-
├── app.py # Main application entry point
|
| 58 |
-
├── pii_leaderboard.py # Core leaderboard functionality
|
| 59 |
-
├── data_loader.py # Data loading and styling configuration
|
| 60 |
-
├── requirements.txt # Python dependencies
|
| 61 |
-
└── README.md # This file
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
## 🎨 Design Philosophy
|
| 65 |
-
|
| 66 |
-
This leaderboard combines the slim architecture of agent-leaderboard with the beautiful design elements from DocumentProcessing Leaderboard Nutrient, featuring:
|
| 67 |
-
|
| 68 |
-
- **Minimal Dependencies**: Only essential packages (Gradio, Pandas, NumPy)
|
| 69 |
-
- **Clean Architecture**: Simple, maintainable code structure
|
| 70 |
-
- **Professional Styling**: Modern dark theme with custom color palette
|
| 71 |
-
- **Interactive Elements**: Score bars, rank badges, and performance cards
|
| 72 |
-
- **Responsive Design**: Works beautifully on all screen sizes
|
| 73 |
-
|
| 74 |
-
## 🔧 Customization
|
| 75 |
-
|
| 76 |
-
### Adding New Models
|
| 77 |
-
|
| 78 |
-
Update the `sample_data` dictionary in `data_loader.py` with your model's performance metrics.
|
| 79 |
-
|
| 80 |
-
### Changing Colors
|
| 81 |
-
|
| 82 |
-
Modify the `COLORS` dictionary in `data_loader.py` to customize the color scheme.
|
| 83 |
-
|
| 84 |
-
### Adding New Metrics
|
| 85 |
-
|
| 86 |
-
1. Add the metric to your data structure
|
| 87 |
-
2. Update the table generation in `pii_leaderboard.py`
|
| 88 |
-
3. Add appropriate styling and score bars
|
| 89 |
-
|
| 90 |
-
## 📈 Performance
|
| 91 |
-
|
| 92 |
-
The leaderboard currently evaluates 8 leading language models across:
|
| 93 |
-
- **5 Document Types**: Healthcare, Financial, Government, Legal, Personal
|
| 94 |
-
- **6 Key Metrics**: Accuracy, Precision, Recall, F1, Over-detection Rate, Cost & Time
|
| 95 |
-
- **Real-world Scenarios**: Synthetic industry documents with embedded PII
|
| 96 |
-
|
| 97 |
-
## 🤝 Contributing
|
| 98 |
-
|
| 99 |
-
1. Fork the repository
|
| 100 |
-
2. Create a feature branch
|
| 101 |
-
3. Make your changes
|
| 102 |
-
4. Test thoroughly
|
| 103 |
-
5. Submit a pull request
|
| 104 |
-
|
| 105 |
-
## 📄 License
|
| 106 |
-
|
| 107 |
-
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 108 |
-
|
| 109 |
-
## 🙏 Acknowledgments
|
| 110 |
-
|
| 111 |
-
- Inspired by the elegant design of DocumentProcessing Leaderboard Nutrient
|
| 112 |
-
- Built with the slim architecture approach of agent-leaderboard
|
| 113 |
-
- Powered by Gradio for the beautiful web interface
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pii_leaderboard.py
CHANGED
|
@@ -418,6 +418,17 @@ def create_pii_leaderboard():
|
|
| 418 |
PII Detection Performance Leaderboard
|
| 419 |
</h3>
|
| 420 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 421 |
<p style="color: var(--text-secondary); margin-bottom: 20px; font-size: 1.1rem; font-family: 'Archivo', sans-serif;">
|
| 422 |
Filter by document type, model access, and sort by any metric to explore performance
|
| 423 |
</p>
|
|
@@ -485,7 +496,77 @@ def create_pii_leaderboard():
|
|
| 485 |
# Methodology section
|
| 486 |
gr.HTML(f"""
|
| 487 |
<div class="dark-container" style="margin-top: 32px;">
|
| 488 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 489 |
</div>
|
| 490 |
""")
|
| 491 |
|
|
|
|
| 418 |
PII Detection Performance Leaderboard
|
| 419 |
</h3>
|
| 420 |
</div>
|
| 421 |
+
|
| 422 |
+
<!-- Dataset Reference -->
|
| 423 |
+
<div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 12px; padding: 16px; margin: 16px 0 24px 0;">
|
| 424 |
+
<p style="color: var(--text-primary); margin: 0 0 8px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
|
| 425 |
+
📊 <strong>Dataset</strong>: <a href="https://huggingface.co/datasets/nutrientdocs/DocPII-redaction-benchmark" style="color: var(--accent-primary); text-decoration: none;" target="_blank">DocPII: Contextual Redaction Benchmark Dataset</a>
|
| 426 |
+
</p>
|
| 427 |
+
<p style="color: var(--text-secondary); margin: 0; font-size: 0.95rem; font-family: 'Archivo', sans-serif; line-height: 1.4;">
|
| 428 |
+
DocPII contains 1,101 high-quality document samples with embedded PII, designed to evaluate context-aware redaction systems. It provides realistic, full-document contexts across healthcare, finance, and other sectors—a notable advancement over sentence-level datasets.
|
| 429 |
+
</p>
|
| 430 |
+
</div>
|
| 431 |
+
|
| 432 |
<p style="color: var(--text-secondary); margin-bottom: 20px; font-size: 1.1rem; font-family: 'Archivo', sans-serif;">
|
| 433 |
Filter by document type, model access, and sort by any metric to explore performance
|
| 434 |
</p>
|
|
|
|
| 496 |
# Methodology section
|
| 497 |
gr.HTML(f"""
|
| 498 |
<div class="dark-container" style="margin-top: 32px;">
|
| 499 |
+
{METHODOLOGY}
|
| 500 |
+
</div>
|
| 501 |
+
""")
|
| 502 |
+
|
| 503 |
+
# Contribution Section
|
| 504 |
+
gr.HTML("""
|
| 505 |
+
<div class="dark-container" style="margin-top: 32px;">
|
| 506 |
+
<div class="section-header">
|
| 507 |
+
<h3 style="margin: 0; color: var(--text-primary); font-size: 1.5rem; font-family: 'Archivo', sans-serif; font-weight: 700;">
|
| 508 |
+
Contribute to the Leaderboard
|
| 509 |
+
</h3>
|
| 510 |
+
</div>
|
| 511 |
+
|
| 512 |
+
<div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 16px; padding: 24px; margin-bottom: 24px;">
|
| 513 |
+
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 2rem; align-items: center;">
|
| 514 |
+
<div>
|
| 515 |
+
<h4 style="color: var(--accent-primary); margin: 0 0 16px 0; font-size: 1.2rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
|
| 516 |
+
Help Improve PII Detection
|
| 517 |
+
</h4>
|
| 518 |
+
<p style="color: var(--text-primary); margin: 0 0 16px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; line-height: 1.6;">
|
| 519 |
+
Join our community and contribute to advancing PII detection capabilities! We encourage researchers and developers to:
|
| 520 |
+
</p>
|
| 521 |
+
<ul style="color: var(--text-secondary); font-size: 0.95rem; font-family: 'Archivo', sans-serif; line-height: 1.5; margin: 0; padding-left: 20px;">
|
| 522 |
+
<li style="margin-bottom: 8px;"><strong>Optimize prompts</strong> with existing models for better performance</li>
|
| 523 |
+
<li style="margin-bottom: 8px;"><strong>Test your own models</strong> on the DocPII benchmark dataset</li>
|
| 524 |
+
<li style="margin-bottom: 8px;"><strong>Share novel approaches</strong> and techniques for PII detection</li>
|
| 525 |
+
<li style="margin-bottom: 8px;"><strong>Experiment with fine-tuning</strong> strategies for document-level context</li>
|
| 526 |
+
</ul>
|
| 527 |
+
</div>
|
| 528 |
+
|
| 529 |
+
<div style="text-align: center;">
|
| 530 |
+
<div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 16px; margin-bottom: 16px;">
|
| 531 |
+
<h4 style="color: var(--text-primary); margin: 0 0 8px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
|
| 532 |
+
Example Notebook
|
| 533 |
+
</h4>
|
| 534 |
+
<p style="color: var(--text-secondary); margin: 0; font-size: 0.85rem; font-family: 'Archivo', sans-serif;">
|
| 535 |
+
Ready-to-run evaluation setup
|
| 536 |
+
</p>
|
| 537 |
+
</div>
|
| 538 |
+
<a href="https://colab.research.google.com/drive/1Qs5b85jWzmpFhVO-2mo0BgECCxKAeQIP?usp=sharing"
|
| 539 |
+
target="_blank"
|
| 540 |
+
rel="noopener noreferrer"
|
| 541 |
+
style="display: inline-block; background: var(--bg-secondary); color: var(--text-primary); border: 1px solid var(--accent-primary); padding: 10px 20px; border-radius: 6px; text-decoration: none; font-family: 'Archivo', sans-serif; font-weight: 500; font-size: 0.9rem; transition: all 0.3s ease; hover: background: var(--accent-primary);">
|
| 542 |
+
Open in Google Colab
|
| 543 |
+
</a>
|
| 544 |
+
</div>
|
| 545 |
+
</div>
|
| 546 |
+
</div>
|
| 547 |
+
|
| 548 |
+
<div style="background: linear-gradient(135deg, rgba(240, 201, 104, 0.1), rgba(239, 235, 231, 0.1)); border: 1px solid var(--accent-primary); border-radius: 16px; padding: 20px; text-align: center;">
|
| 549 |
+
<h4 style="color: var(--accent-primary); margin: 0 0 12px 0; font-size: 1.1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
|
| 550 |
+
How to Submit Your Results
|
| 551 |
+
</h4>
|
| 552 |
+
<p style="color: var(--text-primary); margin: 0 0 16px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; line-height: 1.5;">
|
| 553 |
+
Share your findings with the community! Submit your results along with a Google Colab notebook demonstrating your approach.
|
| 554 |
+
</p>
|
| 555 |
+
<div style="display: flex; justify-content: center; gap: 16px; flex-wrap: wrap;">
|
| 556 |
+
<div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
|
| 557 |
+
<span style="color: var(--accent-primary); font-weight: 600;">1.</span>
|
| 558 |
+
<span style="color: var(--text-secondary);"> Run evaluation</span>
|
| 559 |
+
</div>
|
| 560 |
+
<div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
|
| 561 |
+
<span style="color: var(--accent-primary); font-weight: 600;">2.</span>
|
| 562 |
+
<span style="color: var(--text-secondary);"> Create Colab notebook</span>
|
| 563 |
+
</div>
|
| 564 |
+
<div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
|
| 565 |
+
<span style="color: var(--accent-primary); font-weight: 600;">3.</span>
|
| 566 |
+
<span style="color: var(--text-secondary);"> Add Discussion in Community</span>
|
| 567 |
+
</div>
|
| 568 |
+
</div>
|
| 569 |
+
</div>
|
| 570 |
</div>
|
| 571 |
""")
|
| 572 |
|
results/pii_detection_results.csv
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
Model,Model Type,Vendor,Overall Accuracy,Precision,Recall,F1 Score,Over-redaction Rate,Processing Time (s),Cost per Document ($),Healthcare Accuracy,Financial Accuracy,Government Accuracy,Legal Accuracy,Personal Accuracy
|
| 2 |
-
Nutrient & GPT-5-mini,Proprietary,OpenAI,0.
|
| 3 |
-
Nutrient & GPT-
|
| 4 |
-
Nutrient & GPT-4.1-
|
| 5 |
-
|
|
|
|
| 1 |
Model,Model Type,Vendor,Overall Accuracy,Precision,Recall,F1 Score,Over-redaction Rate,Processing Time (s),Cost per Document ($),Healthcare Accuracy,Financial Accuracy,Government Accuracy,Legal Accuracy,Personal Accuracy
|
| 2 |
+
Nutrient & GPT-5-mini,Proprietary,OpenAI,0.972,0.993,0.952,0.972,0.054,2.7,0.018,0.982,0.974,0.958,0.977,0.989
|
| 3 |
+
Nutrient & GPT-4.1-mini,Proprietary,OpenAI,0.945,0.993,0.900,0.945,0.065,2.3,0.012,0.960,0.961,0.966,0.895,0.994
|
| 4 |
+
Nutrient & GPT-4.1-nano,Proprietary,OpenAI,0.906,0.989,0.830,0.906,0.118,1.8,0.008,0.939,0.939,0.933,0.925,0.974
|
| 5 |
+
GPT-4.1-nano (Example Notebook),LLM,OpenAI,0.749,0.817,0.711,0.749,0.022,1.3,5.7e-05,0.721,0.781,0.726,0.805,0.746
|