LLM-PII-Detection-Leaderboard

Sleeping

App Files Files Community

Luis Kalckstein commited on Aug 20, 2025

Commit

ebc9315

unverified ·

1 Parent(s): cfdd1af

New benchmarking results from improved dataset and contribution notebook

Browse files

Files changed (3) hide show

README.md +0 -113
pii_leaderboard.py +82 -1
results/pii_detection_results.csv +4 -4

README.md DELETED Viewed

@@ -1,113 +0,0 @@
----
-title: LLM PII Detection Leaderboard
-emoji: 🥇
-colorFrom: green
-colorTo: indigo
-sdk: gradio
-app_file: app.py
-pinned: true
-license: apache-2.0
-short_description: Duplicate this leaderboard to initialize your own!
-sdk_version: 5.19.0
----
-# 🔒 LLM PII Detection Leaderboard
-A comprehensive benchmark for evaluating language models' performance in detecting and handling personally identifiable information (PII) across various document types and scenarios.
-## ✨ Features
-- **Beautiful Modern UI**: Elegant dark theme with gradient styling and smooth animations
-- **Comprehensive Metrics**: Precision, Recall, F1 Score, Over-detection Rate, Processing Time, and Cost
-- **Domain-Specific Analysis**: Specialized evaluation across Healthcare, Financial, Government, Legal, and Personal documents
-- **Performance Cards**: Professional model performance cards perfect for presentations and reports
-- **Interactive Filtering**: Filter by model type, document type, and sort by any metric
-- **Real-time Updates**: Dynamic table updates and score visualizations
-## 🚀 Quick Start
-### Installation
-```bash
-git clone https://github.com/your-username/LLM-PII-Detection-Leaderboard.git
-cd LLM-PII-Detection-Leaderboard
-pip install -r requirements.txt
-```
-### Run the Application
-```bash
-python app.py
-```
-The leaderboard will be available at `http://localhost:7860`
-## 📊 Key Metrics
-- **Overall Accuracy**: Percentage of correctly identified and classified PII entities
-- **Precision**: Of all flagged items, how many were actually PII (avoiding false positives)
-- **Recall**: Of all PII present, how many were successfully detected (avoiding false negatives)
-- **F1 Score**: Harmonic mean balancing precision and recall
-- **Over-detection Rate**: Percentage of non-PII incorrectly flagged (lower is better)
-## 🏗️ Project Structure
-```
-LLM-PII-Detection-Leaderboard/
-├── app.py                 # Main application entry point
-├── pii_leaderboard.py     # Core leaderboard functionality
-├── data_loader.py         # Data loading and styling configuration
-├── requirements.txt       # Python dependencies
-└── README.md             # This file
-```
-## 🎨 Design Philosophy
-This leaderboard combines the slim architecture of agent-leaderboard with the beautiful design elements from DocumentProcessing Leaderboard Nutrient, featuring:
-- **Minimal Dependencies**: Only essential packages (Gradio, Pandas, NumPy)
-- **Clean Architecture**: Simple, maintainable code structure
-- **Professional Styling**: Modern dark theme with custom color palette
-- **Interactive Elements**: Score bars, rank badges, and performance cards
-- **Responsive Design**: Works beautifully on all screen sizes
-## 🔧 Customization
-### Adding New Models
-Update the `sample_data` dictionary in `data_loader.py` with your model's performance metrics.
-### Changing Colors
-Modify the `COLORS` dictionary in `data_loader.py` to customize the color scheme.
-### Adding New Metrics
-1. Add the metric to your data structure
-2. Update the table generation in `pii_leaderboard.py`
-3. Add appropriate styling and score bars
-## 📈 Performance
-The leaderboard currently evaluates 8 leading language models across:
-- **5 Document Types**: Healthcare, Financial, Government, Legal, Personal
-- **6 Key Metrics**: Accuracy, Precision, Recall, F1, Over-detection Rate, Cost & Time
-- **Real-world Scenarios**: Synthetic industry documents with embedded PII
-## 🤝 Contributing
-1. Fork the repository
-2. Create a feature branch
-3. Make your changes
-4. Test thoroughly
-5. Submit a pull request
-## 📄 License
-This project is licensed under the MIT License - see the LICENSE file for details.
-## 🙏 Acknowledgments
-- Inspired by the elegant design of DocumentProcessing Leaderboard Nutrient
-- Built with the slim architecture approach of agent-leaderboard
-- Powered by Gradio for the beautiful web interface

pii_leaderboard.py CHANGED Viewed

@@ -418,6 +418,17 @@ def create_pii_leaderboard():
                 PII Detection Performance Leaderboard
             </h3>
         </div>
         <p style="color: var(--text-secondary); margin-bottom: 20px; font-size: 1.1rem; font-family: 'Archivo', sans-serif;">
             Filter by document type, model access, and sort by any metric to explore performance
         </p>
@@ -485,7 +496,77 @@ def create_pii_leaderboard():
     # Methodology section
     gr.HTML(f"""
     <div class="dark-container" style="margin-top: 32px;">
-        {METHODOLOGY}
     </div>
     """)

                 PII Detection Performance Leaderboard
             </h3>
         </div>
+        <!-- Dataset Reference -->
+        <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 12px; padding: 16px; margin: 16px 0 24px 0;">
+            <p style="color: var(--text-primary); margin: 0 0 8px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
+                📊 <strong>Dataset</strong>: <a href="https://huggingface.co/datasets/nutrientdocs/DocPII-redaction-benchmark" style="color: var(--accent-primary); text-decoration: none;" target="_blank">DocPII: Contextual Redaction Benchmark Dataset</a>
+            </p>
+            <p style="color: var(--text-secondary); margin: 0; font-size: 0.95rem; font-family: 'Archivo', sans-serif; line-height: 1.4;">
+                DocPII contains 1,101 high-quality document samples with embedded PII, designed to evaluate context-aware redaction systems. It provides realistic, full-document contexts across healthcare, finance, and other sectors—a notable advancement over sentence-level datasets.
+            </p>
+        </div>
         <p style="color: var(--text-secondary); margin-bottom: 20px; font-size: 1.1rem; font-family: 'Archivo', sans-serif;">
             Filter by document type, model access, and sort by any metric to explore performance
         </p>
     # Methodology section
     gr.HTML(f"""
     <div class="dark-container" style="margin-top: 32px;">
+{METHODOLOGY}
+    </div>
+    """)
+    # Contribution Section
+    gr.HTML("""
+    <div class="dark-container" style="margin-top: 32px;">
+        <div class="section-header">
+            <h3 style="margin: 0; color: var(--text-primary); font-size: 1.5rem; font-family: 'Archivo', sans-serif; font-weight: 700;">
+                Contribute to the Leaderboard
+            </h3>
+        </div>
+        <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 16px; padding: 24px; margin-bottom: 24px;">
+            <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 2rem; align-items: center;">
+                <div>
+                    <h4 style="color: var(--accent-primary); margin: 0 0 16px 0; font-size: 1.2rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
+                        Help Improve PII Detection
+                    </h4>
+                    <p style="color: var(--text-primary); margin: 0 0 16px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; line-height: 1.6;">
+                        Join our community and contribute to advancing PII detection capabilities! We encourage researchers and developers to:
+                    </p>
+                    <ul style="color: var(--text-secondary); font-size: 0.95rem; font-family: 'Archivo', sans-serif; line-height: 1.5; margin: 0; padding-left: 20px;">
+                        <li style="margin-bottom: 8px;"><strong>Optimize prompts</strong> with existing models for better performance</li>
+                        <li style="margin-bottom: 8px;"><strong>Test your own models</strong> on the DocPII benchmark dataset</li>
+                        <li style="margin-bottom: 8px;"><strong>Share novel approaches</strong> and techniques for PII detection</li>
+                        <li style="margin-bottom: 8px;"><strong>Experiment with fine-tuning</strong> strategies for document-level context</li>
+                    </ul>
+                </div>
+                <div style="text-align: center;">
+                    <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 16px; margin-bottom: 16px;">
+                        <h4 style="color: var(--text-primary); margin: 0 0 8px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
+                            Example Notebook
+                        </h4>
+                        <p style="color: var(--text-secondary); margin: 0; font-size: 0.85rem; font-family: 'Archivo', sans-serif;">
+                            Ready-to-run evaluation setup
+                        </p>
+                    </div>
+                    <a href="https://colab.research.google.com/drive/1Qs5b85jWzmpFhVO-2mo0BgECCxKAeQIP?usp=sharing"
+                       target="_blank"
+                       rel="noopener noreferrer"
+                       style="display: inline-block; background: var(--bg-secondary); color: var(--text-primary); border: 1px solid var(--accent-primary); padding: 10px 20px; border-radius: 6px; text-decoration: none; font-family: 'Archivo', sans-serif; font-weight: 500; font-size: 0.9rem; transition: all 0.3s ease; hover: background: var(--accent-primary);">
+                        Open in Google Colab
+                    </a>
+                </div>
+            </div>
+        </div>
+        <div style="background: linear-gradient(135deg, rgba(240, 201, 104, 0.1), rgba(239, 235, 231, 0.1)); border: 1px solid var(--accent-primary); border-radius: 16px; padding: 20px; text-align: center;">
+            <h4 style="color: var(--accent-primary); margin: 0 0 12px 0; font-size: 1.1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
+                How to Submit Your Results
+            </h4>
+            <p style="color: var(--text-primary); margin: 0 0 16px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; line-height: 1.5;">
+                Share your findings with the community! Submit your results along with a Google Colab notebook demonstrating your approach.
+            </p>
+            <div style="display: flex; justify-content: center; gap: 16px; flex-wrap: wrap;">
+                <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
+                    <span style="color: var(--accent-primary); font-weight: 600;">1.</span>
+                    <span style="color: var(--text-secondary);"> Run evaluation</span>
+                </div>
+                <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
+                    <span style="color: var(--accent-primary); font-weight: 600;">2.</span>
+                    <span style="color: var(--text-secondary);"> Create Colab notebook</span>
+                </div>
+                <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
+                    <span style="color: var(--accent-primary); font-weight: 600;">3.</span>
+                    <span style="color: var(--text-secondary);"> Add Discussion in Community</span>
+                </div>
+            </div>
+        </div>
     </div>
     """)

results/pii_detection_results.csv CHANGED Viewed

@@ -1,5 +1,5 @@
 Model,Model Type,Vendor,Overall Accuracy,Precision,Recall,F1 Score,Over-redaction Rate,Processing Time (s),Cost per Document ($),Healthcare Accuracy,Financial Accuracy,Government Accuracy,Legal Accuracy,Personal Accuracy
-Nutrient & GPT-5-mini,Proprietary,OpenAI,0.757,0.993,0.972,0.98,0.054,2.7,0.018,0.982,0.974,0.958,0.977,0.989
-Nutrient & GPT-5-nano,Proprietary,OpenAI,0.658,0.988,0.954,0.966,0.066,2.1,0.015,0.963,0.961,0.943,0.946,0.978
-Nutrient & GPT-4.1-mini,Proprietary,OpenAI,0.599,0.993,0.945,0.964,0.065,2.3,0.012,0.96,0.961,0.966,0.895,0.994
-Nutrient & GPT-4.1-nano,Proprietary,OpenAI,0.419,0.989,0.906,0.936,0.118,1.8,0.008,0.939,0.939,0.933,0.925,0.974

 Model,Model Type,Vendor,Overall Accuracy,Precision,Recall,F1 Score,Over-redaction Rate,Processing Time (s),Cost per Document ($),Healthcare Accuracy,Financial Accuracy,Government Accuracy,Legal Accuracy,Personal Accuracy
+Nutrient & GPT-5-mini,Proprietary,OpenAI,0.972,0.993,0.952,0.972,0.054,2.7,0.018,0.982,0.974,0.958,0.977,0.989
+Nutrient & GPT-4.1-mini,Proprietary,OpenAI,0.945,0.993,0.900,0.945,0.065,2.3,0.012,0.960,0.961,0.966,0.895,0.994
+Nutrient & GPT-4.1-nano,Proprietary,OpenAI,0.906,0.989,0.830,0.906,0.118,1.8,0.008,0.939,0.939,0.933,0.925,0.974
+GPT-4.1-nano (Example Notebook),LLM,OpenAI,0.749,0.817,0.711,0.749,0.022,1.3,5.7e-05,0.721,0.781,0.726,0.805,0.746