Luis Kalckstein commited on
Commit
ebc9315
·
unverified ·
1 Parent(s): cfdd1af

New benchmarking results from improved dataset and contribution notebook

Browse files
README.md DELETED
@@ -1,113 +0,0 @@
1
- ---
2
- title: LLM PII Detection Leaderboard
3
- emoji: 🥇
4
- colorFrom: green
5
- colorTo: indigo
6
- sdk: gradio
7
- app_file: app.py
8
- pinned: true
9
- license: apache-2.0
10
- short_description: Duplicate this leaderboard to initialize your own!
11
- sdk_version: 5.19.0
12
- ---
13
-
14
- # 🔒 LLM PII Detection Leaderboard
15
-
16
- A comprehensive benchmark for evaluating language models' performance in detecting and handling personally identifiable information (PII) across various document types and scenarios.
17
-
18
- ## ✨ Features
19
-
20
- - **Beautiful Modern UI**: Elegant dark theme with gradient styling and smooth animations
21
- - **Comprehensive Metrics**: Precision, Recall, F1 Score, Over-detection Rate, Processing Time, and Cost
22
- - **Domain-Specific Analysis**: Specialized evaluation across Healthcare, Financial, Government, Legal, and Personal documents
23
- - **Performance Cards**: Professional model performance cards perfect for presentations and reports
24
- - **Interactive Filtering**: Filter by model type, document type, and sort by any metric
25
- - **Real-time Updates**: Dynamic table updates and score visualizations
26
-
27
- ## 🚀 Quick Start
28
-
29
- ### Installation
30
-
31
- ```bash
32
- git clone https://github.com/your-username/LLM-PII-Detection-Leaderboard.git
33
- cd LLM-PII-Detection-Leaderboard
34
- pip install -r requirements.txt
35
- ```
36
-
37
- ### Run the Application
38
-
39
- ```bash
40
- python app.py
41
- ```
42
-
43
- The leaderboard will be available at `http://localhost:7860`
44
-
45
- ## 📊 Key Metrics
46
-
47
- - **Overall Accuracy**: Percentage of correctly identified and classified PII entities
48
- - **Precision**: Of all flagged items, how many were actually PII (avoiding false positives)
49
- - **Recall**: Of all PII present, how many were successfully detected (avoiding false negatives)
50
- - **F1 Score**: Harmonic mean balancing precision and recall
51
- - **Over-detection Rate**: Percentage of non-PII incorrectly flagged (lower is better)
52
-
53
- ## 🏗️ Project Structure
54
-
55
- ```
56
- LLM-PII-Detection-Leaderboard/
57
- ├── app.py # Main application entry point
58
- ├── pii_leaderboard.py # Core leaderboard functionality
59
- ├── data_loader.py # Data loading and styling configuration
60
- ├── requirements.txt # Python dependencies
61
- └── README.md # This file
62
- ```
63
-
64
- ## 🎨 Design Philosophy
65
-
66
- This leaderboard combines the slim architecture of agent-leaderboard with the beautiful design elements from DocumentProcessing Leaderboard Nutrient, featuring:
67
-
68
- - **Minimal Dependencies**: Only essential packages (Gradio, Pandas, NumPy)
69
- - **Clean Architecture**: Simple, maintainable code structure
70
- - **Professional Styling**: Modern dark theme with custom color palette
71
- - **Interactive Elements**: Score bars, rank badges, and performance cards
72
- - **Responsive Design**: Works beautifully on all screen sizes
73
-
74
- ## 🔧 Customization
75
-
76
- ### Adding New Models
77
-
78
- Update the `sample_data` dictionary in `data_loader.py` with your model's performance metrics.
79
-
80
- ### Changing Colors
81
-
82
- Modify the `COLORS` dictionary in `data_loader.py` to customize the color scheme.
83
-
84
- ### Adding New Metrics
85
-
86
- 1. Add the metric to your data structure
87
- 2. Update the table generation in `pii_leaderboard.py`
88
- 3. Add appropriate styling and score bars
89
-
90
- ## 📈 Performance
91
-
92
- The leaderboard currently evaluates 8 leading language models across:
93
- - **5 Document Types**: Healthcare, Financial, Government, Legal, Personal
94
- - **6 Key Metrics**: Accuracy, Precision, Recall, F1, Over-detection Rate, Cost & Time
95
- - **Real-world Scenarios**: Synthetic industry documents with embedded PII
96
-
97
- ## 🤝 Contributing
98
-
99
- 1. Fork the repository
100
- 2. Create a feature branch
101
- 3. Make your changes
102
- 4. Test thoroughly
103
- 5. Submit a pull request
104
-
105
- ## 📄 License
106
-
107
- This project is licensed under the MIT License - see the LICENSE file for details.
108
-
109
- ## 🙏 Acknowledgments
110
-
111
- - Inspired by the elegant design of DocumentProcessing Leaderboard Nutrient
112
- - Built with the slim architecture approach of agent-leaderboard
113
- - Powered by Gradio for the beautiful web interface
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pii_leaderboard.py CHANGED
@@ -418,6 +418,17 @@ def create_pii_leaderboard():
418
  PII Detection Performance Leaderboard
419
  </h3>
420
  </div>
 
 
 
 
 
 
 
 
 
 
 
421
  <p style="color: var(--text-secondary); margin-bottom: 20px; font-size: 1.1rem; font-family: 'Archivo', sans-serif;">
422
  Filter by document type, model access, and sort by any metric to explore performance
423
  </p>
@@ -485,7 +496,77 @@ def create_pii_leaderboard():
485
  # Methodology section
486
  gr.HTML(f"""
487
  <div class="dark-container" style="margin-top: 32px;">
488
- {METHODOLOGY}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
489
  </div>
490
  """)
491
 
 
418
  PII Detection Performance Leaderboard
419
  </h3>
420
  </div>
421
+
422
+ <!-- Dataset Reference -->
423
+ <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 12px; padding: 16px; margin: 16px 0 24px 0;">
424
+ <p style="color: var(--text-primary); margin: 0 0 8px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
425
+ 📊 <strong>Dataset</strong>: <a href="https://huggingface.co/datasets/nutrientdocs/DocPII-redaction-benchmark" style="color: var(--accent-primary); text-decoration: none;" target="_blank">DocPII: Contextual Redaction Benchmark Dataset</a>
426
+ </p>
427
+ <p style="color: var(--text-secondary); margin: 0; font-size: 0.95rem; font-family: 'Archivo', sans-serif; line-height: 1.4;">
428
+ DocPII contains 1,101 high-quality document samples with embedded PII, designed to evaluate context-aware redaction systems. It provides realistic, full-document contexts across healthcare, finance, and other sectors—a notable advancement over sentence-level datasets.
429
+ </p>
430
+ </div>
431
+
432
  <p style="color: var(--text-secondary); margin-bottom: 20px; font-size: 1.1rem; font-family: 'Archivo', sans-serif;">
433
  Filter by document type, model access, and sort by any metric to explore performance
434
  </p>
 
496
  # Methodology section
497
  gr.HTML(f"""
498
  <div class="dark-container" style="margin-top: 32px;">
499
+ {METHODOLOGY}
500
+ </div>
501
+ """)
502
+
503
+ # Contribution Section
504
+ gr.HTML("""
505
+ <div class="dark-container" style="margin-top: 32px;">
506
+ <div class="section-header">
507
+ <h3 style="margin: 0; color: var(--text-primary); font-size: 1.5rem; font-family: 'Archivo', sans-serif; font-weight: 700;">
508
+ Contribute to the Leaderboard
509
+ </h3>
510
+ </div>
511
+
512
+ <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 16px; padding: 24px; margin-bottom: 24px;">
513
+ <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 2rem; align-items: center;">
514
+ <div>
515
+ <h4 style="color: var(--accent-primary); margin: 0 0 16px 0; font-size: 1.2rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
516
+ Help Improve PII Detection
517
+ </h4>
518
+ <p style="color: var(--text-primary); margin: 0 0 16px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; line-height: 1.6;">
519
+ Join our community and contribute to advancing PII detection capabilities! We encourage researchers and developers to:
520
+ </p>
521
+ <ul style="color: var(--text-secondary); font-size: 0.95rem; font-family: 'Archivo', sans-serif; line-height: 1.5; margin: 0; padding-left: 20px;">
522
+ <li style="margin-bottom: 8px;"><strong>Optimize prompts</strong> with existing models for better performance</li>
523
+ <li style="margin-bottom: 8px;"><strong>Test your own models</strong> on the DocPII benchmark dataset</li>
524
+ <li style="margin-bottom: 8px;"><strong>Share novel approaches</strong> and techniques for PII detection</li>
525
+ <li style="margin-bottom: 8px;"><strong>Experiment with fine-tuning</strong> strategies for document-level context</li>
526
+ </ul>
527
+ </div>
528
+
529
+ <div style="text-align: center;">
530
+ <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 16px; margin-bottom: 16px;">
531
+ <h4 style="color: var(--text-primary); margin: 0 0 8px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
532
+ Example Notebook
533
+ </h4>
534
+ <p style="color: var(--text-secondary); margin: 0; font-size: 0.85rem; font-family: 'Archivo', sans-serif;">
535
+ Ready-to-run evaluation setup
536
+ </p>
537
+ </div>
538
+ <a href="https://colab.research.google.com/drive/1Qs5b85jWzmpFhVO-2mo0BgECCxKAeQIP?usp=sharing"
539
+ target="_blank"
540
+ rel="noopener noreferrer"
541
+ style="display: inline-block; background: var(--bg-secondary); color: var(--text-primary); border: 1px solid var(--accent-primary); padding: 10px 20px; border-radius: 6px; text-decoration: none; font-family: 'Archivo', sans-serif; font-weight: 500; font-size: 0.9rem; transition: all 0.3s ease; hover: background: var(--accent-primary);">
542
+ Open in Google Colab
543
+ </a>
544
+ </div>
545
+ </div>
546
+ </div>
547
+
548
+ <div style="background: linear-gradient(135deg, rgba(240, 201, 104, 0.1), rgba(239, 235, 231, 0.1)); border: 1px solid var(--accent-primary); border-radius: 16px; padding: 20px; text-align: center;">
549
+ <h4 style="color: var(--accent-primary); margin: 0 0 12px 0; font-size: 1.1rem; font-family: 'Archivo', sans-serif; font-weight: 600;">
550
+ How to Submit Your Results
551
+ </h4>
552
+ <p style="color: var(--text-primary); margin: 0 0 16px 0; font-size: 1rem; font-family: 'Archivo', sans-serif; line-height: 1.5;">
553
+ Share your findings with the community! Submit your results along with a Google Colab notebook demonstrating your approach.
554
+ </p>
555
+ <div style="display: flex; justify-content: center; gap: 16px; flex-wrap: wrap;">
556
+ <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
557
+ <span style="color: var(--accent-primary); font-weight: 600;">1.</span>
558
+ <span style="color: var(--text-secondary);"> Run evaluation</span>
559
+ </div>
560
+ <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
561
+ <span style="color: var(--accent-primary); font-weight: 600;">2.</span>
562
+ <span style="color: var(--text-secondary);"> Create Colab notebook</span>
563
+ </div>
564
+ <div style="background: var(--bg-secondary); border: 1px solid var(--border-subtle); border-radius: 8px; padding: 12px 16px; font-size: 0.9rem; font-family: 'Archivo', sans-serif;">
565
+ <span style="color: var(--accent-primary); font-weight: 600;">3.</span>
566
+ <span style="color: var(--text-secondary);"> Add Discussion in Community</span>
567
+ </div>
568
+ </div>
569
+ </div>
570
  </div>
571
  """)
572
 
results/pii_detection_results.csv CHANGED
@@ -1,5 +1,5 @@
1
  Model,Model Type,Vendor,Overall Accuracy,Precision,Recall,F1 Score,Over-redaction Rate,Processing Time (s),Cost per Document ($),Healthcare Accuracy,Financial Accuracy,Government Accuracy,Legal Accuracy,Personal Accuracy
2
- Nutrient & GPT-5-mini,Proprietary,OpenAI,0.757,0.993,0.972,0.98,0.054,2.7,0.018,0.982,0.974,0.958,0.977,0.989
3
- Nutrient & GPT-5-nano,Proprietary,OpenAI,0.658,0.988,0.954,0.966,0.066,2.1,0.015,0.963,0.961,0.943,0.946,0.978
4
- Nutrient & GPT-4.1-mini,Proprietary,OpenAI,0.599,0.993,0.945,0.964,0.065,2.3,0.012,0.96,0.961,0.966,0.895,0.994
5
- Nutrient & GPT-4.1-nano,Proprietary,OpenAI,0.419,0.989,0.906,0.936,0.118,1.8,0.008,0.939,0.939,0.933,0.925,0.974
 
1
  Model,Model Type,Vendor,Overall Accuracy,Precision,Recall,F1 Score,Over-redaction Rate,Processing Time (s),Cost per Document ($),Healthcare Accuracy,Financial Accuracy,Government Accuracy,Legal Accuracy,Personal Accuracy
2
+ Nutrient & GPT-5-mini,Proprietary,OpenAI,0.972,0.993,0.952,0.972,0.054,2.7,0.018,0.982,0.974,0.958,0.977,0.989
3
+ Nutrient & GPT-4.1-mini,Proprietary,OpenAI,0.945,0.993,0.900,0.945,0.065,2.3,0.012,0.960,0.961,0.966,0.895,0.994
4
+ Nutrient & GPT-4.1-nano,Proprietary,OpenAI,0.906,0.989,0.830,0.906,0.118,1.8,0.008,0.939,0.939,0.933,0.925,0.974
5
+ GPT-4.1-nano (Example Notebook),LLM,OpenAI,0.749,0.817,0.711,0.749,0.022,1.3,5.7e-05,0.721,0.781,0.726,0.805,0.746