Luis Kalckstein
V1 including mock results
32e8dbc unverified
|
raw
history blame
3.93 kB
metadata
title: LLM PII Detection Leaderboard
emoji: πŸ₯‡
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: Duplicate this leaderboard to initialize your own!
sdk_version: 5.19.0

πŸ”’ LLM PII Detection Leaderboard

A comprehensive benchmark for evaluating language models' performance in detecting and handling personally identifiable information (PII) across various document types and scenarios.

✨ Features

  • Beautiful Modern UI: Elegant dark theme with gradient styling and smooth animations
  • Comprehensive Metrics: Precision, Recall, F1 Score, Over-detection Rate, Processing Time, and Cost
  • Domain-Specific Analysis: Specialized evaluation across Healthcare, Financial, Government, Legal, and Personal documents
  • Performance Cards: Professional model performance cards perfect for presentations and reports
  • Interactive Filtering: Filter by model type, document type, and sort by any metric
  • Real-time Updates: Dynamic table updates and score visualizations

πŸš€ Quick Start

Installation

git clone https://github.com/your-username/LLM-PII-Detection-Leaderboard.git
cd LLM-PII-Detection-Leaderboard
pip install -r requirements.txt

Run the Application

python app.py

The leaderboard will be available at http://localhost:7860

πŸ“Š Key Metrics

  • Overall Accuracy: Percentage of correctly identified and classified PII entities
  • Precision: Of all flagged items, how many were actually PII (avoiding false positives)
  • Recall: Of all PII present, how many were successfully detected (avoiding false negatives)
  • F1 Score: Harmonic mean balancing precision and recall
  • Over-detection Rate: Percentage of non-PII incorrectly flagged (lower is better)

πŸ—οΈ Project Structure

LLM-PII-Detection-Leaderboard/
β”œβ”€β”€ app.py                 # Main application entry point
β”œβ”€β”€ pii_leaderboard.py     # Core leaderboard functionality
β”œβ”€β”€ data_loader.py         # Data loading and styling configuration
β”œβ”€β”€ requirements.txt       # Python dependencies
└── README.md             # This file

🎨 Design Philosophy

This leaderboard combines the slim architecture of agent-leaderboard with the beautiful design elements from DocumentProcessing Leaderboard Nutrient, featuring:

  • Minimal Dependencies: Only essential packages (Gradio, Pandas, NumPy)
  • Clean Architecture: Simple, maintainable code structure
  • Professional Styling: Modern dark theme with custom color palette
  • Interactive Elements: Score bars, rank badges, and performance cards
  • Responsive Design: Works beautifully on all screen sizes

πŸ”§ Customization

Adding New Models

Update the sample_data dictionary in data_loader.py with your model's performance metrics.

Changing Colors

Modify the COLORS dictionary in data_loader.py to customize the color scheme.

Adding New Metrics

  1. Add the metric to your data structure
  2. Update the table generation in pii_leaderboard.py
  3. Add appropriate styling and score bars

πŸ“ˆ Performance

The leaderboard currently evaluates 8 leading language models across:

  • 5 Document Types: Healthcare, Financial, Government, Legal, Personal
  • 6 Key Metrics: Accuracy, Precision, Recall, F1, Over-detection Rate, Cost & Time
  • Real-world Scenarios: Synthetic industry documents with embedded PII

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Inspired by the elegant design of DocumentProcessing Leaderboard Nutrient
  • Built with the slim architecture approach of agent-leaderboard
  • Powered by Gradio for the beautiful web interface