🔒 LLM PII Detection Leaderboard

Comprehensive benchmark for language models' performance in detecting and redacting personally identifiable information (PII) across various document types and scenarios. "How well do LLMs protect sensitive information?"

Language Models

Cutting-edge Nutrient models

GPT-5-mini, GPT-5-nano, GPT-4.1-mini, GPT-4.1-nano

Document Types

Real-world scenarios

Healthcare, Financial, Government, Legal, Personal

98.0%

Best F1 Score

State-of-the-art performance

Nutrient & GPT-5-mini leading F1 performance

Methodology

Our evaluation methodology assesses language models' capabilities in detecting and handling personally identifiable information (PII) across realistic document scenarios. Each model is tested on synthetic documents containing embedded PII entities across 5 document categories.

Evaluation Process

Model Selection: We evaluate leading language models across proprietary and open-source categories
PII Detection: Each model processes documents with instructions to identify and classify PII entities
Performance Metrics: Precision, Recall, F1 Score, Over-detection Rate, Processing Time, and Cost
Domain Analysis: Specialized evaluation across Healthcare, Financial, Government, Legal, and Personal documents

Key Metrics Explained

Overall Accuracy: Percentage of correctly identified and classified PII entities
Precision: Of all flagged items, how many were actually PII (avoiding false positives)
Recall: Of all PII present, how many were successfully detected (avoiding false negatives)
F1 Score: Harmonic mean balancing precision and recall
Over-detection Rate: Percentage of non-PII incorrectly flagged (lower is better)