Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| data | 41 items | ||
| docs | 17 items | ||
| .gitattributes | 2.46 kB xet | 19463de8 | |
| README.md | 8.86 kB xet | f2588440 |
SEC Executive Compensation Dataset
🚧 DATASET UNDER CONSTRUCTION 🚧
This dataset is actively being developed and expanded. The current version contains ~12,000 records out of a target of 100,000+ SEC filings (2005-2022).
What to expect:
- Data may contain errors or inconsistencies
- Schema and fields may change
- More records will be added regularly
- Statistics will be updated as processing continues
Use at your own risk for research purposes only.
🔗 Pipeline: github.com/pierpierpy/Execcomp-AI
Structured executive compensation data extracted from SEC DEF 14A proxy statements using AI.
from datasets import load_dataset
# Load all data
ds = load_dataset("pierjoe/execcomp-ai-sample", split="train")
# Load specific year only
ds_2020 = load_dataset("pierjoe/execcomp-ai-sample", split="year_2020")
🎯 SCT Probability & Quality (click to expand)
Why sct_probability?
The main VLM (Qwen3-32B) sometimes:
- False positives: Classifies non-SCT tables as SCT (e.g., Director Compensation tables)
- Duplicates: Some documents have multiple tables classified as SCT when only one is the real Summary Compensation Table
A fine-tuned binary classifier (Qwen3-VL-4B) scores each table with a probability (0-1) to help filter these cases.
💡 Recommendation: Filter by
sct_probability >= 0.7to get high-confidence records only.
💰 Compensation Statistics (click to expand)
🏆 Top 10 Highest Paid Executives
Compensation Distribution
Trends Over Time
📋 Dataset Description
This dataset contains Summary Compensation Tables extracted from SEC filings, with:
- Original table images
- HTML table structure
- Structured JSON with executive compensation details
- SCT probability score from a fine-tuned binary classifier (to filter false positives)
💡 Tip: Filter by
sct_probability >= 0.7to get high-confidence SCT records only.
Fields
| Field | Type | Description |
|---|---|---|
cik |
string | SEC Central Index Key |
company |
string | Company name |
year |
int | Filing year |
filing_date |
string | SEC filing date |
sic |
string | Standard Industrial Classification code |
state_of_inc |
string | State of incorporation |
filing_html_index |
string | Link to SEC filing |
accession_number |
string | SEC accession number |
table_image |
image | Extracted table image |
table_body |
string | HTML table content |
executives |
string | JSON array of executive compensation |
sct_probability |
float | Probability (0-1) that this is a real SCT (from fine-tuned classifier) |
Executive Schema
{
"name": "John Smith",
"title": "CEO",
"fiscal_year": 2023,
"salary": 500000,
"bonus": 100000,
"stock_awards": 2000000,
"option_awards": 500000,
"non_equity_incentive": 300000,
"change_in_pension": 50000,
"other_compensation": 25000,
"total": 3475000
}
🚀 Quick Start
from datasets import load_dataset
import json
# Load full dataset
ds = load_dataset("pierjoe/execcomp-ai-sample", split="train")
# Or load a specific year
ds_2020 = load_dataset("pierjoe/execcomp-ai-sample", split="year_2020")
# Filter by SCT probability (recommended to reduce false positives)
ds_filtered = ds.filter(lambda x: x["sct_probability"] >= 0.7)
# View a record
rec = ds_filtered[0]
print(rec["filing_html_index"])
display(rec['table_image']) # PIL Image
print(json.dumps(json.loads(rec['executives']), indent=2))
Analyze with Pandas
import pandas as pd
import json
# Convert to DataFrame
df = ds_filtered.to_pandas()
# Parse all executives
all_execs = []
for _, row in df.iterrows():
for exec_data in json.loads(row['executives']):
exec_data['company'] = row['company']
exec_data['year'] = row['year']
all_execs.append(exec_data)
exec_df = pd.DataFrame(all_execs)
# Average compensation by year
print(exec_df.groupby('year')['total'].mean())
# Top 10 highest paid
print(exec_df.nlargest(10, 'total')[['name', 'company', 'year', 'total']])
View Table Image
# Display table image
record = ds["train"][0]
record["table_image"] # PIL Image object
🔧 Source & Methodology
Data extracted from SEC EDGAR DEF 14A filings using:
- MinerU for PDF table extraction
- Qwen3-VL-32B for classification and extraction
Pipeline Steps
- Download DEF 14A PDFs from SEC EDGAR
- Extract tables with MinerU (VLM-based)
- Classify tables to identify Summary Compensation Tables
- Merge tables split across pages
- Extract structured compensation data with VLM
📜 License
MIT License
📖 Citation
If you use this dataset in your research, please cite:
@dataset{execcomp_ai_2026,
author = {Di Pasquale, Piergiorgio},
title = {SEC Executive Compensation Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/pierjoe/execcomp-ai-sample},
note = {AI-extracted executive compensation data from SEC DEF 14A filings (2005-2022)}
}
Or in text format:
Di Pasquale, P. (2026). SEC Executive Compensation Dataset. Hugging Face. https://huggingface.co/datasets/pierjoe/execcomp-ai
🔗 Links
- GitHub: github.com/pierpierpy/Execcomp-AI
- SEC EDGAR: sec.gov/edgar
- Total size
- 11.7 GB
- Files
- 60
- Last updated
- Jun 27
- Pre-warmed CDN
- US EU US EU









