11.7 GB
60 files
Updated 1 day ago
Name
Size
data
docs
.gitattributes2.46 kB
xet
README.md8.86 kB
xet
README.md

SEC Executive Compensation Dataset

🚧 DATASET UNDER CONSTRUCTION 🚧

This dataset is actively being developed and expanded. The current version contains ~12,000 records out of a target of 100,000+ SEC filings (2005-2022).

What to expect:

  • Data may contain errors or inconsistencies
  • Schema and fields may change
  • More records will be added regularly
  • Statistics will be updated as processing continues

Use at your own risk for research purposes only.

🔗 Pipeline: github.com/pierpierpy/Execcomp-AI

Structured executive compensation data extracted from SEC DEF 14A proxy statements using AI.

from datasets import load_dataset

# Load all data
ds = load_dataset("pierjoe/execcomp-ai-sample", split="train")

# Load specific year only
ds_2020 = load_dataset("pierjoe/execcomp-ai-sample", split="year_2020")
# 📊 Dataset Statistics (click to expand)

Pipeline Stats

Document Breakdown

Tables by Year


🎯 SCT Probability & Quality (click to expand)

Why sct_probability?

The main VLM (Qwen3-32B) sometimes:

  1. False positives: Classifies non-SCT tables as SCT (e.g., Director Compensation tables)
  2. Duplicates: Some documents have multiple tables classified as SCT when only one is the real Summary Compensation Table

A fine-tuned binary classifier (Qwen3-VL-4B) scores each table with a probability (0-1) to help filter these cases.

Probability Stats

Probability Distribution

💡 Recommendation: Filter by sct_probability >= 0.7 to get high-confidence records only.


💰 Compensation Statistics (click to expand)

Compensation Stats

Compensation Breakdown

🏆 Top 10 Highest Paid Executives

Top 10

Compensation Distribution

Compensation Distribution

Trends Over Time

Compensation Trends


📋 Dataset Description

This dataset contains Summary Compensation Tables extracted from SEC filings, with:

  • Original table images
  • HTML table structure
  • Structured JSON with executive compensation details
  • SCT probability score from a fine-tuned binary classifier (to filter false positives)

💡 Tip: Filter by sct_probability >= 0.7 to get high-confidence SCT records only.

Fields

Field Type Description
cik string SEC Central Index Key
company string Company name
year int Filing year
filing_date string SEC filing date
sic string Standard Industrial Classification code
state_of_inc string State of incorporation
filing_html_index string Link to SEC filing
accession_number string SEC accession number
table_image image Extracted table image
table_body string HTML table content
executives string JSON array of executive compensation
sct_probability float Probability (0-1) that this is a real SCT (from fine-tuned classifier)

Executive Schema

{
  "name": "John Smith",
  "title": "CEO",
  "fiscal_year": 2023,
  "salary": 500000,
  "bonus": 100000,
  "stock_awards": 2000000,
  "option_awards": 500000,
  "non_equity_incentive": 300000,
  "change_in_pension": 50000,
  "other_compensation": 25000,
  "total": 3475000
}

🚀 Quick Start

from datasets import load_dataset
import json

# Load full dataset
ds = load_dataset("pierjoe/execcomp-ai-sample", split="train")

# Or load a specific year
ds_2020 = load_dataset("pierjoe/execcomp-ai-sample", split="year_2020")

# Filter by SCT probability (recommended to reduce false positives)
ds_filtered = ds.filter(lambda x: x["sct_probability"] >= 0.7)

# View a record
rec = ds_filtered[0]
print(rec["filing_html_index"])
display(rec['table_image'])  # PIL Image
print(json.dumps(json.loads(rec['executives']), indent=2))

Analyze with Pandas

import pandas as pd
import json

# Convert to DataFrame
df = ds_filtered.to_pandas()

# Parse all executives
all_execs = []
for _, row in df.iterrows():
    for exec_data in json.loads(row['executives']):
        exec_data['company'] = row['company']
        exec_data['year'] = row['year']
        all_execs.append(exec_data)

exec_df = pd.DataFrame(all_execs)

# Average compensation by year
print(exec_df.groupby('year')['total'].mean())

# Top 10 highest paid
print(exec_df.nlargest(10, 'total')[['name', 'company', 'year', 'total']])

View Table Image

# Display table image
record = ds["train"][0]
record["table_image"]  # PIL Image object

🔧 Source & Methodology

Data extracted from SEC EDGAR DEF 14A filings using:

Pipeline Steps

  1. Download DEF 14A PDFs from SEC EDGAR
  2. Extract tables with MinerU (VLM-based)
  3. Classify tables to identify Summary Compensation Tables
  4. Merge tables split across pages
  5. Extract structured compensation data with VLM

📜 License

MIT License


📖 Citation

If you use this dataset in your research, please cite:

@dataset{execcomp_ai_2026,
  author       = {Di Pasquale, Piergiorgio},
  title        = {SEC Executive Compensation Dataset},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/datasets/pierjoe/execcomp-ai-sample},
  note         = {AI-extracted executive compensation data from SEC DEF 14A filings (2005-2022)}
}

Or in text format:

Di Pasquale, P. (2026). SEC Executive Compensation Dataset. Hugging Face. https://huggingface.co/datasets/pierjoe/execcomp-ai

🔗 Links

Total size
11.7 GB
Files
60
Last updated
Jun 27
Pre-warmed CDN
US EU US EU

Contributors