Buckets:

aigentic
/

execcomp-ai-bucket

11.7 GB

60 files

Updated 1 day ago

Ctrl+K

Name	Size	Uploaded	Xet hash
data		1 day ago	41 items
docs		1 day ago	17 items
.gitattributes	2.46 kB xet	1 day ago	19463de8
README.md	8.86 kB xet	1 day ago	f2588440

README.md

SEC Executive Compensation Dataset

🚧 DATASET UNDER CONSTRUCTION 🚧

This dataset is actively being developed and expanded. The current version contains ~12,000 records out of a target of 100,000+ SEC filings (2005-2022).

What to expect:

Data may contain errors or inconsistencies

Schema and fields may change

More records will be added regularly

Statistics will be updated as processing continues

Use at your own risk for research purposes only.

🔗 Pipeline: github.com/pierpierpy/Execcomp-AI

Structured executive compensation data extracted from SEC DEF 14A proxy statements using AI.

from datasets import load_dataset

# Load all data
ds = load_dataset("pierjoe/execcomp-ai-sample", split="train")

# Load specific year only
ds_2020 = load_dataset("pierjoe/execcomp-ai-sample", split="year_2020")

# 📊 Dataset Statistics (click to expand)

🎯 SCT Probability & Quality (click to expand)

Why `sct_probability`?

The main VLM (Qwen3-32B) sometimes:

False positives: Classifies non-SCT tables as SCT (e.g., Director Compensation tables)
Duplicates: Some documents have multiple tables classified as SCT when only one is the real Summary Compensation Table

A fine-tuned binary classifier (Qwen3-VL-4B) scores each table with a probability (0-1) to help filter these cases.

💡 Recommendation: Filter by sct_probability >= 0.7 to get high-confidence records only.

💰 Compensation Statistics (click to expand)

🏆 Top 10 Highest Paid Executives

Compensation Distribution

Trends Over Time

📋 Dataset Description

This dataset contains Summary Compensation Tables extracted from SEC filings, with:

Original table images
HTML table structure
Structured JSON with executive compensation details
SCT probability score from a fine-tuned binary classifier (to filter false positives)

💡 Tip: Filter by sct_probability >= 0.7 to get high-confidence SCT records only.

Fields

Field	Type	Description
`cik`	string	SEC Central Index Key
`company`	string	Company name
`year`	int	Filing year
`filing_date`	string	SEC filing date
`sic`	string	Standard Industrial Classification code
`state_of_inc`	string	State of incorporation
`filing_html_index`	string	Link to SEC filing
`accession_number`	string	SEC accession number
`table_image`	image	Extracted table image
`table_body`	string	HTML table content
`executives`	string	JSON array of executive compensation
`sct_probability`	float	Probability (0-1) that this is a real SCT (from fine-tuned classifier)

Executive Schema

{
  "name": "John Smith",
  "title": "CEO",
  "fiscal_year": 2023,
  "salary": 500000,
  "bonus": 100000,
  "stock_awards": 2000000,
  "option_awards": 500000,
  "non_equity_incentive": 300000,
  "change_in_pension": 50000,
  "other_compensation": 25000,
  "total": 3475000
}

🚀 Quick Start

from datasets import load_dataset
import json

# Load full dataset
ds = load_dataset("pierjoe/execcomp-ai-sample", split="train")

# Or load a specific year
ds_2020 = load_dataset("pierjoe/execcomp-ai-sample", split="year_2020")

# Filter by SCT probability (recommended to reduce false positives)
ds_filtered = ds.filter(lambda x: x["sct_probability"] >= 0.7)

# View a record
rec = ds_filtered[0]
print(rec["filing_html_index"])
display(rec['table_image'])  # PIL Image
print(json.dumps(json.loads(rec['executives']), indent=2))

Analyze with Pandas

import pandas as pd
import json

# Convert to DataFrame
df = ds_filtered.to_pandas()

# Parse all executives
all_execs = []
for _, row in df.iterrows():
    for exec_data in json.loads(row['executives']):
        exec_data['company'] = row['company']
        exec_data['year'] = row['year']
        all_execs.append(exec_data)

exec_df = pd.DataFrame(all_execs)

# Average compensation by year
print(exec_df.groupby('year')['total'].mean())

# Top 10 highest paid
print(exec_df.nlargest(10, 'total')[['name', 'company', 'year', 'total']])

View Table Image

# Display table image
record = ds["train"][0]
record["table_image"]  # PIL Image object

🔧 Source & Methodology

Data extracted from SEC EDGAR DEF 14A filings using:

MinerU for PDF table extraction
Qwen3-VL-32B for classification and extraction

Pipeline Steps

Download DEF 14A PDFs from SEC EDGAR
Extract tables with MinerU (VLM-based)
Classify tables to identify Summary Compensation Tables
Merge tables split across pages
Extract structured compensation data with VLM

📜 License

MIT License

📖 Citation

If you use this dataset in your research, please cite:

@dataset{execcomp_ai_2026,
  author       = {Di Pasquale, Piergiorgio},
  title        = {SEC Executive Compensation Dataset},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/datasets/pierjoe/execcomp-ai-sample},
  note         = {AI-extracted executive compensation data from SEC DEF 14A filings (2005-2022)}
}

Or in text format:

Di Pasquale, P. (2026). SEC Executive Compensation Dataset. Hugging Face. https://huggingface.co/datasets/pierjoe/execcomp-ai

🔗 Links

GitHub: github.com/pierpierpy/Execcomp-AI
SEC EDGAR: sec.gov/edgar

Total size: 11.7 GB

Files: 60

Last updated: Jun 27

Pre-warmed CDN: US EU US EU