IPO-Mine
Collection
All the datasets and models used in the paper titled IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents β’ 3 items β’ Updated
β’ 3
A fine-tuned YOLOv8s model trained to classify images extracted from U.S. IPO registration statements (S-1 and F-1 filings) on SEC EDGAR. This model serves as the initial detection stage in the pipeline used to construct the gtfintechlab/ipo-images dataset.
The model classifies images into 5 categories:
| Label | Description |
|---|---|
chart |
Bar charts, line charts, pie charts, org charts, flow charts, etc. |
logo |
Company logos and branding marks |
map |
Geographic maps |
infographic |
Composite visuals combining data, icons, and text |
other |
Decorative images, photographs, signatures, and other visuals |
pip install ultralytics
from ultralytics import YOLO
model = YOLO("<path/to/model.pt>")
# Single image
results = model("path/to/image.png")
print(results[0].probs.top1) # top class index
print(results[0].names) # class name mapping
# With a confidence threshold
results = model("path/to/image.png", conf=0.5)
# Batch inference
results = model(["image1.png", "image2.png", "image3.png"])
for r in results:
print(r.probs.top1cls, r.names[r.probs.top1])
result = model("image.png")[0]
label = result.names[result.probs.top1]
print(label) # e.g. "chart"
This model is the first stage of the classification pipeline used to build the gtfintechlab/ipo-images dataset β a large-scale collection of 76,000+ labeled images from SEC IPO filings spanning 1994β2026.
The pipeline works as follows:
initial_yolo_prediction) for each imagellm_yolo_verification_score) and per-model votes (llm_yolo_verification_votes)label in the dataset reflects this verified outputIf you use this model in your work, please cite:
@misc{galarnyk2026ipomine,
title = {IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents},
author = {Galarnyk, Michael and Lohani, Siddharth and Nandi, Sagnik and Patel, Aman and Kannan, Vidhyakshaya and Banerjee, Prasun and Routu, Rutwik and Ye, Liqin and Hiray, Arnav and Somani, Siddhartha and Chava, Sudheer},
year = {2026},
url = {https://huggingface.co/datasets/gtfintechlab/ipo-images},
note = {Preprint/Working Paper}
}
Base model
Ultralytics/YOLOv8