📄 Research Papers Dataset

Published May 23, 2026

A curated collection of academic research documents labeled by primary research topic. Built for anyone working on document understanding, classification pipelines, or multimodal ML tasks.

Tegridydev - Research-Papers

Currently building out the foundation topics and raw .pdf research paper files + working through the text markdown file conversions

The goal I'm currently working towards involves processing and cleaning up and converting into high quality training datasets!

Check it out, give it a like and leave a comment below or join community discussion and suggest what fields and research topics you want to see included!

Curated by: tegridy
Language: English
Formats: PDF, Markdown

What's in it

Each entry pairs a raw PDF with its topic label. The PDFs span multiple research domains and are sourced to give reasonable coverage across topic categories. If you're training a classifier, fine-tuning a document model, or stress-testing an OCR pipeline, this gives you labeled real-world documents rather than synthetic ones.

Repo Structure

research-papers/
├── pdfs/              # Raw PDF files
├── text-versions/     # PDF → Markdown processed text
└── metadata.csv       # Topic labels per document

The text-versions/ folder contains pre-processed Markdown conversions of each PDF.

Review these before use — conversion quality varies and some may have formatting artifacts or extraction errors.

Usage

from datasets import load_dataset
 
ds = load_dataset("tegridy/research-papers")

PDFs are loaded as raw bytes. For vision-based models, convert to images:

import fitz  # PyMuPDF
 
doc = fitz.open(stream=sample["pdf"], filetype="pdf")
page = doc[0].get_pixmap()

For text-only pipelines, pull directly from text-versions

Intended Use Cases

Document classification — topic label per PDF
Model fine-tuning — domain-specific academic text
OCR benchmarking — compare raw PDF extraction vs. ground truth
Multimodal document understanding — PDF-as-image + text pairs
Easy Access To A Large Range of Research Papers for Researchers

Notes

Text versions are auto-processed and not guaranteed accurate
Labels reflect the document's primary topic only
Dataset is English-only

Datasets mentioned in this article 1

Open Source AI Agents | Github/Repo List | [2025]

February 21, 2025

WTF is Fine-Tuning? (intro4devs) | [2025]

February 16, 2025

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote