๐Ÿ“„ Research Papers Dataset

Community Article Published May 23, 2026

A curated collection of academic research documents labeled by primary research topic. Built for anyone working on document understanding, classification pipelines, or multimodal ML tasks.

Tegridydev - Research-Papers

Currently building out the foundation topics and raw .pdf research paper files + working through the text markdown file conversions

The goal I'm currently working towards involves processing and cleaning up and converting into high quality training datasets!

Check it out, give it a like and leave a comment below or join community discussion and suggest what fields and research topics you want to see included!

  • Curated by: tegridy
  • Language: English
  • Formats: PDF, Markdown

image


What's in it

Each entry pairs a raw PDF with its topic label. The PDFs span multiple research domains and are sourced to give reasonable coverage across topic categories. If you're training a classifier, fine-tuning a document model, or stress-testing an OCR pipeline, this gives you labeled real-world documents rather than synthetic ones.


Repo Structure

research-papers/
โ”œโ”€โ”€ pdfs/              # Raw PDF files
โ”œโ”€โ”€ text-versions/     # PDF โ†’ Markdown processed text
โ””โ”€โ”€ metadata.csv       # Topic labels per document

The text-versions/ folder contains pre-processed Markdown conversions of each PDF.

Review these before use โ€” conversion quality varies and some may have formatting artifacts or extraction errors.

Usage

from datasets import load_dataset
 
ds = load_dataset("tegridy/research-papers")
  • PDFs are loaded as raw bytes. For vision-based models, convert to images:
import fitz  # PyMuPDF
 
doc = fitz.open(stream=sample["pdf"], filetype="pdf")
page = doc[0].get_pixmap()

For text-only pipelines, pull directly from text-versions

Intended Use Cases

  • Document classification โ€” topic label per PDF
  • Model fine-tuning โ€” domain-specific academic text
  • OCR benchmarking โ€” compare raw PDF extraction vs. ground truth
  • Multimodal document understanding โ€” PDF-as-image + text pairs
  • Easy Access To A Large Range of Research Papers for Researchers

Notes

  • Text versions are auto-processed and not guaranteed accurate
  • Labels reflect the document's primary topic only
  • Dataset is English-only

Community

Sign up or log in to comment