๐ Research Papers Dataset
Currently building out the foundation topics and raw .pdf research paper files + working through the text markdown file conversions
The goal I'm currently working towards involves processing and cleaning up and converting into high quality training datasets!
Check it out, give it a like and leave a comment below or join community discussion and suggest what fields and research topics you want to see included!
- Curated by: tegridy
- Language: English
- Formats: PDF, Markdown
What's in it
Each entry pairs a raw PDF with its topic label. The PDFs span multiple research domains and are sourced to give reasonable coverage across topic categories. If you're training a classifier, fine-tuning a document model, or stress-testing an OCR pipeline, this gives you labeled real-world documents rather than synthetic ones.
Repo Structure
research-papers/
โโโ pdfs/ # Raw PDF files
โโโ text-versions/ # PDF โ Markdown processed text
โโโ metadata.csv # Topic labels per document
The text-versions/ folder contains pre-processed Markdown conversions of each PDF.
Review these before use โ conversion quality varies and some may have formatting artifacts or extraction errors.
Usage
from datasets import load_dataset
ds = load_dataset("tegridy/research-papers")
- PDFs are loaded as raw bytes. For vision-based models, convert to images:
import fitz # PyMuPDF
doc = fitz.open(stream=sample["pdf"], filetype="pdf")
page = doc[0].get_pixmap()
For text-only pipelines, pull directly from text-versions
Intended Use Cases
- Document classification โ topic label per PDF
- Model fine-tuning โ domain-specific academic text
- OCR benchmarking โ compare raw PDF extraction vs. ground truth
- Multimodal document understanding โ PDF-as-image + text pairs
- Easy Access To A Large Range of Research Papers for Researchers
Notes
- Text versions are auto-processed and not guaranteed accurate
- Labels reflect the document's primary topic only
- Dataset is English-only
