---
title: README
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
short_description: We index the internet and publish it as Parquet
---

# OpenIndex

We index the internet and publish it as Parquet. 15 datasets, 199K+ downloads, all queryable with DuckDB or `load_dataset`.

### Web

We process every Common Crawl release and convert it to structured formats so you don't have to wrangle WARC files yourself.

| Dataset | What's in it | Scale |
|---------|-------------|-------|
| [open-markdown](https://huggingface.co/datasets/open-index/open-markdown) | Clean markdown from Common Crawl with URL, language, content metadata | Billions of pages |

### Social

The two largest community archives on the internet, continuously maintained. Good for training data, trend analysis, or just digging through 20 years of internet arguments.

| Dataset | What's in it | Scale |
|---------|-------------|-------|
| [hacker-news](https://huggingface.co/datasets/open-index/hacker-news) | Every HN story, comment, poll, and job post since 2006. Live-updated every 5 min | 47M+ items |
| [hacker-news-rss](https://huggingface.co/datasets/open-index/hacker-news-rss) | RSS feeds discovered from links posted on HN | 623K feeds |
| [arctic](https://huggingface.co/datasets/open-index/arctic) | The Arctic Shift Reddit archive. 8.3B comments, 2.2B submissions | 10.5B items |

### Code

Full mirrors of the major package registries and GitHub's public event stream. If you want to study how open source actually works, start here.

| Dataset | What's in it | Scale |
|---------|-------------|-------|
| [open-github](https://huggingface.co/datasets/open-index/open-github) | Every public GitHub event: pushes, PRs, issues, stars, forks, reviews, releases | Continuous |
| [open-github-issues](https://huggingface.co/datasets/open-index/open-github-issues) | Issues, PRs, comments, reviews, commits for 17 major repos | 21.7M rows |
| [open-npm](https://huggingface.co/datasets/open-index/open-npm) | Every npm package with versions, dependencies, maintainers, download stats | 35M+ rows |
| [open-pypi](https://huggingface.co/datasets/open-index/open-pypi) | Every PyPI package with releases, classifiers, dependencies, project URLs | 47M+ rows |

### Academia

Structured dumps of the two biggest open research databases. Useful for citation graphs, topic modeling, or finding who's working on what.

| Dataset | What's in it | Scale |
|---------|-------------|-------|
| [open-arxiv](https://huggingface.co/datasets/open-index/open-arxiv) | Every arXiv paper since 1991 with abstracts, authors, categories, DOIs | 2.99M papers |
| [open-alex](https://huggingface.co/datasets/open-index/open-alex) | Full OpenAlex dump: works, authors, sources, institutions, topics, funders | 114M records |

### Knowledge

Wikipedia in three formats (pick the one that fits your pipeline) and the entire Open Library book catalog.

| Dataset | What's in it | Scale |
|---------|-------------|-------|
| [open-wikipedia](https://huggingface.co/datasets/open-index/open-wikipedia) | Every Wikipedia article in original MediaWiki markup, all languages | All articles |
| [open-wikipedia-markdown](https://huggingface.co/datasets/open-index/open-wikipedia-markdown) | Same articles, converted to clean Markdown | All articles |
| [open-wikipedia-text](https://huggingface.co/datasets/open-index/open-wikipedia-text) | Same articles, as plain text | All articles |
| [open-library](https://huggingface.co/datasets/open-index/open-library) | Full Open Library catalog: works, editions, authors, subjects, publishers | 150M+ records |

### AI

The agent ecosystem is growing fast. This is a snapshot of everything published on skills.sh.

| Dataset | What's in it | Scale |
|---------|-------------|-------|
| [open-skills](https://huggingface.co/datasets/open-index/open-skills) | Agent skills with READMEs, install commands, security audits, weekly installs | 133K skills |