--- title: README emoji: 📊 colorFrom: blue colorTo: indigo sdk: static pinned: false short_description: We index the internet and publish it as Parquet --- # OpenIndex We index the internet and publish it as Parquet. 15 datasets, 199K+ downloads, all queryable with DuckDB or `load_dataset`. ### Web We process every Common Crawl release and convert it to structured formats so you don't have to wrangle WARC files yourself. | Dataset | What's in it | Scale | |---------|-------------|-------| | [open-markdown](https://huggingface.co/datasets/open-index/open-markdown) | Clean markdown from Common Crawl with URL, language, content metadata | Billions of pages | ### Social The two largest community archives on the internet, continuously maintained. Good for training data, trend analysis, or just digging through 20 years of internet arguments. | Dataset | What's in it | Scale | |---------|-------------|-------| | [hacker-news](https://huggingface.co/datasets/open-index/hacker-news) | Every HN story, comment, poll, and job post since 2006. Live-updated every 5 min | 47M+ items | | [hacker-news-rss](https://huggingface.co/datasets/open-index/hacker-news-rss) | RSS feeds discovered from links posted on HN | 623K feeds | | [arctic](https://huggingface.co/datasets/open-index/arctic) | The Arctic Shift Reddit archive. 8.3B comments, 2.2B submissions | 10.5B items | ### Code Full mirrors of the major package registries and GitHub's public event stream. If you want to study how open source actually works, start here. | Dataset | What's in it | Scale | |---------|-------------|-------| | [open-github](https://huggingface.co/datasets/open-index/open-github) | Every public GitHub event: pushes, PRs, issues, stars, forks, reviews, releases | Continuous | | [open-github-issues](https://huggingface.co/datasets/open-index/open-github-issues) | Issues, PRs, comments, reviews, commits for 17 major repos | 21.7M rows | | [open-npm](https://huggingface.co/datasets/open-index/open-npm) | Every npm package with versions, dependencies, maintainers, download stats | 35M+ rows | | [open-pypi](https://huggingface.co/datasets/open-index/open-pypi) | Every PyPI package with releases, classifiers, dependencies, project URLs | 47M+ rows | ### Academia Structured dumps of the two biggest open research databases. Useful for citation graphs, topic modeling, or finding who's working on what. | Dataset | What's in it | Scale | |---------|-------------|-------| | [open-arxiv](https://huggingface.co/datasets/open-index/open-arxiv) | Every arXiv paper since 1991 with abstracts, authors, categories, DOIs | 2.99M papers | | [open-alex](https://huggingface.co/datasets/open-index/open-alex) | Full OpenAlex dump: works, authors, sources, institutions, topics, funders | 114M records | ### Knowledge Wikipedia in three formats (pick the one that fits your pipeline) and the entire Open Library book catalog. | Dataset | What's in it | Scale | |---------|-------------|-------| | [open-wikipedia](https://huggingface.co/datasets/open-index/open-wikipedia) | Every Wikipedia article in original MediaWiki markup, all languages | All articles | | [open-wikipedia-markdown](https://huggingface.co/datasets/open-index/open-wikipedia-markdown) | Same articles, converted to clean Markdown | All articles | | [open-wikipedia-text](https://huggingface.co/datasets/open-index/open-wikipedia-text) | Same articles, as plain text | All articles | | [open-library](https://huggingface.co/datasets/open-index/open-library) | Full Open Library catalog: works, editions, authors, subjects, publishers | 150M+ records | ### AI The agent ecosystem is growing fast. This is a snapshot of everything published on skills.sh. | Dataset | What's in it | Scale | |---------|-------------|-------| | [open-skills](https://huggingface.co/datasets/open-index/open-skills) | Agent skills with READMEs, install commands, security audits, weekly installs | 133K skills |