Spaces:

open-index
/

README

Running

App Files Files Community

README / README.md

tamnd

Update org card

f263283 verified about 1 month ago

preview code

raw

history blame contribute delete

3.96 kB

	---
	title: README
	emoji: 📊
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	short_description: We index the internet and publish it as Parquet
	---

	# OpenIndex

	We index the internet and publish it as Parquet. 15 datasets, 199K+ downloads, all queryable with DuckDB or `load_dataset`.

	### Web

	We process every Common Crawl release and convert it to structured formats so you don't have to wrangle WARC files yourself.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-markdown](https://huggingface.co/datasets/open-index/open-markdown) \| Clean markdown from Common Crawl with URL, language, content metadata \| Billions of pages \|

	### Social

	The two largest community archives on the internet, continuously maintained. Good for training data, trend analysis, or just digging through 20 years of internet arguments.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [hacker-news](https://huggingface.co/datasets/open-index/hacker-news) \| Every HN story, comment, poll, and job post since 2006. Live-updated every 5 min \| 47M+ items \|
	\| [hacker-news-rss](https://huggingface.co/datasets/open-index/hacker-news-rss) \| RSS feeds discovered from links posted on HN \| 623K feeds \|
	\| [arctic](https://huggingface.co/datasets/open-index/arctic) \| The Arctic Shift Reddit archive. 8.3B comments, 2.2B submissions \| 10.5B items \|

	### Code

	Full mirrors of the major package registries and GitHub's public event stream. If you want to study how open source actually works, start here.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-github](https://huggingface.co/datasets/open-index/open-github) \| Every public GitHub event: pushes, PRs, issues, stars, forks, reviews, releases \| Continuous \|
	\| [open-github-issues](https://huggingface.co/datasets/open-index/open-github-issues) \| Issues, PRs, comments, reviews, commits for 17 major repos \| 21.7M rows \|
	\| [open-npm](https://huggingface.co/datasets/open-index/open-npm) \| Every npm package with versions, dependencies, maintainers, download stats \| 35M+ rows \|
	\| [open-pypi](https://huggingface.co/datasets/open-index/open-pypi) \| Every PyPI package with releases, classifiers, dependencies, project URLs \| 47M+ rows \|

	### Academia

	Structured dumps of the two biggest open research databases. Useful for citation graphs, topic modeling, or finding who's working on what.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-arxiv](https://huggingface.co/datasets/open-index/open-arxiv) \| Every arXiv paper since 1991 with abstracts, authors, categories, DOIs \| 2.99M papers \|
	\| [open-alex](https://huggingface.co/datasets/open-index/open-alex) \| Full OpenAlex dump: works, authors, sources, institutions, topics, funders \| 114M records \|

	### Knowledge

	Wikipedia in three formats (pick the one that fits your pipeline) and the entire Open Library book catalog.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-wikipedia](https://huggingface.co/datasets/open-index/open-wikipedia) \| Every Wikipedia article in original MediaWiki markup, all languages \| All articles \|
	\| [open-wikipedia-markdown](https://huggingface.co/datasets/open-index/open-wikipedia-markdown) \| Same articles, converted to clean Markdown \| All articles \|
	\| [open-wikipedia-text](https://huggingface.co/datasets/open-index/open-wikipedia-text) \| Same articles, as plain text \| All articles \|
	\| [open-library](https://huggingface.co/datasets/open-index/open-library) \| Full Open Library catalog: works, editions, authors, subjects, publishers \| 150M+ records \|

	### AI

	The agent ecosystem is growing fast. This is a snapshot of everything published on skills.sh.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-skills](https://huggingface.co/datasets/open-index/open-skills) \| Agent skills with READMEs, install commands, security audits, weekly installs \| 133K skills \|

	---
	title: README
	emoji: 📊
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	short_description: We index the internet and publish it as Parquet
	---

	# OpenIndex

	We index the internet and publish it as Parquet. 15 datasets, 199K+ downloads, all queryable with DuckDB or `load_dataset`.

	### Web

	We process every Common Crawl release and convert it to structured formats so you don't have to wrangle WARC files yourself.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-markdown](https://huggingface.co/datasets/open-index/open-markdown) \| Clean markdown from Common Crawl with URL, language, content metadata \| Billions of pages \|

	### Social

	The two largest community archives on the internet, continuously maintained. Good for training data, trend analysis, or just digging through 20 years of internet arguments.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [hacker-news](https://huggingface.co/datasets/open-index/hacker-news) \| Every HN story, comment, poll, and job post since 2006. Live-updated every 5 min \| 47M+ items \|
	\| [hacker-news-rss](https://huggingface.co/datasets/open-index/hacker-news-rss) \| RSS feeds discovered from links posted on HN \| 623K feeds \|
	\| [arctic](https://huggingface.co/datasets/open-index/arctic) \| The Arctic Shift Reddit archive. 8.3B comments, 2.2B submissions \| 10.5B items \|

	### Code

	Full mirrors of the major package registries and GitHub's public event stream. If you want to study how open source actually works, start here.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-github](https://huggingface.co/datasets/open-index/open-github) \| Every public GitHub event: pushes, PRs, issues, stars, forks, reviews, releases \| Continuous \|
	\| [open-github-issues](https://huggingface.co/datasets/open-index/open-github-issues) \| Issues, PRs, comments, reviews, commits for 17 major repos \| 21.7M rows \|
	\| [open-npm](https://huggingface.co/datasets/open-index/open-npm) \| Every npm package with versions, dependencies, maintainers, download stats \| 35M+ rows \|
	\| [open-pypi](https://huggingface.co/datasets/open-index/open-pypi) \| Every PyPI package with releases, classifiers, dependencies, project URLs \| 47M+ rows \|

	### Academia

	Structured dumps of the two biggest open research databases. Useful for citation graphs, topic modeling, or finding who's working on what.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-arxiv](https://huggingface.co/datasets/open-index/open-arxiv) \| Every arXiv paper since 1991 with abstracts, authors, categories, DOIs \| 2.99M papers \|
	\| [open-alex](https://huggingface.co/datasets/open-index/open-alex) \| Full OpenAlex dump: works, authors, sources, institutions, topics, funders \| 114M records \|

	### Knowledge

	Wikipedia in three formats (pick the one that fits your pipeline) and the entire Open Library book catalog.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-wikipedia](https://huggingface.co/datasets/open-index/open-wikipedia) \| Every Wikipedia article in original MediaWiki markup, all languages \| All articles \|
	\| [open-wikipedia-markdown](https://huggingface.co/datasets/open-index/open-wikipedia-markdown) \| Same articles, converted to clean Markdown \| All articles \|
	\| [open-wikipedia-text](https://huggingface.co/datasets/open-index/open-wikipedia-text) \| Same articles, as plain text \| All articles \|
	\| [open-library](https://huggingface.co/datasets/open-index/open-library) \| Full Open Library catalog: works, editions, authors, subjects, publishers \| 150M+ records \|

	### AI

	The agent ecosystem is growing fast. This is a snapshot of everything published on skills.sh.

	\| Dataset \| What's in it \| Scale \|
	\|---------\|-------------\|-------\|
	\| [open-skills](https://huggingface.co/datasets/open-index/open-skills) \| Agent skills with READMEs, install commands, security audits, weekly installs \| 133K skills \|