Spaces:
Running
Running
| title: README | |
| emoji: ๐ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: static | |
| pinned: false | |
| short_description: We index the internet and publish it as Parquet | |
| # OpenIndex | |
| We index the internet and publish it as Parquet. 15 datasets, 199K+ downloads, all queryable with DuckDB or `load_dataset`. | |
| ### Web | |
| We process every Common Crawl release and convert it to structured formats so you don't have to wrangle WARC files yourself. | |
| | Dataset | What's in it | Scale | | |
| |---------|-------------|-------| | |
| | [open-markdown](https://huggingface.co/datasets/open-index/open-markdown) | Clean markdown from Common Crawl with URL, language, content metadata | Billions of pages | | |
| ### Social | |
| The two largest community archives on the internet, continuously maintained. Good for training data, trend analysis, or just digging through 20 years of internet arguments. | |
| | Dataset | What's in it | Scale | | |
| |---------|-------------|-------| | |
| | [hacker-news](https://huggingface.co/datasets/open-index/hacker-news) | Every HN story, comment, poll, and job post since 2006. Live-updated every 5 min | 47M+ items | | |
| | [hacker-news-rss](https://huggingface.co/datasets/open-index/hacker-news-rss) | RSS feeds discovered from links posted on HN | 623K feeds | | |
| | [arctic](https://huggingface.co/datasets/open-index/arctic) | The Arctic Shift Reddit archive. 8.3B comments, 2.2B submissions | 10.5B items | | |
| ### Code | |
| Full mirrors of the major package registries and GitHub's public event stream. If you want to study how open source actually works, start here. | |
| | Dataset | What's in it | Scale | | |
| |---------|-------------|-------| | |
| | [open-github](https://huggingface.co/datasets/open-index/open-github) | Every public GitHub event: pushes, PRs, issues, stars, forks, reviews, releases | Continuous | | |
| | [open-github-issues](https://huggingface.co/datasets/open-index/open-github-issues) | Issues, PRs, comments, reviews, commits for 17 major repos | 21.7M rows | | |
| | [open-npm](https://huggingface.co/datasets/open-index/open-npm) | Every npm package with versions, dependencies, maintainers, download stats | 35M+ rows | | |
| | [open-pypi](https://huggingface.co/datasets/open-index/open-pypi) | Every PyPI package with releases, classifiers, dependencies, project URLs | 47M+ rows | | |
| ### Academia | |
| Structured dumps of the two biggest open research databases. Useful for citation graphs, topic modeling, or finding who's working on what. | |
| | Dataset | What's in it | Scale | | |
| |---------|-------------|-------| | |
| | [open-arxiv](https://huggingface.co/datasets/open-index/open-arxiv) | Every arXiv paper since 1991 with abstracts, authors, categories, DOIs | 2.99M papers | | |
| | [open-alex](https://huggingface.co/datasets/open-index/open-alex) | Full OpenAlex dump: works, authors, sources, institutions, topics, funders | 114M records | | |
| ### Knowledge | |
| Wikipedia in three formats (pick the one that fits your pipeline) and the entire Open Library book catalog. | |
| | Dataset | What's in it | Scale | | |
| |---------|-------------|-------| | |
| | [open-wikipedia](https://huggingface.co/datasets/open-index/open-wikipedia) | Every Wikipedia article in original MediaWiki markup, all languages | All articles | | |
| | [open-wikipedia-markdown](https://huggingface.co/datasets/open-index/open-wikipedia-markdown) | Same articles, converted to clean Markdown | All articles | | |
| | [open-wikipedia-text](https://huggingface.co/datasets/open-index/open-wikipedia-text) | Same articles, as plain text | All articles | | |
| | [open-library](https://huggingface.co/datasets/open-index/open-library) | Full Open Library catalog: works, editions, authors, subjects, publishers | 150M+ records | | |
| ### AI | |
| The agent ecosystem is growing fast. This is a snapshot of everything published on skills.sh. | |
| | Dataset | What's in it | Scale | | |
| |---------|-------------|-------| | |
| | [open-skills](https://huggingface.co/datasets/open-index/open-skills) | Agent skills with READMEs, install commands, security audits, weekly installs | 133K skills | | |