metadata
title: ArXiv New ML Datasets
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
license: mit
ArXiv New ML Datasets
Browse 1.1M+ CS papers from arXiv, with 50,000+ classified as introducing new machine learning datasets.
Features
- Keyword search - Search titles and abstracts
- Semantic search - Find conceptually similar papers using vector embeddings
- Filter by arXiv category (cs.AI, cs.CV, cs.LG, etc.)
- Infinite scroll for smooth browsing
- Links to arXiv, PDF, and HF Papers
Data Source
Papers classified using ModernBERT. Embeddings from BGE-base-en-v1.5.
Data from librarian-bots/arxiv-cs-papers-lance. Updated weekly.
Tech Stack
- Backend: FastAPI + Polars + Lance
- Frontend: HTMX + Tailwind CSS
- Vector Search: Lance with IVF_PQ index