davanstrien's picture
davanstrien HF Staff
Upload folder using huggingface_hub
4cf63e7 verified
metadata
title: ArXiv New ML Datasets
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
license: mit

ArXiv New ML Datasets

Browse 1.1M+ CS papers from arXiv, with 50,000+ classified as introducing new machine learning datasets.

Features

  • Keyword search - Search titles and abstracts
  • Semantic search - Find conceptually similar papers using vector embeddings
  • Filter by arXiv category (cs.AI, cs.CV, cs.LG, etc.)
  • Infinite scroll for smooth browsing
  • Links to arXiv, PDF, and HF Papers

Data Source

Papers classified using ModernBERT. Embeddings from BGE-base-en-v1.5.

Data from librarian-bots/arxiv-cs-papers-lance. Updated weekly.

Tech Stack

  • Backend: FastAPI + Polars + Lance
  • Frontend: HTMX + Tailwind CSS
  • Vector Search: Lance with IVF_PQ index