File size: 1,027 Bytes
3c378cf 4cf63e7 3c378cf 4cf63e7 3c378cf 4cf63e7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ---
title: ArXiv New ML Datasets
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
license: mit
---
# ArXiv New ML Datasets
Browse **1.1M+ CS papers** from arXiv, with **50,000+ classified** as introducing new machine learning datasets.
## Features
- **Keyword search** - Search titles and abstracts
- **Semantic search** - Find conceptually similar papers using vector embeddings
- **Filter** by arXiv category (cs.AI, cs.CV, cs.LG, etc.)
- **Infinite scroll** for smooth browsing
- Links to arXiv, PDF, and HF Papers
## Data Source
Papers classified using [ModernBERT](https://huggingface.co/davanstrien/ModernBERT-base-is-new-arxiv-dataset). Embeddings from [BGE-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5).
Data from [librarian-bots/arxiv-cs-papers-lance](https://huggingface.co/datasets/librarian-bots/arxiv-cs-papers-lance). Updated weekly.
## Tech Stack
- **Backend**: FastAPI + Polars + Lance
- **Frontend**: HTMX + Tailwind CSS
- **Vector Search**: Lance with IVF_PQ index
|