| title: ArXiv New ML Datasets | |
| emoji: 📚 | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| # ArXiv New ML Datasets | |
| Browse **1.1M+ CS papers** from arXiv, with **50,000+ classified** as introducing new machine learning datasets. | |
| ## Features | |
| - **Keyword search** - Search titles and abstracts | |
| - **Semantic search** - Find conceptually similar papers using vector embeddings | |
| - **Filter** by arXiv category (cs.AI, cs.CV, cs.LG, etc.) | |
| - **Infinite scroll** for smooth browsing | |
| - Links to arXiv, PDF, and HF Papers | |
| ## Data Source | |
| Papers classified using [ModernBERT](https://huggingface.co/davanstrien/ModernBERT-base-is-new-arxiv-dataset). Embeddings from [BGE-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5). | |
| Data from [librarian-bots/arxiv-cs-papers-lance](https://huggingface.co/datasets/librarian-bots/arxiv-cs-papers-lance). Updated weekly. | |
| ## Tech Stack | |
| - **Backend**: FastAPI + Polars + Lance | |
| - **Frontend**: HTMX + Tailwind CSS | |
| - **Vector Search**: Lance with IVF_PQ index | |