File size: 1,027 Bytes
3c378cf
4cf63e7
 
 
 
 
3c378cf
4cf63e7
3c378cf
 
4cf63e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
---
title: ArXiv New ML Datasets
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
license: mit
---

# ArXiv New ML Datasets

Browse **1.1M+ CS papers** from arXiv, with **50,000+ classified** as introducing new machine learning datasets.

## Features

- **Keyword search** - Search titles and abstracts
- **Semantic search** - Find conceptually similar papers using vector embeddings
- **Filter** by arXiv category (cs.AI, cs.CV, cs.LG, etc.)
- **Infinite scroll** for smooth browsing
- Links to arXiv, PDF, and HF Papers

## Data Source

Papers classified using [ModernBERT](https://huggingface.co/davanstrien/ModernBERT-base-is-new-arxiv-dataset). Embeddings from [BGE-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5).

Data from [librarian-bots/arxiv-cs-papers-lance](https://huggingface.co/datasets/librarian-bots/arxiv-cs-papers-lance). Updated weekly.

## Tech Stack

- **Backend**: FastAPI + Polars + Lance
- **Frontend**: HTMX + Tailwind CSS
- **Vector Search**: Lance with IVF_PQ index