metadata title: ResearchIT
emoji: π
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
ResearchIT β Personalized ArXiv Paper Recommender
An "Instagram for research" β a multi-interest aware feed that surfaces relevant papers across a researcher's distinct areas without collapsing toward a dominant interest.
Stack: FastAPI Β· HTMX Β· Jinja2 Β· BGE-M3 (1024-dim) Β· Qdrant Cloud Β· Zilliz Cloud Β· Turso (libSQL) Β· Groq Β· LightGBM Β· HuggingFace Spaces
Live demo: https://siddhm11-researchit.hf.space
Architecture Overview
User β [HTMX Frontend] β [FastAPI Backend]
β
βββββββββββββββΌββββββββββββββββββ
β β β
[Qdrant Cloud] [Zilliz Cloud] [Turso Cloud]
Dense vectors Sparse vectors Paper metadata
1.6M papers 1.6M papers ~1.6M rows
BGE-M3 1024d BGE-M3 lexical + citations
β β β
βββββββββββββββΌβββββββββββββββββββ
β
[Recommendation Engine]
βββ EWMA Profiles
βββ Ward Clustering
βββ Quota Fusion
βββ LightGBM Reranker (37 features)
βββ MMR Diversity
βββ Exploration Injection
Data Infrastructure & Schemas
Qdrant Cloud β Dense Vector Store
Property
Value
Collection
arxiv_bgem3_dense
Documents
~1,600,000 arXiv papers
Vector dim
1024 (BGE-M3 dense embeddings, float32)
Quantization
Binary Quantization (BQ) enabled
HNSW
m=32
Point ID
Integer (auto-generated)
Payload
arxiv_id (TEXT, keyword-indexed)
Region
Qdrant Cloud
Zilliz Cloud β Sparse Vector Store
Property
Value
Collection
arxiv_bgem3_sparse
Documents
~1,600,000 arXiv papers
Schema
id (INT64 auto PK), arxiv_id (VARCHAR), sparse_vector (SPARSE_FLOAT_VECTOR)
Index
SPARSE_INVERTED_INDEX, metric_type=IP
Sparse format
Integer token IDs as keys (BGE-M3 tokenizer), e.g. {29: 0.0427, 6083: 0.1852}
Turso (libSQL) β Paper Metadata DB
Property
Value
Database
arxiv-data on aws-ap-south-1
URL
https://arxiv-data-siddhm11.aws-ap-south-1.turso.io
Rows
~1,600,000 papers
Data sources
Kaggle siddhm11/arxivdata + siddhm11/citation-data-letsgoo
Table: papers
CREATE TABLE papers (
arxiv_id TEXT UNIQUE ,
title TEXT,
authors TEXT,
categories TEXT,
primary_topic TEXT,
update_date TEXT,
abstract_preview TEXT,
citation_count INTEGER DEFAULT 0 ,
influential_citations INTEGER DEFAULT 0
);
CREATE UNIQUE INDEX idx_papers_arxiv_id ON papers(arxiv_id);
SQLite β Local Application DB
File: interactions.db (WAL mode, async via aiosqlite)
CREATE TABLE interactions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL ,
paper_id TEXT NOT NULL ,
event_type TEXT NOT NULL ,
source TEXT,
position INTEGER ,
query_id TEXT,
ranker_version TEXT,
candidate_source TEXT,
cluster_id INTEGER ,
timestamp TEXT NOT NULL DEFAULT (datetime('now' ))
);
CREATE TABLE paper_qdrant_map (
arxiv_id TEXT PRIMARY KEY,
qdrant_point_id INTEGER NOT NULL ,
mapped_at TEXT NOT NULL DEFAULT (datetime('now' ))
);
CREATE TABLE paper_metadata (
arxiv_id TEXT PRIMARY KEY,
title TEXT, abstract TEXT, authors TEXT,
category TEXT, published TEXT,
cached_at TEXT NOT NULL DEFAULT (datetime('now' ))
);
CREATE TABLE user_profiles (
user_id TEXT NOT NULL ,
profile_type TEXT NOT NULL ,
vector BLOB NOT NULL ,
interaction_count INTEGER DEFAULT 0 ,
updated_at TEXT,
PRIMARY KEY (user_id, profile_type)
);
CREATE TABLE user_clusters (
user_id TEXT NOT NULL ,
cluster_idx INTEGER NOT NULL ,
medoid_paper_id TEXT NOT NULL ,
importance REAL NOT NULL ,
paper_ids TEXT NOT NULL ,
computed_at TEXT,
PRIMARY KEY (user_id, cluster_idx)
);
CREATE TABLE user_onboarding (
user_id TEXT PRIMARY KEY,
selected_categories TEXT,
onboarding_completed INTEGER DEFAULT 0 ,
created_at TEXT, updated_at TEXT
);
LightGBM Reranker β ML Model
Property
Value
File
models/reranker-phase6/production_model/reranker_v1.txt
HuggingFace
siddhm11/researchit-reranker-phase6
Format
LightGBM v4 text (plain text, no pickle)
Objective
LambdaRank (optimizes nDCG)
Trees
141 (early stopped from 500)
Features
37 (see docs/PHASE6-HANDOFF.md for full schema)
Size
974 KB
Latency
0.143ms per 100 candidates
Fallback
Heuristic scorer when model unavailable
Recommendation Pipeline
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tier 1 (β₯5 saves): Multi-Interest Clustering + Quota Fusion β
β 1. Ward clustering β identify distinct interests β
β 2. Hungarian matching β stabilize cluster IDs β
β 3. Quota allocation β per-cluster slot budgets β
β 4. Parallel per-cluster ANN searches β
β 5. LightGBM reranking (37 features) + heuristic fallback β
β 6. Category suppression (β₯3 dismissals in 14 days) β
β 7. MMR diversity (Ξ»=0.6) β
β 8. Exploration injection (2 serendipitous papers) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Tier 2 (β₯3 saves): EWMA long-term vector β single ANN search β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Tier 3 (β₯1 save): Qdrant BEST_SCORE Recommend API β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Tier 0 (onboarded, 0 saves): Trending papers by category β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quick Start
pip install -r requirements.txt
cp .env.example .env
python run.py
python -m pytest tests/ -v
python tests/test_reranker_integration.py
Phase Completion Status
Phase
Status
Description
1
β
Complete
Zero-ML Recommender (Qdrant + HTMX)
2a
β
Complete
EWMA Profile Embeddings
2b
β
Complete
Ward Clustering + Multi-Interest
2c
β
Complete
Heuristic Re-ranking + MMR
3
β
Complete
Hybrid Semantic Search (BGE-M3 + Qdrant + Zilliz + RRF)
3.5
β
Complete
Turso Metadata DB (2.9x faster search)
4
β
Complete
Quota Fusion + Hungarian Matching + Category Suppression
4.5
β
Complete
Instrumentation Foundation
5
β
Complete
Cold-Start Onboarding + UI Redesign
6
β
Complete
LightGBM Reranker (nDCG@10: 0.879, +233%)
7
π Planned
Evaluation Framework
8
π Planned
LLM Summaries + Distilled Reranker
9
π Planned
Exploration + Collaborative Filtering
Key Documentation
Document
Purpose
CLAUDE.md
Agent rulebook β architectural rules, doc precedence, code conventions
docs/TASK-TRACKER.md
Master task checklist with all phase details
docs/PHASE6-HANDOFF.md
LightGBM reranker handoff β model provenance, schema, reproduction
docs/phases/PHASE6-Reranker-Framing.md
Phase 6.1-6.3 framing β feature wiring, deployment verification, retraining strategy
docs/research/06-Deep-Research-Verdict.md
Source of truth for architecture decisions
docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md
Master roadmap (Phases 3β9)
docs/ML Intern docs/
ML Intern conversation logs for model training
Health & Monitoring
curl -s https://siddhm11-researchit.hf.space/healthz/reranker | python -m json.tool
Environment Variables
Variable
Required
Description
QDRANT_URL
Yes
Qdrant Cloud cluster URL
QDRANT_API_KEY
Yes
Qdrant Cloud API key
ZILLIZ_URI
Yes
Zilliz Cloud gRPC endpoint
ZILLIZ_TOKEN
Yes
Zilliz Cloud API token
TURSO_URL
Yes
Turso database URL
TURSO_DB_TOKEN
Yes
Turso auth token
GROQ_API_KEY
Yes
Groq API key for query rewriting
S2_API_KEY
No
Semantic Scholar API key (offline training scripts only, not used by the app)
RERANKER_MODEL_PATH
No
Override LightGBM model file path
DB_PATH
No
SQLite path (default: interactions.db)
Test Suite
Test File
Tests
Coverage
test_profiles.py
11
EWMA profile computation
test_clustering.py
21
Ward clustering + Hungarian matching
test_reranker_diversity.py
13
Reranker (37-feature) + MMR diversity
test_reranker_integration.py
7
Phase 6 LightGBM integration
test_phase6_feature_wiring.py
9
Phase 6.1+6.2 feature wiring + per-candidate cluster
test_fusion.py
20
Quota allocation
test_db.py
19
SQLite schema + suppression
test_onboarding.py
11
Onboarding wizard
test_hybrid_search.py
21
Hybrid search pipeline
test_search_router.py
6
Search router
Others
~13
User state, saved, arxiv, qdrant, integration
Total
~203