ResearchIT / README.md
siddhm11
Phase 6.5: Pipeline telemetry, search UX fixes, latency profiling
ec67b2f
metadata
title: ResearchIT
emoji: πŸ“š
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false

ResearchIT β€” Personalized ArXiv Paper Recommender

An "Instagram for research" β€” a multi-interest aware feed that surfaces relevant papers across a researcher's distinct areas without collapsing toward a dominant interest.

Stack: FastAPI Β· HTMX Β· Jinja2 Β· BGE-M3 (1024-dim) Β· Qdrant Cloud Β· Zilliz Cloud Β· Turso (libSQL) Β· Groq Β· LightGBM Β· HuggingFace Spaces

Live demo: https://siddhm11-researchit.hf.space


Architecture Overview

User β†’ [HTMX Frontend] β†’ [FastAPI Backend]
                              β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚             β”‚                  β”‚
         [Qdrant Cloud]  [Zilliz Cloud]   [Turso Cloud]
         Dense vectors   Sparse vectors   Paper metadata
         1.6M papers     1.6M papers      ~1.6M rows
         BGE-M3 1024d    BGE-M3 lexical   + citations
                β”‚             β”‚                  β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                    [Recommendation Engine]
                     β”œβ”€β”€ EWMA Profiles
                     β”œβ”€β”€ Ward Clustering
                     β”œβ”€β”€ Quota Fusion
                     β”œβ”€β”€ LightGBM Reranker (37 features)
                     β”œβ”€β”€ MMR Diversity
                     └── Exploration Injection

Data Infrastructure & Schemas

Qdrant Cloud β€” Dense Vector Store

Property Value
Collection arxiv_bgem3_dense
Documents ~1,600,000 arXiv papers
Vector dim 1024 (BGE-M3 dense embeddings, float32)
Quantization Binary Quantization (BQ) enabled
HNSW m=32
Point ID Integer (auto-generated)
Payload arxiv_id (TEXT, keyword-indexed)
Region Qdrant Cloud

Zilliz Cloud β€” Sparse Vector Store

Property Value
Collection arxiv_bgem3_sparse
Documents ~1,600,000 arXiv papers
Schema id (INT64 auto PK), arxiv_id (VARCHAR), sparse_vector (SPARSE_FLOAT_VECTOR)
Index SPARSE_INVERTED_INDEX, metric_type=IP
Sparse format Integer token IDs as keys (BGE-M3 tokenizer), e.g. {29: 0.0427, 6083: 0.1852}

Turso (libSQL) β€” Paper Metadata DB

Property Value
Database arxiv-data on aws-ap-south-1
URL https://arxiv-data-siddhm11.aws-ap-south-1.turso.io
Rows ~1,600,000 papers
Data sources Kaggle siddhm11/arxivdata + siddhm11/citation-data-letsgoo

Table: papers

CREATE TABLE papers (
    arxiv_id              TEXT UNIQUE,    -- e.g. "2401.12345"
    title                 TEXT,
    authors               TEXT,           -- comma-separated
    categories            TEXT,           -- space-separated arXiv categories
    primary_topic         TEXT,           -- e.g. "cs.CL"
    update_date           TEXT,           -- "YYYY-MM-DD"
    abstract_preview      TEXT,           -- truncated to 500 chars
    citation_count        INTEGER DEFAULT 0,
    influential_citations INTEGER DEFAULT 0
);
CREATE UNIQUE INDEX idx_papers_arxiv_id ON papers(arxiv_id);

SQLite β€” Local Application DB

File: interactions.db (WAL mode, async via aiosqlite)

-- User interactions (saves, dismissals, clicks, views)
CREATE TABLE interactions (
    id               INTEGER PRIMARY KEY AUTOINCREMENT,
    user_id          TEXT NOT NULL,
    paper_id         TEXT NOT NULL,
    event_type       TEXT NOT NULL,    -- save | not_interested | click | view
    source           TEXT,             -- search | recommendation
    position         INTEGER,
    query_id         TEXT,
    ranker_version   TEXT,             -- Phase 4.5: pipeline version tag
    candidate_source TEXT,             -- Phase 4.5: cluster_0 | exploration | ewma
    cluster_id       INTEGER,          -- Phase 4.5: interest cluster index
    timestamp        TEXT NOT NULL DEFAULT (datetime('now'))
);

-- arXiv ID β†’ Qdrant integer point ID mapping (lazy cache)
CREATE TABLE paper_qdrant_map (
    arxiv_id        TEXT PRIMARY KEY,
    qdrant_point_id INTEGER NOT NULL,
    mapped_at       TEXT NOT NULL DEFAULT (datetime('now'))
);

-- Paper metadata cache (from Turso/arXiv API)
CREATE TABLE paper_metadata (
    arxiv_id  TEXT PRIMARY KEY,
    title     TEXT, abstract TEXT, authors TEXT,
    category  TEXT, published TEXT,
    cached_at TEXT NOT NULL DEFAULT (datetime('now'))
);

-- EWMA user profile embeddings (1024-dim float32 blobs)
CREATE TABLE user_profiles (
    user_id           TEXT NOT NULL,
    profile_type      TEXT NOT NULL,   -- long_term | short_term | negative
    vector            BLOB NOT NULL,   -- 4096 bytes (1024 Γ— float32)
    interaction_count INTEGER DEFAULT 0,
    updated_at        TEXT,
    PRIMARY KEY (user_id, profile_type)
);

-- Ward clustering results per user
CREATE TABLE user_clusters (
    user_id         TEXT NOT NULL,
    cluster_idx     INTEGER NOT NULL,
    medoid_paper_id TEXT NOT NULL,
    importance      REAL NOT NULL,
    paper_ids       TEXT NOT NULL,     -- JSON array of arxiv_ids
    computed_at     TEXT,
    PRIMARY KEY (user_id, cluster_idx)
);

-- Onboarding wizard state
CREATE TABLE user_onboarding (
    user_id              TEXT PRIMARY KEY,
    selected_categories  TEXT,          -- JSON array: ["nlp", "cv", "ml"]
    onboarding_completed INTEGER DEFAULT 0,
    created_at           TEXT, updated_at TEXT
);

LightGBM Reranker β€” ML Model

Property Value
File models/reranker-phase6/production_model/reranker_v1.txt
HuggingFace siddhm11/researchit-reranker-phase6
Format LightGBM v4 text (plain text, no pickle)
Objective LambdaRank (optimizes nDCG)
Trees 141 (early stopped from 500)
Features 37 (see docs/PHASE6-HANDOFF.md for full schema)
Size 974 KB
Latency 0.143ms per 100 candidates
Fallback Heuristic scorer when model unavailable

Recommendation Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Tier 1 (β‰₯5 saves): Multi-Interest Clustering + Quota Fusion     β”‚
β”‚   1. Ward clustering β†’ identify distinct interests              β”‚
β”‚   2. Hungarian matching β†’ stabilize cluster IDs                 β”‚
β”‚   3. Quota allocation β†’ per-cluster slot budgets                β”‚
β”‚   4. Parallel per-cluster ANN searches                          β”‚
β”‚   5. LightGBM reranking (37 features) + heuristic fallback     β”‚
β”‚   6. Category suppression (β‰₯3 dismissals in 14 days)            β”‚
β”‚   7. MMR diversity (Ξ»=0.6)                                      β”‚
β”‚   8. Exploration injection (2 serendipitous papers)             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Tier 2 (β‰₯3 saves): EWMA long-term vector β†’ single ANN search   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Tier 3 (β‰₯1 save): Qdrant BEST_SCORE Recommend API               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Tier 0 (onboarded, 0 saves): Trending papers by category        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

# Install dependencies
pip install -r requirements.txt

# Set environment variables (see .env.example)
cp .env.example .env
# Edit .env with your Qdrant, Zilliz, Turso, Groq credentials

# Run dev server
python run.py
# β†’ http://127.0.0.1:7860

# Run tests
python -m pytest tests/ -v

# Run Phase 6 reranker integration tests
python tests/test_reranker_integration.py

Phase Completion Status

Phase Status Description
1 βœ… Complete Zero-ML Recommender (Qdrant + HTMX)
2a βœ… Complete EWMA Profile Embeddings
2b βœ… Complete Ward Clustering + Multi-Interest
2c βœ… Complete Heuristic Re-ranking + MMR
3 βœ… Complete Hybrid Semantic Search (BGE-M3 + Qdrant + Zilliz + RRF)
3.5 βœ… Complete Turso Metadata DB (2.9x faster search)
4 βœ… Complete Quota Fusion + Hungarian Matching + Category Suppression
4.5 βœ… Complete Instrumentation Foundation
5 βœ… Complete Cold-Start Onboarding + UI Redesign
6 βœ… Complete LightGBM Reranker (nDCG@10: 0.879, +233%)
7 πŸ“‹ Planned Evaluation Framework
8 πŸ“‹ Planned LLM Summaries + Distilled Reranker
9 πŸ“‹ Planned Exploration + Collaborative Filtering

Key Documentation

Document Purpose
CLAUDE.md Agent rulebook β€” architectural rules, doc precedence, code conventions
docs/TASK-TRACKER.md Master task checklist with all phase details
docs/PHASE6-HANDOFF.md LightGBM reranker handoff β€” model provenance, schema, reproduction
docs/phases/PHASE6-Reranker-Framing.md Phase 6.1-6.3 framing β€” feature wiring, deployment verification, retraining strategy
docs/research/06-Deep-Research-Verdict.md Source of truth for architecture decisions
docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md Master roadmap (Phases 3–9)
docs/ML Intern docs/ ML Intern conversation logs for model training

Health & Monitoring

# Phase 6.3: Verify reranker deployment
curl -s https://siddhm11-researchit.hf.space/healthz/reranker | python -m json.tool
# Expected: {"model_loaded": true, "n_trees": 141, "fallback_active": false, ...}

Environment Variables

Variable Required Description
QDRANT_URL Yes Qdrant Cloud cluster URL
QDRANT_API_KEY Yes Qdrant Cloud API key
ZILLIZ_URI Yes Zilliz Cloud gRPC endpoint
ZILLIZ_TOKEN Yes Zilliz Cloud API token
TURSO_URL Yes Turso database URL
TURSO_DB_TOKEN Yes Turso auth token
GROQ_API_KEY Yes Groq API key for query rewriting
S2_API_KEY No Semantic Scholar API key (offline training scripts only, not used by the app)
RERANKER_MODEL_PATH No Override LightGBM model file path
DB_PATH No SQLite path (default: interactions.db)

Test Suite

Test File Tests Coverage
test_profiles.py 11 EWMA profile computation
test_clustering.py 21 Ward clustering + Hungarian matching
test_reranker_diversity.py 13 Reranker (37-feature) + MMR diversity
test_reranker_integration.py 7 Phase 6 LightGBM integration
test_phase6_feature_wiring.py 9 Phase 6.1+6.2 feature wiring + per-candidate cluster
test_fusion.py 20 Quota allocation
test_db.py 19 SQLite schema + suppression
test_onboarding.py 11 Onboarding wizard
test_hybrid_search.py 21 Hybrid search pipeline
test_search_router.py 6 Search router
Others ~13 User state, saved, arxiv, qdrant, integration
Total ~203