ResearchIT / docs /TASK-TRACKER.md
siddhm11
Phase 6.5: Pipeline telemetry, search UX fixes, latency profiling
ec67b2f

ResearchIT β€” Master Task Tracker

Purpose: Single source of truth for all completed, in-progress, and upcoming work.
Last updated: 2026-05-05
Current phase: Phase 6.5 (Instrumentation) β€” COMPLETE βœ” | Phase 7 next


Legend

  • [x] β€” Done
  • [/] β€” In progress
  • [ ] β€” Not started
  • [~] β€” Intentionally deferred (blocked by data/users/scale)
  • [!] β€” Backlog item (documented, not yet coded)

Phase 1: Zero-ML Recommender βœ… COMPLETE

Built the foundation: Qdrant connection, arXiv search, save/dismiss, cookie identity, HTMX frontend.

  • Qdrant Cloud connection (1.6M BGE-M3 papers, BQ, HNSW m=32)
    • Collection: arxiv_bgem3_dense, 1024-dim dense vectors
    • File: app/qdrant_svc.py β†’ _get_client()
  • BEST_SCORE Recommend API (raw paper IDs β†’ Qdrant)
    • File: app/qdrant_svc.py β†’ recommend()
  • arXiv keyword API search (placeholder β€” replaced in Phase 3)
    • File: app/arxiv_svc.py β†’ search()
  • arXiv metadata fetching + SQLite cache
    • File: app/arxiv_svc.py β†’ fetch_metadata_batch()
  • SQLite database schema (interactions, paper_metadata)
    • File: app/db.py β†’ init_db()
    • WAL mode, async via aiosqlite
  • Cookie-based user identity
    • File: app/config.py β†’ COOKIE_NAME
  • User state management (positive/negative deques)
    • File: app/user_state.py β†’ UserState
  • Save/Dismiss event logging
    • File: app/routers/events.py
  • HTMX + Jinja2 frontend (search, recs, save, dismiss)
    • Files: app/templates/ (base.html, index.html, search.html, saved.html, partials/)
  • Test suite β€” 55 tests passing

Gaps: None.


Phase 2a: EWMA Profile Embeddings βœ… COMPLETE

Replaced raw ID-list approach with temporal decay vectors so recent interests outweigh old ones.

  • Create app/recommend/ module with __init__.py
  • Create app/recommend/profiles.py β€” EWMA computation + storage
    • Long-term: Ξ±=0.03 βœ… (corrected from 0.10 per Doc 06)
    • Short-term: Ξ±=0.40
    • Negative: Ξ±=0.15
    • All embeddings L2-normalized
  • Modify app/db.py β€” add user_profiles table + user_clusters table
  • Modify app/qdrant_svc.py β€” add get_paper_vectors() and search_by_vector()
  • Modify app/routers/events.py β€” trigger EWMA updates on save/dismiss
  • Modify app/routers/recommendations.py β€” EWMA vector search with Tier 2 fallback
  • Add numpy + scipy to requirements.txt
  • Tests for profiles module β€” 11 passed
  • Full test suite β€” no regressions

Doc 06 correction applied: Ξ±_long 0.10 β†’ 0.03 (PinnerSage rejected 0.10 as too recent-biased).

Gaps: None.


Phase 2b: Ward Clustering + Multi-Interest Retrieval βœ… COMPLETE

Detect distinct user interests via hierarchical clustering, retrieve candidates per interest.

  • Create app/recommend/clustering.py β€” Ward clustering + medoid extraction
    • L2-normalize embeddings before Ward βœ… (Doc 06 correction)
    • Adaptive gap-based threshold (no fixed K)
    • Medoid representation (real papers, not centroids) βœ…
    • Dynamic K (1–7 clusters, auto-determined)
    • Recency-weighted importance scores
  • Modify app/qdrant_svc.py β€” add multi_interest_search() with prefetch+RRF
  • Modify app/routers/recommendations.py β€” 3-tier cascading pipeline
    • Tier 1 (β‰₯5 saves): Multi-interest clustering β†’ prefetch + RRF
    • Tier 2 (β‰₯3 saves): EWMA long-term vector β†’ single ANN search
    • Tier 3 (β‰₯1 save): Qdrant BEST_SCORE Recommend API
  • Tests for clustering module β€” 10 passed
  • Full test suite β€” no regressions

Doc 06 corrections applied: L2-normalization before Ward, medoid not centroid.

Gaps (deferred to Phase 4):

  • [!] RRF β†’ quota fusion (dominant clusters can swamp minority interests)
  • [!] Hungarian matching for cluster ID stability across reclusterings

Phase 2c: Heuristic Re-ranking + MMR Diversity βœ… COMPLETE

Added scoring and diversity layers on top of retrieval to produce the final feed.

  • Create app/recommend/reranker.py β€” 5-feature heuristic scorer
    • Feature 1: cosine_sim_longterm (weight 0.40)
    • Feature 2: cosine_sim_shortterm (weight 0.25)
    • Feature 3: paper_age_days / recency (weight 0.15)
    • Feature 4: rrf_position (weight 0.10)
    • Feature 5: cosine_sim_negative (weight -0.15) βœ… (Doc 06 addition)
  • Create app/recommend/diversity.py β€” MMR + exploration injection
    • MMR with Ξ»=0.6
    • 2 serendipitous exploration papers per feed
  • Modify app/routers/recommendations.py β€” full 5-step pipeline
    • Step 1: Clustering β†’ Step 2: Retrieval β†’ Step 3: Rerank β†’ Step 4: MMR β†’ Step 5: Exploration
  • Tests for reranker + diversity β€” 13 passed
  • Full test suite β€” 88 passed (86 + 2 pre-existing live Qdrant failures resolved)

Doc 06 correction applied: Negative EWMA profile wired as Feature 5 with 0.15 penalty.

Gaps: None. LightGBM model now integrated (Phase 6 βœ…).


Phase 2d: Advanced Models ❌ DEFERRED (Blocked by data/users)

These logically belong to the recommendation engine but cannot be built without real user data or scale.

  • [~] LightGBM lambdarank model β€” requires β‰₯500 labeled save/dismiss interactions β†’ Phase 6
  • [~] Collaborative filtering features β€” requires β‰₯500 users β†’ Phase 9
  • [~] DPP diversity β€” explicitly ruled out for v1 by Doc 06 β†’ Phase 9+
  • [~] Two-Tower model β€” requires GPU + large dataset β†’ Phase 9+

Phase 3: Hybrid Semantic Search βœ… COMPLETE

Replace the arXiv keyword API placeholder with real vector-based semantic search using Qdrant dense + Zilliz sparse + RRF.
Detailed plan: docs/phases/PHASE3-Hybrid-Semantic-Search.md
Prototype reference: docs/phases/PHASE2-Hybrid-Search-Plan.md
Deployment target: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)

New files created

  • app/embed_svc.py β€” BGE-M3 model singleton (load BAAI/bge-m3 once at startup, ~570MB, ~15s cold)
    • encode_query(text) β†’ (dense: np.ndarray[1024], sparse: dict)
    • LRU cache for repeat queries
    • Thread-safe, lazy loading with double-check locking
  • app/zilliz_svc.py β€” Zilliz Cloud sparse search client
    • Collection: arxiv_bgem3_sparse
    • Schema: id (INT64 auto PK), arxiv_id (VARCHAR), sparse_vector (SPARSE_FLOAT_VECTOR)
    • Index: SPARSE_INVERTED_INDEX, metric_type=IP
    • Sparse format: {int_token_id: float_weight} (BGE-M3 lexical weights, NOT string words)
    • search_sparse(sparse_dict, limit) β†’ list[dict] with arxiv_id + score
    • gRPC reconnect handling
  • app/groq_svc.py β€” LLM query rewriter (Groq / llama-3.3-70b)
    • rewrite(user_query) β†’ academic query string
    • Graceful fallback to original query on error
    • Academic-detection heuristic to skip unnecessary rewrites
    • 2s hard timeout
  • app/hybrid_search_svc.py β€” search orchestrator
    • Rewrite β†’ Encode β†’ Parallel (Qdrant dense + Zilliz sparse) β†’ RRF β†’ Rerank
    • Each step has independent failure handling
    • Recency reranking: 0.80 RRF + 0.20 recency

Files modified

  • app/config.py β€” added ZILLIZ_URI, ZILLIZ_TOKEN, ZILLIZ_COLLECTION, GROQ_API_KEY, BGE_M3_MODEL, BGE_M3_DEVICE, ENCODE_CACHE_SIZE, search weights, APP_PORT
  • app/qdrant_svc.py β€” added search_dense(dense_vec, limit) for raw vector search returning scores
  • app/routers/search.py β€” swapped arxiv_svc.search() β†’ hybrid_search_svc.search() with arXiv fallback
  • app/main.py β€” added graceful BGE-M3 warm-up to lifespan
  • requirements.txt β€” added FlagEmbedding, pymilvus, groq
  • run.py β€” configurable port (7860 default for HF Spaces)

Deployment files created

  • Dockerfile β€” HF Spaces Docker SDK, CPU-only PyTorch, pre-baked BGE-M3 model
  • .dockerignore β€” excludes notebooks, PDFs, databases, caches

Implementation steps completed

  • Step 1: BGE-M3 model service (embed_svc.py) + unit tests
  • Step 2: Zilliz client (zilliz_svc.py)
  • Step 3: Dense search in Qdrant service
  • Step 4: Groq rewriter (groq_svc.py)
  • Step 5: Hybrid search orchestrator (hybrid_search_svc.py)
  • Step 6: Swap search router
  • Step 7: Model warm-up + deployment config
  • Step 8: Tests β€” 21 new tests passing (RRF, recency, Groq heuristics, embed edge cases, orchestrator mocks)

Test results

  • 88 original tests: βœ… All pass (zero regressions)
  • 21 Phase 3 unit tests: βœ… All pass (RRF, recency, Groq, embed, orchestrator mocks)
  • 6 search router tests: βœ… All pass (ranking, fallback, HTMX, saved state)
  • 8 live service tests: βœ… All pass (Qdrant dense, Zilliz sparse, Groq rewrite, parallel)
  • Total: 123 tests passing

Latency budget

Stage Time
LLM rewrite (Groq) ~300ms (skippable)
BGE-M3 encode (CPU) ~300ms first, ~0ms cached
Qdrant + Zilliz (parallel) ~300ms
RRF + rerank <5ms
Total (warm) ~600ms

Phase 3.5: Turso ArXiv Metadata DB βœ… COMPLETE

Bulk-loaded 1.23 GB of arXiv paper metadata + citation data to Turso (libSQL) cloud DB.
Eliminates the unstable arXiv API dependency for metadata fetching (Phase 4.2 solved early).
Integrated into codebase and deployed to HF Spaces.

Infrastructure

  • Turso cloud DB created: arxiv-data on aws-ap-south-1
    • URL: https://arxiv-data-siddhm11.aws-ap-south-1.turso.io
    • Auth: Platform token + DB auth token (minted via CLI)
  • Table: papers with columns:
    • arxiv_id (TEXT, UNIQUE INDEX idx_papers_arxiv_id)
    • title (TEXT)
    • authors (TEXT)
    • categories (TEXT)
    • primary_topic (TEXT)
    • update_date (TEXT)
    • abstract_preview (TEXT, truncated to 500 chars)
    • citation_count (INTEGER, default 0)
    • influential_citations (INTEGER, default 0)
  • Data sources:
    • arxiv_comprehensive_papers.csv (Kaggle: siddhm11/arxivdata)
    • arxiv_citations_summary.csv (Kaggle: siddhm11/citation-data-letsgoo)
    • Joined on id = arxiv_id_clean, deduplicated
  • Row count verified: local ↔ remote match
  • Unique index on arxiv_id for fast lookups

Integration (DONE)

  • Added TURSO_URL and TURSO_DB_TOKEN to config.py / .env / HF Secrets
  • Created app/turso_svc.py β€” metadata lookup service
    • fetch_metadata_batch(arxiv_ids) β†’ {arxiv_id: paper_dict}
    • Uses Turso HTTP pipeline API (zero new Python deps β€” just httpx)
    • Includes citation_count + influential_citations
  • app/routers/search.py β€” Turso primary, arXiv API fallback (only for IDs not in Turso)
  • Created tests/test_turso_timing.py β€” timing benchmark
  • Verified: 10/10 title match, 6.1x end-to-end speedup on HF Spaces
  • Impact: Avg search time dropped from ~10.7s to ~1.75s on HF Spaces

Phase 4: Recommendation Pipeline Fixes βœ… COMPLETE

Fixed the known architectural debt in the recommendation pipeline.
Detailed plan: docs/phases/PHASE4-Recommendation-Pipeline-Fixes.md

4.1 β€” Replace RRF with Importance-Weighted Quota Fusion

  • Create app/recommend/fusion.py β€” quota allocation logic
    • w_k = importance_k / sum(importance_k)
    • slot_k = max(floor(F Γ— w_k), F_min=3) β€” every cluster gets at least 3 slots
    • Distribute remainder by largest fractional part
  • Create tests/test_fusion.py β€” 20 unit tests for quota allocation
    • Proportionality, floor enforcement, total invariant, edge cases, Doc 06 worked examples
  • Refactor _multi_interest_recommend() in recommendations.py
    • Replace multi_interest_search() with per-cluster separate ANN queries
    • Use asyncio.gather() for concurrent searches (~15ms wall-clock)
    • Allocate feed slots proportionally via allocate_quotas()
    • Deduplicate across clusters (first-occurrence = highest-ranked cluster wins)
    • MMR over merged union (unchanged)
  • Keep qdrant_svc.multi_interest_search() in codebase (no deletion)

4.2 β€” Pre-populate Metadata Store βœ… DONE (via Turso)

  • Bulk-loaded arXiv metadata from Kaggle to Turso cloud DB (Phase 3.5)
  • 1.23 GB, includes citation counts from Semantic Scholar
  • Wired Turso service into search.py (Turso primary, arXiv API fallback)
  • arXiv API is now fallback only for genuinely new papers
  • Impact: Search time dropped from ~10.7s to ~1.75s on HF Spaces

4.3 β€” Hungarian Matching for Cluster Stability

  • Add stabilize_cluster_ids() function to clustering.py
    • Uses scipy.optimize.linear_sum_assignment (already a dependency)
    • Cost matrix: 1 - cosine_sim(new_medoid, old_medoid) β€” trivial at K≀7
    • Matched clusters keep old indices; new clusters get next available
    • Min cosine threshold (0.5) rejects unrelated matches
  • Call between compute_clusters() and save_clusters_to_db() in recommendations.py
  • 10 tests in test_clustering.py β€” perturbed clusters preserve indices, unrelated match rejection, K growth/shrink, custom thresholds

4.4 β€” Category-Level Negative Suppression

  • Add get_suppressed_categories() to db.py
    • Joins interactions + paper_metadata to find categories with β‰₯3 dismissals
    • Primary category only (decision: avoid over-suppression)
    • 14-day window (standard default, Ο„_neg = 14 days)
  • Add suppression filter in _multi_interest_recommend() after reranking
  • Cache Turso metadata to paper_metadata via cache_turso_metadata_batch()
  • 8 tests in test_db.py β€” threshold, partitioning, user isolation, custom threshold
  • [~] Per-item short-term decay β†’ deferred to Phase 6 (LightGBM feature)

Gaps: None.


Phase 4.5: Instrumentation Foundation βœ… COMPLETE

Added telemetry columns to the interactions table so every saved/dismissed paper can be attributed to its pipeline tier, cluster origin, and ranker version. Doc 07 (ADR A4) identified this as the single most valuable early investment β€” retrofitting these fields after real user data exists is painful and blocks all later counterfactual evaluation.

Schema changes

  • Add ranker_version TEXT to interactions table β€” pipeline version tag
  • Add candidate_source TEXT to interactions β€” e.g. cluster_0, exploration, ewma_longterm, qdrant_recommend, short_term_supplement
  • Add cluster_id INTEGER to interactions β€” interest cluster index (NULL if N/A)
  • ALTER TABLE migration for existing DBs (safe try/except, idempotent)

Pipeline tagging

  • Add _RANKER_VERSION constant to recommendations.py
  • Tag Tier 1 papers with cluster origin, exploration status, short-term supplement
  • Tag Tier 2 papers as ewma_longterm
  • Tag Tier 3 papers as qdrant_recommend
  • Build paper_cluster_map before quota merge (first-occurrence = cluster attribution)
  • Exploration papers tagged as candidate_source='exploration'

End-to-end flow

  • recommendations.py embeds tags in paper dicts
  • action_buttons.html includes tags in hx-vals JSON
  • events.py accepts ranker_version, candidate_source, cluster_id Form fields
  • db.log_interaction() stores all three new columns

Files modified: app/db.py, app/routers/events.py, app/routers/recommendations.py, app/templates/partials/action_buttons.html

Gaps: None. propensity and policy_id fields deferred until Ξ΅-greedy exploration (Phase 9).


Phase 5: Cold-Start Onboarding βœ… COMPLETE

Onboarding wizard for new users β€” category selection + seed paper search + trending fallback.
Reference: Doc 06 β€” "4-37% lift even once behavioral data exists"

5.1 β€” arXiv Category Multi-Select βœ…

  • UI screen on first visit: select 1-8 arXiv category groups
  • Store selections in SQLite (user_onboarding table)
  • Use as pool filter for recommendations (via get_user_category_filter())
  • Preserve as LightGBM feature permanently (Feature 26: onboarding_category_match)
  • Does NOT create "subject vectors" β€” just filters

5.2 β€” Seed Paper Import βœ…

  • Let users search for and save seed papers during onboarding
  • Immediately create EWMA profiles + Ward clusters on next feed request
  • Uses hybrid search (Phase 3) for discovery

5.3 β€” ORCID / Semantic Scholar Import ❌ REMOVED

S2 author import was implemented but removed β€” not the onboarding direction we want. Onboarding focuses on category selection + manual seed paper search.

5.4 β€” Popularity Fallback βœ…

  • Category-filtered trending papers served via turso_svc.fetch_trending_by_categories()
  • 1-hour TTL trending cache for performance

Phase 6: LightGBM Re-ranker βœ… COMPLETE

Replaced heuristic scorer with a trained LightGBM lambdarank model.
Unblocked via citation-graph pseudo-labels from Semantic Scholar.
Handoff doc: docs/PHASE6-HANDOFF.md
Model repo: siddhm11/researchit-reranker-phase6

6.1 β€” ML Intern: Data Pipeline + Model Training βœ…

  • Export 1.6M arXiv IDs from Turso β†’ arxiv_ids.txt (scripts/export_arxiv_ids.py)
  • Fetch 242K citation edges from Semantic Scholar Batch API (01_fetch_citation_edges.py)
  • Generate 98K training triples with pseudo-labels: cited=2, co-cited=1, negative=0 (02_generate_training_triples.py)
  • 37-feature schema (20 content, 11 user behavior, 6 cross-features)
  • Train LightGBM LambdaRank model: 141 trees, 63 leaves, lr=0.05 (03_train_lightgbm.py)
  • nDCG@10 = 0.879 (+233% vs heuristic baseline)
  • All artifacts pushed to HuggingFace

6.2 β€” Opus: Integration into ResearchIT βœ…

  • Rewrite app/recommend/reranker.py β€” 5 features β†’ 37 features
  • LightGBM model loading at import time with heuristic fallback
  • Multi-path model file search (env var β†’ relative β†’ absolute)
  • Backward-compatible rerank_candidates() signature (old callers unaffected)
  • Add lightgbm>=4.0,<5.0 to requirements.txt
  • Fix CRLFβ†’LF line endings in model file (Windows Git issue)
  • 7 integration tests β€” all passing (tests/test_reranker_integration.py)
  • Latency verified: 0.223ms per 100 candidates (target: <1ms) βœ…

6.3 β€” Antigravity: Feature Wiring + Deployment Verification βœ…

  • Wire all 37 features into recommendations.py caller (was legacy 6-arg signature)
  • Per-candidate cluster_importance (N,) from paper_cluster_map
  • Per-candidate cluster_medoid (N, 1024) per source cluster
  • Pre-computed is_suppressed_category and onboarding_category_match arrays
  • Pass qdrant_scores, user_total_saves, user_total_dismissals
  • reranker.py supports both scalar broadcast and per-candidate arrays
  • Add model accessors: is_model_loaded(), get_num_trees(), get_loaded_model_path()
  • Add per-request feature activation logging
  • Create GET /healthz/reranker endpoint (app/routers/health.py)
  • Bug B fix: persist medoid_embedding_blob BLOB in user_clusters table
  • Bug B fix: fall back to persisted blob instead of zero vector in Hungarian matching
  • DB migration: ALTER TABLE user_clusters ADD COLUMN medoid_embedding_blob BLOB
  • 9 new tests β€” all passing (tests/test_phase6_feature_wiring.py)
  • Full suite: 203+ tests passing, 0 failures
  • Updated CLAUDE.md, PHASE6-HANDOFF.md, README.md

6.4 β€” Retraining [~] DEFERRED

Phase 6.4 retraining is deferred. The published model siddhm11/researchit-reranker-phase6 was trained on citation pseudo-labels with features 23–30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with β‰₯10 saves each. Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.

  • [~] Synthetic user simulator (scripts/simulate_users.py) β€” target: +30d
  • [~] Real-user retrain at 100-user threshold β€” target: +90d or threshold
  • [~] HF model card backfill (library_name, pipeline_tag, metrics, schema)

Phase 6.5: Instrumentation βœ… COMPLETE

Purpose: Stabilize the recommendation pipeline and prepare telemetry substrate for Phase 7 evaluation.

A1 β€” Real Qdrant cosine scores

  • Switch search_by_vector() β†’ search_by_vector_with_scores() in per-cluster + short-term searches
  • Build qdrant_score_map from real cosines (replaces fake 1.0 - rank*0.01 linear decay)
  • Feature 0 (qdrant_cosine_score) now receives actual cosine similarities

A2 β€” Deployment verification

  • curl /healthz/reranker β†’ model_loaded=true, n_trees=141, fallback_active=false
  • Verification timestamp added to PHASE6-Reranker-Framing.md

B1 β€” query_id linkage

  • Generate query_id (UUID) once per feed request in get_recommendations()
  • Thread through all 4 tiers: trending, Tier 1, Tier 2, Tier 3
  • Generate query_id in search.py per search request
  • Add query_id + position to action_buttons.html hx-vals

B2 β€” Propensity logging

  • Add propensity REAL + policy_id TEXT migration to interactions table
  • Extend db.log_interaction() with propensity + policy_id params
  • Compute propensity: 1.0 (deterministic) vs n_explore/pool_size (exploration)
  • Thread through templates + events.py Form params

B3 β€” Cluster snapshot versioning

  • Add cluster_snapshots table (append-only, content-addressed via paper_ids_hash)
  • save_cluster_snapshot() called after each save_clusters_to_db()
  • prune_old_snapshots(30) on startup in main.py lifespan

B4 β€” S2 author import ❌ REMOVED

S2 author import was implemented and then removed β€” not the onboarding direction we want. app/s2_svc.py, the /api/onboarding/import-author endpoint, and the quick-import UI have all been deleted. Onboarding uses category selection + manual seed search only.

Documentation

  • CLAUDE.md: Rule 3.11 β€” interaction instrumentation invariants
  • _RANKER_VERSION bumped to v6.5_lightgbm_real_cosines
  • Phase status updated to 6.5 COMPLETE
  • Tests: 203+ passing

Test suite

  • tests/test_reranker_integration.py β€” 7 tests (smoke, features, heuristic, E2E, latency, backward compat, comparison)
  • tests/test_phase6_feature_wiring.py β€” 9 tests (per-candidate arrays, broadcast medoid, model accessors, aggregate activation)
  • tests/demo_reranker.py β€” interactive demo with 20 realistic papers

Phase 7: Evaluation Framework πŸ“‹ NOT STARTED

Build offline and online evaluation before scaling users.
Estimated effort: ~1 week

  • Offline metrics: nDCG@10, Recall@50, HR@10, ILS, category entropy
  • Time-split evaluation on unarXive 2022 + S2ORC
  • Online metrics (once users exist): CTR, save rate, dwell time, return rate

Phase 8: LLM Interest Summaries + Distilled Re-ranker πŸ“‹ NOT STARTED

Estimated effort: ~10-12 weeks (Doc 07)
Detailed research plan: docs/research/07-LLM-Summaries-Reranker-and-Scaling-Research.md
Entry criteria: Phase 7 eval producing stable nDCG@10; cluster stability Jaccard β‰₯0.7 over 7 days

8a β€” Claude-generated per-cluster interest summaries (Doc 07 Β§A)

  • Cluster snapshot versioning (ADR A1)
  • Content-addressed caching: sha256(sorted(paper_ids) + prompt_version + model)
  • Shared summaries (not per-user) β€” Haiku 4.5 + Batch API (~$50-80/month @ 1K users)
  • Nightly regeneration job with 7-day TTL + event-triggered refresh
  • "You're reading about X" UI framing with sub-theme bullets
  • Anthropic Citations API for hallucination prevention

8b β€” Distilled cross-encoder reranker (Doc 07 Β§B)

  • Deploy cross-encoder/ms-marco-TinyBERT-L-2-v2 INT8 ONNX as MVP
  • 6ms budget for 20 pairs on CPU (AVX-512 VNNI)
  • TinyBERT score as LightGBM feature (Option C architecture)
  • Custom distillation from BGE-reranker-v2-m3 only if held-out gap >3 nDCG
  • MarginMSE loss + SciNCL citation-graph hard negatives

8c β€” Use-cases and information-gain design doc (Doc 07 Β§C)

  • 8 user personas (P1 cold-start through P8 stay-current)
  • Information-gain table (save=3-5Γ—, dismiss-as-label=βˆ’3-4Γ—, passive skip=βˆ’0.1Γ—)
  • Mode-switching UI: "Stay Current" vs "Lit Review" toggle
  • Failure mode detection rules (feed collapse, stale profile, filter bubble)

Phase 9: Exploration + Collaborative Filtering πŸ“‹ NOT STARTED

Blocked by: β‰₯500 users

  • Epsilon-greedy exploration (Ξ΅=0.25 new users, Ξ΅=0.05 established)
  • LightFM hybrid CF model with switching strategy
  • Category-level negative suppression
  • Retrain LightGBM with dismissals as negative labels

Appendix: Infrastructure Status

Component Status Details
Qdrant Cloud βœ… Live 1.6M papers, BGE-M3 1024-dim, BQ enabled, HNSW m=32
Zilliz Cloud βœ… Live 1.6M papers, BGE-M3 sparse vectors, collection arxiv_bgem3_sparse
Turso (libSQL) βœ… Live 1.23 GB arXiv metadata + citations, arxiv-data DB, papers table, unique index on arxiv_id
SQLite βœ… Live interactions, paper_metadata (local cache), user_profiles, user_clusters
HF Spaces βœ… Deployed Docker SDK, free tier, port 7860 β€” https://siddhm11-researchit.hf.space
Render ⚠️ Previous target (512MB RAM too small for BGE-M3) May still be used for non-ML services
arXiv API βœ… Fallback only Keyword search + metadata for papers not in Turso
BGE-M3 Model βœ… Live Pre-baked in Docker image, warm-up at startup
Groq API βœ… Live + HF Secret app/groq_svc.py β€” 2s timeout, academic heuristic skip
Notebooks βœ… Organized notebooks/ β€” 01-upload, 02-test, 03-search-benchmark

Credentials Status

Credential Status Env Var Notes
Qdrant Cloud βœ… In .env QDRANT_URL, QDRANT_API_KEY Already wired
Zilliz Cloud βœ… In .env ZILLIZ_URI, ZILLIZ_TOKEN Phase 3, wired
Turso (libSQL) βœ… In .env + HF TURSO_URL, TURSO_DB_TOKEN Phase 3.5, wired + deployed
Groq βœ… In .env + HF GROQ_API_KEY Phase 3, wired + deployed
HF Spaces βœ… Deployed Secrets panel All env vars set βœ”

Appendix: Test Suite

Test File Count Status
tests/test_profiles.py 11 βœ… Passing
tests/test_clustering.py 21 βœ… Passing
tests/test_reranker_diversity.py 13 βœ… Passing
tests/test_reranker_integration.py 7 βœ… Passing
tests/test_phase6_feature_wiring.py 9 βœ… Passing
tests/test_fusion.py 20 βœ… Passing
tests/test_db.py 19 βœ… Passing
tests/test_qdrant_svc.py β€” βœ… Passing
tests/test_arxiv_svc.py β€” βœ… Passing
tests/test_integration.py β€” βœ… Passing
tests/test_user_state.py β€” βœ… Passing
tests/test_saved.py β€” βœ… Passing
tests/test_hybrid_search.py 21 βœ… Passing
tests/test_search_router.py 6 βœ… Passing
tests/test_live_search.py 8 βœ… Passing
Total 203+ βœ…
test_e2e_recs.py (standalone) 1 βœ… E2E simulation

Appendix: Doc 06 Corrections β€” Tracking

Correction Status Where
Ξ±_long 0.10 β†’ 0.03 βœ… Applied app/recommend/profiles.py:30
L2-normalize before Ward clustering βœ… Applied app/recommend/clustering.py
Medoid not centroid βœ… Applied app/recommend/clustering.py β†’ _find_medoid()
Negative EWMA wired into reranking βœ… Applied app/recommend/reranker.py β†’ Feature 5
RRF β†’ quota fusion for recommendations βœ… Applied app/recommend/fusion.py (Phase 4.1)
Hungarian cluster matching βœ… Applied app/recommend/clustering.py β†’ stabilize_cluster_ids() (Phase 4.3)
Per-item short-term negative decay [!] Backlog Phase 6 (LightGBM feature)
Category-level suppression βœ… Applied app/db.py β†’ get_suppressed_categories() (Phase 4.4)
BGE-reranker NEVER in hot path βœ… Followed Heuristic scorer used instead