ResearchIT / docs /README.md
siddhm11
Phase 3 complete: Hybrid Semantic Search pipeline
d5a6f3e

ResearchIT Documentation

All project documentation organized by purpose. Each document has a specific role in the project lifecycle.


πŸ“ Folder Structure

docs/
β”œβ”€β”€ README.md                     ← you are here
β”‚
β”œβ”€β”€ TASK-TRACKER.md               ← master checklist (all phases)
β”‚
β”œβ”€β”€ research/                     ← deep research & strategic thinking
β”‚   β”œβ”€β”€ 01-Vision-Instagram-for-Research.md
β”‚   β”œβ”€β”€ 02-Recommendation-System-Blueprint.md
β”‚   β”œβ”€β”€ 03-MultiInterest-Recommender-Architecture.md
β”‚   β”œβ”€β”€ 04-Technical-Roadmap-Legacy.md
β”‚   β”œβ”€β”€ 05-Evolution-Of-Onboarding-And-Interests.md
β”‚   └── 06-Deep-Research-Verdict.md
β”‚
β”œβ”€β”€ phases/                       ← what we built & what we plan to build
β”‚   β”œβ”€β”€ PHASE1-Zero-ML-Recommender.md
β”‚   β”œβ”€β”€ PHASE2-Hybrid-Search-Plan.md     (prototype reference)
β”‚   └── PHASE3-Hybrid-Semantic-Search.md (ACTIVE PHASE 3 PLAN)
β”‚
β”œβ”€β”€ walkthroughs/                 ← detailed implementation records
β”‚   β”œβ”€β”€ 01-Phase1-Code-Tour.md
β”‚   β”œβ”€β”€ 02-Phase2-MultiInterest-Recommender.md
β”‚   β”œβ”€β”€ 03-Code-Summary-and-Test-Plan.md
β”‚   └── 04-Next-Steps-and-Phase-Plan.md
β”‚
notebooks/                        ← Kaggle reference notebooks (not in docs/)
β”œβ”€β”€ README.md
β”œβ”€β”€ 01-bme-upload.ipynb             (BGE-M3 encode + upload 1.6M papers)
β”œβ”€β”€ 02-bme-arxiv-test.ipynb         (search quality + encoding tests)
└── 03-check-search-bq-prm.ipynb    (BQ vs PRM benchmark)

πŸ“š Reading Order

If you're new to this project, read these in order:

1. Understand the Vision

01-Vision-Instagram-for-Research.md The strategic blueprint. Covers competitive landscape, UX patterns from TikTok/Spotify/Pinterest, social dynamics, differentiation features, and business model. This is "why we're building this."

2. Understand the Technical Foundation

02-Recommendation-System-Blueprint.md The initial deep research on recommendation architectures. Covers user modeling, content-based vs collaborative filtering, cold start strategies, and evaluation metrics. This is "how recommendation systems work in general."

3. Understand the Chosen Architecture

03-MultiInterest-Recommender-Architecture.md The definitive architecture RFC. EWMA temporal decay, Ward hierarchical clustering, LightGBM re-ranking, MMR diversity. Validated by Twitter, Pinterest, and Alibaba production systems. This is the blueprint we implemented.

4. See the Architectural Evolution

05-Evolution-Of-Onboarding-And-Interests.md Documents the founder's pivot from explicit onboarding subject vectors to implicit behavioral tracking. Captures the original vision vs. the current approach and why the change was made.

06-Deep-Research-Verdict.md ⭐ Latest Research The comprehensive verdict that resolves contradictions across all prior documents. Proposes a three-layer hybrid (coarse categories + seed papers + behavioral clustering). Identifies faults in Doc 03 (RRFβ†’quota, Ξ± correction). The definitive architectural reference going forward.

5. See What Phase 1 Built

PHASE1-Zero-ML-Recommender.md What was built first: zero-ML-inference recommender using Qdrant's BEST_SCORE Recommend API, SQLite event logging, and arXiv metadata caching. The working foundation.

01-Phase1-Code-Tour.md A file-by-file walkthrough of every piece of the Phase 1 codebase: entry points, routers, services, database, templates, and tests.

6. See What Phase 2 Built

02-Phase2-MultiInterest-Recommender.md What was just built: PinnerSage-style multi-interest engine with EWMA profiles, Ward clustering, prefetch+RRF, heuristic re-ranking, and MMR diversity. 88 tests passing.

7. Review Core Code & Automation

03-Code-Summary-and-Test-Plan.md Summarizes all structural backend modules, frontend files, and breaks down our three-layered ongoing testing strategies (Automated, Manual, and Analytic Evaluation).

8. What's Next β€” The Revised Phase Plan

04-Next-Steps-and-Phase-Plan.md ⭐ Start Here for Next Steps The master roadmap synthesizing all 6 research documents. Resolves contradictions between docs, captures the founder's thinking evolution, and lays out Phases 3-9 in priority order. Includes the three highest-impact next actions.

9. Phase 3 Plan (Current Focus)

PHASE3-Hybrid-Semantic-Search.md ⭐ Active Implementation Plan The detailed implementation plan for hybrid semantic search. Covers architecture, all new/modified files, Zilliz schema, BGE-M3 encoding, RRF fusion, HF Spaces deployment, latency budget, and 8-step implementation order.

10. Data Preparation Notebooks

notebooks/README.md β€” Index + extracted schema details.

  • 01-bme-upload.ipynb β€” How 1.6M papers were encoded and uploaded to Qdrant + Zilliz
  • 02-bme-arxiv-test.ipynb β€” BGE-M3 encoding + search quality prototype
  • 03-check-search-bq-prm.ipynb β€” BQ vs PRM quantization benchmark

πŸ“„ Document Status

Document Status Notes
01 β€” Vision (Instagram for Research) βœ… Complete Strategic north star
02 β€” Recommendation Blueprint βœ… Complete Initial research, still relevant
03 β€” Multi-Interest Architecture βœ… Implemented The RFC we implemented β€” has 4 known faults identified in Doc 06
04 β€” Technical Roadmap ⚠️ Legacy Superseded. Kept for reference only
05 β€” Evolution of Onboarding βœ… Complete Documents the subject-vector β†’ behavioral pivot
06 β€” Deep Research Verdict βœ… Complete The definitive architectural reference β€” resolves all contradictions
Phase 1 Walkthrough βœ… Complete Still accurate for Phase 1 code
Phase 1 Code Tour βœ… Complete File-by-file walkthrough
Phase 2 Recommender Walkthrough βœ… Complete Multi-interest engine
Codebase Summary & Test Plan βœ… Complete Summarizes codebase & testing
Next Steps & Phase Plan βœ… Complete Master roadmap for Phases 3-9
Phase 2 Hybrid Search Plan πŸ“‹ Prototype reference Superseded by PHASE3-Hybrid-Semantic-Search.md as the active plan
Phase 3 Hybrid Semantic Search πŸ“‹ Active Plan The current implementation guide for Phase 3
Task Tracker βœ… Active Master checklist for all phases

πŸ—οΈ Architecture Evolution

Phase 1 (completed)
  └── Qdrant BEST_SCORE with raw paper IDs
       β”œβ”€β”€ Works from 1 save
       └── No temporal awareness, no diversity

Phase 2a (completed)
  └── EWMA profile embeddings
       β”œβ”€β”€ Long-term (Ξ±=0.03) + Short-term (Ξ±=0.40) + Negative (Ξ±=0.15)
       └── Activates at 3+ saves

Phase 2b (completed)
  └── Ward clustering + Qdrant prefetch+RRF
       β”œβ”€β”€ Auto-detects K interests per user (1-7)
       β”œβ”€β”€ Single API call, server-side parallel ANN
       └── Activates at 5+ saves

Phase 2c (completed)
  └── Heuristic re-ranking + MMR diversity
       β”œβ”€β”€ 5-feature scorer (40% relevance, 25% session, 15% recency, 10% rank, -15% negative)
       β”œβ”€β”€ MMR diversity (Ξ»=0.6) + exploration injection (2 papers)
       └── Upgrade path: swap heuristic for LightGBM at β‰₯500 interactions

Phase 3 (NEXT β€” hybrid semantic search)
  └── Replace arXiv keyword API with vector-based search
       β”œβ”€β”€ BGE-M3 query encoding (loaded at startup)
       β”œβ”€β”€ Dense (Qdrant) + Sparse (Zilliz) parallel retrieval
       β”œβ”€β”€ RRF fusion (correct for search: same query, different retrievers)
       └── Deployment: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)

Phase 4 (planned β€” recommendation pipeline fixes)
  └── RRF β†’ quota fusion, Ξ±_long 0.10 β†’ 0.03, negative profile wiring,
       pre-populate metadata store

Phase 5 (planned β€” cold-start onboarding)
  └── arXiv category multiselect + seed paper import + ORCID

Phase 6+ (future)
  └── LightGBM lambdarank, evaluation framework, LLM summaries,
       collaborative filtering, exploration