# ResearchIT Documentation All project documentation organized by purpose. Each document has a specific role in the project lifecycle. --- ## πŸ“ Folder Structure ``` docs/ β”œβ”€β”€ README.md ← you are here β”‚ β”œβ”€β”€ TASK-TRACKER.md ← master checklist (all phases) β”‚ β”œβ”€β”€ research/ ← deep research & strategic thinking β”‚ β”œβ”€β”€ 01-Vision-Instagram-for-Research.md β”‚ β”œβ”€β”€ 02-Recommendation-System-Blueprint.md β”‚ β”œβ”€β”€ 03-MultiInterest-Recommender-Architecture.md β”‚ β”œβ”€β”€ 04-Technical-Roadmap-Legacy.md β”‚ β”œβ”€β”€ 05-Evolution-Of-Onboarding-And-Interests.md β”‚ └── 06-Deep-Research-Verdict.md β”‚ β”œβ”€β”€ phases/ ← what we built & what we plan to build β”‚ β”œβ”€β”€ PHASE1-Zero-ML-Recommender.md β”‚ β”œβ”€β”€ PHASE2-Hybrid-Search-Plan.md (prototype reference) β”‚ └── PHASE3-Hybrid-Semantic-Search.md (ACTIVE PHASE 3 PLAN) β”‚ β”œβ”€β”€ walkthroughs/ ← detailed implementation records β”‚ β”œβ”€β”€ 01-Phase1-Code-Tour.md β”‚ β”œβ”€β”€ 02-Phase2-MultiInterest-Recommender.md β”‚ β”œβ”€β”€ 03-Code-Summary-and-Test-Plan.md β”‚ └── 04-Next-Steps-and-Phase-Plan.md β”‚ notebooks/ ← Kaggle reference notebooks (not in docs/) β”œβ”€β”€ README.md β”œβ”€β”€ 01-bme-upload.ipynb (BGE-M3 encode + upload 1.6M papers) β”œβ”€β”€ 02-bme-arxiv-test.ipynb (search quality + encoding tests) └── 03-check-search-bq-prm.ipynb (BQ vs PRM benchmark) ``` --- ## πŸ“š Reading Order If you're new to this project, read these in order: ### 1. Understand the Vision **[01-Vision-Instagram-for-Research.md](research/01-Vision-Instagram-for-Research.md)** The strategic blueprint. Covers competitive landscape, UX patterns from TikTok/Spotify/Pinterest, social dynamics, differentiation features, and business model. This is "why we're building this." ### 2. Understand the Technical Foundation **[02-Recommendation-System-Blueprint.md](research/02-Recommendation-System-Blueprint.md)** The initial deep research on recommendation architectures. Covers user modeling, content-based vs collaborative filtering, cold start strategies, and evaluation metrics. This is "how recommendation systems work in general." ### 3. Understand the Chosen Architecture **[03-MultiInterest-Recommender-Architecture.md](research/03-MultiInterest-Recommender-Architecture.md)** The definitive architecture RFC. EWMA temporal decay, Ward hierarchical clustering, LightGBM re-ranking, MMR diversity. Validated by Twitter, Pinterest, and Alibaba production systems. **This is the blueprint we implemented.** ### 4. See the Architectural Evolution **[05-Evolution-Of-Onboarding-And-Interests.md](research/05-Evolution-Of-Onboarding-And-Interests.md)** Documents the founder's pivot from explicit onboarding subject vectors to implicit behavioral tracking. Captures the original vision vs. the current approach and why the change was made. **[06-Deep-Research-Verdict.md](research/06-Deep-Research-Verdict.md)** ⭐ *Latest Research* The comprehensive verdict that resolves contradictions across all prior documents. Proposes a **three-layer hybrid** (coarse categories + seed papers + behavioral clustering). Identifies faults in Doc 03 (RRFβ†’quota, Ξ± correction). The definitive architectural reference going forward. ### 5. See What Phase 1 Built **[PHASE1-Zero-ML-Recommender.md](phases/PHASE1-Zero-ML-Recommender.md)** What was built first: zero-ML-inference recommender using Qdrant's BEST_SCORE Recommend API, SQLite event logging, and arXiv metadata caching. The working foundation. **[01-Phase1-Code-Tour.md](walkthroughs/01-Phase1-Code-Tour.md)** A file-by-file walkthrough of every piece of the Phase 1 codebase: entry points, routers, services, database, templates, and tests. ### 6. See What Phase 2 Built **[02-Phase2-MultiInterest-Recommender.md](walkthroughs/02-Phase2-MultiInterest-Recommender.md)** What was just built: PinnerSage-style multi-interest engine with EWMA profiles, Ward clustering, prefetch+RRF, heuristic re-ranking, and MMR diversity. 88 tests passing. ### 7. Review Core Code & Automation **[03-Code-Summary-and-Test-Plan.md](walkthroughs/03-Code-Summary-and-Test-Plan.md)** Summarizes all structural backend modules, frontend files, and breaks down our three-layered ongoing testing strategies (Automated, Manual, and Analytic Evaluation). ### 8. What's Next β€” The Revised Phase Plan **[04-Next-Steps-and-Phase-Plan.md](walkthroughs/04-Next-Steps-and-Phase-Plan.md)** ⭐ *Start Here for Next Steps* The master roadmap synthesizing all 6 research documents. Resolves contradictions between docs, captures the founder's thinking evolution, and lays out Phases 3-9 in priority order. Includes the three highest-impact next actions. ### 9. Phase 3 Plan (Current Focus) **[PHASE3-Hybrid-Semantic-Search.md](phases/PHASE3-Hybrid-Semantic-Search.md)** ⭐ *Active Implementation Plan* The detailed implementation plan for hybrid semantic search. Covers architecture, all new/modified files, Zilliz schema, BGE-M3 encoding, RRF fusion, HF Spaces deployment, latency budget, and 8-step implementation order. ### 10. Data Preparation Notebooks **[notebooks/README.md](../notebooks/README.md)** β€” Index + extracted schema details. - `01-bme-upload.ipynb` β€” How 1.6M papers were encoded and uploaded to Qdrant + Zilliz - `02-bme-arxiv-test.ipynb` β€” BGE-M3 encoding + search quality prototype - `03-check-search-bq-prm.ipynb` β€” BQ vs PRM quantization benchmark --- ## πŸ“„ Document Status | Document | Status | Notes | |---|---|---| | 01 β€” Vision (Instagram for Research) | βœ… Complete | Strategic north star | | 02 β€” Recommendation Blueprint | βœ… Complete | Initial research, still relevant | | 03 β€” Multi-Interest Architecture | βœ… Implemented | **The RFC we implemented** β€” has 4 known faults identified in Doc 06 | | 04 β€” Technical Roadmap | ⚠️ Legacy | Superseded. Kept for reference only | | 05 β€” Evolution of Onboarding | βœ… Complete | Documents the subject-vector β†’ behavioral pivot | | 06 β€” Deep Research Verdict | βœ… Complete | **The definitive architectural reference** β€” resolves all contradictions | | Phase 1 Walkthrough | βœ… Complete | Still accurate for Phase 1 code | | Phase 1 Code Tour | βœ… Complete | File-by-file walkthrough | | Phase 2 Recommender Walkthrough | βœ… Complete | Multi-interest engine | | Codebase Summary & Test Plan | βœ… Complete | Summarizes codebase & testing | | Next Steps & Phase Plan | βœ… Complete | **Master roadmap for Phases 3-9** | | Phase 2 Hybrid Search Plan | πŸ“‹ Prototype reference | Superseded by PHASE3-Hybrid-Semantic-Search.md as the active plan | | **Phase 3 Hybrid Semantic Search** | **πŸ“‹ Active Plan** | **The current implementation guide for Phase 3** | | Task Tracker | βœ… Active | Master checklist for all phases | --- ## πŸ—οΈ Architecture Evolution ``` Phase 1 (completed) └── Qdrant BEST_SCORE with raw paper IDs β”œβ”€β”€ Works from 1 save └── No temporal awareness, no diversity Phase 2a (completed) └── EWMA profile embeddings β”œβ”€β”€ Long-term (Ξ±=0.03) + Short-term (Ξ±=0.40) + Negative (Ξ±=0.15) └── Activates at 3+ saves Phase 2b (completed) └── Ward clustering + Qdrant prefetch+RRF β”œβ”€β”€ Auto-detects K interests per user (1-7) β”œβ”€β”€ Single API call, server-side parallel ANN └── Activates at 5+ saves Phase 2c (completed) └── Heuristic re-ranking + MMR diversity β”œβ”€β”€ 5-feature scorer (40% relevance, 25% session, 15% recency, 10% rank, -15% negative) β”œβ”€β”€ MMR diversity (Ξ»=0.6) + exploration injection (2 papers) └── Upgrade path: swap heuristic for LightGBM at β‰₯500 interactions Phase 3 (NEXT β€” hybrid semantic search) └── Replace arXiv keyword API with vector-based search β”œβ”€β”€ BGE-M3 query encoding (loaded at startup) β”œβ”€β”€ Dense (Qdrant) + Sparse (Zilliz) parallel retrieval β”œβ”€β”€ RRF fusion (correct for search: same query, different retrievers) └── Deployment: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs) Phase 4 (planned β€” recommendation pipeline fixes) └── RRF β†’ quota fusion, Ξ±_long 0.10 β†’ 0.03, negative profile wiring, pre-populate metadata store Phase 5 (planned β€” cold-start onboarding) └── arXiv category multiselect + seed paper import + ORCID Phase 6+ (future) └── LightGBM lambdarank, evaluation framework, LLM summaries, collaborative filtering, exploration ```