ResearchIT / docs /README.md
siddhm11
Phase 3 complete: Hybrid Semantic Search pipeline
d5a6f3e
# ResearchIT Documentation
All project documentation organized by purpose. Each document has a specific role in the project lifecycle.
---
## πŸ“ Folder Structure
```
docs/
β”œβ”€β”€ README.md ← you are here
β”‚
β”œβ”€β”€ TASK-TRACKER.md ← master checklist (all phases)
β”‚
β”œβ”€β”€ research/ ← deep research & strategic thinking
β”‚ β”œβ”€β”€ 01-Vision-Instagram-for-Research.md
β”‚ β”œβ”€β”€ 02-Recommendation-System-Blueprint.md
β”‚ β”œβ”€β”€ 03-MultiInterest-Recommender-Architecture.md
β”‚ β”œβ”€β”€ 04-Technical-Roadmap-Legacy.md
β”‚ β”œβ”€β”€ 05-Evolution-Of-Onboarding-And-Interests.md
β”‚ └── 06-Deep-Research-Verdict.md
β”‚
β”œβ”€β”€ phases/ ← what we built & what we plan to build
β”‚ β”œβ”€β”€ PHASE1-Zero-ML-Recommender.md
β”‚ β”œβ”€β”€ PHASE2-Hybrid-Search-Plan.md (prototype reference)
β”‚ └── PHASE3-Hybrid-Semantic-Search.md (ACTIVE PHASE 3 PLAN)
β”‚
β”œβ”€β”€ walkthroughs/ ← detailed implementation records
β”‚ β”œβ”€β”€ 01-Phase1-Code-Tour.md
β”‚ β”œβ”€β”€ 02-Phase2-MultiInterest-Recommender.md
β”‚ β”œβ”€β”€ 03-Code-Summary-and-Test-Plan.md
β”‚ └── 04-Next-Steps-and-Phase-Plan.md
β”‚
notebooks/ ← Kaggle reference notebooks (not in docs/)
β”œβ”€β”€ README.md
β”œβ”€β”€ 01-bme-upload.ipynb (BGE-M3 encode + upload 1.6M papers)
β”œβ”€β”€ 02-bme-arxiv-test.ipynb (search quality + encoding tests)
└── 03-check-search-bq-prm.ipynb (BQ vs PRM benchmark)
```
---
## πŸ“š Reading Order
If you're new to this project, read these in order:
### 1. Understand the Vision
**[01-Vision-Instagram-for-Research.md](research/01-Vision-Instagram-for-Research.md)**
The strategic blueprint. Covers competitive landscape, UX patterns from TikTok/Spotify/Pinterest, social dynamics, differentiation features, and business model. This is "why we're building this."
### 2. Understand the Technical Foundation
**[02-Recommendation-System-Blueprint.md](research/02-Recommendation-System-Blueprint.md)**
The initial deep research on recommendation architectures. Covers user modeling, content-based vs collaborative filtering, cold start strategies, and evaluation metrics. This is "how recommendation systems work in general."
### 3. Understand the Chosen Architecture
**[03-MultiInterest-Recommender-Architecture.md](research/03-MultiInterest-Recommender-Architecture.md)**
The definitive architecture RFC. EWMA temporal decay, Ward hierarchical clustering, LightGBM re-ranking, MMR diversity. Validated by Twitter, Pinterest, and Alibaba production systems. **This is the blueprint we implemented.**
### 4. See the Architectural Evolution
**[05-Evolution-Of-Onboarding-And-Interests.md](research/05-Evolution-Of-Onboarding-And-Interests.md)**
Documents the founder's pivot from explicit onboarding subject vectors to implicit behavioral tracking. Captures the original vision vs. the current approach and why the change was made.
**[06-Deep-Research-Verdict.md](research/06-Deep-Research-Verdict.md)** ⭐ *Latest Research*
The comprehensive verdict that resolves contradictions across all prior documents. Proposes a **three-layer hybrid** (coarse categories + seed papers + behavioral clustering). Identifies faults in Doc 03 (RRF→quota, α correction). The definitive architectural reference going forward.
### 5. See What Phase 1 Built
**[PHASE1-Zero-ML-Recommender.md](phases/PHASE1-Zero-ML-Recommender.md)**
What was built first: zero-ML-inference recommender using Qdrant's BEST_SCORE Recommend API, SQLite event logging, and arXiv metadata caching. The working foundation.
**[01-Phase1-Code-Tour.md](walkthroughs/01-Phase1-Code-Tour.md)**
A file-by-file walkthrough of every piece of the Phase 1 codebase: entry points, routers, services, database, templates, and tests.
### 6. See What Phase 2 Built
**[02-Phase2-MultiInterest-Recommender.md](walkthroughs/02-Phase2-MultiInterest-Recommender.md)**
What was just built: PinnerSage-style multi-interest engine with EWMA profiles, Ward clustering, prefetch+RRF, heuristic re-ranking, and MMR diversity. 88 tests passing.
### 7. Review Core Code & Automation
**[03-Code-Summary-and-Test-Plan.md](walkthroughs/03-Code-Summary-and-Test-Plan.md)**
Summarizes all structural backend modules, frontend files, and breaks down our three-layered ongoing testing strategies (Automated, Manual, and Analytic Evaluation).
### 8. What's Next β€” The Revised Phase Plan
**[04-Next-Steps-and-Phase-Plan.md](walkthroughs/04-Next-Steps-and-Phase-Plan.md)** ⭐ *Start Here for Next Steps*
The master roadmap synthesizing all 6 research documents. Resolves contradictions between docs, captures the founder's thinking evolution, and lays out Phases 3-9 in priority order. Includes the three highest-impact next actions.
### 9. Phase 3 Plan (Current Focus)
**[PHASE3-Hybrid-Semantic-Search.md](phases/PHASE3-Hybrid-Semantic-Search.md)** ⭐ *Active Implementation Plan*
The detailed implementation plan for hybrid semantic search. Covers architecture, all new/modified files, Zilliz schema, BGE-M3 encoding, RRF fusion, HF Spaces deployment, latency budget, and 8-step implementation order.
### 10. Data Preparation Notebooks
**[notebooks/README.md](../notebooks/README.md)** β€” Index + extracted schema details.
- `01-bme-upload.ipynb` β€” How 1.6M papers were encoded and uploaded to Qdrant + Zilliz
- `02-bme-arxiv-test.ipynb` β€” BGE-M3 encoding + search quality prototype
- `03-check-search-bq-prm.ipynb` β€” BQ vs PRM quantization benchmark
---
## πŸ“„ Document Status
| Document | Status | Notes |
|---|---|---|
| 01 β€” Vision (Instagram for Research) | βœ… Complete | Strategic north star |
| 02 β€” Recommendation Blueprint | βœ… Complete | Initial research, still relevant |
| 03 β€” Multi-Interest Architecture | βœ… Implemented | **The RFC we implemented** β€” has 4 known faults identified in Doc 06 |
| 04 β€” Technical Roadmap | ⚠️ Legacy | Superseded. Kept for reference only |
| 05 β€” Evolution of Onboarding | βœ… Complete | Documents the subject-vector β†’ behavioral pivot |
| 06 β€” Deep Research Verdict | βœ… Complete | **The definitive architectural reference** β€” resolves all contradictions |
| Phase 1 Walkthrough | βœ… Complete | Still accurate for Phase 1 code |
| Phase 1 Code Tour | βœ… Complete | File-by-file walkthrough |
| Phase 2 Recommender Walkthrough | βœ… Complete | Multi-interest engine |
| Codebase Summary & Test Plan | βœ… Complete | Summarizes codebase & testing |
| Next Steps & Phase Plan | βœ… Complete | **Master roadmap for Phases 3-9** |
| Phase 2 Hybrid Search Plan | πŸ“‹ Prototype reference | Superseded by PHASE3-Hybrid-Semantic-Search.md as the active plan |
| **Phase 3 Hybrid Semantic Search** | **πŸ“‹ Active Plan** | **The current implementation guide for Phase 3** |
| Task Tracker | βœ… Active | Master checklist for all phases |
---
## πŸ—οΈ Architecture Evolution
```
Phase 1 (completed)
└── Qdrant BEST_SCORE with raw paper IDs
β”œβ”€β”€ Works from 1 save
└── No temporal awareness, no diversity
Phase 2a (completed)
└── EWMA profile embeddings
β”œβ”€β”€ Long-term (Ξ±=0.03) + Short-term (Ξ±=0.40) + Negative (Ξ±=0.15)
└── Activates at 3+ saves
Phase 2b (completed)
└── Ward clustering + Qdrant prefetch+RRF
β”œβ”€β”€ Auto-detects K interests per user (1-7)
β”œβ”€β”€ Single API call, server-side parallel ANN
└── Activates at 5+ saves
Phase 2c (completed)
└── Heuristic re-ranking + MMR diversity
β”œβ”€β”€ 5-feature scorer (40% relevance, 25% session, 15% recency, 10% rank, -15% negative)
β”œβ”€β”€ MMR diversity (Ξ»=0.6) + exploration injection (2 papers)
└── Upgrade path: swap heuristic for LightGBM at β‰₯500 interactions
Phase 3 (NEXT β€” hybrid semantic search)
└── Replace arXiv keyword API with vector-based search
β”œβ”€β”€ BGE-M3 query encoding (loaded at startup)
β”œβ”€β”€ Dense (Qdrant) + Sparse (Zilliz) parallel retrieval
β”œβ”€β”€ RRF fusion (correct for search: same query, different retrievers)
└── Deployment: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)
Phase 4 (planned β€” recommendation pipeline fixes)
└── RRF β†’ quota fusion, Ξ±_long 0.10 β†’ 0.03, negative profile wiring,
pre-populate metadata store
Phase 5 (planned β€” cold-start onboarding)
└── arXiv category multiselect + seed paper import + ORCID
Phase 6+ (future)
└── LightGBM lambdarank, evaluation framework, LLM summaries,
collaborative filtering, exploration
```