Spaces:
Running
Running
| # ResearchIT Documentation | |
| All project documentation organized by purpose. Each document has a specific role in the project lifecycle. | |
| --- | |
| ## π Folder Structure | |
| ``` | |
| docs/ | |
| βββ README.md β you are here | |
| β | |
| βββ TASK-TRACKER.md β master checklist (all phases) | |
| β | |
| βββ research/ β deep research & strategic thinking | |
| β βββ 01-Vision-Instagram-for-Research.md | |
| β βββ 02-Recommendation-System-Blueprint.md | |
| β βββ 03-MultiInterest-Recommender-Architecture.md | |
| β βββ 04-Technical-Roadmap-Legacy.md | |
| β βββ 05-Evolution-Of-Onboarding-And-Interests.md | |
| β βββ 06-Deep-Research-Verdict.md | |
| β | |
| βββ phases/ β what we built & what we plan to build | |
| β βββ PHASE1-Zero-ML-Recommender.md | |
| β βββ PHASE2-Hybrid-Search-Plan.md (prototype reference) | |
| β βββ PHASE3-Hybrid-Semantic-Search.md (ACTIVE PHASE 3 PLAN) | |
| β | |
| βββ walkthroughs/ β detailed implementation records | |
| β βββ 01-Phase1-Code-Tour.md | |
| β βββ 02-Phase2-MultiInterest-Recommender.md | |
| β βββ 03-Code-Summary-and-Test-Plan.md | |
| β βββ 04-Next-Steps-and-Phase-Plan.md | |
| β | |
| notebooks/ β Kaggle reference notebooks (not in docs/) | |
| βββ README.md | |
| βββ 01-bme-upload.ipynb (BGE-M3 encode + upload 1.6M papers) | |
| βββ 02-bme-arxiv-test.ipynb (search quality + encoding tests) | |
| βββ 03-check-search-bq-prm.ipynb (BQ vs PRM benchmark) | |
| ``` | |
| --- | |
| ## π Reading Order | |
| If you're new to this project, read these in order: | |
| ### 1. Understand the Vision | |
| **[01-Vision-Instagram-for-Research.md](research/01-Vision-Instagram-for-Research.md)** | |
| The strategic blueprint. Covers competitive landscape, UX patterns from TikTok/Spotify/Pinterest, social dynamics, differentiation features, and business model. This is "why we're building this." | |
| ### 2. Understand the Technical Foundation | |
| **[02-Recommendation-System-Blueprint.md](research/02-Recommendation-System-Blueprint.md)** | |
| The initial deep research on recommendation architectures. Covers user modeling, content-based vs collaborative filtering, cold start strategies, and evaluation metrics. This is "how recommendation systems work in general." | |
| ### 3. Understand the Chosen Architecture | |
| **[03-MultiInterest-Recommender-Architecture.md](research/03-MultiInterest-Recommender-Architecture.md)** | |
| The definitive architecture RFC. EWMA temporal decay, Ward hierarchical clustering, LightGBM re-ranking, MMR diversity. Validated by Twitter, Pinterest, and Alibaba production systems. **This is the blueprint we implemented.** | |
| ### 4. See the Architectural Evolution | |
| **[05-Evolution-Of-Onboarding-And-Interests.md](research/05-Evolution-Of-Onboarding-And-Interests.md)** | |
| Documents the founder's pivot from explicit onboarding subject vectors to implicit behavioral tracking. Captures the original vision vs. the current approach and why the change was made. | |
| **[06-Deep-Research-Verdict.md](research/06-Deep-Research-Verdict.md)** β *Latest Research* | |
| The comprehensive verdict that resolves contradictions across all prior documents. Proposes a **three-layer hybrid** (coarse categories + seed papers + behavioral clustering). Identifies faults in Doc 03 (RRFβquota, Ξ± correction). The definitive architectural reference going forward. | |
| ### 5. See What Phase 1 Built | |
| **[PHASE1-Zero-ML-Recommender.md](phases/PHASE1-Zero-ML-Recommender.md)** | |
| What was built first: zero-ML-inference recommender using Qdrant's BEST_SCORE Recommend API, SQLite event logging, and arXiv metadata caching. The working foundation. | |
| **[01-Phase1-Code-Tour.md](walkthroughs/01-Phase1-Code-Tour.md)** | |
| A file-by-file walkthrough of every piece of the Phase 1 codebase: entry points, routers, services, database, templates, and tests. | |
| ### 6. See What Phase 2 Built | |
| **[02-Phase2-MultiInterest-Recommender.md](walkthroughs/02-Phase2-MultiInterest-Recommender.md)** | |
| What was just built: PinnerSage-style multi-interest engine with EWMA profiles, Ward clustering, prefetch+RRF, heuristic re-ranking, and MMR diversity. 88 tests passing. | |
| ### 7. Review Core Code & Automation | |
| **[03-Code-Summary-and-Test-Plan.md](walkthroughs/03-Code-Summary-and-Test-Plan.md)** | |
| Summarizes all structural backend modules, frontend files, and breaks down our three-layered ongoing testing strategies (Automated, Manual, and Analytic Evaluation). | |
| ### 8. What's Next β The Revised Phase Plan | |
| **[04-Next-Steps-and-Phase-Plan.md](walkthroughs/04-Next-Steps-and-Phase-Plan.md)** β *Start Here for Next Steps* | |
| The master roadmap synthesizing all 6 research documents. Resolves contradictions between docs, captures the founder's thinking evolution, and lays out Phases 3-9 in priority order. Includes the three highest-impact next actions. | |
| ### 9. Phase 3 Plan (Current Focus) | |
| **[PHASE3-Hybrid-Semantic-Search.md](phases/PHASE3-Hybrid-Semantic-Search.md)** β *Active Implementation Plan* | |
| The detailed implementation plan for hybrid semantic search. Covers architecture, all new/modified files, Zilliz schema, BGE-M3 encoding, RRF fusion, HF Spaces deployment, latency budget, and 8-step implementation order. | |
| ### 10. Data Preparation Notebooks | |
| **[notebooks/README.md](../notebooks/README.md)** β Index + extracted schema details. | |
| - `01-bme-upload.ipynb` β How 1.6M papers were encoded and uploaded to Qdrant + Zilliz | |
| - `02-bme-arxiv-test.ipynb` β BGE-M3 encoding + search quality prototype | |
| - `03-check-search-bq-prm.ipynb` β BQ vs PRM quantization benchmark | |
| --- | |
| ## π Document Status | |
| | Document | Status | Notes | | |
| |---|---|---| | |
| | 01 β Vision (Instagram for Research) | β Complete | Strategic north star | | |
| | 02 β Recommendation Blueprint | β Complete | Initial research, still relevant | | |
| | 03 β Multi-Interest Architecture | β Implemented | **The RFC we implemented** β has 4 known faults identified in Doc 06 | | |
| | 04 β Technical Roadmap | β οΈ Legacy | Superseded. Kept for reference only | | |
| | 05 β Evolution of Onboarding | β Complete | Documents the subject-vector β behavioral pivot | | |
| | 06 β Deep Research Verdict | β Complete | **The definitive architectural reference** β resolves all contradictions | | |
| | Phase 1 Walkthrough | β Complete | Still accurate for Phase 1 code | | |
| | Phase 1 Code Tour | β Complete | File-by-file walkthrough | | |
| | Phase 2 Recommender Walkthrough | β Complete | Multi-interest engine | | |
| | Codebase Summary & Test Plan | β Complete | Summarizes codebase & testing | | |
| | Next Steps & Phase Plan | β Complete | **Master roadmap for Phases 3-9** | | |
| | Phase 2 Hybrid Search Plan | π Prototype reference | Superseded by PHASE3-Hybrid-Semantic-Search.md as the active plan | | |
| | **Phase 3 Hybrid Semantic Search** | **π Active Plan** | **The current implementation guide for Phase 3** | | |
| | Task Tracker | β Active | Master checklist for all phases | | |
| --- | |
| ## ποΈ Architecture Evolution | |
| ``` | |
| Phase 1 (completed) | |
| βββ Qdrant BEST_SCORE with raw paper IDs | |
| βββ Works from 1 save | |
| βββ No temporal awareness, no diversity | |
| Phase 2a (completed) | |
| βββ EWMA profile embeddings | |
| βββ Long-term (Ξ±=0.03) + Short-term (Ξ±=0.40) + Negative (Ξ±=0.15) | |
| βββ Activates at 3+ saves | |
| Phase 2b (completed) | |
| βββ Ward clustering + Qdrant prefetch+RRF | |
| βββ Auto-detects K interests per user (1-7) | |
| βββ Single API call, server-side parallel ANN | |
| βββ Activates at 5+ saves | |
| Phase 2c (completed) | |
| βββ Heuristic re-ranking + MMR diversity | |
| βββ 5-feature scorer (40% relevance, 25% session, 15% recency, 10% rank, -15% negative) | |
| βββ MMR diversity (Ξ»=0.6) + exploration injection (2 papers) | |
| βββ Upgrade path: swap heuristic for LightGBM at β₯500 interactions | |
| Phase 3 (NEXT β hybrid semantic search) | |
| βββ Replace arXiv keyword API with vector-based search | |
| βββ BGE-M3 query encoding (loaded at startup) | |
| βββ Dense (Qdrant) + Sparse (Zilliz) parallel retrieval | |
| βββ RRF fusion (correct for search: same query, different retrievers) | |
| βββ Deployment: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs) | |
| Phase 4 (planned β recommendation pipeline fixes) | |
| βββ RRF β quota fusion, Ξ±_long 0.10 β 0.03, negative profile wiring, | |
| pre-populate metadata store | |
| Phase 5 (planned β cold-start onboarding) | |
| βββ arXiv category multiselect + seed paper import + ORCID | |
| Phase 6+ (future) | |
| βββ LightGBM lambdarank, evaluation framework, LLM summaries, | |
| collaborative filtering, exploration | |
| ``` | |