Spaces:
Running
ResearchIT Documentation
All project documentation organized by purpose. Each document has a specific role in the project lifecycle.
π Folder Structure
docs/
βββ README.md β you are here
β
βββ TASK-TRACKER.md β master checklist (all phases)
β
βββ research/ β deep research & strategic thinking
β βββ 01-Vision-Instagram-for-Research.md
β βββ 02-Recommendation-System-Blueprint.md
β βββ 03-MultiInterest-Recommender-Architecture.md
β βββ 04-Technical-Roadmap-Legacy.md
β βββ 05-Evolution-Of-Onboarding-And-Interests.md
β βββ 06-Deep-Research-Verdict.md
β
βββ phases/ β what we built & what we plan to build
β βββ PHASE1-Zero-ML-Recommender.md
β βββ PHASE2-Hybrid-Search-Plan.md (prototype reference)
β βββ PHASE3-Hybrid-Semantic-Search.md (ACTIVE PHASE 3 PLAN)
β
βββ walkthroughs/ β detailed implementation records
β βββ 01-Phase1-Code-Tour.md
β βββ 02-Phase2-MultiInterest-Recommender.md
β βββ 03-Code-Summary-and-Test-Plan.md
β βββ 04-Next-Steps-and-Phase-Plan.md
β
notebooks/ β Kaggle reference notebooks (not in docs/)
βββ README.md
βββ 01-bme-upload.ipynb (BGE-M3 encode + upload 1.6M papers)
βββ 02-bme-arxiv-test.ipynb (search quality + encoding tests)
βββ 03-check-search-bq-prm.ipynb (BQ vs PRM benchmark)
π Reading Order
If you're new to this project, read these in order:
1. Understand the Vision
01-Vision-Instagram-for-Research.md The strategic blueprint. Covers competitive landscape, UX patterns from TikTok/Spotify/Pinterest, social dynamics, differentiation features, and business model. This is "why we're building this."
2. Understand the Technical Foundation
02-Recommendation-System-Blueprint.md The initial deep research on recommendation architectures. Covers user modeling, content-based vs collaborative filtering, cold start strategies, and evaluation metrics. This is "how recommendation systems work in general."
3. Understand the Chosen Architecture
03-MultiInterest-Recommender-Architecture.md The definitive architecture RFC. EWMA temporal decay, Ward hierarchical clustering, LightGBM re-ranking, MMR diversity. Validated by Twitter, Pinterest, and Alibaba production systems. This is the blueprint we implemented.
4. See the Architectural Evolution
05-Evolution-Of-Onboarding-And-Interests.md Documents the founder's pivot from explicit onboarding subject vectors to implicit behavioral tracking. Captures the original vision vs. the current approach and why the change was made.
06-Deep-Research-Verdict.md β Latest Research The comprehensive verdict that resolves contradictions across all prior documents. Proposes a three-layer hybrid (coarse categories + seed papers + behavioral clustering). Identifies faults in Doc 03 (RRFβquota, Ξ± correction). The definitive architectural reference going forward.
5. See What Phase 1 Built
PHASE1-Zero-ML-Recommender.md What was built first: zero-ML-inference recommender using Qdrant's BEST_SCORE Recommend API, SQLite event logging, and arXiv metadata caching. The working foundation.
01-Phase1-Code-Tour.md A file-by-file walkthrough of every piece of the Phase 1 codebase: entry points, routers, services, database, templates, and tests.
6. See What Phase 2 Built
02-Phase2-MultiInterest-Recommender.md What was just built: PinnerSage-style multi-interest engine with EWMA profiles, Ward clustering, prefetch+RRF, heuristic re-ranking, and MMR diversity. 88 tests passing.
7. Review Core Code & Automation
03-Code-Summary-and-Test-Plan.md Summarizes all structural backend modules, frontend files, and breaks down our three-layered ongoing testing strategies (Automated, Manual, and Analytic Evaluation).
8. What's Next β The Revised Phase Plan
04-Next-Steps-and-Phase-Plan.md β Start Here for Next Steps The master roadmap synthesizing all 6 research documents. Resolves contradictions between docs, captures the founder's thinking evolution, and lays out Phases 3-9 in priority order. Includes the three highest-impact next actions.
9. Phase 3 Plan (Current Focus)
PHASE3-Hybrid-Semantic-Search.md β Active Implementation Plan The detailed implementation plan for hybrid semantic search. Covers architecture, all new/modified files, Zilliz schema, BGE-M3 encoding, RRF fusion, HF Spaces deployment, latency budget, and 8-step implementation order.
10. Data Preparation Notebooks
notebooks/README.md β Index + extracted schema details.
01-bme-upload.ipynbβ How 1.6M papers were encoded and uploaded to Qdrant + Zilliz02-bme-arxiv-test.ipynbβ BGE-M3 encoding + search quality prototype03-check-search-bq-prm.ipynbβ BQ vs PRM quantization benchmark
π Document Status
| Document | Status | Notes |
|---|---|---|
| 01 β Vision (Instagram for Research) | β Complete | Strategic north star |
| 02 β Recommendation Blueprint | β Complete | Initial research, still relevant |
| 03 β Multi-Interest Architecture | β Implemented | The RFC we implemented β has 4 known faults identified in Doc 06 |
| 04 β Technical Roadmap | β οΈ Legacy | Superseded. Kept for reference only |
| 05 β Evolution of Onboarding | β Complete | Documents the subject-vector β behavioral pivot |
| 06 β Deep Research Verdict | β Complete | The definitive architectural reference β resolves all contradictions |
| Phase 1 Walkthrough | β Complete | Still accurate for Phase 1 code |
| Phase 1 Code Tour | β Complete | File-by-file walkthrough |
| Phase 2 Recommender Walkthrough | β Complete | Multi-interest engine |
| Codebase Summary & Test Plan | β Complete | Summarizes codebase & testing |
| Next Steps & Phase Plan | β Complete | Master roadmap for Phases 3-9 |
| Phase 2 Hybrid Search Plan | π Prototype reference | Superseded by PHASE3-Hybrid-Semantic-Search.md as the active plan |
| Phase 3 Hybrid Semantic Search | π Active Plan | The current implementation guide for Phase 3 |
| Task Tracker | β Active | Master checklist for all phases |
ποΈ Architecture Evolution
Phase 1 (completed)
βββ Qdrant BEST_SCORE with raw paper IDs
βββ Works from 1 save
βββ No temporal awareness, no diversity
Phase 2a (completed)
βββ EWMA profile embeddings
βββ Long-term (Ξ±=0.03) + Short-term (Ξ±=0.40) + Negative (Ξ±=0.15)
βββ Activates at 3+ saves
Phase 2b (completed)
βββ Ward clustering + Qdrant prefetch+RRF
βββ Auto-detects K interests per user (1-7)
βββ Single API call, server-side parallel ANN
βββ Activates at 5+ saves
Phase 2c (completed)
βββ Heuristic re-ranking + MMR diversity
βββ 5-feature scorer (40% relevance, 25% session, 15% recency, 10% rank, -15% negative)
βββ MMR diversity (Ξ»=0.6) + exploration injection (2 papers)
βββ Upgrade path: swap heuristic for LightGBM at β₯500 interactions
Phase 3 (NEXT β hybrid semantic search)
βββ Replace arXiv keyword API with vector-based search
βββ BGE-M3 query encoding (loaded at startup)
βββ Dense (Qdrant) + Sparse (Zilliz) parallel retrieval
βββ RRF fusion (correct for search: same query, different retrievers)
βββ Deployment: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)
Phase 4 (planned β recommendation pipeline fixes)
βββ RRF β quota fusion, Ξ±_long 0.10 β 0.03, negative profile wiring,
pre-populate metadata store
Phase 5 (planned β cold-start onboarding)
βββ arXiv category multiselect + seed paper import + ORCID
Phase 6+ (future)
βββ LightGBM lambdarank, evaluation framework, LLM summaries,
collaborative filtering, exploration