Spaces:
Running
Running
File size: 8,722 Bytes
d5a6f3e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | # ResearchIT Documentation
All project documentation organized by purpose. Each document has a specific role in the project lifecycle.
---
## π Folder Structure
```
docs/
βββ README.md β you are here
β
βββ TASK-TRACKER.md β master checklist (all phases)
β
βββ research/ β deep research & strategic thinking
β βββ 01-Vision-Instagram-for-Research.md
β βββ 02-Recommendation-System-Blueprint.md
β βββ 03-MultiInterest-Recommender-Architecture.md
β βββ 04-Technical-Roadmap-Legacy.md
β βββ 05-Evolution-Of-Onboarding-And-Interests.md
β βββ 06-Deep-Research-Verdict.md
β
βββ phases/ β what we built & what we plan to build
β βββ PHASE1-Zero-ML-Recommender.md
β βββ PHASE2-Hybrid-Search-Plan.md (prototype reference)
β βββ PHASE3-Hybrid-Semantic-Search.md (ACTIVE PHASE 3 PLAN)
β
βββ walkthroughs/ β detailed implementation records
β βββ 01-Phase1-Code-Tour.md
β βββ 02-Phase2-MultiInterest-Recommender.md
β βββ 03-Code-Summary-and-Test-Plan.md
β βββ 04-Next-Steps-and-Phase-Plan.md
β
notebooks/ β Kaggle reference notebooks (not in docs/)
βββ README.md
βββ 01-bme-upload.ipynb (BGE-M3 encode + upload 1.6M papers)
βββ 02-bme-arxiv-test.ipynb (search quality + encoding tests)
βββ 03-check-search-bq-prm.ipynb (BQ vs PRM benchmark)
```
---
## π Reading Order
If you're new to this project, read these in order:
### 1. Understand the Vision
**[01-Vision-Instagram-for-Research.md](research/01-Vision-Instagram-for-Research.md)**
The strategic blueprint. Covers competitive landscape, UX patterns from TikTok/Spotify/Pinterest, social dynamics, differentiation features, and business model. This is "why we're building this."
### 2. Understand the Technical Foundation
**[02-Recommendation-System-Blueprint.md](research/02-Recommendation-System-Blueprint.md)**
The initial deep research on recommendation architectures. Covers user modeling, content-based vs collaborative filtering, cold start strategies, and evaluation metrics. This is "how recommendation systems work in general."
### 3. Understand the Chosen Architecture
**[03-MultiInterest-Recommender-Architecture.md](research/03-MultiInterest-Recommender-Architecture.md)**
The definitive architecture RFC. EWMA temporal decay, Ward hierarchical clustering, LightGBM re-ranking, MMR diversity. Validated by Twitter, Pinterest, and Alibaba production systems. **This is the blueprint we implemented.**
### 4. See the Architectural Evolution
**[05-Evolution-Of-Onboarding-And-Interests.md](research/05-Evolution-Of-Onboarding-And-Interests.md)**
Documents the founder's pivot from explicit onboarding subject vectors to implicit behavioral tracking. Captures the original vision vs. the current approach and why the change was made.
**[06-Deep-Research-Verdict.md](research/06-Deep-Research-Verdict.md)** β *Latest Research*
The comprehensive verdict that resolves contradictions across all prior documents. Proposes a **three-layer hybrid** (coarse categories + seed papers + behavioral clustering). Identifies faults in Doc 03 (RRFβquota, Ξ± correction). The definitive architectural reference going forward.
### 5. See What Phase 1 Built
**[PHASE1-Zero-ML-Recommender.md](phases/PHASE1-Zero-ML-Recommender.md)**
What was built first: zero-ML-inference recommender using Qdrant's BEST_SCORE Recommend API, SQLite event logging, and arXiv metadata caching. The working foundation.
**[01-Phase1-Code-Tour.md](walkthroughs/01-Phase1-Code-Tour.md)**
A file-by-file walkthrough of every piece of the Phase 1 codebase: entry points, routers, services, database, templates, and tests.
### 6. See What Phase 2 Built
**[02-Phase2-MultiInterest-Recommender.md](walkthroughs/02-Phase2-MultiInterest-Recommender.md)**
What was just built: PinnerSage-style multi-interest engine with EWMA profiles, Ward clustering, prefetch+RRF, heuristic re-ranking, and MMR diversity. 88 tests passing.
### 7. Review Core Code & Automation
**[03-Code-Summary-and-Test-Plan.md](walkthroughs/03-Code-Summary-and-Test-Plan.md)**
Summarizes all structural backend modules, frontend files, and breaks down our three-layered ongoing testing strategies (Automated, Manual, and Analytic Evaluation).
### 8. What's Next β The Revised Phase Plan
**[04-Next-Steps-and-Phase-Plan.md](walkthroughs/04-Next-Steps-and-Phase-Plan.md)** β *Start Here for Next Steps*
The master roadmap synthesizing all 6 research documents. Resolves contradictions between docs, captures the founder's thinking evolution, and lays out Phases 3-9 in priority order. Includes the three highest-impact next actions.
### 9. Phase 3 Plan (Current Focus)
**[PHASE3-Hybrid-Semantic-Search.md](phases/PHASE3-Hybrid-Semantic-Search.md)** β *Active Implementation Plan*
The detailed implementation plan for hybrid semantic search. Covers architecture, all new/modified files, Zilliz schema, BGE-M3 encoding, RRF fusion, HF Spaces deployment, latency budget, and 8-step implementation order.
### 10. Data Preparation Notebooks
**[notebooks/README.md](../notebooks/README.md)** β Index + extracted schema details.
- `01-bme-upload.ipynb` β How 1.6M papers were encoded and uploaded to Qdrant + Zilliz
- `02-bme-arxiv-test.ipynb` β BGE-M3 encoding + search quality prototype
- `03-check-search-bq-prm.ipynb` β BQ vs PRM quantization benchmark
---
## π Document Status
| Document | Status | Notes |
|---|---|---|
| 01 β Vision (Instagram for Research) | β
Complete | Strategic north star |
| 02 β Recommendation Blueprint | β
Complete | Initial research, still relevant |
| 03 β Multi-Interest Architecture | β
Implemented | **The RFC we implemented** β has 4 known faults identified in Doc 06 |
| 04 β Technical Roadmap | β οΈ Legacy | Superseded. Kept for reference only |
| 05 β Evolution of Onboarding | β
Complete | Documents the subject-vector β behavioral pivot |
| 06 β Deep Research Verdict | β
Complete | **The definitive architectural reference** β resolves all contradictions |
| Phase 1 Walkthrough | β
Complete | Still accurate for Phase 1 code |
| Phase 1 Code Tour | β
Complete | File-by-file walkthrough |
| Phase 2 Recommender Walkthrough | β
Complete | Multi-interest engine |
| Codebase Summary & Test Plan | β
Complete | Summarizes codebase & testing |
| Next Steps & Phase Plan | β
Complete | **Master roadmap for Phases 3-9** |
| Phase 2 Hybrid Search Plan | π Prototype reference | Superseded by PHASE3-Hybrid-Semantic-Search.md as the active plan |
| **Phase 3 Hybrid Semantic Search** | **π Active Plan** | **The current implementation guide for Phase 3** |
| Task Tracker | β
Active | Master checklist for all phases |
---
## ποΈ Architecture Evolution
```
Phase 1 (completed)
βββ Qdrant BEST_SCORE with raw paper IDs
βββ Works from 1 save
βββ No temporal awareness, no diversity
Phase 2a (completed)
βββ EWMA profile embeddings
βββ Long-term (Ξ±=0.03) + Short-term (Ξ±=0.40) + Negative (Ξ±=0.15)
βββ Activates at 3+ saves
Phase 2b (completed)
βββ Ward clustering + Qdrant prefetch+RRF
βββ Auto-detects K interests per user (1-7)
βββ Single API call, server-side parallel ANN
βββ Activates at 5+ saves
Phase 2c (completed)
βββ Heuristic re-ranking + MMR diversity
βββ 5-feature scorer (40% relevance, 25% session, 15% recency, 10% rank, -15% negative)
βββ MMR diversity (Ξ»=0.6) + exploration injection (2 papers)
βββ Upgrade path: swap heuristic for LightGBM at β₯500 interactions
Phase 3 (NEXT β hybrid semantic search)
βββ Replace arXiv keyword API with vector-based search
βββ BGE-M3 query encoding (loaded at startup)
βββ Dense (Qdrant) + Sparse (Zilliz) parallel retrieval
βββ RRF fusion (correct for search: same query, different retrievers)
βββ Deployment: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)
Phase 4 (planned β recommendation pipeline fixes)
βββ RRF β quota fusion, Ξ±_long 0.10 β 0.03, negative profile wiring,
pre-populate metadata store
Phase 5 (planned β cold-start onboarding)
βββ arXiv category multiselect + seed paper import + ORCID
Phase 6+ (future)
βββ LightGBM lambdarank, evaluation framework, LLM summaries,
collaborative filtering, exploration
```
|