File size: 8,722 Bytes
d5a6f3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# ResearchIT Documentation

All project documentation organized by purpose. Each document has a specific role in the project lifecycle.

---

## πŸ“ Folder Structure

```
docs/
β”œβ”€β”€ README.md                     ← you are here
β”‚
β”œβ”€β”€ TASK-TRACKER.md               ← master checklist (all phases)
β”‚
β”œβ”€β”€ research/                     ← deep research & strategic thinking
β”‚   β”œβ”€β”€ 01-Vision-Instagram-for-Research.md
β”‚   β”œβ”€β”€ 02-Recommendation-System-Blueprint.md
β”‚   β”œβ”€β”€ 03-MultiInterest-Recommender-Architecture.md
β”‚   β”œβ”€β”€ 04-Technical-Roadmap-Legacy.md
β”‚   β”œβ”€β”€ 05-Evolution-Of-Onboarding-And-Interests.md
β”‚   └── 06-Deep-Research-Verdict.md
β”‚
β”œβ”€β”€ phases/                       ← what we built & what we plan to build
β”‚   β”œβ”€β”€ PHASE1-Zero-ML-Recommender.md
β”‚   β”œβ”€β”€ PHASE2-Hybrid-Search-Plan.md     (prototype reference)
β”‚   └── PHASE3-Hybrid-Semantic-Search.md (ACTIVE PHASE 3 PLAN)
β”‚
β”œβ”€β”€ walkthroughs/                 ← detailed implementation records
β”‚   β”œβ”€β”€ 01-Phase1-Code-Tour.md
β”‚   β”œβ”€β”€ 02-Phase2-MultiInterest-Recommender.md
β”‚   β”œβ”€β”€ 03-Code-Summary-and-Test-Plan.md
β”‚   └── 04-Next-Steps-and-Phase-Plan.md
β”‚
notebooks/                        ← Kaggle reference notebooks (not in docs/)
β”œβ”€β”€ README.md
β”œβ”€β”€ 01-bme-upload.ipynb             (BGE-M3 encode + upload 1.6M papers)
β”œβ”€β”€ 02-bme-arxiv-test.ipynb         (search quality + encoding tests)
└── 03-check-search-bq-prm.ipynb    (BQ vs PRM benchmark)
```

---

## πŸ“š Reading Order

If you're new to this project, read these in order:

### 1. Understand the Vision
**[01-Vision-Instagram-for-Research.md](research/01-Vision-Instagram-for-Research.md)**
The strategic blueprint. Covers competitive landscape, UX patterns from TikTok/Spotify/Pinterest, social dynamics, differentiation features, and business model. This is "why we're building this."

### 2. Understand the Technical Foundation
**[02-Recommendation-System-Blueprint.md](research/02-Recommendation-System-Blueprint.md)**
The initial deep research on recommendation architectures. Covers user modeling, content-based vs collaborative filtering, cold start strategies, and evaluation metrics. This is "how recommendation systems work in general."

### 3. Understand the Chosen Architecture
**[03-MultiInterest-Recommender-Architecture.md](research/03-MultiInterest-Recommender-Architecture.md)**
The definitive architecture RFC. EWMA temporal decay, Ward hierarchical clustering, LightGBM re-ranking, MMR diversity. Validated by Twitter, Pinterest, and Alibaba production systems. **This is the blueprint we implemented.**

### 4. See the Architectural Evolution
**[05-Evolution-Of-Onboarding-And-Interests.md](research/05-Evolution-Of-Onboarding-And-Interests.md)**
Documents the founder's pivot from explicit onboarding subject vectors to implicit behavioral tracking. Captures the original vision vs. the current approach and why the change was made.

**[06-Deep-Research-Verdict.md](research/06-Deep-Research-Verdict.md)** ⭐ *Latest Research*
The comprehensive verdict that resolves contradictions across all prior documents. Proposes a **three-layer hybrid** (coarse categories + seed papers + behavioral clustering). Identifies faults in Doc 03 (RRF→quota, α correction). The definitive architectural reference going forward.

### 5. See What Phase 1 Built
**[PHASE1-Zero-ML-Recommender.md](phases/PHASE1-Zero-ML-Recommender.md)**
What was built first: zero-ML-inference recommender using Qdrant's BEST_SCORE Recommend API, SQLite event logging, and arXiv metadata caching. The working foundation.

**[01-Phase1-Code-Tour.md](walkthroughs/01-Phase1-Code-Tour.md)**
A file-by-file walkthrough of every piece of the Phase 1 codebase: entry points, routers, services, database, templates, and tests.

### 6. See What Phase 2 Built
**[02-Phase2-MultiInterest-Recommender.md](walkthroughs/02-Phase2-MultiInterest-Recommender.md)**
What was just built: PinnerSage-style multi-interest engine with EWMA profiles, Ward clustering, prefetch+RRF, heuristic re-ranking, and MMR diversity. 88 tests passing.

### 7. Review Core Code & Automation
**[03-Code-Summary-and-Test-Plan.md](walkthroughs/03-Code-Summary-and-Test-Plan.md)**
Summarizes all structural backend modules, frontend files, and breaks down our three-layered ongoing testing strategies (Automated, Manual, and Analytic Evaluation).

### 8. What's Next β€” The Revised Phase Plan
**[04-Next-Steps-and-Phase-Plan.md](walkthroughs/04-Next-Steps-and-Phase-Plan.md)** ⭐ *Start Here for Next Steps*
The master roadmap synthesizing all 6 research documents. Resolves contradictions between docs, captures the founder's thinking evolution, and lays out Phases 3-9 in priority order. Includes the three highest-impact next actions.

### 9. Phase 3 Plan (Current Focus)
**[PHASE3-Hybrid-Semantic-Search.md](phases/PHASE3-Hybrid-Semantic-Search.md)** ⭐ *Active Implementation Plan*
The detailed implementation plan for hybrid semantic search. Covers architecture, all new/modified files, Zilliz schema, BGE-M3 encoding, RRF fusion, HF Spaces deployment, latency budget, and 8-step implementation order.

### 10. Data Preparation Notebooks
**[notebooks/README.md](../notebooks/README.md)** β€” Index + extracted schema details.
- `01-bme-upload.ipynb` β€” How 1.6M papers were encoded and uploaded to Qdrant + Zilliz
- `02-bme-arxiv-test.ipynb` β€” BGE-M3 encoding + search quality prototype
- `03-check-search-bq-prm.ipynb` β€” BQ vs PRM quantization benchmark

---

## πŸ“„ Document Status

| Document | Status | Notes |
|---|---|---|
| 01 β€” Vision (Instagram for Research) | βœ… Complete | Strategic north star |
| 02 β€” Recommendation Blueprint | βœ… Complete | Initial research, still relevant |
| 03 β€” Multi-Interest Architecture | βœ… Implemented | **The RFC we implemented** β€” has 4 known faults identified in Doc 06 |
| 04 β€” Technical Roadmap | ⚠️ Legacy | Superseded. Kept for reference only |
| 05 β€” Evolution of Onboarding | βœ… Complete | Documents the subject-vector β†’ behavioral pivot |
| 06 β€” Deep Research Verdict | βœ… Complete | **The definitive architectural reference** β€” resolves all contradictions |
| Phase 1 Walkthrough | βœ… Complete | Still accurate for Phase 1 code |
| Phase 1 Code Tour | βœ… Complete | File-by-file walkthrough |
| Phase 2 Recommender Walkthrough | βœ… Complete | Multi-interest engine |
| Codebase Summary & Test Plan | βœ… Complete | Summarizes codebase & testing |
| Next Steps & Phase Plan | βœ… Complete | **Master roadmap for Phases 3-9** |
| Phase 2 Hybrid Search Plan | πŸ“‹ Prototype reference | Superseded by PHASE3-Hybrid-Semantic-Search.md as the active plan |
| **Phase 3 Hybrid Semantic Search** | **πŸ“‹ Active Plan** | **The current implementation guide for Phase 3** |
| Task Tracker | βœ… Active | Master checklist for all phases |

---

## πŸ—οΈ Architecture Evolution

```
Phase 1 (completed)
  └── Qdrant BEST_SCORE with raw paper IDs
       β”œβ”€β”€ Works from 1 save
       └── No temporal awareness, no diversity

Phase 2a (completed)
  └── EWMA profile embeddings
       β”œβ”€β”€ Long-term (Ξ±=0.03) + Short-term (Ξ±=0.40) + Negative (Ξ±=0.15)
       └── Activates at 3+ saves

Phase 2b (completed)
  └── Ward clustering + Qdrant prefetch+RRF
       β”œβ”€β”€ Auto-detects K interests per user (1-7)
       β”œβ”€β”€ Single API call, server-side parallel ANN
       └── Activates at 5+ saves

Phase 2c (completed)
  └── Heuristic re-ranking + MMR diversity
       β”œβ”€β”€ 5-feature scorer (40% relevance, 25% session, 15% recency, 10% rank, -15% negative)
       β”œβ”€β”€ MMR diversity (Ξ»=0.6) + exploration injection (2 papers)
       └── Upgrade path: swap heuristic for LightGBM at β‰₯500 interactions

Phase 3 (NEXT β€” hybrid semantic search)
  └── Replace arXiv keyword API with vector-based search
       β”œβ”€β”€ BGE-M3 query encoding (loaded at startup)
       β”œβ”€β”€ Dense (Qdrant) + Sparse (Zilliz) parallel retrieval
       β”œβ”€β”€ RRF fusion (correct for search: same query, different retrievers)
       └── Deployment: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)

Phase 4 (planned β€” recommendation pipeline fixes)
  └── RRF β†’ quota fusion, Ξ±_long 0.10 β†’ 0.03, negative profile wiring,
       pre-populate metadata store

Phase 5 (planned β€” cold-start onboarding)
  └── arXiv category multiselect + seed paper import + ORCID

Phase 6+ (future)
  └── LightGBM lambdarank, evaluation framework, LLM summaries,
       collaborative filtering, exploration
```