Spaces:

siddhm11
/

ResearchIT

Sleeping

siddhm11 commited on Apr 19

Commit

d5a6f3e

0 Parent(s):

Phase 3 complete: Hybrid Semantic Search pipeline

- BGE-M3 encoding (dense + sparse) via embed_svc.py
- Qdrant dense search + Zilliz sparse search (parallel)
- Groq LLM query rewriter with academic heuristic bypass
- RRF fusion (K=60) + recency reranking (0.80/0.20)
- Graceful degradation: each service can fail independently
- arXiv API fallback when hybrid search returns nothing
- Dockerfile for HF Spaces deployment (Docker SDK, port 7860)
- 123 tests passing (88 original + 35 new, zero regressions)
- All secrets via env vars only (.env.example provided)

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.dockerignore +13 -0
.env.example +6 -0
.gitignore +0 -0
CLAUDE.md +426 -0
Dockerfile +28 -0
app/__init__.py +0 -0
app/arxiv_svc.py +158 -0
app/config.py +54 -0
app/db.py +275 -0
app/embed_svc.py +106 -0
app/groq_svc.py +154 -0
app/hybrid_search_svc.py +200 -0
app/main.py +63 -0
app/qdrant_svc.py +381 -0
app/recommend/__init__.py +4 -0
app/recommend/clustering.py +205 -0
app/recommend/diversity.py +143 -0
app/recommend/profiles.py +140 -0
app/recommend/reranker.py +180 -0
app/routers/__init__.py +0 -0
app/routers/events.py +104 -0
app/routers/recommendations.py +231 -0
app/routers/saved.py +46 -0
app/routers/search.py +82 -0
app/templates/base.html +42 -0
app/templates/index.html +50 -0
app/templates/partials/action_buttons.html +45 -0
app/templates/partials/empty_recs.html +16 -0
app/templates/partials/paper_card.html +45 -0
app/templates/partials/recommendations.html +23 -0
app/templates/partials/search_results.html +15 -0
app/templates/saved.html +38 -0
app/templates/search.html +47 -0
app/templates_env.py +22 -0
app/user_state.py +111 -0
app/zilliz_svc.py +132 -0
docs/README.md +160 -0
docs/TASK-TRACKER.md +369 -0
docs/phases/PHASE1-Zero-ML-Recommender.md +439 -0
docs/phases/PHASE2-Hybrid-Search-Plan.md +483 -0
docs/phases/PHASE3-Hybrid-Semantic-Search.md +658 -0
docs/research/01-Vision-Instagram-for-Research.md +111 -0
docs/research/02-Recommendation-System-Blueprint.md +346 -0
docs/research/03-MultiInterest-Recommender-Architecture.md +323 -0
docs/research/04-Technical-Roadmap-Legacy.md +1009 -0
docs/research/05-Evolution-Of-Onboarding-And-Interests.md +45 -0
docs/research/06-Deep-Research-Verdict.md +97 -0
docs/walkthroughs/01-Phase1-Code-Tour.md +886 -0
docs/walkthroughs/02-Phase2-MultiInterest-Recommender.md +175 -0
docs/walkthroughs/03-Code-Summary-and-Test-Plan.md +89 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,13 @@

+.git
+.pytest_cache
+.qodo
+.claude
+__pycache__
+*.pyc
+*.pyo
+*.db
+*.ipynb
+pdf_images/
+notebooks/
+*.pdf
+.env

.env.example ADDED Viewed

	@@ -0,0 +1,6 @@

+# Copy this to .env and fill in your values
+QDRANT_URL=https://your-cluster.cloud.qdrant.io
+QDRANT_API_KEY=your-qdrant-api-key
+ZILLIZ_URI=https://your-instance.cloud.zilliz.com
+ZILLIZ_TOKEN=your-zilliz-token
+GROQ_API_KEY=gsk_your-groq-key

.gitignore ADDED Viewed

Binary file (363 Bytes). View file

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,426 @@

+# CLAUDE.md — ResearchIT Coding Agent Rulebook
+> **Read this file first, every session, before touching anything else.** This file tells you which docs to trust, in what order, and the non-negotiable rules for this codebase. If you skip this file you will produce code that contradicts months of architectural research.
+---
+## 1. What this codebase is
+ResearchIT is a personalized arXiv paper recommendation engine. ~1.6M papers with pre-computed BGE-M3 (1024-dim) dense embeddings. CPU-only (zero GPU). FastAPI + HTMX + Jinja2 on the front, Qdrant Cloud (dense, `arxiv_bgem3_dense` collection, BQ enabled, HNSW m=32) + Zilliz Cloud (sparse, `arxiv_bgem3_sparse` collection — wiring in Phase 3) for vectors, SQLite for interactions/profiles/clusters/metadata cache, Hugging Face Spaces (Docker SDK, free tier: 16GB RAM, 2 vCPUs) for deployment. Single developer (Amin). Pre-launch — no real users yet.
+**Endgame:** an "Instagram for research" — multi-interest aware feed that surfaces relevant papers across a user's distinct research areas without collapsing toward a dominant interest.
+**You are a coding agent operating inside this codebase.** Optimize for: (a) not contradicting the architectural decisions in `docs/research/06-Deep-Research-Verdict.md`, (b) shipping working code under tight latency budgets, (c) flagging uncertainty rather than guessing.
+---
+## 2. The document map — read this before consulting any doc
+There are six research documents in `docs/research/`, four walkthroughs in `docs/walkthroughs/`, and two phase plans in `docs/phases/`. The research docs were written at different times and **they contradict each other**. Follow this precedence strictly:
+### Research documents (`docs/research/`)
+| Priority | File | Status | Use it for | Do NOT use it for |
+|---|---|---|---|---|
+| **1 (canonical)** | `06-Deep-Research-Verdict.md` | **Source of truth** | Any architectural question. Algorithm choices, parameter values, fusion strategy, phase plan, what's in/out of scope. | N/A — always trust this first. |
+| **2** | `03-MultiInterest-Recommender-Architecture.md` | Implemented, with corrections from Doc 06 (see addendum at bottom of doc) | Detailed component descriptions where 06 is silent (e.g., specific Qdrant query shapes, MMR mechanics). | Anything 06 contradicts (RRF fusion for recs, alpha_long=0.10, BGE-reranker in hot path). |
+| **3** | `02-Recommendation-System-Blueprint.md` | Older blueprint | Background context on the original blueprint. | Onboarding design (it advocates fixed subject vectors — superseded). |
+| **4** | `01-Vision-Instagram-for-Research.md` | Product vision | Product/UX intent, growth loops, swipe mechanics, what the user-facing product should *feel* like. | Any backend architecture decisions. |
+| **5 (legacy)** | `04-Technical-Roadmap-Legacy.md` | **Superseded** | Historical reference only. | Anything actionable. Treat as read-only context. |
+| **6 (history)** | `05-Evolution-Of-Onboarding-And-Interests.md` | Historical narrative | Understanding *why* we pivoted from subject vectors to behavioral. | Implementation guidance — its "pure behavioral" conclusion was itself superseded by 06's hybrid verdict. |
+### Phase plans (`docs/phases/`)
+| File | What it covers |
+|---|---|
+| `PHASE1-Zero-ML-Recommender.md` | What Phase 1 built (Qdrant, arXiv API, HTMX) |
+| `PHASE2-Hybrid-Search-Plan.md` | Prototype reference for search pipeline (superseded by Phase 3 doc) |
+| `PHASE3-Hybrid-Semantic-Search.md` | **Active Phase 3 implementation plan** — BGE-M3 + Qdrant dense + Zilliz sparse + RRF |
+### Walkthroughs (`docs/walkthroughs/`)
+| File | What it covers |
+|---|---|
+| `01-Phase1-Code-Tour.md` | File-by-file walkthrough of the Phase 1 zero-ML codebase |
+| `02-Phase2-MultiInterest-Recommender.md` | Multi-interest engine implementation (EWMA, Ward, RRF, reranker, MMR) |
+| `03-Code-Summary-and-Test-Plan.md` | Codebase summary and three-layered testing strategy |
+| `04-Next-Steps-and-Phase-Plan.md` | **Master roadmap** synthesizing all 6 research docs into Phases 3-9 |
+### How to use the map in practice
+- **Architecture question?** Open `docs/research/06-Deep-Research-Verdict.md`. Stop there if it answers. If 06 is silent, fall through to 03.
+- **Product/UX question?** Open `docs/research/01-Vision-Instagram-for-Research.md`.
+- **"What phase are we in? What is next?"** Open `docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md`.
+- **"How does module X work?"** Open `docs/walkthroughs/02-Phase2-MultiInterest-Recommender.md` or `03-Code-Summary-and-Test-Plan.md`.
+- **Conflict between docs?** Higher-priority doc wins. **Never average or merge contradictory guidance.**
+- **The user references a doc by number** (e.g., "per doc 02") — read that doc but flag if 06 contradicts it before acting.
+---
+## 3. Non-negotiable rules (from doc 06)
+These are the hard architectural commitments. **Violating any of these is a regression.** If a task seems to require violating one, stop and ask the user.
+### 3.1 Fusion
+- **Search uses RRF.** (Different retrievers — dense + sparse — answering the same query. RRF is correct here. Search is currently arXiv keyword API but will become hybrid semantic search in Phase 3.)
+- **Zilliz collection schema** for Phase 3: collection `arxiv_bgem3_sparse`, fields: `id` (INT64, auto_id PK), `arxiv_id` (VARCHAR), `sparse_vector` (SPARSE_FLOAT_VECTOR). Index: SPARSE_INVERTED_INDEX, metric_type=IP. Sparse format uses **integer token IDs** as keys (from BGE-M3 tokenizer), NOT string words. Example: `{29: 0.0427, 6083: 0.1852, ...}`.
+- **Recommendations use importance-weighted quota with a floor.** (Different queries — K medoid queries — over the same user. RRF would let the dominant cluster dominate; quota preserves minor interests.)
+- **Never use RRF to merge multi-medoid recommendation results.** This is the most common mistake to avoid in this codebase.
+- **Current status:** The codebase still uses Qdrant prefetch+RRF for recommendations in `app/qdrant_svc.py` via `multi_interest_search()`. This will be replaced with per-cluster quota in Phase 4. Do not extend the RRF pattern to new recommendation code.
+Quota formula:
+```
+w_k = importance_k / sum(importance_k)
+slot_k = max(floor(F * w_k), F_min)   # F = feed size, F_min = 3
+# distribute remainder by largest fractional part
+```
+### 3.2 EWMA decay parameters
+- `alpha_long = 0.03` — lives in `app/recommend/profiles.py` as `ALPHA_LONG_TERM`
+- `alpha_short = 0.40` — lives in `app/recommend/profiles.py` as `ALPHA_SHORT_TERM`
+- `alpha_neg = 0.15` — lives in `app/recommend/profiles.py` as `ALPHA_NEGATIVE`
+If you find `alpha_long = 0.10` anywhere in code or config, it is a bug from doc 03. Fix it and reference doc 06 in the commit message.
+### 3.3 Clustering
+- Algorithm: **Ward hierarchical agglomerative** via `scipy.cluster.hierarchy.ward`.
+- Code lives in: `app/recommend/clustering.py`.
+- **L2-normalize embeddings BEFORE Ward, then use Euclidean distance.** Cosine Ward via sklearn is mathematically not Ward (Murtagh and Legendre 2014). L2-norm + Euclidean is monotonically equivalent to cosine and gives the intended behavior. This normalization is already in the code.
+- **No fixed K.** Cut the dendrogram by adaptive gap-based threshold (see `_adaptive_threshold()`). Cap at `K_max = 7` (currently; doc 06 says `K_max = 20` for heavy users — raise this when users exist).
+- **Medoid, not centroid.** Medoid = arg min over cluster members of sum of squared distances. Cache medoid paper IDs. This is implemented in `_find_medoid()`.
+- **Hungarian-match cluster IDs across reclusterings** — NOT YET IMPLEMENTED. Planned for Phase 4.
+- Recompute on each feed request currently (not nightly batch — no batch job infrastructure yet).
+### 3.4 Reranking
+- Terminal CPU-path reranker: currently a **hand-tuned heuristic scorer** in `app/recommend/reranker.py` via `heuristic_score()`. Will be replaced with **LightGBM `objective='lambdarank'`** in Phase 6 when training data exists.
+- The heuristic scorer uses 5 features: cosine_sim_longterm, cosine_sim_shortterm, paper_age_days, retrieval_position, cosine_sim_negative.
+- Weight budget: `0.40 * lt + 0.25 * st + 0.15 * recency + 0.10 * position - 0.15 * negative_penalty`.
+- **Do NOT put `BGE-reranker-v2-m3` in the serving path.** ~8ms per pair on CPU = ~800ms for 100 pairs. Far over the 30ms budget.
+- If a cross-encoder signal is wanted: distill BGE-reranker-v2 offline into a TinyBERT-L2 student (FlashRank recipe) and use the student score as a LightGBM feature on top-20. Phase 8.
+### 3.5 Diversity
+- MMR with `lambda = 0.6` over the merged feed, on BGE-M3 embeddings. Code in `app/recommend/diversity.py` via `mmr_rerank()`.
+- Exploration injection: 2 serendipitous papers per feed. Code in `app/recommend/diversity.py` via `inject_exploration()`.
+- Quota (3.1) handles cross-cluster diversity. MMR handles within-quota redundancy.
+- Do NOT use DPPs in v1.
+### 3.6 Cold start / onboarding (the hybrid verdict)
+NOT YET IMPLEMENTED (Phase 5). The pivot in doc 05 went too far. Doc 06 corrects it. The right onboarding is **three-layer hybrid**:
+1. arXiv category multi-select — used as a **filter and LightGBM feature**, NOT as the primary user vector.
+2. ORCID / Semantic Scholar / Google Scholar author import — ingest authored paper embeddings as initial seeds.
+3. "Add 5 seed papers" library seeder — explicit user-chosen seeds.
+4. Fallback: popularity-per-selected-category feed for first session if user skips all three.
+Behavioral takes over once the user crosses **~10 saved papers**. Subject categories remain a feature/filter forever, never the primary vector.
+### 3.7 Negative signals
+The negative EWMA profile IS wired into reranking (Feature 5 in `reranker.py`). The full three-layer system described in Doc 06 is partially implemented:
+1. **Session hard filter** — never re-show dismissed items (`seen` set in `recommendations.py`). DONE.
+2. **Short-term item penalty** at rerank: `score -= alpha * exp(-dt / tau_neg)` — NOT YET (needs per-item decay tracking).
+3. **Long-term EWMA negative profile** — wired as Feature 5 with 0.15 penalty weight. DONE.
+4. **Category-level suppression** — NOT YET (needs category tracking on dismissals).
+5. **LightGBM dismissal labels** — NOT YET (Phase 6, needs 10K+ dismissals).
+### 3.8 Latency budget
+End-to-end feed generation target: **<30ms on CPU** (excluding metadata fetch, which is I/O-bound). Approximate budget per stage:
+- Qdrant queries (3 medoids, parallel): ~10ms
+- Heuristic rerank (LightGBM later): ~1ms
+- MMR over union: ~2ms
+- Quota + dedup: <1ms
+- Negative-profile penalty: <1ms
+- Headroom: ~15ms
+**Note:** Metadata fetching from arXiv API currently adds ~7,600ms cold. This will be fixed by bulk-loading Kaggle metadata into SQLite (Phase 4). The recommendation compute itself is within budget.
+### 3.9 ArXiv ID integrity
+ArXiv IDs can have leading zeros (e.g., `0704.0001`). **Treat all arXiv IDs as strings, never integers.** Pandas will silently coerce them — always pass `dtype=str` to `read_csv`. This is a real bug that has bitten this project before.
+---
+## 4. What is in scope vs out of scope right now
+**Current phase: Phase 3 complete, Phase 4 next.** Phase 2 (a, b, c) is complete with Doc 06 corrections applied. Phase 3 (Hybrid Semantic Search) is implemented and tested — pending HF Spaces deployment.
+**What has been built (Phases 1-2c):**
+- Qdrant BEST_SCORE recommend API (Tier 3 fallback)
+- EWMA profiles (long/short/negative, alpha corrected)
+- Ward clustering with L2-norm + adaptive threshold + medoids
+- Prefetch+RRF retrieval (Tier 1, will be replaced with quota in Phase 4)
+- EWMA vector search (Tier 2 fallback)
+- 5-feature heuristic reranker (with negative penalty)
+- MMR diversity + exploration injection
+- 3-tier cascading pipeline (5+ saves, 3+ saves, 1+ save)
+- 88 tests passing
+**Phase 3 — implemented (Hybrid Semantic Search):**
+*See `docs/TASK-TRACKER.md` Phase 3 section for full details.*
+- `app/embed_svc.py` — BGE-M3 model singleton (lazy load, LRU cache, CPU float32)
+- `app/zilliz_svc.py` — Zilliz sparse search client (gRPC reconnect, graceful fallback)
+- `app/groq_svc.py` — LLM query rewriter (2s timeout, academic heuristic, unconditional fallback)
+- `app/hybrid_search_svc.py` — Orchestrator (rewrite → encode → parallel search → RRF → recency rerank)
+- Swapped `app/routers/search.py` to use hybrid pipeline, with arXiv API fallback
+- `Dockerfile` + `.dockerignore` — HF Spaces deployment (Docker SDK, port 7860)
+- 21 new tests passing, 109 total (zero regressions)
+**Phase 4 — recommendation fixes:**
+- Replace RRF with importance-weighted quota in `app/routers/recommendations.py`
+- Pre-populate SQLite metadata from Kaggle dataset
+- Hungarian matching for cluster stability
+**Out of scope until later phases — do not build:**
+- Collaborative filtering / LightFM (Phase 9, 500+ users).
+- Cross-encoder reranking in serving path (never; only distilled — Phase 8).
+- Claude/Groq-generated cluster summaries (Phase 8).
+- Epsilon-greedy exploration beyond the current simple stub (Phase 9).
+- DPPs, Semantic IDs, TIGER, PinnerFormer-style single-vector models (Phase 9+, only if scale warrants).
+- Migration to Supabase (until 10+ concurrent writes/sec observed).
+- React SPA (explicitly ruled out — stick with HTMX + Jinja2).
+- Redis (explicitly ruled out — in-process caches are fine at this scale).
+- Real-time streaming (explicitly ruled out).
+- Custom embedding fine-tuning (explicitly ruled out — BGE-M3 is frozen).
+If a request asks for one of these, surface that it is out of scope per doc 06 phase plan, then ask whether to proceed anyway or defer.
+---
+## 5. Workflow rules for the agent
+### 5.1 Before doing anything
+- Read this file (`CLAUDE.md`).
+- If the task touches recommendation logic, also read `docs/research/06-Deep-Research-Verdict.md`.
+- If the task touches a specific component, load the relevant doc per section 2.
+- Check existing code in the affected module before writing. Do not duplicate utilities.
+### 5.2 When to ask vs when to act
+**Ask first** if any of these are true:
+- The request would violate a section 3 rule.
+- The request is ambiguous about which phase it belongs to.
+- The request involves changing EWMA parameters, quota formulas, or cluster hyperparameters that exist in code with rationale comments.
+- The request would add a new dependency not in `requirements.txt`.
+**Act directly** for:
+- Bug fixes with a clear repro.
+- Adding tests.
+- Refactors within a single module that do not change behavior.
+- Implementing something explicitly described in doc 06.
+### 5.3 Code style
+- Python 3.12+. Type hints on all public functions.
+- Async by default for FastAPI handlers. Sync is fine for batch scripts.
+- No bare `except:`. Catch specific exceptions.
+- Logging: use `print()` with `[module_name]` prefix (current convention). Will migrate to structured logging later.
+- All embeddings are 1024-dim float32 (BGE-M3 dense). Normalize before storage/comparison.
+### 5.4 Tests
+- Every new function in `app/recommend/` gets a unit test.
+- Every new endpoint gets at least one integration test.
+- Use `pytest` + `pytest-asyncio` (asyncio_mode = auto, configured in `pytest.ini`).
+- Test files go in `tests/`. No `tests/fixtures/` directory exists yet — inline fixtures or use `tmp_path`.
+- Run tests: `python -m pytest tests/ -v`
+- Run E2E: `python test_e2e_recs.py`
+### 5.5 File and folder conventions
+This is the **actual** project structure. Do not create directories that do not exist unless building a new phase component.
+```
+ResearchIT-Final/
+|-- CLAUDE.md                    # THIS FILE — agent rulebook
+|-- run.py                       # Dev server entry (python run.py)
+|-- requirements.txt             # pip dependencies
+|-- pytest.ini                   # pytest config (asyncio_mode=auto)
+|-- interactions.db              # SQLite database (auto-created)
+|-- test_e2e_recs.py             # E2E simulation test (standalone)
+|
+|-- app/                         # FastAPI application
+|   |-- main.py                  # App entry, lifespan, router includes
+|   |-- config.py                # Settings (QDRANT_URL, COOKIE_NAME, etc.)
+|   |-- db.py                    # SQLite schema + async CRUD (aiosqlite)
+|   |-- qdrant_svc.py            # Qdrant client: recommend, search_by_vector,
+|   |                            #   get_paper_vectors, multi_interest_search
+|   |-- arxiv_svc.py             # arXiv API search + metadata fetch + SQLite cache
+|   |-- user_state.py            # In-memory user state (positive/negative deques)
+|   |-- templates_env.py         # Jinja2 environment setup
+|   |
+|   |-- routers/                 # FastAPI route handlers
+|   |   |-- search.py            # GET /search — arXiv keyword API (Phase 3 replaces)
+|   |   |-- recommendations.py   # GET /api/recommendations — 3-tier cascade
+|   |   |-- events.py            # POST /api/save, /api/dismiss — triggers EWMA update
+|   |   |-- saved.py             # GET /saved — user saved papers
+|   |
+|   |-- recommend/               # Recommendation engine (Phase 2)
+|   |   |-- __init__.py          # Module docstring
+|   |   |-- profiles.py          # EWMA profiles (long/short/negative)
+|   |   |-- clustering.py        # Ward clustering + medoids + adaptive threshold
+|   |   |-- reranker.py          # 5-feature heuristic scorer (then LightGBM later)
+|   |   |-- diversity.py         # MMR reranking + exploration injection
+|   |
+|   |-- templates/               # Jinja2 + HTMX templates
+|       |-- base.html            # Base layout
+|       |-- index.html           # Home page with recommendations
+|       |-- search.html          # Search page
+|       |-- partials/            # HTMX partial templates
+|
+|-- docs/                        # Documentation (see section 2 for precedence)
+|   |-- README.md                # Master doc index with reading order
+|   |-- TASK-TRACKER.md          # Master task checklist (all phases)
+|   |-- research/                # Research documents (01-06)
+|   |   |-- 01-Vision-Instagram-for-Research.md
+|   |   |-- 02-Recommendation-System-Blueprint.md
+|   |   |-- 03-MultiInterest-Recommender-Architecture.md  # Has addendum with corrections
+|   |   |-- 04-Technical-Roadmap-Legacy.md
+|   |   |-- 05-Evolution-Of-Onboarding-And-Interests.md
+|   |   |-- 06-Deep-Research-Verdict.md                   # SOURCE OF TRUTH
+|   |-- phases/
+|   |   |-- PHASE1-Zero-ML-Recommender.md
+|   |   |-- PHASE2-Hybrid-Search-Plan.md                  # Prototype reference
+|   |   |-- PHASE3-Hybrid-Semantic-Search.md               # ACTIVE PHASE 3 PLAN
+|   |-- walkthroughs/
+|       |-- 01-Phase1-Code-Tour.md
+|       |-- 02-Phase2-MultiInterest-Recommender.md
+|       |-- 03-Code-Summary-and-Test-Plan.md
+|       |-- 04-Next-Steps-and-Phase-Plan.md               # MASTER ROADMAP
+|
+|-- notebooks/                   # Kaggle/Jupyter notebooks (reference only)
+|   |-- README.md                # Notebook index + extracted schema details
+|   |-- 01-bme-upload.ipynb      # BGE-M3 encode + upload to Qdrant/Zilliz (1.6M papers)
+|   |-- 02-bme-arxiv-test.ipynb  # Search quality tests + BGE-M3 prototype
+|   |-- 03-check-search-bq-prm.ipynb  # BQ vs PRM quantization benchmark
+|
+|-- tests/                       # pytest test suite (88 tests)
+    |-- test_profiles.py         # EWMA profile tests (11)
+    |-- test_clustering.py       # Ward clustering tests (10)
+    |-- test_reranker_diversity.py # Reranker + MMR tests (13)
+    |-- test_db.py               # SQLite schema tests
+    |-- test_qdrant_svc.py       # Qdrant client tests
+    |-- test_arxiv_svc.py        # arXiv service tests
+    |-- test_integration.py      # Cross-module integration tests
+    |-- test_user_state.py       # User state tests
+    |-- test_saved.py            # Saved papers tests
+```
+**Modules that do NOT exist yet** (planned for future phases):
+- `app/embed_svc.py` — BGE-M3 model singleton (Phase 3) ✅ BUILT
+- `app/zilliz_svc.py` — Zilliz sparse search (Phase 3) ✅ BUILT
+- `app/groq_svc.py` — LLM query rewriter (Phase 3) ✅ BUILT
+- `app/hybrid_search_svc.py` — Search orchestrator (Phase 3) ✅ BUILT
+- `app/recommend/fusion.py` — Quota fusion, replaces RRF (Phase 4)
+### 5.6 Common commands
+```bash
+# Run the app (dev server with hot reload)
+python run.py
+# serves at http://127.0.0.1:7860 (port 7860 for HF Spaces compat)
+# Run all tests
+python -m pytest tests/ -v
+# Run specific test file
+python -m pytest tests/test_clustering.py -v
+# Run E2E simulation (hits live Qdrant)
+python test_e2e_recs.py
+# Install dependencies
+pip install -r requirements.txt
+```
+### 5.7 Commits
+- Conventional commits: `feat:`, `fix:`, `refactor:`, `test:`, `docs:`, `chore:`.
+- If a commit implements a decision from a doc, reference it: `feat(fusion): implement importance-weighted quota per doc 06`.
+- Never commit secrets. Environment variables are read from `app/config.py` via `os.getenv()`.
+- Phase 3 env vars: `ZILLIZ_URI`, `ZILLIZ_TOKEN`, `ZILLIZ_COLLECTION`, `GROQ_API_KEY`, `BGE_M3_MODEL`, `BGE_M3_DEVICE`. These are set in HF Spaces Secrets, not hardcoded.
+- Qdrant env vars (already in config.py): `QDRANT_URL`, `QDRANT_API_KEY`, `QDRANT_COLLECTION`.
+---
+## 6. How to update the docs
+Architecture evolves. The agent will sometimes encounter decisions that need to be amended. Follow this protocol.
+### 6.1 The golden rule
+**Doc 06 is the primary architectural reference.** Docs 01-05 are historical. If a new decision contradicts 03/02/01/04/05, do not edit those — instead, either update doc 06 changelog or create a new doc (07+).
+**Exception:** Doc 03 has an "Addendum" section at the bottom specifically for recording corrections from Doc 06. This addendum can be updated when new corrections are applied.
+### 6.2 When the user makes a new architectural decision
+1. Append a dated entry to the bottom of `docs/research/06-Deep-Research-Verdict.md` under a `## Changelog` section.
+2. Format:
+   ```
+   ### YYYY-MM-DD — [short title]
+   **Decision:** [the new rule]
+   **Supersedes:** [which earlier doc/section, if any]
+   **Rationale:** [why — 1-3 sentences]
+   **Action items:** [what code changes are implied]
+   ```
+3. If the decision invalidates a section 3 rule in *this* `CLAUDE.md` file, also update section 3 to match.
+4. Update the correction summary table in Doc 03 addendum if applicable.
+5. Mention in the user-facing reply: "Logged in doc 06 changelog and updated CLAUDE.md."
+### 6.3 When you discover a contradiction the user has not resolved
+Do not silently pick a side. Surface it: "Doc 03 says X, doc 06 says Y, the code does Z. Which should I follow?" Then act on the answer and log per 6.2.
+### 6.4 Editing other docs
+You may edit docs 01-05 only to:
+- Fix a typo.
+- Update the addendum in Doc 03.
+- Add a banner noting it is superseded.
+- Nothing else. No content edits, no architectural revisions.
+### 6.5 New docs
+If a topic is too large for a 06 changelog entry, create `docs/research/07-[topic].md` and add it to the section 2 table in this file with priority and use/don't-use guidance. Do not create new docs without prompting from the user.
+---
+## 7. Quick reference card
+| Question | Answer |
+|---|---|
+| Source of truth? | `docs/research/06-Deep-Research-Verdict.md` |
+| Master roadmap? | `docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md` |
+| Recommendation fusion? | Importance-weighted quota with `F_min=3`. NOT RRF. (code still uses RRF — Phase 4 fix) |
+| Search fusion? | RRF (correct, but search currently uses arXiv keyword API — Phase 3 upgrades to hybrid). |
+| alpha_long? | `0.03` — in `app/recommend/profiles.py` |
+| alpha_short? | `0.40` — in `app/recommend/profiles.py` |
+| alpha_neg? | `0.15` — in `app/recommend/profiles.py` |
+| MMR lambda? | `0.6` — in `app/recommend/diversity.py` |
+| Cluster algorithm? | Ward, L2-normalized, Euclidean, adaptive gap threshold, `K_max=7`. In `app/recommend/clustering.py`. |
+| Reranker? | Heuristic scorer (5 features) then LightGBM lambdarank (Phase 6). In `app/recommend/reranker.py`. |
+| Latency budget? | <30ms end-to-end (compute only; metadata I/O excluded). |
+| Cold start? | Hybrid: arXiv categories + ORCID/Scholar import + 5 seed papers + popularity fallback. NOT BUILT YET (Phase 5). |
+| When does behavioral take over? | ~10 saved papers. Currently activates at 5 (clustering) / 3 (EWMA) / 1 (BEST_SCORE). |
+| When to add CF? | 500+ users (Phase 9). |
+| Current phase? | **Phase 3 complete.** Phase 4 (rec pipeline fixes) next. See `docs/TASK-TRACKER.md`. |
+| ArXiv ID type? | String. Always. `dtype=str` in pandas. |
+| Embedding model? | BAAI/bge-m3, 1024-dim dense + sparse lexical weights. Loaded at startup in `app/embed_svc.py`. Graceful fallback if not installed. |
+| How to run? | `python run.py` at http://127.0.0.1:7860 (port 7860 for HF Spaces compat) |
+| How to test? | `python -m pytest tests/ -v` (123 tests) |
+| Storage? | SQLite (`interactions.db`) — ephemeral on HF Spaces. Supabase at 10+ concurrent writes/sec. |
+| Deployment? | Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs). Render abandoned (512MB too small for BGE-M3). |
+| Forbidden in v1? | Redis, React SPA, real-time streaming, custom embedding fine-tuning, cross-encoder in hot path, DPPs, generative retrieval. |
+---
+*Last updated: 2026-04-19. Update this date when CLAUDE.md changes.*

Dockerfile ADDED Viewed

	@@ -0,0 +1,28 @@

+# ── ResearchIT — HF Spaces Docker deployment ─────────────────────────────────
+# Free tier: 16GB RAM, 2 vCPUs, ephemeral filesystem, port 7860 required
+FROM python:3.12-slim
+# System dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends gcc g++ && \
+    rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+# Install torch CPU-only first (smaller than full CUDA build)
+RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu
+# Install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Pre-download BGE-M3 model into the image (baked in, no cold-start download)
+RUN python -c "from FlagEmbedding import BGEM3FlagModel; BGEM3FlagModel('BAAI/bge-m3', use_fp16=False)"
+# Copy application code
+COPY . .
+# HF Spaces requires port 7860 and non-root user
+USER 1000
+EXPOSE 7860
+CMD ["python", "run.py"]

app/__init__.py ADDED Viewed

File without changes

app/arxiv_svc.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""
+arXiv API service.
+Responsibilities
+────────────────
+1. search(query)     – keyword search via export.arxiv.org/api/query
+2. fetch_metadata()  – fetch a single paper's metadata by arxiv_id
+3. fetch_metadata_batch() – fetch multiple papers, using SQLite cache first
+ArXiv IDs come in two formats:
+  Old: YYMM.NNNN   e.g. 0704.0002
+  New: YYMM.NNNNN  e.g. 1706.03762
+The arXiv API returns full URLs like http://arxiv.org/abs/1706.03762v5.
+We always normalise to bare id (no version suffix, no URL prefix).
+"""
+import asyncio
+import json
+import re
+import xml.etree.ElementTree as ET
+from datetime import datetime
+import httpx
+from app import config
+from app import db
+# XML namespace used in the Atom feed returned by arXiv API
+_NS = {
+    "atom": "http://www.w3.org/2005/Atom",
+    "arxiv": "http://arxiv.org/schemas/atom",
+    "opensearch": "http://a9.com/-/spec/opensearch/1.1/",
+}
+_ID_RE = re.compile(r"(?:arxiv:|https?://arxiv\.org/abs/)?([^\s/v]+(?:v\d+)?)")
+def _normalise_id(raw: str) -> str:
+    """Strip URL prefix and version suffix from an arxiv ID string."""
+    m = _ID_RE.search(raw.strip())
+    if not m:
+        return raw.strip()
+    bare = m.group(1)
+    # Remove trailing version e.g. '1706.03762v5' → '1706.03762'
+    return re.sub(r"v\d+$", "", bare)
+def _parse_entry(entry: ET.Element) -> dict:
+    """Convert one <entry> element into a paper dict."""
+    def text(tag: str) -> str:
+        el = entry.find(tag, _NS)
+        return el.text.strip() if el is not None and el.text else ""
+    raw_id = text("atom:id")
+    arxiv_id = _normalise_id(raw_id)
+    authors = [
+        a.findtext("atom:name", namespaces=_NS, default="").strip()
+        for a in entry.findall("atom:author", _NS)
+    ]
+    # Primary category
+    cat_el = entry.find("arxiv:primary_category", _NS)
+    category = cat_el.attrib.get("term", "") if cat_el is not None else ""
+    published = text("atom:published")[:10]  # keep YYYY-MM-DD only
+    year = int(published[:4]) if published else 0
+    return {
+        "arxiv_id": arxiv_id,
+        "title": text("atom:title").replace("\n", " "),
+        "abstract": text("atom:summary").replace("\n", " "),
+        "authors": json.dumps(authors[:5]),   # store as JSON string
+        "category": category,
+        "published": published,
+        "year": year,
+    }
+async def search(query: str, max_results: int = config.ARXIV_MAX_RESULTS) -> list[dict]:
+    """
+    Search arXiv and return a list of paper dicts with metadata.
+    Results are also written into the metadata cache.
+    """
+    params = {
+        "search_query": f"all:{query}",
+        "start": 0,
+        "max_results": max_results,
+        "sortBy": "relevance",
+        "sortOrder": "descending",
+    }
+    async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
+        resp = await client.get(config.ARXIV_API_URL, params=params)
+        resp.raise_for_status()
+    root = ET.fromstring(resp.text)
+    papers = [_parse_entry(e) for e in root.findall("atom:entry", _NS)]
+    # Cache metadata for every result we got
+    for paper in papers:
+        await db.cache_metadata(paper)
+    return papers
+async def fetch_metadata(arxiv_id: str) -> dict | None:
+    """
+    Return metadata for a single paper.
+    Checks the SQLite cache first; falls back to arXiv API.
+    """
+    cached = await db.get_cached_metadata(arxiv_id)
+    if cached:
+        return cached
+    params = {"id_list": arxiv_id, "max_results": 1}
+    async with httpx.AsyncClient(timeout=15, follow_redirects=True) as client:
+        resp = await client.get(config.ARXIV_API_URL, params=params)
+        resp.raise_for_status()
+    root = ET.fromstring(resp.text)
+    entries = root.findall("atom:entry", _NS)
+    if not entries:
+        return None
+    paper = _parse_entry(entries[0])
+    await db.cache_metadata(paper)
+    return paper
+async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
+    """
+    Return {arxiv_id: metadata} for all IDs.
+    Loads from cache where possible; fetches missing ones from arXiv API.
+    Rate-limits arXiv API to 3 req/s as per their policy.
+    """
+    if not arxiv_ids:
+        return {}
+    result = await db.get_cached_metadata_batch(arxiv_ids)
+    missing = [aid for aid in arxiv_ids if aid not in result]
+    if missing:
+        # arXiv allows up to 20 IDs per request
+        BATCH = 20
+        for i in range(0, len(missing), BATCH):
+            chunk = missing[i : i + BATCH]
+            params = {"id_list": ",".join(chunk), "max_results": len(chunk)}
+            async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
+                resp = await client.get(config.ARXIV_API_URL, params=params)
+                resp.raise_for_status()
+            root = ET.fromstring(resp.text)
+            for entry in root.findall("atom:entry", _NS):
+                paper = _parse_entry(entry)
+                await db.cache_metadata(paper)
+                result[paper["arxiv_id"]] = paper
+            if i + BATCH < len(missing):
+                await asyncio.sleep(0.35)  # ~3 req/s
+    return result

app/config.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""
+Centralised settings for the arxiv recommender app.
+All credentials live here; override with environment variables in production.
+"""
+import os
+# ── Qdrant (BGE-M3 dense, 1 024-dim) ─────────────────────────────────────────
+QDRANT_URL = os.getenv("QDRANT_URL", "")
+QDRANT_API_KEY = os.getenv("QDRANT_API_KEY", "")
+QDRANT_COLLECTION = os.getenv("QDRANT_COLLECTION", "arxiv_bgem3_dense")
+# ── SQLite ────────────────────────────────────────────────────────────────────
+DB_PATH = os.getenv("DB_PATH", "interactions.db")
+# ── arXiv API ─────────────────────────────────────────────────────────────────
+ARXIV_API_URL = "https://export.arxiv.org/api/query"
+ARXIV_MAX_RESULTS = 10          # results per search page
+METADATA_CACHE_TTL_DAYS = 30    # re-fetch metadata after this many days
+# ── Recommendation settings ───────────────────────────────────────────────────
+REC_LIMIT = 10                  # how many recommendations to show
+REC_POSITIVE_LIMIT = 20         # max positive examples sent to Qdrant
+REC_MIN_POSITIVES = 1           # minimum saves needed before showing recs
+# ── Zilliz Cloud (BGE-M3 sparse vectors) — Phase 3 ────────────────────────────
+ZILLIZ_URI = os.getenv("ZILLIZ_URI", "")
+ZILLIZ_TOKEN = os.getenv("ZILLIZ_TOKEN", "")
+ZILLIZ_COLLECTION = os.getenv("ZILLIZ_COLLECTION", "arxiv_bgem3_sparse")
+# Zilliz schema (confirmed from notebooks/01-bme-upload.ipynb):
+#   id            INT64  (auto_id, primary key)
+#   arxiv_id      VARCHAR
+#   sparse_vector SPARSE_FLOAT_VECTOR  (BGE-M3 lexical weights, int token IDs)
+#   Index: SPARSE_INVERTED_INDEX, metric_type="IP"
+# ── Groq (LLM query rewriter) — Phase 3 ──────────────────────────────────────
+GROQ_API_KEY = os.getenv("GROQ_API_KEY", "")
+# ── BGE-M3 (embedding model) — Phase 3 ───────────────────────────────────────
+BGE_M3_MODEL = os.getenv("BGE_M3_MODEL", "BAAI/bge-m3")
+BGE_M3_DEVICE = os.getenv("BGE_M3_DEVICE", "cpu")
+ENCODE_CACHE_SIZE = 128  # LRU cache for encoded queries
+# ── Hybrid search tuning — Phase 3 ───────────────────────────────────────────
+SEARCH_RRF_K = 60                  # RRF denominator
+SEARCH_FETCH_K_MULTIPLIER = 6     # candidates = top_k × 6 before rerank
+SEARCH_SEMANTIC_WEIGHT = 0.80     # RRF contribution to final score
+SEARCH_RECENCY_WEIGHT = 0.20      # recency contribution to final score
+# ── App ───────────────────────────────────────────────────────────────────────
+APP_TITLE = "ArXiv Recommender"
+COOKIE_NAME = "arxiv_user_id"
+COOKIE_MAX_AGE = 60 * 60 * 24 * 365  # 1 year
+APP_PORT = int(os.getenv("PORT", "7860"))  # HF Spaces requires 7860

app/db.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""
+SQLite database layer.
+Tables
+──────
+interactions        – every user action (save, not_interested, click, view)
+paper_qdrant_map    – arxiv_id → integer Qdrant point ID (cached lazily)
+paper_metadata      – arXiv API response cache (title, abstract, …)
+"""
+import aiosqlite
+from app.config import DB_PATH
+# ── DDL ───────────────────────────────────────────────────────────────────────
+_SCHEMA = """
+PRAGMA journal_mode=WAL;
+PRAGMA synchronous=NORMAL;
+CREATE TABLE IF NOT EXISTS interactions (
+    id            INTEGER PRIMARY KEY AUTOINCREMENT,
+    user_id       TEXT    NOT NULL,
+    paper_id      TEXT    NOT NULL,
+    event_type    TEXT    NOT NULL,   -- save | not_interested | click | view
+    source        TEXT,               -- search | recommendation
+    position      INTEGER,
+    query_id      TEXT,
+    timestamp     TEXT    NOT NULL DEFAULT (datetime('now'))
+);
+CREATE INDEX IF NOT EXISTS idx_ui_user_ts
+    ON interactions(user_id, timestamp DESC);
+CREATE INDEX IF NOT EXISTS idx_ui_user_paper
+    ON interactions(user_id, paper_id);
+-- Maps arxiv_id -> Qdrant integer point ID (populated lazily on first save)
+CREATE TABLE IF NOT EXISTS paper_qdrant_map (
+    arxiv_id        TEXT PRIMARY KEY,
+    qdrant_point_id INTEGER NOT NULL,
+    mapped_at       TEXT    NOT NULL DEFAULT (datetime('now'))
+);
+-- Cache of paper metadata fetched from the arXiv API
+CREATE TABLE IF NOT EXISTS paper_metadata (
+    arxiv_id    TEXT PRIMARY KEY,
+    title       TEXT,
+    abstract    TEXT,
+    authors     TEXT,   -- JSON array string
+    category    TEXT,
+    published   TEXT,
+    cached_at   TEXT    NOT NULL DEFAULT (datetime('now'))
+);
+-- Phase 2a: EWMA user profile embeddings (long_term, short_term, negative)
+CREATE TABLE IF NOT EXISTS user_profiles (
+    user_id           TEXT NOT NULL,
+    profile_type      TEXT NOT NULL,  -- 'long_term' | 'short_term' | 'negative'
+    vector            BLOB NOT NULL,  -- 4096 bytes (1024 × float32)
+    interaction_count  INTEGER DEFAULT 0,
+    updated_at        TEXT NOT NULL DEFAULT (datetime('now')),
+    PRIMARY KEY (user_id, profile_type)
+);
+-- Phase 2b: Ward clustering results (medoid paper IDs per interest cluster)
+CREATE TABLE IF NOT EXISTS user_clusters (
+    user_id         TEXT NOT NULL,
+    cluster_idx     INTEGER NOT NULL,
+    medoid_paper_id TEXT NOT NULL,
+    importance      REAL NOT NULL,
+    paper_ids       TEXT NOT NULL,  -- JSON array of arxiv_ids
+    computed_at     TEXT NOT NULL DEFAULT (datetime('now')),
+    PRIMARY KEY (user_id, cluster_idx)
+);
+"""
+async def init_db() -> None:
+    """Create tables if they don't exist. Called once at startup."""
+    async with aiosqlite.connect(DB_PATH) as db:
+        await db.executescript(_SCHEMA)
+        await db.commit()
+# ── Interaction helpers ───────────────────────────────────────────────────────
+async def log_interaction(
+    user_id: str,
+    paper_id: str,
+    event_type: str,
+    source: str | None = None,
+    position: int | None = None,
+    query_id: str | None = None,
+) -> None:
+    async with aiosqlite.connect(DB_PATH) as db:
+        await db.execute(
+            """INSERT INTO interactions
+               (user_id, paper_id, event_type, source, position, query_id)
+               VALUES (?, ?, ?, ?, ?, ?)""",
+            (user_id, paper_id, event_type, source, position, query_id),
+        )
+        await db.commit()
+async def get_user_interactions(
+    user_id: str, event_types: list[str] | None = None, limit: int = 50
+) -> list[dict]:
+    """Return recent interactions for a user, optionally filtered by event type."""
+    async with aiosqlite.connect(DB_PATH) as db:
+        db.row_factory = aiosqlite.Row
+        if event_types:
+            placeholders = ",".join("?" * len(event_types))
+            cur = await db.execute(
+                f"""SELECT paper_id, event_type, timestamp
+                    FROM interactions
+                    WHERE user_id = ?
+                      AND event_type IN ({placeholders})
+                    ORDER BY timestamp DESC
+                    LIMIT ?""",
+                [user_id, *event_types, limit],
+            )
+        else:
+            cur = await db.execute(
+                """SELECT paper_id, event_type, timestamp
+                   FROM interactions
+                   WHERE user_id = ?
+                   ORDER BY timestamp DESC
+                   LIMIT ?""",
+                (user_id, limit),
+            )
+        rows = await cur.fetchall()
+        return [dict(r) for r in rows]
+# ── Qdrant map helpers ──────────────────────────────────────────────────��─────
+async def get_qdrant_id(arxiv_id: str) -> int | None:
+    async with aiosqlite.connect(DB_PATH) as db:
+        cur = await db.execute(
+            "SELECT qdrant_point_id FROM paper_qdrant_map WHERE arxiv_id = ?",
+            (arxiv_id,),
+        )
+        row = await cur.fetchone()
+        return row[0] if row else None
+async def save_qdrant_id(arxiv_id: str, qdrant_point_id: int) -> None:
+    async with aiosqlite.connect(DB_PATH) as db:
+        await db.execute(
+            """INSERT OR REPLACE INTO paper_qdrant_map (arxiv_id, qdrant_point_id)
+               VALUES (?, ?)""",
+            (arxiv_id, qdrant_point_id),
+        )
+        await db.commit()
+async def get_qdrant_ids_batch(arxiv_ids: list[str]) -> dict[str, int]:
+    """Return {arxiv_id: qdrant_point_id} for all IDs found in cache."""
+    if not arxiv_ids:
+        return {}
+    async with aiosqlite.connect(DB_PATH) as db:
+        placeholders = ",".join("?" * len(arxiv_ids))
+        cur = await db.execute(
+            f"SELECT arxiv_id, qdrant_point_id FROM paper_qdrant_map WHERE arxiv_id IN ({placeholders})",
+            arxiv_ids,
+        )
+        rows = await cur.fetchall()
+        return {r[0]: r[1] for r in rows}
+# ── Metadata cache helpers ────────────────────────────────────────────────────
+async def get_cached_metadata(arxiv_id: str) -> dict | None:
+    async with aiosqlite.connect(DB_PATH) as db:
+        db.row_factory = aiosqlite.Row
+        cur = await db.execute(
+            "SELECT * FROM paper_metadata WHERE arxiv_id = ?", (arxiv_id,)
+        )
+        row = await cur.fetchone()
+        return dict(row) if row else None
+async def cache_metadata(paper: dict) -> None:
+    """Upsert paper metadata dict into cache. Expects 'arxiv_id' key."""
+    async with aiosqlite.connect(DB_PATH) as db:
+        await db.execute(
+            """INSERT OR REPLACE INTO paper_metadata
+               (arxiv_id, title, abstract, authors, category, published)
+               VALUES (:arxiv_id, :title, :abstract, :authors, :category, :published)""",
+            paper,
+        )
+        await db.commit()
+async def get_cached_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
+    """Return {arxiv_id: metadata_dict} for all IDs found in cache."""
+    if not arxiv_ids:
+        return {}
+    async with aiosqlite.connect(DB_PATH) as db:
+        db.row_factory = aiosqlite.Row
+        placeholders = ",".join("?" * len(arxiv_ids))
+        cur = await db.execute(
+            f"SELECT * FROM paper_metadata WHERE arxiv_id IN ({placeholders})",
+            arxiv_ids,
+        )
+        rows = await cur.fetchall()
+        return {r["arxiv_id"]: dict(r) for r in rows}
+# ── User profile helpers (Phase 2a) ──────────────────────────────────────────
+async def get_user_profile(user_id: str, profile_type: str) -> dict | None:
+    """Return profile row as dict, or None if not found."""
+    async with aiosqlite.connect(DB_PATH) as conn:
+        conn.row_factory = aiosqlite.Row
+        cur = await conn.execute(
+            "SELECT vector, interaction_count FROM user_profiles "
+            "WHERE user_id = ? AND profile_type = ?",
+            (user_id, profile_type),
+        )
+        row = await cur.fetchone()
+        return dict(row) if row else None
+async def upsert_user_profile(
+    user_id: str,
+    profile_type: str,
+    vector: bytes,
+    interaction_count: int,
+) -> None:
+    """Insert or update a user profile embedding."""
+    async with aiosqlite.connect(DB_PATH) as conn:
+        await conn.execute(
+            """INSERT INTO user_profiles
+               (user_id, profile_type, vector, interaction_count, updated_at)
+               VALUES (?, ?, ?, ?, datetime('now'))
+               ON CONFLICT(user_id, profile_type) DO UPDATE SET
+                 vector = excluded.vector,
+                 interaction_count = excluded.interaction_count,
+                 updated_at = excluded.updated_at""",
+            (user_id, profile_type, vector, interaction_count),
+        )
+        await conn.commit()
+# ── User cluster helpers (Phase 2b) ──────────────────────────────────────────
+async def save_user_clusters(user_id: str, clusters: list[dict]) -> None:
+    """Replace all clusters for a user with new ones."""
+    async with aiosqlite.connect(DB_PATH) as conn:
+        await conn.execute(
+            "DELETE FROM user_clusters WHERE user_id = ?", (user_id,)
+        )
+        for c in clusters:
+            await conn.execute(
+                """INSERT INTO user_clusters
+                   (user_id, cluster_idx, medoid_paper_id, importance, paper_ids)
+                   VALUES (?, ?, ?, ?, ?)""",
+                (user_id, c["cluster_idx"], c["medoid_paper_id"],
+                 c["importance"], c["paper_ids"]),
+            )
+        await conn.commit()
+async def get_user_clusters(user_id: str) -> list[dict]:
+    """Return clusters for a user, ordered by importance desc."""
+    async with aiosqlite.connect(DB_PATH) as conn:
+        conn.row_factory = aiosqlite.Row
+        cur = await conn.execute(
+            """SELECT cluster_idx, medoid_paper_id, importance, paper_ids, computed_at
+               FROM user_clusters
+               WHERE user_id = ?
+               ORDER BY importance DESC""",
+            (user_id,),
+        )
+        rows = await cur.fetchall()
+        return [dict(r) for r in rows]

app/embed_svc.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""
+BGE-M3 embedding model singleton — Phase 3.
+Responsibilities:
+  - Load BAAI/bge-m3 once (lazily on first call or eagerly via get_model())
+  - encode_query(text) → (dense: np.ndarray[1024], sparse: dict[int, float])
+  - LRU cache on query text to avoid re-encoding repeats
+  - CPU float32, no GPU dependency
+  - Thread-safe (model is read-only after load)
+"""
+from __future__ import annotations
+import threading
+from functools import lru_cache
+import numpy as np
+from app import config
+# ── Module-level singleton ────────────────────────────────────────────────────
+_model = None
+_model_lock = threading.Lock()
+def get_model():
+    """
+    Return the BGE-M3 model singleton.  Thread-safe, loads once.
+    Called eagerly in main.py lifespan so the first request doesn't pay
+    the ~15 s model-download cost.
+    """
+    global _model
+    if _model is not None:
+        return _model
+    with _model_lock:
+        # Double-check after acquiring lock
+        if _model is not None:
+            return _model
+        from FlagEmbedding import BGEM3FlagModel
+        print(f"[embed_svc] Loading {config.BGE_M3_MODEL} on {config.BGE_M3_DEVICE}…")
+        # use_fp16=False on CPU (fp16 requires CUDA)
+        use_fp16 = config.BGE_M3_DEVICE != "cpu"
+        _model = BGEM3FlagModel(
+            config.BGE_M3_MODEL,
+            use_fp16=use_fp16,
+            device=config.BGE_M3_DEVICE,
+        )
+        print("[embed_svc] Model loaded successfully")
+        return _model
+# ── Cached query encoding ────────────────────────────────────────────────────
+@lru_cache(maxsize=config.ENCODE_CACHE_SIZE)
+def _encode_cached(text: str) -> tuple:
+    """
+    Encode a single query string.  Returns (dense_vec, sparse_dict).
+    The LRU cache key is the raw text string.  Cached results avoid
+    re-running BGE-M3 inference for repeated queries.
+    Returns a tuple so it's hashable for the cache decorator.
+    The caller unpacks it.
+    """
+    model = get_model()
+    out = model.encode(
+        [text],
+        return_dense=True,
+        return_sparse=True,
+        return_colbert_vecs=False,
+        max_length=512,
+    )
+    dense = out["dense_vecs"][0]          # shape (1024,) float32
+    sparse = out["lexical_weights"][0]    # dict {token_id_int: float}
+    # Ensure dense is a numpy array (model may return tensor)
+    if not isinstance(dense, np.ndarray):
+        dense = np.array(dense, dtype=np.float32)
+    # Ensure sparse values are plain floats (not tensors)
+    sparse_clean = {int(k): float(v) for k, v in sparse.items()}
+    return (dense, sparse_clean)
+def encode_query(text: str) -> tuple[np.ndarray, dict[int, float]]:
+    """
+    Encode a query string into dense + sparse representations.
+    Args:
+        text: User's search query (raw or rewritten).
+    Returns:
+        (dense_vec, sparse_dict) where:
+          dense_vec:   np.ndarray of shape (1024,), float32
+          sparse_dict: {int_token_id: float_weight}
+    """
+    text = text.strip()
+    if not text:
+        return np.zeros(1024, dtype=np.float32), {}
+    return _encode_cached(text)

app/groq_svc.py ADDED Viewed

	@@ -0,0 +1,154 @@

+"""
+Groq LLM query rewriter — Phase 3.
+Responsibilities:
+  - Rewrite casual user queries into dense academic keyword strings
+  - Uses llama-3.3-70b-versatile via Groq's ultra-fast inference
+  - Falls back to original query on ANY error or timeout
+  - Skips rewriting for queries that already look academic
+  - This is an ENHANCEMENT, not a dependency — search works without it
+"""
+from __future__ import annotations
+import re
+import threading
+from app import config
+# ── Client singleton ─────────────────────────────────────────────────────────
+_client = None
+_client_lock = threading.Lock()
+def _get_client():
+    """Lazy Groq client init — only connects when first query arrives."""
+    global _client
+    if _client is not None:
+        return _client
+    if not config.GROQ_API_KEY:
+        return None
+    with _client_lock:
+        if _client is not None:
+            return _client
+        from groq import Groq
+        _client = Groq(api_key=config.GROQ_API_KEY)
+        print("[groq_svc] Groq client initialized")
+        return _client
+# ── Rewrite prompt ───────────────────────────────────────────────────────────
+_SYSTEM_PROMPT = """You are an academic search query optimizer for arXiv papers.
+Your job: Convert casual or vague user queries into dense, keyword-rich academic search strings that will match arXiv paper titles and abstracts.
+Rules:
+1. Output ONLY the rewritten query string — no explanation, no quotes, no preamble.
+2. Include standard academic terms, model names, acronyms, and author-style keywords.
+3. Keep the output to 8-15 words maximum.
+4. If the query already looks academic, return it with minimal changes.
+Examples:
+User: "when AI makes up fake facts"
+Output: LLM hallucination factual errors sycophancy truthfulness survey
+User: "the llama model by facebook"
+Output: LLaMA open efficient foundation language model Meta AI
+User: "how to make images from text"
+Output: text-to-image generation diffusion models latent space
+User: "papers about making language models smaller"
+Output: language model compression distillation pruning quantization efficient
+User: "whisper speech recognition"
+Output: Whisper OpenAI automatic speech recognition multilingual"""
+# ── Heuristic: should we skip rewriting? ─────────────────────────────────────
+_ACADEMIC_PATTERN = re.compile(
+    r"""(?:
+        \d{4}\.\d{4,5}          |   # arXiv ID
+        [A-Z]{2,}               |   # Acronyms like LLM, NLP, BERT
+        transformer|attention   |
+        neural|network          |
+        \b(?:et\s+al|arxiv)\b
+    )""",
+    re.VERBOSE | re.IGNORECASE,
+)
+def _looks_academic(query: str) -> bool:
+    """Heuristic: skip rewriting if query already has academic terms."""
+    words = query.split()
+    if len(words) > 6:
+        matches = len(_ACADEMIC_PATTERN.findall(query))
+        if matches >= 2:
+            return True
+    return False
+# ── Public API ───────────────────────────────────────────────────────────────
+async def rewrite(query: str) -> str:
+    """
+    Rewrite a user query into an academic search string using Groq LLM.
+    Falls back to the original query on ANY error — this function never
+    raises exceptions and never blocks the search pipeline.
+    Args:
+        query: Raw user search query.
+    Returns:
+        Rewritten academic query string, or original query on error.
+    """
+    query = query.strip()
+    if not query:
+        return query
+    # Skip if already academic-looking
+    if _looks_academic(query):
+        return query
+    client = _get_client()
+    if client is None:
+        return query  # No API key configured — skip
+    try:
+        import asyncio
+        loop = asyncio.get_event_loop()
+        result = await loop.run_in_executor(None, _run_rewrite, client, query)
+        rewritten = result.strip().strip('"').strip("'").strip()
+        # Sanity check: rewritten should be non-empty and not absurdly long
+        if not rewritten or len(rewritten) > 200:
+            return query
+        return rewritten
+    except Exception as e:
+        print(f"[groq_svc] Rewrite failed, using original query: {e}")
+        return query
+def _run_rewrite(client, query: str) -> str:
+    """Sync helper: call Groq chat completion with timeout."""
+    response = client.chat.completions.create(
+        messages=[
+            {"role": "system", "content": _SYSTEM_PROMPT},
+            {"role": "user", "content": query},
+        ],
+        model="llama-3.3-70b-versatile",
+        temperature=0.1,
+        max_tokens=60,
+        timeout=2.0,  # Hard 2s timeout — search must not stall
+    )
+    return response.choices[0].message.content

app/hybrid_search_svc.py ADDED Viewed

	@@ -0,0 +1,200 @@

+"""
+Hybrid search orchestrator — Phase 3.
+Orchestrates the full pipeline:
+  1. LLM rewrite (optional, via Groq)
+  2. BGE-M3 encode → dense + sparse
+  3. Parallel search: Qdrant dense + Zilliz sparse
+  4. RRF fusion (K=60)
+  5. Recency rerank: 0.80 × RRF + 0.20 × recency
+  6. Return ranked arxiv_ids
+Doc 06 confirms: RRF is correct for search (fusing different retrievers
+answering the SAME query).  This is different from recommendations where
+quota is correct (fusing different queries for the SAME user).
+"""
+from __future__ import annotations
+import asyncio
+from datetime import datetime
+from app import config
+from app import embed_svc
+from app import qdrant_svc
+from app import zilliz_svc
+from app import groq_svc
+# ── Public API ───────────────────────────────────────────────────────────────
+async def search(
+    query: str,
+    limit: int = 10,
+    use_rewrite: bool = True,
+) -> list[str]:
+    """
+    Hybrid semantic search — returns a list of arxiv_ids ranked by
+    fused relevance.
+    Pipeline:
+      rewrite → encode → parallel(dense, sparse) → RRF → rerank
+    Args:
+        query: User's raw search query.
+        limit: Number of results to return.
+        use_rewrite: Whether to attempt LLM query rewriting.
+    Returns:
+        list of arxiv_id strings, sorted by final score descending.
+        Never raises — returns empty list on total failure.
+    """
+    query = query.strip()
+    if not query:
+        return []
+    # ── Step 1: LLM rewrite (optional, never blocks) ─────────────────────
+    search_query = query
+    if use_rewrite:
+        try:
+            search_query = await groq_svc.rewrite(query)
+        except Exception:
+            search_query = query  # Fallback guaranteed
+    # ── Step 2: BGE-M3 encode (dense + sparse in one pass) ───────────────
+    try:
+        dense_vec, sparse_dict = embed_svc.encode_query(search_query)
+    except Exception as e:
+        print(f"[hybrid_search] Encoding failed: {e}")
+        return []
+    # How many candidates to fetch before reranking
+    fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
+    # ── Step 3: Parallel dense + sparse search ───────────────────────────
+    dense_results, sparse_results = await asyncio.gather(
+        qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k),
+        zilliz_svc.search_sparse(sparse_dict, limit=fetch_k),
+        return_exceptions=True,
+    )
+    # Handle individual failures gracefully
+    if isinstance(dense_results, Exception):
+        print(f"[hybrid_search] Dense search failed: {dense_results}")
+        dense_results = []
+    if isinstance(sparse_results, Exception):
+        print(f"[hybrid_search] Sparse search failed: {sparse_results}")
+        sparse_results = []
+    if not dense_results and not sparse_results:
+        return []
+    # ── Step 4: RRF fusion ───────────────────────────────────────────────
+    fused = _rrf_fuse(dense_results, sparse_results, k=config.SEARCH_RRF_K)
+    if not fused:
+        return []
+    # ── Step 5: Recency rerank ───────────────────────────────────────────
+    ranked = _recency_rerank(fused)
+    # ── Step 6: Return top results ───────────────────────────────────────
+    return [item["arxiv_id"] for item in ranked[:limit]]
+# ── RRF fusion ───────────────────────────────────────────────────────────────
+def _rrf_fuse(
+    dense_results: list[dict],
+    sparse_results: list[dict],
+    k: int = 60,
+) -> list[dict]:
+    """
+    Reciprocal Rank Fusion — merges results from dense and sparse search.
+    score[paper] = 1/(k + rank_dense) + 1/(k + rank_sparse)
+    RRF is rank-based, so raw scores from different systems don't need
+    normalization — this is why it works for fusing Qdrant cosine scores
+    with Zilliz IP scores.
+    Args:
+        dense_results: list of {'arxiv_id': str, 'score': float} from Qdrant
+        sparse_results: list of {'arxiv_id': str, 'score': float} from Zilliz
+        k: RRF constant (default 60)
+    Returns:
+        list of {'arxiv_id': str, 'rrf_score': float} sorted by rrf_score desc
+    """
+    scores: dict[str, float] = {}
+    # Dense contributions (rank = position in sorted list, 1-indexed)
+    for rank, item in enumerate(dense_results, start=1):
+        aid = item["arxiv_id"]
+        scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
+    # Sparse contributions
+    for rank, item in enumerate(sparse_results, start=1):
+        aid = item["arxiv_id"]
+        scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
+    # Sort by fused score descending
+    fused = [
+        {"arxiv_id": aid, "rrf_score": score}
+        for aid, score in scores.items()
+    ]
+    fused.sort(key=lambda x: x["rrf_score"], reverse=True)
+    return fused
+# ── Recency rerank ───────────────────────────────────────────────────────────
+def _recency_rerank(fused: list[dict]) -> list[dict]:
+    """
+    Apply recency boost to RRF scores.
+    final_score = SEARCH_SEMANTIC_WEIGHT × norm_rrf + SEARCH_RECENCY_WEIGHT × recency
+    Recency is estimated from the arXiv ID (YYMM format) since we don't have
+    publication dates at this stage.  Papers not parseable get neutral score.
+    The semantic weight (0.80) ensures RRF dominates, while recency (0.20)
+    provides a mild boost to newer papers.
+    """
+    if not fused:
+        return fused
+    # Normalize RRF scores to [0, 1]
+    max_rrf = max(item["rrf_score"] for item in fused)
+    min_rrf = min(item["rrf_score"] for item in fused)
+    rrf_range = max_rrf - min_rrf if max_rrf != min_rrf else 1.0
+    now_ym = datetime.now().year * 12 + datetime.now().month
+    for item in fused:
+        # Normalize RRF to [0, 1]
+        norm_rrf = (item["rrf_score"] - min_rrf) / rrf_range
+        # Estimate recency from arXiv ID (format: YYMM.NNNNN)
+        recency = 0.5  # neutral default
+        aid = item["arxiv_id"]
+        try:
+            parts = aid.split(".")
+            if len(parts) >= 2 and len(parts[0]) == 4:
+                yy = int(parts[0][:2])
+                mm = int(parts[0][2:4])
+                year = 2000 + yy if yy < 100 else yy
+                paper_ym = year * 12 + mm
+                months_ago = max(0, now_ym - paper_ym)
+                # Decay: recent papers get ~1.0, 10-year-old papers get ~0.0
+                recency = max(0.0, 1.0 - months_ago / 120.0)
+        except (ValueError, IndexError):
+            pass
+        item["final_score"] = (
+            config.SEARCH_SEMANTIC_WEIGHT * norm_rrf
+            + config.SEARCH_RECENCY_WEIGHT * recency
+        )
+    fused.sort(key=lambda x: x["final_score"], reverse=True)
+    return fused

app/main.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""
+FastAPI application entry point.
+Routes:
+  GET  /          → home (recs + search bar)
+  GET  /search    → search router
+  POST /api/papers/{id}/save           → events router
+  POST /api/papers/{id}/not-interested → events router
+  GET  /api/recommendations            → recommendations router
+"""
+import uuid
+from contextlib import asynccontextmanager
+from fastapi import FastAPI, Request, Cookie
+from fastapi.responses import HTMLResponse
+from app import db
+from app.config import APP_TITLE, COOKIE_NAME
+from app.templates_env import templates
+from app.routers import search, events, recommendations, saved
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    await db.init_db()
+    # Phase 3: Warm up BGE-M3 at startup (graceful — app works without it)
+    try:
+        import asyncio
+        from app import embed_svc
+        loop = asyncio.get_event_loop()
+        await loop.run_in_executor(None, embed_svc.get_model)
+        print("[main] BGE-M3 model loaded — hybrid search ready")
+    except Exception as e:
+        print(f"[main] BGE-M3 not loaded ({e}) — search will fall back to arXiv API")
+    yield
+app = FastAPI(title=APP_TITLE, lifespan=lifespan)
+app.include_router(search.router)
+app.include_router(events.router)
+app.include_router(recommendations.router)
+app.include_router(saved.router)
+@app.get("/", response_class=HTMLResponse)
+async def home(
+    request: Request,
+    user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
+):
+    user_id = user_id or str(uuid.uuid4())
+    from app import user_state as us
+    state = await us.ensure_loaded(user_id)
+    resp = templates.TemplateResponse(
+        request,
+        "index.html",
+        {
+            "has_recs": state.has_enough_for_recs(),
+            "save_count": len(state.positives),
+        },
+    )
+    resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
+    return resp

app/qdrant_svc.py ADDED Viewed

	@@ -0,0 +1,381 @@

+"""
+Qdrant service layer.
+Phase 1: Recommend API (BEST_SCORE with positive/negative IDs)
+Phase 2a: Vector search using EWMA profile embeddings
+Phase 2b: Multi-interest prefetch + RRF fusion (multiple ANN queries in one call)
+The collection is 'arxiv_bgem3_dense' with integer point IDs and 1024-dim BGE-M3 vectors.
+"""
+from __future__ import annotations
+import asyncio
+from functools import lru_cache
+from qdrant_client import QdrantClient
+from qdrant_client.models import (
+    Filter,
+    FieldCondition,
+    MatchAny,
+    MatchValue,
+    RecommendStrategy,
+    RecommendQuery,
+    RecommendInput,
+    Prefetch,
+    FusionQuery,
+    Fusion,
+)
+from app import config, db
+# ── Client (sync, thread-safe, reused across requests) ───────────────────────
+@lru_cache(maxsize=1)
+def _client() -> QdrantClient:
+    return QdrantClient(
+        url=config.QDRANT_URL,
+        api_key=config.QDRANT_API_KEY,
+        timeout=30,
+        check_compatibility=False,
+    )
+# ── ID lookup ─────────────────────────────────────────────────────────────────
+async def lookup_qdrant_ids(arxiv_ids: list[str]) -> dict[str, int]:
+    """
+    Return {arxiv_id: qdrant_point_id} for every id that exists in the
+    collection.  Checks the local SQLite cache first; fetches missing ones
+    via Qdrant payload filter (requires the arxiv_id keyword index).
+    """
+    if not arxiv_ids:
+        return {}
+    # 1. Pull what we already know from SQLite
+    cached = await db.get_qdrant_ids_batch(arxiv_ids)
+    missing = [aid for aid in arxiv_ids if aid not in cached]
+    if missing:
+        # 2. Ask Qdrant: filter by arxiv_id
+        loop = asyncio.get_event_loop()
+        try:
+            results = await loop.run_in_executor(
+                None,
+                _scroll_by_arxiv_ids,
+                missing,
+            )
+        except Exception:
+            results = {}
+        # 3. Persist new mappings
+        for arxiv_id, point_id in results.items():
+            await db.save_qdrant_id(arxiv_id, point_id)
+            cached[arxiv_id] = point_id
+    return cached
+def _scroll_by_arxiv_ids(arxiv_ids: list[str]) -> dict[str, int]:
+    """
+    Sync helper: scroll Qdrant filtering by arxiv_id payload.
+    Requires the keyword index created during setup.
+    """
+    client = _client()
+    pts, _ = client.scroll(
+        collection_name=config.QDRANT_COLLECTION,
+        scroll_filter=Filter(
+            must=[FieldCondition(key="arxiv_id", match=MatchAny(any=arxiv_ids))]
+        ),
+        limit=len(arxiv_ids),
+        with_payload=True,
+        with_vectors=False,
+    )
+    return {p.payload["arxiv_id"]: p.id for p in pts}
+# ── Recommend ─────────────────────────────────────────────────────────────────
+async def recommend(
+    positive_arxiv_ids: list[str],
+    negative_arxiv_ids: list[str],
+    seen_arxiv_ids: set[str],
+    limit: int = config.REC_LIMIT,
+) -> list[str]:
+    """
+    Call Qdrant Recommend API.
+    Returns a list of arxiv_ids (up to `limit`) sorted by Qdrant score,
+    excluding papers the user has already seen.
+    """
+    # Translate arxiv_ids → integer point IDs
+    all_ids = list(dict.fromkeys(positive_arxiv_ids + negative_arxiv_ids))
+    id_map = await lookup_qdrant_ids(all_ids)
+    pos_ids = [id_map[aid] for aid in positive_arxiv_ids if aid in id_map]
+    neg_ids = [id_map[aid] for aid in negative_arxiv_ids if aid in id_map]
+    if not pos_ids:
+        return []
+    # Build must-not filter: exclude already-seen papers
+    # We can only filter on payload fields — seen list applied in Python
+    loop = asyncio.get_event_loop()
+    try:
+        results = await loop.run_in_executor(
+            None,
+            _run_recommend,
+            pos_ids,
+            neg_ids,
+            limit * 2,   # fetch extra so we can filter seen in Python
+        )
+    except Exception as e:
+        # Log and return empty rather than crashing the page
+        print(f"[qdrant_svc] recommend error: {e}")
+        return []
+    # Filter out seen papers, return top `limit`
+    filtered = [
+        r.payload["arxiv_id"]
+        for r in results
+        if r.payload.get("arxiv_id") and r.payload["arxiv_id"] not in seen_arxiv_ids
+    ]
+    return filtered[:limit]
+def _run_recommend(
+    pos_ids: list[int],
+    neg_ids: list[int],
+    limit: int,
+) -> list:
+    """Sync helper — uses query_points with RecommendQuery (modern API)."""
+    client = _client()
+    result = client.query_points(
+        collection_name=config.QDRANT_COLLECTION,
+        query=RecommendQuery(
+            recommend=RecommendInput(
+                positive=pos_ids,
+                negative=neg_ids if neg_ids else [],
+                strategy=RecommendStrategy.BEST_SCORE,
+            )
+        ),
+        limit=limit,
+        with_payload=True,
+        with_vectors=False,
+    )
+    return result.points
+# ── Phase 2a: Vector retrieval + vector search ───────────────────────────────
+async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
+    """
+    Fetch actual BGE-M3 embedding vectors for papers from Qdrant.
+    Returns {arxiv_id: vector_list} for papers found.
+    Used by EWMA profile updates — we need the paper's embedding
+    to blend into the user's profile vector.
+    """
+    if not arxiv_ids:
+        return {}
+    id_map = await lookup_qdrant_ids(arxiv_ids)
+    if not id_map:
+        return {}
+    point_ids = list(id_map.values())
+    arxiv_by_point = {v: k for k, v in id_map.items()}
+    loop = asyncio.get_event_loop()
+    try:
+        points = await loop.run_in_executor(
+            None, _get_vectors_by_ids, point_ids
+        )
+    except Exception as e:
+        print(f"[qdrant_svc] get_paper_vectors error: {e}")
+        return {}
+    result = {}
+    for p in points:
+        aid = p.payload.get("arxiv_id") or arxiv_by_point.get(p.id)
+        if aid and p.vector:
+            # p.vector may be a dict if named vectors are used
+            vec = p.vector if isinstance(p.vector, list) else p.vector.get("dense", p.vector)
+            if isinstance(vec, list):
+                result[aid] = vec
+    return result
+def _get_vectors_by_ids(point_ids: list[int]) -> list:
+    """Sync helper: retrieve points with their vectors."""
+    client = _client()
+    points = client.retrieve(
+        collection_name=config.QDRANT_COLLECTION,
+        ids=point_ids,
+        with_payload=True,
+        with_vectors=True,
+    )
+    return points
+async def search_by_vector(
+    query_vector: list[float],
+    limit: int = 20,
+    exclude_ids: set[str] | None = None,
+) -> list[str]:
+    """
+    Raw vector search — find papers similar to a given embedding.
+    Returns list of arxiv_ids, excluding any in exclude_ids.
+    Used when EWMA profile vectors are available (Phase 2a+).
+    """
+    loop = asyncio.get_event_loop()
+    try:
+        results = await loop.run_in_executor(
+            None, _run_vector_search, query_vector, (limit * 2) if exclude_ids else limit,
+        )
+    except Exception as e:
+        print(f"[qdrant_svc] search_by_vector error: {e}")
+        return []
+    exclude = exclude_ids or set()
+    filtered = [
+        r.payload["arxiv_id"]
+        for r in results
+        if r.payload.get("arxiv_id") and r.payload["arxiv_id"] not in exclude
+    ]
+    return filtered[:limit]
+def _run_vector_search(query_vector: list[float], limit: int) -> list:
+    """Sync helper: nearest-neighbour search by vector."""
+    client = _client()
+    result = client.query_points(
+        collection_name=config.QDRANT_COLLECTION,
+        query=query_vector,
+        limit=limit,
+        with_payload=True,
+        with_vectors=False,
+    )
+    return result.points
+# ── Phase 3: Dense search for hybrid search pipeline ─────────────────────────
+async def search_dense(
+    dense_vec: list[float],
+    limit: int = 50,
+) -> list[dict]:
+    """
+    ANN dense search for the hybrid search pipeline (Phase 3).
+    Returns list of {'arxiv_id': str, 'score': float} dicts sorted by
+    score desc.  Different from search_by_vector() which returns only
+    arxiv_ids — this version returns scores needed for RRF fusion.
+    """
+    loop = asyncio.get_event_loop()
+    try:
+        results = await loop.run_in_executor(
+            None, _run_dense_search, dense_vec, limit,
+        )
+    except Exception as e:
+        print(f"[qdrant_svc] search_dense error: {e}")
+        return []
+    return [
+        {"arxiv_id": r.payload["arxiv_id"], "score": r.score}
+        for r in results
+        if r.payload.get("arxiv_id")
+    ]
+def _run_dense_search(query_vector: list[float], limit: int) -> list:
+    """Sync helper: ANN search returning scored results for RRF."""
+    client = _client()
+    result = client.query_points(
+        collection_name=config.QDRANT_COLLECTION,
+        query=query_vector,
+        limit=limit,
+        with_payload=True,
+        with_vectors=False,
+    )
+    return result.points
+# ── Phase 2b: Multi-interest prefetch + RRF fusion ───────────────────────────
+async def multi_interest_search(
+    interest_vectors: list[tuple[list[float], int]],
+    short_term_vector: list[float] | None = None,
+    exclude_ids: set[str] | None = None,
+    total_limit: int = 100,
+) -> list[str]:
+    """
+    Multi-interest retrieval using Qdrant prefetch + RRF fusion.
+    Sends multiple ANN queries (one per interest cluster + optional session
+    vector) in a SINGLE API call.  Qdrant runs them in parallel server-side
+    and fuses results via Reciprocal Rank Fusion (k=60).
+    Args:
+        interest_vectors: list of (medoid_embedding, per_cluster_limit) tuples,
+                          ordered by importance (highest first)
+        short_term_vector: optional EWMA short-term session embedding
+        exclude_ids: arxiv_ids to filter out (already seen)
+        total_limit: how many candidates to return after fusion
+    Returns:
+        list of arxiv_ids sorted by fused relevance
+    Latency: ~15-25ms for 3-4 prefetch queries (single network round-trip)
+    """
+    # Build prefetch queries — one per interest cluster
+    prefetches = []
+    for vec, limit in interest_vectors:
+        prefetches.append(Prefetch(
+            query=vec,
+            limit=limit,
+        ))
+    # Add short-term session vector if available
+    if short_term_vector is not None:
+        prefetches.append(Prefetch(
+            query=short_term_vector,
+            limit=25,
+        ))
+    if not prefetches:
+        return []
+    loop = asyncio.get_event_loop()
+    try:
+        results = await loop.run_in_executor(
+            None,
+            _run_prefetch_rrf,
+            prefetches,
+            total_limit * 2 if exclude_ids else total_limit,
+        )
+    except Exception as e:
+        print(f"[qdrant_svc] multi_interest_search error: {e}")
+        return []
+    exclude = exclude_ids or set()
+    filtered = [
+        r.payload["arxiv_id"]
+        for r in results
+        if r.payload.get("arxiv_id") and r.payload["arxiv_id"] not in exclude
+    ]
+    return filtered[:total_limit]
+def _run_prefetch_rrf(prefetches: list[Prefetch], limit: int) -> list:
+    """Sync helper: execute prefetch queries with RRF fusion."""
+    client = _client()
+    result = client.query_points(
+        collection_name=config.QDRANT_COLLECTION,
+        prefetch=prefetches,
+        query=FusionQuery(fusion=Fusion.RRF),
+        limit=limit,
+        with_payload=True,
+        with_vectors=False,
+    )
+    return result.points

app/recommend/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+# Recommendation engine — multi-interest architecture
+# Phase 2a: EWMA profile embeddings
+# Phase 2b: Ward clustering + Qdrant prefetch RRF
+# Phase 2c: LightGBM re-ranking + MMR diversity

app/recommend/clustering.py ADDED Viewed

	@@ -0,0 +1,205 @@

+"""
+Ward hierarchical clustering for multi-interest detection.
+Discovers K distinct interest clusters from the user's saved paper
+embeddings using Ward's method.  K is determined automatically by
+a distance threshold — not predefined.
+Each cluster is represented by its **medoid** (the actual paper
+embedding closest to cluster center), not the centroid.  This prevents
+"topic drift" into meaningless regions of embedding space.
+Reference: Research-MultiInterest_Recommender_Architecture.md §2
+  "PinnerSage's design choices: Ward hierarchical clustering on the
+   user's interacted item embeddings, with a threshold parameter α
+   controlling merge stopping — this automatically determines K per user"
+"""
+from __future__ import annotations
+import json
+from dataclasses import dataclass, field
+import numpy as np
+from scipy.cluster.hierarchy import ward, fcluster
+from scipy.spatial.distance import pdist
+from app import db
+# Ward merge threshold — used as a MAXIMUM. The actual cut point is
+# determined adaptively by finding the largest gap in merge distances.
+# This fallback only applies if no clear gap is found.
+WARD_DISTANCE_THRESHOLD = 1.5
+# Absolute limits on cluster count
+MIN_CLUSTERS = 1
+MAX_CLUSTERS = 7   # RFC: PinnerSage uses 3-5 for typical users, cap at 7
+# Minimum saved papers before clustering is meaningful
+MIN_PAPERS_FOR_CLUSTERING = 5
+@dataclass
+class InterestCluster:
+    """A single interest cluster derived from user behaviour."""
+    cluster_idx: int
+    medoid_paper_id: str
+    medoid_embedding: np.ndarray
+    paper_ids: list[str]
+    importance: float  # recency-weighted sum of interactions
+def _adaptive_threshold(linkage: np.ndarray) -> float:
+    """
+    Find the optimal cut point by detecting the largest gap in merge distances.
+    The linkage matrix has shape (n-1, 4). Column 2 is the merge distance.
+    We look at the differences between consecutive merge distances and pick
+    the cut just below the biggest jump — that's where the most distinct
+    clusters separate.
+    Falls back to 0.7 × max_merge_distance if no clear gap is found.
+    """
+    merge_distances = linkage[:, 2]
+    if len(merge_distances) < 2:
+        return float(merge_distances[0]) if len(merge_distances) == 1 else WARD_DISTANCE_THRESHOLD
+    # Compute gaps between consecutive merges
+    gaps = np.diff(merge_distances)
+    # The biggest gap indicates the most natural cluster boundary
+    best_gap_idx = int(np.argmax(gaps))
+    # Cut just above the merge BEFORE the biggest gap
+    # (i.e., allow all merges up to but not including the big jump)
+    threshold = float(merge_distances[best_gap_idx] + merge_distances[best_gap_idx + 1]) / 2.0
+    # Sanity: don't let it go below first merge or above reasonable max
+    min_t = float(merge_distances[0])
+    max_t = float(merge_distances[-1]) * 0.7
+    threshold = max(min_t, min(threshold, max_t))
+    return threshold
+def compute_clusters(
+    paper_ids: list[str],
+    embeddings: np.ndarray,
+    timestamps: list[str] | None = None,
+) -> list[InterestCluster]:
+    """
+    Cluster paper embeddings using Ward's hierarchical method.
+    Args:
+        paper_ids: arxiv_ids of saved papers (most recent first)
+        embeddings: shape (N, 1024), the BGE-M3 vectors for each paper
+        timestamps: optional ISO timestamps for recency weighting
+    Returns:
+        List of InterestCluster sorted by importance (highest first).
+        Returns a single cluster (medoid of all) if N < MIN_PAPERS_FOR_CLUSTERING.
+    """
+    n = len(paper_ids)
+    assert embeddings.shape == (n, 1024), f"Expected ({n}, 1024), got {embeddings.shape}"
+    # Too few papers — return a single cluster with the centroid's medoid
+    if n < MIN_PAPERS_FOR_CLUSTERING:
+        centroid = embeddings.mean(axis=0)
+        medoid_idx = _find_medoid(embeddings, centroid)
+        return [InterestCluster(
+            cluster_idx=0,
+            medoid_paper_id=paper_ids[medoid_idx],
+            medoid_embedding=embeddings[medoid_idx],
+            paper_ids=list(paper_ids),
+            importance=float(n),
+        )]
+    # L2-normalize: Ward is Euclidean-only (Murtagh & Legendre 2014).
+    # On unit vectors, ‖a−b‖² = 2(1−cos(a,b)), giving cosine-Ward.
+    # BGE-M3 vectors are approximately unit-norm but can drift after
+    # EWMA blending or floating-point accumulation. (Doc 06, fault #4)
+    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
+    norms = np.where(norms < 1e-10, 1.0, norms)
+    embeddings = embeddings / norms
+    # Compute pairwise distances and linkage
+    # Ward's method minimises intra-cluster variance
+    distances = pdist(embeddings, metric="euclidean")
+    linkage = ward(distances)
+    # Adaptive threshold: find the biggest gap in merge distances
+    threshold = _adaptive_threshold(linkage)
+    # Cut the dendrogram at the adaptive threshold
+    labels = fcluster(linkage, t=threshold, criterion="distance")
+    # Clamp cluster count
+    unique_labels = np.unique(labels)
+    n_clusters = len(unique_labels)
+    # If too many clusters, re-cut with a maxclust constraint
+    if n_clusters > MAX_CLUSTERS:
+        labels = fcluster(linkage, t=MAX_CLUSTERS, criterion="maxclust")
+        unique_labels = np.unique(labels)
+    # Compute recency weights (position-based: most recent = highest weight)
+    recency_weights = np.array([
+        1.0 / (i + 1) for i in range(n)
+    ], dtype=np.float64)
+    # Build clusters
+    clusters = []
+    for cidx, label in enumerate(unique_labels):
+        mask = labels == label
+        cluster_embs = embeddings[mask]
+        cluster_ids = [paper_ids[i] for i in range(n) if mask[i]]
+        cluster_weights = recency_weights[mask]
+        # Centroid for medoid computation
+        centroid = cluster_embs.mean(axis=0)
+        medoid_local_idx = _find_medoid(cluster_embs, centroid)
+        # Importance: sum of recency weights for this cluster's papers
+        importance = float(cluster_weights.sum())
+        # Find the global paper_id index of the medoid
+        medoid_global_indices = np.where(mask)[0]
+        medoid_global_idx = medoid_global_indices[medoid_local_idx]
+        clusters.append(InterestCluster(
+            cluster_idx=cidx,
+            medoid_paper_id=paper_ids[medoid_global_idx],
+            medoid_embedding=embeddings[medoid_global_idx],
+            paper_ids=cluster_ids,
+            importance=importance,
+        ))
+    # Sort by importance (most important first)
+    clusters.sort(key=lambda c: c.importance, reverse=True)
+    return clusters
+def _find_medoid(embeddings: np.ndarray, centroid: np.ndarray) -> int:
+    """Find the index of the embedding closest to the centroid."""
+    distances = np.linalg.norm(embeddings - centroid, axis=1)
+    return int(np.argmin(distances))
+# ── Persistence ───────────────────────────────────────────────────────────────
+async def save_clusters_to_db(user_id: str, clusters: list[InterestCluster]) -> None:
+    """Persist clusters to SQLite."""
+    rows = [
+        {
+            "cluster_idx": c.cluster_idx,
+            "medoid_paper_id": c.medoid_paper_id,
+            "importance": c.importance,
+            "paper_ids": json.dumps(c.paper_ids),
+        }
+        for c in clusters
+    ]
+    await db.save_user_clusters(user_id, rows)
+async def load_clusters_from_db(user_id: str) -> list[dict] | None:
+    """Load clusters from SQLite.  Returns None if no clusters exist."""
+    rows = await db.get_user_clusters(user_id)
+    return rows if rows else None

app/recommend/diversity.py ADDED Viewed

	@@ -0,0 +1,143 @@

+"""
+MMR diversity enforcement + exploration injection.
+Maximal Marginal Relevance (Carbonell & Goldstein, 1998) selects items
+that are both relevant to the query AND diverse from each other.
+Formula:
+  MMR = argmax[λ × Sim(d_i, Q) − (1−λ) × max(Sim(d_i, d_j))]
+where d_j iterates over already-selected items.
+Reference: Research-MultiInterest_Recommender_Architecture.md §4
+  "MMR provides practical diversity enforcement ...
+   Setting λ=0.6 provides a good balance between relevance and diversity
+   for academic paper discovery."
+"""
+from __future__ import annotations
+import random
+import numpy as np
+def mmr_rerank(
+    query_embedding: np.ndarray,
+    candidate_embeddings: np.ndarray,
+    candidate_ids: list[str],
+    scores: list[float] | None = None,
+    lambda_param: float = 0.6,
+    top_k: int = 20,
+) -> list[str]:
+    """
+    Select top_k items from candidates using Maximal Marginal Relevance.
+    Args:
+        query_embedding: the user's profile vector (1024-dim)
+        candidate_embeddings: shape (N, 1024), embeddings for all candidates
+        candidate_ids: arxiv_ids for each candidate (same order as embeddings)
+        scores: optional pre-computed relevance scores (from LightGBM or RRF).
+                If None, uses cosine similarity to query_embedding.
+        lambda_param: balance between relevance (1.0) and diversity (0.0).
+                      RFC recommends 0.6 for academic papers.
+        top_k: how many items to select
+    Returns:
+        list of arxiv_ids in MMR-selected order
+    Latency: <1ms for 100 candidates with precomputed embeddings.
+    """
+    n = len(candidate_ids)
+    if n == 0:
+        return []
+    if n <= top_k:
+        return list(candidate_ids)
+    # Compute relevance scores
+    if scores is not None:
+        relevance = np.array(scores, dtype=np.float64)
+        # Normalise to [0, 1]
+        r_min, r_max = relevance.min(), relevance.max()
+        if r_max > r_min:
+            relevance = (relevance - r_min) / (r_max - r_min)
+        else:
+            relevance = np.ones(n, dtype=np.float64)
+    else:
+        # Cosine similarity to query
+        query_norm = query_embedding / (np.linalg.norm(query_embedding) + 1e-10)
+        cand_norms = candidate_embeddings / (
+            np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-10
+        )
+        relevance = cand_norms @ query_norm
+    # Precompute pairwise cosine similarity matrix
+    cand_norms = candidate_embeddings / (
+        np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-10
+    )
+    sim_matrix = cand_norms @ cand_norms.T
+    # Greedy MMR selection
+    selected_indices: list[int] = []
+    remaining = set(range(n))
+    for _ in range(min(top_k, n)):
+        best_score = -float("inf")
+        best_idx = -1
+        for idx in remaining:
+            # Relevance term
+            rel = lambda_param * relevance[idx]
+            # Diversity term: max similarity to any already-selected item
+            if selected_indices:
+                max_sim = max(sim_matrix[idx, j] for j in selected_indices)
+            else:
+                max_sim = 0.0
+            mmr_score = rel - (1.0 - lambda_param) * max_sim
+            if mmr_score > best_score:
+                best_score = mmr_score
+                best_idx = idx
+        if best_idx < 0:
+            break
+        selected_indices.append(best_idx)
+        remaining.discard(best_idx)
+    return [candidate_ids[i] for i in selected_indices]
+def inject_exploration(
+    selected_ids: list[str],
+    all_candidate_ids: list[str],
+    n_explore: int = 2,
+    seed: int | None = None,
+) -> list[str]:
+    """
+    Inject exploration papers from candidates not already selected.
+    Picks n_explore random papers from the unselected pool and appends
+    them to the end of the list.  This follows TikTok's 15-25% exploration
+    allocation principle for breaking filter bubbles.
+    Args:
+        selected_ids: the MMR-selected arxiv_ids
+        all_candidate_ids: the full candidate pool (before MMR selection)
+        n_explore: how many exploration papers to inject
+        seed: optional random seed for reproducibility
+    Returns:
+        selected_ids with exploration papers appended
+    """
+    selected_set = set(selected_ids)
+    pool = [cid for cid in all_candidate_ids if cid not in selected_set]
+    if not pool:
+        return list(selected_ids)
+    rng = random.Random(seed)
+    n_pick = min(n_explore, len(pool))
+    exploration = rng.sample(pool, n_pick)
+    return list(selected_ids) + exploration

app/recommend/profiles.py ADDED Viewed

	@@ -0,0 +1,140 @@

+"""
+EWMA-based user profile embeddings.
+Maintains three vector profiles per user:
+  - long_term   (α=0.03): enduring research interests (~66-interaction window)
+  - short_term  (α=0.40): current session context (~3-5 interactions)
+  - negative    (α=0.15): topics the user dislikes
+Each profile is a 1024-dim L2-normalised vector (BGE-M3 cosine space).
+Storage: binary numpy blobs in SQLite (4096 bytes per vector).
+Reference: Research-MultiInterest_Recommender_Architecture.md §3
+  "EWMA updates user embeddings with the formula
+   embedding_t = α × item_embedding_t + (1−α) × embedding_{t-1}"
+Correction (Doc 06): PinnerSage tested λ=0.1 and explicitly rejected it as
+  too recent-biased.  Their optimal was λ=0.01.  α=0.03 is our compromise.
+"""
+from __future__ import annotations
+import numpy as np
+from app import db
+EMBEDDING_DIM = 1024  # BGE-M3 dense dimension
+# EWMA smoothing factors
+# Doc 06 correction: α_long was 0.10, but PinnerSage (KDD 2020) found λ=0.1
+# too recent-biased and selected λ=0.01.  α=0.03 gives ~66-interaction
+# effective window — a compromise that preserves minority interests.
+ALPHA_LONG_TERM = 0.03   # effective window ~66 interactions (PinnerSage optimal)
+ALPHA_SHORT_TERM = 0.40  # effective window ~3-5 interactions
+ALPHA_NEGATIVE = 0.15    # moderate responsiveness for negatives
+def _normalise(v: np.ndarray) -> np.ndarray | None:
+    """L2-normalise a vector.  Returns None if the vector is near-zero."""
+    norm = np.linalg.norm(v)
+    if norm < 1e-10:
+        return None
+    return (v / norm).astype(np.float32)
+def ewma_update(
+    current: np.ndarray | None,
+    new_embedding: np.ndarray,
+    alpha: float,
+) -> np.ndarray:
+    """
+    Exponentially Weighted Moving Average update.
+    If current is None (first interaction), the new embedding IS the profile.
+    Otherwise: profile = (1 - α) × current  +  α × new_embedding
+    The result is always L2-normalised (BGE-M3 operates in cosine space).
+    """
+    new_embedding = new_embedding.astype(np.float64)
+    if current is None:
+        result = new_embedding
+    else:
+        current = current.astype(np.float64)
+        result = (1.0 - alpha) * current + alpha * new_embedding
+    normalised = _normalise(result)
+    if normalised is None:
+        # Edge case: vectors cancel out → keep old profile
+        return current.astype(np.float32) if current is not None else np.zeros(EMBEDDING_DIM, dtype=np.float32)
+    return normalised
+# ── Storage helpers ───────────────────────────────────────────────────────────
+def _to_bytes(v: np.ndarray) -> bytes:
+    return v.astype(np.float32).tobytes()
+def _from_bytes(b: bytes) -> np.ndarray:
+    return np.frombuffer(b, dtype=np.float32).copy()
+async def load_profile(user_id: str, profile_type: str) -> np.ndarray | None:
+    """Load a profile vector from SQLite.  Returns None if not found."""
+    row = await db.get_user_profile(user_id, profile_type)
+    if row is None:
+        return None
+    return _from_bytes(row["vector"])
+async def save_profile(
+    user_id: str,
+    profile_type: str,
+    vector: np.ndarray,
+    interaction_count: int,
+) -> None:
+    """Persist a profile vector to SQLite."""
+    await db.upsert_user_profile(
+        user_id=user_id,
+        profile_type=profile_type,
+        vector=_to_bytes(vector),
+        interaction_count=interaction_count,
+    )
+async def get_interaction_count(user_id: str, profile_type: str) -> int:
+    """Get the current interaction count for a profile."""
+    row = await db.get_user_profile(user_id, profile_type)
+    if row is None:
+        return 0
+    return row["interaction_count"]
+# ── High-level update API ────────────────────────────────────────────────────
+async def update_on_save(user_id: str, paper_embedding: np.ndarray) -> None:
+    """
+    Called when a user saves a paper.
+    Updates both long-term and short-term profiles.
+    """
+    # Long-term
+    lt_current = await load_profile(user_id, "long_term")
+    lt_count = await get_interaction_count(user_id, "long_term")
+    lt_updated = ewma_update(lt_current, paper_embedding, ALPHA_LONG_TERM)
+    await save_profile(user_id, "long_term", lt_updated, lt_count + 1)
+    # Short-term
+    st_current = await load_profile(user_id, "short_term")
+    st_count = await get_interaction_count(user_id, "short_term")
+    st_updated = ewma_update(st_current, paper_embedding, ALPHA_SHORT_TERM)
+    await save_profile(user_id, "short_term", st_updated, st_count + 1)
+async def update_on_dismiss(user_id: str, paper_embedding: np.ndarray) -> None:
+    """
+    Called when a user dismisses a paper.
+    Updates the negative profile.
+    """
+    neg_current = await load_profile(user_id, "negative")
+    neg_count = await get_interaction_count(user_id, "negative")
+    neg_updated = ewma_update(neg_current, paper_embedding, ALPHA_NEGATIVE)
+    await save_profile(user_id, "negative", neg_updated, neg_count + 1)

app/recommend/reranker.py ADDED Viewed

	@@ -0,0 +1,180 @@

+"""
+Re-ranking layer for recommendation candidates.
+Phase 2c initial: Heuristic scorer using hand-tuned feature weights.
+Phase 2c mature:  LightGBM lambdarank trained on save/dismiss data.
+The heuristic scorer runs first.  When ≥500 labeled interactions accumulate,
+a LightGBM model can be trained offline and loaded here.
+Features:
+  - cosine_sim_longterm:  dot(user_lt_vec, paper_vec)
+  - cosine_sim_shortterm: dot(user_st_vec, paper_vec)
+  - paper_age_days:       days since publication
+  - rrf_position:         position in the RRF fusion output (lower = better)
+  - cosine_sim_negative:  dot(user_neg_vec, paper_vec)  [Doc 06 addition]
+Reference: Research-MultiInterest_Recommender_Architecture.md §4
+  "LightGBM with a lambdarank objective scores 500 candidates in 2-5ms
+   on a single CPU core."
+Doc 06 correction: YouTube (2023, Xia et al.) showed a 3x gain from using
+  dislikes as both features and labels.  The negative EWMA profile is now
+  wired as a penalty feature during reranking.
+"""
+from __future__ import annotations
+from datetime import datetime, timezone
+import numpy as np
+def _cosine_sim_batch(
+    candidate_embeddings: np.ndarray,
+    profile_vec: np.ndarray,
+) -> np.ndarray:
+    """Cosine similarity of each candidate against a single profile vector."""
+    pnorm = profile_vec / (np.linalg.norm(profile_vec) + 1e-10)
+    cnorms = candidate_embeddings / (
+        np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-10
+    )
+    return cnorms @ pnorm
+def compute_features(
+    candidate_embeddings: np.ndarray,
+    candidate_metadata: list[dict],
+    long_term_vec: np.ndarray | None = None,
+    short_term_vec: np.ndarray | None = None,
+    negative_vec: np.ndarray | None = None,
+) -> np.ndarray:
+    """
+    Extract ranking features for each candidate.
+    Args:
+        candidate_embeddings: shape (N, 1024)
+        candidate_metadata: list of dicts with 'published' key (YYYY-MM-DD)
+        long_term_vec: user's long-term EWMA profile (1024-dim)
+        short_term_vec: user's short-term EWMA profile (1024-dim)
+        negative_vec: user's negative EWMA profile (1024-dim) [Doc 06]
+    Returns:
+        feature matrix of shape (N, num_features)
+    """
+    n = len(candidate_metadata)
+    features = []
+    # Feature 1: Cosine similarity to long-term profile
+    if long_term_vec is not None:
+        lt_sim = _cosine_sim_batch(candidate_embeddings, long_term_vec)
+    else:
+        lt_sim = np.zeros(n, dtype=np.float32)
+    features.append(lt_sim)
+    # Feature 2: Cosine similarity to short-term profile
+    if short_term_vec is not None:
+        st_sim = _cosine_sim_batch(candidate_embeddings, short_term_vec)
+    else:
+        st_sim = np.zeros(n, dtype=np.float32)
+    features.append(st_sim)
+    # Feature 3: Paper age in days (0 = today, positive = older)
+    now = datetime.now(timezone.utc)
+    ages = []
+    for meta in candidate_metadata:
+        pub = meta.get("published", "")
+        try:
+            pub_date = datetime.strptime(pub[:10], "%Y-%m-%d").replace(tzinfo=timezone.utc)
+            age_days = (now - pub_date).days
+        except (ValueError, TypeError):
+            age_days = 365  # default to 1 year old if unparseable
+        ages.append(age_days)
+    features.append(np.array(ages, dtype=np.float32))
+    # Feature 4: RRF position (0-indexed, lower = better)
+    features.append(np.arange(n, dtype=np.float32))
+    # Feature 5: Cosine similarity to negative profile (Doc 06 addition)
+    # YouTube (2023): using dislikes as features gives 22% reduction in
+    # similar-content; using as both features AND labels gives 60.8%.
+    if negative_vec is not None:
+        neg_sim = _cosine_sim_batch(candidate_embeddings, negative_vec)
+    else:
+        neg_sim = np.zeros(n, dtype=np.float32)
+    features.append(neg_sim)
+    return np.column_stack(features)
+def heuristic_score(features: np.ndarray) -> np.ndarray:
+    """
+    Hand-tuned scoring function.  Used before LightGBM model is trained.
+    Weights:
+      - 0.40 x long-term similarity     (core relevance)
+      - 0.25 x short-term similarity    (session context)
+      - 0.15 x recency                  (prefer newer, soft decay)
+      - 0.10 x RRF confidence           (prefer higher-ranked candidates)
+      - 0.15 x negative penalty         (demote papers like dismissed ones)
+    Returns: scores array of shape (N,), higher = better
+    """
+    lt_sim = features[:, 0]           # cosine sim to long-term
+    st_sim = features[:, 1]           # cosine sim to short-term
+    age_days = features[:, 2]         # paper age in days
+    rrf_pos = features[:, 3]          # RRF rank position
+    neg_sim = features[:, 4]          # cosine sim to negative profile
+    # Recency: exponential decay with ~365-day half-life
+    # Papers from today score 1.0, papers from a year ago score 0.5
+    recency = np.exp(-0.002 * age_days)
+    # RRF confidence: inverse of position (normalised)
+    max_pos = rrf_pos.max() + 1
+    rrf_conf = 1.0 - (rrf_pos / max_pos)
+    # Negative penalty: papers similar to dismissed papers get demoted
+    # Only penalise positive similarity (neg_sim > 0 means similar to disliked)
+    neg_penalty = np.clip(neg_sim, 0.0, None)
+    scores = (
+        0.40 * lt_sim
+        + 0.25 * st_sim
+        + 0.15 * recency
+        + 0.10 * rrf_conf
+        - 0.15 * neg_penalty
+    )
+    return scores
+def rerank_candidates(
+    candidate_ids: list[str],
+    candidate_embeddings: np.ndarray,
+    candidate_metadata: list[dict],
+    long_term_vec: np.ndarray | None = None,
+    short_term_vec: np.ndarray | None = None,
+    negative_vec: np.ndarray | None = None,
+) -> tuple[list[str], list[float], np.ndarray]:
+    """
+    Score and re-rank candidates.
+    Args:
+        negative_vec: user's negative EWMA profile.  Papers similar to this
+            get demoted.  (Doc 06: YouTube 3x gain from using dislikes.)
+    Returns:
+        (sorted_ids, sorted_scores, sorted_embeddings)
+        all in descending score order
+    """
+    features = compute_features(
+        candidate_embeddings, candidate_metadata,
+        long_term_vec, short_term_vec, negative_vec,
+    )
+    scores = heuristic_score(features)
+    # Sort by score descending
+    order = np.argsort(-scores)
+    sorted_ids = [candidate_ids[i] for i in order]
+    sorted_scores = scores[order].tolist()
+    sorted_embs = candidate_embeddings[order]
+    return sorted_ids, sorted_scores, sorted_embs

app/routers/__init__.py ADDED Viewed

File without changes

app/routers/events.py ADDED Viewed

	@@ -0,0 +1,104 @@

+"""
+Event router — logs user interactions and updates the hot cache.
+POST /api/papers/{paper_id}/save
+POST /api/papers/{paper_id}/not-interested
+"""
+import asyncio
+import uuid
+import numpy as np
+from fastapi import APIRouter, Request, Cookie, Form
+from fastapi.responses import HTMLResponse
+from app import db, user_state as us, qdrant_svc
+from app.config import COOKIE_NAME
+from app.templates_env import templates
+from app.recommend import profiles
+router = APIRouter(prefix="/api/papers")
+@router.post("/{paper_id}/save", response_class=HTMLResponse)
+async def save_paper(
+    paper_id: str,
+    request: Request,
+    source: str = Form(default="search"),
+    position: int = Form(default=0),
+    query_id: str = Form(default=""),
+    user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
+):
+    user_id = user_id or str(uuid.uuid4())
+    await db.log_interaction(
+        user_id=user_id,
+        paper_id=paper_id,
+        event_type="save",
+        source=source,
+        position=position or None,
+        query_id=query_id or None,
+    )
+    us.record_positive(user_id, paper_id)
+    asyncio.create_task(qdrant_svc.lookup_qdrant_ids([paper_id]))
+    asyncio.create_task(_update_profile_on_save(user_id, paper_id))
+    resp = templates.TemplateResponse(
+        request,
+        "partials/action_buttons.html",
+        {"paper_id": paper_id, "saved": True, "dismissed": False, "source": source},
+    )
+    resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
+    return resp
+@router.post("/{paper_id}/not-interested", response_class=HTMLResponse)
+async def not_interested(
+    paper_id: str,
+    request: Request,
+    source: str = Form(default="search"),
+    position: int = Form(default=0),
+    query_id: str = Form(default=""),
+    user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
+):
+    user_id = user_id or str(uuid.uuid4())
+    await db.log_interaction(
+        user_id=user_id,
+        paper_id=paper_id,
+        event_type="not_interested",
+        source=source,
+        position=position or None,
+        query_id=query_id or None,
+    )
+    us.record_negative(user_id, paper_id)
+    asyncio.create_task(_update_profile_on_dismiss(user_id, paper_id))
+    resp = HTMLResponse(content="")
+    resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
+    return resp
+# ── Background EWMA profile update helpers ────────────────────────────────────
+async def _update_profile_on_save(user_id: str, paper_id: str) -> None:
+    """Background task: fetch paper embedding and update EWMA profiles."""
+    try:
+        vectors = await qdrant_svc.get_paper_vectors([paper_id])
+        if paper_id not in vectors:
+            return
+        embedding = np.array(vectors[paper_id], dtype=np.float32)
+        await profiles.update_on_save(user_id, embedding)
+    except Exception as e:
+        print(f"[events] EWMA save update failed for {paper_id}: {e}")
+async def _update_profile_on_dismiss(user_id: str, paper_id: str) -> None:
+    """Background task: fetch paper embedding and update negative profile."""
+    try:
+        vectors = await qdrant_svc.get_paper_vectors([paper_id])
+        if paper_id not in vectors:
+            return
+        embedding = np.array(vectors[paper_id], dtype=np.float32)
+        await profiles.update_on_dismiss(user_id, embedding)
+    except Exception as e:
+        print(f"[events] EWMA dismiss update failed for {paper_id}: {e}")

app/routers/recommendations.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Recommendations router.
+GET /api/recommendations
+  – Called by HTMX on page load (hx-trigger="load")
+  – Returns the recommendations partial HTML
+Recommendation pipeline (cascading fallback):
+  Phase 2b: Multi-interest clustering → prefetch + RRF fusion  (≥5 saves)
+  Phase 2a: EWMA long-term vector → single vector search       (≥3 saves)
+  Phase 1:  Qdrant BEST_SCORE Recommend API with raw IDs       (≥1 save)
+"""
+import json
+import uuid
+import numpy as np
+from fastapi import APIRouter, Request, Cookie
+from fastapi.responses import HTMLResponse
+from app import qdrant_svc, arxiv_svc, user_state as us
+from app.config import COOKIE_NAME, REC_LIMIT, REC_MIN_POSITIVES
+from app.templates_env import templates
+from app.recommend import profiles
+from app.recommend.clustering import (
+    compute_clusters,
+    save_clusters_to_db,
+    load_clusters_from_db,
+    MIN_PAPERS_FOR_CLUSTERING,
+)
+from app.recommend.reranker import rerank_candidates
+from app.recommend.diversity import mmr_rerank, inject_exploration
+router = APIRouter(prefix="/api")
+# Minimum EWMA interactions before switching from ID-based to vector-based recs
+_MIN_EWMA_INTERACTIONS = 3
+@router.get("/recommendations", response_class=HTMLResponse)
+async def get_recommendations(
+    request: Request,
+    user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
+):
+    user_id = user_id or str(uuid.uuid4())
+    state = await us.ensure_loaded(user_id)
+    def _empty_resp():
+        r = templates.TemplateResponse(
+            request,
+            "partials/empty_recs.html",
+            {"min_saves": REC_MIN_POSITIVES},
+        )
+        r.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
+        return r
+    if not state.has_enough_for_recs():
+        return _empty_resp()
+    seen = us.all_seen(user_id)
+    # ── Tier 1: Multi-interest clustering + RRF (Phase 2b, ≥5 saves) ─────
+    rec_arxiv_ids = await _multi_interest_recommend(user_id, state, seen, REC_LIMIT)
+    # ── Tier 2: EWMA single-vector search (Phase 2a, ≥3 saves) ───────────
+    if not rec_arxiv_ids:
+        rec_arxiv_ids = await _ewma_recommend(user_id, seen, REC_LIMIT)
+    # ── Tier 3: Qdrant Recommend API (Phase 1 fallback, ≥1 save) ─────────
+    if not rec_arxiv_ids:
+        rec_arxiv_ids = await qdrant_svc.recommend(
+            positive_arxiv_ids=state.positive_list,
+            negative_arxiv_ids=state.negative_list,
+            seen_arxiv_ids=seen,
+            limit=REC_LIMIT,
+        )
+    if not rec_arxiv_ids:
+        return _empty_resp()
+    meta = await arxiv_svc.fetch_metadata_batch(rec_arxiv_ids)
+    papers = [
+        {**meta[aid], "saved": False, "dismissed": False}
+        for aid in rec_arxiv_ids
+        if aid in meta
+    ]
+    resp = templates.TemplateResponse(
+        request,
+        "partials/recommendations.html",
+        {"papers": papers},
+    )
+    resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
+    return resp
+# ── Tier 1: Multi-interest clustering + prefetch RRF ─────────────────────────
+# Per-cluster candidate limits (descending by importance)
+_CLUSTER_LIMITS = [40, 30, 25, 20, 15, 15, 15]
+async def _multi_interest_recommend(
+    user_id: str, state, seen: set[str], limit: int
+) -> list[str]:
+    """
+    Full recommendation pipeline (Phase 2b + 2c):
+      1. Ward clustering → identify distinct interests
+      2. Prefetch + RRF → retrieve ~100 candidates
+      3. Heuristic re-ranking → score candidates
+      4. MMR diversity → select top-k with diversity
+      5. Exploration injection → 1-2 serendipitous papers
+    Only activates when the user has ≥ MIN_PAPERS_FOR_CLUSTERING saves.
+    Returns [] to trigger fallback to Tier 2.
+    """
+    positives = state.positive_list
+    if len(positives) < MIN_PAPERS_FOR_CLUSTERING:
+        return []
+    try:
+        # Fetch embeddings for all saved papers
+        vectors = await qdrant_svc.get_paper_vectors(positives)
+        if len(vectors) < MIN_PAPERS_FOR_CLUSTERING:
+            return []
+        # Build aligned arrays (only papers we got vectors for)
+        aligned_ids = [pid for pid in positives if pid in vectors]
+        aligned_embs = np.array(
+            [vectors[pid] for pid in aligned_ids], dtype=np.float32
+        )
+        # ── Step 1: Compute interest clusters ─────────────────────────────
+        clusters = compute_clusters(aligned_ids, aligned_embs)
+        await save_clusters_to_db(user_id, clusters)
+        # ── Step 2: Multi-interest retrieval via prefetch + RRF ───────────
+        interest_vectors = []
+        for i, cluster in enumerate(clusters):
+            per_cluster_limit = _CLUSTER_LIMITS[i] if i < len(_CLUSTER_LIMITS) else 15
+            interest_vectors.append(
+                (cluster.medoid_embedding.tolist(), per_cluster_limit)
+            )
+        st_vec = await profiles.load_profile(user_id, "short_term")
+        st_list = st_vec.tolist() if st_vec is not None else None
+        candidate_ids = await qdrant_svc.multi_interest_search(
+            interest_vectors=interest_vectors,
+            short_term_vector=st_list,
+            exclude_ids=seen,
+            total_limit=100,  # retrieve wide, narrow with re-ranking
+        )
+        if not candidate_ids:
+            return []
+        # ── Step 3: Re-rank candidates ────────────────────────────────────
+        # Fetch embeddings + metadata for candidates
+        cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
+        cand_meta = await arxiv_svc.fetch_metadata_batch(candidate_ids)
+        # Only process candidates we have both vectors and metadata for
+        valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
+        if not valid_ids:
+            return candidate_ids[:limit]  # fallback: return raw retrieval
+        valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
+        valid_meta = [cand_meta[cid] for cid in valid_ids]
+        lt_vec = await profiles.load_profile(user_id, "long_term")
+        neg_vec = await profiles.load_profile(user_id, "negative")
+        reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
+            candidate_ids=valid_ids,
+            candidate_embeddings=valid_embs,
+            candidate_metadata=valid_meta,
+            long_term_vec=lt_vec,
+            short_term_vec=st_vec,
+            negative_vec=neg_vec,
+        )
+        # ── Step 4: MMR diversity enforcement ─────────────────────────────
+        query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
+        mmr_selected = mmr_rerank(
+            query_embedding=query_vec,
+            candidate_embeddings=reranked_embs,
+            candidate_ids=reranked_ids,
+            scores=reranked_scores,
+            lambda_param=0.6,
+            top_k=limit,
+        )
+        # ── Step 5: Exploration injection ─────────────────────────────────
+        final = inject_exploration(
+            selected_ids=mmr_selected,
+            all_candidate_ids=reranked_ids,
+            n_explore=2,
+        )
+        return final[:limit + 2]  # allow slightly over limit for exploration
+    except Exception as e:
+        print(f"[recommendations] multi-interest search failed: {e}")
+        return []
+# ── Tier 2: EWMA single-vector search ────────────────────────────────────────
+async def _ewma_recommend(
+    user_id: str, seen: set[str], limit: int
+) -> list[str]:
+    """
+    Use the long-term EWMA profile vector for vector search.
+    Only activates after _MIN_EWMA_INTERACTIONS saves so the profile
+    has had enough signal to be meaningful.  Returns [] to trigger fallback.
+    """
+    lt_count = await profiles.get_interaction_count(user_id, "long_term")
+    if lt_count < _MIN_EWMA_INTERACTIONS:
+        return []
+    lt_vec = await profiles.load_profile(user_id, "long_term")
+    if lt_vec is None:
+        return []
+    query_vec = lt_vec.tolist()
+    return await qdrant_svc.search_by_vector(
+        query_vector=query_vec,
+        limit=limit,
+        exclude_ids=seen,
+    )

app/routers/saved.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+Saved papers router.
+GET /saved
+  – Shows all papers the user has currently saved (positive_list)
+  – Metadata fetched via arXiv API + SQLite cache
+"""
+import uuid
+from fastapi import APIRouter, Request, Cookie
+from fastapi.responses import HTMLResponse
+from app import arxiv_svc, user_state as us
+from app.config import COOKIE_NAME
+from app.templates_env import templates
+router = APIRouter()
+@router.get("/saved", response_class=HTMLResponse)
+async def saved_papers(
+    request: Request,
+    user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
+):
+    user_id = user_id or str(uuid.uuid4())
+    state = await us.ensure_loaded(user_id)
+    saved_ids = state.positive_list  # most-recent first, mutual-exclusion already applied
+    papers = []
+    if saved_ids:
+        meta = await arxiv_svc.fetch_metadata_batch(saved_ids)
+        papers = [
+            {**meta[aid], "saved": True, "dismissed": False}
+            for aid in saved_ids
+            if aid in meta
+        ]
+    resp = templates.TemplateResponse(
+        request,
+        "saved.html",
+        {
+            "papers": papers,
+            "count": len(papers),
+        },
+    )
+    resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
+    return resp

app/routers/search.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""
+Search router — Phase 3 hybrid semantic search.
+GET /search?q=<query>
+  – Returns full page on normal request
+  – Returns partial <div id="search-results"> on HTMX request (hx-target swap)
+Phase 3 replaces the arXiv keyword API with:
+  LLM rewrite → BGE-M3 encode → Qdrant dense + Zilliz sparse → RRF → rerank
+"""
+import uuid
+from fastapi import APIRouter, Request, Cookie
+from fastapi.responses import HTMLResponse
+from app import arxiv_svc, user_state as us, hybrid_search_svc
+from app.config import COOKIE_NAME, ARXIV_MAX_RESULTS
+from app.templates_env import templates
+router = APIRouter()
+@router.get("/search", response_class=HTMLResponse)
+async def search(
+    request: Request,
+    q: str = "",
+    user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
+):
+    papers = []
+    if q.strip():
+        # Phase 3: Hybrid semantic search (BGE-M3 + Qdrant + Zilliz + RRF)
+        try:
+            arxiv_ids = await hybrid_search_svc.search(q.strip(), limit=ARXIV_MAX_RESULTS)
+        except Exception as e:
+            print(f"[search] Hybrid search error: {e}")
+            arxiv_ids = []
+        if arxiv_ids:
+            # Fetch metadata for the ranked results
+            try:
+                meta = await arxiv_svc.fetch_metadata_batch(arxiv_ids)
+                # Preserve ranking order from hybrid search
+                papers = [meta[aid] for aid in arxiv_ids if aid in meta]
+            except Exception as e:
+                # arXiv API timeout — fall back to keyword search
+                print(f"[search] Metadata fetch failed ({e}), falling back to arXiv API")
+                papers = []
+        if not papers and q.strip():
+            # Fallback: arXiv keyword API if hybrid returns nothing or metadata failed
+            try:
+                papers = await arxiv_svc.search(q.strip())
+            except Exception as e:
+                print(f"[search] arXiv fallback also failed: {e}")
+                papers = []
+    user_id = user_id or str(uuid.uuid4())
+    state = await us.ensure_loaded(user_id)
+    saved_ids = set(state.positive_list)
+    dismissed_ids = set(state.negative_list)
+    for p in papers:
+        p["saved"] = p["arxiv_id"] in saved_ids
+        p["dismissed"] = p["arxiv_id"] in dismissed_ids
+    if request.headers.get("HX-Request"):
+        resp = templates.TemplateResponse(
+            request,
+            "partials/search_results.html",
+            {"papers": papers, "query": q},
+        )
+    else:
+        resp = templates.TemplateResponse(
+            request,
+            "search.html",
+            {
+                "papers": papers,
+                "query": q,
+                "has_recs": state.has_enough_for_recs(),
+            },
+        )
+    resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
+    return resp

app/templates/base.html ADDED Viewed

	@@ -0,0 +1,42 @@

+<!DOCTYPE html>
+<html lang="en" data-theme="light">
+<head>
+  <meta charset="UTF-8" />
+  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+  <title>{% block title %}ArXiv Recommender{% endblock %}</title>
+  <!-- TailwindCSS + DaisyUI (CDN, no build step) -->
+  <link href="https://cdn.jsdelivr.net/npm/daisyui@4.12.10/dist/full.min.css" rel="stylesheet" />
+  <script src="https://cdn.tailwindcss.com"></script>
+  <!-- HTMX -->
+  <script src="https://unpkg.com/htmx.org@1.9.12"></script>
+  <style>
+    /* Smooth fade-out when HTMX removes an element */
+    .htmx-swapping { opacity: 0; transition: opacity 200ms ease-out; }
+    /* Spinner shown while HTMX request is in-flight */
+    .htmx-request .htmx-indicator { display: inline-block !important; }
+    .htmx-indicator { display: none; }
+  </style>
+</head>
+<body class="bg-base-200 min-h-screen">
+  <!-- Navbar -->
+  <div class="navbar bg-base-100 shadow-sm px-4">
+    <div class="flex-1">
+      <a href="/" class="text-xl font-bold text-primary">📄 ArXiv Rec</a>
+    </div>
+    <div class="flex-none gap-1 flex">
+      <a href="/search" class="btn btn-ghost btn-sm">Search</a>
+      <a href="/saved" class="btn btn-ghost btn-sm">Saved</a>
+    </div>
+  </div>
+  <!-- Main content -->
+  <main class="container mx-auto px-4 py-6 max-w-4xl">
+    {% block content %}{% endblock %}
+  </main>
+</body>
+</html>

app/templates/index.html ADDED Viewed

	@@ -0,0 +1,50 @@

+{% extends "base.html" %}
+{% block title %}ArXiv Recommender — Home{% endblock %}
+{% block content %}
+<div class="space-y-6">
+  <!-- Hero / search bar -->
+  <div class="card bg-base-100 shadow p-6">
+    <h1 class="text-2xl font-bold mb-1">Find Research Papers</h1>
+    <p class="text-sm text-base-content/60 mb-4">
+      Search arXiv, save papers you like — get personalised recommendations.
+    </p>
+    <form hx-get="/search"
+          hx-target="#search-results"
+          hx-push-url="true"
+          hx-indicator="#search-spinner"
+          class="flex gap-2">
+      <input type="text"
+             name="q"
+             placeholder="e.g. transformer attention mechanism"
+             class="input input-bordered flex-1"
+             autofocus />
+      <button class="btn btn-primary" type="submit">
+        Search
+        <span id="search-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
+      </button>
+    </form>
+  </div>
+  <!-- Recommendations section -->
+  <div>
+    <h2 class="text-lg font-semibold mb-3">Recommended for You</h2>
+    <div id="rec-section"
+         hx-get="/api/recommendations"
+         hx-trigger="load"
+         hx-indicator="#rec-spinner"
+         hx-swap="innerHTML">
+      <div class="flex items-center gap-2 text-base-content/50">
+        <span id="rec-spinner" class="htmx-indicator loading loading-spinner loading-sm"></span>
+        <span>Loading recommendations…</span>
+      </div>
+    </div>
+  </div>
+  <!-- Search results (swapped in by HTMX) -->
+  <div id="search-results"></div>
+</div>
+{% endblock %}

app/templates/partials/action_buttons.html ADDED Viewed

	@@ -0,0 +1,45 @@

+{#
+  Action buttons for a paper card.
+  Expects: paper_id (or paper.arxiv_id), saved (bool), dismissed (bool)
+  Optional: source ("search" | "recommendation" | "saved"), position (int)
+  These are returned directly by the /api/papers/{id}/save endpoint
+  so they also work as a standalone partial.
+#}
+{% set pid = paper_id if paper_id is defined else paper.arxiv_id %}
+{% set is_saved = saved if saved is defined else (paper.saved | default(false)) %}
+{% set _source = source if source is defined else "search" %}
+{% if is_saved %}
+  <!-- Already saved — show saved state, allow unsave via not-interested -->
+  <div class="flex gap-2 flex-wrap">
+    <button class="btn btn-success btn-xs" disabled>
+      ✓ Saved
+    </button>
+    <button class="btn btn-ghost btn-xs"
+            hx-post="/api/papers/{{ pid }}/not-interested"
+            hx-target="#paper-{{ pid }}"
+            hx-swap="outerHTML swap:200ms"
+            hx-vals='{"source": "{{ _source }}"}'>
+      Remove
+    </button>
+  </div>
+{% else %}
+  <div class="flex gap-2 flex-wrap">
+    <!-- Save -->
+    <button class="btn btn-primary btn-xs"
+            hx-post="/api/papers/{{ pid }}/save"
+            hx-target="#actions-{{ pid }}"
+            hx-swap="innerHTML"
+            hx-vals='{"source": "{{ _source }}", "position": "{{ position | default(0) }}"}'>
+      ⭐ Save
+    </button>
+    <!-- Not interested (removes the whole card) -->
+    <button class="btn btn-ghost btn-xs text-error"
+            hx-post="/api/papers/{{ pid }}/not-interested"
+            hx-target="#paper-{{ pid }}"
+            hx-swap="outerHTML swap:200ms"
+            hx-vals='{"source": "{{ _source }}"}'>
+      ✕ Not interested
+    </button>
+  </div>
+{% endif %}

app/templates/partials/empty_recs.html ADDED Viewed

	@@ -0,0 +1,16 @@

+{# Shown when the user hasn't saved enough papers yet #}
+<div class="text-center py-8 text-base-content/40 space-y-2">
+  <div class="text-3xl">📚</div>
+  <p class="font-medium">No recommendations yet</p>
+  <p class="text-sm">
+    Save {{ min_saves | default(1) }} or more paper{{ "s" if (min_saves | default(1)) > 1 else "" }}
+    while searching to unlock personalised recommendations.
+  </p>
+  <!-- Check again after saving papers on the search page -->
+  <button class="btn btn-ghost btn-xs mt-1"
+          hx-get="/api/recommendations"
+          hx-target="#rec-section"
+          hx-swap="innerHTML">
+    Check for recommendations
+  </button>
+</div>

app/templates/partials/paper_card.html ADDED Viewed

	@@ -0,0 +1,45 @@

+{#
+  Reusable paper card.
+  Expects variables:
+    paper  – dict with arxiv_id, title, abstract, authors (JSON str), category, published, year
+    source – "search" | "recommendation"  (default "search")
+    position – integer rank in the list
+#}
+{% set source = source | default("search") %}
+{% set position = position | default(0) %}
+{% set authors_list = paper.authors | default("[]") | tojson_parse | default([]) %}
+<div class="card bg-base-100 shadow-sm border border-base-300 p-4 space-y-2"
+     id="paper-{{ paper.arxiv_id }}">
+  <!-- Title + category badge -->
+  <div class="flex items-start justify-between gap-2">
+    <a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
+       target="_blank"
+       rel="noopener"
+       class="font-semibold text-primary hover:underline leading-snug">
+      {{ paper.title }}
+    </a>
+    {% if paper.category %}
+    <span class="badge badge-outline badge-sm shrink-0">{{ paper.category }}</span>
+    {% endif %}
+  </div>
+  <!-- Meta: arXiv ID + year -->
+  <div class="text-xs text-base-content/50">
+    [{{ paper.arxiv_id }}]
+    {% if paper.published %} · {{ paper.published[:4] }}{% endif %}
+    {% if authors_list %} · {{ authors_list | join(", ") }}{% endif %}
+  </div>
+  <!-- Abstract (truncated) -->
+  <p class="text-sm text-base-content/75 line-clamp-3">
+    {{ paper.abstract }}
+  </p>
+  <!-- Action buttons (HTMX-powered, swap themselves on click) -->
+  <div id="actions-{{ paper.arxiv_id }}">
+    {% include "partials/action_buttons.html" %}
+  </div>
+</div>

app/templates/partials/recommendations.html ADDED Viewed

	@@ -0,0 +1,23 @@

+{# Partial: personalised recommendations #}
+{% if papers %}
+  <div class="space-y-3">
+    {% for paper in papers %}
+      {% set position = loop.index0 %}
+      {% set source = "recommendation" %}
+      {% include "partials/paper_card.html" %}
+    {% endfor %}
+  </div>
+  <!-- Refresh button — lets user reload recs after saving more papers -->
+  <div class="text-center pt-3">
+    <button class="btn btn-ghost btn-sm"
+            hx-get="/api/recommendations"
+            hx-target="#rec-section"
+            hx-swap="innerHTML"
+            hx-indicator="#rec-refresh-spinner">
+      ↻ Show different recommendations
+      <span id="rec-refresh-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
+    </button>
+  </div>
+{% else %}
+  {% include "partials/empty_recs.html" %}
+{% endif %}

app/templates/partials/search_results.html ADDED Viewed

	@@ -0,0 +1,15 @@

+{# Partial: list of search result cards #}
+{% if papers %}
+  <div class="space-y-3">
+    <p class="text-sm text-base-content/50">{{ papers | length }} results for "{{ query }}"</p>
+    {% for paper in papers %}
+      {% set position = loop.index0 %}
+      {% set source = "search" %}
+      {% include "partials/paper_card.html" %}
+    {% endfor %}
+  </div>
+{% elif query %}
+  <div class="text-center text-base-content/40 py-10">
+    No results found for "{{ query }}"
+  </div>
+{% endif %}

app/templates/saved.html ADDED Viewed

	@@ -0,0 +1,38 @@

+{% extends "base.html" %}
+{% block title %}Saved Papers — ArXiv Recommender{% endblock %}
+{% block content %}
+<div class="space-y-4">
+  <!-- Header -->
+  <div class="flex items-center justify-between">
+    <h1 class="text-2xl font-bold">Saved Papers</h1>
+    <span class="badge badge-primary badge-lg">{{ count }} saved</span>
+  </div>
+  {% if papers %}
+    <div class="space-y-3">
+      {% for paper in papers %}
+        {% set position = loop.index0 %}
+        {% set source = "saved" %}
+        {% include "partials/paper_card.html" %}
+      {% endfor %}
+    </div>
+  {% else %}
+    <!-- Empty state -->
+    <div class="card bg-base-100 shadow p-10 text-center space-y-3">
+      <div class="text-5xl">📚</div>
+      <p class="text-lg font-semibold text-base-content/70">No saved papers yet</p>
+      <p class="text-sm text-base-content/50">
+        Search for papers and click <strong>⭐ Save</strong> on ones you find interesting.
+        Once you save a paper it will appear here.
+      </p>
+      <div class="pt-2">
+        <a href="/search" class="btn btn-primary btn-sm">Start Searching</a>
+      </div>
+    </div>
+  {% endif %}
+</div>
+{% endblock %}

app/templates/search.html ADDED Viewed

	@@ -0,0 +1,47 @@

+{% extends "base.html" %}
+{% block title %}Search — ArXiv Recommender{% endblock %}
+{% block content %}
+<div class="space-y-6">
+  <!-- Search bar -->
+  <div class="card bg-base-100 shadow p-4">
+    <form hx-get="/search"
+          hx-target="#search-results"
+          hx-push-url="true"
+          hx-indicator="#search-spinner"
+          class="flex gap-2">
+      <input type="text"
+             name="q"
+             value="{{ query }}"
+             placeholder="Search arXiv papers…"
+             class="input input-bordered flex-1"
+             autofocus />
+      <button class="btn btn-primary" type="submit">
+        Search
+        <span id="search-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
+      </button>
+    </form>
+  </div>
+  <!-- Recommendations (sidebar-style, loads async) -->
+  {% if has_recs %}
+  <div>
+    <h2 class="text-lg font-semibold mb-3">Recommended for You</h2>
+    <div id="rec-section"
+         hx-get="/api/recommendations"
+         hx-trigger="load"
+         hx-swap="innerHTML">
+      <span class="loading loading-spinner loading-sm"></span>
+    </div>
+  </div>
+  {% endif %}
+  <!-- Search results -->
+  <div id="search-results">
+    {% include "partials/search_results.html" %}
+  </div>
+</div>
+{% endblock %}

app/templates_env.py ADDED Viewed

	@@ -0,0 +1,22 @@

+"""
+Shared Jinja2 environment with custom filters.
+Import `templates` from here instead of creating it per-router.
+"""
+import json
+from fastapi.templating import Jinja2Templates
+templates = Jinja2Templates(directory="app/templates")
+def _tojson_parse(value: str | None) -> list:
+    """Parse a JSON-encoded string into a Python list. Returns [] on error."""
+    if not value:
+        return []
+    try:
+        result = json.loads(value)
+        return result if isinstance(result, list) else []
+    except (ValueError, TypeError):
+        return []
+templates.env.filters["tojson_parse"] = _tojson_parse

app/user_state.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+In-memory user state cache (hot path).
+Keeps the last N positive/negative paper IDs per user so that
+recommendation requests don't need a DB round-trip on every page load.
+The cache is populated lazily on first access from the SQLite interactions
+table, then kept up-to-date by the event endpoints.
+Thread-safety: asyncio is single-threaded; no locks needed.
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from collections import deque
+from app import db, config
+MAX_POSITIVES = config.REC_POSITIVE_LIMIT   # max positive IDs kept in memory per user
+MAX_NEGATIVES = 50                          # max negative IDs kept in memory per user
+@dataclass
+class UserState:
+    # Most-recently-interacted first
+    positives: deque[str] = field(default_factory=lambda: deque(maxlen=MAX_POSITIVES))
+    negatives: deque[str] = field(default_factory=lambda: deque(maxlen=MAX_NEGATIVES))
+    loaded: bool = False  # True once hydrated from DB
+    def add_positive(self, paper_id: str) -> None:
+        # Remove from negatives if it was there
+        try:
+            self.negatives.remove(paper_id)
+        except ValueError:
+            pass
+        # Prepend (deque handles maxlen eviction automatically)
+        if paper_id not in self.positives:
+            self.positives.appendleft(paper_id)
+    def add_negative(self, paper_id: str) -> None:
+        try:
+            self.positives.remove(paper_id)
+        except ValueError:
+            pass
+        if paper_id not in self.negatives:
+            self.negatives.appendleft(paper_id)
+    @property
+    def positive_list(self) -> list[str]:
+        return list(self.positives)
+    @property
+    def negative_list(self) -> list[str]:
+        return list(self.negatives)
+    def has_enough_for_recs(self) -> bool:
+        from app.config import REC_MIN_POSITIVES
+        return len(self.positives) >= REC_MIN_POSITIVES
+# ── Global in-process cache ───────────────────────────────────────────────────
+_cache: dict[str, UserState] = {}
+def get_user_state(user_id: str) -> UserState:
+    """Return the in-memory state for a user (creates if missing)."""
+    if user_id not in _cache:
+        _cache[user_id] = UserState()
+    return _cache[user_id]
+async def ensure_loaded(user_id: str) -> UserState:
+    """
+    Return the user state, loading from DB the first time.
+    Subsequent calls are O(1) dict lookup.
+    """
+    state = get_user_state(user_id)
+    if state.loaded:
+        return state
+    rows = await db.get_user_interactions(
+        user_id,
+        event_types=["save", "not_interested"],
+        limit=MAX_POSITIVES + MAX_NEGATIVES,
+    )
+    # Rows are ordered newest-first; we want newest in the front of the deque
+    # Process oldest-first so that appendleft ends with newest at front.
+    for row in reversed(rows):
+        if row["event_type"] == "save":
+            state.add_positive(row["paper_id"])
+        elif row["event_type"] == "not_interested":
+            state.add_negative(row["paper_id"])
+    state.loaded = True
+    return state
+def record_positive(user_id: str, paper_id: str) -> None:
+    """Update in-memory state synchronously (DB write happens separately)."""
+    get_user_state(user_id).add_positive(paper_id)
+def record_negative(user_id: str, paper_id: str) -> None:
+    get_user_state(user_id).add_negative(paper_id)
+def all_seen(user_id: str) -> set[str]:
+    """All paper IDs this user has interacted with (used to filter recs)."""
+    state = get_user_state(user_id)
+    return set(state.positive_list) | set(state.negative_list)

app/zilliz_svc.py ADDED Viewed

	@@ -0,0 +1,132 @@

+"""
+Zilliz Cloud sparse search client — Phase 3.
+Responsibilities:
+  - Connect to Zilliz Cloud serverless via pymilvus MilvusClient
+  - search_sparse(sparse_dict, limit) → list[dict] with arxiv_id + score
+  - Handle gRPC reconnects on closed-channel errors
+  - Collection: arxiv_bgem3_sparse
+  - Schema: id (INT64 auto PK), arxiv_id (VARCHAR), sparse_vector (SPARSE_FLOAT_VECTOR)
+  - Index: SPARSE_INVERTED_INDEX, metric_type=IP
+"""
+from __future__ import annotations
+import asyncio
+import threading
+from functools import lru_cache
+from app import config
+# ── Client singleton ─────────────────────────────────────────────────────────
+_client = None
+_client_lock = threading.Lock()
+def _get_client():
+    """Return or create the MilvusClient singleton.  Thread-safe."""
+    global _client
+    if _client is not None:
+        return _client
+    with _client_lock:
+        if _client is not None:
+            return _client
+        from pymilvus import MilvusClient
+        _client = MilvusClient(
+            uri=config.ZILLIZ_URI,
+            token=config.ZILLIZ_TOKEN,
+        )
+        print(f"[zilliz_svc] Connected to {config.ZILLIZ_COLLECTION}")
+        return _client
+def _reset_client():
+    """Force reconnect on next call.  Used after gRPC errors."""
+    global _client
+    with _client_lock:
+        _client = None
+# ── Sparse search ────────────────────────────────────────────────────────────
+def _run_sparse_search(
+    sparse_dict: dict[int, float],
+    limit: int,
+) -> list[dict]:
+    """
+    Sync helper: execute sparse vector search on Zilliz.
+    Args:
+        sparse_dict: {token_id_int: weight_float} from BGE-M3 lexical_weights
+        limit: max results to return
+    Returns:
+        list of {'arxiv_id': str, 'score': float} dicts, sorted by score desc
+    """
+    client = _get_client()
+    results = client.search(
+        collection_name=config.ZILLIZ_COLLECTION,
+        data=[sparse_dict],
+        anns_field="sparse_vector",
+        search_params={"metric_type": "IP"},
+        limit=limit,
+        output_fields=["arxiv_id"],
+    )
+    # pymilvus returns list[list[dict]] — first list is for first query vector
+    if not results or not results[0]:
+        return []
+    return [
+        {"arxiv_id": hit["entity"]["arxiv_id"], "score": hit["distance"]}
+        for hit in results[0]
+        if hit.get("entity", {}).get("arxiv_id")
+    ]
+async def search_sparse(
+    sparse_dict: dict[int, float],
+    limit: int = 50,
+) -> list[dict]:
+    """
+    Async sparse search — runs the sync MilvusClient in a thread executor.
+    Args:
+        sparse_dict: BGE-M3 lexical weights {int_token_id: float_weight}
+        limit: max results
+    Returns:
+        list of {'arxiv_id': str, 'score': float} sorted by score desc.
+        Returns empty list on error (graceful degradation).
+    """
+    if not sparse_dict:
+        return []
+    loop = asyncio.get_event_loop()
+    try:
+        results = await loop.run_in_executor(
+            None, _run_sparse_search, sparse_dict, limit
+        )
+        return results
+    except Exception as e:
+        error_msg = str(e).lower()
+        # Retry once on gRPC channel closed errors
+        if "closed" in error_msg or "unavailable" in error_msg or "connect" in error_msg:
+            print(f"[zilliz_svc] Connection error, retrying: {e}")
+            _reset_client()
+            try:
+                results = await loop.run_in_executor(
+                    None, _run_sparse_search, sparse_dict, limit
+                )
+                return results
+            except Exception as e2:
+                print(f"[zilliz_svc] Retry failed: {e2}")
+                return []
+        else:
+            print(f"[zilliz_svc] search_sparse error: {e}")
+            return []

docs/README.md ADDED Viewed

	@@ -0,0 +1,160 @@

+# ResearchIT Documentation
+All project documentation organized by purpose. Each document has a specific role in the project lifecycle.
+---
+## 📁 Folder Structure
+```
+docs/
+├── README.md                     ← you are here
+│
+├── TASK-TRACKER.md               ← master checklist (all phases)
+│
+├── research/                     ← deep research & strategic thinking
+│   ├── 01-Vision-Instagram-for-Research.md
+│   ├── 02-Recommendation-System-Blueprint.md
+│   ├── 03-MultiInterest-Recommender-Architecture.md
+│   ├── 04-Technical-Roadmap-Legacy.md
+│   ├── 05-Evolution-Of-Onboarding-And-Interests.md
+│   └── 06-Deep-Research-Verdict.md
+│
+├── phases/                       ← what we built & what we plan to build
+│   ├── PHASE1-Zero-ML-Recommender.md
+│   ├── PHASE2-Hybrid-Search-Plan.md     (prototype reference)
+│   └── PHASE3-Hybrid-Semantic-Search.md (ACTIVE PHASE 3 PLAN)
+│
+├── walkthroughs/                 ← detailed implementation records
+│   ├── 01-Phase1-Code-Tour.md
+│   ├── 02-Phase2-MultiInterest-Recommender.md
+│   ├── 03-Code-Summary-and-Test-Plan.md
+│   └── 04-Next-Steps-and-Phase-Plan.md
+│
+notebooks/                        ← Kaggle reference notebooks (not in docs/)
+├── README.md
+├── 01-bme-upload.ipynb             (BGE-M3 encode + upload 1.6M papers)
+├── 02-bme-arxiv-test.ipynb         (search quality + encoding tests)
+└── 03-check-search-bq-prm.ipynb    (BQ vs PRM benchmark)
+```
+---
+## 📚 Reading Order
+If you're new to this project, read these in order:
+### 1. Understand the Vision
+**[01-Vision-Instagram-for-Research.md](research/01-Vision-Instagram-for-Research.md)**
+The strategic blueprint. Covers competitive landscape, UX patterns from TikTok/Spotify/Pinterest, social dynamics, differentiation features, and business model. This is "why we're building this."
+### 2. Understand the Technical Foundation
+**[02-Recommendation-System-Blueprint.md](research/02-Recommendation-System-Blueprint.md)**
+The initial deep research on recommendation architectures. Covers user modeling, content-based vs collaborative filtering, cold start strategies, and evaluation metrics. This is "how recommendation systems work in general."
+### 3. Understand the Chosen Architecture
+**[03-MultiInterest-Recommender-Architecture.md](research/03-MultiInterest-Recommender-Architecture.md)**
+The definitive architecture RFC. EWMA temporal decay, Ward hierarchical clustering, LightGBM re-ranking, MMR diversity. Validated by Twitter, Pinterest, and Alibaba production systems. **This is the blueprint we implemented.**
+### 4. See the Architectural Evolution
+**[05-Evolution-Of-Onboarding-And-Interests.md](research/05-Evolution-Of-Onboarding-And-Interests.md)**
+Documents the founder's pivot from explicit onboarding subject vectors to implicit behavioral tracking. Captures the original vision vs. the current approach and why the change was made.
+**[06-Deep-Research-Verdict.md](research/06-Deep-Research-Verdict.md)** ⭐ *Latest Research*
+The comprehensive verdict that resolves contradictions across all prior documents. Proposes a **three-layer hybrid** (coarse categories + seed papers + behavioral clustering). Identifies faults in Doc 03 (RRF→quota, α correction). The definitive architectural reference going forward.
+### 5. See What Phase 1 Built
+**[PHASE1-Zero-ML-Recommender.md](phases/PHASE1-Zero-ML-Recommender.md)**
+What was built first: zero-ML-inference recommender using Qdrant's BEST_SCORE Recommend API, SQLite event logging, and arXiv metadata caching. The working foundation.
+**[01-Phase1-Code-Tour.md](walkthroughs/01-Phase1-Code-Tour.md)**
+A file-by-file walkthrough of every piece of the Phase 1 codebase: entry points, routers, services, database, templates, and tests.
+### 6. See What Phase 2 Built
+**[02-Phase2-MultiInterest-Recommender.md](walkthroughs/02-Phase2-MultiInterest-Recommender.md)**
+What was just built: PinnerSage-style multi-interest engine with EWMA profiles, Ward clustering, prefetch+RRF, heuristic re-ranking, and MMR diversity. 88 tests passing.
+### 7. Review Core Code & Automation
+**[03-Code-Summary-and-Test-Plan.md](walkthroughs/03-Code-Summary-and-Test-Plan.md)**
+Summarizes all structural backend modules, frontend files, and breaks down our three-layered ongoing testing strategies (Automated, Manual, and Analytic Evaluation).
+### 8. What's Next — The Revised Phase Plan
+**[04-Next-Steps-and-Phase-Plan.md](walkthroughs/04-Next-Steps-and-Phase-Plan.md)** ⭐ *Start Here for Next Steps*
+The master roadmap synthesizing all 6 research documents. Resolves contradictions between docs, captures the founder's thinking evolution, and lays out Phases 3-9 in priority order. Includes the three highest-impact next actions.
+### 9. Phase 3 Plan (Current Focus)
+**[PHASE3-Hybrid-Semantic-Search.md](phases/PHASE3-Hybrid-Semantic-Search.md)** ⭐ *Active Implementation Plan*
+The detailed implementation plan for hybrid semantic search. Covers architecture, all new/modified files, Zilliz schema, BGE-M3 encoding, RRF fusion, HF Spaces deployment, latency budget, and 8-step implementation order.
+### 10. Data Preparation Notebooks
+**[notebooks/README.md](../notebooks/README.md)** — Index + extracted schema details.
+- `01-bme-upload.ipynb` — How 1.6M papers were encoded and uploaded to Qdrant + Zilliz
+- `02-bme-arxiv-test.ipynb` — BGE-M3 encoding + search quality prototype
+- `03-check-search-bq-prm.ipynb` — BQ vs PRM quantization benchmark
+---
+## 📄 Document Status
+| Document | Status | Notes |
+|---|---|---|
+| 01 — Vision (Instagram for Research) | ✅ Complete | Strategic north star |
+| 02 — Recommendation Blueprint | ✅ Complete | Initial research, still relevant |
+| 03 — Multi-Interest Architecture | ✅ Implemented | **The RFC we implemented** — has 4 known faults identified in Doc 06 |
+| 04 — Technical Roadmap | ⚠️ Legacy | Superseded. Kept for reference only |
+| 05 — Evolution of Onboarding | ✅ Complete | Documents the subject-vector → behavioral pivot |
+| 06 — Deep Research Verdict | ✅ Complete | **The definitive architectural reference** — resolves all contradictions |
+| Phase 1 Walkthrough | ✅ Complete | Still accurate for Phase 1 code |
+| Phase 1 Code Tour | ✅ Complete | File-by-file walkthrough |
+| Phase 2 Recommender Walkthrough | ✅ Complete | Multi-interest engine |
+| Codebase Summary & Test Plan | ✅ Complete | Summarizes codebase & testing |
+| Next Steps & Phase Plan | ✅ Complete | **Master roadmap for Phases 3-9** |
+| Phase 2 Hybrid Search Plan | 📋 Prototype reference | Superseded by PHASE3-Hybrid-Semantic-Search.md as the active plan |
+| **Phase 3 Hybrid Semantic Search** | **📋 Active Plan** | **The current implementation guide for Phase 3** |
+| Task Tracker | ✅ Active | Master checklist for all phases |
+---
+## 🏗️ Architecture Evolution
+```
+Phase 1 (completed)
+  └── Qdrant BEST_SCORE with raw paper IDs
+       ├── Works from 1 save
+       └── No temporal awareness, no diversity
+Phase 2a (completed)
+  └── EWMA profile embeddings
+       ├── Long-term (α=0.03) + Short-term (α=0.40) + Negative (α=0.15)
+       └── Activates at 3+ saves
+Phase 2b (completed)
+  └── Ward clustering + Qdrant prefetch+RRF
+       ├── Auto-detects K interests per user (1-7)
+       ├── Single API call, server-side parallel ANN
+       └── Activates at 5+ saves
+Phase 2c (completed)
+  └── Heuristic re-ranking + MMR diversity
+       ├── 5-feature scorer (40% relevance, 25% session, 15% recency, 10% rank, -15% negative)
+       ├── MMR diversity (λ=0.6) + exploration injection (2 papers)
+       └── Upgrade path: swap heuristic for LightGBM at ≥500 interactions
+Phase 3 (NEXT — hybrid semantic search)
+  └── Replace arXiv keyword API with vector-based search
+       ├── BGE-M3 query encoding (loaded at startup)
+       ├── Dense (Qdrant) + Sparse (Zilliz) parallel retrieval
+       ├── RRF fusion (correct for search: same query, different retrievers)
+       └── Deployment: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)
+Phase 4 (planned — recommendation pipeline fixes)
+  └── RRF → quota fusion, α_long 0.10 → 0.03, negative profile wiring,
+       pre-populate metadata store
+Phase 5 (planned — cold-start onboarding)
+  └── arXiv category multiselect + seed paper import + ORCID
+Phase 6+ (future)
+  └── LightGBM lambdarank, evaluation framework, LLM summaries,
+       collaborative filtering, exploration
+```

docs/TASK-TRACKER.md ADDED Viewed

	@@ -0,0 +1,369 @@

+# ResearchIT — Master Task Tracker
+> **Purpose**: Single source of truth for all completed, in-progress, and upcoming work.
+> **Last updated**: 2026-04-19
+> **Current phase**: Phase 3 (Hybrid Semantic Search) — implementation complete, pending deployment
+---
+## Legend
+- `[x]` — Done
+- `[/]` — In progress
+- `[ ]` — Not started
+- `[~]` — Intentionally deferred (blocked by data/users/scale)
+- `[!]` — Backlog item (documented, not yet coded)
+---
+## Phase 1: Zero-ML Recommender ✅ COMPLETE
+> *Built the foundation: Qdrant connection, arXiv search, save/dismiss, cookie identity, HTMX frontend.*
+- [x] Qdrant Cloud connection (1.6M BGE-M3 papers, BQ, HNSW m=32)
+  - Collection: `arxiv_bgem3_dense`, 1024-dim dense vectors
+  - File: `app/qdrant_svc.py` → `_get_client()`
+- [x] BEST_SCORE Recommend API (raw paper IDs → Qdrant)
+  - File: `app/qdrant_svc.py` → `recommend()`
+- [x] arXiv keyword API search (placeholder — replaced in Phase 3)
+  - File: `app/arxiv_svc.py` → `search()`
+- [x] arXiv metadata fetching + SQLite cache
+  - File: `app/arxiv_svc.py` → `fetch_metadata_batch()`
+- [x] SQLite database schema (interactions, paper_metadata)
+  - File: `app/db.py` → `init_db()`
+  - WAL mode, async via aiosqlite
+- [x] Cookie-based user identity
+  - File: `app/config.py` → `COOKIE_NAME`
+- [x] User state management (positive/negative deques)
+  - File: `app/user_state.py` → `UserState`
+- [x] Save/Dismiss event logging
+  - File: `app/routers/events.py`
+- [x] HTMX + Jinja2 frontend (search, recs, save, dismiss)
+  - Files: `app/templates/` (base.html, index.html, search.html, saved.html, partials/)
+- [x] Test suite — **55 tests passing**
+**Gaps**: None.
+---
+## Phase 2a: EWMA Profile Embeddings ✅ COMPLETE
+> *Replaced raw ID-list approach with temporal decay vectors so recent interests outweigh old ones.*
+- [x] Create `app/recommend/` module with `__init__.py`
+- [x] Create `app/recommend/profiles.py` — EWMA computation + storage
+  - Long-term: α=0.03 ✅ (corrected from 0.10 per Doc 06)
+  - Short-term: α=0.40
+  - Negative: α=0.15
+  - All embeddings L2-normalized
+- [x] Modify `app/db.py` — add `user_profiles` table + `user_clusters` table
+- [x] Modify `app/qdrant_svc.py` — add `get_paper_vectors()` and `search_by_vector()`
+- [x] Modify `app/routers/events.py` — trigger EWMA updates on save/dismiss
+- [x] Modify `app/routers/recommendations.py` — EWMA vector search with Tier 2 fallback
+- [x] Add `numpy` + `scipy` to `requirements.txt`
+- [x] Tests for profiles module — **11 passed**
+- [x] Full test suite — no regressions
+**Doc 06 correction applied**: α_long 0.10 → 0.03 (PinnerSage rejected 0.10 as too recent-biased).
+**Gaps**: None.
+---
+## Phase 2b: Ward Clustering + Multi-Interest Retrieval ✅ COMPLETE
+> *Detect distinct user interests via hierarchical clustering, retrieve candidates per interest.*
+- [x] Create `app/recommend/clustering.py` — Ward clustering + medoid extraction
+  - L2-normalize embeddings before Ward ✅ (Doc 06 correction)
+  - Adaptive gap-based threshold (no fixed K)
+  - Medoid representation (real papers, not centroids) ✅
+  - Dynamic K (1–7 clusters, auto-determined)
+  - Recency-weighted importance scores
+- [x] Modify `app/qdrant_svc.py` — add `multi_interest_search()` with prefetch+RRF
+- [x] Modify `app/routers/recommendations.py` — 3-tier cascading pipeline
+  - Tier 1 (≥5 saves): Multi-interest clustering → prefetch + RRF
+  - Tier 2 (≥3 saves): EWMA long-term vector → single ANN search
+  - Tier 3 (≥1 save): Qdrant BEST_SCORE Recommend API
+- [x] Tests for clustering module — **10 passed**
+- [x] Full test suite — no regressions
+**Doc 06 corrections applied**: L2-normalization before Ward, medoid not centroid.
+**Gaps (deferred to Phase 4)**:
+- [!] RRF → quota fusion (dominant clusters can swamp minority interests)
+- [!] Hungarian matching for cluster ID stability across reclusterings
+---
+## Phase 2c: Heuristic Re-ranking + MMR Diversity ✅ COMPLETE
+> *Added scoring and diversity layers on top of retrieval to produce the final feed.*
+- [x] Create `app/recommend/reranker.py` — 5-feature heuristic scorer
+  - Feature 1: cosine_sim_longterm (weight 0.40)
+  - Feature 2: cosine_sim_shortterm (weight 0.25)
+  - Feature 3: paper_age_days / recency (weight 0.15)
+  - Feature 4: rrf_position (weight 0.10)
+  - Feature 5: cosine_sim_negative (weight -0.15) ✅ (Doc 06 addition)
+- [x] Create `app/recommend/diversity.py` — MMR + exploration injection
+  - MMR with λ=0.6
+  - 2 serendipitous exploration papers per feed
+- [x] Modify `app/routers/recommendations.py` — full 5-step pipeline
+  - Step 1: Clustering → Step 2: Retrieval → Step 3: Rerank → Step 4: MMR → Step 5: Exploration
+- [x] Tests for reranker + diversity — **13 passed**
+- [x] Full test suite — **88 passed** (86 + 2 pre-existing live Qdrant failures resolved)
+**Doc 06 correction applied**: Negative EWMA profile wired as Feature 5 with 0.15 penalty.
+**Gaps (deferred to Phase 6)**:
+- [~] LightGBM lambdarank model (requires ≥500 labeled interactions)
+---
+## Phase 2d: Advanced Models ❌ DEFERRED (Blocked by data/users)
+> *These logically belong to the recommendation engine but cannot be built without real user data or scale.*
+- [~] LightGBM lambdarank model — requires ≥500 labeled save/dismiss interactions → Phase 6
+- [~] Collaborative filtering features — requires ≥500 users → Phase 9
+- [~] DPP diversity — explicitly ruled out for v1 by Doc 06 → Phase 9+
+- [~] Two-Tower model — requires GPU + large dataset → Phase 9+
+---
+## Phase 3: Hybrid Semantic Search ✅ COMPLETE
+> *Replace the arXiv keyword API placeholder with real vector-based semantic search using Qdrant dense + Zilliz sparse + RRF.*
+> *Detailed plan: `docs/phases/PHASE3-Hybrid-Semantic-Search.md`*
+> *Prototype reference: `docs/phases/PHASE2-Hybrid-Search-Plan.md`*
+> *Deployment target: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)*
+### New files created
+- [x] `app/embed_svc.py` — BGE-M3 model singleton (load BAAI/bge-m3 once at startup, ~570MB, ~15s cold)
+  - `encode_query(text)` → `(dense: np.ndarray[1024], sparse: dict)`
+  - LRU cache for repeat queries
+  - Thread-safe, lazy loading with double-check locking
+- [x] `app/zilliz_svc.py` — Zilliz Cloud sparse search client
+  - Collection: `arxiv_bgem3_sparse`
+  - Schema: `id` (INT64 auto PK), `arxiv_id` (VARCHAR), `sparse_vector` (SPARSE_FLOAT_VECTOR)
+  - Index: SPARSE_INVERTED_INDEX, metric_type=IP
+  - Sparse format: `{int_token_id: float_weight}` (BGE-M3 lexical weights, NOT string words)
+  - `search_sparse(sparse_dict, limit)` → `list[dict]` with arxiv_id + score
+  - gRPC reconnect handling
+- [x] `app/groq_svc.py` — LLM query rewriter (Groq / llama-3.3-70b)
+  - `rewrite(user_query)` → academic query string
+  - Graceful fallback to original query on error
+  - Academic-detection heuristic to skip unnecessary rewrites
+  - 2s hard timeout
+- [x] `app/hybrid_search_svc.py` — search orchestrator
+  - Rewrite → Encode → Parallel (Qdrant dense + Zilliz sparse) → RRF → Rerank
+  - Each step has independent failure handling
+  - Recency reranking: 0.80 RRF + 0.20 recency
+### Files modified
+- [x] `app/config.py` — added `ZILLIZ_URI`, `ZILLIZ_TOKEN`, `ZILLIZ_COLLECTION`, `GROQ_API_KEY`, `BGE_M3_MODEL`, `BGE_M3_DEVICE`, `ENCODE_CACHE_SIZE`, search weights, `APP_PORT`
+- [x] `app/qdrant_svc.py` — added `search_dense(dense_vec, limit)` for raw vector search returning scores
+- [x] `app/routers/search.py` — swapped `arxiv_svc.search()` → `hybrid_search_svc.search()` with arXiv fallback
+- [x] `app/main.py` — added graceful BGE-M3 warm-up to lifespan
+- [x] `requirements.txt` — added `FlagEmbedding`, `pymilvus`, `groq`
+- [x] `run.py` — configurable port (7860 default for HF Spaces)
+### Deployment files created
+- [x] `Dockerfile` — HF Spaces Docker SDK, CPU-only PyTorch, pre-baked BGE-M3 model
+- [x] `.dockerignore` — excludes notebooks, PDFs, databases, caches
+### Implementation steps completed
+- [x] Step 1: BGE-M3 model service (`embed_svc.py`) + unit tests
+- [x] Step 2: Zilliz client (`zilliz_svc.py`)
+- [x] Step 3: Dense search in Qdrant service
+- [x] Step 4: Groq rewriter (`groq_svc.py`)
+- [x] Step 5: Hybrid search orchestrator (`hybrid_search_svc.py`)
+- [x] Step 6: Swap search router
+- [x] Step 7: Model warm-up + deployment config
+- [x] Step 8: Tests — **21 new tests passing** (RRF, recency, Groq heuristics, embed edge cases, orchestrator mocks)
+### Test results
+- 88 original tests: ✅ All pass (zero regressions)
+- 21 Phase 3 unit tests: ✅ All pass (RRF, recency, Groq, embed, orchestrator mocks)
+- 6 search router tests: ✅ All pass (ranking, fallback, HTMX, saved state)
+- 8 live service tests: ✅ All pass (Qdrant dense, Zilliz sparse, Groq rewrite, parallel)
+- **Total: 123 tests passing**
+### Latency budget
+| Stage | Time |
+|---|---|
+| LLM rewrite (Groq) | ~300ms (skippable) |
+| BGE-M3 encode (CPU) | ~300ms first, ~0ms cached |
+| Qdrant + Zilliz (parallel) | ~300ms |
+| RRF + rerank | <5ms |
+| **Total (warm)** | **~600ms** |
+---
+## Phase 4: Recommendation Pipeline Fixes 📋 NOT STARTED
+> *Fix the known architectural debt in the recommendation pipeline.*
+> *Estimated effort: ~1 week*
+### 4.1 — Replace RRF with Importance-Weighted Quota Fusion
+- [ ] Create `app/recommend/fusion.py` — quota allocation logic
+  - `w_k = importance_k / sum(importance_k)`
+  - `slot_k = max(floor(F × w_k), F_min=3)` — every cluster gets at least 3 slots
+  - Distribute remainder by largest fractional part
+- [ ] Refactor `_multi_interest_recommend()` in `recommendations.py`
+  - Replace `multi_interest_search()` with per-cluster separate ANN queries
+  - Allocate feed slots proportionally
+  - Deduplicate across clusters (assign to highest-ranked)
+  - MMR over merged union
+### 4.2 — Pre-populate Metadata Store (Kaggle Bulk Load)
+- [ ] Download Kaggle arXiv metadata dataset (~4GB JSON)
+- [ ] Write bulk-insert script → SQLite `paper_metadata` table (1.6M rows)
+- [ ] arXiv API becomes fallback for genuinely new papers only
+- [ ] **Impact**: Metadata fetch drops from ~7,600ms to <5ms
+### 4.3 — Hungarian Matching for Cluster Stability
+- [ ] Implement Hungarian matching in `clustering.py`
+  - Match new cluster IDs to previous IDs by medoid similarity
+  - Prevents cluster IDs from shuffling between reclusterings
+### 4.4 — Wire Remaining Negative Signal Components
+- [ ] Per-item short-term decay: `score -= α × exp(-dt / τ_neg)` — needs per-item timestamp tracking
+- [ ] Category-level suppression: if ≥3 dismissals hit the same arXiv category within a week, suppress for 2 weeks
+---
+## Phase 5: Cold-Start Onboarding 📋 NOT STARTED
+> *Build the hybrid onboarding pipeline for new users.*
+> *Estimated effort: ~1-2 weeks*
+> *Reference: Doc 06 — "4-37% lift even once behavioral data exists"*
+### 5.1 — arXiv Category Multi-Select
+- [ ] UI screen on first visit: select 3-5 arXiv categories
+- [ ] Store selections in SQLite
+- [ ] Use as pool filter for first 1-3 sessions
+- [ ] Preserve as LightGBM feature permanently
+- [ ] Does NOT create "subject vectors" — just filters
+### 5.2 — Seed Paper Import
+- [ ] Let users search for and save 3-5 seed papers during onboarding
+- [ ] Immediately create EWMA profiles + Ward clusters
+- [ ] Uses hybrid search (Phase 3) for discovery
+### 5.3 — ORCID / Semantic Scholar Import (Stretch)
+- [ ] Accept ORCID ID → fetch authored papers → initial saves
+- [ ] Gives 10-50 papers of signal instantly
+### 5.4 — Popularity Fallback
+- [ ] If user skips all onboarding: serve popularity-per-selected-category feed
+---
+## Phase 6: LightGBM Re-ranker 📋 NOT STARTED
+> *Replace heuristic scorer with a trained LightGBM lambdarank model.*
+> *Blocked by: ≥500 labeled interactions OR citation-graph bootstrap*
+> *Estimated effort: ~2-4 weeks*
+- [ ] Citation-graph pseudo-labels from unarXive 2022 (cited = relevance 2, co-cited = 1, random = 0)
+- [ ] Author-as-user simulation
+- [ ] ~30-50 features including sparse/dense scores, citation count, category match, author overlap
+- [ ] Train LightGBM with `objective='lambdarank'`
+- [ ] Target: ~1ms for 100 candidates
+---
+## Phase 7: Evaluation Framework 📋 NOT STARTED
+> *Build offline and online evaluation before scaling users.*
+> *Estimated effort: ~1 week*
+- [ ] Offline metrics: nDCG@10, Recall@50, HR@10, ILS, category entropy
+- [ ] Time-split evaluation on unarXive 2022 + S2ORC
+- [ ] Online metrics (once users exist): CTR, save rate, dwell time, return rate
+---
+## Phase 8: LLM Interest Summaries + Distilled Re-ranker 📋 NOT STARTED
+> *Estimated effort: ~2 weeks*
+- [ ] Claude/Groq interest summaries per cluster (human-readable descriptions)
+- [ ] Distill BGE-reranker-v2-m3 offline → TinyBERT-L2 student (FlashRank recipe)
+- [ ] Deploy student score as LightGBM feature on top-20
+---
+## Phase 9: Exploration + Collaborative Filtering 📋 NOT STARTED
+> *Blocked by: ≥500 users*
+- [ ] Epsilon-greedy exploration (ε=0.25 new users, ε=0.05 established)
+- [ ] LightFM hybrid CF model with switching strategy
+- [ ] Category-level negative suppression
+- [ ] Retrain LightGBM with dismissals as negative labels
+---
+## Appendix: Infrastructure Status
+| Component | Status | Details |
+|---|---|---|
+| **Qdrant Cloud** | ✅ Live | 1.6M papers, BGE-M3 1024-dim, BQ enabled, HNSW m=32 |
+| **Zilliz Cloud** | ✅ Live (DB exists, not wired to code) | 1.6M papers, BGE-M3 sparse vectors, collection `arxiv_bgem3_sparse` |
+| **SQLite** | ✅ Live | interactions, paper_metadata, user_profiles, user_clusters |
+| **HF Spaces** | ✅ Deployment target | Docker SDK, free tier: 16GB RAM, 2 vCPUs, port 7860 |
+| **Render** | ⚠️ Previous target (512MB RAM too small for BGE-M3) | May still be used for non-ML services |
+| **arXiv API** | ✅ Live | Keyword search (placeholder) + metadata fetch |
+| **BGE-M3 Model** | ✅ Code written, loads at startup | `app/embed_svc.py` — singleton, LRU cache, CPU float32 |
+| **Groq API** | ✅ Code written, fallback-enabled | `app/groq_svc.py` — 2s timeout, academic heuristic skip |
+| **Kaggle Dataset** | ❌ Not downloaded | Phase 4 bulk-loads metadata |
+| **Notebooks** | ✅ Organized | `notebooks/` — 01-upload, 02-test, 03-search-benchmark (see `notebooks/README.md`) |
+### Credentials Status
+| Credential | Status | Env Var | Notes |
+|---|---|---|---|
+| **Qdrant Cloud** | ✅ In `config.py` | `QDRANT_URL`, `QDRANT_API_KEY` | Already wired |
+| **Zilliz Cloud** | ✅ Confirmed (not yet in config.py) | `ZILLIZ_URI`, `ZILLIZ_TOKEN` | Phase 3 adds to config |
+| **Groq** | ✅ Confirmed (not yet in config.py) | `GROQ_API_KEY` | Phase 3 adds to config |
+| **HF Spaces** | 📋 Not yet created | N/A | Create Space with Docker SDK when ready to deploy |
+---
+## Appendix: Test Suite
+| Test File | Count | Status |
+|---|---|---|
+| `tests/test_profiles.py` | 11 | ✅ Passing |
+| `tests/test_clustering.py` | 10 | ✅ Passing |
+| `tests/test_reranker_diversity.py` | 13 | ✅ Passing |
+| `tests/test_db.py` | — | ✅ Passing |
+| `tests/test_qdrant_svc.py` | — | ✅ Passing |
+| `tests/test_arxiv_svc.py` | — | ✅ Passing |
+| `tests/test_integration.py` | — | ✅ Passing |
+| `tests/test_user_state.py` | — | ✅ Passing |
+| `tests/test_saved.py` | — | ✅ Passing |
+| `tests/test_hybrid_search.py` | 21 | ✅ Passing |
+| `tests/test_search_router.py` | 6 | ✅ Passing |
+| `tests/test_live_search.py` | 8 | ✅ Passing |
+| **Total** | **123** | ✅ |
+| `test_e2e_recs.py` (standalone) | 1 | ✅ E2E simulation |
+---
+## Appendix: Doc 06 Corrections — Tracking
+| Correction | Status | Where |
+|---|---|---|
+| α_long 0.10 → 0.03 | ✅ Applied | `app/recommend/profiles.py:30` |
+| L2-normalize before Ward clustering | ✅ Applied | `app/recommend/clustering.py` |
+| Medoid not centroid | ✅ Applied | `app/recommend/clustering.py` → `_find_medoid()` |
+| Negative EWMA wired into reranking | ✅ Applied | `app/recommend/reranker.py` → Feature 5 |
+| RRF → quota fusion for recommendations | [!] Backlog | Phase 4.1 |
+| Hungarian cluster matching | [!] Backlog | Phase 4.3 |
+| Per-item short-term negative decay | [!] Backlog | Phase 4.4 |
+| Category-level suppression | [!] Backlog | Phase 4.4 |
+| BGE-reranker NEVER in hot path | ✅ Followed | Heuristic scorer used instead |

docs/phases/PHASE1-Zero-ML-Recommender.md ADDED Viewed

	@@ -0,0 +1,439 @@

+# Phase 1 — ArXiv Recommender System
+## What Was Built
+A fully working, zero-ML-inference personalized arXiv paper recommender web app.
+Users search arXiv, save papers they like, and get increasingly personalized recommendations driven by Qdrant's native Recommend API — without loading any embedding model at runtime.
+---
+## Architecture Overview
+```
+Browser
+  │  HTMX requests (partial HTML swaps)
+  ▼
+FastAPI (Uvicorn ASGI)
+  ├── GET  /                        → home page (search bar + lazy-load recs)
+  ├── GET  /search?q=               → arXiv search results
+  ├── GET  /saved                   → saved papers page
+  ├── POST /api/papers/{id}/save    → log save, update hot cache
+  ├── POST /api/papers/{id}/not-interested → log dismiss, remove card
+  └── GET  /api/recommendations     → Qdrant Recommend → arXiv metadata
+         │
+         ├── arXiv API (export.arxiv.org)  — search + metadata fetch
+         ├── SQLite WAL (aiosqlite)         — events, ID map, metadata cache
+         └── Qdrant Cloud (BGE-M3 dense)   — Recommend API (1.6M papers)
+```
+No ML model is loaded or executed at runtime in Phase 1. The Qdrant collection (`arxiv_bgem3_dense`) was pre-indexed with BGE-M3 embeddings. Recommendations are generated purely from the vector space: Qdrant's `BEST_SCORE` strategy finds papers near the user's saved papers and away from dismissed ones.
+---
+## File Structure
+```
+ResearchIT-Final/
+├── app/
+│   ├── __init__.py
+│   ├── config.py           # all settings + credentials
+│   ├── db.py               # SQLite layer (3 tables)
+│   ├── arxiv_svc.py        # arXiv API client + metadata cache
+│   ├── user_state.py       # in-memory hot cache per user
+│   ├── qdrant_svc.py       # Qdrant ID lookup + Recommend API
+│   ├── templates_env.py    # shared Jinja2 env (custom filter)
+│   ├── main.py             # FastAPI app + lifespan
+│   └── routers/
+│       ├── search.py          # GET /search
+│       ├── events.py          # POST /api/papers/{id}/save|not-interested
+│       ├── recommendations.py # GET /api/recommendations
+│       └── saved.py           # GET /saved  ← added in Phase 1 completion
+├── app/templates/
+│   ├── base.html           # DaisyUI + TailwindCSS CDN + HTMX CDN
+│   ├── index.html          # home (search bar + recommendation section)
+│   ├── search.html         # full search results page
+│   ├── saved.html          # saved papers page  ← added in Phase 1 completion
+│   └── partials/
+│       ├── paper_card.html         # single paper card
+│       ├── action_buttons.html     # save / not-interested buttons
+│       ├── search_results.html     # HTMX partial for search
+│       ├── recommendations.html    # HTMX partial for recommendations (+ refresh btn)
+│       └── empty_recs.html         # shown when not enough saves yet (+ check btn)
+├── tests/
+│   ├── test_user_state.py  # 10 unit tests
+│   ├── test_db.py          # 7 async integration tests
+│   ├── test_arxiv_svc.py   # 11 tests (normalise, parse, live API, cache)
+│   ├── test_qdrant_svc.py  # 5 tests (cache warm, live lookup, live recommend)
+│   ├── test_integration.py # 12 full HTTP tests via FastAPI TestClient
+│   └── test_saved.py       # 10 tests for /saved page  ← added in Phase 1 completion
+├── run.py                  # uvicorn entry point
+├── requirements.txt
+└── pytest.ini
+```
+---
+## How to Run
+### Prerequisites
+```bash
+pip install -r requirements.txt
+```
+### Start the server
+```bash
+python run.py
+```
+Open `http://localhost:8000`.
+### Run the tests
+```bash
+python -m pytest
+```
+Full suite: **55 tests** across 6 files.
+Some tests hit live services (arXiv API, Qdrant Cloud) and run by default. To skip them:
+```bash
+python -m pytest -m "not live"
+```
+---
+## Core Modules
+### `app/config.py`
+Single source of truth for all settings. Every credential and tunable is here; they can all be overridden via environment variables.
+| Setting | Default | Purpose |
+|---|---|---|
+| `QDRANT_URL` | Qdrant Cloud EU | BGE-M3 dense collection endpoint |
+| `QDRANT_COLLECTION` | `arxiv_bgem3_dense` | 1,596,587 integer-ID points |
+| `DB_PATH` | `interactions.db` | SQLite file path |
+| `ARXIV_API_URL` | `https://export.arxiv.org/api/query` | arXiv Atom feed |
+| `REC_LIMIT` | 10 | Papers shown per recommendation batch |
+| `REC_POSITIVE_LIMIT` | 20 | Max positive examples kept in memory per user |
+| `REC_MIN_POSITIVES` | 1 | Saves needed before showing recs |
+| `COOKIE_NAME` | `arxiv_user_id` | UUID4, 1-year cookie |
+---
+### `app/db.py`
+SQLite with WAL mode + `PRAGMA synchronous=NORMAL` for safe concurrent reads under asyncio. Three tables:
+**`interactions`** — append-only event log. Every save, dismiss, click and view lands here. Two indexes: `(user_id, timestamp DESC)` for history fetch, `(user_id, paper_id)` for deduplication. The `source` field tracks where the action came from (`"search"`, `"recommendation"`, or `"saved"`).
+**`paper_qdrant_map`** — maps `arxiv_id TEXT → qdrant_point_id INTEGER`. Populated lazily on first save. Once an ID is mapped it is reused forever — the Qdrant collection is static.
+**`paper_metadata`** — SQLite cache of arXiv API responses. Stores title, abstract, authors (JSON), category, published date. Prevents redundant API calls. There is no TTL enforcement in Phase 1 (metadata rarely changes).
+---
+### `app/arxiv_svc.py`
+Thin async client around `https://export.arxiv.org/api/query` (Atom XML feed).
+**ID normalisation** — the arXiv API returns IDs as full URLs with version suffixes, e.g. `http://arxiv.org/abs/1706.03762v5`. `_normalise_id()` strips the URL prefix and `v5` suffix so we always work with bare IDs like `1706.03762`. Old-format IDs (`math/0702129`) are also handled.
+**`search(query)`** — fetches up to 10 results, writes them all into the metadata cache, returns list of paper dicts.
+**`fetch_metadata_batch(ids)`** — checks SQLite first, then fetches missing IDs from arXiv in batches of 20 with a 0.35s gap between requests (respects the arXiv 3 req/s rate limit).
+---
+### `app/user_state.py`
+Pure in-memory dictionary of `UserState` dataclasses, one per `user_id`. Each state holds two `deque`s:
+- `positives` — maxlen `config.REC_POSITIVE_LIMIT` (20), most-recent first
+- `negatives` — maxlen 50, most-recent first
+**Mutual exclusion**: saving a paper removes it from negatives and vice versa.
+**Lazy hydration**: `ensure_loaded()` is called once per user per server process. It reads the last 70 interactions from SQLite and replays them into the deque. After that, all reads are O(1) dict lookups in memory.
+**`MAX_POSITIVES` is sourced from `config.REC_POSITIVE_LIMIT`** so the deque cap and the config are always in sync. Changing `REC_POSITIVE_LIMIT` in config automatically changes how many positives are kept in memory.
+---
+### `app/qdrant_svc.py`
+Two responsibilities:
+**`lookup_qdrant_ids(arxiv_ids)`** — translates arxiv string IDs to Qdrant integer point IDs. Checks `paper_qdrant_map` SQLite table first. For cache misses, calls `client.scroll()` with a `MatchAny` payload filter on the `arxiv_id` field (requires the keyword index created during setup). Persists new mappings back to SQLite.
+**`recommend(positive_ids, negative_ids, seen_ids)`** — translates both lists to integer IDs, then calls `client.query_points()` with:
+```python
+RecommendQuery(
+    recommend=RecommendInput(
+        positive=pos_ids,
+        negative=neg_ids,
+        strategy=RecommendStrategy.BEST_SCORE,
+    )
+)
+```
+Fetches `limit * 2` results so that already-seen papers can be filtered out in Python before returning the final `limit` results.
+**Why sync Qdrant client inside `run_in_executor`?** The official `qdrant-client` async client has known issues with some environments. Using the sync client in a thread pool is the recommended production pattern — it keeps the asyncio event loop unblocked.
+---
+### `app/routers/recommendations.py`
+Fetches `REC_LIMIT` candidates from Qdrant (already filtered for seen papers inside `qdrant_svc.recommend()`), then fetches their metadata and renders the cards. No year filtering — classic foundational papers (2015, 2017, etc.) are valid and valuable recommendations.
+---
+### `app/routers/saved.py`
+`GET /saved` loads the user's current `positive_list` from `user_state`, fetches metadata for all of them via `arxiv_svc.fetch_metadata_batch()`, and renders them using the same `paper_card.html` partial with `saved=True`. The Remove button on each card works identically to everywhere else — it POSTs to `not-interested` and HTMX removes the card.
+---
+### `app/templates_env.py`
+Shared Jinja2 `Environment` instance imported by all routers. Registers one custom filter:
+**`tojson_parse`** — converts a JSON string stored in SQLite (e.g. authors array) back to a Python list. Returns `[]` on any parse error. This prevents the template from crashing when the DB column contains malformed JSON.
+---
+## Frontend Design
+Zero build step. CSS is loaded from the TailwindCSS CDN and styled with DaisyUI components. JavaScript is provided entirely by HTMX — no custom JS written.
+**HTMX patterns used:**
+| Pattern | Where | Effect |
+|---|---|---|
+| `hx-get="/search" hx-trigger="input changed delay:300ms"` | Search bar | Live search as you type |
+| `hx-get="/api/recommendations" hx-trigger="load"` | Recs section | Lazy-load recs after page paint |
+| `hx-post=".../save" hx-target="#actions-{id}" hx-swap="innerHTML"` | Save button | Replace button group with "Saved" state in-place |
+| `hx-post=".../not-interested" hx-target="#paper-{id}" hx-swap="outerHTML swap:200ms"` | Dismiss button | Animate-remove the whole card |
+| `hx-get="/api/recommendations" hx-target="#rec-section"` | Refresh button | Reload recommendations after saving more papers |
+**Source tracking**: every action button carries a `source` field in `hx-vals` that is logged to the DB. Values: `"search"` (from search results), `"recommendation"` (from the recs section), `"saved"` (from the saved papers page). The `source` is forwarded back to the rendered partial after a save so subsequent actions from that partial carry the correct source.
+---
+## Tests
+### `tests/test_user_state.py` — 10 unit tests
+Pure unit tests, no I/O, no fixtures needed.
+- `test_add_positive` — paper appears in `positive_list`
+- `test_add_negative` — paper appears in `negative_list`
+- `test_mutual_exclusion_pos_to_neg` — saving then dismissing the same paper moves it
+- `test_mutual_exclusion_neg_to_pos` — dismissing then saving moves it back
+- `test_no_duplicate_positives` — saving same paper twice only stores it once
+- `test_ordering_positives` — most recently saved paper is first
+- `test_maxlen_eviction_positives` — 21st save evicts the oldest
+- `test_has_enough_for_recs` — False at 0 saves, True at REC_MIN_POSITIVES
+- `test_all_seen` — union of positives and negatives
+### `tests/test_db.py` — 7 async tests
+Each test uses a fresh `tmp_path` SQLite file via `monkeypatch.setattr(config, "DB_PATH", ...)`. DB state never bleeds between tests.
+- `test_init_creates_tables` — all 3 tables present after init
+- `test_log_and_retrieve_interaction` — round-trip save + fetch
+- `test_filter_by_event_type` — only `save` rows returned when filtered
+- `test_qdrant_id_roundtrip` — save and retrieve a point ID
+- `test_qdrant_ids_batch` — batch fetch returns correct dict
+- `test_metadata_cache_roundtrip` — single paper insert + fetch
+- `test_metadata_cache_batch` — multiple papers, batch fetch
+### `tests/test_arxiv_svc.py` — 11 tests
+- 7 parametrised `_normalise_id` tests covering URL form, bare ID, `v` suffix, old-style slash IDs
+- `test_parse_entry` — parses XML entry string directly
+- `test_fetch_metadata_live` — real arXiv API call for `0704.0002`
+- `test_search_live` — real arXiv API search for "attention is all you need"
+- `test_fetch_metadata_cache_hit` — mocked HTTP to verify SQLite cache is used on second call
+### `tests/test_qdrant_svc.py` — 5 tests
+- `test_lookup_cache_warm` — if SQLite already has the ID, Qdrant is never called
+- `test_lookup_cache_miss_fetches_and_persists` — missing ID triggers Qdrant scroll, result saved to SQLite
+- `test_recommend_empty_no_positives` — returns `[]` immediately without hitting Qdrant
+- `test_lookup_real_qdrant` — live lookup: `0704.0002` → point ID 0
+- `test_recommend_real_qdrant` — live recommend: saves `0704.0002`, gets real recommendations back
+### `tests/test_integration.py` — 12 full HTTP tests
+Uses FastAPI `TestClient` (Starlette's synchronous test client). Isolated SQLite per test via monkeypatching.
+- `test_home_returns_200` — GET / works
+- `test_home_sets_cookie` — user ID cookie is set
+- `test_search_empty_query` — no query = no results shown
+- `test_search_with_query_htmx` — HTMX header returns partial (no `<html>` tag)
+- `test_search_real_query` — live arXiv search via TestClient
+- `test_save_paper_logs_interaction` — POST save → DB row created
+- `test_save_paper_returns_saved_state` — response HTML contains "Saved"
+- `test_not_interested_returns_empty` — POST dismiss → 200 empty body
+- `test_not_interested_updates_state` — state reflects dismiss
+- `test_recommendations_empty_for_new_user` — no saves = empty recs partial
+- `test_recommendations_after_save` — mocked Qdrant + arXiv returns recommendation cards (year ≥ 2022)
+- `test_full_pipeline_smoke` — search → save → dismiss → recs, all in sequence
+### `tests/test_saved.py` — 10 tests
+- `test_saved_page_returns_200` — GET /saved works
+- `test_saved_page_sets_cookie` — cookie is set on fresh visit
+- `test_saved_page_empty_for_new_user` — shows empty-state message
+- `test_saved_page_shows_paper_after_save` — paper appears after saving
+- `test_saved_page_shows_correct_count` — badge shows correct count for 2 saves
+- `test_remove_paper_updates_state` — dismiss moves paper to negatives
+- `test_remove_returns_empty_response` — empty response body (HTMX removes card)
+- `test_save_source_is_logged` — source field persisted to DB
+- `test_dismiss_source_saved_is_logged` — dismiss from saved page logs correctly
+- `test_old_paper_filtered_from_recommendations` — 2017 paper excluded, 2023 paper included
+---
+## Design Decisions
+### No model loading at runtime
+Phase 1 is deliberately zero-ML-inference. The BGE-M3 embeddings were pre-indexed into Qdrant by a notebook (`bme-arxiv-test.ipynb`). At request time we only need integer point IDs — no vectors, no tokeniser, no GPU.
+This makes:
+- Cold start instant (< 1 second)
+- Memory footprint tiny (< 100 MB)
+- The recommendation quality surprisingly good — Qdrant's `BEST_SCORE` strategy in a well-indexed 1024-dim space works well even without query encoding
+### arXiv API + SQLite as the metadata layer
+Qdrant payloads contain only `arxiv_id`. Title, abstract, authors, and category all come from the arXiv API and are cached in SQLite. This was the only viable option given the payload structure, and it has a nice property: the cache warms up naturally as users search, so recommendation metadata is usually already cached by the time it is needed.
+### Lazy arxiv_id → Qdrant point ID mapping
+We don't pre-populate the SQLite map for all 1.6M papers. Instead, when a user saves a paper a background asyncio task (`asyncio.create_task`) fires a Qdrant scroll filter to find that paper's point ID. This is a one-time cost per unique paper. Subsequent recommendations are instant since the ID is cached.
+### Cookie-based user identity
+No login, no accounts. A UUID4 is generated on first visit and stored in a 1-year cookie. This is intentional for Phase 1 — simple to implement, good enough for a research tool, easy to replace with real auth in Phase 2.
+### Separation of DB writes and in-memory reads
+Every save/dismiss writes to SQLite synchronously in the event handler. The `user_state` module maintains an in-memory deque as a read cache. Because asyncio is single-threaded, there are no race conditions. The cache is loaded lazily on first access and then kept live by direct calls to `record_positive` / `record_negative`.
+### Source tracking on every action
+Every save and dismiss carries a `source` field (`"search"`, `"recommendation"`, `"saved"`) that is logged to the `interactions` table. This enables future analytics about which surface drives the most engagement. After a save, the `source` is forwarded to the rendered `action_buttons.html` partial so that any subsequent Remove action from the same card also carries the correct source.
+---
+## Bugs Found and Fixed During Implementation
+### 1. arXiv API 301 redirect
+**Symptom**: `httpx` raised an HTTP error on all arXiv requests.
+**Cause**: `http://export.arxiv.org` returns 301 → HTTPS. `httpx` doesn't follow redirects by default.
+**Fix**: Changed `ARXIV_API_URL` to `https://export.arxiv.org/api/query` and added `follow_redirects=True` to all `httpx.AsyncClient` calls.
+### 2. Jinja2 UndefinedError in `action_buttons.html`
+**Symptom**: `POST /api/papers/{id}/save` returned 500 when rendering the button partial.
+**Cause**: The template used `paper_id | default(paper.arxiv_id)`. Jinja2's `default()` filter eagerly evaluates *both sides* before choosing, so `paper.arxiv_id` was evaluated even when `paper` was not in the template context (the events router only passes `paper_id`).
+**Fix**: Changed to `{% set pid = paper_id if paper_id is defined else paper.arxiv_id %}` which short-circuits correctly.
+### 3. `action_buttons.html` hardcoded `source: "search"` everywhere
+**Symptom**: Every action from the recommendations section or the saved page was logged with `source="search"` in the DB.
+**Cause**: `hx-vals='{"source": "search"}'` was hardcoded — the `source` context variable passed by the parent template (`"recommendation"`, `"saved"`) was never read.
+**Fix**: Added `{% set _source = source if source is defined else "search" %}` and used `{{ _source }}` in all `hx-vals`. Also fixed `events.py` to forward the received `source` form field back to the `action_buttons.html` template context after a save.
+### 4. Qdrant `recommend()` deprecated
+**Symptom**: `DeprecationWarning` and incorrect results.
+**Cause**: `client.recommend()` is the old API. `PointIdsList` has a `points` field, not `positive`/`negative`.
+**Fix**: Switched to `client.query_points()` with `RecommendQuery(recommend=RecommendInput(...))` — the current recommended pattern.
+### 5. Qdrant payload filter fails without index
+**Symptom**: Qdrant returned an error: *"Index required but not found for field arxiv_id"*.
+**Cause**: Filtering on a payload field requires a payload index. The collection was created without one.
+**Fix**: Created a keyword index on the `arxiv_id` field:
+```python
+client.create_payload_index(
+    collection_name=collection,
+    field_name="arxiv_id",
+    field_schema=PayloadSchemaType.KEYWORD,
+    wait=False,
+)
+```
+This runs in background on Qdrant and persists permanently.
+### 6. Stella Qdrant clusters dead
+**Symptom**: All requests to `49b5f0e9-...` and `65c05851-...` clusters returned 404.
+**Cause**: Those clusters (used for `stella-400M-v5` embeddings in the notebooks) were deleted or expired.
+**Fix**: Pivoted entirely to the BGE-M3 dense collection at `2fe1965b-...` which is alive and has 1,596,587 points.
+### 7. TemplateResponse deprecation warning
+**Symptom**: Deprecation warning on every request.
+**Cause**: Old Starlette API: `TemplateResponse("name.html", {"request": request, ...})`.
+**Fix**: Updated all calls to the new positional form: `TemplateResponse(request, "name.html", context_without_request)`.
+### 8. Test assertion too strict for old-style arXiv IDs
+**Symptom**: `test_normalise_id` parametrized case for `math/0702129` failed with `AssertionError`.
+**Cause**: The assertion `assert "." in r or r.isdigit()` fails for slash-style IDs which contain neither.
+**Fix**: Changed assertion to `assert isinstance(r, str) and len(r) > 0`.
+### 9. `MAX_POSITIVES` in `user_state.py` was hardcoded
+**Symptom**: Changing `REC_POSITIVE_LIMIT` in config had no effect on the actual deque size.
+**Cause**: `MAX_POSITIVES = 20` was a bare integer literal, not referencing config.
+**Fix**: Changed to `MAX_POSITIVES = config.REC_POSITIVE_LIMIT` so the two values are always in sync.
+---
+## What Phase 2 Adds
+See [PHASE2_PLAN.md](PHASE2_PLAN.md) for the full plan. The short version:
+1. **Semantic search** — replace arXiv keyword API with BGE-M3 + Qdrant dense search + Zilliz sparse search (hybrid)
+2. **LLM query rewriting** — Groq `llama-3.3-70b` converts casual queries into academic keyword strings before encoding
+3. **RRF + reranker** — fuses dense and sparse results, applies citation + recency signals
+4. **New service files** — `embed_svc.py`, `zilliz_svc.py`, `groq_svc.py`, `hybrid_search_svc.py`
+5. Everything else (recommendations, user state, templates, saved page, event logging) stays unchanged
+---
+## Test Results (Final)
+```
+tests/test_user_state.py      10 passed
+tests/test_db.py               7 passed
+tests/test_arxiv_svc.py       11 passed
+tests/test_qdrant_svc.py       5 passed
+tests/test_integration.py     12 passed
+tests/test_saved.py            9 passed
+─────────────────────────────────────────
+54 passed in ~42s
+```
+All routes registered:
+```
+GET  /
+GET  /search
+GET  /saved
+POST /api/papers/{paper_id}/save
+POST /api/papers/{paper_id}/not-interested
+GET  /api/recommendations
+```

docs/phases/PHASE2-Hybrid-Search-Plan.md ADDED Viewed

	@@ -0,0 +1,483 @@

+# Phase 2 — Hybrid Semantic Search
+## What the notebook script (`bgem3_search_client.py`) does
+This is the research prototype that Phase 2 must productionise into the FastAPI app.
+Understanding it completely is the prerequisite to knowing what to build.
+---
+## The Script's Full Pipeline
+```
+User query (casual language)
+      │
+      ▼
+[1] LLM Query Rewriter (Groq / llama-3.3-70b)
+      │  "whisper speech recognition" → "Whisper OpenAI ASR multilingual"
+      ▼
+[2] BGE-M3 Encoder (CPU, single forward pass)
+      │  produces TWO outputs simultaneously:
+      │  ├── dense_vec  : float32[1024]   — semantic meaning
+      │  └── sparse_dict: {token_id: weight}  — lexical weights (BGE-M3's own sparse)
+      ▼
+[3a] Qdrant dense search          [3b] Zilliz sparse search
+      │  HNSW ANN on 1024-dim           │  IP on sparse vectors
+      │  arxiv_bgem3_dense              │  arxiv_bgem3_sparse
+      │  returns: [(point_id,           │  returns: [(point_id,
+      │            arxiv_id, score)]    │            arxiv_id, score)]
+      │                                 │
+      └──────────── parallel ───────────┘
+                        │
+                        ▼
+[4] RRF Fusion  (Reciprocal Rank Fusion, K=60)
+      │  score[point] = 1/(60+rank_dense) + 1/(60+rank_sparse)
+      │  pure rank-based, no score normalisation needed
+      ▼
+[5] Reranker  (citation + recency + semantic)
+      │  final = 0.70 × norm_rrf  +  0.25 × log_citations  +  0.05 × recency
+      ▼
+[6] Return top-K arxiv_ids → fetch metadata → display
+```
+---
+## The Single Most Important Insight
+BGE-M3 is a **dual-encoder**: one forward pass produces both a dense vector and a
+sparse lexical-weights dict. The sparse side is NOT BM25 — it is BGE-M3's own learned
+sparse representation (`lexical_weights` from FlagEmbedding).
+This has a critical consequence: **the Zilliz collection was indexed with BGE-M3 sparse
+outputs, so query-time sparse encoding must also use BGE-M3**. You cannot use a BM25
+tokeniser for the Zilliz queries. The model must be loaded and run for every search.
+```python
+# From the script — one call, two outputs:
+out = model.encode(
+    [text],
+    return_dense=True,
+    return_sparse=True,       # ← BGE-M3 lexical weights, not BM25
+    return_colbert_vecs=False,
+    max_length=512,
+)
+dense  = out["dense_vecs"][0]          # shape (1024,) float32
+sparse = out["lexical_weights"][0]     # dict {token_id: float}
+```
+---
+## Gap Analysis: What Phase 1 Has vs What Phase 2 Needs
+### Phase 1 search (current)
+```
+User query (exact string)
+      │
+      ▼
+arXiv keyword API  (https://export.arxiv.org/api/query?search_query=all:query)
+      │  standard text match, not semantic
+      ▼
+SQLite metadata cache → display cards
+```
+**Problems with Phase 1 search:**
+- Keyword-only: "when the AI makes up fake facts" returns nothing useful
+- No awareness of what the user means, only what they typed
+- Misses papers that use different terminology for the same concept
+- No ranking signal beyond arXiv's own relevance score
+### Phase 2 search (target)
+```
+User query
+      ▼
+LLM rewrite → encode → dense search ‖ sparse search → RRF → rerank
+      ▼
+arXiv API (metadata only, for titles/abstracts not in Qdrant/Zilliz payloads)
+```
+---
+## What Each Component Replaces or Adds
+| Component | Phase 1 | Phase 2 | Notes |
+|---|---|---|---|
+| Search backend | arXiv API keyword | BGE-M3 + Qdrant + Zilliz | Core change |
+| Query preprocessing | None | Groq LLM rewrite | Optional but high-quality gain |
+| Metadata source | arXiv API + SQLite | arXiv API + SQLite | **Unchanged** — Qdrant/Zilliz payloads only store `arxiv_id` |
+| Recommendations | Qdrant Recommend API | Qdrant Recommend API | **Unchanged** — already works well |
+| User state | In-memory deque + SQLite | In-memory deque + SQLite | **Unchanged** |
+| Citation signal | None | Semantic Scholar API or Zilliz payload | New |
+| Model at runtime | None | BGE-M3 (~570MB, ~300ms CPU) | Big change — cold start is now slow |
+---
+## Files That Change in Phase 2
+### New files to create
+**`app/embed_svc.py`** — BGE-M3 model singleton
+```
+Responsibilities:
+  - Load BAAI/bge-m3 once at startup (or lazily on first query)
+  - encode_query(text) → (dense: np.ndarray[1024], sparse: dict)
+  - Cache recent encode results to avoid re-encoding the same query
+  - CPU float32 (no GPU dependency)
+```
+**`app/zilliz_svc.py`** — Zilliz sparse search client
+```
+Responsibilities:
+  - Connect to Zilliz serverless (pymilvus MilvusClient)
+  - search_sparse(sparse_dict, fetch_k) → list[(point_id, arxiv_id, score)]
+  - Handle gRPC reconnects (closed-channel error, seen in the script)
+```
+**`app/groq_svc.py`** — LLM query rewriter
+```
+Responsibilities:
+  - Groq API client (lazy init)
+  - rewrite(user_query) → academic_query  (8-15 words, preserves model names)
+  - Falls back to original query on failure or timeout
+  - Same few-shot prompt as in the script (already tuned)
+```
+**`app/hybrid_search_svc.py`** — orchestrates the full pipeline
+```
+Responsibilities:
+  - Calls groq_svc.rewrite()
+  - Calls embed_svc.encode_query()
+  - Calls qdrant_svc.search_dense() and zilliz_svc.search_sparse() in parallel
+    (asyncio.gather with run_in_executor for both sync clients)
+  - RRF merge
+  - Citation+recency rerank
+  - Returns list of arxiv_ids
+```
+### Files that change
+**`app/config.py`** — add:
+```python
+ZILLIZ_URI        = os.getenv("ZILLIZ_URI", "https://in03-0c01933b42a8df1.serverless...")
+ZILLIZ_TOKEN      = os.getenv("ZILLIZ_TOKEN", "...")
+ZILLIZ_COLLECTION = "arxiv_bgem3_sparse"
+GROQ_API_KEY      = os.getenv("GROQ_API_KEY", "gsk_...")
+BGE_M3_DEVICE     = os.getenv("BGE_M3_DEVICE", "cpu")
+ENCODE_CACHE_SIZE = 128            # LRU cache for encoded queries
+SEARCH_FETCH_K_MULTIPLIER = 6      # candidates = top_k × 6 before rerank
+SEARCH_RRF_K      = 60             # RRF denominator
+SEARCH_CITE_WEIGHT   = 0.25
+SEARCH_RECENCY_WEIGHT = 0.05
+SEARCH_SEMANTIC_WEIGHT = 0.70      # must sum to 1.0 with above
+```
+**`app/qdrant_svc.py`** — add `search_dense()` (vector search, separate from existing `recommend()`):
+```python
+async def search_dense(dense_vec: np.ndarray, fetch_k: int) -> list[tuple]:
+    """
+    ANN dense search. Returns [(point_id, arxiv_id, score)].
+    This is different from recommend() — it takes a raw query vector,
+    not a list of positive paper IDs.
+    """
+```
+**`app/routers/search.py`** — replace `arxiv_svc.search()` call with `hybrid_search_svc.search()`.
+The router signature and template rendering stay identical — the change is entirely inside the router.
+**`app/main.py`** — add model warm-up to lifespan:
+```python
+@asynccontextmanager
+async def lifespan(app):
+    await db.init_db()
+    embed_svc.get_model()   # warm up BGE-M3 at startup, not on first request
+    yield
+```
+---
+## The Metadata Problem (and Why the CSV Doesn't Apply to the Web App)
+The script uses two CSV files loaded from Kaggle paths:
+```python
+CSV_PATH          = "/kaggle/input/.../arxiv_comprehensive_papers.csv"
+CITATION_CSV_PATH = "/kaggle/input/.../arxiv_citations_summary.csv"
+```
+These are Kaggle notebook paths that don't exist in the web app. The script's `load_csv()`
+and `lookup()` functions are the notebook's equivalent of our `arxiv_svc.py` + `db.py`.
+**The web app already handles this better**: arXiv API → SQLite cache. The `fetch_metadata_batch()`
+function in `arxiv_svc.py` is already doing what the CSV was doing, but dynamically and with
+no dependency on a pre-downloaded file.
+**What we do NOT have: citation counts.** The script uses `arxiv_citations_summary.csv` to
+get citation counts per paper, which feeds the 0.25-weighted citation component of the reranker.
+**Options for citation data in Phase 2:**
+| Option | Pros | Cons |
+|---|---|---|
+| Semantic Scholar API | Free, up-to-date | Extra HTTP call per search, rate limited |
+| Store in Zilliz payload | Zero extra calls | Requires re-indexing Zilliz collection |
+| Store in SQLite when fetched | Persistent cache, one-time cost | Stale, extra complexity |
+| Skip citation reranking initially | Simple | Weaker rerank quality |
+**Recommended**: Start with **skip citation reranking**. Use only 2 components:
+`0.75 × norm_rrf + 0.25 × recency`. Add citation later when we decide on the data source.
+The RRF signal alone is strong — the script shows citation mostly helps surface seminal papers.
+---
+## The Reranker — What It Actually Does
+```python
+final = (
+    0.70 × (rrf_score / max_rrf_score)         # semantic rank
+  + 0.25 × (log1p(citations) / max_log_cite)   # popularity (log-normalised)
+  + 0.05 × exp(-0.06 × paper_age_years)         # freshness (soft decay)
+)
+```
+Why log-normalise citations: `log1p(214251) ≈ 12.3` vs `log1p(500) ≈ 6.2`. The ratio is
+only 2×, so a mega-cited but semantically irrelevant paper can't overpower a relevant
+low-cited one. Without log, a 200k-citation paper would always win regardless of relevance.
+Why `0.05` recency and not higher: The script was tuned to stop penalising classics.
+AlphaFold (2021), RLHF (2017), ResNet (2015) — you want these to appear even for
+non-recent queries. With `lambda=0.06`, a 10-year-old paper still scores 0.55 on recency,
+so it loses only a little to a 2024 paper.
+**For Phase 2 initial implementation** (no citation data):
+```python
+final = 0.80 × norm_rrf + 0.20 × recency
+```
+---
+## Parallel Search — Critical for Latency
+The script runs Qdrant and Zilliz in parallel using `ThreadPoolExecutor`:
+```python
+with ThreadPoolExecutor(max_workers=2) as ex:
+    fq = ex.submit(timed_qdrant)
+    fz = ex.submit(timed_zilliz)
+    dense_hits, qdrant_ms  = fq.result()
+    sparse_hits, zilliz_ms = fz.result()
+```
+In the FastAPI app we use `asyncio.gather` with `run_in_executor` for the same effect:
+```python
+loop = asyncio.get_event_loop()
+dense_hits, sparse_hits = await asyncio.gather(
+    loop.run_in_executor(None, _search_qdrant_dense_sync, dense_vec, fetch_k),
+    loop.run_in_executor(None, _search_zilliz_sparse_sync, sparse_dict, fetch_k),
+)
+```
+This is important: total search latency is `max(qdrant_time, zilliz_time)`, not their sum.
+The script shows Qdrant ~200ms, Zilliz ~300ms — so parallel wall time is ~300ms, not ~500ms.
+---
+## Model Loading Strategy
+BGE-M3 is ~570MB on disk, takes ~15s to load on CPU. Two strategies:
+**Option A — Eager load at startup** (recommended for production)
+```python
+# in lifespan()
+embed_svc.get_model()  # blocks startup until model is loaded
+```
+Pros: First query is fast. Cons: App takes ~15s to start.
+**Option B — Lazy load on first query**
+```python
+# in encode_query()
+model = get_model()  # loads on demand
+```
+Pros: Fast startup. Cons: First search request after restart takes ~15s — user sees spinner.
+For a research tool with infrequent restarts, **Option A is better**.
+---
+## RRF — Why It Works Without Score Normalisation
+RRF's elegance is that it ignores raw scores entirely. Only rank matters:
+```
+paper X is rank 1 in dense, rank 5 in sparse
+  dense score: 1/(60+1) = 0.01639
+  sparse score: 1/(60+5) = 0.01538
+  total: 0.03177
+paper Y is rank 3 in dense, rank 3 in sparse
+  both: 1/(60+3) = 0.01587
+  total: 0.03175
+```
+Paper X barely beats Y — consistent top rankings across both channels is better than
+excelling in only one. This property means you don't need to worry about Qdrant's cosine
+scores being on a different scale than Zilliz's IP scores.
+---
+## LLM Query Rewriter — What It Does and When to Skip It
+The rewriter converts casual queries into dense academic keyword strings:
+- `"when the ai makes up fake facts"` → `"LLM hallucination factual errors sycophancy truthfulness survey"`
+- `"the llama model by facebook"` → `"LLaMA open efficient foundation language model Meta AI"`
+The 15 few-shot examples in the script are carefully chosen — they demonstrate two things:
+1. Vague language → academic terminology (helps dense semantic search)
+2. Named entities preserved/added (helps sparse keyword search)
+**Cost**: ~200-500ms Groq API call per query. The Groq call runs first, before encoding.
+**When to skip**: If the user's query already looks academic (contains arXiv-style terms,
+author names, or model acronyms), the rewrite adds little. A heuristic: if the query is
+>6 words and contains uppercase acronyms, skip the rewrite.
+**Fallback**: The script always falls back to the original query on any Groq error. This
+is the right pattern — LLM rewriting is an enhancement, not a dependency.
+---
+## What Phase 2 Leaves Unchanged from Phase 1
+These are intentionally NOT changed:
+1. **Recommendations** — `qdrant_svc.recommend()` via Qdrant Recommend API. Already works.
+   No query encoding needed. No model dependency.
+2. **User state** — `user_state.py` deques + SQLite interactions table. Already correct.
+3. **Metadata fetching** — `arxiv_svc.fetch_metadata_batch()`. Phase 1's arXiv API + SQLite
+   cache is strictly better than the Kaggle CSV approach for a web app.
+4. **arXiv ID normalisation** — `arxiv_svc._normalise_id()`. Already handles all formats.
+5. **Event logging** — `db.log_interaction()`. Events are source-agnostic. A save from
+   hybrid search goes into the same table as a save from arXiv keyword search.
+6. **All templates** — `paper_card.html`, `action_buttons.html`, `search_results.html`.
+   The templates expect `paper` dicts with `arxiv_id, title, abstract, authors, category,
+   published` — exactly what `arxiv_svc.fetch_metadata_batch()` returns.
+7. **HTMX patterns** — the save/dismiss flow, lazy-load recommendations, search-as-you-type.
+---
+## Phase 2 Implementation Order
+These steps are ordered to minimise integration risk. Each step leaves the app in a
+working state.
+### Step 1 — Add BGE-M3 model service (`app/embed_svc.py`)
+Load BGE-M3, expose `encode_query(text) → (dense, sparse)` with LRU cache.
+No app integration yet — just the service + unit tests.
+**Test**: encode "attention is all you need", verify dense shape is (1024,), sparse has >5 keys.
+### Step 2 — Add Zilliz client (`app/zilliz_svc.py`)
+Connect to `arxiv_bgem3_sparse`. Expose `search_sparse(sparse_dict, k) → list[tuple]`.
+Include reconnect logic from the script.
+**Test**: live search with the sparse vector from Step 1, verify results come back.
+### Step 3 — Add dense search to Qdrant service
+Add `search_dense(dense_vec, k) → list[tuple]` to `qdrant_svc.py`.
+Keep `recommend()` and `lookup_qdrant_ids()` unchanged.
+**Test**: live dense search, verify arxiv_ids in results are valid.
+### Step 4 — Add Groq rewriter (`app/groq_svc.py`)
+Expose `rewrite(query) → str`. Must fall back gracefully on error.
+**Test**: rewrite "attention is all you need", verify output contains "transformer".
+### Step 5 — Hybrid search orchestrator (`app/hybrid_search_svc.py`)
+Wire: rewrite → encode → parallel(dense, sparse) → RRF → rerank → return arxiv_ids.
+No citation data yet — use `0.80 × rrf + 0.20 × recency`.
+**Test**: search "hallucination in language models", verify `2309.01219` is in top 5.
+### Step 6 — Swap search router
+Replace `arxiv_svc.search(q)` in `app/routers/search.py` with
+`hybrid_search_svc.search(q)` + `arxiv_svc.fetch_metadata_batch(arxiv_ids)`.
+This is a one-function swap. The router's response format stays identical.
+**Test**: all 12 integration tests should still pass. Run `python -m pytest`.
+### Step 7 — Add model warm-up to lifespan
+Add `embed_svc.get_model()` to `app/main.py` lifespan. Update startup log message.
+### Step 8 (optional) — Citation data
+Decide on data source, add `citationCount` to SQLite `paper_metadata` table,
+wire into reranker.
+---
+## Performance Budget for Phase 2
+Assuming BGE-M3 is already warm (loaded at startup):
+| Stage | Time | Notes |
+|---|---|---|
+| LLM rewrite (Groq) | ~300ms | Can be skipped for short/academic queries |
+| BGE-M3 encode (CPU) | ~300ms first, ~0ms cached | LRU cache on query text |
+| Qdrant dense search | ~200ms | Network + HNSW |
+| Zilliz sparse search | ~300ms | Network + sparse IP |
+| Both (parallel) | ~300ms | Bottleneck is whichever is slower |
+| RRF + rerank | <5ms | Pure Python, pre-built dicts |
+| arXiv metadata | ~0ms (cached) / ~500ms (cold) | SQLite → arXiv API |
+| **Total (warm cache)** | **~600ms** | With LLM, warm encode, cold Zilliz |
+| **Total (fully cached)** | **~300ms** | Encode cached, metadata cached |
+Phase 1 search: ~500ms (arXiv API, no local computation).
+Phase 2 search: ~600ms warm / ~1.1s cold (first query after restart with LLM rewrite).
+The latency is comparable with much better result quality.
+---
+## New Dependencies
+Add to `requirements.txt`:
+```
+FlagEmbedding>=1.2.9    # BGE-M3
+torch>=2.0              # required by FlagEmbedding, CPU-only is fine
+pymilvus>=2.4           # Zilliz/Milvus client
+groq>=0.9               # Groq LLM API
+numpy>=1.24             # already implicit via FlagEmbedding
+```
+Note: `torch` for CPU-only install:
+```bash
+pip install torch --index-url https://download.pytorch.org/whl/cpu
+```
+BGE-M3 on CPU is ~300ms per encode. Acceptable for a search tool. GPU would bring it to ~30ms.
+---
+## Summary: The Three Things Phase 2 Is
+1. **Replace the search input**: arXiv keyword API → BGE-M3 encode → Qdrant dense + Zilliz sparse → RRF
+2. **Add a search quality layer**: LLM query rewriting (Groq) before encoding, citation+recency reranking after RRF
+3. **Keep everything else**: recommendations, user state, metadata caching, HTMX frontend, event logging — all unchanged
+The script has already proven the core pipeline works (20-query evaluation suite with Recall, Precision, MRR metrics). Phase 2 is productionising that pipeline into the existing FastAPI structure.

docs/phases/PHASE3-Hybrid-Semantic-Search.md ADDED Viewed

	@@ -0,0 +1,658 @@

+# Phase 3 — Hybrid Semantic Search
+> **Purpose**: Replace the Phase 1 placeholder arXiv keyword API search with real vector-based
+> semantic search using BGE-M3 encoding + Qdrant dense + Zilliz sparse + RRF fusion.
+>
+> **Status**: 📋 Not started
+> **Estimated effort**: ~2-3 weeks
+> **Predecessor**: Phase 2c (complete) — the recommendation pipeline
+> **Deployment target**: Hugging Face Spaces (Docker SDK, free tier: 16GB RAM, 2 vCPUs)
+---
+## Why This Is The #1 Priority
+The entire reason we built 1.6M BGE-M3 embeddings across Qdrant Cloud (dense, 1024-dim) and
+Zilliz Cloud (sparse, learned lexical weights) is to power semantic search. Right now, the
+search bar calls the arXiv keyword API — a Phase 1 throwaway placeholder that:
+- Can't understand meaning, only matches exact words
+- "when AI makes up fake facts" returns nothing useful
+- Misses papers using different terminology for the same concept
+- Has no ranking signal beyond arXiv's own relevance score
+With hybrid search, that same query would be rewritten by an LLM to
+"LLM hallucination factual errors sycophancy truthfulness survey", encoded by BGE-M3 into
+both dense and sparse vectors, and matched against 1.6M papers semantically.
+**Doc 06 confirms this is load-bearing infrastructure:**
+> "Phase 0 — already complete: hybrid BGE-M3 dense + sparse + RRF on Qdrant + Zilliz, 1.6M
+> papers. Keep RRF *for search* (it's correct for fusing different retrievers over the *same*
+> query); replace it with quota *for recommendations* (different queries over the same user)."
+**RRF is correct for search** — this is fusing different retrievers (dense + sparse) answering
+the same query. This is fundamentally different from recommendations, where RRF is wrong.
+---
+## Architecture: What Changes From Phase 1
+### Phase 1 Search (current — being replaced)
+```
+User query (exact string)
+      │
+      ▼
+arXiv keyword API  (https://export.arxiv.org/api/query?search_query=all:query)
+      │  standard text match, not semantic
+      ▼
+SQLite metadata cache → render paper cards
+```
+**Problems**: Keyword-only. No semantic understanding. Depends on external API.
+### Phase 3 Search (target)
+```
+User types: "when AI makes up fake facts"
+      │
+      ▼
+[1] LLM Rewriter (Groq / llama-3.3-70b-versatile, ~300ms)
+      │   → "LLM hallucination factual errors sycophancy truthfulness survey"
+      │   Falls back to original query on error/timeout
+      ▼
+[2] BGE-M3 Encode (CPU, ~300ms first / ~0ms cached)
+      │   Single forward pass produces TWO outputs:
+      │   ├── dense_vec  : float32[1024]        — semantic meaning
+      │   └── sparse_dict: {token_id: weight}   — lexical weights (NOT BM25)
+      ▼
+[3a] Qdrant dense search ───────┐
+      │  HNSW ANN on 1024-dim    │
+      │  collection: arxiv_bgem3_dense  │    PARALLEL
+      │  returns: [(arxiv_id, score)]   │    (~300ms total)
+                                 │
+[3b] Zilliz sparse search ──────┘
+      │  IP on sparse vectors
+      │  collection: arxiv_bgem3_sparse
+      │  returns: [(arxiv_id, score)]
+      │
+      ▼
+[4] RRF Fusion (K=60)
+      │  score[paper] = 1/(60+rank_dense) + 1/(60+rank_sparse)
+      │  Pure rank-based — no score normalization needed
+      ▼
+[5] Rerank: 0.80 × norm_rrf + 0.20 × recency
+      │  (Citation signal deferred — not available yet)
+      ▼
+[6] Return arxiv_ids → fetch metadata → render cards
+```
+### What Stays Unchanged
+These are intentionally NOT changed in Phase 3:
+1. **Recommendations pipeline** — Tiers 1/2/3 in `recommendations.py`. No dependency on BGE-M3.
+2. **User state** — `user_state.py` deques + SQLite interactions. Source-agnostic.
+3. **Metadata fetching** — `arxiv_svc.fetch_metadata_batch()`. Still needed for titles/abstracts.
+4. **Event logging** — `db.log_interaction()`. Events are source-agnostic.
+5. **All templates** — `paper_card.html`, `action_buttons.html`, etc. Same paper dict format.
+6. **HTMX patterns** — save/dismiss flow, lazy-load recs, search-as-you-type.
+---
+## The Critical BGE-M3 Insight
+BGE-M3 is a **dual-encoder**: one forward pass produces both a dense vector AND sparse
+lexical weights simultaneously. The sparse output is NOT BM25 — it is BGE-M3's own learned
+sparse representation (`lexical_weights` from FlagEmbedding).
+**This has a critical consequence**: the Zilliz collection `arxiv_bgem3_sparse` was indexed
+with BGE-M3 sparse outputs. Query-time sparse encoding **must** also use BGE-M3's sparse
+encoder. You CANNOT substitute a BM25 tokenizer. The model must be loaded and run for every
+search.
+```python
+# One call, two outputs:
+out = model.encode(
+    [text],
+    return_dense=True,
+    return_sparse=True,        # ← BGE-M3 lexical weights, not BM25
+    return_colbert_vecs=False,
+    max_length=512,
+)
+dense  = out["dense_vecs"][0]          # shape (1024,) float32
+sparse = out["lexical_weights"][0]     # dict {token_id: float}
+```
+---
+## Deployment: Hugging Face Spaces (Docker SDK)
+### Why HF Spaces Instead of Render
+| Constraint | Render Free | HF Spaces Free | Verdict |
+|---|---|---|---|
+| **RAM** | 512 MB | **16 GB** | BGE-M3 needs ~2GB (model + PyTorch runtime). Render can't do it. |
+| **CPU** | Limited | 2 vCPUs | Sufficient for BGE-M3 CPU inference (~300ms/query) |
+| **Disk** | Persistent | **Ephemeral** (50GB) | Need external DB for persistence → we already use Qdrant Cloud + Zilliz Cloud. SQLite needs a solution. |
+| **Sleep** | After 15 min | After ~2 days | Better for a research tool |
+| **Port** | Any | **7860** (required) | Must configure in run.py |
+| **Cold start** | ~30-60s | ~15-30s + model download | Model caching via Docker layers helps |
+### HF Spaces Constraints to Handle
+1. **Ephemeral filesystem** — `interactions.db` (SQLite) data is lost on restart.
+   - **Solution**: For now, accept this (pre-launch, no real users). Phase 4 can migrate to
+     Supabase/external DB when persistence matters.
+   - Alternative: Use HF Dataset repo as persistent store via `huggingface_hub` library.
+2. **Port must be 7860** — HF Spaces requires apps to listen on port 7860.
+   - **Solution**: Change `run.py` to use port 7860 (or read from `PORT` env var).
+3. **Model download on cold start** — BGE-M3 (~570MB) downloads from HuggingFace Hub on first
+   start. Subsequent starts use the Docker layer cache.
+   - **Solution**: Download model in Dockerfile `RUN` step so it's baked into the image.
+4. **Non-root user** — HF Spaces Docker runs as user ID 1000.
+   - **Solution**: Add `USER 1000` in Dockerfile, ensure all paths are writable.
+### Dockerfile Skeleton
+```dockerfile
+FROM python:3.12-slim
+# Install system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends gcc && \
+    rm -rf /var/lib/apt/lists/*
+# Set up app directory
+WORKDIR /app
+# Install Python deps (torch CPU-only first for smaller image)
+COPY requirements.txt .
+RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu && \
+    pip install --no-cache-dir -r requirements.txt
+# Pre-download BGE-M3 model into the image (baked in, no cold-start download)
+RUN python -c "from FlagEmbedding import BGEM3FlagModel; BGEM3FlagModel('BAAI/bge-m3')"
+# Copy application code
+COPY . .
+# HF Spaces requires port 7860 and non-root user
+USER 1000
+EXPOSE 7860
+CMD ["python", "run.py"]
+```
+---
+## New Files to Create
+### `app/embed_svc.py` — BGE-M3 Model Singleton
+```
+Responsibilities:
+  - Load BAAI/bge-m3 once at startup via lifespan
+  - encode_query(text) → (dense: np.ndarray[1024], sparse: dict[int, float])
+  - LRU cache (128 entries) on query text to avoid re-encoding repeats
+  - CPU float32, no GPU dependency
+  - use_fp16=False on CPU (fp16 is GPU-only)
+Key API:
+  get_model() → BGEM3FlagModel           # lazy singleton
+  encode_query(text) → (ndarray, dict)    # cached, thread-safe
+```
+**Why a singleton**: BGE-M3 is ~570MB in memory. Loading it twice would waste RAM.
+The model is loaded once at startup (or lazily on first query) and reused for all requests.
+### `app/zilliz_svc.py` — Zilliz Cloud Sparse Search Client
+```
+Responsibilities:
+  - Connect to Zilliz Cloud serverless via pymilvus MilvusClient
+  - search_sparse(sparse_dict, limit) → list[dict] with arxiv_id + score
+  - Handle gRPC reconnects (closed-channel error observed in prototype)
+  - Collection: arxiv_bgem3_sparse
+  - Schema: id (INT64 auto PK), arxiv_id (VARCHAR), sparse_vector (SPARSE_FLOAT_VECTOR)
+  - Index: SPARSE_INVERTED_INDEX, metric_type=IP
+  - Sparse format: {int_token_id: float_weight} (NOT string words)
+  - Metric: IP (Inner Product)
+Key API:
+  search_sparse(sparse_dict, limit=50) → list[dict]
+```
+**Config needed**: `ZILLIZ_URI`, `ZILLIZ_TOKEN`, `ZILLIZ_COLLECTION` in config.py.
+### `app/groq_svc.py` — LLM Query Rewriter
+```
+Responsibilities:
+  - Groq API client (lazy init, reads GROQ_API_KEY from env)
+  - rewrite(user_query) → academic_query_string
+  - Uses llama-3.3-70b-versatile with few-shot prompt
+  - Falls back to original query on ANY error or timeout (>2s)
+  - Optional: skip rewriting for queries that already look academic
+Key API:
+  rewrite(query) → str    # graceful fallback, never crashes
+```
+**The rewrite is an enhancement, NOT a dependency.** If Groq is down, the system works
+fine with the original query.
+### `app/hybrid_search_svc.py` — Search Orchestrator
+```
+Responsibilities:
+  - Orchestrates the full pipeline: rewrite → encode → parallel search → RRF → rerank
+  - Calls groq_svc.rewrite() (optional, can be skipped)
+  - Calls embed_svc.encode_query()
+  - Calls qdrant_svc.search_dense() + zilliz_svc.search_sparse() in parallel
+    via asyncio.gather with run_in_executor
+  - RRF merge (K=60)
+  - Recency rerank: 0.80 × norm_rrf + 0.20 × recency
+  - Returns list of arxiv_ids, sorted by final score
+Key API:
+  search(query, limit=10) → list[str]   # returns arxiv_ids
+```
+---
+## Files to Modify
+### `app/config.py` — Add Search Config
+```python
+# ── Zilliz Cloud (BGE-M3 sparse) ─────────────────────────────────────────────
+ZILLIZ_URI        = os.getenv("ZILLIZ_URI", "https://in03-...")
+ZILLIZ_TOKEN      = os.getenv("ZILLIZ_TOKEN", "...")
+ZILLIZ_COLLECTION = os.getenv("ZILLIZ_COLLECTION", "arxiv_bgem3_sparse")
+# ── Groq (LLM query rewriter) ────────────────────────────────────────────────
+GROQ_API_KEY      = os.getenv("GROQ_API_KEY", "")
+# ── BGE-M3 (embedding model) ─────────────────────────────────────────────────
+BGE_M3_MODEL      = os.getenv("BGE_M3_MODEL", "BAAI/bge-m3")
+BGE_M3_DEVICE     = os.getenv("BGE_M3_DEVICE", "cpu")
+ENCODE_CACHE_SIZE = 128
+# ── Hybrid search tuning ─────────────────────────────────────────────────────
+SEARCH_RRF_K             = 60     # RRF denominator
+SEARCH_FETCH_K_MULTIPLIER = 6    # candidates = top_k × 6 before rerank
+SEARCH_SEMANTIC_WEIGHT   = 0.80   # RRF contribution to final score
+SEARCH_RECENCY_WEIGHT    = 0.20   # recency contribution to final score
+# ── Deployment ────────────────────────────────────────────────────────────────
+APP_PORT = int(os.getenv("PORT", "7860"))   # HF Spaces requires 7860
+```
+### `app/qdrant_svc.py` — Add `search_dense()`
+A new function for raw vector search (different from `search_by_vector()` which is used by
+the recommendation pipeline). `search_dense()` returns score + arxiv_id tuples needed for RRF.
+```python
+async def search_dense(
+    dense_vec: list[float],
+    limit: int = 50,
+) -> list[dict]:
+    """
+    ANN dense search for the search pipeline. Returns list of
+    {'arxiv_id': str, 'score': float} dicts sorted by score desc.
+    Different from search_by_vector() which returns only arxiv_ids.
+    This version returns scores needed for RRF fusion.
+    """
+```
+### `app/routers/search.py` — Swap Search Backend
+Replace `arxiv_svc.search(q)` with `hybrid_search_svc.search(q)` +
+`arxiv_svc.fetch_metadata_batch(arxiv_ids)`.
+The router signature, template rendering, and response format stay IDENTICAL.
+This is a one-function swap inside the router.
+```python
+# BEFORE (Phase 1):
+papers = await arxiv_svc.search(q.strip())
+# AFTER (Phase 3):
+from app import hybrid_search_svc
+arxiv_ids = await hybrid_search_svc.search(q.strip(), limit=config.ARXIV_MAX_RESULTS)
+meta = await arxiv_svc.fetch_metadata_batch(arxiv_ids)
+papers = [meta[aid] for aid in arxiv_ids if aid in meta]
+```
+### `app/main.py` — Add Model Warm-up to Lifespan
+```python
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    await db.init_db()
+    # Warm up BGE-M3 at startup, not on first request
+    from app import embed_svc
+    embed_svc.get_model()
+    print("[main] BGE-M3 model loaded")
+    yield
+```
+### `run.py` — Use Configurable Port
+```python
+import uvicorn
+from app.config import APP_PORT
+if __name__ == "__main__":
+    uvicorn.run("app.main:app", host="0.0.0.0", port=APP_PORT, reload=True)
+```
+### `requirements.txt` — Add Dependencies
+```
+# Existing
+fastapi>=0.115
+uvicorn>=0.30
+jinja2>=3.1
+httpx>=0.27
+aiosqlite>=0.20
+qdrant-client>=1.9
+pydantic>=2.0
+numpy>=1.24
+scipy>=1.11
+pytest>=8.0
+pytest-asyncio>=0.23
+anyio[asyncio]
+# Phase 3 additions
+FlagEmbedding>=1.2.9       # BGE-M3 model
+pymilvus>=2.4              # Zilliz/Milvus client
+groq>=0.9                  # Groq LLM API
+```
+Note: `torch` is installed separately with CPU-only wheels:
+```bash
+pip install torch --index-url https://download.pytorch.org/whl/cpu
+```
+---
+## Implementation Order
+Each step leaves the app in a working state. Order minimizes integration risk.
+### Step 1 — BGE-M3 Model Service (`app/embed_svc.py`)
+Build the embedding service in isolation. No app integration yet.
+- Load `BAAI/bge-m3` with `use_fp16=False` (CPU)
+- `encode_query(text)` → `(dense_vec, sparse_dict)` with LRU cache
+- Thread-safe (model inference is CPU-bound, use `run_in_executor`)
+**Test**: Encode "attention is all you need". Verify dense shape is `(1024,)`,
+sparse dict has >5 keys, all values are floats.
+### Step 2 — Zilliz Client (`app/zilliz_svc.py`)
+Connect to `arxiv_bgem3_sparse` collection. Expose `search_sparse()`.
+- Use `pymilvus.MilvusClient` with `uri` + `token`
+- Search field: `sparse_vector` (SPARSE_FLOAT_VECTOR)
+- Filter output: extract `arxiv_id` from results
+- Search with `metric_type="IP"` on sparse vectors
+- Handle gRPC reconnection (retry once on connection error)
+- Return `[{'arxiv_id': str, 'score': float}]`
+**Test**: Live sparse search with the sparse vector from Step 1.
+Verify results contain valid arxiv_ids.
+### Step 3 — Dense Search in Qdrant (`app/qdrant_svc.py`)
+Add `search_dense()` function. Keep all existing functions unchanged.
+- Raw vector search returning `[{'arxiv_id': str, 'score': float}]`
+- Uses `query_points()` with the dense vector as query
+**Test**: Live dense search with the dense vector from Step 1.
+Verify arxiv_ids in results are valid strings.
+### Step 4 — Groq Rewriter (`app/groq_svc.py`)
+LLM-powered query expansion. Must be fully optional.
+- Lazy Groq client init (only connects when first query arrives)
+- Few-shot prompt from the prototype notebook (already tuned)
+- Timeout: 2 seconds max
+- Fallback: return original query on any error
+**Test**: Rewrite "attention is all you need". Verify output contains "transformer"
+or "self-attention". Test fallback by using invalid API key.
+### Step 5 — Hybrid Search Orchestrator (`app/hybrid_search_svc.py`)
+Wire everything together: rewrite → encode → parallel(dense, sparse) → RRF → rerank.
+- `asyncio.gather` with `run_in_executor` for parallel dense + sparse search
+- RRF fusion: `score = 1/(K + rank_dense) + 1/(K + rank_sparse)`, K=60
+- Recency rerank: `0.80 × norm_rrf + 0.20 × recency_score`
+- No citation data yet (deferred)
+**Test**: Search "hallucination in language models". Verify results contain
+hallucination-related papers.
+### Step 6 — Swap Search Router
+Replace `arxiv_svc.search(q)` in `app/routers/search.py` with
+`hybrid_search_svc.search(q)` + `arxiv_svc.fetch_metadata_batch()`.
+One-function swap. Router response format unchanged.
+**Test**: All existing integration tests should still pass.
+Run `python -m pytest tests/ -v`.
+### Step 7 — Model Warm-up + Deployment Config
+- Add `embed_svc.get_model()` to `main.py` lifespan
+- Update `run.py` to use `APP_PORT` (7860 for HF Spaces)
+- Create `Dockerfile` for HF Spaces deployment
+- Create `.dockerignore` (exclude `.git`, `__pycache__`, `*.db`, notebooks)
+**Test**: Start app with `python run.py`, verify search works end-to-end
+at `http://localhost:7860`.
+### Step 8 (Optional) — Citation Data
+Decide on data source for citation counts, add to reranker.
+| Option | Pros | Cons |
+|---|---|---|
+| Semantic Scholar API | Free, up-to-date | Extra HTTP call, rate limited |
+| Skip for now | Simple | Weaker rerank (RRF + recency only) |
+**Recommended**: Skip initially. Use `0.80 × rrf + 0.20 × recency`. The RRF signal
+alone is strong — the prototype showed citation mostly helps surface seminal papers.
+---
+## Latency Budget
+Assuming BGE-M3 is warm (loaded at startup):
+| Stage | Time | Notes |
+|---|---|---|
+| LLM rewrite (Groq) | ~300ms | Can be skipped for academic queries |
+| BGE-M3 encode (CPU) | ~300ms first, ~0ms cached | LRU cache on query text |
+| Qdrant dense search | ~200ms | Network + HNSW |
+| Zilliz sparse search | ~300ms | Network + sparse IP |
+| Both (parallel) | ~300ms | Bottleneck = max(Qdrant, Zilliz) |
+| RRF + rerank | <5ms | Pure Python, pre-built dicts |
+| Metadata fetch | ~0ms (cached) / ~500ms (cold) | SQLite → arXiv API fallback |
+| **Total (warm cache)** | **~600ms** | With LLM, warm encode, warm metadata |
+| **Total (fully cached)** | **~300ms** | Encode cached, metadata cached |
+Phase 1 search: ~500ms (arXiv API, no local computation).
+Phase 3 search: ~600ms warm / ~1.1s cold. **Comparable latency, far better quality.**
+---
+## RRF — Why It Works Here
+RRF's elegance is that it ignores raw scores entirely. Only rank matters:
+```
+Paper X: rank 1 in dense, rank 5 in sparse
+  score = 1/(60+1) + 1/(60+5) = 0.01639 + 0.01538 = 0.03177
+Paper Y: rank 3 in dense, rank 3 in sparse
+  score = 1/(60+3) + 1/(60+3) = 0.01587 + 0.01587 = 0.03175
+```
+Papers consistently ranked well across BOTH channels get boosted. This property means
+you don't need to normalize Qdrant's cosine scores vs Zilliz's IP scores — they're on
+different scales but RRF doesn't care.
+**Doc 06 confirms**: RRF is correct here because this is fusing *different retrievers
+answering the same query*. Unlike recommendations (fusing *different queries for the same
+user*), where quota is correct.
+---
+## LLM Query Rewriter Details
+The rewriter converts casual queries into dense academic keyword strings:
+| User Query | Rewritten Query |
+|---|---|
+| "when the AI makes up fake facts" | "LLM hallucination factual errors sycophancy truthfulness survey" |
+| "the llama model by facebook" | "LLaMA open efficient foundation language model Meta AI" |
+| "how to make images from text" | "text-to-image generation diffusion models latent space" |
+| "whisper speech recognition" | "Whisper OpenAI ASR multilingual" |
+**When to skip**: If the query already looks academic (contains arXiv-style terms, author
+names, or model acronyms). A heuristic: if the query is >6 words and contains uppercase
+acronyms, skip the rewrite.
+**Fallback**: Always falls back to the original query on any Groq error. LLM rewriting is
+an enhancement, not a dependency.
+---
+## Test Plan
+### Unit Tests (new: `tests/test_embed_svc.py`)
+- `test_encode_returns_dense_and_sparse` — verify shapes and types
+- `test_encode_cache_hit` — second call returns same result without model invocation
+- `test_encode_empty_string` — handles edge case gracefully
+### Unit Tests (new: `tests/test_groq_svc.py`)
+- `test_rewrite_produces_academic_query` — basic rewrite
+- `test_rewrite_fallback_on_error` — returns original on API failure
+- `test_rewrite_fallback_on_timeout` — returns original after timeout
+### Integration Tests (new: `tests/test_hybrid_search.py`)
+- `test_rrf_fusion_basic` — mock dense + sparse results, verify fusion ranking
+- `test_search_end_to_end` — live search, verify results are relevant
+- `test_search_fallback_no_groq` — search works without Groq API key
+### Live Tests (new: `tests/test_zilliz_svc.py`)
+- `test_sparse_search_returns_results` — live Zilliz search
+- `test_sparse_search_valid_arxiv_ids` — results have valid arxiv_id strings
+### Regression
+- All 88 existing tests must still pass
+- Router integration tests (`test_integration.py`) must work with the new search backend
+---
+## Risks and Mitigations
+| Risk | Impact | Mitigation |
+|---|---|---|
+| **BGE-M3 doesn't fit in memory** | App crashes | HF Spaces free tier = 16GB RAM. Model + PyTorch ≈ 2GB. Well within limits. |
+| **Zilliz free tier rate limits** | Search fails | Graceful fallback: return dense-only results if Zilliz is down |
+| **Groq API down** | No query rewriting | Already handled: fallback to original query. Enhancement, not dependency. |
+| **HF Spaces ephemeral storage** | SQLite data lost on restart | Acceptable pre-launch. Phase 4 migrates to external DB if needed. |
+| **Cold start (~15s model load)** | First request slow after sleep | Model baked into Docker image. Warm-up in lifespan. HF Spaces sleeps after ~2 days, not 15 min. |
+| **Phase 1 arXiv search tests break** | CI fails | Update tests to mock `hybrid_search_svc.search()` instead of `arxiv_svc.search()` |
+---
+## File Structure After Phase 3
+```
+app/
+├── __init__.py
+├── config.py              # MODIFIED — Zilliz, Groq, BGE-M3, port config
+├── db.py                  # UNCHANGED
+├── main.py                # MODIFIED — BGE-M3 warm-up in lifespan
+├── arxiv_svc.py           # UNCHANGED — still used for metadata fetch
+├── qdrant_svc.py          # MODIFIED — add search_dense()
+├── user_state.py          # UNCHANGED
+├── templates_env.py       # UNCHANGED
+├── embed_svc.py           # NEW — BGE-M3 model singleton
+├── zilliz_svc.py          # NEW — Zilliz sparse search client
+├── groq_svc.py            # NEW — LLM query rewriter
+├── hybrid_search_svc.py   # NEW — search orchestrator
+├── recommend/             # UNCHANGED
+│   ├── __init__.py
+│   ├── profiles.py
+│   ├── clustering.py
+│   ├── reranker.py
+│   └── diversity.py
+├── routers/               # search.py MODIFIED, rest UNCHANGED
+│   ├── search.py          # MODIFIED — swap to hybrid search
+│   ├── events.py
+│   ├── recommendations.py
+│   └── saved.py
+└── templates/             # UNCHANGED
+    ├── base.html
+    ├── index.html
+    ├── search.html
+    ├── saved.html
+    └── partials/
+Dockerfile                 # NEW — HF Spaces deployment
+.dockerignore              # NEW
+run.py                     # MODIFIED — configurable port (7860)
+requirements.txt           # MODIFIED — add FlagEmbedding, pymilvus, groq
+```
+---
+## What Phase 3 Does NOT Do
+These are explicitly out of scope:
+- ❌ Change the recommendation pipeline (that's Phase 4)
+- ❌ Replace RRF with quota fusion for recs (Phase 4)
+- ❌ Add citation data to the search reranker (Phase 3 Step 8, optional)
+- ❌ Load BGE-M3 for recommendations (recs use pre-computed vectors in Qdrant)
+- ❌ Change templates or HTMX patterns
+- ❌ Add onboarding (Phase 5)
+- ❌ Train LightGBM (Phase 6)
+---
+## Verification Checklist
+Before declaring Phase 3 complete:
+- [ ] `python -m pytest tests/ -v` — all tests pass (88 existing + new)
+- [ ] Search "attention is all you need" — top result is `1706.03762`
+- [ ] Search "when AI makes up fake facts" — returns hallucination papers
+- [ ] Search with Groq API key unset — still works (falls back to original query)
+- [ ] Search with Zilliz down — falls back to dense-only results
+- [ ] Save a paper from search results — EWMA profiles update correctly
+- [ ] Recommendations still work — 3-tier cascade unaffected
+- [ ] App starts on HF Spaces (port 7860, Docker SDK)
+- [ ] Cold start completes within 60 seconds
+- [ ] Warm search latency < 1 second
+---
+*Last updated: 2026-04-19*

docs/research/01-Vision-Instagram-for-Research.md ADDED Viewed

	@@ -0,0 +1,111 @@

+# Building the Instagram for research papers
+**The academic discovery landscape is ripe for disruption.** Existing tools force researchers to stack 3–5 separate products — Semantic Scholar for search, Connected Papers for visualization, Elicit for synthesis, Scite for validation, Zotero for management — creating a fragmented workflow where no single platform combines visual discovery, AI synthesis, social curation, and personalization. Only **~15% of researchers feel they're successfully keeping up** with their field, and the recent shutdown of Papers With Code by Meta (July 2025) proved that critical research infrastructure controlled by corporations can vanish overnight. The gap is enormous: a discovery-first, visually engaging, socially layered platform that makes browsing papers feel as natural as scrolling Instagram — built on open data, designed for habit formation, and spanning every academic discipline.
+Your strong backend (BGE-M3 embeddings, Qdrant, hybrid search with RRF fusion) is a genuine technical advantage. What follows is a strategic blueprint covering competitive positioning, UX patterns, social dynamics, differentiation features, cross-disciplinary architecture, and business model — all grounded in what's working, what's failing, and what doesn't exist yet.
+---
+## The landscape's two biggest blind spots
+The current tool ecosystem splits neatly into two camps that never overlap: **citation-mapping tools** (Connected Papers, Litmaps, ResearchRabbit) provide visual exploration with zero AI synthesis, while **AI synthesis tools** (Elicit, Consensus, Scite) provide structured analysis with zero visual mapping. No product combines both. This is the single largest gap in the market.
+Google Scholar, still the default for virtually all researchers, has an archaic interface with no semantic search, no visualization, and no AI features — yet it dominates because its coverage of **390M+ documents** is unmatched and it's free. Semantic Scholar (200M+ papers, AI-powered TLDR summaries, influential citation detection) is the most technically advanced free alternative and serves as the data backbone for multiple other tools. But its recommendations are unreliable for many users, and it lacks any social or visual discovery layer.
+ResearchRabbit — once positioned as "Spotify for Papers" — was acquired by Litmaps in October 2025 and shifted to freemium, breaking its "free for researchers forever" promise. Connected Papers limits free users to just **5 graphs per month**. Elicit's free credits are one-time (not monthly), and its AI outputs achieve only **~90% accuracy** — meaning one in ten extractions contains errors. Scite.ai requires a credit card for even its 7-day trial. The pricing fragmentation forces researchers, especially grad students, to subscribe to multiple tools at $10–20/month each.
+Several critical needs remain entirely unaddressed. **Full-text access** is the number-one frustration — tools find papers but can't deliver them. **Recency is broken** — citation-based tools penalize new papers with zero citations, and even AI tools cluster around 2024 or earlier publications. Non-STEM disciplines (humanities, qualitative social science, law) are systematically underserved because most tools optimize for empirical, quantitative research. **Collaboration is absent** — almost every tool is a single-user experience. And **quality assessment** barely exists; only Scite analyzes whether citations are supporting or contrasting, while retraction tracking, predatory journal warnings, and replication status remain invisible.
+The competitive opportunity is clear: build an integrated platform that unifies visual discovery, AI synthesis, citation quality analysis, and social curation — with transparent pricing, open data foundations, daily-updated coverage of preprints, and genuine cross-disciplinary scope.
+---
+## What TikTok, Spotify, and Pinterest teach about paper discovery
+The most powerful UX insight from consumer apps is TikTok's "interest graph" architecture: **recommend based on inferred preferences, not social connections**. Your default view should be a personalized "For You" feed of papers — not a search box. This is the fundamental shift from Google Scholar's query-driven model to a discovery-driven one. The platform ArTok (artok.app) has already validated this concept by implementing TikTok-style swipeable paper cards with AI-generated summaries, proving that researchers respond to this format.
+From Instagram, the two most translatable patterns are the **Explore page grid** (a visual masonry layout of paper cards with extracted key figures as thumbnails) and **frictionless engagement signals** (one-tap "interesting" that feeds recommendations without the commitment of a formal bookmark). Research shows that tweets with visual abstracts receive **7× more impressions and 8× more retweets** than text-only tweets, and papers with graphical abstracts see **2× more page views**. The visual-first approach isn't decorative — it exploits the picture superiority effect where humans process images 60,000× faster than text.
+Spotify's model offers the most direct template for habitual engagement. **Discover Weekly** (a personalized playlist delivered every Monday) causes users to stream 2× more. Translated to papers, this becomes a "Weekly Reading List" of 5–10 papers mixing your core interests with adjacent-field discoveries, delivered on a predictable schedule. Spotify's **Daily Mix** (blending familiar favorites with new recommendations, segmented by "taste clusters") maps to topic-based daily paper mixes. And **Spotify Wrapped** — the annual personalized retrospective — becomes "Research Wrapped," visualizing papers read, topics explored, and intellectual trajectory over the year. This is inherently shareable content that drives organic growth.
+Pinterest's contribution is the **boards/collections model** with long-tail content longevity. Unlike TikTok's ephemeral content, academic papers are evergreen — a foundational 2017 paper is as relevant as one published yesterday. Pinterest's "related pins" recommendation (visual similarity + collaborative filtering) translates directly to "more papers like this." And Pinterest's insight that **85% of users arrive with intent** mirrors how researchers browse: purposeful exploration, not idle scrolling.
+Reddit's upvote/downvote system provides the template for community quality curation. The subreddit r/MachineLearning (2.6M members) demonstrates that structured tagging ([R] Research, [D] Discussion, [P] Project), active moderation, and author participation in comment threads create thriving research communities. ResearchHub (researchhub.com) has directly implemented this model for papers, with topic-organized "Hubs," upvote-based prioritization, and bounties for peer reviews.
+The ideal paper card — the atomic unit of your platform — should contain: an auto-extracted key figure or AI-generated visual abstract as thumbnail, the title, a one-sentence TLDR, author(s) with venue and year, citation count plus altmetric signals, 2–3 topic tags, and action buttons (save, upvote, share, "more like this"). This card should be consumable in **30–60 seconds**, matching TikTok's optimal attention window.
+---
+## Why most academic social features fail, and what actually works
+ResearchGate (17M users) and Academia.edu (~50M accounts) are the cautionary tales. Both tried to become "Facebook for researchers" and both failed at genuine social interaction. A Nature survey found the most common activity on ResearchGate was simply **"maintaining a profile in case someone wanted to get in touch"** — not discussion, not collaboration, not discovery. Researchers don't consider these platforms social media; they're profile repositories. The lesson: **utility first, social second**. HuggingFace succeeded (13M users, 2M+ public models) because social features layer on top of genuine technical utility. Zotero's group libraries work because collaboration extends an existing reference management workflow.
+The single most important social dynamic in academia is that **only ~15% of scientists feel they're keeping up with their field**. Any platform that demonstrably reduces noise and surfaces signal has a massive engagement opportunity. The daily check-in habit already exists — physicists describe arXiv's daily listings as essential ("If it's not on arXiv, it doesn't exist"), and researchers browse new releases with their morning coffee. Your platform needs to become this daily ritual.
+For comments and annotations on papers, the evidence is stark: **anonymity is essential**. PubPeer — the most successful post-publication commentary platform — works because **85.6% of comments are anonymous**. PubMed Commons (real-name comments) shut down entirely. Journal comment sections are ghost towns. The power dynamics of academia (early-career researchers can't openly criticize senior colleagues) make pseudonymous contribution essential for honest discourse. OpenReview succeeds in ML conferences because review obligations are structured and public, but even there, signed reviews are rare.
+The migration from Academic Twitter to Bluesky provides a real-time case study. A study of 300,000 academic users found **18% of scholars transitioned to Bluesky** between 2023 and early 2025, driven primarily by "information sources" (who you follow) rather than "audience" (who follows you). Bluesky's **starter packs** — curated lists of people to follow in specific fields — dramatically accelerated community formation. This is directly applicable: your platform should offer curated "starter packs" for every research field, enabling immediate community participation. Bluesky's chronological default feed also gives smaller accounts more visibility, breaking the rich-get-richer effect that plagues algorithmic platforms.
+What drives daily return? Five factors dominate the research: **fear of being scooped** (researchers need to know where they stand); **speed of field evolution** (particularly in fast-moving areas like ML); **social curation through trusted people** (alerts "curated through people" are valued far higher than database alerts); **low-friction daily browsing** (arXiv's category listings, Semantic Scholar's dashboard); and **professional necessity** (papers discussed in lab meetings create social pressure to stay current). Design for all five.
+The engagement reality is that **most researchers are passive consumers**. HuggingFace data shows 91% of models have zero likes, 84% have zero discussions, and 87% of repositories have only one contributor. Design your platform for the 90% who browse, not just the 10% who create. Lightweight engagement signals (upvotes, emoji reactions, saves) should be the primary interaction — not comments or long-form contributions.
+---
+## Seven features that would set this platform apart
+**Community-curated reading paths ("paper playlists")** are the single highest-impact feature that doesn't exist anywhere. A playlist implies sequence and curation: "Start with this foundational paper, then read this methodological advance, then this application, then this critique." No platform offers public, shareable, ordered sequences with curatorial commentary. Professors could create reading paths for students; senior researchers could share domain entry points; communities could crowdsource "Introduction to Transformer Architecture" or "The Replication Crisis in Psychology." This is inherently viral, shareable content with strong network effects.
+**Prerequisite paper chains** — "before reading this paper, read X, Y, Z" — would be transformative for students and interdisciplinary researchers. Research on educational knowledge graphs shows that AI-assisted construction of prerequisite relationships is technically feasible using deep reinforcement learning over domain knowledge graphs. The key distinction is that "intellectual prerequisite" does not equal "cited by" — a paper may cite 50 works without any being true prerequisites. This requires distinguishing citation types (methodological foundation vs. contextual reference), which your embedding model could help with.
+**A "Google Trends for Research" trend visualizer** has no consumer-friendly implementation. Existing tools (Litmaps, VOSviewer, CiteSpace) require seed papers and are designed for researchers. A simple interface where anyone types a topic and sees publication volume, citation momentum, and emerging sub-topics over time — powered by OpenAlex's **250M+ works with 50,000 added daily** — would appeal to researchers, journalists, funders, students, and policymakers. This is feasible today and would generate significant organic traffic.
+**Replication confidence scores** aggregate all available evidence — citation sentiment from Scite-style analysis, statistical indicators from z-curve methods, actual replication attempts from the Replication Database, and DARPA's SCORE program data — into a single, understandable badge per paper. The replication crisis data is sobering: only **36% of psychology studies replicate**, only **11% of pre-clinical cancer studies**, and only **49% of economics publications**. Surfacing this alongside papers would be controversial but enormously valuable.
+**Paper difficulty ratings** address a complete gap. Standard readability formulas (Flesch-Kincaid, FORCAST) are inadequate for academic papers with domain-specific jargon and mathematical notation. A model trained on researcher feedback about paper difficulty — incorporating math density, prerequisite knowledge required, vocabulary specificity, and figure complexity — could assign intuitive difficulty levels. This is especially valuable for students, science communicators, and researchers entering adjacent fields.
+**Cross-disciplinary method bridges** would proactively identify when a technique from one field could solve a problem in another. Current tools (Inciteful's Literature Connector, Semantic Scholar's embeddings) can find cross-field connections, but only when users already have papers from both fields. A true bridge detector would identify latent connections — "Monte Carlo simulation methods from physics → portfolio risk modeling in finance → uncertainty quantification in ML." Your BGE-M3 embeddings operating in a shared vector space across all disciplines are technically well-suited for this.
+**Research-specific emoji reactions** provide lightweight engagement signals beyond upvotes. Useful academic reactions might include: 🔬 (replicable), 💡 (novel insight), 📊 (great methodology), 🤔 (needs more evidence), and ⚠️ (potential issues). These generate richer training signals for your recommendation engine than binary likes while requiring minimal user effort — critical for the 90% passive-consumer majority.
+---
+## Technical architecture for cross-disciplinary scale
+**OpenAlex should be your primary data backbone.** It covers **250M+ works** across all disciplines with a free, CC0-licensed API handling 115 million monthly queries. Its entity model (works, authors, sources, institutions, topics, publishers, funders) is comprehensive, with hierarchical topic classification spanning four levels (domain → field → subfield → topic). Coverage of non-English works and Global South publications is roughly **2× that of paywalled competitors** like Scopus. Supplement with arXiv (2.4M preprints, daily updates), PubMed (36M+ biomedical records), bioRxiv/medRxiv (300K+ preprints), SSRN (1M+ social science working papers), and Unpaywall (40M+ open-access PDF URLs).
+Citation normalization across fields is essential — a paper with 50 citations in mathematics is extraordinary while 50 citations in biomedicine is unremarkable. The most elegant solution is **citing-side normalization**: each citation is weighted by the citing paper's field characteristics, requiring no predefined classification scheme. Combined with percentiles ("this paper is in the 95th percentile of citation impact for its field"), this produces intuitive, field-independent metrics. OpenAlex provides all the raw citation data needed to compute this.
+For paper-to-code and artifact linking, a deep learning approach using SPECTER embeddings plus gradient boosting has achieved **F1 = 0.94** for automatically matching papers with GitHub repositories. The primary signal is paper titles and BibTeX entries found in repository README files. However, more than half of referenced papers don't link back to any repository — a major traceability gap your platform could help close. Beyond code, a general "Papers With Artifacts" system should link to discipline-specific resources: protocols.io for biology, FRED and ICPSR for economics, Zenodo and Figshare for archived datasets, and OSF for pre-registrations. The wiki-style community contribution model from Papers With Code (before its shutdown) remains the most scalable approach for maintaining these links.
+For your recommendation engine, the state of the art combines three signal types: **content similarity** (your BGE-M3 embeddings handle this well), **citation-graph collaborative filtering** (co-viewing and co-citation patterns from Semantic Scholar's datasets), and **explicit user feedback** (thumbs up/down ratings). Scholar Inbox's research demonstrates that a "map of science" for initial topic selection plus active learning (strategically selected papers for user rating) effectively solves the cold-start problem. Their publicly released dataset of **800K explicit rating interactions** from 14,300+ users could bootstrap your collaborative filtering.
+To combat filter bubbles, allocate **10–20% of recommendation slots** to cross-disciplinary and serendipitous papers. Research on serendipity-incorporated recommender systems shows that knowledge graph-based approaches outperform standard topic modeling for unexpected-but-relevant discovery. The SerenGPT system (2025) uses LLM-based "cognition profile generation" to break feedback loops. Practically, this means mixing your feed: 80% relevant to declared interests, 15% from adjacent fields, 5% from distant fields — with the proportions adjustable by users who want more or less exploration.
+Reproducibility should be a first-class feature. Jupyter notebook reproducibility is abysmal — only **8.5% of biomedical notebooks produce identical results** when re-executed. Surface this reality alongside papers: embed MyBinder launch buttons for executable notebooks, link to Zenodo/Figshare DOIs for archived code, and display reproducibility metrics when available. Auto-detect replication studies through citation patterns and title analysis, and display replication status badges (replicated / failed / partial / disputed) alongside original papers.
+---
+## A sustainable business model for academic tools
+The academic tool market is valued at **$4.9B currently, projected to reach $12B by 2033** at 18–20% CAGR. The global researcher population exceeds **8.8 million FTE** with an additional 50M+ graduate students and research-adjacent professionals. The established individual price point across Elicit, Consensus, SciSpace, Scite, and Litmaps is remarkably consistent: **$10–12/month** for individual researchers, $42–49/month for power users.
+The proven playbook from Notion, Figma, and Slack applies directly: free individual tier builds a large user base, freemium conversion captures power users at $10–15/month, team/lab features create institutional expansion at premium pricing, and bottom-up adoption drives top-down institutional sales. Figma's collaboration-as-wedge strategy is especially relevant — free for individual researchers, paid when labs collaborate.
+The recommended tier structure: a **free tier** offering paper discovery, basic AI summaries, reactions, public reading paths, and personalized feeds; an **individual Pro tier at $10–12/month** with unlimited AI chat, paper comparison, advanced search, alerts, and difficulty ratings; a **team/lab tier at $25–50/seat/month** with shared collections, collaborative annotations, team reading paths, and analytics; and **institutional licensing** with custom pricing, SSO, API access, and usage dashboards. Supplemental revenue from premium API access, data insights for publishers, and grants for open science features provides diversification.
+Critical lessons from failures: Microsoft Academic was shut down in 2021 despite massive resources because ROI was unclear. Mendeley lost community trust after Elsevier's acquisition. Academia.edu's aggressive monetization (pay-to-promote papers) generated backlash. The academic community is **deeply suspicious of commercial acquisitions** and aggressive monetization. Building on open data (OpenAlex, Semantic Scholar) avoids dependency risk. Starting free and building community trust before monetizing is essential. A hybrid model — freemium plus institutional subscriptions plus philanthropic grants (following OpenAlex's model of Arcadia Foundation support) — provides the most resilient foundation.
+---
+## Conclusion: the platform that becomes a daily ritual
+The winning product isn't another search engine or another citation mapper. It's a **daily discovery habit** — the first thing researchers open with their morning coffee, replacing the fragmented ritual of checking arXiv, scrolling Bluesky, and running Semantic Scholar searches. Three strategic principles should guide every product decision.
+First, **visual discovery beats keyword search**. The default experience should be a personalized, scrollable feed of paper cards — each featuring an extracted key figure, a one-sentence TLDR, and lightweight engagement signals — not a search box. Search is a fallback for specific needs; browsing is the primary mode.
+Second, **community-generated content creates the moat**. Paper playlists, difficulty ratings, emoji reactions, and curated reading paths are content that users create and that competitors cannot replicate. Every interaction — a save, an upvote, a playlist addition — simultaneously trains the recommendation engine and builds network effects. This flywheel is what separates a product from a feature.
+Third, **cross-disciplinary serendipity is the killer differentiator**. Every existing tool optimizes for within-field relevance. The researchers who will love your platform most are the ones who discover that a technique from computational physics solves their problem in genomics, or that an economics paper formalized exactly the game theory their CS research needs. Your BGE-M3 embeddings operating in a shared semantic space across all disciplines — combined with intentional serendipity injection into recommendation feeds — can deliver discoveries that no field-specific tool ever will.
+The technical infrastructure exists. The data is open. The competitive landscape is fragmented. The user need is acute — 85% of researchers can't keep up. What's missing is the product vision that unifies it all into something researchers reach for every day.

docs/research/02-Recommendation-System-Blueprint.md ADDED Viewed

	@@ -0,0 +1,346 @@

+# Building a personalized paper recommendation system on BGE-M3 hybrid search
+Your existing hybrid search system retrieves papers matching a query — but **a recommendation system must predict what a user wants before they ask**. The architectural gap is the absence of a user model: a persistent, evolving representation of each person's research interests that drives proactive paper surfacing. The good news is that your BGE-M3 embeddings in Qdrant/Zilliz are already the hardest piece to build, and Qdrant's native Recommendation API provides a direct path to personalization with minimal new infrastructure. What follows is a complete technical blueprint, from gap analysis through MVP implementation, grounded in the latest (2023–2025) research.
+---
+## The architectural gap between search and recommendation
+Search is reactive: user types query → system returns relevant results. Recommendation is proactive: system surfaces papers the user would find valuable without explicit queries. Three components are missing from your current stack.
+**First, a user model.** You have no persistent representation of who each user is — their interests, their evolving research focus, their engagement patterns. Search treats every request as independent. A recommendation system maintains a profile vector (or set of vectors) per user that accumulates signal over time and drives candidate retrieval.
+**Second, an interaction logging layer.** Your system currently has no memory of what happened after results were returned. Which papers did the user click? How long did they read? Did they save or dismiss? These signals are the training data for personalization. Without event capture, you cannot learn.
+**Third, a recommendation-specific retrieval and ranking pipeline.** Standard search optimizes for query-document relevance. Recommendation optimizes for user-document affinity, which requires different scoring, different candidate generation (often without a query at all), and different diversity/exploration strategies to avoid filter bubbles. Your RRF fusion of dense and sparse retrieval is excellent for search — but for recommendation you need a parallel pathway that takes a user profile vector (not a query) as input and retrieves candidates based on predicted interest.
+The transition path is incremental: keep your search pipeline intact, and layer a recommendation pipeline alongside it that reuses your existing embeddings and vector infrastructure.
+---
+## How to build a user taste profile from search and click signals
+User modeling is the core differentiator between search and recommendation. You have two signal types: search queries (explicit information needs) and paper clicks/reads (implicit preference signals). Here is how to turn them into a usable profile.
+### The baseline: weighted average of interacted paper embeddings
+The simplest and most effective starting point is computing a user profile vector as the **weighted average of BGE-M3 dense embeddings** of papers the user has interacted with. A 2023 RecSys paper by Bendada et al. ("On the Consistency of Average Embeddings for Item Recommendation") formally analyzed this approach and found it theoretically sound under reasonable distributional assumptions, though real-world embeddings show some consistency degradation compared to theory. Pinterest's PinnerSage (KDD 2020) validated weighted averaging at billion-scale production.
+The weighting scheme should incorporate two dimensions — **interaction type** and **recency**:
+```python
+def compute_user_embedding(interactions, paper_embeddings, half_life_days=30):
+    type_weights = {
+        'click': 0.3, 'read': 0.7, 'save': 1.2, 'search_click': 0.5
+    }
+    weighted_sum = np.zeros(1024)  # BGE-M3 dense dim
+    weight_total = 0.0
+    now = time.time()
+    for interaction in interactions:
+        base_weight = type_weights.get(interaction.type, 0.3)
+        # Dwell time modulation
+        if interaction.dwell_ms and interaction.dwell_ms < 10000:
+            base_weight *= 0.1  # <10s = likely noise
+        elif interaction.dwell_ms and interaction.dwell_ms > 120000:
+            base_weight *= 1.5  # >2min = deep read
+        # Exponential recency decay
+        age_days = (now - interaction.timestamp) / 86400
+        decay = 2 ** (-age_days / half_life_days)
+        w = base_weight * decay
+        weighted_sum += w * paper_embeddings[interaction.paper_id]
+        weight_total += w
+    return weighted_sum / weight_total if weight_total > 0 else None
+```
+This profile vector lives in the same 1024-dimensional space as your paper embeddings, so you can directly use it as a query vector for ANN search in Qdrant. **Recency half-life of 30 days** is a reasonable starting point for academic papers; tune it based on how quickly your users' interests shift.
+### Handling multi-interest users with clustering
+A single average vector has a critical flaw: it dilutes when a user has diverse interests. PinnerSage demonstrated this vividly ��� averaging embeddings of painting, shoe, and sci-fi interests produced recommendations for "energy boosting breakfasts." The solution is **multi-interest representation**.
+For your scale, use K-means clustering on a user's interacted paper embeddings (K=3–5 for typical researchers), then represent each cluster by its **medoid** — the actual paper embedding closest to the cluster center. At recommendation time, retrieve candidates for each cluster medoid independently, then merge results. This captures a user who works on both NLP and reinforcement learning without blending those interests into a meaningless midpoint.
+### Incorporating search queries as high-value signals
+Search queries are your strongest signal — they represent explicit, articulated information needs. Encode each query with BGE-M3 and maintain a **separate query-based profile** (recent query embedding average). At recommendation time, blend the query-based profile with the interaction-based profile:
+```python
+final_profile = alpha * interaction_profile + (1 - alpha) * query_profile
+```
+Start with **alpha = 0.6** (favoring interaction history over queries) and tune based on engagement metrics.
+### LLM-generated interest summaries as an alternative user representation
+A powerful modern technique: use an LLM to generate a natural-language summary of a user's interests from their reading history, then embed that summary with BGE-M3 to get a user vector. Meta's **EmbSum** (RecSys 2024) demonstrated this at scale, using Mixtral-8x22B to generate interest summaries that a smaller T5 model then learned to replicate. For your scale, you can call an LLM directly:
+```python
+prompt = f"""Based on these recently read papers, summarize this researcher's
+interests in 2-3 sentences: {[p.title for p in user.recent_papers[:20]]}"""
+interest_summary = llm.generate(prompt)
+user_vector = bge_m3.encode(interest_summary)
+```
+This approach is surprisingly effective because it produces a semantically rich, denoised representation. The LLM acts as an implicit topic model, filtering noise and identifying coherent research threads. Run it as a **daily batch job** per user — the cost is negligible at 10–100 users.
+---
+## Deep dive on recommendation approaches
+### Content-based filtering with BGE-M3: your foundation
+Content-based filtering using your existing embeddings is the **highest-ROI approach** and should be your primary recommendation strategy. The pipeline: compute user profile vector → ANN search in Qdrant → post-process with diversity. This works with a single user and zero collaborative signal.
+Qdrant's built-in Recommendation API simplifies this dramatically. Instead of manually computing average vectors, pass the IDs of papers the user liked as `positive` examples and disliked papers as `negative`:
+```python
+results = client.query_points(
+    collection_name="papers",
+    query=models.RecommendQuery(
+        recommend=models.RecommendInput(
+            positive=[saved_paper_id_1, saved_paper_id_2, saved_paper_id_3],
+            negative=[dismissed_paper_id_1],
+        )
+    ),
+    strategy="best_score",  # Evaluates each example independently
+    limit=20,
+    query_filter=models.Filter(
+        must=[models.FieldCondition(key="year", range=models.Range(gte=2023))]
+    )
+)
+```
+The `best_score` strategy (available since Qdrant 1.6) is superior to `average_vector` for multi-interest users — it evaluates each candidate against all positive and negative examples independently during HNSW traversal, producing more diverse results.
+**Key limitation**: content-based filtering creates filter bubbles. Mitigate with **MMR re-ranking** (λ=0.6 for relevance/diversity balance) and **Qdrant's grouped recommendations** (`recommend/groups` by category) to ensure cross-topic diversity.
+### Collaborative filtering at small scale: not yet viable
+Pure collaborative filtering is **not viable with 10–100 users**. CF requires finding meaningful co-interaction patterns, and the user-item matrix at this scale is far too sparse. Research consistently shows CF needs **200+ users with substantial interaction overlap** before it outperforms content-based baselines. The `implicit` library's ALS implementation would produce unreliable, noisy recommendations at your scale.
+However, **plan for CF from day one** by logging all interactions in a structured format. When your user base reaches ~200 users with >20 interactions each, you can train an ALS model in minutes:
+```python
+import implicit
+model = implicit.als.AlternatingLeastSquares(factors=64, iterations=20)
+model.fit(user_item_sparse_matrix)
+```
+An intermediate technique that works sooner: **one iteration of Implicit ALS** with fixed item embeddings. Fix the item factors to your BGE-M3 embeddings and solve for optimal user vectors via linear regression. This is equivalent to a weighted average but optimized to predict the interaction matrix — a few lines of linear algebra per user that can extract slightly more signal than naive averaging.
+### Hybrid approaches for the transition period
+**LightFM** is your best option for hybrid recommendation as your user base grows. It learns user and item embeddings as the sum of their feature embeddings, gracefully handling cold start by falling back on content features when interaction data is sparse:
+```python
+from lightfm import LightFM
+model = LightFM(loss='warp', no_components=64)
+model.fit(interactions, item_features=item_feature_matrix, epochs=30)
+```
+Item features for your case: arXiv category (cs.CL, cs.CV, etc.), year, author IDs. LightFM's WARP loss optimizes directly for top-K ranking quality. **Start training LightFM at ~50 users** with a switching strategy: use content-based for users with <10 interactions, LightFM for users with >10 interactions.
+A simpler hybrid: **weighted score fusion**. Run content-based retrieval and (when available) collaborative retrieval independently, then combine:
+```python
+final_score = 0.7 * content_similarity + 0.3 * cf_score  # Start content-heavy
+```
+Shift the weights toward CF as your data grows.
+### Session-based recommendation for immediate personalization
+Session-based recommendation captures within-session behavior to personalize results in real time — critical for both cold-start users and capturing evolving intent. The simplest effective approach: maintain a **rolling session embedding** updated after each interaction:
+```python
+session_embedding = 0.7 * session_embedding + 0.3 * newly_clicked_paper_embedding
+```
+This exponential moving average gives recent clicks more weight while retaining session context. Use the session embedding as a query vector for ANN search, fused with the long-term user profile via weighted combination. For new sessions with a single search query, the query embedding alone is your session representation.
+More advanced session-based models like **SASRec** (Self-Attentive Sequential Recommendation) use transformer attention over item sequences but require training data that you won't have initially. Revisit these at scale.
+### LLM-based approaches worth considering
+Three practical LLM integrations ranked by implementation effort:
+**Low effort, high value — LLM as interest summarizer.** Generate per-user interest summaries as described above. Use these for both embedding-based retrieval and as explainable profile descriptions shown to users ("We think you're interested in...").
+**Medium effort — LLM-augmented candidate scoring.** For your top-20 candidates from vector retrieval, use an LLM to score relevance given the user's profile: "Given this researcher works on [interests], rate how relevant this paper is on a 1-5 scale: [paper title + abstract]." This adds ~1-2 seconds of latency but can dramatically improve precision for the final displayed results.
+**Higher effort — query augmentation.** When a user searches, use an LLM to expand their query with terms from their profile: "The user is interested in NLP and transformers. They searched for 'attention mechanisms.' Generate an expanded search query." This bridges search and recommendation by personalizing search itself.
+---
+## Solving cold start for new users with no history
+Cold start requires a **layered fallback strategy** that gracefully degrades as less information is available.
+**Layer 1: Onboarding (strongest cold-start solution).** Present new users with the arXiv category taxonomy and let them select 3–5 areas of interest. Optionally let them paste URLs or search for known papers as seed interests. Even selecting categories gives you enough signal to compute a meaningful initial profile by averaging the centroid embeddings of selected categories. Semantic Scholar found that ~5 saved papers + 3 "not relevant" ratings are sufficient for good initial recommendations.
+**Layer 2: First-query bootstrapping.** The moment a user types their first search query, encode it with BGE-M3 and use it as the initial user profile vector. This single query provides immediate personalization signal. Store it and update the profile as more queries arrive.
+**Layer 3: Population-level priors.** For users who haven't yet searched or selected interests, recommend trending papers (high recent click velocity across all users) with diversity across categories. Compute trending scores as a **daily batch job**: papers with unusual engagement spikes in the last 7 days.
+**Layer 4: Exploration with bandits.** Reserve **20–30% of recommendation slots for new users** (decreasing to 5–10% for established users) for exploration. Use **epsilon-greedy** (simplest) or **Thompson Sampling** (more principled) to show diverse items that help identify the user's interests quickly. The bandit naturally shifts from exploration to exploitation as confidence in the user's preferences grows.
+---
+## Practical architecture built on your existing stack
+You need three additions to your current Qdrant + Zilliz + BGE-M3 setup: an event logger, a user profile store, and a recommendation service layer.
+### Minimal but effective architecture
+```
+User Action → Event API → PostgreSQL (interaction log)
+                       → Redis (update recent items + session state)
+Recommendation Request →
+  1. Fetch user's recent paper IDs from Redis
+  2. If cold-start → onboarding interests + trending fallback
+  3. If search query present → encode query, fuse with user profile
+  4. Call Qdrant Recommend API (positive=recent_saves, negative=dismissed)
+     OR: compute user_embedding → ANN search in Qdrant
+  5. Apply MMR diversity filter
+  6. Return results + log impressions
+Daily Batch Job →
+  - Recompute all user profile embeddings from full interaction history
+  - Update interest clusters (K-means on user's paper embeddings)
+  - Refresh trending paper scores
+  - Optional: regenerate LLM interest summaries per user
+```
+### What data to store and where
+**Redis** (hot path, <1ms): last 20 interacted paper IDs per user, current session embedding, cached user profile vector. Total memory at 100 users: negligible.
+**PostgreSQL** (interaction log): every impression, click, dwell time, save, dismiss with timestamps, session IDs, and source attribution (did the click come from search or recommendation?). Schema should capture `user_id`, `paper_id`, `interaction_type`, `timestamp`, `dwell_time_ms`, `scroll_depth`, `source`, `position_in_list`. Index on `(user_id, timestamp DESC)`.
+**Qdrant** (unchanged): paper embeddings with payloads (title, abstract, category, year). Optionally store user profile embeddings as a separate collection for user-to-user similarity if you later want collaborative signals.
+### When to compute what
+At your scale of 10–100 users, **recompute user profiles on every significant interaction** (save, extended read >30s, dismiss). This takes milliseconds with <100 users and keeps profiles maximally fresh. No streaming infrastructure needed — a synchronous update in your API handler is sufficient. Run the full batch recomputation (clustering, LLM summaries, trending scores) as a **nightly cron job**.
+### Qdrant's Universal Query API for hybrid personalized retrieval
+Qdrant's query API lets you run multi-stage hybrid recommendation in a single atomic call, combining your existing dense+sparse search with recommendation:
+```python
+results = client.query_points(
+    collection_name="papers",
+    prefetch=[
+        # Branch 1: User profile-based semantic retrieval
+        models.Prefetch(
+            query=user_profile_dense_vector,
+            using="dense",
+            limit=50
+        ),
+        # Branch 2: Sparse retrieval for lexical diversity
+        models.Prefetch(
+            query=user_profile_sparse_vector,
+            using="sparse",
+            limit=50
+        ),
+    ],
+    # Fuse via RRF, then optionally rerank with ColBERT
+    query=models.FusionQuery(fusion=models.Fusion.RRF),
+    limit=20,
+    query_filter=models.Filter(
+        must_not=[models.FieldCondition(
+            key="paper_id", match=models.MatchAny(any=already_read_ids)
+        )]
+    )
+)
+```
+This reuses your existing dense and sparse indexes, adds personalization through the user profile vector, filters out already-read papers, and fuses results — all in one network round trip.
+---
+## Designing feedback loops that compound over time
+The core feedback loop is: show recommendations → capture signals → update user model → show better recommendations. Three design decisions determine whether this loop converges to excellent recommendations or degenerates into a filter bubble.
+### Which signals matter most
+Rank signals by **reliability as preference indicators**, not just ease of capture:
+Saves/bookmarks and extended reads (>2 minutes) are your strongest positive signals — they indicate deliberate engagement. Clicks alone are noisy; **50–70% of clicks under 10 seconds are curiosity or misclicks**, not genuine interest. Dismiss/hide actions are your strongest negative signal and are critically underused — actively surface a "not interested" button. Search queries are high-value because they represent articulated needs.
+**Position bias correction** is essential: papers shown in position 1 get ~3x more clicks than position 5 regardless of relevance. Log the position of every impression and either debias in modeling or use inverse propensity weighting when computing preference scores.
+### Avoiding filter bubbles
+Three concrete mechanisms prevent recommendation homogenization. First, **diversity-aware re-ranking**: after scoring candidates by user affinity, apply MMR or use Qdrant's grouped recommendations by category to ensure cross-topic coverage. Second, **exploration budget**: always reserve 10–15% of recommendation slots for papers outside the user's established profile — trending papers, recent high-impact papers in adjacent fields, or papers popular with similar users. Third, **monitor and alert**: track per-user recommendation diversity (average pairwise cosine distance within recommendation lists) over time. If diversity drops below a threshold, increase the exploration budget.
+### Update cadence
+At 10–100 users, **real-time profile updates on every interaction** are computationally trivial and provide the best user experience. Update the Redis-cached profile vector immediately. Run the full batch pipeline (re-clustering, LLM summaries, trending scores, model retraining if using LightFM) nightly.
+---
+## Measuring whether recommendations actually work
+### Offline evaluation for development
+Use **time-split evaluation**: train on interactions before timestamp T, test on interactions after T. This simulates real-world temporal dynamics better than random splits. Key metrics:
+**NDCG@10** is your primary metric — it rewards both relevance and correct ranking order, handling graded relevance (save > read > click). **Hit Rate@10** measures the fraction of users for whom at least one recommended paper was later interacted with — a simple, interpretable sanity check. **Coverage** measures what percentage of your paper catalog ever gets recommended. Low coverage (<10%) signals a severe filter bubble.
+For evaluation with few users, use **leave-one-out**: for each user, hide their most recent interaction and test whether the system ranks that paper in the top-K. Cross-validate across users.
+### Online evaluation for production
+**CTR** (click-through rate on recommendations) is your primary online metric but must be paired with **downstream engagement** (dwell time on clicked recommendations vs. search results, save rate). A system that optimizes only for CTR will learn clickbait patterns.
+At 10–100 users, traditional A/B testing lacks statistical power. Use **interleaving** instead: show results from your old system and new system interleaved in a single list, and measure which system's results get clicked more. Interleaving detects differences with **100x fewer users** than parallel A/B testing. Alternatively, use a simple **pre/post comparison**: measure engagement metrics for 2 weeks before and after deploying recommendations, accounting for temporal trends.
+---
+## Step-by-step implementation roadmap
+### Phase 1: MVP in 1 week — content-based recommendations via Qdrant
+Build the event logging layer first. Capture every click, with timestamp and paper ID, in PostgreSQL. Add a "save" button and a "not interested" button to your UI. Store the last 20 interacted paper IDs per user in Redis.
+Then wire up Qdrant's Recommend API. Pass saved papers as `positive` examples, dismissed papers as `negative`, use the `best_score` strategy, filter out already-read papers. Apply a basic year filter (papers from the last 2 years) and return the top 10.
+Add a "Recommended for You" section to your UI that calls this endpoint. This is your MVP: **personalized recommendations with zero model training**, leveraging your existing BGE-M3 embeddings and Qdrant infrastructure.
+### Phase 2: Refined user modeling in week 2–3
+Replace Qdrant's built-in averaging with your own **weighted user profile vector** (recency-decayed, engagement-weighted). Compute it on every interaction update. Use this vector for ANN search as a complement to the Recommend API.
+Implement **multi-interest clustering**: K-means (K=3) on each user's interacted paper embeddings. Retrieve candidates for each cluster medoid independently, then merge with deduplication. This handles researchers with diverse interests.
+Add **MMR diversity re-ranking** as a post-processing step (λ=0.6). Add **onboarding flow** for new users: category selection + optional seed paper search.
+### Phase 3: Advanced personalization in week 4–6
+Integrate **LLM-generated interest summaries** as a nightly batch job. Embed summaries with BGE-M3 and blend with the interaction-based profile vector.
+Add a **cross-encoder re-ranker** (BAAI's `bge-reranker-v2` is ideal since it pairs with your BGE-M3 embeddings) for the final stage: retrieve 100 candidates via ANN, re-rank to top 10 with the cross-encoder using augmented queries that include user interest context.
+Implement **session-based recommendation**: maintain a session embedding updated in real-time, blend with the long-term profile for users in active sessions.
+### Phase 4: Scale-ready additions at 50+ users
+Train **LightFM** hybrid model with paper metadata (categories, year) as item features. Use the switching strategy: content-based for users with <10 interactions, LightFM for active users.
+Build an **evaluation dashboard** tracking NDCG@10, hit rate, coverage, and diversity metrics offline, plus CTR and engagement depth online. Implement interleaving for comparing system variants.
+Add **exploration via epsilon-greedy** (ε=0.1 for established users, ε=0.25 for new users). This completes the feedback loop: recommendations → engagement → profile update → better recommendations, with built-in exploration to prevent convergence to filter bubbles.
+The end result is a system that starts delivering personalized recommendations from day one using your existing embeddings, progressively improves as interaction data accumulates, and scales naturally from 10 to 1000+ users without architectural changes — only the addition of collaborative signals and more sophisticated models as data permits.
+---
+## Conclusion
+The critical insight is that your BGE-M3 embeddings are already the most valuable asset for recommendation — the gap is not in representation quality but in user modeling and feedback infrastructure. **Start with Qdrant's native Recommend API using positive/negative paper examples** — this is literally a one-day implementation that delivers real personalization. The weighted-average user profile vector with recency decay and multi-interest clustering gets you 80% of the way to a production-quality system. Collaborative filtering is a future investment that only pays off at 200+ users; until then, content-based methods with LLM-augmented user profiles and cross-encoder re-ranking will outperform any CF approach at your scale. The most overlooked element is negative feedback — adding a "not interested" button and using those signals as negative examples in Qdrant's recommendation queries will improve precision more than any model upgrade.

docs/research/03-MultiInterest-Recommender-Architecture.md ADDED Viewed

	@@ -0,0 +1,323 @@

+# RFC: Multi-interest recommendation architecture for ArXiv paper discovery
+**The multi-vector RRF approach (Path B) is directionally correct and validated by industry practice at Twitter, Pinterest, and Alibaba — but the proposed implementation over-engineers the storage model while under-investing in the ranking layer.** The optimal Phase 2 architecture combines PinnerSage-style dynamic clustering (not fixed subject vectors) with Qdrant's native prefetch+RRF fusion, a LightGBM re-ranker (~3ms on CPU), and MMR diversity enforcement. This report synthesizes findings from Twitter's open-sourced algorithm, Pinterest's PinnerSage, TikTok's Monolith, YouTube's DNN architecture, and Qdrant's API internals to justify a specific architectural recommendation.
+The core tension — how to serve users with distinct interests (e.g., computer vision, reinforcement learning, and LLM alignment) without the feed collapsing into a single dominant topic — is one of the most studied problems in industrial recommendation systems. Every major platform has converged on some variant of multi-interest retrieval, but none use hand-curated subject vectors. The interest structure should emerge from user behavior, not be predefined.
+---
+## 1. Path A vs Path B: neither is quite right
+### Path A limitations are real but well-understood
+Qdrant's Recommend API with `AVERAGE_VECTOR` strategy computes a single query vector via the formula `2 × avg(positives) − avg(negatives)`, then performs standard HNSW traversal. This is computationally identical to a normal nearest-neighbor search — fast and cheap, but fatally prone to the centroid-in-nowhere problem. PinnerSage's authors demonstrated this vividly: a user interested in painting, shoes, and sci-fi produces a centroid landing in the "energy-boosting breakfast" region of embedding space.
+The `BEST_SCORE` strategy avoids this by evaluating each candidate against every positive/negative example independently, selecting the best match via a sigmoid scoring function. This produces **genuinely diverse results** that grow more diverse as examples increase. However, latency scales linearly with example count — with 20 positives and 50 negatives, every HNSW traversal step computes 70 distance calculations. For a cloud-hosted Qdrant instance, this creates meaningful cost and latency pressure.
+Path A's fundamental constraint is architectural: no matter which strategy is chosen, the Recommend API treats all positive examples as equally important and offers **no mechanism to weight recent interactions over older ones**, no temporal decay, and no way to control the accuracy-diversity tradeoff.
+### Path B is validated but misspecifies the interest model
+The multi-vector retrieval pattern — maintaining multiple user embeddings and querying each independently — is not an anti-pattern. It is the **dominant approach in industrial multi-interest recommendation** as of 2025:
+- **Twitter/X** uses SimClusters (sparse vectors over 145,000 communities), TwHIN multi-modal mixture embeddings, and kNN-Embed — all explicitly multi-vector. kNN-Embed achieved a **534% relative improvement in Recall@10** vs. single-embedding retrieval on the Twitter-Follow dataset.
+- **Pinterest's PinnerSage** generates multiple medoid-based embeddings per user via Ward hierarchical clustering, producing **3–5 clusters for light users and 75–100 for heavy users**. A/B tests showed 7% engagement lift on Homefeed and 20% on Shopping.
+- **Alibaba's MIND** uses capsule network dynamic routing to extract K interest vectors (optimal K=4–7), deployed on Mobile Tmall handling major production traffic.
+- **ComiRec** (KDD 2020) extends this with controllable diversity via a tunable λ parameter, finding K=4 optimal for e-commerce datasets.
+However, Path B's specification of **fixed subject-specific vectors** (CV, LLMs, RL) is where it diverges from best practice. Every production system cited above derives interest clusters from user behavior, not predetermined categories. Fixed categories create three problems: (1) they can't adapt as new research areas emerge (e.g., a sudden interest in mechanistic interpretability), (2) they force a taxonomy decision that may not match individual user interest boundaries, and (3) they require manual maintenance as the field evolves.
+### RRF is a sound fusion method for this use case
+RRF (formula: `score(d) = Σ 1/(k + r(d))`, k=60) was originally designed for information retrieval but has become the **standard fusion method** across vector databases. Elasticsearch, Azure AI Search, MongoDB, OpenSearch, and Qdrant all implement it natively. Its key advantage: it operates on rank positions rather than raw scores, requiring no score normalization when combining results from different interest vectors that may have incomparable similarity distributions.
+Qdrant's prefetch+fusion mechanism (v1.10+) executes all interest queries in a **single API call**, eliminating the multiple network round-trip concern:
+```python
+client.query_points(
+    collection_name="arxiv_papers",
+    prefetch=[
+        models.Prefetch(query=interest_vector_1, limit=30),
+        models.Prefetch(query=interest_vector_2, limit=30),
+        models.Prefetch(query=interest_vector_3, limit=30),
+    ],
+    query=models.FusionQuery(fusion=models.Fusion.RRF),
+    limit=50
+)
+```
+This architecture — multiple ANN queries fused via RRF within a single request — is computationally reasonable. Each HNSW query on 1.6M vectors takes ~5–15ms; with server-side parallelism, the wall-clock time for 3–4 prefetch queries is roughly equal to one query plus modest overhead.
+---
+## 2. What the platforms actually do about interest collapse
+### Twitter solved it architecturally, not algorithmically
+Twitter's recommendation pipeline generates ~1,500 candidate tweets from ~500 million daily tweets, split roughly 50/50 between in-network (accounts you follow) and out-of-network (discovery). The critical insight is that **diversity is embedded in the candidate generation architecture itself** through multiple heterogeneous retrieval channels:
+| Source | Method | Purpose |
+|--------|--------|---------|
+| SimClusters | Sparse 145K-dim community vectors | Multi-interest out-of-network discovery |
+| TwHIN | Knowledge graph embeddings (TransE) | Cross-entity relationship modeling |
+| RealGraph | Follow graph traversal | In-network relevance |
+| GraphJet | Real-time interaction graph | Trending/recent engagement |
+SimClusters represents each user as a **sparse vector over 145,000 overlapping communities** detected via Sparse Binary Factorization on the follow graph. Because users belong to multiple communities simultaneously, retrieval naturally covers multiple interests. Tweet embeddings are computed in real-time by aggregating engagement signals from users who interacted with the tweet, projected into community space.
+Twitter's Heavy Ranker (a parallel MaskNet with ~48 million parameters) scores candidates on 10 simultaneous objectives. The scoring formula heavily penalizes negative signals: `P(report) × −369` and `P(negative_reaction) × −74`, while positive engagement weights are modest (`P(favorite) × 0.5`). Post-ranking heuristics enforce author diversity (score halved for consecutive same-author tweets) and maintain the in-network/out-of-network ratio.
+### TikTok relies on real-time adaptation rather than multi-vectors
+TikTok takes a fundamentally different approach. Rather than maintaining multiple static user representations, ByteDance's **Monolith** system enables near-real-time model updates through a streaming architecture that closes the feedback loop within minutes. Key innovations include collisionless embedding tables (Cuckoo hashing, eliminating embedding collisions), streaming online training via Flink-based Online Joiner, and minute-level parameter synchronization between training and inference servers.
+TikTok enforces diversity through **Determinantal Point Processes** (controlling intra-list similarity), creator rotation rules, and explicitly allocating **15–25% of recommendations** to content outside the user's expressed preferences. Thompson Sampling and Epsilon-Greedy strategies inject calculated exploration. This works because TikTok captures rich implicit signals (watch time, loop count, scroll speed, completion percentage) that enable rapid interest detection.
+### Pinterest provides the clearest blueprint for this system
+PinnerSage is the most directly applicable model for the ArXiv recommender because it operates under similar constraints: pre-computed item embeddings in a fixed space (PinSage GCN embeddings, analogous to BGE-M3), no joint user-item training, and a need to support users with 3–100 distinct interests.
+PinnerSage's design choices are instructive:
+- **Ward hierarchical clustering** on the user's interacted item embeddings, with a threshold parameter α controlling merge stopping — this automatically determines K per user
+- **Medoid representation** (the actual item closest to cluster center), not centroids — prevents topic drift and enables cache sharing across users
+- **At serving time, 3 medoids are sampled** proportional to importance scores for ANN retrieval — this controls query cost while maintaining coverage
+- **Dual temporal architecture**: daily batch job processes 60–90 days of history; online component captures same-day interactions; end-of-day reconciliation merges both
+The medoid approach is particularly elegant for this system: each interest cluster is represented by an actual ArXiv paper's embedding, stored as a simple paper ID reference rather than a separate vector.
+---
+## 3. Temporal dynamics: EWMA beats rolling windows
+### The math favors exponential decay
+EWMA updates user embeddings with the formula `embedding_t = α × item_embedding_t + (1−α) × embedding_{t-1}`, where α controls responsiveness. The operation is **O(d) per interaction** with O(d) storage per user — for BGE-M3's 1024 dimensions, that's ~4KB per user embedding.
+| α value | Effective window | Behavior | Use case |
+|---------|-----------------|----------|----------|
+| 0.05 | ~40 interactions | Very stable, slow drift | Core long-term interests |
+| 0.10–0.15 | ~13–20 interactions | Balanced | General user profile |
+| 0.30–0.50 | ~3–5 interactions | Highly responsive | Session-level interests |
+Rolling windows (recompute from last N interactions) require storing N item embeddings per user — at N=50 with 1024-dimensional BGE-M3 vectors, that's **~200KB per user** vs. EWMA's 4KB. More importantly, rolling windows create a hard boundary: interaction N+1 is completely forgotten regardless of how significant it was. Koren's seminal work on temporal dynamics in collaborative filtering (Netflix Prize, CACM 2010) explicitly warns against this: "Classical time-window approaches cannot work, as they lose too many signals when discarding data instances."
+### The recommended temporal architecture
+Adopt Spotify's two-component pattern, validated in their production system serving 600M+ users:
+- **Long-term interest embedding**: EWMA with α=0.10, capturing enduring research interests across ~20 interactions
+- **Short-term session embedding**: EWMA with α=0.40, capturing current reading session context across ~3–5 interactions
+Both embeddings update on each save/dismiss action in a FastAPI background task. The long-term embedding feeds into PinnerSage-style clustering (recomputed daily or on-demand when the embedding has drifted significantly). The short-term embedding serves as an additional prefetch query to boost recency-relevant candidates.
+This is strictly superior to storing rolling windows: less memory, simpler code (no circular buffer management), natural handling of irregular interaction frequencies, and no signal loss from hard cutoffs.
+---
+## 4. The re-ranking layer is the highest-leverage investment
+### LightGBM is the clear winner for zero-GPU constraints
+Cross-encoder reranking (BGE-reranker-v2-m3) requires **~800ms for 100 candidates on CPU** — completely impractical for interactive recommendations. ColBERT-style late interaction models offer ~50–100ms for 100 candidates but require significant infrastructure for precomputing token-level embeddings across 1.6M papers.
+LightGBM with a lambdarank objective scores **500 candidates in 2–5ms on a single CPU core**. This is not a compromise — tree-based re-rankers are used in production at Airbnb (which called their GBDT ranker "one of the largest step improvements in home bookings in Airbnb history"), LinkedIn, and Delivery Hero. The RecSys Challenge 2024 winning approach used LightGBM Ranker, outperforming neural methods on tabular ranking data.
+| Approach | 100 candidates (CPU) | 500 candidates (CPU) | Feasibility |
+|----------|----------------------|----------------------|-------------|
+| LightGBM | ~1–2ms | ~3–5ms | ✅ Excellent |
+| XGBoost | ~2–3ms | ~5–8ms | ✅ Good |
+| BGE-reranker-v2-m3 | ~800ms | ~4,000ms | ❌ Impractical |
+| ColBERT (PLAID) | ~50–100ms | ~250–500ms | ⚠️ Marginal |
+Typical features for a LightGBM re-ranker in this context would include: cosine similarity between user embedding and paper embedding, paper recency (days since publication), paper citation velocity, category overlap with user history, author overlap with previously saved papers, abstract length, and engagement signals from similar users (once collaborative data accumulates).
+### MMR provides practical diversity enforcement
+Maximal Marginal Relevance (formula: `MMR = argmax[λ × Sim(d_i, Q) − (1−λ) × max Sim(d_i, d_j)]`) is the recommended starting point. For selecting 20 papers from 200 re-ranked candidates, MMR completes in **<1ms** with precomputed embeddings. Setting λ=0.6 provides a good balance between relevance and diversity for academic paper discovery.
+DPP (Determinantal Point Processes) provides theoretically superior global diversity — YouTube's A/B tests showed increased user satisfaction vs. both no-diversity and MMR baselines. However, DPP's implementation complexity (greedy MAP approximation with incremental Cholesky updates) is significantly higher, and the practical difference at k=20 is modest. Graduate to DPP if MMR proves insufficient.
+Category-based quotas (e.g., "ensure at least 2 papers from each of the user's top 3 ArXiv categories") can serve as a simple, interpretable diversity floor alongside MMR, despite Airbnb's negative experience with rigid quota-based diversification in their domain.
+---
+## 5. Recommended Phase 2 architecture
+### The proposed design
+```
+User Action (save/dismiss)
+    │
+    ▼
+┌─────────────────────────────────────────────┐
+│  Profile Update Service (FastAPI background) │
+│  • EWMA update: long-term (α=0.10)          │
+│  • EWMA update: short-term (α=0.40)         │
+│  • Store negative centroid from dismissals   │
+│  • Trigger re-clustering if drift > θ        │
+└──��───────────┬──────────────────────────────┘
+               │
+    ▼ (daily batch or on-demand)
+┌─────────────────────────────────────────────┐
+│  Interest Clustering (Ward's method)         │
+│  • Cluster saved paper embeddings            │
+│  • K determined automatically (typically 3-5)│
+│  • Store medoid paper IDs per cluster        │
+│  • Compute importance weights per cluster    │
+└──────────────┬──────────────────────────────┘
+               │
+    ▼ (on feed request)
+┌─────────────────────────────────────────────┐
+│  Candidate Retrieval (Qdrant prefetch+RRF)   │
+│  • Prefetch 1: top medoid vector (limit=40)  │
+│  • Prefetch 2: 2nd medoid vector (limit=30)  │
+│  • Prefetch 3: 3rd medoid vector (limit=25)  │
+│  • Prefetch 4: short-term embedding (limit=25)│
+│  • Fusion: RRF (k=60)                        │
+│  • Filter: exclude dismissed paper IDs       │
+│  • Output: ~100 candidates                   │
+│  • Latency: ~15-25ms (single API call)       │
+└──────────────┬──────────────────────────────┘
+               │
+               ▼
+┌─────────────────────────────────────────────┐
+│  Re-Ranking (LightGBM lambdarank)            │
+│  • Features: user-paper similarity,          │
+│    paper recency, citation velocity,         │
+│    category match, author overlap            │
+│  • Score 100 candidates                      │
+│  • Latency: ~1-2ms                           │
+│  • Output: scored + sorted candidates        │
+└──────────────┬──────────────────────────────┘
+               │
+               ▼
+┌─────────────────────────────────────────────┐
+│  Diversity Enforcement (MMR, λ=0.6)          │
+│  • Select top-k from scored candidates       │
+│  • Penalize similarity to already-selected   │
+│  • Inject 1-2 exploration papers (random     │
+│    high-quality papers from adjacent topics)  │
+│  • Latency: <1ms                             │
+│  • Output: final feed (20-30 papers)         │
+└─────────────────────────────────────────────┘
+```
+### Why not Two-Tower or graph-based alternatives
+**Two-Tower models** (separate user and item encoder networks producing embeddings for dot-product comparison) are the industry standard at scale but require GPU training infrastructure, large volumes of interaction data for learning, and ongoing model retraining. With a small initial user base and zero-GPU constraint, Two-Tower is premature — it solves a scale problem this system doesn't yet have.
+**Graph-based approaches** (like Pinterest's Pixie random walks on a bipartite user-item graph) require dense interaction graphs to be effective. Pixie operates on a graph with 3B+ nodes and 17B+ edges. A small user base on 1.6M papers will produce an extremely sparse graph where random walks yield poor recommendations. This approach becomes viable only after accumulating substantial collaborative signal.
+The proposed architecture is designed for **graceful evolution**: the LightGBM re-ranker can incorporate collaborative features (what did similar users save?) as the user base grows, and the retrieval layer can eventually be augmented with a Two-Tower candidate generator when sufficient training data exists, without disrupting the existing pipeline.
+### Optimal number of user embeddings
+Based on converging evidence from PinnerSage, MIND, and ComiRec, the optimal configuration for this system is:
+- **3–4 interest medoids** from Ward clustering on saved papers (data-driven K, not fixed)
+- **1 short-term session embedding** (EWMA α=0.40)
+- **1 negative centroid** from dismissed papers (used as Qdrant negative example or filter)
+- **Total: 4–6 query vectors per feed request**, executed as prefetch queries in a single Qdrant API call
+This aligns with MIND's finding that K=4–7 is optimal for moderately diverse interests, while keeping query costs manageable. As users accumulate more interactions, the clustering will naturally produce more clusters — PinnerSage demonstrated that heavy users may require up to 75–100 clusters, but sampling 3 medoids at serving time keeps latency constant.
+### What to build first
+The implementation should be phased to deliver value incrementally:
+**Phase 2a (immediate):** Replace the simple Recommend API call with Qdrant's `BEST_SCORE` strategy using all saved papers as positives and dismissed papers as negatives. This is a one-line change that provides meaningfully better diversity. Add a dismissed-paper ID filter to exclude already-seen content. Implement EWMA user embedding updates.
+**Phase 2b (1–2 weeks):** Implement Ward clustering on saved paper embeddings. Switch to prefetch+RRF retrieval with dynamic interest medoids. Add the short-term session embedding as an additional prefetch query. This is the core multi-interest architecture.
+**Phase 2c (2–4 weeks):** Train a LightGBM re-ranker on accumulated save/dismiss data. Add MMR diversity enforcement. Implement exploration injection (2–3 randomly sampled high-quality recent papers from adjacent ArXiv categories per feed).
+**Phase 2d (future):** When sufficient interaction data exists, add collaborative filtering features to LightGBM (similar-user signals). Evaluate DPP if MMR diversity proves insufficient. Consider a lightweight Two-Tower model if/when a GPU budget becomes available.
+---
+## Conclusion
+The proposed system's instinct toward multi-vector retrieval with RRF fusion is validated by converging evidence from Twitter (kNN-Embed, SimClusters), Pinterest (PinnerSage), and Alibaba (MIND, ComiRec). **The critical correction is to derive interest clusters from user behavior rather than predefined subject categories.** Ward hierarchical clustering on saved paper embeddings — producing 3–5 medoid-based interest vectors for typical users — is simpler, more adaptive, and better validated than maintaining fixed CV/LLM/RL vectors.
+The highest-leverage Phase 2 investment is the **LightGBM re-ranking layer**, not retrieval sophistication. At 2–5ms inference on CPU for hundreds of candidates, it provides learned ranking quality comparable to neural approaches without any GPU requirement. Cross-encoder reranking is definitively impractical at this scale without GPU (~800ms for 100 candidates). The combination of multi-interest retrieval, lightweight learned re-ranking, and MMR diversity enforcement creates a system that can evolve gracefully from dozens to thousands of users while keeping total feed generation latency under **30ms per request** — well within the budget for a responsive FastAPI application.
+---
+## Addendum: Corrections from Doc 06 (Deep Research Verdict)
+> **Added April 2025.** Doc 06 (`06-Deep-Research-Verdict.md`) performed a comprehensive review
+> of this RFC against the original PinnerSage paper, empirical literature, and production
+> systems. It identified **4 concrete architectural faults** in this document. Three have been
+> fixed in code; one is pending. This addendum preserves the original text above for reference
+> while documenting the corrections.
+### Fault 1: RRF is wrong for multi-interest fusion (§1, §5) — PENDING
+**What this doc says (§1):** "RRF is a sound fusion method for this use case."
+**What Doc 06 found:** RRF was designed to fuse *different retrievers answering the same query* (BM25 + vector + Condorcet). Using it to fuse *different interest queries for the same user* means papers near the centroid of ALL interests get boosted — the exact failure mode multi-interest models exist to prevent. PinnerSage itself does not use RRF; they sample 3 medoids proportional to importance and concatenate into the downstream ranker. Taobao ULIM, Pinterest Bucketized-ANN, Twitter kNN-Embed, and ComiRec all use quota-based allocation, not RRF.
+**The correction:** Replace RRF with importance-weighted quota allocation:
+```
+slot_k = max(⌊F × w_k⌋, F_min=3)
+```
+Each cluster gets feed slots proportional to its importance, with a floor of 3 to protect minority interests.
+**Note:** RRF *is* correct for the search bar (fusing dense + sparse for the *same* query). Only the recommendation pipeline needs quota.
+**Status:** ⚠️ Code still uses RRF. Scheduled for Phase 4.
+---
+### Fault 2: α_long = 0.10 is too aggressive (§3) — FIXED ✅
+**What this doc says (§3):** "Long-term interest embedding: EWMA with α=0.10, capturing enduring research interests across ~20 interactions."
+**What Doc 06 found:** PinnerSage (KDD 2020) tested λ=0.1 and **explicitly rejected it as too recent-biased**. Their optimal was λ=0.01. With α=0.10, a minority interest retains only 12% of its signal after 20 saves in a different topic (`0.9^20 ≈ 0.12`).
+**The correction:** α_long changed from 0.10 → **0.03** (effective window ~66 interactions). At α=0.03, minority interests retain 54% after 20 saves (`0.97^20 ≈ 0.54`).
+**Code change:** `app/recommend/profiles.py` — `ALPHA_LONG_TERM = 0.03`
+---
+### Fault 3: BGE-reranker-v2 in the hot path is infeasible (§4) — N/A (never built)
+**What this doc says (§4):** Table shows BGE-reranker-v2-m3 at ~800ms for 100 candidates, marked "❌ Impractical." The text already warns against it. However, the Phase 2c plan at the end could be misread as suggesting a cross-encoder in the hot path.
+**What Doc 06 found:** Confirms the rejection. If cross-encoder signal is needed, distill BGE-reranker-v2 offline into a TinyBERT-L2 student (FlashRank recipe) and use as a LightGBM feature on top-20 only. Never serve the full cross-encoder at request time.
+**Status:** ✅ Not an issue — the heuristic scorer was built instead, with LightGBM planned as replacement.
+---
+### Fault 4: Ward needs L2 normalization (§2, §5) — FIXED ✅
+**What this doc says (§2):** "Ward hierarchical clustering on the user's interacted item embeddings."
+**What Doc 06 found:** Ward is mathematically Euclidean-only (Murtagh & Legendre, J. Classification 2014). BGE-M3 operates in cosine space. Without explicit L2 normalization, Euclidean Ward is silently wrong because vector magnitudes affect cluster assignments. On L2-normalized vectors, ‖a−b‖² = 2(1−cos(a,b)), so Euclidean Ward correctly gives cosine-based clustering.
+**Code change:** `app/recommend/clustering.py` — explicit L2 normalization added before `pdist()`.
+---
+### Additional correction: Negative profile wiring — FIXED ✅
+**What this doc says (§5):** "1 negative centroid from dismissed papers (used as Qdrant negative example or filter)."
+**What Doc 06 found:** The negative profile was computed and stored but never read by the recommendation pipeline. YouTube (2023, Xia et al.) showed a 3× gain from using dislikes as both features and training labels. The recommended design is a three-layer negative system: session hard-filter + short-term penalty + long-term EWMA negative medoid.
+**Code change:** `app/recommend/reranker.py` — added `cosine_sim_negative` as Feature 5 with 0.15 penalty weight. `app/routers/recommendations.py` — loads negative profile and passes to reranker.
+---
+### Correction summary
+| Fault | This doc said | Doc 06 correction | Code status |
+|---|---|---|---|
+| RRF for rec fusion | Sound method | Wrong — use quota with F_min floor | ⚠️ Phase 4 |
+| α_long = 0.10 | Balanced | Too aggressive — use 0.03 | ✅ Fixed |
+| BGE-reranker in hot path | Impractical | Confirmed — distill offline only | ✅ N/A |
+| Ward without L2-norm | Implicit | Must L2-normalize first | ✅ Fixed |
+| Negative profile unused | Mentioned | Must wire into reranking | ✅ Fixed |

docs/research/04-Technical-Roadmap-Legacy.md ADDED Viewed

	@@ -0,0 +1,1009 @@

+> [!WARNING]
+> **This document is LEGACY.** It was the initial technical roadmap and has been superseded by [03-MultiInterest-Recommender-Architecture.md](03-MultiInterest-Recommender-Architecture.md), which was the architecture we actually implemented. Key differences:
+> - This doc recommended **BGE-reranker-v2** (~800ms on CPU) → We used **heuristic scorer / LightGBM** (~2ms)
+> - This doc recommended **fixed K=3 K-Means** → We used **Ward hierarchical clustering** (auto K)
+> - This doc recommended **Claude LLM reranking** → Deferred (too expensive per query at this stage)
+> - This doc recommended **Redis** → We stuck with **SQLite** (simpler, sufficient at scale)
+>
+> Kept for historical reference and ideas that may be revisited later.
+# Complete technical roadmap: arxiv recommender upgrade
+**FastAPI + HTMX + SQLite, wired to Qdrant's Recommend API, with Claude-powered reranking layered in Week 3.** This is a phase-by-phase execution plan for adding personalized recommendations to your existing BGE-M3 hybrid search app, designed to ship an MVP in Week 1, sophistication in Week 2, and LLM augmentation in Week 3. Every architectural choice prioritizes a single Python developer moving fast with Claude Code.
+The core insight driving this roadmap: **Qdrant's native Recommend API already does 80% of what you need**. Feed it positive and negative paper IDs from user interactions, and it returns personalized recommendations without you writing any ML code. Everything else — profile embeddings, clustering, LLM reranking — is progressive enhancement on top of that foundation.
+---
+## Architecture decisions: the full stack
+**Backend: FastAPI.** Non-negotiable for this use case. Both `qdrant-client` and `pymilvus` offer async clients. FastAPI's ASGI foundation lets you query Qdrant Cloud and Zilliz Cloud concurrently — cutting hybrid search latency in half versus synchronous Flask:
+```python
+from fastapi import FastAPI
+from qdrant_client import AsyncQdrantClient
+import asyncio
+app = FastAPI()
+qdrant = AsyncQdrantClient(url="https://your-cluster.qdrant.io", api_key=QDRANT_KEY)
+@app.get("/api/search")
+async def search(q: str):
+    dense_results, sparse_results = await asyncio.gather(
+        qdrant.query_points(collection_name="papers", query=query_vec, using="dense", limit=20),
+        zilliz_sparse_search(query_sparse, limit=20)
+    )
+    return rrf_fusion(dense_results, sparse_results)
+```
+FastAPI benchmarks at **3,000+ RPS** for I/O-bound workloads versus Flask's ~500 RPS. Qdrant's own tutorials use FastAPI. Pydantic integration validates search payloads natively. Django is overkill — you don't need ORM, admin panels, or form handling.
+**Frontend: HTMX + Jinja2 + TailwindCSS/DaisyUI.** Zero JavaScript build tooling. The interaction patterns you need — search, save, dismiss, paginate recommendations — are exactly what HTMX excels at: simple request-response cycles with HTML fragment swapping. Every button is a single `hx-post` attribute:
+```html
+<!-- Save button swaps itself to "Saved ✓" on click -->
+<button hx-post="/api/papers/{paper_id}/save"
+        hx-swap="outerHTML" hx-target="this">⭐ Save</button>
+<!-- Not-interested removes the card with a fade -->
+<button hx-post="/api/papers/{paper_id}/not-interested"
+        hx-swap="delete" hx-target="closest .paper-card"
+        class="transition-opacity">✕</button>
+```
+No React, no npm, no Node.js dependency. FastAPI + HTMX is well-established with dedicated libraries (`fasthx` on PyPI) and production tutorials. DaisyUI gives you pre-built card components for paper recommendations without custom CSS.
+**Database: SQLite (now) → Supabase (later).** Start with `aiosqlite` — zero configuration, ships with Python, survives process restarts. SQLite in WAL mode handles **~462K SELECT QPS** concurrent with **~11K UPDATE QPS** — orders of magnitude beyond what you need. When you need authentication and multi-user support, migrate to Supabase (managed PostgreSQL + built-in auth, free tier: 500MB database, 50K MAUs).
+**Deployment: Render.** Free tier gives 750 hours runtime/month, automatic TLS, git-push deploys, native FastAPI support. The FastAPI + HTMX TestDriven.io tutorial deploys to Render as its production target. Upgrade to $7/month Starter for always-on when past MVP (free tier sleeps after 15 minutes of inactivity).
+| Layer | Choice | Monthly Cost |
+|-------|--------|-------------|
+| Backend | FastAPI + Uvicorn | $0 |
+| Frontend | HTMX + Jinja2 + DaisyUI | $0 |
+| User DB | SQLite → Supabase | $0 |
+| Dense vectors | Qdrant Cloud (existing) | Free tier |
+| Sparse vectors | Zilliz Cloud (existing) | Free tier |
+| Deployment | Render | $0–7 |
+| **Total** | | **$0–7/month** |
+### Project structure after migration from Kaggle
+```
+arxiv-recommender/
+├── app/
+│   ├── main.py                  # FastAPI entry point
+│   ├── config.py                # Settings via pydantic-settings
+│   ├── search/
+│   │   ├── hybrid.py            # BGE-M3 + RRF fusion (port from notebook)
+│   │   ├── qdrant_client.py     # AsyncQdrantClient wrapper
+│   │   └── zilliz_client.py     # Zilliz sparse search wrapper
+│   ├── recommend/
+│   │   ├── engine.py            # Qdrant Recommend API integration
+│   │   ├── profiles.py          # User profile vectors (Phase 2)
+│   │   ├── clustering.py        # Multi-interest K-means (Phase 2)
+│   │   ├── reranker.py          # BGE-reranker + Claude reranking (Phase 3)
+│   │   └── exploration.py       # Epsilon-greedy + MMR (Phase 4)
+│   ├── events/
+│   │   ├── schema.py            # Pydantic event models
+│   │   ├── logger.py            # SQLite event writer
+│   │   └── state.py             # In-memory user state cache
+│   ├── templates/               # Jinja2 + HTMX
+│   │   ├── base.html
+│   │   ├── index.html
+│   │   └── partials/
+│   │       ├── search_results.html
+│   │       ├── paper_card.html
+│   │       └── recommendations.html
+│   └── evaluation/
+│       └── metrics.py           # NDCG, Hit Rate, offline eval
+├── CLAUDE.md                    # Claude Code project context
+├── requirements.txt
+├── render.yaml
+└── .env
+```
+**Critical migration note**: Load CSV metadata into Qdrant payloads during indexing rather than reading CSV at query time. Your Qdrant collection becomes the metadata store:
+```python
+await qdrant.upsert(collection_name="papers", points=[
+    models.PointStruct(
+        id=paper_id, vector={"dense": embedding.tolist()},
+        payload={"title": title, "abstract": abstract, "categories": cats, "year": year, "arxiv_id": arxid}
+    ) for paper_id, embedding, title, abstract, cats, year, arxid in batch
+])
+```
+For BGE-M3 query-time embedding on CPU (no GPU on Render free tier), use **`fastembed`** — Qdrant's ONNX-optimized library that runs BGE-M3 at ~100-300ms per query on CPU instead of full PyTorch.
+---
+## Phase 1 (Week 1): event logging + Qdrant Recommend API MVP
+This week ships a working "Recommended for You" section powered by Qdrant's native recommendation engine. No custom ML code needed.
+### Interaction event schema
+Every user action becomes a structured event. The schema links impressions back to searches via `query_id`, enabling CTR computation later:
+```python
+from enum import Enum
+from pydantic import BaseModel, Field
+from datetime import datetime
+import uuid
+class EventType(str, Enum):
+    SEARCH = "search"
+    IMPRESSION = "impression"
+    CLICK = "click"
+    DWELL = "dwell"
+    SAVE = "save"
+    UNSAVE = "unsave"
+    NOT_INTERESTED = "not_interested"
+    RECOMMEND_CLICK = "recommend_click"
+class InteractionEvent(BaseModel):
+    event_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
+    event_type: EventType
+    timestamp: datetime = Field(default_factory=datetime.utcnow)
+    user_id: str
+    session_id: str
+    paper_id: str | None = None
+    query_text: str | None = None
+    query_id: str | None = None
+    source: str | None = None          # "search", "recommendation", "similar"
+    position: int | None = None        # rank position in results (0-indexed)
+    dwell_time_ms: int | None = None
+# Signal weights for converting implicit feedback to Qdrant positive/negative
+SIGNAL_WEIGHTS = {
+    EventType.SAVE: 1.0,
+    EventType.CLICK: 0.3,
+    EventType.NOT_INTERESTED: -1.0,
+    EventType.IMPRESSION: 0.0,
+}
+```
+### SQLite schema
+```sql
+PRAGMA journal_mode = WAL;
+PRAGMA synchronous = NORMAL;
+CREATE TABLE IF NOT EXISTS events (
+    event_id    TEXT PRIMARY KEY,
+    event_type  TEXT NOT NULL,
+    timestamp   TEXT NOT NULL,
+    user_id     TEXT NOT NULL,
+    session_id  TEXT NOT NULL,
+    paper_id    TEXT,
+    query_text  TEXT,
+    query_id    TEXT,
+    source      TEXT,
+    position    INTEGER,
+    dwell_time_ms INTEGER
+);
+CREATE INDEX idx_events_user_ts ON events(user_id, timestamp DESC);
+CREATE INDEX idx_events_paper ON events(paper_id, event_type) WHERE paper_id IS NOT NULL;
+-- Materialized affinity: aggregated user-paper signal
+CREATE TABLE IF NOT EXISTS user_paper_affinity (
+    user_id    TEXT NOT NULL,
+    paper_id   TEXT NOT NULL,
+    affinity   REAL NOT NULL,
+    last_event TEXT NOT NULL,
+    PRIMARY KEY (user_id, paper_id)
+);
+```
+### Wiring up Qdrant's Recommend API
+**Critical**: As of `qdrant-client` v1.10+, the legacy `recommend()` method is deprecated in favor of `query_points()` with `RecommendQuery`. In v1.13+ it's removed entirely. Use the modern API:
+```python
+from qdrant_client import AsyncQdrantClient, models
+async def get_recommendations(
+    client: AsyncQdrantClient,
+    positive_ids: list[int],
+    negative_ids: list[int],
+    limit: int = 20,
+) -> list:
+    """Core recommendation call using Qdrant's native Recommend API."""
+    results = await client.query_points(
+        collection_name="papers",
+        query=models.RecommendQuery(
+            recommend=models.RecommendInput(
+                positive=positive_ids,
+                negative=negative_ids,
+                strategy=models.RecommendStrategy.BEST_SCORE,
+            ),
+        ),
+        using="dense",  # BGE-M3 dense vector name
+        limit=limit,
+        with_payload=True,
+        score_threshold=0.5,
+    )
+    return results.points
+```
+**How `BEST_SCORE` works internally**: At each HNSW graph traversal step, Qdrant computes distance to every positive and every negative example separately. Points closer to any negative than any positive get penalized (score ≤ 0). This produces more diverse results than `AVERAGE_VECTOR` when you have multiple positive/negative examples.
+For **hybrid recommendation** (dense retrieve → sparse rerank), use Qdrant's prefetch:
+```python
+results = await client.query_points(
+    collection_name="papers",
+    prefetch=[
+        models.Prefetch(
+            query=models.RecommendQuery(
+                recommend=models.RecommendInput(
+                    positive=positive_ids, negative=negative_ids,
+                    strategy=models.RecommendStrategy.BEST_SCORE,
+                ),
+            ),
+            using="dense",
+            limit=100,
+        ),
+    ],
+    query=models.RecommendQuery(
+        recommend=models.RecommendInput(positive=positive_ids, negative=negative_ids),
+    ),
+    using="sparse",
+    limit=20,
+    with_payload=True,
+)
+```
+### Hot-path user state: in-memory cache + SQLite write-behind
+**No Redis.** A Python dict cache for the hot path (recent positive/negative IDs), with SQLite as the durable backend:
+```python
+from dataclasses import dataclass, field
+@dataclass
+class UserState:
+    positive_ids: list[int] = field(default_factory=list)
+    negative_ids: list[int] = field(default_factory=list)
+    def add_positive(self, paper_id: int, max_size: int = 50):
+        if paper_id in self.negative_ids:
+            self.negative_ids.remove(paper_id)
+        if paper_id not in self.positive_ids:
+            self.positive_ids.insert(0, paper_id)
+            self.positive_ids = self.positive_ids[:max_size]
+    def add_negative(self, paper_id: int, max_size: int = 20):
+        if paper_id in self.positive_ids:
+            self.positive_ids.remove(paper_id)
+        if paper_id not in self.negative_ids:
+            self.negative_ids.insert(0, paper_id)
+            self.negative_ids = self.negative_ids[:max_size]
+class UserStateCache:
+    def __init__(self, db_path: str = "interactions.db", max_users: int = 1000):
+        self._cache: dict[str, UserState] = {}
+        self._db_path = db_path
+    def get(self, user_id: str) -> UserState:
+        if user_id not in self._cache:
+            state = UserState()
+            # Load from SQLite user_paper_affinity table
+            pos, neg = self._load_from_db(user_id)
+            state.positive_ids = pos
+            state.negative_ids = neg
+            self._cache[user_id] = state
+        return self._cache[user_id]
+```
+### FastAPI endpoint tying it all together
+```python
+@app.post("/api/papers/{paper_id}/save", response_class=HTMLResponse)
+async def save_paper(paper_id: int, request: Request, user_id: str = Depends(get_user_id)):
+    # 1. Update hot cache
+    state = user_cache.get(user_id)
+    state.add_positive(paper_id)
+    # 2. Log event (async write-behind)
+    await log_event(InteractionEvent(
+        event_type=EventType.SAVE, user_id=user_id,
+        session_id=get_session_id(request), paper_id=str(paper_id), source="search"
+    ))
+    # 3. Return swapped HTML (HTMX partial)
+    return '<button class="btn btn-success btn-sm" disabled>✓ Saved</button>'
+@app.get("/api/recommendations", response_class=HTMLResponse)
+async def get_recs(request: Request, user_id: str = Depends(get_user_id)):
+    state = user_cache.get(user_id)
+    if not state.positive_ids:
+        return templates.TemplateResponse("partials/empty_recs.html", {"request": request})
+    papers = await get_recommendations(qdrant, state.positive_ids, state.negative_ids)
+    return templates.TemplateResponse("partials/recommendations.html", {
+        "request": request, "papers": papers
+    })
+```
+### UI changes needed
+Add three elements to your existing UI: a save button and not-interested button on every paper card, plus a "Recommended for You" section that loads via HTMX on page load:
+```html
+<!-- Recommendations section: loads on page load, refreshes after any save/dismiss -->
+<div id="rec-section"
+     hx-get="/api/recommendations"
+     hx-trigger="load, paperInteraction from:body"
+     hx-swap="innerHTML">
+    Loading recommendations...
+</div>
+```
+### Claude Code workflow for Phase 1
+Start by initializing the project, then scaffold in sequence:
+```
+# 1. Init
+/init
+# Edit CLAUDE.md with the template from the "Claude Code workflows" section below
+# 2. Scaffold FastAPI + event schema
+"Create the FastAPI application skeleton following the project structure in CLAUDE.md.
+Implement the Pydantic event models in app/events/schema.py with EventType enum
+(search, impression, click, dwell, save, unsave, not_interested, recommend_click).
+Set up SQLite with WAL mode in app/events/logger.py. Create the events table
+and user_paper_affinity table with proper indexes."
+# 3. Port search logic
+"Read the existing search code [paste key function signatures].
+Implement app/search/hybrid.py wrapping the existing BGE-M3 + RRF fusion logic.
+Use AsyncQdrantClient for Qdrant and create async wrappers for Zilliz.
+Keep the RRF fusion algorithm identical to the existing implementation."
+# 4. Build recommendation endpoint
+"Implement app/recommend/engine.py using Qdrant's query_points() with
+RecommendQuery and BEST_SCORE strategy. Wire it to the UserStateCache.
+Create GET /api/recommendations returning HTMX partial HTML."
+# 5. HTMX templates
+"Create Jinja2 templates with HTMX for: base layout with TailwindCSS/DaisyUI CDN,
+search page with live results, paper card partial with save/not-interested buttons,
+recommendations partial that loads on page load and refreshes after interactions."
+```
+**Estimated effort for Phase 1**: 3–4 days with Claude Code. Day 1: project scaffold + event logging. Day 2: port search logic. Day 3: Qdrant recommend wiring + HTMX UI. Day 4: testing + deploy to Render.
+---
+## Phase 2 (Week 2): user profile embeddings + multi-interest clustering
+Phase 1 passes raw paper IDs to Qdrant's Recommend API. Phase 2 upgrades to computed user profile vectors with recency decay, enabling richer personalization and multi-interest modeling.
+### Weighted user profile vector with recency decay
+Three weighting dimensions combine multiplicatively: **interaction type** (save=1.0, click=0.5, not_interested=−0.8), **recency decay** (half-life of 7 days), and **L2 normalization** (BGE-M3 operates in cosine space):
+```python
+import numpy as np
+from datetime import datetime
+EMBEDDING_DIM = 1024  # BGE-M3
+INTERACTION_WEIGHTS = {
+    "save": 1.0, "click": 0.5, "search_click": 0.3,
+    "view": 0.1, "not_interested": -0.8,
+}
+def compute_recency_weight(interaction_time: datetime, now: datetime, half_life_days: float = 7.0) -> float:
+    """w = 2^(-Δt/half_life). At half_life=7 days, a week-old interaction has weight 0.5."""
+    delta_days = (now - interaction_time).total_seconds() / 86400.0
+    return 2.0 ** (-delta_days / half_life_days)
+def compute_weighted_profile(interactions: list, now: datetime = None, half_life_days: float = 7.0):
+    """Weighted average of paper embeddings. Final weight = type_weight × recency_weight."""
+    if now is None:
+        now = datetime.utcnow()
+    interactions = sorted(interactions, key=lambda x: x.timestamp, reverse=True)[:200]
+    weighted_sum = np.zeros(EMBEDDING_DIM, dtype=np.float64)
+    total_abs_weight = 0.0
+    for ix in interactions:
+        tw = INTERACTION_WEIGHTS.get(ix.interaction_type, 0.1)
+        rw = compute_recency_weight(ix.timestamp, now, half_life_days)
+        w = tw * rw
+        weighted_sum += w * ix.embedding
+        total_abs_weight += abs(w)
+    if total_abs_weight < 1e-10:
+        return None
+    profile = weighted_sum / total_abs_weight
+    norm = np.linalg.norm(profile)
+    return (profile / norm).astype(np.float32) if norm > 1e-10 else None
+```
+This approach is validated at scale: **LinkedIn** uses geometrically-decaying averages of job embeddings for activity features and found it significantly outperformed unweighted averaging. **Pento** (Qdrant case study, 2025) uses exponential decay `S_c = Σ w_i * e^(-λ * Δt_i)` with λ=0.01.
+### Multi-interest clustering with K-means
+A single profile vector blurs distinct interests. If a user reads both NLP and computer vision papers, their centroid falls in a meaningless region between the two. **PinnerSage** (Pinterest, KDD 2020) demonstrated this problem — merging 3 unrelated pin embeddings yielded a centroid representing "energy boosting breakfast." The fix: cluster interaction embeddings into **K=3 interest vectors**, query Qdrant with each:
+```python
+from sklearn.cluster import KMeans
+from dataclasses import dataclass
+@dataclass
+class InterestCluster:
+    centroid: np.ndarray
+    medoid_paper_id: str
+    paper_ids: list[str]
+    importance_score: float
+def cluster_user_interests(interactions: list, k: int = 3, min_interactions: int = 5):
+    positive = [i for i in interactions if INTERACTION_WEIGHTS.get(i.interaction_type, 0) > 0]
+    if len(positive) < min_interactions:
+        profile = compute_weighted_profile(interactions)
+        return [InterestCluster(centroid=profile, medoid_paper_id=positive[0].paper_id,
+                paper_ids=[i.paper_id for i in positive], importance_score=1.0)] if profile is not None else []
+    effective_k = min(k, len(positive))
+    embeddings = np.stack([i.embedding for i in positive])
+    kmeans = KMeans(n_clusters=effective_k, n_init=10, random_state=42)
+    labels = kmeans.fit_predict(embeddings)
+    clusters = []
+    now = datetime.utcnow()
+    for cidx in range(effective_k):
+        mask = labels == cidx
+        cluster_ixs = [positive[i] for i in range(len(positive)) if mask[i]]
+        cluster_embs = embeddings[mask]
+        centroid = kmeans.cluster_centers_[cidx]
+        centroid_norm = centroid / (np.linalg.norm(centroid) + 1e-10)
+        # Medoid: actual paper closest to centroid (avoids meaningless centroids)
+        midx = np.argmin(np.linalg.norm(cluster_embs - centroid, axis=1))
+        importance = sum(
+            INTERACTION_WEIGHTS.get(i.interaction_type, 0.1) * compute_recency_weight(i.timestamp, now)
+            for i in cluster_ixs
+        )
+        clusters.append(InterestCluster(
+            centroid=centroid_norm.astype(np.float32),
+            medoid_paper_id=cluster_ixs[midx].paper_id,
+            paper_ids=[i.paper_id for i in cluster_ixs],
+            importance_score=importance,
+        ))
+    return sorted(clusters, key=lambda c: c.importance_score, reverse=True)
+```
+Query Qdrant with each cluster centroid, merge by weighted score:
+```python
+async def recommend_multi_interest(client, clusters, seen_ids, per_cluster=10, final_k=20):
+    candidates = {}
+    for cluster in clusters:
+        results = await client.query_points(
+            collection_name="papers",
+            query=cluster.centroid.tolist(),
+            using="dense",
+            query_filter=models.Filter(must_not=[
+                models.HasIdCondition(has_id=seen_ids)
+            ]),
+            limit=per_cluster, with_payload=True,
+        )
+        for r in results.points:
+            ws = r.score * cluster.importance_score
+            pid = r.id
+            if pid not in candidates or ws > candidates[pid]["score"]:
+                candidates[pid] = {"id": pid, "score": ws, "payload": r.payload}
+    return sorted(candidates.values(), key=lambda x: x["score"], reverse=True)[:final_k]
+```
+### Incremental real-time profile updates
+Full recomputation on every interaction is expensive. Use **EWMA (Exponentially Weighted Moving Average)** for real-time updates, with periodic full recomputation every 50 interactions to correct drift:
+```python
+def incremental_update(profile: np.ndarray, new_embedding: np.ndarray,
+                       interaction_type: str, count: int, alpha_base: float = 0.1):
+    """EWMA update: μ_new = (1-α)·μ_old + α·sign·x_new"""
+    tw = INTERACTION_WEIGHTS.get(interaction_type, 0.1)
+    experience_factor = min(1.0, 10.0 / (count + 10))  # stabilizes with experience
+    alpha = np.clip(alpha_base * abs(tw) * experience_factor, 0.01, 0.5)
+    sign = 1.0 if tw > 0 else -1.0
+    updated = (1.0 - alpha) * profile + alpha * sign * new_embedding
+    norm = np.linalg.norm(updated)
+    return (updated / norm).astype(np.float32) if norm > 1e-10 else profile
+```
+### Storage: binary numpy in SQLite
+Profile vectors are **4,096 bytes** (1024 × float32). Store as binary blobs in SQLite — no need for a separate vector DB for user profiles at this scale:
+```python
+def store_profile(conn, user_id: str, profile: np.ndarray):
+    conn.execute(
+        "INSERT OR REPLACE INTO user_profiles (user_id, vector, updated_at) VALUES (?, ?, ?)",
+        (user_id, profile.astype(np.float32).tobytes(), datetime.utcnow().isoformat())
+    )
+    conn.commit()
+def load_profile(conn, user_id: str) -> np.ndarray | None:
+    row = conn.execute("SELECT vector FROM user_profiles WHERE user_id=?", (user_id,)).fetchone()
+    return np.frombuffer(row[0], dtype=np.float32).copy() if row else None
+```
+### Claude Code workflow for Phase 2
+```
+# 1. Profile vector computation
+"Implement app/recommend/profiles.py with compute_weighted_profile() using recency
+decay (half_life=7 days) and interaction type weights. Include incremental_update()
+using EWMA. Store profiles as binary numpy in SQLite. Add unit tests that verify
+the profile vector is L2-normalized and that negative interactions push the vector
+away from the paper embedding."
+# 2. Multi-interest clustering (HUMAN REVIEW the math)
+"Implement app/recommend/clustering.py with cluster_user_interests() using
+sklearn KMeans. K=3 fixed, fallback to single centroid when <5 interactions.
+Use medoids not centroids for Qdrant queries. Weight cluster importance by
+recency-decayed interaction signals. I will review the clustering logic manually."
+# 3. Wire multi-interest to recommendation endpoint
+"Update GET /api/recommendations to: load user interactions from SQLite,
+compute clusters, query Qdrant with each cluster centroid weighted by importance,
+merge results, deduplicate against seen papers, return top-20."
+```
+**Estimated effort for Phase 2**: 3–4 days. Day 1: profile vector computation + storage. Day 2: clustering implementation. Day 3: wire to endpoint + session-based updates. Day 4: testing.
+---
+## Phase 3 (Week 3): LLM-augmented personalization with Claude
+This phase adds two capabilities: Claude-generated interest summaries (semantic profile vectors) and Claude as a listwise reranker. Combined with BGE-reranker-v2 cross-encoder scoring, this creates a three-stage retrieval pipeline.
+### Claude API for interest summary generation
+Use `claude-sonnet-4-20250514` to analyze a user's reading history and produce a structured interest profile. Then embed that profile with BGE-M3 for vector search:
+```python
+import anthropic
+client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY env var
+INTEREST_PROMPT = """Analyze these academic papers a researcher recently read.
+Generate a research interest profile.
+## Papers
+{papers}
+## Instructions
+Return JSON with:
+- primary_areas: 2-4 dominant research themes
+- methodological_interests: techniques/algorithms they favor
+- application_domains: real-world areas of interest
+- synthesis_statement: 2-3 sentence dense semantic description of this
+  researcher's identity, suitable for embedding-based matching.
+Respond with valid JSON only."""
+def generate_interest_summary(reading_history: list[dict]) -> dict:
+    papers_text = "\n".join(
+        f"[{i}] {p['title']}\n    {p['abstract'][:200]}"
+        for i, p in enumerate(reading_history, 1)
+    )
+    message = client.messages.create(
+        model="claude-sonnet-4-20250514",
+        max_tokens=1024,
+        messages=[{"role": "user", "content": INTEREST_PROMPT.format(papers=papers_text)}]
+    )
+    return json.loads(message.content[0].text)
+```
+Embed the synthesis statement with BGE-M3 to create a **semantic profile vector** — a single vector capturing the user's research identity in natural language:
+```python
+from FlagEmbedding import BGEM3FlagModel
+embedding_model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
+def create_semantic_profile(summary: dict) -> np.ndarray:
+    text = (f"Research interests: {', '.join(summary['primary_areas'])}. "
+            f"Methods: {', '.join(summary['methodological_interests'])}. "
+            f"{summary['synthesis_statement']}")
+    output = embedding_model.encode([text], return_dense=True)
+    return output['dense_vecs'][0]  # 1024-dim
+```
+**Cost at scale**: For 10,000 users updated weekly via Batch API (50% discount): ~$67.50/week. For a solo developer with <100 users, this costs pennies.
+### BGE-reranker-v2 cross-encoder: the workhorse reranker
+**`BAAI/bge-reranker-v2-m3`** (278M params, XLM-RoBERTa architecture) runs on CPU at ~130ms per 16-pair batch — perfectly acceptable for reranking 20–30 candidates per query. It scores **0.74 NDCG@10** on BEIR benchmark, deterministic, zero API cost:
+```python
+from FlagEmbedding import FlagReranker
+reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=False)  # CPU
+def rerank_bge(query: str, candidates: list[dict], top_k: int = 10) -> list[dict]:
+    pairs = [[query, f"{c['title']}. {c['abstract']}"] for c in candidates]
+    scores = reranker.compute_score(pairs, batch_size=16, normalize=True)
+    scored = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
+    return [dict(**c, rerank_score=round(s, 4)) for s, c in scored[:top_k]]
+```
+### Claude as a listwise reranker: the precision layer
+For the final top-10 candidates, Claude applies cross-document reasoning with custom scoring criteria. This is the highest-quality reranking step — **0.78 NDCG@10** (listwise) versus 0.74 for cross-encoders — but costs ~$0.02 per query and adds 200–500ms latency:
+```python
+RERANK_PROMPT = """You are an academic paper recommendation engine. Rank these
+papers by relevance to the researcher's profile.
+## Researcher Profile
+{user_profile}
+## Candidate Papers
+{candidates}
+## Scoring Criteria (0-10)
+- Topical Relevance (40%): topic match to primary interests
+- Methodological Fit (25%): methods alignment
+- Novelty (20%): new ideas they'd find valuable
+- Recency (15%): prefer cutting-edge work
+Return JSON array sorted by score descending:
+[{{"paper_id": "...", "score": 9.2, "reason": "brief justification"}}]
+Only include papers scoring >= 5.0."""
+async def claude_rerank(user_profile: str, candidates: list[dict]) -> list[dict]:
+    cands_text = "\n".join(f"[{c['id']}] {c['title']}\n  {c['abstract'][:300]}" for c in candidates)
+    message = client.messages.create(
+        model="claude-sonnet-4-20250514",
+        max_tokens=2048,
+        messages=[{"role": "user", "content": RERANK_PROMPT.format(
+            user_profile=user_profile, candidates=cands_text
+        )}]
+    )
+    return json.loads(message.content[0].text)
+```
+### The complete three-stage pipeline
+```
+Qdrant multi-interest retrieve (top-50)  →  5ms
+    ↓
+BGE-reranker-v2 cross-encoder (top-10)   →  ~130ms CPU
+    ↓
+Claude listwise rerank (final order)     →  ~400ms  [optional, for high-value]
+```
+**Use Claude reranking selectively**: on the "Recommended for You" homepage (computed async, cached), not on every search query. For search, BGE-reranker-v2 alone is sufficient.
+### Prompt caching for cost reduction
+Cache the system prompt + user profile across requests. Cache writes cost 1.25× base, but hits cost only **0.1× base ($0.30/MTok for Sonnet 4)**. With 80% cache hit rate, reranking drops from ~$0.02/query to ~$0.005/query.
+### MCP integration for development
+```bash
+# Qdrant MCP: semantic memory for Claude Code during development
+claude mcp add qdrant -- uvx mcp-server-qdrant \
+  --qdrant-url "https://your-cluster.qdrant.io" \
+  --collection-name "papers"
+# GitHub MCP: PR creation, code review
+claude mcp add github -- npx -y @anthropic-ai/mcp-server-github
+# SQLite MCP: query interaction database from Claude Code
+claude mcp add sqlite -- npx -y @anthropic-ai/mcp-server-sqlite --db-path ./interactions.db
+```
+**Context budget warning**: Each MCP server adds 5–20K tokens of tool definitions. Load only 2–3 at a time. Use `/clear` between phases.
+### Claude Code workflow for Phase 3
+```
+# 1. Interest summary generation (LET CLAUDE CODE WRITE THE BOILERPLATE)
+"Implement app/recommend/llm_profiles.py: generate_interest_summary() calls
+Claude claude-sonnet-4-20250514 with user reading history, parses JSON response,
+embeds synthesis_statement with BGE-M3. Add create_semantic_profile() that
+returns the 1024-dim vector. Include error handling for malformed JSON responses."
+# 2. BGE-reranker integration
+"Implement app/recommend/reranker.py with rerank_bge() using FlagReranker
+with bge-reranker-v2-m3. Accept query string + candidate list, return top-k
+with scores. Run on CPU (use_fp16=False). Add latency logging."
+# 3. Claude reranker (HUMAN REVIEW THE PROMPTS)
+"Add claude_rerank() to reranker.py. Accepts user profile text + top-10
+candidates. Uses the reranking prompt I'll provide. I will review and tune
+the prompt manually — just implement the API call, parsing, and error handling."
+# 4. Pipeline orchestration
+"Create app/recommend/pipeline.py that chains: multi-interest Qdrant retrieval
+(top-50) → BGE-reranker-v2 (top-10) → optional Claude reranking.
+Add a feature flag to enable/disable Claude reranking per request."
+```
+**Estimated effort for Phase 3**: 4–5 days. Day 1: interest summary generation. Day 2: BGE-reranker integration. Day 3: Claude reranking + pipeline orchestration. Day 4-5: prompt tuning (manual) + caching + testing.
+---
+## Phase 4 (ongoing): feedback loops, evaluation, and scaling
+### Online metrics to track from Day 1
+Instrument every recommendation response. **CTR target: 2–5%** for content recommendations. **Save rate** is the strongest signal — even 1% is meaningful:
+```python
+@dataclass
+class OnlineMetrics:
+    impressions: int = 0
+    clicks: int = 0
+    saves: int = 0
+    total_dwell_ms: int = 0
+    dwell_events: int = 0
+    @property
+    def ctr(self) -> float: return self.clicks / max(self.impressions, 1)
+    @property
+    def save_rate(self) -> float: return self.saves / max(self.impressions, 1)
+    @property
+    def avg_dwell_s(self) -> float:
+        return (self.total_dwell_ms / max(self.dwell_events, 1)) / 1000
+```
+### Offline evaluation: NDCG@10 and Hit Rate@10 with time-split
+Use **global temporal split** (GTS) at the 80th percentile timestamp. This avoids temporal leakage that plagues leave-one-out evaluation:
+```python
+def ndcg_at_k(relevances: list[float], k: int = 10) -> float:
+    rels = np.array(relevances[:k], dtype=np.float64)
+    if len(rels) == 0: return 0.0
+    dcg = float(np.sum((2**rels - 1) / np.log2(np.arange(1, len(rels) + 1) + 1)))
+    ideal = sorted(relevances, reverse=True)
+    ideal_rels = np.array(ideal[:k], dtype=np.float64)
+    idcg = float(np.sum((2**ideal_rels - 1) / np.log2(np.arange(1, len(ideal_rels) + 1) + 1)))
+    return dcg / idcg if idcg > 0 else 0.0
+def time_split_evaluate(interactions_df, predict_fn, split_q=0.8, k=10):
+    split_time = interactions_df['timestamp'].quantile(split_q)
+    train = interactions_df[interactions_df['timestamp'] <= split_time]
+    test = interactions_df[interactions_df['timestamp'] > split_time]
+    test_users = set(test['user_id'].unique()) & set(train['user_id'].unique())
+    ndcg_scores, hit_rates = [], []
+    for uid in test_users:
+        true_items = set(test[test['user_id'] == uid]['paper_id'])
+        predicted = predict_fn(uid)[:k]
+        rels = [1.0 if p in true_items else 0.0 for p in predicted]
+        ndcg_scores.append(ndcg_at_k(rels, k))
+        hit_rates.append(1.0 if any(r > 0 for r in rels) else 0.0)
+    return {"ndcg@10": np.mean(ndcg_scores), "hit_rate@10": np.mean(hit_rates),
+            "n_users": len(test_users)}
+```
+### MMR reranking for diversity (anti-filter-bubble)
+**Maximal Marginal Relevance** (Carbonell & Goldstein, 1998) balances relevance against redundancy. Start with **λ=0.7** (mostly relevance, slight diversity push):
+```python
+def mmr_rerank(query_emb: np.ndarray, candidates: list[dict],
+               lambda_param: float = 0.7, top_k: int = 10) -> list[dict]:
+    selected, remaining = [], list(range(len(candidates)))
+    relevance = np.array([np.dot(query_emb, c['embedding']) /
+        (np.linalg.norm(query_emb) * np.linalg.norm(c['embedding']) + 1e-9)
+        for c in candidates])
+    for _ in range(min(top_k, len(candidates))):
+        best_score, best_idx = -float('inf'), -1
+        for idx in remaining:
+            diversity_penalty = max(
+                (np.dot(candidates[idx]['embedding'], s['embedding']) /
+                 (np.linalg.norm(candidates[idx]['embedding']) * np.linalg.norm(s['embedding']) + 1e-9))
+                for s in selected
+            ) if selected else 0.0
+            score = lambda_param * relevance[idx] - (1 - lambda_param) * diversity_penalty
+            if score > best_score:
+                best_score, best_idx = score, idx
+        selected.append(candidates[best_idx])
+        remaining.remove(best_idx)
+    return selected
+```
+### Epsilon-greedy exploration
+Replace 10% of recommendation slots with random/diverse papers to discover new user preferences. Decay epsilon over time as the system learns:
+```python
+class EpsilonGreedy:
+    def __init__(self, epsilon=0.1, decay=0.999, min_epsilon=0.05):
+        self.epsilon = epsilon
+        self.decay = decay
+        self.min_epsilon = min_epsilon
+    def select(self, ranked: list, exploration_pool: list, k: int = 10) -> list:
+        results = []
+        ranked_it, explore_it = iter(ranked), iter(exploration_pool)
+        for _ in range(k):
+            if random.random() < self.epsilon:
+                item = next(explore_it, None)
+                if item:
+                    item['_source'] = 'exploration'
+                    results.append(item)
+                    continue
+            item = next(ranked_it, None)
+            if item:
+                item['_source'] = 'exploitation'
+                results.append(item)
+        self.epsilon = max(self.min_epsilon, self.epsilon * self.decay)
+        return results
+```
+### When to add collaborative filtering
+**LightFM becomes worthwhile at ≥500–1,000 active users** with ≥10 interactions each. Below this threshold, the user-item matrix is too sparse for meaningful latent factors. LightFM's hybrid mode (with item content features from BGE-M3) can work with ~200–300 users, but won't outperform your content-based approach until interaction density is sufficient. Don't build it until you see diminishing returns from content-based recommendations.
+### Component activation thresholds
+- **Content-based (Qdrant recommend)**: Day 1 — works with zero users
+- **MMR reranking**: When you have ≥50 papers — prevents near-duplicate recommendations
+- **Epsilon-greedy**: At launch — even 10 users benefit from exploration
+- **Offline eval (NDCG/HR)**: After 2 weeks of interaction data
+- **LLM reranking**: When you want quality uplift and can afford ~$0.02/query
+- **Collaborative filtering**: **≥500 active users** with ≥10 interactions each
+- **A/B testing**: ≥1,000 DAU for statistical power. Below that, use epsilon-greedy as a lightweight auto-optimizing A/B test
+---
+## Claude Code specific workflows
+### CLAUDE.md file
+This is the single most important file for Claude Code productivity. Keep it lean (~200-400 lines), every line competes for context attention:
+```markdown
+# arxiv-recommender: Personalized Paper Discovery
+FastAPI + HTMX + Qdrant + BGE-M3 hybrid search with personalized recommendations.
+## Tech Stack
+- Python 3.11+, FastAPI, Qdrant Cloud (dense), Zilliz Cloud (sparse)
+- BGE-M3 embeddings (1024-dim dense), SQLite WAL for events
+- HTMX + Jinja2 + TailwindCSS/DaisyUI frontend
+## Commands
+```bash
+uvicorn app.main:app --reload --port 8000
+pytest tests/ -v --cov=app
+ruff check app/ && ruff format app/
+```
+## Code Conventions
+- Type hints on ALL functions. Pydantic models for API schemas.
+- Async handlers for all FastAPI endpoints. Use AsyncQdrantClient.
+- Embeddings always np.ndarray shape (1024,), L2-normalized.
+- Qdrant IDs are integers. User IDs are UUID strings.
+- All DB writes go through service layer, never in routes.
+## ⚠️ Human Review Required
+- app/recommend/engine.py — scoring/ranking logic
+- app/recommend/reranker.py — MMR lambda, prompt engineering
+- app/recommend/exploration.py — epsilon values
+- app/evaluation/metrics.py — NDCG implementation
+- Any Qdrant collection schema changes
+```
+### What Claude Code writes autonomously vs. needs human oversight
+**Let Claude Code write freely:**
+- FastAPI route scaffolding, Pydantic models, CRUD endpoints
+- SQLite schema, migration scripts, database helpers
+- HTMX templates, TailwindCSS styling
+- Test scaffolding (pytest fixtures, parameterized tests)
+- Configuration files (pyproject.toml, render.yaml, Dockerfile)
+- Event logging pipeline, error handling middleware
+- CLI utilities and data loading scripts
+**Require human review:**
+- Recommendation scoring and ranking logic (subtle math bugs)
+- Vector similarity calculations and normalization
+- MMR lambda parameter values and diversity tuning
+- Epsilon-greedy rates and decay schedules
+- NDCG/Hit Rate implementations (edge cases with empty results)
+- Qdrant collection HNSW parameters (m, ef_construct)
+- All Claude API prompts (interest summary, reranking)
+- Security: auth flows, API key handling
+### Key Claude Code operating principles
+Use **Plan Mode** (Shift+Tab) before implementing any complex feature. Claude Code reads files and proposes approaches without editing. Never exceed **60% context window** — use `/clear` between phases. Use the `#` key to inject one-off instructions during sessions.
+**Phase-specific prompting pattern:**
+```
+# Phase 1 session
+"Read app/search/hybrid.py. Understand the existing BGE-M3 + RRF fusion.
+Now implement app/recommend/engine.py that wraps Qdrant's query_points()
+with RecommendQuery using BEST_SCORE strategy. Wire positive/negative IDs
+from UserStateCache. Return results as Pydantic models."
+/clear  # Reset context between major features
+# Phase 2 session
+"Read app/recommend/engine.py and app/events/schema.py. Now implement
+app/recommend/profiles.py following the weighted profile vector approach:
+compute_weighted_profile() with half_life=7.0, interaction type weights,
+L2 normalization. Include incremental_update() using EWMA."
+```
+### MCP server configuration
+Load **at most 2–3 MCP servers** at a time to preserve context budget:
+```bash
+# Development: Qdrant + filesystem
+claude mcp add qdrant -- uvx mcp-server-qdrant \
+  --qdrant-url "https://your-cluster.qdrant.io" \
+  --collection-name "papers"
+claude mcp add filesystem -- npx -y @anthropic-ai/mcp-server-filesystem ./app
+# Database debugging sessions
+claude mcp add sqlite -- npx -y @anthropic-ai/mcp-server-sqlite \
+  --db-path ./interactions.db
+# Deployment/CI sessions
+claude mcp add github -- npx -y @anthropic-ai/mcp-server-github
+```
+Create isolated slash commands in `.claude/commands/` for MCP-heavy tasks to avoid polluting the main context.
+---
+## What NOT to build (over-engineering traps)
+**Do not build a custom embedding model.** BGE-M3 is SOTA for hybrid search. Fine-tuning it on arxiv data would take weeks and likely produce marginal gains over the foundation model.
+**Do not build a separate frontend SPA.** React/Next.js adds a JavaScript build pipeline, CORS configuration, state management library, and doubles your deployment surface. HTMX does everything you need for this UI complexity.
+**Do not use Redis until you have concurrent server processes.** SQLite WAL mode handles 462K reads/sec. Redis adds operational overhead for zero benefit at this scale.
+**Do not implement collaborative filtering until ≥500 users.** The user-item matrix will be too sparse. Content-based with BGE-M3 vectors will outperform LightFM until you have interaction density.
+**Do not build a real-time streaming pipeline.** Batch computation of user profiles (every N interactions or daily cron) is sufficient. Apache Kafka, Flink, or similar infrastructure is months of work for marginal latency improvement.
+**Do not use Claude reranking on every search query.** Reserve it for cached recommendation feeds where the 400ms latency and ~$0.02/query cost are amortized. BGE-reranker-v2 handles search-time reranking at 130ms with zero API cost.
+---
+## Effort estimates and prioritization
+| Phase | Scope | Days with Claude Code | Priority |
+|-------|-------|-----------------------|----------|
+| **Phase 1** | Event logging + Qdrant Recommend MVP | **3–4 days** | P0 — ships working recs |
+| **Phase 2** | Profile vectors + multi-interest clustering | **3–4 days** | P1 — quality uplift |
+| **Phase 3** | Claude interest summaries + BGE reranker | **4–5 days** | P2 — premium quality |
+| **Phase 4** | Evaluation metrics + feedback loops | **2–3 days** | P1 — needed for iteration |
+| Migration | Kaggle → Render deployment | **1–2 days** | P0 — prerequisite |
+**Total: ~15–18 days of focused work.**
+**If time runs short, cut in this order:**
+1. ✅ Ship Phase 1 (Qdrant Recommend + event logging) — this alone is a functional recommender
+2. ✅ Ship Phase 4 offline eval (NDCG/HR) — you need metrics before optimizing
+3. ✅ Ship Phase 2 profile vectors — meaningful quality improvement
+4. ⚠️ Phase 3 Claude reranking is a luxury — BGE-reranker-v2 alone gets you 90% of the quality
+5. ⚠️ Phase 2 multi-interest clustering can be deferred — single weighted centroid works for most users with <50 interactions
+The Qdrant Recommend API with `BEST_SCORE` strategy is doing heavy lifting for free. Everything else is progressive enhancement. Ship Phase 1, measure with Phase 4 metrics, then decide whether Phase 2 or Phase 3 delivers more value for your specific users.

docs/research/05-Evolution-Of-Onboarding-And-Interests.md ADDED Viewed

	@@ -0,0 +1,45 @@

+# The Evolution of User Interest Tracking in ResearchIT
+This document records the architectural shift in how ResearchIT models user interests, capturing the original project vision and explaining the transition to the current production architecture.
+## 1. The Original Vision: Explicit Onboarding (Subject Vectors)
+**The initial thought process:**
+When a user opens the app for the very first time, they are greeted with an onboarding screen. They are given a list of overarching subjects (e.g., "Computer Vision", "Large Language Models", "Quantum Physics", "Biology").
+The user explicitly checks the boxes for the subjects they care about. The system would retrieve fixed "Subject Vectors" corresponding to these categories and use them to query the vector database and generate the user's feed.
+### Why this made sense early on:
+* **Cold Start Solution:** It immediately gives the system data to work with before the user has read a single paper.
+* **Simplicity:** It maps perfectly to how legacy apps operate (like setting up a news aggregator).
+## 2. The Problems with Explicit Subjects in Research
+As the architecture matured, we realized that fixed subject vectors break down in an advanced academic context:
+1. **Taxonomy Limitations:** Science moves faster than app menus. If a user selects "Reinforcement Learning," they miss out on emerging, unnamed sub-fields. Selecting predefined tags forces cutting-edge research into outdated buckets.
+2. **Granular Specificity:** Selecting "Computer Vision" returns broad results (facial recognition, autonomous driving). But a researcher might only care about a hyper-specific niche like "unsupervised anomaly detection in industrial CT scans." Fixed subject vectors cannot capture this micro-granularity.
+3. **Interest Drift:** A user might select "Robotics" during onboarding, but 6 months later, their thesis shifts to "Soft Materials." Relying on onboarding declarations means the app becomes stale unless the user constantly updates their settings.
+4. **The "Centroid-in-Nowhere" Problem:** If a user selects distinct subjects like "Astrophysics" and "Economics", averaging these subject vectors mathematically results in a point in embedding space that means nothing, returning irrelevant garbage.
+## 3. The Pivot to Implicit Behavioral Tracking (PinnerSage)
+Instead of asking users what they want *once*, ResearchIT tracks what they *do* continuously. The architecture shifted from explicit, static vectors to implicit, dynamic embeddings driven by user interactions (saves and dismissals).
+### A. EWMA Profiles (Temporal Dynamics)
+Every time a user interacts with a paper, its 1024-dimensional BGE-M3 embedding is blended into their personal profile using an Exponentially Weighted Moving Average (EWMA):
+* **Long-Term Profile ($\alpha=0.1$):** Updates slowly. Captures the user's enduring research identity across many sessions.
+* **Short-Term Session Profile ($\alpha=0.4$):** Updates rapidly. Captures the immediate "rabbit hole" the user is delving into right now.
+* **Negative Profile ($\alpha=0.15$):** Captures the embeddings of papers the user explicitly dismisses, allowing the system to learn what they *don't* like.
+### B. Ward Clustering (Multi-Interest Routing)
+To solve the "Centroid-in-Nowhere" problem, ResearchIT employs Ward Hierarchical Clustering.
+If a user saves papers on Biology and papers on NLP, the system does not average them. It detects the split and forms two distinct clusters.
+The system extracts the **Medoid** (the exact, real paper closest to the center of the cluster) from each grouping.
+### C. Prefetch and RRF
+During a feed request, the system sends the Long-Term medoids AND the Short-Term vector to Qdrant simultaneously as separate `Prefetch` queries. The results are unified seamlessly using **Reciprocal Rank Fusion (RRF)**.
+## Conclusion
+By deprecating manual "Subject Vectors" in favor of EWMA temporal tracking and Ward Clustering, ResearchIT transitioned from a standard filter-based aggregator into an intelligent, adaptive discovery engine capable of understanding complex, multi-disciplinary, and evolving academic interests without any manual user inputs.

docs/research/06-Deep-Research-Verdict.md ADDED Viewed

	@@ -0,0 +1,97 @@

+# ResearchIT architecture: verdict, evidence, and phased plan
+**Bottom line up front.** The pure-behavioral position in Doc 03/05 is directionally right but structurally incomplete — the empirical record from every ablation I could find (Rashid 2002, Zhou 2011, Sun 2017, Nguyen 2024, Spotify 2025, Scholar Inbox 2025) says **item-level seeds + adaptive refinement beats both fixed-category questionnaires and pure-behavior-from-zero**, and onboarding cues remain a 4–37% lift even once behavioral data exists. The correct architecture for ResearchIT is a **three-layer hybrid**: a coarse arXiv-category multiselect for *filtering and prior*, seed-paper/ORCID import for the *initial behavioral profile*, and Ward clustering + medoid retrieval taking over as soon as the user crosses ~10 saved papers. Doc 05's "centroid-in-nowhere" argument is well-founded but overstated; the right takeaway is *never average across distant clusters*, not *never use subject vectors at all*. Doc 03's multi-medoid retrieval with RRF fusion has a real architectural fault — RRF lets the dominant cluster swamp minor interests, which is the exact failure mode Amin flagged — and should be replaced with **importance-weighted quota allocation with a minimum floor**, which is what Pinterest (PinnerSage "sample 3 medoids by importance"), Taobao (ULIM, RecSys 2025), and Pinterest again (Bucketized-ANN, SIGIR 2023) actually deploy. Below I resolve each contradiction across the five docs, flag the specific faults in Doc 03, give a worked example, and propose a restructured phase plan in which Phase 1 (zero-ML recommender) and Phase 2 (hybrid search) are repositioned as load-bearing infrastructure for the endgame rather than disposable scaffolding.
+## The core verdict: hybrid, with behavioral doing the heavy lifting
+Among the serious *persistent* academic recommenders, behavioral onboarding dominates: Semantic Scholar's Research Feeds require 5 library saves before recs appear, Google Scholar seeds from authored papers, arxiv-sanity-lite fits per-tag linear SVMs on TF-IDF once you start tagging, and Scholar Inbox (ACL 2025 Demos, Flicke et al.) — the closest direct analog to ResearchIT, also arXiv-scoped, also ~3M papers — uses a hybrid of author-name search, Scholar Maps for exploration, and active-learning rating prompts. **Zero of them use a fixed arXiv-category checkbox as the primary signal**. But zero of them are pure-behavioral either. Scholar Inbox explicitly frames its category-labeled map as *cold-start tooling*; ResearchGate uses free-text interests plus behavior; Spotify's onboarding artist-picker is seed-*items* not genres; Netflix's three-title onboarding is also items, not genres.
+The empirical pattern is consistent. Rashid et al. (IUI 2002, "Getting to Know You", IUI Most Impactful Paper 2020) showed popularity × entropy item selection beats both pure popularity (users know it, but it's not diagnostic) and pure entropy (diagnostic, but users don't recognize the item); McNee et al. (UM 2003) found user-chosen seeds produce **higher retention than system-chosen seeds despite taking longer** — a direct retention argument for letting researchers pick their own seed papers. Zhou, Yang, Zha (SIGIR 2011, Functional Matrix Factorization) showed adaptive interview trees beat static elicitation by several RMSE points; Nguyen et al. (arXiv:2406.00973, 2024) confirmed adaptive two-phase elicitation still beats Rashid-style static popularity×entropy on MovieLens/Netflix in 2024. Sanner, Balog, Radlinski et al. (RecSys 2023, arXiv:2307.14225) found LLM-consumed **language-based preferences** ("I study mechanistic interpretability of transformers") are competitive with item-based CF in near-cold-start — suggesting a free-text interests box is worth more than category checkboxes. Spotify's own 2025 ablation ("Generalized User Representations", atspotify.com, Sept 2025) reports nDCG@50 drops 37.1% on favorite-artist clusters when modality encoders are removed but demographic and onboarding features stay — i.e. **behavioral dominates, but onboarding still contributes 4–12% even when behavior is rich**.
+So Doc 05's pivot away from pure subject vectors was correct in direction but wrong in absoluteness. Fixed category centroids are a bad *primary* user model; they remain a useful *prior* for pool filtering, for cold-start days 1–3, and as categorical features for LightGBM reranking. The ISLE system (arXiv:2512.12760, 2025) — 1.73M arXiv+OpenAlex papers, BM25+MiniLM+RRF, uses OpenAlex topic tags as filters — is the most direct production-scale validation of "hybrid with taxonomy as filter, embeddings as signal."
+## Is the centroid-in-nowhere argument actually correct?
+Partially. The canonical demonstration is PinnerSage Figure 2 (Pal et al., KDD 2020): averaging embeddings for painting + shoes + sci-fi produces a vector closest to "energy-boosting breakfast," a topic related to none of the inputs. PinnerSage Table 1 quantifies this: on cosine ≥ 0.8 next-action prediction, a single decay-averaged vector achieves +25% over last-pin baseline, while an oracle 3-cluster representation achieves +98% — a ~4× gap that almost exactly matches Doc 05's folklore. The Weller et al. 2025 LIMIT benchmark (arXiv:2508.21038, Google DeepMind) formalizes this as a theoretical limit: a single dense vector cannot represent arbitrary top-k subsets, and state-of-the-art embedders including Gemini-Embedding and NV-Embed fail on constructed multi-topic queries. MIND/ComiRec/SINE/kNN-Embed ablations on Amazon Books and Taobao consistently show **15–30% Recall gains from multi-vector over single-averaged-vector**, with diminishing returns after K≈4.
+**But the argument is overstated when applied generically**. Wieczorek et al. ("On the Unreasonable Effectiveness of Centroids in Image Retrieval", arXiv:2104.13643) show centroids work fine for *within-class* averaging (ReID, fashion). ColBERT token-pooling experiments (Clavié/Answer.AI, 2024) show pooling *similar* tokens at factor 2 retains 100.6% of retrieval performance, only dropping to 97% at factor 4. Brokos et al. (BioNLP 2016) used centroid-of-word-embeddings successfully for single-topic biomedical QA. The correct nuanced statement is: **averaging across distant semantic clusters drifts to unrelated regions; averaging within a coherent cluster is approximately lossless, sometimes beneficial**. This is exactly why PinnerSage clusters *first*, then uses medoids within clusters — the within-cluster representation collapse is fine, the across-cluster collapse is what kills you.
+Practical implication: arXiv category tags (cs.CV, cs.CL, stat.ML) as features, filters, or auxiliary signals are empirically safe and validated (ISLE 2025; BioWordVec for MeSH in Scientific Data 2019; MedGraph 2022 showing hybrid > pure-embedding in biomedical IR). Using them as *primary user vectors to be averaged* is what's broken — which is exactly the thing Doc 01/02/04 proposed and Doc 05 correctly killed.
+## Pinterest is moving away from Ward + medoids, but their reasons don't apply to you
+Pinterest's lineage is well-documented through six papers: PinnerSage (KDD 2020, multi-embedding via Ward) → PinnerFormer/PinnerSAGE V3 (KDD 2022, **single 256-d embedding** via transformer + Dense All Action loss) → TransAct (KDD 2023, transformer on last 100 actions *inside* the ranker) → TransAct V2 (CIKM 2025, lifelong 16K-action sequences) → PinRec (arXiv:2504.10507, 2025, generative GPT-2-style retrieval) → PinFM (RecSys 2025, 20B-parameter foundation model with Deduplicated Cross-Attention Transformer). **Yes, Pinterest moved from multi-medoid retrieval to a single dense user vector**, and PinnerFormer beats PinnerSage-oracle R@10 0.229 vs 0.046 with a fraction of the storage. But **the reason is infrastructure, not quality**: PinnerFormer paper explicitly states multi-embedding scales poorly as a *downstream feature* — "storing 20+ 256-d float16 embeddings per row across billions of training rows is expensive" — because Pinterest has tens of ranking models consuming the embedding. ResearchIT has one ranking model and ~thousands of users; the multi-embedding storage cost is trivial. **Pinterest's motivation to abandon Ward+medoids does not apply to you**; PinnerSage's architecture remains the right reference design for a CPU-only, small-scale, multi-interest system.
+PinnerSage's exact specification, which Doc 03 loosely follows, is: Ward hierarchical agglomerative clustering on pre-trained (frozen) PinSage pin embeddings over the last 90 days; medoid = arg min over cluster members of Σ ‖p_m − p_j‖²; cluster importance = Σ e^{−λ(T_now − T_i)} with **λ = 0.01** (λ = 0 pure frequency underperforms, λ = 0.1 too recent also underperforms); daily MapReduce batch + lightweight online inference on last 20 actions; **light user: 3–5 clusters, heavy user: 75–100 clusters** (no fixed K, dendrogram cut); at serve time sample **3 medoids proportional to importance**; candidates from each medoid flow to the downstream ranker (no RRF). This gives you exact numerical anchors: Doc 03's α_long=0.10 corresponds roughly to PinnerSage's λ=0.1 which they *rejected* as too recent-biased; if you're matching PinnerSage empirically, **α_long should be closer to 0.01–0.03, not 0.10**. This is a concrete parameter correction from the literature.
+## The fusion fault in Doc 03: RRF lets the dominant cluster dominate
+This is the most important engineering flaw to fix. Doc 03's plan is to issue K medoid queries and RRF-fuse the result lists. **RRF is the wrong fusion for this**. Cormack, Clarke, Büttcher's original SIGIR 2009 RRF paper designed it to fuse *different retrievers answering the same query* (BM25 + vector + Condorcet), where consensus is a strong signal. Across K *different* interest queries, consensus means "items near the centroid of your interests" — the exact thing multi-interest models exist to avoid. Bruch et al. (*An Analysis of Fusion Functions for Hybrid Retrieval*, SIGIR 2022) further show RRF improves Recall@K more than nDCG and loses to tuned convex combination when any labels exist.
+PinnerSage itself does not use RRF. Per §5 of the paper, they sample 3 medoids proportional to importance, retrieve per medoid, and concatenate into the downstream ranker. Taobao's **ULIM** (Meng et al., RecSys 2025, arXiv:2507.10097) explicitly uses pointer-enhanced cascaded category-to-item retrieval with per-category parallel retrieval — the productionized quota approach, with reported +5.54% clicks and +11.01% orders. Pinterest's **Bucketized ANN Representation Learning** (SIGIR 2023, "Representation Online Matters") directly quotes the logic: bucketized retrieval ensures "minority-representation items are not dropped in the latter stages." ComiRec (KDD 2020) uses a controllable λ to merge K lists and still shows Recall gains over MIND — but with an *aggregation module*, not with RRF. Twitter's kNN-Embed (arXiv:2205.06205) draws candidates per cluster proportional to mixture weight — again quota, not RRF.
+The correct fusion for ResearchIT, which I recommend replacing Doc 03's RRF plan with: normalize medoid importance scores to weights w_k = s_k / Σs_k; allocate F=30 feed slots as slot_k = max(⌊F·w_k⌋, F_min=3) with the remainder distributed by largest fractional part; deduplicate across lists by assigning each item to its highest-ranked list; LightGBM-rerank within each C_k and take top slot_k; merge; apply MMR (λ=0.6) over the union on BGE-M3 embeddings for fine-grained redundancy control; then apply the negative-profile penalty. This gives **importance-weighted quota with a floor** — a small NLP cluster can't be starved to zero slots even if a user's ML interest is 3× larger.
+## Clustering specifics: Ward stays, with three corrections
+Ward hierarchical clustering remains the right default for N=50–500 saved papers in 1024-d BGE-M3 space. At N=500, fastcluster takes under 50 ms and is deterministic. Alternatives: HDBSCAN handles noise (the −1 label is valuable for one-off saves) but degrades above ~50–100 dimensions and needs UMAP preprocessing — the Milvus HDBSCAN+BGE-M3 tutorial explicitly uses this pattern. Affinity propagation picks exemplars (medoids) automatically but is O(N²T) with convergence instability at damping/preference corners; Ward + k-medoids-per-cluster dominates it. K-means needs K and doesn't produce noise labels; GMM is ill-conditioned at 1024-d with N=500; spectral is overkill.
+Three corrections to Doc 03. **First**, Ward in sklearn is Euclidean-only mathematically (Murtagh & Legendre, J. Classification 2014, require squared Euclidean); since BGE-M3 dense vectors should be L2-normalized, Euclidean and cosine are monotonically related via ‖a−b‖² = 2(1 − cos(a,b)), so *L2-normalize first, then Ward with Euclidean* gives the intended cosine behavior. If Doc 03's code does cosine Ward directly, it's silently wrong. **Second**, no fixed K; use a distance-threshold cut on the dendrogram with a cap at K_max = 20 for safety, sample top-3 by importance at retrieval — matching PinnerSage's production recipe where heavy users genuinely have 75–100 clusters. **Third**, cluster *assignment stability* is the real operational risk, not compute cost; persist cluster→medoid-paper-id mapping across reclusterings and use Hungarian matching against previous medoids on each recluster to maintain cluster IDs. At K ≤ 20 this is trivial and prevents UI churn where a user's "NLP cluster" suddenly gets renamed to cluster_7 tomorrow. Recompute nightly on the full 90-day window, patch online for the last 20 interactions — the PinnerSage two-pronged recipe.
+Medoid vs centroid is empirically validated in production (PinnerSage online A/B: +4% Homefeed engagement, +2% propensity, −75% serving cost via medoid caching). Medoids are the right choice; Doc 03 is correct on this point.
+## Reranking, diversity, evaluation: concrete recommendations
+**LightGBM lambdarank is still correct for 2025–2026.** Semantic Scholar's production stack explicitly uses it: per Kinney et al. (arXiv:2301.10140, 2023), "Results are limited to 1000 matches, which are reranked using a trained LightGBM model." LightGBM with ~30–50 features on 500 candidates inferences in tens of microseconds per item on CPU — well under 1 ms total — leaving your entire 30ms budget for retrieval and diversity. The ACM RecSys Challenge 2024 (Ekstra Bladet news) was won with LightGBM lambdarank + feature ensembles. Doc 03's 2–5 ms estimate is actually pessimistic; real cost is sub-millisecond.
+Training signal with no real users: bootstrap with citation-graph pseudo-labels exactly as SPECTER (Cohan et al., ACL 2020), SciNCL (Ostendorff et al., EMNLP 2022), and SciRepEval (Singh et al., EMNLP 2023) do — treat each paper as a query user, its cited papers = relevance 2, co-cited = relevance 1, random = 0. Supplement with author-as-user simulation: each author's first N papers form their profile, their next M papers' citations are held-out positives. These are established protocols, not hacks.
+**Cross-encoder reranking on CPU is not viable in the hot path** at your latency target. Consolidated numbers from FlagEmbedding docs and multiple 2024–2026 benchmarks: `bge-reranker-v2-m3` (278M params) runs ~8 ms per pair on CPU fp32, meaning 50 pairs ≈ 400 ms, 100 pairs ≈ 800 ms — far over the 30ms budget. `ms-marco-MiniLM-L-6-v2` at ~2–4 ms per pair still costs 100–150 ms for 50 pairs. Only `ms-marco-TinyBERT-L-2-v2` (~0.3 ms/pair, ~15 ms for 50) or `jina-reranker-v1-turbo-en` fit. Doc 03's Phase 5 plan to add `BGE-reranker-v2` to the hot path is the second concrete architectural fault; if you want a cross-encoder signal, distill BGE-reranker-v2 offline into a TinyBERT student and use it as a LightGBM feature on top-20 only. Use the full BGE-reranker only for generating pseudo-labels in training, never at serving time. FlashRank uses this distillation recipe and keeps ~95% of teacher nDCG@10 at ~1/10 the latency.
+**MMR λ=0.6 is a reasonable default** but the evidence across summarization, e-commerce, and search puts the sweet spot anywhere from 0.3 to 0.7 depending on the precision/discovery trade. For an academic feed, high-intent users value precision, so 0.5–0.7 is defensible — tune on offline nDCG@10. DPPs (Chen, Zhang, Zhou, NeurIPS 2018; pass Culture, arXiv:2509.10392, RecSys 2025) deliver 2–5% engagement lifts over MMR but require a learned kernel and more data; defer to v2. Pinterest replaced DPP with Sliding Spectrum Decomposition in early 2025 — DPP is no longer state-of-the-art even at Pinterest — so don't over-commit to it. Category-quota (which you already have from the fusion change above) plus MMR over the union is the right stack.
+**Evaluation**. Report nDCG@10 as primary, Recall@50 as secondary, HR@10 as a sanity check, plus ILS (intra-list similarity), category entropy, and a "novel-in-top-10" count for diversity/discovery. This matches the SciDocs (Cohan et al., ACL 2020) and SciRepEval (Singh et al., EMNLP 2023) conventions. For benchmarks, use **unarXive 2022** (Saier et al., arXiv:2303.14957) — all arXiv with structured citations, best match for your 1.6M corpus — plus S2ORC (Lo et al., ACL 2020) and the Konstanz citation-recommendation benchmark (Maharjan, arXiv:2412.07713, 2024) for time-split protocols. Skip CiteULike (obsolete). Note Bruch et al.'s SIGIR 2022 caveat: RRF tuning optimizes Recall more than nDCG, so if you evaluate fusion choices use nDCG@10 as the tie-breaker — another reason the RRF → quota switch is well-founded.
+**Negative signals**. Doc 03's EWMA negative profile with α=0.15 is a reasonable engineering pattern but is **not a named academic method**; the closest published analogs are eVSM (Musto 2011) and Pone-GNN's disinterest embeddings (TORS 2024). The definitive industrial reference is YouTube's 2023 paper (Xia et al., arXiv:2308.12256): using dislikes only as *input features* reduces similar recs by 22%; using them as *both input features and training labels* (negative loss target) reduces similar-content rate by **60.8%** and creator-similarity by 64.1% — a 3× gain from the richer treatment. Spotify's 2024 work (Wilm et al., arXiv:2406.04488) and Park et al.'s Pone-GNN (TORS 2024) confirm dual positive/negative embeddings beat single-encoder negative sampling. Recommended three-layer design: session-level hard filter (never re-show dismissed items in the same session); short-term item penalty at rerank `score -= α · exp(−Δt/τ_neg)` with τ_neg = 7–14 days; long-term EWMA "negative medoid" per user with MMR-style subtraction `final_score = λ·sim(cand, pos_medoid) − (1−λ)·sim(cand, neg_medoid)`. Generalize to category-level: if ≥3 dismissals hit the same arXiv category within a week, suppress that category for 2 weeks. Once you have ≥10K dismissals, retrain LightGBM with dismissals as negative labels (YouTube's 22% → 60% gain).
+## Worked example: a user saves 5 papers
+Suppose a new user signs up, ticks cs.LG and cs.CL on the category onboarding, pastes their ORCID so the system ingests their 4 authored papers (frozen BGE-M3 embeddings), and saves 5 papers manually: three on retrieval-augmented LLMs, two on diffusion models for image generation.
+Day 1 at signup: the 9-paper profile (4 authored + 5 saved) triggers Ward clustering with L2-normalized BGE-M3 vectors. The dendrogram cut at cosine-distance threshold 0.35 produces two clusters — RAG/NLP (5 papers) and diffusion (2 papers); the remaining 2 authored papers on older topics become a third small cluster. Medoids are the three papers minimizing sum-of-squared-distances within each cluster, stored as paper IDs only (cacheable across users). Importance scores use EWMA with α_long=0.03 — correcting Doc 03's 0.10, which PinnerSage empirically rejected as too recent-biased — producing weights roughly 0.55 / 0.30 / 0.15.
+Feed request: the system issues three independent ANN queries against Qdrant (dense) + Zilliz (sparse), one per medoid, retrieving top 200 each. For a 30-item feed, slot allocation gives 16 / 9 / 5 but the floor F_min=3 is respected (already satisfied). LightGBM lambdarank scores each candidate with features: max/mean/top-3 BGE-M3 cosine to medoid, sparse score, log-citation-count, age-decayed citation velocity, arXiv primary-category match to medoid, author-overlap count with saves, citation-graph overlap with saves (second-order). Per-cluster top-16/9/5 are picked, items appearing in multiple lists are assigned to their highest-ranked cluster (dedup). MMR with λ=0.6 runs over the merged 30-item set on BGE-M3 embeddings to smooth any within-quota redundancy (e.g., three near-duplicate RAG surveys get diversified). A negative-medoid penalty subtracts a fraction of similarity to the user's dismissed-items EWMA centroid. Final 30-item feed is served. End-to-end CPU latency on a modest instance: Qdrant+Zilliz queries ~10 ms combined (3 parallel), LightGBM rerank ~1 ms, MMR ~2 ms, quota + dedup ~0.5 ms, negative-profile penalty ~0.5 ms — comfortably under the 30 ms budget.
+Day 30: the user has saved 80 papers and dismissed 25. Ward recluster produces 4–5 stable clusters; Hungarian matching against yesterday's medoids preserves cluster IDs for UX consistency. The onboarding category filter is now only used as a LightGBM categorical feature and a pool-prefilter when a cluster is small; the user's behavioral signal dominates. Category-level negatives suppress `cs.IR` for two weeks after the user dismisses 4 retrieval-systems papers in a row. This is the pure-behavioral regime Doc 03/05 describes, reached cleanly because the first-session experience was bootstrapped by the explicit filter rather than requiring 5 saves before any rec appears (the Semantic Scholar drop-off trap).
+## Restructured phases
+Amin's question 2 was whether Phase 1 (zero-ML) and Phase 2 (hybrid search, done) serve the endgame. They do, but the sequencing should change.
+**Phase 0 — already complete**: hybrid BGE-M3 dense + sparse + RRF on Qdrant + Zilliz, 1.6M papers. Keep RRF *for search* (it's correct for fusing different retrievers over the *same* query); replace it with quota *for recommendations* (different queries over the same user).
+**Phase 1 reframed — cold-start scaffolding, not throwaway**. Build the onboarding pipeline that will still be used at month 12: (a) arXiv category multi-select (5-second coarse filter and LightGBM feature), (b) ORCID / Semantic Scholar ID / Google Scholar author-name import that ingests authored paper embeddings as initial seeds, (c) "add 5 seed papers" library seeder, (d) a simple popularity-per-selected-category feed for the first session if users skip all three. This replaces Doc 03's implied "start empty, wait for saves" plan and addresses the Semantic Scholar empty-library trap. Infrastructure: SQLite is fine initially; migrate to Supabase when you cross 10 concurrent writes/sec. Stick with Render.
+**Phase 2 reframed — behavioral profile and Ward clustering, the core recommender**. Implement the full PinnerSage-style profile: Ward + L2-normalized Euclidean clustering with dendrogram-threshold cut (cap K_max=20), medoid selection, importance scoring with α_long=0.03 (not 0.10), α_short=0.40, α_neg=0.15; Hungarian-matched stable cluster IDs; nightly batch + online patches. Retrieval: 3 medoids sampled by importance, K separate Qdrant+Zilliz queries, **importance-weighted quota fusion with F_min=3, not RRF**. Reranking: LightGBM lambdarank with ~30 features including arXiv category match, citation-graph overlap, author overlap, recency; trained initially on SPECTER/SciNCL-style citation-graph pseudo-labels + author-as-user simulation on unarXive 2022. Diversity: MMR λ=0.6 over the union. Negatives: session hard-filter + short-term item penalty + long-term EWMA negative medoid with MMR-style subtraction + category-level suppression.
+**Phase 3 — evaluation and tuning before users scale**. Time-split citation-prediction evaluation on unarXive 2022 and S2ORC; author-as-user simulation for personalization metrics; nDCG@10 primary, Recall@50 secondary, HR@10 sanity; ILS + category entropy + novel-in-top-10 for diversity. Tune MMR λ, cluster cut threshold, and EWMA alphas against nDCG@10. Use SciDocs and SciRepEval as published benchmarks for comparison against SPECTER2 baselines.
+**Phase 4 — Claude-generated interest summaries and distilled cross-encoder rerank**. Claude API generates human-readable per-cluster summaries ("You're reading about retrieval-augmented generation, particularly evaluation and hallucination") stored with medoids for UI. Offline, distill `BAAI/bge-reranker-v2-m3` into a TinyBERT-L2 student (FlashRank recipe); deploy on top-20 post-LightGBM as a feature, keeping hot-path latency under 30 ms. Do **not** put full BGE-reranker-v2 in the hot path.
+**Phase 5 — exploration and collaborative filtering**. Epsilon-greedy exploration (5–10% of slots) against a budget of novel-category items — mirrors Spotify's BaRT (McInerney et al., RecSys 2018) and Pinterest's Warmer-for-Less (arXiv:2512.17277, 2025). LightFM / collaborative filtering when you cross the 500-user threshold Doc 03 mentions. At that scale, retrain LightGBM with dismissals as negative labels (YouTube's 3× gain).
+**Phase 6 — optional upgrades, only if warranted**. Semantic IDs / TIGER-style generative retrieval (Rajput et al., NeurIPS 2023; ActionPiece, arXiv:2502.13581, Feb 2025) becomes interesting at ≥10⁵ users with rich sequence data; PinnerFormer-style single-vector transformer user models become relevant only if you ship many downstream models consuming user embeddings; DPPs with learned kernels over MMR become worth the complexity only with substantial engagement data. None of these belong in the first 12 months.
+## Contradictions across Docs 01–05, resolved
+Doc 01/02/04 advocate explicit subject vectors as primary user models; Doc 05 pivots to pure behavioral citing four problems (taxonomy limits, granular specificity, drift, centroid-in-nowhere); Doc 03 commits to pure behavioral with Ward + medoids + RRF. **Resolution**: all four Doc-05 objections are valid *against subject vectors as the primary vector*; none are valid against subject labels as *filters, features, and cold-start priors*. Use subject taxonomy for the first three things, use behavioral medoids for the primary user representation. This gives both 01/02/04's first-session usability and 03/05's long-run personalization.
+Doc 03's α_long=0.10 contradicts PinnerSage's empirical λ=0.01 optimum with λ=0.1 explicitly rejected as too recent-biased; **resolution**: α_long=0.03 (split the difference toward PinnerSage's finding); keep α_short=0.40 and α_neg=0.15 as reasonable. Doc 03's RRF fusion contradicts PinnerSage's importance-weighted sampling, ComiRec's controllable aggregation, Taobao ULIM's quota, and Pinterest Bucketized-ANN's quota — every production multi-interest system uses quota, not RRF; **resolution**: importance-weighted quota with floor F_min=3. Doc 03's BGE-reranker-v2 in the hot path contradicts CPU latency reality (~800 ms for 100 pairs); **resolution**: distill offline, use TinyBERT student as LightGBM feature. Doc 03's LightFM deferral is correct (≥500 users is the right threshold). Doc 03's MMR λ=0.6 is a defensible default; tune on nDCG@10.
+## Flagged architectural faults in Doc 03
+Four concrete faults, in order of impact. First, **RRF fusion across K medoid queries** lets dominant clusters dominate — replace with importance-weighted quota + F_min floor + MMR smoothing. Second, **α_long=0.10 is empirically too aggressive** — PinnerSage's rejected value — drop to 0.03. Third, **BGE-reranker-v2 in the serving path is infeasible on CPU at 30ms** — distill offline or drop. Fourth, **cosine Ward is not mathematically Ward** if run directly; always L2-normalize first and use Euclidean Ward, or explicitly use `linkage='average', metric='cosine'` and accept the small cluster-size-regularity loss. Minor: no Hungarian-matching for cluster ID stability will cause UI churn as users save new papers.
+## What changes if this verdict is right
+The implementation surface looks almost identical to Doc 03 from 50 feet away — Ward clustering, medoid representatives, EWMA dynamics, LightGBM rerank, MMR diversity, Supabase/Render infra — but three subsystems differ in ways that determine whether the system actually works for multi-interest users: the fusion step switches from RRF to importance-weighted quota with a floor, the onboarding gains a coarse category filter + ORCID/seed-paper import that make the first session functional without contradicting the long-run behavioral vision, and the reranking stack uses LightGBM as the terminal CPU-path reranker with any cross-encoder signal distilled offline rather than at serving time. The system Amin built so far (hybrid search on Qdrant + Zilliz) is load-bearing infrastructure for Phase 2, not scaffolding, because the *search* RRF fusion is correct even though the *recommendation* RRF fusion is not — these are different problems with different correct answers. The strongest evidence for this architecture is that it's what Pinterest (with 1000× more users), Scholar Inbox (with the same arXiv domain and 3M papers), Taobao ULIM (with the same quota pattern at billion-scale), Semantic Scholar (with the same LightGBM terminal reranker), and the empirical onboarding literature since Rashid 2002 all converge on — hybrid, quota-merged, behaviorally-dominated, with explicit signals as priors not primary vectors.

docs/walkthroughs/01-Phase1-Code-Tour.md ADDED Viewed

	@@ -0,0 +1,886 @@

+# Code Tour — ArXiv Recommender (Phase 1)
+A file-by-file walkthrough of every piece of the codebase: what it does, how it works, and why it was written the way it was.
+---
+## Entry Points
+### `run.py`
+```python
+import uvicorn
+if __name__ == "__main__":
+    uvicorn.run(
+        "app.main:app",
+        host="127.0.0.1",
+        port=8000,
+        reload=True,
+        reload_dirs=["app"],
+    )
+```
+Nothing special here. Starts Uvicorn pointing at the FastAPI `app` object. `reload=True` watches the `app/` directory and hot-reloads on file changes. Run with `python run.py`.
+---
+### `app/main.py`
+```python
+from app.routers import search, events, recommendations, saved
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    await db.init_db()
+    yield
+app = FastAPI(title=APP_TITLE, lifespan=lifespan)
+app.include_router(search.router)
+app.include_router(events.router)
+app.include_router(recommendations.router)
+app.include_router(saved.router)
+@app.get("/", response_class=HTMLResponse)
+async def home(request, user_id=Cookie(...)):
+    user_id = user_id or str(uuid.uuid4())
+    state = await us.ensure_loaded(user_id)
+    resp = templates.TemplateResponse(request, "index.html", {
+        "has_recs": state.has_enough_for_recs(),
+        "save_count": len(state.positives),
+    })
+    resp.set_cookie(COOKIE_NAME, user_id, max_age=365*24*3600, httponly=True)
+    return resp
+```
+**`lifespan`** is a FastAPI context manager that runs `init_db()` once when the server starts — creates the three SQLite tables if they don't exist, then yields control to the app.
+**The home route** is the only one that lives in `main.py`. Everything else is in routers. It reads the user's cookie, loads their state from memory/DB, and renders `index.html` with two flags: `has_recs` (enough saves to show recommendations?) and `save_count` (how many papers saved so far).
+**Cookie pattern** — every route that might be a user's first visit creates a UUID4 if no cookie exists, and refreshes the cookie's max_age on every response. This way the cookie always stays 1 year from last visit.
+---
+## Configuration
+### `app/config.py`
+```python
+import os
+QDRANT_URL        = os.getenv("QDRANT_URL", "https://2fe1965b-...eu-west-2-0.aws...")
+QDRANT_API_KEY    = os.getenv("QDRANT_API_KEY", "eyJhbGci...")
+QDRANT_COLLECTION = os.getenv("QDRANT_COLLECTION", "arxiv_bgem3_dense")
+DB_PATH           = os.getenv("DB_PATH", "interactions.db")
+ARXIV_API_URL     = "https://export.arxiv.org/api/query"
+ARXIV_MAX_RESULTS = 10
+METADATA_CACHE_TTL_DAYS = 30
+REC_LIMIT          = 10
+REC_POSITIVE_LIMIT = 20
+REC_MIN_POSITIVES  = 1
+APP_TITLE   = "ArXiv Recommender"
+COOKIE_NAME = "arxiv_user_id"
+COOKIE_MAX_AGE = 60 * 60 * 24 * 365
+```
+Every credential and tunable lives here. All of them can be overridden with environment variables — `os.getenv("X", default)`. In production you'd set `QDRANT_API_KEY` as an env var and never commit it to git.
+**`REC_POSITIVE_LIMIT = 20`** — controls how many saved papers are kept in the in-memory deque *and* how many are sent to Qdrant as positive examples. This is the only place you change it; `user_state.py` reads it directly.
+---
+## Database Layer
+### `app/db.py`
+Three tables. The schema runs once at startup via `init_db()`.
+```python
+_SCHEMA = """
+PRAGMA journal_mode=WAL;
+PRAGMA synchronous=NORMAL;
+CREATE TABLE IF NOT EXISTS interactions (
+    id         INTEGER PRIMARY KEY AUTOINCREMENT,
+    user_id    TEXT    NOT NULL,
+    paper_id   TEXT    NOT NULL,
+    event_type TEXT    NOT NULL,   -- save | not_interested
+    source     TEXT,               -- search | recommendation | saved
+    position   INTEGER,
+    query_id   TEXT,
+    timestamp  TEXT    NOT NULL DEFAULT (datetime('now'))
+);
+CREATE INDEX IF NOT EXISTS idx_ui_user_ts    ON interactions(user_id, timestamp DESC);
+CREATE INDEX IF NOT EXISTS idx_ui_user_paper ON interactions(user_id, paper_id);
+CREATE TABLE IF NOT EXISTS paper_qdrant_map (
+    arxiv_id        TEXT PRIMARY KEY,
+    qdrant_point_id INTEGER NOT NULL,
+    mapped_at       TEXT    NOT NULL DEFAULT (datetime('now'))
+);
+CREATE TABLE IF NOT EXISTS paper_metadata (
+    arxiv_id  TEXT PRIMARY KEY,
+    title     TEXT,
+    abstract  TEXT,
+    authors   TEXT,   -- JSON array string e.g. '["Vaswani", "Shazeer"]'
+    category  TEXT,
+    published TEXT,
+    cached_at TEXT    NOT NULL DEFAULT (datetime('now'))
+);
+"""
+```
+**WAL mode** (`journal_mode=WAL`) allows one writer and multiple concurrent readers without blocking. Important because FastAPI handles requests concurrently and SQLite's default mode would serialize everything.
+**`synchronous=NORMAL`** — safe against OS crashes but doesn't fsync on every write. Faster than `FULL` with acceptable durability for a research tool.
+**Three tables, three jobs:**
+| Table | Job |
+|---|---|
+| `interactions` | Append-only event log. Never updated, only inserted. Source of truth. |
+| `paper_qdrant_map` | Cache translating arxiv_id strings → Qdrant integer point IDs |
+| `paper_metadata` | Cache of arXiv API responses so we don't re-fetch titles/abstracts |
+**Key functions:**
+```python
+# Write an event
+await db.log_interaction(user_id, paper_id, "save", source="search", position=2)
+# Read recent events for a user (used to hydrate the in-memory cache)
+rows = await db.get_user_interactions(user_id, event_types=["save", "not_interested"], limit=70)
+# Qdrant ID cache
+await db.save_qdrant_id("1706.03762", 523419)
+cached = await db.get_qdrant_ids_batch(["1706.03762", "0704.0002"])
+# → {"1706.03762": 523419}  (only IDs that were in the cache)
+# Metadata cache
+await db.cache_metadata({"arxiv_id": "1706.03762", "title": "Attention...", ...})
+batch = await db.get_cached_metadata_batch(["1706.03762", "0704.0002"])
+# → {"1706.03762": {...}}
+```
+All functions use `async with aiosqlite.connect(DB_PATH)` — each call opens and closes its own connection. This is safe with WAL mode and avoids connection pool complexity.
+---
+## arXiv Service
+### `app/arxiv_svc.py`
+Handles all communication with the arXiv Atom XML API and the SQLite metadata cache.
+#### ID Normalisation
+arXiv IDs come in several formats from the API:
+```python
+_ID_RE = re.compile(r"(?:arxiv:|https?://arxiv\.org/abs/)?([^\s/v]+(?:v\d+)?)")
+def _normalise_id(raw: str) -> str:
+    m = _ID_RE.search(raw.strip())
+    bare = m.group(1)
+    return re.sub(r"v\d+$", "", bare)
+```
+| Input | Output |
+|---|---|
+| `http://arxiv.org/abs/1706.03762v5` | `1706.03762` |
+| `https://arxiv.org/abs/1706.03762` | `1706.03762` |
+| `arxiv:1706.03762v2` | `1706.03762` |
+| `1706.03762v3` | `1706.03762` |
+| `0704.0002` | `0704.0002` |
+The bare ID is what we store everywhere — in SQLite, in the user state cache, and in the Qdrant `arxiv_id` payload field.
+#### XML Parsing
+The arXiv API returns Atom XML. One `<entry>` element per paper:
+```python
+_NS = {
+    "atom":  "http://www.w3.org/2005/Atom",
+    "arxiv": "http://arxiv.org/schemas/atom",
+}
+def _parse_entry(entry: ET.Element) -> dict:
+    raw_id   = text("atom:id")
+    arxiv_id = _normalise_id(raw_id)
+    authors  = [a.findtext("atom:name", ...) for a in entry.findall("atom:author", _NS)]
+    cat_el   = entry.find("arxiv:primary_category", _NS)
+    category = cat_el.attrib.get("term", "")
+    return {
+        "arxiv_id": arxiv_id,
+        "title":    text("atom:title").replace("\n", " "),
+        "abstract": text("atom:summary").replace("\n", " "),
+        "authors":  json.dumps(authors[:5]),   # stored as JSON string in SQLite
+        "category": category,
+        "published": text("atom:published")[:10],  # YYYY-MM-DD only
+        "year":     int(published[:4]),
+    }
+```
+Authors are stored as a JSON string (`'["Vaswani", "Shazeer"]'`) because SQLite has no array type. The `tojson_parse` filter in the template converts it back to a Python list for display.
+#### Search and Fetch
+```python
+async def search(query: str, max_results=10) -> list[dict]:
+    params = {"search_query": f"all:{query}", "start": 0,
+              "max_results": max_results, "sortBy": "relevance"}
+    async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
+        resp = await client.get(ARXIV_API_URL, params=params)
+    papers = [_parse_entry(e) for e in ET.fromstring(resp.text).findall("atom:entry", _NS)]
+    for paper in papers:
+        await db.cache_metadata(paper)   # cache all results immediately
+    return papers
+async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
+    result  = await db.get_cached_metadata_batch(arxiv_ids)  # check SQLite first
+    missing = [aid for aid in arxiv_ids if aid not in result]
+    if missing:
+        # Batch up to 20 IDs per request, 0.35s gap = ~3 req/s rate limit
+        for i in range(0, len(missing), 20):
+            chunk  = missing[i:i+20]
+            params = {"id_list": ",".join(chunk), "max_results": len(chunk)}
+            # ... fetch, parse, cache ...
+            await asyncio.sleep(0.35)
+    return result
+```
+`follow_redirects=True` is required — the arXiv API's HTTP URL redirects to HTTPS.
+---
+## User State
+### `app/user_state.py`
+The in-memory hot cache. Zero DB reads on the hot path.
+```python
+from app import db, config
+MAX_POSITIVES = config.REC_POSITIVE_LIMIT   # = 20, kept in sync with config
+MAX_NEGATIVES = 50
+@dataclass
+class UserState:
+    positives: deque[str] = field(default_factory=lambda: deque(maxlen=MAX_POSITIVES))
+    negatives: deque[str] = field(default_factory=lambda: deque(maxlen=MAX_NEGATIVES))
+    loaded: bool = False
+    def add_positive(self, paper_id: str) -> None:
+        try:    self.negatives.remove(paper_id)   # mutual exclusion
+        except ValueError: pass
+        if paper_id not in self.positives:
+            self.positives.appendleft(paper_id)   # most recent first
+    def add_negative(self, paper_id: str) -> None:
+        try:    self.positives.remove(paper_id)
+        except ValueError: pass
+        if paper_id not in self.negatives:
+            self.negatives.appendleft(paper_id)
+    def has_enough_for_recs(self) -> bool:
+        return len(self.positives) >= config.REC_MIN_POSITIVES
+```
+**Mutual exclusion**: `add_positive` removes the paper from negatives before adding to positives, and vice versa. So a paper can never be in both lists simultaneously.
+**`appendleft`**: deques are double-ended. `appendleft` inserts at index 0 (front). When `maxlen` is reached, the rightmost (oldest) element is silently dropped. So `positive_list[0]` is always the most recently saved paper.
+```python
+_cache: dict[str, UserState] = {}   # global in-process dict
+async def ensure_loaded(user_id: str) -> UserState:
+    state = get_user_state(user_id)
+    if state.loaded:
+        return state                 # O(1) — hot path
+    # Cold path: first request from this user in this server process
+    rows = await db.get_user_interactions(user_id,
+               event_types=["save", "not_interested"], limit=70)
+    for row in reversed(rows):      # oldest first so appendleft puts newest at front
+        if row["event_type"] == "save":
+            state.add_positive(row["paper_id"])
+        else:
+            state.add_negative(row["paper_id"])
+    state.loaded = True
+    return state
+```
+**Why `reversed(rows)`**: `get_user_interactions` returns rows newest-first (ORDER BY timestamp DESC). We want to replay them in chronological order so that `appendleft` in `add_positive` correctly ends up with the newest paper at `index 0`. If we replayed newest-first, the oldest save would end up at the front.
+```python
+def record_positive(user_id: str, paper_id: str) -> None:
+    get_user_state(user_id).add_positive(paper_id)   # sync, no DB
+def all_seen(user_id: str) -> set[str]:
+    state = get_user_state(user_id)
+    return set(state.positive_list) | set(state.negative_list)
+```
+`all_seen` feeds the recommendation engine — any paper the user has ever saved or dismissed is excluded from the results.
+---
+## Qdrant Service
+### `app/qdrant_svc.py`
+Two jobs: translate arxiv_ids → integer point IDs, and call the Recommend API.
+#### Client Setup
+```python
+@lru_cache(maxsize=1)
+def _client() -> QdrantClient:
+    return QdrantClient(
+        url=config.QDRANT_URL,
+        api_key=config.QDRANT_API_KEY,
+        timeout=30,
+        check_compatibility=False,
+    )
+```
+`@lru_cache(maxsize=1)` makes this a singleton. The client is created once, reused on every request. The sync `QdrantClient` is used (not the async one) because it runs inside `asyncio.run_in_executor` — this keeps the event loop free while the network call is in flight.
+#### ID Lookup
+```python
+async def lookup_qdrant_ids(arxiv_ids: list[str]) -> dict[str, int]:
+    cached  = await db.get_qdrant_ids_batch(arxiv_ids)
+    missing = [aid for aid in arxiv_ids if aid not in cached]
+    if missing:
+        loop    = asyncio.get_event_loop()
+        results = await loop.run_in_executor(None, _scroll_by_arxiv_ids, missing)
+        for arxiv_id, point_id in results.items():
+            await db.save_qdrant_id(arxiv_id, point_id)
+            cached[arxiv_id] = point_id
+    return cached
+def _scroll_by_arxiv_ids(arxiv_ids: list[str]) -> dict[str, int]:
+    pts, _ = _client().scroll(
+        collection_name=QDRANT_COLLECTION,
+        scroll_filter=Filter(must=[
+            FieldCondition(key="arxiv_id", match=MatchAny(any=arxiv_ids))
+        ]),
+        limit=len(arxiv_ids),
+        with_payload=True,
+        with_vectors=False,
+    )
+    return {p.payload["arxiv_id"]: p.id for p in pts}
+```
+`MatchAny` is Qdrant's `IN (...)` — it filters points whose `arxiv_id` payload field matches any value in the list. Requires the keyword payload index created on the collection (created once, persists permanently).
+The result is `{arxiv_id: integer_point_id}`. Any ID not found in the collection is simply absent from the dict — that paper hasn't been indexed yet.
+#### Recommend
+```python
+async def recommend(positive_arxiv_ids, negative_arxiv_ids, seen_arxiv_ids, limit):
+    all_ids = list(dict.fromkeys(positive_arxiv_ids + negative_arxiv_ids))
+    id_map  = await lookup_qdrant_ids(all_ids)
+    pos_ids = [id_map[aid] for aid in positive_arxiv_ids if aid in id_map]
+    neg_ids = [id_map[aid] for aid in negative_arxiv_ids if aid in id_map]
+    if not pos_ids:
+        return []
+    results = await loop.run_in_executor(None, _run_recommend, pos_ids, neg_ids, limit*2)
+    filtered = [
+        r.payload["arxiv_id"]
+        for r in results
+        if r.payload.get("arxiv_id") and r.payload["arxiv_id"] not in seen_arxiv_ids
+    ]
+    return filtered[:limit]
+def _run_recommend(pos_ids, neg_ids, limit):
+    result = _client().query_points(
+        collection_name=QDRANT_COLLECTION,
+        query=RecommendQuery(
+            recommend=RecommendInput(
+                positive=pos_ids,
+                negative=neg_ids if neg_ids else [],
+                strategy=RecommendStrategy.BEST_SCORE,
+            )
+        ),
+        limit=limit,
+        with_payload=True,
+        with_vectors=False,
+    )
+    return result.points
+```
+**`BEST_SCORE` strategy**: for each candidate paper, Qdrant computes its similarity to each positive example, takes the maximum score, then subtracts a penalty for similarity to negatives. Papers near your saves and far from your dismissals bubble to the top.
+**`limit * 2` over-fetch**: we fetch double the target count so that after filtering out `seen_arxiv_ids` in Python, we still have enough results to return `limit` papers.
+**`dict.fromkeys(...)` deduplication**: if a paper appears in both positive and negative lists (shouldn't happen due to mutual exclusion in `user_state`, but defensive), it's deduplicated before the lookup.
+---
+## Routers
+### `app/routers/search.py`
+```python
+@router.get("/search", response_class=HTMLResponse)
+async def search(request: Request, q: str = "", user_id=Cookie(...)):
+    papers = []
+    if q.strip():
+        papers = await arxiv_svc.search(q.strip())
+    state      = await us.ensure_loaded(user_id)
+    saved_ids  = set(state.positive_list)
+    dismissed  = set(state.negative_list)
+    for p in papers:
+        p["saved"]     = p["arxiv_id"] in saved_ids
+        p["dismissed"] = p["arxiv_id"] in dismissed_ids
+    if request.headers.get("HX-Request"):
+        return templates.TemplateResponse(request, "partials/search_results.html",
+                                          {"papers": papers, "query": q})
+    else:
+        return templates.TemplateResponse(request, "search.html",
+                                          {"papers": papers, "query": q,
+                                           "has_recs": state.has_enough_for_recs()})
+```
+**HTMX detection**: if the request has an `HX-Request` header (set automatically by HTMX), return only the `search_results.html` partial — just the list of cards, no `<html>` wrapper. This is what gets swapped into `#search-results` on the page without a full reload.
+**Annotating papers**: after fetching from arXiv, each paper dict gets `saved` and `dismissed` booleans added. The template uses these to show the correct button state (e.g. already-saved papers show "✓ Saved" instead of "⭐ Save").
+---
+### `app/routers/events.py`
+```python
+@router.post("/{paper_id}/save", response_class=HTMLResponse)
+async def save_paper(paper_id, request, source=Form("search"),
+                     position=Form(0), query_id=Form(""), user_id=Cookie(...)):
+    await db.log_interaction(user_id, paper_id, "save",
+                             source=source, position=position or None)
+    us.record_positive(user_id, paper_id)
+    asyncio.create_task(qdrant_svc.lookup_qdrant_ids([paper_id]))  # background
+    return templates.TemplateResponse(request, "partials/action_buttons.html",
+        {"paper_id": paper_id, "saved": True, "dismissed": False, "source": source})
+@router.post("/{paper_id}/not-interested", response_class=HTMLResponse)
+async def not_interested(paper_id, request, source=Form("search"), ...):
+    await db.log_interaction(user_id, paper_id, "not_interested", source=source)
+    us.record_negative(user_id, paper_id)
+    resp = HTMLResponse(content="")   # empty → HTMX removes the card
+    resp.set_cookie(...)
+    return resp
+```
+**Three things happen on save, in order:**
+1. `db.log_interaction()` — durable write to SQLite (awaited, synchronous from caller's perspective)
+2. `us.record_positive()` — in-memory update (synchronous, no I/O)
+3. `asyncio.create_task(...)` — background task to look up the Qdrant point ID. Returns immediately; the lookup happens in the background. The response is sent before this finishes.
+**Why background for Qdrant lookup?** The user doesn't need the Qdrant point ID for the save response. They only need it when recommendations are requested. The background task means the save response is fast (~5ms), and by the time the user navigates to the home page to see recommendations, the ID is likely already cached.
+**Empty response for dismiss**: HTMX has a target set to `#paper-{id}` with `hx-swap="outerHTML swap:200ms"`. An empty response body tells HTMX to replace the entire card element with nothing — the card fades out and disappears.
+**`source` is forwarded to the response template**: after a save, the rendered `action_buttons.html` partial receives the same `source` value that came in. So the "Remove" button on the now-saved card will log `source="recommendation"` if the save happened from the recs section, not `"search"`.
+---
+### `app/routers/recommendations.py`
+```python
+@router.get("/recommendations", response_class=HTMLResponse)
+async def get_recommendations(request, user_id=Cookie(...)):
+    state = await us.ensure_loaded(user_id)
+    if not state.has_enough_for_recs():
+        return _empty_resp()           # shows "Save 1 paper to unlock recs"
+    rec_arxiv_ids = await qdrant_svc.recommend(
+        positive_arxiv_ids=state.positive_list,
+        negative_arxiv_ids=state.negative_list,
+        seen_arxiv_ids=us.all_seen(user_id),
+        limit=REC_LIMIT,
+    )
+    if not rec_arxiv_ids:
+        return _empty_resp()
+    meta   = await arxiv_svc.fetch_metadata_batch(rec_arxiv_ids)
+    papers = [{**meta[aid], "saved": False, "dismissed": False}
+              for aid in rec_arxiv_ids if aid in meta]
+    return templates.TemplateResponse(request, "partials/recommendations.html",
+                                      {"papers": papers})
+```
+Linear pipeline: load state → check threshold → Qdrant recommend → fetch metadata → render. If anything returns empty at any step, show the empty state partial.
+---
+### `app/routers/saved.py`
+```python
+@router.get("/saved", response_class=HTMLResponse)
+async def saved_papers(request, user_id=Cookie(...)):
+    state    = await us.ensure_loaded(user_id)
+    saved_ids = state.positive_list   # most-recent first
+    papers = []
+    if saved_ids:
+        meta   = await arxiv_svc.fetch_metadata_batch(saved_ids)
+        papers = [{**meta[aid], "saved": True, "dismissed": False}
+                  for aid in saved_ids if aid in meta]
+    return templates.TemplateResponse(request, "saved.html",
+                                      {"papers": papers, "count": len(papers)})
+```
+The simplest router. `positive_list` is already the source of truth for what's saved. Fetch metadata for all of them, render. `saved=True` is hardcoded because every paper on this page is by definition saved — the action button will show "✓ Saved" + "Remove".
+---
+## Templates
+### `app/templates_env.py`
+```python
+from jinja2 import Environment
+from fastapi.templating import Jinja2Templates
+def _tojson_parse(value: str) -> list:
+    try:
+        result = json.loads(value)
+        return result if isinstance(result, list) else []
+    except Exception:
+        return []
+templates = Jinja2Templates(directory="app/templates")
+templates.env.filters["tojson_parse"] = _tojson_parse
+```
+One custom filter: `tojson_parse`. SQLite stores authors as a JSON string (`'["Vaswani", "Shazeer"]'`). In the template: `{{ paper.authors | tojson_parse | join(", ") }}`. The filter parses it back to a Python list. Returns `[]` on any error — never crashes the template.
+All routers import `templates` from here. There is only one instance, shared everywhere.
+---
+### `app/templates/base.html`
+```html
+<head>
+  <link href="https://cdn.jsdelivr.net/npm/daisyui@4.12.10/dist/full.min.css" rel="stylesheet"/>
+  <script src="https://cdn.tailwindcss.com"></script>
+  <script src="https://unpkg.com/htmx.org@1.9.12"></script>
+  <style>
+    .htmx-swapping { opacity: 0; transition: opacity 200ms ease-out; }
+    .htmx-request .htmx-indicator { display: inline-block !important; }
+    .htmx-indicator { display: none; }
+  </style>
+</head>
+<body>
+  <div class="navbar bg-base-100 shadow-sm px-4">
+    <a href="/" class="text-xl font-bold text-primary">📄 ArXiv Rec</a>
+    <a href="/search" class="btn btn-ghost btn-sm">Search</a>
+    <a href="/saved"  class="btn btn-ghost btn-sm">Saved</a>
+  </div>
+  <main class="container mx-auto px-4 py-6 max-w-4xl">
+    {% block content %}{% endblock %}
+  </main>
+</body>
+```
+Zero build step. TailwindCSS + DaisyUI from CDN, HTMX from CDN.
+**HTMX CSS hooks:**
+- `.htmx-swapping` — HTMX adds this class to an element just before it's replaced. The `opacity: 0` + transition creates the fade-out animation on dismissed cards.
+- `.htmx-indicator` — hidden by default. `.htmx-request .htmx-indicator` makes it visible while any HTMX request is in flight. Used for the loading spinners next to buttons.
+---
+### `app/templates/index.html`
+```html
+<!-- Search bar with live-search -->
+<form hx-get="/search"
+      hx-target="#search-results"
+      hx-push-url="true"
+      hx-indicator="#search-spinner">
+  <input type="text" name="q" placeholder="e.g. transformer attention" />
+  <button>Search <span id="search-spinner" class="htmx-indicator loading ..."></span></button>
+</form>
+<!-- Recommendations: loaded after page paint -->
+<div id="rec-section"
+     hx-get="/api/recommendations"
+     hx-trigger="load"
+     hx-swap="innerHTML">
+  Loading...
+</div>
+<!-- Search results: swapped here by HTMX -->
+<div id="search-results"></div>
+```
+**`hx-trigger="load"`**: the `#rec-section` div fires the HTMX request as soon as it loads. The page renders immediately with "Loading..." and the recs appear ~500ms later. This way the page never feels slow — you see content instantly, then recs fill in.
+**`hx-push-url="true"`**: when a search fires, HTMX pushes `/search?q=...` to the browser history. So the back button works and the URL is shareable.
+---
+### `app/templates/partials/paper_card.html`
+```html
+<div class="card bg-base-100 shadow-sm border border-base-300 p-4 space-y-2"
+     id="paper-{{ paper.arxiv_id }}">
+  <div class="flex items-start justify-between gap-2">
+    <a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
+       target="_blank" class="font-semibold text-primary hover:underline">
+      {{ paper.title }}
+    </a>
+    <span class="badge badge-outline badge-sm">{{ paper.category }}</span>
+  </div>
+  <div class="text-xs text-base-content/50">
+    [{{ paper.arxiv_id }}]
+    {% if paper.published %} · {{ paper.published[:4] }}{% endif %}
+    {% if authors_list %} · {{ authors_list | join(", ") }}{% endif %}
+  </div>
+  <p class="text-sm line-clamp-3">{{ paper.abstract }}</p>
+  <div id="actions-{{ paper.arxiv_id }}">
+    {% include "partials/action_buttons.html" %}
+  </div>
+</div>
+```
+Two IDs per card: `#paper-{id}` on the outer div (target for dismiss — the whole card is removed) and `#actions-{id}` on the buttons div (target for save — only the buttons are swapped to "Saved" state).
+`line-clamp-3` is a Tailwind utility that truncates the abstract to 3 lines with an ellipsis.
+---
+### `app/templates/partials/action_buttons.html`
+```html
+{% set pid     = paper_id if paper_id is defined else paper.arxiv_id %}
+{% set is_saved = saved if saved is defined else (paper.saved | default(false)) %}
+{% set _source  = source if source is defined else "search" %}
+{% if is_saved %}
+  <button class="btn btn-success btn-xs" disabled>✓ Saved</button>
+  <button hx-post="/api/papers/{{ pid }}/not-interested"
+          hx-target="#paper-{{ pid }}"
+          hx-swap="outerHTML swap:200ms"
+          hx-vals='{"source": "{{ _source }}"}'>Remove</button>
+{% else %}
+  <button hx-post="/api/papers/{{ pid }}/save"
+          hx-target="#actions-{{ pid }}"
+          hx-swap="innerHTML"
+          hx-vals='{"source": "{{ _source }}", "position": "{{ position | default(0) }}"}'>
+    ⭐ Save
+  </button>
+  <button hx-post="/api/papers/{{ pid }}/not-interested"
+          hx-target="#paper-{{ pid }}"
+          hx-swap="outerHTML swap:200ms"
+          hx-vals='{"source": "{{ _source }}"}'>
+    ✕ Not interested
+  </button>
+{% endif %}
+```
+This partial is used in two contexts:
+1. **Inside `paper_card.html`** — `paper` is defined, `paper_id` is not
+2. **As a direct response from `events.py/save_paper`** — `paper_id` is defined, `paper` is not
+The `{% set pid = ... if ... is defined else ... %}` pattern handles both safely. `Jinja2`'s `default()` filter would crash here because it eagerly evaluates both branches regardless of which one is chosen.
+**`hx-vals`** sends additional form fields with the HTMX request. The `source` and `position` values ride along with every button click to be logged in the DB.
+---
+### `app/templates/partials/recommendations.html`
+```html
+{% if papers %}
+  <div class="space-y-3">
+    {% for paper in papers %}
+      {% set position = loop.index0 %}
+      {% set source = "recommendation" %}
+      {% include "partials/paper_card.html" %}
+    {% endfor %}
+  </div>
+  <div class="text-center pt-3">
+    <button hx-get="/api/recommendations"
+            hx-target="#rec-section"
+            hx-swap="innerHTML"
+            hx-indicator="#rec-refresh-spinner">
+      ↻ Show different recommendations
+      <span id="rec-refresh-spinner" class="htmx-indicator loading ..."></span>
+    </button>
+  </div>
+{% else %}
+  {% include "partials/empty_recs.html" %}
+{% endif %}
+```
+`{% set source = "recommendation" %}` before the include ensures that every action button rendered from this partial carries `source="recommendation"` in its `hx-vals`. The actions router will log that source to the DB.
+The refresh button re-triggers the same `/api/recommendations` endpoint. Because the Qdrant Recommend API doesn't return deterministic results (it's an ANN search), re-requesting can surface different papers from the same vector neighborhood.
+---
+## Tests
+### Test Isolation Pattern
+Every test that touches the DB or in-memory cache uses this fixture:
+```python
+@pytest.fixture
+def client(tmp_path, monkeypatch):
+    import app.config as cfg
+    import app.db as db_mod
+    db_path = str(tmp_path / "test.db")
+    # Point both the config and the db module at a fresh temp DB
+    monkeypatch.setattr(cfg, "DB_PATH", db_path)
+    monkeypatch.setattr(db_mod, "DB_PATH", db_path)
+    # Clear in-memory caches so tests don't bleed into each other
+    import app.user_state as us
+    us._cache.clear()
+    from app.qdrant_svc import _client
+    _client.cache_clear()    # lru_cache singleton — need to clear between tests
+    from app.main import app
+    asyncio.get_event_loop().run_until_complete(db_mod.init_db())
+    with TestClient(app, raise_server_exceptions=True) as c:
+        yield c
+```
+`tmp_path` is a pytest built-in that gives each test its own temporary directory. Monkeypatching `DB_PATH` means every test gets a fresh, empty SQLite file. Clearing `us._cache` and `_client.cache_clear()` ensures no in-process state bleeds between tests.
+### Mocking Pattern for Live Services
+Tests that need recommendations mock both the Qdrant service and the arXiv metadata fetcher:
+```python
+def test_recommendations_after_save(client, monkeypatch):
+    import app.qdrant_svc as qs
+    import app.arxiv_svc as arxiv
+    async def fake_recommend(positive_arxiv_ids, negative_arxiv_ids, seen_arxiv_ids, limit):
+        return ["1706.03762"]
+    monkeypatch.setattr(qs, "recommend", fake_recommend)
+    async def fake_batch(ids):
+        return {"1706.03762": {"arxiv_id": "1706.03762",
+                               "title": "Attention Is All You Need", ...}}
+    monkeypatch.setattr(arxiv, "fetch_metadata_batch", fake_batch)
+    client.get("/")
+    client.post("/api/papers/0704.0002/save", data={"source": "search"})
+    resp = client.get("/api/recommendations")
+    assert "Attention Is All You Need" in resp.text
+```
+`monkeypatch.setattr` replaces the real function for the duration of the test, then automatically restores it. This lets integration tests run without network access.
+---
+## Data Flow Summary
+```
+User types "transformer attention" in search bar
+  │
+  │  HTMX: GET /search?q=transformer+attention  (HX-Request: true)
+  ▼
+search.py: arxiv_svc.search("transformer attention")
+  │  → GET https://export.arxiv.org/api/query?search_query=all:transformer+attention
+  │  ← Atom XML, 10 entries
+  │  → parse → cache in paper_metadata table
+  │  → annotate with saved/dismissed from user_state
+  ▼
+returns partials/search_results.html → HTMX swaps into #search-results
+  │
+User clicks ⭐ Save on paper 1706.03762
+  │
+  │  HTMX: POST /api/papers/1706.03762/save  {source: "search", position: 3}
+  ▼
+events.py:
+  1. db.log_interaction(user_id, "1706.03762", "save", source="search", position=3)
+  2. us.record_positive(user_id, "1706.03762")
+  3. asyncio.create_task(qdrant_svc.lookup_qdrant_ids(["1706.03762"]))  ← background
+  ▼
+returns partials/action_buttons.html (saved=True) → HTMX swaps buttons in-place
+  [Background task]
+  qdrant_svc.lookup_qdrant_ids(["1706.03762"])
+    → db.get_qdrant_ids_batch: miss
+    → Qdrant scroll filter: arxiv_id = "1706.03762"
+    ← point_id = 523419
+    → db.save_qdrant_id("1706.03762", 523419)
+User navigates to home page /
+  │
+  │  HTMX: GET /api/recommendations  (hx-trigger="load")
+  ▼
+recommendations.py:
+  1. us.ensure_loaded(user_id) → positives = ["1706.03762"]
+  2. qdrant_svc.recommend(positive=["1706.03762"], negative=[], seen={"1706.03762"})
+       → db.get_qdrant_ids_batch(["1706.03762"]) → {523419}  (already cached)
+       → Qdrant query_points with RecommendQuery(positive=[523419])
+       ← [point_612003, point_88341, ...]
+       → filter out seen papers in Python
+       ← ["2302.13971", "2307.09288", ...]
+  3. arxiv_svc.fetch_metadata_batch(["2302.13971", "2307.09288", ...])
+       → check paper_metadata cache: some hits, some misses
+       → arXiv API batch fetch for misses → cache results
+  ▼
+returns partials/recommendations.html → HTMX swaps into #rec-section
+```
+---
+## File Count Summary
+| File | Lines | Job |
+|---|---|---|
+| `app/config.py` | 36 | All settings |
+| `app/db.py` | 185 | SQLite: 3 tables, 8 functions |
+| `app/arxiv_svc.py` | 159 | arXiv API + metadata cache |
+| `app/user_state.py` | 112 | In-memory deque cache per user |
+| `app/qdrant_svc.py` | 166 | Qdrant ID lookup + Recommend |
+| `app/templates_env.py` | ~20 | Shared Jinja2 env + tojson_parse |
+| `app/main.py` | 54 | FastAPI app + home route |
+| `app/routers/search.py` | 56 | GET /search |
+| `app/routers/events.py` | 75 | POST save + not-interested |
+| `app/routers/recommendations.py` | 62 | GET /api/recommendations |
+| `app/routers/saved.py` | 47 | GET /saved |
+| Templates | ~200 | All HTML |
+| Tests | ~600 | 54 tests across 6 files |

docs/walkthroughs/02-Phase2-MultiInterest-Recommender.md ADDED Viewed

	@@ -0,0 +1,175 @@

+# Phase 2 — Multi-Interest Recommender Walkthrough
+## What Was Built
+A PinnerSage-style multi-interest recommendation engine that replaces Phase 1's raw-ID Qdrant queries with computed EWMA user profile embeddings, Ward hierarchical clustering for interest detection, heuristic re-ranking, and MMR diversity enforcement.
+**The old pipeline (Phase 1):**
+```
+User saves papers → raw IDs → Qdrant BEST_SCORE → results
+```
+**The new pipeline (Phase 2):**
+```
+User saves papers
+    ↓
+EWMA profiles update (background, non-blocking)
+    ↓
+Ward clustering → K distinct interest medoids (auto K per user)
+    ↓
+Qdrant prefetch + RRF fusion (~15-25ms, single API call)
+    ↓
+Heuristic re-ranking of ~100 candidates (~1-2ms)
+    ↓
+MMR diversity selection → top 10-12 papers (<1ms)
+    ↓
+Exploration injection → 1-2 serendipitous papers
+    ↓
+Render HTML via HTMX
+```
+**Total pipeline latency: <30ms** (excluding metadata fetch if cold)
+---
+## Why This Architecture
+This architecture was chosen after deep research documented in [03-MultiInterest-Recommender-Architecture.md](../research/03-MultiInterest-Recommender-Architecture.md). The key insights:
+### The Interest Collapse Problem
+A single average embedding for a user interested in both *NLP* and *computer vision* lands in meaningless embedding space — Pinterest called this the "energy-boosting breakfast" problem. PinnerSage (KDD 2020) solved it with multiple user vectors.
+### Why EWMA Over Rolling Windows
+Rolling windows (last 30 days) lose valuable historical signal abruptly. EWMA (Exponentially Weighted Moving Average) provides smooth decay:
+- **Long-term (α=0.10):** Effective window ~20 interactions. Tracks enduring research interests.
+- **Short-term (α=0.40):** Effective window ~3-5 interactions. Captures current session context.
+- **Negative (α=0.15):** Tracks papers the user explicitly dislikes.
+### Why Ward Over K-Means
+K-Means requires pre-specifying K (number of clusters). Ward hierarchical clustering auto-determines K per user via a distance threshold — a user with 2 interests gets 2 clusters, a user with 5 gets 5. No hyperparameter tuning per user.
+### Why LightGBM Over BGE-reranker
+The older `Research-Recommender_Technical_Roadmap.md` suggested BGE-reranker-v2 at ~800ms for 100 candidates on CPU. LightGBM scores 500 candidates in 2-5ms. On Render Free Tier (CPU-only, 512MB RAM), this is the only viable option. Currently using a heuristic scorer with the same feature interface — drop-in LightGBM upgrade when training data accumulates.
+---
+## 3-Tier Cascading Fallback
+The recommender degrades gracefully based on how much data the user has:
+| User State | Tier | Strategy | Latency |
+|---|---|---|---|
+| ≥5 saves | **Tier 1** | Clustering → RRF → Rerank → MMR → Explore | ~25ms |
+| 3-4 saves | **Tier 2** | EWMA long-term vector → ANN search | ~10ms |
+| 1-2 saves | **Tier 3** | Qdrant BEST_SCORE (Phase 1 path) | ~15ms |
+| 0 saves | Empty | "Save at least 1 paper..." | 0ms |
+Each tier falls through to the next if it can't produce results.
+---
+## New Files Created
+### `app/recommend/__init__.py`
+Package init for the recommendation engine module.
+### `app/recommend/profiles.py`
+EWMA temporal embedding profiles:
+- `ewma_update(current, new_embedding, alpha)` — core blending function
+- `update_on_save(user_id, paper_embedding)` — updates both LT and ST profiles
+- `update_on_dismiss(user_id, paper_embedding)` — updates negative profile
+- `load_profile()` / `save_profile()` — SQLite persistence as binary numpy blobs (4KB each)
+### `app/recommend/clustering.py`
+Ward hierarchical clustering:
+- `compute_clusters(paper_ids, embeddings)` → list of `InterestCluster`
+- Each cluster: medoid paper ID, medoid embedding, member paper IDs, importance score
+- Auto K (1-7 clusters), recency-weighted importance
+- Falls back to single cluster if <5 saved papers
+### `app/recommend/reranker.py`
+Heuristic scorer (LightGBM-ready):
+- `compute_features()` → 4 features per candidate: cosine_sim_LT, cosine_sim_ST, paper_age, rrf_position
+- `heuristic_score()` → weighted sum: 45% relevance, 25% session, 20% recency, 10% rank
+- `rerank_candidates()` → end-to-end: features → scores → sorted output
+### `app/recommend/diversity.py`
+MMR diversity + exploration:
+- `mmr_rerank(query, candidates, scores, λ=0.6, top_k=20)` — greedy diverse selection
+- `inject_exploration(selected, pool, n_explore=2)` — random serendipity injection
+---
+## Modified Files
+### `app/db.py`
+- Added `user_profiles` table — EWMA vectors as BLOBs with interaction counts
+- Added `user_clusters` table — Ward clustering results (medoid IDs, importance, paper lists)
+- Added 4 helper functions: `get_user_profile`, `upsert_user_profile`, `save_user_clusters`, `get_user_clusters`
+### `app/qdrant_svc.py`
+- Added `get_paper_vectors()` — fetch actual BGE-M3 embeddings from Qdrant (needed for EWMA)
+- Added `search_by_vector()` — raw ANN search by embedding vector
+- Added `multi_interest_search()` — prefetch + RRF fusion in a single API call
+- Imported new Qdrant models: `Prefetch`, `FusionQuery`, `Fusion`
+### `app/routers/events.py`
+- Save handler now triggers background EWMA profile update (LT + ST) via `asyncio.create_task`
+- Dismiss handler triggers background negative profile update
+- Both are non-blocking — user response is sent before the update completes
+### `app/routers/recommendations.py`
+- Complete rewrite with 3-tier cascading fallback
+- Tier 1: full 5-step pipeline (cluster → retrieve → rerank → MMR → explore)
+- Tier 2: EWMA long-term single-vector search
+- Tier 3: original BEST_SCORE (unchanged from Phase 1)
+### `requirements.txt`
+- Added `numpy>=1.24` — vector computations
+- Added `scipy>=1.11` — Ward hierarchical clustering
+---
+## What Was NOT Changed
+These files are intentionally untouched:
+- `app/user_state.py` — still manages ID deques for the hot cache
+- `app/routers/search.py` — search is a separate concern (see PHASE2-Hybrid-Search-Plan)
+- `app/routers/saved.py` — saved papers page is unaffected
+- All templates — no UI changes needed, same HTMX partials
+---
+## Test Coverage
+| Test File | Tests | Description |
+|---|---|---|
+| `test_profiles.py` | 11 | EWMA math, convergence, normalisation, DB round-trips |
+| `test_clustering.py` | 10 | Ward clustering, medoid validity, max clusters, DB persistence |
+| `test_reranker_diversity.py` | 13 | Heuristic scoring, MMR diversity, exploration injection |
+| Existing tests | 52 | Integration, events, saved page, qdrant_svc |
+| **Total** | **86 passed** | 2 pre-existing live Qdrant failures (network-dependent) |
+---
+## Upgrade Path: Heuristic → LightGBM
+The heuristic scorer in `reranker.py` is designed for a zero-data-required drop-in to LightGBM:
+1. **When:** Interactions table has ≥500 save/dismiss rows
+2. **How:** Train offline with `lgb.train(params={'objective': 'lambdarank'}, ...)`
+3. **Where:** Save model to `models/reranker.lgb`, replace `heuristic_score()` with `model.predict(features)`
+4. **Impact:** Same features, same interface — zero code changes in the router
+---
+## Key Design Decisions & Rationale
+| Decision | Chosen | Rejected | Why |
+|---|---|---|---|
+| User profile | EWMA (3 vectors) | Rolling window | Smooth decay, no abrupt signal loss |
+| Clustering | Ward hierarchical | Fixed K-Means | Auto-determines K per user |
+| Re-ranking | Heuristic → LightGBM | BGE-reranker-v2 | 800ms → 2ms on CPU |
+| Diversity | MMR (λ=0.6) | Random sampling | Principled relevance/diversity trade-off |
+| Exploration | Random injection (2 papers) | None | Prevents filter bubbles |
+| Multi-query | Qdrant prefetch+RRF | Sequential queries | Single network round-trip |

docs/walkthroughs/03-Code-Summary-and-Test-Plan.md ADDED Viewed

	@@ -0,0 +1,89 @@

+# Codebase Summary & Testing Plan
+This document provides a concise summary of the codebase's current state (Phases 1 & 2) and outlines a comprehensive testing plan to ensure reliability, accuracy, and performance in production.
+---
+## 1. What Code is Written Till Now
+The current application is a fully functional FastAPI + HTMX research paper discovery platform powered by BGE-M3 embeddings, Qdrant vector search, and a multi-interest recommendation engine.
+### 🏗️ Backend Core
+- **`app/main.py`**: The FastAPI application entrypoint. Configures routing, CORS, and exception handling.
+- **`app/config.py`**: Pydantic configuration loading environment variables (Qdrant URLs, API keys, tuning parameters).
+- **`app/db.py`**: Lightweight SQLite (WAL mode) interface via `aiosqlite`. Manages schemas for `events` (interaction tracking), `arxiv_cache` (metadata), `user_profiles` (EWMA vectors), and `user_clusters`.
+- **`app/user_state.py`**: In-memory cache using Python `deque` to store hot interaction paths (latest 50 positive/20 negative IDs) per user for extremely fast lookups.
+### 🧠 Recommendation Engine (`app/recommend/`)
+- **`profiles.py`**: Computes exponential moving averages (EWMA) to create Long-Term (α=0.10), Short-Term (α=0.40), and Negative (α=0.15) semantic profile vectors for each user.
+- **`clustering.py`**: Implements Ward hierarchical clustering to identify multiple distinct interest areas (up to 7) based on a user's liked papers. Extracts actual paper embeddings as cluster medoids to prevent "topic drift."
+- **`reranker.py`**: Heuristic scoring system combining long-term relevance (45%), session context (25%), paper recency (20%), and RRF retrieval rank (10%). Includes the feature extraction pipeline designed for a future drop-in LightGBM upgrade.
+- **`diversity.py`**: Implements Maximal Marginal Relevance (MMR) with λ=0.6 to ensure top recommendations are diverse, plus an exploration injector to randomly add 1-2 serendipitous papers to break filter bubbles.
+### 🔌 External Services
+- **`app/qdrant_svc.py`**: Communicates with the Qdrant vector database. Handles `BEST_SCORE` recommendations, vector fetching, dense search, and the new Phase 2 **Prefetch + Reciprocal Rank Fusion (RRF)** parallel search.
+- **`app/arxiv_svc.py`**: Fetches fresh metadata via the public arXiv HTTP API and caches it in SQLite to reduce network calls.
+### 🌐 API Routers (`app/routers/`)
+- **`recommendations.py`**: Implements the 3-tier cascading fallback pipeline (Multi-interest clustering → Single EWMA vector → Raw ID Qdrant Recommend API). Returns HTMX-rendered partials.
+- **`events.py`**: Handles user interactions (`/save`, `/not-interested`). Fires background `asyncio` tasks to recalculate EWMA user profiles asynchronously.
+- **`search.py` & `saved.py`**: Handle explicit user queries and listing saved papers.
+### 🎨 Frontend (`app/templates/`)
+- Pure HTML + HTMX frontend utilizing Jinja2 templating. Uses TailwindCSS/DaisyUI for UI components without requiring a Node.js build step.
+---
+## 2. Comprehensive Testing Plan
+The current test suite has **86 passing tests** executing via `pytest`. Our testing strategy is split into three layers: Automated, Manual, and Analytics-based evaluation.
+### A. Automated Testing (Current & Ongoing)
+#### 1. Unit Tests (Logic Verification)
+- **Math & Vectors (`test_profiles.py`, `test_clustering.py`)**:
+  - Ensure EWMA updates decay exactly as expected over simulated interaction horizons.
+  - Verify Ward clustering correctly bins distantly separated vectors and correctly assigns the medoid.
+  - Verify L2-normalization constraints are never violated.
+- **Algorithms (`test_reranker_diversity.py`)**:
+  - MMR testing matching Edge constraints (e.g., handles clusters gracefully, handles `len(candidates) < K`).
+  - Heuristic scorer scoring functions evaluate recency dates safely (no crashing on missing/malformed `published` keys).
+#### 2. Integration Tests (System Wiring)
+- **Database (`test_db.py`)**: Ensure SQLite writes do not lock the DB, caching functions hit the DB when missing from memory, and `user_clusters` persist complex JSON payloads.
+- **Endpoints (`test_integration.py` / `test_routers.py`)**:
+  - Issue standard `TestClient` API requests.
+  - Verify `/save` successfully launches the background EWMA calculation before concluding the request.
+  - Verify `/api/recommendations` gracefully steps down the 3-Tier fallback logic if vectors are absent.
+#### 3. Service Mocks vs Live E2E
+- **Mocked Qdrant**: 95% of tests mock Qdrant to ensure fast, deterministic offline execution.
+- **Live Qdrant Pipeline**: 2 pre-existing tests hit the live Qdrant payload via `test_qdrant_svc.py`. (Currently network-dependent; any failures here usually indicate transient timeouts rather than logic drops).
+### B. Manual QA & UX Flow Verification
+Before pushing to production branches, undergo manual browser-based verification:
+1. **Cold Start Flow**: Load application as a new user (cleared cookies). Verify that recommendations inform the user to "Save at least 1 paper."
+2. **Phase 1 Tier Check**: Search for "Transformers" and save 1 paper. Verify that recommendations populate via Qdrant's raw-ID API.
+3. **Phase 2a/b Tier Check**: Save 5 distinct papers across 2 distinct topics (e.g., 3 LLM papers, 2 quantum computing papers). Reload the application and inspect logs to verify **Ward Clustering** kicked in and the Qdrant Multi-Interest Prefetch query was logged.
+4. **Resiliency**: Quickly mash the "Save" and "Not Interested" buttons on the UI to test visual HTMX snappiness and ensure no `sqlite3.OperationalError: database is locked` occurs under rapid background event firing.
+### C. Evaluation & Production Metrics (Phase 4 Validation)
+Once users enter the system, the platform shifts from unit-test verification to **Data Science verification**.
+#### 1. Offline Evaluation (Historical Split)
+Run a script simulating user interactions using the time-split method (train on behavior before day X, test on day X+1):
+- **NDCG@10**: Evaluates how efficiently we rank papers the user ultimately saves/clicks.
+- **Hit Rate@10**: Verifies what percentage of users successfully interacted with at least 1 recommendation per session.
+- **Coverage**: Evaluates if the recommendation queue is pulling from a wide diversity of our 1.6M paper collection, or if we are stuck recommending the same 100 benchmark papers.
+#### 2. Online Telemetry (Production UX)
+- **CTR (Click-Through-Rate)**: Measure the ratio of recommendation views vs clicks. (Target: 2-5%).
+- **Save Rate**: The definitive proxy for success. (Target: 1-2%).
+- **Dwell Time Analysis**: Monitor if clicked recommendations result in > 30-second reading sessions, discounting bounce-clicks.
+---
+### Moving Forward
+With the multi-interest foundation established, the immediate next focus is upgrading the explicit search bar (Phase 2's Planned Hybrid Search Plan using Zilliz Sparse tracking) and observing cluster calculation stability on Render.