Spaces:

siddhm11
/

ResearchIT

Sleeping

App Files Files Community

ResearchIT / docs /TASK-TRACKER.md

siddhm11

Phase 6.5: Pipeline telemetry, search UX fixes, latency profiling

ec67b2f 20 days ago

preview code

raw

history blame contribute delete

28.7 kB

	# ResearchIT — Master Task Tracker

	> Purpose: Single source of truth for all completed, in-progress, and upcoming work.
	> Last updated: 2026-05-05
	> Current phase: Phase 6.5 (Instrumentation) — COMPLETE ✔ \| Phase 7 next

	---

	## Legend

	- `[x]` — Done
	- `[/]` — In progress
	- `[ ]` — Not started
	- `[~]` — Intentionally deferred (blocked by data/users/scale)
	- `[!]` — Backlog item (documented, not yet coded)

	---

	## Phase 1: Zero-ML Recommender ✅ COMPLETE

	> Built the foundation: Qdrant connection, arXiv search, save/dismiss, cookie identity, HTMX frontend.

	- [x] Qdrant Cloud connection (1.6M BGE-M3 papers, BQ, HNSW m=32)
	- Collection: `arxiv_bgem3_dense`, 1024-dim dense vectors
	- File: `app/qdrant_svc.py` → `_get_client()`
	- [x] BEST_SCORE Recommend API (raw paper IDs → Qdrant)
	- File: `app/qdrant_svc.py` → `recommend()`
	- [x] arXiv keyword API search (placeholder — replaced in Phase 3)
	- File: `app/arxiv_svc.py` → `search()`
	- [x] arXiv metadata fetching + SQLite cache
	- File: `app/arxiv_svc.py` → `fetch_metadata_batch()`
	- [x] SQLite database schema (interactions, paper_metadata)
	- File: `app/db.py` → `init_db()`
	- WAL mode, async via aiosqlite
	- [x] Cookie-based user identity
	- File: `app/config.py` → `COOKIE_NAME`
	- [x] User state management (positive/negative deques)
	- File: `app/user_state.py` → `UserState`
	- [x] Save/Dismiss event logging
	- File: `app/routers/events.py`
	- [x] HTMX + Jinja2 frontend (search, recs, save, dismiss)
	- Files: `app/templates/` (base.html, index.html, search.html, saved.html, partials/)
	- [x] Test suite — 55 tests passing

	Gaps: None.

	---

	## Phase 2a: EWMA Profile Embeddings ✅ COMPLETE

	> Replaced raw ID-list approach with temporal decay vectors so recent interests outweigh old ones.

	- [x] Create `app/recommend/` module with `__init__.py`
	- [x] Create `app/recommend/profiles.py` — EWMA computation + storage
	- Long-term: α=0.03 ✅ (corrected from 0.10 per Doc 06)
	- Short-term: α=0.40
	- Negative: α=0.15
	- All embeddings L2-normalized
	- [x] Modify `app/db.py` — add `user_profiles` table + `user_clusters` table
	- [x] Modify `app/qdrant_svc.py` — add `get_paper_vectors()` and `search_by_vector()`
	- [x] Modify `app/routers/events.py` — trigger EWMA updates on save/dismiss
	- [x] Modify `app/routers/recommendations.py` — EWMA vector search with Tier 2 fallback
	- [x] Add `numpy` + `scipy` to `requirements.txt`
	- [x] Tests for profiles module — 11 passed
	- [x] Full test suite — no regressions

	Doc 06 correction applied: α_long 0.10 → 0.03 (PinnerSage rejected 0.10 as too recent-biased).

	Gaps: None.

	---

	## Phase 2b: Ward Clustering + Multi-Interest Retrieval ✅ COMPLETE

	> Detect distinct user interests via hierarchical clustering, retrieve candidates per interest.

	- [x] Create `app/recommend/clustering.py` — Ward clustering + medoid extraction
	- L2-normalize embeddings before Ward ✅ (Doc 06 correction)
	- Adaptive gap-based threshold (no fixed K)
	- Medoid representation (real papers, not centroids) ✅
	- Dynamic K (1–7 clusters, auto-determined)
	- Recency-weighted importance scores
	- [x] Modify `app/qdrant_svc.py` — add `multi_interest_search()` with prefetch+RRF
	- [x] Modify `app/routers/recommendations.py` — 3-tier cascading pipeline
	- Tier 1 (≥5 saves): Multi-interest clustering → prefetch + RRF
	- Tier 2 (≥3 saves): EWMA long-term vector → single ANN search
	- Tier 3 (≥1 save): Qdrant BEST_SCORE Recommend API
	- [x] Tests for clustering module — 10 passed
	- [x] Full test suite — no regressions

	Doc 06 corrections applied: L2-normalization before Ward, medoid not centroid.

	Gaps (deferred to Phase 4):
	- [!] RRF → quota fusion (dominant clusters can swamp minority interests)
	- [!] Hungarian matching for cluster ID stability across reclusterings

	---

	## Phase 2c: Heuristic Re-ranking + MMR Diversity ✅ COMPLETE

	> Added scoring and diversity layers on top of retrieval to produce the final feed.

	- [x] Create `app/recommend/reranker.py` — 5-feature heuristic scorer
	- Feature 1: cosine_sim_longterm (weight 0.40)
	- Feature 2: cosine_sim_shortterm (weight 0.25)
	- Feature 3: paper_age_days / recency (weight 0.15)
	- Feature 4: rrf_position (weight 0.10)
	- Feature 5: cosine_sim_negative (weight -0.15) ✅ (Doc 06 addition)
	- [x] Create `app/recommend/diversity.py` — MMR + exploration injection
	- MMR with λ=0.6
	- 2 serendipitous exploration papers per feed
	- [x] Modify `app/routers/recommendations.py` — full 5-step pipeline
	- Step 1: Clustering → Step 2: Retrieval → Step 3: Rerank → Step 4: MMR → Step 5: Exploration
	- [x] Tests for reranker + diversity — 13 passed
	- [x] Full test suite — 88 passed (86 + 2 pre-existing live Qdrant failures resolved)

	Doc 06 correction applied: Negative EWMA profile wired as Feature 5 with 0.15 penalty.

	Gaps: None. LightGBM model now integrated (Phase 6 ✅).

	---

	## Phase 2d: Advanced Models ❌ DEFERRED (Blocked by data/users)

	> These logically belong to the recommendation engine but cannot be built without real user data or scale.

	- [~] LightGBM lambdarank model — requires ≥500 labeled save/dismiss interactions → Phase 6
	- [~] Collaborative filtering features — requires ≥500 users → Phase 9
	- [~] DPP diversity — explicitly ruled out for v1 by Doc 06 → Phase 9+
	- [~] Two-Tower model — requires GPU + large dataset → Phase 9+

	---

	## Phase 3: Hybrid Semantic Search ✅ COMPLETE

	> Replace the arXiv keyword API placeholder with real vector-based semantic search using Qdrant dense + Zilliz sparse + RRF.
	> Detailed plan: `docs/phases/PHASE3-Hybrid-Semantic-Search.md`
	> Prototype reference: `docs/phases/PHASE2-Hybrid-Search-Plan.md`
	> Deployment target: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)

	### New files created
	- [x] `app/embed_svc.py` — BGE-M3 model singleton (load BAAI/bge-m3 once at startup, ~570MB, ~15s cold)
	- `encode_query(text)` → `(dense: np.ndarray[1024], sparse: dict)`
	- LRU cache for repeat queries
	- Thread-safe, lazy loading with double-check locking
	- [x] `app/zilliz_svc.py` — Zilliz Cloud sparse search client
	- Collection: `arxiv_bgem3_sparse`
	- Schema: `id` (INT64 auto PK), `arxiv_id` (VARCHAR), `sparse_vector` (SPARSE_FLOAT_VECTOR)
	- Index: SPARSE_INVERTED_INDEX, metric_type=IP
	- Sparse format: `{int_token_id: float_weight}` (BGE-M3 lexical weights, NOT string words)
	- `search_sparse(sparse_dict, limit)` → `list[dict]` with arxiv_id + score
	- gRPC reconnect handling
	- [x] `app/groq_svc.py` — LLM query rewriter (Groq / llama-3.3-70b)
	- `rewrite(user_query)` → academic query string
	- Graceful fallback to original query on error
	- Academic-detection heuristic to skip unnecessary rewrites
	- 2s hard timeout
	- [x] `app/hybrid_search_svc.py` — search orchestrator
	- Rewrite → Encode → Parallel (Qdrant dense + Zilliz sparse) → RRF → Rerank
	- Each step has independent failure handling
	- Recency reranking: 0.80 RRF + 0.20 recency

	### Files modified
	- [x] `app/config.py` — added `ZILLIZ_URI`, `ZILLIZ_TOKEN`, `ZILLIZ_COLLECTION`, `GROQ_API_KEY`, `BGE_M3_MODEL`, `BGE_M3_DEVICE`, `ENCODE_CACHE_SIZE`, search weights, `APP_PORT`
	- [x] `app/qdrant_svc.py` — added `search_dense(dense_vec, limit)` for raw vector search returning scores
	- [x] `app/routers/search.py` — swapped `arxiv_svc.search()` → `hybrid_search_svc.search()` with arXiv fallback
	- [x] `app/main.py` — added graceful BGE-M3 warm-up to lifespan
	- [x] `requirements.txt` — added `FlagEmbedding`, `pymilvus`, `groq`
	- [x] `run.py` — configurable port (7860 default for HF Spaces)

	### Deployment files created
	- [x] `Dockerfile` — HF Spaces Docker SDK, CPU-only PyTorch, pre-baked BGE-M3 model
	- [x] `.dockerignore` — excludes notebooks, PDFs, databases, caches

	### Implementation steps completed
	- [x] Step 1: BGE-M3 model service (`embed_svc.py`) + unit tests
	- [x] Step 2: Zilliz client (`zilliz_svc.py`)
	- [x] Step 3: Dense search in Qdrant service
	- [x] Step 4: Groq rewriter (`groq_svc.py`)
	- [x] Step 5: Hybrid search orchestrator (`hybrid_search_svc.py`)
	- [x] Step 6: Swap search router
	- [x] Step 7: Model warm-up + deployment config
	- [x] Step 8: Tests — 21 new tests passing (RRF, recency, Groq heuristics, embed edge cases, orchestrator mocks)

	### Test results
	- 88 original tests: ✅ All pass (zero regressions)
	- 21 Phase 3 unit tests: ✅ All pass (RRF, recency, Groq, embed, orchestrator mocks)
	- 6 search router tests: ✅ All pass (ranking, fallback, HTMX, saved state)
	- 8 live service tests: ✅ All pass (Qdrant dense, Zilliz sparse, Groq rewrite, parallel)
	- Total: 123 tests passing

	### Latency budget
	\| Stage \| Time \|
	\|---\|---\|
	\| LLM rewrite (Groq) \| ~300ms (skippable) \|
	\| BGE-M3 encode (CPU) \| ~300ms first, ~0ms cached \|
	\| Qdrant + Zilliz (parallel) \| ~300ms \|
	\| RRF + rerank \| <5ms \|
	\| Total (warm) \| ~600ms \|

	---

	## Phase 3.5: Turso ArXiv Metadata DB ✅ COMPLETE

	> Bulk-loaded 1.23 GB of arXiv paper metadata + citation data to Turso (libSQL) cloud DB.
	> Eliminates the unstable arXiv API dependency for metadata fetching (Phase 4.2 solved early).
	> Integrated into codebase and deployed to HF Spaces.

	### Infrastructure
	- [x] Turso cloud DB created: `arxiv-data` on `aws-ap-south-1`
	- URL: `https://arxiv-data-siddhm11.aws-ap-south-1.turso.io`
	- Auth: Platform token + DB auth token (minted via CLI)
	- [x] Table: `papers` with columns:
	- `arxiv_id` (TEXT, UNIQUE INDEX `idx_papers_arxiv_id`)
	- `title` (TEXT)
	- `authors` (TEXT)
	- `categories` (TEXT)
	- `primary_topic` (TEXT)
	- `update_date` (TEXT)
	- `abstract_preview` (TEXT, truncated to 500 chars)
	- `citation_count` (INTEGER, default 0)
	- `influential_citations` (INTEGER, default 0)
	- [x] Data sources:
	- `arxiv_comprehensive_papers.csv` (Kaggle: siddhm11/arxivdata)
	- `arxiv_citations_summary.csv` (Kaggle: siddhm11/citation-data-letsgoo)
	- Joined on `id` = `arxiv_id_clean`, deduplicated
	- [x] Row count verified: local ↔ remote match
	- [x] Unique index on `arxiv_id` for fast lookups

	### Integration (DONE)
	- [x] Added `TURSO_URL` and `TURSO_DB_TOKEN` to `config.py` / `.env` / HF Secrets
	- [x] Created `app/turso_svc.py` — metadata lookup service
	- `fetch_metadata_batch(arxiv_ids)` → `{arxiv_id: paper_dict}`
	- Uses Turso HTTP pipeline API (zero new Python deps — just httpx)
	- Includes citation_count + influential_citations
	- [x] `app/routers/search.py` — Turso primary, arXiv API fallback (only for IDs not in Turso)
	- [x] Created `tests/test_turso_timing.py` — timing benchmark
	- [x] Verified: 10/10 title match, 6.1x end-to-end speedup on HF Spaces
	- [x] Impact: Avg search time dropped from ~10.7s to ~1.75s on HF Spaces

	---

	## Phase 4: Recommendation Pipeline Fixes ✅ COMPLETE

	> Fixed the known architectural debt in the recommendation pipeline.
	> Detailed plan: `docs/phases/PHASE4-Recommendation-Pipeline-Fixes.md`

	### 4.1 — Replace RRF with Importance-Weighted Quota Fusion
	- [x] Create `app/recommend/fusion.py` — quota allocation logic
	- `w_k = importance_k / sum(importance_k)`
	- `slot_k = max(floor(F × w_k), F_min=3)` — every cluster gets at least 3 slots
	- Distribute remainder by largest fractional part
	- [x] Create `tests/test_fusion.py` — 20 unit tests for quota allocation
	- Proportionality, floor enforcement, total invariant, edge cases, Doc 06 worked examples
	- [x] Refactor `_multi_interest_recommend()` in `recommendations.py`
	- Replace `multi_interest_search()` with per-cluster separate ANN queries
	- Use `asyncio.gather()` for concurrent searches (~15ms wall-clock)
	- Allocate feed slots proportionally via `allocate_quotas()`
	- Deduplicate across clusters (first-occurrence = highest-ranked cluster wins)
	- MMR over merged union (unchanged)
	- [x] Keep `qdrant_svc.multi_interest_search()` in codebase (no deletion)

	### 4.2 — Pre-populate Metadata Store ✅ DONE (via Turso)
	- [x] Bulk-loaded arXiv metadata from Kaggle to Turso cloud DB (Phase 3.5)
	- [x] 1.23 GB, includes citation counts from Semantic Scholar
	- [x] Wired Turso service into `search.py` (Turso primary, arXiv API fallback)
	- [x] arXiv API is now fallback only for genuinely new papers
	- [x] Impact: Search time dropped from ~10.7s to ~1.75s on HF Spaces

	### 4.3 — Hungarian Matching for Cluster Stability
	- [x] Add `stabilize_cluster_ids()` function to `clustering.py`
	- Uses `scipy.optimize.linear_sum_assignment` (already a dependency)
	- Cost matrix: `1 - cosine_sim(new_medoid, old_medoid)` — trivial at K≤7
	- Matched clusters keep old indices; new clusters get next available
	- Min cosine threshold (0.5) rejects unrelated matches
	- [x] Call between `compute_clusters()` and `save_clusters_to_db()` in recommendations.py
	- [x] 10 tests in `test_clustering.py` — perturbed clusters preserve indices,
	unrelated match rejection, K growth/shrink, custom thresholds

	### 4.4 — Category-Level Negative Suppression
	- [x] Add `get_suppressed_categories()` to `db.py`
	- Joins `interactions` + `paper_metadata` to find categories with ≥3 dismissals
	- Primary category only (decision: avoid over-suppression)
	- 14-day window (standard default, τ_neg = 14 days)
	- [x] Add suppression filter in `_multi_interest_recommend()` after reranking
	- [x] Cache Turso metadata to `paper_metadata` via `cache_turso_metadata_batch()`
	- [x] 8 tests in `test_db.py` — threshold, partitioning, user isolation, custom threshold
	- [~] Per-item short-term decay → deferred to Phase 6 (LightGBM feature)

	Gaps: None.

	---

	## Phase 4.5: Instrumentation Foundation ✅ COMPLETE

	> Added telemetry columns to the interactions table so every saved/dismissed paper
	> can be attributed to its pipeline tier, cluster origin, and ranker version.
	> Doc 07 (ADR A4) identified this as the single most valuable early investment —
	> retrofitting these fields after real user data exists is painful and blocks all
	> later counterfactual evaluation.

	### Schema changes
	- [x] Add `ranker_version TEXT` to `interactions` table — pipeline version tag
	- [x] Add `candidate_source TEXT` to `interactions` — e.g. `cluster_0`, `exploration`, `ewma_longterm`, `qdrant_recommend`, `short_term_supplement`
	- [x] Add `cluster_id INTEGER` to `interactions` — interest cluster index (NULL if N/A)
	- [x] ALTER TABLE migration for existing DBs (safe try/except, idempotent)

	### Pipeline tagging
	- [x] Add `_RANKER_VERSION` constant to `recommendations.py`
	- [x] Tag Tier 1 papers with cluster origin, exploration status, short-term supplement
	- [x] Tag Tier 2 papers as `ewma_longterm`
	- [x] Tag Tier 3 papers as `qdrant_recommend`
	- [x] Build `paper_cluster_map` before quota merge (first-occurrence = cluster attribution)
	- [x] Exploration papers tagged as `candidate_source='exploration'`

	### End-to-end flow
	- [x] `recommendations.py` embeds tags in paper dicts
	- [x] `action_buttons.html` includes tags in `hx-vals` JSON
	- [x] `events.py` accepts `ranker_version`, `candidate_source`, `cluster_id` Form fields
	- [x] `db.log_interaction()` stores all three new columns

	Files modified: `app/db.py`, `app/routers/events.py`, `app/routers/recommendations.py`, `app/templates/partials/action_buttons.html`

	Gaps: None. `propensity` and `policy_id` fields deferred until ε-greedy exploration (Phase 9).

	---

	## Phase 5: Cold-Start Onboarding ✅ COMPLETE

	> Onboarding wizard for new users — category selection + seed paper search + trending fallback.
	> Reference: Doc 06 — "4-37% lift even once behavioral data exists"

	### 5.1 — arXiv Category Multi-Select ✅
	- [x] UI screen on first visit: select 1-8 arXiv category groups
	- [x] Store selections in SQLite (`user_onboarding` table)
	- [x] Use as pool filter for recommendations (via `get_user_category_filter()`)
	- [x] Preserve as LightGBM feature permanently (Feature 26: `onboarding_category_match`)
	- [x] Does NOT create "subject vectors" — just filters

	### 5.2 — Seed Paper Import ✅
	- [x] Let users search for and save seed papers during onboarding
	- [x] Immediately create EWMA profiles + Ward clusters on next feed request
	- [x] Uses hybrid search (Phase 3) for discovery

	### ~~5.3 — ORCID / Semantic Scholar Import~~ ❌ REMOVED
	> S2 author import was implemented but removed — not the onboarding direction we want.
	> Onboarding focuses on category selection + manual seed paper search.

	### 5.4 — Popularity Fallback ✅
	- [x] Category-filtered trending papers served via `turso_svc.fetch_trending_by_categories()`
	- [x] 1-hour TTL trending cache for performance

	---

	## Phase 6: LightGBM Re-ranker ✅ COMPLETE

	> Replaced heuristic scorer with a trained LightGBM lambdarank model.
	> Unblocked via citation-graph pseudo-labels from Semantic Scholar.
	> Handoff doc: `docs/PHASE6-HANDOFF.md`
	> Model repo: [siddhm11/researchit-reranker-phase6](https://huggingface.co/siddhm11/researchit-reranker-phase6)

	### 6.1 — ML Intern: Data Pipeline + Model Training ✅
	- [x] Export 1.6M arXiv IDs from Turso → `arxiv_ids.txt` (`scripts/export_arxiv_ids.py`)
	- [x] Fetch 242K citation edges from Semantic Scholar Batch API (`01_fetch_citation_edges.py`)
	- [x] Generate 98K training triples with pseudo-labels: cited=2, co-cited=1, negative=0 (`02_generate_training_triples.py`)
	- [x] 37-feature schema (20 content, 11 user behavior, 6 cross-features)
	- [x] Train LightGBM LambdaRank model: 141 trees, 63 leaves, lr=0.05 (`03_train_lightgbm.py`)
	- [x] nDCG@10 = 0.879 (+233% vs heuristic baseline)
	- [x] All artifacts pushed to HuggingFace

	### 6.2 — Opus: Integration into ResearchIT ✅
	- [x] Rewrite `app/recommend/reranker.py` — 5 features → 37 features
	- [x] LightGBM model loading at import time with heuristic fallback
	- [x] Multi-path model file search (env var → relative → absolute)
	- [x] Backward-compatible `rerank_candidates()` signature (old callers unaffected)
	- [x] Add `lightgbm>=4.0,<5.0` to `requirements.txt`
	- [x] Fix CRLF→LF line endings in model file (Windows Git issue)
	- [x] 7 integration tests — all passing (`tests/test_reranker_integration.py`)
	- [x] Latency verified: 0.223ms per 100 candidates (target: <1ms) ✅

	### 6.3 — Antigravity: Feature Wiring + Deployment Verification ✅
	- [x] Wire all 37 features into `recommendations.py` caller (was legacy 6-arg signature)
	- [x] Per-candidate `cluster_importance` (N,) from `paper_cluster_map`
	- [x] Per-candidate `cluster_medoid` (N, 1024) per source cluster
	- [x] Pre-computed `is_suppressed_category` and `onboarding_category_match` arrays
	- [x] Pass `qdrant_scores`, `user_total_saves`, `user_total_dismissals`
	- [x] `reranker.py` supports both scalar broadcast and per-candidate arrays
	- [x] Add model accessors: `is_model_loaded()`, `get_num_trees()`, `get_loaded_model_path()`
	- [x] Add per-request feature activation logging
	- [x] Create `GET /healthz/reranker` endpoint (`app/routers/health.py`)
	- [x] Bug B fix: persist `medoid_embedding_blob` BLOB in `user_clusters` table
	- [x] Bug B fix: fall back to persisted blob instead of zero vector in Hungarian matching
	- [x] DB migration: `ALTER TABLE user_clusters ADD COLUMN medoid_embedding_blob BLOB`
	- [x] 9 new tests — all passing (`tests/test_phase6_feature_wiring.py`)
	- [x] Full suite: 203+ tests passing, 0 failures
	- [x] Updated `CLAUDE.md`, `PHASE6-HANDOFF.md`, `README.md`

	### 6.4 — Retraining [~] DEFERRED
	> Phase 6.4 retraining is deferred. The published model `siddhm11/researchit-reranker-phase6` was trained on citation pseudo-labels with features 23–30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with ≥10 saves each. Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.

	- [~] Synthetic user simulator (`scripts/simulate_users.py`) — target: +30d
	- [~] Real-user retrain at 100-user threshold — target: +90d or threshold
	- [~] HF model card backfill (library_name, pipeline_tag, metrics, schema)

	## Phase 6.5: Instrumentation ✅ COMPLETE

	> Purpose: Stabilize the recommendation pipeline and prepare telemetry substrate for Phase 7 evaluation.

	### A1 — Real Qdrant cosine scores
	- [x] Switch `search_by_vector()` → `search_by_vector_with_scores()` in per-cluster + short-term searches
	- [x] Build `qdrant_score_map` from real cosines (replaces fake `1.0 - rank*0.01` linear decay)
	- [x] Feature 0 (`qdrant_cosine_score`) now receives actual cosine similarities

	### A2 — Deployment verification
	- [x] `curl /healthz/reranker` → `model_loaded=true, n_trees=141, fallback_active=false`
	- [x] Verification timestamp added to `PHASE6-Reranker-Framing.md`

	### B1 — query_id linkage
	- [x] Generate `query_id` (UUID) once per feed request in `get_recommendations()`
	- [x] Thread through all 4 tiers: trending, Tier 1, Tier 2, Tier 3
	- [x] Generate `query_id` in `search.py` per search request
	- [x] Add `query_id` + `position` to `action_buttons.html` hx-vals

	### B2 — Propensity logging
	- [x] Add `propensity REAL` + `policy_id TEXT` migration to `interactions` table
	- [x] Extend `db.log_interaction()` with propensity + policy_id params
	- [x] Compute propensity: 1.0 (deterministic) vs `n_explore/pool_size` (exploration)
	- [x] Thread through templates + `events.py` Form params

	### B3 — Cluster snapshot versioning
	- [x] Add `cluster_snapshots` table (append-only, content-addressed via `paper_ids_hash`)
	- [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
	- [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan

	### ~~B4 — S2 author import~~ ❌ REMOVED
	> S2 author import was implemented and then removed — not the onboarding direction we want.
	> `app/s2_svc.py`, the `/api/onboarding/import-author` endpoint, and the quick-import UI
	> have all been deleted. Onboarding uses category selection + manual seed search only.

	### Documentation
	- [x] `CLAUDE.md`: Rule 3.11 — interaction instrumentation invariants
	- [x] `_RANKER_VERSION` bumped to `v6.5_lightgbm_real_cosines`
	- [x] Phase status updated to 6.5 COMPLETE
	- [x] Tests: 203+ passing

	### Test suite
	- `tests/test_reranker_integration.py` — 7 tests (smoke, features, heuristic, E2E, latency, backward compat, comparison)
	- `tests/test_phase6_feature_wiring.py` — 9 tests (per-candidate arrays, broadcast medoid, model accessors, aggregate activation)
	- `tests/demo_reranker.py` — interactive demo with 20 realistic papers

	---

	## Phase 7: Evaluation Framework 📋 NOT STARTED

	> Build offline and online evaluation before scaling users.
	> Estimated effort: ~1 week

	- [ ] Offline metrics: nDCG@10, Recall@50, HR@10, ILS, category entropy
	- [ ] Time-split evaluation on unarXive 2022 + S2ORC
	- [ ] Online metrics (once users exist): CTR, save rate, dwell time, return rate

	---

	## Phase 8: LLM Interest Summaries + Distilled Re-ranker 📋 NOT STARTED

	> Estimated effort: ~10-12 weeks (Doc 07)
	> Detailed research plan: `docs/research/07-LLM-Summaries-Reranker-and-Scaling-Research.md`
	> Entry criteria: Phase 7 eval producing stable nDCG@10; cluster stability Jaccard ≥0.7 over 7 days

	### 8a — Claude-generated per-cluster interest summaries (Doc 07 §A)
	- [ ] Cluster snapshot versioning (ADR A1)
	- [ ] Content-addressed caching: `sha256(sorted(paper_ids) + prompt_version + model)`
	- [ ] Shared summaries (not per-user) — Haiku 4.5 + Batch API (~$50-80/month @ 1K users)
	- [ ] Nightly regeneration job with 7-day TTL + event-triggered refresh
	- [ ] "You're reading about X" UI framing with sub-theme bullets
	- [ ] Anthropic Citations API for hallucination prevention

	### 8b — Distilled cross-encoder reranker (Doc 07 §B)
	- [ ] Deploy `cross-encoder/ms-marco-TinyBERT-L-2-v2` INT8 ONNX as MVP
	- [ ] 6ms budget for 20 pairs on CPU (AVX-512 VNNI)
	- [ ] TinyBERT score as LightGBM feature (Option C architecture)
	- [ ] Custom distillation from BGE-reranker-v2-m3 only if held-out gap >3 nDCG
	- [ ] MarginMSE loss + SciNCL citation-graph hard negatives

	### 8c — Use-cases and information-gain design doc (Doc 07 §C)
	- [ ] 8 user personas (P1 cold-start through P8 stay-current)
	- [ ] Information-gain table (save=3-5×, dismiss-as-label=−3-4×, passive skip=−0.1×)
	- [ ] Mode-switching UI: "Stay Current" vs "Lit Review" toggle
	- [ ] Failure mode detection rules (feed collapse, stale profile, filter bubble)

	---

	## Phase 9: Exploration + Collaborative Filtering 📋 NOT STARTED

	> Blocked by: ≥500 users

	- [ ] Epsilon-greedy exploration (ε=0.25 new users, ε=0.05 established)
	- [ ] LightFM hybrid CF model with switching strategy
	- [ ] Category-level negative suppression
	- [ ] Retrain LightGBM with dismissals as negative labels

	---

	## Appendix: Infrastructure Status

	\| Component \| Status \| Details \|
	\|---\|---\|---\|
	\| Qdrant Cloud \| ✅ Live \| 1.6M papers, BGE-M3 1024-dim, BQ enabled, HNSW m=32 \|
	\| Zilliz Cloud \| ✅ Live \| 1.6M papers, BGE-M3 sparse vectors, collection `arxiv_bgem3_sparse` \|
	\| Turso (libSQL) \| ✅ Live \| 1.23 GB arXiv metadata + citations, `arxiv-data` DB, `papers` table, unique index on `arxiv_id` \|
	\| SQLite \| ✅ Live \| interactions, paper_metadata (local cache), user_profiles, user_clusters \|
	\| HF Spaces \| ✅ Deployed \| Docker SDK, free tier, port 7860 — https://siddhm11-researchit.hf.space \|
	\| Render \| ⚠️ Previous target (512MB RAM too small for BGE-M3) \| May still be used for non-ML services \|
	\| arXiv API \| ✅ Fallback only \| Keyword search + metadata for papers not in Turso \|
	\| BGE-M3 Model \| ✅ Live \| Pre-baked in Docker image, warm-up at startup \|
	\| Groq API \| ✅ Live + HF Secret \| `app/groq_svc.py` — 2s timeout, academic heuristic skip \|
	\| Notebooks \| ✅ Organized \| `notebooks/` — 01-upload, 02-test, 03-search-benchmark \|

	### Credentials Status

	\| Credential \| Status \| Env Var \| Notes \|
	\|---\|---\|---\|---\|
	\| Qdrant Cloud \| ✅ In `.env` \| `QDRANT_URL`, `QDRANT_API_KEY` \| Already wired \|
	\| Zilliz Cloud \| ✅ In `.env` \| `ZILLIZ_URI`, `ZILLIZ_TOKEN` \| Phase 3, wired \|
	\| Turso (libSQL) \| ✅ In `.env` + HF \| `TURSO_URL`, `TURSO_DB_TOKEN` \| Phase 3.5, wired + deployed \|
	\| Groq \| ✅ In `.env` + HF \| `GROQ_API_KEY` \| Phase 3, wired + deployed \|
	\| HF Spaces \| ✅ Deployed \| Secrets panel \| All env vars set ✔ \|

	---

	## Appendix: Test Suite

	\| Test File \| Count \| Status \|
	\|---\|---\|---\|
	\| `tests/test_profiles.py` \| 11 \| ✅ Passing \|
	\| `tests/test_clustering.py` \| 21 \| ✅ Passing \| (9 compute + 10 Hungarian + 2 persistence) \|
	\| `tests/test_reranker_diversity.py` \| 13 \| ✅ Passing \|
	\| `tests/test_reranker_integration.py` \| 7 \| ✅ Passing \| (Phase 6: smoke, features, E2E, latency) \|
	\| `tests/test_phase6_feature_wiring.py` \| 9 \| ✅ Passing \| (Phase 6.3: per-candidate arrays, medoids, accessors) \|
	\| `tests/test_fusion.py` \| 20 \| ✅ Passing \| (Phase 4.1) \|
	\| `tests/test_db.py` \| 19 \| ✅ Passing \| (includes 4 Turso cache + 8 suppression) \|
	\| `tests/test_qdrant_svc.py` \| — \| ✅ Passing \|
	\| `tests/test_arxiv_svc.py` \| — \| ✅ Passing \|
	\| `tests/test_integration.py` \| — \| ✅ Passing \| (includes quota pipeline E2E) \|
	\| `tests/test_user_state.py` \| — \| ✅ Passing \|
	\| `tests/test_saved.py` \| — \| ✅ Passing \|
	\| `tests/test_hybrid_search.py` \| 21 \| ✅ Passing \|
	\| `tests/test_search_router.py` \| 6 \| ✅ Passing \|
	\| `tests/test_live_search.py` \| 8 \| ✅ Passing \|
	\| Total \| 203+ \| ✅ \|
	\| `test_e2e_recs.py` (standalone) \| 1 \| ✅ E2E simulation \|

	---

	## Appendix: Doc 06 Corrections — Tracking

	\| Correction \| Status \| Where \|
	\|---\|---\|---\|
	\| α_long 0.10 → 0.03 \| ✅ Applied \| `app/recommend/profiles.py:30` \|
	\| L2-normalize before Ward clustering \| ✅ Applied \| `app/recommend/clustering.py` \|
	\| Medoid not centroid \| ✅ Applied \| `app/recommend/clustering.py` → `_find_medoid()` \|
	\| Negative EWMA wired into reranking \| ✅ Applied \| `app/recommend/reranker.py` → Feature 5 \|
	\| RRF → quota fusion for recommendations \| ✅ Applied \| `app/recommend/fusion.py` (Phase 4.1) \|
	\| Hungarian cluster matching \| ✅ Applied \| `app/recommend/clustering.py` → `stabilize_cluster_ids()` (Phase 4.3) \|
	\| Per-item short-term negative decay \| [!] Backlog \| Phase 6 (LightGBM feature) \|
	\| Category-level suppression \| ✅ Applied \| `app/db.py` → `get_suppressed_categories()` (Phase 4.4) \|
	\| BGE-reranker NEVER in hot path \| ✅ Followed \| Heuristic scorer used instead \|