Spaces:

siddhm11
/

ResearchIT

Running

siddhm11 commited on 15 days ago

Commit

ec67b2f

1 Parent(s): d2f0bed

Phase 6.5: Pipeline telemetry, search UX fixes, latency profiling

- Instrumented search pipeline: Groq rewrite, BGE-M3 encode, Qdrant+Zilliz retrieval, RRF fusion, title rerank with per-stage timing
- Instrumented recommendation pipeline: clustering, ANN retrieval, metadata fetch, LightGBM rerank, MMR diversity
- Split Title+Citation Rerank into Turso fetch vs compute time (exposed hidden 1.5s network call)
- Added search loading overlay with pipeline stage labels
- Fixed HTMX search: recommendations now hide when search starts
- Fixed paper card: truncate authors (max 3 + et al), hard-truncate abstract to 500 chars
- Show Groq rewrite status (skipped/rewritten/error) in both banner and breakdown
- Added Groq heuristic visibility: shows skip reason (query too short, looks academic)
- Added parallel task count to retrieval breakdown
- New evaluation and diagnostic scripts
- Removed deprecated s2_svc.py

Files changed (43) hide show

.github/skills/researchit-codebase-overview/SKILL.md +48 -0
.github/skills/researchit-data-layer/SKILL.md +31 -0
.github/skills/researchit-debug-performance/SKILL.md +31 -0
.github/skills/researchit-recs-analysis/SKILL.md +42 -0
.github/skills/researchit-reranker-explainer/SKILL.md +30 -0
.github/skills/researchit-search-analysis/SKILL.md +34 -0
.github/skills/researchit-testing-eval/SKILL.md +30 -0
CLAUDE.md +2 -0
README.md +1 -1
app/config.py +1 -2
app/groq_svc.py +19 -13
app/hybrid_search_svc.py +316 -94
app/qdrant_svc.py +87 -17
app/recommend/clustering.py +76 -2
app/recommend/reranker.py +1 -1
app/routers/onboarding.py +7 -99
app/routers/recommendations.py +44 -15
app/routers/search.py +13 -2
app/s2_svc.py +0 -111
app/templates/index.html +2 -12
app/templates/partials/paper_card.html +10 -5
app/templates/partials/recommendations.html +34 -0
app/templates/partials/search_results.html +78 -2
app/templates/partials/seed_results.html +41 -0
app/templates/partials/seed_search.html +2 -60
app/templates/search.html +55 -9
app/turso_svc.py +127 -9
docs/TASK-TRACKER.md +22 -22
docs/previous_prompt.txt +0 -0
docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md +21 -20
requirements.txt +1 -1
scripts/browser_test_onboarding.py +75 -0
scripts/browser_test_search.py +77 -0
scripts/diag_mamba.py +69 -0
scripts/diag_search_rank.py +45 -0
scripts/e2e_audit.py +622 -0
scripts/eval_expanded_queries.py +336 -0
scripts/eval_recs_quality.py +547 -0
scripts/eval_search_quality.py +197 -0
scripts/expanded_eval_results.json +0 -0
scripts/profile_pipelines.py +410 -0
scripts/test_citation_boost.py +91 -0
tests/test_hybrid_search.py +76 -32

.github/skills/researchit-codebase-overview/SKILL.md ADDED Viewed

	@@ -0,0 +1,48 @@

+---
+name: researchit-codebase-overview
+description: "Explain the ResearchIT codebase architecture and current state. Use for onboarding, project overviews, and accurate summaries of how the system works. Triggers: codebase overview, architecture summary, explain this project, how this works, system map."
+argument-hint: "Specify audience (dev/stakeholder), depth (brief/standard/deep), and focus (search/recs/data)."
+---
+# ResearchIT Codebase Overview
+## When to Use
+- The user asks for a full understanding of the codebase or architecture.
+- You need to produce a top-level system map or explain how components interact.
+- You need a concise but accurate "what is happening here" summary.
+## Inputs to Ask For (if missing)
+- Audience: developer vs stakeholder.
+- Depth: brief, standard, or deep.
+- Focus areas: search, recommendations, data layer, evaluation.
+## Required Sources (read in this order)
+1. CLAUDE.md (rules and source-of-truth doc map).
+2. docs/research/06-Deep-Research-Verdict.md (architecture decisions).
+3. README.md (current system summary).
+4. docs/walkthroughs/03-Code-Summary-and-Test-Plan.md (module map).
+5. docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md (current phase).
+## Procedure
+1. State the product goal in one sentence and the system constraints (CPU-only, latency budget).
+2. Describe the high-level architecture (frontend, backend, vector stores, metadata DB, SQLite).
+3. Summarize the two main pipelines:
+   - Search: rewrite -> encode -> dense+sparse -> RRF -> title/citation boost.
+   - Recommendations: clustering -> quota -> rerank -> MMR -> exploration.
+4. Call out invariants from doc 06 (quota for recs, RRF for search, alpha values, MMR lambda).
+5. Explain data flow and caching (Turso LRU, Qdrant vector cache, SQLite metadata cache).
+6. State current phase status and what is out of scope.
+## Output Format
+- 6 to 10 bullet points, ordered by importance.
+- Short "where to look" section with key files.
+- If stakeholder audience: avoid implementation detail and emphasize outcomes.
+## Key Files to Cite
+- app/main.py
+- app/routers/recommendations.py
+- app/routers/search.py
+- app/hybrid_search_svc.py
+- app/recommend/*
+- app/qdrant_svc.py, app/zilliz_svc.py, app/turso_svc.py
+- app/db.py

.github/skills/researchit-data-layer/SKILL.md ADDED Viewed

	@@ -0,0 +1,31 @@

+---
+name: researchit-data-layer
+description: "Explain the data/storage layer (SQLite, Turso metadata, Qdrant dense vectors, Zilliz sparse vectors). Use for data integrity, schema questions, caching behavior, and ID handling. Triggers: database schema, metadata cache, Qdrant mapping, Zilliz schema."
+argument-hint: "Specify the component(s) and whether you want schema details or runtime behavior."
+---
+# Data and Storage Layer Analysis
+## When to Use
+- The user asks about storage, caching, or schemas.
+- You need to validate data integrity or ID handling.
+- You need to explain how metadata or vector mappings work.
+## Required Sources
+1. app/db.py (SQLite schema + migrations)
+2. app/turso_svc.py (metadata + caches)
+3. app/qdrant_svc.py (ID mapping + vector cache)
+4. app/zilliz_svc.py (sparse schema + search)
+5. app/arxiv_svc.py (API fallback + ID normalization)
+## Procedure
+1. Summarize each store and its responsibility (SQLite, Turso, Qdrant, Zilliz).
+2. Explain arXiv ID handling (always string; never integer coercion).
+3. Document caches (vector cache, metadata LRU, trending cache).
+4. Note schema migrations and instrumentation columns.
+5. Identify data consistency boundaries and fallbacks.
+## Output Format
+- Component-by-component description.
+- Tables/fields summary for SQLite.
+- Integrity rules and common pitfalls.

.github/skills/researchit-debug-performance/SKILL.md ADDED Viewed

	@@ -0,0 +1,31 @@

+---
+name: researchit-debug-performance
+description: "Debug performance and quality issues in search or recommendations. Use for latency spikes, slow retrievals, or degraded relevance. Triggers: performance issue, slow search, slow recs, latency debug."
+argument-hint: "Specify area (search/recs/data), symptoms, and whether to propose fixes."
+---
+# Debugging and Performance Profiling
+## When to Use
+- Latency regressions or slow responses appear.
+- Search or recommendation quality drops unexpectedly.
+- External services time out or return empty results.
+## Required Sources
+1. app/qdrant_svc.py (vector cache, retrieve latency)
+2. app/turso_svc.py (metadata cache, trending cache)
+3. app/hybrid_search_svc.py (RRF pipeline)
+4. app/routers/recommendations.py (candidate flow + oversample)
+5. app/recommend/reranker.py (model load, feature cost)
+## Procedure
+1. Identify the failing pipeline (search vs recommendations).
+2. Check cache hit rates conceptually (vector and metadata caches).
+3. Inspect candidate fetch sizes and oversampling factors.
+4. Review service fallbacks (Zilliz, Turso, arXiv).
+5. Isolate latency contributors and propose focused fixes.
+## Output Format
+- Symptom -> probable cause mapping.
+- Targeted checks in code.
+- Minimal, low-risk fix options.

.github/skills/researchit-recs-analysis/SKILL.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+name: researchit-recs-analysis
+description: "Analyze and explain the recommendation pipeline. Use for recs debugging, feature reviews, pipeline changes, or explaining multi-interest behavior. Triggers: recommendation pipeline, recs analysis, multi-interest, quota fusion, reranker."
+argument-hint: "Specify the task (explain/debug/change), expected output (summary/findings), and whether to include tests."
+---
+# Recommendation Pipeline Analysis
+## When to Use
+- The user wants a deep explanation of recommendations or changes.
+- You need to verify rules like quota fusion, EWMA alphas, or MMR usage.
+- You are asked to debug rec quality or performance.
+## Required Sources
+1. CLAUDE.md and docs/research/06-Deep-Research-Verdict.md (non-negotiables).
+2. app/routers/recommendations.py (pipeline and instrumentation).
+3. app/recommend/profiles.py (EWMA parameters).
+4. app/recommend/clustering.py (Ward + medoids + stabilization).
+5. app/recommend/fusion.py (quota logic).
+6. app/recommend/reranker.py (LightGBM + features).
+7. app/recommend/diversity.py (MMR + exploration).
+## Procedure
+1. Identify which tier is active and the fallback sequence.
+2. Validate invariant rules:
+   - Search uses RRF, recommendations do not.
+   - Quota fusion with floor; MMR lambda is 0.6.
+   - alpha_long=0.03, alpha_short=0.40, alpha_neg=0.15.
+3. Trace candidate flow:
+   - Medoids -> per-cluster search -> dedup -> rerank -> MMR -> exploration.
+4. Check instrumentation fields: query_id, propensity, policy_id.
+5. Summarize likely failure modes: missing vectors, empty clusters, cache misses.
+6. Recommend targeted tests or metrics to verify changes.
+## Output Format
+- Pipeline summary with stages and main functions.
+- Invariants checklist (pass/fail).
+- Risks and suggested tests.
+## Notes
+- Never propose RRF for multi-medoid recommendations.
+- Do not introduce cross-encoders into the hot path.

.github/skills/researchit-reranker-explainer/SKILL.md ADDED Viewed

	@@ -0,0 +1,30 @@

+---
+name: researchit-reranker-explainer
+description: "Explain the LightGBM reranker, feature schema, and fallback behavior. Use for model integration checks, feature debugging, or deployment validation. Triggers: reranker, LightGBM, feature schema, model loading."
+argument-hint: "Specify: explain, validate, or troubleshoot."
+---
+# Reranker and Feature Schema Explainer
+## When to Use
+- The user asks how the reranker works or which features are used.
+- You need to validate model loading and fallback behavior.
+- You are reviewing feature wiring or scoring behavior.
+## Required Sources
+1. app/recommend/reranker.py
+2. models/reranker-phase6/production_model/feature_schema.json
+3. app/routers/health.py
+4. app/routers/recommendations.py (feature wiring)
+## Procedure
+1. Confirm model load paths and fallback logic.
+2. Verify the 37-feature ordering matches the schema.
+3. Explain which features are active in recommendations and how they are computed.
+4. Confirm health endpoint expectations (/healthz/reranker).
+5. Provide a concise explanation of latency and why cross-encoders are excluded.
+## Output Format
+- Model load status + fallback behavior.
+- Feature group summary (content, behavior, cross features).
+- Integration checklist.

.github/skills/researchit-search-analysis/SKILL.md ADDED Viewed

	@@ -0,0 +1,34 @@

+---
+name: researchit-search-analysis
+description: "Explain or analyze the hybrid semantic search pipeline (rewrite, encode, dense+sparse, RRF, title/citation boost). Use for search quality, latency, and correctness reviews. Triggers: search pipeline, hybrid search, RRF, BGE-M3 search."
+argument-hint: "Specify: explain vs debug, and whether to include latency hotspots."
+---
+# Search Pipeline Analysis
+## When to Use
+- The user wants to understand or debug search results.
+- You need to review hybrid search correctness.
+- You are asked about RRF usage or query rewriting.
+## Required Sources
+1. app/routers/search.py
+2. app/hybrid_search_svc.py
+3. app/embed_svc.py
+4. app/qdrant_svc.py
+5. app/zilliz_svc.py
+6. app/groq_svc.py
+7. app/turso_svc.py and app/arxiv_svc.py
+## Procedure
+1. Trace the full pipeline from query to results.
+2. Call out the dual-encode design (original + rewrite) and why it exists.
+3. Verify RRF usage is limited to search fusion (correct per doc 06).
+4. Explain title/citation boosts and their intended effect.
+5. Document fallback behavior when any component fails.
+6. Summarize latency hotspots and caching layers.
+## Output Format
+- Step-by-step pipeline description.
+- Fallbacks and failure handling.
+- Notes on ranking behavior and edge cases.

.github/skills/researchit-testing-eval/SKILL.md ADDED Viewed

	@@ -0,0 +1,30 @@

+---
+name: researchit-testing-eval
+description: "Guide testing and evaluation for ResearchIT. Use for test planning, running tests, and explaining evaluation metrics. Triggers: testing plan, run tests, evaluation metrics, offline eval."
+argument-hint: "Specify scope (unit/integration/e2e) and whether to include metrics."
+---
+# Testing and Evaluation Guidance
+## When to Use
+- The user wants to run or plan tests.
+- The user asks about evaluation metrics or offline evaluation.
+- You need to explain test coverage or risks.
+## Required Sources
+1. docs/walkthroughs/03-Code-Summary-and-Test-Plan.md
+2. tests/ (overview)
+3. pytest.ini
+4. test_e2e_recs.py
+## Procedure
+1. Identify test scope (unit, integration, live, e2e).
+2. Provide the correct test command(s) and file locations.
+3. Call out live tests that hit external services.
+4. Provide evaluation metrics and how they map to system goals.
+5. Note any missing coverage or potential regressions.
+## Output Format
+- Test scope summary.
+- Commands and expected outputs.
+- Evaluation metric checklist.

CLAUDE.md CHANGED Viewed

@@ -205,6 +205,7 @@ Every interaction logged via `db.log_interaction()` must carry **`query_id`**, *
 - Onboarding wizard (category multi-select + seed search)
 - Category-filtered trending fallback
 - Dark-mode base UI + updated paper cards
 **Phase 6 — LightGBM reranker (COMPLETE ✅):**
 - LightGBM LambdaRank (141 trees, 37 features) integrated with heuristic fallback
@@ -216,6 +217,7 @@ Every interaction logged via `db.log_interaction()` must carry **`query_id`**, *
 - Phase 6.4 (retraining) deferred: gated on 100 users or synthetic simulator
 **Out of scope until later phases — do not build:**
 - Collaborative filtering / LightFM (Phase 9, 500+ users).
 - Cross-encoder reranking in serving path (never; only distilled — Phase 8).
 - Claude/Groq-generated cluster summaries (Phase 8).

 - Onboarding wizard (category multi-select + seed search)
 - Category-filtered trending fallback
 - Dark-mode base UI + updated paper cards
+- S2/ORCID author import was explored and **removed** — not the direction we want
 **Phase 6 — LightGBM reranker (COMPLETE ✅):**
 - LightGBM LambdaRank (141 trees, 37 features) integrated with heuristic fallback
 - Phase 6.4 (retraining) deferred: gated on 100 users or synthetic simulator
 **Out of scope until later phases — do not build:**
+- S2/ORCID author import for onboarding (removed — not the direction we want).
 - Collaborative filtering / LightFM (Phase 9, 500+ users).
 - Cross-encoder reranking in serving path (never; only distilled — Phase 8).
 - Claude/Groq-generated cluster summaries (Phase 8).

README.md CHANGED Viewed

@@ -276,7 +276,7 @@ curl -s https://siddhm11-researchit.hf.space/healthz/reranker | python -m json.t
 | `TURSO_URL` | Yes | Turso database URL |
 | `TURSO_DB_TOKEN` | Yes | Turso auth token |
 | `GROQ_API_KEY` | Yes | Groq API key for query rewriting |
-| `S2_API_KEY` | No | Semantic Scholar API key (training only) |
 | `RERANKER_MODEL_PATH` | No | Override LightGBM model file path |
 | `DB_PATH` | No | SQLite path (default: `interactions.db`) |

 | `TURSO_URL` | Yes | Turso database URL |
 | `TURSO_DB_TOKEN` | Yes | Turso auth token |
 | `GROQ_API_KEY` | Yes | Groq API key for query rewriting |
+| `S2_API_KEY` | No | Semantic Scholar API key (offline training scripts only, not used by the app) |
 | `RERANKER_MODEL_PATH` | No | Override LightGBM model file path |
 | `DB_PATH` | No | SQLite path (default: `interactions.db`) |

app/config.py CHANGED Viewed

@@ -24,8 +24,7 @@ METADATA_CACHE_TTL_DAYS = 30    # re-fetch metadata after this many days
 TURSO_URL = os.getenv("TURSO_URL", "")
 TURSO_DB_TOKEN = os.getenv("TURSO_DB_TOKEN", "")
-# ── Semantic Scholar API — Phase 5.1 (author import) ─────────────────────────
-S2_API_KEY = os.getenv("S2_API_KEY", "")
 # ── Recommendation settings ───────────────────────────────────────────────────
 REC_LIMIT = 10                  # how many recommendations to show

 TURSO_URL = os.getenv("TURSO_URL", "")
 TURSO_DB_TOKEN = os.getenv("TURSO_DB_TOKEN", "")
 # ── Recommendation settings ───────────────────────────────────────────────────
 REC_LIMIT = 10                  # how many recommendations to show

app/groq_svc.py CHANGED Viewed

@@ -45,29 +45,29 @@ def _get_client():
 _SYSTEM_PROMPT = """You are an academic search query optimizer for arXiv papers.
-Your job: Convert casual or vague user queries into dense, keyword-rich academic search strings that will match arXiv paper titles and abstracts.
 Rules:
 1. Output ONLY the rewritten query string — no explanation, no quotes, no preamble.
-2. Include standard academic terms, model names, acronyms, and author-style keywords.
-3. Keep the output to 8-15 words maximum.
-4. If the query already looks academic, return it with minimal changes.
 Examples:
 User: "when AI makes up fake facts"
-Output: LLM hallucination factual errors sycophancy truthfulness survey
 User: "the llama model by facebook"
-Output: LLaMA open efficient foundation language model Meta AI
-User: "how to make images from text"
-Output: text-to-image generation diffusion models latent space
-User: "papers about making language models smaller"
-Output: language model compression distillation pruning quantization efficient
-User: "whisper speech recognition"
-Output: Whisper OpenAI automatic speech recognition multilingual"""
 # ── Heuristic: should we skip rewriting? ─────────────────────────────────────
@@ -85,8 +85,14 @@ _ACADEMIC_PATTERN = re.compile(
 def _looks_academic(query: str) -> bool:
-    """Heuristic: skip rewriting if query already has academic terms."""
     words = query.split()
     if len(words) > 6:
         matches = len(_ACADEMIC_PATTERN.findall(query))
         if matches >= 2:

 _SYSTEM_PROMPT = """You are an academic search query optimizer for arXiv papers.
+Your job: Convert casual or conversational user queries into academic search strings.
 Rules:
 1. Output ONLY the rewritten query string — no explanation, no quotes, no preamble.
+2. If the user's query is casual or conversational, rewrite it using standard academic terms.
+3. CRITICAL: If the query is ALREADY a precise technical term, a single keyword, an acronym, or a known paper title (e.g., "perplexity", "transformers", "Adam optimizer"), DO NOT expand it. Return it EXACTLY AS IS. Do NOT add random related words.
+4. Never output more than 8 words.
 Examples:
 User: "when AI makes up fake facts"
+Output: LLM hallucination factual errors
 User: "the llama model by facebook"
+Output: LLaMA foundation language model Meta AI
+User: "perplexity"
+Output: perplexity
+User: "attention is all you need"
+Output: attention is all you need
+User: "gradient descent"
+Output: gradient descent"""
 # ── Heuristic: should we skip rewriting? ─────────────────────────────────────
 def _looks_academic(query: str) -> bool:
+    """Heuristic: skip rewriting if query already looks academic or is very short."""
     words = query.split()
+    # 1-2 word queries are usually precise keywords or author names (e.g., "perplexity", "lecun")
+    # Expanding them almost always ruins the precision.
+    if len(words) <= 2:
+        return True
     if len(words) > 6:
         matches = len(_ACADEMIC_PATTERN.findall(query))
         if matches >= 2:

app/hybrid_search_svc.py CHANGED Viewed

@@ -6,23 +6,29 @@ Orchestrates the full pipeline:
   2. BGE-M3 encode → dense + sparse
   3. Parallel search: Qdrant dense + Zilliz sparse
   4. RRF fusion (K=60)
-  5. Recency rerank: 0.80 × RRF + 0.20 × recency
   6. Return ranked arxiv_ids
 Doc 06 confirms: RRF is correct for search (fusing different retrievers
 answering the SAME query).  This is different from recommendations where
 quota is correct (fusing different queries for the SAME user).
 """
 from __future__ import annotations
 import asyncio
-from datetime import datetime
 from app import config
 from app import embed_svc
 from app import qdrant_svc
 from app import zilliz_svc
 from app import groq_svc
 # ── Public API ───────────────────────────────────────────────────────────────
@@ -31,18 +37,20 @@ async def search(
     query: str,
     limit: int = 10,
     use_rewrite: bool = True,
-) -> list[str]:
     """
     Hybrid semantic search — returns a list of arxiv_ids ranked by
     fused relevance.
     Pipeline:
-      rewrite → encode → parallel(dense, sparse) → RRF → rerank
     Args:
         query: User's raw search query.
         limit: Number of results to return.
         use_rewrite: Whether to attempt LLM query rewriting.
     Returns:
         list of arxiv_id strings, sorted by final score descending.
@@ -50,55 +58,115 @@ async def search(
     """
     query = query.strip()
     if not query:
-        return []
     # ── Step 1: LLM rewrite (optional, never blocks) ─────────────────────
-    search_query = query
     if use_rewrite:
         try:
-            search_query = await groq_svc.rewrite(query)
         except Exception:
-            search_query = query  # Fallback guaranteed
-    # ── Step 2: BGE-M3 encode (dense + sparse in one pass) ───────────────
-    try:
-        dense_vec, sparse_dict = embed_svc.encode_query(search_query)
-    except Exception as e:
-        print(f"[hybrid_search] Encoding failed: {e}")
-        return []
     # How many candidates to fetch before reranking
     fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
-    # ── Step 3: Parallel dense + sparse search ───────────────────────────
-    dense_results, sparse_results = await asyncio.gather(
-        qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k),
-        zilliz_svc.search_sparse(sparse_dict, limit=fetch_k),
-        return_exceptions=True,
-    )
-    # Handle individual failures gracefully
-    if isinstance(dense_results, Exception):
-        print(f"[hybrid_search] Dense search failed: {dense_results}")
-        dense_results = []
-    if isinstance(sparse_results, Exception):
-        print(f"[hybrid_search] Sparse search failed: {sparse_results}")
-        sparse_results = []
-    if not dense_results and not sparse_results:
-        return []
-    # ── Step 4: RRF fusion ───────────────────────────────────────────────
-    fused = _rrf_fuse(dense_results, sparse_results, k=config.SEARCH_RRF_K)
     if not fused:
-        return []
-    # ── Step 5: Recency rerank ───────────────────────────────────────────
-    ranked = _recency_rerank(fused)
     # ── Step 6: Return top results ───────────────────────────────────────
-    return [item["arxiv_id"] for item in ranked[:limit]]
 # ── RRF fusion ───────────────────────────────────────────────────────────────
@@ -109,92 +177,246 @@ def _rrf_fuse(
     k: int = 60,
 ) -> list[dict]:
     """
-    Reciprocal Rank Fusion — merges results from dense and sparse search.
-    score[paper] = 1/(k + rank_dense) + 1/(k + rank_sparse)
     RRF is rank-based, so raw scores from different systems don't need
-    normalization — this is why it works for fusing Qdrant cosine scores
-    with Zilliz IP scores.
     Args:
-        dense_results: list of {'arxiv_id': str, 'score': float} from Qdrant
-        sparse_results: list of {'arxiv_id': str, 'score': float} from Zilliz
-        k: RRF constant (default 60)
     Returns:
-        list of {'arxiv_id': str, 'rrf_score': float} sorted by rrf_score desc
     """
     scores: dict[str, float] = {}
-    # Dense contributions (rank = position in sorted list, 1-indexed)
-    for rank, item in enumerate(dense_results, start=1):
-        aid = item["arxiv_id"]
-        scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
-    # Sparse contributions
-    for rank, item in enumerate(sparse_results, start=1):
-        aid = item["arxiv_id"]
-        scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
-    # Sort by fused score descending
     fused = [
         {"arxiv_id": aid, "rrf_score": score}
         for aid, score in scores.items()
     ]
     fused.sort(key=lambda x: x["rrf_score"], reverse=True)
     return fused
-# ── Recency rerank ───────────────────────────────────────────────────────────
-def _recency_rerank(fused: list[dict]) -> list[dict]:
     """
-    Apply recency boost to RRF scores.
-    final_score = SEARCH_SEMANTIC_WEIGHT × norm_rrf + SEARCH_RECENCY_WEIGHT × recency
-    Recency is estimated from the arXiv ID (YYMM format) since we don't have
-    publication dates at this stage.  Papers not parseable get neutral score.
-    The semantic weight (0.80) ensures RRF dominates, while recency (0.20)
-    provides a mild boost to newer papers.
     """
     if not fused:
         return fused
-    # Normalize RRF scores to [0, 1]
-    max_rrf = max(item["rrf_score"] for item in fused)
-    min_rrf = min(item["rrf_score"] for item in fused)
-    rrf_range = max_rrf - min_rrf if max_rrf != min_rrf else 1.0
-    now_ym = datetime.now().year * 12 + datetime.now().month
-    for item in fused:
-        # Normalize RRF to [0, 1]
-        norm_rrf = (item["rrf_score"] - min_rrf) / rrf_range
-        # Estimate recency from arXiv ID (format: YYMM.NNNNN)
-        recency = 0.5  # neutral default
         aid = item["arxiv_id"]
-        try:
-            parts = aid.split(".")
-            if len(parts) >= 2 and len(parts[0]) == 4:
-                yy = int(parts[0][:2])
-                mm = int(parts[0][2:4])
-                year = 2000 + yy if yy < 100 else yy
-                paper_ym = year * 12 + mm
-                months_ago = max(0, now_ym - paper_ym)
-                # Decay: recent papers get ~1.0, 10-year-old papers get ~0.0
-                recency = max(0.0, 1.0 - months_ago / 120.0)
-        except (ValueError, IndexError):
-            pass
-        item["final_score"] = (
-            config.SEARCH_SEMANTIC_WEIGHT * norm_rrf
-            + config.SEARCH_RECENCY_WEIGHT * recency
-        )
     fused.sort(key=lambda x: x["final_score"], reverse=True)
     return fused

   2. BGE-M3 encode → dense + sparse
   3. Parallel search: Qdrant dense + Zilliz sparse
   4. RRF fusion (K=60)
+  5. Title-match boost (exact/substring against Turso titles)
   6. Return ranked arxiv_ids
 Doc 06 confirms: RRF is correct for search (fusing different retrievers
 answering the SAME query).  This is different from recommendations where
 quota is correct (fusing different queries for the SAME user).
+Recency rerank was removed — search relevance should not be biased toward
+newer papers (that is a recommendations concern). For exact-title queries
+like "attention is all you need", the recency overlay was crushing seminal
+older papers below newer "X is all you need" titles.
 """
 from __future__ import annotations
 import asyncio
+import re
 from app import config
 from app import embed_svc
 from app import qdrant_svc
 from app import zilliz_svc
 from app import groq_svc
+from app import turso_svc
 # ── Public API ───────────────────────────────────────────────────────────────
     query: str,
     limit: int = 10,
     use_rewrite: bool = True,
+    return_meta: bool = False,
+) -> list[str] | tuple[list[str], dict]:
     """
     Hybrid semantic search — returns a list of arxiv_ids ranked by
     fused relevance.
     Pipeline:
+      rewrite → encode → parallel(dense, sparse) → RRF → title-boost
     Args:
         query: User's raw search query.
         limit: Number of results to return.
         use_rewrite: Whether to attempt LLM query rewriting.
+        return_meta: If True, returns a tuple of (arxiv_ids, metadata_dict).
     Returns:
         list of arxiv_id strings, sorted by final score descending.
     """
     query = query.strip()
     if not query:
+        return ([], {}) if return_meta else []
+    import time
+    search_meta = {"rewritten_query": None, "groq_time_ms": 0, "groq_status": "off"}
     # ── Step 1: LLM rewrite (optional, never blocks) ─────────────────────
+    rewritten_query = query
     if use_rewrite:
+        start_groq = time.perf_counter()
         try:
+            rewritten_query = await groq_svc.rewrite(query)
+            if rewritten_query != query:
+                search_meta["rewritten_query"] = rewritten_query
+                search_meta["groq_status"] = "rewritten"
+            else:
+                # Groq returned same query — either skipped by heuristic or LLM kept it
+                word_count = len(query.strip().split())
+                if word_count <= 2:
+                    search_meta["groq_status"] = f"skipped (query too short: {word_count} words)"
+                elif groq_svc._looks_academic(query):
+                    search_meta["groq_status"] = "skipped (looks academic)"
+                else:
+                    search_meta["groq_status"] = "called, kept original"
         except Exception:
+            rewritten_query = query  # Fallback guaranteed
+            search_meta["groq_status"] = "error (fallback)"
+        search_meta["groq_time_ms"] = int((time.perf_counter() - start_groq) * 1000)
+    # ── Step 2: BGE-M3 encode the original AND rewrite ──────────────────
+    # Why both: The rewriter improves recall on conceptual/casual queries
+    # ("when AI makes up fake facts" -> "LLM hallucination ...") but it
+    # paraphrases away from literal title wording on known-item queries
+    # ("attention is all you need" -> "Transformer self-attention ..."),
+    # which can drop the actual famous paper out of the candidate pool
+    # entirely. Searching both forms and RRF-fusing all result lists
+    # gives us recall on both axes.
+    queries_to_encode: list[str] = [query]
+    if rewritten_query and rewritten_query != query:
+        queries_to_encode.append(rewritten_query)
+    t0_encode = time.perf_counter()
+    encoded: list[tuple] = []
+    for q in queries_to_encode:
+        try:
+            d, s = embed_svc.encode_query(q)
+            encoded.append((d, s))
+        except Exception as e:
+            print(f"[hybrid_search] Encoding failed for {q!r}: {e}")
+    search_meta["encode_time_ms"] = int((time.perf_counter() - t0_encode) * 1000)
+    if not encoded:
+        return ([], search_meta) if return_meta else []
     # How many candidates to fetch before reranking
     fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
+    # ── Step 3: Parallel dense + sparse search for every encoded form ───
+    # Build a flat list of search coroutines: [dense_q1, sparse_q1, dense_q2, sparse_q2, ...]
+    t0_retrieval = time.perf_counter()
+    tasks = []
+    task_labels = []
+    for i, (dense_vec, sparse_dict) in enumerate(encoded):
+        tasks.append(qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k))
+        task_labels.append(f"qdrant_q{i}")
+        tasks.append(zilliz_svc.search_sparse(sparse_dict, limit=fetch_k))
+        task_labels.append(f"zilliz_q{i}")
+    # Time each task individually
+    import asyncio as _aio
+    task_start = time.perf_counter()
+    raw_results = await asyncio.gather(*tasks, return_exceptions=True)
+    search_meta["retrieval_time_ms"] = int((time.perf_counter() - t0_retrieval) * 1000)
+    search_meta["n_retrieval_tasks"] = len(tasks)
+    valid_result_lists: list[list[dict]] = []
+    for r in raw_results:
+        if isinstance(r, Exception):
+            print(f"[hybrid_search] search task failed: {r}")
+            continue
+        if r:
+            valid_result_lists.append(r)
+    if not valid_result_lists:
+        return ([], search_meta) if return_meta else []
+    # ── Step 4: RRF fusion across all result lists ──────────────────────
+    t0_rrf = time.perf_counter()
+    fused = _rrf_fuse_multi(valid_result_lists, k=config.SEARCH_RRF_K)
+    search_meta["rrf_time_ms"] = int((time.perf_counter() - t0_rrf) * 1000)
     if not fused:
+        return ([], search_meta) if return_meta else []
+    # ── Step 5: Title-match boost ────────────────────────────────────────
+    # Use the user's ORIGINAL query (not the LLM rewrite) for title matching —
+    # the user's literal text is what should match a paper title.
+    t0_rerank = time.perf_counter()
+    ranked = await _title_match_rerank(fused, query, top_n_for_boost=50)
+    rerank_total = int((time.perf_counter() - t0_rerank) * 1000)
+    search_meta["rerank_time_ms"] = rerank_total
+    # Extract sub-timings stashed by _title_match_rerank
+    if ranked:
+        turso_boost_ms = ranked[0].pop("_turso_boost_fetch_ms", 0)
+        search_meta["turso_boost_fetch_ms"] = turso_boost_ms
+        search_meta["rerank_compute_ms"] = max(0, rerank_total - turso_boost_ms)
     # ── Step 6: Return top results ───────────────────────────────────────
+    final_results = [item["arxiv_id"] for item in ranked[:limit]]
+    return (final_results, search_meta) if return_meta else final_results
 # ── RRF fusion ───────────────────────────────────────────────────────────────
     k: int = 60,
 ) -> list[dict]:
     """
+    Reciprocal Rank Fusion of two result lists (dense + sparse).
+    Kept for callers that pass exactly two lists; new code (and the
+    hybrid pipeline itself) should call _rrf_fuse_multi instead.
+    """
+    return _rrf_fuse_multi([dense_results, sparse_results], k=k)
+def _rrf_fuse_multi(
+    result_lists: list[list[dict]],
+    k: int = 60,
+) -> list[dict]:
+    """
+    Reciprocal Rank Fusion across N result lists.
+    score[paper] = sum over each list of 1/(k + rank_in_that_list)
     RRF is rank-based, so raw scores from different systems don't need
+    normalization. This means we can merge dense, sparse, AND multiple
+    encoded query forms (original + LLM-rewritten) without per-source
+    score calibration.
     Args:
+        result_lists: each list contains {'arxiv_id': str, 'score': ...}
+                      sorted best-first.
+        k: RRF constant (default 60).
     Returns:
+        list of {'arxiv_id': str, 'rrf_score': float} sorted by rrf_score desc.
     """
     scores: dict[str, float] = {}
+    for results in result_lists:
+        for rank, item in enumerate(results, start=1):
+            aid = item["arxiv_id"]
+            scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
     fused = [
         {"arxiv_id": aid, "rrf_score": score}
         for aid, score in scores.items()
     ]
     fused.sort(key=lambda x: x["rrf_score"], reverse=True)
     return fused
+# ── Title-match + citation-popularity rerank ─────────────────────────────────
+# Boost magnitudes are calibrated against `max_rrf` so any meaningful title
+# match outranks the best non-matching candidate:
+#   final = rrf_score + max_rrf * (title_boost + citation_boost)
+# With boost=2.0 (exact title), the worst exact-match still beats the best
+# non-match by >= max_rrf. boost=1.0 same vs. no-match.
+_BOOST_EXACT_TITLE = 2.0          # query == title (after normalize)
+_BOOST_SUBSTRING_TITLE = 1.0      # query is contiguous substring of title
+_BOOST_HIGH_COVERAGE = 1.0        # >= 80% of query words found in title
+_BOOST_MED_COVERAGE = 0.5         # >= 50% of query words found in title
+# Citation-popularity boost — surfaces landmark papers even when keyword
+# overlap is low. Without this, "how do transformers work in NLP" returns
+# niche papers instead of "Attention Is All You Need" because RRF favors
+# papers whose titles contain more query keywords.
+#
+# Uses log10(citations) scaled to a cap:
+#   0 citations   -> 0.0 boost
+#   10 citations  -> 0.03
+#   100 citations -> 0.06
+#   1K citations  -> 0.10
+#   10K citations -> 0.13
+#   100K citations-> 0.17 (near cap)
+#
+# Cap is deliberately small (0.2 * max_rrf) so it NUDGES but doesn't
+# override title-match or strong semantic signal. A 100K-citation paper
+# still loses to a perfect title match.
+import math
+_CITATION_BOOST_CAP = 0.2         # max boost from citations alone
+_CITATION_LOG_DIVISOR = 30.0      # how many log10 units to reach the cap
+# Drop any token shorter than this from coverage calculation — single-letter
+# tokens ("a", "i") and tiny stop-likes inflate spurious matches.
+_MIN_COVERAGE_TOKEN_LEN = 2
+def _normalize_for_match(text: str) -> str:
+    """Lowercase, collapse non-alnum to single spaces, strip."""
+    return re.sub(r"[^a-z0-9]+", " ", text.lower()).strip()
+def _stem_plural(w: str) -> str:
+    """Trim a single trailing 's' on tokens longer than 3 chars.
+    Crude but cheap. Catches the 'space' vs 'spaces' problem in the
+    Mamba paper title without dragging in a real stemmer dependency.
+    """
+    return w[:-1] if len(w) > 3 and w.endswith("s") else w
+def _word_set(text: str) -> set[str]:
+    return {
+        _stem_plural(w) for w in text.split()
+        if len(w) >= _MIN_COVERAGE_TOKEN_LEN
+    }
+def _compute_title_boost(query_norm: str, title_raw: str) -> float:
+    """Decide how much to boost a candidate based on title overlap.
+    Order of checks (strongest signal first):
+      1. Exact match after normalization                  -> 2.0
+      2. Query is contiguous substring of normalized title -> 1.0
+         (rescues "chain of thought prompting" vs
+          "Chain-of-Thought Prompting Elicits Reasoning..." — punctuation
+          in title was the only thing blocking the old binary substring check)
+      3. Coverage: fraction of query word-stems found in title (or as
+         substring of compact title — catches "multilingual" appearing
+         in "Multi-Lingual" once spaces are stripped).
+            >= 0.8 -> _BOOST_HIGH_COVERAGE * coverage
+            >= 0.5 -> _BOOST_MED_COVERAGE * coverage
+            otherwise -> 0
+    """
+    if not query_norm or not title_raw:
+        return 0.0
+    title_norm = _normalize_for_match(title_raw)
+    if not title_norm:
+        return 0.0
+    if query_norm == title_norm:
+        return _BOOST_EXACT_TITLE
+    if query_norm in title_norm:
+        return _BOOST_SUBSTRING_TITLE
+    q_words = _word_set(query_norm)
+    if not q_words:
+        return 0.0
+    t_words = _word_set(title_norm)
+    title_compact = title_norm.replace(" ", "")
+    matches = 0
+    for w in q_words:
+        if w in t_words:
+            matches += 1
+        elif len(w) >= 4 and w in title_compact:
+            # Catches "multilingual" appearing within "multi lingual"
+            # once whitespace is stripped from the title.
+            matches += 1
+    coverage = matches / len(q_words)
+    if coverage >= 0.8:
+        return _BOOST_HIGH_COVERAGE * coverage
+    if coverage >= 0.5:
+        return _BOOST_MED_COVERAGE * coverage
+    return 0.0
+def _compute_citation_boost(citation_count: int) -> float:
+    """Log-scaled citation boost, capped at _CITATION_BOOST_CAP.
+    The idea: a paper with 100K citations (like "Attention Is All You Need")
+    gets a small but meaningful nudge upward even when it has zero keyword
+    overlap with a beginner's query like "how do transformers work".
+    The boost is small enough that a strong title match always wins, and
+    a strong semantic RRF score always wins. But when two papers have
+    similar RRF scores and neither has a title match, the one with 100K
+    citations beats the one with 3 citations.
+    Scale (log10-based):
+      citations=0     -> 0.000
+      citations=10    -> 0.033
+      citations=100   -> 0.067
+      citations=1000  -> 0.100
+      citations=10000 -> 0.133
+      citations=100000-> 0.167
+    """
+    if citation_count <= 0:
+        return 0.0
+    raw = math.log10(citation_count + 1) / _CITATION_LOG_DIVISOR
+    return min(raw, _CITATION_BOOST_CAP)
+async def _title_match_rerank(
+    fused: list[dict],
+    user_query: str,
+    top_n_for_boost: int = 50,
+) -> list[dict]:
     """
+    Boost candidates by title overlap + citation popularity.
+    Two signals, both based on metadata we already fetch from Turso:
+    1. Title boost (strong): exact/substring/coverage match between the
+       user's ORIGINAL query and paper titles. Rescues known-item queries.
+    2. Citation boost (gentle): log-scaled citation count, capped at 0.2x
+       max_rrf. Rescues landmark papers for beginner queries where keyword
+       overlap is low but the paper is obviously important.
+    The final score is:
+      final = rrf_score + max_rrf * (title_boost + citation_boost)
+    Safe under partial Turso failure: papers with missing metadata get
+    boost=0 and rank by RRF alone.
     """
     if not fused:
         return fused
+    q_norm = _normalize_for_match(user_query)
+    if not q_norm:
+        for item in fused:
+            item["final_score"] = item["rrf_score"]
+        return fused
+    candidate_ids = [item["arxiv_id"] for item in fused[:top_n_for_boost]]
+    titles: dict[str, str] = {}
+    citations: dict[str, int] = {}
+    import time as _time
+    _t0_turso_boost = _time.perf_counter()
+    try:
+        meta = await turso_svc.fetch_metadata_batch(candidate_ids)
+        titles = {aid: (m.get("title") or "") for aid, m in meta.items()}
+        citations = {aid: (m.get("citation_count") or 0) for aid, m in meta.items()}
+    except Exception as e:
+        print(f"[hybrid_search] Metadata fetch for boost failed: {e}")
+        for item in fused:
+            item["final_score"] = item["rrf_score"]
+        return fused
+    _turso_boost_ms = int((_time.perf_counter() - _t0_turso_boost) * 1000)
+    # Stash on first item so the caller can extract it
+    if fused:
+        fused[0]["_turso_boost_fetch_ms"] = _turso_boost_ms
+    max_rrf = max(item["rrf_score"] for item in fused)
+    for item in fused:
         aid = item["arxiv_id"]
+        t_boost = _compute_title_boost(q_norm, titles.get(aid, ""))
+        c_boost = _compute_citation_boost(citations.get(aid, 0))
+        item["title_boost"] = t_boost
+        item["citation_boost"] = c_boost
+        item["final_score"] = item["rrf_score"] + max_rrf * (t_boost + c_boost)
     fused.sort(key=lambda x: x["final_score"], reverse=True)
     return fused

app/qdrant_svc.py CHANGED Viewed

@@ -10,6 +10,7 @@ The collection is 'arxiv_bgem3_dense' with integer point IDs and 1024-dim BGE-M3
 from __future__ import annotations
 import asyncio
 from functools import lru_cache
 from qdrant_client import QdrantClient
@@ -166,21 +167,75 @@ def _run_recommend(
 # ── Phase 2a: Vector retrieval + vector search ───────────────────────────────
 async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
     """
-    Fetch actual BGE-M3 embedding vectors for papers from Qdrant.
     Returns {arxiv_id: vector_list} for papers found.
-    Used by EWMA profile updates — we need the paper's embedding
-    to blend into the user's profile vector.
     """
     if not arxiv_ids:
         return {}
-    id_map = await lookup_qdrant_ids(arxiv_ids)
     if not id_map:
-        return {}
     point_ids = list(id_map.values())
     arxiv_by_point = {v: k for k, v in id_map.items()}
@@ -192,9 +247,8 @@ async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
         )
     except Exception as e:
         print(f"[qdrant_svc] get_paper_vectors error: {e}")
-        return {}
-    result = {}
     for p in points:
         aid = p.payload.get("arxiv_id") or arxiv_by_point.get(p.id)
         if aid and p.vector:
@@ -202,6 +256,7 @@ async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
             vec = p.vector if isinstance(p.vector, list) else p.vector.get("dense", p.vector)
             if isinstance(vec, list):
                 result[aid] = vec
     return result
@@ -250,6 +305,7 @@ async def search_by_vector_with_scores(
     query_vector: list[float],
     limit: int = 20,
     exclude_ids: set[str] | None = None,
 ) -> list[dict]:
     """
     Vector search returning both arxiv_ids AND cosine scores.
@@ -257,29 +313,43 @@ async def search_by_vector_with_scores(
     Returns list of {'arxiv_id': str, 'score': float} dicts sorted by
     score desc, excluding any in exclude_ids.
-    Used by the recommendation pipeline (Phase 6.1+) to feed
-    qdrant_cosine_score (feature slot 0) to the LightGBM reranker.
     """
     loop = asyncio.get_event_loop()
     try:
         results = await loop.run_in_executor(
             None, _run_vector_search, query_vector,
             (limit * 2) if exclude_ids else limit,
         )
     except Exception as e:
         print(f"[qdrant_svc] search_by_vector_with_scores error: {e}")
         return []
     exclude = exclude_ids or set()
-    filtered = [
-        {"arxiv_id": r.payload["arxiv_id"], "score": float(r.score)}
-        for r in results
-        if r.payload.get("arxiv_id") and r.payload["arxiv_id"] not in exclude
-    ]
-    return filtered[:limit]
-def _run_vector_search(query_vector: list[float], limit: int) -> list:
     """Sync helper: nearest-neighbour search by vector."""
     client = _client()
     result = client.query_points(
@@ -287,7 +357,7 @@ def _run_vector_search(query_vector: list[float], limit: int) -> list:
         query=query_vector,
         limit=limit,
         with_payload=True,
-        with_vectors=False,
     )
     return result.points

 from __future__ import annotations
 import asyncio
+from collections import OrderedDict
 from functools import lru_cache
 from qdrant_client import QdrantClient
 # ── Phase 2a: Vector retrieval + vector search ───────────────────────────────
+#
+# In-process LRU vector cache.
+# Profiling showed Qdrant Cloud free tier reads candidate vectors from
+# disk on every retrieve(), which dominated Tier 1 latency (9-18s for
+# 120 vectors). Vectors are 1024 floats = 4KB each. A 25K cap = ~100MB
+# RAM ceiling. Same papers appear across users' candidate sets (Zipf),
+# so steady-state hit rate is high.
+#
+# Vectors don't change once uploaded, so no TTL.
+_VECTOR_CACHE: "OrderedDict[str, list[float]]" = OrderedDict()
+_VECTOR_CACHE_MAX = 25_000
+def _vec_cache_get(arxiv_id: str) -> list[float] | None:
+    val = _VECTOR_CACHE.get(arxiv_id)
+    if val is not None:
+        _VECTOR_CACHE.move_to_end(arxiv_id)
+    return val
+def _vec_cache_put(arxiv_id: str, vec: list[float]) -> None:
+    if arxiv_id in _VECTOR_CACHE:
+        _VECTOR_CACHE.move_to_end(arxiv_id)
+        _VECTOR_CACHE[arxiv_id] = vec
+        return
+    _VECTOR_CACHE[arxiv_id] = vec
+    if len(_VECTOR_CACHE) > _VECTOR_CACHE_MAX:
+        _VECTOR_CACHE.popitem(last=False)
+def vector_cache_stats() -> dict:
+    return {"size": len(_VECTOR_CACHE), "max": _VECTOR_CACHE_MAX}
 async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
     """
+    Fetch BGE-M3 embedding vectors for papers from Qdrant.
     Returns {arxiv_id: vector_list} for papers found.
+    Cached in-process by arxiv_id; only un-cached IDs hit Qdrant. The
+    Qdrant retrieve() that pulls the actual stored vectors is the
+    single most expensive call in the pipeline (BQ -> disk read), so
+    absorbing repeats here is a big win.
+    Used by:
+      - EWMA profile updates on save (events.py)
+      - Cluster medoid embedding load (recommendations.py)
+      - Tier 1 candidate vector fetch (recommendations.py, ~120 IDs)
     """
     if not arxiv_ids:
         return {}
+    # Cache check first — pull anything we already know.
+    result: dict[str, list[float]] = {}
+    misses: list[str] = []
+    for aid in arxiv_ids:
+        cached = _vec_cache_get(aid)
+        if cached is not None:
+            result[aid] = cached
+        else:
+            misses.append(aid)
+    if not misses:
+        return result
+    id_map = await lookup_qdrant_ids(misses)
     if not id_map:
+        return result
     point_ids = list(id_map.values())
     arxiv_by_point = {v: k for k, v in id_map.items()}
         )
     except Exception as e:
         print(f"[qdrant_svc] get_paper_vectors error: {e}")
+        return result
     for p in points:
         aid = p.payload.get("arxiv_id") or arxiv_by_point.get(p.id)
         if aid and p.vector:
             vec = p.vector if isinstance(p.vector, list) else p.vector.get("dense", p.vector)
             if isinstance(vec, list):
                 result[aid] = vec
+                _vec_cache_put(aid, vec)
     return result
     query_vector: list[float],
     limit: int = 20,
     exclude_ids: set[str] | None = None,
+    with_vectors: bool = False,
 ) -> list[dict]:
     """
     Vector search returning both arxiv_ids AND cosine scores.
     Returns list of {'arxiv_id': str, 'score': float} dicts sorted by
     score desc, excluding any in exclude_ids.
+    If `with_vectors=True`, each dict also has a 'vector' key holding the
+    1024-dim BGE-M3 embedding. Returning vectors in the search response
+    avoids a separate `client.retrieve()` round-trip later — that retrieve
+    was ~9-18s on cold candidates because BQ rescore reads from disk.
     """
     loop = asyncio.get_event_loop()
     try:
         results = await loop.run_in_executor(
             None, _run_vector_search, query_vector,
             (limit * 2) if exclude_ids else limit,
+            with_vectors,
         )
     except Exception as e:
         print(f"[qdrant_svc] search_by_vector_with_scores error: {e}")
         return []
     exclude = exclude_ids or set()
+    out: list[dict] = []
+    for r in results:
+        aid = r.payload.get("arxiv_id")
+        if not aid or aid in exclude:
+            continue
+        item = {"arxiv_id": aid, "score": float(r.score)}
+        if with_vectors and r.vector:
+            # Named vectors return a dict; unnamed returns a list.
+            vec = r.vector if isinstance(r.vector, list) else r.vector.get("dense", r.vector)
+            if isinstance(vec, list):
+                item["vector"] = vec
+        out.append(item)
+        if len(out) >= limit:
+            break
+    return out
+def _run_vector_search(
+    query_vector: list[float], limit: int, with_vectors: bool = False,
+) -> list:
     """Sync helper: nearest-neighbour search by vector."""
     client = _client()
     result = client.query_points(
         query=query_vector,
         limit=limit,
         with_payload=True,
+        with_vectors=with_vectors,
     )
     return result.points

app/recommend/clustering.py CHANGED Viewed

@@ -17,6 +17,7 @@ Reference: Research-MultiInterest_Recommender_Architecture.md §2
 from __future__ import annotations
 import json
 from dataclasses import dataclass, field
 import numpy as np
 from scipy.cluster.hierarchy import ward, fcluster
@@ -34,6 +35,14 @@ WARD_DISTANCE_THRESHOLD = 1.5
 MIN_CLUSTERS = 1
 MAX_CLUSTERS = 7   # RFC: PinnerSage uses 3-5 for typical users, cap at 7
 # Minimum saved papers before clustering is meaningful
 MIN_PAPERS_FOR_CLUSTERING = 5
@@ -132,14 +141,36 @@ def compute_clusters(
     # Cut the dendrogram at the adaptive threshold
     labels = fcluster(linkage, t=threshold, criterion="distance")
-    # Clamp cluster count
     unique_labels = np.unique(labels)
     n_clusters = len(unique_labels)
-    # If too many clusters, re-cut with a maxclust constraint
     if n_clusters > MAX_CLUSTERS:
         labels = fcluster(linkage, t=MAX_CLUSTERS, criterion="maxclust")
         unique_labels = np.unique(labels)
     # Compute recency weights (position-based: most recent = highest weight)
     recency_weights = np.array([
@@ -184,6 +215,49 @@ def _find_medoid(embeddings: np.ndarray, centroid: np.ndarray) -> int:
     return int(np.argmin(distances))
 # ── Cluster ID stabilisation (Phase 4.2) ─────────────────────────────────────
 # Hungarian matches below this cosine similarity are rejected as "unrelated".

 from __future__ import annotations
 import json
+import math
 from dataclasses import dataclass, field
 import numpy as np
 from scipy.cluster.hierarchy import ward, fcluster
 MIN_CLUSTERS = 1
 MAX_CLUSTERS = 7   # RFC: PinnerSage uses 3-5 for typical users, cap at 7
+# Average papers per cluster floor — used to derive a soft cap on K from N.
+# K_soft_cap = max(MIN_CLUSTERS, ceil(N / AVG_CLUSTER_SIZE_FLOOR)).
+# Set to 4: at N=5 -> K_max=2, at N=10 -> K_max=3, at N=28 -> K_max=7.
+# Without this, gap-based thresholding over-splits at small N: 5 same-domain
+# papers were producing K=4 (3 singletons), which then got over-weighted by
+# the quota floor of 3 slots per cluster.
+AVG_CLUSTER_SIZE_FLOOR = 4
 # Minimum saved papers before clustering is meaningful
 MIN_PAPERS_FOR_CLUSTERING = 5
     # Cut the dendrogram at the adaptive threshold
     labels = fcluster(linkage, t=threshold, criterion="distance")
+    # Clamp cluster count.
+    # Two layers:
+    #   1. Hard cap: never exceed MAX_CLUSTERS (=7) regardless of N.
+    #   2. Soft cap: keep average cluster size >= AVG_CLUSTER_SIZE_FLOOR.
+    #      This prevents the gap-detection from over-splitting small N
+    #      (e.g. 5 same-domain saves were producing K=4 with 3 singletons,
+    #      which then got over-weighted by the quota floor of 3 slots).
+    soft_cap = max(
+        MIN_CLUSTERS,
+        min(MAX_CLUSTERS, math.ceil(n / AVG_CLUSTER_SIZE_FLOOR)),
+    )
     unique_labels = np.unique(labels)
     n_clusters = len(unique_labels)
     if n_clusters > MAX_CLUSTERS:
         labels = fcluster(linkage, t=MAX_CLUSTERS, criterion="maxclust")
         unique_labels = np.unique(labels)
+        n_clusters = len(unique_labels)
+    if n_clusters > soft_cap:
+        labels = fcluster(linkage, t=soft_cap, criterion="maxclust")
+        unique_labels = np.unique(labels)
+        n_clusters = len(unique_labels)
+    # Final safety net: merge any remaining singleton clusters into their
+    # nearest non-singleton neighbour. The soft cap usually eliminates them,
+    # but a 6-1-1-1 distribution after maxclust=4 would still leave 3.
+    labels = _merge_singletons(labels, embeddings)
+    unique_labels = np.unique(labels)
     # Compute recency weights (position-based: most recent = highest weight)
     recency_weights = np.array([
     return int(np.argmin(distances))
+def _merge_singletons(labels: np.ndarray, embeddings: np.ndarray) -> np.ndarray:
+    """Merge singleton clusters into their nearest non-singleton cluster.
+    Why: Ward's gap-based threshold can over-split at small N, producing
+    1-paper clusters that get over-weighted by the quota floor (3 slots
+    per cluster regardless of importance). Merging singletons into the
+    nearest non-singleton cluster preserves the multi-interest signal
+    where it's real and removes spurious singletons where it's noise.
+    Edge case: if every cluster is a singleton (all papers maximally
+    distant), we leave the labels alone — collapsing them would erase
+    a genuine multi-interest signal.
+    """
+    unique_labels, counts = np.unique(labels, return_counts=True)
+    singleton_labels = unique_labels[counts == 1]
+    non_singleton_labels = unique_labels[counts > 1]
+    if len(singleton_labels) == 0:
+        return labels  # nothing to merge
+    if len(non_singleton_labels) == 0:
+        return labels  # all singletons — keep as is
+    centroids: dict[int, np.ndarray] = {}
+    for ns_label in non_singleton_labels:
+        ns_mask = labels == ns_label
+        centroids[int(ns_label)] = embeddings[ns_mask].mean(axis=0)
+    new_labels = labels.copy()
+    for s_label in singleton_labels:
+        s_idx = int(np.where(labels == s_label)[0][0])
+        s_emb = embeddings[s_idx]
+        best_label = int(s_label)
+        best_dist = float("inf")
+        for ns_label, centroid in centroids.items():
+            d = float(np.linalg.norm(s_emb - centroid))
+            if d < best_dist:
+                best_dist = d
+                best_label = ns_label
+        new_labels[s_idx] = best_label
+    return new_labels
 # ── Cluster ID stabilisation (Phase 4.2) ─────────────────────────────────────
 # Hungarian matches below this cosine similarity are rejected as "unrelated".

app/recommend/reranker.py CHANGED Viewed

@@ -45,7 +45,7 @@ try:
         if _path and os.path.isfile(_path):
             _lgb_model = lgb.Booster(model_file=_path)
             _USE_LGB = True
-            print(f"[reranker] ✅ LightGBM model loaded from {_path}")
             print(f"[reranker]   trees={_lgb_model.num_trees()}, features={_lgb_model.num_feature()}")
             break

         if _path and os.path.isfile(_path):
             _lgb_model = lgb.Booster(model_file=_path)
             _USE_LGB = True
+            print(f"[reranker] SUCCESS: LightGBM model loaded from {_path}")
             print(f"[reranker]   trees={_lgb_model.num_trees()}, features={_lgb_model.num_feature()}")
             break

app/routers/onboarding.py CHANGED Viewed

@@ -9,7 +9,7 @@ POST /api/onboarding/skip           → mark done (no categories), redirect to /
 """
 import uuid
 import json
-from fastapi import APIRouter, Request, Cookie, Form
 from fastapi.responses import HTMLResponse, RedirectResponse
 from app import db
 from app.config import COOKIE_NAME, CATEGORY_GROUPS
@@ -116,20 +116,14 @@ async def seed_search(
             except Exception:
                 pass
-    # Check current save count
-    from app import user_state as us
-    state = await us.ensure_loaded(user_id)
-    seed_count = len(state.positives)
     resp = templates.TemplateResponse(
         request,
-        "partials/seed_search.html",
-        {
-            "papers": papers,
-            "query": q,
-            "seed_count": seed_count,
-            "seed_target": 5,
-        },
     )
     resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
     return resp
@@ -161,90 +155,4 @@ async def skip_onboarding(
     return resp
-@router.post("/api/onboarding/import-author", response_class=HTMLResponse)
-async def import_author(
-    request: Request,
-    author_url: str = Form(default=""),
-    user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
-):
-    """Phase 5.1: Import papers from a Semantic Scholar author profile.
-    Accepts S2 URL, raw S2 author ID, or ORCID.
-    Auto-saves the author's arXiv papers as seed interests.
-    """
-    user_id = user_id or str(uuid.uuid4())
-    if not author_url.strip():
-        return HTMLResponse(
-            '<div class="alert alert-warning text-sm py-2">'
-            '⚠️ Please paste a Semantic Scholar author URL, ID, or ORCID.</div>'
-        )
-    from app import s2_svc, user_state as us
-    # 1. Parse input
-    parsed_id, input_type = s2_svc.parse_author_input(author_url)
-    if parsed_id is None:
-        return HTMLResponse(
-            '<div class="alert alert-error text-sm py-2">'
-            '❌ Could not recognise input. Paste a Semantic Scholar author URL, '
-            'a numeric author ID, or an ORCID (e.g. 0000-0003-3394-6622).</div>'
-        )
-    # 2. Resolve ORCID → S2 author ID if needed
-    try:
-        if input_type == "orcid":
-            s2_id = await s2_svc.resolve_orcid(parsed_id)
-            if not s2_id:
-                return HTMLResponse(
-                    '<div class="alert alert-warning text-sm py-2">'
-                    f'⚠️ No Semantic Scholar author found for ORCID {parsed_id}.</div>'
-                )
-        else:
-            s2_id = parsed_id
-    except Exception as e:
-        print(f"[onboarding] ORCID resolve failed: {e}")
-        return HTMLResponse(
-            '<div class="alert alert-error text-sm py-2">'
-            '❌ Failed to look up ORCID. Please try pasting the S2 URL directly.</div>'
-        )
-    # 3. Fetch arXiv papers
-    try:
-        arxiv_ids = await s2_svc.fetch_author_arxiv_papers(s2_id, limit=20)
-    except Exception as e:
-        print(f"[onboarding] S2 author paper fetch failed: {e}")
-        return HTMLResponse(
-            '<div class="alert alert-error text-sm py-2">'
-            '❌ Failed to fetch papers from Semantic Scholar. '
-            'The author ID may be invalid, or the API may be down.</div>'
-        )
-    if not arxiv_ids:
-        return HTMLResponse(
-            '<div class="alert alert-warning text-sm py-2">'
-            '⚠️ No arXiv papers found for this author. '
-            'They may publish in venues not indexed on arXiv.</div>'
-        )
-    # 4. Auto-save each paper as a positive interaction
-    for aid in arxiv_ids:
-        us.record_positive(user_id, aid)
-        await db.log_interaction(
-            user_id=user_id,
-            paper_id=aid,
-            event_type="save",
-            source="s2_import",
-        )
-    state = await us.ensure_loaded(user_id)
-    seed_count = len(state.positives)
-    resp = HTMLResponse(
-        f'<div class="alert alert-success text-sm py-2">'
-        f'✅ Imported {len(arxiv_ids)} papers! '
-        f'You now have {seed_count} saved papers. '
-        f'Click <strong>"Done — start exploring →"</strong> to see your recommendations.</div>'
-    )
-    resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
-    return resp

 """
 import uuid
 import json
+from fastapi import APIRouter, Request, Cookie
 from fastapi.responses import HTMLResponse, RedirectResponse
 from app import db
 from app.config import COOKIE_NAME, CATEGORY_GROUPS
             except Exception:
                 pass
+    # HTMX request: return ONLY the results partial (swap target = #seed-results).
+    # The full seed_search.html panel is rendered by save_categories() during the
+    # step 1 → step 2 transition; subsequent searches must not re-render the whole
+    # panel or it nests inside #seed-results and duplicates the wizard.
     resp = templates.TemplateResponse(
         request,
+        "partials/seed_results.html",
+        {"papers": papers, "query": q},
     )
     resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
     return resp
     return resp

app/routers/recommendations.py CHANGED Viewed

@@ -16,6 +16,7 @@ Phase 4 changes vs Phase 2b:
   - Category-level suppression filters strongly disliked topics (4.3)
 """
 import asyncio
 import uuid
 import numpy as np
 from fastapi import APIRouter, Request, Cookie
@@ -110,9 +111,11 @@ async def get_recommendations(
     # populated by whichever tier serves the result.
     paper_tags: dict[str, dict] = {}
     rec_arxiv_ids: list[str] = []
     # ── Tier 1: Multi-interest clustering + quota fusion (≥5 saves) ──────
-    rec_arxiv_ids, paper_tags = await _multi_interest_recommend(
         user_id, state, seen, REC_LIMIT, query_id=query_id,
     )
@@ -151,6 +154,7 @@ async def get_recommendations(
         return _empty_resp()
     # Phase 3.5: Turso primary, arXiv API fallback
     meta = await turso_svc.fetch_metadata_batch(rec_arxiv_ids)
     missing = [aid for aid in rec_arxiv_ids if aid not in meta]
     if missing:
@@ -159,6 +163,8 @@ async def get_recommendations(
             meta.update(arxiv_meta)
         except Exception as e:
             print(f"[recommendations] arXiv fallback for {len(missing)} IDs failed: {e}")
     # Cache to SQLite so category suppression JOINs work (Phase 4.3)
     await db.cache_turso_metadata_batch(list(meta.values()))
@@ -187,7 +193,12 @@ async def get_recommendations(
     resp = templates.TemplateResponse(
         request,
         "partials/recommendations.html",
-        {"papers": papers},
     )
     resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
     return resp
@@ -210,18 +221,20 @@ async def _multi_interest_recommend(
       7. MMR diversity → select top-k with diversity
       8. Exploration injection → serendipitous papers
-    Returns ([], {}) to trigger fallback to Tier 2.
     Phase 4.5: second element is {arxiv_id: {ranker_version, candidate_source, cluster_id}}.
     """
     positives = state.positive_list
     if len(positives) < MIN_PAPERS_FOR_CLUSTERING:
-        return [], {}
     try:
         # Fetch embeddings for all saved papers
         vectors = await qdrant_svc.get_paper_vectors(positives)
         if len(vectors) < MIN_PAPERS_FOR_CLUSTERING:
-            return [], {}
         # Build aligned arrays (only papers we got vectors for)
         aligned_ids = [pid for pid in positives if pid in vectors]
@@ -230,6 +243,7 @@ async def _multi_interest_recommend(
         )
         # ── Step 1: Compute interest clusters ─────────────────────────────
         clusters = compute_clusters(aligned_ids, aligned_embs)
         # ── Step 4.2: Stabilise cluster IDs with Hungarian matching ───────
@@ -267,6 +281,7 @@ async def _multi_interest_recommend(
                 clusters = stabilize_cluster_ids(clusters, old_clusters)
         await save_clusters_to_db(user_id, clusters)
         # Phase 6.5 B3: append snapshot for cluster history (non-blocking)
         try:
@@ -289,8 +304,15 @@ async def _multi_interest_recommend(
         quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
         # ── Step 3: Parallel per-cluster ANN searches ─────────────────────
         st_vec = await profiles.load_profile(user_id, "short_term")
         search_tasks = [
             qdrant_svc.search_by_vector_with_scores(
                 query_vector=c.medoid_embedding.tolist(),
@@ -301,20 +323,16 @@ async def _multi_interest_recommend(
         ]
         per_cluster_scored = await asyncio.gather(*search_tasks)
-        # Build paper → cluster map AND real qdrant_score_map in one pass.
-        # Phase 6.5 A1: replaces the old rank-based linear decay approximation.
         paper_cluster_map: dict[str, int] = {}
         qdrant_score_map: dict[str, float] = {}
         for cluster, scored_results in zip(clusters, per_cluster_scored):
             for hit in scored_results:
                 aid = hit["arxiv_id"]
-                if aid not in paper_cluster_map:  # first-occurrence wins
                     paper_cluster_map[aid] = cluster.cluster_idx
-                # Keep highest cosine if a paper appears in multiple clusters
                 if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
                     qdrant_score_map[aid] = float(hit["score"])
-        # merge_quota_results expects list[list[str]] — extract IDs
         per_cluster_ids = [
             [h["arxiv_id"] for h in scored] for scored in per_cluster_scored
         ]
@@ -337,9 +355,14 @@ async def _multi_interest_recommend(
                     qdrant_score_map[aid] = float(hit["score"])
         if not candidate_ids:
-            return [], {}
         # ── Step 5: Fetch candidate vectors + metadata ────────────────────
         cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
         cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
         cand_missing = [cid for cid in candidate_ids if cid not in cand_meta]
@@ -356,7 +379,8 @@ async def _multi_interest_recommend(
         # Only process candidates with both vectors and metadata
         valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
         if not valid_ids:
-            return candidate_ids[:limit], {}
         valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
         valid_meta = [cand_meta[cid] for cid in valid_ids]
@@ -427,6 +451,7 @@ async def _multi_interest_recommend(
         )
         # ── Step 6: LightGBM re-ranking (37 features) ────────────────────
         reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
             candidate_ids=valid_ids,
             candidate_embeddings=valid_embs,
@@ -443,6 +468,8 @@ async def _multi_interest_recommend(
             user_total_saves=user_total_saves,
             user_total_dismissals=user_total_dismissals,
         )
         # ── Step 4.3: Category suppression (post-rerank safety net) ───────
         # The model now sees feature 25 (is_suppressed_category), but we
@@ -459,6 +486,7 @@ async def _multi_interest_recommend(
                 reranked_embs = reranked_embs[kept]
         # ── Step 7: MMR diversity enforcement ─────────────────────────────
         query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
         mmr_selected = mmr_rerank(
             query_embedding=query_vec,
@@ -468,6 +496,7 @@ async def _multi_interest_recommend(
             lambda_param=0.6,
             top_k=limit,
         )
         # ── Step 8: Exploration injection ─────────────────────────────────
         final = inject_exploration(
@@ -508,11 +537,11 @@ async def _multi_interest_recommend(
                 "policy_id": _RANKER_VERSION,
             }
-        return final, paper_tags
     except Exception as e:
-        print(f"[recommendations] multi-interest search failed: {e}")
-        return [], {}
 # ── Tier 2: EWMA single-vector search ────────────────────────────────────────

   - Category-level suppression filters strongly disliked topics (4.3)
 """
 import asyncio
+import time
 import uuid
 import numpy as np
 from fastapi import APIRouter, Request, Cookie
     # populated by whichever tier serves the result.
     paper_tags: dict[str, dict] = {}
     rec_arxiv_ids: list[str] = []
+    rerank_time_ms = 0
+    timing_breakdown: dict = {}
     # ── Tier 1: Multi-interest clustering + quota fusion (≥5 saves) ──────
+    rec_arxiv_ids, paper_tags, rerank_time_ms, timing_breakdown = await _multi_interest_recommend(
         user_id, state, seen, REC_LIMIT, query_id=query_id,
     )
         return _empty_resp()
     # Phase 3.5: Turso primary, arXiv API fallback
+    t0_meta = time.time()
     meta = await turso_svc.fetch_metadata_batch(rec_arxiv_ids)
     missing = [aid for aid in rec_arxiv_ids if aid not in meta]
     if missing:
             meta.update(arxiv_meta)
         except Exception as e:
             print(f"[recommendations] arXiv fallback for {len(missing)} IDs failed: {e}")
+    t1_meta = time.time()
+    meta_time_ms = int((t1_meta - t0_meta) * 1000)
     # Cache to SQLite so category suppression JOINs work (Phase 4.3)
     await db.cache_turso_metadata_batch(list(meta.values()))
     resp = templates.TemplateResponse(
         request,
         "partials/recommendations.html",
+        {
+            "papers": papers,
+            "rerank_time_ms": rerank_time_ms,
+            "meta_time_ms": meta_time_ms,
+            "timing": timing_breakdown,
+        },
     )
     resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
     return resp
       7. MMR diversity → select top-k with diversity
       8. Exploration injection → serendipitous papers
+    Returns ([], {}, 0, {}) to trigger fallback to Tier 2.
     Phase 4.5: second element is {arxiv_id: {ranker_version, candidate_source, cluster_id}}.
     """
     positives = state.positive_list
     if len(positives) < MIN_PAPERS_FOR_CLUSTERING:
+        return [], {}, 0, {}
     try:
         # Fetch embeddings for all saved papers
         vectors = await qdrant_svc.get_paper_vectors(positives)
         if len(vectors) < MIN_PAPERS_FOR_CLUSTERING:
+            return [], {}, 0, {}
+        timing = {}  # Collect per-stage timing breakdown
         # Build aligned arrays (only papers we got vectors for)
         aligned_ids = [pid for pid in positives if pid in vectors]
         )
         # ── Step 1: Compute interest clusters ─────────────────────────────
+        t0_cluster = time.time()
         clusters = compute_clusters(aligned_ids, aligned_embs)
         # ── Step 4.2: Stabilise cluster IDs with Hungarian matching ───────
                 clusters = stabilize_cluster_ids(clusters, old_clusters)
         await save_clusters_to_db(user_id, clusters)
+        timing["clustering_ms"] = int((time.time() - t0_cluster) * 1000)
         # Phase 6.5 B3: append snapshot for cluster history (non-blocking)
         try:
         quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
         # ── Step 3: Parallel per-cluster ANN searches ─────────────────────
+        t0_ann = time.time()
         st_vec = await profiles.load_profile(user_id, "short_term")
+        # NOTE on latency: we previously tried passing with_vectors=True
+        # to fold the candidate-vector fetch into the search call. That
+        # made it *worse* on Qdrant Cloud free tier — search latency
+        # ballooned from ~2s to ~40s because returning vectors triggers
+        # a per-result disk read inside the search path. Keep the search
+        # vector-free; vectors come from a separate cached retrieve.
         search_tasks = [
             qdrant_svc.search_by_vector_with_scores(
                 query_vector=c.medoid_embedding.tolist(),
         ]
         per_cluster_scored = await asyncio.gather(*search_tasks)
         paper_cluster_map: dict[str, int] = {}
         qdrant_score_map: dict[str, float] = {}
         for cluster, scored_results in zip(clusters, per_cluster_scored):
             for hit in scored_results:
                 aid = hit["arxiv_id"]
+                if aid not in paper_cluster_map:
                     paper_cluster_map[aid] = cluster.cluster_idx
                 if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
                     qdrant_score_map[aid] = float(hit["score"])
         per_cluster_ids = [
             [h["arxiv_id"] for h in scored] for scored in per_cluster_scored
         ]
                     qdrant_score_map[aid] = float(hit["score"])
         if not candidate_ids:
+            return [], {}, 0, {}
+        timing["ann_retrieval_ms"] = int((time.time() - t0_ann) * 1000)
         # ── Step 5: Fetch candidate vectors + metadata ────────────────────
+        # get_paper_vectors is now LRU-cached by arxiv_id (qdrant_svc),
+        # so warm cache makes this cheap; only fresh papers pay the
+        # disk-read cost.
+        t0_cand_meta = time.time()
         cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
         cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
         cand_missing = [cid for cid in candidate_ids if cid not in cand_meta]
         # Only process candidates with both vectors and metadata
         valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
         if not valid_ids:
+            return candidate_ids[:limit], {}, 0, {}
+        timing["candidate_meta_ms"] = int((time.time() - t0_cand_meta) * 1000)
         valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
         valid_meta = [cand_meta[cid] for cid in valid_ids]
         )
         # ── Step 6: LightGBM re-ranking (37 features) ────────────────────
+        t0_rerank = time.time()
         reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
             candidate_ids=valid_ids,
             candidate_embeddings=valid_embs,
             user_total_saves=user_total_saves,
             user_total_dismissals=user_total_dismissals,
         )
+        t1_rerank = time.time()
+        rerank_time_ms = int((t1_rerank - t0_rerank) * 1000)
         # ── Step 4.3: Category suppression (post-rerank safety net) ───────
         # The model now sees feature 25 (is_suppressed_category), but we
                 reranked_embs = reranked_embs[kept]
         # ── Step 7: MMR diversity enforcement ─────────────────────────────
+        t0_mmr = time.time()
         query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
         mmr_selected = mmr_rerank(
             query_embedding=query_vec,
             lambda_param=0.6,
             top_k=limit,
         )
+        timing["mmr_ms"] = int((time.time() - t0_mmr) * 1000)
         # ── Step 8: Exploration injection ─────────────────────────────────
         final = inject_exploration(
                 "policy_id": _RANKER_VERSION,
             }
+        return final, paper_tags, rerank_time_ms, timing
     except Exception as e:
+        print(f"[recommendations] multi-interest preprocessing failed: {e}")
+        return [], {}, 0, {}
 # ── Tier 2: EWMA single-vector search ────────────────────────────────────────

app/routers/search.py CHANGED Viewed

@@ -27,17 +27,23 @@ async def search(
     q: str = "",
     user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
 ):
     papers = []
     if q.strip():
         # Phase 3: Hybrid semantic search (BGE-M3 + Qdrant + Zilliz + RRF)
         try:
-            arxiv_ids = await hybrid_search_svc.search(q.strip(), limit=ARXIV_MAX_RESULTS)
         except Exception as e:
             print(f"[search] Hybrid search error: {e}")
             arxiv_ids = []
         if arxiv_ids:
             # Phase 3.5: Fetch metadata from Turso DB first (fast, ~50ms)
             try:
                 meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
             except Exception as e:
@@ -52,6 +58,8 @@ async def search(
                     meta.update(arxiv_meta)
                 except Exception as e:
                     print(f"[search] arXiv fallback for {len(missing)} IDs failed: {e}")
             # Phase 4.3: Cache to SQLite so dismissal category JOINs work
             await db.cache_turso_metadata_batch(list(meta.values()))
@@ -66,6 +74,8 @@ async def search(
             except Exception as e:
                 print(f"[search] arXiv fallback also failed: {e}")
                 papers = []
     user_id = user_id or str(uuid.uuid4())
     # Phase 6.5 B1: one query_id per search request for per-feed CTR
@@ -86,7 +96,7 @@ async def search(
         resp = templates.TemplateResponse(
             request,
             "partials/search_results.html",
-            {"papers": papers, "query": q},
         )
     else:
         resp = templates.TemplateResponse(
@@ -96,6 +106,7 @@ async def search(
                 "papers": papers,
                 "query": q,
                 "has_recs": state.has_enough_for_recs(),
             },
         )

     q: str = "",
     user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
 ):
+    import time
+    start_time = time.perf_counter()
+    search_meta = {}
     papers = []
     if q.strip():
         # Phase 3: Hybrid semantic search (BGE-M3 + Qdrant + Zilliz + RRF)
         try:
+            arxiv_ids, search_meta = await hybrid_search_svc.search(
+                q.strip(), limit=ARXIV_MAX_RESULTS, return_meta=True
+            )
         except Exception as e:
             print(f"[search] Hybrid search error: {e}")
             arxiv_ids = []
         if arxiv_ids:
             # Phase 3.5: Fetch metadata from Turso DB first (fast, ~50ms)
+            t0_meta = time.perf_counter()
             try:
                 meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
             except Exception as e:
                     meta.update(arxiv_meta)
                 except Exception as e:
                     print(f"[search] arXiv fallback for {len(missing)} IDs failed: {e}")
+            search_meta["meta_time_ms"] = int((time.perf_counter() - t0_meta) * 1000)
             # Phase 4.3: Cache to SQLite so dismissal category JOINs work
             await db.cache_turso_metadata_batch(list(meta.values()))
             except Exception as e:
                 print(f"[search] arXiv fallback also failed: {e}")
                 papers = []
+        search_meta["total_time_ms"] = int((time.perf_counter() - start_time) * 1000)
     user_id = user_id or str(uuid.uuid4())
     # Phase 6.5 B1: one query_id per search request for per-feed CTR
         resp = templates.TemplateResponse(
             request,
             "partials/search_results.html",
+            {"papers": papers, "query": q, "search_meta": search_meta},
         )
     else:
         resp = templates.TemplateResponse(
                 "papers": papers,
                 "query": q,
                 "has_recs": state.has_enough_for_recs(),
+                "search_meta": search_meta,
             },
         )

app/s2_svc.py DELETED Viewed

@@ -1,111 +0,0 @@
-"""
-Semantic Scholar service — Phase 5.1 (author import for onboarding).
-Accepts an S2 author URL, a raw S2 author ID, or an ORCID, then
-fetches that author's papers and returns arXiv IDs for auto-saving.
-API docs: https://api.semanticscholar.org/api-docs/graph
-"""
-from __future__ import annotations
-import re
-import httpx
-from app.config import S2_API_KEY
-_BASE = "https://api.semanticscholar.org/graph/v1"
-_TIMEOUT = 15.0  # seconds
-# ── Patterns ──────────────────────────────────────────────────────────────────
-#   URL:   https://www.semanticscholar.org/author/Yoshua-Bengio/1751762
-#   Raw:   1751762
-#   ORCID: 0000-0003-3394-6622
-_S2_URL_RE = re.compile(
-    r"semanticscholar\.org/author/[^/]+/(\d+)", re.IGNORECASE
-)
-_ORCID_RE = re.compile(r"\d{4}-\d{4}-\d{4}-\d{3}[\dX]")
-_RAW_ID_RE = re.compile(r"^\d{3,}$")  # 3+ digits = plausible S2 author ID
-def _headers() -> dict[str, str]:
-    """Build request headers, including API key if available."""
-    h: dict[str, str] = {"Accept": "application/json"}
-    if S2_API_KEY:
-        h["x-api-key"] = S2_API_KEY
-    return h
-# ── Public API ────────────────────────────────────────────────────────────────
-def parse_author_input(text: str) -> tuple[str | None, str]:
-    """Parse user-provided text into an S2 author ID or ORCID.
-    Returns (s2_author_id | None, input_type) where input_type is one of:
-      "s2_url", "s2_id", "orcid", "unknown"
-    """
-    text = text.strip()
-    if not text:
-        return None, "unknown"
-    # 1. Try S2 URL
-    m = _S2_URL_RE.search(text)
-    if m:
-        return m.group(1), "s2_url"
-    # 2. Try ORCID
-    m = _ORCID_RE.search(text)
-    if m:
-        return m.group(0), "orcid"
-    # 3. Try raw numeric ID
-    if _RAW_ID_RE.match(text):
-        return text, "s2_id"
-    return None, "unknown"
-async def resolve_orcid(orcid: str) -> str | None:
-    """Resolve an ORCID to an S2 author ID via the author search endpoint.
-    Returns the S2 authorId string or None if not found.
-    """
-    url = f"{_BASE}/author/search"
-    params = {"query": orcid, "limit": 1}
-    async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
-        resp = await client.get(url, params=params, headers=_headers())
-        resp.raise_for_status()
-        data = resp.json()
-        authors = data.get("data", [])
-        if authors:
-            return str(authors[0]["authorId"])
-    return None
-async def fetch_author_arxiv_papers(
-    author_id: str, limit: int = 50,
-) -> list[str]:
-    """Fetch an author's papers from S2 and return arXiv IDs.
-    Filters to papers that have an ArXiv external ID.
-    Returns at most `limit` arXiv IDs, ordered by citation count (desc).
-    """
-    url = f"{_BASE}/author/{author_id}/papers"
-    params = {
-        "fields": "externalIds,citationCount",
-        "limit": min(limit * 2, 500),  # over-fetch since not all have arXiv IDs
-    }
-    arxiv_ids: list[tuple[int, str]] = []  # (citation_count, arxiv_id)
-    async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
-        resp = await client.get(url, params=params, headers=_headers())
-        resp.raise_for_status()
-        data = resp.json()
-        for paper in data.get("data", []):
-            ext = paper.get("externalIds") or {}
-            arxiv_id = ext.get("ArXiv")
-            if arxiv_id:
-                cites = paper.get("citationCount") or 0
-                arxiv_ids.append((cites, arxiv_id))
-    # Sort by citation count descending so we import the most impactful first
-    arxiv_ids.sort(key=lambda x: x[0], reverse=True)
-    return [aid for _, aid in arxiv_ids[:limit]]

app/templates/index.html CHANGED Viewed

@@ -13,20 +13,13 @@
     <p class="text-sm text-base-content/60 mb-4">
       Search arXiv, save papers you like — get personalised recommendations.
     </p>
-    <form hx-get="/search"
-          hx-target="#search-results"
-          hx-push-url="true"
-          hx-indicator="#search-spinner"
-          class="flex gap-2">
       <input type="text"
              name="q"
              placeholder="e.g. transformer attention mechanism"
              class="input input-bordered flex-1"
              autofocus />
-      <button class="btn btn-primary" type="submit">
-        Search
-        <span id="search-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
-      </button>
     </form>
   </div>
@@ -57,8 +50,5 @@
     </div>
   </div>
-  <!-- Search results (swapped in by HTMX) -->
-  <div id="search-results"></div>
 </div>
 {% endblock %}

     <p class="text-sm text-base-content/60 mb-4">
       Search arXiv, save papers you like — get personalised recommendations.
     </p>
+    <form action="/search" method="get" class="flex gap-2">
       <input type="text"
              name="q"
              placeholder="e.g. transformer attention mechanism"
              class="input input-bordered flex-1"
              autofocus />
+      <button class="btn btn-primary" type="submit">Search</button>
     </form>
   </div>
     </div>
   </div>
 </div>
 {% endblock %}

app/templates/partials/paper_card.html CHANGED Viewed

@@ -9,6 +9,11 @@
 {% set position = position | default(0) %}
 {% set authors_list = paper.authors | default("[]") | tojson_parse | default([]) %}
 {# Category badge colour mapping #}
 {% set cat = paper.category | default("") %}
 {% if cat.startswith("cs.") %}
@@ -43,19 +48,19 @@
     {% endif %}
   </div>
-  <!-- Meta: arXiv ID + year + citations -->
   <div class="text-xs text-base-content/50 mono">
     [{{ paper.arxiv_id }}]
     {% if paper.published %} · {{ paper.published[:4] }}{% endif %}
-    {% if authors_list %} · <span class="font-sans">{{ authors_list[:3] | join(", ") }}{% if authors_list | length > 3 %} et al.{% endif %}</span>{% endif %}
     {% if paper.citation_count %}
     · <span class="font-medium text-base-content/70 font-sans" title="{{ paper.influential_citations|default(0) }} influential">📊 {{ paper.citation_count }} citations</span>
     {% endif %}
   </div>
-  <!-- Abstract (truncated) -->
-  <p class="text-sm text-base-content/75 line-clamp-3">
-    {{ paper.abstract }}
   </p>
   <!-- Action buttons (HTMX-powered, swap themselves on click) -->

 {% set position = position | default(0) %}
 {% set authors_list = paper.authors | default("[]") | tojson_parse | default([]) %}
+{# Fallback: if tojson_parse returned empty but authors is a non-empty string, split by comma #}
+{% if not authors_list and paper.authors %}
+  {% set authors_list = paper.authors.split(", ") %}
+{% endif %}
 {# Category badge colour mapping #}
 {% set cat = paper.category | default("") %}
 {% if cat.startswith("cs.") %}
     {% endif %}
   </div>
+  <!-- Meta: arXiv ID + year + authors (max 3) + citations -->
   <div class="text-xs text-base-content/50 mono">
     [{{ paper.arxiv_id }}]
     {% if paper.published %} · {{ paper.published[:4] }}{% endif %}
+    {% if authors_list %} · <span class="font-sans">{{ authors_list[:3] | join(", ") }}{% if authors_list | length > 3 %} et al. ({{ authors_list | length }} authors){% endif %}</span>{% endif %}
     {% if paper.citation_count %}
     · <span class="font-medium text-base-content/70 font-sans" title="{{ paper.influential_citations|default(0) }} influential">📊 {{ paper.citation_count }} citations</span>
     {% endif %}
   </div>
+  <!-- Abstract (truncated to ~300 chars + CSS clamp) -->
+  <p class="text-sm text-base-content/75" style="display: -webkit-box; -webkit-line-clamp: 3; -webkit-box-orient: vertical; overflow: hidden;">
+    {{ paper.abstract[:500] }}{% if paper.abstract | length > 500 %}…{% endif %}
   </p>
   <!-- Action buttons (HTMX-powered, swap themselves on click) -->

app/templates/partials/recommendations.html CHANGED Viewed

@@ -13,6 +13,40 @@
       {% include "partials/paper_card.html" %}
     {% endfor %}
   </div>
   <!-- Refresh button — lets user reload recs after saving more papers -->
   <div class="text-center pt-3">
     <button class="btn btn-ghost btn-sm"

       {% include "partials/paper_card.html" %}
     {% endfor %}
   </div>
+  {# Pipeline timing breakdown #}
+  {% if timing is defined and timing %}
+  <div class="mt-4 p-3 rounded-lg bg-base-200/50 border border-base-300/30">
+    <div class="flex items-center gap-2 mb-2">
+      <span class="text-xs font-semibold text-base-content/60">⚡ Recommendation Pipeline Breakdown</span>
+    </div>
+    <div class="flex flex-wrap gap-x-4 gap-y-1 text-xs font-mono text-base-content/50">
+      {% if timing.clustering_ms is defined %}
+        <span>Ward Clustering: <span class="text-primary">{{ timing.clustering_ms }}ms</span></span>
+      {% endif %}
+      {% if timing.ann_retrieval_ms is defined %}
+        <span>ANN Retrieval: <span class="text-primary">{{ timing.ann_retrieval_ms }}ms</span></span>
+      {% endif %}
+      {% if timing.candidate_meta_ms is defined %}
+        <span>Candidate Meta: <span class="text-primary">{{ timing.candidate_meta_ms }}ms</span></span>
+      {% endif %}
+      {% if rerank_time_ms is defined %}
+        <span>LightGBM Rerank: <span class="text-primary">{{ rerank_time_ms }}ms</span></span>
+      {% endif %}
+      {% if timing.mmr_ms is defined %}
+        <span>MMR Diversity: <span class="text-primary">{{ timing.mmr_ms }}ms</span></span>
+      {% endif %}
+      {% if meta_time_ms is defined %}
+        <span>Final Metadata: <span class="text-primary">{{ meta_time_ms }}ms</span></span>
+      {% endif %}
+    </div>
+  </div>
+  {% elif rerank_time_ms is defined and meta_time_ms is defined %}
+  <div class="text-center pt-2 pb-1 text-xs text-base-content/40 font-mono">
+    ⚡ Reranking: {{ rerank_time_ms }}ms | Metadata: {{ meta_time_ms }}ms
+  </div>
+  {% endif %}
   <!-- Refresh button — lets user reload recs after saving more papers -->
   <div class="text-center pt-3">
     <button class="btn btn-ghost btn-sm"

app/templates/partials/search_results.html CHANGED Viewed

@@ -1,15 +1,91 @@
 {# Partial: list of search result cards #}
 {% if papers %}
   <div class="space-y-3">
-    <p class="text-sm text-base-content/50">{{ papers | length }} results for "{{ query }}"</p>
     {% for paper in papers %}
       {% set position = loop.index0 %}
       {% set source = "search" %}
       {% include "partials/paper_card.html" %}
     {% endfor %}
   </div>
 {% elif query %}
   <div class="text-center text-base-content/40 py-10">
-    No results found for "{{ query }}"
   </div>
 {% endif %}

 {# Partial: list of search result cards #}
 {% if papers %}
   <div class="space-y-3">
+    <div class="flex flex-col gap-1 mb-4">
+      <div class="flex justify-between items-center text-sm text-base-content/50">
+        <span>{{ papers | length }} results for "{{ query }}"</span>
+        {% if search_meta and search_meta.total_time_ms is defined %}
+          <span>Search completed in {{ search_meta.total_time_ms }}ms</span>
+        {% endif %}
+      </div>
+      {# Groq rewrite result — show both rewritten AND skipped cases #}
+      {% if search_meta %}
+        {% if search_meta.rewritten_query %}
+        <div class="alert bg-base-200 border-l-4 border-primary p-3 text-sm flex gap-2">
+          <svg xmlns="http://www.w3.org/2000/svg" class="stroke-primary shrink-0 h-5 w-5" fill="none" viewBox="0 0 24 24"><path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M13 16h-1v-4h-1m1-4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z" /></svg>
+          <div class="flex-1">
+            <span class="font-semibold">Groq expanded query:</span> "{{ search_meta.rewritten_query }}"
+            <span class="text-xs text-base-content/50 ml-2">({{ search_meta.groq_time_ms }}ms)</span>
+          </div>
+        </div>
+        {% elif search_meta.groq_status is defined and search_meta.groq_status != 'rewritten' %}
+        <div class="alert bg-base-200/50 border-l-4 border-base-300 p-3 text-sm flex gap-2">
+          <svg xmlns="http://www.w3.org/2000/svg" class="stroke-base-content/30 shrink-0 h-5 w-5" fill="none" viewBox="0 0 24 24"><path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M13 16h-1v-4h-1m1-4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z" /></svg>
+          <div class="flex-1 text-base-content/50">
+            <span class="font-semibold">Groq rewrite:</span> {{ search_meta.groq_status }}
+            — searching with original query as-is
+          </div>
+        </div>
+        {% endif %}
+      {% endif %}
+    </div>
     {% for paper in papers %}
       {% set position = loop.index0 %}
       {% set source = "search" %}
       {% include "partials/paper_card.html" %}
     {% endfor %}
   </div>
+  {# Pipeline timing breakdown #}
+  {% if search_meta %}
+  <div class="mt-4 p-3 rounded-lg bg-base-200/50 border border-base-300/30">
+    <div class="flex items-center gap-2 mb-2">
+      <span class="text-xs font-semibold text-base-content/60">⚡ Search Pipeline Breakdown</span>
+      {% if search_meta.total_time_ms is defined %}
+        <span class="text-xs text-base-content/40">({{ search_meta.total_time_ms }}ms total)</span>
+      {% endif %}
+    </div>
+    <div class="flex flex-wrap gap-x-4 gap-y-1 text-xs font-mono text-base-content/50">
+      {% if search_meta.groq_time_ms is defined %}
+        <span>Groq Rewrite: <span class="text-primary">{{ search_meta.groq_time_ms }}ms</span>
+          {% if search_meta.groq_status is defined and search_meta.groq_status != 'rewritten' %}
+            <span class="text-warning/60">({{ search_meta.groq_status }})</span>
+          {% endif %}
+        </span>
+      {% endif %}
+      {% if search_meta.encode_time_ms is defined %}
+        <span>BGE-M3 Encode: <span class="text-primary">{{ search_meta.encode_time_ms }}ms</span></span>
+      {% endif %}
+      {% if search_meta.retrieval_time_ms is defined %}
+        <span>Qdrant+Zilliz Retrieval: <span class="text-primary">{{ search_meta.retrieval_time_ms }}ms</span>
+          {% if search_meta.n_retrieval_tasks is defined %}
+            <span class="text-base-content/30">({{ search_meta.n_retrieval_tasks }} parallel tasks)</span>
+          {% endif %}
+        </span>
+      {% endif %}
+      {% if search_meta.rrf_time_ms is defined %}
+        <span>RRF Fusion: <span class="text-primary">{{ search_meta.rrf_time_ms }}ms</span></span>
+      {% endif %}
+      {% if search_meta.turso_boost_fetch_ms is defined %}
+        <span>Turso Title Fetch: <span class="text-primary">{{ search_meta.turso_boost_fetch_ms }}ms</span></span>
+        <span>Rerank Compute: <span class="text-primary">{{ search_meta.rerank_compute_ms }}ms</span></span>
+      {% elif search_meta.rerank_time_ms is defined %}
+        <span>Title+Citation Rerank: <span class="text-primary">{{ search_meta.rerank_time_ms }}ms</span></span>
+      {% endif %}
+      {% if search_meta.meta_time_ms is defined %}
+        <span>Final Metadata: <span class="text-primary">{{ search_meta.meta_time_ms }}ms</span></span>
+      {% endif %}
+    </div>
+  </div>
+  {% endif %}
 {% elif query %}
   <div class="text-center text-base-content/40 py-10">
+    <p>No results found for "{{ query }}"</p>
+    {% if search_meta and search_meta.total_time_ms is defined %}
+      <p class="text-xs mt-2">Search completed in {{ search_meta.total_time_ms }}ms</p>
+    {% endif %}
   </div>
 {% endif %}

app/templates/partials/seed_results.html ADDED Viewed

	@@ -0,0 +1,41 @@

+{#
+  Seed search results — inner partial, swapped into #seed-results by HTMX.
+  Expects:
+    papers  – list[dict] (optional)
+    query   – str         (optional)
+#}
+{% if papers is defined and papers %}
+  {% for paper in papers %}
+  <div class="seed-card flex items-start justify-between gap-3"
+       id="seed-paper-{{ paper.arxiv_id }}">
+    <div class="flex-1 min-w-0">
+      <a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
+         target="_blank" rel="noopener"
+         class="font-medium text-sm text-primary hover:underline leading-snug line-clamp-1">
+        {{ paper.title }}
+      </a>
+      <div class="text-xs text-base-content/50 mt-0.5">
+        [{{ paper.arxiv_id }}]
+        {% if paper.category %} · <span class="cat-badge cat-cs">{{ paper.category }}</span>{% endif %}
+        {% if paper.citation_count %} · 📊 {{ paper.citation_count }}{% endif %}
+      </div>
+    </div>
+    <button class="btn btn-primary btn-xs shrink-0"
+            hx-post="/api/papers/{{ paper.arxiv_id }}/save"
+            hx-target="#seed-paper-{{ paper.arxiv_id }}"
+            hx-swap="outerHTML"
+            hx-vals='{"source": "onboarding"}'
+            onclick="bumpSeedCount()">
+      ⭐ Save
+    </button>
+  </div>
+  {% endfor %}
+{% elif query is defined and query %}
+  <p class="text-center text-base-content/40 py-6 text-sm">
+    No results found for "{{ query }}"
+  </p>
+{% else %}
+  <p class="text-center text-base-content/30 py-6 text-sm">
+    Search above to find papers in your research area
+  </p>
+{% endif %}

app/templates/partials/seed_search.html CHANGED Viewed

@@ -15,30 +15,6 @@
     </p>
   </div>
-  {# Phase 5.1: Quick author import #}
-  <div class="mb-4 p-3 bg-base-200/50 rounded-lg">
-    <p class="text-xs font-medium text-base-content/70 mb-2">
-      ⚡ Quick import: Paste your Semantic Scholar profile URL to auto-import papers
-    </p>
-    <form hx-post="/api/onboarding/import-author"
-          hx-target="#import-result"
-          hx-swap="innerHTML"
-          hx-indicator="#import-spinner"
-          class="flex gap-2">
-      <input type="text"
-             name="author_url"
-             placeholder="e.g. https://www.semanticscholar.org/author/…/1234567"
-             class="input input-bordered input-sm flex-1 text-xs" />
-      <button class="btn btn-secondary btn-sm" type="submit">
-        Import
-        <span id="import-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
-      </button>
-    </form>
-    <div id="import-result" class="mt-2"></div>
-  </div>
-  <div class="divider text-xs text-base-content/40">OR search manually</div>
   {# Search bar #}
   <div class="mb-4">
     <form hx-get="/api/onboarding/seed-search"
@@ -68,43 +44,9 @@
     </div>
   </div>
-  {# Search results #}
   <div id="seed-results" class="space-y-2 mb-6">
-    {% if papers is defined and papers %}
-      {% for paper in papers %}
-      <div class="seed-card flex items-start justify-between gap-3"
-           id="seed-paper-{{ paper.arxiv_id }}">
-        <div class="flex-1 min-w-0">
-          <a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
-             target="_blank" rel="noopener"
-             class="font-medium text-sm text-primary hover:underline leading-snug line-clamp-1">
-            {{ paper.title }}
-          </a>
-          <div class="text-xs text-base-content/50 mt-0.5">
-            [{{ paper.arxiv_id }}]
-            {% if paper.category %} · <span class="cat-badge cat-cs">{{ paper.category }}</span>{% endif %}
-            {% if paper.citation_count %} · 📊 {{ paper.citation_count }}{% endif %}
-          </div>
-        </div>
-        <button class="btn btn-primary btn-xs shrink-0"
-                hx-post="/api/papers/{{ paper.arxiv_id }}/save"
-                hx-target="#seed-paper-{{ paper.arxiv_id }}"
-                hx-swap="outerHTML"
-                hx-vals='{"source": "onboarding"}'
-                onclick="bumpSeedCount()">
-          ⭐ Save
-        </button>
-      </div>
-      {% endfor %}
-    {% elif query is defined and query %}
-      <p class="text-center text-base-content/40 py-6 text-sm">
-        No results found for "{{ query }}"
-      </p>
-    {% else %}
-      <p class="text-center text-base-content/30 py-6 text-sm">
-        Search above to find papers in your research area
-      </p>
-    {% endif %}
   </div>
   {# Done / Skip buttons #}

     </p>
   </div>
   {# Search bar #}
   <div class="mb-4">
     <form hx-get="/api/onboarding/seed-search"
     </div>
   </div>
+  {# Search results — inner div is the HTMX swap target #}
   <div id="seed-results" class="space-y-2 mb-6">
+    {% include "partials/seed_results.html" %}
   </div>
   {# Done / Skip buttons #}

app/templates/search.html CHANGED Viewed

@@ -7,10 +7,9 @@
   <!-- Search bar -->
   <div class="card bg-base-100 shadow-md rounded-xl p-4">
-    <form hx-get="/search"
-          hx-target="#search-results"
           hx-push-url="true"
-          hx-indicator="#search-spinner"
           class="flex gap-2">
       <input type="text"
              name="q"
@@ -18,16 +17,38 @@
              placeholder="Search arXiv papers…"
              class="input input-bordered flex-1"
              autofocus />
-      <button class="btn btn-primary" type="submit">
-        Search
-        <span id="search-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
       </button>
     </form>
   </div>
-  <!-- Recommendations (sidebar-style, loads async) -->
-  {% if has_recs %}
-  <div>
     <h2 class="text-lg font-semibold mb-3">Recommended for You</h2>
     <div id="rec-section"
          hx-get="/api/recommendations"
@@ -47,4 +68,29 @@
   </div>
 </div>
 {% endblock %}

   <!-- Search bar -->
   <div class="card bg-base-100 shadow-md rounded-xl p-4">
+    <form hx-get="/search"
+          hx-target="#search-results"
           hx-push-url="true"
           class="flex gap-2">
       <input type="text"
              name="q"
              placeholder="Search arXiv papers…"
              class="input input-bordered flex-1"
              autofocus />
+      <button class="btn btn-primary flex items-center gap-2" type="submit">
+        <span class="search-btn-text">Search</span>
+        <span class="search-btn-loading hidden">
+          <span class="loading loading-spinner loading-sm"></span>
+          Searching…
+        </span>
       </button>
     </form>
   </div>
+  <!-- Loading overlay (outside search-results so it doesn't get swapped away) -->
+  <div id="search-loading" class="hidden">
+    <div class="flex flex-col items-center justify-center py-16 gap-4">
+      <span class="loading loading-ring loading-lg text-primary"></span>
+      <div class="text-sm text-base-content/60 animate-pulse">
+        Searching 1.6M papers across Qdrant + Zilliz…
+      </div>
+      <div class="flex gap-6 text-xs text-base-content/40 font-mono">
+        <span>Groq rewriting</span>
+        <span>→</span>
+        <span>BGE-M3 encoding</span>
+        <span>→</span>
+        <span>Vector retrieval</span>
+        <span>→</span>
+        <span>RRF + reranking</span>
+      </div>
+    </div>
+  </div>
+  <!-- Recommendations — only when not actively searching -->
+  {% if has_recs and not query %}
+  <div id="rec-wrapper">
     <h2 class="text-lg font-semibold mb-3">Recommended for You</h2>
     <div id="rec-section"
          hx-get="/api/recommendations"
   </div>
 </div>
+<script>
+  // Show/hide loading overlay + HIDE recommendations when searching
+  document.body.addEventListener('htmx:beforeRequest', function(evt) {
+    if (evt.detail.target && evt.detail.target.id === 'search-results') {
+      document.getElementById('search-loading').classList.remove('hidden');
+      document.getElementById('search-results').classList.add('opacity-30');
+      // Hide recommendations section when a search starts
+      var recWrapper = document.getElementById('rec-wrapper');
+      if (recWrapper) recWrapper.classList.add('hidden');
+      // Swap button text
+      document.querySelectorAll('.search-btn-text').forEach(el => el.classList.add('hidden'));
+      document.querySelectorAll('.search-btn-loading').forEach(el => el.classList.remove('hidden'));
+    }
+  });
+  document.body.addEventListener('htmx:afterRequest', function(evt) {
+    if (evt.detail.target && evt.detail.target.id === 'search-results') {
+      document.getElementById('search-loading').classList.add('hidden');
+      document.getElementById('search-results').classList.remove('opacity-30');
+      // Restore button text
+      document.querySelectorAll('.search-btn-text').forEach(el => el.classList.remove('hidden'));
+      document.querySelectorAll('.search-btn-loading').forEach(el => el.classList.add('hidden'));
+    }
+  });
+</script>
 {% endblock %}

app/turso_svc.py CHANGED Viewed

@@ -15,12 +15,65 @@ from __future__ import annotations
 import json
 import time
 import httpx
 from app import config
 # ── Public API ───────────────────────────────────────────────────────────────
 async def fetch_metadata(arxiv_id: str) -> dict | None:
@@ -37,11 +90,31 @@ async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
     Paper dict has keys: arxiv_id, title, abstract, authors, category,
     published, year, citation_count, influential_citations.
-    Uses Turso HTTP pipeline API — single HTTP request for all IDs.
     """
     if not arxiv_ids:
         return {}
     url = config.TURSO_URL
     token = config.TURSO_DB_TOKEN
@@ -133,6 +206,7 @@ async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
         paper = _to_paper_dict(values)
         if paper:
             output[paper["arxiv_id"]] = paper
     return output
@@ -211,27 +285,52 @@ async def fetch_trending_by_categories(
     Fetch recently published, high-citation papers from Turso DB
     filtered by arXiv categories. Used as Tier 0 popularity fallback
     for onboarded users with zero saves.
     """
     if not categories:
         return []
     url = config.TURSO_URL
     token = config.TURSO_DB_TOKEN
     if not url or not token:
         return []
-    # Build query: papers in selected categories, sorted by citation count
-    placeholders = ", ".join(["?" for _ in categories])
     sql = f"""SELECT arxiv_id, title, authors, categories, primary_topic,
                      update_date, abstract_preview, citation_count, influential_citations
               FROM papers
-              WHERE primary_topic IN ({placeholders})
                 AND citation_count > 0
               ORDER BY citation_count DESC, update_date DESC
               LIMIT ?"""
-    cat_list = list(categories)
-    args = [{"type": "text", "value": c} for c in cat_list]
     args.append({"type": "integer", "value": str(limit)})
     pipeline_url = url.rstrip("/")
@@ -254,16 +353,29 @@ async def fetch_trending_by_categories(
         "Content-Type": "application/json",
     }
     try:
-        async with httpx.AsyncClient(timeout=10) as client:
             resp = await client.post(
                 f"{pipeline_url}/v2/pipeline",
                 json=payload,
                 headers=headers,
             )
             resp.raise_for_status()
     except Exception as e:
-        print(f"[turso] trending query failed: {e}")
         return []
     try:
@@ -282,7 +394,7 @@ async def fetch_trending_by_categories(
         cols = [c["name"] for c in result_data.get("cols", [])]
         rows = result_data.get("rows", [])
     except (KeyError, IndexError, TypeError) as e:
-        print(f"[turso] trending parse error: {e}")
         return []
     papers = []
@@ -299,4 +411,10 @@ async def fetch_trending_by_categories(
             papers.append(paper)
     print(f"[turso] trending: {len(papers)} papers in {len(categories)} categories")
     return papers

 import json
 import time
+from collections import OrderedDict
 import httpx
 from app import config
+# ── In-process metadata cache ────────────────────────────────────────────────
+#
+# Recommendations + search both fetch metadata for hundreds of arxiv_ids per
+# request, often the same well-known papers across users. Each round-trip is
+# 1-3s on a 1.6M-row libSQL DB. An in-process LRU absorbs the repeats.
+#
+# Trade-offs:
+#   - Asyncio is single-threaded, no lock needed.
+#   - Paper title/abstract/authors are effectively immutable for our use,
+#     so we don't TTL-expire metadata. citation_count drifts but is only
+#     used for display ranking; staleness is fine.
+#   - 50K capacity at ~1KB per row -> ~50MB RAM ceiling.
+_METADATA_CACHE: "OrderedDict[str, dict]" = OrderedDict()
+_METADATA_CACHE_MAX = 50_000
+def _cache_get(arxiv_id: str) -> dict | None:
+    val = _METADATA_CACHE.get(arxiv_id)
+    if val is not None:
+        # Mark as MRU
+        _METADATA_CACHE.move_to_end(arxiv_id)
+    return val
+def _cache_put(arxiv_id: str, paper: dict) -> None:
+    if arxiv_id in _METADATA_CACHE:
+        _METADATA_CACHE.move_to_end(arxiv_id)
+        _METADATA_CACHE[arxiv_id] = paper
+        return
+    _METADATA_CACHE[arxiv_id] = paper
+    if len(_METADATA_CACHE) > _METADATA_CACHE_MAX:
+        # Evict LRU
+        _METADATA_CACHE.popitem(last=False)
+def metadata_cache_stats() -> dict:
+    """For diagnostics: current cache size and max."""
+    return {"size": len(_METADATA_CACHE), "max": _METADATA_CACHE_MAX}
+# ── In-process trending cache ────────────────────────────────────────────────
+#
+# Trending is filter-by-LIKE on 1.6M rows -> ~15s cold. Onboarding has a
+# small fixed set of category combinations, and citation counts barely
+# change minute-to-minute. A short TTL converts the 15s wait into a
+# one-time hit per category combo.
+_TRENDING_CACHE: dict[tuple, tuple[float, list[dict]]] = {}
+_TRENDING_TTL_SECONDS = 60 * 60  # 1 hour
 # ── Public API ───────────────────────────────────────────────────────────────
 async def fetch_metadata(arxiv_id: str) -> dict | None:
     Paper dict has keys: arxiv_id, title, abstract, authors, category,
     published, year, citation_count, influential_citations.
+    First checks the in-process LRU cache; only un-cached IDs hit the network.
     """
     if not arxiv_ids:
         return {}
+    # Cache check — pull anything already-known up front.
+    output: dict[str, dict] = {}
+    misses: list[str] = []
+    for aid in arxiv_ids:
+        cached = _cache_get(aid)
+        if cached is not None:
+            output[aid] = cached
+        else:
+            misses.append(aid)
+    if not misses:
+        return output
+    fetched = await _fetch_metadata_batch_uncached(misses)
+    output.update(fetched)
+    return output
+async def _fetch_metadata_batch_uncached(arxiv_ids: list[str]) -> dict[str, dict]:
+    """Network fetch for IDs we don't already have cached."""
     url = config.TURSO_URL
     token = config.TURSO_DB_TOKEN
         paper = _to_paper_dict(values)
         if paper:
             output[paper["arxiv_id"]] = paper
+            _cache_put(paper["arxiv_id"], paper)
     return output
     Fetch recently published, high-citation papers from Turso DB
     filtered by arXiv categories. Used as Tier 0 popularity fallback
     for onboarded users with zero saves.
+    Cached in-process (1 hour TTL): citation counts barely change
+    minute-to-minute, and onboarding has a small fixed set of category
+    combinations, so the first cold-start hit pays the ~15s LIKE-scan
+    cost once and subsequent users get an instant hit.
+    Filter strategy:
+      Turso's `primary_topic` column stores friendly labels like
+      "AI/ML" / "Computer Vision" — NOT arxiv codes — and the mapping
+      from arxiv code to friendly label is not 1:1 (e.g. Vaswani's
+      cs.CL paper is labeled "AI/ML" while BERT's cs.CL paper is
+      labeled "NLP/Computational Linguistics"). The `categories`
+      column, however, contains the real space-separated arxiv codes
+      ("cs.CL cs.LG"). So we filter via LIKE on `categories`.
+      Performance: LIKE '%cs.XX%' with leading wildcard skips the index,
+      but Turso's `citation_count > 0` filter + ORDER BY citation_count
+      narrows the scan, and trending is not a hot path.
     """
     if not categories:
         return []
+    cache_key = (tuple(sorted(categories)), limit)
+    cached = _TRENDING_CACHE.get(cache_key)
+    if cached is not None and (time.time() - cached[0]) < _TRENDING_TTL_SECONDS:
+        return cached[1]
     url = config.TURSO_URL
     token = config.TURSO_DB_TOKEN
     if not url or not token:
         return []
+    cat_list = list(categories)
+    # categories column is space-separated arxiv codes; arxiv codes
+    # don't share substrings (no code is a substring of another), so
+    # plain LIKE '%code%' is safe.
+    like_clauses = " OR ".join(["categories LIKE ?" for _ in cat_list])
     sql = f"""SELECT arxiv_id, title, authors, categories, primary_topic,
                      update_date, abstract_preview, citation_count, influential_citations
               FROM papers
+              WHERE ({like_clauses})
                 AND citation_count > 0
               ORDER BY citation_count DESC, update_date DESC
               LIMIT ?"""
+    args = [{"type": "text", "value": f"%{c}%"} for c in cat_list]
     args.append({"type": "integer", "value": str(limit)})
     pipeline_url = url.rstrip("/")
         "Content-Type": "application/json",
     }
+    # Use a longer timeout than metadata fetch — full table scan
+    # for citation-sorted trending against 1.6M rows can spike to
+    # 15-25s on the first cold hit. Once cached, warm reads are 0ms.
     try:
+        async with httpx.AsyncClient(timeout=30) as client:
             resp = await client.post(
                 f"{pipeline_url}/v2/pipeline",
                 json=payload,
                 headers=headers,
             )
             resp.raise_for_status()
+    except httpx.HTTPStatusError as e:
+        # Surface response body on HTTP errors — Turso's empty-string
+        # exceptions were the symptom that hid this bug for months.
+        body = ""
+        try:
+            body = e.response.text[:500]
+        except Exception:
+            pass
+        print(f"[turso] trending HTTP error {e.response.status_code}: {body}")
+        return []
     except Exception as e:
+        print(f"[turso] trending request failed: {type(e).__name__}: {e!r}")
         return []
     try:
         cols = [c["name"] for c in result_data.get("cols", [])]
         rows = result_data.get("rows", [])
     except (KeyError, IndexError, TypeError) as e:
+        print(f"[turso] trending parse error: {type(e).__name__}: {e!r}")
         return []
     papers = []
             papers.append(paper)
     print(f"[turso] trending: {len(papers)} papers in {len(categories)} categories")
+    if papers:
+        _TRENDING_CACHE[cache_key] = (time.time(), papers)
+        # Also seed metadata cache — these papers are likely to be
+        # fetched again as part of recommendations / display.
+        for p in papers:
+            _cache_put(p["arxiv_id"], p)
     return papers

docs/TASK-TRACKER.md CHANGED Viewed

@@ -325,30 +325,30 @@
 ---
-## Phase 5: Cold-Start Onboarding 📋 NOT STARTED
-> *Build the hybrid onboarding pipeline for new users.*
-> *Estimated effort: ~1-2 weeks*
 > *Reference: Doc 06 — "4-37% lift even once behavioral data exists"*
-### 5.1 — arXiv Category Multi-Select
-- [ ] UI screen on first visit: select 3-5 arXiv categories
-- [ ] Store selections in SQLite
-- [ ] Use as pool filter for first 1-3 sessions
-- [ ] Preserve as LightGBM feature permanently
-- [ ] Does NOT create "subject vectors" — just filters
-### 5.2 — Seed Paper Import
-- [ ] Let users search for and save 3-5 seed papers during onboarding
-- [ ] Immediately create EWMA profiles + Ward clusters
-- [ ] Uses hybrid search (Phase 3) for discovery
-### 5.3 — ORCID / Semantic Scholar Import (Stretch)
-- [ ] Accept ORCID ID → fetch authored papers → initial saves
-- [ ] Gives 10-50 papers of signal instantly
-### 5.4 — Popularity Fallback
-- [ ] If user skips all onboarding: serve popularity-per-selected-category feed
 ---
@@ -432,10 +432,10 @@
 - [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
 - [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan
-### B4 — S2 author import (Phase 5.1)
-- [x] `app/s2_svc.py`: parse S2 URL / raw ID / ORCID, fetch author papers from S2 API
-- [x] `POST /api/onboarding/import-author` endpoint in `onboarding.py`
-- [x] Quick-import form added to `seed_search.html` template
 ### Documentation
 - [x] `CLAUDE.md`: Rule 3.11 — interaction instrumentation invariants

 ---
+## Phase 5: Cold-Start Onboarding ✅ COMPLETE
+> *Onboarding wizard for new users — category selection + seed paper search + trending fallback.*
 > *Reference: Doc 06 — "4-37% lift even once behavioral data exists"*
+### 5.1 — arXiv Category Multi-Select ✅
+- [x] UI screen on first visit: select 1-8 arXiv category groups
+- [x] Store selections in SQLite (`user_onboarding` table)
+- [x] Use as pool filter for recommendations (via `get_user_category_filter()`)
+- [x] Preserve as LightGBM feature permanently (Feature 26: `onboarding_category_match`)
+- [x] Does NOT create "subject vectors" — just filters
+### 5.2 — Seed Paper Import ✅
+- [x] Let users search for and save seed papers during onboarding
+- [x] Immediately create EWMA profiles + Ward clusters on next feed request
+- [x] Uses hybrid search (Phase 3) for discovery
+### ~~5.3 — ORCID / Semantic Scholar Import~~ ❌ REMOVED
+> S2 author import was implemented but removed — not the onboarding direction we want.
+> Onboarding focuses on category selection + manual seed paper search.
+### 5.4 — Popularity Fallback ✅
+- [x] Category-filtered trending papers served via `turso_svc.fetch_trending_by_categories()`
+- [x] 1-hour TTL trending cache for performance
 ---
 - [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
 - [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan
+### ~~B4 — S2 author import~~ ❌ REMOVED
+> S2 author import was implemented and then removed — not the onboarding direction we want.
+> `app/s2_svc.py`, the `/api/onboarding/import-author` endpoint, and the quick-import UI
+> have all been deleted. Onboarding uses category selection + manual seed search only.
 ### Documentation
 - [x] `CLAUDE.md`: Rule 3.11 — interaction instrumentation invariants

docs/previous_prompt.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md CHANGED Viewed

@@ -32,7 +32,7 @@
 | Component | Planned In | Blocked By |
 |---|---|---|
 | Evaluation framework (offline + online metrics) | Phase 7 | Not yet implemented |
-| ORCID / Scholar import (onboarding stretch) | Phase 5 (stretch) | Deferred |
 | LLM interest summaries per cluster | Phase 8 | Needs Claude/Groq API integration |
 | Exploration + collaborative filtering | Phase 9 | Needs user scale |
@@ -101,12 +101,12 @@ The latest deep research (Doc 06) adds critical nuance that **neither pure-behav
 > "The pure-behavioral position in Doc 03/05 is directionally right but structurally incomplete... item-level seeds + adaptive refinement beats both fixed-category questionnaires and pure-behavior-from-zero, and onboarding cues remain a 4–37% lift even once behavioral data exists."
-**The corrected position**: A three-layer hybrid:
 1. **Coarse arXiv-category multiselect** — filter and LightGBM feature (5-second cold-start signal)
-2. **Seed-paper / ORCID import** — initial behavioral profile (strong cold-start signal)
-3. **Ward clustering + medoid retrieval** — takes over at ~10 saves (production-grade personalization)
-This resolves the tension: subject categories aren't the *primary* user model, but they *are* a useful prior for cold-start, filtering, and as re-ranking features.
 ---
@@ -283,29 +283,30 @@ Turso cloud DB with 1.23GB of metadata + citation counts. Search time: ~10.7s
 ---
-### Phase 5: Cold-Start Onboarding (COMPLETE)
-Status: core flow implemented (categories + seed search + trending fallback). ORCID/Scholar import deferred.
 Build the onboarding pipeline that Doc 06 identifies as a 4-37% lift even once behavioral data exists.
-#### 5.1 arXiv Category Multi-Select
-A simple UI screen on first visit: select 3-5 arXiv categories (cs.CL, cs.CV, stat.ML, etc.).
-- Used as pool filter for first 1-3 sessions
-- Stored as a LightGBM feature permanently
 - Does NOT create "subject vectors" — just filters
-#### 5.2 Seed Paper Import
-Let users search for and save 3-5 seed papers during onboarding.
-- These immediately create EWMA profiles and Ward clusters
 - Bypasses the "save 5 papers before any recs" cold-start trap
-- Scholar Inbox found this sufficient for good initial recommendations
-- **With hybrid search in place (Phase 3), seed paper search will use Qdrant vectors, not the arXiv API**
-#### 5.3 ORCID / Semantic Scholar ID Import (Stretch)
-If the user pastes their ORCID, ingest their authored papers as initial saves.
-- This gives the system 10-50 papers worth of signal instantly
-- Creates highly personalized clusters from Day 1
 ---

 | Component | Planned In | Blocked By |
 |---|---|---|
 | Evaluation framework (offline + online metrics) | Phase 7 | Not yet implemented |
+| ~~ORCID / Scholar import~~ | ~~Phase 5~~ | Removed (not the onboarding direction we want) |
 | LLM interest summaries per cluster | Phase 8 | Needs Claude/Groq API integration |
 | Exploration + collaborative filtering | Phase 9 | Needs user scale |
 > "The pure-behavioral position in Doc 03/05 is directionally right but structurally incomplete... item-level seeds + adaptive refinement beats both fixed-category questionnaires and pure-behavior-from-zero, and onboarding cues remain a 4–37% lift even once behavioral data exists."
+**The corrected position**: A two-layer hybrid:
 1. **Coarse arXiv-category multiselect** — filter and LightGBM feature (5-second cold-start signal)
+2. **Seed paper search + save** — initial behavioral profile via manual discovery
+3. **Ward clustering + medoid retrieval** — takes over at ~5 saves (production-grade personalization)
+This resolves the tension: subject categories aren't the *primary* user model, but they *are* a useful prior for cold-start, filtering, and as re-ranking features. ORCID/S2 author import was explored and removed — manual seed search is the preferred onboarding path.
 ---
 ---
+### Phase 5: Cold-Start Onboarding (COMPLETE ✅)
+Status: fully implemented — categories + seed search + trending fallback.
 Build the onboarding pipeline that Doc 06 identifies as a 4-37% lift even once behavioral data exists.
+#### 5.1 arXiv Category Multi-Select ✅
+UI screen on first visit: select 1-8 arXiv category groups.
+- Used as pool filter for recommendations
+- Stored as a LightGBM feature permanently (Feature 26: `onboarding_category_match`)
 - Does NOT create "subject vectors" — just filters
+#### 5.2 Seed Paper Import ✅
+Users search for and save seed papers during onboarding.
+- These immediately create EWMA profiles and Ward clusters on next feed request
 - Bypasses the "save 5 papers before any recs" cold-start trap
+- Uses hybrid search (Phase 3) for discovery
+#### ~~5.3 ORCID / Semantic Scholar ID Import~~ ❌ REMOVED
+S2 author import was implemented and then removed — not the onboarding direction we want.
+Onboarding focuses on category selection + manual seed paper search.
+#### 5.4 Popularity Fallback ✅
+Category-filtered trending papers via `turso_svc.fetch_trending_by_categories()` with 1-hour TTL cache.
 ---

requirements.txt CHANGED Viewed

@@ -14,7 +14,7 @@ python-multipart>=0.0.9
 FlagEmbedding>=1.2.9
 transformers>=4.44,<5.0
 pymilvus>=2.4
-groq>=0.9
 python-dotenv>=1.0
 # ── Phase 6: LightGBM reranker ───────────────────────────────────────────

 FlagEmbedding>=1.2.9
 transformers>=4.44,<5.0
 pymilvus>=2.4
+groq>=1.0  # 1.0+ drops the `proxies` kwarg internally so httpx>=0.28 works
 python-dotenv>=1.0
 # ── Phase 6: LightGBM reranker ───────────────────────────────────────────

scripts/browser_test_onboarding.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""Verify the onboarding seed-search step does not duplicate the panel."""
+from playwright.sync_api import sync_playwright
+URL = "http://127.0.0.1:7860"
+QUERY = "attention is all you need"
+def run():
+    with sync_playwright() as p:
+        browser = p.chromium.launch(headless=True)
+        ctx = browser.new_context(viewport={"width": 1280, "height": 1800})
+        # Use a fresh, unonboarded user so we land on /onboarding
+        ctx.add_cookies([{
+            "name": "arxiv_user_id",
+            "value": "onboarding-test-user-fresh",
+            "url": URL,
+        }])
+        page = ctx.new_page()
+        page.goto(URL + "/onboarding", wait_until="networkidle")
+        # Step 1: pick a category, click Continue
+        page.click("[data-key='nlp']")
+        page.click("#continue-btn")
+        # Step 2 should appear (rendered by submitCategories() via fetch + innerHTML)
+        page.wait_for_selector("#seed-results", timeout=10_000)
+        # Snapshot before search
+        page.screenshot(path="scripts/screenshot_onboard_step2_before.png", full_page=True)
+        # Now search — this is what triggered the duplication bug
+        page.fill("input[name='q']", QUERY)
+        page.click("button:has-text('Search')")
+        # wait for results to swap in
+        page.wait_for_function(
+            "document.querySelectorAll('.seed-card').length > 0",
+            timeout=15_000,
+        )
+        page.wait_for_load_state("networkidle", timeout=15_000)
+        page.screenshot(path="scripts/screenshot_onboard_step2_after.png", full_page=True)
+        # ── Inspect the DOM
+        save_panels = page.locator("h2:has-text('Save a few papers you like')").count()
+        quick_imports = page.locator("text=Quick import:").count()
+        search_inputs = page.locator("input[name='q']").count()
+        seed_counters = page.locator("#seed-counter").count()
+        done_buttons = page.locator("button:has-text('Done — start exploring')").count()
+        seed_cards = page.locator(".seed-card").count()
+        seed_card_ids = page.locator(".seed-card").evaluate_all("els => els.map(e => e.id)")
+        print(f"'Save a few papers you like' headings: {save_panels} (expected 1)")
+        print(f"'Quick import:' blocks: {quick_imports} (expected 1)")
+        print(f"search inputs: {search_inputs} (expected 1)")
+        print(f"#seed-counter elements: {seed_counters} (expected 1)")
+        print(f"'Done — start exploring' buttons: {done_buttons} (expected 1)")
+        print(f"seed-cards: {seed_cards}, unique ids: {len(set(seed_card_ids))}")
+        ok = (
+            save_panels == 1
+            and quick_imports == 1
+            and search_inputs == 1
+            and seed_counters == 1
+            and done_buttons == 1
+            and seed_cards > 0
+            and seed_cards == len(set(seed_card_ids))
+        )
+        print("\nRESULT:", "PASS" if ok else "FAIL")
+        browser.close()
+if __name__ == "__main__":
+    run()

scripts/browser_test_search.py ADDED Viewed

	@@ -0,0 +1,77 @@

+"""Drive a real Chromium browser to verify the search UI shows results once."""
+from playwright.sync_api import sync_playwright
+URL = "http://127.0.0.1:7860"
+QUERY = "attention is all you need"
+def run():
+    with sync_playwright() as p:
+        browser = p.chromium.launch(headless=True)
+        ctx = browser.new_context(
+            viewport={"width": 1280, "height": 1800},
+        )
+        # Pre-seed cookie of a user that has saves so has_recs=True
+        ctx.add_cookies([{
+            "name": "arxiv_user_id",
+            "value": "browser-test-user",
+            "url": URL,
+        }])
+        page = ctx.new_page()
+        # 1) Land on the homepage and search from there.
+        page.goto(URL + "/", wait_until="networkidle")
+        page.fill("input[name='q']", QUERY)
+        page.screenshot(path="scripts/screenshot_before_submit.png", full_page=True)
+        page.click("button[type='submit']")
+        page.wait_for_url("**/search?q=*", timeout=10_000)
+        # search.html does not auto-load anything heavy when q is set, but give it a beat
+        page.wait_for_load_state("networkidle", timeout=15_000)
+        page.screenshot(path="scripts/screenshot_after_search.png", full_page=True)
+        # 2) Inspect the DOM
+        url = page.url
+        paper_cards = page.locator(".paper-card").count()
+        recs_visible = page.locator("#rec-section").count()
+        recs_heading = page.get_by_role("heading", name="Recommended for You").count()
+        results_heading_count = page.locator("text=results for").count()
+        print(f"URL after search: {url}")
+        print(f".paper-card count: {paper_cards}")
+        print(f"#rec-section count: {recs_visible}")
+        print(f"'Recommended for You' heading count: {recs_heading}")
+        print(f"'results for' phrase count: {results_heading_count}")
+        # 3) Check for duplicate paper IDs (the original 'twice' complaint)
+        ids = page.locator("[id^='paper-']").evaluate_all(
+            "els => els.map(e => e.id)"
+        )
+        unique = set(ids)
+        print(f"paper element ids: {len(ids)} total, {len(unique)} unique")
+        if len(ids) != len(unique):
+            from collections import Counter
+            dups = [k for k, v in Counter(ids).items() if v > 1]
+            print(f"DUPLICATE IDS: {dups}")
+        # Phase: title-match boost — Vaswani's "Attention Is All You Need"
+        # (1706.03762) must be the #1 result for this exact-title query.
+        first_paper_id = page.locator("[id^='paper-']").first.get_attribute("id")
+        print(f"first paper id: {first_paper_id}")
+        ok = (
+            recs_visible == 0
+            and recs_heading == 0
+            and results_heading_count == 1
+            and paper_cards == len(unique)
+            and paper_cards > 0
+            and first_paper_id == "paper-1706.03762"
+        )
+        print("\nRESULT:", "PASS" if ok else "FAIL")
+        browser.close()
+if __name__ == "__main__":
+    run()

scripts/diag_mamba.py ADDED Viewed

	@@ -0,0 +1,69 @@

+"""Diagnose why the Mamba paper (2312.00752) is missing from search results."""
+import asyncio
+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from app import qdrant_svc, embed_svc, zilliz_svc, hybrid_search_svc, turso_svc
+MAMBA_ID = "2312.00752"
+async def main():
+    # Step 1: is the paper in Qdrant at all?
+    vecs = await qdrant_svc.get_paper_vectors([MAMBA_ID])
+    in_qdrant = MAMBA_ID in vecs
+    print(f"Mamba paper {MAMBA_ID} in Qdrant: {in_qdrant}")
+    # Step 2: is it in Turso?
+    meta = await turso_svc.fetch_metadata_batch([MAMBA_ID])
+    if MAMBA_ID in meta:
+        print(f"Mamba paper in Turso: YES — title: {meta[MAMBA_ID].get('title')!r}")
+    else:
+        print("Mamba paper in Turso: NO")
+    if not in_qdrant:
+        print("\n--> Paper missing from Qdrant collection. End of investigation.")
+        return
+    # Step 3: where does it rank in dense, sparse, and fused?
+    q = "Mamba state space model linear time"
+    dense_vec, sparse_dict = embed_svc.encode_query(q)
+    print(f"\nQuery: {q!r}")
+    print(f"Sparse keys: {len(sparse_dict)}")
+    fetch_k = 60
+    dense = await qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k)
+    sparse = await zilliz_svc.search_sparse(sparse_dict, limit=fetch_k)
+    dense_ids = [r["arxiv_id"] for r in dense]
+    sparse_ids = [r["arxiv_id"] for r in sparse]
+    if MAMBA_ID in dense_ids:
+        print(f"\nDense rank: {dense_ids.index(MAMBA_ID)+1}/{fetch_k}")
+    else:
+        print(f"\nDense top {fetch_k}: NOT present")
+    if MAMBA_ID in sparse_ids:
+        print(f"Sparse rank: {sparse_ids.index(MAMBA_ID)+1}/{fetch_k}")
+    else:
+        print(f"Sparse top {fetch_k}: NOT present")
+    fused = hybrid_search_svc._rrf_fuse(dense, sparse, k=60)
+    fused_ids = [item["arxiv_id"] for item in fused]
+    if MAMBA_ID in fused_ids:
+        print(f"RRF fused rank: {fused_ids.index(MAMBA_ID)+1}")
+    else:
+        print(f"RRF fused: NOT present in top {len(fused_ids)}")
+    # Show top 5 of each
+    print(f"\n=== Dense top 5 ===")
+    for r in dense[:5]:
+        print(f"  {r['arxiv_id']}  score={r['score']:.4f}")
+    print(f"\n=== Sparse top 5 ===")
+    for r in sparse[:5]:
+        print(f"  {r['arxiv_id']}  score={r['score']:.4f}")
+asyncio.run(main())

scripts/diag_search_rank.py ADDED Viewed

	@@ -0,0 +1,45 @@

+"""Trace where Vaswani's paper falls in the hybrid pipeline."""
+import asyncio
+from app import qdrant_svc, embed_svc, zilliz_svc, hybrid_search_svc
+VASWANI = "1706.03762"
+async def main():
+    q = "attention is all you need"
+    dense_vec, sparse_dict = embed_svc.encode_query(q)
+    print(f"sparse keys: {len(sparse_dict)}")
+    fetch_k = 60
+    dense = await qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k)
+    sparse = await zilliz_svc.search_sparse(sparse_dict, limit=fetch_k)
+    dense_ids = [r["arxiv_id"] for r in dense]
+    sparse_ids = [r["arxiv_id"] for r in sparse]
+    print(f"\nVaswani in dense top {fetch_k}: ", VASWANI in dense_ids,
+          (f"(rank {dense_ids.index(VASWANI)+1})" if VASWANI in dense_ids else ""))
+    print(f"Vaswani in sparse top {fetch_k}: ", VASWANI in sparse_ids,
+          (f"(rank {sparse_ids.index(VASWANI)+1})" if VASWANI in sparse_ids else ""))
+    fused = hybrid_search_svc._rrf_fuse(dense, sparse, k=60)
+    fused_ids = [item["arxiv_id"] for item in fused]
+    v_rank_rrf = fused_ids.index(VASWANI) + 1 if VASWANI in fused_ids else None
+    print(f"\nVaswani rank after pure RRF: {v_rank_rrf}")
+    print("\n=== Pure RRF (no recency), top 10 ===")
+    for i, item in enumerate(fused[:10], 1):
+        marker = " <-- VASWANI" if item["arxiv_id"] == VASWANI else ""
+        print(f"  {i:2d}. {item['arxiv_id']}  rrf={item['rrf_score']:.4f}{marker}")
+    ranked = hybrid_search_svc._recency_rerank([dict(x) for x in fused])
+    ranked_ids = [item["arxiv_id"] for item in ranked]
+    v_rank_recency = ranked_ids.index(VASWANI) + 1 if VASWANI in ranked_ids else None
+    print(f"\nVaswani rank after current 0.80/0.20 recency rerank: {v_rank_recency}")
+    print("\n=== Current rerank (0.80 RRF + 0.20 recency), top 10 ===")
+    for i, item in enumerate(ranked[:10], 1):
+        marker = " <-- VASWANI" if item["arxiv_id"] == VASWANI else ""
+        print(f"  {i:2d}. {item['arxiv_id']}  final={item['final_score']:.4f}{marker}")
+asyncio.run(main())

scripts/e2e_audit.py ADDED Viewed

	@@ -0,0 +1,622 @@

+"""
+End-to-end audit of the ResearchIT recommendation pipeline.
+Steps:
+  1. Smoke test: hybrid search (10 queries, per-layer scores)
+  2. User profile pipeline: EWMA update + Ward clustering
+  3. Recommendation feed generation with quota fusion
+  4. LightGBM reranker pass
+  5. Gap analysis
+Run:  python scripts/e2e_audit.py
+"""
+from __future__ import annotations
+import asyncio, sys, time, json, struct
+from pathlib import Path
+import numpy as np
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+# ── Imports ──────────────────────────────────────────────────────────────────
+from app import hybrid_search_svc, turso_svc, embed_svc, qdrant_svc, zilliz_svc, groq_svc, db
+from app.recommend import profiles, clustering
+from app.recommend.reranker import (
+    rerank_candidates, compute_features, heuristic_score,
+    is_model_loaded, get_num_trees, FEATURE_NAMES,
+)
+from app.recommend.diversity import mmr_rerank, inject_exploration
+# ── Globals ──────────────────────────────────────────────────────────────────
+ERRORS: list[str] = []
+WRONG_OUTPUTS: list[str] = []
+MISSING: list[str] = []
+TEST_USER = "e2e_audit_user_001"
+# ── Helpers ──────────────────────────────────────────────────────────────────
+def banner(text: str):
+    print(f"\n{'='*90}")
+    print(f"  {text}")
+    print(f"{'='*90}\n")
+def check(label: str, condition: bool, detail: str = ""):
+    status = "OK" if condition else "FAIL"
+    msg = f"  [{status:>4}] {label}"
+    if detail:
+        msg += f"  --  {detail}"
+    print(msg)
+    if not condition:
+        WRONG_OUTPUTS.append(f"{label}: {detail}")
+# ═══════════════════════════════════════════════════════════════════════════════
+#  STEP 1 — SMOKE TEST: HYBRID SEARCH
+# ═══════════════════════════════════════════════════════════════════════════════
+SEARCH_QUERIES = [
+    "vision transformer image classification",
+    "reinforcement learning reward shaping",
+    "large language model fine-tuning RLHF",
+    "graph neural network drug discovery",
+    "federated learning differential privacy",
+    "attention is all you need",
+    "diffusion models image generation",
+    "knowledge distillation BERT compression",
+    "object detection YOLO real-time",
+    "protein structure prediction deep learning",
+]
+async def step1_search():
+    banner("STEP 1: HYBRID SEARCH SMOKE TEST")
+    print(f"Running {len(SEARCH_QUERIES)} queries...\n")
+    all_latencies = []
+    all_results_count = []
+    for i, q in enumerate(SEARCH_QUERIES, 1):
+        t0 = time.perf_counter()
+        try:
+            results = await hybrid_search_svc.search(q, limit=10)
+            elapsed = (time.perf_counter() - t0) * 1000
+        except Exception as e:
+            ERRORS.append(f"Step 1: Query {q!r} threw {type(e).__name__}: {e}")
+            print(f"  Q{i}: {q!r} -> ERROR: {e}")
+            continue
+        all_latencies.append(elapsed)
+        all_results_count.append(len(results))
+        # Fetch metadata for display
+        meta = {}
+        if results:
+            try:
+                meta = await turso_svc.fetch_metadata_batch(results)
+            except Exception as e:
+                ERRORS.append(f"Step 1: Metadata fetch failed for {q!r}: {e}")
+        print(f"  Q{i}: {q!r}")
+        print(f"       Results: {len(results)}  |  Latency: {elapsed:.0f}ms")
+        for rank, aid in enumerate(results[:5], 1):
+            m = meta.get(aid, {})
+            title = (m.get("title") or "?")[:65]
+            cites = m.get("citation_count", 0) or 0
+            print(f"       {rank}. [{cites:>6} cites] {aid:14s}  {title}")
+        # Relevance check: does the query topic appear in at least 3/5 titles?
+        if results and meta:
+            q_words = set(q.lower().split())
+            relevant = 0
+            for aid in results[:5]:
+                t = (meta.get(aid, {}).get("title") or "").lower()
+                matches = sum(1 for w in q_words if w in t)
+                if matches >= 2:
+                    relevant += 1
+            check(f"Q{i} relevance ({relevant}/5 top results overlap query terms)",
+                  relevant >= 2,
+                  f"{q!r}")
+        print()
+    # Summary
+    if all_latencies:
+        print(f"  --- Search Summary ---")
+        print(f"  Queries: {len(all_latencies)}")
+        print(f"  Avg latency: {sum(all_latencies)/len(all_latencies):.0f}ms")
+        print(f"  p50: {sorted(all_latencies)[len(all_latencies)//2]:.0f}ms")
+        print(f"  Max: {max(all_latencies):.0f}ms")
+        zero_results = sum(1 for c in all_results_count if c == 0)
+        print(f"  Zero-result queries: {zero_results}")
+        if zero_results > 0:
+            ERRORS.append(f"Step 1: {zero_results} queries returned 0 results")
+# ═══════════════════════════════════════════════════════════════════════════════
+#  STEP 2 — USER PROFILE PIPELINE
+# ═══════════════════════════════════════════════════════════════════════════════
+# Real paper IDs from known categories:
+# CV papers (computer vision)
+CV_PAPERS = [
+    "1512.03385",   # ResNet
+    "2010.11929",   # ViT
+    "2105.01601",   # Swin Transformer
+    "2106.08254",   # BEiT
+    "1409.1556",    # VGGNet
+]
+# LLM papers (NLP / language models)
+LLM_PAPERS = [
+    "1706.03762",   # Attention Is All You Need
+    "1810.04805",   # BERT
+    "2005.14165",   # GPT-3
+    "2303.08774",   # GPT-4
+    "2302.13971",   # LLaMA
+]
+ALL_SEED_PAPERS = CV_PAPERS + LLM_PAPERS
+async def step2_profiles():
+    banner("STEP 2: USER PROFILE PIPELINE")
+    # Initialize DB
+    await db.init_db()
+    print(f"  Test user: {TEST_USER}")
+    print(f"  Seed papers: {len(ALL_SEED_PAPERS)} (5 CV + 5 LLM)")
+    # Step 2a: Retrieve embeddings for seed papers from Qdrant (batch)
+    print(f"\n  Fetching embeddings from Qdrant for {len(ALL_SEED_PAPERS)} papers...")
+    embeddings = {}
+    try:
+        vecs = await qdrant_svc.get_paper_vectors(ALL_SEED_PAPERS)
+        for aid, vec in vecs.items():
+            embeddings[aid] = np.array(vec, dtype=np.float32)
+        missing = [a for a in ALL_SEED_PAPERS if a not in embeddings]
+        if missing:
+            print(f"    WARN: No vectors for {len(missing)} papers: {missing[:3]}...")
+    except Exception as e:
+        print(f"    ERROR: get_paper_vectors -> {e}")
+        ERRORS.append(f"Step 2: get_paper_vectors failed: {e}")
+    print(f"  Retrieved {len(embeddings)}/{len(ALL_SEED_PAPERS)} embeddings")
+    if len(embeddings) < 5:
+        ERRORS.append(f"Step 2: Only {len(embeddings)} embeddings retrieved, need >= 5")
+        print("  ABORT: Not enough embeddings to continue Step 2")
+        return None, None
+    # Step 2b: EWMA profile updates
+    print(f"\n  Running EWMA profile updates (alpha_long={profiles.ALPHA_LONG_TERM}, "
+          f"alpha_short={profiles.ALPHA_SHORT_TERM})...")
+    for aid in ALL_SEED_PAPERS:
+        if aid not in embeddings:
+            continue
+        try:
+            await profiles.update_on_save(TEST_USER, embeddings[aid])
+        except Exception as e:
+            ERRORS.append(f"Step 2: EWMA update failed for {aid}: {e}")
+            print(f"    ERROR: update_on_save({aid}) -> {e}")
+    # Load profiles back
+    lt_vec = await profiles.load_profile(TEST_USER, "long_term")
+    st_vec = await profiles.load_profile(TEST_USER, "short_term")
+    lt_count = await profiles.get_interaction_count(TEST_USER, "long_term")
+    st_count = await profiles.get_interaction_count(TEST_USER, "short_term")
+    check("Long-term profile exists", lt_vec is not None)
+    check("Short-term profile exists", st_vec is not None)
+    check(f"Long-term interaction count = {lt_count}", lt_count == len(embeddings),
+          f"expected {len(embeddings)}")
+    check(f"Short-term interaction count = {st_count}", st_count == len(embeddings),
+          f"expected {len(embeddings)}")
+    if lt_vec is not None:
+        lt_norm = float(np.linalg.norm(lt_vec))
+        check(f"Long-term vector L2-norm ~= 1.0 (actual: {lt_norm:.4f})",
+              abs(lt_norm - 1.0) < 0.01)
+    if st_vec is not None:
+        st_norm = float(np.linalg.norm(st_vec))
+        check(f"Short-term vector L2-norm ~= 1.0 (actual: {st_norm:.4f})",
+              abs(st_norm - 1.0) < 0.01)
+    # Step 2c: Ward hierarchical clustering
+    print(f"\n  Running Ward clustering on {len(embeddings)} paper embeddings...")
+    paper_ids = list(embeddings.keys())
+    emb_matrix = np.stack([embeddings[aid] for aid in paper_ids])
+    try:
+        clusters = clustering.compute_clusters(
+            paper_ids=paper_ids,
+            embeddings=emb_matrix,
+        )
+    except Exception as e:
+        ERRORS.append(f"Step 2: compute_clusters failed: {e}")
+        print(f"    ERROR: {e}")
+        return lt_vec, st_vec
+    print(f"  Clusters found: {len(clusters)}")
+    for c in clusters:
+        print(f"    Cluster {c.cluster_idx}: medoid={c.medoid_paper_id}, "
+              f"papers={len(c.paper_ids)}, importance={c.importance:.3f}")
+        for pid in c.paper_ids:
+            label = "CV" if pid in CV_PAPERS else "LLM" if pid in LLM_PAPERS else "?"
+            print(f"      - {pid} [{label}]")
+    check(f"Number of clusters >= 2 (actual: {len(clusters)})",
+          len(clusters) >= 2,
+          "CV and LLM papers should form distinct clusters")
+    # Check cluster purity
+    for c in clusters:
+        cv_count = sum(1 for p in c.paper_ids if p in CV_PAPERS)
+        llm_count = sum(1 for p in c.paper_ids if p in LLM_PAPERS)
+        total = len(c.paper_ids)
+        purity = max(cv_count, llm_count) / total if total > 0 else 0
+        dominant = "CV" if cv_count > llm_count else "LLM"
+        check(f"Cluster {c.cluster_idx} purity ({dominant}: {purity:.0%})",
+              purity >= 0.6,
+              f"{cv_count} CV + {llm_count} LLM papers")
+    # Save clusters for Step 3
+    try:
+        await clustering.save_clusters_to_db(TEST_USER, clusters)
+    except Exception as e:
+        ERRORS.append(f"Step 2: save_clusters_to_db failed: {e}")
+    return lt_vec, st_vec
+# ═══════════════════════════════════════════════════════════════════════════════
+#  STEP 3 — RECOMMENDATION FEED GENERATION
+# ═══════════════════════════════════════════════════════════════════════════════
+async def step3_recommendation_feed(lt_vec, st_vec):
+    banner("STEP 3: RECOMMENDATION FEED GENERATION")
+    if lt_vec is None:
+        ERRORS.append("Step 3: Skipped — no long-term profile from Step 2")
+        print("  SKIPPED: No profile vectors from Step 2")
+        return None, None, None
+    # Load clusters from DB
+    clusters = await clustering.load_clusters_from_db(TEST_USER)
+    if not clusters:
+        ERRORS.append("Step 3: No clusters found in DB")
+        print("  SKIPPED: No clusters in DB")
+        return None, None, None
+    print(f"  Loaded {len(clusters)} clusters from DB")
+    print(f"  Target feed size: 20 papers")
+    # Step 3a: Search for candidates per cluster (using medoid embeddings)
+    all_candidates: dict[str, dict] = {}  # arxiv_id -> metadata
+    all_embeddings: dict[str, np.ndarray] = {}
+    cluster_assignments: dict[str, int] = {}  # arxiv_id -> cluster_idx
+    seen = set(ALL_SEED_PAPERS)
+    t0 = time.perf_counter()
+    # Get medoid vectors in batch
+    medoid_ids = [c["medoid_paper_id"] for c in clusters]
+    medoid_vecs = await qdrant_svc.get_paper_vectors(medoid_ids)
+    for c in clusters:
+        mid = c["medoid_paper_id"]
+        medoid_vec = None
+        # Try stored blob first
+        if c.get("medoid_embedding_blob"):
+            medoid_vec = np.frombuffer(c["medoid_embedding_blob"], dtype=np.float32)
+        # Fallback: batch-fetched vector
+        if medoid_vec is None and mid in medoid_vecs:
+            medoid_vec = np.array(medoid_vecs[mid], dtype=np.float32)
+        if medoid_vec is None:
+            ERRORS.append(f"Step 3: No medoid vector for cluster {c['cluster_idx']}")
+            continue
+        # Search Qdrant for similar papers (with scores + vectors)
+        try:
+            results = await qdrant_svc.search_by_vector_with_scores(
+                medoid_vec.tolist(), limit=30, with_vectors=True
+            )
+        except Exception as e:
+            ERRORS.append(f"Step 3: search failed for cluster {c['cluster_idx']}: {e}")
+            continue
+        # Filter out seen papers
+        for r in results:
+            aid = r["arxiv_id"]
+            if aid in seen:
+                continue
+            all_candidates[aid] = {"score": r["score"]}
+            cluster_assignments[aid] = c["cluster_idx"]
+            if "vector" in r:
+                all_embeddings[aid] = np.array(r["vector"], dtype=np.float32)
+            seen.add(aid)
+            if len([a for a in cluster_assignments if cluster_assignments[a] == c["cluster_idx"]]) >= 15:
+                break
+    elapsed_search = (time.perf_counter() - t0) * 1000
+    print(f"  Candidate search: {len(all_candidates)} papers in {elapsed_search:.0f}ms")
+    if not all_candidates:
+        ERRORS.append("Step 3: Zero candidates retrieved")
+        print("  ABORT: No candidates")
+        return None, None, None
+    # Step 3b: Fetch metadata
+    cand_ids = list(all_candidates.keys())
+    try:
+        meta = await turso_svc.fetch_metadata_batch(cand_ids)
+    except Exception as e:
+        ERRORS.append(f"Step 3: metadata fetch failed: {e}")
+        meta = {}
+    # Step 3c: Fetch embeddings for candidates (use what we got from search + batch fetch rest)
+    cand_embeddings = dict(all_embeddings)  # Already have some from with_vectors=True
+    missing_emb = [aid for aid in cand_ids if aid not in cand_embeddings]
+    if missing_emb:
+        print(f"  Fetching {len(missing_emb)} missing embeddings from Qdrant...")
+        try:
+            extra = await qdrant_svc.get_paper_vectors(missing_emb)
+            for aid, vec in extra.items():
+                cand_embeddings[aid] = np.array(vec, dtype=np.float32)
+        except Exception as e:
+            print(f"    WARN: batch vector fetch failed: {e}")
+    print(f"  Got {len(cand_embeddings)}/{len(cand_ids)} embeddings")
+    # Build aligned arrays
+    valid_ids = [aid for aid in cand_ids if aid in cand_embeddings and aid in meta]
+    if len(valid_ids) < 5:
+        ERRORS.append(f"Step 3: Only {len(valid_ids)} valid candidates")
+        print(f"  ABORT: Not enough valid candidates")
+        return None, None, None
+    emb_matrix = np.stack([cand_embeddings[aid] for aid in valid_ids])
+    meta_list = [meta[aid] for aid in valid_ids]
+    # Step 3d: Print the raw candidate feed
+    print(f"\n  Raw candidate feed ({len(valid_ids)} papers):")
+    cluster_counts: dict[int, int] = {}
+    for i, aid in enumerate(valid_ids[:20]):
+        m = meta.get(aid, {})
+        title = (m.get("title") or "?")[:55]
+        cites = m.get("citation_count", 0) or 0
+        cidx = cluster_assignments.get(aid, -1)
+        cluster_counts[cidx] = cluster_counts.get(cidx, 0) + 1
+        print(f"    {i+1:2d}. [C{cidx}] [{cites:>6} cites] {title}")
+    print(f"\n  Cluster distribution in top 20:")
+    for cidx, count in sorted(cluster_counts.items()):
+        print(f"    Cluster {cidx}: {count} papers")
+    total_feed = (time.perf_counter() - t0) * 1000
+    print(f"  Total feed generation: {total_feed:.0f}ms")
+    return valid_ids, emb_matrix, meta_list
+# ═══════════════════════════════════════════════════════════════════════════════
+#  STEP 4 — LIGHTGBM RERANKER
+# ═══════════════════════════════════════════════════════════════════════════════
+async def step4_reranker(valid_ids, emb_matrix, meta_list, lt_vec, st_vec):
+    banner("STEP 4: LIGHTGBM RERANKER")
+    if valid_ids is None:
+        print("  SKIPPED: No candidates from Step 3")
+        return
+    print(f"  Model loaded: {is_model_loaded()}")
+    if is_model_loaded():
+        print(f"  Trees: {get_num_trees()}")
+    else:
+        MISSING.append("LightGBM model not loaded — using heuristic fallback")
+    n = min(len(valid_ids), 20)
+    ids_subset = valid_ids[:n]
+    emb_subset = emb_matrix[:n]
+    meta_subset = meta_list[:n]
+    print(f"  Running reranker on {n} candidates...")
+    t0 = time.perf_counter()
+    try:
+        sorted_ids, sorted_scores, sorted_embs = rerank_candidates(
+            ids_subset,
+            emb_subset,
+            meta_subset,
+            lt_vec,
+            st_vec,
+            None,  # no negative profile
+        )
+        elapsed = (time.perf_counter() - t0) * 1000
+    except Exception as e:
+        ERRORS.append(f"Step 4: rerank_candidates failed: {e}")
+        print(f"  ERROR: {e}")
+        return
+    print(f"  Reranker latency: {elapsed:.0f}ms")
+    print(f"\n  Reranked order (top 10):")
+    # Fetch metadata for display
+    re_meta = {}
+    try:
+        re_meta = await turso_svc.fetch_metadata_batch(sorted_ids[:10])
+    except Exception:
+        pass
+    for i, (aid, score) in enumerate(zip(sorted_ids[:10], sorted_scores[:10]), 1):
+        m = re_meta.get(aid, {})
+        title = (m.get("title") or "?")[:55]
+        cites = m.get("citation_count", 0) or 0
+        old_rank = ids_subset.index(aid) + 1 if aid in ids_subset else "?"
+        print(f"    {i:2d}. (was #{old_rank:>2}) [{cites:>6} cites] score={score:.4f}  {title}")
+    # Feature analysis for top 3 and bottom 3
+    features = compute_features(emb_subset, meta_subset, lt_vec, st_vec, None)
+    print(f"\n  Feature snapshot (top 3 reranked papers):")
+    for rank_idx in range(min(3, len(sorted_ids))):
+        aid = sorted_ids[rank_idx]
+        orig_idx = ids_subset.index(aid)
+        f = features[orig_idx]
+        print(f"    #{rank_idx+1} {aid}:")
+        print(f"      qdrant_cosine={f[0]:.3f}  lt_sim={f[20]:.3f}  st_sim={f[21]:.3f}  "
+              f"cites={f[2]:.0f}  recency={f[6]:.3f}  age_days={f[5]:.0f}")
+    if len(sorted_ids) >= 3:
+        print(f"\n  Feature snapshot (bottom 3 reranked papers):")
+        for rank_idx in range(max(0, len(sorted_ids)-3), len(sorted_ids)):
+            aid = sorted_ids[rank_idx]
+            orig_idx = ids_subset.index(aid)
+            f = features[orig_idx]
+            print(f"    #{rank_idx+1} {aid}:")
+            print(f"      qdrant_cosine={f[0]:.3f}  lt_sim={f[20]:.3f}  st_sim={f[21]:.3f}  "
+                  f"cites={f[2]:.0f}  recency={f[6]:.3f}  age_days={f[5]:.0f}")
+    # Check: did reranking change anything?
+    moved = sum(1 for i, aid in enumerate(sorted_ids) if aid != ids_subset[i])
+    check(f"Reranker changed {moved}/{n} positions", moved > 0,
+          "Reranker should reorder candidates based on features")
+# ═══════════════════════════════════════════════════════════════════════════════
+#  STEP 5 — MMR DIVERSITY + EXPLORATION
+# ═══════════════════════════════════════════════════════════════════════════════
+async def step5_diversity(valid_ids, emb_matrix, lt_vec):
+    banner("STEP 5: MMR DIVERSITY + EXPLORATION")
+    if valid_ids is None or lt_vec is None:
+        print("  SKIPPED: No data from previous steps")
+        return
+    n = min(len(valid_ids), 30)
+    print(f"  Running MMR (lambda=0.6) on {n} candidates, selecting 15...")
+    t0 = time.perf_counter()
+    try:
+        mmr_ids = mmr_rerank(
+            lt_vec, emb_matrix[:n], valid_ids[:n],
+            lambda_param=0.6, top_k=15,
+        )
+        elapsed = (time.perf_counter() - t0) * 1000
+    except Exception as e:
+        ERRORS.append(f"Step 5: mmr_rerank failed: {e}")
+        print(f"  ERROR: {e}")
+        return
+    print(f"  MMR latency: {elapsed:.0f}ms")
+    print(f"  MMR selected {len(mmr_ids)} papers")
+    # Check rank changes
+    moved = sum(1 for i, aid in enumerate(mmr_ids) if i < len(valid_ids) and aid != valid_ids[i])
+    print(f"  Rank changes vs input: {moved}/{len(mmr_ids)}")
+    # Exploration injection
+    with_explore = inject_exploration(mmr_ids, valid_ids[:n], n_explore=2, seed=42)
+    explore_count = len(with_explore) - len(mmr_ids)
+    print(f"  Exploration injected: {explore_count} papers")
+    check("Exploration added papers", explore_count > 0 or len(valid_ids[:n]) <= len(mmr_ids))
+    # Check diversity: compute avg pairwise cosine among selected
+    selected_embs = []
+    for aid in mmr_ids[:10]:
+        if aid in valid_ids:
+            idx = valid_ids.index(aid)
+            if idx < len(emb_matrix):
+                selected_embs.append(emb_matrix[idx])
+    if len(selected_embs) >= 2:
+        sel_matrix = np.stack(selected_embs)
+        norms = sel_matrix / (np.linalg.norm(sel_matrix, axis=1, keepdims=True) + 1e-10)
+        sim_matrix = norms @ norms.T
+        # Average off-diagonal similarity
+        mask = ~np.eye(len(sel_matrix), dtype=bool)
+        avg_sim = sim_matrix[mask].mean()
+        print(f"  Avg pairwise cosine among top 10 MMR picks: {avg_sim:.3f}")
+        check("MMR diversity (avg pairwise sim < 0.85)", avg_sim < 0.85,
+              f"actual: {avg_sim:.3f}")
+# ═══════════════════════════════════════════════════════════════════════════════
+#  STEP 6 — GAP ANALYSIS
+# ═══════════════════════════════════════════════════════════════════════════════
+def step6_gap_analysis():
+    banner("STEP 6: GAP ANALYSIS")
+    print("  ERRORS (things that threw exceptions or returned empty):")
+    if ERRORS:
+        for e in ERRORS:
+            print(f"    - {e}")
+    else:
+        print("    (none)")
+    print("\n  WRONG OUTPUTS (things that ran but returned bad results):")
+    if WRONG_OUTPUTS:
+        for w in WRONG_OUTPUTS:
+            print(f"    - {w}")
+    else:
+        print("    (none)")
+    print("\n  MISSING PIECES (not implemented or not loaded):")
+    if MISSING:
+        for m in MISSING:
+            print(f"    - {m}")
+    else:
+        print("    (none)")
+    print(f"\n  Totals: {len(ERRORS)} errors, {len(WRONG_OUTPUTS)} wrong outputs, {len(MISSING)} missing")
+    # Verdict
+    total_issues = len(ERRORS) + len(WRONG_OUTPUTS) + len(MISSING)
+    if total_issues == 0:
+        print("\n  VERDICT: ALL CLEAR")
+    else:
+        print(f"\n  VERDICT: {total_issues} issues found")
+# ═══════════════════════════════════════════════════════════════════════════════
+#  MAIN
+# ═══════════════════════════════════════════════════════════════════════════════
+async def main():
+    banner("RESEARCHIT E2E PIPELINE AUDIT")
+    print("  Warming up BGE-M3 + services...")
+    embed_svc.encode_query("warmup")
+    await turso_svc.fetch_metadata_batch(["1706.03762"])
+    print("  Ready.\n")
+    # Step 1: Search
+    await step1_search()
+    # Step 2: Profiles + Clustering
+    lt_vec, st_vec = await step2_profiles()
+    # Step 3: Recommendation feed
+    valid_ids, emb_matrix, meta_list = await step3_recommendation_feed(lt_vec, st_vec)
+    # Step 4: Reranker
+    await step4_reranker(valid_ids, emb_matrix, meta_list, lt_vec, st_vec)
+    # Step 5: MMR Diversity
+    await step5_diversity(valid_ids, emb_matrix, lt_vec)
+    # Step 6: Gap analysis
+    step6_gap_analysis()
+    banner("AUDIT COMPLETE")
+if __name__ == "__main__":
+    asyncio.run(main())

scripts/eval_expanded_queries.py ADDED Viewed

	@@ -0,0 +1,336 @@

+"""
+Expanded search quality evaluation — realistic user queries.
+The original eval_search_quality.py uses 21 queries across 5 bands (A-E).
+This script expands to 8 categories that simulate REAL users of an academic
+paper search engine, not just known-item lookups and adversarial tests.
+Categories:
+  F: Beginner / Newcomer — "explain like I'm starting a research project"
+  G: Research-in-Progress — "I know the field, looking for specific work"
+  H: Implementation-Focused — "I want to BUILD something"
+  I: Comparative / Survey — "compare X vs Y" or "survey of Z"
+  J: Emerging / Cutting-Edge — "what's new in X?"
+  K: Cross-Domain — "applying X from domain A to domain B"
+  L: Vague / Exploratory — underspecified queries that real users actually type
+  M: Follow-up / Refinement — queries that build on prior context
+Run:  python scripts/eval_expanded_queries.py
+"""
+from __future__ import annotations
+import asyncio
+import json
+import sys
+import time
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from app import hybrid_search_svc
+from app import turso_svc
+from app import embed_svc
+from app import groq_svc
+# ── Query definitions ────────────────────────────────────────────────────────
+# (band, query, expected_arxiv_id_or_None, description)
+QUERIES: list[tuple[str, str, str | None, str]] = [
+    # ── Band A (original): Known-item titles ─────────────────────────────────
+    ("A", "attention is all you need", "1706.03762",
+     "Landmark transformer paper by Vaswani et al."),
+    ("A", "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805",
+     "Full BERT title — should be exact #1"),
+    ("A", "Deep Residual Learning for Image Recognition", "1512.03385",
+     "ResNet — the most-cited CV paper"),
+    # ── Band F: Beginner / Newcomer queries ──────────────────────────────────
+    # These simulate a student or newcomer who doesn't know the jargon.
+    ("F", "how do transformers work in NLP",  None,
+     "Newcomer asking about transformer basics"),
+    ("F", "what is reinforcement learning from human feedback",  None,
+     "Beginner asking about RLHF — should surface Ouyang/InstructGPT/Christiano"),
+    ("F", "explain how neural networks learn",  None,
+     "Very basic — should return foundational/survey papers"),
+    ("F", "what are diffusion models and how do they generate images",  None,
+     "Beginner asking about DDPM/Stable Diffusion family"),
+    ("F", "how does GPT-4 work",  None,
+     "Newcomer asking about GPT-4 — should surface the technical report"),
+    # ── Band G: Research-in-Progress queries ─────────────────────────────────
+    # These simulate a PhD student deep in their research.
+    ("G", "contrastive learning for self-supervised visual representations",  None,
+     "Should return SimCLR, MoCo, BYOL, DINO etc."),
+    ("G", "knowledge distillation from large language models to smaller ones",  None,
+     "Distillation pipeline — DistilBERT, TinyBERT, knowledge distillation surveys"),
+    ("G", "graph neural networks for molecular property prediction",  None,
+     "GNN + chemistry — SchNet, DimeNet, MPNN papers"),
+    ("G", "efficient inference for large language models quantization pruning",  None,
+     "LLM compression — GPTQ, AWQ, SparseGPT, pruning surveys"),
+    ("G", "causal inference in observational studies with machine learning",  None,
+     "Causal ML — double ML, causal forests, CATE estimation"),
+    ("G", "multi-task learning with shared representations",  None,
+     "MTL surveys, hard/soft parameter sharing, task relationships"),
+    # ── Band H: Implementation-Focused queries ───────────────────────────────
+    # These simulate someone who wants to BUILD something.
+    ("H", "how to fine-tune a pre-trained language model for classification",  None,
+     "Practical fine-tuning — ULMFiT, how-to-fine-tune-BERT papers"),
+    ("H", "implementing attention mechanism from scratch",  None,
+     "Implementation-level detail — attention tutorials, scaled dot product"),
+    ("H", "best practices for training stable diffusion models",  None,
+     "Practical SD training — latent diffusion, classifier-free guidance"),
+    ("H", "building a retrieval augmented generation system",  None,
+     "RAG — should surface the Lewis et al. RAG paper, REALM, etc."),
+    ("H", "how to do distributed training with PyTorch across GPUs",  None,
+     "Distributed training — ZeRO, Megatron, FSDP, DeepSpeed papers"),
+    # ── Band I: Comparative / Survey queries ─────────────────────────────────
+    # Users who want to understand the landscape.
+    ("I", "transformer vs CNN for image classification",  None,
+     "ViT vs ResNet/EfficientNet — should surface comparison papers"),
+    ("I", "survey of large language models",  None,
+     "LLM surveys — Zhao et al. survey, Minaee survey"),
+    ("I", "comparison of object detection architectures YOLO vs DETR",  None,
+     "YOLO family vs transformer-based detection"),
+    ("I", "GAN vs diffusion models for image generation",  None,
+     "Generative model comparison — StyleGAN, DDPM, score matching"),
+    ("I", "review of federated learning privacy methods",  None,
+     "FL surveys — McMahan, differential privacy in FL"),
+    # ── Band J: Emerging / Cutting-Edge queries ──────────────────────────────
+    # Users looking for the latest developments.
+    ("J", "mixture of experts models scaling",  None,
+     "MoE — Switch Transformer, Mixtral, GShard"),
+    ("J", "test-time compute scaling for reasoning",  None,
+     "New paradigm — o1-style reasoning, tree search at inference"),
+    ("J", "multimodal large language models vision and text",  None,
+     "GPT-4V, LLaVA, Flamingo, multimodal LLMs"),
+    ("J", "state space models as alternative to transformers",  None,
+     "S4, Mamba, H3 — structured state space models"),
+    ("J", "constitutional AI and AI safety alignment techniques",  None,
+     "Anthropic constitutional AI, RLHF alternatives, safety"),
+    ("J", "sparse attention mechanisms for long context",  None,
+     "Longformer, BigBird, sparse transformers for 100K+ context"),
+    # ── Band K: Cross-Domain queries ─────────────────────────────────────────
+    # Users applying ML to their specific domain.
+    ("K", "deep learning for protein structure prediction",  None,
+     "AlphaFold, ESMFold, protein language models"),
+    ("K", "natural language processing for legal document analysis",  None,
+     "Legal NLP — contract analysis, legal BERT, court opinion mining"),
+    ("K", "machine learning for climate change prediction",  None,
+     "Climate ML — weather forecasting, carbon modeling"),
+    ("K", "using transformers for time series forecasting",  None,
+     "Time series transformers — Informer, Autoformer, PatchTST"),
+    ("K", "reinforcement learning for robotics manipulation",  None,
+     "RL + robotics — sim-to-real transfer, dexterous manipulation"),
+    # ── Band L: Vague / Exploratory queries ──────────────────────────────────
+    # Underspecified queries that real users actually type.
+    ("L", "AI ethics",  None,
+     "Very broad — should return survey-level papers on AI ethics/fairness/bias"),
+    ("L", "embedding",  None,
+     "Single word — highly ambiguous. Word2Vec? Sentence embeddings? Image embeddings?"),
+    ("L", "language model",  None,
+     "Broad — should return influential LM papers or surveys"),
+    ("L", "generate images from text",  None,
+     "Casual — should surface DALL-E, Stable Diffusion, Imagen"),
+    ("L", "make AI more safe",  None,
+     "Very casual — should surface alignment/safety papers"),
+    # ── Band M: Follow-up / Refinement queries ───────────────────────────────
+    # Simulate a user who already found something and wants more.
+    ("M", "improvements to the original transformer architecture",  None,
+     "Post-Vaswani improvements — Reformer, Performer, ALiBi, RoPE"),
+    ("M", "papers that cite ResNet and extend residual connections",  None,
+     "ResNet extensions — DenseNet, ResNeXt, WideResNet, SE-Net"),
+    ("M", "alternatives to RLHF for aligning language models",  None,
+     "DPO, SPIN, KTO — methods that bypass reward modeling"),
+    ("M", "BERT variants for low resource languages",  None,
+     "mBERT, XLM-R, AfricanBERT, ArabBERT — multilingual BERT variants"),
+]
+# ── Wire rewrite logging ─────────────────────────────────────────────────────
+_rewrite_log: dict[str, str] = {}
+_original_rewrite = groq_svc.rewrite
+async def _logging_rewrite(q: str) -> str:
+    r = await _original_rewrite(q)
+    _rewrite_log[q] = r
+    return r
+groq_svc.rewrite = _logging_rewrite
+# ── Per-query evaluation ─────────────────────────────────────────────────────
+async def eval_query(
+    band: str, query: str, expected_id: str | None, description: str
+) -> dict:
+    """Run one query end-to-end and return structured results."""
+    t0 = time.perf_counter()
+    results = await hybrid_search_svc.search(query, limit=10)
+    elapsed_ms = (time.perf_counter() - t0) * 1000
+    rewrite = _rewrite_log.get(query, query)
+    rewrite_fired = rewrite.strip() != query.strip()
+    titles: dict[str, str] = {}
+    categories: dict[str, str] = {}
+    if results:
+        meta = await turso_svc.fetch_metadata_batch(results)
+        titles = {aid: (m.get("title") or "(no title)") for aid, m in meta.items()}
+        categories = {aid: (m.get("primary_topic") or "?") for aid, m in meta.items()}
+    # Print formatted output
+    print()
+    print(f"[{band}] {query!r}")
+    print(f"      intent: {description}")
+    if rewrite_fired:
+        print(f"      rewrite: {rewrite!r}")
+    else:
+        print(f"      rewrite: (skipped or no change)")
+    if expected_id is not None:
+        if results and results[0] == expected_id:
+            verdict = f"PASS  -  {expected_id} at #1"
+        elif expected_id in results:
+            rank = results.index(expected_id) + 1
+            verdict = f"PARTIAL  -  {expected_id} at rank #{rank}"
+        else:
+            verdict = f"FAIL  -  {expected_id} NOT in top 10"
+        print(f"      verdict: {verdict}")
+    print(f"      latency: {elapsed_ms:.0f} ms  |  results: {len(results)}")
+    if not results:
+        print("      (no results returned)")
+    else:
+        for i, aid in enumerate(results, 1):
+            title = titles.get(aid, "(title unavailable)")
+            cat = categories.get(aid, "?")
+            if len(title) > 75:
+                title = title[:72] + "..."
+            marker = " *" if expected_id and aid == expected_id else "  "
+            print(f"  {i:2d}.{marker}{aid:14s} [{cat:20s}]  {title}")
+    # Compute topic diversity
+    unique_cats = set(categories.values()) - {"?"}
+    return {
+        "band": band,
+        "query": query,
+        "description": description,
+        "rewrite": rewrite if rewrite_fired else None,
+        "latency_ms": elapsed_ms,
+        "n_results": len(results),
+        "results": [
+            {"rank": i+1, "arxiv_id": aid, "title": titles.get(aid, ""),
+             "category": categories.get(aid, "?")}
+            for i, aid in enumerate(results)
+        ],
+        "expected_id": expected_id,
+        "expected_found": expected_id in results if expected_id else None,
+        "expected_rank": results.index(expected_id) + 1 if expected_id and expected_id in results else None,
+        "topic_diversity": len(unique_cats),
+    }
+async def main():
+    print("=" * 100)
+    print("EXPANDED SEARCH EVALUATION  -  Realistic User Queries")
+    print(f"Total queries: {len(QUERIES)}  |  Bands: {sorted(set(b for b,_,_,_ in QUERIES))}")
+    print("=" * 100)
+    # Warm-up
+    print("\nWarming up BGE-M3 + Turso...")
+    t0 = time.perf_counter()
+    embed_svc.encode_query("warmup query for the eval harness")
+    await turso_svc.fetch_metadata_batch(["1706.03762"])
+    print(f"Warm-up: {(time.perf_counter()-t0)*1000:.0f} ms\n")
+    all_results: list[dict] = []
+    band_results: dict[str, list[dict]] = {}
+    for band, query, expected, description in QUERIES:
+        result = await eval_query(band, query, expected, description)
+        all_results.append(result)
+        band_results.setdefault(band, []).append(result)
+    # ── Summary ──────────────────────────────────────────────────────────────
+    print("\n" + "=" * 100)
+    print("SUMMARY")
+    print("=" * 100)
+    # Band A: known-item hit rate
+    if "A" in band_results:
+        a_rows = band_results["A"]
+        hits = sum(1 for r in a_rows if r["expected_rank"] == 1)
+        total = len(a_rows)
+        print(f"\nBand A (known-item): {hits}/{total} top-1 hits")
+    # Per-band stats
+    print("\nPer-Band Results:")
+    print(f"  {'Band':<6} {'Queries':>7}  {'Avg Latency':>12}  {'Avg Results':>12}  {'Avg Topics':>11}  Description")
+    print(f"  {'-'*6} {'-'*7}  {'-'*12}  {'-'*12}  {'-'*11}  {'-'*40}")
+    band_labels = {
+        "A": "Known-item titles",
+        "F": "Beginner / Newcomer",
+        "G": "Research-in-Progress",
+        "H": "Implementation-Focused",
+        "I": "Comparative / Survey",
+        "J": "Emerging / Cutting-Edge",
+        "K": "Cross-Domain",
+        "L": "Vague / Exploratory",
+        "M": "Follow-up / Refinement",
+    }
+    for band in sorted(band_results.keys()):
+        rows = band_results[band]
+        n = len(rows)
+        avg_lat = sum(r["latency_ms"] for r in rows) / n
+        avg_res = sum(r["n_results"] for r in rows) / n
+        avg_div = sum(r["topic_diversity"] for r in rows) / n
+        label = band_labels.get(band, "")
+        print(f"  {band:<6} {n:>7}  {avg_lat:>10.0f}ms  {avg_res:>12.1f}  {avg_div:>11.1f}  {label}")
+    # Overall latency
+    all_lat = [r["latency_ms"] for r in all_results]
+    all_lat.sort()
+    n = len(all_lat)
+    p50 = all_lat[n // 2]
+    p95 = all_lat[max(0, int(n * 0.95) - 1)]
+    print(f"\nOverall Latency (n={n}): mean {sum(all_lat)/n:.0f} ms  "
+          f"p50 {p50:.0f} ms  p95 {p95:.0f} ms  max {max(all_lat):.0f} ms")
+    # Rewrite analysis
+    rewrites = [(r["query"], r["rewrite"]) for r in all_results if r["rewrite"]]
+    skips = [r["query"] for r in all_results if not r["rewrite"]]
+    print(f"\nGroq Rewriter: {len(rewrites)} fired, {len(skips)} skipped")
+    # Zero-result queries
+    zeros = [r["query"] for r in all_results if r["n_results"] == 0]
+    if zeros:
+        print(f"\nWARNING: ZERO RESULTS ({len(zeros)}):")
+        for q in zeros:
+            print(f"  - {q!r}")
+    else:
+        print(f"\nOK: All queries returned results")
+    # Save JSON for comparison
+    out_path = Path(__file__).parent / "expanded_eval_results.json"
+    with open(out_path, "w") as f:
+        json.dump(all_results, f, indent=2, default=str)
+    print(f"\nResults saved to: {out_path}")
+if __name__ == "__main__":
+    asyncio.run(main())

scripts/eval_recs_quality.py ADDED Viewed

	@@ -0,0 +1,547 @@

+"""
+Recommendation engine evaluation harness.
+Bypasses HTTP and calls the same pipeline functions the router uses,
+with full DB setup/cleanup per scenario. Each scenario probes a specific
+behavior (which tier fired, how many clusters formed, whether suppression
+removed disliked categories, etc.) rather than just "did we get results."
+Run:  python scripts/eval_recs_quality.py
+"""
+from __future__ import annotations
+import asyncio
+import sys
+import time
+import uuid
+from collections import Counter
+from pathlib import Path
+import numpy as np
+import aiosqlite
+# Force UTF-8 stdout so unicode glyphs (>=, ->, etc.) don't crash on Windows cp1252
+if hasattr(sys.stdout, "reconfigure"):
+    sys.stdout.reconfigure(encoding="utf-8")
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from app import qdrant_svc, db, turso_svc, user_state as us
+from app.config import REC_LIMIT, DB_PATH
+from app.recommend import profiles
+from app.recommend.clustering import (
+    compute_clusters, MIN_PAPERS_FOR_CLUSTERING,
+)
+from app.routers.recommendations import (
+    _multi_interest_recommend, _ewma_recommend,
+)
+# ── Curated paper ids (verified-famous papers in each domain) ────────────────
+NLP_PAPERS = [
+    ("1706.03762", "Attention Is All You Need"),
+    ("1810.04805", "BERT"),
+    ("2005.14165", "GPT-3"),
+    ("1907.11692", "RoBERTa"),
+    ("1910.10683", "T5"),
+    ("2203.02155", "InstructGPT"),
+    ("2201.11903", "CoT Prompting"),
+    ("2307.09288", "Llama 2"),
+]
+CV_PAPERS = [
+    ("1512.03385", "ResNet"),
+    ("2010.11929", "Vision Transformer"),
+    ("1409.1556",  "VGG"),
+    ("1505.04597", "U-Net"),
+    ("2103.14030", "Swin Transformer"),
+    ("2104.14294", "DINO"),
+    ("2112.10752", "Latent Diffusion"),
+    ("1311.2524",  "R-CNN"),
+]
+ML_THEORY_PAPERS = [
+    # cs.LG / stat.ML — used for negative-suppression test
+    ("1607.06450", "Layer Normalization"),
+    ("1502.03167", "Batch Normalization"),
+    ("1412.6980",  "Adam optimizer"),
+    ("1411.1784",  "Conditional GAN"),
+]
+# ── User setup / teardown helpers ────────────────────────────────────────────
+async def setup_user(
+    user_id: str,
+    save_ids: list[str],
+    dismiss_ids: list[str] | None = None,
+    onboarding_categories: list[str] | None = None,
+) -> object:
+    """Build a test user from scratch: saves, dismisses, EWMA, in-memory state."""
+    dismiss_ids = dismiss_ids or []
+    if onboarding_categories:
+        await db.save_onboarding_categories(user_id, onboarding_categories)
+    # Pre-fetch all vectors in one batch
+    all_ids = save_ids + dismiss_ids
+    vecs = await qdrant_svc.get_paper_vectors(all_ids) if all_ids else {}
+    # Cache metadata so category suppression / display work
+    if all_ids:
+        meta = await turso_svc.fetch_metadata_batch(all_ids)
+        if meta:
+            await db.cache_turso_metadata_batch(list(meta.values()))
+    state = await us.ensure_loaded(user_id)
+    for pid in save_ids:
+        if pid not in vecs:
+            print(f"  [setup] WARNING: {pid} not in Qdrant; skipping")
+            continue
+        state.add_positive(pid)
+        emb = np.array(vecs[pid], dtype=np.float32)
+        await profiles.update_on_save(user_id, emb)
+        await db.log_interaction(user_id, pid, "save")
+    for pid in dismiss_ids:
+        if pid not in vecs:
+            continue
+        state.add_negative(pid)
+        emb = np.array(vecs[pid], dtype=np.float32)
+        await profiles.update_on_dismiss(user_id, emb)
+        await db.log_interaction(user_id, pid, "not_interested")
+    return state
+async def cleanup_user(user_id: str) -> None:
+    """Wipe all DB rows + in-memory cache for a test user."""
+    async with aiosqlite.connect(DB_PATH) as conn:
+        for sql in [
+            "DELETE FROM interactions WHERE user_id = ?",
+            "DELETE FROM user_profiles WHERE user_id = ?",
+            "DELETE FROM user_clusters WHERE user_id = ?",
+            "DELETE FROM user_onboarding WHERE user_id = ?",
+            "DELETE FROM cluster_snapshots WHERE user_id = ?",
+        ]:
+            try:
+                await conn.execute(sql, (user_id,))
+            except Exception:
+                pass
+        await conn.commit()
+    if user_id in us._cache:
+        del us._cache[user_id]
+# ── Pipeline runner (mirrors get_recommendations() cascade) ──────────────────
+async def run_pipeline(user_id: str, state) -> tuple[str, list[str], dict, float]:
+    """Returns (tier_label, rec_ids, paper_tags, latency_ms)."""
+    seen = us.all_seen(user_id)
+    n_saves = len(state.positive_list)
+    t0 = time.perf_counter()
+    # Tier 0: cold-start (no saves) → trending by category
+    if n_saves == 0:
+        cat_filter = await db.get_user_category_filter(user_id)
+        if cat_filter:
+            trending = await turso_svc.fetch_trending_by_categories(
+                cat_filter, limit=REC_LIMIT,
+            )
+            elapsed = (time.perf_counter() - t0) * 1000
+            return ("Tier 0 trending",
+                    [t["arxiv_id"] for t in trending],
+                    {}, elapsed)
+        elapsed = (time.perf_counter() - t0) * 1000
+        return ("EMPTY (no onboarding)", [], {}, elapsed)
+    # Tier 1: ≥5 saves → multi-interest clustering + quota
+    if n_saves >= MIN_PAPERS_FOR_CLUSTERING:
+        rec_ids, paper_tags = await _multi_interest_recommend(
+            user_id, state, seen, REC_LIMIT, query_id="eval-test",
+        )
+        if rec_ids:
+            elapsed = (time.perf_counter() - t0) * 1000
+            return ("Tier 1 multi-interest", rec_ids, paper_tags, elapsed)
+    # Tier 2: ≥3 saves (EWMA threshold internally) → single-vector search
+    rec_ids = await _ewma_recommend(user_id, seen, REC_LIMIT)
+    if rec_ids:
+        elapsed = (time.perf_counter() - t0) * 1000
+        return ("Tier 2 EWMA", rec_ids, {}, elapsed)
+    # Tier 3: ≥1 save → Qdrant Recommend with raw IDs
+    rec_ids = await qdrant_svc.recommend(
+        positive_arxiv_ids=state.positive_list,
+        negative_arxiv_ids=state.negative_list,
+        seen_arxiv_ids=seen,
+        limit=REC_LIMIT,
+    )
+    elapsed = (time.perf_counter() - t0) * 1000
+    if rec_ids:
+        return ("Tier 3 Qdrant Recommend", rec_ids, {}, elapsed)
+    return ("EMPTY (all tiers exhausted)", [], {}, elapsed)
+async def report_results(rec_ids: list[str], paper_tags: dict) -> tuple[Counter, Counter]:
+    """Print top-10 with category and cluster origin. Return (cat_counts, source_counts)."""
+    if not rec_ids:
+        print("    (no results)")
+        return Counter(), Counter()
+    meta = await turso_svc.fetch_metadata_batch(rec_ids)
+    cats: Counter = Counter()
+    sources: Counter = Counter()
+    for i, aid in enumerate(rec_ids, 1):
+        m = meta.get(aid, {})
+        title = m.get("title", "(no title)")
+        if len(title) > 65:
+            title = title[:62] + "..."
+        cat = m.get("category", "?")
+        cats[cat] += 1
+        tag = paper_tags.get(aid, {}) if paper_tags else {}
+        source = tag.get("candidate_source", "")
+        sources[source] += 1
+        src_short = f"  [{source}]" if source else ""
+        print(f"    {i:2d}. {aid:13s} {cat:14s}  {title}{src_short}")
+    return cats, sources
+# ── Scenarios ────────────────────────────────────────────────────────────────
+async def scenario_1_cold_with_onboarding():
+    """Tier 0: zero saves, NLP categories selected during onboarding."""
+    user_id = f"eval-recs-1-{uuid.uuid4().hex[:6]}"
+    print("\n" + "=" * 100)
+    print("S1  Cold-start with onboarding categories (NLP)")
+    print("    Expect: Tier 0 trending; results in NLP-adjacent friendly categories")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=[], onboarding_categories=["nlp"])
+        state = await us.ensure_loaded(user_id)
+        tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
+        print(f"    Tier: {tier}  ({lat:.0f} ms)  Returned: {len(rec_ids)}")
+        cats, _ = await report_results(rec_ids, tags)
+        nlp_count = sum(
+            c for k, c in cats.items()
+            if k in {"AI/ML", "NLP/Computational Linguistics"} or k.startswith("cs.CL")
+        )
+        verdict = "PASS" if tier.startswith("Tier 0") and len(rec_ids) >= 5 else \
+                  "FAIL  (Tier 0 broken — fetch_trending_by_categories returned 0)"
+        print(f"    Categories: {dict(cats)}  -->  NLP count: {nlp_count}/{len(rec_ids)}")
+        print(f"    VERDICT: {verdict}")
+    finally:
+        await cleanup_user(user_id)
+async def scenario_2_single_save():
+    """Tier 3: 1 save, expect Qdrant Recommend nearest-neighbors."""
+    user_id = f"eval-recs-2-{uuid.uuid4().hex[:6]}"
+    print("\n" + "=" * 100)
+    print("S2  Single save (Vaswani Attention)")
+    print("    Expect: Tier 3 Qdrant Recommend; results semantically near saved paper")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=["1706.03762"])
+        state = await us.ensure_loaded(user_id)
+        tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
+        print(f"    Tier: {tier}  ({lat:.0f} ms)  Returned: {len(rec_ids)}")
+        cats, _ = await report_results(rec_ids, tags)
+        ml_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
+        verdict = "PASS" if tier.startswith("Tier 3") and ml_count >= 6 else "PARTIAL"
+        print(f"    Categories: {dict(cats)}  -->  AI/ML + NLP count: {ml_count}/10")
+        print(f"    VERDICT: {verdict}")
+    finally:
+        await cleanup_user(user_id)
+async def scenario_3_three_nlp_saves():
+    """Tier 2: 3 same-domain saves, expect EWMA single-vector search."""
+    user_id = f"eval-recs-3-{uuid.uuid4().hex[:6]}"
+    save_ids = [pid for pid, _ in NLP_PAPERS[:3]]
+    print("\n" + "=" * 100)
+    print("S3  Three NLP saves")
+    print(f"    Saved: {save_ids}")
+    print("    Expect: Tier 2 EWMA single-vector; results NLP-coherent")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=save_ids)
+        state = await us.ensure_loaded(user_id)
+        tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
+        print(f"    Tier: {tier}  ({lat:.0f} ms)  Returned: {len(rec_ids)}")
+        cats, _ = await report_results(rec_ids, tags)
+        nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
+        verdict = "PASS" if tier.startswith("Tier 2") and nlp_count >= 7 else "PARTIAL"
+        print(f"    Categories: {dict(cats)}  -->  AI/ML + NLP count: {nlp_count}/10")
+        print(f"    VERDICT: {verdict}")
+    finally:
+        await cleanup_user(user_id)
+async def scenario_4_five_nlp_saves_single_cluster():
+    """Tier 1, single interest: expect K=1 cluster, NLP-only output."""
+    user_id = f"eval-recs-4-{uuid.uuid4().hex[:6]}"
+    save_ids = [pid for pid, _ in NLP_PAPERS[:5]]
+    print("\n" + "=" * 100)
+    print("S4  Five NLP saves (single interest)")
+    print(f"    Saved: {save_ids}")
+    print("    Expect: Tier 1; 1 or few clusters; ML/NLP-coherent output")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=save_ids)
+        state = await us.ensure_loaded(user_id)
+        # Inspect clusters explicitly
+        vecs = await qdrant_svc.get_paper_vectors(save_ids)
+        embs = np.array([vecs[p] for p in save_ids if p in vecs], dtype=np.float32)
+        clusters = compute_clusters([p for p in save_ids if p in vecs], embs)
+        print(f"    Clusters formed: K={len(clusters)}")
+        for c in clusters:
+            print(f"      cluster {c.cluster_idx}: medoid={c.medoid_paper_id}  importance={c.importance:.3f}  size={len(c.paper_ids)}")
+        tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
+        print(f"    Tier: {tier}  ({lat:.0f} ms)  Returned: {len(rec_ids)}")
+        cats, _ = await report_results(rec_ids, tags)
+        nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
+        verdict = "PASS" if tier.startswith("Tier 1") and nlp_count >= 7 else "PARTIAL"
+        print(f"    Categories: {dict(cats)}  -->  AI/ML + NLP count: {nlp_count}/10")
+        print(f"    VERDICT: {verdict}")
+    finally:
+        await cleanup_user(user_id)
+async def scenario_5_multi_interest_balanced():
+    """Tier 1, the headline test for quota fusion."""
+    user_id = f"eval-recs-5-{uuid.uuid4().hex[:6]}"
+    save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
+    print("\n" + "=" * 100)
+    print("S5  Multi-interest (5 NLP + 5 CV)  -- THE HEADLINE QUOTA TEST")
+    print(f"    Saved: 5x NLP + 5x CV")
+    print("    Expect: K>=2 clusters, both interests visible, neither cluster swamps")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=save_ids)
+        state = await us.ensure_loaded(user_id)
+        # Inspect clusters
+        vecs = await qdrant_svc.get_paper_vectors(save_ids)
+        aligned = [p for p in save_ids if p in vecs]
+        embs = np.array([vecs[p] for p in aligned], dtype=np.float32)
+        clusters = compute_clusters(aligned, embs)
+        print(f"    Clusters formed: K={len(clusters)}")
+        for c in clusters:
+            print(f"      cluster {c.cluster_idx}: medoid={c.medoid_paper_id}  importance={c.importance:.3f}  size={len(c.paper_ids)}")
+        tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
+        print(f"    Tier: {tier}  ({lat:.0f} ms)  Returned: {len(rec_ids)}")
+        cats, sources = await report_results(rec_ids, tags)
+        nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
+        cv_count  = sum(c for k, c in cats.items() if k == "Computer Vision")
+        print(f"    NLP (AI/ML + NLP): {nlp_count}   CV (Computer Vision): {cv_count}")
+        print(f"    Cluster origin counts: {dict(sources)}")
+        smaller = min(nlp_count, cv_count) if (nlp_count and cv_count) else 0
+        verdict = "PASS" if len(clusters) >= 2 and smaller >= 3 else "FAIL"
+        print(f"    VERDICT: {verdict}  (floor=3 enforced: {smaller >= 3})")
+    finally:
+        await cleanup_user(user_id)
+async def scenario_6_multi_interest_imbalanced():
+    """Tier 1: imbalanced split — does the floor=3 rescue the minority?"""
+    user_id = f"eval-recs-6-{uuid.uuid4().hex[:6]}"
+    save_ids = [pid for pid, _ in NLP_PAPERS[:8]] + [pid for pid, _ in CV_PAPERS[:2]]
+    print("\n" + "=" * 100)
+    print("S6  Multi-interest imbalanced (8 NLP + 2 CV)  -- FLOOR TEST")
+    print("    Expect: if K>=2, CV gets >=3 slots even though importance is ~80/20")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=save_ids)
+        state = await us.ensure_loaded(user_id)
+        vecs = await qdrant_svc.get_paper_vectors(save_ids)
+        aligned = [p for p in save_ids if p in vecs]
+        embs = np.array([vecs[p] for p in aligned], dtype=np.float32)
+        clusters = compute_clusters(aligned, embs)
+        print(f"    Clusters formed: K={len(clusters)}")
+        for c in clusters:
+            print(f"      cluster {c.cluster_idx}: medoid={c.medoid_paper_id}  importance={c.importance:.3f}  size={len(c.paper_ids)}")
+        tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
+        print(f"    Tier: {tier}  ({lat:.0f} ms)  Returned: {len(rec_ids)}")
+        cats, sources = await report_results(rec_ids, tags)
+        nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
+        cv_count  = sum(c for k, c in cats.items() if k == "Computer Vision")
+        print(f"    NLP: {nlp_count}   CV: {cv_count}   Cluster sources: {dict(sources)}")
+        if len(clusters) >= 2:
+            verdict = "PASS" if cv_count >= 3 else "FAIL  (floor not enforced)"
+        else:
+            verdict = "AMBIGUOUS  (only 1 cluster formed - floor doesn't apply)"
+        print(f"    VERDICT: {verdict}")
+    finally:
+        await cleanup_user(user_id)
+async def scenario_7_category_suppression():
+    """Tier 1 with dismissals: 'Computer Vision' should be suppressed."""
+    # Save 5 NLP, dismiss 3 CV — non-overlapping friendly categories
+    user_id = f"eval-recs-7-{uuid.uuid4().hex[:6]}"
+    save_ids = [pid for pid, _ in NLP_PAPERS[:5]]
+    dismiss_ids = [pid for pid, _ in CV_PAPERS[:3]]
+    print("\n" + "=" * 100)
+    print("S7  Category suppression (5 NLP saves + 3 CV dismissals)")
+    print("    Expect: 'Computer Vision' suppressed; zero CV papers in output")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=save_ids, dismiss_ids=dismiss_ids)
+        state = await us.ensure_loaded(user_id)
+        suppressed = await db.get_suppressed_categories(user_id)
+        print(f"    Suppressed categories detected: {suppressed}")
+        tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
+        print(f"    Tier: {tier}  ({lat:.0f} ms)  Returned: {len(rec_ids)}")
+        cats, _ = await report_results(rec_ids, tags)
+        cv_count = cats.get("Computer Vision", 0)
+        verdict = "PASS" if cv_count == 0 and "Computer Vision" in suppressed else \
+                  "FAIL  (CV leaked through)" if cv_count > 0 else \
+                  "PARTIAL  (no CV but suppression set empty)"
+        print(f"    CV count in output: {cv_count}    VERDICT: {verdict}")
+    finally:
+        await cleanup_user(user_id)
+async def scenario_8_hungarian_stability():
+    """Cluster IDs should remain stable across reclusterings when one new save is added."""
+    user_id = f"eval-recs-8-{uuid.uuid4().hex[:6]}"
+    save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
+    new_save = NLP_PAPERS[5][0]   # 6th NLP paper (added later)
+    print("\n" + "=" * 100)
+    print("S8  Hungarian cluster-ID stability")
+    print("    Run pipeline once -> save 1 more NLP paper -> run again")
+    print("    Expect: same cluster_idx assigned to NLP cluster across runs")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=save_ids)
+        state = await us.ensure_loaded(user_id)
+        # First run
+        await run_pipeline(user_id, state)
+        clusters_v1 = await db.get_user_clusters(user_id)
+        v1 = {(c["cluster_idx"], c["medoid_paper_id"]) for c in clusters_v1}
+        print(f"    After run 1: {sorted(v1)}")
+        # Add one more NLP paper
+        more_vecs = await qdrant_svc.get_paper_vectors([new_save])
+        if new_save in more_vecs:
+            state.add_positive(new_save)
+            await profiles.update_on_save(user_id, np.array(more_vecs[new_save], dtype=np.float32))
+            await db.log_interaction(user_id, new_save, "save")
+        # Second run
+        await run_pipeline(user_id, state)
+        clusters_v2 = await db.get_user_clusters(user_id)
+        v2 = {(c["cluster_idx"], c["medoid_paper_id"]) for c in clusters_v2}
+        print(f"    After run 2: {sorted(v2)}")
+        # Stability check: every (idx, medoid) in v1 still present in v2 (medoid may change but idx must stay)
+        idx_v1 = {c["cluster_idx"]: c["medoid_paper_id"] for c in clusters_v1}
+        idx_v2 = {c["cluster_idx"]: c["medoid_paper_id"] for c in clusters_v2}
+        # All cluster_idx that existed in v1 should still exist in v2
+        stable = all(k in idx_v2 for k in idx_v1)
+        print(f"    Cluster IDs in v1: {sorted(idx_v1.keys())}   v2: {sorted(idx_v2.keys())}")
+        print(f"    VERDICT: {'PASS  (all v1 cluster_idx preserved)' if stable else 'FAIL  (cluster_idx churned)'}")
+    finally:
+        await cleanup_user(user_id)
+async def scenario_9_latency():
+    """Latency sanity: full Tier 1 pipeline on 10 saved papers."""
+    user_id = f"eval-recs-9-{uuid.uuid4().hex[:6]}"
+    save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
+    print("\n" + "=" * 100)
+    print("S9  Latency sanity (Tier 1, 10 saved papers)")
+    print("    Expect: <30 ms compute (excluding metadata I/O); end-to-end <2s")
+    print("=" * 100)
+    try:
+        await setup_user(user_id, save_ids=save_ids)
+        state = await us.ensure_loaded(user_id)
+        # Warm: run once to load profiles
+        await run_pipeline(user_id, state)
+        # Time multiple runs
+        runs = []
+        for i in range(3):
+            tier, _, _, lat = await run_pipeline(user_id, state)
+            runs.append(lat)
+            print(f"    Run {i+1}: {tier}  {lat:.0f} ms")
+        print(f"    Mean: {sum(runs)/len(runs):.0f} ms   Min: {min(runs):.0f} ms   Max: {max(runs):.0f} ms")
+        # The 30ms compute target excludes Qdrant + Turso I/O — full e2e includes them
+        e2e_pass = max(runs) < 2000
+        print(f"    VERDICT: {'PASS (e2e <2s)' if e2e_pass else 'PARTIAL  (over 2s e2e — investigate)'}")
+    finally:
+        await cleanup_user(user_id)
+# ── Pre-flight + main ────────────────────────────────────────────────────────
+async def preflight():
+    """Verify all curated paper IDs exist in Qdrant before running."""
+    all_ids = [p[0] for p in NLP_PAPERS + CV_PAPERS + ML_THEORY_PAPERS]
+    vecs = await qdrant_svc.get_paper_vectors(all_ids)
+    missing = [pid for pid in all_ids if pid not in vecs]
+    if missing:
+        print(f"WARNING: {len(missing)} curated IDs not in Qdrant: {missing}")
+        print("Some scenarios may produce skewed results.")
+    else:
+        print(f"Pre-flight: all {len(all_ids)} curated paper IDs present in Qdrant.")
+async def wipe_all_eval_users():
+    """Belt-and-braces cleanup: remove any eval-recs-* users left from crashes."""
+    async with aiosqlite.connect(DB_PATH) as conn:
+        for tbl in ["interactions", "user_profiles", "user_clusters",
+                    "user_onboarding", "cluster_snapshots"]:
+            try:
+                await conn.execute(f"DELETE FROM {tbl} WHERE user_id LIKE ?", ("eval-recs-%",))
+            except Exception:
+                pass
+        await conn.commit()
+async def main():
+    print("=" * 100)
+    print("RECOMMENDATION ENGINE EVALUATION")
+    print("=" * 100)
+    await db.init_db()
+    await wipe_all_eval_users()
+    await preflight()
+    scenarios = [
+        scenario_1_cold_with_onboarding,
+        scenario_2_single_save,
+        scenario_3_three_nlp_saves,
+        scenario_4_five_nlp_saves_single_cluster,
+        scenario_5_multi_interest_balanced,
+        scenario_6_multi_interest_imbalanced,
+        scenario_7_category_suppression,
+        scenario_8_hungarian_stability,
+        scenario_9_latency,
+    ]
+    for s in scenarios:
+        try:
+            await s()
+        except Exception as e:
+            import traceback
+            print(f"  SCENARIO ERROR: {e}")
+            traceback.print_exc()
+    # Final safety wipe in case any cleanup_user calls failed
+    await wipe_all_eval_users()
+    print("\n" + "=" * 100)
+    print("DONE — all eval-recs-* users wiped from DB")
+    print("=" * 100)
+if __name__ == "__main__":
+    asyncio.run(main())

scripts/eval_search_quality.py ADDED Viewed

	@@ -0,0 +1,197 @@

+"""
+Search quality evaluation harness.
+For each curated query, runs the hybrid search pipeline end-to-end
+(rewrite -> encode -> dense+sparse -> RRF -> title-boost) and prints the
+top 10 results with titles fetched from Turso. For known-item queries,
+flags whether the expected paper landed at #1.
+This is a HUMAN-JUDGMENT report, not a pass/fail test. The output is
+designed to be read top-to-bottom and rated query by query.
+Run:  python scripts/eval_search_quality.py
+"""
+from __future__ import annotations
+import asyncio
+import sys
+import time
+from pathlib import Path
+# Make the project root importable when run as `python scripts/eval_search_quality.py`
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from app import hybrid_search_svc
+from app import turso_svc
+from app import embed_svc
+from app import groq_svc
+# (band, query, expected_arxiv_id_or_None)
+QUERIES: list[tuple[str, str, str | None]] = [
+    # ── Band A: known-item title queries ──────────────────────────────────
+    # The right answer is unambiguous. Top-1 hit is the bar.
+    ("A", "attention is all you need", "1706.03762"),
+    ("A", "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805"),
+    ("A", "Adam: A Method for Stochastic Optimization", "1412.6980"),
+    ("A", "Language Models are Few-Shot Learners", "2005.14165"),
+    ("A", "Deep Residual Learning for Image Recognition", "1512.03385"),
+    # ── Band B: conceptual semantic queries ───────────────────────────────
+    # No exact keyword match; tests whether dense retrieval rescues meaning.
+    ("B", "when AI makes up fake facts", None),
+    ("B", "making language models follow human preferences", None),
+    ("B", "why deep networks generalize despite overparameterization", None),
+    ("B", "finding similar papers using vector embeddings", None),
+    ("B", "models that pretend to be aligned but aren't", None),
+    # ── Band C: keyword-academic queries ──────────────────────────────────
+    # Already in academic form; rewriter heuristic should skip these.
+    ("C", "BGE-M3 multilingual dense retrieval", None),
+    ("C", "Mamba state space model linear time", None),
+    ("C", "chain of thought prompting", None),
+    ("C", "FlashAttention IO-aware exact attention", None),
+    # ── Band D: adversarial / edge cases ──────────────────────────────────
+    ("D", "transformr", None),                                          # typo
+    ("D", "GPT", None),                                                 # very short
+    ("D", "bayesian deep learning monte carlo dropout uncertainty estimation", None),  # very long
+    ("D", "applying CV to medical imaging", None),                      # cross-domain (CV->medical)
+    ("D", "attention", None),                                           # single ambiguous word
+    # ── Band E: recency-sensitive queries ─────────────────────────────────
+    # Recency rerank was removed; verify recent work still surfaces.
+    ("E", "Llama 3", None),
+    ("E", "reasoning models 2024", None),
+]
+# ── Wire a thin wrapper around groq_svc.rewrite to capture what fired ────
+_rewrite_log: dict[str, str] = {}
+_original_rewrite = groq_svc.rewrite
+async def _logging_rewrite(q: str) -> str:
+    r = await _original_rewrite(q)
+    _rewrite_log[q] = r
+    return r
+groq_svc.rewrite = _logging_rewrite
+async def eval_query(
+    band: str, query: str, expected_id: str | None
+) -> tuple[list[str], float]:
+    """Run one query end-to-end and print a formatted report."""
+    t0 = time.perf_counter()
+    results = await hybrid_search_svc.search(query, limit=10)
+    elapsed_ms = (time.perf_counter() - t0) * 1000
+    rewrite = _rewrite_log.get(query, query)
+    rewrite_fired = rewrite.strip() != query.strip()
+    titles: dict[str, str] = {}
+    if results:
+        meta = await turso_svc.fetch_metadata_batch(results)
+        titles = {aid: (m.get("title") or "(no title)") for aid, m in meta.items()}
+    # ── Header ──────────────────────────────────────────────────────────────
+    print()
+    print(f"[{band}] {query!r}")
+    if rewrite_fired:
+        print(f"      rewrite: {rewrite!r}")
+    else:
+        print(f"      rewrite: (heuristic skipped or no change)")
+    if expected_id is not None:
+        if results and results[0] == expected_id:
+            verdict = f"PASS  -  {expected_id} at #1"
+        elif expected_id in results:
+            rank = results.index(expected_id) + 1
+            verdict = f"PARTIAL  -  {expected_id} at rank #{rank}"
+        else:
+            verdict = f"FAIL  -  {expected_id} NOT in top 10"
+        print(f"      verdict: {verdict}")
+    print(f"      latency: {elapsed_ms:.0f} ms  |  results: {len(results)}")
+    if not results:
+        print("      (no results returned)")
+        return results, elapsed_ms
+    for i, aid in enumerate(results, 1):
+        title = titles.get(aid, "(title unavailable)")
+        if len(title) > 88:
+            title = title[:85] + "..."
+        marker = " *" if expected_id and aid == expected_id else "  "
+        print(f"  {i:2d}.{marker}{aid:13s} {title}")
+    return results, elapsed_ms
+async def main():
+    print("=" * 100)
+    print("SEARCH QUALITY EVALUATION  -  ResearchIT hybrid search pipeline")
+    print("=" * 100)
+    # ── Warm-up ─────────────────────────────────────────────────────────────
+    # First BGE-M3 encode is ~10-15s cold. Warm before timing anything.
+    print("\nWarming up BGE-M3 + Turso...")
+    t0 = time.perf_counter()
+    embed_svc.encode_query("warmup query for the eval harness")
+    await turso_svc.fetch_metadata_batch(["1706.03762"])
+    print(f"Warm-up: {(time.perf_counter()-t0)*1000:.0f} ms\n")
+    band_results: dict[str, list[tuple[str, str | None, list[str], float]]] = {}
+    for band, query, expected in QUERIES:
+        results, latency = await eval_query(band, query, expected)
+        band_results.setdefault(band, []).append((query, expected, results, latency))
+    # ── Summary ─────────────────────────────────────────────────────────────
+    print("\n" + "=" * 100)
+    print("SUMMARY")
+    print("=" * 100)
+    # Band A: top-1 hit rate
+    if "A" in band_results:
+        a_rows = band_results["A"]
+        hits = sum(1 for _, exp, res, _ in a_rows if res and res[0] == exp)
+        partial = sum(
+            1 for _, exp, res, _ in a_rows
+            if exp in (res or []) and (not res or res[0] != exp)
+        )
+        misses = len(a_rows) - hits - partial
+        print(f"\nBand A (known-item titles): {hits}/{len(a_rows)} top-1 hits, "
+              f"{partial} partial (in top 10 but not #1), {misses} miss")
+        for q, exp, res, _ in a_rows:
+            if res and res[0] == exp:
+                tag = "PASS"
+            elif exp in (res or []):
+                tag = f"PARTIAL #{res.index(exp)+1}"
+            else:
+                tag = "MISS"
+            qshort = q if len(q) <= 60 else q[:57] + "..."
+            print(f"  [{tag:10s}] {exp:14s} {qshort}")
+    # Latency stats
+    all_lat = [lat for rows in band_results.values() for *_, lat in rows]
+    if all_lat:
+        all_lat.sort()
+        n = len(all_lat)
+        p50 = all_lat[n // 2]
+        p95 = all_lat[max(0, int(n * 0.95) - 1)]
+        print(f"\nLatency (n={n}): mean {sum(all_lat)/n:.0f} ms  "
+              f"p50 {p50:.0f} ms  p95 {p95:.0f} ms  "
+              f"max {max(all_lat):.0f} ms")
+    # Per-band coverage (how often did we get any results?)
+    print("\nResults coverage by band:")
+    for band, rows in sorted(band_results.items()):
+        empty = sum(1 for _, _, res, _ in rows if not res)
+        print(f"  Band {band}: {len(rows) - empty}/{len(rows)} returned results")
+if __name__ == "__main__":
+    asyncio.run(main())

scripts/expanded_eval_results.json ADDED Viewed

The diff for this file is too large to render. See raw diff

scripts/profile_pipelines.py ADDED Viewed

	@@ -0,0 +1,410 @@

+"""
+Stage-by-stage profiler for the search and recommendation pipelines.
+Mirrors the production paths (hybrid_search_svc.search and
+_multi_interest_recommend) with explicit timers between every stage,
+so we can see where the time actually goes.
+Run: python scripts/profile_pipelines.py
+"""
+from __future__ import annotations
+import asyncio
+import sys
+import time
+import uuid
+from contextlib import contextmanager
+from pathlib import Path
+import numpy as np
+if hasattr(sys.stdout, "reconfigure"):
+    sys.stdout.reconfigure(encoding="utf-8")
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from app import (
+    config, embed_svc, qdrant_svc, zilliz_svc, groq_svc, turso_svc,
+    db, user_state as us,
+)
+from app.recommend import profiles
+from app.recommend.clustering import (
+    compute_clusters, stabilize_cluster_ids, save_clusters_to_db,
+    load_clusters_from_db, MIN_PAPERS_FOR_CLUSTERING, InterestCluster,
+)
+from app.recommend.fusion import allocate_quotas, merge_quota_results
+from app.recommend.reranker import rerank_candidates
+from app.recommend.diversity import mmr_rerank, inject_exploration
+@contextmanager
+def stage(name: str, sink: list):
+    t0 = time.perf_counter()
+    yield
+    sink.append((name, (time.perf_counter() - t0) * 1000))
+def print_breakdown(label: str, timings: list[tuple[str, float]]):
+    total = sum(t for _, t in timings)
+    print(f"\n  --- {label} ---")
+    print(f"  {'Stage':<46s} {'ms':>10s}  {'%':>6s}")
+    print(f"  {'-'*46} {'-'*10}  {'-'*6}")
+    for name, t in timings:
+        pct = (100.0 * t / total) if total > 0 else 0.0
+        print(f"  {name:<46s} {t:>10.0f}  {pct:>5.1f}%")
+    print(f"  {'-'*46} {'-'*10}  {'-'*6}")
+    print(f"  {'TOTAL':<46s} {total:>10.0f}  {100.0:>5.1f}%")
+# ── Search pipeline profiler ─────────────────────────────────────────────────
+async def profile_search(query: str) -> list[tuple[str, float]]:
+    """Mirror hybrid_search_svc.search() with stage timers."""
+    timings: list[tuple[str, float]] = []
+    limit = 10
+    fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
+    # Stage 1: Groq rewrite
+    rewritten = query
+    with stage("1. Groq rewrite (LLM)", timings):
+        try:
+            rewritten = await groq_svc.rewrite(query)
+        except Exception:
+            rewritten = query
+    # Stage 2: BGE-M3 encode (original)
+    with stage("2a. BGE-M3 encode (original)", timings):
+        d_orig, s_orig = embed_svc.encode_query(query)
+    encodings = [(d_orig, s_orig)]
+    # Stage 2b: BGE-M3 encode (rewritten, if different)
+    if rewritten and rewritten != query:
+        with stage("2b. BGE-M3 encode (rewrite)", timings):
+            d_rw, s_rw = embed_svc.encode_query(rewritten)
+        encodings.append((d_rw, s_rw))
+    else:
+        timings.append(("2b. BGE-M3 encode (rewrite skipped)", 0.0))
+    # Stage 3: Parallel Qdrant + Zilliz searches
+    with stage(f"3. Parallel search ({len(encodings)*2} tasks)", timings):
+        tasks = []
+        for d, s in encodings:
+            tasks.append(qdrant_svc.search_dense(d.tolist(), limit=fetch_k))
+            tasks.append(zilliz_svc.search_sparse(s, limit=fetch_k))
+        raw = await asyncio.gather(*tasks, return_exceptions=True)
+    valid_lists = [r for r in raw if not isinstance(r, Exception) and r]
+    # Stage 4: RRF fusion
+    with stage("4. RRF fusion", timings):
+        from app.hybrid_search_svc import _rrf_fuse_multi, _title_match_rerank
+        fused = _rrf_fuse_multi(valid_lists, k=config.SEARCH_RRF_K)
+    # Stage 5: Title-boost (Turso fetch + scoring)
+    with stage("5. Title-match boost (Turso + score)", timings):
+        ranked = await _title_match_rerank(fused, query, top_n_for_boost=50)
+    return timings
+# ── Recommendations Tier 1 pipeline profiler ─────────────────────────────────
+async def profile_recs_tier1(user_id: str, save_ids: list[str]) -> list[tuple[str, float]]:
+    """Mirror _multi_interest_recommend() with stage timers."""
+    timings: list[tuple[str, float]] = []
+    state = await us.ensure_loaded(user_id)
+    seen = us.all_seen(user_id)
+    REC_LIMIT = config.REC_LIMIT
+    OVERSAMPLE = 3
+    ST_SUPPLEMENT = 20
+    positives = state.positive_list
+    # 1. Fetch saved-paper vectors from Qdrant
+    with stage("1. Fetch saved-paper vectors (Qdrant)", timings):
+        vectors = await qdrant_svc.get_paper_vectors(positives)
+    aligned_ids = [pid for pid in positives if pid in vectors]
+    aligned_embs = np.array([vectors[pid] for pid in aligned_ids], dtype=np.float32)
+    # 2. Ward clustering (CPU)
+    with stage("2. Ward clustering (CPU)", timings):
+        clusters = compute_clusters(aligned_ids, aligned_embs)
+    # 3. Hungarian: load + match
+    with stage("3. Hungarian load+match (SQLite + numpy)", timings):
+        old_clusters_data = await load_clusters_from_db(user_id)
+        if old_clusters_data:
+            old_clusters = []
+            for row in old_clusters_data:
+                mpid = row["medoid_paper_id"]
+                if mpid in vectors:
+                    medoid_emb = np.array(vectors[mpid], dtype=np.float32)
+                elif row.get("medoid_embedding_blob") is not None:
+                    medoid_emb = np.frombuffer(
+                        row["medoid_embedding_blob"], dtype=np.float32
+                    ).copy()
+                else:
+                    continue
+                old_clusters.append(InterestCluster(
+                    cluster_idx=row["cluster_idx"],
+                    medoid_paper_id=mpid,
+                    medoid_embedding=medoid_emb,
+                    paper_ids=[],
+                    importance=row["importance"],
+                ))
+            if old_clusters:
+                clusters = stabilize_cluster_ids(clusters, old_clusters)
+    # 4. Save clusters + snapshot (SQLite writes)
+    with stage("4. Save clusters + snapshot (SQLite)", timings):
+        await save_clusters_to_db(user_id, clusters)
+        await db.save_cluster_snapshot(user_id, [
+            {
+                "cluster_idx": c.cluster_idx,
+                "medoid_paper_id": c.medoid_paper_id,
+                "importance": c.importance,
+                "paper_ids": c.paper_ids,
+                "medoid_embedding_blob": c.medoid_embedding.astype(np.float32).tobytes(),
+            }
+            for c in clusters
+        ])
+    # 5. Quota allocation (CPU)
+    with stage("5. Allocate quotas (CPU)", timings):
+        importances = [c.importance for c in clusters]
+        quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
+    # 6. Load short-term profile
+    with stage("6. Load short-term profile (SQLite)", timings):
+        st_vec = await profiles.load_profile(user_id, "short_term")
+    # 7. Per-cluster parallel ANN searches (no with_vectors — that path
+    # is 10x slower on Qdrant Cloud free tier; we cache vectors instead)
+    with stage(f"7. Per-cluster ANN searches (gather {len(clusters)})", timings):
+        search_tasks = [
+            qdrant_svc.search_by_vector_with_scores(
+                query_vector=c.medoid_embedding.tolist(),
+                limit=quota * OVERSAMPLE,
+                exclude_ids=seen,
+            )
+            for c, quota in zip(clusters, quotas)
+        ]
+        per_cluster_scored = await asyncio.gather(*search_tasks)
+    paper_cluster_map: dict[str, int] = {}
+    qdrant_score_map: dict[str, float] = {}
+    for cluster, scored in zip(clusters, per_cluster_scored):
+        for hit in scored:
+            aid = hit["arxiv_id"]
+            if aid not in paper_cluster_map:
+                paper_cluster_map[aid] = cluster.cluster_idx
+            if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
+                qdrant_score_map[aid] = float(hit["score"])
+    per_cluster_ids = [
+        [h["arxiv_id"] for h in scored] for scored in per_cluster_scored
+    ]
+    candidate_ids = merge_quota_results(per_cluster_ids, quotas)
+    # 8. Short-term supplement search
+    with stage("8. Short-term supplement (Qdrant)", timings):
+        if st_vec is not None:
+            seen_so_far = seen | set(candidate_ids)
+            st_scored = await qdrant_svc.search_by_vector_with_scores(
+                query_vector=st_vec.tolist(),
+                limit=ST_SUPPLEMENT,
+                exclude_ids=seen_so_far,
+            )
+            for hit in st_scored:
+                aid = hit["arxiv_id"]
+                if aid not in set(candidate_ids):
+                    candidate_ids.append(aid)
+                    paper_cluster_map[aid] = -1
+                if aid not in qdrant_score_map:
+                    qdrant_score_map[aid] = float(hit["score"])
+    # 9. Fetch candidate vectors (LRU-cached by arxiv_id in qdrant_svc).
+    with stage(f"9. Fetch {len(candidate_ids)} candidate vectors (Qdrant, cached)", timings):
+        cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
+    # 10. Fetch candidate metadata from Turso (with cache)
+    with stage(f"10. Fetch {len(candidate_ids)} candidate metadata (Turso)", timings):
+        cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
+    # 11. Cache metadata to SQLite
+    with stage("11. Cache Turso metadata to SQLite", timings):
+        await db.cache_turso_metadata_batch(list(cand_meta.values()))
+    valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
+    valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
+    valid_meta = [cand_meta[cid] for cid in valid_ids]
+    # 12. Load profiles (long-term, negative)
+    with stage("12. Load long-term + negative profiles (SQLite)", timings):
+        lt_vec = await profiles.load_profile(user_id, "long_term")
+        neg_vec = await profiles.load_profile(user_id, "negative")
+    # 13. SQLite reads (suppression + onboarding)
+    with stage("13. Suppression + onboarding lookup (SQLite)", timings):
+        suppressed = await db.get_suppressed_categories(user_id)
+        onboarding_categories = await db.get_user_category_filter(user_id)
+    # 14. Build feature arrays (CPU)
+    with stage("14. Build per-candidate feature arrays (CPU)", timings):
+        user_total_saves = len(state.positive_list)
+        user_total_dismissals = len(state.negative_list)
+        qdrant_scores = np.asarray(
+            [qdrant_score_map.get(cid, 0.0) for cid in valid_ids],
+            dtype=np.float32,
+        )
+        per_cand_imp = np.asarray(
+            [
+                clusters[paper_cluster_map[cid]].importance
+                if cid in paper_cluster_map and 0 <= paper_cluster_map[cid] < len(clusters)
+                else 0.0
+                for cid in valid_ids
+            ],
+            dtype=np.float32,
+        )
+        per_cand_med = np.stack(
+            [
+                np.asarray(clusters[paper_cluster_map[cid]].medoid_embedding, dtype=np.float32)
+                if cid in paper_cluster_map and 0 <= paper_cluster_map[cid] < len(clusters)
+                else np.zeros(1024, dtype=np.float32)
+                for cid in valid_ids
+            ],
+            axis=0,
+        )
+        is_suppressed_arr = np.asarray(
+            [1.0 if cand_meta.get(cid, {}).get("category", "") in suppressed else 0.0
+             for cid in valid_ids],
+            dtype=np.float32,
+        )
+        onb_match_arr = np.asarray(
+            [1.0 if cand_meta.get(cid, {}).get("category", "") in onboarding_categories else 0.0
+             for cid in valid_ids],
+            dtype=np.float32,
+        )
+    # 15. LightGBM rerank
+    with stage("15. LightGBM rerank (CPU)", timings):
+        reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
+            candidate_ids=valid_ids,
+            candidate_embeddings=valid_embs,
+            candidate_metadata=valid_meta,
+            long_term_vec=lt_vec,
+            short_term_vec=st_vec,
+            negative_vec=neg_vec,
+            qdrant_scores=qdrant_scores,
+            cluster_importance=per_cand_imp,
+            cluster_medoid=per_cand_med,
+            is_suppressed_category=is_suppressed_arr,
+            onboarding_category_match=onb_match_arr,
+            user_total_saves=user_total_saves,
+            user_total_dismissals=user_total_dismissals,
+        )
+    # 16. MMR
+    with stage("16. MMR diversity (CPU)", timings):
+        query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
+        mmr_selected = mmr_rerank(
+            query_embedding=query_vec,
+            candidate_embeddings=reranked_embs,
+            candidate_ids=reranked_ids,
+            scores=reranked_scores,
+            lambda_param=0.6,
+            top_k=REC_LIMIT,
+        )
+    # 17. Exploration injection
+    with stage("17. Exploration injection (CPU)", timings):
+        final = inject_exploration(
+            selected_ids=mmr_selected,
+            all_candidate_ids=reranked_ids,
+            n_explore=2,
+        )
+    return timings
+# ── Setup helper for recs profile ────────────────────────────────────────────
+async def setup_recs_user(user_id: str, save_ids: list[str]):
+    vecs = await qdrant_svc.get_paper_vectors(save_ids)
+    state = await us.ensure_loaded(user_id)
+    for pid in save_ids:
+        if pid not in vecs:
+            continue
+        state.add_positive(pid)
+        emb = np.array(vecs[pid], dtype=np.float32)
+        await profiles.update_on_save(user_id, emb)
+        await db.log_interaction(user_id, pid, "save")
+async def cleanup_user(user_id: str):
+    import aiosqlite
+    async with aiosqlite.connect(config.DB_PATH) as conn:
+        for tbl in ["interactions", "user_profiles", "user_clusters",
+                    "user_onboarding", "cluster_snapshots"]:
+            try:
+                await conn.execute(f"DELETE FROM {tbl} WHERE user_id = ?", (user_id,))
+            except Exception:
+                pass
+        await conn.commit()
+    if user_id in us._cache:
+        del us._cache[user_id]
+async def main():
+    print("=" * 92)
+    print("PIPELINE PROFILER")
+    print("=" * 92)
+    await db.init_db()
+    # Warm BGE-M3 + Turso connection so first stage isn't a 15s outlier
+    print("\nWarming up BGE-M3 + Turso...")
+    embed_svc.encode_query("warmup")
+    await turso_svc.fetch_metadata_batch(["1706.03762"])
+    # ── Search profiling ────────────────────────────────────────────────────
+    print("\n" + "=" * 92)
+    print("SEARCH PIPELINE — three representative queries")
+    print("=" * 92)
+    queries = [
+        ("known-item title", "attention is all you need"),
+        ("conceptual rewrite", "when AI makes up fake facts"),
+        ("academic, no rewrite", "BGE-M3 multilingual dense retrieval"),
+    ]
+    for label, q in queries:
+        print(f"\n>>> Query [{label}]: {q!r}")
+        # Run twice — first cold, second warm — to show cache effect
+        for run in (1, 2):
+            timings = await profile_search(q)
+            print_breakdown(f"Run {run}", timings)
+    # ── Recs Tier 1 profiling ───────────────────────────────────────────────
+    print("\n\n" + "=" * 92)
+    print("RECS TIER 1 PIPELINE — 10 saved papers (5 NLP + 5 CV)")
+    print("=" * 92)
+    user_id = f"profile-recs-{uuid.uuid4().hex[:6]}"
+    save_ids = [
+        "1706.03762", "1810.04805", "2005.14165", "1907.11692", "1910.10683",
+        "1512.03385", "2010.11929", "1409.1556", "1505.04597", "2103.14030",
+    ]
+    try:
+        await setup_recs_user(user_id, save_ids)
+        for run in (1, 2, 3):
+            timings = await profile_recs_tier1(user_id, save_ids)
+            print_breakdown(f"Run {run}", timings)
+    finally:
+        await cleanup_user(user_id)
+if __name__ == "__main__":
+    asyncio.run(main())

scripts/test_citation_boost.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""Side-by-side comparison: BEFORE vs AFTER citation boost.
+Shows beginner vs expert results for the same topic.
+Also verifies Band A (known-item) queries aren't broken.
+"""
+import asyncio, sys, time
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from app import hybrid_search_svc, turso_svc, embed_svc
+# Pairs: (topic, beginner_query, expert_query)
+COMPARISONS = [
+    ("TRANSFORMERS",
+     "how do transformers work in NLP",
+     "attention is all you need"),
+    ("DIFFUSION",
+     "what are diffusion models and how do they generate images",
+     "denoising diffusion probabilistic models"),
+    ("GPT-4",
+     "how does GPT-4 work",
+     "GPT-4 Technical Report"),
+    ("RLHF",
+     "what is reinforcement learning from human feedback",
+     "reinforcement learning from human feedback"),
+]
+BAND_A = [
+    ("attention is all you need", "1706.03762"),
+    ("Deep Residual Learning for Image Recognition", "1512.03385"),
+    ("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805"),
+]
+async def run_query(q: str):
+    results = await hybrid_search_svc.search(q, limit=10)
+    meta = {}
+    if results:
+        meta = await turso_svc.fetch_metadata_batch(results)
+    return results, meta
+async def main():
+    print("Warming up BGE-M3...")
+    embed_svc.encode_query("warmup")
+    await turso_svc.fetch_metadata_batch(["1706.03762"])
+    # === Band A verification ===
+    print()
+    print("=" * 90)
+    print("BAND A VERIFICATION - Known-item queries (must still be #1)")
+    print("=" * 90)
+    for q, expected in BAND_A:
+        results, meta = await run_query(q)
+        rank = results.index(expected) + 1 if expected in results else -1
+        status = "PASS" if rank == 1 else f"RANK #{rank}" if rank > 0 else "MISS"
+        cites = meta.get(expected, {}).get("citation_count", 0)
+        print(f"  [{status:>8}] {q[:55]:55s}  ({cites} cites)")
+    # === Side-by-side comparisons ===
+    print()
+    print("=" * 90)
+    print("SIDE-BY-SIDE: Beginner vs Expert queries (same topic)")
+    print("=" * 90)
+    for topic, beginner_q, expert_q in COMPARISONS:
+        print(f"\n--- {topic} ---")
+        # Beginner
+        print(f"\n  BEGINNER: {beginner_q!r}")
+        results, meta = await run_query(beginner_q)
+        for i, aid in enumerate(results[:5], 1):
+            m = meta.get(aid, {})
+            title = (m.get("title") or "?")[:60]
+            cites = m.get("citation_count", 0)
+            print(f"    {i}. [{cites:>6} cites] {title}")
+        # Expert
+        print(f"\n  EXPERT:   {expert_q!r}")
+        results, meta = await run_query(expert_q)
+        for i, aid in enumerate(results[:5], 1):
+            m = meta.get(aid, {})
+            title = (m.get("title") or "?")[:60]
+            cites = m.get("citation_count", 0)
+            print(f"    {i}. [{cites:>6} cites] {title}")
+    print()
+    print("=" * 90)
+    print("DONE")
+    print("=" * 90)
+if __name__ == "__main__":
+    asyncio.run(main())

tests/test_hybrid_search.py CHANGED Viewed

@@ -102,56 +102,100 @@ class TestRRFFusion:
         assert gap_k10 > gap_k100
-# ── Recency rerank tests ─────────────────────────────────────────────────────
-class TestRecencyRerank:
-    """Test recency boosting in hybrid_search_svc."""
-    def test_recency_boost_newer_papers(self):
-        """Newer papers should get higher recency scores."""
-        from app.hybrid_search_svc import _recency_rerank
-        # Two papers with same RRF score but different ages
         fused = [
-            {"arxiv_id": "2401.00001", "rrf_score": 0.5},  # Jan 2024
-            {"arxiv_id": "1501.00001", "rrf_score": 0.5},  # Jan 2015
         ]
-        ranked = _recency_rerank(fused)
-        # Newer paper (2401) should rank higher
-        assert ranked[0]["arxiv_id"] == "2401.00001"
-    def test_recency_preserves_strong_rrf(self):
-        """A much higher RRF score should still dominate over recency."""
-        from app.hybrid_search_svc import _recency_rerank
         fused = [
-            {"arxiv_id": "1501.00001", "rrf_score": 1.0},   # Old but high RRF
-            {"arxiv_id": "2401.00001", "rrf_score": 0.01},   # New but low RRF
         ]
-        ranked = _recency_rerank(fused)
-        # High RRF should still win (0.80 weight vs 0.20 recency)
-        assert ranked[0]["arxiv_id"] == "1501.00001"
-    def test_recency_empty_input(self):
         """Empty input returns empty output."""
-        from app.hybrid_search_svc import _recency_rerank
-        assert _recency_rerank([]) == []
-    def test_recency_unparseable_id(self):
-        """Papers with unparseable IDs get neutral recency (0.5)."""
-        from app.hybrid_search_svc import _recency_rerank
         fused = [
-            {"arxiv_id": "math/0301001", "rrf_score": 0.5},
         ]
-        ranked = _recency_rerank(fused)
-        assert len(ranked) == 1
-        assert "final_score" in ranked[0]
 # ── Groq rewriter tests ─────────────────────────────────────────────────────

         assert gap_k10 > gap_k100
+# ── Title-match rerank tests ─────────────────────────────────────────────────
+class TestTitleMatchRerank:
+    """Test the title-match boost in hybrid_search_svc.
+    Recency rerank was removed (it crushed seminal old papers like
+    1706.03762 below newer "X is all you need" titles). Replaced with a
+    title-match boost that promotes papers whose title matches the query.
+    """
+    @pytest.mark.asyncio
+    async def test_exact_title_match_wins(self, monkeypatch):
+        """Paper with exact-title match should rank #1 even with low RRF."""
+        from app import hybrid_search_svc
+        async def fake_meta(ids):
+            return {
+                "1706.03762": {"title": "Attention Is All You Need"},
+                "2404.01183": {"title": "Positioning Is All You Need"},
+            }
+        monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
         fused = [
+            {"arxiv_id": "2404.01183", "rrf_score": 0.0317},  # higher RRF
+            {"arxiv_id": "1706.03762", "rrf_score": 0.0164},  # lower RRF, exact match
         ]
+        ranked = await hybrid_search_svc._title_match_rerank(
+            fused, "attention is all you need"
+        )
+        assert ranked[0]["arxiv_id"] == "1706.03762"
+    @pytest.mark.asyncio
+    async def test_substring_match_beats_no_match(self, monkeypatch):
+        """A substring title match outranks no-match candidates."""
+        from app import hybrid_search_svc
+        async def fake_meta(ids):
+            return {
+                "2501.05730": {"title": "Element-wise Attention Is All You Need"},
+                "9999.99999": {"title": "An Unrelated Survey of Graph Theory"},
+            }
+        monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
         fused = [
+            {"arxiv_id": "9999.99999", "rrf_score": 0.05},     # higher RRF, no match
+            {"arxiv_id": "2501.05730", "rrf_score": 0.01},     # lower RRF, substring match
         ]
+        ranked = await hybrid_search_svc._title_match_rerank(
+            fused, "attention is all you need"
+        )
+        assert ranked[0]["arxiv_id"] == "2501.05730"
+    @pytest.mark.asyncio
+    async def test_no_match_falls_back_to_rrf(self, monkeypatch):
+        """When nothing matches, RRF order is preserved."""
+        from app import hybrid_search_svc
+        async def fake_meta(ids):
+            return {
+                "1234.56789": {"title": "Some Paper"},
+                "9876.54321": {"title": "Another Paper"},
+            }
+        monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
+        fused = [
+            {"arxiv_id": "1234.56789", "rrf_score": 0.05},
+            {"arxiv_id": "9876.54321", "rrf_score": 0.01},
+        ]
+        ranked = await hybrid_search_svc._title_match_rerank(fused, "transformer")
+        assert [r["arxiv_id"] for r in ranked] == ["1234.56789", "9876.54321"]
+    @pytest.mark.asyncio
+    async def test_empty_input(self):
         """Empty input returns empty output."""
+        from app import hybrid_search_svc
+        assert await hybrid_search_svc._title_match_rerank([], "anything") == []
+    @pytest.mark.asyncio
+    async def test_turso_failure_falls_back_to_rrf(self, monkeypatch):
+        """If Turso lookup raises, ranking falls back to pure RRF order."""
+        from app import hybrid_search_svc
+        async def boom(ids):
+            raise RuntimeError("turso down")
+        monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", boom)
         fused = [
+            {"arxiv_id": "1234.56789", "rrf_score": 0.05},
+            {"arxiv_id": "9876.54321", "rrf_score": 0.01},
         ]
+        ranked = await hybrid_search_svc._title_match_rerank(fused, "attention")
+        assert [r["arxiv_id"] for r in ranked] == ["1234.56789", "9876.54321"]
+        # final_score must be set even on the fallback path
+        assert all("final_score" in r for r in ranked)
 # ── Groq rewriter tests ─────────────────────────────────────────────────────