siddhm11 commited on
Commit
ec67b2f
Β·
1 Parent(s): d2f0bed

Phase 6.5: Pipeline telemetry, search UX fixes, latency profiling

Browse files

- Instrumented search pipeline: Groq rewrite, BGE-M3 encode, Qdrant+Zilliz retrieval, RRF fusion, title rerank with per-stage timing
- Instrumented recommendation pipeline: clustering, ANN retrieval, metadata fetch, LightGBM rerank, MMR diversity
- Split Title+Citation Rerank into Turso fetch vs compute time (exposed hidden 1.5s network call)
- Added search loading overlay with pipeline stage labels
- Fixed HTMX search: recommendations now hide when search starts
- Fixed paper card: truncate authors (max 3 + et al), hard-truncate abstract to 500 chars
- Show Groq rewrite status (skipped/rewritten/error) in both banner and breakdown
- Added Groq heuristic visibility: shows skip reason (query too short, looks academic)
- Added parallel task count to retrieval breakdown
- New evaluation and diagnostic scripts
- Removed deprecated s2_svc.py

Files changed (43) hide show
  1. .github/skills/researchit-codebase-overview/SKILL.md +48 -0
  2. .github/skills/researchit-data-layer/SKILL.md +31 -0
  3. .github/skills/researchit-debug-performance/SKILL.md +31 -0
  4. .github/skills/researchit-recs-analysis/SKILL.md +42 -0
  5. .github/skills/researchit-reranker-explainer/SKILL.md +30 -0
  6. .github/skills/researchit-search-analysis/SKILL.md +34 -0
  7. .github/skills/researchit-testing-eval/SKILL.md +30 -0
  8. CLAUDE.md +2 -0
  9. README.md +1 -1
  10. app/config.py +1 -2
  11. app/groq_svc.py +19 -13
  12. app/hybrid_search_svc.py +316 -94
  13. app/qdrant_svc.py +87 -17
  14. app/recommend/clustering.py +76 -2
  15. app/recommend/reranker.py +1 -1
  16. app/routers/onboarding.py +7 -99
  17. app/routers/recommendations.py +44 -15
  18. app/routers/search.py +13 -2
  19. app/s2_svc.py +0 -111
  20. app/templates/index.html +2 -12
  21. app/templates/partials/paper_card.html +10 -5
  22. app/templates/partials/recommendations.html +34 -0
  23. app/templates/partials/search_results.html +78 -2
  24. app/templates/partials/seed_results.html +41 -0
  25. app/templates/partials/seed_search.html +2 -60
  26. app/templates/search.html +55 -9
  27. app/turso_svc.py +127 -9
  28. docs/TASK-TRACKER.md +22 -22
  29. docs/previous_prompt.txt +0 -0
  30. docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md +21 -20
  31. requirements.txt +1 -1
  32. scripts/browser_test_onboarding.py +75 -0
  33. scripts/browser_test_search.py +77 -0
  34. scripts/diag_mamba.py +69 -0
  35. scripts/diag_search_rank.py +45 -0
  36. scripts/e2e_audit.py +622 -0
  37. scripts/eval_expanded_queries.py +336 -0
  38. scripts/eval_recs_quality.py +547 -0
  39. scripts/eval_search_quality.py +197 -0
  40. scripts/expanded_eval_results.json +0 -0
  41. scripts/profile_pipelines.py +410 -0
  42. scripts/test_citation_boost.py +91 -0
  43. tests/test_hybrid_search.py +76 -32
.github/skills/researchit-codebase-overview/SKILL.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: researchit-codebase-overview
3
+ description: "Explain the ResearchIT codebase architecture and current state. Use for onboarding, project overviews, and accurate summaries of how the system works. Triggers: codebase overview, architecture summary, explain this project, how this works, system map."
4
+ argument-hint: "Specify audience (dev/stakeholder), depth (brief/standard/deep), and focus (search/recs/data)."
5
+ ---
6
+
7
+ # ResearchIT Codebase Overview
8
+
9
+ ## When to Use
10
+ - The user asks for a full understanding of the codebase or architecture.
11
+ - You need to produce a top-level system map or explain how components interact.
12
+ - You need a concise but accurate "what is happening here" summary.
13
+
14
+ ## Inputs to Ask For (if missing)
15
+ - Audience: developer vs stakeholder.
16
+ - Depth: brief, standard, or deep.
17
+ - Focus areas: search, recommendations, data layer, evaluation.
18
+
19
+ ## Required Sources (read in this order)
20
+ 1. CLAUDE.md (rules and source-of-truth doc map).
21
+ 2. docs/research/06-Deep-Research-Verdict.md (architecture decisions).
22
+ 3. README.md (current system summary).
23
+ 4. docs/walkthroughs/03-Code-Summary-and-Test-Plan.md (module map).
24
+ 5. docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md (current phase).
25
+
26
+ ## Procedure
27
+ 1. State the product goal in one sentence and the system constraints (CPU-only, latency budget).
28
+ 2. Describe the high-level architecture (frontend, backend, vector stores, metadata DB, SQLite).
29
+ 3. Summarize the two main pipelines:
30
+ - Search: rewrite -> encode -> dense+sparse -> RRF -> title/citation boost.
31
+ - Recommendations: clustering -> quota -> rerank -> MMR -> exploration.
32
+ 4. Call out invariants from doc 06 (quota for recs, RRF for search, alpha values, MMR lambda).
33
+ 5. Explain data flow and caching (Turso LRU, Qdrant vector cache, SQLite metadata cache).
34
+ 6. State current phase status and what is out of scope.
35
+
36
+ ## Output Format
37
+ - 6 to 10 bullet points, ordered by importance.
38
+ - Short "where to look" section with key files.
39
+ - If stakeholder audience: avoid implementation detail and emphasize outcomes.
40
+
41
+ ## Key Files to Cite
42
+ - app/main.py
43
+ - app/routers/recommendations.py
44
+ - app/routers/search.py
45
+ - app/hybrid_search_svc.py
46
+ - app/recommend/*
47
+ - app/qdrant_svc.py, app/zilliz_svc.py, app/turso_svc.py
48
+ - app/db.py
.github/skills/researchit-data-layer/SKILL.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: researchit-data-layer
3
+ description: "Explain the data/storage layer (SQLite, Turso metadata, Qdrant dense vectors, Zilliz sparse vectors). Use for data integrity, schema questions, caching behavior, and ID handling. Triggers: database schema, metadata cache, Qdrant mapping, Zilliz schema."
4
+ argument-hint: "Specify the component(s) and whether you want schema details or runtime behavior."
5
+ ---
6
+
7
+ # Data and Storage Layer Analysis
8
+
9
+ ## When to Use
10
+ - The user asks about storage, caching, or schemas.
11
+ - You need to validate data integrity or ID handling.
12
+ - You need to explain how metadata or vector mappings work.
13
+
14
+ ## Required Sources
15
+ 1. app/db.py (SQLite schema + migrations)
16
+ 2. app/turso_svc.py (metadata + caches)
17
+ 3. app/qdrant_svc.py (ID mapping + vector cache)
18
+ 4. app/zilliz_svc.py (sparse schema + search)
19
+ 5. app/arxiv_svc.py (API fallback + ID normalization)
20
+
21
+ ## Procedure
22
+ 1. Summarize each store and its responsibility (SQLite, Turso, Qdrant, Zilliz).
23
+ 2. Explain arXiv ID handling (always string; never integer coercion).
24
+ 3. Document caches (vector cache, metadata LRU, trending cache).
25
+ 4. Note schema migrations and instrumentation columns.
26
+ 5. Identify data consistency boundaries and fallbacks.
27
+
28
+ ## Output Format
29
+ - Component-by-component description.
30
+ - Tables/fields summary for SQLite.
31
+ - Integrity rules and common pitfalls.
.github/skills/researchit-debug-performance/SKILL.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: researchit-debug-performance
3
+ description: "Debug performance and quality issues in search or recommendations. Use for latency spikes, slow retrievals, or degraded relevance. Triggers: performance issue, slow search, slow recs, latency debug."
4
+ argument-hint: "Specify area (search/recs/data), symptoms, and whether to propose fixes."
5
+ ---
6
+
7
+ # Debugging and Performance Profiling
8
+
9
+ ## When to Use
10
+ - Latency regressions or slow responses appear.
11
+ - Search or recommendation quality drops unexpectedly.
12
+ - External services time out or return empty results.
13
+
14
+ ## Required Sources
15
+ 1. app/qdrant_svc.py (vector cache, retrieve latency)
16
+ 2. app/turso_svc.py (metadata cache, trending cache)
17
+ 3. app/hybrid_search_svc.py (RRF pipeline)
18
+ 4. app/routers/recommendations.py (candidate flow + oversample)
19
+ 5. app/recommend/reranker.py (model load, feature cost)
20
+
21
+ ## Procedure
22
+ 1. Identify the failing pipeline (search vs recommendations).
23
+ 2. Check cache hit rates conceptually (vector and metadata caches).
24
+ 3. Inspect candidate fetch sizes and oversampling factors.
25
+ 4. Review service fallbacks (Zilliz, Turso, arXiv).
26
+ 5. Isolate latency contributors and propose focused fixes.
27
+
28
+ ## Output Format
29
+ - Symptom -> probable cause mapping.
30
+ - Targeted checks in code.
31
+ - Minimal, low-risk fix options.
.github/skills/researchit-recs-analysis/SKILL.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: researchit-recs-analysis
3
+ description: "Analyze and explain the recommendation pipeline. Use for recs debugging, feature reviews, pipeline changes, or explaining multi-interest behavior. Triggers: recommendation pipeline, recs analysis, multi-interest, quota fusion, reranker."
4
+ argument-hint: "Specify the task (explain/debug/change), expected output (summary/findings), and whether to include tests."
5
+ ---
6
+
7
+ # Recommendation Pipeline Analysis
8
+
9
+ ## When to Use
10
+ - The user wants a deep explanation of recommendations or changes.
11
+ - You need to verify rules like quota fusion, EWMA alphas, or MMR usage.
12
+ - You are asked to debug rec quality or performance.
13
+
14
+ ## Required Sources
15
+ 1. CLAUDE.md and docs/research/06-Deep-Research-Verdict.md (non-negotiables).
16
+ 2. app/routers/recommendations.py (pipeline and instrumentation).
17
+ 3. app/recommend/profiles.py (EWMA parameters).
18
+ 4. app/recommend/clustering.py (Ward + medoids + stabilization).
19
+ 5. app/recommend/fusion.py (quota logic).
20
+ 6. app/recommend/reranker.py (LightGBM + features).
21
+ 7. app/recommend/diversity.py (MMR + exploration).
22
+
23
+ ## Procedure
24
+ 1. Identify which tier is active and the fallback sequence.
25
+ 2. Validate invariant rules:
26
+ - Search uses RRF, recommendations do not.
27
+ - Quota fusion with floor; MMR lambda is 0.6.
28
+ - alpha_long=0.03, alpha_short=0.40, alpha_neg=0.15.
29
+ 3. Trace candidate flow:
30
+ - Medoids -> per-cluster search -> dedup -> rerank -> MMR -> exploration.
31
+ 4. Check instrumentation fields: query_id, propensity, policy_id.
32
+ 5. Summarize likely failure modes: missing vectors, empty clusters, cache misses.
33
+ 6. Recommend targeted tests or metrics to verify changes.
34
+
35
+ ## Output Format
36
+ - Pipeline summary with stages and main functions.
37
+ - Invariants checklist (pass/fail).
38
+ - Risks and suggested tests.
39
+
40
+ ## Notes
41
+ - Never propose RRF for multi-medoid recommendations.
42
+ - Do not introduce cross-encoders into the hot path.
.github/skills/researchit-reranker-explainer/SKILL.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: researchit-reranker-explainer
3
+ description: "Explain the LightGBM reranker, feature schema, and fallback behavior. Use for model integration checks, feature debugging, or deployment validation. Triggers: reranker, LightGBM, feature schema, model loading."
4
+ argument-hint: "Specify: explain, validate, or troubleshoot."
5
+ ---
6
+
7
+ # Reranker and Feature Schema Explainer
8
+
9
+ ## When to Use
10
+ - The user asks how the reranker works or which features are used.
11
+ - You need to validate model loading and fallback behavior.
12
+ - You are reviewing feature wiring or scoring behavior.
13
+
14
+ ## Required Sources
15
+ 1. app/recommend/reranker.py
16
+ 2. models/reranker-phase6/production_model/feature_schema.json
17
+ 3. app/routers/health.py
18
+ 4. app/routers/recommendations.py (feature wiring)
19
+
20
+ ## Procedure
21
+ 1. Confirm model load paths and fallback logic.
22
+ 2. Verify the 37-feature ordering matches the schema.
23
+ 3. Explain which features are active in recommendations and how they are computed.
24
+ 4. Confirm health endpoint expectations (/healthz/reranker).
25
+ 5. Provide a concise explanation of latency and why cross-encoders are excluded.
26
+
27
+ ## Output Format
28
+ - Model load status + fallback behavior.
29
+ - Feature group summary (content, behavior, cross features).
30
+ - Integration checklist.
.github/skills/researchit-search-analysis/SKILL.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: researchit-search-analysis
3
+ description: "Explain or analyze the hybrid semantic search pipeline (rewrite, encode, dense+sparse, RRF, title/citation boost). Use for search quality, latency, and correctness reviews. Triggers: search pipeline, hybrid search, RRF, BGE-M3 search."
4
+ argument-hint: "Specify: explain vs debug, and whether to include latency hotspots."
5
+ ---
6
+
7
+ # Search Pipeline Analysis
8
+
9
+ ## When to Use
10
+ - The user wants to understand or debug search results.
11
+ - You need to review hybrid search correctness.
12
+ - You are asked about RRF usage or query rewriting.
13
+
14
+ ## Required Sources
15
+ 1. app/routers/search.py
16
+ 2. app/hybrid_search_svc.py
17
+ 3. app/embed_svc.py
18
+ 4. app/qdrant_svc.py
19
+ 5. app/zilliz_svc.py
20
+ 6. app/groq_svc.py
21
+ 7. app/turso_svc.py and app/arxiv_svc.py
22
+
23
+ ## Procedure
24
+ 1. Trace the full pipeline from query to results.
25
+ 2. Call out the dual-encode design (original + rewrite) and why it exists.
26
+ 3. Verify RRF usage is limited to search fusion (correct per doc 06).
27
+ 4. Explain title/citation boosts and their intended effect.
28
+ 5. Document fallback behavior when any component fails.
29
+ 6. Summarize latency hotspots and caching layers.
30
+
31
+ ## Output Format
32
+ - Step-by-step pipeline description.
33
+ - Fallbacks and failure handling.
34
+ - Notes on ranking behavior and edge cases.
.github/skills/researchit-testing-eval/SKILL.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: researchit-testing-eval
3
+ description: "Guide testing and evaluation for ResearchIT. Use for test planning, running tests, and explaining evaluation metrics. Triggers: testing plan, run tests, evaluation metrics, offline eval."
4
+ argument-hint: "Specify scope (unit/integration/e2e) and whether to include metrics."
5
+ ---
6
+
7
+ # Testing and Evaluation Guidance
8
+
9
+ ## When to Use
10
+ - The user wants to run or plan tests.
11
+ - The user asks about evaluation metrics or offline evaluation.
12
+ - You need to explain test coverage or risks.
13
+
14
+ ## Required Sources
15
+ 1. docs/walkthroughs/03-Code-Summary-and-Test-Plan.md
16
+ 2. tests/ (overview)
17
+ 3. pytest.ini
18
+ 4. test_e2e_recs.py
19
+
20
+ ## Procedure
21
+ 1. Identify test scope (unit, integration, live, e2e).
22
+ 2. Provide the correct test command(s) and file locations.
23
+ 3. Call out live tests that hit external services.
24
+ 4. Provide evaluation metrics and how they map to system goals.
25
+ 5. Note any missing coverage or potential regressions.
26
+
27
+ ## Output Format
28
+ - Test scope summary.
29
+ - Commands and expected outputs.
30
+ - Evaluation metric checklist.
CLAUDE.md CHANGED
@@ -205,6 +205,7 @@ Every interaction logged via `db.log_interaction()` must carry **`query_id`**, *
205
  - Onboarding wizard (category multi-select + seed search)
206
  - Category-filtered trending fallback
207
  - Dark-mode base UI + updated paper cards
 
208
 
209
  **Phase 6 β€” LightGBM reranker (COMPLETE βœ…):**
210
  - LightGBM LambdaRank (141 trees, 37 features) integrated with heuristic fallback
@@ -216,6 +217,7 @@ Every interaction logged via `db.log_interaction()` must carry **`query_id`**, *
216
  - Phase 6.4 (retraining) deferred: gated on 100 users or synthetic simulator
217
 
218
  **Out of scope until later phases β€” do not build:**
 
219
  - Collaborative filtering / LightFM (Phase 9, 500+ users).
220
  - Cross-encoder reranking in serving path (never; only distilled β€” Phase 8).
221
  - Claude/Groq-generated cluster summaries (Phase 8).
 
205
  - Onboarding wizard (category multi-select + seed search)
206
  - Category-filtered trending fallback
207
  - Dark-mode base UI + updated paper cards
208
+ - S2/ORCID author import was explored and **removed** β€” not the direction we want
209
 
210
  **Phase 6 β€” LightGBM reranker (COMPLETE βœ…):**
211
  - LightGBM LambdaRank (141 trees, 37 features) integrated with heuristic fallback
 
217
  - Phase 6.4 (retraining) deferred: gated on 100 users or synthetic simulator
218
 
219
  **Out of scope until later phases β€” do not build:**
220
+ - S2/ORCID author import for onboarding (removed β€” not the direction we want).
221
  - Collaborative filtering / LightFM (Phase 9, 500+ users).
222
  - Cross-encoder reranking in serving path (never; only distilled β€” Phase 8).
223
  - Claude/Groq-generated cluster summaries (Phase 8).
README.md CHANGED
@@ -276,7 +276,7 @@ curl -s https://siddhm11-researchit.hf.space/healthz/reranker | python -m json.t
276
  | `TURSO_URL` | Yes | Turso database URL |
277
  | `TURSO_DB_TOKEN` | Yes | Turso auth token |
278
  | `GROQ_API_KEY` | Yes | Groq API key for query rewriting |
279
- | `S2_API_KEY` | No | Semantic Scholar API key (training only) |
280
  | `RERANKER_MODEL_PATH` | No | Override LightGBM model file path |
281
  | `DB_PATH` | No | SQLite path (default: `interactions.db`) |
282
 
 
276
  | `TURSO_URL` | Yes | Turso database URL |
277
  | `TURSO_DB_TOKEN` | Yes | Turso auth token |
278
  | `GROQ_API_KEY` | Yes | Groq API key for query rewriting |
279
+ | `S2_API_KEY` | No | Semantic Scholar API key (offline training scripts only, not used by the app) |
280
  | `RERANKER_MODEL_PATH` | No | Override LightGBM model file path |
281
  | `DB_PATH` | No | SQLite path (default: `interactions.db`) |
282
 
app/config.py CHANGED
@@ -24,8 +24,7 @@ METADATA_CACHE_TTL_DAYS = 30 # re-fetch metadata after this many days
24
  TURSO_URL = os.getenv("TURSO_URL", "")
25
  TURSO_DB_TOKEN = os.getenv("TURSO_DB_TOKEN", "")
26
 
27
- # ── Semantic Scholar API β€” Phase 5.1 (author import) ─────────────────────────
28
- S2_API_KEY = os.getenv("S2_API_KEY", "")
29
 
30
  # ── Recommendation settings ───────────────────────────────────────────────────
31
  REC_LIMIT = 10 # how many recommendations to show
 
24
  TURSO_URL = os.getenv("TURSO_URL", "")
25
  TURSO_DB_TOKEN = os.getenv("TURSO_DB_TOKEN", "")
26
 
27
+
 
28
 
29
  # ── Recommendation settings ───────────────────────────────────────────────────
30
  REC_LIMIT = 10 # how many recommendations to show
app/groq_svc.py CHANGED
@@ -45,29 +45,29 @@ def _get_client():
45
 
46
  _SYSTEM_PROMPT = """You are an academic search query optimizer for arXiv papers.
47
 
48
- Your job: Convert casual or vague user queries into dense, keyword-rich academic search strings that will match arXiv paper titles and abstracts.
49
 
50
  Rules:
51
  1. Output ONLY the rewritten query string β€” no explanation, no quotes, no preamble.
52
- 2. Include standard academic terms, model names, acronyms, and author-style keywords.
53
- 3. Keep the output to 8-15 words maximum.
54
- 4. If the query already looks academic, return it with minimal changes.
55
 
56
  Examples:
57
  User: "when AI makes up fake facts"
58
- Output: LLM hallucination factual errors sycophancy truthfulness survey
59
 
60
  User: "the llama model by facebook"
61
- Output: LLaMA open efficient foundation language model Meta AI
62
 
63
- User: "how to make images from text"
64
- Output: text-to-image generation diffusion models latent space
65
 
66
- User: "papers about making language models smaller"
67
- Output: language model compression distillation pruning quantization efficient
68
 
69
- User: "whisper speech recognition"
70
- Output: Whisper OpenAI automatic speech recognition multilingual"""
71
 
72
 
73
  # ── Heuristic: should we skip rewriting? ─────────────────────────────────────
@@ -85,8 +85,14 @@ _ACADEMIC_PATTERN = re.compile(
85
 
86
 
87
  def _looks_academic(query: str) -> bool:
88
- """Heuristic: skip rewriting if query already has academic terms."""
89
  words = query.split()
 
 
 
 
 
 
90
  if len(words) > 6:
91
  matches = len(_ACADEMIC_PATTERN.findall(query))
92
  if matches >= 2:
 
45
 
46
  _SYSTEM_PROMPT = """You are an academic search query optimizer for arXiv papers.
47
 
48
+ Your job: Convert casual or conversational user queries into academic search strings.
49
 
50
  Rules:
51
  1. Output ONLY the rewritten query string β€” no explanation, no quotes, no preamble.
52
+ 2. If the user's query is casual or conversational, rewrite it using standard academic terms.
53
+ 3. CRITICAL: If the query is ALREADY a precise technical term, a single keyword, an acronym, or a known paper title (e.g., "perplexity", "transformers", "Adam optimizer"), DO NOT expand it. Return it EXACTLY AS IS. Do NOT add random related words.
54
+ 4. Never output more than 8 words.
55
 
56
  Examples:
57
  User: "when AI makes up fake facts"
58
+ Output: LLM hallucination factual errors
59
 
60
  User: "the llama model by facebook"
61
+ Output: LLaMA foundation language model Meta AI
62
 
63
+ User: "perplexity"
64
+ Output: perplexity
65
 
66
+ User: "attention is all you need"
67
+ Output: attention is all you need
68
 
69
+ User: "gradient descent"
70
+ Output: gradient descent"""
71
 
72
 
73
  # ── Heuristic: should we skip rewriting? ─────────────────────────────────────
 
85
 
86
 
87
  def _looks_academic(query: str) -> bool:
88
+ """Heuristic: skip rewriting if query already looks academic or is very short."""
89
  words = query.split()
90
+
91
+ # 1-2 word queries are usually precise keywords or author names (e.g., "perplexity", "lecun")
92
+ # Expanding them almost always ruins the precision.
93
+ if len(words) <= 2:
94
+ return True
95
+
96
  if len(words) > 6:
97
  matches = len(_ACADEMIC_PATTERN.findall(query))
98
  if matches >= 2:
app/hybrid_search_svc.py CHANGED
@@ -6,23 +6,29 @@ Orchestrates the full pipeline:
6
  2. BGE-M3 encode β†’ dense + sparse
7
  3. Parallel search: Qdrant dense + Zilliz sparse
8
  4. RRF fusion (K=60)
9
- 5. Recency rerank: 0.80 Γ— RRF + 0.20 Γ— recency
10
  6. Return ranked arxiv_ids
11
 
12
  Doc 06 confirms: RRF is correct for search (fusing different retrievers
13
  answering the SAME query). This is different from recommendations where
14
  quota is correct (fusing different queries for the SAME user).
 
 
 
 
 
15
  """
16
  from __future__ import annotations
17
 
18
  import asyncio
19
- from datetime import datetime
20
 
21
  from app import config
22
  from app import embed_svc
23
  from app import qdrant_svc
24
  from app import zilliz_svc
25
  from app import groq_svc
 
26
 
27
 
28
  # ── Public API ───────────────────────────────────────────────────────────────
@@ -31,18 +37,20 @@ async def search(
31
  query: str,
32
  limit: int = 10,
33
  use_rewrite: bool = True,
34
- ) -> list[str]:
 
35
  """
36
  Hybrid semantic search β€” returns a list of arxiv_ids ranked by
37
  fused relevance.
38
 
39
  Pipeline:
40
- rewrite β†’ encode β†’ parallel(dense, sparse) β†’ RRF β†’ rerank
41
 
42
  Args:
43
  query: User's raw search query.
44
  limit: Number of results to return.
45
  use_rewrite: Whether to attempt LLM query rewriting.
 
46
 
47
  Returns:
48
  list of arxiv_id strings, sorted by final score descending.
@@ -50,55 +58,115 @@ async def search(
50
  """
51
  query = query.strip()
52
  if not query:
53
- return []
 
 
 
54
 
55
  # ── Step 1: LLM rewrite (optional, never blocks) ─────────────────────
56
- search_query = query
57
  if use_rewrite:
 
58
  try:
59
- search_query = await groq_svc.rewrite(query)
 
 
 
 
 
 
 
 
 
 
 
 
60
  except Exception:
61
- search_query = query # Fallback guaranteed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- # ── Step 2: BGE-M3 encode (dense + sparse in one pass) ───────────────
64
- try:
65
- dense_vec, sparse_dict = embed_svc.encode_query(search_query)
66
- except Exception as e:
67
- print(f"[hybrid_search] Encoding failed: {e}")
68
- return []
69
 
70
  # How many candidates to fetch before reranking
71
  fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
72
 
73
- # ── Step 3: Parallel dense + sparse search ───────────────────────────
74
- dense_results, sparse_results = await asyncio.gather(
75
- qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k),
76
- zilliz_svc.search_sparse(sparse_dict, limit=fetch_k),
77
- return_exceptions=True,
78
- )
79
-
80
- # Handle individual failures gracefully
81
- if isinstance(dense_results, Exception):
82
- print(f"[hybrid_search] Dense search failed: {dense_results}")
83
- dense_results = []
84
- if isinstance(sparse_results, Exception):
85
- print(f"[hybrid_search] Sparse search failed: {sparse_results}")
86
- sparse_results = []
87
-
88
- if not dense_results and not sparse_results:
89
- return []
90
-
91
- # ── Step 4: RRF fusion ───────────────────────────────────────────────
92
- fused = _rrf_fuse(dense_results, sparse_results, k=config.SEARCH_RRF_K)
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
  if not fused:
95
- return []
96
-
97
- # ── Step 5: Recency rerank ───────────────────────────────────────────
98
- ranked = _recency_rerank(fused)
 
 
 
 
 
 
 
 
 
 
99
 
100
  # ── Step 6: Return top results ───────────────────────────────────────
101
- return [item["arxiv_id"] for item in ranked[:limit]]
 
102
 
103
 
104
  # ── RRF fusion ───────────────────────────────────────────────────────────────
@@ -109,92 +177,246 @@ def _rrf_fuse(
109
  k: int = 60,
110
  ) -> list[dict]:
111
  """
112
- Reciprocal Rank Fusion β€” merges results from dense and sparse search.
113
 
114
- score[paper] = 1/(k + rank_dense) + 1/(k + rank_sparse)
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
  RRF is rank-based, so raw scores from different systems don't need
117
- normalization β€” this is why it works for fusing Qdrant cosine scores
118
- with Zilliz IP scores.
 
119
 
120
  Args:
121
- dense_results: list of {'arxiv_id': str, 'score': float} from Qdrant
122
- sparse_results: list of {'arxiv_id': str, 'score': float} from Zilliz
123
- k: RRF constant (default 60)
124
 
125
  Returns:
126
- list of {'arxiv_id': str, 'rrf_score': float} sorted by rrf_score desc
127
  """
128
  scores: dict[str, float] = {}
 
 
 
 
129
 
130
- # Dense contributions (rank = position in sorted list, 1-indexed)
131
- for rank, item in enumerate(dense_results, start=1):
132
- aid = item["arxiv_id"]
133
- scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
134
-
135
- # Sparse contributions
136
- for rank, item in enumerate(sparse_results, start=1):
137
- aid = item["arxiv_id"]
138
- scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
139
-
140
- # Sort by fused score descending
141
  fused = [
142
  {"arxiv_id": aid, "rrf_score": score}
143
  for aid, score in scores.items()
144
  ]
145
  fused.sort(key=lambda x: x["rrf_score"], reverse=True)
146
-
147
  return fused
148
 
149
 
150
- # ── Recency rerank ───────────────────────────────────────────────────────────
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
- def _recency_rerank(fused: list[dict]) -> list[dict]:
 
 
 
 
 
153
  """
154
- Apply recency boost to RRF scores.
 
 
 
 
 
155
 
156
- final_score = SEARCH_SEMANTIC_WEIGHT Γ— norm_rrf + SEARCH_RECENCY_WEIGHT Γ— recency
 
 
157
 
158
- Recency is estimated from the arXiv ID (YYMM format) since we don't have
159
- publication dates at this stage. Papers not parseable get neutral score.
160
 
161
- The semantic weight (0.80) ensures RRF dominates, while recency (0.20)
162
- provides a mild boost to newer papers.
163
  """
164
  if not fused:
165
  return fused
166
 
167
- # Normalize RRF scores to [0, 1]
168
- max_rrf = max(item["rrf_score"] for item in fused)
169
- min_rrf = min(item["rrf_score"] for item in fused)
170
- rrf_range = max_rrf - min_rrf if max_rrf != min_rrf else 1.0
 
171
 
172
- now_ym = datetime.now().year * 12 + datetime.now().month
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
- for item in fused:
175
- # Normalize RRF to [0, 1]
176
- norm_rrf = (item["rrf_score"] - min_rrf) / rrf_range
177
 
178
- # Estimate recency from arXiv ID (format: YYMM.NNNNN)
179
- recency = 0.5 # neutral default
180
  aid = item["arxiv_id"]
181
- try:
182
- parts = aid.split(".")
183
- if len(parts) >= 2 and len(parts[0]) == 4:
184
- yy = int(parts[0][:2])
185
- mm = int(parts[0][2:4])
186
- year = 2000 + yy if yy < 100 else yy
187
- paper_ym = year * 12 + mm
188
- months_ago = max(0, now_ym - paper_ym)
189
- # Decay: recent papers get ~1.0, 10-year-old papers get ~0.0
190
- recency = max(0.0, 1.0 - months_ago / 120.0)
191
- except (ValueError, IndexError):
192
- pass
193
-
194
- item["final_score"] = (
195
- config.SEARCH_SEMANTIC_WEIGHT * norm_rrf
196
- + config.SEARCH_RECENCY_WEIGHT * recency
197
- )
198
 
199
  fused.sort(key=lambda x: x["final_score"], reverse=True)
200
  return fused
 
6
  2. BGE-M3 encode β†’ dense + sparse
7
  3. Parallel search: Qdrant dense + Zilliz sparse
8
  4. RRF fusion (K=60)
9
+ 5. Title-match boost (exact/substring against Turso titles)
10
  6. Return ranked arxiv_ids
11
 
12
  Doc 06 confirms: RRF is correct for search (fusing different retrievers
13
  answering the SAME query). This is different from recommendations where
14
  quota is correct (fusing different queries for the SAME user).
15
+
16
+ Recency rerank was removed β€” search relevance should not be biased toward
17
+ newer papers (that is a recommendations concern). For exact-title queries
18
+ like "attention is all you need", the recency overlay was crushing seminal
19
+ older papers below newer "X is all you need" titles.
20
  """
21
  from __future__ import annotations
22
 
23
  import asyncio
24
+ import re
25
 
26
  from app import config
27
  from app import embed_svc
28
  from app import qdrant_svc
29
  from app import zilliz_svc
30
  from app import groq_svc
31
+ from app import turso_svc
32
 
33
 
34
  # ── Public API ───────────────────────────────────────────────────────────────
 
37
  query: str,
38
  limit: int = 10,
39
  use_rewrite: bool = True,
40
+ return_meta: bool = False,
41
+ ) -> list[str] | tuple[list[str], dict]:
42
  """
43
  Hybrid semantic search β€” returns a list of arxiv_ids ranked by
44
  fused relevance.
45
 
46
  Pipeline:
47
+ rewrite β†’ encode β†’ parallel(dense, sparse) β†’ RRF β†’ title-boost
48
 
49
  Args:
50
  query: User's raw search query.
51
  limit: Number of results to return.
52
  use_rewrite: Whether to attempt LLM query rewriting.
53
+ return_meta: If True, returns a tuple of (arxiv_ids, metadata_dict).
54
 
55
  Returns:
56
  list of arxiv_id strings, sorted by final score descending.
 
58
  """
59
  query = query.strip()
60
  if not query:
61
+ return ([], {}) if return_meta else []
62
+
63
+ import time
64
+ search_meta = {"rewritten_query": None, "groq_time_ms": 0, "groq_status": "off"}
65
 
66
  # ── Step 1: LLM rewrite (optional, never blocks) ─────────────────────
67
+ rewritten_query = query
68
  if use_rewrite:
69
+ start_groq = time.perf_counter()
70
  try:
71
+ rewritten_query = await groq_svc.rewrite(query)
72
+ if rewritten_query != query:
73
+ search_meta["rewritten_query"] = rewritten_query
74
+ search_meta["groq_status"] = "rewritten"
75
+ else:
76
+ # Groq returned same query β€” either skipped by heuristic or LLM kept it
77
+ word_count = len(query.strip().split())
78
+ if word_count <= 2:
79
+ search_meta["groq_status"] = f"skipped (query too short: {word_count} words)"
80
+ elif groq_svc._looks_academic(query):
81
+ search_meta["groq_status"] = "skipped (looks academic)"
82
+ else:
83
+ search_meta["groq_status"] = "called, kept original"
84
  except Exception:
85
+ rewritten_query = query # Fallback guaranteed
86
+ search_meta["groq_status"] = "error (fallback)"
87
+ search_meta["groq_time_ms"] = int((time.perf_counter() - start_groq) * 1000)
88
+
89
+ # ── Step 2: BGE-M3 encode the original AND rewrite ──────────────────
90
+ # Why both: The rewriter improves recall on conceptual/casual queries
91
+ # ("when AI makes up fake facts" -> "LLM hallucination ...") but it
92
+ # paraphrases away from literal title wording on known-item queries
93
+ # ("attention is all you need" -> "Transformer self-attention ..."),
94
+ # which can drop the actual famous paper out of the candidate pool
95
+ # entirely. Searching both forms and RRF-fusing all result lists
96
+ # gives us recall on both axes.
97
+ queries_to_encode: list[str] = [query]
98
+ if rewritten_query and rewritten_query != query:
99
+ queries_to_encode.append(rewritten_query)
100
+
101
+ t0_encode = time.perf_counter()
102
+ encoded: list[tuple] = []
103
+ for q in queries_to_encode:
104
+ try:
105
+ d, s = embed_svc.encode_query(q)
106
+ encoded.append((d, s))
107
+ except Exception as e:
108
+ print(f"[hybrid_search] Encoding failed for {q!r}: {e}")
109
+ search_meta["encode_time_ms"] = int((time.perf_counter() - t0_encode) * 1000)
110
 
111
+ if not encoded:
112
+ return ([], search_meta) if return_meta else []
 
 
 
 
113
 
114
  # How many candidates to fetch before reranking
115
  fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
116
 
117
+ # ── Step 3: Parallel dense + sparse search for every encoded form ───
118
+ # Build a flat list of search coroutines: [dense_q1, sparse_q1, dense_q2, sparse_q2, ...]
119
+ t0_retrieval = time.perf_counter()
120
+ tasks = []
121
+ task_labels = []
122
+ for i, (dense_vec, sparse_dict) in enumerate(encoded):
123
+ tasks.append(qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k))
124
+ task_labels.append(f"qdrant_q{i}")
125
+ tasks.append(zilliz_svc.search_sparse(sparse_dict, limit=fetch_k))
126
+ task_labels.append(f"zilliz_q{i}")
127
+
128
+ # Time each task individually
129
+ import asyncio as _aio
130
+ task_start = time.perf_counter()
131
+ raw_results = await asyncio.gather(*tasks, return_exceptions=True)
132
+ search_meta["retrieval_time_ms"] = int((time.perf_counter() - t0_retrieval) * 1000)
133
+ search_meta["n_retrieval_tasks"] = len(tasks)
134
+
135
+ valid_result_lists: list[list[dict]] = []
136
+ for r in raw_results:
137
+ if isinstance(r, Exception):
138
+ print(f"[hybrid_search] search task failed: {r}")
139
+ continue
140
+ if r:
141
+ valid_result_lists.append(r)
142
+
143
+ if not valid_result_lists:
144
+ return ([], search_meta) if return_meta else []
145
+
146
+ # ── Step 4: RRF fusion across all result lists ──────────────────────
147
+ t0_rrf = time.perf_counter()
148
+ fused = _rrf_fuse_multi(valid_result_lists, k=config.SEARCH_RRF_K)
149
+ search_meta["rrf_time_ms"] = int((time.perf_counter() - t0_rrf) * 1000)
150
 
151
  if not fused:
152
+ return ([], search_meta) if return_meta else []
153
+
154
+ # ── Step 5: Title-match boost ────────────────────────────────────────
155
+ # Use the user's ORIGINAL query (not the LLM rewrite) for title matching β€”
156
+ # the user's literal text is what should match a paper title.
157
+ t0_rerank = time.perf_counter()
158
+ ranked = await _title_match_rerank(fused, query, top_n_for_boost=50)
159
+ rerank_total = int((time.perf_counter() - t0_rerank) * 1000)
160
+ search_meta["rerank_time_ms"] = rerank_total
161
+ # Extract sub-timings stashed by _title_match_rerank
162
+ if ranked:
163
+ turso_boost_ms = ranked[0].pop("_turso_boost_fetch_ms", 0)
164
+ search_meta["turso_boost_fetch_ms"] = turso_boost_ms
165
+ search_meta["rerank_compute_ms"] = max(0, rerank_total - turso_boost_ms)
166
 
167
  # ── Step 6: Return top results ───────────────────────────────────────
168
+ final_results = [item["arxiv_id"] for item in ranked[:limit]]
169
+ return (final_results, search_meta) if return_meta else final_results
170
 
171
 
172
  # ── RRF fusion ───────────────────────────────────────────────────────────────
 
177
  k: int = 60,
178
  ) -> list[dict]:
179
  """
180
+ Reciprocal Rank Fusion of two result lists (dense + sparse).
181
 
182
+ Kept for callers that pass exactly two lists; new code (and the
183
+ hybrid pipeline itself) should call _rrf_fuse_multi instead.
184
+ """
185
+ return _rrf_fuse_multi([dense_results, sparse_results], k=k)
186
+
187
+
188
+ def _rrf_fuse_multi(
189
+ result_lists: list[list[dict]],
190
+ k: int = 60,
191
+ ) -> list[dict]:
192
+ """
193
+ Reciprocal Rank Fusion across N result lists.
194
+
195
+ score[paper] = sum over each list of 1/(k + rank_in_that_list)
196
 
197
  RRF is rank-based, so raw scores from different systems don't need
198
+ normalization. This means we can merge dense, sparse, AND multiple
199
+ encoded query forms (original + LLM-rewritten) without per-source
200
+ score calibration.
201
 
202
  Args:
203
+ result_lists: each list contains {'arxiv_id': str, 'score': ...}
204
+ sorted best-first.
205
+ k: RRF constant (default 60).
206
 
207
  Returns:
208
+ list of {'arxiv_id': str, 'rrf_score': float} sorted by rrf_score desc.
209
  """
210
  scores: dict[str, float] = {}
211
+ for results in result_lists:
212
+ for rank, item in enumerate(results, start=1):
213
+ aid = item["arxiv_id"]
214
+ scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
215
 
 
 
 
 
 
 
 
 
 
 
 
216
  fused = [
217
  {"arxiv_id": aid, "rrf_score": score}
218
  for aid, score in scores.items()
219
  ]
220
  fused.sort(key=lambda x: x["rrf_score"], reverse=True)
 
221
  return fused
222
 
223
 
224
+ # ── Title-match + citation-popularity rerank ─────────────────────────────────
225
+
226
+ # Boost magnitudes are calibrated against `max_rrf` so any meaningful title
227
+ # match outranks the best non-matching candidate:
228
+ # final = rrf_score + max_rrf * (title_boost + citation_boost)
229
+ # With boost=2.0 (exact title), the worst exact-match still beats the best
230
+ # non-match by >= max_rrf. boost=1.0 same vs. no-match.
231
+ _BOOST_EXACT_TITLE = 2.0 # query == title (after normalize)
232
+ _BOOST_SUBSTRING_TITLE = 1.0 # query is contiguous substring of title
233
+ _BOOST_HIGH_COVERAGE = 1.0 # >= 80% of query words found in title
234
+ _BOOST_MED_COVERAGE = 0.5 # >= 50% of query words found in title
235
+
236
+ # Citation-popularity boost β€” surfaces landmark papers even when keyword
237
+ # overlap is low. Without this, "how do transformers work in NLP" returns
238
+ # niche papers instead of "Attention Is All You Need" because RRF favors
239
+ # papers whose titles contain more query keywords.
240
+ #
241
+ # Uses log10(citations) scaled to a cap:
242
+ # 0 citations -> 0.0 boost
243
+ # 10 citations -> 0.03
244
+ # 100 citations -> 0.06
245
+ # 1K citations -> 0.10
246
+ # 10K citations -> 0.13
247
+ # 100K citations-> 0.17 (near cap)
248
+ #
249
+ # Cap is deliberately small (0.2 * max_rrf) so it NUDGES but doesn't
250
+ # override title-match or strong semantic signal. A 100K-citation paper
251
+ # still loses to a perfect title match.
252
+ import math
253
+ _CITATION_BOOST_CAP = 0.2 # max boost from citations alone
254
+ _CITATION_LOG_DIVISOR = 30.0 # how many log10 units to reach the cap
255
+
256
+ # Drop any token shorter than this from coverage calculation β€” single-letter
257
+ # tokens ("a", "i") and tiny stop-likes inflate spurious matches.
258
+ _MIN_COVERAGE_TOKEN_LEN = 2
259
+
260
+
261
+ def _normalize_for_match(text: str) -> str:
262
+ """Lowercase, collapse non-alnum to single spaces, strip."""
263
+ return re.sub(r"[^a-z0-9]+", " ", text.lower()).strip()
264
+
265
+
266
+ def _stem_plural(w: str) -> str:
267
+ """Trim a single trailing 's' on tokens longer than 3 chars.
268
+
269
+ Crude but cheap. Catches the 'space' vs 'spaces' problem in the
270
+ Mamba paper title without dragging in a real stemmer dependency.
271
+ """
272
+ return w[:-1] if len(w) > 3 and w.endswith("s") else w
273
+
274
+
275
+ def _word_set(text: str) -> set[str]:
276
+ return {
277
+ _stem_plural(w) for w in text.split()
278
+ if len(w) >= _MIN_COVERAGE_TOKEN_LEN
279
+ }
280
+
281
+
282
+ def _compute_title_boost(query_norm: str, title_raw: str) -> float:
283
+ """Decide how much to boost a candidate based on title overlap.
284
+
285
+ Order of checks (strongest signal first):
286
+ 1. Exact match after normalization -> 2.0
287
+ 2. Query is contiguous substring of normalized title -> 1.0
288
+ (rescues "chain of thought prompting" vs
289
+ "Chain-of-Thought Prompting Elicits Reasoning..." β€” punctuation
290
+ in title was the only thing blocking the old binary substring check)
291
+ 3. Coverage: fraction of query word-stems found in title (or as
292
+ substring of compact title β€” catches "multilingual" appearing
293
+ in "Multi-Lingual" once spaces are stripped).
294
+ >= 0.8 -> _BOOST_HIGH_COVERAGE * coverage
295
+ >= 0.5 -> _BOOST_MED_COVERAGE * coverage
296
+ otherwise -> 0
297
+ """
298
+ if not query_norm or not title_raw:
299
+ return 0.0
300
+
301
+ title_norm = _normalize_for_match(title_raw)
302
+ if not title_norm:
303
+ return 0.0
304
+
305
+ if query_norm == title_norm:
306
+ return _BOOST_EXACT_TITLE
307
+ if query_norm in title_norm:
308
+ return _BOOST_SUBSTRING_TITLE
309
+
310
+ q_words = _word_set(query_norm)
311
+ if not q_words:
312
+ return 0.0
313
+
314
+ t_words = _word_set(title_norm)
315
+ title_compact = title_norm.replace(" ", "")
316
+
317
+ matches = 0
318
+ for w in q_words:
319
+ if w in t_words:
320
+ matches += 1
321
+ elif len(w) >= 4 and w in title_compact:
322
+ # Catches "multilingual" appearing within "multi lingual"
323
+ # once whitespace is stripped from the title.
324
+ matches += 1
325
+
326
+ coverage = matches / len(q_words)
327
+ if coverage >= 0.8:
328
+ return _BOOST_HIGH_COVERAGE * coverage
329
+ if coverage >= 0.5:
330
+ return _BOOST_MED_COVERAGE * coverage
331
+ return 0.0
332
+
333
+
334
+ def _compute_citation_boost(citation_count: int) -> float:
335
+ """Log-scaled citation boost, capped at _CITATION_BOOST_CAP.
336
+
337
+ The idea: a paper with 100K citations (like "Attention Is All You Need")
338
+ gets a small but meaningful nudge upward even when it has zero keyword
339
+ overlap with a beginner's query like "how do transformers work".
340
+
341
+ The boost is small enough that a strong title match always wins, and
342
+ a strong semantic RRF score always wins. But when two papers have
343
+ similar RRF scores and neither has a title match, the one with 100K
344
+ citations beats the one with 3 citations.
345
+
346
+ Scale (log10-based):
347
+ citations=0 -> 0.000
348
+ citations=10 -> 0.033
349
+ citations=100 -> 0.067
350
+ citations=1000 -> 0.100
351
+ citations=10000 -> 0.133
352
+ citations=100000-> 0.167
353
+ """
354
+ if citation_count <= 0:
355
+ return 0.0
356
+ raw = math.log10(citation_count + 1) / _CITATION_LOG_DIVISOR
357
+ return min(raw, _CITATION_BOOST_CAP)
358
 
359
+
360
+ async def _title_match_rerank(
361
+ fused: list[dict],
362
+ user_query: str,
363
+ top_n_for_boost: int = 50,
364
+ ) -> list[dict]:
365
  """
366
+ Boost candidates by title overlap + citation popularity.
367
+
368
+ Two signals, both based on metadata we already fetch from Turso:
369
+
370
+ 1. Title boost (strong): exact/substring/coverage match between the
371
+ user's ORIGINAL query and paper titles. Rescues known-item queries.
372
 
373
+ 2. Citation boost (gentle): log-scaled citation count, capped at 0.2x
374
+ max_rrf. Rescues landmark papers for beginner queries where keyword
375
+ overlap is low but the paper is obviously important.
376
 
377
+ The final score is:
378
+ final = rrf_score + max_rrf * (title_boost + citation_boost)
379
 
380
+ Safe under partial Turso failure: papers with missing metadata get
381
+ boost=0 and rank by RRF alone.
382
  """
383
  if not fused:
384
  return fused
385
 
386
+ q_norm = _normalize_for_match(user_query)
387
+ if not q_norm:
388
+ for item in fused:
389
+ item["final_score"] = item["rrf_score"]
390
+ return fused
391
 
392
+ candidate_ids = [item["arxiv_id"] for item in fused[:top_n_for_boost]]
393
+ titles: dict[str, str] = {}
394
+ citations: dict[str, int] = {}
395
+ import time as _time
396
+ _t0_turso_boost = _time.perf_counter()
397
+ try:
398
+ meta = await turso_svc.fetch_metadata_batch(candidate_ids)
399
+ titles = {aid: (m.get("title") or "") for aid, m in meta.items()}
400
+ citations = {aid: (m.get("citation_count") or 0) for aid, m in meta.items()}
401
+ except Exception as e:
402
+ print(f"[hybrid_search] Metadata fetch for boost failed: {e}")
403
+ for item in fused:
404
+ item["final_score"] = item["rrf_score"]
405
+ return fused
406
+ _turso_boost_ms = int((_time.perf_counter() - _t0_turso_boost) * 1000)
407
+ # Stash on first item so the caller can extract it
408
+ if fused:
409
+ fused[0]["_turso_boost_fetch_ms"] = _turso_boost_ms
410
 
411
+ max_rrf = max(item["rrf_score"] for item in fused)
 
 
412
 
413
+ for item in fused:
 
414
  aid = item["arxiv_id"]
415
+ t_boost = _compute_title_boost(q_norm, titles.get(aid, ""))
416
+ c_boost = _compute_citation_boost(citations.get(aid, 0))
417
+ item["title_boost"] = t_boost
418
+ item["citation_boost"] = c_boost
419
+ item["final_score"] = item["rrf_score"] + max_rrf * (t_boost + c_boost)
 
 
 
 
 
 
 
 
 
 
 
 
420
 
421
  fused.sort(key=lambda x: x["final_score"], reverse=True)
422
  return fused
app/qdrant_svc.py CHANGED
@@ -10,6 +10,7 @@ The collection is 'arxiv_bgem3_dense' with integer point IDs and 1024-dim BGE-M3
10
  from __future__ import annotations
11
 
12
  import asyncio
 
13
  from functools import lru_cache
14
 
15
  from qdrant_client import QdrantClient
@@ -166,21 +167,75 @@ def _run_recommend(
166
 
167
 
168
  # ── Phase 2a: Vector retrieval + vector search ───────────────────────────────
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
  async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
171
  """
172
- Fetch actual BGE-M3 embedding vectors for papers from Qdrant.
173
  Returns {arxiv_id: vector_list} for papers found.
174
 
175
- Used by EWMA profile updates β€” we need the paper's embedding
176
- to blend into the user's profile vector.
 
 
 
 
 
 
 
177
  """
178
  if not arxiv_ids:
179
  return {}
180
 
181
- id_map = await lookup_qdrant_ids(arxiv_ids)
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  if not id_map:
183
- return {}
184
 
185
  point_ids = list(id_map.values())
186
  arxiv_by_point = {v: k for k, v in id_map.items()}
@@ -192,9 +247,8 @@ async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
192
  )
193
  except Exception as e:
194
  print(f"[qdrant_svc] get_paper_vectors error: {e}")
195
- return {}
196
 
197
- result = {}
198
  for p in points:
199
  aid = p.payload.get("arxiv_id") or arxiv_by_point.get(p.id)
200
  if aid and p.vector:
@@ -202,6 +256,7 @@ async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
202
  vec = p.vector if isinstance(p.vector, list) else p.vector.get("dense", p.vector)
203
  if isinstance(vec, list):
204
  result[aid] = vec
 
205
  return result
206
 
207
 
@@ -250,6 +305,7 @@ async def search_by_vector_with_scores(
250
  query_vector: list[float],
251
  limit: int = 20,
252
  exclude_ids: set[str] | None = None,
 
253
  ) -> list[dict]:
254
  """
255
  Vector search returning both arxiv_ids AND cosine scores.
@@ -257,29 +313,43 @@ async def search_by_vector_with_scores(
257
  Returns list of {'arxiv_id': str, 'score': float} dicts sorted by
258
  score desc, excluding any in exclude_ids.
259
 
260
- Used by the recommendation pipeline (Phase 6.1+) to feed
261
- qdrant_cosine_score (feature slot 0) to the LightGBM reranker.
 
 
262
  """
263
  loop = asyncio.get_event_loop()
264
  try:
265
  results = await loop.run_in_executor(
266
  None, _run_vector_search, query_vector,
267
  (limit * 2) if exclude_ids else limit,
 
268
  )
269
  except Exception as e:
270
  print(f"[qdrant_svc] search_by_vector_with_scores error: {e}")
271
  return []
272
 
273
  exclude = exclude_ids or set()
274
- filtered = [
275
- {"arxiv_id": r.payload["arxiv_id"], "score": float(r.score)}
276
- for r in results
277
- if r.payload.get("arxiv_id") and r.payload["arxiv_id"] not in exclude
278
- ]
279
- return filtered[:limit]
 
 
 
 
 
 
 
 
 
280
 
281
 
282
- def _run_vector_search(query_vector: list[float], limit: int) -> list:
 
 
283
  """Sync helper: nearest-neighbour search by vector."""
284
  client = _client()
285
  result = client.query_points(
@@ -287,7 +357,7 @@ def _run_vector_search(query_vector: list[float], limit: int) -> list:
287
  query=query_vector,
288
  limit=limit,
289
  with_payload=True,
290
- with_vectors=False,
291
  )
292
  return result.points
293
 
 
10
  from __future__ import annotations
11
 
12
  import asyncio
13
+ from collections import OrderedDict
14
  from functools import lru_cache
15
 
16
  from qdrant_client import QdrantClient
 
167
 
168
 
169
  # ── Phase 2a: Vector retrieval + vector search ───────────────────────────────
170
+ #
171
+ # In-process LRU vector cache.
172
+ # Profiling showed Qdrant Cloud free tier reads candidate vectors from
173
+ # disk on every retrieve(), which dominated Tier 1 latency (9-18s for
174
+ # 120 vectors). Vectors are 1024 floats = 4KB each. A 25K cap = ~100MB
175
+ # RAM ceiling. Same papers appear across users' candidate sets (Zipf),
176
+ # so steady-state hit rate is high.
177
+ #
178
+ # Vectors don't change once uploaded, so no TTL.
179
+
180
+ _VECTOR_CACHE: "OrderedDict[str, list[float]]" = OrderedDict()
181
+ _VECTOR_CACHE_MAX = 25_000
182
+
183
+
184
+ def _vec_cache_get(arxiv_id: str) -> list[float] | None:
185
+ val = _VECTOR_CACHE.get(arxiv_id)
186
+ if val is not None:
187
+ _VECTOR_CACHE.move_to_end(arxiv_id)
188
+ return val
189
+
190
+
191
+ def _vec_cache_put(arxiv_id: str, vec: list[float]) -> None:
192
+ if arxiv_id in _VECTOR_CACHE:
193
+ _VECTOR_CACHE.move_to_end(arxiv_id)
194
+ _VECTOR_CACHE[arxiv_id] = vec
195
+ return
196
+ _VECTOR_CACHE[arxiv_id] = vec
197
+ if len(_VECTOR_CACHE) > _VECTOR_CACHE_MAX:
198
+ _VECTOR_CACHE.popitem(last=False)
199
+
200
+
201
+ def vector_cache_stats() -> dict:
202
+ return {"size": len(_VECTOR_CACHE), "max": _VECTOR_CACHE_MAX}
203
+
204
 
205
  async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
206
  """
207
+ Fetch BGE-M3 embedding vectors for papers from Qdrant.
208
  Returns {arxiv_id: vector_list} for papers found.
209
 
210
+ Cached in-process by arxiv_id; only un-cached IDs hit Qdrant. The
211
+ Qdrant retrieve() that pulls the actual stored vectors is the
212
+ single most expensive call in the pipeline (BQ -> disk read), so
213
+ absorbing repeats here is a big win.
214
+
215
+ Used by:
216
+ - EWMA profile updates on save (events.py)
217
+ - Cluster medoid embedding load (recommendations.py)
218
+ - Tier 1 candidate vector fetch (recommendations.py, ~120 IDs)
219
  """
220
  if not arxiv_ids:
221
  return {}
222
 
223
+ # Cache check first β€” pull anything we already know.
224
+ result: dict[str, list[float]] = {}
225
+ misses: list[str] = []
226
+ for aid in arxiv_ids:
227
+ cached = _vec_cache_get(aid)
228
+ if cached is not None:
229
+ result[aid] = cached
230
+ else:
231
+ misses.append(aid)
232
+
233
+ if not misses:
234
+ return result
235
+
236
+ id_map = await lookup_qdrant_ids(misses)
237
  if not id_map:
238
+ return result
239
 
240
  point_ids = list(id_map.values())
241
  arxiv_by_point = {v: k for k, v in id_map.items()}
 
247
  )
248
  except Exception as e:
249
  print(f"[qdrant_svc] get_paper_vectors error: {e}")
250
+ return result
251
 
 
252
  for p in points:
253
  aid = p.payload.get("arxiv_id") or arxiv_by_point.get(p.id)
254
  if aid and p.vector:
 
256
  vec = p.vector if isinstance(p.vector, list) else p.vector.get("dense", p.vector)
257
  if isinstance(vec, list):
258
  result[aid] = vec
259
+ _vec_cache_put(aid, vec)
260
  return result
261
 
262
 
 
305
  query_vector: list[float],
306
  limit: int = 20,
307
  exclude_ids: set[str] | None = None,
308
+ with_vectors: bool = False,
309
  ) -> list[dict]:
310
  """
311
  Vector search returning both arxiv_ids AND cosine scores.
 
313
  Returns list of {'arxiv_id': str, 'score': float} dicts sorted by
314
  score desc, excluding any in exclude_ids.
315
 
316
+ If `with_vectors=True`, each dict also has a 'vector' key holding the
317
+ 1024-dim BGE-M3 embedding. Returning vectors in the search response
318
+ avoids a separate `client.retrieve()` round-trip later β€” that retrieve
319
+ was ~9-18s on cold candidates because BQ rescore reads from disk.
320
  """
321
  loop = asyncio.get_event_loop()
322
  try:
323
  results = await loop.run_in_executor(
324
  None, _run_vector_search, query_vector,
325
  (limit * 2) if exclude_ids else limit,
326
+ with_vectors,
327
  )
328
  except Exception as e:
329
  print(f"[qdrant_svc] search_by_vector_with_scores error: {e}")
330
  return []
331
 
332
  exclude = exclude_ids or set()
333
+ out: list[dict] = []
334
+ for r in results:
335
+ aid = r.payload.get("arxiv_id")
336
+ if not aid or aid in exclude:
337
+ continue
338
+ item = {"arxiv_id": aid, "score": float(r.score)}
339
+ if with_vectors and r.vector:
340
+ # Named vectors return a dict; unnamed returns a list.
341
+ vec = r.vector if isinstance(r.vector, list) else r.vector.get("dense", r.vector)
342
+ if isinstance(vec, list):
343
+ item["vector"] = vec
344
+ out.append(item)
345
+ if len(out) >= limit:
346
+ break
347
+ return out
348
 
349
 
350
+ def _run_vector_search(
351
+ query_vector: list[float], limit: int, with_vectors: bool = False,
352
+ ) -> list:
353
  """Sync helper: nearest-neighbour search by vector."""
354
  client = _client()
355
  result = client.query_points(
 
357
  query=query_vector,
358
  limit=limit,
359
  with_payload=True,
360
+ with_vectors=with_vectors,
361
  )
362
  return result.points
363
 
app/recommend/clustering.py CHANGED
@@ -17,6 +17,7 @@ Reference: Research-MultiInterest_Recommender_Architecture.md Β§2
17
  from __future__ import annotations
18
 
19
  import json
 
20
  from dataclasses import dataclass, field
21
  import numpy as np
22
  from scipy.cluster.hierarchy import ward, fcluster
@@ -34,6 +35,14 @@ WARD_DISTANCE_THRESHOLD = 1.5
34
  MIN_CLUSTERS = 1
35
  MAX_CLUSTERS = 7 # RFC: PinnerSage uses 3-5 for typical users, cap at 7
36
 
 
 
 
 
 
 
 
 
37
  # Minimum saved papers before clustering is meaningful
38
  MIN_PAPERS_FOR_CLUSTERING = 5
39
 
@@ -132,14 +141,36 @@ def compute_clusters(
132
  # Cut the dendrogram at the adaptive threshold
133
  labels = fcluster(linkage, t=threshold, criterion="distance")
134
 
135
- # Clamp cluster count
 
 
 
 
 
 
 
 
 
 
 
136
  unique_labels = np.unique(labels)
137
  n_clusters = len(unique_labels)
138
 
139
- # If too many clusters, re-cut with a maxclust constraint
140
  if n_clusters > MAX_CLUSTERS:
141
  labels = fcluster(linkage, t=MAX_CLUSTERS, criterion="maxclust")
142
  unique_labels = np.unique(labels)
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  # Compute recency weights (position-based: most recent = highest weight)
145
  recency_weights = np.array([
@@ -184,6 +215,49 @@ def _find_medoid(embeddings: np.ndarray, centroid: np.ndarray) -> int:
184
  return int(np.argmin(distances))
185
 
186
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  # ── Cluster ID stabilisation (Phase 4.2) ─────────────────────────────────────
188
 
189
  # Hungarian matches below this cosine similarity are rejected as "unrelated".
 
17
  from __future__ import annotations
18
 
19
  import json
20
+ import math
21
  from dataclasses import dataclass, field
22
  import numpy as np
23
  from scipy.cluster.hierarchy import ward, fcluster
 
35
  MIN_CLUSTERS = 1
36
  MAX_CLUSTERS = 7 # RFC: PinnerSage uses 3-5 for typical users, cap at 7
37
 
38
+ # Average papers per cluster floor β€” used to derive a soft cap on K from N.
39
+ # K_soft_cap = max(MIN_CLUSTERS, ceil(N / AVG_CLUSTER_SIZE_FLOOR)).
40
+ # Set to 4: at N=5 -> K_max=2, at N=10 -> K_max=3, at N=28 -> K_max=7.
41
+ # Without this, gap-based thresholding over-splits at small N: 5 same-domain
42
+ # papers were producing K=4 (3 singletons), which then got over-weighted by
43
+ # the quota floor of 3 slots per cluster.
44
+ AVG_CLUSTER_SIZE_FLOOR = 4
45
+
46
  # Minimum saved papers before clustering is meaningful
47
  MIN_PAPERS_FOR_CLUSTERING = 5
48
 
 
141
  # Cut the dendrogram at the adaptive threshold
142
  labels = fcluster(linkage, t=threshold, criterion="distance")
143
 
144
+ # Clamp cluster count.
145
+ # Two layers:
146
+ # 1. Hard cap: never exceed MAX_CLUSTERS (=7) regardless of N.
147
+ # 2. Soft cap: keep average cluster size >= AVG_CLUSTER_SIZE_FLOOR.
148
+ # This prevents the gap-detection from over-splitting small N
149
+ # (e.g. 5 same-domain saves were producing K=4 with 3 singletons,
150
+ # which then got over-weighted by the quota floor of 3 slots).
151
+ soft_cap = max(
152
+ MIN_CLUSTERS,
153
+ min(MAX_CLUSTERS, math.ceil(n / AVG_CLUSTER_SIZE_FLOOR)),
154
+ )
155
+
156
  unique_labels = np.unique(labels)
157
  n_clusters = len(unique_labels)
158
 
 
159
  if n_clusters > MAX_CLUSTERS:
160
  labels = fcluster(linkage, t=MAX_CLUSTERS, criterion="maxclust")
161
  unique_labels = np.unique(labels)
162
+ n_clusters = len(unique_labels)
163
+
164
+ if n_clusters > soft_cap:
165
+ labels = fcluster(linkage, t=soft_cap, criterion="maxclust")
166
+ unique_labels = np.unique(labels)
167
+ n_clusters = len(unique_labels)
168
+
169
+ # Final safety net: merge any remaining singleton clusters into their
170
+ # nearest non-singleton neighbour. The soft cap usually eliminates them,
171
+ # but a 6-1-1-1 distribution after maxclust=4 would still leave 3.
172
+ labels = _merge_singletons(labels, embeddings)
173
+ unique_labels = np.unique(labels)
174
 
175
  # Compute recency weights (position-based: most recent = highest weight)
176
  recency_weights = np.array([
 
215
  return int(np.argmin(distances))
216
 
217
 
218
+ def _merge_singletons(labels: np.ndarray, embeddings: np.ndarray) -> np.ndarray:
219
+ """Merge singleton clusters into their nearest non-singleton cluster.
220
+
221
+ Why: Ward's gap-based threshold can over-split at small N, producing
222
+ 1-paper clusters that get over-weighted by the quota floor (3 slots
223
+ per cluster regardless of importance). Merging singletons into the
224
+ nearest non-singleton cluster preserves the multi-interest signal
225
+ where it's real and removes spurious singletons where it's noise.
226
+
227
+ Edge case: if every cluster is a singleton (all papers maximally
228
+ distant), we leave the labels alone β€” collapsing them would erase
229
+ a genuine multi-interest signal.
230
+ """
231
+ unique_labels, counts = np.unique(labels, return_counts=True)
232
+ singleton_labels = unique_labels[counts == 1]
233
+ non_singleton_labels = unique_labels[counts > 1]
234
+
235
+ if len(singleton_labels) == 0:
236
+ return labels # nothing to merge
237
+ if len(non_singleton_labels) == 0:
238
+ return labels # all singletons β€” keep as is
239
+
240
+ centroids: dict[int, np.ndarray] = {}
241
+ for ns_label in non_singleton_labels:
242
+ ns_mask = labels == ns_label
243
+ centroids[int(ns_label)] = embeddings[ns_mask].mean(axis=0)
244
+
245
+ new_labels = labels.copy()
246
+ for s_label in singleton_labels:
247
+ s_idx = int(np.where(labels == s_label)[0][0])
248
+ s_emb = embeddings[s_idx]
249
+ best_label = int(s_label)
250
+ best_dist = float("inf")
251
+ for ns_label, centroid in centroids.items():
252
+ d = float(np.linalg.norm(s_emb - centroid))
253
+ if d < best_dist:
254
+ best_dist = d
255
+ best_label = ns_label
256
+ new_labels[s_idx] = best_label
257
+
258
+ return new_labels
259
+
260
+
261
  # ── Cluster ID stabilisation (Phase 4.2) ─────────────────────────────────────
262
 
263
  # Hungarian matches below this cosine similarity are rejected as "unrelated".
app/recommend/reranker.py CHANGED
@@ -45,7 +45,7 @@ try:
45
  if _path and os.path.isfile(_path):
46
  _lgb_model = lgb.Booster(model_file=_path)
47
  _USE_LGB = True
48
- print(f"[reranker] βœ… LightGBM model loaded from {_path}")
49
  print(f"[reranker] trees={_lgb_model.num_trees()}, features={_lgb_model.num_feature()}")
50
  break
51
 
 
45
  if _path and os.path.isfile(_path):
46
  _lgb_model = lgb.Booster(model_file=_path)
47
  _USE_LGB = True
48
+ print(f"[reranker] SUCCESS: LightGBM model loaded from {_path}")
49
  print(f"[reranker] trees={_lgb_model.num_trees()}, features={_lgb_model.num_feature()}")
50
  break
51
 
app/routers/onboarding.py CHANGED
@@ -9,7 +9,7 @@ POST /api/onboarding/skip β†’ mark done (no categories), redirect to /
9
  """
10
  import uuid
11
  import json
12
- from fastapi import APIRouter, Request, Cookie, Form
13
  from fastapi.responses import HTMLResponse, RedirectResponse
14
  from app import db
15
  from app.config import COOKIE_NAME, CATEGORY_GROUPS
@@ -116,20 +116,14 @@ async def seed_search(
116
  except Exception:
117
  pass
118
 
119
- # Check current save count
120
- from app import user_state as us
121
- state = await us.ensure_loaded(user_id)
122
- seed_count = len(state.positives)
123
-
124
  resp = templates.TemplateResponse(
125
  request,
126
- "partials/seed_search.html",
127
- {
128
- "papers": papers,
129
- "query": q,
130
- "seed_count": seed_count,
131
- "seed_target": 5,
132
- },
133
  )
134
  resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
135
  return resp
@@ -161,90 +155,4 @@ async def skip_onboarding(
161
  return resp
162
 
163
 
164
- @router.post("/api/onboarding/import-author", response_class=HTMLResponse)
165
- async def import_author(
166
- request: Request,
167
- author_url: str = Form(default=""),
168
- user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
169
- ):
170
- """Phase 5.1: Import papers from a Semantic Scholar author profile.
171
-
172
- Accepts S2 URL, raw S2 author ID, or ORCID.
173
- Auto-saves the author's arXiv papers as seed interests.
174
- """
175
- user_id = user_id or str(uuid.uuid4())
176
-
177
- if not author_url.strip():
178
- return HTMLResponse(
179
- '<div class="alert alert-warning text-sm py-2">'
180
- '⚠️ Please paste a Semantic Scholar author URL, ID, or ORCID.</div>'
181
- )
182
-
183
- from app import s2_svc, user_state as us
184
-
185
- # 1. Parse input
186
- parsed_id, input_type = s2_svc.parse_author_input(author_url)
187
- if parsed_id is None:
188
- return HTMLResponse(
189
- '<div class="alert alert-error text-sm py-2">'
190
- '❌ Could not recognise input. Paste a Semantic Scholar author URL, '
191
- 'a numeric author ID, or an ORCID (e.g. 0000-0003-3394-6622).</div>'
192
- )
193
-
194
- # 2. Resolve ORCID β†’ S2 author ID if needed
195
- try:
196
- if input_type == "orcid":
197
- s2_id = await s2_svc.resolve_orcid(parsed_id)
198
- if not s2_id:
199
- return HTMLResponse(
200
- '<div class="alert alert-warning text-sm py-2">'
201
- f'⚠️ No Semantic Scholar author found for ORCID {parsed_id}.</div>'
202
- )
203
- else:
204
- s2_id = parsed_id
205
- except Exception as e:
206
- print(f"[onboarding] ORCID resolve failed: {e}")
207
- return HTMLResponse(
208
- '<div class="alert alert-error text-sm py-2">'
209
- '❌ Failed to look up ORCID. Please try pasting the S2 URL directly.</div>'
210
- )
211
-
212
- # 3. Fetch arXiv papers
213
- try:
214
- arxiv_ids = await s2_svc.fetch_author_arxiv_papers(s2_id, limit=20)
215
- except Exception as e:
216
- print(f"[onboarding] S2 author paper fetch failed: {e}")
217
- return HTMLResponse(
218
- '<div class="alert alert-error text-sm py-2">'
219
- '❌ Failed to fetch papers from Semantic Scholar. '
220
- 'The author ID may be invalid, or the API may be down.</div>'
221
- )
222
-
223
- if not arxiv_ids:
224
- return HTMLResponse(
225
- '<div class="alert alert-warning text-sm py-2">'
226
- '⚠️ No arXiv papers found for this author. '
227
- 'They may publish in venues not indexed on arXiv.</div>'
228
- )
229
-
230
- # 4. Auto-save each paper as a positive interaction
231
- for aid in arxiv_ids:
232
- us.record_positive(user_id, aid)
233
- await db.log_interaction(
234
- user_id=user_id,
235
- paper_id=aid,
236
- event_type="save",
237
- source="s2_import",
238
- )
239
-
240
- state = await us.ensure_loaded(user_id)
241
- seed_count = len(state.positives)
242
 
243
- resp = HTMLResponse(
244
- f'<div class="alert alert-success text-sm py-2">'
245
- f'βœ… Imported {len(arxiv_ids)} papers! '
246
- f'You now have {seed_count} saved papers. '
247
- f'Click <strong>"Done β€” start exploring β†’"</strong> to see your recommendations.</div>'
248
- )
249
- resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
250
- return resp
 
9
  """
10
  import uuid
11
  import json
12
+ from fastapi import APIRouter, Request, Cookie
13
  from fastapi.responses import HTMLResponse, RedirectResponse
14
  from app import db
15
  from app.config import COOKIE_NAME, CATEGORY_GROUPS
 
116
  except Exception:
117
  pass
118
 
119
+ # HTMX request: return ONLY the results partial (swap target = #seed-results).
120
+ # The full seed_search.html panel is rendered by save_categories() during the
121
+ # step 1 β†’ step 2 transition; subsequent searches must not re-render the whole
122
+ # panel or it nests inside #seed-results and duplicates the wizard.
 
123
  resp = templates.TemplateResponse(
124
  request,
125
+ "partials/seed_results.html",
126
+ {"papers": papers, "query": q},
 
 
 
 
 
127
  )
128
  resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
129
  return resp
 
155
  return resp
156
 
157
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
 
 
 
 
 
 
 
 
app/routers/recommendations.py CHANGED
@@ -16,6 +16,7 @@ Phase 4 changes vs Phase 2b:
16
  - Category-level suppression filters strongly disliked topics (4.3)
17
  """
18
  import asyncio
 
19
  import uuid
20
  import numpy as np
21
  from fastapi import APIRouter, Request, Cookie
@@ -110,9 +111,11 @@ async def get_recommendations(
110
  # populated by whichever tier serves the result.
111
  paper_tags: dict[str, dict] = {}
112
  rec_arxiv_ids: list[str] = []
 
 
113
 
114
  # ── Tier 1: Multi-interest clustering + quota fusion (β‰₯5 saves) ──────
115
- rec_arxiv_ids, paper_tags = await _multi_interest_recommend(
116
  user_id, state, seen, REC_LIMIT, query_id=query_id,
117
  )
118
 
@@ -151,6 +154,7 @@ async def get_recommendations(
151
  return _empty_resp()
152
 
153
  # Phase 3.5: Turso primary, arXiv API fallback
 
154
  meta = await turso_svc.fetch_metadata_batch(rec_arxiv_ids)
155
  missing = [aid for aid in rec_arxiv_ids if aid not in meta]
156
  if missing:
@@ -159,6 +163,8 @@ async def get_recommendations(
159
  meta.update(arxiv_meta)
160
  except Exception as e:
161
  print(f"[recommendations] arXiv fallback for {len(missing)} IDs failed: {e}")
 
 
162
 
163
  # Cache to SQLite so category suppression JOINs work (Phase 4.3)
164
  await db.cache_turso_metadata_batch(list(meta.values()))
@@ -187,7 +193,12 @@ async def get_recommendations(
187
  resp = templates.TemplateResponse(
188
  request,
189
  "partials/recommendations.html",
190
- {"papers": papers},
 
 
 
 
 
191
  )
192
  resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
193
  return resp
@@ -210,18 +221,20 @@ async def _multi_interest_recommend(
210
  7. MMR diversity β†’ select top-k with diversity
211
  8. Exploration injection β†’ serendipitous papers
212
 
213
- Returns ([], {}) to trigger fallback to Tier 2.
214
  Phase 4.5: second element is {arxiv_id: {ranker_version, candidate_source, cluster_id}}.
215
  """
216
  positives = state.positive_list
217
  if len(positives) < MIN_PAPERS_FOR_CLUSTERING:
218
- return [], {}
219
 
220
  try:
221
  # Fetch embeddings for all saved papers
222
  vectors = await qdrant_svc.get_paper_vectors(positives)
223
  if len(vectors) < MIN_PAPERS_FOR_CLUSTERING:
224
- return [], {}
 
 
225
 
226
  # Build aligned arrays (only papers we got vectors for)
227
  aligned_ids = [pid for pid in positives if pid in vectors]
@@ -230,6 +243,7 @@ async def _multi_interest_recommend(
230
  )
231
 
232
  # ── Step 1: Compute interest clusters ─────────────────────────────
 
233
  clusters = compute_clusters(aligned_ids, aligned_embs)
234
 
235
  # ── Step 4.2: Stabilise cluster IDs with Hungarian matching ───────
@@ -267,6 +281,7 @@ async def _multi_interest_recommend(
267
  clusters = stabilize_cluster_ids(clusters, old_clusters)
268
 
269
  await save_clusters_to_db(user_id, clusters)
 
270
 
271
  # Phase 6.5 B3: append snapshot for cluster history (non-blocking)
272
  try:
@@ -289,8 +304,15 @@ async def _multi_interest_recommend(
289
  quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
290
 
291
  # ── Step 3: Parallel per-cluster ANN searches ─────────────────────
 
292
  st_vec = await profiles.load_profile(user_id, "short_term")
293
 
 
 
 
 
 
 
294
  search_tasks = [
295
  qdrant_svc.search_by_vector_with_scores(
296
  query_vector=c.medoid_embedding.tolist(),
@@ -301,20 +323,16 @@ async def _multi_interest_recommend(
301
  ]
302
  per_cluster_scored = await asyncio.gather(*search_tasks)
303
 
304
- # Build paper β†’ cluster map AND real qdrant_score_map in one pass.
305
- # Phase 6.5 A1: replaces the old rank-based linear decay approximation.
306
  paper_cluster_map: dict[str, int] = {}
307
  qdrant_score_map: dict[str, float] = {}
308
  for cluster, scored_results in zip(clusters, per_cluster_scored):
309
  for hit in scored_results:
310
  aid = hit["arxiv_id"]
311
- if aid not in paper_cluster_map: # first-occurrence wins
312
  paper_cluster_map[aid] = cluster.cluster_idx
313
- # Keep highest cosine if a paper appears in multiple clusters
314
  if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
315
  qdrant_score_map[aid] = float(hit["score"])
316
 
317
- # merge_quota_results expects list[list[str]] β€” extract IDs
318
  per_cluster_ids = [
319
  [h["arxiv_id"] for h in scored] for scored in per_cluster_scored
320
  ]
@@ -337,9 +355,14 @@ async def _multi_interest_recommend(
337
  qdrant_score_map[aid] = float(hit["score"])
338
 
339
  if not candidate_ids:
340
- return [], {}
 
341
 
342
  # ── Step 5: Fetch candidate vectors + metadata ────────────────────
 
 
 
 
343
  cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
344
  cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
345
  cand_missing = [cid for cid in candidate_ids if cid not in cand_meta]
@@ -356,7 +379,8 @@ async def _multi_interest_recommend(
356
  # Only process candidates with both vectors and metadata
357
  valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
358
  if not valid_ids:
359
- return candidate_ids[:limit], {}
 
360
 
361
  valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
362
  valid_meta = [cand_meta[cid] for cid in valid_ids]
@@ -427,6 +451,7 @@ async def _multi_interest_recommend(
427
  )
428
 
429
  # ── Step 6: LightGBM re-ranking (37 features) ────────────────────
 
430
  reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
431
  candidate_ids=valid_ids,
432
  candidate_embeddings=valid_embs,
@@ -443,6 +468,8 @@ async def _multi_interest_recommend(
443
  user_total_saves=user_total_saves,
444
  user_total_dismissals=user_total_dismissals,
445
  )
 
 
446
 
447
  # ── Step 4.3: Category suppression (post-rerank safety net) ───────
448
  # The model now sees feature 25 (is_suppressed_category), but we
@@ -459,6 +486,7 @@ async def _multi_interest_recommend(
459
  reranked_embs = reranked_embs[kept]
460
 
461
  # ── Step 7: MMR diversity enforcement ─────────────────────────────
 
462
  query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
463
  mmr_selected = mmr_rerank(
464
  query_embedding=query_vec,
@@ -468,6 +496,7 @@ async def _multi_interest_recommend(
468
  lambda_param=0.6,
469
  top_k=limit,
470
  )
 
471
 
472
  # ── Step 8: Exploration injection ─────────────────────────────────
473
  final = inject_exploration(
@@ -508,11 +537,11 @@ async def _multi_interest_recommend(
508
  "policy_id": _RANKER_VERSION,
509
  }
510
 
511
- return final, paper_tags
512
 
513
  except Exception as e:
514
- print(f"[recommendations] multi-interest search failed: {e}")
515
- return [], {}
516
 
517
 
518
  # ── Tier 2: EWMA single-vector search ────────────────────────────────────────
 
16
  - Category-level suppression filters strongly disliked topics (4.3)
17
  """
18
  import asyncio
19
+ import time
20
  import uuid
21
  import numpy as np
22
  from fastapi import APIRouter, Request, Cookie
 
111
  # populated by whichever tier serves the result.
112
  paper_tags: dict[str, dict] = {}
113
  rec_arxiv_ids: list[str] = []
114
+ rerank_time_ms = 0
115
+ timing_breakdown: dict = {}
116
 
117
  # ── Tier 1: Multi-interest clustering + quota fusion (β‰₯5 saves) ──────
118
+ rec_arxiv_ids, paper_tags, rerank_time_ms, timing_breakdown = await _multi_interest_recommend(
119
  user_id, state, seen, REC_LIMIT, query_id=query_id,
120
  )
121
 
 
154
  return _empty_resp()
155
 
156
  # Phase 3.5: Turso primary, arXiv API fallback
157
+ t0_meta = time.time()
158
  meta = await turso_svc.fetch_metadata_batch(rec_arxiv_ids)
159
  missing = [aid for aid in rec_arxiv_ids if aid not in meta]
160
  if missing:
 
163
  meta.update(arxiv_meta)
164
  except Exception as e:
165
  print(f"[recommendations] arXiv fallback for {len(missing)} IDs failed: {e}")
166
+ t1_meta = time.time()
167
+ meta_time_ms = int((t1_meta - t0_meta) * 1000)
168
 
169
  # Cache to SQLite so category suppression JOINs work (Phase 4.3)
170
  await db.cache_turso_metadata_batch(list(meta.values()))
 
193
  resp = templates.TemplateResponse(
194
  request,
195
  "partials/recommendations.html",
196
+ {
197
+ "papers": papers,
198
+ "rerank_time_ms": rerank_time_ms,
199
+ "meta_time_ms": meta_time_ms,
200
+ "timing": timing_breakdown,
201
+ },
202
  )
203
  resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
204
  return resp
 
221
  7. MMR diversity β†’ select top-k with diversity
222
  8. Exploration injection β†’ serendipitous papers
223
 
224
+ Returns ([], {}, 0, {}) to trigger fallback to Tier 2.
225
  Phase 4.5: second element is {arxiv_id: {ranker_version, candidate_source, cluster_id}}.
226
  """
227
  positives = state.positive_list
228
  if len(positives) < MIN_PAPERS_FOR_CLUSTERING:
229
+ return [], {}, 0, {}
230
 
231
  try:
232
  # Fetch embeddings for all saved papers
233
  vectors = await qdrant_svc.get_paper_vectors(positives)
234
  if len(vectors) < MIN_PAPERS_FOR_CLUSTERING:
235
+ return [], {}, 0, {}
236
+
237
+ timing = {} # Collect per-stage timing breakdown
238
 
239
  # Build aligned arrays (only papers we got vectors for)
240
  aligned_ids = [pid for pid in positives if pid in vectors]
 
243
  )
244
 
245
  # ── Step 1: Compute interest clusters ─────────────────────────────
246
+ t0_cluster = time.time()
247
  clusters = compute_clusters(aligned_ids, aligned_embs)
248
 
249
  # ── Step 4.2: Stabilise cluster IDs with Hungarian matching ───────
 
281
  clusters = stabilize_cluster_ids(clusters, old_clusters)
282
 
283
  await save_clusters_to_db(user_id, clusters)
284
+ timing["clustering_ms"] = int((time.time() - t0_cluster) * 1000)
285
 
286
  # Phase 6.5 B3: append snapshot for cluster history (non-blocking)
287
  try:
 
304
  quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
305
 
306
  # ── Step 3: Parallel per-cluster ANN searches ─────────────────────
307
+ t0_ann = time.time()
308
  st_vec = await profiles.load_profile(user_id, "short_term")
309
 
310
+ # NOTE on latency: we previously tried passing with_vectors=True
311
+ # to fold the candidate-vector fetch into the search call. That
312
+ # made it *worse* on Qdrant Cloud free tier β€” search latency
313
+ # ballooned from ~2s to ~40s because returning vectors triggers
314
+ # a per-result disk read inside the search path. Keep the search
315
+ # vector-free; vectors come from a separate cached retrieve.
316
  search_tasks = [
317
  qdrant_svc.search_by_vector_with_scores(
318
  query_vector=c.medoid_embedding.tolist(),
 
323
  ]
324
  per_cluster_scored = await asyncio.gather(*search_tasks)
325
 
 
 
326
  paper_cluster_map: dict[str, int] = {}
327
  qdrant_score_map: dict[str, float] = {}
328
  for cluster, scored_results in zip(clusters, per_cluster_scored):
329
  for hit in scored_results:
330
  aid = hit["arxiv_id"]
331
+ if aid not in paper_cluster_map:
332
  paper_cluster_map[aid] = cluster.cluster_idx
 
333
  if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
334
  qdrant_score_map[aid] = float(hit["score"])
335
 
 
336
  per_cluster_ids = [
337
  [h["arxiv_id"] for h in scored] for scored in per_cluster_scored
338
  ]
 
355
  qdrant_score_map[aid] = float(hit["score"])
356
 
357
  if not candidate_ids:
358
+ return [], {}, 0, {}
359
+ timing["ann_retrieval_ms"] = int((time.time() - t0_ann) * 1000)
360
 
361
  # ── Step 5: Fetch candidate vectors + metadata ────────────────────
362
+ # get_paper_vectors is now LRU-cached by arxiv_id (qdrant_svc),
363
+ # so warm cache makes this cheap; only fresh papers pay the
364
+ # disk-read cost.
365
+ t0_cand_meta = time.time()
366
  cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
367
  cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
368
  cand_missing = [cid for cid in candidate_ids if cid not in cand_meta]
 
379
  # Only process candidates with both vectors and metadata
380
  valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
381
  if not valid_ids:
382
+ return candidate_ids[:limit], {}, 0, {}
383
+ timing["candidate_meta_ms"] = int((time.time() - t0_cand_meta) * 1000)
384
 
385
  valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
386
  valid_meta = [cand_meta[cid] for cid in valid_ids]
 
451
  )
452
 
453
  # ── Step 6: LightGBM re-ranking (37 features) ────────────────────
454
+ t0_rerank = time.time()
455
  reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
456
  candidate_ids=valid_ids,
457
  candidate_embeddings=valid_embs,
 
468
  user_total_saves=user_total_saves,
469
  user_total_dismissals=user_total_dismissals,
470
  )
471
+ t1_rerank = time.time()
472
+ rerank_time_ms = int((t1_rerank - t0_rerank) * 1000)
473
 
474
  # ── Step 4.3: Category suppression (post-rerank safety net) ───────
475
  # The model now sees feature 25 (is_suppressed_category), but we
 
486
  reranked_embs = reranked_embs[kept]
487
 
488
  # ── Step 7: MMR diversity enforcement ─────────────────────────────
489
+ t0_mmr = time.time()
490
  query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
491
  mmr_selected = mmr_rerank(
492
  query_embedding=query_vec,
 
496
  lambda_param=0.6,
497
  top_k=limit,
498
  )
499
+ timing["mmr_ms"] = int((time.time() - t0_mmr) * 1000)
500
 
501
  # ── Step 8: Exploration injection ─────────────────────────────────
502
  final = inject_exploration(
 
537
  "policy_id": _RANKER_VERSION,
538
  }
539
 
540
+ return final, paper_tags, rerank_time_ms, timing
541
 
542
  except Exception as e:
543
+ print(f"[recommendations] multi-interest preprocessing failed: {e}")
544
+ return [], {}, 0, {}
545
 
546
 
547
  # ── Tier 2: EWMA single-vector search ────────────────────────────────────────
app/routers/search.py CHANGED
@@ -27,17 +27,23 @@ async def search(
27
  q: str = "",
28
  user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
29
  ):
 
 
 
30
  papers = []
31
  if q.strip():
32
  # Phase 3: Hybrid semantic search (BGE-M3 + Qdrant + Zilliz + RRF)
33
  try:
34
- arxiv_ids = await hybrid_search_svc.search(q.strip(), limit=ARXIV_MAX_RESULTS)
 
 
35
  except Exception as e:
36
  print(f"[search] Hybrid search error: {e}")
37
  arxiv_ids = []
38
 
39
  if arxiv_ids:
40
  # Phase 3.5: Fetch metadata from Turso DB first (fast, ~50ms)
 
41
  try:
42
  meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
43
  except Exception as e:
@@ -52,6 +58,8 @@ async def search(
52
  meta.update(arxiv_meta)
53
  except Exception as e:
54
  print(f"[search] arXiv fallback for {len(missing)} IDs failed: {e}")
 
 
55
 
56
  # Phase 4.3: Cache to SQLite so dismissal category JOINs work
57
  await db.cache_turso_metadata_batch(list(meta.values()))
@@ -66,6 +74,8 @@ async def search(
66
  except Exception as e:
67
  print(f"[search] arXiv fallback also failed: {e}")
68
  papers = []
 
 
69
 
70
  user_id = user_id or str(uuid.uuid4())
71
  # Phase 6.5 B1: one query_id per search request for per-feed CTR
@@ -86,7 +96,7 @@ async def search(
86
  resp = templates.TemplateResponse(
87
  request,
88
  "partials/search_results.html",
89
- {"papers": papers, "query": q},
90
  )
91
  else:
92
  resp = templates.TemplateResponse(
@@ -96,6 +106,7 @@ async def search(
96
  "papers": papers,
97
  "query": q,
98
  "has_recs": state.has_enough_for_recs(),
 
99
  },
100
  )
101
 
 
27
  q: str = "",
28
  user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
29
  ):
30
+ import time
31
+ start_time = time.perf_counter()
32
+ search_meta = {}
33
  papers = []
34
  if q.strip():
35
  # Phase 3: Hybrid semantic search (BGE-M3 + Qdrant + Zilliz + RRF)
36
  try:
37
+ arxiv_ids, search_meta = await hybrid_search_svc.search(
38
+ q.strip(), limit=ARXIV_MAX_RESULTS, return_meta=True
39
+ )
40
  except Exception as e:
41
  print(f"[search] Hybrid search error: {e}")
42
  arxiv_ids = []
43
 
44
  if arxiv_ids:
45
  # Phase 3.5: Fetch metadata from Turso DB first (fast, ~50ms)
46
+ t0_meta = time.perf_counter()
47
  try:
48
  meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
49
  except Exception as e:
 
58
  meta.update(arxiv_meta)
59
  except Exception as e:
60
  print(f"[search] arXiv fallback for {len(missing)} IDs failed: {e}")
61
+
62
+ search_meta["meta_time_ms"] = int((time.perf_counter() - t0_meta) * 1000)
63
 
64
  # Phase 4.3: Cache to SQLite so dismissal category JOINs work
65
  await db.cache_turso_metadata_batch(list(meta.values()))
 
74
  except Exception as e:
75
  print(f"[search] arXiv fallback also failed: {e}")
76
  papers = []
77
+
78
+ search_meta["total_time_ms"] = int((time.perf_counter() - start_time) * 1000)
79
 
80
  user_id = user_id or str(uuid.uuid4())
81
  # Phase 6.5 B1: one query_id per search request for per-feed CTR
 
96
  resp = templates.TemplateResponse(
97
  request,
98
  "partials/search_results.html",
99
+ {"papers": papers, "query": q, "search_meta": search_meta},
100
  )
101
  else:
102
  resp = templates.TemplateResponse(
 
106
  "papers": papers,
107
  "query": q,
108
  "has_recs": state.has_enough_for_recs(),
109
+ "search_meta": search_meta,
110
  },
111
  )
112
 
app/s2_svc.py DELETED
@@ -1,111 +0,0 @@
1
- """
2
- Semantic Scholar service β€” Phase 5.1 (author import for onboarding).
3
-
4
- Accepts an S2 author URL, a raw S2 author ID, or an ORCID, then
5
- fetches that author's papers and returns arXiv IDs for auto-saving.
6
-
7
- API docs: https://api.semanticscholar.org/api-docs/graph
8
- """
9
- from __future__ import annotations
10
-
11
- import re
12
- import httpx
13
- from app.config import S2_API_KEY
14
-
15
- _BASE = "https://api.semanticscholar.org/graph/v1"
16
- _TIMEOUT = 15.0 # seconds
17
-
18
- # ── Patterns ──────────────────────────────────────────────────────────────────
19
- # URL: https://www.semanticscholar.org/author/Yoshua-Bengio/1751762
20
- # Raw: 1751762
21
- # ORCID: 0000-0003-3394-6622
22
- _S2_URL_RE = re.compile(
23
- r"semanticscholar\.org/author/[^/]+/(\d+)", re.IGNORECASE
24
- )
25
- _ORCID_RE = re.compile(r"\d{4}-\d{4}-\d{4}-\d{3}[\dX]")
26
- _RAW_ID_RE = re.compile(r"^\d{3,}$") # 3+ digits = plausible S2 author ID
27
-
28
-
29
- def _headers() -> dict[str, str]:
30
- """Build request headers, including API key if available."""
31
- h: dict[str, str] = {"Accept": "application/json"}
32
- if S2_API_KEY:
33
- h["x-api-key"] = S2_API_KEY
34
- return h
35
-
36
-
37
- # ── Public API ────────────────────────────────────────────────────────────────
38
-
39
- def parse_author_input(text: str) -> tuple[str | None, str]:
40
- """Parse user-provided text into an S2 author ID or ORCID.
41
-
42
- Returns (s2_author_id | None, input_type) where input_type is one of:
43
- "s2_url", "s2_id", "orcid", "unknown"
44
- """
45
- text = text.strip()
46
- if not text:
47
- return None, "unknown"
48
-
49
- # 1. Try S2 URL
50
- m = _S2_URL_RE.search(text)
51
- if m:
52
- return m.group(1), "s2_url"
53
-
54
- # 2. Try ORCID
55
- m = _ORCID_RE.search(text)
56
- if m:
57
- return m.group(0), "orcid"
58
-
59
- # 3. Try raw numeric ID
60
- if _RAW_ID_RE.match(text):
61
- return text, "s2_id"
62
-
63
- return None, "unknown"
64
-
65
-
66
- async def resolve_orcid(orcid: str) -> str | None:
67
- """Resolve an ORCID to an S2 author ID via the author search endpoint.
68
-
69
- Returns the S2 authorId string or None if not found.
70
- """
71
- url = f"{_BASE}/author/search"
72
- params = {"query": orcid, "limit": 1}
73
- async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
74
- resp = await client.get(url, params=params, headers=_headers())
75
- resp.raise_for_status()
76
- data = resp.json()
77
- authors = data.get("data", [])
78
- if authors:
79
- return str(authors[0]["authorId"])
80
- return None
81
-
82
-
83
- async def fetch_author_arxiv_papers(
84
- author_id: str, limit: int = 50,
85
- ) -> list[str]:
86
- """Fetch an author's papers from S2 and return arXiv IDs.
87
-
88
- Filters to papers that have an ArXiv external ID.
89
- Returns at most `limit` arXiv IDs, ordered by citation count (desc).
90
- """
91
- url = f"{_BASE}/author/{author_id}/papers"
92
- params = {
93
- "fields": "externalIds,citationCount",
94
- "limit": min(limit * 2, 500), # over-fetch since not all have arXiv IDs
95
- }
96
- arxiv_ids: list[tuple[int, str]] = [] # (citation_count, arxiv_id)
97
-
98
- async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
99
- resp = await client.get(url, params=params, headers=_headers())
100
- resp.raise_for_status()
101
- data = resp.json()
102
- for paper in data.get("data", []):
103
- ext = paper.get("externalIds") or {}
104
- arxiv_id = ext.get("ArXiv")
105
- if arxiv_id:
106
- cites = paper.get("citationCount") or 0
107
- arxiv_ids.append((cites, arxiv_id))
108
-
109
- # Sort by citation count descending so we import the most impactful first
110
- arxiv_ids.sort(key=lambda x: x[0], reverse=True)
111
- return [aid for _, aid in arxiv_ids[:limit]]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/templates/index.html CHANGED
@@ -13,20 +13,13 @@
13
  <p class="text-sm text-base-content/60 mb-4">
14
  Search arXiv, save papers you like β€” get personalised recommendations.
15
  </p>
16
- <form hx-get="/search"
17
- hx-target="#search-results"
18
- hx-push-url="true"
19
- hx-indicator="#search-spinner"
20
- class="flex gap-2">
21
  <input type="text"
22
  name="q"
23
  placeholder="e.g. transformer attention mechanism"
24
  class="input input-bordered flex-1"
25
  autofocus />
26
- <button class="btn btn-primary" type="submit">
27
- Search
28
- <span id="search-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
29
- </button>
30
  </form>
31
  </div>
32
 
@@ -57,8 +50,5 @@
57
  </div>
58
  </div>
59
 
60
- <!-- Search results (swapped in by HTMX) -->
61
- <div id="search-results"></div>
62
-
63
  </div>
64
  {% endblock %}
 
13
  <p class="text-sm text-base-content/60 mb-4">
14
  Search arXiv, save papers you like β€” get personalised recommendations.
15
  </p>
16
+ <form action="/search" method="get" class="flex gap-2">
 
 
 
 
17
  <input type="text"
18
  name="q"
19
  placeholder="e.g. transformer attention mechanism"
20
  class="input input-bordered flex-1"
21
  autofocus />
22
+ <button class="btn btn-primary" type="submit">Search</button>
 
 
 
23
  </form>
24
  </div>
25
 
 
50
  </div>
51
  </div>
52
 
 
 
 
53
  </div>
54
  {% endblock %}
app/templates/partials/paper_card.html CHANGED
@@ -9,6 +9,11 @@
9
  {% set position = position | default(0) %}
10
  {% set authors_list = paper.authors | default("[]") | tojson_parse | default([]) %}
11
 
 
 
 
 
 
12
  {# Category badge colour mapping #}
13
  {% set cat = paper.category | default("") %}
14
  {% if cat.startswith("cs.") %}
@@ -43,19 +48,19 @@
43
  {% endif %}
44
  </div>
45
 
46
- <!-- Meta: arXiv ID + year + citations -->
47
  <div class="text-xs text-base-content/50 mono">
48
  [{{ paper.arxiv_id }}]
49
  {% if paper.published %} Β· {{ paper.published[:4] }}{% endif %}
50
- {% if authors_list %} Β· <span class="font-sans">{{ authors_list[:3] | join(", ") }}{% if authors_list | length > 3 %} et al.{% endif %}</span>{% endif %}
51
  {% if paper.citation_count %}
52
  Β· <span class="font-medium text-base-content/70 font-sans" title="{{ paper.influential_citations|default(0) }} influential">πŸ“Š {{ paper.citation_count }} citations</span>
53
  {% endif %}
54
  </div>
55
 
56
- <!-- Abstract (truncated) -->
57
- <p class="text-sm text-base-content/75 line-clamp-3">
58
- {{ paper.abstract }}
59
  </p>
60
 
61
  <!-- Action buttons (HTMX-powered, swap themselves on click) -->
 
9
  {% set position = position | default(0) %}
10
  {% set authors_list = paper.authors | default("[]") | tojson_parse | default([]) %}
11
 
12
+ {# Fallback: if tojson_parse returned empty but authors is a non-empty string, split by comma #}
13
+ {% if not authors_list and paper.authors %}
14
+ {% set authors_list = paper.authors.split(", ") %}
15
+ {% endif %}
16
+
17
  {# Category badge colour mapping #}
18
  {% set cat = paper.category | default("") %}
19
  {% if cat.startswith("cs.") %}
 
48
  {% endif %}
49
  </div>
50
 
51
+ <!-- Meta: arXiv ID + year + authors (max 3) + citations -->
52
  <div class="text-xs text-base-content/50 mono">
53
  [{{ paper.arxiv_id }}]
54
  {% if paper.published %} Β· {{ paper.published[:4] }}{% endif %}
55
+ {% if authors_list %} Β· <span class="font-sans">{{ authors_list[:3] | join(", ") }}{% if authors_list | length > 3 %} et al. ({{ authors_list | length }} authors){% endif %}</span>{% endif %}
56
  {% if paper.citation_count %}
57
  Β· <span class="font-medium text-base-content/70 font-sans" title="{{ paper.influential_citations|default(0) }} influential">πŸ“Š {{ paper.citation_count }} citations</span>
58
  {% endif %}
59
  </div>
60
 
61
+ <!-- Abstract (truncated to ~300 chars + CSS clamp) -->
62
+ <p class="text-sm text-base-content/75" style="display: -webkit-box; -webkit-line-clamp: 3; -webkit-box-orient: vertical; overflow: hidden;">
63
+ {{ paper.abstract[:500] }}{% if paper.abstract | length > 500 %}…{% endif %}
64
  </p>
65
 
66
  <!-- Action buttons (HTMX-powered, swap themselves on click) -->
app/templates/partials/recommendations.html CHANGED
@@ -13,6 +13,40 @@
13
  {% include "partials/paper_card.html" %}
14
  {% endfor %}
15
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  <!-- Refresh button β€” lets user reload recs after saving more papers -->
17
  <div class="text-center pt-3">
18
  <button class="btn btn-ghost btn-sm"
 
13
  {% include "partials/paper_card.html" %}
14
  {% endfor %}
15
  </div>
16
+
17
+ {# Pipeline timing breakdown #}
18
+ {% if timing is defined and timing %}
19
+ <div class="mt-4 p-3 rounded-lg bg-base-200/50 border border-base-300/30">
20
+ <div class="flex items-center gap-2 mb-2">
21
+ <span class="text-xs font-semibold text-base-content/60">⚑ Recommendation Pipeline Breakdown</span>
22
+ </div>
23
+ <div class="flex flex-wrap gap-x-4 gap-y-1 text-xs font-mono text-base-content/50">
24
+ {% if timing.clustering_ms is defined %}
25
+ <span>Ward Clustering: <span class="text-primary">{{ timing.clustering_ms }}ms</span></span>
26
+ {% endif %}
27
+ {% if timing.ann_retrieval_ms is defined %}
28
+ <span>ANN Retrieval: <span class="text-primary">{{ timing.ann_retrieval_ms }}ms</span></span>
29
+ {% endif %}
30
+ {% if timing.candidate_meta_ms is defined %}
31
+ <span>Candidate Meta: <span class="text-primary">{{ timing.candidate_meta_ms }}ms</span></span>
32
+ {% endif %}
33
+ {% if rerank_time_ms is defined %}
34
+ <span>LightGBM Rerank: <span class="text-primary">{{ rerank_time_ms }}ms</span></span>
35
+ {% endif %}
36
+ {% if timing.mmr_ms is defined %}
37
+ <span>MMR Diversity: <span class="text-primary">{{ timing.mmr_ms }}ms</span></span>
38
+ {% endif %}
39
+ {% if meta_time_ms is defined %}
40
+ <span>Final Metadata: <span class="text-primary">{{ meta_time_ms }}ms</span></span>
41
+ {% endif %}
42
+ </div>
43
+ </div>
44
+ {% elif rerank_time_ms is defined and meta_time_ms is defined %}
45
+ <div class="text-center pt-2 pb-1 text-xs text-base-content/40 font-mono">
46
+ ⚑ Reranking: {{ rerank_time_ms }}ms | Metadata: {{ meta_time_ms }}ms
47
+ </div>
48
+ {% endif %}
49
+
50
  <!-- Refresh button β€” lets user reload recs after saving more papers -->
51
  <div class="text-center pt-3">
52
  <button class="btn btn-ghost btn-sm"
app/templates/partials/search_results.html CHANGED
@@ -1,15 +1,91 @@
1
  {# Partial: list of search result cards #}
2
  {% if papers %}
3
  <div class="space-y-3">
4
- <p class="text-sm text-base-content/50">{{ papers | length }} results for "{{ query }}"</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  {% for paper in papers %}
6
  {% set position = loop.index0 %}
7
  {% set source = "search" %}
8
  {% include "partials/paper_card.html" %}
9
  {% endfor %}
10
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  {% elif query %}
12
  <div class="text-center text-base-content/40 py-10">
13
- No results found for "{{ query }}"
 
 
 
14
  </div>
15
  {% endif %}
 
1
  {# Partial: list of search result cards #}
2
  {% if papers %}
3
  <div class="space-y-3">
4
+ <div class="flex flex-col gap-1 mb-4">
5
+ <div class="flex justify-between items-center text-sm text-base-content/50">
6
+ <span>{{ papers | length }} results for "{{ query }}"</span>
7
+ {% if search_meta and search_meta.total_time_ms is defined %}
8
+ <span>Search completed in {{ search_meta.total_time_ms }}ms</span>
9
+ {% endif %}
10
+ </div>
11
+
12
+ {# Groq rewrite result β€” show both rewritten AND skipped cases #}
13
+ {% if search_meta %}
14
+ {% if search_meta.rewritten_query %}
15
+ <div class="alert bg-base-200 border-l-4 border-primary p-3 text-sm flex gap-2">
16
+ <svg xmlns="http://www.w3.org/2000/svg" class="stroke-primary shrink-0 h-5 w-5" fill="none" viewBox="0 0 24 24"><path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M13 16h-1v-4h-1m1-4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z" /></svg>
17
+ <div class="flex-1">
18
+ <span class="font-semibold">Groq expanded query:</span> "{{ search_meta.rewritten_query }}"
19
+ <span class="text-xs text-base-content/50 ml-2">({{ search_meta.groq_time_ms }}ms)</span>
20
+ </div>
21
+ </div>
22
+ {% elif search_meta.groq_status is defined and search_meta.groq_status != 'rewritten' %}
23
+ <div class="alert bg-base-200/50 border-l-4 border-base-300 p-3 text-sm flex gap-2">
24
+ <svg xmlns="http://www.w3.org/2000/svg" class="stroke-base-content/30 shrink-0 h-5 w-5" fill="none" viewBox="0 0 24 24"><path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M13 16h-1v-4h-1m1-4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z" /></svg>
25
+ <div class="flex-1 text-base-content/50">
26
+ <span class="font-semibold">Groq rewrite:</span> {{ search_meta.groq_status }}
27
+ β€” searching with original query as-is
28
+ </div>
29
+ </div>
30
+ {% endif %}
31
+ {% endif %}
32
+ </div>
33
+
34
  {% for paper in papers %}
35
  {% set position = loop.index0 %}
36
  {% set source = "search" %}
37
  {% include "partials/paper_card.html" %}
38
  {% endfor %}
39
  </div>
40
+
41
+ {# Pipeline timing breakdown #}
42
+ {% if search_meta %}
43
+ <div class="mt-4 p-3 rounded-lg bg-base-200/50 border border-base-300/30">
44
+ <div class="flex items-center gap-2 mb-2">
45
+ <span class="text-xs font-semibold text-base-content/60">⚑ Search Pipeline Breakdown</span>
46
+ {% if search_meta.total_time_ms is defined %}
47
+ <span class="text-xs text-base-content/40">({{ search_meta.total_time_ms }}ms total)</span>
48
+ {% endif %}
49
+ </div>
50
+ <div class="flex flex-wrap gap-x-4 gap-y-1 text-xs font-mono text-base-content/50">
51
+ {% if search_meta.groq_time_ms is defined %}
52
+ <span>Groq Rewrite: <span class="text-primary">{{ search_meta.groq_time_ms }}ms</span>
53
+ {% if search_meta.groq_status is defined and search_meta.groq_status != 'rewritten' %}
54
+ <span class="text-warning/60">({{ search_meta.groq_status }})</span>
55
+ {% endif %}
56
+ </span>
57
+ {% endif %}
58
+ {% if search_meta.encode_time_ms is defined %}
59
+ <span>BGE-M3 Encode: <span class="text-primary">{{ search_meta.encode_time_ms }}ms</span></span>
60
+ {% endif %}
61
+ {% if search_meta.retrieval_time_ms is defined %}
62
+ <span>Qdrant+Zilliz Retrieval: <span class="text-primary">{{ search_meta.retrieval_time_ms }}ms</span>
63
+ {% if search_meta.n_retrieval_tasks is defined %}
64
+ <span class="text-base-content/30">({{ search_meta.n_retrieval_tasks }} parallel tasks)</span>
65
+ {% endif %}
66
+ </span>
67
+ {% endif %}
68
+ {% if search_meta.rrf_time_ms is defined %}
69
+ <span>RRF Fusion: <span class="text-primary">{{ search_meta.rrf_time_ms }}ms</span></span>
70
+ {% endif %}
71
+ {% if search_meta.turso_boost_fetch_ms is defined %}
72
+ <span>Turso Title Fetch: <span class="text-primary">{{ search_meta.turso_boost_fetch_ms }}ms</span></span>
73
+ <span>Rerank Compute: <span class="text-primary">{{ search_meta.rerank_compute_ms }}ms</span></span>
74
+ {% elif search_meta.rerank_time_ms is defined %}
75
+ <span>Title+Citation Rerank: <span class="text-primary">{{ search_meta.rerank_time_ms }}ms</span></span>
76
+ {% endif %}
77
+ {% if search_meta.meta_time_ms is defined %}
78
+ <span>Final Metadata: <span class="text-primary">{{ search_meta.meta_time_ms }}ms</span></span>
79
+ {% endif %}
80
+ </div>
81
+ </div>
82
+ {% endif %}
83
+
84
  {% elif query %}
85
  <div class="text-center text-base-content/40 py-10">
86
+ <p>No results found for "{{ query }}"</p>
87
+ {% if search_meta and search_meta.total_time_ms is defined %}
88
+ <p class="text-xs mt-2">Search completed in {{ search_meta.total_time_ms }}ms</p>
89
+ {% endif %}
90
  </div>
91
  {% endif %}
app/templates/partials/seed_results.html ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {#
2
+ Seed search results β€” inner partial, swapped into #seed-results by HTMX.
3
+ Expects:
4
+ papers – list[dict] (optional)
5
+ query – str (optional)
6
+ #}
7
+ {% if papers is defined and papers %}
8
+ {% for paper in papers %}
9
+ <div class="seed-card flex items-start justify-between gap-3"
10
+ id="seed-paper-{{ paper.arxiv_id }}">
11
+ <div class="flex-1 min-w-0">
12
+ <a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
13
+ target="_blank" rel="noopener"
14
+ class="font-medium text-sm text-primary hover:underline leading-snug line-clamp-1">
15
+ {{ paper.title }}
16
+ </a>
17
+ <div class="text-xs text-base-content/50 mt-0.5">
18
+ [{{ paper.arxiv_id }}]
19
+ {% if paper.category %} Β· <span class="cat-badge cat-cs">{{ paper.category }}</span>{% endif %}
20
+ {% if paper.citation_count %} Β· πŸ“Š {{ paper.citation_count }}{% endif %}
21
+ </div>
22
+ </div>
23
+ <button class="btn btn-primary btn-xs shrink-0"
24
+ hx-post="/api/papers/{{ paper.arxiv_id }}/save"
25
+ hx-target="#seed-paper-{{ paper.arxiv_id }}"
26
+ hx-swap="outerHTML"
27
+ hx-vals='{"source": "onboarding"}'
28
+ onclick="bumpSeedCount()">
29
+ ⭐ Save
30
+ </button>
31
+ </div>
32
+ {% endfor %}
33
+ {% elif query is defined and query %}
34
+ <p class="text-center text-base-content/40 py-6 text-sm">
35
+ No results found for "{{ query }}"
36
+ </p>
37
+ {% else %}
38
+ <p class="text-center text-base-content/30 py-6 text-sm">
39
+ Search above to find papers in your research area
40
+ </p>
41
+ {% endif %}
app/templates/partials/seed_search.html CHANGED
@@ -15,30 +15,6 @@
15
  </p>
16
  </div>
17
 
18
- {# Phase 5.1: Quick author import #}
19
- <div class="mb-4 p-3 bg-base-200/50 rounded-lg">
20
- <p class="text-xs font-medium text-base-content/70 mb-2">
21
- ⚑ Quick import: Paste your Semantic Scholar profile URL to auto-import papers
22
- </p>
23
- <form hx-post="/api/onboarding/import-author"
24
- hx-target="#import-result"
25
- hx-swap="innerHTML"
26
- hx-indicator="#import-spinner"
27
- class="flex gap-2">
28
- <input type="text"
29
- name="author_url"
30
- placeholder="e.g. https://www.semanticscholar.org/author/…/1234567"
31
- class="input input-bordered input-sm flex-1 text-xs" />
32
- <button class="btn btn-secondary btn-sm" type="submit">
33
- Import
34
- <span id="import-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
35
- </button>
36
- </form>
37
- <div id="import-result" class="mt-2"></div>
38
- </div>
39
-
40
- <div class="divider text-xs text-base-content/40">OR search manually</div>
41
-
42
  {# Search bar #}
43
  <div class="mb-4">
44
  <form hx-get="/api/onboarding/seed-search"
@@ -68,43 +44,9 @@
68
  </div>
69
  </div>
70
 
71
- {# Search results #}
72
  <div id="seed-results" class="space-y-2 mb-6">
73
- {% if papers is defined and papers %}
74
- {% for paper in papers %}
75
- <div class="seed-card flex items-start justify-between gap-3"
76
- id="seed-paper-{{ paper.arxiv_id }}">
77
- <div class="flex-1 min-w-0">
78
- <a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
79
- target="_blank" rel="noopener"
80
- class="font-medium text-sm text-primary hover:underline leading-snug line-clamp-1">
81
- {{ paper.title }}
82
- </a>
83
- <div class="text-xs text-base-content/50 mt-0.5">
84
- [{{ paper.arxiv_id }}]
85
- {% if paper.category %} Β· <span class="cat-badge cat-cs">{{ paper.category }}</span>{% endif %}
86
- {% if paper.citation_count %} Β· πŸ“Š {{ paper.citation_count }}{% endif %}
87
- </div>
88
- </div>
89
- <button class="btn btn-primary btn-xs shrink-0"
90
- hx-post="/api/papers/{{ paper.arxiv_id }}/save"
91
- hx-target="#seed-paper-{{ paper.arxiv_id }}"
92
- hx-swap="outerHTML"
93
- hx-vals='{"source": "onboarding"}'
94
- onclick="bumpSeedCount()">
95
- ⭐ Save
96
- </button>
97
- </div>
98
- {% endfor %}
99
- {% elif query is defined and query %}
100
- <p class="text-center text-base-content/40 py-6 text-sm">
101
- No results found for "{{ query }}"
102
- </p>
103
- {% else %}
104
- <p class="text-center text-base-content/30 py-6 text-sm">
105
- Search above to find papers in your research area
106
- </p>
107
- {% endif %}
108
  </div>
109
 
110
  {# Done / Skip buttons #}
 
15
  </p>
16
  </div>
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  {# Search bar #}
19
  <div class="mb-4">
20
  <form hx-get="/api/onboarding/seed-search"
 
44
  </div>
45
  </div>
46
 
47
+ {# Search results β€” inner div is the HTMX swap target #}
48
  <div id="seed-results" class="space-y-2 mb-6">
49
+ {% include "partials/seed_results.html" %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  </div>
51
 
52
  {# Done / Skip buttons #}
app/templates/search.html CHANGED
@@ -7,10 +7,9 @@
7
 
8
  <!-- Search bar -->
9
  <div class="card bg-base-100 shadow-md rounded-xl p-4">
10
- <form hx-get="/search"
11
- hx-target="#search-results"
12
  hx-push-url="true"
13
- hx-indicator="#search-spinner"
14
  class="flex gap-2">
15
  <input type="text"
16
  name="q"
@@ -18,16 +17,38 @@
18
  placeholder="Search arXiv papers…"
19
  class="input input-bordered flex-1"
20
  autofocus />
21
- <button class="btn btn-primary" type="submit">
22
- Search
23
- <span id="search-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
 
 
 
24
  </button>
25
  </form>
26
  </div>
27
 
28
- <!-- Recommendations (sidebar-style, loads async) -->
29
- {% if has_recs %}
30
- <div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  <h2 class="text-lg font-semibold mb-3">Recommended for You</h2>
32
  <div id="rec-section"
33
  hx-get="/api/recommendations"
@@ -47,4 +68,29 @@
47
  </div>
48
 
49
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  {% endblock %}
 
7
 
8
  <!-- Search bar -->
9
  <div class="card bg-base-100 shadow-md rounded-xl p-4">
10
+ <form hx-get="/search"
11
+ hx-target="#search-results"
12
  hx-push-url="true"
 
13
  class="flex gap-2">
14
  <input type="text"
15
  name="q"
 
17
  placeholder="Search arXiv papers…"
18
  class="input input-bordered flex-1"
19
  autofocus />
20
+ <button class="btn btn-primary flex items-center gap-2" type="submit">
21
+ <span class="search-btn-text">Search</span>
22
+ <span class="search-btn-loading hidden">
23
+ <span class="loading loading-spinner loading-sm"></span>
24
+ Searching…
25
+ </span>
26
  </button>
27
  </form>
28
  </div>
29
 
30
+ <!-- Loading overlay (outside search-results so it doesn't get swapped away) -->
31
+ <div id="search-loading" class="hidden">
32
+ <div class="flex flex-col items-center justify-center py-16 gap-4">
33
+ <span class="loading loading-ring loading-lg text-primary"></span>
34
+ <div class="text-sm text-base-content/60 animate-pulse">
35
+ Searching 1.6M papers across Qdrant + Zilliz…
36
+ </div>
37
+ <div class="flex gap-6 text-xs text-base-content/40 font-mono">
38
+ <span>Groq rewriting</span>
39
+ <span>β†’</span>
40
+ <span>BGE-M3 encoding</span>
41
+ <span>β†’</span>
42
+ <span>Vector retrieval</span>
43
+ <span>β†’</span>
44
+ <span>RRF + reranking</span>
45
+ </div>
46
+ </div>
47
+ </div>
48
+
49
+ <!-- Recommendations β€” only when not actively searching -->
50
+ {% if has_recs and not query %}
51
+ <div id="rec-wrapper">
52
  <h2 class="text-lg font-semibold mb-3">Recommended for You</h2>
53
  <div id="rec-section"
54
  hx-get="/api/recommendations"
 
68
  </div>
69
 
70
  </div>
71
+
72
+ <script>
73
+ // Show/hide loading overlay + HIDE recommendations when searching
74
+ document.body.addEventListener('htmx:beforeRequest', function(evt) {
75
+ if (evt.detail.target && evt.detail.target.id === 'search-results') {
76
+ document.getElementById('search-loading').classList.remove('hidden');
77
+ document.getElementById('search-results').classList.add('opacity-30');
78
+ // Hide recommendations section when a search starts
79
+ var recWrapper = document.getElementById('rec-wrapper');
80
+ if (recWrapper) recWrapper.classList.add('hidden');
81
+ // Swap button text
82
+ document.querySelectorAll('.search-btn-text').forEach(el => el.classList.add('hidden'));
83
+ document.querySelectorAll('.search-btn-loading').forEach(el => el.classList.remove('hidden'));
84
+ }
85
+ });
86
+ document.body.addEventListener('htmx:afterRequest', function(evt) {
87
+ if (evt.detail.target && evt.detail.target.id === 'search-results') {
88
+ document.getElementById('search-loading').classList.add('hidden');
89
+ document.getElementById('search-results').classList.remove('opacity-30');
90
+ // Restore button text
91
+ document.querySelectorAll('.search-btn-text').forEach(el => el.classList.remove('hidden'));
92
+ document.querySelectorAll('.search-btn-loading').forEach(el => el.classList.add('hidden'));
93
+ }
94
+ });
95
+ </script>
96
  {% endblock %}
app/turso_svc.py CHANGED
@@ -15,12 +15,65 @@ from __future__ import annotations
15
 
16
  import json
17
  import time
 
18
 
19
  import httpx
20
 
21
  from app import config
22
 
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  # ── Public API ───────────────────────────────────────────────────────────────
25
 
26
  async def fetch_metadata(arxiv_id: str) -> dict | None:
@@ -37,11 +90,31 @@ async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
37
  Paper dict has keys: arxiv_id, title, abstract, authors, category,
38
  published, year, citation_count, influential_citations.
39
 
40
- Uses Turso HTTP pipeline API β€” single HTTP request for all IDs.
41
  """
42
  if not arxiv_ids:
43
  return {}
44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  url = config.TURSO_URL
46
  token = config.TURSO_DB_TOKEN
47
 
@@ -133,6 +206,7 @@ async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
133
  paper = _to_paper_dict(values)
134
  if paper:
135
  output[paper["arxiv_id"]] = paper
 
136
 
137
  return output
138
 
@@ -211,27 +285,52 @@ async def fetch_trending_by_categories(
211
  Fetch recently published, high-citation papers from Turso DB
212
  filtered by arXiv categories. Used as Tier 0 popularity fallback
213
  for onboarded users with zero saves.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
  """
215
  if not categories:
216
  return []
217
 
 
 
 
 
 
218
  url = config.TURSO_URL
219
  token = config.TURSO_DB_TOKEN
220
  if not url or not token:
221
  return []
222
 
223
- # Build query: papers in selected categories, sorted by citation count
224
- placeholders = ", ".join(["?" for _ in categories])
 
 
 
225
  sql = f"""SELECT arxiv_id, title, authors, categories, primary_topic,
226
  update_date, abstract_preview, citation_count, influential_citations
227
  FROM papers
228
- WHERE primary_topic IN ({placeholders})
229
  AND citation_count > 0
230
  ORDER BY citation_count DESC, update_date DESC
231
  LIMIT ?"""
232
 
233
- cat_list = list(categories)
234
- args = [{"type": "text", "value": c} for c in cat_list]
235
  args.append({"type": "integer", "value": str(limit)})
236
 
237
  pipeline_url = url.rstrip("/")
@@ -254,16 +353,29 @@ async def fetch_trending_by_categories(
254
  "Content-Type": "application/json",
255
  }
256
 
 
 
 
257
  try:
258
- async with httpx.AsyncClient(timeout=10) as client:
259
  resp = await client.post(
260
  f"{pipeline_url}/v2/pipeline",
261
  json=payload,
262
  headers=headers,
263
  )
264
  resp.raise_for_status()
 
 
 
 
 
 
 
 
 
 
265
  except Exception as e:
266
- print(f"[turso] trending query failed: {e}")
267
  return []
268
 
269
  try:
@@ -282,7 +394,7 @@ async def fetch_trending_by_categories(
282
  cols = [c["name"] for c in result_data.get("cols", [])]
283
  rows = result_data.get("rows", [])
284
  except (KeyError, IndexError, TypeError) as e:
285
- print(f"[turso] trending parse error: {e}")
286
  return []
287
 
288
  papers = []
@@ -299,4 +411,10 @@ async def fetch_trending_by_categories(
299
  papers.append(paper)
300
 
301
  print(f"[turso] trending: {len(papers)} papers in {len(categories)} categories")
 
 
 
 
 
 
302
  return papers
 
15
 
16
  import json
17
  import time
18
+ from collections import OrderedDict
19
 
20
  import httpx
21
 
22
  from app import config
23
 
24
 
25
+ # ── In-process metadata cache ────────────────────────────────────────────────
26
+ #
27
+ # Recommendations + search both fetch metadata for hundreds of arxiv_ids per
28
+ # request, often the same well-known papers across users. Each round-trip is
29
+ # 1-3s on a 1.6M-row libSQL DB. An in-process LRU absorbs the repeats.
30
+ #
31
+ # Trade-offs:
32
+ # - Asyncio is single-threaded, no lock needed.
33
+ # - Paper title/abstract/authors are effectively immutable for our use,
34
+ # so we don't TTL-expire metadata. citation_count drifts but is only
35
+ # used for display ranking; staleness is fine.
36
+ # - 50K capacity at ~1KB per row -> ~50MB RAM ceiling.
37
+
38
+ _METADATA_CACHE: "OrderedDict[str, dict]" = OrderedDict()
39
+ _METADATA_CACHE_MAX = 50_000
40
+
41
+
42
+ def _cache_get(arxiv_id: str) -> dict | None:
43
+ val = _METADATA_CACHE.get(arxiv_id)
44
+ if val is not None:
45
+ # Mark as MRU
46
+ _METADATA_CACHE.move_to_end(arxiv_id)
47
+ return val
48
+
49
+
50
+ def _cache_put(arxiv_id: str, paper: dict) -> None:
51
+ if arxiv_id in _METADATA_CACHE:
52
+ _METADATA_CACHE.move_to_end(arxiv_id)
53
+ _METADATA_CACHE[arxiv_id] = paper
54
+ return
55
+ _METADATA_CACHE[arxiv_id] = paper
56
+ if len(_METADATA_CACHE) > _METADATA_CACHE_MAX:
57
+ # Evict LRU
58
+ _METADATA_CACHE.popitem(last=False)
59
+
60
+
61
+ def metadata_cache_stats() -> dict:
62
+ """For diagnostics: current cache size and max."""
63
+ return {"size": len(_METADATA_CACHE), "max": _METADATA_CACHE_MAX}
64
+
65
+
66
+ # ── In-process trending cache ────────────────────────────────────────────────
67
+ #
68
+ # Trending is filter-by-LIKE on 1.6M rows -> ~15s cold. Onboarding has a
69
+ # small fixed set of category combinations, and citation counts barely
70
+ # change minute-to-minute. A short TTL converts the 15s wait into a
71
+ # one-time hit per category combo.
72
+
73
+ _TRENDING_CACHE: dict[tuple, tuple[float, list[dict]]] = {}
74
+ _TRENDING_TTL_SECONDS = 60 * 60 # 1 hour
75
+
76
+
77
  # ── Public API ───────────────────────────────────────────────────────────────
78
 
79
  async def fetch_metadata(arxiv_id: str) -> dict | None:
 
90
  Paper dict has keys: arxiv_id, title, abstract, authors, category,
91
  published, year, citation_count, influential_citations.
92
 
93
+ First checks the in-process LRU cache; only un-cached IDs hit the network.
94
  """
95
  if not arxiv_ids:
96
  return {}
97
 
98
+ # Cache check β€” pull anything already-known up front.
99
+ output: dict[str, dict] = {}
100
+ misses: list[str] = []
101
+ for aid in arxiv_ids:
102
+ cached = _cache_get(aid)
103
+ if cached is not None:
104
+ output[aid] = cached
105
+ else:
106
+ misses.append(aid)
107
+
108
+ if not misses:
109
+ return output
110
+
111
+ fetched = await _fetch_metadata_batch_uncached(misses)
112
+ output.update(fetched)
113
+ return output
114
+
115
+
116
+ async def _fetch_metadata_batch_uncached(arxiv_ids: list[str]) -> dict[str, dict]:
117
+ """Network fetch for IDs we don't already have cached."""
118
  url = config.TURSO_URL
119
  token = config.TURSO_DB_TOKEN
120
 
 
206
  paper = _to_paper_dict(values)
207
  if paper:
208
  output[paper["arxiv_id"]] = paper
209
+ _cache_put(paper["arxiv_id"], paper)
210
 
211
  return output
212
 
 
285
  Fetch recently published, high-citation papers from Turso DB
286
  filtered by arXiv categories. Used as Tier 0 popularity fallback
287
  for onboarded users with zero saves.
288
+
289
+ Cached in-process (1 hour TTL): citation counts barely change
290
+ minute-to-minute, and onboarding has a small fixed set of category
291
+ combinations, so the first cold-start hit pays the ~15s LIKE-scan
292
+ cost once and subsequent users get an instant hit.
293
+
294
+ Filter strategy:
295
+ Turso's `primary_topic` column stores friendly labels like
296
+ "AI/ML" / "Computer Vision" β€” NOT arxiv codes β€” and the mapping
297
+ from arxiv code to friendly label is not 1:1 (e.g. Vaswani's
298
+ cs.CL paper is labeled "AI/ML" while BERT's cs.CL paper is
299
+ labeled "NLP/Computational Linguistics"). The `categories`
300
+ column, however, contains the real space-separated arxiv codes
301
+ ("cs.CL cs.LG"). So we filter via LIKE on `categories`.
302
+
303
+ Performance: LIKE '%cs.XX%' with leading wildcard skips the index,
304
+ but Turso's `citation_count > 0` filter + ORDER BY citation_count
305
+ narrows the scan, and trending is not a hot path.
306
  """
307
  if not categories:
308
  return []
309
 
310
+ cache_key = (tuple(sorted(categories)), limit)
311
+ cached = _TRENDING_CACHE.get(cache_key)
312
+ if cached is not None and (time.time() - cached[0]) < _TRENDING_TTL_SECONDS:
313
+ return cached[1]
314
+
315
  url = config.TURSO_URL
316
  token = config.TURSO_DB_TOKEN
317
  if not url or not token:
318
  return []
319
 
320
+ cat_list = list(categories)
321
+ # categories column is space-separated arxiv codes; arxiv codes
322
+ # don't share substrings (no code is a substring of another), so
323
+ # plain LIKE '%code%' is safe.
324
+ like_clauses = " OR ".join(["categories LIKE ?" for _ in cat_list])
325
  sql = f"""SELECT arxiv_id, title, authors, categories, primary_topic,
326
  update_date, abstract_preview, citation_count, influential_citations
327
  FROM papers
328
+ WHERE ({like_clauses})
329
  AND citation_count > 0
330
  ORDER BY citation_count DESC, update_date DESC
331
  LIMIT ?"""
332
 
333
+ args = [{"type": "text", "value": f"%{c}%"} for c in cat_list]
 
334
  args.append({"type": "integer", "value": str(limit)})
335
 
336
  pipeline_url = url.rstrip("/")
 
353
  "Content-Type": "application/json",
354
  }
355
 
356
+ # Use a longer timeout than metadata fetch β€” full table scan
357
+ # for citation-sorted trending against 1.6M rows can spike to
358
+ # 15-25s on the first cold hit. Once cached, warm reads are 0ms.
359
  try:
360
+ async with httpx.AsyncClient(timeout=30) as client:
361
  resp = await client.post(
362
  f"{pipeline_url}/v2/pipeline",
363
  json=payload,
364
  headers=headers,
365
  )
366
  resp.raise_for_status()
367
+ except httpx.HTTPStatusError as e:
368
+ # Surface response body on HTTP errors β€” Turso's empty-string
369
+ # exceptions were the symptom that hid this bug for months.
370
+ body = ""
371
+ try:
372
+ body = e.response.text[:500]
373
+ except Exception:
374
+ pass
375
+ print(f"[turso] trending HTTP error {e.response.status_code}: {body}")
376
+ return []
377
  except Exception as e:
378
+ print(f"[turso] trending request failed: {type(e).__name__}: {e!r}")
379
  return []
380
 
381
  try:
 
394
  cols = [c["name"] for c in result_data.get("cols", [])]
395
  rows = result_data.get("rows", [])
396
  except (KeyError, IndexError, TypeError) as e:
397
+ print(f"[turso] trending parse error: {type(e).__name__}: {e!r}")
398
  return []
399
 
400
  papers = []
 
411
  papers.append(paper)
412
 
413
  print(f"[turso] trending: {len(papers)} papers in {len(categories)} categories")
414
+ if papers:
415
+ _TRENDING_CACHE[cache_key] = (time.time(), papers)
416
+ # Also seed metadata cache β€” these papers are likely to be
417
+ # fetched again as part of recommendations / display.
418
+ for p in papers:
419
+ _cache_put(p["arxiv_id"], p)
420
  return papers
docs/TASK-TRACKER.md CHANGED
@@ -325,30 +325,30 @@
325
 
326
  ---
327
 
328
- ## Phase 5: Cold-Start Onboarding πŸ“‹ NOT STARTED
329
 
330
- > *Build the hybrid onboarding pipeline for new users.*
331
- > *Estimated effort: ~1-2 weeks*
332
  > *Reference: Doc 06 β€” "4-37% lift even once behavioral data exists"*
333
 
334
- ### 5.1 β€” arXiv Category Multi-Select
335
- - [ ] UI screen on first visit: select 3-5 arXiv categories
336
- - [ ] Store selections in SQLite
337
- - [ ] Use as pool filter for first 1-3 sessions
338
- - [ ] Preserve as LightGBM feature permanently
339
- - [ ] Does NOT create "subject vectors" β€” just filters
340
 
341
- ### 5.2 β€” Seed Paper Import
342
- - [ ] Let users search for and save 3-5 seed papers during onboarding
343
- - [ ] Immediately create EWMA profiles + Ward clusters
344
- - [ ] Uses hybrid search (Phase 3) for discovery
345
 
346
- ### 5.3 β€” ORCID / Semantic Scholar Import (Stretch)
347
- - [ ] Accept ORCID ID β†’ fetch authored papers β†’ initial saves
348
- - [ ] Gives 10-50 papers of signal instantly
349
 
350
- ### 5.4 β€” Popularity Fallback
351
- - [ ] If user skips all onboarding: serve popularity-per-selected-category feed
 
352
 
353
  ---
354
 
@@ -432,10 +432,10 @@
432
  - [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
433
  - [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan
434
 
435
- ### B4 β€” S2 author import (Phase 5.1)
436
- - [x] `app/s2_svc.py`: parse S2 URL / raw ID / ORCID, fetch author papers from S2 API
437
- - [x] `POST /api/onboarding/import-author` endpoint in `onboarding.py`
438
- - [x] Quick-import form added to `seed_search.html` template
439
 
440
  ### Documentation
441
  - [x] `CLAUDE.md`: Rule 3.11 β€” interaction instrumentation invariants
 
325
 
326
  ---
327
 
328
+ ## Phase 5: Cold-Start Onboarding βœ… COMPLETE
329
 
330
+ > *Onboarding wizard for new users β€” category selection + seed paper search + trending fallback.*
 
331
  > *Reference: Doc 06 β€” "4-37% lift even once behavioral data exists"*
332
 
333
+ ### 5.1 β€” arXiv Category Multi-Select βœ…
334
+ - [x] UI screen on first visit: select 1-8 arXiv category groups
335
+ - [x] Store selections in SQLite (`user_onboarding` table)
336
+ - [x] Use as pool filter for recommendations (via `get_user_category_filter()`)
337
+ - [x] Preserve as LightGBM feature permanently (Feature 26: `onboarding_category_match`)
338
+ - [x] Does NOT create "subject vectors" β€” just filters
339
 
340
+ ### 5.2 β€” Seed Paper Import βœ…
341
+ - [x] Let users search for and save seed papers during onboarding
342
+ - [x] Immediately create EWMA profiles + Ward clusters on next feed request
343
+ - [x] Uses hybrid search (Phase 3) for discovery
344
 
345
+ ### ~~5.3 β€” ORCID / Semantic Scholar Import~~ ❌ REMOVED
346
+ > S2 author import was implemented but removed β€” not the onboarding direction we want.
347
+ > Onboarding focuses on category selection + manual seed paper search.
348
 
349
+ ### 5.4 β€” Popularity Fallback βœ…
350
+ - [x] Category-filtered trending papers served via `turso_svc.fetch_trending_by_categories()`
351
+ - [x] 1-hour TTL trending cache for performance
352
 
353
  ---
354
 
 
432
  - [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
433
  - [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan
434
 
435
+ ### ~~B4 β€” S2 author import~~ ❌ REMOVED
436
+ > S2 author import was implemented and then removed β€” not the onboarding direction we want.
437
+ > `app/s2_svc.py`, the `/api/onboarding/import-author` endpoint, and the quick-import UI
438
+ > have all been deleted. Onboarding uses category selection + manual seed search only.
439
 
440
  ### Documentation
441
  - [x] `CLAUDE.md`: Rule 3.11 β€” interaction instrumentation invariants
docs/previous_prompt.txt ADDED
The diff for this file is too large to render. See raw diff
 
docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md CHANGED
@@ -32,7 +32,7 @@
32
  | Component | Planned In | Blocked By |
33
  |---|---|---|
34
  | Evaluation framework (offline + online metrics) | Phase 7 | Not yet implemented |
35
- | ORCID / Scholar import (onboarding stretch) | Phase 5 (stretch) | Deferred |
36
  | LLM interest summaries per cluster | Phase 8 | Needs Claude/Groq API integration |
37
  | Exploration + collaborative filtering | Phase 9 | Needs user scale |
38
 
@@ -101,12 +101,12 @@ The latest deep research (Doc 06) adds critical nuance that **neither pure-behav
101
 
102
  > "The pure-behavioral position in Doc 03/05 is directionally right but structurally incomplete... item-level seeds + adaptive refinement beats both fixed-category questionnaires and pure-behavior-from-zero, and onboarding cues remain a 4–37% lift even once behavioral data exists."
103
 
104
- **The corrected position**: A three-layer hybrid:
105
  1. **Coarse arXiv-category multiselect** β€” filter and LightGBM feature (5-second cold-start signal)
106
- 2. **Seed-paper / ORCID import** β€” initial behavioral profile (strong cold-start signal)
107
- 3. **Ward clustering + medoid retrieval** β€” takes over at ~10 saves (production-grade personalization)
108
 
109
- This resolves the tension: subject categories aren't the *primary* user model, but they *are* a useful prior for cold-start, filtering, and as re-ranking features.
110
 
111
  ---
112
 
@@ -283,29 +283,30 @@ Turso cloud DB with 1.23GB of metadata + citation counts. Search time: ~10.7s
283
 
284
  ---
285
 
286
- ### Phase 5: Cold-Start Onboarding (COMPLETE)
287
 
288
- Status: core flow implemented (categories + seed search + trending fallback). ORCID/Scholar import deferred.
289
 
290
  Build the onboarding pipeline that Doc 06 identifies as a 4-37% lift even once behavioral data exists.
291
 
292
- #### 5.1 arXiv Category Multi-Select
293
- A simple UI screen on first visit: select 3-5 arXiv categories (cs.CL, cs.CV, stat.ML, etc.).
294
- - Used as pool filter for first 1-3 sessions
295
- - Stored as a LightGBM feature permanently
296
  - Does NOT create "subject vectors" β€” just filters
297
 
298
- #### 5.2 Seed Paper Import
299
- Let users search for and save 3-5 seed papers during onboarding.
300
- - These immediately create EWMA profiles and Ward clusters
301
  - Bypasses the "save 5 papers before any recs" cold-start trap
302
- - Scholar Inbox found this sufficient for good initial recommendations
303
- - **With hybrid search in place (Phase 3), seed paper search will use Qdrant vectors, not the arXiv API**
304
 
305
- #### 5.3 ORCID / Semantic Scholar ID Import (Stretch)
306
- If the user pastes their ORCID, ingest their authored papers as initial saves.
307
- - This gives the system 10-50 papers worth of signal instantly
308
- - Creates highly personalized clusters from Day 1
 
 
309
 
310
  ---
311
 
 
32
  | Component | Planned In | Blocked By |
33
  |---|---|---|
34
  | Evaluation framework (offline + online metrics) | Phase 7 | Not yet implemented |
35
+ | ~~ORCID / Scholar import~~ | ~~Phase 5~~ | Removed (not the onboarding direction we want) |
36
  | LLM interest summaries per cluster | Phase 8 | Needs Claude/Groq API integration |
37
  | Exploration + collaborative filtering | Phase 9 | Needs user scale |
38
 
 
101
 
102
  > "The pure-behavioral position in Doc 03/05 is directionally right but structurally incomplete... item-level seeds + adaptive refinement beats both fixed-category questionnaires and pure-behavior-from-zero, and onboarding cues remain a 4–37% lift even once behavioral data exists."
103
 
104
+ **The corrected position**: A two-layer hybrid:
105
  1. **Coarse arXiv-category multiselect** β€” filter and LightGBM feature (5-second cold-start signal)
106
+ 2. **Seed paper search + save** β€” initial behavioral profile via manual discovery
107
+ 3. **Ward clustering + medoid retrieval** β€” takes over at ~5 saves (production-grade personalization)
108
 
109
+ This resolves the tension: subject categories aren't the *primary* user model, but they *are* a useful prior for cold-start, filtering, and as re-ranking features. ORCID/S2 author import was explored and removed β€” manual seed search is the preferred onboarding path.
110
 
111
  ---
112
 
 
283
 
284
  ---
285
 
286
+ ### Phase 5: Cold-Start Onboarding (COMPLETE βœ…)
287
 
288
+ Status: fully implemented β€” categories + seed search + trending fallback.
289
 
290
  Build the onboarding pipeline that Doc 06 identifies as a 4-37% lift even once behavioral data exists.
291
 
292
+ #### 5.1 arXiv Category Multi-Select βœ…
293
+ UI screen on first visit: select 1-8 arXiv category groups.
294
+ - Used as pool filter for recommendations
295
+ - Stored as a LightGBM feature permanently (Feature 26: `onboarding_category_match`)
296
  - Does NOT create "subject vectors" β€” just filters
297
 
298
+ #### 5.2 Seed Paper Import βœ…
299
+ Users search for and save seed papers during onboarding.
300
+ - These immediately create EWMA profiles and Ward clusters on next feed request
301
  - Bypasses the "save 5 papers before any recs" cold-start trap
302
+ - Uses hybrid search (Phase 3) for discovery
 
303
 
304
+ #### ~~5.3 ORCID / Semantic Scholar ID Import~~ ❌ REMOVED
305
+ S2 author import was implemented and then removed β€” not the onboarding direction we want.
306
+ Onboarding focuses on category selection + manual seed paper search.
307
+
308
+ #### 5.4 Popularity Fallback βœ…
309
+ Category-filtered trending papers via `turso_svc.fetch_trending_by_categories()` with 1-hour TTL cache.
310
 
311
  ---
312
 
requirements.txt CHANGED
@@ -14,7 +14,7 @@ python-multipart>=0.0.9
14
  FlagEmbedding>=1.2.9
15
  transformers>=4.44,<5.0
16
  pymilvus>=2.4
17
- groq>=0.9
18
  python-dotenv>=1.0
19
 
20
  # ── Phase 6: LightGBM reranker ───────────────────────────────────────────
 
14
  FlagEmbedding>=1.2.9
15
  transformers>=4.44,<5.0
16
  pymilvus>=2.4
17
+ groq>=1.0 # 1.0+ drops the `proxies` kwarg internally so httpx>=0.28 works
18
  python-dotenv>=1.0
19
 
20
  # ── Phase 6: LightGBM reranker ───────────────────────────────────────────
scripts/browser_test_onboarding.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Verify the onboarding seed-search step does not duplicate the panel."""
2
+ from playwright.sync_api import sync_playwright
3
+
4
+ URL = "http://127.0.0.1:7860"
5
+ QUERY = "attention is all you need"
6
+
7
+
8
+ def run():
9
+ with sync_playwright() as p:
10
+ browser = p.chromium.launch(headless=True)
11
+ ctx = browser.new_context(viewport={"width": 1280, "height": 1800})
12
+ # Use a fresh, unonboarded user so we land on /onboarding
13
+ ctx.add_cookies([{
14
+ "name": "arxiv_user_id",
15
+ "value": "onboarding-test-user-fresh",
16
+ "url": URL,
17
+ }])
18
+ page = ctx.new_page()
19
+
20
+ page.goto(URL + "/onboarding", wait_until="networkidle")
21
+
22
+ # Step 1: pick a category, click Continue
23
+ page.click("[data-key='nlp']")
24
+ page.click("#continue-btn")
25
+
26
+ # Step 2 should appear (rendered by submitCategories() via fetch + innerHTML)
27
+ page.wait_for_selector("#seed-results", timeout=10_000)
28
+
29
+ # Snapshot before search
30
+ page.screenshot(path="scripts/screenshot_onboard_step2_before.png", full_page=True)
31
+
32
+ # Now search β€” this is what triggered the duplication bug
33
+ page.fill("input[name='q']", QUERY)
34
+ page.click("button:has-text('Search')")
35
+ # wait for results to swap in
36
+ page.wait_for_function(
37
+ "document.querySelectorAll('.seed-card').length > 0",
38
+ timeout=15_000,
39
+ )
40
+ page.wait_for_load_state("networkidle", timeout=15_000)
41
+
42
+ page.screenshot(path="scripts/screenshot_onboard_step2_after.png", full_page=True)
43
+
44
+ # ── Inspect the DOM
45
+ save_panels = page.locator("h2:has-text('Save a few papers you like')").count()
46
+ quick_imports = page.locator("text=Quick import:").count()
47
+ search_inputs = page.locator("input[name='q']").count()
48
+ seed_counters = page.locator("#seed-counter").count()
49
+ done_buttons = page.locator("button:has-text('Done β€” start exploring')").count()
50
+ seed_cards = page.locator(".seed-card").count()
51
+ seed_card_ids = page.locator(".seed-card").evaluate_all("els => els.map(e => e.id)")
52
+
53
+ print(f"'Save a few papers you like' headings: {save_panels} (expected 1)")
54
+ print(f"'Quick import:' blocks: {quick_imports} (expected 1)")
55
+ print(f"search inputs: {search_inputs} (expected 1)")
56
+ print(f"#seed-counter elements: {seed_counters} (expected 1)")
57
+ print(f"'Done β€” start exploring' buttons: {done_buttons} (expected 1)")
58
+ print(f"seed-cards: {seed_cards}, unique ids: {len(set(seed_card_ids))}")
59
+
60
+ ok = (
61
+ save_panels == 1
62
+ and quick_imports == 1
63
+ and search_inputs == 1
64
+ and seed_counters == 1
65
+ and done_buttons == 1
66
+ and seed_cards > 0
67
+ and seed_cards == len(set(seed_card_ids))
68
+ )
69
+ print("\nRESULT:", "PASS" if ok else "FAIL")
70
+
71
+ browser.close()
72
+
73
+
74
+ if __name__ == "__main__":
75
+ run()
scripts/browser_test_search.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Drive a real Chromium browser to verify the search UI shows results once."""
2
+ from playwright.sync_api import sync_playwright
3
+
4
+ URL = "http://127.0.0.1:7860"
5
+ QUERY = "attention is all you need"
6
+
7
+
8
+ def run():
9
+ with sync_playwright() as p:
10
+ browser = p.chromium.launch(headless=True)
11
+ ctx = browser.new_context(
12
+ viewport={"width": 1280, "height": 1800},
13
+ )
14
+ # Pre-seed cookie of a user that has saves so has_recs=True
15
+ ctx.add_cookies([{
16
+ "name": "arxiv_user_id",
17
+ "value": "browser-test-user",
18
+ "url": URL,
19
+ }])
20
+ page = ctx.new_page()
21
+
22
+ # 1) Land on the homepage and search from there.
23
+ page.goto(URL + "/", wait_until="networkidle")
24
+ page.fill("input[name='q']", QUERY)
25
+ page.screenshot(path="scripts/screenshot_before_submit.png", full_page=True)
26
+
27
+ page.click("button[type='submit']")
28
+ page.wait_for_url("**/search?q=*", timeout=10_000)
29
+ # search.html does not auto-load anything heavy when q is set, but give it a beat
30
+ page.wait_for_load_state("networkidle", timeout=15_000)
31
+
32
+ page.screenshot(path="scripts/screenshot_after_search.png", full_page=True)
33
+
34
+ # 2) Inspect the DOM
35
+ url = page.url
36
+ paper_cards = page.locator(".paper-card").count()
37
+ recs_visible = page.locator("#rec-section").count()
38
+ recs_heading = page.get_by_role("heading", name="Recommended for You").count()
39
+ results_heading_count = page.locator("text=results for").count()
40
+
41
+ print(f"URL after search: {url}")
42
+ print(f".paper-card count: {paper_cards}")
43
+ print(f"#rec-section count: {recs_visible}")
44
+ print(f"'Recommended for You' heading count: {recs_heading}")
45
+ print(f"'results for' phrase count: {results_heading_count}")
46
+
47
+ # 3) Check for duplicate paper IDs (the original 'twice' complaint)
48
+ ids = page.locator("[id^='paper-']").evaluate_all(
49
+ "els => els.map(e => e.id)"
50
+ )
51
+ unique = set(ids)
52
+ print(f"paper element ids: {len(ids)} total, {len(unique)} unique")
53
+ if len(ids) != len(unique):
54
+ from collections import Counter
55
+ dups = [k for k, v in Counter(ids).items() if v > 1]
56
+ print(f"DUPLICATE IDS: {dups}")
57
+
58
+ # Phase: title-match boost β€” Vaswani's "Attention Is All You Need"
59
+ # (1706.03762) must be the #1 result for this exact-title query.
60
+ first_paper_id = page.locator("[id^='paper-']").first.get_attribute("id")
61
+ print(f"first paper id: {first_paper_id}")
62
+
63
+ ok = (
64
+ recs_visible == 0
65
+ and recs_heading == 0
66
+ and results_heading_count == 1
67
+ and paper_cards == len(unique)
68
+ and paper_cards > 0
69
+ and first_paper_id == "paper-1706.03762"
70
+ )
71
+ print("\nRESULT:", "PASS" if ok else "FAIL")
72
+
73
+ browser.close()
74
+
75
+
76
+ if __name__ == "__main__":
77
+ run()
scripts/diag_mamba.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Diagnose why the Mamba paper (2312.00752) is missing from search results."""
2
+ import asyncio
3
+ import sys
4
+ from pathlib import Path
5
+
6
+ sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
7
+
8
+ from app import qdrant_svc, embed_svc, zilliz_svc, hybrid_search_svc, turso_svc
9
+
10
+ MAMBA_ID = "2312.00752"
11
+
12
+
13
+ async def main():
14
+ # Step 1: is the paper in Qdrant at all?
15
+ vecs = await qdrant_svc.get_paper_vectors([MAMBA_ID])
16
+ in_qdrant = MAMBA_ID in vecs
17
+ print(f"Mamba paper {MAMBA_ID} in Qdrant: {in_qdrant}")
18
+
19
+ # Step 2: is it in Turso?
20
+ meta = await turso_svc.fetch_metadata_batch([MAMBA_ID])
21
+ if MAMBA_ID in meta:
22
+ print(f"Mamba paper in Turso: YES β€” title: {meta[MAMBA_ID].get('title')!r}")
23
+ else:
24
+ print("Mamba paper in Turso: NO")
25
+
26
+ if not in_qdrant:
27
+ print("\n--> Paper missing from Qdrant collection. End of investigation.")
28
+ return
29
+
30
+ # Step 3: where does it rank in dense, sparse, and fused?
31
+ q = "Mamba state space model linear time"
32
+ dense_vec, sparse_dict = embed_svc.encode_query(q)
33
+ print(f"\nQuery: {q!r}")
34
+ print(f"Sparse keys: {len(sparse_dict)}")
35
+
36
+ fetch_k = 60
37
+ dense = await qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k)
38
+ sparse = await zilliz_svc.search_sparse(sparse_dict, limit=fetch_k)
39
+
40
+ dense_ids = [r["arxiv_id"] for r in dense]
41
+ sparse_ids = [r["arxiv_id"] for r in sparse]
42
+
43
+ if MAMBA_ID in dense_ids:
44
+ print(f"\nDense rank: {dense_ids.index(MAMBA_ID)+1}/{fetch_k}")
45
+ else:
46
+ print(f"\nDense top {fetch_k}: NOT present")
47
+
48
+ if MAMBA_ID in sparse_ids:
49
+ print(f"Sparse rank: {sparse_ids.index(MAMBA_ID)+1}/{fetch_k}")
50
+ else:
51
+ print(f"Sparse top {fetch_k}: NOT present")
52
+
53
+ fused = hybrid_search_svc._rrf_fuse(dense, sparse, k=60)
54
+ fused_ids = [item["arxiv_id"] for item in fused]
55
+ if MAMBA_ID in fused_ids:
56
+ print(f"RRF fused rank: {fused_ids.index(MAMBA_ID)+1}")
57
+ else:
58
+ print(f"RRF fused: NOT present in top {len(fused_ids)}")
59
+
60
+ # Show top 5 of each
61
+ print(f"\n=== Dense top 5 ===")
62
+ for r in dense[:5]:
63
+ print(f" {r['arxiv_id']} score={r['score']:.4f}")
64
+ print(f"\n=== Sparse top 5 ===")
65
+ for r in sparse[:5]:
66
+ print(f" {r['arxiv_id']} score={r['score']:.4f}")
67
+
68
+
69
+ asyncio.run(main())
scripts/diag_search_rank.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Trace where Vaswani's paper falls in the hybrid pipeline."""
2
+ import asyncio
3
+ from app import qdrant_svc, embed_svc, zilliz_svc, hybrid_search_svc
4
+
5
+ VASWANI = "1706.03762"
6
+
7
+
8
+ async def main():
9
+ q = "attention is all you need"
10
+ dense_vec, sparse_dict = embed_svc.encode_query(q)
11
+ print(f"sparse keys: {len(sparse_dict)}")
12
+
13
+ fetch_k = 60
14
+ dense = await qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k)
15
+ sparse = await zilliz_svc.search_sparse(sparse_dict, limit=fetch_k)
16
+ dense_ids = [r["arxiv_id"] for r in dense]
17
+ sparse_ids = [r["arxiv_id"] for r in sparse]
18
+
19
+ print(f"\nVaswani in dense top {fetch_k}: ", VASWANI in dense_ids,
20
+ (f"(rank {dense_ids.index(VASWANI)+1})" if VASWANI in dense_ids else ""))
21
+ print(f"Vaswani in sparse top {fetch_k}: ", VASWANI in sparse_ids,
22
+ (f"(rank {sparse_ids.index(VASWANI)+1})" if VASWANI in sparse_ids else ""))
23
+
24
+ fused = hybrid_search_svc._rrf_fuse(dense, sparse, k=60)
25
+ fused_ids = [item["arxiv_id"] for item in fused]
26
+ v_rank_rrf = fused_ids.index(VASWANI) + 1 if VASWANI in fused_ids else None
27
+ print(f"\nVaswani rank after pure RRF: {v_rank_rrf}")
28
+
29
+ print("\n=== Pure RRF (no recency), top 10 ===")
30
+ for i, item in enumerate(fused[:10], 1):
31
+ marker = " <-- VASWANI" if item["arxiv_id"] == VASWANI else ""
32
+ print(f" {i:2d}. {item['arxiv_id']} rrf={item['rrf_score']:.4f}{marker}")
33
+
34
+ ranked = hybrid_search_svc._recency_rerank([dict(x) for x in fused])
35
+ ranked_ids = [item["arxiv_id"] for item in ranked]
36
+ v_rank_recency = ranked_ids.index(VASWANI) + 1 if VASWANI in ranked_ids else None
37
+ print(f"\nVaswani rank after current 0.80/0.20 recency rerank: {v_rank_recency}")
38
+
39
+ print("\n=== Current rerank (0.80 RRF + 0.20 recency), top 10 ===")
40
+ for i, item in enumerate(ranked[:10], 1):
41
+ marker = " <-- VASWANI" if item["arxiv_id"] == VASWANI else ""
42
+ print(f" {i:2d}. {item['arxiv_id']} final={item['final_score']:.4f}{marker}")
43
+
44
+
45
+ asyncio.run(main())
scripts/e2e_audit.py ADDED
@@ -0,0 +1,622 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ End-to-end audit of the ResearchIT recommendation pipeline.
3
+
4
+ Steps:
5
+ 1. Smoke test: hybrid search (10 queries, per-layer scores)
6
+ 2. User profile pipeline: EWMA update + Ward clustering
7
+ 3. Recommendation feed generation with quota fusion
8
+ 4. LightGBM reranker pass
9
+ 5. Gap analysis
10
+
11
+ Run: python scripts/e2e_audit.py
12
+ """
13
+ from __future__ import annotations
14
+ import asyncio, sys, time, json, struct
15
+ from pathlib import Path
16
+ import numpy as np
17
+
18
+ sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
19
+
20
+ # ── Imports ──────────────────────────────────────────────────────────────────
21
+
22
+ from app import hybrid_search_svc, turso_svc, embed_svc, qdrant_svc, zilliz_svc, groq_svc, db
23
+ from app.recommend import profiles, clustering
24
+ from app.recommend.reranker import (
25
+ rerank_candidates, compute_features, heuristic_score,
26
+ is_model_loaded, get_num_trees, FEATURE_NAMES,
27
+ )
28
+ from app.recommend.diversity import mmr_rerank, inject_exploration
29
+
30
+ # ── Globals ──────────────────────────────────────────────────────────────────
31
+
32
+ ERRORS: list[str] = []
33
+ WRONG_OUTPUTS: list[str] = []
34
+ MISSING: list[str] = []
35
+ TEST_USER = "e2e_audit_user_001"
36
+
37
+ # ── Helpers ──────────────────────────────────────────────────────────────────
38
+
39
+ def banner(text: str):
40
+ print(f"\n{'='*90}")
41
+ print(f" {text}")
42
+ print(f"{'='*90}\n")
43
+
44
+ def check(label: str, condition: bool, detail: str = ""):
45
+ status = "OK" if condition else "FAIL"
46
+ msg = f" [{status:>4}] {label}"
47
+ if detail:
48
+ msg += f" -- {detail}"
49
+ print(msg)
50
+ if not condition:
51
+ WRONG_OUTPUTS.append(f"{label}: {detail}")
52
+
53
+
54
+ # ═══════════════════════════════════════════════════════════════════════════════
55
+ # STEP 1 β€” SMOKE TEST: HYBRID SEARCH
56
+ # ═══════════════════════════════════════════════════════════════════════════════
57
+
58
+ SEARCH_QUERIES = [
59
+ "vision transformer image classification",
60
+ "reinforcement learning reward shaping",
61
+ "large language model fine-tuning RLHF",
62
+ "graph neural network drug discovery",
63
+ "federated learning differential privacy",
64
+ "attention is all you need",
65
+ "diffusion models image generation",
66
+ "knowledge distillation BERT compression",
67
+ "object detection YOLO real-time",
68
+ "protein structure prediction deep learning",
69
+ ]
70
+
71
+
72
+ async def step1_search():
73
+ banner("STEP 1: HYBRID SEARCH SMOKE TEST")
74
+ print(f"Running {len(SEARCH_QUERIES)} queries...\n")
75
+
76
+ all_latencies = []
77
+ all_results_count = []
78
+
79
+ for i, q in enumerate(SEARCH_QUERIES, 1):
80
+ t0 = time.perf_counter()
81
+ try:
82
+ results = await hybrid_search_svc.search(q, limit=10)
83
+ elapsed = (time.perf_counter() - t0) * 1000
84
+ except Exception as e:
85
+ ERRORS.append(f"Step 1: Query {q!r} threw {type(e).__name__}: {e}")
86
+ print(f" Q{i}: {q!r} -> ERROR: {e}")
87
+ continue
88
+
89
+ all_latencies.append(elapsed)
90
+ all_results_count.append(len(results))
91
+
92
+ # Fetch metadata for display
93
+ meta = {}
94
+ if results:
95
+ try:
96
+ meta = await turso_svc.fetch_metadata_batch(results)
97
+ except Exception as e:
98
+ ERRORS.append(f"Step 1: Metadata fetch failed for {q!r}: {e}")
99
+
100
+ print(f" Q{i}: {q!r}")
101
+ print(f" Results: {len(results)} | Latency: {elapsed:.0f}ms")
102
+
103
+ for rank, aid in enumerate(results[:5], 1):
104
+ m = meta.get(aid, {})
105
+ title = (m.get("title") or "?")[:65]
106
+ cites = m.get("citation_count", 0) or 0
107
+ print(f" {rank}. [{cites:>6} cites] {aid:14s} {title}")
108
+
109
+ # Relevance check: does the query topic appear in at least 3/5 titles?
110
+ if results and meta:
111
+ q_words = set(q.lower().split())
112
+ relevant = 0
113
+ for aid in results[:5]:
114
+ t = (meta.get(aid, {}).get("title") or "").lower()
115
+ matches = sum(1 for w in q_words if w in t)
116
+ if matches >= 2:
117
+ relevant += 1
118
+ check(f"Q{i} relevance ({relevant}/5 top results overlap query terms)",
119
+ relevant >= 2,
120
+ f"{q!r}")
121
+
122
+ print()
123
+
124
+ # Summary
125
+ if all_latencies:
126
+ print(f" --- Search Summary ---")
127
+ print(f" Queries: {len(all_latencies)}")
128
+ print(f" Avg latency: {sum(all_latencies)/len(all_latencies):.0f}ms")
129
+ print(f" p50: {sorted(all_latencies)[len(all_latencies)//2]:.0f}ms")
130
+ print(f" Max: {max(all_latencies):.0f}ms")
131
+ zero_results = sum(1 for c in all_results_count if c == 0)
132
+ print(f" Zero-result queries: {zero_results}")
133
+ if zero_results > 0:
134
+ ERRORS.append(f"Step 1: {zero_results} queries returned 0 results")
135
+
136
+
137
+ # ═══════════════════════════════════════════════════════════════════════════════
138
+ # STEP 2 β€” USER PROFILE PIPELINE
139
+ # ═══════════════════════════════════════════════════════════════════════════════
140
+
141
+ # Real paper IDs from known categories:
142
+ # CV papers (computer vision)
143
+ CV_PAPERS = [
144
+ "1512.03385", # ResNet
145
+ "2010.11929", # ViT
146
+ "2105.01601", # Swin Transformer
147
+ "2106.08254", # BEiT
148
+ "1409.1556", # VGGNet
149
+ ]
150
+ # LLM papers (NLP / language models)
151
+ LLM_PAPERS = [
152
+ "1706.03762", # Attention Is All You Need
153
+ "1810.04805", # BERT
154
+ "2005.14165", # GPT-3
155
+ "2303.08774", # GPT-4
156
+ "2302.13971", # LLaMA
157
+ ]
158
+
159
+ ALL_SEED_PAPERS = CV_PAPERS + LLM_PAPERS
160
+
161
+
162
+ async def step2_profiles():
163
+ banner("STEP 2: USER PROFILE PIPELINE")
164
+
165
+ # Initialize DB
166
+ await db.init_db()
167
+ print(f" Test user: {TEST_USER}")
168
+ print(f" Seed papers: {len(ALL_SEED_PAPERS)} (5 CV + 5 LLM)")
169
+
170
+ # Step 2a: Retrieve embeddings for seed papers from Qdrant (batch)
171
+ print(f"\n Fetching embeddings from Qdrant for {len(ALL_SEED_PAPERS)} papers...")
172
+ embeddings = {}
173
+ try:
174
+ vecs = await qdrant_svc.get_paper_vectors(ALL_SEED_PAPERS)
175
+ for aid, vec in vecs.items():
176
+ embeddings[aid] = np.array(vec, dtype=np.float32)
177
+ missing = [a for a in ALL_SEED_PAPERS if a not in embeddings]
178
+ if missing:
179
+ print(f" WARN: No vectors for {len(missing)} papers: {missing[:3]}...")
180
+ except Exception as e:
181
+ print(f" ERROR: get_paper_vectors -> {e}")
182
+ ERRORS.append(f"Step 2: get_paper_vectors failed: {e}")
183
+
184
+ print(f" Retrieved {len(embeddings)}/{len(ALL_SEED_PAPERS)} embeddings")
185
+
186
+ if len(embeddings) < 5:
187
+ ERRORS.append(f"Step 2: Only {len(embeddings)} embeddings retrieved, need >= 5")
188
+ print(" ABORT: Not enough embeddings to continue Step 2")
189
+ return None, None
190
+
191
+ # Step 2b: EWMA profile updates
192
+ print(f"\n Running EWMA profile updates (alpha_long={profiles.ALPHA_LONG_TERM}, "
193
+ f"alpha_short={profiles.ALPHA_SHORT_TERM})...")
194
+
195
+ for aid in ALL_SEED_PAPERS:
196
+ if aid not in embeddings:
197
+ continue
198
+ try:
199
+ await profiles.update_on_save(TEST_USER, embeddings[aid])
200
+ except Exception as e:
201
+ ERRORS.append(f"Step 2: EWMA update failed for {aid}: {e}")
202
+ print(f" ERROR: update_on_save({aid}) -> {e}")
203
+
204
+ # Load profiles back
205
+ lt_vec = await profiles.load_profile(TEST_USER, "long_term")
206
+ st_vec = await profiles.load_profile(TEST_USER, "short_term")
207
+ lt_count = await profiles.get_interaction_count(TEST_USER, "long_term")
208
+ st_count = await profiles.get_interaction_count(TEST_USER, "short_term")
209
+
210
+ check("Long-term profile exists", lt_vec is not None)
211
+ check("Short-term profile exists", st_vec is not None)
212
+ check(f"Long-term interaction count = {lt_count}", lt_count == len(embeddings),
213
+ f"expected {len(embeddings)}")
214
+ check(f"Short-term interaction count = {st_count}", st_count == len(embeddings),
215
+ f"expected {len(embeddings)}")
216
+
217
+ if lt_vec is not None:
218
+ lt_norm = float(np.linalg.norm(lt_vec))
219
+ check(f"Long-term vector L2-norm ~= 1.0 (actual: {lt_norm:.4f})",
220
+ abs(lt_norm - 1.0) < 0.01)
221
+
222
+ if st_vec is not None:
223
+ st_norm = float(np.linalg.norm(st_vec))
224
+ check(f"Short-term vector L2-norm ~= 1.0 (actual: {st_norm:.4f})",
225
+ abs(st_norm - 1.0) < 0.01)
226
+
227
+ # Step 2c: Ward hierarchical clustering
228
+ print(f"\n Running Ward clustering on {len(embeddings)} paper embeddings...")
229
+
230
+ paper_ids = list(embeddings.keys())
231
+ emb_matrix = np.stack([embeddings[aid] for aid in paper_ids])
232
+
233
+ try:
234
+ clusters = clustering.compute_clusters(
235
+ paper_ids=paper_ids,
236
+ embeddings=emb_matrix,
237
+ )
238
+ except Exception as e:
239
+ ERRORS.append(f"Step 2: compute_clusters failed: {e}")
240
+ print(f" ERROR: {e}")
241
+ return lt_vec, st_vec
242
+
243
+ print(f" Clusters found: {len(clusters)}")
244
+ for c in clusters:
245
+ print(f" Cluster {c.cluster_idx}: medoid={c.medoid_paper_id}, "
246
+ f"papers={len(c.paper_ids)}, importance={c.importance:.3f}")
247
+ for pid in c.paper_ids:
248
+ label = "CV" if pid in CV_PAPERS else "LLM" if pid in LLM_PAPERS else "?"
249
+ print(f" - {pid} [{label}]")
250
+
251
+ check(f"Number of clusters >= 2 (actual: {len(clusters)})",
252
+ len(clusters) >= 2,
253
+ "CV and LLM papers should form distinct clusters")
254
+
255
+ # Check cluster purity
256
+ for c in clusters:
257
+ cv_count = sum(1 for p in c.paper_ids if p in CV_PAPERS)
258
+ llm_count = sum(1 for p in c.paper_ids if p in LLM_PAPERS)
259
+ total = len(c.paper_ids)
260
+ purity = max(cv_count, llm_count) / total if total > 0 else 0
261
+ dominant = "CV" if cv_count > llm_count else "LLM"
262
+ check(f"Cluster {c.cluster_idx} purity ({dominant}: {purity:.0%})",
263
+ purity >= 0.6,
264
+ f"{cv_count} CV + {llm_count} LLM papers")
265
+
266
+ # Save clusters for Step 3
267
+ try:
268
+ await clustering.save_clusters_to_db(TEST_USER, clusters)
269
+ except Exception as e:
270
+ ERRORS.append(f"Step 2: save_clusters_to_db failed: {e}")
271
+
272
+ return lt_vec, st_vec
273
+
274
+
275
+ # ═══════════════════════════════════════════════════════════════════════════════
276
+ # STEP 3 β€” RECOMMENDATION FEED GENERATION
277
+ # ═══════════════════════════════════════════════════════════════════════════════
278
+
279
+ async def step3_recommendation_feed(lt_vec, st_vec):
280
+ banner("STEP 3: RECOMMENDATION FEED GENERATION")
281
+
282
+ if lt_vec is None:
283
+ ERRORS.append("Step 3: Skipped β€” no long-term profile from Step 2")
284
+ print(" SKIPPED: No profile vectors from Step 2")
285
+ return None, None, None
286
+
287
+ # Load clusters from DB
288
+ clusters = await clustering.load_clusters_from_db(TEST_USER)
289
+ if not clusters:
290
+ ERRORS.append("Step 3: No clusters found in DB")
291
+ print(" SKIPPED: No clusters in DB")
292
+ return None, None, None
293
+
294
+ print(f" Loaded {len(clusters)} clusters from DB")
295
+ print(f" Target feed size: 20 papers")
296
+
297
+ # Step 3a: Search for candidates per cluster (using medoid embeddings)
298
+ all_candidates: dict[str, dict] = {} # arxiv_id -> metadata
299
+ all_embeddings: dict[str, np.ndarray] = {}
300
+ cluster_assignments: dict[str, int] = {} # arxiv_id -> cluster_idx
301
+ seen = set(ALL_SEED_PAPERS)
302
+
303
+ t0 = time.perf_counter()
304
+
305
+ # Get medoid vectors in batch
306
+ medoid_ids = [c["medoid_paper_id"] for c in clusters]
307
+ medoid_vecs = await qdrant_svc.get_paper_vectors(medoid_ids)
308
+
309
+ for c in clusters:
310
+ mid = c["medoid_paper_id"]
311
+ medoid_vec = None
312
+
313
+ # Try stored blob first
314
+ if c.get("medoid_embedding_blob"):
315
+ medoid_vec = np.frombuffer(c["medoid_embedding_blob"], dtype=np.float32)
316
+
317
+ # Fallback: batch-fetched vector
318
+ if medoid_vec is None and mid in medoid_vecs:
319
+ medoid_vec = np.array(medoid_vecs[mid], dtype=np.float32)
320
+
321
+ if medoid_vec is None:
322
+ ERRORS.append(f"Step 3: No medoid vector for cluster {c['cluster_idx']}")
323
+ continue
324
+
325
+ # Search Qdrant for similar papers (with scores + vectors)
326
+ try:
327
+ results = await qdrant_svc.search_by_vector_with_scores(
328
+ medoid_vec.tolist(), limit=30, with_vectors=True
329
+ )
330
+ except Exception as e:
331
+ ERRORS.append(f"Step 3: search failed for cluster {c['cluster_idx']}: {e}")
332
+ continue
333
+
334
+ # Filter out seen papers
335
+ for r in results:
336
+ aid = r["arxiv_id"]
337
+ if aid in seen:
338
+ continue
339
+ all_candidates[aid] = {"score": r["score"]}
340
+ cluster_assignments[aid] = c["cluster_idx"]
341
+ if "vector" in r:
342
+ all_embeddings[aid] = np.array(r["vector"], dtype=np.float32)
343
+ seen.add(aid)
344
+ if len([a for a in cluster_assignments if cluster_assignments[a] == c["cluster_idx"]]) >= 15:
345
+ break
346
+
347
+ elapsed_search = (time.perf_counter() - t0) * 1000
348
+ print(f" Candidate search: {len(all_candidates)} papers in {elapsed_search:.0f}ms")
349
+
350
+ if not all_candidates:
351
+ ERRORS.append("Step 3: Zero candidates retrieved")
352
+ print(" ABORT: No candidates")
353
+ return None, None, None
354
+
355
+ # Step 3b: Fetch metadata
356
+ cand_ids = list(all_candidates.keys())
357
+ try:
358
+ meta = await turso_svc.fetch_metadata_batch(cand_ids)
359
+ except Exception as e:
360
+ ERRORS.append(f"Step 3: metadata fetch failed: {e}")
361
+ meta = {}
362
+
363
+ # Step 3c: Fetch embeddings for candidates (use what we got from search + batch fetch rest)
364
+ cand_embeddings = dict(all_embeddings) # Already have some from with_vectors=True
365
+ missing_emb = [aid for aid in cand_ids if aid not in cand_embeddings]
366
+ if missing_emb:
367
+ print(f" Fetching {len(missing_emb)} missing embeddings from Qdrant...")
368
+ try:
369
+ extra = await qdrant_svc.get_paper_vectors(missing_emb)
370
+ for aid, vec in extra.items():
371
+ cand_embeddings[aid] = np.array(vec, dtype=np.float32)
372
+ except Exception as e:
373
+ print(f" WARN: batch vector fetch failed: {e}")
374
+
375
+ print(f" Got {len(cand_embeddings)}/{len(cand_ids)} embeddings")
376
+
377
+ # Build aligned arrays
378
+ valid_ids = [aid for aid in cand_ids if aid in cand_embeddings and aid in meta]
379
+ if len(valid_ids) < 5:
380
+ ERRORS.append(f"Step 3: Only {len(valid_ids)} valid candidates")
381
+ print(f" ABORT: Not enough valid candidates")
382
+ return None, None, None
383
+
384
+ emb_matrix = np.stack([cand_embeddings[aid] for aid in valid_ids])
385
+ meta_list = [meta[aid] for aid in valid_ids]
386
+
387
+ # Step 3d: Print the raw candidate feed
388
+ print(f"\n Raw candidate feed ({len(valid_ids)} papers):")
389
+ cluster_counts: dict[int, int] = {}
390
+ for i, aid in enumerate(valid_ids[:20]):
391
+ m = meta.get(aid, {})
392
+ title = (m.get("title") or "?")[:55]
393
+ cites = m.get("citation_count", 0) or 0
394
+ cidx = cluster_assignments.get(aid, -1)
395
+ cluster_counts[cidx] = cluster_counts.get(cidx, 0) + 1
396
+ print(f" {i+1:2d}. [C{cidx}] [{cites:>6} cites] {title}")
397
+
398
+ print(f"\n Cluster distribution in top 20:")
399
+ for cidx, count in sorted(cluster_counts.items()):
400
+ print(f" Cluster {cidx}: {count} papers")
401
+
402
+ total_feed = (time.perf_counter() - t0) * 1000
403
+ print(f" Total feed generation: {total_feed:.0f}ms")
404
+
405
+ return valid_ids, emb_matrix, meta_list
406
+
407
+
408
+ # ═══════════════════════════════════════════════════════════════════════════════
409
+ # STEP 4 β€” LIGHTGBM RERANKER
410
+ # ═══════════════════════════════════════════════════════════════════════════════
411
+
412
+ async def step4_reranker(valid_ids, emb_matrix, meta_list, lt_vec, st_vec):
413
+ banner("STEP 4: LIGHTGBM RERANKER")
414
+
415
+ if valid_ids is None:
416
+ print(" SKIPPED: No candidates from Step 3")
417
+ return
418
+
419
+ print(f" Model loaded: {is_model_loaded()}")
420
+ if is_model_loaded():
421
+ print(f" Trees: {get_num_trees()}")
422
+ else:
423
+ MISSING.append("LightGBM model not loaded β€” using heuristic fallback")
424
+
425
+ n = min(len(valid_ids), 20)
426
+ ids_subset = valid_ids[:n]
427
+ emb_subset = emb_matrix[:n]
428
+ meta_subset = meta_list[:n]
429
+
430
+ print(f" Running reranker on {n} candidates...")
431
+ t0 = time.perf_counter()
432
+
433
+ try:
434
+ sorted_ids, sorted_scores, sorted_embs = rerank_candidates(
435
+ ids_subset,
436
+ emb_subset,
437
+ meta_subset,
438
+ lt_vec,
439
+ st_vec,
440
+ None, # no negative profile
441
+ )
442
+ elapsed = (time.perf_counter() - t0) * 1000
443
+ except Exception as e:
444
+ ERRORS.append(f"Step 4: rerank_candidates failed: {e}")
445
+ print(f" ERROR: {e}")
446
+ return
447
+
448
+ print(f" Reranker latency: {elapsed:.0f}ms")
449
+ print(f"\n Reranked order (top 10):")
450
+
451
+ # Fetch metadata for display
452
+ re_meta = {}
453
+ try:
454
+ re_meta = await turso_svc.fetch_metadata_batch(sorted_ids[:10])
455
+ except Exception:
456
+ pass
457
+
458
+ for i, (aid, score) in enumerate(zip(sorted_ids[:10], sorted_scores[:10]), 1):
459
+ m = re_meta.get(aid, {})
460
+ title = (m.get("title") or "?")[:55]
461
+ cites = m.get("citation_count", 0) or 0
462
+ old_rank = ids_subset.index(aid) + 1 if aid in ids_subset else "?"
463
+ print(f" {i:2d}. (was #{old_rank:>2}) [{cites:>6} cites] score={score:.4f} {title}")
464
+
465
+ # Feature analysis for top 3 and bottom 3
466
+ features = compute_features(emb_subset, meta_subset, lt_vec, st_vec, None)
467
+ print(f"\n Feature snapshot (top 3 reranked papers):")
468
+ for rank_idx in range(min(3, len(sorted_ids))):
469
+ aid = sorted_ids[rank_idx]
470
+ orig_idx = ids_subset.index(aid)
471
+ f = features[orig_idx]
472
+ print(f" #{rank_idx+1} {aid}:")
473
+ print(f" qdrant_cosine={f[0]:.3f} lt_sim={f[20]:.3f} st_sim={f[21]:.3f} "
474
+ f"cites={f[2]:.0f} recency={f[6]:.3f} age_days={f[5]:.0f}")
475
+
476
+ if len(sorted_ids) >= 3:
477
+ print(f"\n Feature snapshot (bottom 3 reranked papers):")
478
+ for rank_idx in range(max(0, len(sorted_ids)-3), len(sorted_ids)):
479
+ aid = sorted_ids[rank_idx]
480
+ orig_idx = ids_subset.index(aid)
481
+ f = features[orig_idx]
482
+ print(f" #{rank_idx+1} {aid}:")
483
+ print(f" qdrant_cosine={f[0]:.3f} lt_sim={f[20]:.3f} st_sim={f[21]:.3f} "
484
+ f"cites={f[2]:.0f} recency={f[6]:.3f} age_days={f[5]:.0f}")
485
+
486
+ # Check: did reranking change anything?
487
+ moved = sum(1 for i, aid in enumerate(sorted_ids) if aid != ids_subset[i])
488
+ check(f"Reranker changed {moved}/{n} positions", moved > 0,
489
+ "Reranker should reorder candidates based on features")
490
+
491
+
492
+ # ═══════════════════════════════════════════════════════════════════════════════
493
+ # STEP 5 β€” MMR DIVERSITY + EXPLORATION
494
+ # ═══════════════════════════════════════════════════════════════════════════════
495
+
496
+ async def step5_diversity(valid_ids, emb_matrix, lt_vec):
497
+ banner("STEP 5: MMR DIVERSITY + EXPLORATION")
498
+
499
+ if valid_ids is None or lt_vec is None:
500
+ print(" SKIPPED: No data from previous steps")
501
+ return
502
+
503
+ n = min(len(valid_ids), 30)
504
+ print(f" Running MMR (lambda=0.6) on {n} candidates, selecting 15...")
505
+
506
+ t0 = time.perf_counter()
507
+ try:
508
+ mmr_ids = mmr_rerank(
509
+ lt_vec, emb_matrix[:n], valid_ids[:n],
510
+ lambda_param=0.6, top_k=15,
511
+ )
512
+ elapsed = (time.perf_counter() - t0) * 1000
513
+ except Exception as e:
514
+ ERRORS.append(f"Step 5: mmr_rerank failed: {e}")
515
+ print(f" ERROR: {e}")
516
+ return
517
+
518
+ print(f" MMR latency: {elapsed:.0f}ms")
519
+ print(f" MMR selected {len(mmr_ids)} papers")
520
+
521
+ # Check rank changes
522
+ moved = sum(1 for i, aid in enumerate(mmr_ids) if i < len(valid_ids) and aid != valid_ids[i])
523
+ print(f" Rank changes vs input: {moved}/{len(mmr_ids)}")
524
+
525
+ # Exploration injection
526
+ with_explore = inject_exploration(mmr_ids, valid_ids[:n], n_explore=2, seed=42)
527
+ explore_count = len(with_explore) - len(mmr_ids)
528
+ print(f" Exploration injected: {explore_count} papers")
529
+ check("Exploration added papers", explore_count > 0 or len(valid_ids[:n]) <= len(mmr_ids))
530
+
531
+ # Check diversity: compute avg pairwise cosine among selected
532
+ selected_embs = []
533
+ for aid in mmr_ids[:10]:
534
+ if aid in valid_ids:
535
+ idx = valid_ids.index(aid)
536
+ if idx < len(emb_matrix):
537
+ selected_embs.append(emb_matrix[idx])
538
+
539
+ if len(selected_embs) >= 2:
540
+ sel_matrix = np.stack(selected_embs)
541
+ norms = sel_matrix / (np.linalg.norm(sel_matrix, axis=1, keepdims=True) + 1e-10)
542
+ sim_matrix = norms @ norms.T
543
+ # Average off-diagonal similarity
544
+ mask = ~np.eye(len(sel_matrix), dtype=bool)
545
+ avg_sim = sim_matrix[mask].mean()
546
+ print(f" Avg pairwise cosine among top 10 MMR picks: {avg_sim:.3f}")
547
+ check("MMR diversity (avg pairwise sim < 0.85)", avg_sim < 0.85,
548
+ f"actual: {avg_sim:.3f}")
549
+
550
+
551
+ # ═══════════════════════════════════════════════════════════════════════════════
552
+ # STEP 6 β€” GAP ANALYSIS
553
+ # ═══════════════════════════════════════════════════════════════════════════════
554
+
555
+ def step6_gap_analysis():
556
+ banner("STEP 6: GAP ANALYSIS")
557
+
558
+ print(" ERRORS (things that threw exceptions or returned empty):")
559
+ if ERRORS:
560
+ for e in ERRORS:
561
+ print(f" - {e}")
562
+ else:
563
+ print(" (none)")
564
+
565
+ print("\n WRONG OUTPUTS (things that ran but returned bad results):")
566
+ if WRONG_OUTPUTS:
567
+ for w in WRONG_OUTPUTS:
568
+ print(f" - {w}")
569
+ else:
570
+ print(" (none)")
571
+
572
+ print("\n MISSING PIECES (not implemented or not loaded):")
573
+ if MISSING:
574
+ for m in MISSING:
575
+ print(f" - {m}")
576
+ else:
577
+ print(" (none)")
578
+
579
+ print(f"\n Totals: {len(ERRORS)} errors, {len(WRONG_OUTPUTS)} wrong outputs, {len(MISSING)} missing")
580
+
581
+ # Verdict
582
+ total_issues = len(ERRORS) + len(WRONG_OUTPUTS) + len(MISSING)
583
+ if total_issues == 0:
584
+ print("\n VERDICT: ALL CLEAR")
585
+ else:
586
+ print(f"\n VERDICT: {total_issues} issues found")
587
+
588
+
589
+ # ═══════════════════════════════════════════════════════════════════════════════
590
+ # MAIN
591
+ # ═══════════════════════════════════════════════════════════════════════════════
592
+
593
+ async def main():
594
+ banner("RESEARCHIT E2E PIPELINE AUDIT")
595
+ print(" Warming up BGE-M3 + services...")
596
+ embed_svc.encode_query("warmup")
597
+ await turso_svc.fetch_metadata_batch(["1706.03762"])
598
+ print(" Ready.\n")
599
+
600
+ # Step 1: Search
601
+ await step1_search()
602
+
603
+ # Step 2: Profiles + Clustering
604
+ lt_vec, st_vec = await step2_profiles()
605
+
606
+ # Step 3: Recommendation feed
607
+ valid_ids, emb_matrix, meta_list = await step3_recommendation_feed(lt_vec, st_vec)
608
+
609
+ # Step 4: Reranker
610
+ await step4_reranker(valid_ids, emb_matrix, meta_list, lt_vec, st_vec)
611
+
612
+ # Step 5: MMR Diversity
613
+ await step5_diversity(valid_ids, emb_matrix, lt_vec)
614
+
615
+ # Step 6: Gap analysis
616
+ step6_gap_analysis()
617
+
618
+ banner("AUDIT COMPLETE")
619
+
620
+
621
+ if __name__ == "__main__":
622
+ asyncio.run(main())
scripts/eval_expanded_queries.py ADDED
@@ -0,0 +1,336 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Expanded search quality evaluation β€” realistic user queries.
3
+
4
+ The original eval_search_quality.py uses 21 queries across 5 bands (A-E).
5
+ This script expands to 8 categories that simulate REAL users of an academic
6
+ paper search engine, not just known-item lookups and adversarial tests.
7
+
8
+ Categories:
9
+ F: Beginner / Newcomer β€” "explain like I'm starting a research project"
10
+ G: Research-in-Progress β€” "I know the field, looking for specific work"
11
+ H: Implementation-Focused β€” "I want to BUILD something"
12
+ I: Comparative / Survey β€” "compare X vs Y" or "survey of Z"
13
+ J: Emerging / Cutting-Edge β€” "what's new in X?"
14
+ K: Cross-Domain β€” "applying X from domain A to domain B"
15
+ L: Vague / Exploratory β€” underspecified queries that real users actually type
16
+ M: Follow-up / Refinement β€” queries that build on prior context
17
+
18
+ Run: python scripts/eval_expanded_queries.py
19
+ """
20
+ from __future__ import annotations
21
+
22
+ import asyncio
23
+ import json
24
+ import sys
25
+ import time
26
+ from pathlib import Path
27
+
28
+ sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
29
+
30
+ from app import hybrid_search_svc
31
+ from app import turso_svc
32
+ from app import embed_svc
33
+ from app import groq_svc
34
+
35
+
36
+ # ── Query definitions ────────────────────────────────────────────────────────
37
+
38
+ # (band, query, expected_arxiv_id_or_None, description)
39
+ QUERIES: list[tuple[str, str, str | None, str]] = [
40
+
41
+ # ── Band A (original): Known-item titles ─────────────────────────────────
42
+ ("A", "attention is all you need", "1706.03762",
43
+ "Landmark transformer paper by Vaswani et al."),
44
+ ("A", "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805",
45
+ "Full BERT title β€” should be exact #1"),
46
+ ("A", "Deep Residual Learning for Image Recognition", "1512.03385",
47
+ "ResNet β€” the most-cited CV paper"),
48
+
49
+ # ── Band F: Beginner / Newcomer queries ──────────────────────────────────
50
+ # These simulate a student or newcomer who doesn't know the jargon.
51
+ ("F", "how do transformers work in NLP", None,
52
+ "Newcomer asking about transformer basics"),
53
+ ("F", "what is reinforcement learning from human feedback", None,
54
+ "Beginner asking about RLHF β€” should surface Ouyang/InstructGPT/Christiano"),
55
+ ("F", "explain how neural networks learn", None,
56
+ "Very basic β€” should return foundational/survey papers"),
57
+ ("F", "what are diffusion models and how do they generate images", None,
58
+ "Beginner asking about DDPM/Stable Diffusion family"),
59
+ ("F", "how does GPT-4 work", None,
60
+ "Newcomer asking about GPT-4 β€” should surface the technical report"),
61
+
62
+ # ── Band G: Research-in-Progress queries ─────────────────────────────────
63
+ # These simulate a PhD student deep in their research.
64
+ ("G", "contrastive learning for self-supervised visual representations", None,
65
+ "Should return SimCLR, MoCo, BYOL, DINO etc."),
66
+ ("G", "knowledge distillation from large language models to smaller ones", None,
67
+ "Distillation pipeline β€” DistilBERT, TinyBERT, knowledge distillation surveys"),
68
+ ("G", "graph neural networks for molecular property prediction", None,
69
+ "GNN + chemistry β€” SchNet, DimeNet, MPNN papers"),
70
+ ("G", "efficient inference for large language models quantization pruning", None,
71
+ "LLM compression β€” GPTQ, AWQ, SparseGPT, pruning surveys"),
72
+ ("G", "causal inference in observational studies with machine learning", None,
73
+ "Causal ML β€” double ML, causal forests, CATE estimation"),
74
+ ("G", "multi-task learning with shared representations", None,
75
+ "MTL surveys, hard/soft parameter sharing, task relationships"),
76
+
77
+ # ── Band H: Implementation-Focused queries ───────────────────────────────
78
+ # These simulate someone who wants to BUILD something.
79
+ ("H", "how to fine-tune a pre-trained language model for classification", None,
80
+ "Practical fine-tuning β€” ULMFiT, how-to-fine-tune-BERT papers"),
81
+ ("H", "implementing attention mechanism from scratch", None,
82
+ "Implementation-level detail β€” attention tutorials, scaled dot product"),
83
+ ("H", "best practices for training stable diffusion models", None,
84
+ "Practical SD training β€” latent diffusion, classifier-free guidance"),
85
+ ("H", "building a retrieval augmented generation system", None,
86
+ "RAG β€” should surface the Lewis et al. RAG paper, REALM, etc."),
87
+ ("H", "how to do distributed training with PyTorch across GPUs", None,
88
+ "Distributed training β€” ZeRO, Megatron, FSDP, DeepSpeed papers"),
89
+
90
+ # ── Band I: Comparative / Survey queries ─────────────────────────────────
91
+ # Users who want to understand the landscape.
92
+ ("I", "transformer vs CNN for image classification", None,
93
+ "ViT vs ResNet/EfficientNet β€” should surface comparison papers"),
94
+ ("I", "survey of large language models", None,
95
+ "LLM surveys β€” Zhao et al. survey, Minaee survey"),
96
+ ("I", "comparison of object detection architectures YOLO vs DETR", None,
97
+ "YOLO family vs transformer-based detection"),
98
+ ("I", "GAN vs diffusion models for image generation", None,
99
+ "Generative model comparison β€” StyleGAN, DDPM, score matching"),
100
+ ("I", "review of federated learning privacy methods", None,
101
+ "FL surveys β€” McMahan, differential privacy in FL"),
102
+
103
+ # ── Band J: Emerging / Cutting-Edge queries ──────────────────────────────
104
+ # Users looking for the latest developments.
105
+ ("J", "mixture of experts models scaling", None,
106
+ "MoE β€” Switch Transformer, Mixtral, GShard"),
107
+ ("J", "test-time compute scaling for reasoning", None,
108
+ "New paradigm β€” o1-style reasoning, tree search at inference"),
109
+ ("J", "multimodal large language models vision and text", None,
110
+ "GPT-4V, LLaVA, Flamingo, multimodal LLMs"),
111
+ ("J", "state space models as alternative to transformers", None,
112
+ "S4, Mamba, H3 β€” structured state space models"),
113
+ ("J", "constitutional AI and AI safety alignment techniques", None,
114
+ "Anthropic constitutional AI, RLHF alternatives, safety"),
115
+ ("J", "sparse attention mechanisms for long context", None,
116
+ "Longformer, BigBird, sparse transformers for 100K+ context"),
117
+
118
+ # ── Band K: Cross-Domain queries ─────────────────────────────────────────
119
+ # Users applying ML to their specific domain.
120
+ ("K", "deep learning for protein structure prediction", None,
121
+ "AlphaFold, ESMFold, protein language models"),
122
+ ("K", "natural language processing for legal document analysis", None,
123
+ "Legal NLP β€” contract analysis, legal BERT, court opinion mining"),
124
+ ("K", "machine learning for climate change prediction", None,
125
+ "Climate ML β€” weather forecasting, carbon modeling"),
126
+ ("K", "using transformers for time series forecasting", None,
127
+ "Time series transformers β€” Informer, Autoformer, PatchTST"),
128
+ ("K", "reinforcement learning for robotics manipulation", None,
129
+ "RL + robotics β€” sim-to-real transfer, dexterous manipulation"),
130
+
131
+ # ── Band L: Vague / Exploratory queries ──────────────────────────────────
132
+ # Underspecified queries that real users actually type.
133
+ ("L", "AI ethics", None,
134
+ "Very broad β€” should return survey-level papers on AI ethics/fairness/bias"),
135
+ ("L", "embedding", None,
136
+ "Single word β€” highly ambiguous. Word2Vec? Sentence embeddings? Image embeddings?"),
137
+ ("L", "language model", None,
138
+ "Broad β€” should return influential LM papers or surveys"),
139
+ ("L", "generate images from text", None,
140
+ "Casual β€” should surface DALL-E, Stable Diffusion, Imagen"),
141
+ ("L", "make AI more safe", None,
142
+ "Very casual β€” should surface alignment/safety papers"),
143
+
144
+ # ── Band M: Follow-up / Refinement queries ───────────────────────────────
145
+ # Simulate a user who already found something and wants more.
146
+ ("M", "improvements to the original transformer architecture", None,
147
+ "Post-Vaswani improvements β€” Reformer, Performer, ALiBi, RoPE"),
148
+ ("M", "papers that cite ResNet and extend residual connections", None,
149
+ "ResNet extensions β€” DenseNet, ResNeXt, WideResNet, SE-Net"),
150
+ ("M", "alternatives to RLHF for aligning language models", None,
151
+ "DPO, SPIN, KTO β€” methods that bypass reward modeling"),
152
+ ("M", "BERT variants for low resource languages", None,
153
+ "mBERT, XLM-R, AfricanBERT, ArabBERT β€” multilingual BERT variants"),
154
+ ]
155
+
156
+
157
+ # ── Wire rewrite logging ─────────────────────────────────────────────────────
158
+
159
+ _rewrite_log: dict[str, str] = {}
160
+ _original_rewrite = groq_svc.rewrite
161
+
162
+
163
+ async def _logging_rewrite(q: str) -> str:
164
+ r = await _original_rewrite(q)
165
+ _rewrite_log[q] = r
166
+ return r
167
+
168
+
169
+ groq_svc.rewrite = _logging_rewrite
170
+
171
+
172
+ # ── Per-query evaluation ─────────────────────────────────────────────────────
173
+
174
+ async def eval_query(
175
+ band: str, query: str, expected_id: str | None, description: str
176
+ ) -> dict:
177
+ """Run one query end-to-end and return structured results."""
178
+ t0 = time.perf_counter()
179
+ results = await hybrid_search_svc.search(query, limit=10)
180
+ elapsed_ms = (time.perf_counter() - t0) * 1000
181
+
182
+ rewrite = _rewrite_log.get(query, query)
183
+ rewrite_fired = rewrite.strip() != query.strip()
184
+
185
+ titles: dict[str, str] = {}
186
+ categories: dict[str, str] = {}
187
+ if results:
188
+ meta = await turso_svc.fetch_metadata_batch(results)
189
+ titles = {aid: (m.get("title") or "(no title)") for aid, m in meta.items()}
190
+ categories = {aid: (m.get("primary_topic") or "?") for aid, m in meta.items()}
191
+
192
+ # Print formatted output
193
+ print()
194
+ print(f"[{band}] {query!r}")
195
+ print(f" intent: {description}")
196
+ if rewrite_fired:
197
+ print(f" rewrite: {rewrite!r}")
198
+ else:
199
+ print(f" rewrite: (skipped or no change)")
200
+
201
+ if expected_id is not None:
202
+ if results and results[0] == expected_id:
203
+ verdict = f"PASS - {expected_id} at #1"
204
+ elif expected_id in results:
205
+ rank = results.index(expected_id) + 1
206
+ verdict = f"PARTIAL - {expected_id} at rank #{rank}"
207
+ else:
208
+ verdict = f"FAIL - {expected_id} NOT in top 10"
209
+ print(f" verdict: {verdict}")
210
+
211
+ print(f" latency: {elapsed_ms:.0f} ms | results: {len(results)}")
212
+
213
+ if not results:
214
+ print(" (no results returned)")
215
+ else:
216
+ for i, aid in enumerate(results, 1):
217
+ title = titles.get(aid, "(title unavailable)")
218
+ cat = categories.get(aid, "?")
219
+ if len(title) > 75:
220
+ title = title[:72] + "..."
221
+ marker = " *" if expected_id and aid == expected_id else " "
222
+ print(f" {i:2d}.{marker}{aid:14s} [{cat:20s}] {title}")
223
+
224
+ # Compute topic diversity
225
+ unique_cats = set(categories.values()) - {"?"}
226
+
227
+ return {
228
+ "band": band,
229
+ "query": query,
230
+ "description": description,
231
+ "rewrite": rewrite if rewrite_fired else None,
232
+ "latency_ms": elapsed_ms,
233
+ "n_results": len(results),
234
+ "results": [
235
+ {"rank": i+1, "arxiv_id": aid, "title": titles.get(aid, ""),
236
+ "category": categories.get(aid, "?")}
237
+ for i, aid in enumerate(results)
238
+ ],
239
+ "expected_id": expected_id,
240
+ "expected_found": expected_id in results if expected_id else None,
241
+ "expected_rank": results.index(expected_id) + 1 if expected_id and expected_id in results else None,
242
+ "topic_diversity": len(unique_cats),
243
+ }
244
+
245
+
246
+ async def main():
247
+ print("=" * 100)
248
+ print("EXPANDED SEARCH EVALUATION - Realistic User Queries")
249
+ print(f"Total queries: {len(QUERIES)} | Bands: {sorted(set(b for b,_,_,_ in QUERIES))}")
250
+ print("=" * 100)
251
+
252
+ # Warm-up
253
+ print("\nWarming up BGE-M3 + Turso...")
254
+ t0 = time.perf_counter()
255
+ embed_svc.encode_query("warmup query for the eval harness")
256
+ await turso_svc.fetch_metadata_batch(["1706.03762"])
257
+ print(f"Warm-up: {(time.perf_counter()-t0)*1000:.0f} ms\n")
258
+
259
+ all_results: list[dict] = []
260
+ band_results: dict[str, list[dict]] = {}
261
+
262
+ for band, query, expected, description in QUERIES:
263
+ result = await eval_query(band, query, expected, description)
264
+ all_results.append(result)
265
+ band_results.setdefault(band, []).append(result)
266
+
267
+ # ── Summary ──────────────────────────────────────────────────────────────
268
+ print("\n" + "=" * 100)
269
+ print("SUMMARY")
270
+ print("=" * 100)
271
+
272
+ # Band A: known-item hit rate
273
+ if "A" in band_results:
274
+ a_rows = band_results["A"]
275
+ hits = sum(1 for r in a_rows if r["expected_rank"] == 1)
276
+ total = len(a_rows)
277
+ print(f"\nBand A (known-item): {hits}/{total} top-1 hits")
278
+
279
+ # Per-band stats
280
+ print("\nPer-Band Results:")
281
+ print(f" {'Band':<6} {'Queries':>7} {'Avg Latency':>12} {'Avg Results':>12} {'Avg Topics':>11} Description")
282
+ print(f" {'-'*6} {'-'*7} {'-'*12} {'-'*12} {'-'*11} {'-'*40}")
283
+
284
+ band_labels = {
285
+ "A": "Known-item titles",
286
+ "F": "Beginner / Newcomer",
287
+ "G": "Research-in-Progress",
288
+ "H": "Implementation-Focused",
289
+ "I": "Comparative / Survey",
290
+ "J": "Emerging / Cutting-Edge",
291
+ "K": "Cross-Domain",
292
+ "L": "Vague / Exploratory",
293
+ "M": "Follow-up / Refinement",
294
+ }
295
+
296
+ for band in sorted(band_results.keys()):
297
+ rows = band_results[band]
298
+ n = len(rows)
299
+ avg_lat = sum(r["latency_ms"] for r in rows) / n
300
+ avg_res = sum(r["n_results"] for r in rows) / n
301
+ avg_div = sum(r["topic_diversity"] for r in rows) / n
302
+ label = band_labels.get(band, "")
303
+ print(f" {band:<6} {n:>7} {avg_lat:>10.0f}ms {avg_res:>12.1f} {avg_div:>11.1f} {label}")
304
+
305
+ # Overall latency
306
+ all_lat = [r["latency_ms"] for r in all_results]
307
+ all_lat.sort()
308
+ n = len(all_lat)
309
+ p50 = all_lat[n // 2]
310
+ p95 = all_lat[max(0, int(n * 0.95) - 1)]
311
+ print(f"\nOverall Latency (n={n}): mean {sum(all_lat)/n:.0f} ms "
312
+ f"p50 {p50:.0f} ms p95 {p95:.0f} ms max {max(all_lat):.0f} ms")
313
+
314
+ # Rewrite analysis
315
+ rewrites = [(r["query"], r["rewrite"]) for r in all_results if r["rewrite"]]
316
+ skips = [r["query"] for r in all_results if not r["rewrite"]]
317
+ print(f"\nGroq Rewriter: {len(rewrites)} fired, {len(skips)} skipped")
318
+
319
+ # Zero-result queries
320
+ zeros = [r["query"] for r in all_results if r["n_results"] == 0]
321
+ if zeros:
322
+ print(f"\nWARNING: ZERO RESULTS ({len(zeros)}):")
323
+ for q in zeros:
324
+ print(f" - {q!r}")
325
+ else:
326
+ print(f"\nOK: All queries returned results")
327
+
328
+ # Save JSON for comparison
329
+ out_path = Path(__file__).parent / "expanded_eval_results.json"
330
+ with open(out_path, "w") as f:
331
+ json.dump(all_results, f, indent=2, default=str)
332
+ print(f"\nResults saved to: {out_path}")
333
+
334
+
335
+ if __name__ == "__main__":
336
+ asyncio.run(main())
scripts/eval_recs_quality.py ADDED
@@ -0,0 +1,547 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Recommendation engine evaluation harness.
3
+
4
+ Bypasses HTTP and calls the same pipeline functions the router uses,
5
+ with full DB setup/cleanup per scenario. Each scenario probes a specific
6
+ behavior (which tier fired, how many clusters formed, whether suppression
7
+ removed disliked categories, etc.) rather than just "did we get results."
8
+
9
+ Run: python scripts/eval_recs_quality.py
10
+ """
11
+ from __future__ import annotations
12
+
13
+ import asyncio
14
+ import sys
15
+ import time
16
+ import uuid
17
+ from collections import Counter
18
+ from pathlib import Path
19
+
20
+ import numpy as np
21
+ import aiosqlite
22
+
23
+ # Force UTF-8 stdout so unicode glyphs (>=, ->, etc.) don't crash on Windows cp1252
24
+ if hasattr(sys.stdout, "reconfigure"):
25
+ sys.stdout.reconfigure(encoding="utf-8")
26
+
27
+ sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
28
+
29
+ from app import qdrant_svc, db, turso_svc, user_state as us
30
+ from app.config import REC_LIMIT, DB_PATH
31
+ from app.recommend import profiles
32
+ from app.recommend.clustering import (
33
+ compute_clusters, MIN_PAPERS_FOR_CLUSTERING,
34
+ )
35
+ from app.routers.recommendations import (
36
+ _multi_interest_recommend, _ewma_recommend,
37
+ )
38
+
39
+
40
+ # ── Curated paper ids (verified-famous papers in each domain) ────────────────
41
+
42
+ NLP_PAPERS = [
43
+ ("1706.03762", "Attention Is All You Need"),
44
+ ("1810.04805", "BERT"),
45
+ ("2005.14165", "GPT-3"),
46
+ ("1907.11692", "RoBERTa"),
47
+ ("1910.10683", "T5"),
48
+ ("2203.02155", "InstructGPT"),
49
+ ("2201.11903", "CoT Prompting"),
50
+ ("2307.09288", "Llama 2"),
51
+ ]
52
+
53
+ CV_PAPERS = [
54
+ ("1512.03385", "ResNet"),
55
+ ("2010.11929", "Vision Transformer"),
56
+ ("1409.1556", "VGG"),
57
+ ("1505.04597", "U-Net"),
58
+ ("2103.14030", "Swin Transformer"),
59
+ ("2104.14294", "DINO"),
60
+ ("2112.10752", "Latent Diffusion"),
61
+ ("1311.2524", "R-CNN"),
62
+ ]
63
+
64
+ ML_THEORY_PAPERS = [
65
+ # cs.LG / stat.ML β€” used for negative-suppression test
66
+ ("1607.06450", "Layer Normalization"),
67
+ ("1502.03167", "Batch Normalization"),
68
+ ("1412.6980", "Adam optimizer"),
69
+ ("1411.1784", "Conditional GAN"),
70
+ ]
71
+
72
+
73
+ # ── User setup / teardown helpers ────────────────────────────────────────────
74
+
75
+ async def setup_user(
76
+ user_id: str,
77
+ save_ids: list[str],
78
+ dismiss_ids: list[str] | None = None,
79
+ onboarding_categories: list[str] | None = None,
80
+ ) -> object:
81
+ """Build a test user from scratch: saves, dismisses, EWMA, in-memory state."""
82
+ dismiss_ids = dismiss_ids or []
83
+
84
+ if onboarding_categories:
85
+ await db.save_onboarding_categories(user_id, onboarding_categories)
86
+
87
+ # Pre-fetch all vectors in one batch
88
+ all_ids = save_ids + dismiss_ids
89
+ vecs = await qdrant_svc.get_paper_vectors(all_ids) if all_ids else {}
90
+
91
+ # Cache metadata so category suppression / display work
92
+ if all_ids:
93
+ meta = await turso_svc.fetch_metadata_batch(all_ids)
94
+ if meta:
95
+ await db.cache_turso_metadata_batch(list(meta.values()))
96
+
97
+ state = await us.ensure_loaded(user_id)
98
+
99
+ for pid in save_ids:
100
+ if pid not in vecs:
101
+ print(f" [setup] WARNING: {pid} not in Qdrant; skipping")
102
+ continue
103
+ state.add_positive(pid)
104
+ emb = np.array(vecs[pid], dtype=np.float32)
105
+ await profiles.update_on_save(user_id, emb)
106
+ await db.log_interaction(user_id, pid, "save")
107
+
108
+ for pid in dismiss_ids:
109
+ if pid not in vecs:
110
+ continue
111
+ state.add_negative(pid)
112
+ emb = np.array(vecs[pid], dtype=np.float32)
113
+ await profiles.update_on_dismiss(user_id, emb)
114
+ await db.log_interaction(user_id, pid, "not_interested")
115
+
116
+ return state
117
+
118
+
119
+ async def cleanup_user(user_id: str) -> None:
120
+ """Wipe all DB rows + in-memory cache for a test user."""
121
+ async with aiosqlite.connect(DB_PATH) as conn:
122
+ for sql in [
123
+ "DELETE FROM interactions WHERE user_id = ?",
124
+ "DELETE FROM user_profiles WHERE user_id = ?",
125
+ "DELETE FROM user_clusters WHERE user_id = ?",
126
+ "DELETE FROM user_onboarding WHERE user_id = ?",
127
+ "DELETE FROM cluster_snapshots WHERE user_id = ?",
128
+ ]:
129
+ try:
130
+ await conn.execute(sql, (user_id,))
131
+ except Exception:
132
+ pass
133
+ await conn.commit()
134
+ if user_id in us._cache:
135
+ del us._cache[user_id]
136
+
137
+
138
+ # ── Pipeline runner (mirrors get_recommendations() cascade) ──────────────────
139
+
140
+ async def run_pipeline(user_id: str, state) -> tuple[str, list[str], dict, float]:
141
+ """Returns (tier_label, rec_ids, paper_tags, latency_ms)."""
142
+ seen = us.all_seen(user_id)
143
+ n_saves = len(state.positive_list)
144
+
145
+ t0 = time.perf_counter()
146
+
147
+ # Tier 0: cold-start (no saves) β†’ trending by category
148
+ if n_saves == 0:
149
+ cat_filter = await db.get_user_category_filter(user_id)
150
+ if cat_filter:
151
+ trending = await turso_svc.fetch_trending_by_categories(
152
+ cat_filter, limit=REC_LIMIT,
153
+ )
154
+ elapsed = (time.perf_counter() - t0) * 1000
155
+ return ("Tier 0 trending",
156
+ [t["arxiv_id"] for t in trending],
157
+ {}, elapsed)
158
+ elapsed = (time.perf_counter() - t0) * 1000
159
+ return ("EMPTY (no onboarding)", [], {}, elapsed)
160
+
161
+ # Tier 1: β‰₯5 saves β†’ multi-interest clustering + quota
162
+ if n_saves >= MIN_PAPERS_FOR_CLUSTERING:
163
+ rec_ids, paper_tags = await _multi_interest_recommend(
164
+ user_id, state, seen, REC_LIMIT, query_id="eval-test",
165
+ )
166
+ if rec_ids:
167
+ elapsed = (time.perf_counter() - t0) * 1000
168
+ return ("Tier 1 multi-interest", rec_ids, paper_tags, elapsed)
169
+
170
+ # Tier 2: β‰₯3 saves (EWMA threshold internally) β†’ single-vector search
171
+ rec_ids = await _ewma_recommend(user_id, seen, REC_LIMIT)
172
+ if rec_ids:
173
+ elapsed = (time.perf_counter() - t0) * 1000
174
+ return ("Tier 2 EWMA", rec_ids, {}, elapsed)
175
+
176
+ # Tier 3: β‰₯1 save β†’ Qdrant Recommend with raw IDs
177
+ rec_ids = await qdrant_svc.recommend(
178
+ positive_arxiv_ids=state.positive_list,
179
+ negative_arxiv_ids=state.negative_list,
180
+ seen_arxiv_ids=seen,
181
+ limit=REC_LIMIT,
182
+ )
183
+ elapsed = (time.perf_counter() - t0) * 1000
184
+ if rec_ids:
185
+ return ("Tier 3 Qdrant Recommend", rec_ids, {}, elapsed)
186
+ return ("EMPTY (all tiers exhausted)", [], {}, elapsed)
187
+
188
+
189
+ async def report_results(rec_ids: list[str], paper_tags: dict) -> tuple[Counter, Counter]:
190
+ """Print top-10 with category and cluster origin. Return (cat_counts, source_counts)."""
191
+ if not rec_ids:
192
+ print(" (no results)")
193
+ return Counter(), Counter()
194
+
195
+ meta = await turso_svc.fetch_metadata_batch(rec_ids)
196
+ cats: Counter = Counter()
197
+ sources: Counter = Counter()
198
+
199
+ for i, aid in enumerate(rec_ids, 1):
200
+ m = meta.get(aid, {})
201
+ title = m.get("title", "(no title)")
202
+ if len(title) > 65:
203
+ title = title[:62] + "..."
204
+ cat = m.get("category", "?")
205
+ cats[cat] += 1
206
+ tag = paper_tags.get(aid, {}) if paper_tags else {}
207
+ source = tag.get("candidate_source", "")
208
+ sources[source] += 1
209
+ src_short = f" [{source}]" if source else ""
210
+ print(f" {i:2d}. {aid:13s} {cat:14s} {title}{src_short}")
211
+
212
+ return cats, sources
213
+
214
+
215
+ # ── Scenarios ────────────────────────────────────────────────────────────────
216
+
217
+ async def scenario_1_cold_with_onboarding():
218
+ """Tier 0: zero saves, NLP categories selected during onboarding."""
219
+ user_id = f"eval-recs-1-{uuid.uuid4().hex[:6]}"
220
+ print("\n" + "=" * 100)
221
+ print("S1 Cold-start with onboarding categories (NLP)")
222
+ print(" Expect: Tier 0 trending; results in NLP-adjacent friendly categories")
223
+ print("=" * 100)
224
+ try:
225
+ await setup_user(user_id, save_ids=[], onboarding_categories=["nlp"])
226
+ state = await us.ensure_loaded(user_id)
227
+ tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
228
+ print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
229
+ cats, _ = await report_results(rec_ids, tags)
230
+ nlp_count = sum(
231
+ c for k, c in cats.items()
232
+ if k in {"AI/ML", "NLP/Computational Linguistics"} or k.startswith("cs.CL")
233
+ )
234
+ verdict = "PASS" if tier.startswith("Tier 0") and len(rec_ids) >= 5 else \
235
+ "FAIL (Tier 0 broken β€” fetch_trending_by_categories returned 0)"
236
+ print(f" Categories: {dict(cats)} --> NLP count: {nlp_count}/{len(rec_ids)}")
237
+ print(f" VERDICT: {verdict}")
238
+ finally:
239
+ await cleanup_user(user_id)
240
+
241
+
242
+ async def scenario_2_single_save():
243
+ """Tier 3: 1 save, expect Qdrant Recommend nearest-neighbors."""
244
+ user_id = f"eval-recs-2-{uuid.uuid4().hex[:6]}"
245
+ print("\n" + "=" * 100)
246
+ print("S2 Single save (Vaswani Attention)")
247
+ print(" Expect: Tier 3 Qdrant Recommend; results semantically near saved paper")
248
+ print("=" * 100)
249
+ try:
250
+ await setup_user(user_id, save_ids=["1706.03762"])
251
+ state = await us.ensure_loaded(user_id)
252
+ tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
253
+ print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
254
+ cats, _ = await report_results(rec_ids, tags)
255
+ ml_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
256
+ verdict = "PASS" if tier.startswith("Tier 3") and ml_count >= 6 else "PARTIAL"
257
+ print(f" Categories: {dict(cats)} --> AI/ML + NLP count: {ml_count}/10")
258
+ print(f" VERDICT: {verdict}")
259
+ finally:
260
+ await cleanup_user(user_id)
261
+
262
+
263
+ async def scenario_3_three_nlp_saves():
264
+ """Tier 2: 3 same-domain saves, expect EWMA single-vector search."""
265
+ user_id = f"eval-recs-3-{uuid.uuid4().hex[:6]}"
266
+ save_ids = [pid for pid, _ in NLP_PAPERS[:3]]
267
+ print("\n" + "=" * 100)
268
+ print("S3 Three NLP saves")
269
+ print(f" Saved: {save_ids}")
270
+ print(" Expect: Tier 2 EWMA single-vector; results NLP-coherent")
271
+ print("=" * 100)
272
+ try:
273
+ await setup_user(user_id, save_ids=save_ids)
274
+ state = await us.ensure_loaded(user_id)
275
+ tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
276
+ print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
277
+ cats, _ = await report_results(rec_ids, tags)
278
+ nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
279
+ verdict = "PASS" if tier.startswith("Tier 2") and nlp_count >= 7 else "PARTIAL"
280
+ print(f" Categories: {dict(cats)} --> AI/ML + NLP count: {nlp_count}/10")
281
+ print(f" VERDICT: {verdict}")
282
+ finally:
283
+ await cleanup_user(user_id)
284
+
285
+
286
+ async def scenario_4_five_nlp_saves_single_cluster():
287
+ """Tier 1, single interest: expect K=1 cluster, NLP-only output."""
288
+ user_id = f"eval-recs-4-{uuid.uuid4().hex[:6]}"
289
+ save_ids = [pid for pid, _ in NLP_PAPERS[:5]]
290
+ print("\n" + "=" * 100)
291
+ print("S4 Five NLP saves (single interest)")
292
+ print(f" Saved: {save_ids}")
293
+ print(" Expect: Tier 1; 1 or few clusters; ML/NLP-coherent output")
294
+ print("=" * 100)
295
+ try:
296
+ await setup_user(user_id, save_ids=save_ids)
297
+ state = await us.ensure_loaded(user_id)
298
+ # Inspect clusters explicitly
299
+ vecs = await qdrant_svc.get_paper_vectors(save_ids)
300
+ embs = np.array([vecs[p] for p in save_ids if p in vecs], dtype=np.float32)
301
+ clusters = compute_clusters([p for p in save_ids if p in vecs], embs)
302
+ print(f" Clusters formed: K={len(clusters)}")
303
+ for c in clusters:
304
+ print(f" cluster {c.cluster_idx}: medoid={c.medoid_paper_id} importance={c.importance:.3f} size={len(c.paper_ids)}")
305
+
306
+ tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
307
+ print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
308
+ cats, _ = await report_results(rec_ids, tags)
309
+ nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
310
+ verdict = "PASS" if tier.startswith("Tier 1") and nlp_count >= 7 else "PARTIAL"
311
+ print(f" Categories: {dict(cats)} --> AI/ML + NLP count: {nlp_count}/10")
312
+ print(f" VERDICT: {verdict}")
313
+ finally:
314
+ await cleanup_user(user_id)
315
+
316
+
317
+ async def scenario_5_multi_interest_balanced():
318
+ """Tier 1, the headline test for quota fusion."""
319
+ user_id = f"eval-recs-5-{uuid.uuid4().hex[:6]}"
320
+ save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
321
+ print("\n" + "=" * 100)
322
+ print("S5 Multi-interest (5 NLP + 5 CV) -- THE HEADLINE QUOTA TEST")
323
+ print(f" Saved: 5x NLP + 5x CV")
324
+ print(" Expect: K>=2 clusters, both interests visible, neither cluster swamps")
325
+ print("=" * 100)
326
+ try:
327
+ await setup_user(user_id, save_ids=save_ids)
328
+ state = await us.ensure_loaded(user_id)
329
+ # Inspect clusters
330
+ vecs = await qdrant_svc.get_paper_vectors(save_ids)
331
+ aligned = [p for p in save_ids if p in vecs]
332
+ embs = np.array([vecs[p] for p in aligned], dtype=np.float32)
333
+ clusters = compute_clusters(aligned, embs)
334
+ print(f" Clusters formed: K={len(clusters)}")
335
+ for c in clusters:
336
+ print(f" cluster {c.cluster_idx}: medoid={c.medoid_paper_id} importance={c.importance:.3f} size={len(c.paper_ids)}")
337
+
338
+ tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
339
+ print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
340
+ cats, sources = await report_results(rec_ids, tags)
341
+ nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
342
+ cv_count = sum(c for k, c in cats.items() if k == "Computer Vision")
343
+ print(f" NLP (AI/ML + NLP): {nlp_count} CV (Computer Vision): {cv_count}")
344
+ print(f" Cluster origin counts: {dict(sources)}")
345
+ smaller = min(nlp_count, cv_count) if (nlp_count and cv_count) else 0
346
+ verdict = "PASS" if len(clusters) >= 2 and smaller >= 3 else "FAIL"
347
+ print(f" VERDICT: {verdict} (floor=3 enforced: {smaller >= 3})")
348
+ finally:
349
+ await cleanup_user(user_id)
350
+
351
+
352
+ async def scenario_6_multi_interest_imbalanced():
353
+ """Tier 1: imbalanced split β€” does the floor=3 rescue the minority?"""
354
+ user_id = f"eval-recs-6-{uuid.uuid4().hex[:6]}"
355
+ save_ids = [pid for pid, _ in NLP_PAPERS[:8]] + [pid for pid, _ in CV_PAPERS[:2]]
356
+ print("\n" + "=" * 100)
357
+ print("S6 Multi-interest imbalanced (8 NLP + 2 CV) -- FLOOR TEST")
358
+ print(" Expect: if K>=2, CV gets >=3 slots even though importance is ~80/20")
359
+ print("=" * 100)
360
+ try:
361
+ await setup_user(user_id, save_ids=save_ids)
362
+ state = await us.ensure_loaded(user_id)
363
+ vecs = await qdrant_svc.get_paper_vectors(save_ids)
364
+ aligned = [p for p in save_ids if p in vecs]
365
+ embs = np.array([vecs[p] for p in aligned], dtype=np.float32)
366
+ clusters = compute_clusters(aligned, embs)
367
+ print(f" Clusters formed: K={len(clusters)}")
368
+ for c in clusters:
369
+ print(f" cluster {c.cluster_idx}: medoid={c.medoid_paper_id} importance={c.importance:.3f} size={len(c.paper_ids)}")
370
+
371
+ tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
372
+ print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
373
+ cats, sources = await report_results(rec_ids, tags)
374
+ nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
375
+ cv_count = sum(c for k, c in cats.items() if k == "Computer Vision")
376
+ print(f" NLP: {nlp_count} CV: {cv_count} Cluster sources: {dict(sources)}")
377
+ if len(clusters) >= 2:
378
+ verdict = "PASS" if cv_count >= 3 else "FAIL (floor not enforced)"
379
+ else:
380
+ verdict = "AMBIGUOUS (only 1 cluster formed - floor doesn't apply)"
381
+ print(f" VERDICT: {verdict}")
382
+ finally:
383
+ await cleanup_user(user_id)
384
+
385
+
386
+ async def scenario_7_category_suppression():
387
+ """Tier 1 with dismissals: 'Computer Vision' should be suppressed."""
388
+ # Save 5 NLP, dismiss 3 CV β€” non-overlapping friendly categories
389
+ user_id = f"eval-recs-7-{uuid.uuid4().hex[:6]}"
390
+ save_ids = [pid for pid, _ in NLP_PAPERS[:5]]
391
+ dismiss_ids = [pid for pid, _ in CV_PAPERS[:3]]
392
+ print("\n" + "=" * 100)
393
+ print("S7 Category suppression (5 NLP saves + 3 CV dismissals)")
394
+ print(" Expect: 'Computer Vision' suppressed; zero CV papers in output")
395
+ print("=" * 100)
396
+ try:
397
+ await setup_user(user_id, save_ids=save_ids, dismiss_ids=dismiss_ids)
398
+ state = await us.ensure_loaded(user_id)
399
+ suppressed = await db.get_suppressed_categories(user_id)
400
+ print(f" Suppressed categories detected: {suppressed}")
401
+
402
+ tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
403
+ print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
404
+ cats, _ = await report_results(rec_ids, tags)
405
+ cv_count = cats.get("Computer Vision", 0)
406
+ verdict = "PASS" if cv_count == 0 and "Computer Vision" in suppressed else \
407
+ "FAIL (CV leaked through)" if cv_count > 0 else \
408
+ "PARTIAL (no CV but suppression set empty)"
409
+ print(f" CV count in output: {cv_count} VERDICT: {verdict}")
410
+ finally:
411
+ await cleanup_user(user_id)
412
+
413
+
414
+ async def scenario_8_hungarian_stability():
415
+ """Cluster IDs should remain stable across reclusterings when one new save is added."""
416
+ user_id = f"eval-recs-8-{uuid.uuid4().hex[:6]}"
417
+ save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
418
+ new_save = NLP_PAPERS[5][0] # 6th NLP paper (added later)
419
+ print("\n" + "=" * 100)
420
+ print("S8 Hungarian cluster-ID stability")
421
+ print(" Run pipeline once -> save 1 more NLP paper -> run again")
422
+ print(" Expect: same cluster_idx assigned to NLP cluster across runs")
423
+ print("=" * 100)
424
+ try:
425
+ await setup_user(user_id, save_ids=save_ids)
426
+ state = await us.ensure_loaded(user_id)
427
+
428
+ # First run
429
+ await run_pipeline(user_id, state)
430
+ clusters_v1 = await db.get_user_clusters(user_id)
431
+ v1 = {(c["cluster_idx"], c["medoid_paper_id"]) for c in clusters_v1}
432
+ print(f" After run 1: {sorted(v1)}")
433
+
434
+ # Add one more NLP paper
435
+ more_vecs = await qdrant_svc.get_paper_vectors([new_save])
436
+ if new_save in more_vecs:
437
+ state.add_positive(new_save)
438
+ await profiles.update_on_save(user_id, np.array(more_vecs[new_save], dtype=np.float32))
439
+ await db.log_interaction(user_id, new_save, "save")
440
+
441
+ # Second run
442
+ await run_pipeline(user_id, state)
443
+ clusters_v2 = await db.get_user_clusters(user_id)
444
+ v2 = {(c["cluster_idx"], c["medoid_paper_id"]) for c in clusters_v2}
445
+ print(f" After run 2: {sorted(v2)}")
446
+
447
+ # Stability check: every (idx, medoid) in v1 still present in v2 (medoid may change but idx must stay)
448
+ idx_v1 = {c["cluster_idx"]: c["medoid_paper_id"] for c in clusters_v1}
449
+ idx_v2 = {c["cluster_idx"]: c["medoid_paper_id"] for c in clusters_v2}
450
+ # All cluster_idx that existed in v1 should still exist in v2
451
+ stable = all(k in idx_v2 for k in idx_v1)
452
+ print(f" Cluster IDs in v1: {sorted(idx_v1.keys())} v2: {sorted(idx_v2.keys())}")
453
+ print(f" VERDICT: {'PASS (all v1 cluster_idx preserved)' if stable else 'FAIL (cluster_idx churned)'}")
454
+ finally:
455
+ await cleanup_user(user_id)
456
+
457
+
458
+ async def scenario_9_latency():
459
+ """Latency sanity: full Tier 1 pipeline on 10 saved papers."""
460
+ user_id = f"eval-recs-9-{uuid.uuid4().hex[:6]}"
461
+ save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
462
+ print("\n" + "=" * 100)
463
+ print("S9 Latency sanity (Tier 1, 10 saved papers)")
464
+ print(" Expect: <30 ms compute (excluding metadata I/O); end-to-end <2s")
465
+ print("=" * 100)
466
+ try:
467
+ await setup_user(user_id, save_ids=save_ids)
468
+ state = await us.ensure_loaded(user_id)
469
+ # Warm: run once to load profiles
470
+ await run_pipeline(user_id, state)
471
+ # Time multiple runs
472
+ runs = []
473
+ for i in range(3):
474
+ tier, _, _, lat = await run_pipeline(user_id, state)
475
+ runs.append(lat)
476
+ print(f" Run {i+1}: {tier} {lat:.0f} ms")
477
+ print(f" Mean: {sum(runs)/len(runs):.0f} ms Min: {min(runs):.0f} ms Max: {max(runs):.0f} ms")
478
+ # The 30ms compute target excludes Qdrant + Turso I/O β€” full e2e includes them
479
+ e2e_pass = max(runs) < 2000
480
+ print(f" VERDICT: {'PASS (e2e <2s)' if e2e_pass else 'PARTIAL (over 2s e2e β€” investigate)'}")
481
+ finally:
482
+ await cleanup_user(user_id)
483
+
484
+
485
+ # ── Pre-flight + main ────────────────────────────────────────────────────────
486
+
487
+ async def preflight():
488
+ """Verify all curated paper IDs exist in Qdrant before running."""
489
+ all_ids = [p[0] for p in NLP_PAPERS + CV_PAPERS + ML_THEORY_PAPERS]
490
+ vecs = await qdrant_svc.get_paper_vectors(all_ids)
491
+ missing = [pid for pid in all_ids if pid not in vecs]
492
+ if missing:
493
+ print(f"WARNING: {len(missing)} curated IDs not in Qdrant: {missing}")
494
+ print("Some scenarios may produce skewed results.")
495
+ else:
496
+ print(f"Pre-flight: all {len(all_ids)} curated paper IDs present in Qdrant.")
497
+
498
+
499
+ async def wipe_all_eval_users():
500
+ """Belt-and-braces cleanup: remove any eval-recs-* users left from crashes."""
501
+ async with aiosqlite.connect(DB_PATH) as conn:
502
+ for tbl in ["interactions", "user_profiles", "user_clusters",
503
+ "user_onboarding", "cluster_snapshots"]:
504
+ try:
505
+ await conn.execute(f"DELETE FROM {tbl} WHERE user_id LIKE ?", ("eval-recs-%",))
506
+ except Exception:
507
+ pass
508
+ await conn.commit()
509
+
510
+
511
+ async def main():
512
+ print("=" * 100)
513
+ print("RECOMMENDATION ENGINE EVALUATION")
514
+ print("=" * 100)
515
+ await db.init_db()
516
+ await wipe_all_eval_users()
517
+ await preflight()
518
+
519
+ scenarios = [
520
+ scenario_1_cold_with_onboarding,
521
+ scenario_2_single_save,
522
+ scenario_3_three_nlp_saves,
523
+ scenario_4_five_nlp_saves_single_cluster,
524
+ scenario_5_multi_interest_balanced,
525
+ scenario_6_multi_interest_imbalanced,
526
+ scenario_7_category_suppression,
527
+ scenario_8_hungarian_stability,
528
+ scenario_9_latency,
529
+ ]
530
+
531
+ for s in scenarios:
532
+ try:
533
+ await s()
534
+ except Exception as e:
535
+ import traceback
536
+ print(f" SCENARIO ERROR: {e}")
537
+ traceback.print_exc()
538
+
539
+ # Final safety wipe in case any cleanup_user calls failed
540
+ await wipe_all_eval_users()
541
+ print("\n" + "=" * 100)
542
+ print("DONE β€” all eval-recs-* users wiped from DB")
543
+ print("=" * 100)
544
+
545
+
546
+ if __name__ == "__main__":
547
+ asyncio.run(main())
scripts/eval_search_quality.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Search quality evaluation harness.
3
+
4
+ For each curated query, runs the hybrid search pipeline end-to-end
5
+ (rewrite -> encode -> dense+sparse -> RRF -> title-boost) and prints the
6
+ top 10 results with titles fetched from Turso. For known-item queries,
7
+ flags whether the expected paper landed at #1.
8
+
9
+ This is a HUMAN-JUDGMENT report, not a pass/fail test. The output is
10
+ designed to be read top-to-bottom and rated query by query.
11
+
12
+ Run: python scripts/eval_search_quality.py
13
+ """
14
+ from __future__ import annotations
15
+
16
+ import asyncio
17
+ import sys
18
+ import time
19
+ from pathlib import Path
20
+
21
+ # Make the project root importable when run as `python scripts/eval_search_quality.py`
22
+ sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
23
+
24
+ from app import hybrid_search_svc
25
+ from app import turso_svc
26
+ from app import embed_svc
27
+ from app import groq_svc
28
+
29
+
30
+ # (band, query, expected_arxiv_id_or_None)
31
+ QUERIES: list[tuple[str, str, str | None]] = [
32
+ # ── Band A: known-item title queries ──────────────────────────────────
33
+ # The right answer is unambiguous. Top-1 hit is the bar.
34
+ ("A", "attention is all you need", "1706.03762"),
35
+ ("A", "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805"),
36
+ ("A", "Adam: A Method for Stochastic Optimization", "1412.6980"),
37
+ ("A", "Language Models are Few-Shot Learners", "2005.14165"),
38
+ ("A", "Deep Residual Learning for Image Recognition", "1512.03385"),
39
+
40
+ # ── Band B: conceptual semantic queries ───────────────────────────────
41
+ # No exact keyword match; tests whether dense retrieval rescues meaning.
42
+ ("B", "when AI makes up fake facts", None),
43
+ ("B", "making language models follow human preferences", None),
44
+ ("B", "why deep networks generalize despite overparameterization", None),
45
+ ("B", "finding similar papers using vector embeddings", None),
46
+ ("B", "models that pretend to be aligned but aren't", None),
47
+
48
+ # ── Band C: keyword-academic queries ──────────────────────────────────
49
+ # Already in academic form; rewriter heuristic should skip these.
50
+ ("C", "BGE-M3 multilingual dense retrieval", None),
51
+ ("C", "Mamba state space model linear time", None),
52
+ ("C", "chain of thought prompting", None),
53
+ ("C", "FlashAttention IO-aware exact attention", None),
54
+
55
+ # ── Band D: adversarial / edge cases ──────────────────────────────────
56
+ ("D", "transformr", None), # typo
57
+ ("D", "GPT", None), # very short
58
+ ("D", "bayesian deep learning monte carlo dropout uncertainty estimation", None), # very long
59
+ ("D", "applying CV to medical imaging", None), # cross-domain (CV->medical)
60
+ ("D", "attention", None), # single ambiguous word
61
+
62
+ # ── Band E: recency-sensitive queries ─────────────────────────────────
63
+ # Recency rerank was removed; verify recent work still surfaces.
64
+ ("E", "Llama 3", None),
65
+ ("E", "reasoning models 2024", None),
66
+ ]
67
+
68
+
69
+ # ── Wire a thin wrapper around groq_svc.rewrite to capture what fired ────
70
+ _rewrite_log: dict[str, str] = {}
71
+ _original_rewrite = groq_svc.rewrite
72
+
73
+
74
+ async def _logging_rewrite(q: str) -> str:
75
+ r = await _original_rewrite(q)
76
+ _rewrite_log[q] = r
77
+ return r
78
+
79
+
80
+ groq_svc.rewrite = _logging_rewrite
81
+
82
+
83
+ async def eval_query(
84
+ band: str, query: str, expected_id: str | None
85
+ ) -> tuple[list[str], float]:
86
+ """Run one query end-to-end and print a formatted report."""
87
+ t0 = time.perf_counter()
88
+ results = await hybrid_search_svc.search(query, limit=10)
89
+ elapsed_ms = (time.perf_counter() - t0) * 1000
90
+
91
+ rewrite = _rewrite_log.get(query, query)
92
+ rewrite_fired = rewrite.strip() != query.strip()
93
+
94
+ titles: dict[str, str] = {}
95
+ if results:
96
+ meta = await turso_svc.fetch_metadata_batch(results)
97
+ titles = {aid: (m.get("title") or "(no title)") for aid, m in meta.items()}
98
+
99
+ # ── Header ──────────────────────────────────────────────────────────────
100
+ print()
101
+ print(f"[{band}] {query!r}")
102
+ if rewrite_fired:
103
+ print(f" rewrite: {rewrite!r}")
104
+ else:
105
+ print(f" rewrite: (heuristic skipped or no change)")
106
+
107
+ if expected_id is not None:
108
+ if results and results[0] == expected_id:
109
+ verdict = f"PASS - {expected_id} at #1"
110
+ elif expected_id in results:
111
+ rank = results.index(expected_id) + 1
112
+ verdict = f"PARTIAL - {expected_id} at rank #{rank}"
113
+ else:
114
+ verdict = f"FAIL - {expected_id} NOT in top 10"
115
+ print(f" verdict: {verdict}")
116
+
117
+ print(f" latency: {elapsed_ms:.0f} ms | results: {len(results)}")
118
+
119
+ if not results:
120
+ print(" (no results returned)")
121
+ return results, elapsed_ms
122
+
123
+ for i, aid in enumerate(results, 1):
124
+ title = titles.get(aid, "(title unavailable)")
125
+ if len(title) > 88:
126
+ title = title[:85] + "..."
127
+ marker = " *" if expected_id and aid == expected_id else " "
128
+ print(f" {i:2d}.{marker}{aid:13s} {title}")
129
+
130
+ return results, elapsed_ms
131
+
132
+
133
+ async def main():
134
+ print("=" * 100)
135
+ print("SEARCH QUALITY EVALUATION - ResearchIT hybrid search pipeline")
136
+ print("=" * 100)
137
+
138
+ # ── Warm-up ─────────────────────────────────────────────────────────────
139
+ # First BGE-M3 encode is ~10-15s cold. Warm before timing anything.
140
+ print("\nWarming up BGE-M3 + Turso...")
141
+ t0 = time.perf_counter()
142
+ embed_svc.encode_query("warmup query for the eval harness")
143
+ await turso_svc.fetch_metadata_batch(["1706.03762"])
144
+ print(f"Warm-up: {(time.perf_counter()-t0)*1000:.0f} ms\n")
145
+
146
+ band_results: dict[str, list[tuple[str, str | None, list[str], float]]] = {}
147
+
148
+ for band, query, expected in QUERIES:
149
+ results, latency = await eval_query(band, query, expected)
150
+ band_results.setdefault(band, []).append((query, expected, results, latency))
151
+
152
+ # ── Summary ─────────────────────────────────────────────────────────────
153
+ print("\n" + "=" * 100)
154
+ print("SUMMARY")
155
+ print("=" * 100)
156
+
157
+ # Band A: top-1 hit rate
158
+ if "A" in band_results:
159
+ a_rows = band_results["A"]
160
+ hits = sum(1 for _, exp, res, _ in a_rows if res and res[0] == exp)
161
+ partial = sum(
162
+ 1 for _, exp, res, _ in a_rows
163
+ if exp in (res or []) and (not res or res[0] != exp)
164
+ )
165
+ misses = len(a_rows) - hits - partial
166
+ print(f"\nBand A (known-item titles): {hits}/{len(a_rows)} top-1 hits, "
167
+ f"{partial} partial (in top 10 but not #1), {misses} miss")
168
+ for q, exp, res, _ in a_rows:
169
+ if res and res[0] == exp:
170
+ tag = "PASS"
171
+ elif exp in (res or []):
172
+ tag = f"PARTIAL #{res.index(exp)+1}"
173
+ else:
174
+ tag = "MISS"
175
+ qshort = q if len(q) <= 60 else q[:57] + "..."
176
+ print(f" [{tag:10s}] {exp:14s} {qshort}")
177
+
178
+ # Latency stats
179
+ all_lat = [lat for rows in band_results.values() for *_, lat in rows]
180
+ if all_lat:
181
+ all_lat.sort()
182
+ n = len(all_lat)
183
+ p50 = all_lat[n // 2]
184
+ p95 = all_lat[max(0, int(n * 0.95) - 1)]
185
+ print(f"\nLatency (n={n}): mean {sum(all_lat)/n:.0f} ms "
186
+ f"p50 {p50:.0f} ms p95 {p95:.0f} ms "
187
+ f"max {max(all_lat):.0f} ms")
188
+
189
+ # Per-band coverage (how often did we get any results?)
190
+ print("\nResults coverage by band:")
191
+ for band, rows in sorted(band_results.items()):
192
+ empty = sum(1 for _, _, res, _ in rows if not res)
193
+ print(f" Band {band}: {len(rows) - empty}/{len(rows)} returned results")
194
+
195
+
196
+ if __name__ == "__main__":
197
+ asyncio.run(main())
scripts/expanded_eval_results.json ADDED
The diff for this file is too large to render. See raw diff
 
scripts/profile_pipelines.py ADDED
@@ -0,0 +1,410 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Stage-by-stage profiler for the search and recommendation pipelines.
3
+
4
+ Mirrors the production paths (hybrid_search_svc.search and
5
+ _multi_interest_recommend) with explicit timers between every stage,
6
+ so we can see where the time actually goes.
7
+
8
+ Run: python scripts/profile_pipelines.py
9
+ """
10
+ from __future__ import annotations
11
+
12
+ import asyncio
13
+ import sys
14
+ import time
15
+ import uuid
16
+ from contextlib import contextmanager
17
+ from pathlib import Path
18
+
19
+ import numpy as np
20
+
21
+ if hasattr(sys.stdout, "reconfigure"):
22
+ sys.stdout.reconfigure(encoding="utf-8")
23
+
24
+ sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
25
+
26
+ from app import (
27
+ config, embed_svc, qdrant_svc, zilliz_svc, groq_svc, turso_svc,
28
+ db, user_state as us,
29
+ )
30
+ from app.recommend import profiles
31
+ from app.recommend.clustering import (
32
+ compute_clusters, stabilize_cluster_ids, save_clusters_to_db,
33
+ load_clusters_from_db, MIN_PAPERS_FOR_CLUSTERING, InterestCluster,
34
+ )
35
+ from app.recommend.fusion import allocate_quotas, merge_quota_results
36
+ from app.recommend.reranker import rerank_candidates
37
+ from app.recommend.diversity import mmr_rerank, inject_exploration
38
+
39
+
40
+ @contextmanager
41
+ def stage(name: str, sink: list):
42
+ t0 = time.perf_counter()
43
+ yield
44
+ sink.append((name, (time.perf_counter() - t0) * 1000))
45
+
46
+
47
+ def print_breakdown(label: str, timings: list[tuple[str, float]]):
48
+ total = sum(t for _, t in timings)
49
+ print(f"\n --- {label} ---")
50
+ print(f" {'Stage':<46s} {'ms':>10s} {'%':>6s}")
51
+ print(f" {'-'*46} {'-'*10} {'-'*6}")
52
+ for name, t in timings:
53
+ pct = (100.0 * t / total) if total > 0 else 0.0
54
+ print(f" {name:<46s} {t:>10.0f} {pct:>5.1f}%")
55
+ print(f" {'-'*46} {'-'*10} {'-'*6}")
56
+ print(f" {'TOTAL':<46s} {total:>10.0f} {100.0:>5.1f}%")
57
+
58
+
59
+ # ── Search pipeline profiler ─────────────────────────────────────────────────
60
+
61
+ async def profile_search(query: str) -> list[tuple[str, float]]:
62
+ """Mirror hybrid_search_svc.search() with stage timers."""
63
+ timings: list[tuple[str, float]] = []
64
+ limit = 10
65
+ fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
66
+
67
+ # Stage 1: Groq rewrite
68
+ rewritten = query
69
+ with stage("1. Groq rewrite (LLM)", timings):
70
+ try:
71
+ rewritten = await groq_svc.rewrite(query)
72
+ except Exception:
73
+ rewritten = query
74
+
75
+ # Stage 2: BGE-M3 encode (original)
76
+ with stage("2a. BGE-M3 encode (original)", timings):
77
+ d_orig, s_orig = embed_svc.encode_query(query)
78
+
79
+ encodings = [(d_orig, s_orig)]
80
+
81
+ # Stage 2b: BGE-M3 encode (rewritten, if different)
82
+ if rewritten and rewritten != query:
83
+ with stage("2b. BGE-M3 encode (rewrite)", timings):
84
+ d_rw, s_rw = embed_svc.encode_query(rewritten)
85
+ encodings.append((d_rw, s_rw))
86
+ else:
87
+ timings.append(("2b. BGE-M3 encode (rewrite skipped)", 0.0))
88
+
89
+ # Stage 3: Parallel Qdrant + Zilliz searches
90
+ with stage(f"3. Parallel search ({len(encodings)*2} tasks)", timings):
91
+ tasks = []
92
+ for d, s in encodings:
93
+ tasks.append(qdrant_svc.search_dense(d.tolist(), limit=fetch_k))
94
+ tasks.append(zilliz_svc.search_sparse(s, limit=fetch_k))
95
+ raw = await asyncio.gather(*tasks, return_exceptions=True)
96
+
97
+ valid_lists = [r for r in raw if not isinstance(r, Exception) and r]
98
+
99
+ # Stage 4: RRF fusion
100
+ with stage("4. RRF fusion", timings):
101
+ from app.hybrid_search_svc import _rrf_fuse_multi, _title_match_rerank
102
+ fused = _rrf_fuse_multi(valid_lists, k=config.SEARCH_RRF_K)
103
+
104
+ # Stage 5: Title-boost (Turso fetch + scoring)
105
+ with stage("5. Title-match boost (Turso + score)", timings):
106
+ ranked = await _title_match_rerank(fused, query, top_n_for_boost=50)
107
+
108
+ return timings
109
+
110
+
111
+ # ── Recommendations Tier 1 pipeline profiler ─────────────────────────────────
112
+
113
+ async def profile_recs_tier1(user_id: str, save_ids: list[str]) -> list[tuple[str, float]]:
114
+ """Mirror _multi_interest_recommend() with stage timers."""
115
+ timings: list[tuple[str, float]] = []
116
+
117
+ state = await us.ensure_loaded(user_id)
118
+ seen = us.all_seen(user_id)
119
+ REC_LIMIT = config.REC_LIMIT
120
+ OVERSAMPLE = 3
121
+ ST_SUPPLEMENT = 20
122
+ positives = state.positive_list
123
+
124
+ # 1. Fetch saved-paper vectors from Qdrant
125
+ with stage("1. Fetch saved-paper vectors (Qdrant)", timings):
126
+ vectors = await qdrant_svc.get_paper_vectors(positives)
127
+
128
+ aligned_ids = [pid for pid in positives if pid in vectors]
129
+ aligned_embs = np.array([vectors[pid] for pid in aligned_ids], dtype=np.float32)
130
+
131
+ # 2. Ward clustering (CPU)
132
+ with stage("2. Ward clustering (CPU)", timings):
133
+ clusters = compute_clusters(aligned_ids, aligned_embs)
134
+
135
+ # 3. Hungarian: load + match
136
+ with stage("3. Hungarian load+match (SQLite + numpy)", timings):
137
+ old_clusters_data = await load_clusters_from_db(user_id)
138
+ if old_clusters_data:
139
+ old_clusters = []
140
+ for row in old_clusters_data:
141
+ mpid = row["medoid_paper_id"]
142
+ if mpid in vectors:
143
+ medoid_emb = np.array(vectors[mpid], dtype=np.float32)
144
+ elif row.get("medoid_embedding_blob") is not None:
145
+ medoid_emb = np.frombuffer(
146
+ row["medoid_embedding_blob"], dtype=np.float32
147
+ ).copy()
148
+ else:
149
+ continue
150
+ old_clusters.append(InterestCluster(
151
+ cluster_idx=row["cluster_idx"],
152
+ medoid_paper_id=mpid,
153
+ medoid_embedding=medoid_emb,
154
+ paper_ids=[],
155
+ importance=row["importance"],
156
+ ))
157
+ if old_clusters:
158
+ clusters = stabilize_cluster_ids(clusters, old_clusters)
159
+
160
+ # 4. Save clusters + snapshot (SQLite writes)
161
+ with stage("4. Save clusters + snapshot (SQLite)", timings):
162
+ await save_clusters_to_db(user_id, clusters)
163
+ await db.save_cluster_snapshot(user_id, [
164
+ {
165
+ "cluster_idx": c.cluster_idx,
166
+ "medoid_paper_id": c.medoid_paper_id,
167
+ "importance": c.importance,
168
+ "paper_ids": c.paper_ids,
169
+ "medoid_embedding_blob": c.medoid_embedding.astype(np.float32).tobytes(),
170
+ }
171
+ for c in clusters
172
+ ])
173
+
174
+ # 5. Quota allocation (CPU)
175
+ with stage("5. Allocate quotas (CPU)", timings):
176
+ importances = [c.importance for c in clusters]
177
+ quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
178
+
179
+ # 6. Load short-term profile
180
+ with stage("6. Load short-term profile (SQLite)", timings):
181
+ st_vec = await profiles.load_profile(user_id, "short_term")
182
+
183
+ # 7. Per-cluster parallel ANN searches (no with_vectors β€” that path
184
+ # is 10x slower on Qdrant Cloud free tier; we cache vectors instead)
185
+ with stage(f"7. Per-cluster ANN searches (gather {len(clusters)})", timings):
186
+ search_tasks = [
187
+ qdrant_svc.search_by_vector_with_scores(
188
+ query_vector=c.medoid_embedding.tolist(),
189
+ limit=quota * OVERSAMPLE,
190
+ exclude_ids=seen,
191
+ )
192
+ for c, quota in zip(clusters, quotas)
193
+ ]
194
+ per_cluster_scored = await asyncio.gather(*search_tasks)
195
+
196
+ paper_cluster_map: dict[str, int] = {}
197
+ qdrant_score_map: dict[str, float] = {}
198
+ for cluster, scored in zip(clusters, per_cluster_scored):
199
+ for hit in scored:
200
+ aid = hit["arxiv_id"]
201
+ if aid not in paper_cluster_map:
202
+ paper_cluster_map[aid] = cluster.cluster_idx
203
+ if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
204
+ qdrant_score_map[aid] = float(hit["score"])
205
+
206
+ per_cluster_ids = [
207
+ [h["arxiv_id"] for h in scored] for scored in per_cluster_scored
208
+ ]
209
+ candidate_ids = merge_quota_results(per_cluster_ids, quotas)
210
+
211
+ # 8. Short-term supplement search
212
+ with stage("8. Short-term supplement (Qdrant)", timings):
213
+ if st_vec is not None:
214
+ seen_so_far = seen | set(candidate_ids)
215
+ st_scored = await qdrant_svc.search_by_vector_with_scores(
216
+ query_vector=st_vec.tolist(),
217
+ limit=ST_SUPPLEMENT,
218
+ exclude_ids=seen_so_far,
219
+ )
220
+ for hit in st_scored:
221
+ aid = hit["arxiv_id"]
222
+ if aid not in set(candidate_ids):
223
+ candidate_ids.append(aid)
224
+ paper_cluster_map[aid] = -1
225
+ if aid not in qdrant_score_map:
226
+ qdrant_score_map[aid] = float(hit["score"])
227
+
228
+ # 9. Fetch candidate vectors (LRU-cached by arxiv_id in qdrant_svc).
229
+ with stage(f"9. Fetch {len(candidate_ids)} candidate vectors (Qdrant, cached)", timings):
230
+ cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
231
+
232
+ # 10. Fetch candidate metadata from Turso (with cache)
233
+ with stage(f"10. Fetch {len(candidate_ids)} candidate metadata (Turso)", timings):
234
+ cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
235
+
236
+ # 11. Cache metadata to SQLite
237
+ with stage("11. Cache Turso metadata to SQLite", timings):
238
+ await db.cache_turso_metadata_batch(list(cand_meta.values()))
239
+
240
+ valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
241
+ valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
242
+ valid_meta = [cand_meta[cid] for cid in valid_ids]
243
+
244
+ # 12. Load profiles (long-term, negative)
245
+ with stage("12. Load long-term + negative profiles (SQLite)", timings):
246
+ lt_vec = await profiles.load_profile(user_id, "long_term")
247
+ neg_vec = await profiles.load_profile(user_id, "negative")
248
+
249
+ # 13. SQLite reads (suppression + onboarding)
250
+ with stage("13. Suppression + onboarding lookup (SQLite)", timings):
251
+ suppressed = await db.get_suppressed_categories(user_id)
252
+ onboarding_categories = await db.get_user_category_filter(user_id)
253
+
254
+ # 14. Build feature arrays (CPU)
255
+ with stage("14. Build per-candidate feature arrays (CPU)", timings):
256
+ user_total_saves = len(state.positive_list)
257
+ user_total_dismissals = len(state.negative_list)
258
+ qdrant_scores = np.asarray(
259
+ [qdrant_score_map.get(cid, 0.0) for cid in valid_ids],
260
+ dtype=np.float32,
261
+ )
262
+ per_cand_imp = np.asarray(
263
+ [
264
+ clusters[paper_cluster_map[cid]].importance
265
+ if cid in paper_cluster_map and 0 <= paper_cluster_map[cid] < len(clusters)
266
+ else 0.0
267
+ for cid in valid_ids
268
+ ],
269
+ dtype=np.float32,
270
+ )
271
+ per_cand_med = np.stack(
272
+ [
273
+ np.asarray(clusters[paper_cluster_map[cid]].medoid_embedding, dtype=np.float32)
274
+ if cid in paper_cluster_map and 0 <= paper_cluster_map[cid] < len(clusters)
275
+ else np.zeros(1024, dtype=np.float32)
276
+ for cid in valid_ids
277
+ ],
278
+ axis=0,
279
+ )
280
+ is_suppressed_arr = np.asarray(
281
+ [1.0 if cand_meta.get(cid, {}).get("category", "") in suppressed else 0.0
282
+ for cid in valid_ids],
283
+ dtype=np.float32,
284
+ )
285
+ onb_match_arr = np.asarray(
286
+ [1.0 if cand_meta.get(cid, {}).get("category", "") in onboarding_categories else 0.0
287
+ for cid in valid_ids],
288
+ dtype=np.float32,
289
+ )
290
+
291
+ # 15. LightGBM rerank
292
+ with stage("15. LightGBM rerank (CPU)", timings):
293
+ reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
294
+ candidate_ids=valid_ids,
295
+ candidate_embeddings=valid_embs,
296
+ candidate_metadata=valid_meta,
297
+ long_term_vec=lt_vec,
298
+ short_term_vec=st_vec,
299
+ negative_vec=neg_vec,
300
+ qdrant_scores=qdrant_scores,
301
+ cluster_importance=per_cand_imp,
302
+ cluster_medoid=per_cand_med,
303
+ is_suppressed_category=is_suppressed_arr,
304
+ onboarding_category_match=onb_match_arr,
305
+ user_total_saves=user_total_saves,
306
+ user_total_dismissals=user_total_dismissals,
307
+ )
308
+
309
+ # 16. MMR
310
+ with stage("16. MMR diversity (CPU)", timings):
311
+ query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
312
+ mmr_selected = mmr_rerank(
313
+ query_embedding=query_vec,
314
+ candidate_embeddings=reranked_embs,
315
+ candidate_ids=reranked_ids,
316
+ scores=reranked_scores,
317
+ lambda_param=0.6,
318
+ top_k=REC_LIMIT,
319
+ )
320
+
321
+ # 17. Exploration injection
322
+ with stage("17. Exploration injection (CPU)", timings):
323
+ final = inject_exploration(
324
+ selected_ids=mmr_selected,
325
+ all_candidate_ids=reranked_ids,
326
+ n_explore=2,
327
+ )
328
+
329
+ return timings
330
+
331
+
332
+ # ── Setup helper for recs profile ────────────────────────────────────────────
333
+
334
+ async def setup_recs_user(user_id: str, save_ids: list[str]):
335
+ vecs = await qdrant_svc.get_paper_vectors(save_ids)
336
+ state = await us.ensure_loaded(user_id)
337
+ for pid in save_ids:
338
+ if pid not in vecs:
339
+ continue
340
+ state.add_positive(pid)
341
+ emb = np.array(vecs[pid], dtype=np.float32)
342
+ await profiles.update_on_save(user_id, emb)
343
+ await db.log_interaction(user_id, pid, "save")
344
+
345
+
346
+ async def cleanup_user(user_id: str):
347
+ import aiosqlite
348
+ async with aiosqlite.connect(config.DB_PATH) as conn:
349
+ for tbl in ["interactions", "user_profiles", "user_clusters",
350
+ "user_onboarding", "cluster_snapshots"]:
351
+ try:
352
+ await conn.execute(f"DELETE FROM {tbl} WHERE user_id = ?", (user_id,))
353
+ except Exception:
354
+ pass
355
+ await conn.commit()
356
+ if user_id in us._cache:
357
+ del us._cache[user_id]
358
+
359
+
360
+ async def main():
361
+ print("=" * 92)
362
+ print("PIPELINE PROFILER")
363
+ print("=" * 92)
364
+
365
+ await db.init_db()
366
+
367
+ # Warm BGE-M3 + Turso connection so first stage isn't a 15s outlier
368
+ print("\nWarming up BGE-M3 + Turso...")
369
+ embed_svc.encode_query("warmup")
370
+ await turso_svc.fetch_metadata_batch(["1706.03762"])
371
+
372
+ # ── Search profiling ────────────────────────────────────────────────────
373
+ print("\n" + "=" * 92)
374
+ print("SEARCH PIPELINE β€” three representative queries")
375
+ print("=" * 92)
376
+
377
+ queries = [
378
+ ("known-item title", "attention is all you need"),
379
+ ("conceptual rewrite", "when AI makes up fake facts"),
380
+ ("academic, no rewrite", "BGE-M3 multilingual dense retrieval"),
381
+ ]
382
+ for label, q in queries:
383
+ print(f"\n>>> Query [{label}]: {q!r}")
384
+ # Run twice β€” first cold, second warm β€” to show cache effect
385
+ for run in (1, 2):
386
+ timings = await profile_search(q)
387
+ print_breakdown(f"Run {run}", timings)
388
+
389
+ # ── Recs Tier 1 profiling ───────────────────────────────────────────────
390
+ print("\n\n" + "=" * 92)
391
+ print("RECS TIER 1 PIPELINE β€” 10 saved papers (5 NLP + 5 CV)")
392
+ print("=" * 92)
393
+
394
+ user_id = f"profile-recs-{uuid.uuid4().hex[:6]}"
395
+ save_ids = [
396
+ "1706.03762", "1810.04805", "2005.14165", "1907.11692", "1910.10683",
397
+ "1512.03385", "2010.11929", "1409.1556", "1505.04597", "2103.14030",
398
+ ]
399
+ try:
400
+ await setup_recs_user(user_id, save_ids)
401
+
402
+ for run in (1, 2, 3):
403
+ timings = await profile_recs_tier1(user_id, save_ids)
404
+ print_breakdown(f"Run {run}", timings)
405
+ finally:
406
+ await cleanup_user(user_id)
407
+
408
+
409
+ if __name__ == "__main__":
410
+ asyncio.run(main())
scripts/test_citation_boost.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Side-by-side comparison: BEFORE vs AFTER citation boost.
2
+
3
+ Shows beginner vs expert results for the same topic.
4
+ Also verifies Band A (known-item) queries aren't broken.
5
+ """
6
+ import asyncio, sys, time
7
+ from pathlib import Path
8
+ sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
9
+
10
+ from app import hybrid_search_svc, turso_svc, embed_svc
11
+
12
+ # Pairs: (topic, beginner_query, expert_query)
13
+ COMPARISONS = [
14
+ ("TRANSFORMERS",
15
+ "how do transformers work in NLP",
16
+ "attention is all you need"),
17
+ ("DIFFUSION",
18
+ "what are diffusion models and how do they generate images",
19
+ "denoising diffusion probabilistic models"),
20
+ ("GPT-4",
21
+ "how does GPT-4 work",
22
+ "GPT-4 Technical Report"),
23
+ ("RLHF",
24
+ "what is reinforcement learning from human feedback",
25
+ "reinforcement learning from human feedback"),
26
+ ]
27
+
28
+ BAND_A = [
29
+ ("attention is all you need", "1706.03762"),
30
+ ("Deep Residual Learning for Image Recognition", "1512.03385"),
31
+ ("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805"),
32
+ ]
33
+
34
+ async def run_query(q: str):
35
+ results = await hybrid_search_svc.search(q, limit=10)
36
+ meta = {}
37
+ if results:
38
+ meta = await turso_svc.fetch_metadata_batch(results)
39
+ return results, meta
40
+
41
+ async def main():
42
+ print("Warming up BGE-M3...")
43
+ embed_svc.encode_query("warmup")
44
+ await turso_svc.fetch_metadata_batch(["1706.03762"])
45
+
46
+ # === Band A verification ===
47
+ print()
48
+ print("=" * 90)
49
+ print("BAND A VERIFICATION - Known-item queries (must still be #1)")
50
+ print("=" * 90)
51
+ for q, expected in BAND_A:
52
+ results, meta = await run_query(q)
53
+ rank = results.index(expected) + 1 if expected in results else -1
54
+ status = "PASS" if rank == 1 else f"RANK #{rank}" if rank > 0 else "MISS"
55
+ cites = meta.get(expected, {}).get("citation_count", 0)
56
+ print(f" [{status:>8}] {q[:55]:55s} ({cites} cites)")
57
+
58
+ # === Side-by-side comparisons ===
59
+ print()
60
+ print("=" * 90)
61
+ print("SIDE-BY-SIDE: Beginner vs Expert queries (same topic)")
62
+ print("=" * 90)
63
+
64
+ for topic, beginner_q, expert_q in COMPARISONS:
65
+ print(f"\n--- {topic} ---")
66
+
67
+ # Beginner
68
+ print(f"\n BEGINNER: {beginner_q!r}")
69
+ results, meta = await run_query(beginner_q)
70
+ for i, aid in enumerate(results[:5], 1):
71
+ m = meta.get(aid, {})
72
+ title = (m.get("title") or "?")[:60]
73
+ cites = m.get("citation_count", 0)
74
+ print(f" {i}. [{cites:>6} cites] {title}")
75
+
76
+ # Expert
77
+ print(f"\n EXPERT: {expert_q!r}")
78
+ results, meta = await run_query(expert_q)
79
+ for i, aid in enumerate(results[:5], 1):
80
+ m = meta.get(aid, {})
81
+ title = (m.get("title") or "?")[:60]
82
+ cites = m.get("citation_count", 0)
83
+ print(f" {i}. [{cites:>6} cites] {title}")
84
+
85
+ print()
86
+ print("=" * 90)
87
+ print("DONE")
88
+ print("=" * 90)
89
+
90
+ if __name__ == "__main__":
91
+ asyncio.run(main())
tests/test_hybrid_search.py CHANGED
@@ -102,56 +102,100 @@ class TestRRFFusion:
102
  assert gap_k10 > gap_k100
103
 
104
 
105
- # ── Recency rerank tests ─────────────────────────────────────────────────────
106
 
107
- class TestRecencyRerank:
108
- """Test recency boosting in hybrid_search_svc."""
109
 
110
- def test_recency_boost_newer_papers(self):
111
- """Newer papers should get higher recency scores."""
112
- from app.hybrid_search_svc import _recency_rerank
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
- # Two papers with same RRF score but different ages
115
  fused = [
116
- {"arxiv_id": "2401.00001", "rrf_score": 0.5}, # Jan 2024
117
- {"arxiv_id": "1501.00001", "rrf_score": 0.5}, # Jan 2015
118
  ]
 
 
 
 
119
 
120
- ranked = _recency_rerank(fused)
121
-
122
- # Newer paper (2401) should rank higher
123
- assert ranked[0]["arxiv_id"] == "2401.00001"
124
 
125
- def test_recency_preserves_strong_rrf(self):
126
- """A much higher RRF score should still dominate over recency."""
127
- from app.hybrid_search_svc import _recency_rerank
 
 
 
128
 
129
  fused = [
130
- {"arxiv_id": "1501.00001", "rrf_score": 1.0}, # Old but high RRF
131
- {"arxiv_id": "2401.00001", "rrf_score": 0.01}, # New but low RRF
132
  ]
 
 
 
 
 
 
 
 
 
133
 
134
- ranked = _recency_rerank(fused)
 
 
 
 
 
135
 
136
- # High RRF should still win (0.80 weight vs 0.20 recency)
137
- assert ranked[0]["arxiv_id"] == "1501.00001"
 
 
 
 
138
 
139
- def test_recency_empty_input(self):
 
140
  """Empty input returns empty output."""
141
- from app.hybrid_search_svc import _recency_rerank
142
- assert _recency_rerank([]) == []
143
 
144
- def test_recency_unparseable_id(self):
145
- """Papers with unparseable IDs get neutral recency (0.5)."""
146
- from app.hybrid_search_svc import _recency_rerank
 
 
 
 
 
147
 
148
  fused = [
149
- {"arxiv_id": "math/0301001", "rrf_score": 0.5},
 
150
  ]
151
-
152
- ranked = _recency_rerank(fused)
153
- assert len(ranked) == 1
154
- assert "final_score" in ranked[0]
155
 
156
 
157
  # ── Groq rewriter tests ─────────────────────────────────────────────────────
 
102
  assert gap_k10 > gap_k100
103
 
104
 
105
+ # ── Title-match rerank tests ─────────────────────────────────────────────────
106
 
107
+ class TestTitleMatchRerank:
108
+ """Test the title-match boost in hybrid_search_svc.
109
 
110
+ Recency rerank was removed (it crushed seminal old papers like
111
+ 1706.03762 below newer "X is all you need" titles). Replaced with a
112
+ title-match boost that promotes papers whose title matches the query.
113
+ """
114
+
115
+ @pytest.mark.asyncio
116
+ async def test_exact_title_match_wins(self, monkeypatch):
117
+ """Paper with exact-title match should rank #1 even with low RRF."""
118
+ from app import hybrid_search_svc
119
+
120
+ async def fake_meta(ids):
121
+ return {
122
+ "1706.03762": {"title": "Attention Is All You Need"},
123
+ "2404.01183": {"title": "Positioning Is All You Need"},
124
+ }
125
+ monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
126
 
 
127
  fused = [
128
+ {"arxiv_id": "2404.01183", "rrf_score": 0.0317}, # higher RRF
129
+ {"arxiv_id": "1706.03762", "rrf_score": 0.0164}, # lower RRF, exact match
130
  ]
131
+ ranked = await hybrid_search_svc._title_match_rerank(
132
+ fused, "attention is all you need"
133
+ )
134
+ assert ranked[0]["arxiv_id"] == "1706.03762"
135
 
136
+ @pytest.mark.asyncio
137
+ async def test_substring_match_beats_no_match(self, monkeypatch):
138
+ """A substring title match outranks no-match candidates."""
139
+ from app import hybrid_search_svc
140
 
141
+ async def fake_meta(ids):
142
+ return {
143
+ "2501.05730": {"title": "Element-wise Attention Is All You Need"},
144
+ "9999.99999": {"title": "An Unrelated Survey of Graph Theory"},
145
+ }
146
+ monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
147
 
148
  fused = [
149
+ {"arxiv_id": "9999.99999", "rrf_score": 0.05}, # higher RRF, no match
150
+ {"arxiv_id": "2501.05730", "rrf_score": 0.01}, # lower RRF, substring match
151
  ]
152
+ ranked = await hybrid_search_svc._title_match_rerank(
153
+ fused, "attention is all you need"
154
+ )
155
+ assert ranked[0]["arxiv_id"] == "2501.05730"
156
+
157
+ @pytest.mark.asyncio
158
+ async def test_no_match_falls_back_to_rrf(self, monkeypatch):
159
+ """When nothing matches, RRF order is preserved."""
160
+ from app import hybrid_search_svc
161
 
162
+ async def fake_meta(ids):
163
+ return {
164
+ "1234.56789": {"title": "Some Paper"},
165
+ "9876.54321": {"title": "Another Paper"},
166
+ }
167
+ monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
168
 
169
+ fused = [
170
+ {"arxiv_id": "1234.56789", "rrf_score": 0.05},
171
+ {"arxiv_id": "9876.54321", "rrf_score": 0.01},
172
+ ]
173
+ ranked = await hybrid_search_svc._title_match_rerank(fused, "transformer")
174
+ assert [r["arxiv_id"] for r in ranked] == ["1234.56789", "9876.54321"]
175
 
176
+ @pytest.mark.asyncio
177
+ async def test_empty_input(self):
178
  """Empty input returns empty output."""
179
+ from app import hybrid_search_svc
180
+ assert await hybrid_search_svc._title_match_rerank([], "anything") == []
181
 
182
+ @pytest.mark.asyncio
183
+ async def test_turso_failure_falls_back_to_rrf(self, monkeypatch):
184
+ """If Turso lookup raises, ranking falls back to pure RRF order."""
185
+ from app import hybrid_search_svc
186
+
187
+ async def boom(ids):
188
+ raise RuntimeError("turso down")
189
+ monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", boom)
190
 
191
  fused = [
192
+ {"arxiv_id": "1234.56789", "rrf_score": 0.05},
193
+ {"arxiv_id": "9876.54321", "rrf_score": 0.01},
194
  ]
195
+ ranked = await hybrid_search_svc._title_match_rerank(fused, "attention")
196
+ assert [r["arxiv_id"] for r in ranked] == ["1234.56789", "9876.54321"]
197
+ # final_score must be set even on the fallback path
198
+ assert all("final_score" in r for r in ranked)
199
 
200
 
201
  # ── Groq rewriter tests ─────────────────────────────────────────────────────