Spaces:
Running
Phase 6.5: Pipeline telemetry, search UX fixes, latency profiling
Browse files- Instrumented search pipeline: Groq rewrite, BGE-M3 encode, Qdrant+Zilliz retrieval, RRF fusion, title rerank with per-stage timing
- Instrumented recommendation pipeline: clustering, ANN retrieval, metadata fetch, LightGBM rerank, MMR diversity
- Split Title+Citation Rerank into Turso fetch vs compute time (exposed hidden 1.5s network call)
- Added search loading overlay with pipeline stage labels
- Fixed HTMX search: recommendations now hide when search starts
- Fixed paper card: truncate authors (max 3 + et al), hard-truncate abstract to 500 chars
- Show Groq rewrite status (skipped/rewritten/error) in both banner and breakdown
- Added Groq heuristic visibility: shows skip reason (query too short, looks academic)
- Added parallel task count to retrieval breakdown
- New evaluation and diagnostic scripts
- Removed deprecated s2_svc.py
- .github/skills/researchit-codebase-overview/SKILL.md +48 -0
- .github/skills/researchit-data-layer/SKILL.md +31 -0
- .github/skills/researchit-debug-performance/SKILL.md +31 -0
- .github/skills/researchit-recs-analysis/SKILL.md +42 -0
- .github/skills/researchit-reranker-explainer/SKILL.md +30 -0
- .github/skills/researchit-search-analysis/SKILL.md +34 -0
- .github/skills/researchit-testing-eval/SKILL.md +30 -0
- CLAUDE.md +2 -0
- README.md +1 -1
- app/config.py +1 -2
- app/groq_svc.py +19 -13
- app/hybrid_search_svc.py +316 -94
- app/qdrant_svc.py +87 -17
- app/recommend/clustering.py +76 -2
- app/recommend/reranker.py +1 -1
- app/routers/onboarding.py +7 -99
- app/routers/recommendations.py +44 -15
- app/routers/search.py +13 -2
- app/s2_svc.py +0 -111
- app/templates/index.html +2 -12
- app/templates/partials/paper_card.html +10 -5
- app/templates/partials/recommendations.html +34 -0
- app/templates/partials/search_results.html +78 -2
- app/templates/partials/seed_results.html +41 -0
- app/templates/partials/seed_search.html +2 -60
- app/templates/search.html +55 -9
- app/turso_svc.py +127 -9
- docs/TASK-TRACKER.md +22 -22
- docs/previous_prompt.txt +0 -0
- docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md +21 -20
- requirements.txt +1 -1
- scripts/browser_test_onboarding.py +75 -0
- scripts/browser_test_search.py +77 -0
- scripts/diag_mamba.py +69 -0
- scripts/diag_search_rank.py +45 -0
- scripts/e2e_audit.py +622 -0
- scripts/eval_expanded_queries.py +336 -0
- scripts/eval_recs_quality.py +547 -0
- scripts/eval_search_quality.py +197 -0
- scripts/expanded_eval_results.json +0 -0
- scripts/profile_pipelines.py +410 -0
- scripts/test_citation_boost.py +91 -0
- tests/test_hybrid_search.py +76 -32
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: researchit-codebase-overview
|
| 3 |
+
description: "Explain the ResearchIT codebase architecture and current state. Use for onboarding, project overviews, and accurate summaries of how the system works. Triggers: codebase overview, architecture summary, explain this project, how this works, system map."
|
| 4 |
+
argument-hint: "Specify audience (dev/stakeholder), depth (brief/standard/deep), and focus (search/recs/data)."
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# ResearchIT Codebase Overview
|
| 8 |
+
|
| 9 |
+
## When to Use
|
| 10 |
+
- The user asks for a full understanding of the codebase or architecture.
|
| 11 |
+
- You need to produce a top-level system map or explain how components interact.
|
| 12 |
+
- You need a concise but accurate "what is happening here" summary.
|
| 13 |
+
|
| 14 |
+
## Inputs to Ask For (if missing)
|
| 15 |
+
- Audience: developer vs stakeholder.
|
| 16 |
+
- Depth: brief, standard, or deep.
|
| 17 |
+
- Focus areas: search, recommendations, data layer, evaluation.
|
| 18 |
+
|
| 19 |
+
## Required Sources (read in this order)
|
| 20 |
+
1. CLAUDE.md (rules and source-of-truth doc map).
|
| 21 |
+
2. docs/research/06-Deep-Research-Verdict.md (architecture decisions).
|
| 22 |
+
3. README.md (current system summary).
|
| 23 |
+
4. docs/walkthroughs/03-Code-Summary-and-Test-Plan.md (module map).
|
| 24 |
+
5. docs/walkthroughs/04-Next-Steps-and-Phase-Plan.md (current phase).
|
| 25 |
+
|
| 26 |
+
## Procedure
|
| 27 |
+
1. State the product goal in one sentence and the system constraints (CPU-only, latency budget).
|
| 28 |
+
2. Describe the high-level architecture (frontend, backend, vector stores, metadata DB, SQLite).
|
| 29 |
+
3. Summarize the two main pipelines:
|
| 30 |
+
- Search: rewrite -> encode -> dense+sparse -> RRF -> title/citation boost.
|
| 31 |
+
- Recommendations: clustering -> quota -> rerank -> MMR -> exploration.
|
| 32 |
+
4. Call out invariants from doc 06 (quota for recs, RRF for search, alpha values, MMR lambda).
|
| 33 |
+
5. Explain data flow and caching (Turso LRU, Qdrant vector cache, SQLite metadata cache).
|
| 34 |
+
6. State current phase status and what is out of scope.
|
| 35 |
+
|
| 36 |
+
## Output Format
|
| 37 |
+
- 6 to 10 bullet points, ordered by importance.
|
| 38 |
+
- Short "where to look" section with key files.
|
| 39 |
+
- If stakeholder audience: avoid implementation detail and emphasize outcomes.
|
| 40 |
+
|
| 41 |
+
## Key Files to Cite
|
| 42 |
+
- app/main.py
|
| 43 |
+
- app/routers/recommendations.py
|
| 44 |
+
- app/routers/search.py
|
| 45 |
+
- app/hybrid_search_svc.py
|
| 46 |
+
- app/recommend/*
|
| 47 |
+
- app/qdrant_svc.py, app/zilliz_svc.py, app/turso_svc.py
|
| 48 |
+
- app/db.py
|
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: researchit-data-layer
|
| 3 |
+
description: "Explain the data/storage layer (SQLite, Turso metadata, Qdrant dense vectors, Zilliz sparse vectors). Use for data integrity, schema questions, caching behavior, and ID handling. Triggers: database schema, metadata cache, Qdrant mapping, Zilliz schema."
|
| 4 |
+
argument-hint: "Specify the component(s) and whether you want schema details or runtime behavior."
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Data and Storage Layer Analysis
|
| 8 |
+
|
| 9 |
+
## When to Use
|
| 10 |
+
- The user asks about storage, caching, or schemas.
|
| 11 |
+
- You need to validate data integrity or ID handling.
|
| 12 |
+
- You need to explain how metadata or vector mappings work.
|
| 13 |
+
|
| 14 |
+
## Required Sources
|
| 15 |
+
1. app/db.py (SQLite schema + migrations)
|
| 16 |
+
2. app/turso_svc.py (metadata + caches)
|
| 17 |
+
3. app/qdrant_svc.py (ID mapping + vector cache)
|
| 18 |
+
4. app/zilliz_svc.py (sparse schema + search)
|
| 19 |
+
5. app/arxiv_svc.py (API fallback + ID normalization)
|
| 20 |
+
|
| 21 |
+
## Procedure
|
| 22 |
+
1. Summarize each store and its responsibility (SQLite, Turso, Qdrant, Zilliz).
|
| 23 |
+
2. Explain arXiv ID handling (always string; never integer coercion).
|
| 24 |
+
3. Document caches (vector cache, metadata LRU, trending cache).
|
| 25 |
+
4. Note schema migrations and instrumentation columns.
|
| 26 |
+
5. Identify data consistency boundaries and fallbacks.
|
| 27 |
+
|
| 28 |
+
## Output Format
|
| 29 |
+
- Component-by-component description.
|
| 30 |
+
- Tables/fields summary for SQLite.
|
| 31 |
+
- Integrity rules and common pitfalls.
|
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: researchit-debug-performance
|
| 3 |
+
description: "Debug performance and quality issues in search or recommendations. Use for latency spikes, slow retrievals, or degraded relevance. Triggers: performance issue, slow search, slow recs, latency debug."
|
| 4 |
+
argument-hint: "Specify area (search/recs/data), symptoms, and whether to propose fixes."
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Debugging and Performance Profiling
|
| 8 |
+
|
| 9 |
+
## When to Use
|
| 10 |
+
- Latency regressions or slow responses appear.
|
| 11 |
+
- Search or recommendation quality drops unexpectedly.
|
| 12 |
+
- External services time out or return empty results.
|
| 13 |
+
|
| 14 |
+
## Required Sources
|
| 15 |
+
1. app/qdrant_svc.py (vector cache, retrieve latency)
|
| 16 |
+
2. app/turso_svc.py (metadata cache, trending cache)
|
| 17 |
+
3. app/hybrid_search_svc.py (RRF pipeline)
|
| 18 |
+
4. app/routers/recommendations.py (candidate flow + oversample)
|
| 19 |
+
5. app/recommend/reranker.py (model load, feature cost)
|
| 20 |
+
|
| 21 |
+
## Procedure
|
| 22 |
+
1. Identify the failing pipeline (search vs recommendations).
|
| 23 |
+
2. Check cache hit rates conceptually (vector and metadata caches).
|
| 24 |
+
3. Inspect candidate fetch sizes and oversampling factors.
|
| 25 |
+
4. Review service fallbacks (Zilliz, Turso, arXiv).
|
| 26 |
+
5. Isolate latency contributors and propose focused fixes.
|
| 27 |
+
|
| 28 |
+
## Output Format
|
| 29 |
+
- Symptom -> probable cause mapping.
|
| 30 |
+
- Targeted checks in code.
|
| 31 |
+
- Minimal, low-risk fix options.
|
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: researchit-recs-analysis
|
| 3 |
+
description: "Analyze and explain the recommendation pipeline. Use for recs debugging, feature reviews, pipeline changes, or explaining multi-interest behavior. Triggers: recommendation pipeline, recs analysis, multi-interest, quota fusion, reranker."
|
| 4 |
+
argument-hint: "Specify the task (explain/debug/change), expected output (summary/findings), and whether to include tests."
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Recommendation Pipeline Analysis
|
| 8 |
+
|
| 9 |
+
## When to Use
|
| 10 |
+
- The user wants a deep explanation of recommendations or changes.
|
| 11 |
+
- You need to verify rules like quota fusion, EWMA alphas, or MMR usage.
|
| 12 |
+
- You are asked to debug rec quality or performance.
|
| 13 |
+
|
| 14 |
+
## Required Sources
|
| 15 |
+
1. CLAUDE.md and docs/research/06-Deep-Research-Verdict.md (non-negotiables).
|
| 16 |
+
2. app/routers/recommendations.py (pipeline and instrumentation).
|
| 17 |
+
3. app/recommend/profiles.py (EWMA parameters).
|
| 18 |
+
4. app/recommend/clustering.py (Ward + medoids + stabilization).
|
| 19 |
+
5. app/recommend/fusion.py (quota logic).
|
| 20 |
+
6. app/recommend/reranker.py (LightGBM + features).
|
| 21 |
+
7. app/recommend/diversity.py (MMR + exploration).
|
| 22 |
+
|
| 23 |
+
## Procedure
|
| 24 |
+
1. Identify which tier is active and the fallback sequence.
|
| 25 |
+
2. Validate invariant rules:
|
| 26 |
+
- Search uses RRF, recommendations do not.
|
| 27 |
+
- Quota fusion with floor; MMR lambda is 0.6.
|
| 28 |
+
- alpha_long=0.03, alpha_short=0.40, alpha_neg=0.15.
|
| 29 |
+
3. Trace candidate flow:
|
| 30 |
+
- Medoids -> per-cluster search -> dedup -> rerank -> MMR -> exploration.
|
| 31 |
+
4. Check instrumentation fields: query_id, propensity, policy_id.
|
| 32 |
+
5. Summarize likely failure modes: missing vectors, empty clusters, cache misses.
|
| 33 |
+
6. Recommend targeted tests or metrics to verify changes.
|
| 34 |
+
|
| 35 |
+
## Output Format
|
| 36 |
+
- Pipeline summary with stages and main functions.
|
| 37 |
+
- Invariants checklist (pass/fail).
|
| 38 |
+
- Risks and suggested tests.
|
| 39 |
+
|
| 40 |
+
## Notes
|
| 41 |
+
- Never propose RRF for multi-medoid recommendations.
|
| 42 |
+
- Do not introduce cross-encoders into the hot path.
|
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: researchit-reranker-explainer
|
| 3 |
+
description: "Explain the LightGBM reranker, feature schema, and fallback behavior. Use for model integration checks, feature debugging, or deployment validation. Triggers: reranker, LightGBM, feature schema, model loading."
|
| 4 |
+
argument-hint: "Specify: explain, validate, or troubleshoot."
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Reranker and Feature Schema Explainer
|
| 8 |
+
|
| 9 |
+
## When to Use
|
| 10 |
+
- The user asks how the reranker works or which features are used.
|
| 11 |
+
- You need to validate model loading and fallback behavior.
|
| 12 |
+
- You are reviewing feature wiring or scoring behavior.
|
| 13 |
+
|
| 14 |
+
## Required Sources
|
| 15 |
+
1. app/recommend/reranker.py
|
| 16 |
+
2. models/reranker-phase6/production_model/feature_schema.json
|
| 17 |
+
3. app/routers/health.py
|
| 18 |
+
4. app/routers/recommendations.py (feature wiring)
|
| 19 |
+
|
| 20 |
+
## Procedure
|
| 21 |
+
1. Confirm model load paths and fallback logic.
|
| 22 |
+
2. Verify the 37-feature ordering matches the schema.
|
| 23 |
+
3. Explain which features are active in recommendations and how they are computed.
|
| 24 |
+
4. Confirm health endpoint expectations (/healthz/reranker).
|
| 25 |
+
5. Provide a concise explanation of latency and why cross-encoders are excluded.
|
| 26 |
+
|
| 27 |
+
## Output Format
|
| 28 |
+
- Model load status + fallback behavior.
|
| 29 |
+
- Feature group summary (content, behavior, cross features).
|
| 30 |
+
- Integration checklist.
|
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: researchit-search-analysis
|
| 3 |
+
description: "Explain or analyze the hybrid semantic search pipeline (rewrite, encode, dense+sparse, RRF, title/citation boost). Use for search quality, latency, and correctness reviews. Triggers: search pipeline, hybrid search, RRF, BGE-M3 search."
|
| 4 |
+
argument-hint: "Specify: explain vs debug, and whether to include latency hotspots."
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Search Pipeline Analysis
|
| 8 |
+
|
| 9 |
+
## When to Use
|
| 10 |
+
- The user wants to understand or debug search results.
|
| 11 |
+
- You need to review hybrid search correctness.
|
| 12 |
+
- You are asked about RRF usage or query rewriting.
|
| 13 |
+
|
| 14 |
+
## Required Sources
|
| 15 |
+
1. app/routers/search.py
|
| 16 |
+
2. app/hybrid_search_svc.py
|
| 17 |
+
3. app/embed_svc.py
|
| 18 |
+
4. app/qdrant_svc.py
|
| 19 |
+
5. app/zilliz_svc.py
|
| 20 |
+
6. app/groq_svc.py
|
| 21 |
+
7. app/turso_svc.py and app/arxiv_svc.py
|
| 22 |
+
|
| 23 |
+
## Procedure
|
| 24 |
+
1. Trace the full pipeline from query to results.
|
| 25 |
+
2. Call out the dual-encode design (original + rewrite) and why it exists.
|
| 26 |
+
3. Verify RRF usage is limited to search fusion (correct per doc 06).
|
| 27 |
+
4. Explain title/citation boosts and their intended effect.
|
| 28 |
+
5. Document fallback behavior when any component fails.
|
| 29 |
+
6. Summarize latency hotspots and caching layers.
|
| 30 |
+
|
| 31 |
+
## Output Format
|
| 32 |
+
- Step-by-step pipeline description.
|
| 33 |
+
- Fallbacks and failure handling.
|
| 34 |
+
- Notes on ranking behavior and edge cases.
|
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: researchit-testing-eval
|
| 3 |
+
description: "Guide testing and evaluation for ResearchIT. Use for test planning, running tests, and explaining evaluation metrics. Triggers: testing plan, run tests, evaluation metrics, offline eval."
|
| 4 |
+
argument-hint: "Specify scope (unit/integration/e2e) and whether to include metrics."
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Testing and Evaluation Guidance
|
| 8 |
+
|
| 9 |
+
## When to Use
|
| 10 |
+
- The user wants to run or plan tests.
|
| 11 |
+
- The user asks about evaluation metrics or offline evaluation.
|
| 12 |
+
- You need to explain test coverage or risks.
|
| 13 |
+
|
| 14 |
+
## Required Sources
|
| 15 |
+
1. docs/walkthroughs/03-Code-Summary-and-Test-Plan.md
|
| 16 |
+
2. tests/ (overview)
|
| 17 |
+
3. pytest.ini
|
| 18 |
+
4. test_e2e_recs.py
|
| 19 |
+
|
| 20 |
+
## Procedure
|
| 21 |
+
1. Identify test scope (unit, integration, live, e2e).
|
| 22 |
+
2. Provide the correct test command(s) and file locations.
|
| 23 |
+
3. Call out live tests that hit external services.
|
| 24 |
+
4. Provide evaluation metrics and how they map to system goals.
|
| 25 |
+
5. Note any missing coverage or potential regressions.
|
| 26 |
+
|
| 27 |
+
## Output Format
|
| 28 |
+
- Test scope summary.
|
| 29 |
+
- Commands and expected outputs.
|
| 30 |
+
- Evaluation metric checklist.
|
|
@@ -205,6 +205,7 @@ Every interaction logged via `db.log_interaction()` must carry **`query_id`**, *
|
|
| 205 |
- Onboarding wizard (category multi-select + seed search)
|
| 206 |
- Category-filtered trending fallback
|
| 207 |
- Dark-mode base UI + updated paper cards
|
|
|
|
| 208 |
|
| 209 |
**Phase 6 β LightGBM reranker (COMPLETE β
):**
|
| 210 |
- LightGBM LambdaRank (141 trees, 37 features) integrated with heuristic fallback
|
|
@@ -216,6 +217,7 @@ Every interaction logged via `db.log_interaction()` must carry **`query_id`**, *
|
|
| 216 |
- Phase 6.4 (retraining) deferred: gated on 100 users or synthetic simulator
|
| 217 |
|
| 218 |
**Out of scope until later phases β do not build:**
|
|
|
|
| 219 |
- Collaborative filtering / LightFM (Phase 9, 500+ users).
|
| 220 |
- Cross-encoder reranking in serving path (never; only distilled β Phase 8).
|
| 221 |
- Claude/Groq-generated cluster summaries (Phase 8).
|
|
|
|
| 205 |
- Onboarding wizard (category multi-select + seed search)
|
| 206 |
- Category-filtered trending fallback
|
| 207 |
- Dark-mode base UI + updated paper cards
|
| 208 |
+
- S2/ORCID author import was explored and **removed** β not the direction we want
|
| 209 |
|
| 210 |
**Phase 6 β LightGBM reranker (COMPLETE β
):**
|
| 211 |
- LightGBM LambdaRank (141 trees, 37 features) integrated with heuristic fallback
|
|
|
|
| 217 |
- Phase 6.4 (retraining) deferred: gated on 100 users or synthetic simulator
|
| 218 |
|
| 219 |
**Out of scope until later phases β do not build:**
|
| 220 |
+
- S2/ORCID author import for onboarding (removed β not the direction we want).
|
| 221 |
- Collaborative filtering / LightFM (Phase 9, 500+ users).
|
| 222 |
- Cross-encoder reranking in serving path (never; only distilled β Phase 8).
|
| 223 |
- Claude/Groq-generated cluster summaries (Phase 8).
|
|
@@ -276,7 +276,7 @@ curl -s https://siddhm11-researchit.hf.space/healthz/reranker | python -m json.t
|
|
| 276 |
| `TURSO_URL` | Yes | Turso database URL |
|
| 277 |
| `TURSO_DB_TOKEN` | Yes | Turso auth token |
|
| 278 |
| `GROQ_API_KEY` | Yes | Groq API key for query rewriting |
|
| 279 |
-
| `S2_API_KEY` | No | Semantic Scholar API key (training only) |
|
| 280 |
| `RERANKER_MODEL_PATH` | No | Override LightGBM model file path |
|
| 281 |
| `DB_PATH` | No | SQLite path (default: `interactions.db`) |
|
| 282 |
|
|
|
|
| 276 |
| `TURSO_URL` | Yes | Turso database URL |
|
| 277 |
| `TURSO_DB_TOKEN` | Yes | Turso auth token |
|
| 278 |
| `GROQ_API_KEY` | Yes | Groq API key for query rewriting |
|
| 279 |
+
| `S2_API_KEY` | No | Semantic Scholar API key (offline training scripts only, not used by the app) |
|
| 280 |
| `RERANKER_MODEL_PATH` | No | Override LightGBM model file path |
|
| 281 |
| `DB_PATH` | No | SQLite path (default: `interactions.db`) |
|
| 282 |
|
|
@@ -24,8 +24,7 @@ METADATA_CACHE_TTL_DAYS = 30 # re-fetch metadata after this many days
|
|
| 24 |
TURSO_URL = os.getenv("TURSO_URL", "")
|
| 25 |
TURSO_DB_TOKEN = os.getenv("TURSO_DB_TOKEN", "")
|
| 26 |
|
| 27 |
-
|
| 28 |
-
S2_API_KEY = os.getenv("S2_API_KEY", "")
|
| 29 |
|
| 30 |
# ββ Recommendation settings βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
REC_LIMIT = 10 # how many recommendations to show
|
|
|
|
| 24 |
TURSO_URL = os.getenv("TURSO_URL", "")
|
| 25 |
TURSO_DB_TOKEN = os.getenv("TURSO_DB_TOKEN", "")
|
| 26 |
|
| 27 |
+
|
|
|
|
| 28 |
|
| 29 |
# ββ Recommendation settings βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 30 |
REC_LIMIT = 10 # how many recommendations to show
|
|
@@ -45,29 +45,29 @@ def _get_client():
|
|
| 45 |
|
| 46 |
_SYSTEM_PROMPT = """You are an academic search query optimizer for arXiv papers.
|
| 47 |
|
| 48 |
-
Your job: Convert casual or
|
| 49 |
|
| 50 |
Rules:
|
| 51 |
1. Output ONLY the rewritten query string β no explanation, no quotes, no preamble.
|
| 52 |
-
2.
|
| 53 |
-
3.
|
| 54 |
-
4.
|
| 55 |
|
| 56 |
Examples:
|
| 57 |
User: "when AI makes up fake facts"
|
| 58 |
-
Output: LLM hallucination factual errors
|
| 59 |
|
| 60 |
User: "the llama model by facebook"
|
| 61 |
-
Output: LLaMA
|
| 62 |
|
| 63 |
-
User: "
|
| 64 |
-
Output:
|
| 65 |
|
| 66 |
-
User: "
|
| 67 |
-
Output:
|
| 68 |
|
| 69 |
-
User: "
|
| 70 |
-
Output:
|
| 71 |
|
| 72 |
|
| 73 |
# ββ Heuristic: should we skip rewriting? βββββββββββββββββββββββββββββββββββββ
|
|
@@ -85,8 +85,14 @@ _ACADEMIC_PATTERN = re.compile(
|
|
| 85 |
|
| 86 |
|
| 87 |
def _looks_academic(query: str) -> bool:
|
| 88 |
-
"""Heuristic: skip rewriting if query already
|
| 89 |
words = query.split()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
if len(words) > 6:
|
| 91 |
matches = len(_ACADEMIC_PATTERN.findall(query))
|
| 92 |
if matches >= 2:
|
|
|
|
| 45 |
|
| 46 |
_SYSTEM_PROMPT = """You are an academic search query optimizer for arXiv papers.
|
| 47 |
|
| 48 |
+
Your job: Convert casual or conversational user queries into academic search strings.
|
| 49 |
|
| 50 |
Rules:
|
| 51 |
1. Output ONLY the rewritten query string β no explanation, no quotes, no preamble.
|
| 52 |
+
2. If the user's query is casual or conversational, rewrite it using standard academic terms.
|
| 53 |
+
3. CRITICAL: If the query is ALREADY a precise technical term, a single keyword, an acronym, or a known paper title (e.g., "perplexity", "transformers", "Adam optimizer"), DO NOT expand it. Return it EXACTLY AS IS. Do NOT add random related words.
|
| 54 |
+
4. Never output more than 8 words.
|
| 55 |
|
| 56 |
Examples:
|
| 57 |
User: "when AI makes up fake facts"
|
| 58 |
+
Output: LLM hallucination factual errors
|
| 59 |
|
| 60 |
User: "the llama model by facebook"
|
| 61 |
+
Output: LLaMA foundation language model Meta AI
|
| 62 |
|
| 63 |
+
User: "perplexity"
|
| 64 |
+
Output: perplexity
|
| 65 |
|
| 66 |
+
User: "attention is all you need"
|
| 67 |
+
Output: attention is all you need
|
| 68 |
|
| 69 |
+
User: "gradient descent"
|
| 70 |
+
Output: gradient descent"""
|
| 71 |
|
| 72 |
|
| 73 |
# ββ Heuristic: should we skip rewriting? βββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 85 |
|
| 86 |
|
| 87 |
def _looks_academic(query: str) -> bool:
|
| 88 |
+
"""Heuristic: skip rewriting if query already looks academic or is very short."""
|
| 89 |
words = query.split()
|
| 90 |
+
|
| 91 |
+
# 1-2 word queries are usually precise keywords or author names (e.g., "perplexity", "lecun")
|
| 92 |
+
# Expanding them almost always ruins the precision.
|
| 93 |
+
if len(words) <= 2:
|
| 94 |
+
return True
|
| 95 |
+
|
| 96 |
if len(words) > 6:
|
| 97 |
matches = len(_ACADEMIC_PATTERN.findall(query))
|
| 98 |
if matches >= 2:
|
|
@@ -6,23 +6,29 @@ Orchestrates the full pipeline:
|
|
| 6 |
2. BGE-M3 encode β dense + sparse
|
| 7 |
3. Parallel search: Qdrant dense + Zilliz sparse
|
| 8 |
4. RRF fusion (K=60)
|
| 9 |
-
5.
|
| 10 |
6. Return ranked arxiv_ids
|
| 11 |
|
| 12 |
Doc 06 confirms: RRF is correct for search (fusing different retrievers
|
| 13 |
answering the SAME query). This is different from recommendations where
|
| 14 |
quota is correct (fusing different queries for the SAME user).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
"""
|
| 16 |
from __future__ import annotations
|
| 17 |
|
| 18 |
import asyncio
|
| 19 |
-
|
| 20 |
|
| 21 |
from app import config
|
| 22 |
from app import embed_svc
|
| 23 |
from app import qdrant_svc
|
| 24 |
from app import zilliz_svc
|
| 25 |
from app import groq_svc
|
|
|
|
| 26 |
|
| 27 |
|
| 28 |
# ββ Public API βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -31,18 +37,20 @@ async def search(
|
|
| 31 |
query: str,
|
| 32 |
limit: int = 10,
|
| 33 |
use_rewrite: bool = True,
|
| 34 |
-
|
|
|
|
| 35 |
"""
|
| 36 |
Hybrid semantic search β returns a list of arxiv_ids ranked by
|
| 37 |
fused relevance.
|
| 38 |
|
| 39 |
Pipeline:
|
| 40 |
-
rewrite β encode β parallel(dense, sparse) β RRF β
|
| 41 |
|
| 42 |
Args:
|
| 43 |
query: User's raw search query.
|
| 44 |
limit: Number of results to return.
|
| 45 |
use_rewrite: Whether to attempt LLM query rewriting.
|
|
|
|
| 46 |
|
| 47 |
Returns:
|
| 48 |
list of arxiv_id strings, sorted by final score descending.
|
|
@@ -50,55 +58,115 @@ async def search(
|
|
| 50 |
"""
|
| 51 |
query = query.strip()
|
| 52 |
if not query:
|
| 53 |
-
return []
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
# ββ Step 1: LLM rewrite (optional, never blocks) βββββββββββββββββββββ
|
| 56 |
-
|
| 57 |
if use_rewrite:
|
|
|
|
| 58 |
try:
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
except Exception:
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
dense_vec, sparse_dict = embed_svc.encode_query(search_query)
|
| 66 |
-
except Exception as e:
|
| 67 |
-
print(f"[hybrid_search] Encoding failed: {e}")
|
| 68 |
-
return []
|
| 69 |
|
| 70 |
# How many candidates to fetch before reranking
|
| 71 |
fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
|
| 72 |
|
| 73 |
-
# ββ Step 3: Parallel dense + sparse search βββ
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
)
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
if not fused:
|
| 95 |
-
return []
|
| 96 |
-
|
| 97 |
-
# ββ Step 5:
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
# ββ Step 6: Return top results βββββββββββββββββββββββββββββββββββββββ
|
| 101 |
-
|
|
|
|
| 102 |
|
| 103 |
|
| 104 |
# ββ RRF fusion βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -109,92 +177,246 @@ def _rrf_fuse(
|
|
| 109 |
k: int = 60,
|
| 110 |
) -> list[dict]:
|
| 111 |
"""
|
| 112 |
-
Reciprocal Rank Fusion
|
| 113 |
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
RRF is rank-based, so raw scores from different systems don't need
|
| 117 |
-
normalization
|
| 118 |
-
|
|
|
|
| 119 |
|
| 120 |
Args:
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
k: RRF constant (default 60)
|
| 124 |
|
| 125 |
Returns:
|
| 126 |
-
list of {'arxiv_id': str, 'rrf_score': float} sorted by rrf_score desc
|
| 127 |
"""
|
| 128 |
scores: dict[str, float] = {}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
-
# Dense contributions (rank = position in sorted list, 1-indexed)
|
| 131 |
-
for rank, item in enumerate(dense_results, start=1):
|
| 132 |
-
aid = item["arxiv_id"]
|
| 133 |
-
scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
|
| 134 |
-
|
| 135 |
-
# Sparse contributions
|
| 136 |
-
for rank, item in enumerate(sparse_results, start=1):
|
| 137 |
-
aid = item["arxiv_id"]
|
| 138 |
-
scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
|
| 139 |
-
|
| 140 |
-
# Sort by fused score descending
|
| 141 |
fused = [
|
| 142 |
{"arxiv_id": aid, "rrf_score": score}
|
| 143 |
for aid, score in scores.items()
|
| 144 |
]
|
| 145 |
fused.sort(key=lambda x: x["rrf_score"], reverse=True)
|
| 146 |
-
|
| 147 |
return fused
|
| 148 |
|
| 149 |
|
| 150 |
-
# ββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
"""
|
| 154 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
-
|
|
|
|
|
|
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|
| 160 |
|
| 161 |
-
|
| 162 |
-
|
| 163 |
"""
|
| 164 |
if not fused:
|
| 165 |
return fused
|
| 166 |
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
|
|
|
| 171 |
|
| 172 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
-
for item in fused
|
| 175 |
-
# Normalize RRF to [0, 1]
|
| 176 |
-
norm_rrf = (item["rrf_score"] - min_rrf) / rrf_range
|
| 177 |
|
| 178 |
-
|
| 179 |
-
recency = 0.5 # neutral default
|
| 180 |
aid = item["arxiv_id"]
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
year = 2000 + yy if yy < 100 else yy
|
| 187 |
-
paper_ym = year * 12 + mm
|
| 188 |
-
months_ago = max(0, now_ym - paper_ym)
|
| 189 |
-
# Decay: recent papers get ~1.0, 10-year-old papers get ~0.0
|
| 190 |
-
recency = max(0.0, 1.0 - months_ago / 120.0)
|
| 191 |
-
except (ValueError, IndexError):
|
| 192 |
-
pass
|
| 193 |
-
|
| 194 |
-
item["final_score"] = (
|
| 195 |
-
config.SEARCH_SEMANTIC_WEIGHT * norm_rrf
|
| 196 |
-
+ config.SEARCH_RECENCY_WEIGHT * recency
|
| 197 |
-
)
|
| 198 |
|
| 199 |
fused.sort(key=lambda x: x["final_score"], reverse=True)
|
| 200 |
return fused
|
|
|
|
| 6 |
2. BGE-M3 encode β dense + sparse
|
| 7 |
3. Parallel search: Qdrant dense + Zilliz sparse
|
| 8 |
4. RRF fusion (K=60)
|
| 9 |
+
5. Title-match boost (exact/substring against Turso titles)
|
| 10 |
6. Return ranked arxiv_ids
|
| 11 |
|
| 12 |
Doc 06 confirms: RRF is correct for search (fusing different retrievers
|
| 13 |
answering the SAME query). This is different from recommendations where
|
| 14 |
quota is correct (fusing different queries for the SAME user).
|
| 15 |
+
|
| 16 |
+
Recency rerank was removed β search relevance should not be biased toward
|
| 17 |
+
newer papers (that is a recommendations concern). For exact-title queries
|
| 18 |
+
like "attention is all you need", the recency overlay was crushing seminal
|
| 19 |
+
older papers below newer "X is all you need" titles.
|
| 20 |
"""
|
| 21 |
from __future__ import annotations
|
| 22 |
|
| 23 |
import asyncio
|
| 24 |
+
import re
|
| 25 |
|
| 26 |
from app import config
|
| 27 |
from app import embed_svc
|
| 28 |
from app import qdrant_svc
|
| 29 |
from app import zilliz_svc
|
| 30 |
from app import groq_svc
|
| 31 |
+
from app import turso_svc
|
| 32 |
|
| 33 |
|
| 34 |
# ββ Public API βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 37 |
query: str,
|
| 38 |
limit: int = 10,
|
| 39 |
use_rewrite: bool = True,
|
| 40 |
+
return_meta: bool = False,
|
| 41 |
+
) -> list[str] | tuple[list[str], dict]:
|
| 42 |
"""
|
| 43 |
Hybrid semantic search β returns a list of arxiv_ids ranked by
|
| 44 |
fused relevance.
|
| 45 |
|
| 46 |
Pipeline:
|
| 47 |
+
rewrite β encode β parallel(dense, sparse) β RRF β title-boost
|
| 48 |
|
| 49 |
Args:
|
| 50 |
query: User's raw search query.
|
| 51 |
limit: Number of results to return.
|
| 52 |
use_rewrite: Whether to attempt LLM query rewriting.
|
| 53 |
+
return_meta: If True, returns a tuple of (arxiv_ids, metadata_dict).
|
| 54 |
|
| 55 |
Returns:
|
| 56 |
list of arxiv_id strings, sorted by final score descending.
|
|
|
|
| 58 |
"""
|
| 59 |
query = query.strip()
|
| 60 |
if not query:
|
| 61 |
+
return ([], {}) if return_meta else []
|
| 62 |
+
|
| 63 |
+
import time
|
| 64 |
+
search_meta = {"rewritten_query": None, "groq_time_ms": 0, "groq_status": "off"}
|
| 65 |
|
| 66 |
# ββ Step 1: LLM rewrite (optional, never blocks) βββββββββββββββββββββ
|
| 67 |
+
rewritten_query = query
|
| 68 |
if use_rewrite:
|
| 69 |
+
start_groq = time.perf_counter()
|
| 70 |
try:
|
| 71 |
+
rewritten_query = await groq_svc.rewrite(query)
|
| 72 |
+
if rewritten_query != query:
|
| 73 |
+
search_meta["rewritten_query"] = rewritten_query
|
| 74 |
+
search_meta["groq_status"] = "rewritten"
|
| 75 |
+
else:
|
| 76 |
+
# Groq returned same query β either skipped by heuristic or LLM kept it
|
| 77 |
+
word_count = len(query.strip().split())
|
| 78 |
+
if word_count <= 2:
|
| 79 |
+
search_meta["groq_status"] = f"skipped (query too short: {word_count} words)"
|
| 80 |
+
elif groq_svc._looks_academic(query):
|
| 81 |
+
search_meta["groq_status"] = "skipped (looks academic)"
|
| 82 |
+
else:
|
| 83 |
+
search_meta["groq_status"] = "called, kept original"
|
| 84 |
except Exception:
|
| 85 |
+
rewritten_query = query # Fallback guaranteed
|
| 86 |
+
search_meta["groq_status"] = "error (fallback)"
|
| 87 |
+
search_meta["groq_time_ms"] = int((time.perf_counter() - start_groq) * 1000)
|
| 88 |
+
|
| 89 |
+
# ββ Step 2: BGE-M3 encode the original AND rewrite ββββββββββββββββββ
|
| 90 |
+
# Why both: The rewriter improves recall on conceptual/casual queries
|
| 91 |
+
# ("when AI makes up fake facts" -> "LLM hallucination ...") but it
|
| 92 |
+
# paraphrases away from literal title wording on known-item queries
|
| 93 |
+
# ("attention is all you need" -> "Transformer self-attention ..."),
|
| 94 |
+
# which can drop the actual famous paper out of the candidate pool
|
| 95 |
+
# entirely. Searching both forms and RRF-fusing all result lists
|
| 96 |
+
# gives us recall on both axes.
|
| 97 |
+
queries_to_encode: list[str] = [query]
|
| 98 |
+
if rewritten_query and rewritten_query != query:
|
| 99 |
+
queries_to_encode.append(rewritten_query)
|
| 100 |
+
|
| 101 |
+
t0_encode = time.perf_counter()
|
| 102 |
+
encoded: list[tuple] = []
|
| 103 |
+
for q in queries_to_encode:
|
| 104 |
+
try:
|
| 105 |
+
d, s = embed_svc.encode_query(q)
|
| 106 |
+
encoded.append((d, s))
|
| 107 |
+
except Exception as e:
|
| 108 |
+
print(f"[hybrid_search] Encoding failed for {q!r}: {e}")
|
| 109 |
+
search_meta["encode_time_ms"] = int((time.perf_counter() - t0_encode) * 1000)
|
| 110 |
|
| 111 |
+
if not encoded:
|
| 112 |
+
return ([], search_meta) if return_meta else []
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
# How many candidates to fetch before reranking
|
| 115 |
fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
|
| 116 |
|
| 117 |
+
# ββ Step 3: Parallel dense + sparse search for every encoded form βββ
|
| 118 |
+
# Build a flat list of search coroutines: [dense_q1, sparse_q1, dense_q2, sparse_q2, ...]
|
| 119 |
+
t0_retrieval = time.perf_counter()
|
| 120 |
+
tasks = []
|
| 121 |
+
task_labels = []
|
| 122 |
+
for i, (dense_vec, sparse_dict) in enumerate(encoded):
|
| 123 |
+
tasks.append(qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k))
|
| 124 |
+
task_labels.append(f"qdrant_q{i}")
|
| 125 |
+
tasks.append(zilliz_svc.search_sparse(sparse_dict, limit=fetch_k))
|
| 126 |
+
task_labels.append(f"zilliz_q{i}")
|
| 127 |
+
|
| 128 |
+
# Time each task individually
|
| 129 |
+
import asyncio as _aio
|
| 130 |
+
task_start = time.perf_counter()
|
| 131 |
+
raw_results = await asyncio.gather(*tasks, return_exceptions=True)
|
| 132 |
+
search_meta["retrieval_time_ms"] = int((time.perf_counter() - t0_retrieval) * 1000)
|
| 133 |
+
search_meta["n_retrieval_tasks"] = len(tasks)
|
| 134 |
+
|
| 135 |
+
valid_result_lists: list[list[dict]] = []
|
| 136 |
+
for r in raw_results:
|
| 137 |
+
if isinstance(r, Exception):
|
| 138 |
+
print(f"[hybrid_search] search task failed: {r}")
|
| 139 |
+
continue
|
| 140 |
+
if r:
|
| 141 |
+
valid_result_lists.append(r)
|
| 142 |
+
|
| 143 |
+
if not valid_result_lists:
|
| 144 |
+
return ([], search_meta) if return_meta else []
|
| 145 |
+
|
| 146 |
+
# ββ Step 4: RRF fusion across all result lists ββββββββββββββββββββββ
|
| 147 |
+
t0_rrf = time.perf_counter()
|
| 148 |
+
fused = _rrf_fuse_multi(valid_result_lists, k=config.SEARCH_RRF_K)
|
| 149 |
+
search_meta["rrf_time_ms"] = int((time.perf_counter() - t0_rrf) * 1000)
|
| 150 |
|
| 151 |
if not fused:
|
| 152 |
+
return ([], search_meta) if return_meta else []
|
| 153 |
+
|
| 154 |
+
# ββ Step 5: Title-match boost ββββββββββββββββββββββββββββββββββββββββ
|
| 155 |
+
# Use the user's ORIGINAL query (not the LLM rewrite) for title matching β
|
| 156 |
+
# the user's literal text is what should match a paper title.
|
| 157 |
+
t0_rerank = time.perf_counter()
|
| 158 |
+
ranked = await _title_match_rerank(fused, query, top_n_for_boost=50)
|
| 159 |
+
rerank_total = int((time.perf_counter() - t0_rerank) * 1000)
|
| 160 |
+
search_meta["rerank_time_ms"] = rerank_total
|
| 161 |
+
# Extract sub-timings stashed by _title_match_rerank
|
| 162 |
+
if ranked:
|
| 163 |
+
turso_boost_ms = ranked[0].pop("_turso_boost_fetch_ms", 0)
|
| 164 |
+
search_meta["turso_boost_fetch_ms"] = turso_boost_ms
|
| 165 |
+
search_meta["rerank_compute_ms"] = max(0, rerank_total - turso_boost_ms)
|
| 166 |
|
| 167 |
# ββ Step 6: Return top results βββββββββββββββββββββββββββββββββββββββ
|
| 168 |
+
final_results = [item["arxiv_id"] for item in ranked[:limit]]
|
| 169 |
+
return (final_results, search_meta) if return_meta else final_results
|
| 170 |
|
| 171 |
|
| 172 |
# ββ RRF fusion βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 177 |
k: int = 60,
|
| 178 |
) -> list[dict]:
|
| 179 |
"""
|
| 180 |
+
Reciprocal Rank Fusion of two result lists (dense + sparse).
|
| 181 |
|
| 182 |
+
Kept for callers that pass exactly two lists; new code (and the
|
| 183 |
+
hybrid pipeline itself) should call _rrf_fuse_multi instead.
|
| 184 |
+
"""
|
| 185 |
+
return _rrf_fuse_multi([dense_results, sparse_results], k=k)
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
def _rrf_fuse_multi(
|
| 189 |
+
result_lists: list[list[dict]],
|
| 190 |
+
k: int = 60,
|
| 191 |
+
) -> list[dict]:
|
| 192 |
+
"""
|
| 193 |
+
Reciprocal Rank Fusion across N result lists.
|
| 194 |
+
|
| 195 |
+
score[paper] = sum over each list of 1/(k + rank_in_that_list)
|
| 196 |
|
| 197 |
RRF is rank-based, so raw scores from different systems don't need
|
| 198 |
+
normalization. This means we can merge dense, sparse, AND multiple
|
| 199 |
+
encoded query forms (original + LLM-rewritten) without per-source
|
| 200 |
+
score calibration.
|
| 201 |
|
| 202 |
Args:
|
| 203 |
+
result_lists: each list contains {'arxiv_id': str, 'score': ...}
|
| 204 |
+
sorted best-first.
|
| 205 |
+
k: RRF constant (default 60).
|
| 206 |
|
| 207 |
Returns:
|
| 208 |
+
list of {'arxiv_id': str, 'rrf_score': float} sorted by rrf_score desc.
|
| 209 |
"""
|
| 210 |
scores: dict[str, float] = {}
|
| 211 |
+
for results in result_lists:
|
| 212 |
+
for rank, item in enumerate(results, start=1):
|
| 213 |
+
aid = item["arxiv_id"]
|
| 214 |
+
scores[aid] = scores.get(aid, 0.0) + 1.0 / (k + rank)
|
| 215 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
fused = [
|
| 217 |
{"arxiv_id": aid, "rrf_score": score}
|
| 218 |
for aid, score in scores.items()
|
| 219 |
]
|
| 220 |
fused.sort(key=lambda x: x["rrf_score"], reverse=True)
|
|
|
|
| 221 |
return fused
|
| 222 |
|
| 223 |
|
| 224 |
+
# ββ Title-match + citation-popularity rerank βββββββββββββββββββββββββββββββββ
|
| 225 |
+
|
| 226 |
+
# Boost magnitudes are calibrated against `max_rrf` so any meaningful title
|
| 227 |
+
# match outranks the best non-matching candidate:
|
| 228 |
+
# final = rrf_score + max_rrf * (title_boost + citation_boost)
|
| 229 |
+
# With boost=2.0 (exact title), the worst exact-match still beats the best
|
| 230 |
+
# non-match by >= max_rrf. boost=1.0 same vs. no-match.
|
| 231 |
+
_BOOST_EXACT_TITLE = 2.0 # query == title (after normalize)
|
| 232 |
+
_BOOST_SUBSTRING_TITLE = 1.0 # query is contiguous substring of title
|
| 233 |
+
_BOOST_HIGH_COVERAGE = 1.0 # >= 80% of query words found in title
|
| 234 |
+
_BOOST_MED_COVERAGE = 0.5 # >= 50% of query words found in title
|
| 235 |
+
|
| 236 |
+
# Citation-popularity boost β surfaces landmark papers even when keyword
|
| 237 |
+
# overlap is low. Without this, "how do transformers work in NLP" returns
|
| 238 |
+
# niche papers instead of "Attention Is All You Need" because RRF favors
|
| 239 |
+
# papers whose titles contain more query keywords.
|
| 240 |
+
#
|
| 241 |
+
# Uses log10(citations) scaled to a cap:
|
| 242 |
+
# 0 citations -> 0.0 boost
|
| 243 |
+
# 10 citations -> 0.03
|
| 244 |
+
# 100 citations -> 0.06
|
| 245 |
+
# 1K citations -> 0.10
|
| 246 |
+
# 10K citations -> 0.13
|
| 247 |
+
# 100K citations-> 0.17 (near cap)
|
| 248 |
+
#
|
| 249 |
+
# Cap is deliberately small (0.2 * max_rrf) so it NUDGES but doesn't
|
| 250 |
+
# override title-match or strong semantic signal. A 100K-citation paper
|
| 251 |
+
# still loses to a perfect title match.
|
| 252 |
+
import math
|
| 253 |
+
_CITATION_BOOST_CAP = 0.2 # max boost from citations alone
|
| 254 |
+
_CITATION_LOG_DIVISOR = 30.0 # how many log10 units to reach the cap
|
| 255 |
+
|
| 256 |
+
# Drop any token shorter than this from coverage calculation β single-letter
|
| 257 |
+
# tokens ("a", "i") and tiny stop-likes inflate spurious matches.
|
| 258 |
+
_MIN_COVERAGE_TOKEN_LEN = 2
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
def _normalize_for_match(text: str) -> str:
|
| 262 |
+
"""Lowercase, collapse non-alnum to single spaces, strip."""
|
| 263 |
+
return re.sub(r"[^a-z0-9]+", " ", text.lower()).strip()
|
| 264 |
+
|
| 265 |
+
|
| 266 |
+
def _stem_plural(w: str) -> str:
|
| 267 |
+
"""Trim a single trailing 's' on tokens longer than 3 chars.
|
| 268 |
+
|
| 269 |
+
Crude but cheap. Catches the 'space' vs 'spaces' problem in the
|
| 270 |
+
Mamba paper title without dragging in a real stemmer dependency.
|
| 271 |
+
"""
|
| 272 |
+
return w[:-1] if len(w) > 3 and w.endswith("s") else w
|
| 273 |
+
|
| 274 |
+
|
| 275 |
+
def _word_set(text: str) -> set[str]:
|
| 276 |
+
return {
|
| 277 |
+
_stem_plural(w) for w in text.split()
|
| 278 |
+
if len(w) >= _MIN_COVERAGE_TOKEN_LEN
|
| 279 |
+
}
|
| 280 |
+
|
| 281 |
+
|
| 282 |
+
def _compute_title_boost(query_norm: str, title_raw: str) -> float:
|
| 283 |
+
"""Decide how much to boost a candidate based on title overlap.
|
| 284 |
+
|
| 285 |
+
Order of checks (strongest signal first):
|
| 286 |
+
1. Exact match after normalization -> 2.0
|
| 287 |
+
2. Query is contiguous substring of normalized title -> 1.0
|
| 288 |
+
(rescues "chain of thought prompting" vs
|
| 289 |
+
"Chain-of-Thought Prompting Elicits Reasoning..." β punctuation
|
| 290 |
+
in title was the only thing blocking the old binary substring check)
|
| 291 |
+
3. Coverage: fraction of query word-stems found in title (or as
|
| 292 |
+
substring of compact title β catches "multilingual" appearing
|
| 293 |
+
in "Multi-Lingual" once spaces are stripped).
|
| 294 |
+
>= 0.8 -> _BOOST_HIGH_COVERAGE * coverage
|
| 295 |
+
>= 0.5 -> _BOOST_MED_COVERAGE * coverage
|
| 296 |
+
otherwise -> 0
|
| 297 |
+
"""
|
| 298 |
+
if not query_norm or not title_raw:
|
| 299 |
+
return 0.0
|
| 300 |
+
|
| 301 |
+
title_norm = _normalize_for_match(title_raw)
|
| 302 |
+
if not title_norm:
|
| 303 |
+
return 0.0
|
| 304 |
+
|
| 305 |
+
if query_norm == title_norm:
|
| 306 |
+
return _BOOST_EXACT_TITLE
|
| 307 |
+
if query_norm in title_norm:
|
| 308 |
+
return _BOOST_SUBSTRING_TITLE
|
| 309 |
+
|
| 310 |
+
q_words = _word_set(query_norm)
|
| 311 |
+
if not q_words:
|
| 312 |
+
return 0.0
|
| 313 |
+
|
| 314 |
+
t_words = _word_set(title_norm)
|
| 315 |
+
title_compact = title_norm.replace(" ", "")
|
| 316 |
+
|
| 317 |
+
matches = 0
|
| 318 |
+
for w in q_words:
|
| 319 |
+
if w in t_words:
|
| 320 |
+
matches += 1
|
| 321 |
+
elif len(w) >= 4 and w in title_compact:
|
| 322 |
+
# Catches "multilingual" appearing within "multi lingual"
|
| 323 |
+
# once whitespace is stripped from the title.
|
| 324 |
+
matches += 1
|
| 325 |
+
|
| 326 |
+
coverage = matches / len(q_words)
|
| 327 |
+
if coverage >= 0.8:
|
| 328 |
+
return _BOOST_HIGH_COVERAGE * coverage
|
| 329 |
+
if coverage >= 0.5:
|
| 330 |
+
return _BOOST_MED_COVERAGE * coverage
|
| 331 |
+
return 0.0
|
| 332 |
+
|
| 333 |
+
|
| 334 |
+
def _compute_citation_boost(citation_count: int) -> float:
|
| 335 |
+
"""Log-scaled citation boost, capped at _CITATION_BOOST_CAP.
|
| 336 |
+
|
| 337 |
+
The idea: a paper with 100K citations (like "Attention Is All You Need")
|
| 338 |
+
gets a small but meaningful nudge upward even when it has zero keyword
|
| 339 |
+
overlap with a beginner's query like "how do transformers work".
|
| 340 |
+
|
| 341 |
+
The boost is small enough that a strong title match always wins, and
|
| 342 |
+
a strong semantic RRF score always wins. But when two papers have
|
| 343 |
+
similar RRF scores and neither has a title match, the one with 100K
|
| 344 |
+
citations beats the one with 3 citations.
|
| 345 |
+
|
| 346 |
+
Scale (log10-based):
|
| 347 |
+
citations=0 -> 0.000
|
| 348 |
+
citations=10 -> 0.033
|
| 349 |
+
citations=100 -> 0.067
|
| 350 |
+
citations=1000 -> 0.100
|
| 351 |
+
citations=10000 -> 0.133
|
| 352 |
+
citations=100000-> 0.167
|
| 353 |
+
"""
|
| 354 |
+
if citation_count <= 0:
|
| 355 |
+
return 0.0
|
| 356 |
+
raw = math.log10(citation_count + 1) / _CITATION_LOG_DIVISOR
|
| 357 |
+
return min(raw, _CITATION_BOOST_CAP)
|
| 358 |
|
| 359 |
+
|
| 360 |
+
async def _title_match_rerank(
|
| 361 |
+
fused: list[dict],
|
| 362 |
+
user_query: str,
|
| 363 |
+
top_n_for_boost: int = 50,
|
| 364 |
+
) -> list[dict]:
|
| 365 |
"""
|
| 366 |
+
Boost candidates by title overlap + citation popularity.
|
| 367 |
+
|
| 368 |
+
Two signals, both based on metadata we already fetch from Turso:
|
| 369 |
+
|
| 370 |
+
1. Title boost (strong): exact/substring/coverage match between the
|
| 371 |
+
user's ORIGINAL query and paper titles. Rescues known-item queries.
|
| 372 |
|
| 373 |
+
2. Citation boost (gentle): log-scaled citation count, capped at 0.2x
|
| 374 |
+
max_rrf. Rescues landmark papers for beginner queries where keyword
|
| 375 |
+
overlap is low but the paper is obviously important.
|
| 376 |
|
| 377 |
+
The final score is:
|
| 378 |
+
final = rrf_score + max_rrf * (title_boost + citation_boost)
|
| 379 |
|
| 380 |
+
Safe under partial Turso failure: papers with missing metadata get
|
| 381 |
+
boost=0 and rank by RRF alone.
|
| 382 |
"""
|
| 383 |
if not fused:
|
| 384 |
return fused
|
| 385 |
|
| 386 |
+
q_norm = _normalize_for_match(user_query)
|
| 387 |
+
if not q_norm:
|
| 388 |
+
for item in fused:
|
| 389 |
+
item["final_score"] = item["rrf_score"]
|
| 390 |
+
return fused
|
| 391 |
|
| 392 |
+
candidate_ids = [item["arxiv_id"] for item in fused[:top_n_for_boost]]
|
| 393 |
+
titles: dict[str, str] = {}
|
| 394 |
+
citations: dict[str, int] = {}
|
| 395 |
+
import time as _time
|
| 396 |
+
_t0_turso_boost = _time.perf_counter()
|
| 397 |
+
try:
|
| 398 |
+
meta = await turso_svc.fetch_metadata_batch(candidate_ids)
|
| 399 |
+
titles = {aid: (m.get("title") or "") for aid, m in meta.items()}
|
| 400 |
+
citations = {aid: (m.get("citation_count") or 0) for aid, m in meta.items()}
|
| 401 |
+
except Exception as e:
|
| 402 |
+
print(f"[hybrid_search] Metadata fetch for boost failed: {e}")
|
| 403 |
+
for item in fused:
|
| 404 |
+
item["final_score"] = item["rrf_score"]
|
| 405 |
+
return fused
|
| 406 |
+
_turso_boost_ms = int((_time.perf_counter() - _t0_turso_boost) * 1000)
|
| 407 |
+
# Stash on first item so the caller can extract it
|
| 408 |
+
if fused:
|
| 409 |
+
fused[0]["_turso_boost_fetch_ms"] = _turso_boost_ms
|
| 410 |
|
| 411 |
+
max_rrf = max(item["rrf_score"] for item in fused)
|
|
|
|
|
|
|
| 412 |
|
| 413 |
+
for item in fused:
|
|
|
|
| 414 |
aid = item["arxiv_id"]
|
| 415 |
+
t_boost = _compute_title_boost(q_norm, titles.get(aid, ""))
|
| 416 |
+
c_boost = _compute_citation_boost(citations.get(aid, 0))
|
| 417 |
+
item["title_boost"] = t_boost
|
| 418 |
+
item["citation_boost"] = c_boost
|
| 419 |
+
item["final_score"] = item["rrf_score"] + max_rrf * (t_boost + c_boost)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 420 |
|
| 421 |
fused.sort(key=lambda x: x["final_score"], reverse=True)
|
| 422 |
return fused
|
|
@@ -10,6 +10,7 @@ The collection is 'arxiv_bgem3_dense' with integer point IDs and 1024-dim BGE-M3
|
|
| 10 |
from __future__ import annotations
|
| 11 |
|
| 12 |
import asyncio
|
|
|
|
| 13 |
from functools import lru_cache
|
| 14 |
|
| 15 |
from qdrant_client import QdrantClient
|
|
@@ -166,21 +167,75 @@ def _run_recommend(
|
|
| 166 |
|
| 167 |
|
| 168 |
# ββ Phase 2a: Vector retrieval + vector search βββββββββββββββββββββββββββββββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
|
| 171 |
"""
|
| 172 |
-
Fetch
|
| 173 |
Returns {arxiv_id: vector_list} for papers found.
|
| 174 |
|
| 175 |
-
|
| 176 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
"""
|
| 178 |
if not arxiv_ids:
|
| 179 |
return {}
|
| 180 |
|
| 181 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
if not id_map:
|
| 183 |
-
return
|
| 184 |
|
| 185 |
point_ids = list(id_map.values())
|
| 186 |
arxiv_by_point = {v: k for k, v in id_map.items()}
|
|
@@ -192,9 +247,8 @@ async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
|
|
| 192 |
)
|
| 193 |
except Exception as e:
|
| 194 |
print(f"[qdrant_svc] get_paper_vectors error: {e}")
|
| 195 |
-
return
|
| 196 |
|
| 197 |
-
result = {}
|
| 198 |
for p in points:
|
| 199 |
aid = p.payload.get("arxiv_id") or arxiv_by_point.get(p.id)
|
| 200 |
if aid and p.vector:
|
|
@@ -202,6 +256,7 @@ async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
|
|
| 202 |
vec = p.vector if isinstance(p.vector, list) else p.vector.get("dense", p.vector)
|
| 203 |
if isinstance(vec, list):
|
| 204 |
result[aid] = vec
|
|
|
|
| 205 |
return result
|
| 206 |
|
| 207 |
|
|
@@ -250,6 +305,7 @@ async def search_by_vector_with_scores(
|
|
| 250 |
query_vector: list[float],
|
| 251 |
limit: int = 20,
|
| 252 |
exclude_ids: set[str] | None = None,
|
|
|
|
| 253 |
) -> list[dict]:
|
| 254 |
"""
|
| 255 |
Vector search returning both arxiv_ids AND cosine scores.
|
|
@@ -257,29 +313,43 @@ async def search_by_vector_with_scores(
|
|
| 257 |
Returns list of {'arxiv_id': str, 'score': float} dicts sorted by
|
| 258 |
score desc, excluding any in exclude_ids.
|
| 259 |
|
| 260 |
-
|
| 261 |
-
|
|
|
|
|
|
|
| 262 |
"""
|
| 263 |
loop = asyncio.get_event_loop()
|
| 264 |
try:
|
| 265 |
results = await loop.run_in_executor(
|
| 266 |
None, _run_vector_search, query_vector,
|
| 267 |
(limit * 2) if exclude_ids else limit,
|
|
|
|
| 268 |
)
|
| 269 |
except Exception as e:
|
| 270 |
print(f"[qdrant_svc] search_by_vector_with_scores error: {e}")
|
| 271 |
return []
|
| 272 |
|
| 273 |
exclude = exclude_ids or set()
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
if
|
| 278 |
-
|
| 279 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
|
| 282 |
-
def _run_vector_search(
|
|
|
|
|
|
|
| 283 |
"""Sync helper: nearest-neighbour search by vector."""
|
| 284 |
client = _client()
|
| 285 |
result = client.query_points(
|
|
@@ -287,7 +357,7 @@ def _run_vector_search(query_vector: list[float], limit: int) -> list:
|
|
| 287 |
query=query_vector,
|
| 288 |
limit=limit,
|
| 289 |
with_payload=True,
|
| 290 |
-
with_vectors=
|
| 291 |
)
|
| 292 |
return result.points
|
| 293 |
|
|
|
|
| 10 |
from __future__ import annotations
|
| 11 |
|
| 12 |
import asyncio
|
| 13 |
+
from collections import OrderedDict
|
| 14 |
from functools import lru_cache
|
| 15 |
|
| 16 |
from qdrant_client import QdrantClient
|
|
|
|
| 167 |
|
| 168 |
|
| 169 |
# ββ Phase 2a: Vector retrieval + vector search βββββββββββββββββββββββββββββββ
|
| 170 |
+
#
|
| 171 |
+
# In-process LRU vector cache.
|
| 172 |
+
# Profiling showed Qdrant Cloud free tier reads candidate vectors from
|
| 173 |
+
# disk on every retrieve(), which dominated Tier 1 latency (9-18s for
|
| 174 |
+
# 120 vectors). Vectors are 1024 floats = 4KB each. A 25K cap = ~100MB
|
| 175 |
+
# RAM ceiling. Same papers appear across users' candidate sets (Zipf),
|
| 176 |
+
# so steady-state hit rate is high.
|
| 177 |
+
#
|
| 178 |
+
# Vectors don't change once uploaded, so no TTL.
|
| 179 |
+
|
| 180 |
+
_VECTOR_CACHE: "OrderedDict[str, list[float]]" = OrderedDict()
|
| 181 |
+
_VECTOR_CACHE_MAX = 25_000
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def _vec_cache_get(arxiv_id: str) -> list[float] | None:
|
| 185 |
+
val = _VECTOR_CACHE.get(arxiv_id)
|
| 186 |
+
if val is not None:
|
| 187 |
+
_VECTOR_CACHE.move_to_end(arxiv_id)
|
| 188 |
+
return val
|
| 189 |
+
|
| 190 |
+
|
| 191 |
+
def _vec_cache_put(arxiv_id: str, vec: list[float]) -> None:
|
| 192 |
+
if arxiv_id in _VECTOR_CACHE:
|
| 193 |
+
_VECTOR_CACHE.move_to_end(arxiv_id)
|
| 194 |
+
_VECTOR_CACHE[arxiv_id] = vec
|
| 195 |
+
return
|
| 196 |
+
_VECTOR_CACHE[arxiv_id] = vec
|
| 197 |
+
if len(_VECTOR_CACHE) > _VECTOR_CACHE_MAX:
|
| 198 |
+
_VECTOR_CACHE.popitem(last=False)
|
| 199 |
+
|
| 200 |
+
|
| 201 |
+
def vector_cache_stats() -> dict:
|
| 202 |
+
return {"size": len(_VECTOR_CACHE), "max": _VECTOR_CACHE_MAX}
|
| 203 |
+
|
| 204 |
|
| 205 |
async def get_paper_vectors(arxiv_ids: list[str]) -> dict[str, list[float]]:
|
| 206 |
"""
|
| 207 |
+
Fetch BGE-M3 embedding vectors for papers from Qdrant.
|
| 208 |
Returns {arxiv_id: vector_list} for papers found.
|
| 209 |
|
| 210 |
+
Cached in-process by arxiv_id; only un-cached IDs hit Qdrant. The
|
| 211 |
+
Qdrant retrieve() that pulls the actual stored vectors is the
|
| 212 |
+
single most expensive call in the pipeline (BQ -> disk read), so
|
| 213 |
+
absorbing repeats here is a big win.
|
| 214 |
+
|
| 215 |
+
Used by:
|
| 216 |
+
- EWMA profile updates on save (events.py)
|
| 217 |
+
- Cluster medoid embedding load (recommendations.py)
|
| 218 |
+
- Tier 1 candidate vector fetch (recommendations.py, ~120 IDs)
|
| 219 |
"""
|
| 220 |
if not arxiv_ids:
|
| 221 |
return {}
|
| 222 |
|
| 223 |
+
# Cache check first β pull anything we already know.
|
| 224 |
+
result: dict[str, list[float]] = {}
|
| 225 |
+
misses: list[str] = []
|
| 226 |
+
for aid in arxiv_ids:
|
| 227 |
+
cached = _vec_cache_get(aid)
|
| 228 |
+
if cached is not None:
|
| 229 |
+
result[aid] = cached
|
| 230 |
+
else:
|
| 231 |
+
misses.append(aid)
|
| 232 |
+
|
| 233 |
+
if not misses:
|
| 234 |
+
return result
|
| 235 |
+
|
| 236 |
+
id_map = await lookup_qdrant_ids(misses)
|
| 237 |
if not id_map:
|
| 238 |
+
return result
|
| 239 |
|
| 240 |
point_ids = list(id_map.values())
|
| 241 |
arxiv_by_point = {v: k for k, v in id_map.items()}
|
|
|
|
| 247 |
)
|
| 248 |
except Exception as e:
|
| 249 |
print(f"[qdrant_svc] get_paper_vectors error: {e}")
|
| 250 |
+
return result
|
| 251 |
|
|
|
|
| 252 |
for p in points:
|
| 253 |
aid = p.payload.get("arxiv_id") or arxiv_by_point.get(p.id)
|
| 254 |
if aid and p.vector:
|
|
|
|
| 256 |
vec = p.vector if isinstance(p.vector, list) else p.vector.get("dense", p.vector)
|
| 257 |
if isinstance(vec, list):
|
| 258 |
result[aid] = vec
|
| 259 |
+
_vec_cache_put(aid, vec)
|
| 260 |
return result
|
| 261 |
|
| 262 |
|
|
|
|
| 305 |
query_vector: list[float],
|
| 306 |
limit: int = 20,
|
| 307 |
exclude_ids: set[str] | None = None,
|
| 308 |
+
with_vectors: bool = False,
|
| 309 |
) -> list[dict]:
|
| 310 |
"""
|
| 311 |
Vector search returning both arxiv_ids AND cosine scores.
|
|
|
|
| 313 |
Returns list of {'arxiv_id': str, 'score': float} dicts sorted by
|
| 314 |
score desc, excluding any in exclude_ids.
|
| 315 |
|
| 316 |
+
If `with_vectors=True`, each dict also has a 'vector' key holding the
|
| 317 |
+
1024-dim BGE-M3 embedding. Returning vectors in the search response
|
| 318 |
+
avoids a separate `client.retrieve()` round-trip later β that retrieve
|
| 319 |
+
was ~9-18s on cold candidates because BQ rescore reads from disk.
|
| 320 |
"""
|
| 321 |
loop = asyncio.get_event_loop()
|
| 322 |
try:
|
| 323 |
results = await loop.run_in_executor(
|
| 324 |
None, _run_vector_search, query_vector,
|
| 325 |
(limit * 2) if exclude_ids else limit,
|
| 326 |
+
with_vectors,
|
| 327 |
)
|
| 328 |
except Exception as e:
|
| 329 |
print(f"[qdrant_svc] search_by_vector_with_scores error: {e}")
|
| 330 |
return []
|
| 331 |
|
| 332 |
exclude = exclude_ids or set()
|
| 333 |
+
out: list[dict] = []
|
| 334 |
+
for r in results:
|
| 335 |
+
aid = r.payload.get("arxiv_id")
|
| 336 |
+
if not aid or aid in exclude:
|
| 337 |
+
continue
|
| 338 |
+
item = {"arxiv_id": aid, "score": float(r.score)}
|
| 339 |
+
if with_vectors and r.vector:
|
| 340 |
+
# Named vectors return a dict; unnamed returns a list.
|
| 341 |
+
vec = r.vector if isinstance(r.vector, list) else r.vector.get("dense", r.vector)
|
| 342 |
+
if isinstance(vec, list):
|
| 343 |
+
item["vector"] = vec
|
| 344 |
+
out.append(item)
|
| 345 |
+
if len(out) >= limit:
|
| 346 |
+
break
|
| 347 |
+
return out
|
| 348 |
|
| 349 |
|
| 350 |
+
def _run_vector_search(
|
| 351 |
+
query_vector: list[float], limit: int, with_vectors: bool = False,
|
| 352 |
+
) -> list:
|
| 353 |
"""Sync helper: nearest-neighbour search by vector."""
|
| 354 |
client = _client()
|
| 355 |
result = client.query_points(
|
|
|
|
| 357 |
query=query_vector,
|
| 358 |
limit=limit,
|
| 359 |
with_payload=True,
|
| 360 |
+
with_vectors=with_vectors,
|
| 361 |
)
|
| 362 |
return result.points
|
| 363 |
|
|
@@ -17,6 +17,7 @@ Reference: Research-MultiInterest_Recommender_Architecture.md Β§2
|
|
| 17 |
from __future__ import annotations
|
| 18 |
|
| 19 |
import json
|
|
|
|
| 20 |
from dataclasses import dataclass, field
|
| 21 |
import numpy as np
|
| 22 |
from scipy.cluster.hierarchy import ward, fcluster
|
|
@@ -34,6 +35,14 @@ WARD_DISTANCE_THRESHOLD = 1.5
|
|
| 34 |
MIN_CLUSTERS = 1
|
| 35 |
MAX_CLUSTERS = 7 # RFC: PinnerSage uses 3-5 for typical users, cap at 7
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
# Minimum saved papers before clustering is meaningful
|
| 38 |
MIN_PAPERS_FOR_CLUSTERING = 5
|
| 39 |
|
|
@@ -132,14 +141,36 @@ def compute_clusters(
|
|
| 132 |
# Cut the dendrogram at the adaptive threshold
|
| 133 |
labels = fcluster(linkage, t=threshold, criterion="distance")
|
| 134 |
|
| 135 |
-
# Clamp cluster count
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
unique_labels = np.unique(labels)
|
| 137 |
n_clusters = len(unique_labels)
|
| 138 |
|
| 139 |
-
# If too many clusters, re-cut with a maxclust constraint
|
| 140 |
if n_clusters > MAX_CLUSTERS:
|
| 141 |
labels = fcluster(linkage, t=MAX_CLUSTERS, criterion="maxclust")
|
| 142 |
unique_labels = np.unique(labels)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
# Compute recency weights (position-based: most recent = highest weight)
|
| 145 |
recency_weights = np.array([
|
|
@@ -184,6 +215,49 @@ def _find_medoid(embeddings: np.ndarray, centroid: np.ndarray) -> int:
|
|
| 184 |
return int(np.argmin(distances))
|
| 185 |
|
| 186 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
# ββ Cluster ID stabilisation (Phase 4.2) βββββββββββββββββββββββββββββββββββββ
|
| 188 |
|
| 189 |
# Hungarian matches below this cosine similarity are rejected as "unrelated".
|
|
|
|
| 17 |
from __future__ import annotations
|
| 18 |
|
| 19 |
import json
|
| 20 |
+
import math
|
| 21 |
from dataclasses import dataclass, field
|
| 22 |
import numpy as np
|
| 23 |
from scipy.cluster.hierarchy import ward, fcluster
|
|
|
|
| 35 |
MIN_CLUSTERS = 1
|
| 36 |
MAX_CLUSTERS = 7 # RFC: PinnerSage uses 3-5 for typical users, cap at 7
|
| 37 |
|
| 38 |
+
# Average papers per cluster floor β used to derive a soft cap on K from N.
|
| 39 |
+
# K_soft_cap = max(MIN_CLUSTERS, ceil(N / AVG_CLUSTER_SIZE_FLOOR)).
|
| 40 |
+
# Set to 4: at N=5 -> K_max=2, at N=10 -> K_max=3, at N=28 -> K_max=7.
|
| 41 |
+
# Without this, gap-based thresholding over-splits at small N: 5 same-domain
|
| 42 |
+
# papers were producing K=4 (3 singletons), which then got over-weighted by
|
| 43 |
+
# the quota floor of 3 slots per cluster.
|
| 44 |
+
AVG_CLUSTER_SIZE_FLOOR = 4
|
| 45 |
+
|
| 46 |
# Minimum saved papers before clustering is meaningful
|
| 47 |
MIN_PAPERS_FOR_CLUSTERING = 5
|
| 48 |
|
|
|
|
| 141 |
# Cut the dendrogram at the adaptive threshold
|
| 142 |
labels = fcluster(linkage, t=threshold, criterion="distance")
|
| 143 |
|
| 144 |
+
# Clamp cluster count.
|
| 145 |
+
# Two layers:
|
| 146 |
+
# 1. Hard cap: never exceed MAX_CLUSTERS (=7) regardless of N.
|
| 147 |
+
# 2. Soft cap: keep average cluster size >= AVG_CLUSTER_SIZE_FLOOR.
|
| 148 |
+
# This prevents the gap-detection from over-splitting small N
|
| 149 |
+
# (e.g. 5 same-domain saves were producing K=4 with 3 singletons,
|
| 150 |
+
# which then got over-weighted by the quota floor of 3 slots).
|
| 151 |
+
soft_cap = max(
|
| 152 |
+
MIN_CLUSTERS,
|
| 153 |
+
min(MAX_CLUSTERS, math.ceil(n / AVG_CLUSTER_SIZE_FLOOR)),
|
| 154 |
+
)
|
| 155 |
+
|
| 156 |
unique_labels = np.unique(labels)
|
| 157 |
n_clusters = len(unique_labels)
|
| 158 |
|
|
|
|
| 159 |
if n_clusters > MAX_CLUSTERS:
|
| 160 |
labels = fcluster(linkage, t=MAX_CLUSTERS, criterion="maxclust")
|
| 161 |
unique_labels = np.unique(labels)
|
| 162 |
+
n_clusters = len(unique_labels)
|
| 163 |
+
|
| 164 |
+
if n_clusters > soft_cap:
|
| 165 |
+
labels = fcluster(linkage, t=soft_cap, criterion="maxclust")
|
| 166 |
+
unique_labels = np.unique(labels)
|
| 167 |
+
n_clusters = len(unique_labels)
|
| 168 |
+
|
| 169 |
+
# Final safety net: merge any remaining singleton clusters into their
|
| 170 |
+
# nearest non-singleton neighbour. The soft cap usually eliminates them,
|
| 171 |
+
# but a 6-1-1-1 distribution after maxclust=4 would still leave 3.
|
| 172 |
+
labels = _merge_singletons(labels, embeddings)
|
| 173 |
+
unique_labels = np.unique(labels)
|
| 174 |
|
| 175 |
# Compute recency weights (position-based: most recent = highest weight)
|
| 176 |
recency_weights = np.array([
|
|
|
|
| 215 |
return int(np.argmin(distances))
|
| 216 |
|
| 217 |
|
| 218 |
+
def _merge_singletons(labels: np.ndarray, embeddings: np.ndarray) -> np.ndarray:
|
| 219 |
+
"""Merge singleton clusters into their nearest non-singleton cluster.
|
| 220 |
+
|
| 221 |
+
Why: Ward's gap-based threshold can over-split at small N, producing
|
| 222 |
+
1-paper clusters that get over-weighted by the quota floor (3 slots
|
| 223 |
+
per cluster regardless of importance). Merging singletons into the
|
| 224 |
+
nearest non-singleton cluster preserves the multi-interest signal
|
| 225 |
+
where it's real and removes spurious singletons where it's noise.
|
| 226 |
+
|
| 227 |
+
Edge case: if every cluster is a singleton (all papers maximally
|
| 228 |
+
distant), we leave the labels alone β collapsing them would erase
|
| 229 |
+
a genuine multi-interest signal.
|
| 230 |
+
"""
|
| 231 |
+
unique_labels, counts = np.unique(labels, return_counts=True)
|
| 232 |
+
singleton_labels = unique_labels[counts == 1]
|
| 233 |
+
non_singleton_labels = unique_labels[counts > 1]
|
| 234 |
+
|
| 235 |
+
if len(singleton_labels) == 0:
|
| 236 |
+
return labels # nothing to merge
|
| 237 |
+
if len(non_singleton_labels) == 0:
|
| 238 |
+
return labels # all singletons β keep as is
|
| 239 |
+
|
| 240 |
+
centroids: dict[int, np.ndarray] = {}
|
| 241 |
+
for ns_label in non_singleton_labels:
|
| 242 |
+
ns_mask = labels == ns_label
|
| 243 |
+
centroids[int(ns_label)] = embeddings[ns_mask].mean(axis=0)
|
| 244 |
+
|
| 245 |
+
new_labels = labels.copy()
|
| 246 |
+
for s_label in singleton_labels:
|
| 247 |
+
s_idx = int(np.where(labels == s_label)[0][0])
|
| 248 |
+
s_emb = embeddings[s_idx]
|
| 249 |
+
best_label = int(s_label)
|
| 250 |
+
best_dist = float("inf")
|
| 251 |
+
for ns_label, centroid in centroids.items():
|
| 252 |
+
d = float(np.linalg.norm(s_emb - centroid))
|
| 253 |
+
if d < best_dist:
|
| 254 |
+
best_dist = d
|
| 255 |
+
best_label = ns_label
|
| 256 |
+
new_labels[s_idx] = best_label
|
| 257 |
+
|
| 258 |
+
return new_labels
|
| 259 |
+
|
| 260 |
+
|
| 261 |
# ββ Cluster ID stabilisation (Phase 4.2) βββββββββββββββββββββββββββββββββββββ
|
| 262 |
|
| 263 |
# Hungarian matches below this cosine similarity are rejected as "unrelated".
|
|
@@ -45,7 +45,7 @@ try:
|
|
| 45 |
if _path and os.path.isfile(_path):
|
| 46 |
_lgb_model = lgb.Booster(model_file=_path)
|
| 47 |
_USE_LGB = True
|
| 48 |
-
print(f"[reranker]
|
| 49 |
print(f"[reranker] trees={_lgb_model.num_trees()}, features={_lgb_model.num_feature()}")
|
| 50 |
break
|
| 51 |
|
|
|
|
| 45 |
if _path and os.path.isfile(_path):
|
| 46 |
_lgb_model = lgb.Booster(model_file=_path)
|
| 47 |
_USE_LGB = True
|
| 48 |
+
print(f"[reranker] SUCCESS: LightGBM model loaded from {_path}")
|
| 49 |
print(f"[reranker] trees={_lgb_model.num_trees()}, features={_lgb_model.num_feature()}")
|
| 50 |
break
|
| 51 |
|
|
@@ -9,7 +9,7 @@ POST /api/onboarding/skip β mark done (no categories), redirect to /
|
|
| 9 |
"""
|
| 10 |
import uuid
|
| 11 |
import json
|
| 12 |
-
from fastapi import APIRouter, Request, Cookie
|
| 13 |
from fastapi.responses import HTMLResponse, RedirectResponse
|
| 14 |
from app import db
|
| 15 |
from app.config import COOKIE_NAME, CATEGORY_GROUPS
|
|
@@ -116,20 +116,14 @@ async def seed_search(
|
|
| 116 |
except Exception:
|
| 117 |
pass
|
| 118 |
|
| 119 |
-
#
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
resp = templates.TemplateResponse(
|
| 125 |
request,
|
| 126 |
-
"partials/
|
| 127 |
-
{
|
| 128 |
-
"papers": papers,
|
| 129 |
-
"query": q,
|
| 130 |
-
"seed_count": seed_count,
|
| 131 |
-
"seed_target": 5,
|
| 132 |
-
},
|
| 133 |
)
|
| 134 |
resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
|
| 135 |
return resp
|
|
@@ -161,90 +155,4 @@ async def skip_onboarding(
|
|
| 161 |
return resp
|
| 162 |
|
| 163 |
|
| 164 |
-
@router.post("/api/onboarding/import-author", response_class=HTMLResponse)
|
| 165 |
-
async def import_author(
|
| 166 |
-
request: Request,
|
| 167 |
-
author_url: str = Form(default=""),
|
| 168 |
-
user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
|
| 169 |
-
):
|
| 170 |
-
"""Phase 5.1: Import papers from a Semantic Scholar author profile.
|
| 171 |
-
|
| 172 |
-
Accepts S2 URL, raw S2 author ID, or ORCID.
|
| 173 |
-
Auto-saves the author's arXiv papers as seed interests.
|
| 174 |
-
"""
|
| 175 |
-
user_id = user_id or str(uuid.uuid4())
|
| 176 |
-
|
| 177 |
-
if not author_url.strip():
|
| 178 |
-
return HTMLResponse(
|
| 179 |
-
'<div class="alert alert-warning text-sm py-2">'
|
| 180 |
-
'β οΈ Please paste a Semantic Scholar author URL, ID, or ORCID.</div>'
|
| 181 |
-
)
|
| 182 |
-
|
| 183 |
-
from app import s2_svc, user_state as us
|
| 184 |
-
|
| 185 |
-
# 1. Parse input
|
| 186 |
-
parsed_id, input_type = s2_svc.parse_author_input(author_url)
|
| 187 |
-
if parsed_id is None:
|
| 188 |
-
return HTMLResponse(
|
| 189 |
-
'<div class="alert alert-error text-sm py-2">'
|
| 190 |
-
'β Could not recognise input. Paste a Semantic Scholar author URL, '
|
| 191 |
-
'a numeric author ID, or an ORCID (e.g. 0000-0003-3394-6622).</div>'
|
| 192 |
-
)
|
| 193 |
-
|
| 194 |
-
# 2. Resolve ORCID β S2 author ID if needed
|
| 195 |
-
try:
|
| 196 |
-
if input_type == "orcid":
|
| 197 |
-
s2_id = await s2_svc.resolve_orcid(parsed_id)
|
| 198 |
-
if not s2_id:
|
| 199 |
-
return HTMLResponse(
|
| 200 |
-
'<div class="alert alert-warning text-sm py-2">'
|
| 201 |
-
f'β οΈ No Semantic Scholar author found for ORCID {parsed_id}.</div>'
|
| 202 |
-
)
|
| 203 |
-
else:
|
| 204 |
-
s2_id = parsed_id
|
| 205 |
-
except Exception as e:
|
| 206 |
-
print(f"[onboarding] ORCID resolve failed: {e}")
|
| 207 |
-
return HTMLResponse(
|
| 208 |
-
'<div class="alert alert-error text-sm py-2">'
|
| 209 |
-
'β Failed to look up ORCID. Please try pasting the S2 URL directly.</div>'
|
| 210 |
-
)
|
| 211 |
-
|
| 212 |
-
# 3. Fetch arXiv papers
|
| 213 |
-
try:
|
| 214 |
-
arxiv_ids = await s2_svc.fetch_author_arxiv_papers(s2_id, limit=20)
|
| 215 |
-
except Exception as e:
|
| 216 |
-
print(f"[onboarding] S2 author paper fetch failed: {e}")
|
| 217 |
-
return HTMLResponse(
|
| 218 |
-
'<div class="alert alert-error text-sm py-2">'
|
| 219 |
-
'β Failed to fetch papers from Semantic Scholar. '
|
| 220 |
-
'The author ID may be invalid, or the API may be down.</div>'
|
| 221 |
-
)
|
| 222 |
-
|
| 223 |
-
if not arxiv_ids:
|
| 224 |
-
return HTMLResponse(
|
| 225 |
-
'<div class="alert alert-warning text-sm py-2">'
|
| 226 |
-
'β οΈ No arXiv papers found for this author. '
|
| 227 |
-
'They may publish in venues not indexed on arXiv.</div>'
|
| 228 |
-
)
|
| 229 |
-
|
| 230 |
-
# 4. Auto-save each paper as a positive interaction
|
| 231 |
-
for aid in arxiv_ids:
|
| 232 |
-
us.record_positive(user_id, aid)
|
| 233 |
-
await db.log_interaction(
|
| 234 |
-
user_id=user_id,
|
| 235 |
-
paper_id=aid,
|
| 236 |
-
event_type="save",
|
| 237 |
-
source="s2_import",
|
| 238 |
-
)
|
| 239 |
-
|
| 240 |
-
state = await us.ensure_loaded(user_id)
|
| 241 |
-
seed_count = len(state.positives)
|
| 242 |
|
| 243 |
-
resp = HTMLResponse(
|
| 244 |
-
f'<div class="alert alert-success text-sm py-2">'
|
| 245 |
-
f'β
Imported {len(arxiv_ids)} papers! '
|
| 246 |
-
f'You now have {seed_count} saved papers. '
|
| 247 |
-
f'Click <strong>"Done β start exploring β"</strong> to see your recommendations.</div>'
|
| 248 |
-
)
|
| 249 |
-
resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
|
| 250 |
-
return resp
|
|
|
|
| 9 |
"""
|
| 10 |
import uuid
|
| 11 |
import json
|
| 12 |
+
from fastapi import APIRouter, Request, Cookie
|
| 13 |
from fastapi.responses import HTMLResponse, RedirectResponse
|
| 14 |
from app import db
|
| 15 |
from app.config import COOKIE_NAME, CATEGORY_GROUPS
|
|
|
|
| 116 |
except Exception:
|
| 117 |
pass
|
| 118 |
|
| 119 |
+
# HTMX request: return ONLY the results partial (swap target = #seed-results).
|
| 120 |
+
# The full seed_search.html panel is rendered by save_categories() during the
|
| 121 |
+
# step 1 β step 2 transition; subsequent searches must not re-render the whole
|
| 122 |
+
# panel or it nests inside #seed-results and duplicates the wizard.
|
|
|
|
| 123 |
resp = templates.TemplateResponse(
|
| 124 |
request,
|
| 125 |
+
"partials/seed_results.html",
|
| 126 |
+
{"papers": papers, "query": q},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
)
|
| 128 |
resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
|
| 129 |
return resp
|
|
|
|
| 155 |
return resp
|
| 156 |
|
| 157 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -16,6 +16,7 @@ Phase 4 changes vs Phase 2b:
|
|
| 16 |
- Category-level suppression filters strongly disliked topics (4.3)
|
| 17 |
"""
|
| 18 |
import asyncio
|
|
|
|
| 19 |
import uuid
|
| 20 |
import numpy as np
|
| 21 |
from fastapi import APIRouter, Request, Cookie
|
|
@@ -110,9 +111,11 @@ async def get_recommendations(
|
|
| 110 |
# populated by whichever tier serves the result.
|
| 111 |
paper_tags: dict[str, dict] = {}
|
| 112 |
rec_arxiv_ids: list[str] = []
|
|
|
|
|
|
|
| 113 |
|
| 114 |
# ββ Tier 1: Multi-interest clustering + quota fusion (β₯5 saves) ββββββ
|
| 115 |
-
rec_arxiv_ids, paper_tags = await _multi_interest_recommend(
|
| 116 |
user_id, state, seen, REC_LIMIT, query_id=query_id,
|
| 117 |
)
|
| 118 |
|
|
@@ -151,6 +154,7 @@ async def get_recommendations(
|
|
| 151 |
return _empty_resp()
|
| 152 |
|
| 153 |
# Phase 3.5: Turso primary, arXiv API fallback
|
|
|
|
| 154 |
meta = await turso_svc.fetch_metadata_batch(rec_arxiv_ids)
|
| 155 |
missing = [aid for aid in rec_arxiv_ids if aid not in meta]
|
| 156 |
if missing:
|
|
@@ -159,6 +163,8 @@ async def get_recommendations(
|
|
| 159 |
meta.update(arxiv_meta)
|
| 160 |
except Exception as e:
|
| 161 |
print(f"[recommendations] arXiv fallback for {len(missing)} IDs failed: {e}")
|
|
|
|
|
|
|
| 162 |
|
| 163 |
# Cache to SQLite so category suppression JOINs work (Phase 4.3)
|
| 164 |
await db.cache_turso_metadata_batch(list(meta.values()))
|
|
@@ -187,7 +193,12 @@ async def get_recommendations(
|
|
| 187 |
resp = templates.TemplateResponse(
|
| 188 |
request,
|
| 189 |
"partials/recommendations.html",
|
| 190 |
-
{
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
)
|
| 192 |
resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
|
| 193 |
return resp
|
|
@@ -210,18 +221,20 @@ async def _multi_interest_recommend(
|
|
| 210 |
7. MMR diversity β select top-k with diversity
|
| 211 |
8. Exploration injection β serendipitous papers
|
| 212 |
|
| 213 |
-
Returns ([], {}) to trigger fallback to Tier 2.
|
| 214 |
Phase 4.5: second element is {arxiv_id: {ranker_version, candidate_source, cluster_id}}.
|
| 215 |
"""
|
| 216 |
positives = state.positive_list
|
| 217 |
if len(positives) < MIN_PAPERS_FOR_CLUSTERING:
|
| 218 |
-
return [], {}
|
| 219 |
|
| 220 |
try:
|
| 221 |
# Fetch embeddings for all saved papers
|
| 222 |
vectors = await qdrant_svc.get_paper_vectors(positives)
|
| 223 |
if len(vectors) < MIN_PAPERS_FOR_CLUSTERING:
|
| 224 |
-
return [], {}
|
|
|
|
|
|
|
| 225 |
|
| 226 |
# Build aligned arrays (only papers we got vectors for)
|
| 227 |
aligned_ids = [pid for pid in positives if pid in vectors]
|
|
@@ -230,6 +243,7 @@ async def _multi_interest_recommend(
|
|
| 230 |
)
|
| 231 |
|
| 232 |
# ββ Step 1: Compute interest clusters βββββββββββββββββββββββββββββ
|
|
|
|
| 233 |
clusters = compute_clusters(aligned_ids, aligned_embs)
|
| 234 |
|
| 235 |
# ββ Step 4.2: Stabilise cluster IDs with Hungarian matching βββββββ
|
|
@@ -267,6 +281,7 @@ async def _multi_interest_recommend(
|
|
| 267 |
clusters = stabilize_cluster_ids(clusters, old_clusters)
|
| 268 |
|
| 269 |
await save_clusters_to_db(user_id, clusters)
|
|
|
|
| 270 |
|
| 271 |
# Phase 6.5 B3: append snapshot for cluster history (non-blocking)
|
| 272 |
try:
|
|
@@ -289,8 +304,15 @@ async def _multi_interest_recommend(
|
|
| 289 |
quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
|
| 290 |
|
| 291 |
# ββ Step 3: Parallel per-cluster ANN searches βββββββββββββββββββββ
|
|
|
|
| 292 |
st_vec = await profiles.load_profile(user_id, "short_term")
|
| 293 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 294 |
search_tasks = [
|
| 295 |
qdrant_svc.search_by_vector_with_scores(
|
| 296 |
query_vector=c.medoid_embedding.tolist(),
|
|
@@ -301,20 +323,16 @@ async def _multi_interest_recommend(
|
|
| 301 |
]
|
| 302 |
per_cluster_scored = await asyncio.gather(*search_tasks)
|
| 303 |
|
| 304 |
-
# Build paper β cluster map AND real qdrant_score_map in one pass.
|
| 305 |
-
# Phase 6.5 A1: replaces the old rank-based linear decay approximation.
|
| 306 |
paper_cluster_map: dict[str, int] = {}
|
| 307 |
qdrant_score_map: dict[str, float] = {}
|
| 308 |
for cluster, scored_results in zip(clusters, per_cluster_scored):
|
| 309 |
for hit in scored_results:
|
| 310 |
aid = hit["arxiv_id"]
|
| 311 |
-
if aid not in paper_cluster_map:
|
| 312 |
paper_cluster_map[aid] = cluster.cluster_idx
|
| 313 |
-
# Keep highest cosine if a paper appears in multiple clusters
|
| 314 |
if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
|
| 315 |
qdrant_score_map[aid] = float(hit["score"])
|
| 316 |
|
| 317 |
-
# merge_quota_results expects list[list[str]] β extract IDs
|
| 318 |
per_cluster_ids = [
|
| 319 |
[h["arxiv_id"] for h in scored] for scored in per_cluster_scored
|
| 320 |
]
|
|
@@ -337,9 +355,14 @@ async def _multi_interest_recommend(
|
|
| 337 |
qdrant_score_map[aid] = float(hit["score"])
|
| 338 |
|
| 339 |
if not candidate_ids:
|
| 340 |
-
return [], {}
|
|
|
|
| 341 |
|
| 342 |
# ββ Step 5: Fetch candidate vectors + metadata ββββββββββββββββββββ
|
|
|
|
|
|
|
|
|
|
|
|
|
| 343 |
cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
|
| 344 |
cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
|
| 345 |
cand_missing = [cid for cid in candidate_ids if cid not in cand_meta]
|
|
@@ -356,7 +379,8 @@ async def _multi_interest_recommend(
|
|
| 356 |
# Only process candidates with both vectors and metadata
|
| 357 |
valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
|
| 358 |
if not valid_ids:
|
| 359 |
-
return candidate_ids[:limit], {}
|
|
|
|
| 360 |
|
| 361 |
valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
|
| 362 |
valid_meta = [cand_meta[cid] for cid in valid_ids]
|
|
@@ -427,6 +451,7 @@ async def _multi_interest_recommend(
|
|
| 427 |
)
|
| 428 |
|
| 429 |
# ββ Step 6: LightGBM re-ranking (37 features) ββββββββββββββββββββ
|
|
|
|
| 430 |
reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
|
| 431 |
candidate_ids=valid_ids,
|
| 432 |
candidate_embeddings=valid_embs,
|
|
@@ -443,6 +468,8 @@ async def _multi_interest_recommend(
|
|
| 443 |
user_total_saves=user_total_saves,
|
| 444 |
user_total_dismissals=user_total_dismissals,
|
| 445 |
)
|
|
|
|
|
|
|
| 446 |
|
| 447 |
# ββ Step 4.3: Category suppression (post-rerank safety net) βββββββ
|
| 448 |
# The model now sees feature 25 (is_suppressed_category), but we
|
|
@@ -459,6 +486,7 @@ async def _multi_interest_recommend(
|
|
| 459 |
reranked_embs = reranked_embs[kept]
|
| 460 |
|
| 461 |
# ββ Step 7: MMR diversity enforcement βββββββββββββββββββββββββββββ
|
|
|
|
| 462 |
query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
|
| 463 |
mmr_selected = mmr_rerank(
|
| 464 |
query_embedding=query_vec,
|
|
@@ -468,6 +496,7 @@ async def _multi_interest_recommend(
|
|
| 468 |
lambda_param=0.6,
|
| 469 |
top_k=limit,
|
| 470 |
)
|
|
|
|
| 471 |
|
| 472 |
# ββ Step 8: Exploration injection βββββββββββββββββββββββββββββββββ
|
| 473 |
final = inject_exploration(
|
|
@@ -508,11 +537,11 @@ async def _multi_interest_recommend(
|
|
| 508 |
"policy_id": _RANKER_VERSION,
|
| 509 |
}
|
| 510 |
|
| 511 |
-
return final, paper_tags
|
| 512 |
|
| 513 |
except Exception as e:
|
| 514 |
-
print(f"[recommendations] multi-interest
|
| 515 |
-
return [], {}
|
| 516 |
|
| 517 |
|
| 518 |
# ββ Tier 2: EWMA single-vector search ββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 16 |
- Category-level suppression filters strongly disliked topics (4.3)
|
| 17 |
"""
|
| 18 |
import asyncio
|
| 19 |
+
import time
|
| 20 |
import uuid
|
| 21 |
import numpy as np
|
| 22 |
from fastapi import APIRouter, Request, Cookie
|
|
|
|
| 111 |
# populated by whichever tier serves the result.
|
| 112 |
paper_tags: dict[str, dict] = {}
|
| 113 |
rec_arxiv_ids: list[str] = []
|
| 114 |
+
rerank_time_ms = 0
|
| 115 |
+
timing_breakdown: dict = {}
|
| 116 |
|
| 117 |
# ββ Tier 1: Multi-interest clustering + quota fusion (β₯5 saves) ββββββ
|
| 118 |
+
rec_arxiv_ids, paper_tags, rerank_time_ms, timing_breakdown = await _multi_interest_recommend(
|
| 119 |
user_id, state, seen, REC_LIMIT, query_id=query_id,
|
| 120 |
)
|
| 121 |
|
|
|
|
| 154 |
return _empty_resp()
|
| 155 |
|
| 156 |
# Phase 3.5: Turso primary, arXiv API fallback
|
| 157 |
+
t0_meta = time.time()
|
| 158 |
meta = await turso_svc.fetch_metadata_batch(rec_arxiv_ids)
|
| 159 |
missing = [aid for aid in rec_arxiv_ids if aid not in meta]
|
| 160 |
if missing:
|
|
|
|
| 163 |
meta.update(arxiv_meta)
|
| 164 |
except Exception as e:
|
| 165 |
print(f"[recommendations] arXiv fallback for {len(missing)} IDs failed: {e}")
|
| 166 |
+
t1_meta = time.time()
|
| 167 |
+
meta_time_ms = int((t1_meta - t0_meta) * 1000)
|
| 168 |
|
| 169 |
# Cache to SQLite so category suppression JOINs work (Phase 4.3)
|
| 170 |
await db.cache_turso_metadata_batch(list(meta.values()))
|
|
|
|
| 193 |
resp = templates.TemplateResponse(
|
| 194 |
request,
|
| 195 |
"partials/recommendations.html",
|
| 196 |
+
{
|
| 197 |
+
"papers": papers,
|
| 198 |
+
"rerank_time_ms": rerank_time_ms,
|
| 199 |
+
"meta_time_ms": meta_time_ms,
|
| 200 |
+
"timing": timing_breakdown,
|
| 201 |
+
},
|
| 202 |
)
|
| 203 |
resp.set_cookie(COOKIE_NAME, user_id, max_age=365 * 24 * 3600, httponly=True)
|
| 204 |
return resp
|
|
|
|
| 221 |
7. MMR diversity β select top-k with diversity
|
| 222 |
8. Exploration injection β serendipitous papers
|
| 223 |
|
| 224 |
+
Returns ([], {}, 0, {}) to trigger fallback to Tier 2.
|
| 225 |
Phase 4.5: second element is {arxiv_id: {ranker_version, candidate_source, cluster_id}}.
|
| 226 |
"""
|
| 227 |
positives = state.positive_list
|
| 228 |
if len(positives) < MIN_PAPERS_FOR_CLUSTERING:
|
| 229 |
+
return [], {}, 0, {}
|
| 230 |
|
| 231 |
try:
|
| 232 |
# Fetch embeddings for all saved papers
|
| 233 |
vectors = await qdrant_svc.get_paper_vectors(positives)
|
| 234 |
if len(vectors) < MIN_PAPERS_FOR_CLUSTERING:
|
| 235 |
+
return [], {}, 0, {}
|
| 236 |
+
|
| 237 |
+
timing = {} # Collect per-stage timing breakdown
|
| 238 |
|
| 239 |
# Build aligned arrays (only papers we got vectors for)
|
| 240 |
aligned_ids = [pid for pid in positives if pid in vectors]
|
|
|
|
| 243 |
)
|
| 244 |
|
| 245 |
# ββ Step 1: Compute interest clusters βββββββββββββββββββββββββββββ
|
| 246 |
+
t0_cluster = time.time()
|
| 247 |
clusters = compute_clusters(aligned_ids, aligned_embs)
|
| 248 |
|
| 249 |
# ββ Step 4.2: Stabilise cluster IDs with Hungarian matching βββββββ
|
|
|
|
| 281 |
clusters = stabilize_cluster_ids(clusters, old_clusters)
|
| 282 |
|
| 283 |
await save_clusters_to_db(user_id, clusters)
|
| 284 |
+
timing["clustering_ms"] = int((time.time() - t0_cluster) * 1000)
|
| 285 |
|
| 286 |
# Phase 6.5 B3: append snapshot for cluster history (non-blocking)
|
| 287 |
try:
|
|
|
|
| 304 |
quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
|
| 305 |
|
| 306 |
# ββ Step 3: Parallel per-cluster ANN searches βββββββββββββββββββββ
|
| 307 |
+
t0_ann = time.time()
|
| 308 |
st_vec = await profiles.load_profile(user_id, "short_term")
|
| 309 |
|
| 310 |
+
# NOTE on latency: we previously tried passing with_vectors=True
|
| 311 |
+
# to fold the candidate-vector fetch into the search call. That
|
| 312 |
+
# made it *worse* on Qdrant Cloud free tier β search latency
|
| 313 |
+
# ballooned from ~2s to ~40s because returning vectors triggers
|
| 314 |
+
# a per-result disk read inside the search path. Keep the search
|
| 315 |
+
# vector-free; vectors come from a separate cached retrieve.
|
| 316 |
search_tasks = [
|
| 317 |
qdrant_svc.search_by_vector_with_scores(
|
| 318 |
query_vector=c.medoid_embedding.tolist(),
|
|
|
|
| 323 |
]
|
| 324 |
per_cluster_scored = await asyncio.gather(*search_tasks)
|
| 325 |
|
|
|
|
|
|
|
| 326 |
paper_cluster_map: dict[str, int] = {}
|
| 327 |
qdrant_score_map: dict[str, float] = {}
|
| 328 |
for cluster, scored_results in zip(clusters, per_cluster_scored):
|
| 329 |
for hit in scored_results:
|
| 330 |
aid = hit["arxiv_id"]
|
| 331 |
+
if aid not in paper_cluster_map:
|
| 332 |
paper_cluster_map[aid] = cluster.cluster_idx
|
|
|
|
| 333 |
if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
|
| 334 |
qdrant_score_map[aid] = float(hit["score"])
|
| 335 |
|
|
|
|
| 336 |
per_cluster_ids = [
|
| 337 |
[h["arxiv_id"] for h in scored] for scored in per_cluster_scored
|
| 338 |
]
|
|
|
|
| 355 |
qdrant_score_map[aid] = float(hit["score"])
|
| 356 |
|
| 357 |
if not candidate_ids:
|
| 358 |
+
return [], {}, 0, {}
|
| 359 |
+
timing["ann_retrieval_ms"] = int((time.time() - t0_ann) * 1000)
|
| 360 |
|
| 361 |
# ββ Step 5: Fetch candidate vectors + metadata ββββββββββββββββββββ
|
| 362 |
+
# get_paper_vectors is now LRU-cached by arxiv_id (qdrant_svc),
|
| 363 |
+
# so warm cache makes this cheap; only fresh papers pay the
|
| 364 |
+
# disk-read cost.
|
| 365 |
+
t0_cand_meta = time.time()
|
| 366 |
cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
|
| 367 |
cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
|
| 368 |
cand_missing = [cid for cid in candidate_ids if cid not in cand_meta]
|
|
|
|
| 379 |
# Only process candidates with both vectors and metadata
|
| 380 |
valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
|
| 381 |
if not valid_ids:
|
| 382 |
+
return candidate_ids[:limit], {}, 0, {}
|
| 383 |
+
timing["candidate_meta_ms"] = int((time.time() - t0_cand_meta) * 1000)
|
| 384 |
|
| 385 |
valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
|
| 386 |
valid_meta = [cand_meta[cid] for cid in valid_ids]
|
|
|
|
| 451 |
)
|
| 452 |
|
| 453 |
# ββ Step 6: LightGBM re-ranking (37 features) ββββββββββββββββββββ
|
| 454 |
+
t0_rerank = time.time()
|
| 455 |
reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
|
| 456 |
candidate_ids=valid_ids,
|
| 457 |
candidate_embeddings=valid_embs,
|
|
|
|
| 468 |
user_total_saves=user_total_saves,
|
| 469 |
user_total_dismissals=user_total_dismissals,
|
| 470 |
)
|
| 471 |
+
t1_rerank = time.time()
|
| 472 |
+
rerank_time_ms = int((t1_rerank - t0_rerank) * 1000)
|
| 473 |
|
| 474 |
# ββ Step 4.3: Category suppression (post-rerank safety net) βββββββ
|
| 475 |
# The model now sees feature 25 (is_suppressed_category), but we
|
|
|
|
| 486 |
reranked_embs = reranked_embs[kept]
|
| 487 |
|
| 488 |
# ββ Step 7: MMR diversity enforcement βββββββββββββββββββββββββββββ
|
| 489 |
+
t0_mmr = time.time()
|
| 490 |
query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
|
| 491 |
mmr_selected = mmr_rerank(
|
| 492 |
query_embedding=query_vec,
|
|
|
|
| 496 |
lambda_param=0.6,
|
| 497 |
top_k=limit,
|
| 498 |
)
|
| 499 |
+
timing["mmr_ms"] = int((time.time() - t0_mmr) * 1000)
|
| 500 |
|
| 501 |
# ββ Step 8: Exploration injection βββββββββββββββββββββββββββββββββ
|
| 502 |
final = inject_exploration(
|
|
|
|
| 537 |
"policy_id": _RANKER_VERSION,
|
| 538 |
}
|
| 539 |
|
| 540 |
+
return final, paper_tags, rerank_time_ms, timing
|
| 541 |
|
| 542 |
except Exception as e:
|
| 543 |
+
print(f"[recommendations] multi-interest preprocessing failed: {e}")
|
| 544 |
+
return [], {}, 0, {}
|
| 545 |
|
| 546 |
|
| 547 |
# ββ Tier 2: EWMA single-vector search ββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -27,17 +27,23 @@ async def search(
|
|
| 27 |
q: str = "",
|
| 28 |
user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
|
| 29 |
):
|
|
|
|
|
|
|
|
|
|
| 30 |
papers = []
|
| 31 |
if q.strip():
|
| 32 |
# Phase 3: Hybrid semantic search (BGE-M3 + Qdrant + Zilliz + RRF)
|
| 33 |
try:
|
| 34 |
-
arxiv_ids = await hybrid_search_svc.search(
|
|
|
|
|
|
|
| 35 |
except Exception as e:
|
| 36 |
print(f"[search] Hybrid search error: {e}")
|
| 37 |
arxiv_ids = []
|
| 38 |
|
| 39 |
if arxiv_ids:
|
| 40 |
# Phase 3.5: Fetch metadata from Turso DB first (fast, ~50ms)
|
|
|
|
| 41 |
try:
|
| 42 |
meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
|
| 43 |
except Exception as e:
|
|
@@ -52,6 +58,8 @@ async def search(
|
|
| 52 |
meta.update(arxiv_meta)
|
| 53 |
except Exception as e:
|
| 54 |
print(f"[search] arXiv fallback for {len(missing)} IDs failed: {e}")
|
|
|
|
|
|
|
| 55 |
|
| 56 |
# Phase 4.3: Cache to SQLite so dismissal category JOINs work
|
| 57 |
await db.cache_turso_metadata_batch(list(meta.values()))
|
|
@@ -66,6 +74,8 @@ async def search(
|
|
| 66 |
except Exception as e:
|
| 67 |
print(f"[search] arXiv fallback also failed: {e}")
|
| 68 |
papers = []
|
|
|
|
|
|
|
| 69 |
|
| 70 |
user_id = user_id or str(uuid.uuid4())
|
| 71 |
# Phase 6.5 B1: one query_id per search request for per-feed CTR
|
|
@@ -86,7 +96,7 @@ async def search(
|
|
| 86 |
resp = templates.TemplateResponse(
|
| 87 |
request,
|
| 88 |
"partials/search_results.html",
|
| 89 |
-
{"papers": papers, "query": q},
|
| 90 |
)
|
| 91 |
else:
|
| 92 |
resp = templates.TemplateResponse(
|
|
@@ -96,6 +106,7 @@ async def search(
|
|
| 96 |
"papers": papers,
|
| 97 |
"query": q,
|
| 98 |
"has_recs": state.has_enough_for_recs(),
|
|
|
|
| 99 |
},
|
| 100 |
)
|
| 101 |
|
|
|
|
| 27 |
q: str = "",
|
| 28 |
user_id: str | None = Cookie(default=None, alias=COOKIE_NAME),
|
| 29 |
):
|
| 30 |
+
import time
|
| 31 |
+
start_time = time.perf_counter()
|
| 32 |
+
search_meta = {}
|
| 33 |
papers = []
|
| 34 |
if q.strip():
|
| 35 |
# Phase 3: Hybrid semantic search (BGE-M3 + Qdrant + Zilliz + RRF)
|
| 36 |
try:
|
| 37 |
+
arxiv_ids, search_meta = await hybrid_search_svc.search(
|
| 38 |
+
q.strip(), limit=ARXIV_MAX_RESULTS, return_meta=True
|
| 39 |
+
)
|
| 40 |
except Exception as e:
|
| 41 |
print(f"[search] Hybrid search error: {e}")
|
| 42 |
arxiv_ids = []
|
| 43 |
|
| 44 |
if arxiv_ids:
|
| 45 |
# Phase 3.5: Fetch metadata from Turso DB first (fast, ~50ms)
|
| 46 |
+
t0_meta = time.perf_counter()
|
| 47 |
try:
|
| 48 |
meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
|
| 49 |
except Exception as e:
|
|
|
|
| 58 |
meta.update(arxiv_meta)
|
| 59 |
except Exception as e:
|
| 60 |
print(f"[search] arXiv fallback for {len(missing)} IDs failed: {e}")
|
| 61 |
+
|
| 62 |
+
search_meta["meta_time_ms"] = int((time.perf_counter() - t0_meta) * 1000)
|
| 63 |
|
| 64 |
# Phase 4.3: Cache to SQLite so dismissal category JOINs work
|
| 65 |
await db.cache_turso_metadata_batch(list(meta.values()))
|
|
|
|
| 74 |
except Exception as e:
|
| 75 |
print(f"[search] arXiv fallback also failed: {e}")
|
| 76 |
papers = []
|
| 77 |
+
|
| 78 |
+
search_meta["total_time_ms"] = int((time.perf_counter() - start_time) * 1000)
|
| 79 |
|
| 80 |
user_id = user_id or str(uuid.uuid4())
|
| 81 |
# Phase 6.5 B1: one query_id per search request for per-feed CTR
|
|
|
|
| 96 |
resp = templates.TemplateResponse(
|
| 97 |
request,
|
| 98 |
"partials/search_results.html",
|
| 99 |
+
{"papers": papers, "query": q, "search_meta": search_meta},
|
| 100 |
)
|
| 101 |
else:
|
| 102 |
resp = templates.TemplateResponse(
|
|
|
|
| 106 |
"papers": papers,
|
| 107 |
"query": q,
|
| 108 |
"has_recs": state.has_enough_for_recs(),
|
| 109 |
+
"search_meta": search_meta,
|
| 110 |
},
|
| 111 |
)
|
| 112 |
|
|
@@ -1,111 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
Semantic Scholar service β Phase 5.1 (author import for onboarding).
|
| 3 |
-
|
| 4 |
-
Accepts an S2 author URL, a raw S2 author ID, or an ORCID, then
|
| 5 |
-
fetches that author's papers and returns arXiv IDs for auto-saving.
|
| 6 |
-
|
| 7 |
-
API docs: https://api.semanticscholar.org/api-docs/graph
|
| 8 |
-
"""
|
| 9 |
-
from __future__ import annotations
|
| 10 |
-
|
| 11 |
-
import re
|
| 12 |
-
import httpx
|
| 13 |
-
from app.config import S2_API_KEY
|
| 14 |
-
|
| 15 |
-
_BASE = "https://api.semanticscholar.org/graph/v1"
|
| 16 |
-
_TIMEOUT = 15.0 # seconds
|
| 17 |
-
|
| 18 |
-
# ββ Patterns ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 19 |
-
# URL: https://www.semanticscholar.org/author/Yoshua-Bengio/1751762
|
| 20 |
-
# Raw: 1751762
|
| 21 |
-
# ORCID: 0000-0003-3394-6622
|
| 22 |
-
_S2_URL_RE = re.compile(
|
| 23 |
-
r"semanticscholar\.org/author/[^/]+/(\d+)", re.IGNORECASE
|
| 24 |
-
)
|
| 25 |
-
_ORCID_RE = re.compile(r"\d{4}-\d{4}-\d{4}-\d{3}[\dX]")
|
| 26 |
-
_RAW_ID_RE = re.compile(r"^\d{3,}$") # 3+ digits = plausible S2 author ID
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
def _headers() -> dict[str, str]:
|
| 30 |
-
"""Build request headers, including API key if available."""
|
| 31 |
-
h: dict[str, str] = {"Accept": "application/json"}
|
| 32 |
-
if S2_API_KEY:
|
| 33 |
-
h["x-api-key"] = S2_API_KEY
|
| 34 |
-
return h
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
# ββ Public API ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 38 |
-
|
| 39 |
-
def parse_author_input(text: str) -> tuple[str | None, str]:
|
| 40 |
-
"""Parse user-provided text into an S2 author ID or ORCID.
|
| 41 |
-
|
| 42 |
-
Returns (s2_author_id | None, input_type) where input_type is one of:
|
| 43 |
-
"s2_url", "s2_id", "orcid", "unknown"
|
| 44 |
-
"""
|
| 45 |
-
text = text.strip()
|
| 46 |
-
if not text:
|
| 47 |
-
return None, "unknown"
|
| 48 |
-
|
| 49 |
-
# 1. Try S2 URL
|
| 50 |
-
m = _S2_URL_RE.search(text)
|
| 51 |
-
if m:
|
| 52 |
-
return m.group(1), "s2_url"
|
| 53 |
-
|
| 54 |
-
# 2. Try ORCID
|
| 55 |
-
m = _ORCID_RE.search(text)
|
| 56 |
-
if m:
|
| 57 |
-
return m.group(0), "orcid"
|
| 58 |
-
|
| 59 |
-
# 3. Try raw numeric ID
|
| 60 |
-
if _RAW_ID_RE.match(text):
|
| 61 |
-
return text, "s2_id"
|
| 62 |
-
|
| 63 |
-
return None, "unknown"
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
async def resolve_orcid(orcid: str) -> str | None:
|
| 67 |
-
"""Resolve an ORCID to an S2 author ID via the author search endpoint.
|
| 68 |
-
|
| 69 |
-
Returns the S2 authorId string or None if not found.
|
| 70 |
-
"""
|
| 71 |
-
url = f"{_BASE}/author/search"
|
| 72 |
-
params = {"query": orcid, "limit": 1}
|
| 73 |
-
async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
|
| 74 |
-
resp = await client.get(url, params=params, headers=_headers())
|
| 75 |
-
resp.raise_for_status()
|
| 76 |
-
data = resp.json()
|
| 77 |
-
authors = data.get("data", [])
|
| 78 |
-
if authors:
|
| 79 |
-
return str(authors[0]["authorId"])
|
| 80 |
-
return None
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
async def fetch_author_arxiv_papers(
|
| 84 |
-
author_id: str, limit: int = 50,
|
| 85 |
-
) -> list[str]:
|
| 86 |
-
"""Fetch an author's papers from S2 and return arXiv IDs.
|
| 87 |
-
|
| 88 |
-
Filters to papers that have an ArXiv external ID.
|
| 89 |
-
Returns at most `limit` arXiv IDs, ordered by citation count (desc).
|
| 90 |
-
"""
|
| 91 |
-
url = f"{_BASE}/author/{author_id}/papers"
|
| 92 |
-
params = {
|
| 93 |
-
"fields": "externalIds,citationCount",
|
| 94 |
-
"limit": min(limit * 2, 500), # over-fetch since not all have arXiv IDs
|
| 95 |
-
}
|
| 96 |
-
arxiv_ids: list[tuple[int, str]] = [] # (citation_count, arxiv_id)
|
| 97 |
-
|
| 98 |
-
async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
|
| 99 |
-
resp = await client.get(url, params=params, headers=_headers())
|
| 100 |
-
resp.raise_for_status()
|
| 101 |
-
data = resp.json()
|
| 102 |
-
for paper in data.get("data", []):
|
| 103 |
-
ext = paper.get("externalIds") or {}
|
| 104 |
-
arxiv_id = ext.get("ArXiv")
|
| 105 |
-
if arxiv_id:
|
| 106 |
-
cites = paper.get("citationCount") or 0
|
| 107 |
-
arxiv_ids.append((cites, arxiv_id))
|
| 108 |
-
|
| 109 |
-
# Sort by citation count descending so we import the most impactful first
|
| 110 |
-
arxiv_ids.sort(key=lambda x: x[0], reverse=True)
|
| 111 |
-
return [aid for _, aid in arxiv_ids[:limit]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -13,20 +13,13 @@
|
|
| 13 |
<p class="text-sm text-base-content/60 mb-4">
|
| 14 |
Search arXiv, save papers you like β get personalised recommendations.
|
| 15 |
</p>
|
| 16 |
-
<form
|
| 17 |
-
hx-target="#search-results"
|
| 18 |
-
hx-push-url="true"
|
| 19 |
-
hx-indicator="#search-spinner"
|
| 20 |
-
class="flex gap-2">
|
| 21 |
<input type="text"
|
| 22 |
name="q"
|
| 23 |
placeholder="e.g. transformer attention mechanism"
|
| 24 |
class="input input-bordered flex-1"
|
| 25 |
autofocus />
|
| 26 |
-
<button class="btn btn-primary" type="submit">
|
| 27 |
-
Search
|
| 28 |
-
<span id="search-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
|
| 29 |
-
</button>
|
| 30 |
</form>
|
| 31 |
</div>
|
| 32 |
|
|
@@ -57,8 +50,5 @@
|
|
| 57 |
</div>
|
| 58 |
</div>
|
| 59 |
|
| 60 |
-
<!-- Search results (swapped in by HTMX) -->
|
| 61 |
-
<div id="search-results"></div>
|
| 62 |
-
|
| 63 |
</div>
|
| 64 |
{% endblock %}
|
|
|
|
| 13 |
<p class="text-sm text-base-content/60 mb-4">
|
| 14 |
Search arXiv, save papers you like β get personalised recommendations.
|
| 15 |
</p>
|
| 16 |
+
<form action="/search" method="get" class="flex gap-2">
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
<input type="text"
|
| 18 |
name="q"
|
| 19 |
placeholder="e.g. transformer attention mechanism"
|
| 20 |
class="input input-bordered flex-1"
|
| 21 |
autofocus />
|
| 22 |
+
<button class="btn btn-primary" type="submit">Search</button>
|
|
|
|
|
|
|
|
|
|
| 23 |
</form>
|
| 24 |
</div>
|
| 25 |
|
|
|
|
| 50 |
</div>
|
| 51 |
</div>
|
| 52 |
|
|
|
|
|
|
|
|
|
|
| 53 |
</div>
|
| 54 |
{% endblock %}
|
|
@@ -9,6 +9,11 @@
|
|
| 9 |
{% set position = position | default(0) %}
|
| 10 |
{% set authors_list = paper.authors | default("[]") | tojson_parse | default([]) %}
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
{# Category badge colour mapping #}
|
| 13 |
{% set cat = paper.category | default("") %}
|
| 14 |
{% if cat.startswith("cs.") %}
|
|
@@ -43,19 +48,19 @@
|
|
| 43 |
{% endif %}
|
| 44 |
</div>
|
| 45 |
|
| 46 |
-
<!-- Meta: arXiv ID + year + citations -->
|
| 47 |
<div class="text-xs text-base-content/50 mono">
|
| 48 |
[{{ paper.arxiv_id }}]
|
| 49 |
{% if paper.published %} Β· {{ paper.published[:4] }}{% endif %}
|
| 50 |
-
{% if authors_list %} Β· <span class="font-sans">{{ authors_list[:3] | join(", ") }}{% if authors_list | length > 3 %} et al.{% endif %}</span>{% endif %}
|
| 51 |
{% if paper.citation_count %}
|
| 52 |
Β· <span class="font-medium text-base-content/70 font-sans" title="{{ paper.influential_citations|default(0) }} influential">π {{ paper.citation_count }} citations</span>
|
| 53 |
{% endif %}
|
| 54 |
</div>
|
| 55 |
|
| 56 |
-
<!-- Abstract (truncated) -->
|
| 57 |
-
<p class="text-sm text-base-content/75 line-clamp
|
| 58 |
-
{{ paper.abstract }}
|
| 59 |
</p>
|
| 60 |
|
| 61 |
<!-- Action buttons (HTMX-powered, swap themselves on click) -->
|
|
|
|
| 9 |
{% set position = position | default(0) %}
|
| 10 |
{% set authors_list = paper.authors | default("[]") | tojson_parse | default([]) %}
|
| 11 |
|
| 12 |
+
{# Fallback: if tojson_parse returned empty but authors is a non-empty string, split by comma #}
|
| 13 |
+
{% if not authors_list and paper.authors %}
|
| 14 |
+
{% set authors_list = paper.authors.split(", ") %}
|
| 15 |
+
{% endif %}
|
| 16 |
+
|
| 17 |
{# Category badge colour mapping #}
|
| 18 |
{% set cat = paper.category | default("") %}
|
| 19 |
{% if cat.startswith("cs.") %}
|
|
|
|
| 48 |
{% endif %}
|
| 49 |
</div>
|
| 50 |
|
| 51 |
+
<!-- Meta: arXiv ID + year + authors (max 3) + citations -->
|
| 52 |
<div class="text-xs text-base-content/50 mono">
|
| 53 |
[{{ paper.arxiv_id }}]
|
| 54 |
{% if paper.published %} Β· {{ paper.published[:4] }}{% endif %}
|
| 55 |
+
{% if authors_list %} Β· <span class="font-sans">{{ authors_list[:3] | join(", ") }}{% if authors_list | length > 3 %} et al. ({{ authors_list | length }} authors){% endif %}</span>{% endif %}
|
| 56 |
{% if paper.citation_count %}
|
| 57 |
Β· <span class="font-medium text-base-content/70 font-sans" title="{{ paper.influential_citations|default(0) }} influential">π {{ paper.citation_count }} citations</span>
|
| 58 |
{% endif %}
|
| 59 |
</div>
|
| 60 |
|
| 61 |
+
<!-- Abstract (truncated to ~300 chars + CSS clamp) -->
|
| 62 |
+
<p class="text-sm text-base-content/75" style="display: -webkit-box; -webkit-line-clamp: 3; -webkit-box-orient: vertical; overflow: hidden;">
|
| 63 |
+
{{ paper.abstract[:500] }}{% if paper.abstract | length > 500 %}β¦{% endif %}
|
| 64 |
</p>
|
| 65 |
|
| 66 |
<!-- Action buttons (HTMX-powered, swap themselves on click) -->
|
|
@@ -13,6 +13,40 @@
|
|
| 13 |
{% include "partials/paper_card.html" %}
|
| 14 |
{% endfor %}
|
| 15 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
<!-- Refresh button β lets user reload recs after saving more papers -->
|
| 17 |
<div class="text-center pt-3">
|
| 18 |
<button class="btn btn-ghost btn-sm"
|
|
|
|
| 13 |
{% include "partials/paper_card.html" %}
|
| 14 |
{% endfor %}
|
| 15 |
</div>
|
| 16 |
+
|
| 17 |
+
{# Pipeline timing breakdown #}
|
| 18 |
+
{% if timing is defined and timing %}
|
| 19 |
+
<div class="mt-4 p-3 rounded-lg bg-base-200/50 border border-base-300/30">
|
| 20 |
+
<div class="flex items-center gap-2 mb-2">
|
| 21 |
+
<span class="text-xs font-semibold text-base-content/60">β‘ Recommendation Pipeline Breakdown</span>
|
| 22 |
+
</div>
|
| 23 |
+
<div class="flex flex-wrap gap-x-4 gap-y-1 text-xs font-mono text-base-content/50">
|
| 24 |
+
{% if timing.clustering_ms is defined %}
|
| 25 |
+
<span>Ward Clustering: <span class="text-primary">{{ timing.clustering_ms }}ms</span></span>
|
| 26 |
+
{% endif %}
|
| 27 |
+
{% if timing.ann_retrieval_ms is defined %}
|
| 28 |
+
<span>ANN Retrieval: <span class="text-primary">{{ timing.ann_retrieval_ms }}ms</span></span>
|
| 29 |
+
{% endif %}
|
| 30 |
+
{% if timing.candidate_meta_ms is defined %}
|
| 31 |
+
<span>Candidate Meta: <span class="text-primary">{{ timing.candidate_meta_ms }}ms</span></span>
|
| 32 |
+
{% endif %}
|
| 33 |
+
{% if rerank_time_ms is defined %}
|
| 34 |
+
<span>LightGBM Rerank: <span class="text-primary">{{ rerank_time_ms }}ms</span></span>
|
| 35 |
+
{% endif %}
|
| 36 |
+
{% if timing.mmr_ms is defined %}
|
| 37 |
+
<span>MMR Diversity: <span class="text-primary">{{ timing.mmr_ms }}ms</span></span>
|
| 38 |
+
{% endif %}
|
| 39 |
+
{% if meta_time_ms is defined %}
|
| 40 |
+
<span>Final Metadata: <span class="text-primary">{{ meta_time_ms }}ms</span></span>
|
| 41 |
+
{% endif %}
|
| 42 |
+
</div>
|
| 43 |
+
</div>
|
| 44 |
+
{% elif rerank_time_ms is defined and meta_time_ms is defined %}
|
| 45 |
+
<div class="text-center pt-2 pb-1 text-xs text-base-content/40 font-mono">
|
| 46 |
+
β‘ Reranking: {{ rerank_time_ms }}ms | Metadata: {{ meta_time_ms }}ms
|
| 47 |
+
</div>
|
| 48 |
+
{% endif %}
|
| 49 |
+
|
| 50 |
<!-- Refresh button β lets user reload recs after saving more papers -->
|
| 51 |
<div class="text-center pt-3">
|
| 52 |
<button class="btn btn-ghost btn-sm"
|
|
@@ -1,15 +1,91 @@
|
|
| 1 |
{# Partial: list of search result cards #}
|
| 2 |
{% if papers %}
|
| 3 |
<div class="space-y-3">
|
| 4 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
{% for paper in papers %}
|
| 6 |
{% set position = loop.index0 %}
|
| 7 |
{% set source = "search" %}
|
| 8 |
{% include "partials/paper_card.html" %}
|
| 9 |
{% endfor %}
|
| 10 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
{% elif query %}
|
| 12 |
<div class="text-center text-base-content/40 py-10">
|
| 13 |
-
No results found for "{{ query }}"
|
|
|
|
|
|
|
|
|
|
| 14 |
</div>
|
| 15 |
{% endif %}
|
|
|
|
| 1 |
{# Partial: list of search result cards #}
|
| 2 |
{% if papers %}
|
| 3 |
<div class="space-y-3">
|
| 4 |
+
<div class="flex flex-col gap-1 mb-4">
|
| 5 |
+
<div class="flex justify-between items-center text-sm text-base-content/50">
|
| 6 |
+
<span>{{ papers | length }} results for "{{ query }}"</span>
|
| 7 |
+
{% if search_meta and search_meta.total_time_ms is defined %}
|
| 8 |
+
<span>Search completed in {{ search_meta.total_time_ms }}ms</span>
|
| 9 |
+
{% endif %}
|
| 10 |
+
</div>
|
| 11 |
+
|
| 12 |
+
{# Groq rewrite result β show both rewritten AND skipped cases #}
|
| 13 |
+
{% if search_meta %}
|
| 14 |
+
{% if search_meta.rewritten_query %}
|
| 15 |
+
<div class="alert bg-base-200 border-l-4 border-primary p-3 text-sm flex gap-2">
|
| 16 |
+
<svg xmlns="http://www.w3.org/2000/svg" class="stroke-primary shrink-0 h-5 w-5" fill="none" viewBox="0 0 24 24"><path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M13 16h-1v-4h-1m1-4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z" /></svg>
|
| 17 |
+
<div class="flex-1">
|
| 18 |
+
<span class="font-semibold">Groq expanded query:</span> "{{ search_meta.rewritten_query }}"
|
| 19 |
+
<span class="text-xs text-base-content/50 ml-2">({{ search_meta.groq_time_ms }}ms)</span>
|
| 20 |
+
</div>
|
| 21 |
+
</div>
|
| 22 |
+
{% elif search_meta.groq_status is defined and search_meta.groq_status != 'rewritten' %}
|
| 23 |
+
<div class="alert bg-base-200/50 border-l-4 border-base-300 p-3 text-sm flex gap-2">
|
| 24 |
+
<svg xmlns="http://www.w3.org/2000/svg" class="stroke-base-content/30 shrink-0 h-5 w-5" fill="none" viewBox="0 0 24 24"><path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M13 16h-1v-4h-1m1-4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z" /></svg>
|
| 25 |
+
<div class="flex-1 text-base-content/50">
|
| 26 |
+
<span class="font-semibold">Groq rewrite:</span> {{ search_meta.groq_status }}
|
| 27 |
+
β searching with original query as-is
|
| 28 |
+
</div>
|
| 29 |
+
</div>
|
| 30 |
+
{% endif %}
|
| 31 |
+
{% endif %}
|
| 32 |
+
</div>
|
| 33 |
+
|
| 34 |
{% for paper in papers %}
|
| 35 |
{% set position = loop.index0 %}
|
| 36 |
{% set source = "search" %}
|
| 37 |
{% include "partials/paper_card.html" %}
|
| 38 |
{% endfor %}
|
| 39 |
</div>
|
| 40 |
+
|
| 41 |
+
{# Pipeline timing breakdown #}
|
| 42 |
+
{% if search_meta %}
|
| 43 |
+
<div class="mt-4 p-3 rounded-lg bg-base-200/50 border border-base-300/30">
|
| 44 |
+
<div class="flex items-center gap-2 mb-2">
|
| 45 |
+
<span class="text-xs font-semibold text-base-content/60">β‘ Search Pipeline Breakdown</span>
|
| 46 |
+
{% if search_meta.total_time_ms is defined %}
|
| 47 |
+
<span class="text-xs text-base-content/40">({{ search_meta.total_time_ms }}ms total)</span>
|
| 48 |
+
{% endif %}
|
| 49 |
+
</div>
|
| 50 |
+
<div class="flex flex-wrap gap-x-4 gap-y-1 text-xs font-mono text-base-content/50">
|
| 51 |
+
{% if search_meta.groq_time_ms is defined %}
|
| 52 |
+
<span>Groq Rewrite: <span class="text-primary">{{ search_meta.groq_time_ms }}ms</span>
|
| 53 |
+
{% if search_meta.groq_status is defined and search_meta.groq_status != 'rewritten' %}
|
| 54 |
+
<span class="text-warning/60">({{ search_meta.groq_status }})</span>
|
| 55 |
+
{% endif %}
|
| 56 |
+
</span>
|
| 57 |
+
{% endif %}
|
| 58 |
+
{% if search_meta.encode_time_ms is defined %}
|
| 59 |
+
<span>BGE-M3 Encode: <span class="text-primary">{{ search_meta.encode_time_ms }}ms</span></span>
|
| 60 |
+
{% endif %}
|
| 61 |
+
{% if search_meta.retrieval_time_ms is defined %}
|
| 62 |
+
<span>Qdrant+Zilliz Retrieval: <span class="text-primary">{{ search_meta.retrieval_time_ms }}ms</span>
|
| 63 |
+
{% if search_meta.n_retrieval_tasks is defined %}
|
| 64 |
+
<span class="text-base-content/30">({{ search_meta.n_retrieval_tasks }} parallel tasks)</span>
|
| 65 |
+
{% endif %}
|
| 66 |
+
</span>
|
| 67 |
+
{% endif %}
|
| 68 |
+
{% if search_meta.rrf_time_ms is defined %}
|
| 69 |
+
<span>RRF Fusion: <span class="text-primary">{{ search_meta.rrf_time_ms }}ms</span></span>
|
| 70 |
+
{% endif %}
|
| 71 |
+
{% if search_meta.turso_boost_fetch_ms is defined %}
|
| 72 |
+
<span>Turso Title Fetch: <span class="text-primary">{{ search_meta.turso_boost_fetch_ms }}ms</span></span>
|
| 73 |
+
<span>Rerank Compute: <span class="text-primary">{{ search_meta.rerank_compute_ms }}ms</span></span>
|
| 74 |
+
{% elif search_meta.rerank_time_ms is defined %}
|
| 75 |
+
<span>Title+Citation Rerank: <span class="text-primary">{{ search_meta.rerank_time_ms }}ms</span></span>
|
| 76 |
+
{% endif %}
|
| 77 |
+
{% if search_meta.meta_time_ms is defined %}
|
| 78 |
+
<span>Final Metadata: <span class="text-primary">{{ search_meta.meta_time_ms }}ms</span></span>
|
| 79 |
+
{% endif %}
|
| 80 |
+
</div>
|
| 81 |
+
</div>
|
| 82 |
+
{% endif %}
|
| 83 |
+
|
| 84 |
{% elif query %}
|
| 85 |
<div class="text-center text-base-content/40 py-10">
|
| 86 |
+
<p>No results found for "{{ query }}"</p>
|
| 87 |
+
{% if search_meta and search_meta.total_time_ms is defined %}
|
| 88 |
+
<p class="text-xs mt-2">Search completed in {{ search_meta.total_time_ms }}ms</p>
|
| 89 |
+
{% endif %}
|
| 90 |
</div>
|
| 91 |
{% endif %}
|
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{#
|
| 2 |
+
Seed search results β inner partial, swapped into #seed-results by HTMX.
|
| 3 |
+
Expects:
|
| 4 |
+
papers β list[dict] (optional)
|
| 5 |
+
query β str (optional)
|
| 6 |
+
#}
|
| 7 |
+
{% if papers is defined and papers %}
|
| 8 |
+
{% for paper in papers %}
|
| 9 |
+
<div class="seed-card flex items-start justify-between gap-3"
|
| 10 |
+
id="seed-paper-{{ paper.arxiv_id }}">
|
| 11 |
+
<div class="flex-1 min-w-0">
|
| 12 |
+
<a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
|
| 13 |
+
target="_blank" rel="noopener"
|
| 14 |
+
class="font-medium text-sm text-primary hover:underline leading-snug line-clamp-1">
|
| 15 |
+
{{ paper.title }}
|
| 16 |
+
</a>
|
| 17 |
+
<div class="text-xs text-base-content/50 mt-0.5">
|
| 18 |
+
[{{ paper.arxiv_id }}]
|
| 19 |
+
{% if paper.category %} Β· <span class="cat-badge cat-cs">{{ paper.category }}</span>{% endif %}
|
| 20 |
+
{% if paper.citation_count %} Β· π {{ paper.citation_count }}{% endif %}
|
| 21 |
+
</div>
|
| 22 |
+
</div>
|
| 23 |
+
<button class="btn btn-primary btn-xs shrink-0"
|
| 24 |
+
hx-post="/api/papers/{{ paper.arxiv_id }}/save"
|
| 25 |
+
hx-target="#seed-paper-{{ paper.arxiv_id }}"
|
| 26 |
+
hx-swap="outerHTML"
|
| 27 |
+
hx-vals='{"source": "onboarding"}'
|
| 28 |
+
onclick="bumpSeedCount()">
|
| 29 |
+
β Save
|
| 30 |
+
</button>
|
| 31 |
+
</div>
|
| 32 |
+
{% endfor %}
|
| 33 |
+
{% elif query is defined and query %}
|
| 34 |
+
<p class="text-center text-base-content/40 py-6 text-sm">
|
| 35 |
+
No results found for "{{ query }}"
|
| 36 |
+
</p>
|
| 37 |
+
{% else %}
|
| 38 |
+
<p class="text-center text-base-content/30 py-6 text-sm">
|
| 39 |
+
Search above to find papers in your research area
|
| 40 |
+
</p>
|
| 41 |
+
{% endif %}
|
|
@@ -15,30 +15,6 @@
|
|
| 15 |
</p>
|
| 16 |
</div>
|
| 17 |
|
| 18 |
-
{# Phase 5.1: Quick author import #}
|
| 19 |
-
<div class="mb-4 p-3 bg-base-200/50 rounded-lg">
|
| 20 |
-
<p class="text-xs font-medium text-base-content/70 mb-2">
|
| 21 |
-
β‘ Quick import: Paste your Semantic Scholar profile URL to auto-import papers
|
| 22 |
-
</p>
|
| 23 |
-
<form hx-post="/api/onboarding/import-author"
|
| 24 |
-
hx-target="#import-result"
|
| 25 |
-
hx-swap="innerHTML"
|
| 26 |
-
hx-indicator="#import-spinner"
|
| 27 |
-
class="flex gap-2">
|
| 28 |
-
<input type="text"
|
| 29 |
-
name="author_url"
|
| 30 |
-
placeholder="e.g. https://www.semanticscholar.org/author/β¦/1234567"
|
| 31 |
-
class="input input-bordered input-sm flex-1 text-xs" />
|
| 32 |
-
<button class="btn btn-secondary btn-sm" type="submit">
|
| 33 |
-
Import
|
| 34 |
-
<span id="import-spinner" class="htmx-indicator loading loading-spinner loading-xs ml-1"></span>
|
| 35 |
-
</button>
|
| 36 |
-
</form>
|
| 37 |
-
<div id="import-result" class="mt-2"></div>
|
| 38 |
-
</div>
|
| 39 |
-
|
| 40 |
-
<div class="divider text-xs text-base-content/40">OR search manually</div>
|
| 41 |
-
|
| 42 |
{# Search bar #}
|
| 43 |
<div class="mb-4">
|
| 44 |
<form hx-get="/api/onboarding/seed-search"
|
|
@@ -68,43 +44,9 @@
|
|
| 68 |
</div>
|
| 69 |
</div>
|
| 70 |
|
| 71 |
-
{# Search results #}
|
| 72 |
<div id="seed-results" class="space-y-2 mb-6">
|
| 73 |
-
{%
|
| 74 |
-
{% for paper in papers %}
|
| 75 |
-
<div class="seed-card flex items-start justify-between gap-3"
|
| 76 |
-
id="seed-paper-{{ paper.arxiv_id }}">
|
| 77 |
-
<div class="flex-1 min-w-0">
|
| 78 |
-
<a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
|
| 79 |
-
target="_blank" rel="noopener"
|
| 80 |
-
class="font-medium text-sm text-primary hover:underline leading-snug line-clamp-1">
|
| 81 |
-
{{ paper.title }}
|
| 82 |
-
</a>
|
| 83 |
-
<div class="text-xs text-base-content/50 mt-0.5">
|
| 84 |
-
[{{ paper.arxiv_id }}]
|
| 85 |
-
{% if paper.category %} Β· <span class="cat-badge cat-cs">{{ paper.category }}</span>{% endif %}
|
| 86 |
-
{% if paper.citation_count %} Β· π {{ paper.citation_count }}{% endif %}
|
| 87 |
-
</div>
|
| 88 |
-
</div>
|
| 89 |
-
<button class="btn btn-primary btn-xs shrink-0"
|
| 90 |
-
hx-post="/api/papers/{{ paper.arxiv_id }}/save"
|
| 91 |
-
hx-target="#seed-paper-{{ paper.arxiv_id }}"
|
| 92 |
-
hx-swap="outerHTML"
|
| 93 |
-
hx-vals='{"source": "onboarding"}'
|
| 94 |
-
onclick="bumpSeedCount()">
|
| 95 |
-
β Save
|
| 96 |
-
</button>
|
| 97 |
-
</div>
|
| 98 |
-
{% endfor %}
|
| 99 |
-
{% elif query is defined and query %}
|
| 100 |
-
<p class="text-center text-base-content/40 py-6 text-sm">
|
| 101 |
-
No results found for "{{ query }}"
|
| 102 |
-
</p>
|
| 103 |
-
{% else %}
|
| 104 |
-
<p class="text-center text-base-content/30 py-6 text-sm">
|
| 105 |
-
Search above to find papers in your research area
|
| 106 |
-
</p>
|
| 107 |
-
{% endif %}
|
| 108 |
</div>
|
| 109 |
|
| 110 |
{# Done / Skip buttons #}
|
|
|
|
| 15 |
</p>
|
| 16 |
</div>
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
{# Search bar #}
|
| 19 |
<div class="mb-4">
|
| 20 |
<form hx-get="/api/onboarding/seed-search"
|
|
|
|
| 44 |
</div>
|
| 45 |
</div>
|
| 46 |
|
| 47 |
+
{# Search results β inner div is the HTMX swap target #}
|
| 48 |
<div id="seed-results" class="space-y-2 mb-6">
|
| 49 |
+
{% include "partials/seed_results.html" %}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
</div>
|
| 51 |
|
| 52 |
{# Done / Skip buttons #}
|
|
@@ -7,10 +7,9 @@
|
|
| 7 |
|
| 8 |
<!-- Search bar -->
|
| 9 |
<div class="card bg-base-100 shadow-md rounded-xl p-4">
|
| 10 |
-
<form hx-get="/search"
|
| 11 |
-
hx-target="#search-results"
|
| 12 |
hx-push-url="true"
|
| 13 |
-
hx-indicator="#search-spinner"
|
| 14 |
class="flex gap-2">
|
| 15 |
<input type="text"
|
| 16 |
name="q"
|
|
@@ -18,16 +17,38 @@
|
|
| 18 |
placeholder="Search arXiv papersβ¦"
|
| 19 |
class="input input-bordered flex-1"
|
| 20 |
autofocus />
|
| 21 |
-
<button class="btn btn-primary" type="submit">
|
| 22 |
-
Search
|
| 23 |
-
<span
|
|
|
|
|
|
|
|
|
|
| 24 |
</button>
|
| 25 |
</form>
|
| 26 |
</div>
|
| 27 |
|
| 28 |
-
<!--
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
<h2 class="text-lg font-semibold mb-3">Recommended for You</h2>
|
| 32 |
<div id="rec-section"
|
| 33 |
hx-get="/api/recommendations"
|
|
@@ -47,4 +68,29 @@
|
|
| 47 |
</div>
|
| 48 |
|
| 49 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
{% endblock %}
|
|
|
|
| 7 |
|
| 8 |
<!-- Search bar -->
|
| 9 |
<div class="card bg-base-100 shadow-md rounded-xl p-4">
|
| 10 |
+
<form hx-get="/search"
|
| 11 |
+
hx-target="#search-results"
|
| 12 |
hx-push-url="true"
|
|
|
|
| 13 |
class="flex gap-2">
|
| 14 |
<input type="text"
|
| 15 |
name="q"
|
|
|
|
| 17 |
placeholder="Search arXiv papersβ¦"
|
| 18 |
class="input input-bordered flex-1"
|
| 19 |
autofocus />
|
| 20 |
+
<button class="btn btn-primary flex items-center gap-2" type="submit">
|
| 21 |
+
<span class="search-btn-text">Search</span>
|
| 22 |
+
<span class="search-btn-loading hidden">
|
| 23 |
+
<span class="loading loading-spinner loading-sm"></span>
|
| 24 |
+
Searchingβ¦
|
| 25 |
+
</span>
|
| 26 |
</button>
|
| 27 |
</form>
|
| 28 |
</div>
|
| 29 |
|
| 30 |
+
<!-- Loading overlay (outside search-results so it doesn't get swapped away) -->
|
| 31 |
+
<div id="search-loading" class="hidden">
|
| 32 |
+
<div class="flex flex-col items-center justify-center py-16 gap-4">
|
| 33 |
+
<span class="loading loading-ring loading-lg text-primary"></span>
|
| 34 |
+
<div class="text-sm text-base-content/60 animate-pulse">
|
| 35 |
+
Searching 1.6M papers across Qdrant + Zillizβ¦
|
| 36 |
+
</div>
|
| 37 |
+
<div class="flex gap-6 text-xs text-base-content/40 font-mono">
|
| 38 |
+
<span>Groq rewriting</span>
|
| 39 |
+
<span>β</span>
|
| 40 |
+
<span>BGE-M3 encoding</span>
|
| 41 |
+
<span>β</span>
|
| 42 |
+
<span>Vector retrieval</span>
|
| 43 |
+
<span>β</span>
|
| 44 |
+
<span>RRF + reranking</span>
|
| 45 |
+
</div>
|
| 46 |
+
</div>
|
| 47 |
+
</div>
|
| 48 |
+
|
| 49 |
+
<!-- Recommendations β only when not actively searching -->
|
| 50 |
+
{% if has_recs and not query %}
|
| 51 |
+
<div id="rec-wrapper">
|
| 52 |
<h2 class="text-lg font-semibold mb-3">Recommended for You</h2>
|
| 53 |
<div id="rec-section"
|
| 54 |
hx-get="/api/recommendations"
|
|
|
|
| 68 |
</div>
|
| 69 |
|
| 70 |
</div>
|
| 71 |
+
|
| 72 |
+
<script>
|
| 73 |
+
// Show/hide loading overlay + HIDE recommendations when searching
|
| 74 |
+
document.body.addEventListener('htmx:beforeRequest', function(evt) {
|
| 75 |
+
if (evt.detail.target && evt.detail.target.id === 'search-results') {
|
| 76 |
+
document.getElementById('search-loading').classList.remove('hidden');
|
| 77 |
+
document.getElementById('search-results').classList.add('opacity-30');
|
| 78 |
+
// Hide recommendations section when a search starts
|
| 79 |
+
var recWrapper = document.getElementById('rec-wrapper');
|
| 80 |
+
if (recWrapper) recWrapper.classList.add('hidden');
|
| 81 |
+
// Swap button text
|
| 82 |
+
document.querySelectorAll('.search-btn-text').forEach(el => el.classList.add('hidden'));
|
| 83 |
+
document.querySelectorAll('.search-btn-loading').forEach(el => el.classList.remove('hidden'));
|
| 84 |
+
}
|
| 85 |
+
});
|
| 86 |
+
document.body.addEventListener('htmx:afterRequest', function(evt) {
|
| 87 |
+
if (evt.detail.target && evt.detail.target.id === 'search-results') {
|
| 88 |
+
document.getElementById('search-loading').classList.add('hidden');
|
| 89 |
+
document.getElementById('search-results').classList.remove('opacity-30');
|
| 90 |
+
// Restore button text
|
| 91 |
+
document.querySelectorAll('.search-btn-text').forEach(el => el.classList.remove('hidden'));
|
| 92 |
+
document.querySelectorAll('.search-btn-loading').forEach(el => el.classList.add('hidden'));
|
| 93 |
+
}
|
| 94 |
+
});
|
| 95 |
+
</script>
|
| 96 |
{% endblock %}
|
|
@@ -15,12 +15,65 @@ from __future__ import annotations
|
|
| 15 |
|
| 16 |
import json
|
| 17 |
import time
|
|
|
|
| 18 |
|
| 19 |
import httpx
|
| 20 |
|
| 21 |
from app import config
|
| 22 |
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
# ββ Public API βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 25 |
|
| 26 |
async def fetch_metadata(arxiv_id: str) -> dict | None:
|
|
@@ -37,11 +90,31 @@ async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
|
|
| 37 |
Paper dict has keys: arxiv_id, title, abstract, authors, category,
|
| 38 |
published, year, citation_count, influential_citations.
|
| 39 |
|
| 40 |
-
|
| 41 |
"""
|
| 42 |
if not arxiv_ids:
|
| 43 |
return {}
|
| 44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
url = config.TURSO_URL
|
| 46 |
token = config.TURSO_DB_TOKEN
|
| 47 |
|
|
@@ -133,6 +206,7 @@ async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
|
|
| 133 |
paper = _to_paper_dict(values)
|
| 134 |
if paper:
|
| 135 |
output[paper["arxiv_id"]] = paper
|
|
|
|
| 136 |
|
| 137 |
return output
|
| 138 |
|
|
@@ -211,27 +285,52 @@ async def fetch_trending_by_categories(
|
|
| 211 |
Fetch recently published, high-citation papers from Turso DB
|
| 212 |
filtered by arXiv categories. Used as Tier 0 popularity fallback
|
| 213 |
for onboarded users with zero saves.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
"""
|
| 215 |
if not categories:
|
| 216 |
return []
|
| 217 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
url = config.TURSO_URL
|
| 219 |
token = config.TURSO_DB_TOKEN
|
| 220 |
if not url or not token:
|
| 221 |
return []
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
|
|
|
|
|
|
|
|
|
| 225 |
sql = f"""SELECT arxiv_id, title, authors, categories, primary_topic,
|
| 226 |
update_date, abstract_preview, citation_count, influential_citations
|
| 227 |
FROM papers
|
| 228 |
-
WHERE
|
| 229 |
AND citation_count > 0
|
| 230 |
ORDER BY citation_count DESC, update_date DESC
|
| 231 |
LIMIT ?"""
|
| 232 |
|
| 233 |
-
|
| 234 |
-
args = [{"type": "text", "value": c} for c in cat_list]
|
| 235 |
args.append({"type": "integer", "value": str(limit)})
|
| 236 |
|
| 237 |
pipeline_url = url.rstrip("/")
|
|
@@ -254,16 +353,29 @@ async def fetch_trending_by_categories(
|
|
| 254 |
"Content-Type": "application/json",
|
| 255 |
}
|
| 256 |
|
|
|
|
|
|
|
|
|
|
| 257 |
try:
|
| 258 |
-
async with httpx.AsyncClient(timeout=
|
| 259 |
resp = await client.post(
|
| 260 |
f"{pipeline_url}/v2/pipeline",
|
| 261 |
json=payload,
|
| 262 |
headers=headers,
|
| 263 |
)
|
| 264 |
resp.raise_for_status()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 265 |
except Exception as e:
|
| 266 |
-
print(f"[turso] trending
|
| 267 |
return []
|
| 268 |
|
| 269 |
try:
|
|
@@ -282,7 +394,7 @@ async def fetch_trending_by_categories(
|
|
| 282 |
cols = [c["name"] for c in result_data.get("cols", [])]
|
| 283 |
rows = result_data.get("rows", [])
|
| 284 |
except (KeyError, IndexError, TypeError) as e:
|
| 285 |
-
print(f"[turso] trending parse error: {e}")
|
| 286 |
return []
|
| 287 |
|
| 288 |
papers = []
|
|
@@ -299,4 +411,10 @@ async def fetch_trending_by_categories(
|
|
| 299 |
papers.append(paper)
|
| 300 |
|
| 301 |
print(f"[turso] trending: {len(papers)} papers in {len(categories)} categories")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 302 |
return papers
|
|
|
|
| 15 |
|
| 16 |
import json
|
| 17 |
import time
|
| 18 |
+
from collections import OrderedDict
|
| 19 |
|
| 20 |
import httpx
|
| 21 |
|
| 22 |
from app import config
|
| 23 |
|
| 24 |
|
| 25 |
+
# ββ In-process metadata cache ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 26 |
+
#
|
| 27 |
+
# Recommendations + search both fetch metadata for hundreds of arxiv_ids per
|
| 28 |
+
# request, often the same well-known papers across users. Each round-trip is
|
| 29 |
+
# 1-3s on a 1.6M-row libSQL DB. An in-process LRU absorbs the repeats.
|
| 30 |
+
#
|
| 31 |
+
# Trade-offs:
|
| 32 |
+
# - Asyncio is single-threaded, no lock needed.
|
| 33 |
+
# - Paper title/abstract/authors are effectively immutable for our use,
|
| 34 |
+
# so we don't TTL-expire metadata. citation_count drifts but is only
|
| 35 |
+
# used for display ranking; staleness is fine.
|
| 36 |
+
# - 50K capacity at ~1KB per row -> ~50MB RAM ceiling.
|
| 37 |
+
|
| 38 |
+
_METADATA_CACHE: "OrderedDict[str, dict]" = OrderedDict()
|
| 39 |
+
_METADATA_CACHE_MAX = 50_000
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def _cache_get(arxiv_id: str) -> dict | None:
|
| 43 |
+
val = _METADATA_CACHE.get(arxiv_id)
|
| 44 |
+
if val is not None:
|
| 45 |
+
# Mark as MRU
|
| 46 |
+
_METADATA_CACHE.move_to_end(arxiv_id)
|
| 47 |
+
return val
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def _cache_put(arxiv_id: str, paper: dict) -> None:
|
| 51 |
+
if arxiv_id in _METADATA_CACHE:
|
| 52 |
+
_METADATA_CACHE.move_to_end(arxiv_id)
|
| 53 |
+
_METADATA_CACHE[arxiv_id] = paper
|
| 54 |
+
return
|
| 55 |
+
_METADATA_CACHE[arxiv_id] = paper
|
| 56 |
+
if len(_METADATA_CACHE) > _METADATA_CACHE_MAX:
|
| 57 |
+
# Evict LRU
|
| 58 |
+
_METADATA_CACHE.popitem(last=False)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def metadata_cache_stats() -> dict:
|
| 62 |
+
"""For diagnostics: current cache size and max."""
|
| 63 |
+
return {"size": len(_METADATA_CACHE), "max": _METADATA_CACHE_MAX}
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
# ββ In-process trending cache ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 67 |
+
#
|
| 68 |
+
# Trending is filter-by-LIKE on 1.6M rows -> ~15s cold. Onboarding has a
|
| 69 |
+
# small fixed set of category combinations, and citation counts barely
|
| 70 |
+
# change minute-to-minute. A short TTL converts the 15s wait into a
|
| 71 |
+
# one-time hit per category combo.
|
| 72 |
+
|
| 73 |
+
_TRENDING_CACHE: dict[tuple, tuple[float, list[dict]]] = {}
|
| 74 |
+
_TRENDING_TTL_SECONDS = 60 * 60 # 1 hour
|
| 75 |
+
|
| 76 |
+
|
| 77 |
# ββ Public API βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 78 |
|
| 79 |
async def fetch_metadata(arxiv_id: str) -> dict | None:
|
|
|
|
| 90 |
Paper dict has keys: arxiv_id, title, abstract, authors, category,
|
| 91 |
published, year, citation_count, influential_citations.
|
| 92 |
|
| 93 |
+
First checks the in-process LRU cache; only un-cached IDs hit the network.
|
| 94 |
"""
|
| 95 |
if not arxiv_ids:
|
| 96 |
return {}
|
| 97 |
|
| 98 |
+
# Cache check β pull anything already-known up front.
|
| 99 |
+
output: dict[str, dict] = {}
|
| 100 |
+
misses: list[str] = []
|
| 101 |
+
for aid in arxiv_ids:
|
| 102 |
+
cached = _cache_get(aid)
|
| 103 |
+
if cached is not None:
|
| 104 |
+
output[aid] = cached
|
| 105 |
+
else:
|
| 106 |
+
misses.append(aid)
|
| 107 |
+
|
| 108 |
+
if not misses:
|
| 109 |
+
return output
|
| 110 |
+
|
| 111 |
+
fetched = await _fetch_metadata_batch_uncached(misses)
|
| 112 |
+
output.update(fetched)
|
| 113 |
+
return output
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
async def _fetch_metadata_batch_uncached(arxiv_ids: list[str]) -> dict[str, dict]:
|
| 117 |
+
"""Network fetch for IDs we don't already have cached."""
|
| 118 |
url = config.TURSO_URL
|
| 119 |
token = config.TURSO_DB_TOKEN
|
| 120 |
|
|
|
|
| 206 |
paper = _to_paper_dict(values)
|
| 207 |
if paper:
|
| 208 |
output[paper["arxiv_id"]] = paper
|
| 209 |
+
_cache_put(paper["arxiv_id"], paper)
|
| 210 |
|
| 211 |
return output
|
| 212 |
|
|
|
|
| 285 |
Fetch recently published, high-citation papers from Turso DB
|
| 286 |
filtered by arXiv categories. Used as Tier 0 popularity fallback
|
| 287 |
for onboarded users with zero saves.
|
| 288 |
+
|
| 289 |
+
Cached in-process (1 hour TTL): citation counts barely change
|
| 290 |
+
minute-to-minute, and onboarding has a small fixed set of category
|
| 291 |
+
combinations, so the first cold-start hit pays the ~15s LIKE-scan
|
| 292 |
+
cost once and subsequent users get an instant hit.
|
| 293 |
+
|
| 294 |
+
Filter strategy:
|
| 295 |
+
Turso's `primary_topic` column stores friendly labels like
|
| 296 |
+
"AI/ML" / "Computer Vision" β NOT arxiv codes β and the mapping
|
| 297 |
+
from arxiv code to friendly label is not 1:1 (e.g. Vaswani's
|
| 298 |
+
cs.CL paper is labeled "AI/ML" while BERT's cs.CL paper is
|
| 299 |
+
labeled "NLP/Computational Linguistics"). The `categories`
|
| 300 |
+
column, however, contains the real space-separated arxiv codes
|
| 301 |
+
("cs.CL cs.LG"). So we filter via LIKE on `categories`.
|
| 302 |
+
|
| 303 |
+
Performance: LIKE '%cs.XX%' with leading wildcard skips the index,
|
| 304 |
+
but Turso's `citation_count > 0` filter + ORDER BY citation_count
|
| 305 |
+
narrows the scan, and trending is not a hot path.
|
| 306 |
"""
|
| 307 |
if not categories:
|
| 308 |
return []
|
| 309 |
|
| 310 |
+
cache_key = (tuple(sorted(categories)), limit)
|
| 311 |
+
cached = _TRENDING_CACHE.get(cache_key)
|
| 312 |
+
if cached is not None and (time.time() - cached[0]) < _TRENDING_TTL_SECONDS:
|
| 313 |
+
return cached[1]
|
| 314 |
+
|
| 315 |
url = config.TURSO_URL
|
| 316 |
token = config.TURSO_DB_TOKEN
|
| 317 |
if not url or not token:
|
| 318 |
return []
|
| 319 |
|
| 320 |
+
cat_list = list(categories)
|
| 321 |
+
# categories column is space-separated arxiv codes; arxiv codes
|
| 322 |
+
# don't share substrings (no code is a substring of another), so
|
| 323 |
+
# plain LIKE '%code%' is safe.
|
| 324 |
+
like_clauses = " OR ".join(["categories LIKE ?" for _ in cat_list])
|
| 325 |
sql = f"""SELECT arxiv_id, title, authors, categories, primary_topic,
|
| 326 |
update_date, abstract_preview, citation_count, influential_citations
|
| 327 |
FROM papers
|
| 328 |
+
WHERE ({like_clauses})
|
| 329 |
AND citation_count > 0
|
| 330 |
ORDER BY citation_count DESC, update_date DESC
|
| 331 |
LIMIT ?"""
|
| 332 |
|
| 333 |
+
args = [{"type": "text", "value": f"%{c}%"} for c in cat_list]
|
|
|
|
| 334 |
args.append({"type": "integer", "value": str(limit)})
|
| 335 |
|
| 336 |
pipeline_url = url.rstrip("/")
|
|
|
|
| 353 |
"Content-Type": "application/json",
|
| 354 |
}
|
| 355 |
|
| 356 |
+
# Use a longer timeout than metadata fetch β full table scan
|
| 357 |
+
# for citation-sorted trending against 1.6M rows can spike to
|
| 358 |
+
# 15-25s on the first cold hit. Once cached, warm reads are 0ms.
|
| 359 |
try:
|
| 360 |
+
async with httpx.AsyncClient(timeout=30) as client:
|
| 361 |
resp = await client.post(
|
| 362 |
f"{pipeline_url}/v2/pipeline",
|
| 363 |
json=payload,
|
| 364 |
headers=headers,
|
| 365 |
)
|
| 366 |
resp.raise_for_status()
|
| 367 |
+
except httpx.HTTPStatusError as e:
|
| 368 |
+
# Surface response body on HTTP errors β Turso's empty-string
|
| 369 |
+
# exceptions were the symptom that hid this bug for months.
|
| 370 |
+
body = ""
|
| 371 |
+
try:
|
| 372 |
+
body = e.response.text[:500]
|
| 373 |
+
except Exception:
|
| 374 |
+
pass
|
| 375 |
+
print(f"[turso] trending HTTP error {e.response.status_code}: {body}")
|
| 376 |
+
return []
|
| 377 |
except Exception as e:
|
| 378 |
+
print(f"[turso] trending request failed: {type(e).__name__}: {e!r}")
|
| 379 |
return []
|
| 380 |
|
| 381 |
try:
|
|
|
|
| 394 |
cols = [c["name"] for c in result_data.get("cols", [])]
|
| 395 |
rows = result_data.get("rows", [])
|
| 396 |
except (KeyError, IndexError, TypeError) as e:
|
| 397 |
+
print(f"[turso] trending parse error: {type(e).__name__}: {e!r}")
|
| 398 |
return []
|
| 399 |
|
| 400 |
papers = []
|
|
|
|
| 411 |
papers.append(paper)
|
| 412 |
|
| 413 |
print(f"[turso] trending: {len(papers)} papers in {len(categories)} categories")
|
| 414 |
+
if papers:
|
| 415 |
+
_TRENDING_CACHE[cache_key] = (time.time(), papers)
|
| 416 |
+
# Also seed metadata cache β these papers are likely to be
|
| 417 |
+
# fetched again as part of recommendations / display.
|
| 418 |
+
for p in papers:
|
| 419 |
+
_cache_put(p["arxiv_id"], p)
|
| 420 |
return papers
|
|
@@ -325,30 +325,30 @@
|
|
| 325 |
|
| 326 |
---
|
| 327 |
|
| 328 |
-
## Phase 5: Cold-Start Onboarding
|
| 329 |
|
| 330 |
-
> *
|
| 331 |
-
> *Estimated effort: ~1-2 weeks*
|
| 332 |
> *Reference: Doc 06 β "4-37% lift even once behavioral data exists"*
|
| 333 |
|
| 334 |
-
### 5.1 β arXiv Category Multi-Select
|
| 335 |
-
- [
|
| 336 |
-
- [
|
| 337 |
-
- [
|
| 338 |
-
- [
|
| 339 |
-
- [
|
| 340 |
|
| 341 |
-
### 5.2 β Seed Paper Import
|
| 342 |
-
- [
|
| 343 |
-
- [
|
| 344 |
-
- [
|
| 345 |
|
| 346 |
-
### 5.3 β ORCID / Semantic Scholar Import
|
| 347 |
-
|
| 348 |
-
|
| 349 |
|
| 350 |
-
### 5.4 β Popularity Fallback
|
| 351 |
-
- [
|
|
|
|
| 352 |
|
| 353 |
---
|
| 354 |
|
|
@@ -432,10 +432,10 @@
|
|
| 432 |
- [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
|
| 433 |
- [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan
|
| 434 |
|
| 435 |
-
### B4 β S2 author import
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
|
| 440 |
### Documentation
|
| 441 |
- [x] `CLAUDE.md`: Rule 3.11 β interaction instrumentation invariants
|
|
|
|
| 325 |
|
| 326 |
---
|
| 327 |
|
| 328 |
+
## Phase 5: Cold-Start Onboarding β
COMPLETE
|
| 329 |
|
| 330 |
+
> *Onboarding wizard for new users β category selection + seed paper search + trending fallback.*
|
|
|
|
| 331 |
> *Reference: Doc 06 β "4-37% lift even once behavioral data exists"*
|
| 332 |
|
| 333 |
+
### 5.1 β arXiv Category Multi-Select β
|
| 334 |
+
- [x] UI screen on first visit: select 1-8 arXiv category groups
|
| 335 |
+
- [x] Store selections in SQLite (`user_onboarding` table)
|
| 336 |
+
- [x] Use as pool filter for recommendations (via `get_user_category_filter()`)
|
| 337 |
+
- [x] Preserve as LightGBM feature permanently (Feature 26: `onboarding_category_match`)
|
| 338 |
+
- [x] Does NOT create "subject vectors" β just filters
|
| 339 |
|
| 340 |
+
### 5.2 β Seed Paper Import β
|
| 341 |
+
- [x] Let users search for and save seed papers during onboarding
|
| 342 |
+
- [x] Immediately create EWMA profiles + Ward clusters on next feed request
|
| 343 |
+
- [x] Uses hybrid search (Phase 3) for discovery
|
| 344 |
|
| 345 |
+
### ~~5.3 β ORCID / Semantic Scholar Import~~ β REMOVED
|
| 346 |
+
> S2 author import was implemented but removed β not the onboarding direction we want.
|
| 347 |
+
> Onboarding focuses on category selection + manual seed paper search.
|
| 348 |
|
| 349 |
+
### 5.4 β Popularity Fallback β
|
| 350 |
+
- [x] Category-filtered trending papers served via `turso_svc.fetch_trending_by_categories()`
|
| 351 |
+
- [x] 1-hour TTL trending cache for performance
|
| 352 |
|
| 353 |
---
|
| 354 |
|
|
|
|
| 432 |
- [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
|
| 433 |
- [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan
|
| 434 |
|
| 435 |
+
### ~~B4 β S2 author import~~ β REMOVED
|
| 436 |
+
> S2 author import was implemented and then removed β not the onboarding direction we want.
|
| 437 |
+
> `app/s2_svc.py`, the `/api/onboarding/import-author` endpoint, and the quick-import UI
|
| 438 |
+
> have all been deleted. Onboarding uses category selection + manual seed search only.
|
| 439 |
|
| 440 |
### Documentation
|
| 441 |
- [x] `CLAUDE.md`: Rule 3.11 β interaction instrumentation invariants
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -32,7 +32,7 @@
|
|
| 32 |
| Component | Planned In | Blocked By |
|
| 33 |
|---|---|---|
|
| 34 |
| Evaluation framework (offline + online metrics) | Phase 7 | Not yet implemented |
|
| 35 |
-
| ORCID / Scholar import
|
| 36 |
| LLM interest summaries per cluster | Phase 8 | Needs Claude/Groq API integration |
|
| 37 |
| Exploration + collaborative filtering | Phase 9 | Needs user scale |
|
| 38 |
|
|
@@ -101,12 +101,12 @@ The latest deep research (Doc 06) adds critical nuance that **neither pure-behav
|
|
| 101 |
|
| 102 |
> "The pure-behavioral position in Doc 03/05 is directionally right but structurally incomplete... item-level seeds + adaptive refinement beats both fixed-category questionnaires and pure-behavior-from-zero, and onboarding cues remain a 4β37% lift even once behavioral data exists."
|
| 103 |
|
| 104 |
-
**The corrected position**: A
|
| 105 |
1. **Coarse arXiv-category multiselect** β filter and LightGBM feature (5-second cold-start signal)
|
| 106 |
-
2. **Seed
|
| 107 |
-
3. **Ward clustering + medoid retrieval** β takes over at ~
|
| 108 |
|
| 109 |
-
This resolves the tension: subject categories aren't the *primary* user model, but they *are* a useful prior for cold-start, filtering, and as re-ranking features.
|
| 110 |
|
| 111 |
---
|
| 112 |
|
|
@@ -283,29 +283,30 @@ Turso cloud DB with 1.23GB of metadata + citation counts. Search time: ~10.7s
|
|
| 283 |
|
| 284 |
---
|
| 285 |
|
| 286 |
-
### Phase 5: Cold-Start Onboarding (COMPLETE)
|
| 287 |
|
| 288 |
-
Status:
|
| 289 |
|
| 290 |
Build the onboarding pipeline that Doc 06 identifies as a 4-37% lift even once behavioral data exists.
|
| 291 |
|
| 292 |
-
#### 5.1 arXiv Category Multi-Select
|
| 293 |
-
|
| 294 |
-
- Used as pool filter for
|
| 295 |
-
- Stored as a LightGBM feature permanently
|
| 296 |
- Does NOT create "subject vectors" β just filters
|
| 297 |
|
| 298 |
-
#### 5.2 Seed Paper Import
|
| 299 |
-
|
| 300 |
-
- These immediately create EWMA profiles and Ward clusters
|
| 301 |
- Bypasses the "save 5 papers before any recs" cold-start trap
|
| 302 |
-
-
|
| 303 |
-
- **With hybrid search in place (Phase 3), seed paper search will use Qdrant vectors, not the arXiv API**
|
| 304 |
|
| 305 |
-
#### 5.3 ORCID / Semantic Scholar ID Import
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
|
|
|
|
|
|
| 309 |
|
| 310 |
---
|
| 311 |
|
|
|
|
| 32 |
| Component | Planned In | Blocked By |
|
| 33 |
|---|---|---|
|
| 34 |
| Evaluation framework (offline + online metrics) | Phase 7 | Not yet implemented |
|
| 35 |
+
| ~~ORCID / Scholar import~~ | ~~Phase 5~~ | Removed (not the onboarding direction we want) |
|
| 36 |
| LLM interest summaries per cluster | Phase 8 | Needs Claude/Groq API integration |
|
| 37 |
| Exploration + collaborative filtering | Phase 9 | Needs user scale |
|
| 38 |
|
|
|
|
| 101 |
|
| 102 |
> "The pure-behavioral position in Doc 03/05 is directionally right but structurally incomplete... item-level seeds + adaptive refinement beats both fixed-category questionnaires and pure-behavior-from-zero, and onboarding cues remain a 4β37% lift even once behavioral data exists."
|
| 103 |
|
| 104 |
+
**The corrected position**: A two-layer hybrid:
|
| 105 |
1. **Coarse arXiv-category multiselect** β filter and LightGBM feature (5-second cold-start signal)
|
| 106 |
+
2. **Seed paper search + save** β initial behavioral profile via manual discovery
|
| 107 |
+
3. **Ward clustering + medoid retrieval** β takes over at ~5 saves (production-grade personalization)
|
| 108 |
|
| 109 |
+
This resolves the tension: subject categories aren't the *primary* user model, but they *are* a useful prior for cold-start, filtering, and as re-ranking features. ORCID/S2 author import was explored and removed β manual seed search is the preferred onboarding path.
|
| 110 |
|
| 111 |
---
|
| 112 |
|
|
|
|
| 283 |
|
| 284 |
---
|
| 285 |
|
| 286 |
+
### Phase 5: Cold-Start Onboarding (COMPLETE β
)
|
| 287 |
|
| 288 |
+
Status: fully implemented β categories + seed search + trending fallback.
|
| 289 |
|
| 290 |
Build the onboarding pipeline that Doc 06 identifies as a 4-37% lift even once behavioral data exists.
|
| 291 |
|
| 292 |
+
#### 5.1 arXiv Category Multi-Select β
|
| 293 |
+
UI screen on first visit: select 1-8 arXiv category groups.
|
| 294 |
+
- Used as pool filter for recommendations
|
| 295 |
+
- Stored as a LightGBM feature permanently (Feature 26: `onboarding_category_match`)
|
| 296 |
- Does NOT create "subject vectors" β just filters
|
| 297 |
|
| 298 |
+
#### 5.2 Seed Paper Import β
|
| 299 |
+
Users search for and save seed papers during onboarding.
|
| 300 |
+
- These immediately create EWMA profiles and Ward clusters on next feed request
|
| 301 |
- Bypasses the "save 5 papers before any recs" cold-start trap
|
| 302 |
+
- Uses hybrid search (Phase 3) for discovery
|
|
|
|
| 303 |
|
| 304 |
+
#### ~~5.3 ORCID / Semantic Scholar ID Import~~ β REMOVED
|
| 305 |
+
S2 author import was implemented and then removed β not the onboarding direction we want.
|
| 306 |
+
Onboarding focuses on category selection + manual seed paper search.
|
| 307 |
+
|
| 308 |
+
#### 5.4 Popularity Fallback β
|
| 309 |
+
Category-filtered trending papers via `turso_svc.fetch_trending_by_categories()` with 1-hour TTL cache.
|
| 310 |
|
| 311 |
---
|
| 312 |
|
|
@@ -14,7 +14,7 @@ python-multipart>=0.0.9
|
|
| 14 |
FlagEmbedding>=1.2.9
|
| 15 |
transformers>=4.44,<5.0
|
| 16 |
pymilvus>=2.4
|
| 17 |
-
groq>=0.
|
| 18 |
python-dotenv>=1.0
|
| 19 |
|
| 20 |
# ββ Phase 6: LightGBM reranker βββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 14 |
FlagEmbedding>=1.2.9
|
| 15 |
transformers>=4.44,<5.0
|
| 16 |
pymilvus>=2.4
|
| 17 |
+
groq>=1.0 # 1.0+ drops the `proxies` kwarg internally so httpx>=0.28 works
|
| 18 |
python-dotenv>=1.0
|
| 19 |
|
| 20 |
# ββ Phase 6: LightGBM reranker βββββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Verify the onboarding seed-search step does not duplicate the panel."""
|
| 2 |
+
from playwright.sync_api import sync_playwright
|
| 3 |
+
|
| 4 |
+
URL = "http://127.0.0.1:7860"
|
| 5 |
+
QUERY = "attention is all you need"
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
def run():
|
| 9 |
+
with sync_playwright() as p:
|
| 10 |
+
browser = p.chromium.launch(headless=True)
|
| 11 |
+
ctx = browser.new_context(viewport={"width": 1280, "height": 1800})
|
| 12 |
+
# Use a fresh, unonboarded user so we land on /onboarding
|
| 13 |
+
ctx.add_cookies([{
|
| 14 |
+
"name": "arxiv_user_id",
|
| 15 |
+
"value": "onboarding-test-user-fresh",
|
| 16 |
+
"url": URL,
|
| 17 |
+
}])
|
| 18 |
+
page = ctx.new_page()
|
| 19 |
+
|
| 20 |
+
page.goto(URL + "/onboarding", wait_until="networkidle")
|
| 21 |
+
|
| 22 |
+
# Step 1: pick a category, click Continue
|
| 23 |
+
page.click("[data-key='nlp']")
|
| 24 |
+
page.click("#continue-btn")
|
| 25 |
+
|
| 26 |
+
# Step 2 should appear (rendered by submitCategories() via fetch + innerHTML)
|
| 27 |
+
page.wait_for_selector("#seed-results", timeout=10_000)
|
| 28 |
+
|
| 29 |
+
# Snapshot before search
|
| 30 |
+
page.screenshot(path="scripts/screenshot_onboard_step2_before.png", full_page=True)
|
| 31 |
+
|
| 32 |
+
# Now search β this is what triggered the duplication bug
|
| 33 |
+
page.fill("input[name='q']", QUERY)
|
| 34 |
+
page.click("button:has-text('Search')")
|
| 35 |
+
# wait for results to swap in
|
| 36 |
+
page.wait_for_function(
|
| 37 |
+
"document.querySelectorAll('.seed-card').length > 0",
|
| 38 |
+
timeout=15_000,
|
| 39 |
+
)
|
| 40 |
+
page.wait_for_load_state("networkidle", timeout=15_000)
|
| 41 |
+
|
| 42 |
+
page.screenshot(path="scripts/screenshot_onboard_step2_after.png", full_page=True)
|
| 43 |
+
|
| 44 |
+
# ββ Inspect the DOM
|
| 45 |
+
save_panels = page.locator("h2:has-text('Save a few papers you like')").count()
|
| 46 |
+
quick_imports = page.locator("text=Quick import:").count()
|
| 47 |
+
search_inputs = page.locator("input[name='q']").count()
|
| 48 |
+
seed_counters = page.locator("#seed-counter").count()
|
| 49 |
+
done_buttons = page.locator("button:has-text('Done β start exploring')").count()
|
| 50 |
+
seed_cards = page.locator(".seed-card").count()
|
| 51 |
+
seed_card_ids = page.locator(".seed-card").evaluate_all("els => els.map(e => e.id)")
|
| 52 |
+
|
| 53 |
+
print(f"'Save a few papers you like' headings: {save_panels} (expected 1)")
|
| 54 |
+
print(f"'Quick import:' blocks: {quick_imports} (expected 1)")
|
| 55 |
+
print(f"search inputs: {search_inputs} (expected 1)")
|
| 56 |
+
print(f"#seed-counter elements: {seed_counters} (expected 1)")
|
| 57 |
+
print(f"'Done β start exploring' buttons: {done_buttons} (expected 1)")
|
| 58 |
+
print(f"seed-cards: {seed_cards}, unique ids: {len(set(seed_card_ids))}")
|
| 59 |
+
|
| 60 |
+
ok = (
|
| 61 |
+
save_panels == 1
|
| 62 |
+
and quick_imports == 1
|
| 63 |
+
and search_inputs == 1
|
| 64 |
+
and seed_counters == 1
|
| 65 |
+
and done_buttons == 1
|
| 66 |
+
and seed_cards > 0
|
| 67 |
+
and seed_cards == len(set(seed_card_ids))
|
| 68 |
+
)
|
| 69 |
+
print("\nRESULT:", "PASS" if ok else "FAIL")
|
| 70 |
+
|
| 71 |
+
browser.close()
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
if __name__ == "__main__":
|
| 75 |
+
run()
|
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Drive a real Chromium browser to verify the search UI shows results once."""
|
| 2 |
+
from playwright.sync_api import sync_playwright
|
| 3 |
+
|
| 4 |
+
URL = "http://127.0.0.1:7860"
|
| 5 |
+
QUERY = "attention is all you need"
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
def run():
|
| 9 |
+
with sync_playwright() as p:
|
| 10 |
+
browser = p.chromium.launch(headless=True)
|
| 11 |
+
ctx = browser.new_context(
|
| 12 |
+
viewport={"width": 1280, "height": 1800},
|
| 13 |
+
)
|
| 14 |
+
# Pre-seed cookie of a user that has saves so has_recs=True
|
| 15 |
+
ctx.add_cookies([{
|
| 16 |
+
"name": "arxiv_user_id",
|
| 17 |
+
"value": "browser-test-user",
|
| 18 |
+
"url": URL,
|
| 19 |
+
}])
|
| 20 |
+
page = ctx.new_page()
|
| 21 |
+
|
| 22 |
+
# 1) Land on the homepage and search from there.
|
| 23 |
+
page.goto(URL + "/", wait_until="networkidle")
|
| 24 |
+
page.fill("input[name='q']", QUERY)
|
| 25 |
+
page.screenshot(path="scripts/screenshot_before_submit.png", full_page=True)
|
| 26 |
+
|
| 27 |
+
page.click("button[type='submit']")
|
| 28 |
+
page.wait_for_url("**/search?q=*", timeout=10_000)
|
| 29 |
+
# search.html does not auto-load anything heavy when q is set, but give it a beat
|
| 30 |
+
page.wait_for_load_state("networkidle", timeout=15_000)
|
| 31 |
+
|
| 32 |
+
page.screenshot(path="scripts/screenshot_after_search.png", full_page=True)
|
| 33 |
+
|
| 34 |
+
# 2) Inspect the DOM
|
| 35 |
+
url = page.url
|
| 36 |
+
paper_cards = page.locator(".paper-card").count()
|
| 37 |
+
recs_visible = page.locator("#rec-section").count()
|
| 38 |
+
recs_heading = page.get_by_role("heading", name="Recommended for You").count()
|
| 39 |
+
results_heading_count = page.locator("text=results for").count()
|
| 40 |
+
|
| 41 |
+
print(f"URL after search: {url}")
|
| 42 |
+
print(f".paper-card count: {paper_cards}")
|
| 43 |
+
print(f"#rec-section count: {recs_visible}")
|
| 44 |
+
print(f"'Recommended for You' heading count: {recs_heading}")
|
| 45 |
+
print(f"'results for' phrase count: {results_heading_count}")
|
| 46 |
+
|
| 47 |
+
# 3) Check for duplicate paper IDs (the original 'twice' complaint)
|
| 48 |
+
ids = page.locator("[id^='paper-']").evaluate_all(
|
| 49 |
+
"els => els.map(e => e.id)"
|
| 50 |
+
)
|
| 51 |
+
unique = set(ids)
|
| 52 |
+
print(f"paper element ids: {len(ids)} total, {len(unique)} unique")
|
| 53 |
+
if len(ids) != len(unique):
|
| 54 |
+
from collections import Counter
|
| 55 |
+
dups = [k for k, v in Counter(ids).items() if v > 1]
|
| 56 |
+
print(f"DUPLICATE IDS: {dups}")
|
| 57 |
+
|
| 58 |
+
# Phase: title-match boost β Vaswani's "Attention Is All You Need"
|
| 59 |
+
# (1706.03762) must be the #1 result for this exact-title query.
|
| 60 |
+
first_paper_id = page.locator("[id^='paper-']").first.get_attribute("id")
|
| 61 |
+
print(f"first paper id: {first_paper_id}")
|
| 62 |
+
|
| 63 |
+
ok = (
|
| 64 |
+
recs_visible == 0
|
| 65 |
+
and recs_heading == 0
|
| 66 |
+
and results_heading_count == 1
|
| 67 |
+
and paper_cards == len(unique)
|
| 68 |
+
and paper_cards > 0
|
| 69 |
+
and first_paper_id == "paper-1706.03762"
|
| 70 |
+
)
|
| 71 |
+
print("\nRESULT:", "PASS" if ok else "FAIL")
|
| 72 |
+
|
| 73 |
+
browser.close()
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
if __name__ == "__main__":
|
| 77 |
+
run()
|
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Diagnose why the Mamba paper (2312.00752) is missing from search results."""
|
| 2 |
+
import asyncio
|
| 3 |
+
import sys
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
|
| 6 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
| 7 |
+
|
| 8 |
+
from app import qdrant_svc, embed_svc, zilliz_svc, hybrid_search_svc, turso_svc
|
| 9 |
+
|
| 10 |
+
MAMBA_ID = "2312.00752"
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
async def main():
|
| 14 |
+
# Step 1: is the paper in Qdrant at all?
|
| 15 |
+
vecs = await qdrant_svc.get_paper_vectors([MAMBA_ID])
|
| 16 |
+
in_qdrant = MAMBA_ID in vecs
|
| 17 |
+
print(f"Mamba paper {MAMBA_ID} in Qdrant: {in_qdrant}")
|
| 18 |
+
|
| 19 |
+
# Step 2: is it in Turso?
|
| 20 |
+
meta = await turso_svc.fetch_metadata_batch([MAMBA_ID])
|
| 21 |
+
if MAMBA_ID in meta:
|
| 22 |
+
print(f"Mamba paper in Turso: YES β title: {meta[MAMBA_ID].get('title')!r}")
|
| 23 |
+
else:
|
| 24 |
+
print("Mamba paper in Turso: NO")
|
| 25 |
+
|
| 26 |
+
if not in_qdrant:
|
| 27 |
+
print("\n--> Paper missing from Qdrant collection. End of investigation.")
|
| 28 |
+
return
|
| 29 |
+
|
| 30 |
+
# Step 3: where does it rank in dense, sparse, and fused?
|
| 31 |
+
q = "Mamba state space model linear time"
|
| 32 |
+
dense_vec, sparse_dict = embed_svc.encode_query(q)
|
| 33 |
+
print(f"\nQuery: {q!r}")
|
| 34 |
+
print(f"Sparse keys: {len(sparse_dict)}")
|
| 35 |
+
|
| 36 |
+
fetch_k = 60
|
| 37 |
+
dense = await qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k)
|
| 38 |
+
sparse = await zilliz_svc.search_sparse(sparse_dict, limit=fetch_k)
|
| 39 |
+
|
| 40 |
+
dense_ids = [r["arxiv_id"] for r in dense]
|
| 41 |
+
sparse_ids = [r["arxiv_id"] for r in sparse]
|
| 42 |
+
|
| 43 |
+
if MAMBA_ID in dense_ids:
|
| 44 |
+
print(f"\nDense rank: {dense_ids.index(MAMBA_ID)+1}/{fetch_k}")
|
| 45 |
+
else:
|
| 46 |
+
print(f"\nDense top {fetch_k}: NOT present")
|
| 47 |
+
|
| 48 |
+
if MAMBA_ID in sparse_ids:
|
| 49 |
+
print(f"Sparse rank: {sparse_ids.index(MAMBA_ID)+1}/{fetch_k}")
|
| 50 |
+
else:
|
| 51 |
+
print(f"Sparse top {fetch_k}: NOT present")
|
| 52 |
+
|
| 53 |
+
fused = hybrid_search_svc._rrf_fuse(dense, sparse, k=60)
|
| 54 |
+
fused_ids = [item["arxiv_id"] for item in fused]
|
| 55 |
+
if MAMBA_ID in fused_ids:
|
| 56 |
+
print(f"RRF fused rank: {fused_ids.index(MAMBA_ID)+1}")
|
| 57 |
+
else:
|
| 58 |
+
print(f"RRF fused: NOT present in top {len(fused_ids)}")
|
| 59 |
+
|
| 60 |
+
# Show top 5 of each
|
| 61 |
+
print(f"\n=== Dense top 5 ===")
|
| 62 |
+
for r in dense[:5]:
|
| 63 |
+
print(f" {r['arxiv_id']} score={r['score']:.4f}")
|
| 64 |
+
print(f"\n=== Sparse top 5 ===")
|
| 65 |
+
for r in sparse[:5]:
|
| 66 |
+
print(f" {r['arxiv_id']} score={r['score']:.4f}")
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
asyncio.run(main())
|
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Trace where Vaswani's paper falls in the hybrid pipeline."""
|
| 2 |
+
import asyncio
|
| 3 |
+
from app import qdrant_svc, embed_svc, zilliz_svc, hybrid_search_svc
|
| 4 |
+
|
| 5 |
+
VASWANI = "1706.03762"
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
async def main():
|
| 9 |
+
q = "attention is all you need"
|
| 10 |
+
dense_vec, sparse_dict = embed_svc.encode_query(q)
|
| 11 |
+
print(f"sparse keys: {len(sparse_dict)}")
|
| 12 |
+
|
| 13 |
+
fetch_k = 60
|
| 14 |
+
dense = await qdrant_svc.search_dense(dense_vec.tolist(), limit=fetch_k)
|
| 15 |
+
sparse = await zilliz_svc.search_sparse(sparse_dict, limit=fetch_k)
|
| 16 |
+
dense_ids = [r["arxiv_id"] for r in dense]
|
| 17 |
+
sparse_ids = [r["arxiv_id"] for r in sparse]
|
| 18 |
+
|
| 19 |
+
print(f"\nVaswani in dense top {fetch_k}: ", VASWANI in dense_ids,
|
| 20 |
+
(f"(rank {dense_ids.index(VASWANI)+1})" if VASWANI in dense_ids else ""))
|
| 21 |
+
print(f"Vaswani in sparse top {fetch_k}: ", VASWANI in sparse_ids,
|
| 22 |
+
(f"(rank {sparse_ids.index(VASWANI)+1})" if VASWANI in sparse_ids else ""))
|
| 23 |
+
|
| 24 |
+
fused = hybrid_search_svc._rrf_fuse(dense, sparse, k=60)
|
| 25 |
+
fused_ids = [item["arxiv_id"] for item in fused]
|
| 26 |
+
v_rank_rrf = fused_ids.index(VASWANI) + 1 if VASWANI in fused_ids else None
|
| 27 |
+
print(f"\nVaswani rank after pure RRF: {v_rank_rrf}")
|
| 28 |
+
|
| 29 |
+
print("\n=== Pure RRF (no recency), top 10 ===")
|
| 30 |
+
for i, item in enumerate(fused[:10], 1):
|
| 31 |
+
marker = " <-- VASWANI" if item["arxiv_id"] == VASWANI else ""
|
| 32 |
+
print(f" {i:2d}. {item['arxiv_id']} rrf={item['rrf_score']:.4f}{marker}")
|
| 33 |
+
|
| 34 |
+
ranked = hybrid_search_svc._recency_rerank([dict(x) for x in fused])
|
| 35 |
+
ranked_ids = [item["arxiv_id"] for item in ranked]
|
| 36 |
+
v_rank_recency = ranked_ids.index(VASWANI) + 1 if VASWANI in ranked_ids else None
|
| 37 |
+
print(f"\nVaswani rank after current 0.80/0.20 recency rerank: {v_rank_recency}")
|
| 38 |
+
|
| 39 |
+
print("\n=== Current rerank (0.80 RRF + 0.20 recency), top 10 ===")
|
| 40 |
+
for i, item in enumerate(ranked[:10], 1):
|
| 41 |
+
marker = " <-- VASWANI" if item["arxiv_id"] == VASWANI else ""
|
| 42 |
+
print(f" {i:2d}. {item['arxiv_id']} final={item['final_score']:.4f}{marker}")
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
asyncio.run(main())
|
|
@@ -0,0 +1,622 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
End-to-end audit of the ResearchIT recommendation pipeline.
|
| 3 |
+
|
| 4 |
+
Steps:
|
| 5 |
+
1. Smoke test: hybrid search (10 queries, per-layer scores)
|
| 6 |
+
2. User profile pipeline: EWMA update + Ward clustering
|
| 7 |
+
3. Recommendation feed generation with quota fusion
|
| 8 |
+
4. LightGBM reranker pass
|
| 9 |
+
5. Gap analysis
|
| 10 |
+
|
| 11 |
+
Run: python scripts/e2e_audit.py
|
| 12 |
+
"""
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
import asyncio, sys, time, json, struct
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
import numpy as np
|
| 17 |
+
|
| 18 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
| 19 |
+
|
| 20 |
+
# ββ Imports ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 21 |
+
|
| 22 |
+
from app import hybrid_search_svc, turso_svc, embed_svc, qdrant_svc, zilliz_svc, groq_svc, db
|
| 23 |
+
from app.recommend import profiles, clustering
|
| 24 |
+
from app.recommend.reranker import (
|
| 25 |
+
rerank_candidates, compute_features, heuristic_score,
|
| 26 |
+
is_model_loaded, get_num_trees, FEATURE_NAMES,
|
| 27 |
+
)
|
| 28 |
+
from app.recommend.diversity import mmr_rerank, inject_exploration
|
| 29 |
+
|
| 30 |
+
# ββ Globals ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
|
| 32 |
+
ERRORS: list[str] = []
|
| 33 |
+
WRONG_OUTPUTS: list[str] = []
|
| 34 |
+
MISSING: list[str] = []
|
| 35 |
+
TEST_USER = "e2e_audit_user_001"
|
| 36 |
+
|
| 37 |
+
# ββ Helpers ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 38 |
+
|
| 39 |
+
def banner(text: str):
|
| 40 |
+
print(f"\n{'='*90}")
|
| 41 |
+
print(f" {text}")
|
| 42 |
+
print(f"{'='*90}\n")
|
| 43 |
+
|
| 44 |
+
def check(label: str, condition: bool, detail: str = ""):
|
| 45 |
+
status = "OK" if condition else "FAIL"
|
| 46 |
+
msg = f" [{status:>4}] {label}"
|
| 47 |
+
if detail:
|
| 48 |
+
msg += f" -- {detail}"
|
| 49 |
+
print(msg)
|
| 50 |
+
if not condition:
|
| 51 |
+
WRONG_OUTPUTS.append(f"{label}: {detail}")
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 55 |
+
# STEP 1 β SMOKE TEST: HYBRID SEARCH
|
| 56 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 57 |
+
|
| 58 |
+
SEARCH_QUERIES = [
|
| 59 |
+
"vision transformer image classification",
|
| 60 |
+
"reinforcement learning reward shaping",
|
| 61 |
+
"large language model fine-tuning RLHF",
|
| 62 |
+
"graph neural network drug discovery",
|
| 63 |
+
"federated learning differential privacy",
|
| 64 |
+
"attention is all you need",
|
| 65 |
+
"diffusion models image generation",
|
| 66 |
+
"knowledge distillation BERT compression",
|
| 67 |
+
"object detection YOLO real-time",
|
| 68 |
+
"protein structure prediction deep learning",
|
| 69 |
+
]
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
async def step1_search():
|
| 73 |
+
banner("STEP 1: HYBRID SEARCH SMOKE TEST")
|
| 74 |
+
print(f"Running {len(SEARCH_QUERIES)} queries...\n")
|
| 75 |
+
|
| 76 |
+
all_latencies = []
|
| 77 |
+
all_results_count = []
|
| 78 |
+
|
| 79 |
+
for i, q in enumerate(SEARCH_QUERIES, 1):
|
| 80 |
+
t0 = time.perf_counter()
|
| 81 |
+
try:
|
| 82 |
+
results = await hybrid_search_svc.search(q, limit=10)
|
| 83 |
+
elapsed = (time.perf_counter() - t0) * 1000
|
| 84 |
+
except Exception as e:
|
| 85 |
+
ERRORS.append(f"Step 1: Query {q!r} threw {type(e).__name__}: {e}")
|
| 86 |
+
print(f" Q{i}: {q!r} -> ERROR: {e}")
|
| 87 |
+
continue
|
| 88 |
+
|
| 89 |
+
all_latencies.append(elapsed)
|
| 90 |
+
all_results_count.append(len(results))
|
| 91 |
+
|
| 92 |
+
# Fetch metadata for display
|
| 93 |
+
meta = {}
|
| 94 |
+
if results:
|
| 95 |
+
try:
|
| 96 |
+
meta = await turso_svc.fetch_metadata_batch(results)
|
| 97 |
+
except Exception as e:
|
| 98 |
+
ERRORS.append(f"Step 1: Metadata fetch failed for {q!r}: {e}")
|
| 99 |
+
|
| 100 |
+
print(f" Q{i}: {q!r}")
|
| 101 |
+
print(f" Results: {len(results)} | Latency: {elapsed:.0f}ms")
|
| 102 |
+
|
| 103 |
+
for rank, aid in enumerate(results[:5], 1):
|
| 104 |
+
m = meta.get(aid, {})
|
| 105 |
+
title = (m.get("title") or "?")[:65]
|
| 106 |
+
cites = m.get("citation_count", 0) or 0
|
| 107 |
+
print(f" {rank}. [{cites:>6} cites] {aid:14s} {title}")
|
| 108 |
+
|
| 109 |
+
# Relevance check: does the query topic appear in at least 3/5 titles?
|
| 110 |
+
if results and meta:
|
| 111 |
+
q_words = set(q.lower().split())
|
| 112 |
+
relevant = 0
|
| 113 |
+
for aid in results[:5]:
|
| 114 |
+
t = (meta.get(aid, {}).get("title") or "").lower()
|
| 115 |
+
matches = sum(1 for w in q_words if w in t)
|
| 116 |
+
if matches >= 2:
|
| 117 |
+
relevant += 1
|
| 118 |
+
check(f"Q{i} relevance ({relevant}/5 top results overlap query terms)",
|
| 119 |
+
relevant >= 2,
|
| 120 |
+
f"{q!r}")
|
| 121 |
+
|
| 122 |
+
print()
|
| 123 |
+
|
| 124 |
+
# Summary
|
| 125 |
+
if all_latencies:
|
| 126 |
+
print(f" --- Search Summary ---")
|
| 127 |
+
print(f" Queries: {len(all_latencies)}")
|
| 128 |
+
print(f" Avg latency: {sum(all_latencies)/len(all_latencies):.0f}ms")
|
| 129 |
+
print(f" p50: {sorted(all_latencies)[len(all_latencies)//2]:.0f}ms")
|
| 130 |
+
print(f" Max: {max(all_latencies):.0f}ms")
|
| 131 |
+
zero_results = sum(1 for c in all_results_count if c == 0)
|
| 132 |
+
print(f" Zero-result queries: {zero_results}")
|
| 133 |
+
if zero_results > 0:
|
| 134 |
+
ERRORS.append(f"Step 1: {zero_results} queries returned 0 results")
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 138 |
+
# STEP 2 β USER PROFILE PIPELINE
|
| 139 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 140 |
+
|
| 141 |
+
# Real paper IDs from known categories:
|
| 142 |
+
# CV papers (computer vision)
|
| 143 |
+
CV_PAPERS = [
|
| 144 |
+
"1512.03385", # ResNet
|
| 145 |
+
"2010.11929", # ViT
|
| 146 |
+
"2105.01601", # Swin Transformer
|
| 147 |
+
"2106.08254", # BEiT
|
| 148 |
+
"1409.1556", # VGGNet
|
| 149 |
+
]
|
| 150 |
+
# LLM papers (NLP / language models)
|
| 151 |
+
LLM_PAPERS = [
|
| 152 |
+
"1706.03762", # Attention Is All You Need
|
| 153 |
+
"1810.04805", # BERT
|
| 154 |
+
"2005.14165", # GPT-3
|
| 155 |
+
"2303.08774", # GPT-4
|
| 156 |
+
"2302.13971", # LLaMA
|
| 157 |
+
]
|
| 158 |
+
|
| 159 |
+
ALL_SEED_PAPERS = CV_PAPERS + LLM_PAPERS
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
async def step2_profiles():
|
| 163 |
+
banner("STEP 2: USER PROFILE PIPELINE")
|
| 164 |
+
|
| 165 |
+
# Initialize DB
|
| 166 |
+
await db.init_db()
|
| 167 |
+
print(f" Test user: {TEST_USER}")
|
| 168 |
+
print(f" Seed papers: {len(ALL_SEED_PAPERS)} (5 CV + 5 LLM)")
|
| 169 |
+
|
| 170 |
+
# Step 2a: Retrieve embeddings for seed papers from Qdrant (batch)
|
| 171 |
+
print(f"\n Fetching embeddings from Qdrant for {len(ALL_SEED_PAPERS)} papers...")
|
| 172 |
+
embeddings = {}
|
| 173 |
+
try:
|
| 174 |
+
vecs = await qdrant_svc.get_paper_vectors(ALL_SEED_PAPERS)
|
| 175 |
+
for aid, vec in vecs.items():
|
| 176 |
+
embeddings[aid] = np.array(vec, dtype=np.float32)
|
| 177 |
+
missing = [a for a in ALL_SEED_PAPERS if a not in embeddings]
|
| 178 |
+
if missing:
|
| 179 |
+
print(f" WARN: No vectors for {len(missing)} papers: {missing[:3]}...")
|
| 180 |
+
except Exception as e:
|
| 181 |
+
print(f" ERROR: get_paper_vectors -> {e}")
|
| 182 |
+
ERRORS.append(f"Step 2: get_paper_vectors failed: {e}")
|
| 183 |
+
|
| 184 |
+
print(f" Retrieved {len(embeddings)}/{len(ALL_SEED_PAPERS)} embeddings")
|
| 185 |
+
|
| 186 |
+
if len(embeddings) < 5:
|
| 187 |
+
ERRORS.append(f"Step 2: Only {len(embeddings)} embeddings retrieved, need >= 5")
|
| 188 |
+
print(" ABORT: Not enough embeddings to continue Step 2")
|
| 189 |
+
return None, None
|
| 190 |
+
|
| 191 |
+
# Step 2b: EWMA profile updates
|
| 192 |
+
print(f"\n Running EWMA profile updates (alpha_long={profiles.ALPHA_LONG_TERM}, "
|
| 193 |
+
f"alpha_short={profiles.ALPHA_SHORT_TERM})...")
|
| 194 |
+
|
| 195 |
+
for aid in ALL_SEED_PAPERS:
|
| 196 |
+
if aid not in embeddings:
|
| 197 |
+
continue
|
| 198 |
+
try:
|
| 199 |
+
await profiles.update_on_save(TEST_USER, embeddings[aid])
|
| 200 |
+
except Exception as e:
|
| 201 |
+
ERRORS.append(f"Step 2: EWMA update failed for {aid}: {e}")
|
| 202 |
+
print(f" ERROR: update_on_save({aid}) -> {e}")
|
| 203 |
+
|
| 204 |
+
# Load profiles back
|
| 205 |
+
lt_vec = await profiles.load_profile(TEST_USER, "long_term")
|
| 206 |
+
st_vec = await profiles.load_profile(TEST_USER, "short_term")
|
| 207 |
+
lt_count = await profiles.get_interaction_count(TEST_USER, "long_term")
|
| 208 |
+
st_count = await profiles.get_interaction_count(TEST_USER, "short_term")
|
| 209 |
+
|
| 210 |
+
check("Long-term profile exists", lt_vec is not None)
|
| 211 |
+
check("Short-term profile exists", st_vec is not None)
|
| 212 |
+
check(f"Long-term interaction count = {lt_count}", lt_count == len(embeddings),
|
| 213 |
+
f"expected {len(embeddings)}")
|
| 214 |
+
check(f"Short-term interaction count = {st_count}", st_count == len(embeddings),
|
| 215 |
+
f"expected {len(embeddings)}")
|
| 216 |
+
|
| 217 |
+
if lt_vec is not None:
|
| 218 |
+
lt_norm = float(np.linalg.norm(lt_vec))
|
| 219 |
+
check(f"Long-term vector L2-norm ~= 1.0 (actual: {lt_norm:.4f})",
|
| 220 |
+
abs(lt_norm - 1.0) < 0.01)
|
| 221 |
+
|
| 222 |
+
if st_vec is not None:
|
| 223 |
+
st_norm = float(np.linalg.norm(st_vec))
|
| 224 |
+
check(f"Short-term vector L2-norm ~= 1.0 (actual: {st_norm:.4f})",
|
| 225 |
+
abs(st_norm - 1.0) < 0.01)
|
| 226 |
+
|
| 227 |
+
# Step 2c: Ward hierarchical clustering
|
| 228 |
+
print(f"\n Running Ward clustering on {len(embeddings)} paper embeddings...")
|
| 229 |
+
|
| 230 |
+
paper_ids = list(embeddings.keys())
|
| 231 |
+
emb_matrix = np.stack([embeddings[aid] for aid in paper_ids])
|
| 232 |
+
|
| 233 |
+
try:
|
| 234 |
+
clusters = clustering.compute_clusters(
|
| 235 |
+
paper_ids=paper_ids,
|
| 236 |
+
embeddings=emb_matrix,
|
| 237 |
+
)
|
| 238 |
+
except Exception as e:
|
| 239 |
+
ERRORS.append(f"Step 2: compute_clusters failed: {e}")
|
| 240 |
+
print(f" ERROR: {e}")
|
| 241 |
+
return lt_vec, st_vec
|
| 242 |
+
|
| 243 |
+
print(f" Clusters found: {len(clusters)}")
|
| 244 |
+
for c in clusters:
|
| 245 |
+
print(f" Cluster {c.cluster_idx}: medoid={c.medoid_paper_id}, "
|
| 246 |
+
f"papers={len(c.paper_ids)}, importance={c.importance:.3f}")
|
| 247 |
+
for pid in c.paper_ids:
|
| 248 |
+
label = "CV" if pid in CV_PAPERS else "LLM" if pid in LLM_PAPERS else "?"
|
| 249 |
+
print(f" - {pid} [{label}]")
|
| 250 |
+
|
| 251 |
+
check(f"Number of clusters >= 2 (actual: {len(clusters)})",
|
| 252 |
+
len(clusters) >= 2,
|
| 253 |
+
"CV and LLM papers should form distinct clusters")
|
| 254 |
+
|
| 255 |
+
# Check cluster purity
|
| 256 |
+
for c in clusters:
|
| 257 |
+
cv_count = sum(1 for p in c.paper_ids if p in CV_PAPERS)
|
| 258 |
+
llm_count = sum(1 for p in c.paper_ids if p in LLM_PAPERS)
|
| 259 |
+
total = len(c.paper_ids)
|
| 260 |
+
purity = max(cv_count, llm_count) / total if total > 0 else 0
|
| 261 |
+
dominant = "CV" if cv_count > llm_count else "LLM"
|
| 262 |
+
check(f"Cluster {c.cluster_idx} purity ({dominant}: {purity:.0%})",
|
| 263 |
+
purity >= 0.6,
|
| 264 |
+
f"{cv_count} CV + {llm_count} LLM papers")
|
| 265 |
+
|
| 266 |
+
# Save clusters for Step 3
|
| 267 |
+
try:
|
| 268 |
+
await clustering.save_clusters_to_db(TEST_USER, clusters)
|
| 269 |
+
except Exception as e:
|
| 270 |
+
ERRORS.append(f"Step 2: save_clusters_to_db failed: {e}")
|
| 271 |
+
|
| 272 |
+
return lt_vec, st_vec
|
| 273 |
+
|
| 274 |
+
|
| 275 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 276 |
+
# STEP 3 β RECOMMENDATION FEED GENERATION
|
| 277 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 278 |
+
|
| 279 |
+
async def step3_recommendation_feed(lt_vec, st_vec):
|
| 280 |
+
banner("STEP 3: RECOMMENDATION FEED GENERATION")
|
| 281 |
+
|
| 282 |
+
if lt_vec is None:
|
| 283 |
+
ERRORS.append("Step 3: Skipped β no long-term profile from Step 2")
|
| 284 |
+
print(" SKIPPED: No profile vectors from Step 2")
|
| 285 |
+
return None, None, None
|
| 286 |
+
|
| 287 |
+
# Load clusters from DB
|
| 288 |
+
clusters = await clustering.load_clusters_from_db(TEST_USER)
|
| 289 |
+
if not clusters:
|
| 290 |
+
ERRORS.append("Step 3: No clusters found in DB")
|
| 291 |
+
print(" SKIPPED: No clusters in DB")
|
| 292 |
+
return None, None, None
|
| 293 |
+
|
| 294 |
+
print(f" Loaded {len(clusters)} clusters from DB")
|
| 295 |
+
print(f" Target feed size: 20 papers")
|
| 296 |
+
|
| 297 |
+
# Step 3a: Search for candidates per cluster (using medoid embeddings)
|
| 298 |
+
all_candidates: dict[str, dict] = {} # arxiv_id -> metadata
|
| 299 |
+
all_embeddings: dict[str, np.ndarray] = {}
|
| 300 |
+
cluster_assignments: dict[str, int] = {} # arxiv_id -> cluster_idx
|
| 301 |
+
seen = set(ALL_SEED_PAPERS)
|
| 302 |
+
|
| 303 |
+
t0 = time.perf_counter()
|
| 304 |
+
|
| 305 |
+
# Get medoid vectors in batch
|
| 306 |
+
medoid_ids = [c["medoid_paper_id"] for c in clusters]
|
| 307 |
+
medoid_vecs = await qdrant_svc.get_paper_vectors(medoid_ids)
|
| 308 |
+
|
| 309 |
+
for c in clusters:
|
| 310 |
+
mid = c["medoid_paper_id"]
|
| 311 |
+
medoid_vec = None
|
| 312 |
+
|
| 313 |
+
# Try stored blob first
|
| 314 |
+
if c.get("medoid_embedding_blob"):
|
| 315 |
+
medoid_vec = np.frombuffer(c["medoid_embedding_blob"], dtype=np.float32)
|
| 316 |
+
|
| 317 |
+
# Fallback: batch-fetched vector
|
| 318 |
+
if medoid_vec is None and mid in medoid_vecs:
|
| 319 |
+
medoid_vec = np.array(medoid_vecs[mid], dtype=np.float32)
|
| 320 |
+
|
| 321 |
+
if medoid_vec is None:
|
| 322 |
+
ERRORS.append(f"Step 3: No medoid vector for cluster {c['cluster_idx']}")
|
| 323 |
+
continue
|
| 324 |
+
|
| 325 |
+
# Search Qdrant for similar papers (with scores + vectors)
|
| 326 |
+
try:
|
| 327 |
+
results = await qdrant_svc.search_by_vector_with_scores(
|
| 328 |
+
medoid_vec.tolist(), limit=30, with_vectors=True
|
| 329 |
+
)
|
| 330 |
+
except Exception as e:
|
| 331 |
+
ERRORS.append(f"Step 3: search failed for cluster {c['cluster_idx']}: {e}")
|
| 332 |
+
continue
|
| 333 |
+
|
| 334 |
+
# Filter out seen papers
|
| 335 |
+
for r in results:
|
| 336 |
+
aid = r["arxiv_id"]
|
| 337 |
+
if aid in seen:
|
| 338 |
+
continue
|
| 339 |
+
all_candidates[aid] = {"score": r["score"]}
|
| 340 |
+
cluster_assignments[aid] = c["cluster_idx"]
|
| 341 |
+
if "vector" in r:
|
| 342 |
+
all_embeddings[aid] = np.array(r["vector"], dtype=np.float32)
|
| 343 |
+
seen.add(aid)
|
| 344 |
+
if len([a for a in cluster_assignments if cluster_assignments[a] == c["cluster_idx"]]) >= 15:
|
| 345 |
+
break
|
| 346 |
+
|
| 347 |
+
elapsed_search = (time.perf_counter() - t0) * 1000
|
| 348 |
+
print(f" Candidate search: {len(all_candidates)} papers in {elapsed_search:.0f}ms")
|
| 349 |
+
|
| 350 |
+
if not all_candidates:
|
| 351 |
+
ERRORS.append("Step 3: Zero candidates retrieved")
|
| 352 |
+
print(" ABORT: No candidates")
|
| 353 |
+
return None, None, None
|
| 354 |
+
|
| 355 |
+
# Step 3b: Fetch metadata
|
| 356 |
+
cand_ids = list(all_candidates.keys())
|
| 357 |
+
try:
|
| 358 |
+
meta = await turso_svc.fetch_metadata_batch(cand_ids)
|
| 359 |
+
except Exception as e:
|
| 360 |
+
ERRORS.append(f"Step 3: metadata fetch failed: {e}")
|
| 361 |
+
meta = {}
|
| 362 |
+
|
| 363 |
+
# Step 3c: Fetch embeddings for candidates (use what we got from search + batch fetch rest)
|
| 364 |
+
cand_embeddings = dict(all_embeddings) # Already have some from with_vectors=True
|
| 365 |
+
missing_emb = [aid for aid in cand_ids if aid not in cand_embeddings]
|
| 366 |
+
if missing_emb:
|
| 367 |
+
print(f" Fetching {len(missing_emb)} missing embeddings from Qdrant...")
|
| 368 |
+
try:
|
| 369 |
+
extra = await qdrant_svc.get_paper_vectors(missing_emb)
|
| 370 |
+
for aid, vec in extra.items():
|
| 371 |
+
cand_embeddings[aid] = np.array(vec, dtype=np.float32)
|
| 372 |
+
except Exception as e:
|
| 373 |
+
print(f" WARN: batch vector fetch failed: {e}")
|
| 374 |
+
|
| 375 |
+
print(f" Got {len(cand_embeddings)}/{len(cand_ids)} embeddings")
|
| 376 |
+
|
| 377 |
+
# Build aligned arrays
|
| 378 |
+
valid_ids = [aid for aid in cand_ids if aid in cand_embeddings and aid in meta]
|
| 379 |
+
if len(valid_ids) < 5:
|
| 380 |
+
ERRORS.append(f"Step 3: Only {len(valid_ids)} valid candidates")
|
| 381 |
+
print(f" ABORT: Not enough valid candidates")
|
| 382 |
+
return None, None, None
|
| 383 |
+
|
| 384 |
+
emb_matrix = np.stack([cand_embeddings[aid] for aid in valid_ids])
|
| 385 |
+
meta_list = [meta[aid] for aid in valid_ids]
|
| 386 |
+
|
| 387 |
+
# Step 3d: Print the raw candidate feed
|
| 388 |
+
print(f"\n Raw candidate feed ({len(valid_ids)} papers):")
|
| 389 |
+
cluster_counts: dict[int, int] = {}
|
| 390 |
+
for i, aid in enumerate(valid_ids[:20]):
|
| 391 |
+
m = meta.get(aid, {})
|
| 392 |
+
title = (m.get("title") or "?")[:55]
|
| 393 |
+
cites = m.get("citation_count", 0) or 0
|
| 394 |
+
cidx = cluster_assignments.get(aid, -1)
|
| 395 |
+
cluster_counts[cidx] = cluster_counts.get(cidx, 0) + 1
|
| 396 |
+
print(f" {i+1:2d}. [C{cidx}] [{cites:>6} cites] {title}")
|
| 397 |
+
|
| 398 |
+
print(f"\n Cluster distribution in top 20:")
|
| 399 |
+
for cidx, count in sorted(cluster_counts.items()):
|
| 400 |
+
print(f" Cluster {cidx}: {count} papers")
|
| 401 |
+
|
| 402 |
+
total_feed = (time.perf_counter() - t0) * 1000
|
| 403 |
+
print(f" Total feed generation: {total_feed:.0f}ms")
|
| 404 |
+
|
| 405 |
+
return valid_ids, emb_matrix, meta_list
|
| 406 |
+
|
| 407 |
+
|
| 408 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 409 |
+
# STEP 4 β LIGHTGBM RERANKER
|
| 410 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 411 |
+
|
| 412 |
+
async def step4_reranker(valid_ids, emb_matrix, meta_list, lt_vec, st_vec):
|
| 413 |
+
banner("STEP 4: LIGHTGBM RERANKER")
|
| 414 |
+
|
| 415 |
+
if valid_ids is None:
|
| 416 |
+
print(" SKIPPED: No candidates from Step 3")
|
| 417 |
+
return
|
| 418 |
+
|
| 419 |
+
print(f" Model loaded: {is_model_loaded()}")
|
| 420 |
+
if is_model_loaded():
|
| 421 |
+
print(f" Trees: {get_num_trees()}")
|
| 422 |
+
else:
|
| 423 |
+
MISSING.append("LightGBM model not loaded β using heuristic fallback")
|
| 424 |
+
|
| 425 |
+
n = min(len(valid_ids), 20)
|
| 426 |
+
ids_subset = valid_ids[:n]
|
| 427 |
+
emb_subset = emb_matrix[:n]
|
| 428 |
+
meta_subset = meta_list[:n]
|
| 429 |
+
|
| 430 |
+
print(f" Running reranker on {n} candidates...")
|
| 431 |
+
t0 = time.perf_counter()
|
| 432 |
+
|
| 433 |
+
try:
|
| 434 |
+
sorted_ids, sorted_scores, sorted_embs = rerank_candidates(
|
| 435 |
+
ids_subset,
|
| 436 |
+
emb_subset,
|
| 437 |
+
meta_subset,
|
| 438 |
+
lt_vec,
|
| 439 |
+
st_vec,
|
| 440 |
+
None, # no negative profile
|
| 441 |
+
)
|
| 442 |
+
elapsed = (time.perf_counter() - t0) * 1000
|
| 443 |
+
except Exception as e:
|
| 444 |
+
ERRORS.append(f"Step 4: rerank_candidates failed: {e}")
|
| 445 |
+
print(f" ERROR: {e}")
|
| 446 |
+
return
|
| 447 |
+
|
| 448 |
+
print(f" Reranker latency: {elapsed:.0f}ms")
|
| 449 |
+
print(f"\n Reranked order (top 10):")
|
| 450 |
+
|
| 451 |
+
# Fetch metadata for display
|
| 452 |
+
re_meta = {}
|
| 453 |
+
try:
|
| 454 |
+
re_meta = await turso_svc.fetch_metadata_batch(sorted_ids[:10])
|
| 455 |
+
except Exception:
|
| 456 |
+
pass
|
| 457 |
+
|
| 458 |
+
for i, (aid, score) in enumerate(zip(sorted_ids[:10], sorted_scores[:10]), 1):
|
| 459 |
+
m = re_meta.get(aid, {})
|
| 460 |
+
title = (m.get("title") or "?")[:55]
|
| 461 |
+
cites = m.get("citation_count", 0) or 0
|
| 462 |
+
old_rank = ids_subset.index(aid) + 1 if aid in ids_subset else "?"
|
| 463 |
+
print(f" {i:2d}. (was #{old_rank:>2}) [{cites:>6} cites] score={score:.4f} {title}")
|
| 464 |
+
|
| 465 |
+
# Feature analysis for top 3 and bottom 3
|
| 466 |
+
features = compute_features(emb_subset, meta_subset, lt_vec, st_vec, None)
|
| 467 |
+
print(f"\n Feature snapshot (top 3 reranked papers):")
|
| 468 |
+
for rank_idx in range(min(3, len(sorted_ids))):
|
| 469 |
+
aid = sorted_ids[rank_idx]
|
| 470 |
+
orig_idx = ids_subset.index(aid)
|
| 471 |
+
f = features[orig_idx]
|
| 472 |
+
print(f" #{rank_idx+1} {aid}:")
|
| 473 |
+
print(f" qdrant_cosine={f[0]:.3f} lt_sim={f[20]:.3f} st_sim={f[21]:.3f} "
|
| 474 |
+
f"cites={f[2]:.0f} recency={f[6]:.3f} age_days={f[5]:.0f}")
|
| 475 |
+
|
| 476 |
+
if len(sorted_ids) >= 3:
|
| 477 |
+
print(f"\n Feature snapshot (bottom 3 reranked papers):")
|
| 478 |
+
for rank_idx in range(max(0, len(sorted_ids)-3), len(sorted_ids)):
|
| 479 |
+
aid = sorted_ids[rank_idx]
|
| 480 |
+
orig_idx = ids_subset.index(aid)
|
| 481 |
+
f = features[orig_idx]
|
| 482 |
+
print(f" #{rank_idx+1} {aid}:")
|
| 483 |
+
print(f" qdrant_cosine={f[0]:.3f} lt_sim={f[20]:.3f} st_sim={f[21]:.3f} "
|
| 484 |
+
f"cites={f[2]:.0f} recency={f[6]:.3f} age_days={f[5]:.0f}")
|
| 485 |
+
|
| 486 |
+
# Check: did reranking change anything?
|
| 487 |
+
moved = sum(1 for i, aid in enumerate(sorted_ids) if aid != ids_subset[i])
|
| 488 |
+
check(f"Reranker changed {moved}/{n} positions", moved > 0,
|
| 489 |
+
"Reranker should reorder candidates based on features")
|
| 490 |
+
|
| 491 |
+
|
| 492 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 493 |
+
# STEP 5 β MMR DIVERSITY + EXPLORATION
|
| 494 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 495 |
+
|
| 496 |
+
async def step5_diversity(valid_ids, emb_matrix, lt_vec):
|
| 497 |
+
banner("STEP 5: MMR DIVERSITY + EXPLORATION")
|
| 498 |
+
|
| 499 |
+
if valid_ids is None or lt_vec is None:
|
| 500 |
+
print(" SKIPPED: No data from previous steps")
|
| 501 |
+
return
|
| 502 |
+
|
| 503 |
+
n = min(len(valid_ids), 30)
|
| 504 |
+
print(f" Running MMR (lambda=0.6) on {n} candidates, selecting 15...")
|
| 505 |
+
|
| 506 |
+
t0 = time.perf_counter()
|
| 507 |
+
try:
|
| 508 |
+
mmr_ids = mmr_rerank(
|
| 509 |
+
lt_vec, emb_matrix[:n], valid_ids[:n],
|
| 510 |
+
lambda_param=0.6, top_k=15,
|
| 511 |
+
)
|
| 512 |
+
elapsed = (time.perf_counter() - t0) * 1000
|
| 513 |
+
except Exception as e:
|
| 514 |
+
ERRORS.append(f"Step 5: mmr_rerank failed: {e}")
|
| 515 |
+
print(f" ERROR: {e}")
|
| 516 |
+
return
|
| 517 |
+
|
| 518 |
+
print(f" MMR latency: {elapsed:.0f}ms")
|
| 519 |
+
print(f" MMR selected {len(mmr_ids)} papers")
|
| 520 |
+
|
| 521 |
+
# Check rank changes
|
| 522 |
+
moved = sum(1 for i, aid in enumerate(mmr_ids) if i < len(valid_ids) and aid != valid_ids[i])
|
| 523 |
+
print(f" Rank changes vs input: {moved}/{len(mmr_ids)}")
|
| 524 |
+
|
| 525 |
+
# Exploration injection
|
| 526 |
+
with_explore = inject_exploration(mmr_ids, valid_ids[:n], n_explore=2, seed=42)
|
| 527 |
+
explore_count = len(with_explore) - len(mmr_ids)
|
| 528 |
+
print(f" Exploration injected: {explore_count} papers")
|
| 529 |
+
check("Exploration added papers", explore_count > 0 or len(valid_ids[:n]) <= len(mmr_ids))
|
| 530 |
+
|
| 531 |
+
# Check diversity: compute avg pairwise cosine among selected
|
| 532 |
+
selected_embs = []
|
| 533 |
+
for aid in mmr_ids[:10]:
|
| 534 |
+
if aid in valid_ids:
|
| 535 |
+
idx = valid_ids.index(aid)
|
| 536 |
+
if idx < len(emb_matrix):
|
| 537 |
+
selected_embs.append(emb_matrix[idx])
|
| 538 |
+
|
| 539 |
+
if len(selected_embs) >= 2:
|
| 540 |
+
sel_matrix = np.stack(selected_embs)
|
| 541 |
+
norms = sel_matrix / (np.linalg.norm(sel_matrix, axis=1, keepdims=True) + 1e-10)
|
| 542 |
+
sim_matrix = norms @ norms.T
|
| 543 |
+
# Average off-diagonal similarity
|
| 544 |
+
mask = ~np.eye(len(sel_matrix), dtype=bool)
|
| 545 |
+
avg_sim = sim_matrix[mask].mean()
|
| 546 |
+
print(f" Avg pairwise cosine among top 10 MMR picks: {avg_sim:.3f}")
|
| 547 |
+
check("MMR diversity (avg pairwise sim < 0.85)", avg_sim < 0.85,
|
| 548 |
+
f"actual: {avg_sim:.3f}")
|
| 549 |
+
|
| 550 |
+
|
| 551 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 552 |
+
# STEP 6 β GAP ANALYSIS
|
| 553 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 554 |
+
|
| 555 |
+
def step6_gap_analysis():
|
| 556 |
+
banner("STEP 6: GAP ANALYSIS")
|
| 557 |
+
|
| 558 |
+
print(" ERRORS (things that threw exceptions or returned empty):")
|
| 559 |
+
if ERRORS:
|
| 560 |
+
for e in ERRORS:
|
| 561 |
+
print(f" - {e}")
|
| 562 |
+
else:
|
| 563 |
+
print(" (none)")
|
| 564 |
+
|
| 565 |
+
print("\n WRONG OUTPUTS (things that ran but returned bad results):")
|
| 566 |
+
if WRONG_OUTPUTS:
|
| 567 |
+
for w in WRONG_OUTPUTS:
|
| 568 |
+
print(f" - {w}")
|
| 569 |
+
else:
|
| 570 |
+
print(" (none)")
|
| 571 |
+
|
| 572 |
+
print("\n MISSING PIECES (not implemented or not loaded):")
|
| 573 |
+
if MISSING:
|
| 574 |
+
for m in MISSING:
|
| 575 |
+
print(f" - {m}")
|
| 576 |
+
else:
|
| 577 |
+
print(" (none)")
|
| 578 |
+
|
| 579 |
+
print(f"\n Totals: {len(ERRORS)} errors, {len(WRONG_OUTPUTS)} wrong outputs, {len(MISSING)} missing")
|
| 580 |
+
|
| 581 |
+
# Verdict
|
| 582 |
+
total_issues = len(ERRORS) + len(WRONG_OUTPUTS) + len(MISSING)
|
| 583 |
+
if total_issues == 0:
|
| 584 |
+
print("\n VERDICT: ALL CLEAR")
|
| 585 |
+
else:
|
| 586 |
+
print(f"\n VERDICT: {total_issues} issues found")
|
| 587 |
+
|
| 588 |
+
|
| 589 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 590 |
+
# MAIN
|
| 591 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 592 |
+
|
| 593 |
+
async def main():
|
| 594 |
+
banner("RESEARCHIT E2E PIPELINE AUDIT")
|
| 595 |
+
print(" Warming up BGE-M3 + services...")
|
| 596 |
+
embed_svc.encode_query("warmup")
|
| 597 |
+
await turso_svc.fetch_metadata_batch(["1706.03762"])
|
| 598 |
+
print(" Ready.\n")
|
| 599 |
+
|
| 600 |
+
# Step 1: Search
|
| 601 |
+
await step1_search()
|
| 602 |
+
|
| 603 |
+
# Step 2: Profiles + Clustering
|
| 604 |
+
lt_vec, st_vec = await step2_profiles()
|
| 605 |
+
|
| 606 |
+
# Step 3: Recommendation feed
|
| 607 |
+
valid_ids, emb_matrix, meta_list = await step3_recommendation_feed(lt_vec, st_vec)
|
| 608 |
+
|
| 609 |
+
# Step 4: Reranker
|
| 610 |
+
await step4_reranker(valid_ids, emb_matrix, meta_list, lt_vec, st_vec)
|
| 611 |
+
|
| 612 |
+
# Step 5: MMR Diversity
|
| 613 |
+
await step5_diversity(valid_ids, emb_matrix, lt_vec)
|
| 614 |
+
|
| 615 |
+
# Step 6: Gap analysis
|
| 616 |
+
step6_gap_analysis()
|
| 617 |
+
|
| 618 |
+
banner("AUDIT COMPLETE")
|
| 619 |
+
|
| 620 |
+
|
| 621 |
+
if __name__ == "__main__":
|
| 622 |
+
asyncio.run(main())
|
|
@@ -0,0 +1,336 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Expanded search quality evaluation β realistic user queries.
|
| 3 |
+
|
| 4 |
+
The original eval_search_quality.py uses 21 queries across 5 bands (A-E).
|
| 5 |
+
This script expands to 8 categories that simulate REAL users of an academic
|
| 6 |
+
paper search engine, not just known-item lookups and adversarial tests.
|
| 7 |
+
|
| 8 |
+
Categories:
|
| 9 |
+
F: Beginner / Newcomer β "explain like I'm starting a research project"
|
| 10 |
+
G: Research-in-Progress β "I know the field, looking for specific work"
|
| 11 |
+
H: Implementation-Focused β "I want to BUILD something"
|
| 12 |
+
I: Comparative / Survey β "compare X vs Y" or "survey of Z"
|
| 13 |
+
J: Emerging / Cutting-Edge β "what's new in X?"
|
| 14 |
+
K: Cross-Domain β "applying X from domain A to domain B"
|
| 15 |
+
L: Vague / Exploratory β underspecified queries that real users actually type
|
| 16 |
+
M: Follow-up / Refinement β queries that build on prior context
|
| 17 |
+
|
| 18 |
+
Run: python scripts/eval_expanded_queries.py
|
| 19 |
+
"""
|
| 20 |
+
from __future__ import annotations
|
| 21 |
+
|
| 22 |
+
import asyncio
|
| 23 |
+
import json
|
| 24 |
+
import sys
|
| 25 |
+
import time
|
| 26 |
+
from pathlib import Path
|
| 27 |
+
|
| 28 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
| 29 |
+
|
| 30 |
+
from app import hybrid_search_svc
|
| 31 |
+
from app import turso_svc
|
| 32 |
+
from app import embed_svc
|
| 33 |
+
from app import groq_svc
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
# ββ Query definitions ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 37 |
+
|
| 38 |
+
# (band, query, expected_arxiv_id_or_None, description)
|
| 39 |
+
QUERIES: list[tuple[str, str, str | None, str]] = [
|
| 40 |
+
|
| 41 |
+
# ββ Band A (original): Known-item titles βββββββββββββββββββββββββββββββββ
|
| 42 |
+
("A", "attention is all you need", "1706.03762",
|
| 43 |
+
"Landmark transformer paper by Vaswani et al."),
|
| 44 |
+
("A", "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805",
|
| 45 |
+
"Full BERT title β should be exact #1"),
|
| 46 |
+
("A", "Deep Residual Learning for Image Recognition", "1512.03385",
|
| 47 |
+
"ResNet β the most-cited CV paper"),
|
| 48 |
+
|
| 49 |
+
# ββ Band F: Beginner / Newcomer queries ββββββββββββββββββββββββββββββββββ
|
| 50 |
+
# These simulate a student or newcomer who doesn't know the jargon.
|
| 51 |
+
("F", "how do transformers work in NLP", None,
|
| 52 |
+
"Newcomer asking about transformer basics"),
|
| 53 |
+
("F", "what is reinforcement learning from human feedback", None,
|
| 54 |
+
"Beginner asking about RLHF β should surface Ouyang/InstructGPT/Christiano"),
|
| 55 |
+
("F", "explain how neural networks learn", None,
|
| 56 |
+
"Very basic β should return foundational/survey papers"),
|
| 57 |
+
("F", "what are diffusion models and how do they generate images", None,
|
| 58 |
+
"Beginner asking about DDPM/Stable Diffusion family"),
|
| 59 |
+
("F", "how does GPT-4 work", None,
|
| 60 |
+
"Newcomer asking about GPT-4 β should surface the technical report"),
|
| 61 |
+
|
| 62 |
+
# ββ Band G: Research-in-Progress queries βββββββββββββββββββββββββββββββββ
|
| 63 |
+
# These simulate a PhD student deep in their research.
|
| 64 |
+
("G", "contrastive learning for self-supervised visual representations", None,
|
| 65 |
+
"Should return SimCLR, MoCo, BYOL, DINO etc."),
|
| 66 |
+
("G", "knowledge distillation from large language models to smaller ones", None,
|
| 67 |
+
"Distillation pipeline β DistilBERT, TinyBERT, knowledge distillation surveys"),
|
| 68 |
+
("G", "graph neural networks for molecular property prediction", None,
|
| 69 |
+
"GNN + chemistry β SchNet, DimeNet, MPNN papers"),
|
| 70 |
+
("G", "efficient inference for large language models quantization pruning", None,
|
| 71 |
+
"LLM compression β GPTQ, AWQ, SparseGPT, pruning surveys"),
|
| 72 |
+
("G", "causal inference in observational studies with machine learning", None,
|
| 73 |
+
"Causal ML β double ML, causal forests, CATE estimation"),
|
| 74 |
+
("G", "multi-task learning with shared representations", None,
|
| 75 |
+
"MTL surveys, hard/soft parameter sharing, task relationships"),
|
| 76 |
+
|
| 77 |
+
# ββ Band H: Implementation-Focused queries βββββββββββββββββββββββββββββββ
|
| 78 |
+
# These simulate someone who wants to BUILD something.
|
| 79 |
+
("H", "how to fine-tune a pre-trained language model for classification", None,
|
| 80 |
+
"Practical fine-tuning β ULMFiT, how-to-fine-tune-BERT papers"),
|
| 81 |
+
("H", "implementing attention mechanism from scratch", None,
|
| 82 |
+
"Implementation-level detail β attention tutorials, scaled dot product"),
|
| 83 |
+
("H", "best practices for training stable diffusion models", None,
|
| 84 |
+
"Practical SD training β latent diffusion, classifier-free guidance"),
|
| 85 |
+
("H", "building a retrieval augmented generation system", None,
|
| 86 |
+
"RAG β should surface the Lewis et al. RAG paper, REALM, etc."),
|
| 87 |
+
("H", "how to do distributed training with PyTorch across GPUs", None,
|
| 88 |
+
"Distributed training β ZeRO, Megatron, FSDP, DeepSpeed papers"),
|
| 89 |
+
|
| 90 |
+
# ββ Band I: Comparative / Survey queries βββββββββββββββββββββββββββββββββ
|
| 91 |
+
# Users who want to understand the landscape.
|
| 92 |
+
("I", "transformer vs CNN for image classification", None,
|
| 93 |
+
"ViT vs ResNet/EfficientNet β should surface comparison papers"),
|
| 94 |
+
("I", "survey of large language models", None,
|
| 95 |
+
"LLM surveys β Zhao et al. survey, Minaee survey"),
|
| 96 |
+
("I", "comparison of object detection architectures YOLO vs DETR", None,
|
| 97 |
+
"YOLO family vs transformer-based detection"),
|
| 98 |
+
("I", "GAN vs diffusion models for image generation", None,
|
| 99 |
+
"Generative model comparison β StyleGAN, DDPM, score matching"),
|
| 100 |
+
("I", "review of federated learning privacy methods", None,
|
| 101 |
+
"FL surveys β McMahan, differential privacy in FL"),
|
| 102 |
+
|
| 103 |
+
# ββ Band J: Emerging / Cutting-Edge queries ββββββββββββββββββββββββββββββ
|
| 104 |
+
# Users looking for the latest developments.
|
| 105 |
+
("J", "mixture of experts models scaling", None,
|
| 106 |
+
"MoE β Switch Transformer, Mixtral, GShard"),
|
| 107 |
+
("J", "test-time compute scaling for reasoning", None,
|
| 108 |
+
"New paradigm β o1-style reasoning, tree search at inference"),
|
| 109 |
+
("J", "multimodal large language models vision and text", None,
|
| 110 |
+
"GPT-4V, LLaVA, Flamingo, multimodal LLMs"),
|
| 111 |
+
("J", "state space models as alternative to transformers", None,
|
| 112 |
+
"S4, Mamba, H3 β structured state space models"),
|
| 113 |
+
("J", "constitutional AI and AI safety alignment techniques", None,
|
| 114 |
+
"Anthropic constitutional AI, RLHF alternatives, safety"),
|
| 115 |
+
("J", "sparse attention mechanisms for long context", None,
|
| 116 |
+
"Longformer, BigBird, sparse transformers for 100K+ context"),
|
| 117 |
+
|
| 118 |
+
# ββ Band K: Cross-Domain queries βββββββββββββββββββββββββββββββββββββββββ
|
| 119 |
+
# Users applying ML to their specific domain.
|
| 120 |
+
("K", "deep learning for protein structure prediction", None,
|
| 121 |
+
"AlphaFold, ESMFold, protein language models"),
|
| 122 |
+
("K", "natural language processing for legal document analysis", None,
|
| 123 |
+
"Legal NLP β contract analysis, legal BERT, court opinion mining"),
|
| 124 |
+
("K", "machine learning for climate change prediction", None,
|
| 125 |
+
"Climate ML β weather forecasting, carbon modeling"),
|
| 126 |
+
("K", "using transformers for time series forecasting", None,
|
| 127 |
+
"Time series transformers β Informer, Autoformer, PatchTST"),
|
| 128 |
+
("K", "reinforcement learning for robotics manipulation", None,
|
| 129 |
+
"RL + robotics β sim-to-real transfer, dexterous manipulation"),
|
| 130 |
+
|
| 131 |
+
# ββ Band L: Vague / Exploratory queries ββββββββββββββββββββββββββββββββββ
|
| 132 |
+
# Underspecified queries that real users actually type.
|
| 133 |
+
("L", "AI ethics", None,
|
| 134 |
+
"Very broad β should return survey-level papers on AI ethics/fairness/bias"),
|
| 135 |
+
("L", "embedding", None,
|
| 136 |
+
"Single word β highly ambiguous. Word2Vec? Sentence embeddings? Image embeddings?"),
|
| 137 |
+
("L", "language model", None,
|
| 138 |
+
"Broad β should return influential LM papers or surveys"),
|
| 139 |
+
("L", "generate images from text", None,
|
| 140 |
+
"Casual β should surface DALL-E, Stable Diffusion, Imagen"),
|
| 141 |
+
("L", "make AI more safe", None,
|
| 142 |
+
"Very casual β should surface alignment/safety papers"),
|
| 143 |
+
|
| 144 |
+
# ββ Band M: Follow-up / Refinement queries βββββββββββββββββββββββββββββββ
|
| 145 |
+
# Simulate a user who already found something and wants more.
|
| 146 |
+
("M", "improvements to the original transformer architecture", None,
|
| 147 |
+
"Post-Vaswani improvements β Reformer, Performer, ALiBi, RoPE"),
|
| 148 |
+
("M", "papers that cite ResNet and extend residual connections", None,
|
| 149 |
+
"ResNet extensions β DenseNet, ResNeXt, WideResNet, SE-Net"),
|
| 150 |
+
("M", "alternatives to RLHF for aligning language models", None,
|
| 151 |
+
"DPO, SPIN, KTO β methods that bypass reward modeling"),
|
| 152 |
+
("M", "BERT variants for low resource languages", None,
|
| 153 |
+
"mBERT, XLM-R, AfricanBERT, ArabBERT β multilingual BERT variants"),
|
| 154 |
+
]
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
# ββ Wire rewrite logging βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 158 |
+
|
| 159 |
+
_rewrite_log: dict[str, str] = {}
|
| 160 |
+
_original_rewrite = groq_svc.rewrite
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
async def _logging_rewrite(q: str) -> str:
|
| 164 |
+
r = await _original_rewrite(q)
|
| 165 |
+
_rewrite_log[q] = r
|
| 166 |
+
return r
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
groq_svc.rewrite = _logging_rewrite
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
# ββ Per-query evaluation βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 173 |
+
|
| 174 |
+
async def eval_query(
|
| 175 |
+
band: str, query: str, expected_id: str | None, description: str
|
| 176 |
+
) -> dict:
|
| 177 |
+
"""Run one query end-to-end and return structured results."""
|
| 178 |
+
t0 = time.perf_counter()
|
| 179 |
+
results = await hybrid_search_svc.search(query, limit=10)
|
| 180 |
+
elapsed_ms = (time.perf_counter() - t0) * 1000
|
| 181 |
+
|
| 182 |
+
rewrite = _rewrite_log.get(query, query)
|
| 183 |
+
rewrite_fired = rewrite.strip() != query.strip()
|
| 184 |
+
|
| 185 |
+
titles: dict[str, str] = {}
|
| 186 |
+
categories: dict[str, str] = {}
|
| 187 |
+
if results:
|
| 188 |
+
meta = await turso_svc.fetch_metadata_batch(results)
|
| 189 |
+
titles = {aid: (m.get("title") or "(no title)") for aid, m in meta.items()}
|
| 190 |
+
categories = {aid: (m.get("primary_topic") or "?") for aid, m in meta.items()}
|
| 191 |
+
|
| 192 |
+
# Print formatted output
|
| 193 |
+
print()
|
| 194 |
+
print(f"[{band}] {query!r}")
|
| 195 |
+
print(f" intent: {description}")
|
| 196 |
+
if rewrite_fired:
|
| 197 |
+
print(f" rewrite: {rewrite!r}")
|
| 198 |
+
else:
|
| 199 |
+
print(f" rewrite: (skipped or no change)")
|
| 200 |
+
|
| 201 |
+
if expected_id is not None:
|
| 202 |
+
if results and results[0] == expected_id:
|
| 203 |
+
verdict = f"PASS - {expected_id} at #1"
|
| 204 |
+
elif expected_id in results:
|
| 205 |
+
rank = results.index(expected_id) + 1
|
| 206 |
+
verdict = f"PARTIAL - {expected_id} at rank #{rank}"
|
| 207 |
+
else:
|
| 208 |
+
verdict = f"FAIL - {expected_id} NOT in top 10"
|
| 209 |
+
print(f" verdict: {verdict}")
|
| 210 |
+
|
| 211 |
+
print(f" latency: {elapsed_ms:.0f} ms | results: {len(results)}")
|
| 212 |
+
|
| 213 |
+
if not results:
|
| 214 |
+
print(" (no results returned)")
|
| 215 |
+
else:
|
| 216 |
+
for i, aid in enumerate(results, 1):
|
| 217 |
+
title = titles.get(aid, "(title unavailable)")
|
| 218 |
+
cat = categories.get(aid, "?")
|
| 219 |
+
if len(title) > 75:
|
| 220 |
+
title = title[:72] + "..."
|
| 221 |
+
marker = " *" if expected_id and aid == expected_id else " "
|
| 222 |
+
print(f" {i:2d}.{marker}{aid:14s} [{cat:20s}] {title}")
|
| 223 |
+
|
| 224 |
+
# Compute topic diversity
|
| 225 |
+
unique_cats = set(categories.values()) - {"?"}
|
| 226 |
+
|
| 227 |
+
return {
|
| 228 |
+
"band": band,
|
| 229 |
+
"query": query,
|
| 230 |
+
"description": description,
|
| 231 |
+
"rewrite": rewrite if rewrite_fired else None,
|
| 232 |
+
"latency_ms": elapsed_ms,
|
| 233 |
+
"n_results": len(results),
|
| 234 |
+
"results": [
|
| 235 |
+
{"rank": i+1, "arxiv_id": aid, "title": titles.get(aid, ""),
|
| 236 |
+
"category": categories.get(aid, "?")}
|
| 237 |
+
for i, aid in enumerate(results)
|
| 238 |
+
],
|
| 239 |
+
"expected_id": expected_id,
|
| 240 |
+
"expected_found": expected_id in results if expected_id else None,
|
| 241 |
+
"expected_rank": results.index(expected_id) + 1 if expected_id and expected_id in results else None,
|
| 242 |
+
"topic_diversity": len(unique_cats),
|
| 243 |
+
}
|
| 244 |
+
|
| 245 |
+
|
| 246 |
+
async def main():
|
| 247 |
+
print("=" * 100)
|
| 248 |
+
print("EXPANDED SEARCH EVALUATION - Realistic User Queries")
|
| 249 |
+
print(f"Total queries: {len(QUERIES)} | Bands: {sorted(set(b for b,_,_,_ in QUERIES))}")
|
| 250 |
+
print("=" * 100)
|
| 251 |
+
|
| 252 |
+
# Warm-up
|
| 253 |
+
print("\nWarming up BGE-M3 + Turso...")
|
| 254 |
+
t0 = time.perf_counter()
|
| 255 |
+
embed_svc.encode_query("warmup query for the eval harness")
|
| 256 |
+
await turso_svc.fetch_metadata_batch(["1706.03762"])
|
| 257 |
+
print(f"Warm-up: {(time.perf_counter()-t0)*1000:.0f} ms\n")
|
| 258 |
+
|
| 259 |
+
all_results: list[dict] = []
|
| 260 |
+
band_results: dict[str, list[dict]] = {}
|
| 261 |
+
|
| 262 |
+
for band, query, expected, description in QUERIES:
|
| 263 |
+
result = await eval_query(band, query, expected, description)
|
| 264 |
+
all_results.append(result)
|
| 265 |
+
band_results.setdefault(band, []).append(result)
|
| 266 |
+
|
| 267 |
+
# ββ Summary ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 268 |
+
print("\n" + "=" * 100)
|
| 269 |
+
print("SUMMARY")
|
| 270 |
+
print("=" * 100)
|
| 271 |
+
|
| 272 |
+
# Band A: known-item hit rate
|
| 273 |
+
if "A" in band_results:
|
| 274 |
+
a_rows = band_results["A"]
|
| 275 |
+
hits = sum(1 for r in a_rows if r["expected_rank"] == 1)
|
| 276 |
+
total = len(a_rows)
|
| 277 |
+
print(f"\nBand A (known-item): {hits}/{total} top-1 hits")
|
| 278 |
+
|
| 279 |
+
# Per-band stats
|
| 280 |
+
print("\nPer-Band Results:")
|
| 281 |
+
print(f" {'Band':<6} {'Queries':>7} {'Avg Latency':>12} {'Avg Results':>12} {'Avg Topics':>11} Description")
|
| 282 |
+
print(f" {'-'*6} {'-'*7} {'-'*12} {'-'*12} {'-'*11} {'-'*40}")
|
| 283 |
+
|
| 284 |
+
band_labels = {
|
| 285 |
+
"A": "Known-item titles",
|
| 286 |
+
"F": "Beginner / Newcomer",
|
| 287 |
+
"G": "Research-in-Progress",
|
| 288 |
+
"H": "Implementation-Focused",
|
| 289 |
+
"I": "Comparative / Survey",
|
| 290 |
+
"J": "Emerging / Cutting-Edge",
|
| 291 |
+
"K": "Cross-Domain",
|
| 292 |
+
"L": "Vague / Exploratory",
|
| 293 |
+
"M": "Follow-up / Refinement",
|
| 294 |
+
}
|
| 295 |
+
|
| 296 |
+
for band in sorted(band_results.keys()):
|
| 297 |
+
rows = band_results[band]
|
| 298 |
+
n = len(rows)
|
| 299 |
+
avg_lat = sum(r["latency_ms"] for r in rows) / n
|
| 300 |
+
avg_res = sum(r["n_results"] for r in rows) / n
|
| 301 |
+
avg_div = sum(r["topic_diversity"] for r in rows) / n
|
| 302 |
+
label = band_labels.get(band, "")
|
| 303 |
+
print(f" {band:<6} {n:>7} {avg_lat:>10.0f}ms {avg_res:>12.1f} {avg_div:>11.1f} {label}")
|
| 304 |
+
|
| 305 |
+
# Overall latency
|
| 306 |
+
all_lat = [r["latency_ms"] for r in all_results]
|
| 307 |
+
all_lat.sort()
|
| 308 |
+
n = len(all_lat)
|
| 309 |
+
p50 = all_lat[n // 2]
|
| 310 |
+
p95 = all_lat[max(0, int(n * 0.95) - 1)]
|
| 311 |
+
print(f"\nOverall Latency (n={n}): mean {sum(all_lat)/n:.0f} ms "
|
| 312 |
+
f"p50 {p50:.0f} ms p95 {p95:.0f} ms max {max(all_lat):.0f} ms")
|
| 313 |
+
|
| 314 |
+
# Rewrite analysis
|
| 315 |
+
rewrites = [(r["query"], r["rewrite"]) for r in all_results if r["rewrite"]]
|
| 316 |
+
skips = [r["query"] for r in all_results if not r["rewrite"]]
|
| 317 |
+
print(f"\nGroq Rewriter: {len(rewrites)} fired, {len(skips)} skipped")
|
| 318 |
+
|
| 319 |
+
# Zero-result queries
|
| 320 |
+
zeros = [r["query"] for r in all_results if r["n_results"] == 0]
|
| 321 |
+
if zeros:
|
| 322 |
+
print(f"\nWARNING: ZERO RESULTS ({len(zeros)}):")
|
| 323 |
+
for q in zeros:
|
| 324 |
+
print(f" - {q!r}")
|
| 325 |
+
else:
|
| 326 |
+
print(f"\nOK: All queries returned results")
|
| 327 |
+
|
| 328 |
+
# Save JSON for comparison
|
| 329 |
+
out_path = Path(__file__).parent / "expanded_eval_results.json"
|
| 330 |
+
with open(out_path, "w") as f:
|
| 331 |
+
json.dump(all_results, f, indent=2, default=str)
|
| 332 |
+
print(f"\nResults saved to: {out_path}")
|
| 333 |
+
|
| 334 |
+
|
| 335 |
+
if __name__ == "__main__":
|
| 336 |
+
asyncio.run(main())
|
|
@@ -0,0 +1,547 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Recommendation engine evaluation harness.
|
| 3 |
+
|
| 4 |
+
Bypasses HTTP and calls the same pipeline functions the router uses,
|
| 5 |
+
with full DB setup/cleanup per scenario. Each scenario probes a specific
|
| 6 |
+
behavior (which tier fired, how many clusters formed, whether suppression
|
| 7 |
+
removed disliked categories, etc.) rather than just "did we get results."
|
| 8 |
+
|
| 9 |
+
Run: python scripts/eval_recs_quality.py
|
| 10 |
+
"""
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
import asyncio
|
| 14 |
+
import sys
|
| 15 |
+
import time
|
| 16 |
+
import uuid
|
| 17 |
+
from collections import Counter
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
|
| 20 |
+
import numpy as np
|
| 21 |
+
import aiosqlite
|
| 22 |
+
|
| 23 |
+
# Force UTF-8 stdout so unicode glyphs (>=, ->, etc.) don't crash on Windows cp1252
|
| 24 |
+
if hasattr(sys.stdout, "reconfigure"):
|
| 25 |
+
sys.stdout.reconfigure(encoding="utf-8")
|
| 26 |
+
|
| 27 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
| 28 |
+
|
| 29 |
+
from app import qdrant_svc, db, turso_svc, user_state as us
|
| 30 |
+
from app.config import REC_LIMIT, DB_PATH
|
| 31 |
+
from app.recommend import profiles
|
| 32 |
+
from app.recommend.clustering import (
|
| 33 |
+
compute_clusters, MIN_PAPERS_FOR_CLUSTERING,
|
| 34 |
+
)
|
| 35 |
+
from app.routers.recommendations import (
|
| 36 |
+
_multi_interest_recommend, _ewma_recommend,
|
| 37 |
+
)
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
# ββ Curated paper ids (verified-famous papers in each domain) ββββββββββββββββ
|
| 41 |
+
|
| 42 |
+
NLP_PAPERS = [
|
| 43 |
+
("1706.03762", "Attention Is All You Need"),
|
| 44 |
+
("1810.04805", "BERT"),
|
| 45 |
+
("2005.14165", "GPT-3"),
|
| 46 |
+
("1907.11692", "RoBERTa"),
|
| 47 |
+
("1910.10683", "T5"),
|
| 48 |
+
("2203.02155", "InstructGPT"),
|
| 49 |
+
("2201.11903", "CoT Prompting"),
|
| 50 |
+
("2307.09288", "Llama 2"),
|
| 51 |
+
]
|
| 52 |
+
|
| 53 |
+
CV_PAPERS = [
|
| 54 |
+
("1512.03385", "ResNet"),
|
| 55 |
+
("2010.11929", "Vision Transformer"),
|
| 56 |
+
("1409.1556", "VGG"),
|
| 57 |
+
("1505.04597", "U-Net"),
|
| 58 |
+
("2103.14030", "Swin Transformer"),
|
| 59 |
+
("2104.14294", "DINO"),
|
| 60 |
+
("2112.10752", "Latent Diffusion"),
|
| 61 |
+
("1311.2524", "R-CNN"),
|
| 62 |
+
]
|
| 63 |
+
|
| 64 |
+
ML_THEORY_PAPERS = [
|
| 65 |
+
# cs.LG / stat.ML β used for negative-suppression test
|
| 66 |
+
("1607.06450", "Layer Normalization"),
|
| 67 |
+
("1502.03167", "Batch Normalization"),
|
| 68 |
+
("1412.6980", "Adam optimizer"),
|
| 69 |
+
("1411.1784", "Conditional GAN"),
|
| 70 |
+
]
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
# ββ User setup / teardown helpers ββββββββββββββββββββββββββββββββββββββββββββ
|
| 74 |
+
|
| 75 |
+
async def setup_user(
|
| 76 |
+
user_id: str,
|
| 77 |
+
save_ids: list[str],
|
| 78 |
+
dismiss_ids: list[str] | None = None,
|
| 79 |
+
onboarding_categories: list[str] | None = None,
|
| 80 |
+
) -> object:
|
| 81 |
+
"""Build a test user from scratch: saves, dismisses, EWMA, in-memory state."""
|
| 82 |
+
dismiss_ids = dismiss_ids or []
|
| 83 |
+
|
| 84 |
+
if onboarding_categories:
|
| 85 |
+
await db.save_onboarding_categories(user_id, onboarding_categories)
|
| 86 |
+
|
| 87 |
+
# Pre-fetch all vectors in one batch
|
| 88 |
+
all_ids = save_ids + dismiss_ids
|
| 89 |
+
vecs = await qdrant_svc.get_paper_vectors(all_ids) if all_ids else {}
|
| 90 |
+
|
| 91 |
+
# Cache metadata so category suppression / display work
|
| 92 |
+
if all_ids:
|
| 93 |
+
meta = await turso_svc.fetch_metadata_batch(all_ids)
|
| 94 |
+
if meta:
|
| 95 |
+
await db.cache_turso_metadata_batch(list(meta.values()))
|
| 96 |
+
|
| 97 |
+
state = await us.ensure_loaded(user_id)
|
| 98 |
+
|
| 99 |
+
for pid in save_ids:
|
| 100 |
+
if pid not in vecs:
|
| 101 |
+
print(f" [setup] WARNING: {pid} not in Qdrant; skipping")
|
| 102 |
+
continue
|
| 103 |
+
state.add_positive(pid)
|
| 104 |
+
emb = np.array(vecs[pid], dtype=np.float32)
|
| 105 |
+
await profiles.update_on_save(user_id, emb)
|
| 106 |
+
await db.log_interaction(user_id, pid, "save")
|
| 107 |
+
|
| 108 |
+
for pid in dismiss_ids:
|
| 109 |
+
if pid not in vecs:
|
| 110 |
+
continue
|
| 111 |
+
state.add_negative(pid)
|
| 112 |
+
emb = np.array(vecs[pid], dtype=np.float32)
|
| 113 |
+
await profiles.update_on_dismiss(user_id, emb)
|
| 114 |
+
await db.log_interaction(user_id, pid, "not_interested")
|
| 115 |
+
|
| 116 |
+
return state
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
async def cleanup_user(user_id: str) -> None:
|
| 120 |
+
"""Wipe all DB rows + in-memory cache for a test user."""
|
| 121 |
+
async with aiosqlite.connect(DB_PATH) as conn:
|
| 122 |
+
for sql in [
|
| 123 |
+
"DELETE FROM interactions WHERE user_id = ?",
|
| 124 |
+
"DELETE FROM user_profiles WHERE user_id = ?",
|
| 125 |
+
"DELETE FROM user_clusters WHERE user_id = ?",
|
| 126 |
+
"DELETE FROM user_onboarding WHERE user_id = ?",
|
| 127 |
+
"DELETE FROM cluster_snapshots WHERE user_id = ?",
|
| 128 |
+
]:
|
| 129 |
+
try:
|
| 130 |
+
await conn.execute(sql, (user_id,))
|
| 131 |
+
except Exception:
|
| 132 |
+
pass
|
| 133 |
+
await conn.commit()
|
| 134 |
+
if user_id in us._cache:
|
| 135 |
+
del us._cache[user_id]
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
# ββ Pipeline runner (mirrors get_recommendations() cascade) ββββββββββββββββββ
|
| 139 |
+
|
| 140 |
+
async def run_pipeline(user_id: str, state) -> tuple[str, list[str], dict, float]:
|
| 141 |
+
"""Returns (tier_label, rec_ids, paper_tags, latency_ms)."""
|
| 142 |
+
seen = us.all_seen(user_id)
|
| 143 |
+
n_saves = len(state.positive_list)
|
| 144 |
+
|
| 145 |
+
t0 = time.perf_counter()
|
| 146 |
+
|
| 147 |
+
# Tier 0: cold-start (no saves) β trending by category
|
| 148 |
+
if n_saves == 0:
|
| 149 |
+
cat_filter = await db.get_user_category_filter(user_id)
|
| 150 |
+
if cat_filter:
|
| 151 |
+
trending = await turso_svc.fetch_trending_by_categories(
|
| 152 |
+
cat_filter, limit=REC_LIMIT,
|
| 153 |
+
)
|
| 154 |
+
elapsed = (time.perf_counter() - t0) * 1000
|
| 155 |
+
return ("Tier 0 trending",
|
| 156 |
+
[t["arxiv_id"] for t in trending],
|
| 157 |
+
{}, elapsed)
|
| 158 |
+
elapsed = (time.perf_counter() - t0) * 1000
|
| 159 |
+
return ("EMPTY (no onboarding)", [], {}, elapsed)
|
| 160 |
+
|
| 161 |
+
# Tier 1: β₯5 saves β multi-interest clustering + quota
|
| 162 |
+
if n_saves >= MIN_PAPERS_FOR_CLUSTERING:
|
| 163 |
+
rec_ids, paper_tags = await _multi_interest_recommend(
|
| 164 |
+
user_id, state, seen, REC_LIMIT, query_id="eval-test",
|
| 165 |
+
)
|
| 166 |
+
if rec_ids:
|
| 167 |
+
elapsed = (time.perf_counter() - t0) * 1000
|
| 168 |
+
return ("Tier 1 multi-interest", rec_ids, paper_tags, elapsed)
|
| 169 |
+
|
| 170 |
+
# Tier 2: β₯3 saves (EWMA threshold internally) β single-vector search
|
| 171 |
+
rec_ids = await _ewma_recommend(user_id, seen, REC_LIMIT)
|
| 172 |
+
if rec_ids:
|
| 173 |
+
elapsed = (time.perf_counter() - t0) * 1000
|
| 174 |
+
return ("Tier 2 EWMA", rec_ids, {}, elapsed)
|
| 175 |
+
|
| 176 |
+
# Tier 3: β₯1 save β Qdrant Recommend with raw IDs
|
| 177 |
+
rec_ids = await qdrant_svc.recommend(
|
| 178 |
+
positive_arxiv_ids=state.positive_list,
|
| 179 |
+
negative_arxiv_ids=state.negative_list,
|
| 180 |
+
seen_arxiv_ids=seen,
|
| 181 |
+
limit=REC_LIMIT,
|
| 182 |
+
)
|
| 183 |
+
elapsed = (time.perf_counter() - t0) * 1000
|
| 184 |
+
if rec_ids:
|
| 185 |
+
return ("Tier 3 Qdrant Recommend", rec_ids, {}, elapsed)
|
| 186 |
+
return ("EMPTY (all tiers exhausted)", [], {}, elapsed)
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
async def report_results(rec_ids: list[str], paper_tags: dict) -> tuple[Counter, Counter]:
|
| 190 |
+
"""Print top-10 with category and cluster origin. Return (cat_counts, source_counts)."""
|
| 191 |
+
if not rec_ids:
|
| 192 |
+
print(" (no results)")
|
| 193 |
+
return Counter(), Counter()
|
| 194 |
+
|
| 195 |
+
meta = await turso_svc.fetch_metadata_batch(rec_ids)
|
| 196 |
+
cats: Counter = Counter()
|
| 197 |
+
sources: Counter = Counter()
|
| 198 |
+
|
| 199 |
+
for i, aid in enumerate(rec_ids, 1):
|
| 200 |
+
m = meta.get(aid, {})
|
| 201 |
+
title = m.get("title", "(no title)")
|
| 202 |
+
if len(title) > 65:
|
| 203 |
+
title = title[:62] + "..."
|
| 204 |
+
cat = m.get("category", "?")
|
| 205 |
+
cats[cat] += 1
|
| 206 |
+
tag = paper_tags.get(aid, {}) if paper_tags else {}
|
| 207 |
+
source = tag.get("candidate_source", "")
|
| 208 |
+
sources[source] += 1
|
| 209 |
+
src_short = f" [{source}]" if source else ""
|
| 210 |
+
print(f" {i:2d}. {aid:13s} {cat:14s} {title}{src_short}")
|
| 211 |
+
|
| 212 |
+
return cats, sources
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
# ββ Scenarios ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 216 |
+
|
| 217 |
+
async def scenario_1_cold_with_onboarding():
|
| 218 |
+
"""Tier 0: zero saves, NLP categories selected during onboarding."""
|
| 219 |
+
user_id = f"eval-recs-1-{uuid.uuid4().hex[:6]}"
|
| 220 |
+
print("\n" + "=" * 100)
|
| 221 |
+
print("S1 Cold-start with onboarding categories (NLP)")
|
| 222 |
+
print(" Expect: Tier 0 trending; results in NLP-adjacent friendly categories")
|
| 223 |
+
print("=" * 100)
|
| 224 |
+
try:
|
| 225 |
+
await setup_user(user_id, save_ids=[], onboarding_categories=["nlp"])
|
| 226 |
+
state = await us.ensure_loaded(user_id)
|
| 227 |
+
tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
|
| 228 |
+
print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
|
| 229 |
+
cats, _ = await report_results(rec_ids, tags)
|
| 230 |
+
nlp_count = sum(
|
| 231 |
+
c for k, c in cats.items()
|
| 232 |
+
if k in {"AI/ML", "NLP/Computational Linguistics"} or k.startswith("cs.CL")
|
| 233 |
+
)
|
| 234 |
+
verdict = "PASS" if tier.startswith("Tier 0") and len(rec_ids) >= 5 else \
|
| 235 |
+
"FAIL (Tier 0 broken β fetch_trending_by_categories returned 0)"
|
| 236 |
+
print(f" Categories: {dict(cats)} --> NLP count: {nlp_count}/{len(rec_ids)}")
|
| 237 |
+
print(f" VERDICT: {verdict}")
|
| 238 |
+
finally:
|
| 239 |
+
await cleanup_user(user_id)
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
async def scenario_2_single_save():
|
| 243 |
+
"""Tier 3: 1 save, expect Qdrant Recommend nearest-neighbors."""
|
| 244 |
+
user_id = f"eval-recs-2-{uuid.uuid4().hex[:6]}"
|
| 245 |
+
print("\n" + "=" * 100)
|
| 246 |
+
print("S2 Single save (Vaswani Attention)")
|
| 247 |
+
print(" Expect: Tier 3 Qdrant Recommend; results semantically near saved paper")
|
| 248 |
+
print("=" * 100)
|
| 249 |
+
try:
|
| 250 |
+
await setup_user(user_id, save_ids=["1706.03762"])
|
| 251 |
+
state = await us.ensure_loaded(user_id)
|
| 252 |
+
tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
|
| 253 |
+
print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
|
| 254 |
+
cats, _ = await report_results(rec_ids, tags)
|
| 255 |
+
ml_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
|
| 256 |
+
verdict = "PASS" if tier.startswith("Tier 3") and ml_count >= 6 else "PARTIAL"
|
| 257 |
+
print(f" Categories: {dict(cats)} --> AI/ML + NLP count: {ml_count}/10")
|
| 258 |
+
print(f" VERDICT: {verdict}")
|
| 259 |
+
finally:
|
| 260 |
+
await cleanup_user(user_id)
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
async def scenario_3_three_nlp_saves():
|
| 264 |
+
"""Tier 2: 3 same-domain saves, expect EWMA single-vector search."""
|
| 265 |
+
user_id = f"eval-recs-3-{uuid.uuid4().hex[:6]}"
|
| 266 |
+
save_ids = [pid for pid, _ in NLP_PAPERS[:3]]
|
| 267 |
+
print("\n" + "=" * 100)
|
| 268 |
+
print("S3 Three NLP saves")
|
| 269 |
+
print(f" Saved: {save_ids}")
|
| 270 |
+
print(" Expect: Tier 2 EWMA single-vector; results NLP-coherent")
|
| 271 |
+
print("=" * 100)
|
| 272 |
+
try:
|
| 273 |
+
await setup_user(user_id, save_ids=save_ids)
|
| 274 |
+
state = await us.ensure_loaded(user_id)
|
| 275 |
+
tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
|
| 276 |
+
print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
|
| 277 |
+
cats, _ = await report_results(rec_ids, tags)
|
| 278 |
+
nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
|
| 279 |
+
verdict = "PASS" if tier.startswith("Tier 2") and nlp_count >= 7 else "PARTIAL"
|
| 280 |
+
print(f" Categories: {dict(cats)} --> AI/ML + NLP count: {nlp_count}/10")
|
| 281 |
+
print(f" VERDICT: {verdict}")
|
| 282 |
+
finally:
|
| 283 |
+
await cleanup_user(user_id)
|
| 284 |
+
|
| 285 |
+
|
| 286 |
+
async def scenario_4_five_nlp_saves_single_cluster():
|
| 287 |
+
"""Tier 1, single interest: expect K=1 cluster, NLP-only output."""
|
| 288 |
+
user_id = f"eval-recs-4-{uuid.uuid4().hex[:6]}"
|
| 289 |
+
save_ids = [pid for pid, _ in NLP_PAPERS[:5]]
|
| 290 |
+
print("\n" + "=" * 100)
|
| 291 |
+
print("S4 Five NLP saves (single interest)")
|
| 292 |
+
print(f" Saved: {save_ids}")
|
| 293 |
+
print(" Expect: Tier 1; 1 or few clusters; ML/NLP-coherent output")
|
| 294 |
+
print("=" * 100)
|
| 295 |
+
try:
|
| 296 |
+
await setup_user(user_id, save_ids=save_ids)
|
| 297 |
+
state = await us.ensure_loaded(user_id)
|
| 298 |
+
# Inspect clusters explicitly
|
| 299 |
+
vecs = await qdrant_svc.get_paper_vectors(save_ids)
|
| 300 |
+
embs = np.array([vecs[p] for p in save_ids if p in vecs], dtype=np.float32)
|
| 301 |
+
clusters = compute_clusters([p for p in save_ids if p in vecs], embs)
|
| 302 |
+
print(f" Clusters formed: K={len(clusters)}")
|
| 303 |
+
for c in clusters:
|
| 304 |
+
print(f" cluster {c.cluster_idx}: medoid={c.medoid_paper_id} importance={c.importance:.3f} size={len(c.paper_ids)}")
|
| 305 |
+
|
| 306 |
+
tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
|
| 307 |
+
print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
|
| 308 |
+
cats, _ = await report_results(rec_ids, tags)
|
| 309 |
+
nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
|
| 310 |
+
verdict = "PASS" if tier.startswith("Tier 1") and nlp_count >= 7 else "PARTIAL"
|
| 311 |
+
print(f" Categories: {dict(cats)} --> AI/ML + NLP count: {nlp_count}/10")
|
| 312 |
+
print(f" VERDICT: {verdict}")
|
| 313 |
+
finally:
|
| 314 |
+
await cleanup_user(user_id)
|
| 315 |
+
|
| 316 |
+
|
| 317 |
+
async def scenario_5_multi_interest_balanced():
|
| 318 |
+
"""Tier 1, the headline test for quota fusion."""
|
| 319 |
+
user_id = f"eval-recs-5-{uuid.uuid4().hex[:6]}"
|
| 320 |
+
save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
|
| 321 |
+
print("\n" + "=" * 100)
|
| 322 |
+
print("S5 Multi-interest (5 NLP + 5 CV) -- THE HEADLINE QUOTA TEST")
|
| 323 |
+
print(f" Saved: 5x NLP + 5x CV")
|
| 324 |
+
print(" Expect: K>=2 clusters, both interests visible, neither cluster swamps")
|
| 325 |
+
print("=" * 100)
|
| 326 |
+
try:
|
| 327 |
+
await setup_user(user_id, save_ids=save_ids)
|
| 328 |
+
state = await us.ensure_loaded(user_id)
|
| 329 |
+
# Inspect clusters
|
| 330 |
+
vecs = await qdrant_svc.get_paper_vectors(save_ids)
|
| 331 |
+
aligned = [p for p in save_ids if p in vecs]
|
| 332 |
+
embs = np.array([vecs[p] for p in aligned], dtype=np.float32)
|
| 333 |
+
clusters = compute_clusters(aligned, embs)
|
| 334 |
+
print(f" Clusters formed: K={len(clusters)}")
|
| 335 |
+
for c in clusters:
|
| 336 |
+
print(f" cluster {c.cluster_idx}: medoid={c.medoid_paper_id} importance={c.importance:.3f} size={len(c.paper_ids)}")
|
| 337 |
+
|
| 338 |
+
tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
|
| 339 |
+
print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
|
| 340 |
+
cats, sources = await report_results(rec_ids, tags)
|
| 341 |
+
nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
|
| 342 |
+
cv_count = sum(c for k, c in cats.items() if k == "Computer Vision")
|
| 343 |
+
print(f" NLP (AI/ML + NLP): {nlp_count} CV (Computer Vision): {cv_count}")
|
| 344 |
+
print(f" Cluster origin counts: {dict(sources)}")
|
| 345 |
+
smaller = min(nlp_count, cv_count) if (nlp_count and cv_count) else 0
|
| 346 |
+
verdict = "PASS" if len(clusters) >= 2 and smaller >= 3 else "FAIL"
|
| 347 |
+
print(f" VERDICT: {verdict} (floor=3 enforced: {smaller >= 3})")
|
| 348 |
+
finally:
|
| 349 |
+
await cleanup_user(user_id)
|
| 350 |
+
|
| 351 |
+
|
| 352 |
+
async def scenario_6_multi_interest_imbalanced():
|
| 353 |
+
"""Tier 1: imbalanced split β does the floor=3 rescue the minority?"""
|
| 354 |
+
user_id = f"eval-recs-6-{uuid.uuid4().hex[:6]}"
|
| 355 |
+
save_ids = [pid for pid, _ in NLP_PAPERS[:8]] + [pid for pid, _ in CV_PAPERS[:2]]
|
| 356 |
+
print("\n" + "=" * 100)
|
| 357 |
+
print("S6 Multi-interest imbalanced (8 NLP + 2 CV) -- FLOOR TEST")
|
| 358 |
+
print(" Expect: if K>=2, CV gets >=3 slots even though importance is ~80/20")
|
| 359 |
+
print("=" * 100)
|
| 360 |
+
try:
|
| 361 |
+
await setup_user(user_id, save_ids=save_ids)
|
| 362 |
+
state = await us.ensure_loaded(user_id)
|
| 363 |
+
vecs = await qdrant_svc.get_paper_vectors(save_ids)
|
| 364 |
+
aligned = [p for p in save_ids if p in vecs]
|
| 365 |
+
embs = np.array([vecs[p] for p in aligned], dtype=np.float32)
|
| 366 |
+
clusters = compute_clusters(aligned, embs)
|
| 367 |
+
print(f" Clusters formed: K={len(clusters)}")
|
| 368 |
+
for c in clusters:
|
| 369 |
+
print(f" cluster {c.cluster_idx}: medoid={c.medoid_paper_id} importance={c.importance:.3f} size={len(c.paper_ids)}")
|
| 370 |
+
|
| 371 |
+
tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
|
| 372 |
+
print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
|
| 373 |
+
cats, sources = await report_results(rec_ids, tags)
|
| 374 |
+
nlp_count = sum(c for k, c in cats.items() if k in {"AI/ML", "NLP/Computational Linguistics"})
|
| 375 |
+
cv_count = sum(c for k, c in cats.items() if k == "Computer Vision")
|
| 376 |
+
print(f" NLP: {nlp_count} CV: {cv_count} Cluster sources: {dict(sources)}")
|
| 377 |
+
if len(clusters) >= 2:
|
| 378 |
+
verdict = "PASS" if cv_count >= 3 else "FAIL (floor not enforced)"
|
| 379 |
+
else:
|
| 380 |
+
verdict = "AMBIGUOUS (only 1 cluster formed - floor doesn't apply)"
|
| 381 |
+
print(f" VERDICT: {verdict}")
|
| 382 |
+
finally:
|
| 383 |
+
await cleanup_user(user_id)
|
| 384 |
+
|
| 385 |
+
|
| 386 |
+
async def scenario_7_category_suppression():
|
| 387 |
+
"""Tier 1 with dismissals: 'Computer Vision' should be suppressed."""
|
| 388 |
+
# Save 5 NLP, dismiss 3 CV β non-overlapping friendly categories
|
| 389 |
+
user_id = f"eval-recs-7-{uuid.uuid4().hex[:6]}"
|
| 390 |
+
save_ids = [pid for pid, _ in NLP_PAPERS[:5]]
|
| 391 |
+
dismiss_ids = [pid for pid, _ in CV_PAPERS[:3]]
|
| 392 |
+
print("\n" + "=" * 100)
|
| 393 |
+
print("S7 Category suppression (5 NLP saves + 3 CV dismissals)")
|
| 394 |
+
print(" Expect: 'Computer Vision' suppressed; zero CV papers in output")
|
| 395 |
+
print("=" * 100)
|
| 396 |
+
try:
|
| 397 |
+
await setup_user(user_id, save_ids=save_ids, dismiss_ids=dismiss_ids)
|
| 398 |
+
state = await us.ensure_loaded(user_id)
|
| 399 |
+
suppressed = await db.get_suppressed_categories(user_id)
|
| 400 |
+
print(f" Suppressed categories detected: {suppressed}")
|
| 401 |
+
|
| 402 |
+
tier, rec_ids, tags, lat = await run_pipeline(user_id, state)
|
| 403 |
+
print(f" Tier: {tier} ({lat:.0f} ms) Returned: {len(rec_ids)}")
|
| 404 |
+
cats, _ = await report_results(rec_ids, tags)
|
| 405 |
+
cv_count = cats.get("Computer Vision", 0)
|
| 406 |
+
verdict = "PASS" if cv_count == 0 and "Computer Vision" in suppressed else \
|
| 407 |
+
"FAIL (CV leaked through)" if cv_count > 0 else \
|
| 408 |
+
"PARTIAL (no CV but suppression set empty)"
|
| 409 |
+
print(f" CV count in output: {cv_count} VERDICT: {verdict}")
|
| 410 |
+
finally:
|
| 411 |
+
await cleanup_user(user_id)
|
| 412 |
+
|
| 413 |
+
|
| 414 |
+
async def scenario_8_hungarian_stability():
|
| 415 |
+
"""Cluster IDs should remain stable across reclusterings when one new save is added."""
|
| 416 |
+
user_id = f"eval-recs-8-{uuid.uuid4().hex[:6]}"
|
| 417 |
+
save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
|
| 418 |
+
new_save = NLP_PAPERS[5][0] # 6th NLP paper (added later)
|
| 419 |
+
print("\n" + "=" * 100)
|
| 420 |
+
print("S8 Hungarian cluster-ID stability")
|
| 421 |
+
print(" Run pipeline once -> save 1 more NLP paper -> run again")
|
| 422 |
+
print(" Expect: same cluster_idx assigned to NLP cluster across runs")
|
| 423 |
+
print("=" * 100)
|
| 424 |
+
try:
|
| 425 |
+
await setup_user(user_id, save_ids=save_ids)
|
| 426 |
+
state = await us.ensure_loaded(user_id)
|
| 427 |
+
|
| 428 |
+
# First run
|
| 429 |
+
await run_pipeline(user_id, state)
|
| 430 |
+
clusters_v1 = await db.get_user_clusters(user_id)
|
| 431 |
+
v1 = {(c["cluster_idx"], c["medoid_paper_id"]) for c in clusters_v1}
|
| 432 |
+
print(f" After run 1: {sorted(v1)}")
|
| 433 |
+
|
| 434 |
+
# Add one more NLP paper
|
| 435 |
+
more_vecs = await qdrant_svc.get_paper_vectors([new_save])
|
| 436 |
+
if new_save in more_vecs:
|
| 437 |
+
state.add_positive(new_save)
|
| 438 |
+
await profiles.update_on_save(user_id, np.array(more_vecs[new_save], dtype=np.float32))
|
| 439 |
+
await db.log_interaction(user_id, new_save, "save")
|
| 440 |
+
|
| 441 |
+
# Second run
|
| 442 |
+
await run_pipeline(user_id, state)
|
| 443 |
+
clusters_v2 = await db.get_user_clusters(user_id)
|
| 444 |
+
v2 = {(c["cluster_idx"], c["medoid_paper_id"]) for c in clusters_v2}
|
| 445 |
+
print(f" After run 2: {sorted(v2)}")
|
| 446 |
+
|
| 447 |
+
# Stability check: every (idx, medoid) in v1 still present in v2 (medoid may change but idx must stay)
|
| 448 |
+
idx_v1 = {c["cluster_idx"]: c["medoid_paper_id"] for c in clusters_v1}
|
| 449 |
+
idx_v2 = {c["cluster_idx"]: c["medoid_paper_id"] for c in clusters_v2}
|
| 450 |
+
# All cluster_idx that existed in v1 should still exist in v2
|
| 451 |
+
stable = all(k in idx_v2 for k in idx_v1)
|
| 452 |
+
print(f" Cluster IDs in v1: {sorted(idx_v1.keys())} v2: {sorted(idx_v2.keys())}")
|
| 453 |
+
print(f" VERDICT: {'PASS (all v1 cluster_idx preserved)' if stable else 'FAIL (cluster_idx churned)'}")
|
| 454 |
+
finally:
|
| 455 |
+
await cleanup_user(user_id)
|
| 456 |
+
|
| 457 |
+
|
| 458 |
+
async def scenario_9_latency():
|
| 459 |
+
"""Latency sanity: full Tier 1 pipeline on 10 saved papers."""
|
| 460 |
+
user_id = f"eval-recs-9-{uuid.uuid4().hex[:6]}"
|
| 461 |
+
save_ids = [pid for pid, _ in NLP_PAPERS[:5]] + [pid for pid, _ in CV_PAPERS[:5]]
|
| 462 |
+
print("\n" + "=" * 100)
|
| 463 |
+
print("S9 Latency sanity (Tier 1, 10 saved papers)")
|
| 464 |
+
print(" Expect: <30 ms compute (excluding metadata I/O); end-to-end <2s")
|
| 465 |
+
print("=" * 100)
|
| 466 |
+
try:
|
| 467 |
+
await setup_user(user_id, save_ids=save_ids)
|
| 468 |
+
state = await us.ensure_loaded(user_id)
|
| 469 |
+
# Warm: run once to load profiles
|
| 470 |
+
await run_pipeline(user_id, state)
|
| 471 |
+
# Time multiple runs
|
| 472 |
+
runs = []
|
| 473 |
+
for i in range(3):
|
| 474 |
+
tier, _, _, lat = await run_pipeline(user_id, state)
|
| 475 |
+
runs.append(lat)
|
| 476 |
+
print(f" Run {i+1}: {tier} {lat:.0f} ms")
|
| 477 |
+
print(f" Mean: {sum(runs)/len(runs):.0f} ms Min: {min(runs):.0f} ms Max: {max(runs):.0f} ms")
|
| 478 |
+
# The 30ms compute target excludes Qdrant + Turso I/O β full e2e includes them
|
| 479 |
+
e2e_pass = max(runs) < 2000
|
| 480 |
+
print(f" VERDICT: {'PASS (e2e <2s)' if e2e_pass else 'PARTIAL (over 2s e2e β investigate)'}")
|
| 481 |
+
finally:
|
| 482 |
+
await cleanup_user(user_id)
|
| 483 |
+
|
| 484 |
+
|
| 485 |
+
# ββ Pre-flight + main ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 486 |
+
|
| 487 |
+
async def preflight():
|
| 488 |
+
"""Verify all curated paper IDs exist in Qdrant before running."""
|
| 489 |
+
all_ids = [p[0] for p in NLP_PAPERS + CV_PAPERS + ML_THEORY_PAPERS]
|
| 490 |
+
vecs = await qdrant_svc.get_paper_vectors(all_ids)
|
| 491 |
+
missing = [pid for pid in all_ids if pid not in vecs]
|
| 492 |
+
if missing:
|
| 493 |
+
print(f"WARNING: {len(missing)} curated IDs not in Qdrant: {missing}")
|
| 494 |
+
print("Some scenarios may produce skewed results.")
|
| 495 |
+
else:
|
| 496 |
+
print(f"Pre-flight: all {len(all_ids)} curated paper IDs present in Qdrant.")
|
| 497 |
+
|
| 498 |
+
|
| 499 |
+
async def wipe_all_eval_users():
|
| 500 |
+
"""Belt-and-braces cleanup: remove any eval-recs-* users left from crashes."""
|
| 501 |
+
async with aiosqlite.connect(DB_PATH) as conn:
|
| 502 |
+
for tbl in ["interactions", "user_profiles", "user_clusters",
|
| 503 |
+
"user_onboarding", "cluster_snapshots"]:
|
| 504 |
+
try:
|
| 505 |
+
await conn.execute(f"DELETE FROM {tbl} WHERE user_id LIKE ?", ("eval-recs-%",))
|
| 506 |
+
except Exception:
|
| 507 |
+
pass
|
| 508 |
+
await conn.commit()
|
| 509 |
+
|
| 510 |
+
|
| 511 |
+
async def main():
|
| 512 |
+
print("=" * 100)
|
| 513 |
+
print("RECOMMENDATION ENGINE EVALUATION")
|
| 514 |
+
print("=" * 100)
|
| 515 |
+
await db.init_db()
|
| 516 |
+
await wipe_all_eval_users()
|
| 517 |
+
await preflight()
|
| 518 |
+
|
| 519 |
+
scenarios = [
|
| 520 |
+
scenario_1_cold_with_onboarding,
|
| 521 |
+
scenario_2_single_save,
|
| 522 |
+
scenario_3_three_nlp_saves,
|
| 523 |
+
scenario_4_five_nlp_saves_single_cluster,
|
| 524 |
+
scenario_5_multi_interest_balanced,
|
| 525 |
+
scenario_6_multi_interest_imbalanced,
|
| 526 |
+
scenario_7_category_suppression,
|
| 527 |
+
scenario_8_hungarian_stability,
|
| 528 |
+
scenario_9_latency,
|
| 529 |
+
]
|
| 530 |
+
|
| 531 |
+
for s in scenarios:
|
| 532 |
+
try:
|
| 533 |
+
await s()
|
| 534 |
+
except Exception as e:
|
| 535 |
+
import traceback
|
| 536 |
+
print(f" SCENARIO ERROR: {e}")
|
| 537 |
+
traceback.print_exc()
|
| 538 |
+
|
| 539 |
+
# Final safety wipe in case any cleanup_user calls failed
|
| 540 |
+
await wipe_all_eval_users()
|
| 541 |
+
print("\n" + "=" * 100)
|
| 542 |
+
print("DONE β all eval-recs-* users wiped from DB")
|
| 543 |
+
print("=" * 100)
|
| 544 |
+
|
| 545 |
+
|
| 546 |
+
if __name__ == "__main__":
|
| 547 |
+
asyncio.run(main())
|
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Search quality evaluation harness.
|
| 3 |
+
|
| 4 |
+
For each curated query, runs the hybrid search pipeline end-to-end
|
| 5 |
+
(rewrite -> encode -> dense+sparse -> RRF -> title-boost) and prints the
|
| 6 |
+
top 10 results with titles fetched from Turso. For known-item queries,
|
| 7 |
+
flags whether the expected paper landed at #1.
|
| 8 |
+
|
| 9 |
+
This is a HUMAN-JUDGMENT report, not a pass/fail test. The output is
|
| 10 |
+
designed to be read top-to-bottom and rated query by query.
|
| 11 |
+
|
| 12 |
+
Run: python scripts/eval_search_quality.py
|
| 13 |
+
"""
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import asyncio
|
| 17 |
+
import sys
|
| 18 |
+
import time
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
# Make the project root importable when run as `python scripts/eval_search_quality.py`
|
| 22 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
| 23 |
+
|
| 24 |
+
from app import hybrid_search_svc
|
| 25 |
+
from app import turso_svc
|
| 26 |
+
from app import embed_svc
|
| 27 |
+
from app import groq_svc
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
# (band, query, expected_arxiv_id_or_None)
|
| 31 |
+
QUERIES: list[tuple[str, str, str | None]] = [
|
| 32 |
+
# ββ Band A: known-item title queries ββββββββββββββββββββββββββββββββββ
|
| 33 |
+
# The right answer is unambiguous. Top-1 hit is the bar.
|
| 34 |
+
("A", "attention is all you need", "1706.03762"),
|
| 35 |
+
("A", "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805"),
|
| 36 |
+
("A", "Adam: A Method for Stochastic Optimization", "1412.6980"),
|
| 37 |
+
("A", "Language Models are Few-Shot Learners", "2005.14165"),
|
| 38 |
+
("A", "Deep Residual Learning for Image Recognition", "1512.03385"),
|
| 39 |
+
|
| 40 |
+
# ββ Band B: conceptual semantic queries βββββββββββββββββββββββββββββββ
|
| 41 |
+
# No exact keyword match; tests whether dense retrieval rescues meaning.
|
| 42 |
+
("B", "when AI makes up fake facts", None),
|
| 43 |
+
("B", "making language models follow human preferences", None),
|
| 44 |
+
("B", "why deep networks generalize despite overparameterization", None),
|
| 45 |
+
("B", "finding similar papers using vector embeddings", None),
|
| 46 |
+
("B", "models that pretend to be aligned but aren't", None),
|
| 47 |
+
|
| 48 |
+
# ββ Band C: keyword-academic queries ββββββββββββββββββββββββββββββββββ
|
| 49 |
+
# Already in academic form; rewriter heuristic should skip these.
|
| 50 |
+
("C", "BGE-M3 multilingual dense retrieval", None),
|
| 51 |
+
("C", "Mamba state space model linear time", None),
|
| 52 |
+
("C", "chain of thought prompting", None),
|
| 53 |
+
("C", "FlashAttention IO-aware exact attention", None),
|
| 54 |
+
|
| 55 |
+
# ββ Band D: adversarial / edge cases ββββββββββββββββββββββββββββββββββ
|
| 56 |
+
("D", "transformr", None), # typo
|
| 57 |
+
("D", "GPT", None), # very short
|
| 58 |
+
("D", "bayesian deep learning monte carlo dropout uncertainty estimation", None), # very long
|
| 59 |
+
("D", "applying CV to medical imaging", None), # cross-domain (CV->medical)
|
| 60 |
+
("D", "attention", None), # single ambiguous word
|
| 61 |
+
|
| 62 |
+
# ββ Band E: recency-sensitive queries βββββββββββββββββββββββββββββββββ
|
| 63 |
+
# Recency rerank was removed; verify recent work still surfaces.
|
| 64 |
+
("E", "Llama 3", None),
|
| 65 |
+
("E", "reasoning models 2024", None),
|
| 66 |
+
]
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
# ββ Wire a thin wrapper around groq_svc.rewrite to capture what fired ββββ
|
| 70 |
+
_rewrite_log: dict[str, str] = {}
|
| 71 |
+
_original_rewrite = groq_svc.rewrite
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
async def _logging_rewrite(q: str) -> str:
|
| 75 |
+
r = await _original_rewrite(q)
|
| 76 |
+
_rewrite_log[q] = r
|
| 77 |
+
return r
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
groq_svc.rewrite = _logging_rewrite
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
async def eval_query(
|
| 84 |
+
band: str, query: str, expected_id: str | None
|
| 85 |
+
) -> tuple[list[str], float]:
|
| 86 |
+
"""Run one query end-to-end and print a formatted report."""
|
| 87 |
+
t0 = time.perf_counter()
|
| 88 |
+
results = await hybrid_search_svc.search(query, limit=10)
|
| 89 |
+
elapsed_ms = (time.perf_counter() - t0) * 1000
|
| 90 |
+
|
| 91 |
+
rewrite = _rewrite_log.get(query, query)
|
| 92 |
+
rewrite_fired = rewrite.strip() != query.strip()
|
| 93 |
+
|
| 94 |
+
titles: dict[str, str] = {}
|
| 95 |
+
if results:
|
| 96 |
+
meta = await turso_svc.fetch_metadata_batch(results)
|
| 97 |
+
titles = {aid: (m.get("title") or "(no title)") for aid, m in meta.items()}
|
| 98 |
+
|
| 99 |
+
# ββ Header ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 100 |
+
print()
|
| 101 |
+
print(f"[{band}] {query!r}")
|
| 102 |
+
if rewrite_fired:
|
| 103 |
+
print(f" rewrite: {rewrite!r}")
|
| 104 |
+
else:
|
| 105 |
+
print(f" rewrite: (heuristic skipped or no change)")
|
| 106 |
+
|
| 107 |
+
if expected_id is not None:
|
| 108 |
+
if results and results[0] == expected_id:
|
| 109 |
+
verdict = f"PASS - {expected_id} at #1"
|
| 110 |
+
elif expected_id in results:
|
| 111 |
+
rank = results.index(expected_id) + 1
|
| 112 |
+
verdict = f"PARTIAL - {expected_id} at rank #{rank}"
|
| 113 |
+
else:
|
| 114 |
+
verdict = f"FAIL - {expected_id} NOT in top 10"
|
| 115 |
+
print(f" verdict: {verdict}")
|
| 116 |
+
|
| 117 |
+
print(f" latency: {elapsed_ms:.0f} ms | results: {len(results)}")
|
| 118 |
+
|
| 119 |
+
if not results:
|
| 120 |
+
print(" (no results returned)")
|
| 121 |
+
return results, elapsed_ms
|
| 122 |
+
|
| 123 |
+
for i, aid in enumerate(results, 1):
|
| 124 |
+
title = titles.get(aid, "(title unavailable)")
|
| 125 |
+
if len(title) > 88:
|
| 126 |
+
title = title[:85] + "..."
|
| 127 |
+
marker = " *" if expected_id and aid == expected_id else " "
|
| 128 |
+
print(f" {i:2d}.{marker}{aid:13s} {title}")
|
| 129 |
+
|
| 130 |
+
return results, elapsed_ms
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
async def main():
|
| 134 |
+
print("=" * 100)
|
| 135 |
+
print("SEARCH QUALITY EVALUATION - ResearchIT hybrid search pipeline")
|
| 136 |
+
print("=" * 100)
|
| 137 |
+
|
| 138 |
+
# ββ Warm-up βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 139 |
+
# First BGE-M3 encode is ~10-15s cold. Warm before timing anything.
|
| 140 |
+
print("\nWarming up BGE-M3 + Turso...")
|
| 141 |
+
t0 = time.perf_counter()
|
| 142 |
+
embed_svc.encode_query("warmup query for the eval harness")
|
| 143 |
+
await turso_svc.fetch_metadata_batch(["1706.03762"])
|
| 144 |
+
print(f"Warm-up: {(time.perf_counter()-t0)*1000:.0f} ms\n")
|
| 145 |
+
|
| 146 |
+
band_results: dict[str, list[tuple[str, str | None, list[str], float]]] = {}
|
| 147 |
+
|
| 148 |
+
for band, query, expected in QUERIES:
|
| 149 |
+
results, latency = await eval_query(band, query, expected)
|
| 150 |
+
band_results.setdefault(band, []).append((query, expected, results, latency))
|
| 151 |
+
|
| 152 |
+
# ββ Summary βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 153 |
+
print("\n" + "=" * 100)
|
| 154 |
+
print("SUMMARY")
|
| 155 |
+
print("=" * 100)
|
| 156 |
+
|
| 157 |
+
# Band A: top-1 hit rate
|
| 158 |
+
if "A" in band_results:
|
| 159 |
+
a_rows = band_results["A"]
|
| 160 |
+
hits = sum(1 for _, exp, res, _ in a_rows if res and res[0] == exp)
|
| 161 |
+
partial = sum(
|
| 162 |
+
1 for _, exp, res, _ in a_rows
|
| 163 |
+
if exp in (res or []) and (not res or res[0] != exp)
|
| 164 |
+
)
|
| 165 |
+
misses = len(a_rows) - hits - partial
|
| 166 |
+
print(f"\nBand A (known-item titles): {hits}/{len(a_rows)} top-1 hits, "
|
| 167 |
+
f"{partial} partial (in top 10 but not #1), {misses} miss")
|
| 168 |
+
for q, exp, res, _ in a_rows:
|
| 169 |
+
if res and res[0] == exp:
|
| 170 |
+
tag = "PASS"
|
| 171 |
+
elif exp in (res or []):
|
| 172 |
+
tag = f"PARTIAL #{res.index(exp)+1}"
|
| 173 |
+
else:
|
| 174 |
+
tag = "MISS"
|
| 175 |
+
qshort = q if len(q) <= 60 else q[:57] + "..."
|
| 176 |
+
print(f" [{tag:10s}] {exp:14s} {qshort}")
|
| 177 |
+
|
| 178 |
+
# Latency stats
|
| 179 |
+
all_lat = [lat for rows in band_results.values() for *_, lat in rows]
|
| 180 |
+
if all_lat:
|
| 181 |
+
all_lat.sort()
|
| 182 |
+
n = len(all_lat)
|
| 183 |
+
p50 = all_lat[n // 2]
|
| 184 |
+
p95 = all_lat[max(0, int(n * 0.95) - 1)]
|
| 185 |
+
print(f"\nLatency (n={n}): mean {sum(all_lat)/n:.0f} ms "
|
| 186 |
+
f"p50 {p50:.0f} ms p95 {p95:.0f} ms "
|
| 187 |
+
f"max {max(all_lat):.0f} ms")
|
| 188 |
+
|
| 189 |
+
# Per-band coverage (how often did we get any results?)
|
| 190 |
+
print("\nResults coverage by band:")
|
| 191 |
+
for band, rows in sorted(band_results.items()):
|
| 192 |
+
empty = sum(1 for _, _, res, _ in rows if not res)
|
| 193 |
+
print(f" Band {band}: {len(rows) - empty}/{len(rows)} returned results")
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
if __name__ == "__main__":
|
| 197 |
+
asyncio.run(main())
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -0,0 +1,410 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Stage-by-stage profiler for the search and recommendation pipelines.
|
| 3 |
+
|
| 4 |
+
Mirrors the production paths (hybrid_search_svc.search and
|
| 5 |
+
_multi_interest_recommend) with explicit timers between every stage,
|
| 6 |
+
so we can see where the time actually goes.
|
| 7 |
+
|
| 8 |
+
Run: python scripts/profile_pipelines.py
|
| 9 |
+
"""
|
| 10 |
+
from __future__ import annotations
|
| 11 |
+
|
| 12 |
+
import asyncio
|
| 13 |
+
import sys
|
| 14 |
+
import time
|
| 15 |
+
import uuid
|
| 16 |
+
from contextlib import contextmanager
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
import numpy as np
|
| 20 |
+
|
| 21 |
+
if hasattr(sys.stdout, "reconfigure"):
|
| 22 |
+
sys.stdout.reconfigure(encoding="utf-8")
|
| 23 |
+
|
| 24 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
| 25 |
+
|
| 26 |
+
from app import (
|
| 27 |
+
config, embed_svc, qdrant_svc, zilliz_svc, groq_svc, turso_svc,
|
| 28 |
+
db, user_state as us,
|
| 29 |
+
)
|
| 30 |
+
from app.recommend import profiles
|
| 31 |
+
from app.recommend.clustering import (
|
| 32 |
+
compute_clusters, stabilize_cluster_ids, save_clusters_to_db,
|
| 33 |
+
load_clusters_from_db, MIN_PAPERS_FOR_CLUSTERING, InterestCluster,
|
| 34 |
+
)
|
| 35 |
+
from app.recommend.fusion import allocate_quotas, merge_quota_results
|
| 36 |
+
from app.recommend.reranker import rerank_candidates
|
| 37 |
+
from app.recommend.diversity import mmr_rerank, inject_exploration
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
@contextmanager
|
| 41 |
+
def stage(name: str, sink: list):
|
| 42 |
+
t0 = time.perf_counter()
|
| 43 |
+
yield
|
| 44 |
+
sink.append((name, (time.perf_counter() - t0) * 1000))
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def print_breakdown(label: str, timings: list[tuple[str, float]]):
|
| 48 |
+
total = sum(t for _, t in timings)
|
| 49 |
+
print(f"\n --- {label} ---")
|
| 50 |
+
print(f" {'Stage':<46s} {'ms':>10s} {'%':>6s}")
|
| 51 |
+
print(f" {'-'*46} {'-'*10} {'-'*6}")
|
| 52 |
+
for name, t in timings:
|
| 53 |
+
pct = (100.0 * t / total) if total > 0 else 0.0
|
| 54 |
+
print(f" {name:<46s} {t:>10.0f} {pct:>5.1f}%")
|
| 55 |
+
print(f" {'-'*46} {'-'*10} {'-'*6}")
|
| 56 |
+
print(f" {'TOTAL':<46s} {total:>10.0f} {100.0:>5.1f}%")
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
# ββ Search pipeline profiler βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 60 |
+
|
| 61 |
+
async def profile_search(query: str) -> list[tuple[str, float]]:
|
| 62 |
+
"""Mirror hybrid_search_svc.search() with stage timers."""
|
| 63 |
+
timings: list[tuple[str, float]] = []
|
| 64 |
+
limit = 10
|
| 65 |
+
fetch_k = limit * config.SEARCH_FETCH_K_MULTIPLIER
|
| 66 |
+
|
| 67 |
+
# Stage 1: Groq rewrite
|
| 68 |
+
rewritten = query
|
| 69 |
+
with stage("1. Groq rewrite (LLM)", timings):
|
| 70 |
+
try:
|
| 71 |
+
rewritten = await groq_svc.rewrite(query)
|
| 72 |
+
except Exception:
|
| 73 |
+
rewritten = query
|
| 74 |
+
|
| 75 |
+
# Stage 2: BGE-M3 encode (original)
|
| 76 |
+
with stage("2a. BGE-M3 encode (original)", timings):
|
| 77 |
+
d_orig, s_orig = embed_svc.encode_query(query)
|
| 78 |
+
|
| 79 |
+
encodings = [(d_orig, s_orig)]
|
| 80 |
+
|
| 81 |
+
# Stage 2b: BGE-M3 encode (rewritten, if different)
|
| 82 |
+
if rewritten and rewritten != query:
|
| 83 |
+
with stage("2b. BGE-M3 encode (rewrite)", timings):
|
| 84 |
+
d_rw, s_rw = embed_svc.encode_query(rewritten)
|
| 85 |
+
encodings.append((d_rw, s_rw))
|
| 86 |
+
else:
|
| 87 |
+
timings.append(("2b. BGE-M3 encode (rewrite skipped)", 0.0))
|
| 88 |
+
|
| 89 |
+
# Stage 3: Parallel Qdrant + Zilliz searches
|
| 90 |
+
with stage(f"3. Parallel search ({len(encodings)*2} tasks)", timings):
|
| 91 |
+
tasks = []
|
| 92 |
+
for d, s in encodings:
|
| 93 |
+
tasks.append(qdrant_svc.search_dense(d.tolist(), limit=fetch_k))
|
| 94 |
+
tasks.append(zilliz_svc.search_sparse(s, limit=fetch_k))
|
| 95 |
+
raw = await asyncio.gather(*tasks, return_exceptions=True)
|
| 96 |
+
|
| 97 |
+
valid_lists = [r for r in raw if not isinstance(r, Exception) and r]
|
| 98 |
+
|
| 99 |
+
# Stage 4: RRF fusion
|
| 100 |
+
with stage("4. RRF fusion", timings):
|
| 101 |
+
from app.hybrid_search_svc import _rrf_fuse_multi, _title_match_rerank
|
| 102 |
+
fused = _rrf_fuse_multi(valid_lists, k=config.SEARCH_RRF_K)
|
| 103 |
+
|
| 104 |
+
# Stage 5: Title-boost (Turso fetch + scoring)
|
| 105 |
+
with stage("5. Title-match boost (Turso + score)", timings):
|
| 106 |
+
ranked = await _title_match_rerank(fused, query, top_n_for_boost=50)
|
| 107 |
+
|
| 108 |
+
return timings
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
# ββ Recommendations Tier 1 pipeline profiler βββββββββββββββββββββββββββββββββ
|
| 112 |
+
|
| 113 |
+
async def profile_recs_tier1(user_id: str, save_ids: list[str]) -> list[tuple[str, float]]:
|
| 114 |
+
"""Mirror _multi_interest_recommend() with stage timers."""
|
| 115 |
+
timings: list[tuple[str, float]] = []
|
| 116 |
+
|
| 117 |
+
state = await us.ensure_loaded(user_id)
|
| 118 |
+
seen = us.all_seen(user_id)
|
| 119 |
+
REC_LIMIT = config.REC_LIMIT
|
| 120 |
+
OVERSAMPLE = 3
|
| 121 |
+
ST_SUPPLEMENT = 20
|
| 122 |
+
positives = state.positive_list
|
| 123 |
+
|
| 124 |
+
# 1. Fetch saved-paper vectors from Qdrant
|
| 125 |
+
with stage("1. Fetch saved-paper vectors (Qdrant)", timings):
|
| 126 |
+
vectors = await qdrant_svc.get_paper_vectors(positives)
|
| 127 |
+
|
| 128 |
+
aligned_ids = [pid for pid in positives if pid in vectors]
|
| 129 |
+
aligned_embs = np.array([vectors[pid] for pid in aligned_ids], dtype=np.float32)
|
| 130 |
+
|
| 131 |
+
# 2. Ward clustering (CPU)
|
| 132 |
+
with stage("2. Ward clustering (CPU)", timings):
|
| 133 |
+
clusters = compute_clusters(aligned_ids, aligned_embs)
|
| 134 |
+
|
| 135 |
+
# 3. Hungarian: load + match
|
| 136 |
+
with stage("3. Hungarian load+match (SQLite + numpy)", timings):
|
| 137 |
+
old_clusters_data = await load_clusters_from_db(user_id)
|
| 138 |
+
if old_clusters_data:
|
| 139 |
+
old_clusters = []
|
| 140 |
+
for row in old_clusters_data:
|
| 141 |
+
mpid = row["medoid_paper_id"]
|
| 142 |
+
if mpid in vectors:
|
| 143 |
+
medoid_emb = np.array(vectors[mpid], dtype=np.float32)
|
| 144 |
+
elif row.get("medoid_embedding_blob") is not None:
|
| 145 |
+
medoid_emb = np.frombuffer(
|
| 146 |
+
row["medoid_embedding_blob"], dtype=np.float32
|
| 147 |
+
).copy()
|
| 148 |
+
else:
|
| 149 |
+
continue
|
| 150 |
+
old_clusters.append(InterestCluster(
|
| 151 |
+
cluster_idx=row["cluster_idx"],
|
| 152 |
+
medoid_paper_id=mpid,
|
| 153 |
+
medoid_embedding=medoid_emb,
|
| 154 |
+
paper_ids=[],
|
| 155 |
+
importance=row["importance"],
|
| 156 |
+
))
|
| 157 |
+
if old_clusters:
|
| 158 |
+
clusters = stabilize_cluster_ids(clusters, old_clusters)
|
| 159 |
+
|
| 160 |
+
# 4. Save clusters + snapshot (SQLite writes)
|
| 161 |
+
with stage("4. Save clusters + snapshot (SQLite)", timings):
|
| 162 |
+
await save_clusters_to_db(user_id, clusters)
|
| 163 |
+
await db.save_cluster_snapshot(user_id, [
|
| 164 |
+
{
|
| 165 |
+
"cluster_idx": c.cluster_idx,
|
| 166 |
+
"medoid_paper_id": c.medoid_paper_id,
|
| 167 |
+
"importance": c.importance,
|
| 168 |
+
"paper_ids": c.paper_ids,
|
| 169 |
+
"medoid_embedding_blob": c.medoid_embedding.astype(np.float32).tobytes(),
|
| 170 |
+
}
|
| 171 |
+
for c in clusters
|
| 172 |
+
])
|
| 173 |
+
|
| 174 |
+
# 5. Quota allocation (CPU)
|
| 175 |
+
with stage("5. Allocate quotas (CPU)", timings):
|
| 176 |
+
importances = [c.importance for c in clusters]
|
| 177 |
+
quotas = allocate_quotas(importances, total_slots=100, min_slots=3)
|
| 178 |
+
|
| 179 |
+
# 6. Load short-term profile
|
| 180 |
+
with stage("6. Load short-term profile (SQLite)", timings):
|
| 181 |
+
st_vec = await profiles.load_profile(user_id, "short_term")
|
| 182 |
+
|
| 183 |
+
# 7. Per-cluster parallel ANN searches (no with_vectors β that path
|
| 184 |
+
# is 10x slower on Qdrant Cloud free tier; we cache vectors instead)
|
| 185 |
+
with stage(f"7. Per-cluster ANN searches (gather {len(clusters)})", timings):
|
| 186 |
+
search_tasks = [
|
| 187 |
+
qdrant_svc.search_by_vector_with_scores(
|
| 188 |
+
query_vector=c.medoid_embedding.tolist(),
|
| 189 |
+
limit=quota * OVERSAMPLE,
|
| 190 |
+
exclude_ids=seen,
|
| 191 |
+
)
|
| 192 |
+
for c, quota in zip(clusters, quotas)
|
| 193 |
+
]
|
| 194 |
+
per_cluster_scored = await asyncio.gather(*search_tasks)
|
| 195 |
+
|
| 196 |
+
paper_cluster_map: dict[str, int] = {}
|
| 197 |
+
qdrant_score_map: dict[str, float] = {}
|
| 198 |
+
for cluster, scored in zip(clusters, per_cluster_scored):
|
| 199 |
+
for hit in scored:
|
| 200 |
+
aid = hit["arxiv_id"]
|
| 201 |
+
if aid not in paper_cluster_map:
|
| 202 |
+
paper_cluster_map[aid] = cluster.cluster_idx
|
| 203 |
+
if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
|
| 204 |
+
qdrant_score_map[aid] = float(hit["score"])
|
| 205 |
+
|
| 206 |
+
per_cluster_ids = [
|
| 207 |
+
[h["arxiv_id"] for h in scored] for scored in per_cluster_scored
|
| 208 |
+
]
|
| 209 |
+
candidate_ids = merge_quota_results(per_cluster_ids, quotas)
|
| 210 |
+
|
| 211 |
+
# 8. Short-term supplement search
|
| 212 |
+
with stage("8. Short-term supplement (Qdrant)", timings):
|
| 213 |
+
if st_vec is not None:
|
| 214 |
+
seen_so_far = seen | set(candidate_ids)
|
| 215 |
+
st_scored = await qdrant_svc.search_by_vector_with_scores(
|
| 216 |
+
query_vector=st_vec.tolist(),
|
| 217 |
+
limit=ST_SUPPLEMENT,
|
| 218 |
+
exclude_ids=seen_so_far,
|
| 219 |
+
)
|
| 220 |
+
for hit in st_scored:
|
| 221 |
+
aid = hit["arxiv_id"]
|
| 222 |
+
if aid not in set(candidate_ids):
|
| 223 |
+
candidate_ids.append(aid)
|
| 224 |
+
paper_cluster_map[aid] = -1
|
| 225 |
+
if aid not in qdrant_score_map:
|
| 226 |
+
qdrant_score_map[aid] = float(hit["score"])
|
| 227 |
+
|
| 228 |
+
# 9. Fetch candidate vectors (LRU-cached by arxiv_id in qdrant_svc).
|
| 229 |
+
with stage(f"9. Fetch {len(candidate_ids)} candidate vectors (Qdrant, cached)", timings):
|
| 230 |
+
cand_vectors = await qdrant_svc.get_paper_vectors(candidate_ids)
|
| 231 |
+
|
| 232 |
+
# 10. Fetch candidate metadata from Turso (with cache)
|
| 233 |
+
with stage(f"10. Fetch {len(candidate_ids)} candidate metadata (Turso)", timings):
|
| 234 |
+
cand_meta = await turso_svc.fetch_metadata_batch(candidate_ids)
|
| 235 |
+
|
| 236 |
+
# 11. Cache metadata to SQLite
|
| 237 |
+
with stage("11. Cache Turso metadata to SQLite", timings):
|
| 238 |
+
await db.cache_turso_metadata_batch(list(cand_meta.values()))
|
| 239 |
+
|
| 240 |
+
valid_ids = [cid for cid in candidate_ids if cid in cand_vectors and cid in cand_meta]
|
| 241 |
+
valid_embs = np.array([cand_vectors[cid] for cid in valid_ids], dtype=np.float32)
|
| 242 |
+
valid_meta = [cand_meta[cid] for cid in valid_ids]
|
| 243 |
+
|
| 244 |
+
# 12. Load profiles (long-term, negative)
|
| 245 |
+
with stage("12. Load long-term + negative profiles (SQLite)", timings):
|
| 246 |
+
lt_vec = await profiles.load_profile(user_id, "long_term")
|
| 247 |
+
neg_vec = await profiles.load_profile(user_id, "negative")
|
| 248 |
+
|
| 249 |
+
# 13. SQLite reads (suppression + onboarding)
|
| 250 |
+
with stage("13. Suppression + onboarding lookup (SQLite)", timings):
|
| 251 |
+
suppressed = await db.get_suppressed_categories(user_id)
|
| 252 |
+
onboarding_categories = await db.get_user_category_filter(user_id)
|
| 253 |
+
|
| 254 |
+
# 14. Build feature arrays (CPU)
|
| 255 |
+
with stage("14. Build per-candidate feature arrays (CPU)", timings):
|
| 256 |
+
user_total_saves = len(state.positive_list)
|
| 257 |
+
user_total_dismissals = len(state.negative_list)
|
| 258 |
+
qdrant_scores = np.asarray(
|
| 259 |
+
[qdrant_score_map.get(cid, 0.0) for cid in valid_ids],
|
| 260 |
+
dtype=np.float32,
|
| 261 |
+
)
|
| 262 |
+
per_cand_imp = np.asarray(
|
| 263 |
+
[
|
| 264 |
+
clusters[paper_cluster_map[cid]].importance
|
| 265 |
+
if cid in paper_cluster_map and 0 <= paper_cluster_map[cid] < len(clusters)
|
| 266 |
+
else 0.0
|
| 267 |
+
for cid in valid_ids
|
| 268 |
+
],
|
| 269 |
+
dtype=np.float32,
|
| 270 |
+
)
|
| 271 |
+
per_cand_med = np.stack(
|
| 272 |
+
[
|
| 273 |
+
np.asarray(clusters[paper_cluster_map[cid]].medoid_embedding, dtype=np.float32)
|
| 274 |
+
if cid in paper_cluster_map and 0 <= paper_cluster_map[cid] < len(clusters)
|
| 275 |
+
else np.zeros(1024, dtype=np.float32)
|
| 276 |
+
for cid in valid_ids
|
| 277 |
+
],
|
| 278 |
+
axis=0,
|
| 279 |
+
)
|
| 280 |
+
is_suppressed_arr = np.asarray(
|
| 281 |
+
[1.0 if cand_meta.get(cid, {}).get("category", "") in suppressed else 0.0
|
| 282 |
+
for cid in valid_ids],
|
| 283 |
+
dtype=np.float32,
|
| 284 |
+
)
|
| 285 |
+
onb_match_arr = np.asarray(
|
| 286 |
+
[1.0 if cand_meta.get(cid, {}).get("category", "") in onboarding_categories else 0.0
|
| 287 |
+
for cid in valid_ids],
|
| 288 |
+
dtype=np.float32,
|
| 289 |
+
)
|
| 290 |
+
|
| 291 |
+
# 15. LightGBM rerank
|
| 292 |
+
with stage("15. LightGBM rerank (CPU)", timings):
|
| 293 |
+
reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
|
| 294 |
+
candidate_ids=valid_ids,
|
| 295 |
+
candidate_embeddings=valid_embs,
|
| 296 |
+
candidate_metadata=valid_meta,
|
| 297 |
+
long_term_vec=lt_vec,
|
| 298 |
+
short_term_vec=st_vec,
|
| 299 |
+
negative_vec=neg_vec,
|
| 300 |
+
qdrant_scores=qdrant_scores,
|
| 301 |
+
cluster_importance=per_cand_imp,
|
| 302 |
+
cluster_medoid=per_cand_med,
|
| 303 |
+
is_suppressed_category=is_suppressed_arr,
|
| 304 |
+
onboarding_category_match=onb_match_arr,
|
| 305 |
+
user_total_saves=user_total_saves,
|
| 306 |
+
user_total_dismissals=user_total_dismissals,
|
| 307 |
+
)
|
| 308 |
+
|
| 309 |
+
# 16. MMR
|
| 310 |
+
with stage("16. MMR diversity (CPU)", timings):
|
| 311 |
+
query_vec = lt_vec if lt_vec is not None else aligned_embs.mean(axis=0)
|
| 312 |
+
mmr_selected = mmr_rerank(
|
| 313 |
+
query_embedding=query_vec,
|
| 314 |
+
candidate_embeddings=reranked_embs,
|
| 315 |
+
candidate_ids=reranked_ids,
|
| 316 |
+
scores=reranked_scores,
|
| 317 |
+
lambda_param=0.6,
|
| 318 |
+
top_k=REC_LIMIT,
|
| 319 |
+
)
|
| 320 |
+
|
| 321 |
+
# 17. Exploration injection
|
| 322 |
+
with stage("17. Exploration injection (CPU)", timings):
|
| 323 |
+
final = inject_exploration(
|
| 324 |
+
selected_ids=mmr_selected,
|
| 325 |
+
all_candidate_ids=reranked_ids,
|
| 326 |
+
n_explore=2,
|
| 327 |
+
)
|
| 328 |
+
|
| 329 |
+
return timings
|
| 330 |
+
|
| 331 |
+
|
| 332 |
+
# ββ Setup helper for recs profile ββββββββββββββββββββββββββββββββββββββββββββ
|
| 333 |
+
|
| 334 |
+
async def setup_recs_user(user_id: str, save_ids: list[str]):
|
| 335 |
+
vecs = await qdrant_svc.get_paper_vectors(save_ids)
|
| 336 |
+
state = await us.ensure_loaded(user_id)
|
| 337 |
+
for pid in save_ids:
|
| 338 |
+
if pid not in vecs:
|
| 339 |
+
continue
|
| 340 |
+
state.add_positive(pid)
|
| 341 |
+
emb = np.array(vecs[pid], dtype=np.float32)
|
| 342 |
+
await profiles.update_on_save(user_id, emb)
|
| 343 |
+
await db.log_interaction(user_id, pid, "save")
|
| 344 |
+
|
| 345 |
+
|
| 346 |
+
async def cleanup_user(user_id: str):
|
| 347 |
+
import aiosqlite
|
| 348 |
+
async with aiosqlite.connect(config.DB_PATH) as conn:
|
| 349 |
+
for tbl in ["interactions", "user_profiles", "user_clusters",
|
| 350 |
+
"user_onboarding", "cluster_snapshots"]:
|
| 351 |
+
try:
|
| 352 |
+
await conn.execute(f"DELETE FROM {tbl} WHERE user_id = ?", (user_id,))
|
| 353 |
+
except Exception:
|
| 354 |
+
pass
|
| 355 |
+
await conn.commit()
|
| 356 |
+
if user_id in us._cache:
|
| 357 |
+
del us._cache[user_id]
|
| 358 |
+
|
| 359 |
+
|
| 360 |
+
async def main():
|
| 361 |
+
print("=" * 92)
|
| 362 |
+
print("PIPELINE PROFILER")
|
| 363 |
+
print("=" * 92)
|
| 364 |
+
|
| 365 |
+
await db.init_db()
|
| 366 |
+
|
| 367 |
+
# Warm BGE-M3 + Turso connection so first stage isn't a 15s outlier
|
| 368 |
+
print("\nWarming up BGE-M3 + Turso...")
|
| 369 |
+
embed_svc.encode_query("warmup")
|
| 370 |
+
await turso_svc.fetch_metadata_batch(["1706.03762"])
|
| 371 |
+
|
| 372 |
+
# ββ Search profiling ββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 373 |
+
print("\n" + "=" * 92)
|
| 374 |
+
print("SEARCH PIPELINE β three representative queries")
|
| 375 |
+
print("=" * 92)
|
| 376 |
+
|
| 377 |
+
queries = [
|
| 378 |
+
("known-item title", "attention is all you need"),
|
| 379 |
+
("conceptual rewrite", "when AI makes up fake facts"),
|
| 380 |
+
("academic, no rewrite", "BGE-M3 multilingual dense retrieval"),
|
| 381 |
+
]
|
| 382 |
+
for label, q in queries:
|
| 383 |
+
print(f"\n>>> Query [{label}]: {q!r}")
|
| 384 |
+
# Run twice β first cold, second warm β to show cache effect
|
| 385 |
+
for run in (1, 2):
|
| 386 |
+
timings = await profile_search(q)
|
| 387 |
+
print_breakdown(f"Run {run}", timings)
|
| 388 |
+
|
| 389 |
+
# ββ Recs Tier 1 profiling βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 390 |
+
print("\n\n" + "=" * 92)
|
| 391 |
+
print("RECS TIER 1 PIPELINE β 10 saved papers (5 NLP + 5 CV)")
|
| 392 |
+
print("=" * 92)
|
| 393 |
+
|
| 394 |
+
user_id = f"profile-recs-{uuid.uuid4().hex[:6]}"
|
| 395 |
+
save_ids = [
|
| 396 |
+
"1706.03762", "1810.04805", "2005.14165", "1907.11692", "1910.10683",
|
| 397 |
+
"1512.03385", "2010.11929", "1409.1556", "1505.04597", "2103.14030",
|
| 398 |
+
]
|
| 399 |
+
try:
|
| 400 |
+
await setup_recs_user(user_id, save_ids)
|
| 401 |
+
|
| 402 |
+
for run in (1, 2, 3):
|
| 403 |
+
timings = await profile_recs_tier1(user_id, save_ids)
|
| 404 |
+
print_breakdown(f"Run {run}", timings)
|
| 405 |
+
finally:
|
| 406 |
+
await cleanup_user(user_id)
|
| 407 |
+
|
| 408 |
+
|
| 409 |
+
if __name__ == "__main__":
|
| 410 |
+
asyncio.run(main())
|
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Side-by-side comparison: BEFORE vs AFTER citation boost.
|
| 2 |
+
|
| 3 |
+
Shows beginner vs expert results for the same topic.
|
| 4 |
+
Also verifies Band A (known-item) queries aren't broken.
|
| 5 |
+
"""
|
| 6 |
+
import asyncio, sys, time
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
| 9 |
+
|
| 10 |
+
from app import hybrid_search_svc, turso_svc, embed_svc
|
| 11 |
+
|
| 12 |
+
# Pairs: (topic, beginner_query, expert_query)
|
| 13 |
+
COMPARISONS = [
|
| 14 |
+
("TRANSFORMERS",
|
| 15 |
+
"how do transformers work in NLP",
|
| 16 |
+
"attention is all you need"),
|
| 17 |
+
("DIFFUSION",
|
| 18 |
+
"what are diffusion models and how do they generate images",
|
| 19 |
+
"denoising diffusion probabilistic models"),
|
| 20 |
+
("GPT-4",
|
| 21 |
+
"how does GPT-4 work",
|
| 22 |
+
"GPT-4 Technical Report"),
|
| 23 |
+
("RLHF",
|
| 24 |
+
"what is reinforcement learning from human feedback",
|
| 25 |
+
"reinforcement learning from human feedback"),
|
| 26 |
+
]
|
| 27 |
+
|
| 28 |
+
BAND_A = [
|
| 29 |
+
("attention is all you need", "1706.03762"),
|
| 30 |
+
("Deep Residual Learning for Image Recognition", "1512.03385"),
|
| 31 |
+
("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "1810.04805"),
|
| 32 |
+
]
|
| 33 |
+
|
| 34 |
+
async def run_query(q: str):
|
| 35 |
+
results = await hybrid_search_svc.search(q, limit=10)
|
| 36 |
+
meta = {}
|
| 37 |
+
if results:
|
| 38 |
+
meta = await turso_svc.fetch_metadata_batch(results)
|
| 39 |
+
return results, meta
|
| 40 |
+
|
| 41 |
+
async def main():
|
| 42 |
+
print("Warming up BGE-M3...")
|
| 43 |
+
embed_svc.encode_query("warmup")
|
| 44 |
+
await turso_svc.fetch_metadata_batch(["1706.03762"])
|
| 45 |
+
|
| 46 |
+
# === Band A verification ===
|
| 47 |
+
print()
|
| 48 |
+
print("=" * 90)
|
| 49 |
+
print("BAND A VERIFICATION - Known-item queries (must still be #1)")
|
| 50 |
+
print("=" * 90)
|
| 51 |
+
for q, expected in BAND_A:
|
| 52 |
+
results, meta = await run_query(q)
|
| 53 |
+
rank = results.index(expected) + 1 if expected in results else -1
|
| 54 |
+
status = "PASS" if rank == 1 else f"RANK #{rank}" if rank > 0 else "MISS"
|
| 55 |
+
cites = meta.get(expected, {}).get("citation_count", 0)
|
| 56 |
+
print(f" [{status:>8}] {q[:55]:55s} ({cites} cites)")
|
| 57 |
+
|
| 58 |
+
# === Side-by-side comparisons ===
|
| 59 |
+
print()
|
| 60 |
+
print("=" * 90)
|
| 61 |
+
print("SIDE-BY-SIDE: Beginner vs Expert queries (same topic)")
|
| 62 |
+
print("=" * 90)
|
| 63 |
+
|
| 64 |
+
for topic, beginner_q, expert_q in COMPARISONS:
|
| 65 |
+
print(f"\n--- {topic} ---")
|
| 66 |
+
|
| 67 |
+
# Beginner
|
| 68 |
+
print(f"\n BEGINNER: {beginner_q!r}")
|
| 69 |
+
results, meta = await run_query(beginner_q)
|
| 70 |
+
for i, aid in enumerate(results[:5], 1):
|
| 71 |
+
m = meta.get(aid, {})
|
| 72 |
+
title = (m.get("title") or "?")[:60]
|
| 73 |
+
cites = m.get("citation_count", 0)
|
| 74 |
+
print(f" {i}. [{cites:>6} cites] {title}")
|
| 75 |
+
|
| 76 |
+
# Expert
|
| 77 |
+
print(f"\n EXPERT: {expert_q!r}")
|
| 78 |
+
results, meta = await run_query(expert_q)
|
| 79 |
+
for i, aid in enumerate(results[:5], 1):
|
| 80 |
+
m = meta.get(aid, {})
|
| 81 |
+
title = (m.get("title") or "?")[:60]
|
| 82 |
+
cites = m.get("citation_count", 0)
|
| 83 |
+
print(f" {i}. [{cites:>6} cites] {title}")
|
| 84 |
+
|
| 85 |
+
print()
|
| 86 |
+
print("=" * 90)
|
| 87 |
+
print("DONE")
|
| 88 |
+
print("=" * 90)
|
| 89 |
+
|
| 90 |
+
if __name__ == "__main__":
|
| 91 |
+
asyncio.run(main())
|
|
@@ -102,56 +102,100 @@ class TestRRFFusion:
|
|
| 102 |
assert gap_k10 > gap_k100
|
| 103 |
|
| 104 |
|
| 105 |
-
# ββ
|
| 106 |
|
| 107 |
-
class
|
| 108 |
-
"""Test
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
-
# Two papers with same RRF score but different ages
|
| 115 |
fused = [
|
| 116 |
-
{"arxiv_id": "
|
| 117 |
-
{"arxiv_id": "
|
| 118 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
fused = [
|
| 130 |
-
{"arxiv_id": "
|
| 131 |
-
{"arxiv_id": "
|
| 132 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
|
|
|
| 140 |
"""Empty input returns empty output."""
|
| 141 |
-
from app
|
| 142 |
-
assert
|
| 143 |
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
fused = [
|
| 149 |
-
{"arxiv_id": "
|
|
|
|
| 150 |
]
|
| 151 |
-
|
| 152 |
-
ranked =
|
| 153 |
-
|
| 154 |
-
assert "final_score" in ranked
|
| 155 |
|
| 156 |
|
| 157 |
# ββ Groq rewriter tests βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 102 |
assert gap_k10 > gap_k100
|
| 103 |
|
| 104 |
|
| 105 |
+
# ββ Title-match rerank tests βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 106 |
|
| 107 |
+
class TestTitleMatchRerank:
|
| 108 |
+
"""Test the title-match boost in hybrid_search_svc.
|
| 109 |
|
| 110 |
+
Recency rerank was removed (it crushed seminal old papers like
|
| 111 |
+
1706.03762 below newer "X is all you need" titles). Replaced with a
|
| 112 |
+
title-match boost that promotes papers whose title matches the query.
|
| 113 |
+
"""
|
| 114 |
+
|
| 115 |
+
@pytest.mark.asyncio
|
| 116 |
+
async def test_exact_title_match_wins(self, monkeypatch):
|
| 117 |
+
"""Paper with exact-title match should rank #1 even with low RRF."""
|
| 118 |
+
from app import hybrid_search_svc
|
| 119 |
+
|
| 120 |
+
async def fake_meta(ids):
|
| 121 |
+
return {
|
| 122 |
+
"1706.03762": {"title": "Attention Is All You Need"},
|
| 123 |
+
"2404.01183": {"title": "Positioning Is All You Need"},
|
| 124 |
+
}
|
| 125 |
+
monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
|
| 126 |
|
|
|
|
| 127 |
fused = [
|
| 128 |
+
{"arxiv_id": "2404.01183", "rrf_score": 0.0317}, # higher RRF
|
| 129 |
+
{"arxiv_id": "1706.03762", "rrf_score": 0.0164}, # lower RRF, exact match
|
| 130 |
]
|
| 131 |
+
ranked = await hybrid_search_svc._title_match_rerank(
|
| 132 |
+
fused, "attention is all you need"
|
| 133 |
+
)
|
| 134 |
+
assert ranked[0]["arxiv_id"] == "1706.03762"
|
| 135 |
|
| 136 |
+
@pytest.mark.asyncio
|
| 137 |
+
async def test_substring_match_beats_no_match(self, monkeypatch):
|
| 138 |
+
"""A substring title match outranks no-match candidates."""
|
| 139 |
+
from app import hybrid_search_svc
|
| 140 |
|
| 141 |
+
async def fake_meta(ids):
|
| 142 |
+
return {
|
| 143 |
+
"2501.05730": {"title": "Element-wise Attention Is All You Need"},
|
| 144 |
+
"9999.99999": {"title": "An Unrelated Survey of Graph Theory"},
|
| 145 |
+
}
|
| 146 |
+
monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
|
| 147 |
|
| 148 |
fused = [
|
| 149 |
+
{"arxiv_id": "9999.99999", "rrf_score": 0.05}, # higher RRF, no match
|
| 150 |
+
{"arxiv_id": "2501.05730", "rrf_score": 0.01}, # lower RRF, substring match
|
| 151 |
]
|
| 152 |
+
ranked = await hybrid_search_svc._title_match_rerank(
|
| 153 |
+
fused, "attention is all you need"
|
| 154 |
+
)
|
| 155 |
+
assert ranked[0]["arxiv_id"] == "2501.05730"
|
| 156 |
+
|
| 157 |
+
@pytest.mark.asyncio
|
| 158 |
+
async def test_no_match_falls_back_to_rrf(self, monkeypatch):
|
| 159 |
+
"""When nothing matches, RRF order is preserved."""
|
| 160 |
+
from app import hybrid_search_svc
|
| 161 |
|
| 162 |
+
async def fake_meta(ids):
|
| 163 |
+
return {
|
| 164 |
+
"1234.56789": {"title": "Some Paper"},
|
| 165 |
+
"9876.54321": {"title": "Another Paper"},
|
| 166 |
+
}
|
| 167 |
+
monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", fake_meta)
|
| 168 |
|
| 169 |
+
fused = [
|
| 170 |
+
{"arxiv_id": "1234.56789", "rrf_score": 0.05},
|
| 171 |
+
{"arxiv_id": "9876.54321", "rrf_score": 0.01},
|
| 172 |
+
]
|
| 173 |
+
ranked = await hybrid_search_svc._title_match_rerank(fused, "transformer")
|
| 174 |
+
assert [r["arxiv_id"] for r in ranked] == ["1234.56789", "9876.54321"]
|
| 175 |
|
| 176 |
+
@pytest.mark.asyncio
|
| 177 |
+
async def test_empty_input(self):
|
| 178 |
"""Empty input returns empty output."""
|
| 179 |
+
from app import hybrid_search_svc
|
| 180 |
+
assert await hybrid_search_svc._title_match_rerank([], "anything") == []
|
| 181 |
|
| 182 |
+
@pytest.mark.asyncio
|
| 183 |
+
async def test_turso_failure_falls_back_to_rrf(self, monkeypatch):
|
| 184 |
+
"""If Turso lookup raises, ranking falls back to pure RRF order."""
|
| 185 |
+
from app import hybrid_search_svc
|
| 186 |
+
|
| 187 |
+
async def boom(ids):
|
| 188 |
+
raise RuntimeError("turso down")
|
| 189 |
+
monkeypatch.setattr(hybrid_search_svc.turso_svc, "fetch_metadata_batch", boom)
|
| 190 |
|
| 191 |
fused = [
|
| 192 |
+
{"arxiv_id": "1234.56789", "rrf_score": 0.05},
|
| 193 |
+
{"arxiv_id": "9876.54321", "rrf_score": 0.01},
|
| 194 |
]
|
| 195 |
+
ranked = await hybrid_search_svc._title_match_rerank(fused, "attention")
|
| 196 |
+
assert [r["arxiv_id"] for r in ranked] == ["1234.56789", "9876.54321"]
|
| 197 |
+
# final_score must be set even on the fallback path
|
| 198 |
+
assert all("final_score" in r for r in ranked)
|
| 199 |
|
| 200 |
|
| 201 |
# ββ Groq rewriter tests βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|