Spaces:
Sleeping
Sleeping
| # ResearchIT β Master Task Tracker | |
| > **Purpose**: Single source of truth for all completed, in-progress, and upcoming work. | |
| > **Last updated**: 2026-05-05 | |
| > **Current phase**: Phase 6.5 (Instrumentation) β COMPLETE β | Phase 7 next | |
| --- | |
| ## Legend | |
| - `[x]` β Done | |
| - `[/]` β In progress | |
| - `[ ]` β Not started | |
| - `[~]` β Intentionally deferred (blocked by data/users/scale) | |
| - `[!]` β Backlog item (documented, not yet coded) | |
| --- | |
| ## Phase 1: Zero-ML Recommender β COMPLETE | |
| > *Built the foundation: Qdrant connection, arXiv search, save/dismiss, cookie identity, HTMX frontend.* | |
| - [x] Qdrant Cloud connection (1.6M BGE-M3 papers, BQ, HNSW m=32) | |
| - Collection: `arxiv_bgem3_dense`, 1024-dim dense vectors | |
| - File: `app/qdrant_svc.py` β `_get_client()` | |
| - [x] BEST_SCORE Recommend API (raw paper IDs β Qdrant) | |
| - File: `app/qdrant_svc.py` β `recommend()` | |
| - [x] arXiv keyword API search (placeholder β replaced in Phase 3) | |
| - File: `app/arxiv_svc.py` β `search()` | |
| - [x] arXiv metadata fetching + SQLite cache | |
| - File: `app/arxiv_svc.py` β `fetch_metadata_batch()` | |
| - [x] SQLite database schema (interactions, paper_metadata) | |
| - File: `app/db.py` β `init_db()` | |
| - WAL mode, async via aiosqlite | |
| - [x] Cookie-based user identity | |
| - File: `app/config.py` β `COOKIE_NAME` | |
| - [x] User state management (positive/negative deques) | |
| - File: `app/user_state.py` β `UserState` | |
| - [x] Save/Dismiss event logging | |
| - File: `app/routers/events.py` | |
| - [x] HTMX + Jinja2 frontend (search, recs, save, dismiss) | |
| - Files: `app/templates/` (base.html, index.html, search.html, saved.html, partials/) | |
| - [x] Test suite β **55 tests passing** | |
| **Gaps**: None. | |
| --- | |
| ## Phase 2a: EWMA Profile Embeddings β COMPLETE | |
| > *Replaced raw ID-list approach with temporal decay vectors so recent interests outweigh old ones.* | |
| - [x] Create `app/recommend/` module with `__init__.py` | |
| - [x] Create `app/recommend/profiles.py` β EWMA computation + storage | |
| - Long-term: Ξ±=0.03 β (corrected from 0.10 per Doc 06) | |
| - Short-term: Ξ±=0.40 | |
| - Negative: Ξ±=0.15 | |
| - All embeddings L2-normalized | |
| - [x] Modify `app/db.py` β add `user_profiles` table + `user_clusters` table | |
| - [x] Modify `app/qdrant_svc.py` β add `get_paper_vectors()` and `search_by_vector()` | |
| - [x] Modify `app/routers/events.py` β trigger EWMA updates on save/dismiss | |
| - [x] Modify `app/routers/recommendations.py` β EWMA vector search with Tier 2 fallback | |
| - [x] Add `numpy` + `scipy` to `requirements.txt` | |
| - [x] Tests for profiles module β **11 passed** | |
| - [x] Full test suite β no regressions | |
| **Doc 06 correction applied**: Ξ±_long 0.10 β 0.03 (PinnerSage rejected 0.10 as too recent-biased). | |
| **Gaps**: None. | |
| --- | |
| ## Phase 2b: Ward Clustering + Multi-Interest Retrieval β COMPLETE | |
| > *Detect distinct user interests via hierarchical clustering, retrieve candidates per interest.* | |
| - [x] Create `app/recommend/clustering.py` β Ward clustering + medoid extraction | |
| - L2-normalize embeddings before Ward β (Doc 06 correction) | |
| - Adaptive gap-based threshold (no fixed K) | |
| - Medoid representation (real papers, not centroids) β | |
| - Dynamic K (1β7 clusters, auto-determined) | |
| - Recency-weighted importance scores | |
| - [x] Modify `app/qdrant_svc.py` β add `multi_interest_search()` with prefetch+RRF | |
| - [x] Modify `app/routers/recommendations.py` β 3-tier cascading pipeline | |
| - Tier 1 (β₯5 saves): Multi-interest clustering β prefetch + RRF | |
| - Tier 2 (β₯3 saves): EWMA long-term vector β single ANN search | |
| - Tier 3 (β₯1 save): Qdrant BEST_SCORE Recommend API | |
| - [x] Tests for clustering module β **10 passed** | |
| - [x] Full test suite β no regressions | |
| **Doc 06 corrections applied**: L2-normalization before Ward, medoid not centroid. | |
| **Gaps (deferred to Phase 4)**: | |
| - [!] RRF β quota fusion (dominant clusters can swamp minority interests) | |
| - [!] Hungarian matching for cluster ID stability across reclusterings | |
| --- | |
| ## Phase 2c: Heuristic Re-ranking + MMR Diversity β COMPLETE | |
| > *Added scoring and diversity layers on top of retrieval to produce the final feed.* | |
| - [x] Create `app/recommend/reranker.py` β 5-feature heuristic scorer | |
| - Feature 1: cosine_sim_longterm (weight 0.40) | |
| - Feature 2: cosine_sim_shortterm (weight 0.25) | |
| - Feature 3: paper_age_days / recency (weight 0.15) | |
| - Feature 4: rrf_position (weight 0.10) | |
| - Feature 5: cosine_sim_negative (weight -0.15) β (Doc 06 addition) | |
| - [x] Create `app/recommend/diversity.py` β MMR + exploration injection | |
| - MMR with Ξ»=0.6 | |
| - 2 serendipitous exploration papers per feed | |
| - [x] Modify `app/routers/recommendations.py` β full 5-step pipeline | |
| - Step 1: Clustering β Step 2: Retrieval β Step 3: Rerank β Step 4: MMR β Step 5: Exploration | |
| - [x] Tests for reranker + diversity β **13 passed** | |
| - [x] Full test suite β **88 passed** (86 + 2 pre-existing live Qdrant failures resolved) | |
| **Doc 06 correction applied**: Negative EWMA profile wired as Feature 5 with 0.15 penalty. | |
| **Gaps**: None. LightGBM model now integrated (Phase 6 β ). | |
| --- | |
| ## Phase 2d: Advanced Models β DEFERRED (Blocked by data/users) | |
| > *These logically belong to the recommendation engine but cannot be built without real user data or scale.* | |
| - [~] LightGBM lambdarank model β requires β₯500 labeled save/dismiss interactions β Phase 6 | |
| - [~] Collaborative filtering features β requires β₯500 users β Phase 9 | |
| - [~] DPP diversity β explicitly ruled out for v1 by Doc 06 β Phase 9+ | |
| - [~] Two-Tower model β requires GPU + large dataset β Phase 9+ | |
| --- | |
| ## Phase 3: Hybrid Semantic Search β COMPLETE | |
| > *Replace the arXiv keyword API placeholder with real vector-based semantic search using Qdrant dense + Zilliz sparse + RRF.* | |
| > *Detailed plan: `docs/phases/PHASE3-Hybrid-Semantic-Search.md`* | |
| > *Prototype reference: `docs/phases/PHASE2-Hybrid-Search-Plan.md`* | |
| > *Deployment target: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)* | |
| ### New files created | |
| - [x] `app/embed_svc.py` β BGE-M3 model singleton (load BAAI/bge-m3 once at startup, ~570MB, ~15s cold) | |
| - `encode_query(text)` β `(dense: np.ndarray[1024], sparse: dict)` | |
| - LRU cache for repeat queries | |
| - Thread-safe, lazy loading with double-check locking | |
| - [x] `app/zilliz_svc.py` β Zilliz Cloud sparse search client | |
| - Collection: `arxiv_bgem3_sparse` | |
| - Schema: `id` (INT64 auto PK), `arxiv_id` (VARCHAR), `sparse_vector` (SPARSE_FLOAT_VECTOR) | |
| - Index: SPARSE_INVERTED_INDEX, metric_type=IP | |
| - Sparse format: `{int_token_id: float_weight}` (BGE-M3 lexical weights, NOT string words) | |
| - `search_sparse(sparse_dict, limit)` β `list[dict]` with arxiv_id + score | |
| - gRPC reconnect handling | |
| - [x] `app/groq_svc.py` β LLM query rewriter (Groq / llama-3.3-70b) | |
| - `rewrite(user_query)` β academic query string | |
| - Graceful fallback to original query on error | |
| - Academic-detection heuristic to skip unnecessary rewrites | |
| - 2s hard timeout | |
| - [x] `app/hybrid_search_svc.py` β search orchestrator | |
| - Rewrite β Encode β Parallel (Qdrant dense + Zilliz sparse) β RRF β Rerank | |
| - Each step has independent failure handling | |
| - Recency reranking: 0.80 RRF + 0.20 recency | |
| ### Files modified | |
| - [x] `app/config.py` β added `ZILLIZ_URI`, `ZILLIZ_TOKEN`, `ZILLIZ_COLLECTION`, `GROQ_API_KEY`, `BGE_M3_MODEL`, `BGE_M3_DEVICE`, `ENCODE_CACHE_SIZE`, search weights, `APP_PORT` | |
| - [x] `app/qdrant_svc.py` β added `search_dense(dense_vec, limit)` for raw vector search returning scores | |
| - [x] `app/routers/search.py` β swapped `arxiv_svc.search()` β `hybrid_search_svc.search()` with arXiv fallback | |
| - [x] `app/main.py` β added graceful BGE-M3 warm-up to lifespan | |
| - [x] `requirements.txt` β added `FlagEmbedding`, `pymilvus`, `groq` | |
| - [x] `run.py` β configurable port (7860 default for HF Spaces) | |
| ### Deployment files created | |
| - [x] `Dockerfile` β HF Spaces Docker SDK, CPU-only PyTorch, pre-baked BGE-M3 model | |
| - [x] `.dockerignore` β excludes notebooks, PDFs, databases, caches | |
| ### Implementation steps completed | |
| - [x] Step 1: BGE-M3 model service (`embed_svc.py`) + unit tests | |
| - [x] Step 2: Zilliz client (`zilliz_svc.py`) | |
| - [x] Step 3: Dense search in Qdrant service | |
| - [x] Step 4: Groq rewriter (`groq_svc.py`) | |
| - [x] Step 5: Hybrid search orchestrator (`hybrid_search_svc.py`) | |
| - [x] Step 6: Swap search router | |
| - [x] Step 7: Model warm-up + deployment config | |
| - [x] Step 8: Tests β **21 new tests passing** (RRF, recency, Groq heuristics, embed edge cases, orchestrator mocks) | |
| ### Test results | |
| - 88 original tests: β All pass (zero regressions) | |
| - 21 Phase 3 unit tests: β All pass (RRF, recency, Groq, embed, orchestrator mocks) | |
| - 6 search router tests: β All pass (ranking, fallback, HTMX, saved state) | |
| - 8 live service tests: β All pass (Qdrant dense, Zilliz sparse, Groq rewrite, parallel) | |
| - **Total: 123 tests passing** | |
| ### Latency budget | |
| | Stage | Time | | |
| |---|---| | |
| | LLM rewrite (Groq) | ~300ms (skippable) | | |
| | BGE-M3 encode (CPU) | ~300ms first, ~0ms cached | | |
| | Qdrant + Zilliz (parallel) | ~300ms | | |
| | RRF + rerank | <5ms | | |
| | **Total (warm)** | **~600ms** | | |
| --- | |
| ## Phase 3.5: Turso ArXiv Metadata DB β COMPLETE | |
| > *Bulk-loaded 1.23 GB of arXiv paper metadata + citation data to Turso (libSQL) cloud DB.* | |
| > *Eliminates the unstable arXiv API dependency for metadata fetching (Phase 4.2 solved early).* | |
| > *Integrated into codebase and deployed to HF Spaces.* | |
| ### Infrastructure | |
| - [x] Turso cloud DB created: `arxiv-data` on `aws-ap-south-1` | |
| - URL: `https://arxiv-data-siddhm11.aws-ap-south-1.turso.io` | |
| - Auth: Platform token + DB auth token (minted via CLI) | |
| - [x] Table: `papers` with columns: | |
| - `arxiv_id` (TEXT, UNIQUE INDEX `idx_papers_arxiv_id`) | |
| - `title` (TEXT) | |
| - `authors` (TEXT) | |
| - `categories` (TEXT) | |
| - `primary_topic` (TEXT) | |
| - `update_date` (TEXT) | |
| - `abstract_preview` (TEXT, truncated to 500 chars) | |
| - `citation_count` (INTEGER, default 0) | |
| - `influential_citations` (INTEGER, default 0) | |
| - [x] Data sources: | |
| - `arxiv_comprehensive_papers.csv` (Kaggle: siddhm11/arxivdata) | |
| - `arxiv_citations_summary.csv` (Kaggle: siddhm11/citation-data-letsgoo) | |
| - Joined on `id` = `arxiv_id_clean`, deduplicated | |
| - [x] Row count verified: local β remote match | |
| - [x] Unique index on `arxiv_id` for fast lookups | |
| ### Integration (DONE) | |
| - [x] Added `TURSO_URL` and `TURSO_DB_TOKEN` to `config.py` / `.env` / HF Secrets | |
| - [x] Created `app/turso_svc.py` β metadata lookup service | |
| - `fetch_metadata_batch(arxiv_ids)` β `{arxiv_id: paper_dict}` | |
| - Uses Turso HTTP pipeline API (zero new Python deps β just httpx) | |
| - Includes citation_count + influential_citations | |
| - [x] `app/routers/search.py` β Turso primary, arXiv API fallback (only for IDs not in Turso) | |
| - [x] Created `tests/test_turso_timing.py` β timing benchmark | |
| - [x] **Verified**: 10/10 title match, 6.1x end-to-end speedup on HF Spaces | |
| - [x] **Impact**: Avg search time dropped from ~10.7s to ~1.75s on HF Spaces | |
| --- | |
| ## Phase 4: Recommendation Pipeline Fixes β COMPLETE | |
| > *Fixed the known architectural debt in the recommendation pipeline.* | |
| > *Detailed plan: `docs/phases/PHASE4-Recommendation-Pipeline-Fixes.md`* | |
| ### 4.1 β Replace RRF with Importance-Weighted Quota Fusion | |
| - [x] Create `app/recommend/fusion.py` β quota allocation logic | |
| - `w_k = importance_k / sum(importance_k)` | |
| - `slot_k = max(floor(F Γ w_k), F_min=3)` β every cluster gets at least 3 slots | |
| - Distribute remainder by largest fractional part | |
| - [x] Create `tests/test_fusion.py` β **20 unit tests** for quota allocation | |
| - Proportionality, floor enforcement, total invariant, edge cases, Doc 06 worked examples | |
| - [x] Refactor `_multi_interest_recommend()` in `recommendations.py` | |
| - Replace `multi_interest_search()` with per-cluster separate ANN queries | |
| - Use `asyncio.gather()` for concurrent searches (~15ms wall-clock) | |
| - Allocate feed slots proportionally via `allocate_quotas()` | |
| - Deduplicate across clusters (first-occurrence = highest-ranked cluster wins) | |
| - MMR over merged union (unchanged) | |
| - [x] Keep `qdrant_svc.multi_interest_search()` in codebase (no deletion) | |
| ### 4.2 β Pre-populate Metadata Store β DONE (via Turso) | |
| - [x] Bulk-loaded arXiv metadata from Kaggle to Turso cloud DB (Phase 3.5) | |
| - [x] 1.23 GB, includes citation counts from Semantic Scholar | |
| - [x] Wired Turso service into `search.py` (Turso primary, arXiv API fallback) | |
| - [x] arXiv API is now fallback only for genuinely new papers | |
| - [x] **Impact**: Search time dropped from ~10.7s to ~1.75s on HF Spaces | |
| ### 4.3 β Hungarian Matching for Cluster Stability | |
| - [x] Add `stabilize_cluster_ids()` function to `clustering.py` | |
| - Uses `scipy.optimize.linear_sum_assignment` (already a dependency) | |
| - Cost matrix: `1 - cosine_sim(new_medoid, old_medoid)` β trivial at Kβ€7 | |
| - Matched clusters keep old indices; new clusters get next available | |
| - Min cosine threshold (0.5) rejects unrelated matches | |
| - [x] Call between `compute_clusters()` and `save_clusters_to_db()` in recommendations.py | |
| - [x] **10 tests** in `test_clustering.py` β perturbed clusters preserve indices, | |
| unrelated match rejection, K growth/shrink, custom thresholds | |
| ### 4.4 β Category-Level Negative Suppression | |
| - [x] Add `get_suppressed_categories()` to `db.py` | |
| - Joins `interactions` + `paper_metadata` to find categories with β₯3 dismissals | |
| - **Primary category only** (decision: avoid over-suppression) | |
| - **14-day window** (standard default, Ο_neg = 14 days) | |
| - [x] Add suppression filter in `_multi_interest_recommend()` after reranking | |
| - [x] Cache Turso metadata to `paper_metadata` via `cache_turso_metadata_batch()` | |
| - [x] **8 tests** in `test_db.py` β threshold, partitioning, user isolation, custom threshold | |
| - [~] Per-item short-term decay β **deferred to Phase 6** (LightGBM feature) | |
| **Gaps**: None. | |
| --- | |
| ## Phase 4.5: Instrumentation Foundation β COMPLETE | |
| > *Added telemetry columns to the interactions table so every saved/dismissed paper* | |
| > *can be attributed to its pipeline tier, cluster origin, and ranker version.* | |
| > *Doc 07 (ADR A4) identified this as the single most valuable early investment β* | |
| > *retrofitting these fields after real user data exists is painful and blocks all* | |
| > *later counterfactual evaluation.* | |
| ### Schema changes | |
| - [x] Add `ranker_version TEXT` to `interactions` table β pipeline version tag | |
| - [x] Add `candidate_source TEXT` to `interactions` β e.g. `cluster_0`, `exploration`, `ewma_longterm`, `qdrant_recommend`, `short_term_supplement` | |
| - [x] Add `cluster_id INTEGER` to `interactions` β interest cluster index (NULL if N/A) | |
| - [x] ALTER TABLE migration for existing DBs (safe try/except, idempotent) | |
| ### Pipeline tagging | |
| - [x] Add `_RANKER_VERSION` constant to `recommendations.py` | |
| - [x] Tag Tier 1 papers with cluster origin, exploration status, short-term supplement | |
| - [x] Tag Tier 2 papers as `ewma_longterm` | |
| - [x] Tag Tier 3 papers as `qdrant_recommend` | |
| - [x] Build `paper_cluster_map` before quota merge (first-occurrence = cluster attribution) | |
| - [x] Exploration papers tagged as `candidate_source='exploration'` | |
| ### End-to-end flow | |
| - [x] `recommendations.py` embeds tags in paper dicts | |
| - [x] `action_buttons.html` includes tags in `hx-vals` JSON | |
| - [x] `events.py` accepts `ranker_version`, `candidate_source`, `cluster_id` Form fields | |
| - [x] `db.log_interaction()` stores all three new columns | |
| **Files modified**: `app/db.py`, `app/routers/events.py`, `app/routers/recommendations.py`, `app/templates/partials/action_buttons.html` | |
| **Gaps**: None. `propensity` and `policy_id` fields deferred until Ξ΅-greedy exploration (Phase 9). | |
| --- | |
| ## Phase 5: Cold-Start Onboarding β COMPLETE | |
| > *Onboarding wizard for new users β category selection + seed paper search + trending fallback.* | |
| > *Reference: Doc 06 β "4-37% lift even once behavioral data exists"* | |
| ### 5.1 β arXiv Category Multi-Select β | |
| - [x] UI screen on first visit: select 1-8 arXiv category groups | |
| - [x] Store selections in SQLite (`user_onboarding` table) | |
| - [x] Use as pool filter for recommendations (via `get_user_category_filter()`) | |
| - [x] Preserve as LightGBM feature permanently (Feature 26: `onboarding_category_match`) | |
| - [x] Does NOT create "subject vectors" β just filters | |
| ### 5.2 β Seed Paper Import β | |
| - [x] Let users search for and save seed papers during onboarding | |
| - [x] Immediately create EWMA profiles + Ward clusters on next feed request | |
| - [x] Uses hybrid search (Phase 3) for discovery | |
| ### ~~5.3 β ORCID / Semantic Scholar Import~~ β REMOVED | |
| > S2 author import was implemented but removed β not the onboarding direction we want. | |
| > Onboarding focuses on category selection + manual seed paper search. | |
| ### 5.4 β Popularity Fallback β | |
| - [x] Category-filtered trending papers served via `turso_svc.fetch_trending_by_categories()` | |
| - [x] 1-hour TTL trending cache for performance | |
| --- | |
| ## Phase 6: LightGBM Re-ranker β COMPLETE | |
| > *Replaced heuristic scorer with a trained LightGBM lambdarank model.* | |
| > *Unblocked via citation-graph pseudo-labels from Semantic Scholar.* | |
| > *Handoff doc: `docs/PHASE6-HANDOFF.md`* | |
| > *Model repo: [siddhm11/researchit-reranker-phase6](https://huggingface.co/siddhm11/researchit-reranker-phase6)* | |
| ### 6.1 β ML Intern: Data Pipeline + Model Training β | |
| - [x] Export 1.6M arXiv IDs from Turso β `arxiv_ids.txt` (`scripts/export_arxiv_ids.py`) | |
| - [x] Fetch 242K citation edges from Semantic Scholar Batch API (`01_fetch_citation_edges.py`) | |
| - [x] Generate 98K training triples with pseudo-labels: cited=2, co-cited=1, negative=0 (`02_generate_training_triples.py`) | |
| - [x] 37-feature schema (20 content, 11 user behavior, 6 cross-features) | |
| - [x] Train LightGBM LambdaRank model: 141 trees, 63 leaves, lr=0.05 (`03_train_lightgbm.py`) | |
| - [x] nDCG@10 = 0.879 (+233% vs heuristic baseline) | |
| - [x] All artifacts pushed to HuggingFace | |
| ### 6.2 β Opus: Integration into ResearchIT β | |
| - [x] Rewrite `app/recommend/reranker.py` β 5 features β 37 features | |
| - [x] LightGBM model loading at import time with heuristic fallback | |
| - [x] Multi-path model file search (env var β relative β absolute) | |
| - [x] Backward-compatible `rerank_candidates()` signature (old callers unaffected) | |
| - [x] Add `lightgbm>=4.0,<5.0` to `requirements.txt` | |
| - [x] Fix CRLFβLF line endings in model file (Windows Git issue) | |
| - [x] 7 integration tests β **all passing** (`tests/test_reranker_integration.py`) | |
| - [x] Latency verified: **0.223ms per 100 candidates** (target: <1ms) β | |
| ### 6.3 β Antigravity: Feature Wiring + Deployment Verification β | |
| - [x] Wire all 37 features into `recommendations.py` caller (was legacy 6-arg signature) | |
| - [x] Per-candidate `cluster_importance` (N,) from `paper_cluster_map` | |
| - [x] Per-candidate `cluster_medoid` (N, 1024) per source cluster | |
| - [x] Pre-computed `is_suppressed_category` and `onboarding_category_match` arrays | |
| - [x] Pass `qdrant_scores`, `user_total_saves`, `user_total_dismissals` | |
| - [x] `reranker.py` supports both scalar broadcast and per-candidate arrays | |
| - [x] Add model accessors: `is_model_loaded()`, `get_num_trees()`, `get_loaded_model_path()` | |
| - [x] Add per-request feature activation logging | |
| - [x] Create `GET /healthz/reranker` endpoint (`app/routers/health.py`) | |
| - [x] Bug B fix: persist `medoid_embedding_blob` BLOB in `user_clusters` table | |
| - [x] Bug B fix: fall back to persisted blob instead of zero vector in Hungarian matching | |
| - [x] DB migration: `ALTER TABLE user_clusters ADD COLUMN medoid_embedding_blob BLOB` | |
| - [x] 9 new tests β **all passing** (`tests/test_phase6_feature_wiring.py`) | |
| - [x] Full suite: **203+ tests passing, 0 failures** | |
| - [x] Updated `CLAUDE.md`, `PHASE6-HANDOFF.md`, `README.md` | |
| ### 6.4 β Retraining [~] DEFERRED | |
| > **Phase 6.4 retraining is deferred.** The published model `siddhm11/researchit-reranker-phase6` was trained on citation pseudo-labels with features 23β30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with β₯10 saves each. **Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.** | |
| - [~] Synthetic user simulator (`scripts/simulate_users.py`) β target: +30d | |
| - [~] Real-user retrain at 100-user threshold β target: +90d or threshold | |
| - [~] HF model card backfill (library_name, pipeline_tag, metrics, schema) | |
| ## Phase 6.5: Instrumentation β COMPLETE | |
| > **Purpose**: Stabilize the recommendation pipeline and prepare telemetry substrate for Phase 7 evaluation. | |
| ### A1 β Real Qdrant cosine scores | |
| - [x] Switch `search_by_vector()` β `search_by_vector_with_scores()` in per-cluster + short-term searches | |
| - [x] Build `qdrant_score_map` from real cosines (replaces fake `1.0 - rank*0.01` linear decay) | |
| - [x] Feature 0 (`qdrant_cosine_score`) now receives actual cosine similarities | |
| ### A2 β Deployment verification | |
| - [x] `curl /healthz/reranker` β `model_loaded=true, n_trees=141, fallback_active=false` | |
| - [x] Verification timestamp added to `PHASE6-Reranker-Framing.md` | |
| ### B1 β query_id linkage | |
| - [x] Generate `query_id` (UUID) once per feed request in `get_recommendations()` | |
| - [x] Thread through all 4 tiers: trending, Tier 1, Tier 2, Tier 3 | |
| - [x] Generate `query_id` in `search.py` per search request | |
| - [x] Add `query_id` + `position` to `action_buttons.html` hx-vals | |
| ### B2 β Propensity logging | |
| - [x] Add `propensity REAL` + `policy_id TEXT` migration to `interactions` table | |
| - [x] Extend `db.log_interaction()` with propensity + policy_id params | |
| - [x] Compute propensity: 1.0 (deterministic) vs `n_explore/pool_size` (exploration) | |
| - [x] Thread through templates + `events.py` Form params | |
| ### B3 β Cluster snapshot versioning | |
| - [x] Add `cluster_snapshots` table (append-only, content-addressed via `paper_ids_hash`) | |
| - [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()` | |
| - [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan | |
| ### ~~B4 β S2 author import~~ β REMOVED | |
| > S2 author import was implemented and then removed β not the onboarding direction we want. | |
| > `app/s2_svc.py`, the `/api/onboarding/import-author` endpoint, and the quick-import UI | |
| > have all been deleted. Onboarding uses category selection + manual seed search only. | |
| ### Documentation | |
| - [x] `CLAUDE.md`: Rule 3.11 β interaction instrumentation invariants | |
| - [x] `_RANKER_VERSION` bumped to `v6.5_lightgbm_real_cosines` | |
| - [x] Phase status updated to 6.5 COMPLETE | |
| - [x] Tests: 203+ passing | |
| ### Test suite | |
| - `tests/test_reranker_integration.py` β 7 tests (smoke, features, heuristic, E2E, latency, backward compat, comparison) | |
| - `tests/test_phase6_feature_wiring.py` β 9 tests (per-candidate arrays, broadcast medoid, model accessors, aggregate activation) | |
| - `tests/demo_reranker.py` β interactive demo with 20 realistic papers | |
| --- | |
| ## Phase 7: Evaluation Framework π NOT STARTED | |
| > *Build offline and online evaluation before scaling users.* | |
| > *Estimated effort: ~1 week* | |
| - [ ] Offline metrics: nDCG@10, Recall@50, HR@10, ILS, category entropy | |
| - [ ] Time-split evaluation on unarXive 2022 + S2ORC | |
| - [ ] Online metrics (once users exist): CTR, save rate, dwell time, return rate | |
| --- | |
| ## Phase 8: LLM Interest Summaries + Distilled Re-ranker π NOT STARTED | |
| > *Estimated effort: ~10-12 weeks (Doc 07)* | |
| > *Detailed research plan: `docs/research/07-LLM-Summaries-Reranker-and-Scaling-Research.md`* | |
| > *Entry criteria: Phase 7 eval producing stable nDCG@10; cluster stability Jaccard β₯0.7 over 7 days* | |
| ### 8a β Claude-generated per-cluster interest summaries (Doc 07 Β§A) | |
| - [ ] Cluster snapshot versioning (ADR A1) | |
| - [ ] Content-addressed caching: `sha256(sorted(paper_ids) + prompt_version + model)` | |
| - [ ] Shared summaries (not per-user) β Haiku 4.5 + Batch API (~$50-80/month @ 1K users) | |
| - [ ] Nightly regeneration job with 7-day TTL + event-triggered refresh | |
| - [ ] "You're reading about X" UI framing with sub-theme bullets | |
| - [ ] Anthropic Citations API for hallucination prevention | |
| ### 8b β Distilled cross-encoder reranker (Doc 07 Β§B) | |
| - [ ] Deploy `cross-encoder/ms-marco-TinyBERT-L-2-v2` INT8 ONNX as MVP | |
| - [ ] 6ms budget for 20 pairs on CPU (AVX-512 VNNI) | |
| - [ ] TinyBERT score as LightGBM feature (Option C architecture) | |
| - [ ] Custom distillation from BGE-reranker-v2-m3 only if held-out gap >3 nDCG | |
| - [ ] MarginMSE loss + SciNCL citation-graph hard negatives | |
| ### 8c β Use-cases and information-gain design doc (Doc 07 Β§C) | |
| - [ ] 8 user personas (P1 cold-start through P8 stay-current) | |
| - [ ] Information-gain table (save=3-5Γ, dismiss-as-label=β3-4Γ, passive skip=β0.1Γ) | |
| - [ ] Mode-switching UI: "Stay Current" vs "Lit Review" toggle | |
| - [ ] Failure mode detection rules (feed collapse, stale profile, filter bubble) | |
| --- | |
| ## Phase 9: Exploration + Collaborative Filtering π NOT STARTED | |
| > *Blocked by: β₯500 users* | |
| - [ ] Epsilon-greedy exploration (Ξ΅=0.25 new users, Ξ΅=0.05 established) | |
| - [ ] LightFM hybrid CF model with switching strategy | |
| - [ ] Category-level negative suppression | |
| - [ ] Retrain LightGBM with dismissals as negative labels | |
| --- | |
| ## Appendix: Infrastructure Status | |
| | Component | Status | Details | | |
| |---|---|---| | |
| | **Qdrant Cloud** | β Live | 1.6M papers, BGE-M3 1024-dim, BQ enabled, HNSW m=32 | | |
| | **Zilliz Cloud** | β Live | 1.6M papers, BGE-M3 sparse vectors, collection `arxiv_bgem3_sparse` | | |
| | **Turso (libSQL)** | β Live | 1.23 GB arXiv metadata + citations, `arxiv-data` DB, `papers` table, unique index on `arxiv_id` | | |
| | **SQLite** | β Live | interactions, paper_metadata (local cache), user_profiles, user_clusters | | |
| | **HF Spaces** | β Deployed | Docker SDK, free tier, port 7860 β https://siddhm11-researchit.hf.space | | |
| | **Render** | β οΈ Previous target (512MB RAM too small for BGE-M3) | May still be used for non-ML services | | |
| | **arXiv API** | β Fallback only | Keyword search + metadata for papers not in Turso | | |
| | **BGE-M3 Model** | β Live | Pre-baked in Docker image, warm-up at startup | | |
| | **Groq API** | β Live + HF Secret | `app/groq_svc.py` β 2s timeout, academic heuristic skip | | |
| | **Notebooks** | β Organized | `notebooks/` β 01-upload, 02-test, 03-search-benchmark | | |
| ### Credentials Status | |
| | Credential | Status | Env Var | Notes | | |
| |---|---|---|---| | |
| | **Qdrant Cloud** | β In `.env` | `QDRANT_URL`, `QDRANT_API_KEY` | Already wired | | |
| | **Zilliz Cloud** | β In `.env` | `ZILLIZ_URI`, `ZILLIZ_TOKEN` | Phase 3, wired | | |
| | **Turso (libSQL)** | β In `.env` + HF | `TURSO_URL`, `TURSO_DB_TOKEN` | Phase 3.5, wired + deployed | | |
| | **Groq** | β In `.env` + HF | `GROQ_API_KEY` | Phase 3, wired + deployed | | |
| | **HF Spaces** | β Deployed | Secrets panel | All env vars set β | | |
| --- | |
| ## Appendix: Test Suite | |
| | Test File | Count | Status | | |
| |---|---|---| | |
| | `tests/test_profiles.py` | 11 | β Passing | | |
| | `tests/test_clustering.py` | 21 | β Passing | (9 compute + 10 Hungarian + 2 persistence) | | |
| | `tests/test_reranker_diversity.py` | 13 | β Passing | | |
| | `tests/test_reranker_integration.py` | 7 | β Passing | (Phase 6: smoke, features, E2E, latency) | | |
| | `tests/test_phase6_feature_wiring.py` | 9 | β Passing | (Phase 6.3: per-candidate arrays, medoids, accessors) | | |
| | `tests/test_fusion.py` | 20 | β Passing | (Phase 4.1) | | |
| | `tests/test_db.py` | 19 | β Passing | (includes 4 Turso cache + 8 suppression) | | |
| | `tests/test_qdrant_svc.py` | β | β Passing | | |
| | `tests/test_arxiv_svc.py` | β | β Passing | | |
| | `tests/test_integration.py` | β | β Passing | (includes quota pipeline E2E) | | |
| | `tests/test_user_state.py` | β | β Passing | | |
| | `tests/test_saved.py` | β | β Passing | | |
| | `tests/test_hybrid_search.py` | 21 | β Passing | | |
| | `tests/test_search_router.py` | 6 | β Passing | | |
| | `tests/test_live_search.py` | 8 | β Passing | | |
| | **Total** | **203+** | β | | |
| | `test_e2e_recs.py` (standalone) | 1 | β E2E simulation | | |
| --- | |
| ## Appendix: Doc 06 Corrections β Tracking | |
| | Correction | Status | Where | | |
| |---|---|---| | |
| | Ξ±_long 0.10 β 0.03 | β Applied | `app/recommend/profiles.py:30` | | |
| | L2-normalize before Ward clustering | β Applied | `app/recommend/clustering.py` | | |
| | Medoid not centroid | β Applied | `app/recommend/clustering.py` β `_find_medoid()` | | |
| | Negative EWMA wired into reranking | β Applied | `app/recommend/reranker.py` β Feature 5 | | |
| | RRF β quota fusion for recommendations | β Applied | `app/recommend/fusion.py` (Phase 4.1) | | |
| | Hungarian cluster matching | β Applied | `app/recommend/clustering.py` β `stabilize_cluster_ids()` (Phase 4.3) | | |
| | Per-item short-term negative decay | [!] Backlog | Phase 6 (LightGBM feature) | | |
| | Category-level suppression | β Applied | `app/db.py` β `get_suppressed_categories()` (Phase 4.4) | | |
| | BGE-reranker NEVER in hot path | β Followed | Heuristic scorer used instead | | |