ResearchIT / docs /TASK-TRACKER.md
siddhm11
Phase 6.5: Pipeline telemetry, search UX fixes, latency profiling
ec67b2f
# ResearchIT β€” Master Task Tracker
> **Purpose**: Single source of truth for all completed, in-progress, and upcoming work.
> **Last updated**: 2026-05-05
> **Current phase**: Phase 6.5 (Instrumentation) β€” COMPLETE βœ” | Phase 7 next
---
## Legend
- `[x]` β€” Done
- `[/]` β€” In progress
- `[ ]` β€” Not started
- `[~]` β€” Intentionally deferred (blocked by data/users/scale)
- `[!]` β€” Backlog item (documented, not yet coded)
---
## Phase 1: Zero-ML Recommender βœ… COMPLETE
> *Built the foundation: Qdrant connection, arXiv search, save/dismiss, cookie identity, HTMX frontend.*
- [x] Qdrant Cloud connection (1.6M BGE-M3 papers, BQ, HNSW m=32)
- Collection: `arxiv_bgem3_dense`, 1024-dim dense vectors
- File: `app/qdrant_svc.py` β†’ `_get_client()`
- [x] BEST_SCORE Recommend API (raw paper IDs β†’ Qdrant)
- File: `app/qdrant_svc.py` β†’ `recommend()`
- [x] arXiv keyword API search (placeholder β€” replaced in Phase 3)
- File: `app/arxiv_svc.py` β†’ `search()`
- [x] arXiv metadata fetching + SQLite cache
- File: `app/arxiv_svc.py` β†’ `fetch_metadata_batch()`
- [x] SQLite database schema (interactions, paper_metadata)
- File: `app/db.py` β†’ `init_db()`
- WAL mode, async via aiosqlite
- [x] Cookie-based user identity
- File: `app/config.py` β†’ `COOKIE_NAME`
- [x] User state management (positive/negative deques)
- File: `app/user_state.py` β†’ `UserState`
- [x] Save/Dismiss event logging
- File: `app/routers/events.py`
- [x] HTMX + Jinja2 frontend (search, recs, save, dismiss)
- Files: `app/templates/` (base.html, index.html, search.html, saved.html, partials/)
- [x] Test suite β€” **55 tests passing**
**Gaps**: None.
---
## Phase 2a: EWMA Profile Embeddings βœ… COMPLETE
> *Replaced raw ID-list approach with temporal decay vectors so recent interests outweigh old ones.*
- [x] Create `app/recommend/` module with `__init__.py`
- [x] Create `app/recommend/profiles.py` β€” EWMA computation + storage
- Long-term: Ξ±=0.03 βœ… (corrected from 0.10 per Doc 06)
- Short-term: Ξ±=0.40
- Negative: Ξ±=0.15
- All embeddings L2-normalized
- [x] Modify `app/db.py` β€” add `user_profiles` table + `user_clusters` table
- [x] Modify `app/qdrant_svc.py` β€” add `get_paper_vectors()` and `search_by_vector()`
- [x] Modify `app/routers/events.py` β€” trigger EWMA updates on save/dismiss
- [x] Modify `app/routers/recommendations.py` β€” EWMA vector search with Tier 2 fallback
- [x] Add `numpy` + `scipy` to `requirements.txt`
- [x] Tests for profiles module β€” **11 passed**
- [x] Full test suite β€” no regressions
**Doc 06 correction applied**: Ξ±_long 0.10 β†’ 0.03 (PinnerSage rejected 0.10 as too recent-biased).
**Gaps**: None.
---
## Phase 2b: Ward Clustering + Multi-Interest Retrieval βœ… COMPLETE
> *Detect distinct user interests via hierarchical clustering, retrieve candidates per interest.*
- [x] Create `app/recommend/clustering.py` β€” Ward clustering + medoid extraction
- L2-normalize embeddings before Ward βœ… (Doc 06 correction)
- Adaptive gap-based threshold (no fixed K)
- Medoid representation (real papers, not centroids) βœ…
- Dynamic K (1–7 clusters, auto-determined)
- Recency-weighted importance scores
- [x] Modify `app/qdrant_svc.py` β€” add `multi_interest_search()` with prefetch+RRF
- [x] Modify `app/routers/recommendations.py` β€” 3-tier cascading pipeline
- Tier 1 (β‰₯5 saves): Multi-interest clustering β†’ prefetch + RRF
- Tier 2 (β‰₯3 saves): EWMA long-term vector β†’ single ANN search
- Tier 3 (β‰₯1 save): Qdrant BEST_SCORE Recommend API
- [x] Tests for clustering module β€” **10 passed**
- [x] Full test suite β€” no regressions
**Doc 06 corrections applied**: L2-normalization before Ward, medoid not centroid.
**Gaps (deferred to Phase 4)**:
- [!] RRF β†’ quota fusion (dominant clusters can swamp minority interests)
- [!] Hungarian matching for cluster ID stability across reclusterings
---
## Phase 2c: Heuristic Re-ranking + MMR Diversity βœ… COMPLETE
> *Added scoring and diversity layers on top of retrieval to produce the final feed.*
- [x] Create `app/recommend/reranker.py` β€” 5-feature heuristic scorer
- Feature 1: cosine_sim_longterm (weight 0.40)
- Feature 2: cosine_sim_shortterm (weight 0.25)
- Feature 3: paper_age_days / recency (weight 0.15)
- Feature 4: rrf_position (weight 0.10)
- Feature 5: cosine_sim_negative (weight -0.15) βœ… (Doc 06 addition)
- [x] Create `app/recommend/diversity.py` β€” MMR + exploration injection
- MMR with Ξ»=0.6
- 2 serendipitous exploration papers per feed
- [x] Modify `app/routers/recommendations.py` β€” full 5-step pipeline
- Step 1: Clustering β†’ Step 2: Retrieval β†’ Step 3: Rerank β†’ Step 4: MMR β†’ Step 5: Exploration
- [x] Tests for reranker + diversity β€” **13 passed**
- [x] Full test suite β€” **88 passed** (86 + 2 pre-existing live Qdrant failures resolved)
**Doc 06 correction applied**: Negative EWMA profile wired as Feature 5 with 0.15 penalty.
**Gaps**: None. LightGBM model now integrated (Phase 6 βœ…).
---
## Phase 2d: Advanced Models ❌ DEFERRED (Blocked by data/users)
> *These logically belong to the recommendation engine but cannot be built without real user data or scale.*
- [~] LightGBM lambdarank model β€” requires β‰₯500 labeled save/dismiss interactions β†’ Phase 6
- [~] Collaborative filtering features β€” requires β‰₯500 users β†’ Phase 9
- [~] DPP diversity β€” explicitly ruled out for v1 by Doc 06 β†’ Phase 9+
- [~] Two-Tower model β€” requires GPU + large dataset β†’ Phase 9+
---
## Phase 3: Hybrid Semantic Search βœ… COMPLETE
> *Replace the arXiv keyword API placeholder with real vector-based semantic search using Qdrant dense + Zilliz sparse + RRF.*
> *Detailed plan: `docs/phases/PHASE3-Hybrid-Semantic-Search.md`*
> *Prototype reference: `docs/phases/PHASE2-Hybrid-Search-Plan.md`*
> *Deployment target: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)*
### New files created
- [x] `app/embed_svc.py` β€” BGE-M3 model singleton (load BAAI/bge-m3 once at startup, ~570MB, ~15s cold)
- `encode_query(text)` β†’ `(dense: np.ndarray[1024], sparse: dict)`
- LRU cache for repeat queries
- Thread-safe, lazy loading with double-check locking
- [x] `app/zilliz_svc.py` β€” Zilliz Cloud sparse search client
- Collection: `arxiv_bgem3_sparse`
- Schema: `id` (INT64 auto PK), `arxiv_id` (VARCHAR), `sparse_vector` (SPARSE_FLOAT_VECTOR)
- Index: SPARSE_INVERTED_INDEX, metric_type=IP
- Sparse format: `{int_token_id: float_weight}` (BGE-M3 lexical weights, NOT string words)
- `search_sparse(sparse_dict, limit)` β†’ `list[dict]` with arxiv_id + score
- gRPC reconnect handling
- [x] `app/groq_svc.py` β€” LLM query rewriter (Groq / llama-3.3-70b)
- `rewrite(user_query)` β†’ academic query string
- Graceful fallback to original query on error
- Academic-detection heuristic to skip unnecessary rewrites
- 2s hard timeout
- [x] `app/hybrid_search_svc.py` β€” search orchestrator
- Rewrite β†’ Encode β†’ Parallel (Qdrant dense + Zilliz sparse) β†’ RRF β†’ Rerank
- Each step has independent failure handling
- Recency reranking: 0.80 RRF + 0.20 recency
### Files modified
- [x] `app/config.py` β€” added `ZILLIZ_URI`, `ZILLIZ_TOKEN`, `ZILLIZ_COLLECTION`, `GROQ_API_KEY`, `BGE_M3_MODEL`, `BGE_M3_DEVICE`, `ENCODE_CACHE_SIZE`, search weights, `APP_PORT`
- [x] `app/qdrant_svc.py` β€” added `search_dense(dense_vec, limit)` for raw vector search returning scores
- [x] `app/routers/search.py` β€” swapped `arxiv_svc.search()` β†’ `hybrid_search_svc.search()` with arXiv fallback
- [x] `app/main.py` β€” added graceful BGE-M3 warm-up to lifespan
- [x] `requirements.txt` β€” added `FlagEmbedding`, `pymilvus`, `groq`
- [x] `run.py` β€” configurable port (7860 default for HF Spaces)
### Deployment files created
- [x] `Dockerfile` β€” HF Spaces Docker SDK, CPU-only PyTorch, pre-baked BGE-M3 model
- [x] `.dockerignore` β€” excludes notebooks, PDFs, databases, caches
### Implementation steps completed
- [x] Step 1: BGE-M3 model service (`embed_svc.py`) + unit tests
- [x] Step 2: Zilliz client (`zilliz_svc.py`)
- [x] Step 3: Dense search in Qdrant service
- [x] Step 4: Groq rewriter (`groq_svc.py`)
- [x] Step 5: Hybrid search orchestrator (`hybrid_search_svc.py`)
- [x] Step 6: Swap search router
- [x] Step 7: Model warm-up + deployment config
- [x] Step 8: Tests β€” **21 new tests passing** (RRF, recency, Groq heuristics, embed edge cases, orchestrator mocks)
### Test results
- 88 original tests: βœ… All pass (zero regressions)
- 21 Phase 3 unit tests: βœ… All pass (RRF, recency, Groq, embed, orchestrator mocks)
- 6 search router tests: βœ… All pass (ranking, fallback, HTMX, saved state)
- 8 live service tests: βœ… All pass (Qdrant dense, Zilliz sparse, Groq rewrite, parallel)
- **Total: 123 tests passing**
### Latency budget
| Stage | Time |
|---|---|
| LLM rewrite (Groq) | ~300ms (skippable) |
| BGE-M3 encode (CPU) | ~300ms first, ~0ms cached |
| Qdrant + Zilliz (parallel) | ~300ms |
| RRF + rerank | <5ms |
| **Total (warm)** | **~600ms** |
---
## Phase 3.5: Turso ArXiv Metadata DB βœ… COMPLETE
> *Bulk-loaded 1.23 GB of arXiv paper metadata + citation data to Turso (libSQL) cloud DB.*
> *Eliminates the unstable arXiv API dependency for metadata fetching (Phase 4.2 solved early).*
> *Integrated into codebase and deployed to HF Spaces.*
### Infrastructure
- [x] Turso cloud DB created: `arxiv-data` on `aws-ap-south-1`
- URL: `https://arxiv-data-siddhm11.aws-ap-south-1.turso.io`
- Auth: Platform token + DB auth token (minted via CLI)
- [x] Table: `papers` with columns:
- `arxiv_id` (TEXT, UNIQUE INDEX `idx_papers_arxiv_id`)
- `title` (TEXT)
- `authors` (TEXT)
- `categories` (TEXT)
- `primary_topic` (TEXT)
- `update_date` (TEXT)
- `abstract_preview` (TEXT, truncated to 500 chars)
- `citation_count` (INTEGER, default 0)
- `influential_citations` (INTEGER, default 0)
- [x] Data sources:
- `arxiv_comprehensive_papers.csv` (Kaggle: siddhm11/arxivdata)
- `arxiv_citations_summary.csv` (Kaggle: siddhm11/citation-data-letsgoo)
- Joined on `id` = `arxiv_id_clean`, deduplicated
- [x] Row count verified: local ↔ remote match
- [x] Unique index on `arxiv_id` for fast lookups
### Integration (DONE)
- [x] Added `TURSO_URL` and `TURSO_DB_TOKEN` to `config.py` / `.env` / HF Secrets
- [x] Created `app/turso_svc.py` β€” metadata lookup service
- `fetch_metadata_batch(arxiv_ids)` β†’ `{arxiv_id: paper_dict}`
- Uses Turso HTTP pipeline API (zero new Python deps β€” just httpx)
- Includes citation_count + influential_citations
- [x] `app/routers/search.py` β€” Turso primary, arXiv API fallback (only for IDs not in Turso)
- [x] Created `tests/test_turso_timing.py` β€” timing benchmark
- [x] **Verified**: 10/10 title match, 6.1x end-to-end speedup on HF Spaces
- [x] **Impact**: Avg search time dropped from ~10.7s to ~1.75s on HF Spaces
---
## Phase 4: Recommendation Pipeline Fixes βœ… COMPLETE
> *Fixed the known architectural debt in the recommendation pipeline.*
> *Detailed plan: `docs/phases/PHASE4-Recommendation-Pipeline-Fixes.md`*
### 4.1 β€” Replace RRF with Importance-Weighted Quota Fusion
- [x] Create `app/recommend/fusion.py` β€” quota allocation logic
- `w_k = importance_k / sum(importance_k)`
- `slot_k = max(floor(F Γ— w_k), F_min=3)` β€” every cluster gets at least 3 slots
- Distribute remainder by largest fractional part
- [x] Create `tests/test_fusion.py` β€” **20 unit tests** for quota allocation
- Proportionality, floor enforcement, total invariant, edge cases, Doc 06 worked examples
- [x] Refactor `_multi_interest_recommend()` in `recommendations.py`
- Replace `multi_interest_search()` with per-cluster separate ANN queries
- Use `asyncio.gather()` for concurrent searches (~15ms wall-clock)
- Allocate feed slots proportionally via `allocate_quotas()`
- Deduplicate across clusters (first-occurrence = highest-ranked cluster wins)
- MMR over merged union (unchanged)
- [x] Keep `qdrant_svc.multi_interest_search()` in codebase (no deletion)
### 4.2 β€” Pre-populate Metadata Store βœ… DONE (via Turso)
- [x] Bulk-loaded arXiv metadata from Kaggle to Turso cloud DB (Phase 3.5)
- [x] 1.23 GB, includes citation counts from Semantic Scholar
- [x] Wired Turso service into `search.py` (Turso primary, arXiv API fallback)
- [x] arXiv API is now fallback only for genuinely new papers
- [x] **Impact**: Search time dropped from ~10.7s to ~1.75s on HF Spaces
### 4.3 β€” Hungarian Matching for Cluster Stability
- [x] Add `stabilize_cluster_ids()` function to `clustering.py`
- Uses `scipy.optimize.linear_sum_assignment` (already a dependency)
- Cost matrix: `1 - cosine_sim(new_medoid, old_medoid)` β€” trivial at K≀7
- Matched clusters keep old indices; new clusters get next available
- Min cosine threshold (0.5) rejects unrelated matches
- [x] Call between `compute_clusters()` and `save_clusters_to_db()` in recommendations.py
- [x] **10 tests** in `test_clustering.py` β€” perturbed clusters preserve indices,
unrelated match rejection, K growth/shrink, custom thresholds
### 4.4 β€” Category-Level Negative Suppression
- [x] Add `get_suppressed_categories()` to `db.py`
- Joins `interactions` + `paper_metadata` to find categories with β‰₯3 dismissals
- **Primary category only** (decision: avoid over-suppression)
- **14-day window** (standard default, Ο„_neg = 14 days)
- [x] Add suppression filter in `_multi_interest_recommend()` after reranking
- [x] Cache Turso metadata to `paper_metadata` via `cache_turso_metadata_batch()`
- [x] **8 tests** in `test_db.py` β€” threshold, partitioning, user isolation, custom threshold
- [~] Per-item short-term decay β†’ **deferred to Phase 6** (LightGBM feature)
**Gaps**: None.
---
## Phase 4.5: Instrumentation Foundation βœ… COMPLETE
> *Added telemetry columns to the interactions table so every saved/dismissed paper*
> *can be attributed to its pipeline tier, cluster origin, and ranker version.*
> *Doc 07 (ADR A4) identified this as the single most valuable early investment β€”*
> *retrofitting these fields after real user data exists is painful and blocks all*
> *later counterfactual evaluation.*
### Schema changes
- [x] Add `ranker_version TEXT` to `interactions` table β€” pipeline version tag
- [x] Add `candidate_source TEXT` to `interactions` β€” e.g. `cluster_0`, `exploration`, `ewma_longterm`, `qdrant_recommend`, `short_term_supplement`
- [x] Add `cluster_id INTEGER` to `interactions` β€” interest cluster index (NULL if N/A)
- [x] ALTER TABLE migration for existing DBs (safe try/except, idempotent)
### Pipeline tagging
- [x] Add `_RANKER_VERSION` constant to `recommendations.py`
- [x] Tag Tier 1 papers with cluster origin, exploration status, short-term supplement
- [x] Tag Tier 2 papers as `ewma_longterm`
- [x] Tag Tier 3 papers as `qdrant_recommend`
- [x] Build `paper_cluster_map` before quota merge (first-occurrence = cluster attribution)
- [x] Exploration papers tagged as `candidate_source='exploration'`
### End-to-end flow
- [x] `recommendations.py` embeds tags in paper dicts
- [x] `action_buttons.html` includes tags in `hx-vals` JSON
- [x] `events.py` accepts `ranker_version`, `candidate_source`, `cluster_id` Form fields
- [x] `db.log_interaction()` stores all three new columns
**Files modified**: `app/db.py`, `app/routers/events.py`, `app/routers/recommendations.py`, `app/templates/partials/action_buttons.html`
**Gaps**: None. `propensity` and `policy_id` fields deferred until Ξ΅-greedy exploration (Phase 9).
---
## Phase 5: Cold-Start Onboarding βœ… COMPLETE
> *Onboarding wizard for new users β€” category selection + seed paper search + trending fallback.*
> *Reference: Doc 06 β€” "4-37% lift even once behavioral data exists"*
### 5.1 β€” arXiv Category Multi-Select βœ…
- [x] UI screen on first visit: select 1-8 arXiv category groups
- [x] Store selections in SQLite (`user_onboarding` table)
- [x] Use as pool filter for recommendations (via `get_user_category_filter()`)
- [x] Preserve as LightGBM feature permanently (Feature 26: `onboarding_category_match`)
- [x] Does NOT create "subject vectors" β€” just filters
### 5.2 β€” Seed Paper Import βœ…
- [x] Let users search for and save seed papers during onboarding
- [x] Immediately create EWMA profiles + Ward clusters on next feed request
- [x] Uses hybrid search (Phase 3) for discovery
### ~~5.3 β€” ORCID / Semantic Scholar Import~~ ❌ REMOVED
> S2 author import was implemented but removed β€” not the onboarding direction we want.
> Onboarding focuses on category selection + manual seed paper search.
### 5.4 β€” Popularity Fallback βœ…
- [x] Category-filtered trending papers served via `turso_svc.fetch_trending_by_categories()`
- [x] 1-hour TTL trending cache for performance
---
## Phase 6: LightGBM Re-ranker βœ… COMPLETE
> *Replaced heuristic scorer with a trained LightGBM lambdarank model.*
> *Unblocked via citation-graph pseudo-labels from Semantic Scholar.*
> *Handoff doc: `docs/PHASE6-HANDOFF.md`*
> *Model repo: [siddhm11/researchit-reranker-phase6](https://huggingface.co/siddhm11/researchit-reranker-phase6)*
### 6.1 β€” ML Intern: Data Pipeline + Model Training βœ…
- [x] Export 1.6M arXiv IDs from Turso β†’ `arxiv_ids.txt` (`scripts/export_arxiv_ids.py`)
- [x] Fetch 242K citation edges from Semantic Scholar Batch API (`01_fetch_citation_edges.py`)
- [x] Generate 98K training triples with pseudo-labels: cited=2, co-cited=1, negative=0 (`02_generate_training_triples.py`)
- [x] 37-feature schema (20 content, 11 user behavior, 6 cross-features)
- [x] Train LightGBM LambdaRank model: 141 trees, 63 leaves, lr=0.05 (`03_train_lightgbm.py`)
- [x] nDCG@10 = 0.879 (+233% vs heuristic baseline)
- [x] All artifacts pushed to HuggingFace
### 6.2 β€” Opus: Integration into ResearchIT βœ…
- [x] Rewrite `app/recommend/reranker.py` β€” 5 features β†’ 37 features
- [x] LightGBM model loading at import time with heuristic fallback
- [x] Multi-path model file search (env var β†’ relative β†’ absolute)
- [x] Backward-compatible `rerank_candidates()` signature (old callers unaffected)
- [x] Add `lightgbm>=4.0,<5.0` to `requirements.txt`
- [x] Fix CRLF→LF line endings in model file (Windows Git issue)
- [x] 7 integration tests β€” **all passing** (`tests/test_reranker_integration.py`)
- [x] Latency verified: **0.223ms per 100 candidates** (target: <1ms) βœ…
### 6.3 β€” Antigravity: Feature Wiring + Deployment Verification βœ…
- [x] Wire all 37 features into `recommendations.py` caller (was legacy 6-arg signature)
- [x] Per-candidate `cluster_importance` (N,) from `paper_cluster_map`
- [x] Per-candidate `cluster_medoid` (N, 1024) per source cluster
- [x] Pre-computed `is_suppressed_category` and `onboarding_category_match` arrays
- [x] Pass `qdrant_scores`, `user_total_saves`, `user_total_dismissals`
- [x] `reranker.py` supports both scalar broadcast and per-candidate arrays
- [x] Add model accessors: `is_model_loaded()`, `get_num_trees()`, `get_loaded_model_path()`
- [x] Add per-request feature activation logging
- [x] Create `GET /healthz/reranker` endpoint (`app/routers/health.py`)
- [x] Bug B fix: persist `medoid_embedding_blob` BLOB in `user_clusters` table
- [x] Bug B fix: fall back to persisted blob instead of zero vector in Hungarian matching
- [x] DB migration: `ALTER TABLE user_clusters ADD COLUMN medoid_embedding_blob BLOB`
- [x] 9 new tests β€” **all passing** (`tests/test_phase6_feature_wiring.py`)
- [x] Full suite: **203+ tests passing, 0 failures**
- [x] Updated `CLAUDE.md`, `PHASE6-HANDOFF.md`, `README.md`
### 6.4 β€” Retraining [~] DEFERRED
> **Phase 6.4 retraining is deferred.** The published model `siddhm11/researchit-reranker-phase6` was trained on citation pseudo-labels with features 23–30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with β‰₯10 saves each. **Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.**
- [~] Synthetic user simulator (`scripts/simulate_users.py`) β€” target: +30d
- [~] Real-user retrain at 100-user threshold β€” target: +90d or threshold
- [~] HF model card backfill (library_name, pipeline_tag, metrics, schema)
## Phase 6.5: Instrumentation βœ… COMPLETE
> **Purpose**: Stabilize the recommendation pipeline and prepare telemetry substrate for Phase 7 evaluation.
### A1 β€” Real Qdrant cosine scores
- [x] Switch `search_by_vector()` β†’ `search_by_vector_with_scores()` in per-cluster + short-term searches
- [x] Build `qdrant_score_map` from real cosines (replaces fake `1.0 - rank*0.01` linear decay)
- [x] Feature 0 (`qdrant_cosine_score`) now receives actual cosine similarities
### A2 β€” Deployment verification
- [x] `curl /healthz/reranker` β†’ `model_loaded=true, n_trees=141, fallback_active=false`
- [x] Verification timestamp added to `PHASE6-Reranker-Framing.md`
### B1 β€” query_id linkage
- [x] Generate `query_id` (UUID) once per feed request in `get_recommendations()`
- [x] Thread through all 4 tiers: trending, Tier 1, Tier 2, Tier 3
- [x] Generate `query_id` in `search.py` per search request
- [x] Add `query_id` + `position` to `action_buttons.html` hx-vals
### B2 β€” Propensity logging
- [x] Add `propensity REAL` + `policy_id TEXT` migration to `interactions` table
- [x] Extend `db.log_interaction()` with propensity + policy_id params
- [x] Compute propensity: 1.0 (deterministic) vs `n_explore/pool_size` (exploration)
- [x] Thread through templates + `events.py` Form params
### B3 β€” Cluster snapshot versioning
- [x] Add `cluster_snapshots` table (append-only, content-addressed via `paper_ids_hash`)
- [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
- [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan
### ~~B4 β€” S2 author import~~ ❌ REMOVED
> S2 author import was implemented and then removed β€” not the onboarding direction we want.
> `app/s2_svc.py`, the `/api/onboarding/import-author` endpoint, and the quick-import UI
> have all been deleted. Onboarding uses category selection + manual seed search only.
### Documentation
- [x] `CLAUDE.md`: Rule 3.11 β€” interaction instrumentation invariants
- [x] `_RANKER_VERSION` bumped to `v6.5_lightgbm_real_cosines`
- [x] Phase status updated to 6.5 COMPLETE
- [x] Tests: 203+ passing
### Test suite
- `tests/test_reranker_integration.py` β€” 7 tests (smoke, features, heuristic, E2E, latency, backward compat, comparison)
- `tests/test_phase6_feature_wiring.py` β€” 9 tests (per-candidate arrays, broadcast medoid, model accessors, aggregate activation)
- `tests/demo_reranker.py` β€” interactive demo with 20 realistic papers
---
## Phase 7: Evaluation Framework πŸ“‹ NOT STARTED
> *Build offline and online evaluation before scaling users.*
> *Estimated effort: ~1 week*
- [ ] Offline metrics: nDCG@10, Recall@50, HR@10, ILS, category entropy
- [ ] Time-split evaluation on unarXive 2022 + S2ORC
- [ ] Online metrics (once users exist): CTR, save rate, dwell time, return rate
---
## Phase 8: LLM Interest Summaries + Distilled Re-ranker πŸ“‹ NOT STARTED
> *Estimated effort: ~10-12 weeks (Doc 07)*
> *Detailed research plan: `docs/research/07-LLM-Summaries-Reranker-and-Scaling-Research.md`*
> *Entry criteria: Phase 7 eval producing stable nDCG@10; cluster stability Jaccard β‰₯0.7 over 7 days*
### 8a β€” Claude-generated per-cluster interest summaries (Doc 07 Β§A)
- [ ] Cluster snapshot versioning (ADR A1)
- [ ] Content-addressed caching: `sha256(sorted(paper_ids) + prompt_version + model)`
- [ ] Shared summaries (not per-user) β€” Haiku 4.5 + Batch API (~$50-80/month @ 1K users)
- [ ] Nightly regeneration job with 7-day TTL + event-triggered refresh
- [ ] "You're reading about X" UI framing with sub-theme bullets
- [ ] Anthropic Citations API for hallucination prevention
### 8b β€” Distilled cross-encoder reranker (Doc 07 Β§B)
- [ ] Deploy `cross-encoder/ms-marco-TinyBERT-L-2-v2` INT8 ONNX as MVP
- [ ] 6ms budget for 20 pairs on CPU (AVX-512 VNNI)
- [ ] TinyBERT score as LightGBM feature (Option C architecture)
- [ ] Custom distillation from BGE-reranker-v2-m3 only if held-out gap >3 nDCG
- [ ] MarginMSE loss + SciNCL citation-graph hard negatives
### 8c β€” Use-cases and information-gain design doc (Doc 07 Β§C)
- [ ] 8 user personas (P1 cold-start through P8 stay-current)
- [ ] Information-gain table (save=3-5Γ—, dismiss-as-label=βˆ’3-4Γ—, passive skip=βˆ’0.1Γ—)
- [ ] Mode-switching UI: "Stay Current" vs "Lit Review" toggle
- [ ] Failure mode detection rules (feed collapse, stale profile, filter bubble)
---
## Phase 9: Exploration + Collaborative Filtering πŸ“‹ NOT STARTED
> *Blocked by: β‰₯500 users*
- [ ] Epsilon-greedy exploration (Ξ΅=0.25 new users, Ξ΅=0.05 established)
- [ ] LightFM hybrid CF model with switching strategy
- [ ] Category-level negative suppression
- [ ] Retrain LightGBM with dismissals as negative labels
---
## Appendix: Infrastructure Status
| Component | Status | Details |
|---|---|---|
| **Qdrant Cloud** | βœ… Live | 1.6M papers, BGE-M3 1024-dim, BQ enabled, HNSW m=32 |
| **Zilliz Cloud** | βœ… Live | 1.6M papers, BGE-M3 sparse vectors, collection `arxiv_bgem3_sparse` |
| **Turso (libSQL)** | βœ… Live | 1.23 GB arXiv metadata + citations, `arxiv-data` DB, `papers` table, unique index on `arxiv_id` |
| **SQLite** | βœ… Live | interactions, paper_metadata (local cache), user_profiles, user_clusters |
| **HF Spaces** | βœ… Deployed | Docker SDK, free tier, port 7860 β€” https://siddhm11-researchit.hf.space |
| **Render** | ⚠️ Previous target (512MB RAM too small for BGE-M3) | May still be used for non-ML services |
| **arXiv API** | βœ… Fallback only | Keyword search + metadata for papers not in Turso |
| **BGE-M3 Model** | βœ… Live | Pre-baked in Docker image, warm-up at startup |
| **Groq API** | βœ… Live + HF Secret | `app/groq_svc.py` β€” 2s timeout, academic heuristic skip |
| **Notebooks** | βœ… Organized | `notebooks/` β€” 01-upload, 02-test, 03-search-benchmark |
### Credentials Status
| Credential | Status | Env Var | Notes |
|---|---|---|---|
| **Qdrant Cloud** | βœ… In `.env` | `QDRANT_URL`, `QDRANT_API_KEY` | Already wired |
| **Zilliz Cloud** | βœ… In `.env` | `ZILLIZ_URI`, `ZILLIZ_TOKEN` | Phase 3, wired |
| **Turso (libSQL)** | βœ… In `.env` + HF | `TURSO_URL`, `TURSO_DB_TOKEN` | Phase 3.5, wired + deployed |
| **Groq** | βœ… In `.env` + HF | `GROQ_API_KEY` | Phase 3, wired + deployed |
| **HF Spaces** | βœ… Deployed | Secrets panel | All env vars set βœ” |
---
## Appendix: Test Suite
| Test File | Count | Status |
|---|---|---|
| `tests/test_profiles.py` | 11 | βœ… Passing |
| `tests/test_clustering.py` | 21 | βœ… Passing | (9 compute + 10 Hungarian + 2 persistence) |
| `tests/test_reranker_diversity.py` | 13 | βœ… Passing |
| `tests/test_reranker_integration.py` | 7 | βœ… Passing | (Phase 6: smoke, features, E2E, latency) |
| `tests/test_phase6_feature_wiring.py` | 9 | βœ… Passing | (Phase 6.3: per-candidate arrays, medoids, accessors) |
| `tests/test_fusion.py` | 20 | βœ… Passing | (Phase 4.1) |
| `tests/test_db.py` | 19 | βœ… Passing | (includes 4 Turso cache + 8 suppression) |
| `tests/test_qdrant_svc.py` | β€” | βœ… Passing |
| `tests/test_arxiv_svc.py` | β€” | βœ… Passing |
| `tests/test_integration.py` | β€” | βœ… Passing | (includes quota pipeline E2E) |
| `tests/test_user_state.py` | β€” | βœ… Passing |
| `tests/test_saved.py` | β€” | βœ… Passing |
| `tests/test_hybrid_search.py` | 21 | βœ… Passing |
| `tests/test_search_router.py` | 6 | βœ… Passing |
| `tests/test_live_search.py` | 8 | βœ… Passing |
| **Total** | **203+** | βœ… |
| `test_e2e_recs.py` (standalone) | 1 | βœ… E2E simulation |
---
## Appendix: Doc 06 Corrections β€” Tracking
| Correction | Status | Where |
|---|---|---|
| Ξ±_long 0.10 β†’ 0.03 | βœ… Applied | `app/recommend/profiles.py:30` |
| L2-normalize before Ward clustering | βœ… Applied | `app/recommend/clustering.py` |
| Medoid not centroid | βœ… Applied | `app/recommend/clustering.py` β†’ `_find_medoid()` |
| Negative EWMA wired into reranking | βœ… Applied | `app/recommend/reranker.py` β†’ Feature 5 |
| RRF β†’ quota fusion for recommendations | βœ… Applied | `app/recommend/fusion.py` (Phase 4.1) |
| Hungarian cluster matching | βœ… Applied | `app/recommend/clustering.py` β†’ `stabilize_cluster_ids()` (Phase 4.3) |
| Per-item short-term negative decay | [!] Backlog | Phase 6 (LightGBM feature) |
| Category-level suppression | βœ… Applied | `app/db.py` β†’ `get_suppressed_categories()` (Phase 4.4) |
| BGE-reranker NEVER in hot path | βœ… Followed | Heuristic scorer used instead |