# ResearchIT — Master Task Tracker

> **Purpose**: Single source of truth for all completed, in-progress, and upcoming work.  
> **Last updated**: 2026-05-05  
> **Current phase**: Phase 6.5 (Instrumentation) — COMPLETE ✔ | Phase 7 next  

---

## Legend

- `[x]` — Done
- `[/]` — In progress
- `[ ]` — Not started
- `[~]` — Intentionally deferred (blocked by data/users/scale)
- `[!]` — Backlog item (documented, not yet coded)

---

## Phase 1: Zero-ML Recommender ✅ COMPLETE

> *Built the foundation: Qdrant connection, arXiv search, save/dismiss, cookie identity, HTMX frontend.*

- [x] Qdrant Cloud connection (1.6M BGE-M3 papers, BQ, HNSW m=32)
  - Collection: `arxiv_bgem3_dense`, 1024-dim dense vectors
  - File: `app/qdrant_svc.py` → `_get_client()`
- [x] BEST_SCORE Recommend API (raw paper IDs → Qdrant)
  - File: `app/qdrant_svc.py` → `recommend()`
- [x] arXiv keyword API search (placeholder — replaced in Phase 3)
  - File: `app/arxiv_svc.py` → `search()`
- [x] arXiv metadata fetching + SQLite cache
  - File: `app/arxiv_svc.py` → `fetch_metadata_batch()`
- [x] SQLite database schema (interactions, paper_metadata)
  - File: `app/db.py` → `init_db()`
  - WAL mode, async via aiosqlite
- [x] Cookie-based user identity
  - File: `app/config.py` → `COOKIE_NAME`
- [x] User state management (positive/negative deques)
  - File: `app/user_state.py` → `UserState`
- [x] Save/Dismiss event logging
  - File: `app/routers/events.py`
- [x] HTMX + Jinja2 frontend (search, recs, save, dismiss)
  - Files: `app/templates/` (base.html, index.html, search.html, saved.html, partials/)
- [x] Test suite — **55 tests passing**

**Gaps**: None.

---

## Phase 2a: EWMA Profile Embeddings ✅ COMPLETE

> *Replaced raw ID-list approach with temporal decay vectors so recent interests outweigh old ones.*

- [x] Create `app/recommend/` module with `__init__.py`
- [x] Create `app/recommend/profiles.py` — EWMA computation + storage
  - Long-term: α=0.03 ✅ (corrected from 0.10 per Doc 06)
  - Short-term: α=0.40
  - Negative: α=0.15
  - All embeddings L2-normalized
- [x] Modify `app/db.py` — add `user_profiles` table + `user_clusters` table
- [x] Modify `app/qdrant_svc.py` — add `get_paper_vectors()` and `search_by_vector()`
- [x] Modify `app/routers/events.py` — trigger EWMA updates on save/dismiss
- [x] Modify `app/routers/recommendations.py` — EWMA vector search with Tier 2 fallback
- [x] Add `numpy` + `scipy` to `requirements.txt`
- [x] Tests for profiles module — **11 passed**
- [x] Full test suite — no regressions

**Doc 06 correction applied**: α_long 0.10 → 0.03 (PinnerSage rejected 0.10 as too recent-biased).

**Gaps**: None.

---

## Phase 2b: Ward Clustering + Multi-Interest Retrieval ✅ COMPLETE

> *Detect distinct user interests via hierarchical clustering, retrieve candidates per interest.*

- [x] Create `app/recommend/clustering.py` — Ward clustering + medoid extraction
  - L2-normalize embeddings before Ward ✅ (Doc 06 correction)
  - Adaptive gap-based threshold (no fixed K)
  - Medoid representation (real papers, not centroids) ✅
  - Dynamic K (1–7 clusters, auto-determined)
  - Recency-weighted importance scores
- [x] Modify `app/qdrant_svc.py` — add `multi_interest_search()` with prefetch+RRF
- [x] Modify `app/routers/recommendations.py` — 3-tier cascading pipeline
  - Tier 1 (≥5 saves): Multi-interest clustering → prefetch + RRF
  - Tier 2 (≥3 saves): EWMA long-term vector → single ANN search
  - Tier 3 (≥1 save): Qdrant BEST_SCORE Recommend API
- [x] Tests for clustering module — **10 passed**
- [x] Full test suite — no regressions

**Doc 06 corrections applied**: L2-normalization before Ward, medoid not centroid.

**Gaps (deferred to Phase 4)**:
- [!] RRF → quota fusion (dominant clusters can swamp minority interests)
- [!] Hungarian matching for cluster ID stability across reclusterings

---

## Phase 2c: Heuristic Re-ranking + MMR Diversity ✅ COMPLETE

> *Added scoring and diversity layers on top of retrieval to produce the final feed.*

- [x] Create `app/recommend/reranker.py` — 5-feature heuristic scorer
  - Feature 1: cosine_sim_longterm (weight 0.40)
  - Feature 2: cosine_sim_shortterm (weight 0.25)
  - Feature 3: paper_age_days / recency (weight 0.15)
  - Feature 4: rrf_position (weight 0.10)
  - Feature 5: cosine_sim_negative (weight -0.15) ✅ (Doc 06 addition)
- [x] Create `app/recommend/diversity.py` — MMR + exploration injection
  - MMR with λ=0.6
  - 2 serendipitous exploration papers per feed
- [x] Modify `app/routers/recommendations.py` — full 5-step pipeline
  - Step 1: Clustering → Step 2: Retrieval → Step 3: Rerank → Step 4: MMR → Step 5: Exploration
- [x] Tests for reranker + diversity — **13 passed**
- [x] Full test suite — **88 passed** (86 + 2 pre-existing live Qdrant failures resolved)

**Doc 06 correction applied**: Negative EWMA profile wired as Feature 5 with 0.15 penalty.

**Gaps**: None. LightGBM model now integrated (Phase 6 ✅).

---

## Phase 2d: Advanced Models ❌ DEFERRED (Blocked by data/users)

> *These logically belong to the recommendation engine but cannot be built without real user data or scale.*

- [~] LightGBM lambdarank model — requires ≥500 labeled save/dismiss interactions → Phase 6
- [~] Collaborative filtering features — requires ≥500 users → Phase 9
- [~] DPP diversity — explicitly ruled out for v1 by Doc 06 → Phase 9+
- [~] Two-Tower model — requires GPU + large dataset → Phase 9+

---

## Phase 3: Hybrid Semantic Search ✅ COMPLETE

> *Replace the arXiv keyword API placeholder with real vector-based semantic search using Qdrant dense + Zilliz sparse + RRF.*  
> *Detailed plan: `docs/phases/PHASE3-Hybrid-Semantic-Search.md`*  
> *Prototype reference: `docs/phases/PHASE2-Hybrid-Search-Plan.md`*  
> *Deployment target: Hugging Face Spaces (Docker SDK, 16GB RAM, 2 vCPUs)*

### New files created
- [x] `app/embed_svc.py` — BGE-M3 model singleton (load BAAI/bge-m3 once at startup, ~570MB, ~15s cold)
  - `encode_query(text)` → `(dense: np.ndarray[1024], sparse: dict)`
  - LRU cache for repeat queries
  - Thread-safe, lazy loading with double-check locking
- [x] `app/zilliz_svc.py` — Zilliz Cloud sparse search client
  - Collection: `arxiv_bgem3_sparse`
  - Schema: `id` (INT64 auto PK), `arxiv_id` (VARCHAR), `sparse_vector` (SPARSE_FLOAT_VECTOR)
  - Index: SPARSE_INVERTED_INDEX, metric_type=IP
  - Sparse format: `{int_token_id: float_weight}` (BGE-M3 lexical weights, NOT string words)
  - `search_sparse(sparse_dict, limit)` → `list[dict]` with arxiv_id + score
  - gRPC reconnect handling
- [x] `app/groq_svc.py` — LLM query rewriter (Groq / llama-3.3-70b)
  - `rewrite(user_query)` → academic query string
  - Graceful fallback to original query on error
  - Academic-detection heuristic to skip unnecessary rewrites
  - 2s hard timeout
- [x] `app/hybrid_search_svc.py` — search orchestrator
  - Rewrite → Encode → Parallel (Qdrant dense + Zilliz sparse) → RRF → Rerank
  - Each step has independent failure handling
  - Recency reranking: 0.80 RRF + 0.20 recency

### Files modified
- [x] `app/config.py` — added `ZILLIZ_URI`, `ZILLIZ_TOKEN`, `ZILLIZ_COLLECTION`, `GROQ_API_KEY`, `BGE_M3_MODEL`, `BGE_M3_DEVICE`, `ENCODE_CACHE_SIZE`, search weights, `APP_PORT`
- [x] `app/qdrant_svc.py` — added `search_dense(dense_vec, limit)` for raw vector search returning scores
- [x] `app/routers/search.py` — swapped `arxiv_svc.search()` → `hybrid_search_svc.search()` with arXiv fallback
- [x] `app/main.py` — added graceful BGE-M3 warm-up to lifespan
- [x] `requirements.txt` — added `FlagEmbedding`, `pymilvus`, `groq`
- [x] `run.py` — configurable port (7860 default for HF Spaces)

### Deployment files created
- [x] `Dockerfile` — HF Spaces Docker SDK, CPU-only PyTorch, pre-baked BGE-M3 model
- [x] `.dockerignore` — excludes notebooks, PDFs, databases, caches

### Implementation steps completed
- [x] Step 1: BGE-M3 model service (`embed_svc.py`) + unit tests
- [x] Step 2: Zilliz client (`zilliz_svc.py`)
- [x] Step 3: Dense search in Qdrant service
- [x] Step 4: Groq rewriter (`groq_svc.py`)
- [x] Step 5: Hybrid search orchestrator (`hybrid_search_svc.py`)
- [x] Step 6: Swap search router
- [x] Step 7: Model warm-up + deployment config
- [x] Step 8: Tests — **21 new tests passing** (RRF, recency, Groq heuristics, embed edge cases, orchestrator mocks)

### Test results
- 88 original tests: ✅ All pass (zero regressions)
- 21 Phase 3 unit tests: ✅ All pass (RRF, recency, Groq, embed, orchestrator mocks)
- 6 search router tests: ✅ All pass (ranking, fallback, HTMX, saved state)
- 8 live service tests: ✅ All pass (Qdrant dense, Zilliz sparse, Groq rewrite, parallel)
- **Total: 123 tests passing**

### Latency budget
| Stage | Time |
|---|---|
| LLM rewrite (Groq) | ~300ms (skippable) |
| BGE-M3 encode (CPU) | ~300ms first, ~0ms cached |
| Qdrant + Zilliz (parallel) | ~300ms |
| RRF + rerank | <5ms |
| **Total (warm)** | **~600ms** |

---

## Phase 3.5: Turso ArXiv Metadata DB ✅ COMPLETE

> *Bulk-loaded 1.23 GB of arXiv paper metadata + citation data to Turso (libSQL) cloud DB.*  
> *Eliminates the unstable arXiv API dependency for metadata fetching (Phase 4.2 solved early).*  
> *Integrated into codebase and deployed to HF Spaces.*

### Infrastructure
- [x] Turso cloud DB created: `arxiv-data` on `aws-ap-south-1`
  - URL: `https://arxiv-data-siddhm11.aws-ap-south-1.turso.io`
  - Auth: Platform token + DB auth token (minted via CLI)
- [x] Table: `papers` with columns:
  - `arxiv_id` (TEXT, UNIQUE INDEX `idx_papers_arxiv_id`)
  - `title` (TEXT)
  - `authors` (TEXT)
  - `categories` (TEXT)
  - `primary_topic` (TEXT)
  - `update_date` (TEXT)
  - `abstract_preview` (TEXT, truncated to 500 chars)
  - `citation_count` (INTEGER, default 0)
  - `influential_citations` (INTEGER, default 0)
- [x] Data sources:
  - `arxiv_comprehensive_papers.csv` (Kaggle: siddhm11/arxivdata)
  - `arxiv_citations_summary.csv` (Kaggle: siddhm11/citation-data-letsgoo)
  - Joined on `id` = `arxiv_id_clean`, deduplicated
- [x] Row count verified: local ↔ remote match
- [x] Unique index on `arxiv_id` for fast lookups

### Integration (DONE)
- [x] Added `TURSO_URL` and `TURSO_DB_TOKEN` to `config.py` / `.env` / HF Secrets
- [x] Created `app/turso_svc.py` — metadata lookup service
  - `fetch_metadata_batch(arxiv_ids)` → `{arxiv_id: paper_dict}`
  - Uses Turso HTTP pipeline API (zero new Python deps — just httpx)
  - Includes citation_count + influential_citations
- [x] `app/routers/search.py` — Turso primary, arXiv API fallback (only for IDs not in Turso)
- [x] Created `tests/test_turso_timing.py` — timing benchmark
- [x] **Verified**: 10/10 title match, 6.1x end-to-end speedup on HF Spaces
- [x] **Impact**: Avg search time dropped from ~10.7s to ~1.75s on HF Spaces

---

## Phase 4: Recommendation Pipeline Fixes ✅ COMPLETE

> *Fixed the known architectural debt in the recommendation pipeline.*  
> *Detailed plan: `docs/phases/PHASE4-Recommendation-Pipeline-Fixes.md`*

### 4.1 — Replace RRF with Importance-Weighted Quota Fusion
- [x] Create `app/recommend/fusion.py` — quota allocation logic
  - `w_k = importance_k / sum(importance_k)`
  - `slot_k = max(floor(F × w_k), F_min=3)` — every cluster gets at least 3 slots
  - Distribute remainder by largest fractional part
- [x] Create `tests/test_fusion.py` — **20 unit tests** for quota allocation
  - Proportionality, floor enforcement, total invariant, edge cases, Doc 06 worked examples
- [x] Refactor `_multi_interest_recommend()` in `recommendations.py`
  - Replace `multi_interest_search()` with per-cluster separate ANN queries
  - Use `asyncio.gather()` for concurrent searches (~15ms wall-clock)
  - Allocate feed slots proportionally via `allocate_quotas()`
  - Deduplicate across clusters (first-occurrence = highest-ranked cluster wins)
  - MMR over merged union (unchanged)
- [x] Keep `qdrant_svc.multi_interest_search()` in codebase (no deletion)

### 4.2 — Pre-populate Metadata Store ✅ DONE (via Turso)
- [x] Bulk-loaded arXiv metadata from Kaggle to Turso cloud DB (Phase 3.5)
- [x] 1.23 GB, includes citation counts from Semantic Scholar
- [x] Wired Turso service into `search.py` (Turso primary, arXiv API fallback)
- [x] arXiv API is now fallback only for genuinely new papers
- [x] **Impact**: Search time dropped from ~10.7s to ~1.75s on HF Spaces

### 4.3 — Hungarian Matching for Cluster Stability
- [x] Add `stabilize_cluster_ids()` function to `clustering.py`
  - Uses `scipy.optimize.linear_sum_assignment` (already a dependency)
  - Cost matrix: `1 - cosine_sim(new_medoid, old_medoid)` — trivial at K≤7
  - Matched clusters keep old indices; new clusters get next available
  - Min cosine threshold (0.5) rejects unrelated matches
- [x] Call between `compute_clusters()` and `save_clusters_to_db()` in recommendations.py
- [x] **10 tests** in `test_clustering.py` — perturbed clusters preserve indices,
  unrelated match rejection, K growth/shrink, custom thresholds

### 4.4 — Category-Level Negative Suppression
- [x] Add `get_suppressed_categories()` to `db.py`
  - Joins `interactions` + `paper_metadata` to find categories with ≥3 dismissals
  - **Primary category only** (decision: avoid over-suppression)
  - **14-day window** (standard default, τ_neg = 14 days)
- [x] Add suppression filter in `_multi_interest_recommend()` after reranking
- [x] Cache Turso metadata to `paper_metadata` via `cache_turso_metadata_batch()`
- [x] **8 tests** in `test_db.py` — threshold, partitioning, user isolation, custom threshold
- [~] Per-item short-term decay → **deferred to Phase 6** (LightGBM feature)

**Gaps**: None.

---

## Phase 4.5: Instrumentation Foundation ✅ COMPLETE

> *Added telemetry columns to the interactions table so every saved/dismissed paper*
> *can be attributed to its pipeline tier, cluster origin, and ranker version.*
> *Doc 07 (ADR A4) identified this as the single most valuable early investment —*
> *retrofitting these fields after real user data exists is painful and blocks all*
> *later counterfactual evaluation.*

### Schema changes
- [x] Add `ranker_version TEXT` to `interactions` table — pipeline version tag
- [x] Add `candidate_source TEXT` to `interactions` — e.g. `cluster_0`, `exploration`, `ewma_longterm`, `qdrant_recommend`, `short_term_supplement`
- [x] Add `cluster_id INTEGER` to `interactions` — interest cluster index (NULL if N/A)
- [x] ALTER TABLE migration for existing DBs (safe try/except, idempotent)

### Pipeline tagging
- [x] Add `_RANKER_VERSION` constant to `recommendations.py`
- [x] Tag Tier 1 papers with cluster origin, exploration status, short-term supplement
- [x] Tag Tier 2 papers as `ewma_longterm`
- [x] Tag Tier 3 papers as `qdrant_recommend`
- [x] Build `paper_cluster_map` before quota merge (first-occurrence = cluster attribution)
- [x] Exploration papers tagged as `candidate_source='exploration'`

### End-to-end flow
- [x] `recommendations.py` embeds tags in paper dicts
- [x] `action_buttons.html` includes tags in `hx-vals` JSON
- [x] `events.py` accepts `ranker_version`, `candidate_source`, `cluster_id` Form fields
- [x] `db.log_interaction()` stores all three new columns

**Files modified**: `app/db.py`, `app/routers/events.py`, `app/routers/recommendations.py`, `app/templates/partials/action_buttons.html`

**Gaps**: None. `propensity` and `policy_id` fields deferred until ε-greedy exploration (Phase 9).

---

## Phase 5: Cold-Start Onboarding ✅ COMPLETE

> *Onboarding wizard for new users — category selection + seed paper search + trending fallback.*  
> *Reference: Doc 06 — "4-37% lift even once behavioral data exists"*

### 5.1 — arXiv Category Multi-Select ✅
- [x] UI screen on first visit: select 1-8 arXiv category groups
- [x] Store selections in SQLite (`user_onboarding` table)
- [x] Use as pool filter for recommendations (via `get_user_category_filter()`)
- [x] Preserve as LightGBM feature permanently (Feature 26: `onboarding_category_match`)
- [x] Does NOT create "subject vectors" — just filters

### 5.2 — Seed Paper Import ✅
- [x] Let users search for and save seed papers during onboarding
- [x] Immediately create EWMA profiles + Ward clusters on next feed request
- [x] Uses hybrid search (Phase 3) for discovery

### ~~5.3 — ORCID / Semantic Scholar Import~~ ❌ REMOVED
> S2 author import was implemented but removed — not the onboarding direction we want.
> Onboarding focuses on category selection + manual seed paper search.

### 5.4 — Popularity Fallback ✅
- [x] Category-filtered trending papers served via `turso_svc.fetch_trending_by_categories()`
- [x] 1-hour TTL trending cache for performance

---

## Phase 6: LightGBM Re-ranker ✅ COMPLETE

> *Replaced heuristic scorer with a trained LightGBM lambdarank model.*  
> *Unblocked via citation-graph pseudo-labels from Semantic Scholar.*  
> *Handoff doc: `docs/PHASE6-HANDOFF.md`*  
> *Model repo: [siddhm11/researchit-reranker-phase6](https://huggingface.co/siddhm11/researchit-reranker-phase6)*

### 6.1 — ML Intern: Data Pipeline + Model Training ✅
- [x] Export 1.6M arXiv IDs from Turso → `arxiv_ids.txt` (`scripts/export_arxiv_ids.py`)
- [x] Fetch 242K citation edges from Semantic Scholar Batch API (`01_fetch_citation_edges.py`)
- [x] Generate 98K training triples with pseudo-labels: cited=2, co-cited=1, negative=0 (`02_generate_training_triples.py`)
- [x] 37-feature schema (20 content, 11 user behavior, 6 cross-features)
- [x] Train LightGBM LambdaRank model: 141 trees, 63 leaves, lr=0.05 (`03_train_lightgbm.py`)
- [x] nDCG@10 = 0.879 (+233% vs heuristic baseline)
- [x] All artifacts pushed to HuggingFace

### 6.2 — Opus: Integration into ResearchIT ✅
- [x] Rewrite `app/recommend/reranker.py` — 5 features → 37 features
- [x] LightGBM model loading at import time with heuristic fallback
- [x] Multi-path model file search (env var → relative → absolute)
- [x] Backward-compatible `rerank_candidates()` signature (old callers unaffected)
- [x] Add `lightgbm>=4.0,<5.0` to `requirements.txt`
- [x] Fix CRLF→LF line endings in model file (Windows Git issue)
- [x] 7 integration tests — **all passing** (`tests/test_reranker_integration.py`)
- [x] Latency verified: **0.223ms per 100 candidates** (target: <1ms) ✅

### 6.3 — Antigravity: Feature Wiring + Deployment Verification ✅
- [x] Wire all 37 features into `recommendations.py` caller (was legacy 6-arg signature)
- [x] Per-candidate `cluster_importance` (N,) from `paper_cluster_map`
- [x] Per-candidate `cluster_medoid` (N, 1024) per source cluster
- [x] Pre-computed `is_suppressed_category` and `onboarding_category_match` arrays
- [x] Pass `qdrant_scores`, `user_total_saves`, `user_total_dismissals`
- [x] `reranker.py` supports both scalar broadcast and per-candidate arrays
- [x] Add model accessors: `is_model_loaded()`, `get_num_trees()`, `get_loaded_model_path()`
- [x] Add per-request feature activation logging
- [x] Create `GET /healthz/reranker` endpoint (`app/routers/health.py`)
- [x] Bug B fix: persist `medoid_embedding_blob` BLOB in `user_clusters` table
- [x] Bug B fix: fall back to persisted blob instead of zero vector in Hungarian matching
- [x] DB migration: `ALTER TABLE user_clusters ADD COLUMN medoid_embedding_blob BLOB`
- [x] 9 new tests — **all passing** (`tests/test_phase6_feature_wiring.py`)
- [x] Full suite: **203+ tests passing, 0 failures**
- [x] Updated `CLAUDE.md`, `PHASE6-HANDOFF.md`, `README.md`

### 6.4 — Retraining [~] DEFERRED
> **Phase 6.4 retraining is deferred.** The published model `siddhm11/researchit-reranker-phase6` was trained on citation pseudo-labels with features 23–30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with ≥10 saves each. **Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.**

- [~] Synthetic user simulator (`scripts/simulate_users.py`) — target: +30d
- [~] Real-user retrain at 100-user threshold — target: +90d or threshold
- [~] HF model card backfill (library_name, pipeline_tag, metrics, schema)

## Phase 6.5: Instrumentation ✅ COMPLETE

> **Purpose**: Stabilize the recommendation pipeline and prepare telemetry substrate for Phase 7 evaluation.

### A1 — Real Qdrant cosine scores
- [x] Switch `search_by_vector()` → `search_by_vector_with_scores()` in per-cluster + short-term searches
- [x] Build `qdrant_score_map` from real cosines (replaces fake `1.0 - rank*0.01` linear decay)
- [x] Feature 0 (`qdrant_cosine_score`) now receives actual cosine similarities

### A2 — Deployment verification
- [x] `curl /healthz/reranker` → `model_loaded=true, n_trees=141, fallback_active=false`
- [x] Verification timestamp added to `PHASE6-Reranker-Framing.md`

### B1 — query_id linkage
- [x] Generate `query_id` (UUID) once per feed request in `get_recommendations()`
- [x] Thread through all 4 tiers: trending, Tier 1, Tier 2, Tier 3
- [x] Generate `query_id` in `search.py` per search request
- [x] Add `query_id` + `position` to `action_buttons.html` hx-vals

### B2 — Propensity logging
- [x] Add `propensity REAL` + `policy_id TEXT` migration to `interactions` table
- [x] Extend `db.log_interaction()` with propensity + policy_id params
- [x] Compute propensity: 1.0 (deterministic) vs `n_explore/pool_size` (exploration)
- [x] Thread through templates + `events.py` Form params

### B3 — Cluster snapshot versioning
- [x] Add `cluster_snapshots` table (append-only, content-addressed via `paper_ids_hash`)
- [x] `save_cluster_snapshot()` called after each `save_clusters_to_db()`
- [x] `prune_old_snapshots(30)` on startup in `main.py` lifespan

### ~~B4 — S2 author import~~ ❌ REMOVED
> S2 author import was implemented and then removed — not the onboarding direction we want.
> `app/s2_svc.py`, the `/api/onboarding/import-author` endpoint, and the quick-import UI
> have all been deleted. Onboarding uses category selection + manual seed search only.

### Documentation
- [x] `CLAUDE.md`: Rule 3.11 — interaction instrumentation invariants
- [x] `_RANKER_VERSION` bumped to `v6.5_lightgbm_real_cosines`
- [x] Phase status updated to 6.5 COMPLETE
- [x] Tests: 203+ passing

### Test suite
- `tests/test_reranker_integration.py` — 7 tests (smoke, features, heuristic, E2E, latency, backward compat, comparison)
- `tests/test_phase6_feature_wiring.py` — 9 tests (per-candidate arrays, broadcast medoid, model accessors, aggregate activation)
- `tests/demo_reranker.py` — interactive demo with 20 realistic papers

---

## Phase 7: Evaluation Framework 📋 NOT STARTED

> *Build offline and online evaluation before scaling users.*  
> *Estimated effort: ~1 week*

- [ ] Offline metrics: nDCG@10, Recall@50, HR@10, ILS, category entropy
- [ ] Time-split evaluation on unarXive 2022 + S2ORC
- [ ] Online metrics (once users exist): CTR, save rate, dwell time, return rate

---

## Phase 8: LLM Interest Summaries + Distilled Re-ranker 📋 NOT STARTED

> *Estimated effort: ~10-12 weeks (Doc 07)*  
> *Detailed research plan: `docs/research/07-LLM-Summaries-Reranker-and-Scaling-Research.md`*  
> *Entry criteria: Phase 7 eval producing stable nDCG@10; cluster stability Jaccard ≥0.7 over 7 days*

### 8a — Claude-generated per-cluster interest summaries (Doc 07 §A)
- [ ] Cluster snapshot versioning (ADR A1)
- [ ] Content-addressed caching: `sha256(sorted(paper_ids) + prompt_version + model)`
- [ ] Shared summaries (not per-user) — Haiku 4.5 + Batch API (~$50-80/month @ 1K users)
- [ ] Nightly regeneration job with 7-day TTL + event-triggered refresh
- [ ] "You're reading about X" UI framing with sub-theme bullets
- [ ] Anthropic Citations API for hallucination prevention

### 8b — Distilled cross-encoder reranker (Doc 07 §B)
- [ ] Deploy `cross-encoder/ms-marco-TinyBERT-L-2-v2` INT8 ONNX as MVP
- [ ] 6ms budget for 20 pairs on CPU (AVX-512 VNNI)
- [ ] TinyBERT score as LightGBM feature (Option C architecture)
- [ ] Custom distillation from BGE-reranker-v2-m3 only if held-out gap >3 nDCG
- [ ] MarginMSE loss + SciNCL citation-graph hard negatives

### 8c — Use-cases and information-gain design doc (Doc 07 §C)
- [ ] 8 user personas (P1 cold-start through P8 stay-current)
- [ ] Information-gain table (save=3-5×, dismiss-as-label=−3-4×, passive skip=−0.1×)
- [ ] Mode-switching UI: "Stay Current" vs "Lit Review" toggle
- [ ] Failure mode detection rules (feed collapse, stale profile, filter bubble)

---

## Phase 9: Exploration + Collaborative Filtering 📋 NOT STARTED

> *Blocked by: ≥500 users*

- [ ] Epsilon-greedy exploration (ε=0.25 new users, ε=0.05 established)
- [ ] LightFM hybrid CF model with switching strategy
- [ ] Category-level negative suppression
- [ ] Retrain LightGBM with dismissals as negative labels

---

## Appendix: Infrastructure Status

| Component | Status | Details |
|---|---|---|
| **Qdrant Cloud** | ✅ Live | 1.6M papers, BGE-M3 1024-dim, BQ enabled, HNSW m=32 |
| **Zilliz Cloud** | ✅ Live | 1.6M papers, BGE-M3 sparse vectors, collection `arxiv_bgem3_sparse` |
| **Turso (libSQL)** | ✅ Live | 1.23 GB arXiv metadata + citations, `arxiv-data` DB, `papers` table, unique index on `arxiv_id` |
| **SQLite** | ✅ Live | interactions, paper_metadata (local cache), user_profiles, user_clusters |
| **HF Spaces** | ✅ Deployed | Docker SDK, free tier, port 7860 — https://siddhm11-researchit.hf.space |
| **Render** | ⚠️ Previous target (512MB RAM too small for BGE-M3) | May still be used for non-ML services |
| **arXiv API** | ✅ Fallback only | Keyword search + metadata for papers not in Turso |
| **BGE-M3 Model** | ✅ Live | Pre-baked in Docker image, warm-up at startup |
| **Groq API** | ✅ Live + HF Secret | `app/groq_svc.py` — 2s timeout, academic heuristic skip |
| **Notebooks** | ✅ Organized | `notebooks/` — 01-upload, 02-test, 03-search-benchmark |

### Credentials Status

| Credential | Status | Env Var | Notes |
|---|---|---|---|
| **Qdrant Cloud** | ✅ In `.env` | `QDRANT_URL`, `QDRANT_API_KEY` | Already wired |
| **Zilliz Cloud** | ✅ In `.env` | `ZILLIZ_URI`, `ZILLIZ_TOKEN` | Phase 3, wired |
| **Turso (libSQL)** | ✅ In `.env` + HF | `TURSO_URL`, `TURSO_DB_TOKEN` | Phase 3.5, wired + deployed |
| **Groq** | ✅ In `.env` + HF | `GROQ_API_KEY` | Phase 3, wired + deployed |
| **HF Spaces** | ✅ Deployed | Secrets panel | All env vars set ✔ |

---

## Appendix: Test Suite

| Test File | Count | Status |
|---|---|---|
| `tests/test_profiles.py` | 11 | ✅ Passing |
| `tests/test_clustering.py` | 21 | ✅ Passing | (9 compute + 10 Hungarian + 2 persistence) |
| `tests/test_reranker_diversity.py` | 13 | ✅ Passing |
| `tests/test_reranker_integration.py` | 7 | ✅ Passing | (Phase 6: smoke, features, E2E, latency) |
| `tests/test_phase6_feature_wiring.py` | 9 | ✅ Passing | (Phase 6.3: per-candidate arrays, medoids, accessors) |
| `tests/test_fusion.py` | 20 | ✅ Passing | (Phase 4.1) |
| `tests/test_db.py` | 19 | ✅ Passing | (includes 4 Turso cache + 8 suppression) |
| `tests/test_qdrant_svc.py` | — | ✅ Passing |
| `tests/test_arxiv_svc.py` | — | ✅ Passing |
| `tests/test_integration.py` | — | ✅ Passing | (includes quota pipeline E2E) |
| `tests/test_user_state.py` | — | ✅ Passing |
| `tests/test_saved.py` | — | ✅ Passing |
| `tests/test_hybrid_search.py` | 21 | ✅ Passing |
| `tests/test_search_router.py` | 6 | ✅ Passing |
| `tests/test_live_search.py` | 8 | ✅ Passing |
| **Total** | **203+** | ✅ |
| `test_e2e_recs.py` (standalone) | 1 | ✅ E2E simulation |

---

## Appendix: Doc 06 Corrections — Tracking

| Correction | Status | Where |
|---|---|---|
| α_long 0.10 → 0.03 | ✅ Applied | `app/recommend/profiles.py:30` |
| L2-normalize before Ward clustering | ✅ Applied | `app/recommend/clustering.py` |
| Medoid not centroid | ✅ Applied | `app/recommend/clustering.py` → `_find_medoid()` |
| Negative EWMA wired into reranking | ✅ Applied | `app/recommend/reranker.py` → Feature 5 |
| RRF → quota fusion for recommendations | ✅ Applied | `app/recommend/fusion.py` (Phase 4.1) |
| Hungarian cluster matching | ✅ Applied | `app/recommend/clustering.py` → `stabilize_cluster_ids()` (Phase 4.3) |
| Per-item short-term negative decay | [!] Backlog | Phase 6 (LightGBM feature) |
| Category-level suppression | ✅ Applied | `app/db.py` → `get_suppressed_categories()` (Phase 4.4) |
| BGE-reranker NEVER in hot path | ✅ Followed | Heuristic scorer used instead |