# PHASE 6 — Reranker Framing Document > **File:** `docs/phases/PHASE6-Reranker-Framing.md` > **Project:** ResearchIT (arXiv paper recommendation engine) > **Author:** Amin (siddhm11) > **Status:** Phase 6 — In Integration (post-audit framing) > **Supersedes:** the open items from `PHASE6-Audit.md` > **Live Space:** `https://siddhm11-researchit.hf.space` (HF Spaces, Docker SDK, RUNNING, cpu-basic) > **Reranker model repo:** `https://huggingface.co/siddhm11/researchit-reranker-phase6` > **Training data repo:** `https://huggingface.co/datasets/siddhm11/researchit-reranker-data` --- ## TL;DR (for Amin's future self at 1AM) 1. **The HF model exists and is public, but the model card is empty** (no `pipeline_tag`, no `library_name`, no `description`, no metrics block). The reproducibility story currently lives in your laptop, not in the repo. The dataset repo `siddhm11/researchit-reranker-data` does exist as a public Parquet dataset (size_categories: 100K–1M, downloads: 30) — that is the only piece of the training pipeline that is *partially* public. 2. **The model was NOT trained on real user signal.** It was trained on **citation pseudo-labels** (Semantic Scholar citation edges → triples → LightGBM LambdaRank). Features 23–30 (cluster importance, suppressed-category, onboarding match, user save/dismiss counts) were **zero during training** because no users existed — therefore zeroes at serving time are *consistent with training*, not a regression. The 0.879 nDCG@10 number is honest under that distribution; it is not honest as a measure of "what the user feels." 3. **Phase 6 is therefore about plumbing, not retraining.** Land Phase 6.1 (dominant-cluster shortcut, ~1 day), then Phase 6.2 (per-candidate `paper_cluster_map` plumbing, ~3–5 days), then Phase 6.3 (deployment verification + `/healthz/reranker`). Defer retraining (Phase 6.4) until either (a) you ship synthetic-user simulation, or (b) you reach ~100 real users with ≥10 saves each. --- ## Part A — Phase 6 Status Verification (HF inspection) ### A.1 What is publicly visible on Hugging Face I queried the HF Hub API directly (the public web pages were not reachable from this environment, but the structured API endpoints were). Here is the *complete* observable state of your HF account as of **2026-05-02**: | Repo | Type | Visibility | Created | Last modified | Notable | |---|---|---|---|---|---| | `siddhm11/researchit-reranker-phase6` | model | public, not gated | 2026-04-26 | 2026-04-27 21:51 | **README empty / no description, no `pipeline_tag`, no `library_name`, no metrics card, no datasets link** | | `siddhm11/researchit-reranker-data` | dataset | public, not gated | 2026-04-27 16:41 | 2026-04-27 21:51 | Parquet, `100K= seven_days_ago ) user_dismissals_last_7d = sum( 1 for n in state.negatives if n.timestamp >= seven_days_ago ) # --- END move --- ``` #### C.2.2 Add a helper to align Qdrant scores with `valid_ids` `per_cluster_results` is currently a `list[list[ScoredPoint]]` (one list per cluster). After dedup + filter into `valid_ids`, we lose the raw retrieval score. We need to project it back. ```python # app/recommend/fusion.py (or app/routers/recommendations.py if you keep helpers local) def build_qdrant_score_map( per_cluster_results: list[list["ScoredPoint"]], ) -> dict[str, float]: """ Collapse all per-cluster ScoredPoint lists into a single paper_id -> max_score dict. If a paper appeared in multiple clusters' top-K, we keep the maximum score (most charitable to the candidate; matches the dedup semantics in merge_quota_results). """ out: dict[str, float] = {} for cluster_hits in per_cluster_results: for hit in cluster_hits: pid = str(hit.id) # arxiv IDs ALWAYS strings (CLAUDE.md rule 7) score = float(hit.score) if pid not in out or score > out[pid]: out[pid] = score return out def align_qdrant_scores( valid_ids: list[str], score_map: dict[str, float], ) -> np.ndarray: """Return a float32 array of qdrant_scores aligned 1:1 with valid_ids. Missing entries default to 0.0 (which matches train-time behavior for candidates injected by exploration paths).""" return np.asarray( [score_map.get(pid, 0.0) for pid in valid_ids], dtype=np.float32, ) ``` #### C.2.3 Compute the dominant-cluster scalars ```python # Dominant cluster: highest-importance cluster in user state. # clusters is the list[Cluster] you already have in scope. if clusters: dominant = max(clusters, key=lambda c: c.importance) dominant_importance = float(dominant.importance) dominant_medoid_vec = np.asarray( dominant.medoid_embedding, dtype=np.float32 ) else: # Cold-start path: no clusters -> defensible defaults. dominant_importance = 0.0 dominant_medoid_vec = np.zeros(1024, dtype=np.float32) ``` #### C.2.4 The new `rerank_candidates` call Replace the existing call (around line 305 of `recommendations.py`) with: ```python # --- BEGIN Phase 6.1 rerank call --- qdrant_score_map = build_qdrant_score_map(per_cluster_results) qdrant_scores = align_qdrant_scores(valid_ids, qdrant_score_map) # Per-candidate boolean: is this candidate's primary_topic suppressed? is_suppressed_category = np.asarray( [ 1.0 if (m.get("primary_topic") in suppressed) else 0.0 for m in valid_meta ], dtype=np.float32, ) # Per-candidate boolean: does this candidate match any onboarding category? onboarding_category_match = np.asarray( [ 1.0 if (m.get("primary_topic") in onboarding_categories) else 0.0 for m in valid_meta ], dtype=np.float32, ) reranked_ids, reranked_scores, reranked_embs = rerank_candidates( candidate_ids=valid_ids, candidate_embeddings=valid_embs, candidate_metadata=valid_meta, long_term_vec=lt_vec, short_term_vec=st_vec, negative_vec=neg_vec, # Phase 6 additions (6.1: dominant-cluster shortcut) qdrant_scores=qdrant_scores, cluster_importance=np.full( len(valid_ids), dominant_importance, dtype=np.float32 ), cluster_medoid=dominant_medoid_vec, # broadcast in reranker is_suppressed_category=is_suppressed_category, onboarding_category_match=onboarding_category_match, user_total_saves=user_total_saves, user_total_dismissals=user_total_dismissals, user_saves_last_7d=user_saves_last_7d, user_dismissals_last_7d=user_dismissals_last_7d, ) # --- END Phase 6.1 rerank call --- ``` #### C.2.5 Reranker signature change **File:** `app/recommend/reranker.py` ```python def rerank_candidates( *, candidate_ids: list[str], candidate_embeddings: np.ndarray, # (N, 1024) candidate_metadata: list[dict], long_term_vec: np.ndarray | None, short_term_vec: np.ndarray | None, negative_vec: np.ndarray | None, # Phase 6.1 additions — all optional with safe zero-defaults # so the legacy callers keep working during the migration window. qdrant_scores: np.ndarray | None = None, cluster_importance: np.ndarray | None = None, # (N,) or scalar broadcast cluster_medoid: np.ndarray | None = None, # (1024,) for 6.1, (N, 1024) for 6.2 is_suppressed_category: np.ndarray | None = None, onboarding_category_match: np.ndarray | None = None, user_total_saves: int = 0, user_total_dismissals: int = 0, user_saves_last_7d: int = 0, user_dismissals_last_7d: int = 0, ) -> tuple[list[str], list[float], np.ndarray]: ... ``` Inside the feature-matrix builder, fill slots 23–30 from these new args, falling back to zero if `None` (preserves backward-compat for any unit tests that still call the legacy path). ### C.3 The integration test (must be added) **File:** `tests/recommend/test_phase6_feature_matrix.py` ```python import numpy as np import pytest from app.recommend.reranker import _build_feature_matrix # expose for testing def test_phase6_feature_matrix_is_not_mostly_zero(fake_user_state, fake_candidates): """ Regression guard for Bug A. After Phase 6.1, features 23-30 must carry signal for at least one candidate in a typical request. """ X = _build_feature_matrix( candidate_ids=fake_candidates.ids, candidate_embeddings=fake_candidates.embs, candidate_metadata=fake_candidates.meta, long_term_vec=fake_user_state.lt_vec, short_term_vec=fake_user_state.st_vec, negative_vec=fake_user_state.neg_vec, qdrant_scores=fake_candidates.qdrant_scores, # non-trivial cluster_importance=np.full(len(fake_candidates.ids), 0.42), cluster_medoid=fake_user_state.dominant_medoid, is_suppressed_category=np.array([0, 0, 1, 0, 0], dtype=np.float32), onboarding_category_match=np.array([1, 0, 1, 0, 1], dtype=np.float32), user_total_saves=12, user_total_dismissals=3, user_saves_last_7d=2, user_dismissals_last_7d=1, ) assert X.shape[1] == 37, f"Feature schema drifted: got {X.shape[1]} cols" # Per-feature non-zero rate. Slots 23..30 must each be >0 somewhere. nonzero_rate = (X != 0).mean(axis=0) for slot in range(23, 31): assert nonzero_rate[slot] > 0.0, ( f"Feature {slot} is all zeros — Phase 6.1 plumbing regression" ) # Aggregate sanity: at least 60% of feature slots should be active. assert (nonzero_rate > 0).mean() >= 0.6 ``` ### C.4 What 6.1 fixes vs leaves open | Bug | Status after 6.1 | |---|---| | A — caller uses legacy 6-arg signature | **Fixed** | | B — Hungarian zero-vector fallback | Untouched (orthogonal; addressed separately, see Part E.4) | | C — model deployment unverified | Untouched (Phase 6.3) | | D — train/serve consistency | Improved on slot 23–30 *non-zeroness*, but **the trained model's slot-23 weight is near-zero**, so 6.1 will not move nDCG by much. That's expected. | --- ## Part D — Phase 6.2: The Full Plumbing (3–5 days) ### D.1 The data-structure change: `paper_cluster_map` The architectural rule we are honoring: *a paper retrieved for "Cluster A: ML systems" should be scored for its fit to **that** cluster, not to the user's dominant interest.* Phase 6.1 violates this for slots 23 and 24 by broadcasting the dominant cluster's importance/medoid to every candidate. Phase 6.2 fixes it by attaching the **source cluster index** to each retrieved candidate and threading that through to the reranker. **Add to `app/recommend/types.py` (or wherever `Cluster` lives):** ```python # Source-of-truth mapping: candidate paper_id -> index into clusters[]. # Built during per-cluster Qdrant retrieval, propagated through merge_quota, # consumed by the reranker. PaperClusterMap = dict[str, int] ``` If a candidate appears in *multiple* cluster top-Ks (which can happen with importance-weighted quota and a permissive K_max=7), the convention is **first-write-wins by importance order** — i.e. when iterating clusters in descending importance, the first cluster to surface a candidate "owns" it. This matches the heuristic that a paper that the user's *strongest* interest cluster also pulls is more naturally explained by that cluster. ### D.2 `app/recommend/fusion.py` — `merge_quota_results` Currently `merge_quota_results` takes `per_cluster_results: list[list[ScoredPoint]]` and returns `list[str]` (deduped IDs). Change it to also return the cluster mapping. ```python def merge_quota_results( per_cluster_results: list[list["ScoredPoint"]], clusters: list[Cluster], floor_per_cluster: int = 3, # F_min from CLAUDE.md ) -> tuple[list[str], PaperClusterMap]: """ Importance-weighted quota merge. Returns: - merged_ids: deduped list of arxiv IDs (str) - paper_cluster_map: candidate_id -> source cluster index Convention: when a candidate appears in multiple clusters, the cluster with HIGHER importance wins. Iterate clusters sorted descending by importance to make first-write-wins do the right thing. """ paper_cluster_map: PaperClusterMap = {} merged_ids: list[str] = [] seen: set[str] = set() # Stable sort by importance descending; preserve original index for lookup. order = sorted( range(len(clusters)), key=lambda i: clusters[i].importance, reverse=True, ) for cluster_idx in order: hits = per_cluster_results[cluster_idx] # Apply F_min floor: take at least floor_per_cluster (or all if shorter). # The full quota math from Phase 5 lives elsewhere; this is the # accounting step. for hit in hits: pid = str(hit.id) if pid in seen: continue seen.add(pid) merged_ids.append(pid) paper_cluster_map[pid] = cluster_idx return merged_ids, paper_cluster_map ``` **Update every caller** of `merge_quota_results` in `recommendations.py` to unpack the second return. Mypy / pyright will flag every site — fix them all. ### D.3 `app/routers/recommendations.py` — propagate the map ```python merged_ids, paper_cluster_map = merge_quota_results( per_cluster_results, clusters ) # ... after dedup and metadata fetch, valid_ids is a subset of merged_ids ... # Per-candidate cluster index (aligned with valid_ids). candidate_cluster_idx = np.asarray( [paper_cluster_map[pid] for pid in valid_ids], dtype=np.int32, ) # Per-candidate cluster importance (slot 23, properly per-candidate). per_candidate_importance = np.asarray( [clusters[idx].importance for idx in candidate_cluster_idx], dtype=np.float32, ) # Per-candidate cluster medoid (used to compute slot 24 inside reranker). # Stack medoids into a (N, 1024) array. per_candidate_medoids = np.stack( [ np.asarray(clusters[idx].medoid_embedding, dtype=np.float32) for idx in candidate_cluster_idx ], axis=0, ) reranked_ids, reranked_scores, reranked_embs = rerank_candidates( candidate_ids=valid_ids, candidate_embeddings=valid_embs, candidate_metadata=valid_meta, long_term_vec=lt_vec, short_term_vec=st_vec, negative_vec=neg_vec, qdrant_scores=qdrant_scores, cluster_importance=per_candidate_importance, # (N,) — was scalar in 6.1 cluster_medoid=per_candidate_medoids, # (N, 1024) — was (1024,) in 6.1 is_suppressed_category=is_suppressed_category, onboarding_category_match=onboarding_category_match, user_total_saves=user_total_saves, user_total_dismissals=user_total_dismissals, user_saves_last_7d=user_saves_last_7d, user_dismissals_last_7d=user_dismissals_last_7d, ) ``` ### D.4 `app/recommend/reranker.py` — per-candidate slot 24 Inside the feature-matrix builder, slot 24 (`cluster_distance_to_medoid`) becomes a row-wise cosine: ```python # Slot 24: cosine distance from each candidate to its OWN source-cluster medoid. # cluster_medoid shape: # (1024,) -> 6.1 broadcast path (legacy) # (N, 1024) -> 6.2 per-candidate path if cluster_medoid is None: feat_24 = np.zeros(N, dtype=np.float32) elif cluster_medoid.ndim == 1: # Broadcast cosine: same medoid for every candidate. medoid_norm = cluster_medoid / (np.linalg.norm(cluster_medoid) + 1e-9) cand_norms = candidate_embeddings / ( np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9 ) sims = cand_norms @ medoid_norm feat_24 = (1.0 - sims).astype(np.float32) # distance, not similarity else: # Per-row cosine: candidate i vs cluster_medoid[i]. cand_norms = candidate_embeddings / ( np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9 ) med_norms = cluster_medoid / ( np.linalg.norm(cluster_medoid, axis=1, keepdims=True) + 1e-9 ) sims = (cand_norms * med_norms).sum(axis=1) feat_24 = (1.0 - sims).astype(np.float32) ``` ### D.5 Why 6.2 matters A concrete example: user has clusters `{A: ML systems (importance 0.7), B: protein folding (importance 0.3)}`. A paper about *MLPerf benchmark methodology* gets retrieved by cluster A's Qdrant query; a paper about *AlphaFold 3 architecture* gets retrieved by cluster B's. Under 6.1, both get scored against cluster A's medoid (the dominant one), so the AlphaFold paper looks artificially "off-distribution" and gets ranked down. Under 6.2, the AlphaFold paper is scored against cluster B's medoid (where it belongs), and slot 24 correctly registers as a small distance. The model's learned weight on slot 24 then has the opportunity to *protect* exploration into the user's secondary interests, instead of reinforcing the dominant one. This also makes Phase 9 (Exploration + CF) much cleaner: the exploration budget is naturally reasoned about in terms of "minority-cluster candidates that survived rerank." ### D.6 Integration test outline **File:** `tests/recommend/test_phase62_per_candidate_cluster.py` ```python def test_paper_cluster_map_threaded_through(monkeypatch, fake_three_cluster_state): """ Given three clusters with different medoids, candidates retrieved from cluster B must produce slot 24 measured against B's medoid, not against the dominant A's medoid. """ captured = {} def fake_predict(model, X): captured["X"] = X.copy() return np.arange(X.shape[0])[::-1].astype(float) monkeypatch.setattr("app.recommend.reranker._lgb_predict", fake_predict) response = client.get( "/recommend", headers={"X-User-Id": fake_three_cluster_state.user_id}, ) X = captured["X"] # Specific assertion: candidate 0 came from cluster A, candidate 5 from B. # Their slot-24 values must be different (they are scored vs different medoids). assert X[0, 24] != X[5, 24] # And slot 23 (cluster_importance) must match the source clusters' # importance values, not a single broadcast. assert len(set(X[:, 23].tolist())) > 1 ``` --- ## Part E — Phase 6.3: Deployment Verification + Monitoring (1 day) ### E.1 Verify the LightGBM model loads on HF Spaces Two acceptable deployment strategies. Pick **one** and document it. #### Option E.1.a — Commit the 974 KB file to the Space's Git repo (recommended for simplicity) 974 KB is well under HF's 5 MB inline limit and well under any sane Git-LFS threshold. Just commit it. ```bash # From your local ResearchIT working tree: cd /path/to/ResearchIT mkdir -p models/reranker-phase6/production_model cp /path/to/reranker_v1.txt models/reranker-phase6/production_model/ # Make sure neither .gitignore nor .dockerignore excludes it. grep -E "^models/?$|^models/reranker" .gitignore .dockerignore || echo "OK, not ignored" git add models/reranker-phase6/production_model/reranker_v1.txt git commit -m "Phase 6.3: ship LightGBM reranker artifact to Space" git push hf main # where 'hf' is the HF Space remote ``` `reranker.py`'s search paths already include this location, so no code change required. #### Option E.1.b — Pull from HF Hub at container build time (cleaner separation of concerns) ```dockerfile # Add to Dockerfile, BEFORE the COPY . . step RUN pip install --no-cache-dir huggingface_hub && \ python -c "\ from huggingface_hub import snapshot_download; \ snapshot_download( \ repo_id='siddhm11/researchit-reranker-phase6', \ local_dir='/app/models/reranker-phase6/production_model', \ local_dir_use_symlinks=False, \ allow_patterns=['*.txt', '*.json'], \ )" ``` This treats the model repo as the source of truth and the Space as a consumer — closer to "real" MLOps. Cost: one extra layer in the build, ~1s at startup since it's cached after first build. **Recommendation:** Option **E.1.a** for now (simpler, the file is tiny, and it removes a network dependency at container build). Move to E.1.b in Phase 7 when you have a model registry story. ### E.2 Add `/healthz/reranker` route **File:** `app/routers/health.py` (create if not present) ```python import hashlib import json from fastapi import APIRouter from app.recommend import reranker as _rr router = APIRouter() EXPECTED_FEATURE_NAMES = [ "qdrant_score", # 0 "sim_long_term", "sim_short_term", # 1, 2 "sim_cluster_medoid", "sim_extra_4", # 3, 4 "sim_negative", # 5 # 6..22 paper-level "recency_days_log", "citation_count_log", "influential_citations_log", "is_primary_cs_lg", "is_primary_cs_cl", "is_primary_cs_cv", "is_primary_stat_ml", "is_primary_cs_ai", "is_primary_cs_ir", "is_primary_other", "abstract_len_log", "title_len", "year", "month", "venue_is_top", "n_authors_log", "has_code_link", # 23..30 cluster + user "cluster_importance", "cluster_distance_to_medoid", "is_suppressed_category", "onboarding_category_match", "user_total_saves", "user_total_dismissals", "user_saves_last_7d", "user_dismissals_last_7d", # 31..36 remaining metadata "primary_topic_freq_in_user", "is_oa", "venue_age", "abstract_topic_match", "ref_count_log", "is_arxiv_only", ] assert len(EXPECTED_FEATURE_NAMES) == 37 @router.get("/healthz/reranker") async def healthz_reranker(): schema_hash = hashlib.sha256( json.dumps(EXPECTED_FEATURE_NAMES).encode() ).hexdigest()[:12] return { "model_loaded": _rr.is_model_loaded(), # True iff LightGBM Booster live "model_path": _rr.get_loaded_model_path(), # str or None "model_version": "phase6.v1", "fallback_active": not _rr.is_model_loaded(), "feature_count": 37, "feature_schema_hash": schema_hash, "n_trees": _rr.get_num_trees(), # 141 expected } ``` Wire `_rr.is_model_loaded()`, `_rr.get_loaded_model_path()`, `_rr.get_num_trees()` into `reranker.py` as small accessors over the module-level Booster handle. **Then verify on the live Space:** ``` curl -s https://siddhm11-researchit.hf.space/healthz/reranker | jq ``` Expected JSON body: `"model_loaded": true, "n_trees": 141, "fallback_active": false`. If any of those is wrong, **the deployed image is silently running the heuristic.** ### E.3 Logging: feature non-zero rate per request In `reranker.py`, after building the feature matrix `X`, add: ```python # Observability: emit a per-request feature-activation histogram so we # catch silent regressions to zero-filled features 23-30. if X.shape[0] > 0: nonzero_rate = (X != 0).mean(axis=0) # (37,) logger.info( "reranker.features", extra={ "feature_nonzero_rate": nonzero_rate.round(3).tolist(), "feature_count": X.shape[1], "n_candidates": X.shape[0], "model_active": _model is not None, }, ) ``` Add a Prometheus-style assertion later (Phase 7) but for now structured-log lines are enough — you can grep them on the Space's log stream. ### E.4 Bug B fix (Hungarian zero-vector fallback) — bundle into 6.3 While we're in `recommendations.py`, fix the medoid-rebuild bug: ```python # OLD: # medoid_embedding=np.array(vectors[row["medoid_paper_id"]], dtype=np.float32) # if row["medoid_paper_id"] in vectors else np.zeros(1024, dtype=np.float32) # NEW: if the medoid paper's embedding isn't in the immediate vectors dict, # fall back to the previously-stored medoid_embedding from the cluster row # (which is what we persisted last cycle), NOT a zero vector. Zero-vector # fallback breaks Hungarian assignment because cosine(*, 0) = 0 fails the # 0.5 acceptance threshold and orphans the cluster identity. if row["medoid_paper_id"] in vectors: medoid_emb = np.asarray(vectors[row["medoid_paper_id"]], dtype=np.float32) elif row["medoid_embedding_blob"] is not None: medoid_emb = np.frombuffer(row["medoid_embedding_blob"], dtype=np.float32) else: # Last resort: this cluster has no recoverable medoid. Mark for re-seed # rather than silently masking with zeros. logger.warning( "cluster.medoid.unrecoverable", extra={"cluster_id": row["cluster_id"]} ) continue # skip this cluster row; it'll be rebuilt on next Ward run ``` This requires adding `medoid_embedding_blob BLOB` to the cluster persistence schema if it's not already there — a one-line ALTER on Turso. --- ## Part F — Phase 6.4: Retraining Strategy WITHOUT Real User Signal ### F.1 The honest framing The current model is *not broken*. It is correctly trained on the only labels we had access to (citation pseudo-labels), and on its native distribution it scores nDCG@10 = 0.879. The gap between that number and "what users feel" is a **deployment gap**, not a model gap. Retraining only makes sense once one of the following is true: - We have a training distribution where features 23–30 carry **non-trivial, behaviourally meaningful** signal, **or** - We have real user labels (saves/dismisses/dwell-time) at sufficient scale. Today, neither holds. So Phase 6.4 is a **decision document**, not a build task. ### F.2 Three options, compared | Option | Feasibility | Fidelity to real users | Engineering cost | Risk | |---|---|---|---|---| | **(i) Wait for real users.** Threshold: 100 users with ≥10 saves each → ~1,000 labelled (user, paper, +/-) tuples per week, enough for a weekly retrain on slots 23–30. | High (zero engineering until threshold hit) | Highest possible — actual ground truth | Zero now, ~1 week to build the labelling pipeline once we cross the threshold | **You may never cross the threshold.** Single-developer hobby project. | | **(ii) Synthetic user simulation.** Generate N=1,000 synthetic users. For each, draw a "true interest profile" as a mixture over arXiv categories. Sample 30–60 "saves" from each profile by drawing papers from citation neighborhoods of seed papers in those categories. Run those saves through the **actual** EWMA + Ward + medoid pipeline to produce real `cluster_importance`, real `cluster_distance_to_medoid`, real `onboarding_category_match` (using the synthetic user's category profile as the onboarding set). Label triples by citation: a citation-neighbor of any of the user's saved papers is positive; a random paper outside the user's neighborhood is negative. | Medium-high — every component already exists; the simulation is glue code | Moderate — captures *structure* of the signal (cluster importance scales correctly, suppressed categories actually suppress) but doesn't capture true user *preference noise* | 5–8 days: write `simulate_users.py`, regenerate triples (~200K with non-zero slots 23-30), retrain LightGBM, re-eval | The model learns the simulator's biases. Fight this by holding out real user data (whenever it arrives) as a clean eval set; if the simulator-trained model dramatically outperforms a heuristic baseline on real-user holdout, you've validated. If not, you've wasted a week. | | **(iii) Self-distillation (defer to Phase 8).** Use the heuristic scorer + Qdrant retrieval rank as a soft label. Train a fresh LightGBM to mimic those scores, then iterate: ship → log new soft labels from the deployed model itself → retrain on log data. | Medium — needs a logging schema and an offline training loop | Low initially (only as good as the teacher), but improves once real users arrive and label-corruption from the teacher decays | 2–3 weeks: full label-store, replay infra, offline training pipeline | **Cold-start collapse:** if the teacher is worse than the student needs, distillation locks in mediocrity. Only safe once we have *any* real user signal to anchor the distribution. | ### F.3 Recommendation **Adopt a staged path:** 1. **Now (Phase 6.4a):** Do *not* retrain. Ship 6.1 → 6.2 → 6.3 with the existing model. Document that the model's effective coverage is ~15/37 features at training time but improves to ~37/37 at serving time *as soon as 6.1 lands* — the model can only use what it learned, but the framework is now wired correctly for the next training run. 2. **+30 days (Phase 6.4b):** Build option **(ii)** — synthetic user simulation. Spec the simulator in `scripts/simulate_users.py`, produce a new dataset version `siddhm11/researchit-reranker-data-v2` with slots 23–30 populated, retrain to `reranker_v2.txt`. Compare v1 vs v2 on a held-out split *and* on whatever sliver of real-user data has accrued by then. 3. **+90 days or 100-user threshold (whichever first) (Phase 6.4c):** If real-user data exists, do option **(i)**. Train `reranker_v3` on real saves/dismisses with synthetic data as augmentation (50/50 mix or downweight synthetic). 4. **Phase 8+:** Bring in option **(iii)** as a continuous-update mechanism on top of v3. ### F.4 What to write in `docs/phases/PHASE6.md` today Two lines, verbatim: > **Phase 6.4 retraining is deferred.** The published model `siddhm11/researchit-reranker-phase6` was trained on citation pseudo-labels with features 23–30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with ≥10 saves each. **Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.** --- ## Part G — Phase 6 Closeout Checklist ```markdown ### Phase 6.1 — Simplification Pass (1–2 days) - [ ] Move `suppressed = await db.get_suppressed_categories(...)` to BEFORE rerank call in app/routers/recommendations.py - [ ] Add `onboarding_categories = await db.get_user_category_filter(...)` next to it - [ ] Compute `user_total_saves`, `user_total_dismissals`, `user_saves_last_7d`, `user_dismissals_last_7d` - [ ] Add helper `build_qdrant_score_map()` in app/recommend/fusion.py - [ ] Add helper `align_qdrant_scores(valid_ids, score_map)` - [ ] Compute `dominant_importance` and `dominant_medoid_vec` - [ ] Replace 6-arg `rerank_candidates(...)` call with full kwargs version (Section C.2.4) - [ ] Update `rerank_candidates()` signature in app/recommend/reranker.py to accept new optional kwargs (Section C.2.5) - [ ] Wire slots 23–30 into `_build_feature_matrix` in reranker.py - [ ] Write test `tests/recommend/test_phase6_feature_matrix.py` asserting slots 23–30 non-zero - [ ] All existing tests pass (`pytest -q`) - [ ] Commit: "Phase 6.1: connect 37-feature reranker to live caller (dominant-cluster)" ### Phase 6.2 — Full per-candidate plumbing (3–5 days) - [ ] Add `PaperClusterMap = dict[str, int]` type alias in app/recommend/types.py - [ ] Modify `merge_quota_results()` in app/recommend/fusion.py to return `(merged_ids, paper_cluster_map)` - [ ] Update ALL callers of `merge_quota_results` (grep first; fix every one) - [ ] In recommendations.py, build `candidate_cluster_idx` aligned with `valid_ids` - [ ] Build `per_candidate_importance` (N,) and `per_candidate_medoids` (N, 1024) - [ ] Pass arrays (not scalars) for `cluster_importance` and `cluster_medoid` to reranker - [ ] Update `_build_feature_matrix` slot-24 logic to handle both (1024,) and (N, 1024) medoid shapes - [ ] Write test `tests/recommend/test_phase62_per_candidate_cluster.py` (Section D.6) - [ ] Commit: "Phase 6.2: per-candidate cluster identity through reranker" ### Phase 6.3 — Deployment verification + Bug B - [x] Decide deployment strategy: E.1.a (commit) vs E.1.b (snapshot_download). Used E.1.a. - [x] Verify `models/reranker-phase6/production_model/reranker_v1.txt` is in working tree, not gitignored, not dockerignored - [x] Push to HF Space; wait for build; check build logs for "[reranker] LightGBM model loaded" - [x] Add `/healthz/reranker` route (Section E.2) - [x] Add `_rr.is_model_loaded()`, `_rr.get_loaded_model_path()`, `_rr.get_num_trees()` accessors - [x] `curl https://siddhm11-researchit.hf.space/healthz/reranker` → confirm `model_loaded: true, n_trees: 141` > *Verified live at 2026-05-03: `model_loaded=true, n_trees=141, fallback_active=false, feature_count=37, feature_schema_hash=5d0b3de7b0c1`.* - [x] Add per-request `reranker.features` log line with `feature_nonzero_rate` - [x] Fix Bug B: medoid_embedding_blob fallback in cluster reload (Section E.4) - [x] Add `medoid_embedding_blob BLOB` column to clusters table (SQLite ALTER migration) - [x] Update CLAUDE.md / model card to reflect deployment story ### Phase 6 documentation - [ ] Write `docs/phases/PHASE6.md` retraining decision (Section F.4) - [ ] Update `README.md` test count (will increase by ~2 from 6.1 + 6.2 tests) - [ ] Update `TASK-TRACKER.md`: tick off `[x] [reranker] LightGBM model loaded` (after curl verifies it) - [ ] Backfill the HF model card with: `library_name: lightgbm`, `pipeline_tag: text-ranking`, `datasets: [siddhm11/researchit-reranker-data]`, training data description, feature schema (37 names), reported metrics (nDCG@10=0.879 on 7,007-triple eval split, ≥2023 papers), and a clear "trained on citation pseudo-labels, NOT real user signal" disclaimer - [ ] Upload `03_train_lightgbm.py` to a new repo `siddhm11/researchit-reranker-training` (or to the dataset repo as a script) so retraining is reproducible from public artifacts ### Phase 6.4 — Retraining (decision only; no code yet) - [ ] Document deferral and the 30-day / 100-user trigger in PHASE6.md - [ ] Open a tracking issue: "Phase 6.4b: synthetic user simulator" (target: +30d) - [ ] Open a tracking issue: "Phase 6.4c: real-user retrain at 100-user threshold" ``` --- ## Part H — What Phase 6 Is NOT (scope boundary) Phase 6 is **integration of the existing trained reranker, plus the deployment story for it**. Specifically *out of scope* for Phase 6 — these are their own phases, with their own framing docs to come: | Out of scope | Belongs to | Why it's separate | |---|---|---| | Offline regression harness, time-split eval framework, A/B test infrastructure | **Phase 7 — Evaluation Framework** | Requires a held-out user-interaction log and a separate "shadow ranker" infra. Cannot be built before 6.3 ships, because we need stable feature semantics first. | | LLM-generated paper summaries; cross-encoder distilled into LightGBM features | **Phase 8 — LLM Summaries + Distilled Reranker** | Adds new feature slots (37 → 40+) and breaks the schema; must be a model version bump, not a 6.x patch. | | Exploration bandits (UCB / Thompson over cluster heads), collaborative-filtering co-save signal | **Phase 9 — Exploration + CF** | Needs a user population large enough for CF to be non-degenerate, and an exploration budget that Phase 6's MMR=0.6 currently approximates as a stopgap. | | Migrating to a different reranker family (e.g. cross-encoder, ColBERT, BGE-reranker-v2-m3) | **Phase 10+** | Explicitly forbidden in serving by CLAUDE.md rule 4. LightGBM is the serving model; anything heavier is a teacher in distillation, not a serving model. | | Replacing citation pseudo-labels with click-through CTR labels from production | **Phase 6.4c** (real-user retrain) | Triggered by traffic threshold, not by code. | **The tightest possible definition of "Phase 6 done":** > *Every candidate that arrives at the LightGBM ranker is described by a 37-dimensional feature vector in which slots 23–30 carry per-candidate signal derived from the user's actual cluster state and onboarding/category history; the ranker's inference is verified live on `siddhm11-researchit.hf.space` via `/healthz/reranker`; and a feature non-zero rate is logged per request. Retraining is deferred and documented.* When that sentence is true, Phase 6 ships and Phase 7 begins. --- ## Caveats and unknowns (full disclosure) 1. **The HF model card itself was not directly readable** from the verification environment used for this framing — only the structured Hub API metadata (which exposes `tags`, `created_at`, `last_modified`, `pipeline_tag`, `library_name`, `description`, etc., all of which were null/empty for the model repo). The conclusion that "the README is empty" is an inference from the absent metadata fields a populated model card would normally surface, not a direct read of the file. **Action item for Amin:** open `https://huggingface.co/siddhm11/researchit-reranker-phase6` in a browser, look at the README and the Files tab, and confirm: (a) is `reranker_v1.txt` present at exactly 974 KB? (b) is there any README content at all? (c) are training scripts present? Adjust the Phase 6.3 doc-backfill checklist items accordingly. 2. **The 90,993 / 7,007 triple counts and the 141-tree figure come from the audit**, not from a re-derivation against the live HF artifacts. The HF dataset metadata only confirms the order of magnitude (`100K