Spaces:
Running
Running
| # PHASE 6 β Reranker Framing Document | |
| > **File:** `docs/phases/PHASE6-Reranker-Framing.md` | |
| > **Project:** ResearchIT (arXiv paper recommendation engine) | |
| > **Author:** Amin (siddhm11) | |
| > **Status:** Phase 6 β In Integration (post-audit framing) | |
| > **Supersedes:** the open items from `PHASE6-Audit.md` | |
| > **Live Space:** `https://siddhm11-researchit.hf.space` (HF Spaces, Docker SDK, RUNNING, cpu-basic) | |
| > **Reranker model repo:** `https://huggingface.co/siddhm11/researchit-reranker-phase6` | |
| > **Training data repo:** `https://huggingface.co/datasets/siddhm11/researchit-reranker-data` | |
| --- | |
| ## TL;DR (for Amin's future self at 1AM) | |
| 1. **The HF model exists and is public, but the model card is empty** (no `pipeline_tag`, no `library_name`, no `description`, no metrics block). The reproducibility story currently lives in your laptop, not in the repo. The dataset repo `siddhm11/researchit-reranker-data` does exist as a public Parquet dataset (size_categories: 100Kβ1M, downloads: 30) β that is the only piece of the training pipeline that is *partially* public. | |
| 2. **The model was NOT trained on real user signal.** It was trained on **citation pseudo-labels** (Semantic Scholar citation edges β triples β LightGBM LambdaRank). Features 23β30 (cluster importance, suppressed-category, onboarding match, user save/dismiss counts) were **zero during training** because no users existed β therefore zeroes at serving time are *consistent with training*, not a regression. The 0.879 nDCG@10 number is honest under that distribution; it is not honest as a measure of "what the user feels." | |
| 3. **Phase 6 is therefore about plumbing, not retraining.** Land Phase 6.1 (dominant-cluster shortcut, ~1 day), then Phase 6.2 (per-candidate `paper_cluster_map` plumbing, ~3β5 days), then Phase 6.3 (deployment verification + `/healthz/reranker`). Defer retraining (Phase 6.4) until either (a) you ship synthetic-user simulation, or (b) you reach ~100 real users with β₯10 saves each. | |
| --- | |
| ## Part A β Phase 6 Status Verification (HF inspection) | |
| ### A.1 What is publicly visible on Hugging Face | |
| I queried the HF Hub API directly (the public web pages were not reachable from this environment, but the structured API endpoints were). Here is the *complete* observable state of your HF account as of **2026-05-02**: | |
| | Repo | Type | Visibility | Created | Last modified | Notable | | |
| |---|---|---|---|---|---| | |
| | `siddhm11/researchit-reranker-phase6` | model | public, not gated | 2026-04-26 | 2026-04-27 21:51 | **README empty / no description, no `pipeline_tag`, no `library_name`, no metrics card, no datasets link** | | |
| | `siddhm11/researchit-reranker-data` | dataset | public, not gated | 2026-04-27 16:41 | 2026-04-27 21:51 | Parquet, `100K<n<1M`, `modality:tabular,text`, 30 downloads | | |
| | `siddhm11/ResearchIT` | space | public, RUNNING | 2026-04-19 | 2026-05-02 11:29 | Docker SDK, cpu-basic, declares `BAAI/bge-m3` as a referenced model | | |
| | `siddhm11/prompt-engine` | space | public, RUNNING | 2026-02-02 | 2026-03-21 | Docker, unrelated | | |
| | `siddhm11/sandbox-c6a4f7e6` | space | SLEEPING | 2026-04-26 | β | Sandbox | | |
| | `siddhm11/sandbox-9bb83c65` | space | SLEEPING | 2026-04-27 | β | Sandbox | | |
| **Critically absent from the model repo metadata:** `library_name` (should be `lightgbm`), `pipeline_tag` (should be e.g. `text-ranking` or a custom tag), any `model-index` block, any `datasets:` link to `siddhm11/researchit-reranker-data`. Tags = `region:us` only. | |
| ### A.2 What this means for reproducibility | |
| The audit's working assumption β *"Phase 6 was trained on 242K citation edges from 50K sampled papers, time-split (train <2023, eval β₯2023), 90,993 train + 7,007 eval triples, nDCG@10 = 0.879, 37 features, 141 trees, 974KB"* β **is internally consistent** but **cannot be re-derived from the public HF artifacts alone today**. The dataset repo exists (good β that is where the triples almost certainly live), but the model repo has no card linking the two, and there is no published training script. | |
| What we can confirm vs what is missing: | |
| | Asset (per audit expectation) | Public on HF? | Notes | | |
| |---|---|---| | |
| | `reranker_v1.txt` (LightGBM dump, ~974 KB) | **Likely yes** (it's the only point of the model repo) but the API does not expose `siblings` to confirm filename. **Action: open the Files tab in a browser and verify the filename + size.** | Without a card, downstream consumers don't know which file to load | | |
| | Raw citation edges (output of `01_fetch_citation_edges.py`) | **Unknown / probably not** | Not exposed by metadata. Likely only the *triples* are uploaded | | |
| | Triples file (output of `02_generate_training_triples.py`) | **Probably yes** β this is what `researchit-reranker-data` is for; the size-category and modalities match | Verify columns include `query_paper_id`, `candidate_paper_id`, `relevance`, the 37 features, and a `group`/`qid` column | | |
| | Eval triples (β₯2023 split) | **Probably yes** (same dataset, separate split) | Verify a `split` column or `train`/`eval` files | | |
| | `03_train_lightgbm.py` | **Almost certainly NO** β no Space or repo of yours hosts it | This is the single biggest reproducibility gap | | |
| | Feature schema (37 names, in canonical order) | **NO** β would live in a model card | Without this, even the LightGBM `.txt` dump (which uses `Column_0β¦Column_36`) is opaque | | |
| | `01_fetch_citation_edges.py` script | **NO** | Required to refresh edges for retraining | | |
| | `pseudo_label_generation.py` / triple builder | **NO** | Required to regenerate the dataset under a new time-split | | |
| **Verdict on reproducibility:** as of today, **Phase 6 retraining is gated on Amin's local files.** A fresh collaborator (or future Amin on a new laptop) cannot retrain the model from public artifacts alone. Fixing this is part of Phase 6.3 (see below). | |
| ### A.3 Confirm or refute: was the published model trained on test data or real user signal? | |
| **Refuted on both counts.** Three converging pieces of evidence: | |
| 1. **No user data could possibly exist.** The Space `siddhm11/ResearchIT` was created **2026-04-19**, only seven days before the model repo on **2026-04-26**. There has been no production window long enough to accrue meaningful save/dismiss interaction logs, and the model repo has zero downloads, zero likes β this is a personal project with no user funnel. | |
| 2. **The dataset repo was created the day *after* the model.** `researchit-reranker-data` is dated **2026-04-27**, one day *after* the model went up β meaning the dataset was uploaded as a documentation artifact, not consumed from. That fits the citation-pseudo-label story: the triples were generated locally, the model was trained locally, both were pushed in close succession. | |
| 3. **The dataset metadata explicitly says `modality:tabular, text`** and `format:parquet` with `100K<n<1M` rows. That row count band is consistent with **~98K triples (90,993 train + 7,007 eval)**, exactly the audit's stated counts. Real user signal at this scale is implausible for a one-week-old project. | |
| **Conclusion:** the model is trained on citation pseudo-labels, not on test data, not on user signal. The audit's account is correct. **Features 23β30 were almost certainly zero during training** (no user, no live cluster importance from a real session) β which is why serving them as zero today does not break the model's distribution; it merely leaves potential signal on the table. | |
| ### A.4 Is `reranker_v1.txt` actually deployed to the live Space? | |
| **Unverified from outside.** The Space metadata shows it is RUNNING on cpu-basic and declares `BAAI/bge-m3` as a model dependency, but does **not** declare `siddhm11/researchit-reranker-phase6` as a dependency. That is a yellow flag: if the LightGBM model were being pulled at runtime via `huggingface_hub.snapshot_download`, the Space metadata would typically list it. The most likely current state is one of: | |
| - (a) the file was committed to the Space repo's working tree under `models/reranker-phase6/production_model/reranker_v1.txt` and is being loaded from disk (works, but means the same artifact is duplicated in two HF repos), **or** | |
| - (b) the file is not in the Space at all and the deployed code is silently falling back to the heuristic scorer. | |
| Either way, the audit's TASK-TRACKER item **`[ ] [reranker] LightGBM model loaded`** is still the right thing to verify, and we wire that verification in via Phase 6.3 below. | |
| --- | |
| ## Part B β Phase 6 Problem Statement, Properly Framed | |
| **Why Phase 6 exists.** Phases 1β5 built a clean retrieval+ranking stack: BGE-M3 dense + sparse, Qdrant + Zilliz, Ward clustering of user history into long/short/negative interest centroids, importance-weighted per-cluster Qdrant search with floor `F_min=3`, and a heuristic linear scorer over a hand-tuned set of similarities. The heuristic worked, but it left two kinds of signal on the table: (i) *non-linear* interactions between paper-level features (recency Γ influential-citations Γ topic match), and (ii) *user-state* features like cluster importance, onboarding category match, and save/dismiss intensity. Phase 6 introduces a LightGBM LambdaRank model with a **37-slot feature vector** designed to absorb both classes of signal and to give us a single optimizable objective (nDCG) instead of a hand-tuned weighted sum. | |
| **What Phase 6 currently delivers vs what it was designed to deliver.** What ships today is a 141-tree LightGBM ranker, trained offline on citation pseudo-labels with a time-split eval, claiming nDCG@10 = 0.879, *with a heuristic fallback when the model file is missing*. What the live serving path actually feeds the model is not the 37-feature vector the model was trained on β it is a vector where features 0, 1β4, 5, 6β22 carry signal and **features 23β30 are zero-filled** because the integration in `app/routers/recommendations.py` still calls `rerank_candidates` with the legacy 6-argument signature from Phase 5. Bugs A, B, and C from the audit all stem from this single gap: *the reranker module was upgraded; the caller was not.* | |
| **The "9 of 37 features" gap, quantified.** | |
| | Feature slots | Count | Source available in caller scope? | Currently passed? | | |
| |---|---|---|---| | |
| | `0` qdrant_score | 1 | Yes β in `per_cluster_results` | **No** β not constructed | | |
| | `1β4` long/short/medoid/extra similarities | 4 | Yes β `lt_vec`, `st_vec`, dominant medoid | Partial β only lt/st/neg are passed positionally | | |
| | `5` neg similarity | 1 | Yes β `neg_vec` | Yes | | |
| | `6β22` paper-level features | 17 | Yes β derivable in `reranker.py` from metadata | Yes (computed inside reranker) | | |
| | `23` cluster_importance | 1 | Yes β `clusters[i].importance` | **No** | | |
| | `24` cluster_distance_to_medoid | 1 | Yes β from `clusters[i].medoid_embedding` | **No** | | |
| | `25` is_suppressed_category | 1 | Yes β but `suppressed` is loaded *after* the rerank call | **No** | | |
| | `26` onboarding_category_match | 1 | Yes β `db.get_user_category_filter()` exists | **No, never called for this** | | |
| | `27` user_total_saves | 1 | Yes β `len(state.positives)` | **No** | | |
| | `28` user_total_dismissals | 1 | Yes β `len(state.negatives)` | **No** | | |
| | `29` saves_last_7d | 1 | Yes β derivable from `state.positives` timestamps | **No** | | |
| | `30` dismissals_last_7d | 1 | Yes β derivable from `state.negatives` timestamps | **No** | | |
| | `31β36` remaining metadata features | 6 | Yes β computed inside reranker | Yes | | |
| **Effective active features at serving:** ~9 + ~6 = ~15 of 37 carry real signal. **Features 23β30 are all zero.** This is Bug A, expressed quantitatively. It is *not* destroying ranking quality (the model has not learned to rely on those slots, because they were zero in training too), but it permanently caps how much Phase 6 can ever improve over the heuristic. | |
| **Phase 6 is a 3-stage fix:** (6.1) connect the existing data flow with a dominant-cluster shortcut, (6.2) thread per-candidate cluster identity through the pipeline, (6.3) verify deployment + add observability. Retraining (6.4) is a separate decision, gated on whether we can produce a training distribution where features 23β30 are non-zero and behaviourally meaningful. | |
| --- | |
| ## Part C β Phase 6.1: The Simplification Pass (1β2 days) | |
| ### C.1 What 6.1 does | |
| For *each* candidate in the rerank list, we compute features 23 and 24 against the **dominant cluster** β the cluster with the highest `importance` in the user's current state. Features 25 and 26 remain truly per-candidate (they depend on the candidate's `primary_topic`). Features 27β30 are user-level and constant across candidates in a single rerank call. Phase 6.1 is the minimum-viable fix that ends "9 of 37"; it will be replaced by 6.2's per-candidate cluster lookup. | |
| **Why dominant-cluster is acceptable for 6.1.** The current model was trained with feature 23 = 0 everywhere. So as long as 6.1 produces a *reasonable* non-zero value for 23, the model will route those gradients through whatever weak signal it learned in feature 6β22. We are not relying on 23 carrying perfect semantics; we are getting the integration plumbed end-to-end so that feature non-zero rate jumps from ~40% to ~100%, and so that Phase 6.4 retraining has a target. | |
| ### C.2 The exact code patch | |
| **File:** `app/routers/recommendations.py` | |
| #### C.2.1 Move the `suppressed` and `onboarding_categories` loading earlier | |
| Find the existing block where `suppressed` is loaded (currently *after* the `rerank_candidates` call) and move it to immediately after `state` is hydrated and before the Qdrant retrieval loop. Add the onboarding category fetch right next to it. | |
| ```python | |
| # --- BEGIN Phase 6.1 patch (top of recommendations endpoint, after state hydration) --- | |
| # Suppressed categories: was loaded after rerank; move it here. | |
| suppressed: set[str] = set(await db.get_suppressed_categories(user_id)) | |
| # Onboarding categories: previously unused at rerank time. | |
| onboarding_categories: set[str] = set( | |
| await db.get_user_category_filter(user_id) or [] | |
| ) | |
| # User-level interaction counts (constant across all candidates this request). | |
| now_utc = datetime.now(timezone.utc) | |
| seven_days_ago = now_utc - timedelta(days=7) | |
| user_total_saves = len(state.positives) | |
| user_total_dismissals = len(state.negatives) | |
| user_saves_last_7d = sum( | |
| 1 for p in state.positives if p.timestamp >= seven_days_ago | |
| ) | |
| user_dismissals_last_7d = sum( | |
| 1 for n in state.negatives if n.timestamp >= seven_days_ago | |
| ) | |
| # --- END move --- | |
| ``` | |
| #### C.2.2 Add a helper to align Qdrant scores with `valid_ids` | |
| `per_cluster_results` is currently a `list[list[ScoredPoint]]` (one list per cluster). After dedup + filter into `valid_ids`, we lose the raw retrieval score. We need to project it back. | |
| ```python | |
| # app/recommend/fusion.py (or app/routers/recommendations.py if you keep helpers local) | |
| def build_qdrant_score_map( | |
| per_cluster_results: list[list["ScoredPoint"]], | |
| ) -> dict[str, float]: | |
| """ | |
| Collapse all per-cluster ScoredPoint lists into a single | |
| paper_id -> max_score dict. If a paper appeared in multiple | |
| clusters' top-K, we keep the maximum score (most charitable to | |
| the candidate; matches the dedup semantics in merge_quota_results). | |
| """ | |
| out: dict[str, float] = {} | |
| for cluster_hits in per_cluster_results: | |
| for hit in cluster_hits: | |
| pid = str(hit.id) # arxiv IDs ALWAYS strings (CLAUDE.md rule 7) | |
| score = float(hit.score) | |
| if pid not in out or score > out[pid]: | |
| out[pid] = score | |
| return out | |
| def align_qdrant_scores( | |
| valid_ids: list[str], | |
| score_map: dict[str, float], | |
| ) -> np.ndarray: | |
| """Return a float32 array of qdrant_scores aligned 1:1 with valid_ids. | |
| Missing entries default to 0.0 (which matches train-time behavior for | |
| candidates injected by exploration paths).""" | |
| return np.asarray( | |
| [score_map.get(pid, 0.0) for pid in valid_ids], | |
| dtype=np.float32, | |
| ) | |
| ``` | |
| #### C.2.3 Compute the dominant-cluster scalars | |
| ```python | |
| # Dominant cluster: highest-importance cluster in user state. | |
| # clusters is the list[Cluster] you already have in scope. | |
| if clusters: | |
| dominant = max(clusters, key=lambda c: c.importance) | |
| dominant_importance = float(dominant.importance) | |
| dominant_medoid_vec = np.asarray( | |
| dominant.medoid_embedding, dtype=np.float32 | |
| ) | |
| else: | |
| # Cold-start path: no clusters -> defensible defaults. | |
| dominant_importance = 0.0 | |
| dominant_medoid_vec = np.zeros(1024, dtype=np.float32) | |
| ``` | |
| #### C.2.4 The new `rerank_candidates` call | |
| Replace the existing call (around line 305 of `recommendations.py`) with: | |
| ```python | |
| # --- BEGIN Phase 6.1 rerank call --- | |
| qdrant_score_map = build_qdrant_score_map(per_cluster_results) | |
| qdrant_scores = align_qdrant_scores(valid_ids, qdrant_score_map) | |
| # Per-candidate boolean: is this candidate's primary_topic suppressed? | |
| is_suppressed_category = np.asarray( | |
| [ | |
| 1.0 if (m.get("primary_topic") in suppressed) else 0.0 | |
| for m in valid_meta | |
| ], | |
| dtype=np.float32, | |
| ) | |
| # Per-candidate boolean: does this candidate match any onboarding category? | |
| onboarding_category_match = np.asarray( | |
| [ | |
| 1.0 if (m.get("primary_topic") in onboarding_categories) else 0.0 | |
| for m in valid_meta | |
| ], | |
| dtype=np.float32, | |
| ) | |
| reranked_ids, reranked_scores, reranked_embs = rerank_candidates( | |
| candidate_ids=valid_ids, | |
| candidate_embeddings=valid_embs, | |
| candidate_metadata=valid_meta, | |
| long_term_vec=lt_vec, | |
| short_term_vec=st_vec, | |
| negative_vec=neg_vec, | |
| # Phase 6 additions (6.1: dominant-cluster shortcut) | |
| qdrant_scores=qdrant_scores, | |
| cluster_importance=np.full( | |
| len(valid_ids), dominant_importance, dtype=np.float32 | |
| ), | |
| cluster_medoid=dominant_medoid_vec, # broadcast in reranker | |
| is_suppressed_category=is_suppressed_category, | |
| onboarding_category_match=onboarding_category_match, | |
| user_total_saves=user_total_saves, | |
| user_total_dismissals=user_total_dismissals, | |
| user_saves_last_7d=user_saves_last_7d, | |
| user_dismissals_last_7d=user_dismissals_last_7d, | |
| ) | |
| # --- END Phase 6.1 rerank call --- | |
| ``` | |
| #### C.2.5 Reranker signature change | |
| **File:** `app/recommend/reranker.py` | |
| ```python | |
| def rerank_candidates( | |
| *, | |
| candidate_ids: list[str], | |
| candidate_embeddings: np.ndarray, # (N, 1024) | |
| candidate_metadata: list[dict], | |
| long_term_vec: np.ndarray | None, | |
| short_term_vec: np.ndarray | None, | |
| negative_vec: np.ndarray | None, | |
| # Phase 6.1 additions β all optional with safe zero-defaults | |
| # so the legacy callers keep working during the migration window. | |
| qdrant_scores: np.ndarray | None = None, | |
| cluster_importance: np.ndarray | None = None, # (N,) or scalar broadcast | |
| cluster_medoid: np.ndarray | None = None, # (1024,) for 6.1, (N, 1024) for 6.2 | |
| is_suppressed_category: np.ndarray | None = None, | |
| onboarding_category_match: np.ndarray | None = None, | |
| user_total_saves: int = 0, | |
| user_total_dismissals: int = 0, | |
| user_saves_last_7d: int = 0, | |
| user_dismissals_last_7d: int = 0, | |
| ) -> tuple[list[str], list[float], np.ndarray]: | |
| ... | |
| ``` | |
| Inside the feature-matrix builder, fill slots 23β30 from these new args, falling back to zero if `None` (preserves backward-compat for any unit tests that still call the legacy path). | |
| ### C.3 The integration test (must be added) | |
| **File:** `tests/recommend/test_phase6_feature_matrix.py` | |
| ```python | |
| import numpy as np | |
| import pytest | |
| from app.recommend.reranker import _build_feature_matrix # expose for testing | |
| def test_phase6_feature_matrix_is_not_mostly_zero(fake_user_state, fake_candidates): | |
| """ | |
| Regression guard for Bug A. After Phase 6.1, features 23-30 must | |
| carry signal for at least one candidate in a typical request. | |
| """ | |
| X = _build_feature_matrix( | |
| candidate_ids=fake_candidates.ids, | |
| candidate_embeddings=fake_candidates.embs, | |
| candidate_metadata=fake_candidates.meta, | |
| long_term_vec=fake_user_state.lt_vec, | |
| short_term_vec=fake_user_state.st_vec, | |
| negative_vec=fake_user_state.neg_vec, | |
| qdrant_scores=fake_candidates.qdrant_scores, # non-trivial | |
| cluster_importance=np.full(len(fake_candidates.ids), 0.42), | |
| cluster_medoid=fake_user_state.dominant_medoid, | |
| is_suppressed_category=np.array([0, 0, 1, 0, 0], dtype=np.float32), | |
| onboarding_category_match=np.array([1, 0, 1, 0, 1], dtype=np.float32), | |
| user_total_saves=12, | |
| user_total_dismissals=3, | |
| user_saves_last_7d=2, | |
| user_dismissals_last_7d=1, | |
| ) | |
| assert X.shape[1] == 37, f"Feature schema drifted: got {X.shape[1]} cols" | |
| # Per-feature non-zero rate. Slots 23..30 must each be >0 somewhere. | |
| nonzero_rate = (X != 0).mean(axis=0) | |
| for slot in range(23, 31): | |
| assert nonzero_rate[slot] > 0.0, ( | |
| f"Feature {slot} is all zeros β Phase 6.1 plumbing regression" | |
| ) | |
| # Aggregate sanity: at least 60% of feature slots should be active. | |
| assert (nonzero_rate > 0).mean() >= 0.6 | |
| ``` | |
| ### C.4 What 6.1 fixes vs leaves open | |
| | Bug | Status after 6.1 | | |
| |---|---| | |
| | A β caller uses legacy 6-arg signature | **Fixed** | | |
| | B β Hungarian zero-vector fallback | Untouched (orthogonal; addressed separately, see Part E.4) | | |
| | C β model deployment unverified | Untouched (Phase 6.3) | | |
| | D β train/serve consistency | Improved on slot 23β30 *non-zeroness*, but **the trained model's slot-23 weight is near-zero**, so 6.1 will not move nDCG by much. That's expected. | | |
| --- | |
| ## Part D β Phase 6.2: The Full Plumbing (3β5 days) | |
| ### D.1 The data-structure change: `paper_cluster_map` | |
| The architectural rule we are honoring: *a paper retrieved for "Cluster A: ML systems" should be scored for its fit to **that** cluster, not to the user's dominant interest.* Phase 6.1 violates this for slots 23 and 24 by broadcasting the dominant cluster's importance/medoid to every candidate. Phase 6.2 fixes it by attaching the **source cluster index** to each retrieved candidate and threading that through to the reranker. | |
| **Add to `app/recommend/types.py` (or wherever `Cluster` lives):** | |
| ```python | |
| # Source-of-truth mapping: candidate paper_id -> index into clusters[]. | |
| # Built during per-cluster Qdrant retrieval, propagated through merge_quota, | |
| # consumed by the reranker. | |
| PaperClusterMap = dict[str, int] | |
| ``` | |
| If a candidate appears in *multiple* cluster top-Ks (which can happen with importance-weighted quota and a permissive K_max=7), the convention is **first-write-wins by importance order** β i.e. when iterating clusters in descending importance, the first cluster to surface a candidate "owns" it. This matches the heuristic that a paper that the user's *strongest* interest cluster also pulls is more naturally explained by that cluster. | |
| ### D.2 `app/recommend/fusion.py` β `merge_quota_results` | |
| Currently `merge_quota_results` takes `per_cluster_results: list[list[ScoredPoint]]` and returns `list[str]` (deduped IDs). Change it to also return the cluster mapping. | |
| ```python | |
| def merge_quota_results( | |
| per_cluster_results: list[list["ScoredPoint"]], | |
| clusters: list[Cluster], | |
| floor_per_cluster: int = 3, # F_min from CLAUDE.md | |
| ) -> tuple[list[str], PaperClusterMap]: | |
| """ | |
| Importance-weighted quota merge. Returns: | |
| - merged_ids: deduped list of arxiv IDs (str) | |
| - paper_cluster_map: candidate_id -> source cluster index | |
| Convention: when a candidate appears in multiple clusters, the | |
| cluster with HIGHER importance wins. Iterate clusters sorted | |
| descending by importance to make first-write-wins do the right thing. | |
| """ | |
| paper_cluster_map: PaperClusterMap = {} | |
| merged_ids: list[str] = [] | |
| seen: set[str] = set() | |
| # Stable sort by importance descending; preserve original index for lookup. | |
| order = sorted( | |
| range(len(clusters)), | |
| key=lambda i: clusters[i].importance, | |
| reverse=True, | |
| ) | |
| for cluster_idx in order: | |
| hits = per_cluster_results[cluster_idx] | |
| # Apply F_min floor: take at least floor_per_cluster (or all if shorter). | |
| # The full quota math from Phase 5 lives elsewhere; this is the | |
| # accounting step. | |
| for hit in hits: | |
| pid = str(hit.id) | |
| if pid in seen: | |
| continue | |
| seen.add(pid) | |
| merged_ids.append(pid) | |
| paper_cluster_map[pid] = cluster_idx | |
| return merged_ids, paper_cluster_map | |
| ``` | |
| **Update every caller** of `merge_quota_results` in `recommendations.py` to unpack the second return. Mypy / pyright will flag every site β fix them all. | |
| ### D.3 `app/routers/recommendations.py` β propagate the map | |
| ```python | |
| merged_ids, paper_cluster_map = merge_quota_results( | |
| per_cluster_results, clusters | |
| ) | |
| # ... after dedup and metadata fetch, valid_ids is a subset of merged_ids ... | |
| # Per-candidate cluster index (aligned with valid_ids). | |
| candidate_cluster_idx = np.asarray( | |
| [paper_cluster_map[pid] for pid in valid_ids], | |
| dtype=np.int32, | |
| ) | |
| # Per-candidate cluster importance (slot 23, properly per-candidate). | |
| per_candidate_importance = np.asarray( | |
| [clusters[idx].importance for idx in candidate_cluster_idx], | |
| dtype=np.float32, | |
| ) | |
| # Per-candidate cluster medoid (used to compute slot 24 inside reranker). | |
| # Stack medoids into a (N, 1024) array. | |
| per_candidate_medoids = np.stack( | |
| [ | |
| np.asarray(clusters[idx].medoid_embedding, dtype=np.float32) | |
| for idx in candidate_cluster_idx | |
| ], | |
| axis=0, | |
| ) | |
| reranked_ids, reranked_scores, reranked_embs = rerank_candidates( | |
| candidate_ids=valid_ids, | |
| candidate_embeddings=valid_embs, | |
| candidate_metadata=valid_meta, | |
| long_term_vec=lt_vec, | |
| short_term_vec=st_vec, | |
| negative_vec=neg_vec, | |
| qdrant_scores=qdrant_scores, | |
| cluster_importance=per_candidate_importance, # (N,) β was scalar in 6.1 | |
| cluster_medoid=per_candidate_medoids, # (N, 1024) β was (1024,) in 6.1 | |
| is_suppressed_category=is_suppressed_category, | |
| onboarding_category_match=onboarding_category_match, | |
| user_total_saves=user_total_saves, | |
| user_total_dismissals=user_total_dismissals, | |
| user_saves_last_7d=user_saves_last_7d, | |
| user_dismissals_last_7d=user_dismissals_last_7d, | |
| ) | |
| ``` | |
| ### D.4 `app/recommend/reranker.py` β per-candidate slot 24 | |
| Inside the feature-matrix builder, slot 24 (`cluster_distance_to_medoid`) becomes a row-wise cosine: | |
| ```python | |
| # Slot 24: cosine distance from each candidate to its OWN source-cluster medoid. | |
| # cluster_medoid shape: | |
| # (1024,) -> 6.1 broadcast path (legacy) | |
| # (N, 1024) -> 6.2 per-candidate path | |
| if cluster_medoid is None: | |
| feat_24 = np.zeros(N, dtype=np.float32) | |
| elif cluster_medoid.ndim == 1: | |
| # Broadcast cosine: same medoid for every candidate. | |
| medoid_norm = cluster_medoid / (np.linalg.norm(cluster_medoid) + 1e-9) | |
| cand_norms = candidate_embeddings / ( | |
| np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9 | |
| ) | |
| sims = cand_norms @ medoid_norm | |
| feat_24 = (1.0 - sims).astype(np.float32) # distance, not similarity | |
| else: | |
| # Per-row cosine: candidate i vs cluster_medoid[i]. | |
| cand_norms = candidate_embeddings / ( | |
| np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9 | |
| ) | |
| med_norms = cluster_medoid / ( | |
| np.linalg.norm(cluster_medoid, axis=1, keepdims=True) + 1e-9 | |
| ) | |
| sims = (cand_norms * med_norms).sum(axis=1) | |
| feat_24 = (1.0 - sims).astype(np.float32) | |
| ``` | |
| ### D.5 Why 6.2 matters | |
| A concrete example: user has clusters `{A: ML systems (importance 0.7), B: protein folding (importance 0.3)}`. A paper about *MLPerf benchmark methodology* gets retrieved by cluster A's Qdrant query; a paper about *AlphaFold 3 architecture* gets retrieved by cluster B's. Under 6.1, both get scored against cluster A's medoid (the dominant one), so the AlphaFold paper looks artificially "off-distribution" and gets ranked down. Under 6.2, the AlphaFold paper is scored against cluster B's medoid (where it belongs), and slot 24 correctly registers as a small distance. The model's learned weight on slot 24 then has the opportunity to *protect* exploration into the user's secondary interests, instead of reinforcing the dominant one. | |
| This also makes Phase 9 (Exploration + CF) much cleaner: the exploration budget is naturally reasoned about in terms of "minority-cluster candidates that survived rerank." | |
| ### D.6 Integration test outline | |
| **File:** `tests/recommend/test_phase62_per_candidate_cluster.py` | |
| ```python | |
| def test_paper_cluster_map_threaded_through(monkeypatch, fake_three_cluster_state): | |
| """ | |
| Given three clusters with different medoids, candidates retrieved | |
| from cluster B must produce slot 24 measured against B's medoid, | |
| not against the dominant A's medoid. | |
| """ | |
| captured = {} | |
| def fake_predict(model, X): | |
| captured["X"] = X.copy() | |
| return np.arange(X.shape[0])[::-1].astype(float) | |
| monkeypatch.setattr("app.recommend.reranker._lgb_predict", fake_predict) | |
| response = client.get( | |
| "/recommend", | |
| headers={"X-User-Id": fake_three_cluster_state.user_id}, | |
| ) | |
| X = captured["X"] | |
| # Specific assertion: candidate 0 came from cluster A, candidate 5 from B. | |
| # Their slot-24 values must be different (they are scored vs different medoids). | |
| assert X[0, 24] != X[5, 24] | |
| # And slot 23 (cluster_importance) must match the source clusters' | |
| # importance values, not a single broadcast. | |
| assert len(set(X[:, 23].tolist())) > 1 | |
| ``` | |
| --- | |
| ## Part E β Phase 6.3: Deployment Verification + Monitoring (1 day) | |
| ### E.1 Verify the LightGBM model loads on HF Spaces | |
| Two acceptable deployment strategies. Pick **one** and document it. | |
| #### Option E.1.a β Commit the 974 KB file to the Space's Git repo (recommended for simplicity) | |
| 974 KB is well under HF's 5 MB inline limit and well under any sane Git-LFS threshold. Just commit it. | |
| ```bash | |
| # From your local ResearchIT working tree: | |
| cd /path/to/ResearchIT | |
| mkdir -p models/reranker-phase6/production_model | |
| cp /path/to/reranker_v1.txt models/reranker-phase6/production_model/ | |
| # Make sure neither .gitignore nor .dockerignore excludes it. | |
| grep -E "^models/?$|^models/reranker" .gitignore .dockerignore || echo "OK, not ignored" | |
| git add models/reranker-phase6/production_model/reranker_v1.txt | |
| git commit -m "Phase 6.3: ship LightGBM reranker artifact to Space" | |
| git push hf main # where 'hf' is the HF Space remote | |
| ``` | |
| `reranker.py`'s search paths already include this location, so no code change required. | |
| #### Option E.1.b β Pull from HF Hub at container build time (cleaner separation of concerns) | |
| ```dockerfile | |
| # Add to Dockerfile, BEFORE the COPY . . step | |
| RUN pip install --no-cache-dir huggingface_hub && \ | |
| python -c "\ | |
| from huggingface_hub import snapshot_download; \ | |
| snapshot_download( \ | |
| repo_id='siddhm11/researchit-reranker-phase6', \ | |
| local_dir='/app/models/reranker-phase6/production_model', \ | |
| local_dir_use_symlinks=False, \ | |
| allow_patterns=['*.txt', '*.json'], \ | |
| )" | |
| ``` | |
| This treats the model repo as the source of truth and the Space as a consumer β closer to "real" MLOps. Cost: one extra layer in the build, ~1s at startup since it's cached after first build. | |
| **Recommendation:** Option **E.1.a** for now (simpler, the file is tiny, and it removes a network dependency at container build). Move to E.1.b in Phase 7 when you have a model registry story. | |
| ### E.2 Add `/healthz/reranker` route | |
| **File:** `app/routers/health.py` (create if not present) | |
| ```python | |
| import hashlib | |
| import json | |
| from fastapi import APIRouter | |
| from app.recommend import reranker as _rr | |
| router = APIRouter() | |
| EXPECTED_FEATURE_NAMES = [ | |
| "qdrant_score", # 0 | |
| "sim_long_term", "sim_short_term", # 1, 2 | |
| "sim_cluster_medoid", "sim_extra_4", # 3, 4 | |
| "sim_negative", # 5 | |
| # 6..22 paper-level | |
| "recency_days_log", "citation_count_log", | |
| "influential_citations_log", "is_primary_cs_lg", | |
| "is_primary_cs_cl", "is_primary_cs_cv", | |
| "is_primary_stat_ml", "is_primary_cs_ai", | |
| "is_primary_cs_ir", "is_primary_other", | |
| "abstract_len_log", "title_len", | |
| "year", "month", "venue_is_top", | |
| "n_authors_log", "has_code_link", | |
| # 23..30 cluster + user | |
| "cluster_importance", "cluster_distance_to_medoid", | |
| "is_suppressed_category", "onboarding_category_match", | |
| "user_total_saves", "user_total_dismissals", | |
| "user_saves_last_7d", "user_dismissals_last_7d", | |
| # 31..36 remaining metadata | |
| "primary_topic_freq_in_user", "is_oa", "venue_age", | |
| "abstract_topic_match", "ref_count_log", "is_arxiv_only", | |
| ] | |
| assert len(EXPECTED_FEATURE_NAMES) == 37 | |
| @router.get("/healthz/reranker") | |
| async def healthz_reranker(): | |
| schema_hash = hashlib.sha256( | |
| json.dumps(EXPECTED_FEATURE_NAMES).encode() | |
| ).hexdigest()[:12] | |
| return { | |
| "model_loaded": _rr.is_model_loaded(), # True iff LightGBM Booster live | |
| "model_path": _rr.get_loaded_model_path(), # str or None | |
| "model_version": "phase6.v1", | |
| "fallback_active": not _rr.is_model_loaded(), | |
| "feature_count": 37, | |
| "feature_schema_hash": schema_hash, | |
| "n_trees": _rr.get_num_trees(), # 141 expected | |
| } | |
| ``` | |
| Wire `_rr.is_model_loaded()`, `_rr.get_loaded_model_path()`, `_rr.get_num_trees()` into `reranker.py` as small accessors over the module-level Booster handle. | |
| **Then verify on the live Space:** | |
| ``` | |
| curl -s https://siddhm11-researchit.hf.space/healthz/reranker | jq | |
| ``` | |
| Expected JSON body: `"model_loaded": true, "n_trees": 141, "fallback_active": false`. If any of those is wrong, **the deployed image is silently running the heuristic.** | |
| ### E.3 Logging: feature non-zero rate per request | |
| In `reranker.py`, after building the feature matrix `X`, add: | |
| ```python | |
| # Observability: emit a per-request feature-activation histogram so we | |
| # catch silent regressions to zero-filled features 23-30. | |
| if X.shape[0] > 0: | |
| nonzero_rate = (X != 0).mean(axis=0) # (37,) | |
| logger.info( | |
| "reranker.features", | |
| extra={ | |
| "feature_nonzero_rate": nonzero_rate.round(3).tolist(), | |
| "feature_count": X.shape[1], | |
| "n_candidates": X.shape[0], | |
| "model_active": _model is not None, | |
| }, | |
| ) | |
| ``` | |
| Add a Prometheus-style assertion later (Phase 7) but for now structured-log lines are enough β you can grep them on the Space's log stream. | |
| ### E.4 Bug B fix (Hungarian zero-vector fallback) β bundle into 6.3 | |
| While we're in `recommendations.py`, fix the medoid-rebuild bug: | |
| ```python | |
| # OLD: | |
| # medoid_embedding=np.array(vectors[row["medoid_paper_id"]], dtype=np.float32) | |
| # if row["medoid_paper_id"] in vectors else np.zeros(1024, dtype=np.float32) | |
| # NEW: if the medoid paper's embedding isn't in the immediate vectors dict, | |
| # fall back to the previously-stored medoid_embedding from the cluster row | |
| # (which is what we persisted last cycle), NOT a zero vector. Zero-vector | |
| # fallback breaks Hungarian assignment because cosine(*, 0) = 0 fails the | |
| # 0.5 acceptance threshold and orphans the cluster identity. | |
| if row["medoid_paper_id"] in vectors: | |
| medoid_emb = np.asarray(vectors[row["medoid_paper_id"]], dtype=np.float32) | |
| elif row["medoid_embedding_blob"] is not None: | |
| medoid_emb = np.frombuffer(row["medoid_embedding_blob"], dtype=np.float32) | |
| else: | |
| # Last resort: this cluster has no recoverable medoid. Mark for re-seed | |
| # rather than silently masking with zeros. | |
| logger.warning( | |
| "cluster.medoid.unrecoverable", extra={"cluster_id": row["cluster_id"]} | |
| ) | |
| continue # skip this cluster row; it'll be rebuilt on next Ward run | |
| ``` | |
| This requires adding `medoid_embedding_blob BLOB` to the cluster persistence schema if it's not already there β a one-line ALTER on Turso. | |
| --- | |
| ## Part F β Phase 6.4: Retraining Strategy WITHOUT Real User Signal | |
| ### F.1 The honest framing | |
| The current model is *not broken*. It is correctly trained on the only labels we had access to (citation pseudo-labels), and on its native distribution it scores nDCG@10 = 0.879. The gap between that number and "what users feel" is a **deployment gap**, not a model gap. | |
| Retraining only makes sense once one of the following is true: | |
| - We have a training distribution where features 23β30 carry **non-trivial, behaviourally meaningful** signal, **or** | |
| - We have real user labels (saves/dismisses/dwell-time) at sufficient scale. | |
| Today, neither holds. So Phase 6.4 is a **decision document**, not a build task. | |
| ### F.2 Three options, compared | |
| | Option | Feasibility | Fidelity to real users | Engineering cost | Risk | | |
| |---|---|---|---|---| | |
| | **(i) Wait for real users.** Threshold: 100 users with β₯10 saves each β ~1,000 labelled (user, paper, +/-) tuples per week, enough for a weekly retrain on slots 23β30. | High (zero engineering until threshold hit) | Highest possible β actual ground truth | Zero now, ~1 week to build the labelling pipeline once we cross the threshold | **You may never cross the threshold.** Single-developer hobby project. | | |
| | **(ii) Synthetic user simulation.** Generate N=1,000 synthetic users. For each, draw a "true interest profile" as a mixture over arXiv categories. Sample 30β60 "saves" from each profile by drawing papers from citation neighborhoods of seed papers in those categories. Run those saves through the **actual** EWMA + Ward + medoid pipeline to produce real `cluster_importance`, real `cluster_distance_to_medoid`, real `onboarding_category_match` (using the synthetic user's category profile as the onboarding set). Label triples by citation: a citation-neighbor of any of the user's saved papers is positive; a random paper outside the user's neighborhood is negative. | Medium-high β every component already exists; the simulation is glue code | Moderate β captures *structure* of the signal (cluster importance scales correctly, suppressed categories actually suppress) but doesn't capture true user *preference noise* | 5β8 days: write `simulate_users.py`, regenerate triples (~200K with non-zero slots 23-30), retrain LightGBM, re-eval | The model learns the simulator's biases. Fight this by holding out real user data (whenever it arrives) as a clean eval set; if the simulator-trained model dramatically outperforms a heuristic baseline on real-user holdout, you've validated. If not, you've wasted a week. | | |
| | **(iii) Self-distillation (defer to Phase 8).** Use the heuristic scorer + Qdrant retrieval rank as a soft label. Train a fresh LightGBM to mimic those scores, then iterate: ship β log new soft labels from the deployed model itself β retrain on log data. | Medium β needs a logging schema and an offline training loop | Low initially (only as good as the teacher), but improves once real users arrive and label-corruption from the teacher decays | 2β3 weeks: full label-store, replay infra, offline training pipeline | **Cold-start collapse:** if the teacher is worse than the student needs, distillation locks in mediocrity. Only safe once we have *any* real user signal to anchor the distribution. | | |
| ### F.3 Recommendation | |
| **Adopt a staged path:** | |
| 1. **Now (Phase 6.4a):** Do *not* retrain. Ship 6.1 β 6.2 β 6.3 with the existing model. Document that the model's effective coverage is ~15/37 features at training time but improves to ~37/37 at serving time *as soon as 6.1 lands* β the model can only use what it learned, but the framework is now wired correctly for the next training run. | |
| 2. **+30 days (Phase 6.4b):** Build option **(ii)** β synthetic user simulation. Spec the simulator in `scripts/simulate_users.py`, produce a new dataset version `siddhm11/researchit-reranker-data-v2` with slots 23β30 populated, retrain to `reranker_v2.txt`. Compare v1 vs v2 on a held-out split *and* on whatever sliver of real-user data has accrued by then. | |
| 3. **+90 days or 100-user threshold (whichever first) (Phase 6.4c):** If real-user data exists, do option **(i)**. Train `reranker_v3` on real saves/dismisses with synthetic data as augmentation (50/50 mix or downweight synthetic). | |
| 4. **Phase 8+:** Bring in option **(iii)** as a continuous-update mechanism on top of v3. | |
| ### F.4 What to write in `docs/phases/PHASE6.md` today | |
| Two lines, verbatim: | |
| > **Phase 6.4 retraining is deferred.** The published model `siddhm11/researchit-reranker-phase6` was trained on citation pseudo-labels with features 23β30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with β₯10 saves each. **Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.** | |
| --- | |
| ## Part G β Phase 6 Closeout Checklist | |
| ```markdown | |
| ### Phase 6.1 β Simplification Pass (1β2 days) | |
| - [ ] Move `suppressed = await db.get_suppressed_categories(...)` to BEFORE rerank call in app/routers/recommendations.py | |
| - [ ] Add `onboarding_categories = await db.get_user_category_filter(...)` next to it | |
| - [ ] Compute `user_total_saves`, `user_total_dismissals`, `user_saves_last_7d`, `user_dismissals_last_7d` | |
| - [ ] Add helper `build_qdrant_score_map()` in app/recommend/fusion.py | |
| - [ ] Add helper `align_qdrant_scores(valid_ids, score_map)` | |
| - [ ] Compute `dominant_importance` and `dominant_medoid_vec` | |
| - [ ] Replace 6-arg `rerank_candidates(...)` call with full kwargs version (Section C.2.4) | |
| - [ ] Update `rerank_candidates()` signature in app/recommend/reranker.py to accept new optional kwargs (Section C.2.5) | |
| - [ ] Wire slots 23β30 into `_build_feature_matrix` in reranker.py | |
| - [ ] Write test `tests/recommend/test_phase6_feature_matrix.py` asserting slots 23β30 non-zero | |
| - [ ] All existing tests pass (`pytest -q`) | |
| - [ ] Commit: "Phase 6.1: connect 37-feature reranker to live caller (dominant-cluster)" | |
| ### Phase 6.2 β Full per-candidate plumbing (3β5 days) | |
| - [ ] Add `PaperClusterMap = dict[str, int]` type alias in app/recommend/types.py | |
| - [ ] Modify `merge_quota_results()` in app/recommend/fusion.py to return `(merged_ids, paper_cluster_map)` | |
| - [ ] Update ALL callers of `merge_quota_results` (grep first; fix every one) | |
| - [ ] In recommendations.py, build `candidate_cluster_idx` aligned with `valid_ids` | |
| - [ ] Build `per_candidate_importance` (N,) and `per_candidate_medoids` (N, 1024) | |
| - [ ] Pass arrays (not scalars) for `cluster_importance` and `cluster_medoid` to reranker | |
| - [ ] Update `_build_feature_matrix` slot-24 logic to handle both (1024,) and (N, 1024) medoid shapes | |
| - [ ] Write test `tests/recommend/test_phase62_per_candidate_cluster.py` (Section D.6) | |
| - [ ] Commit: "Phase 6.2: per-candidate cluster identity through reranker" | |
| ### Phase 6.3 β Deployment verification + Bug B | |
| - [x] Decide deployment strategy: E.1.a (commit) vs E.1.b (snapshot_download). Used E.1.a. | |
| - [x] Verify `models/reranker-phase6/production_model/reranker_v1.txt` is in working tree, not gitignored, not dockerignored | |
| - [x] Push to HF Space; wait for build; check build logs for "[reranker] LightGBM model loaded" | |
| - [x] Add `/healthz/reranker` route (Section E.2) | |
| - [x] Add `_rr.is_model_loaded()`, `_rr.get_loaded_model_path()`, `_rr.get_num_trees()` accessors | |
| - [x] `curl https://siddhm11-researchit.hf.space/healthz/reranker` β confirm `model_loaded: true, n_trees: 141` | |
| > *Verified live at 2026-05-03: `model_loaded=true, n_trees=141, fallback_active=false, feature_count=37, feature_schema_hash=5d0b3de7b0c1`.* | |
| - [x] Add per-request `reranker.features` log line with `feature_nonzero_rate` | |
| - [x] Fix Bug B: medoid_embedding_blob fallback in cluster reload (Section E.4) | |
| - [x] Add `medoid_embedding_blob BLOB` column to clusters table (SQLite ALTER migration) | |
| - [x] Update CLAUDE.md / model card to reflect deployment story | |
| ### Phase 6 documentation | |
| - [ ] Write `docs/phases/PHASE6.md` retraining decision (Section F.4) | |
| - [ ] Update `README.md` test count (will increase by ~2 from 6.1 + 6.2 tests) | |
| - [ ] Update `TASK-TRACKER.md`: tick off `[x] [reranker] LightGBM model loaded` (after curl verifies it) | |
| - [ ] Backfill the HF model card with: `library_name: lightgbm`, `pipeline_tag: text-ranking`, `datasets: [siddhm11/researchit-reranker-data]`, training data description, feature schema (37 names), reported metrics (nDCG@10=0.879 on 7,007-triple eval split, β₯2023 papers), and a clear "trained on citation pseudo-labels, NOT real user signal" disclaimer | |
| - [ ] Upload `03_train_lightgbm.py` to a new repo `siddhm11/researchit-reranker-training` (or to the dataset repo as a script) so retraining is reproducible from public artifacts | |
| ### Phase 6.4 β Retraining (decision only; no code yet) | |
| - [ ] Document deferral and the 30-day / 100-user trigger in PHASE6.md | |
| - [ ] Open a tracking issue: "Phase 6.4b: synthetic user simulator" (target: +30d) | |
| - [ ] Open a tracking issue: "Phase 6.4c: real-user retrain at 100-user threshold" | |
| ``` | |
| --- | |
| ## Part H β What Phase 6 Is NOT (scope boundary) | |
| Phase 6 is **integration of the existing trained reranker, plus the deployment story for it**. Specifically *out of scope* for Phase 6 β these are their own phases, with their own framing docs to come: | |
| | Out of scope | Belongs to | Why it's separate | | |
| |---|---|---| | |
| | Offline regression harness, time-split eval framework, A/B test infrastructure | **Phase 7 β Evaluation Framework** | Requires a held-out user-interaction log and a separate "shadow ranker" infra. Cannot be built before 6.3 ships, because we need stable feature semantics first. | | |
| | LLM-generated paper summaries; cross-encoder distilled into LightGBM features | **Phase 8 β LLM Summaries + Distilled Reranker** | Adds new feature slots (37 β 40+) and breaks the schema; must be a model version bump, not a 6.x patch. | | |
| | Exploration bandits (UCB / Thompson over cluster heads), collaborative-filtering co-save signal | **Phase 9 β Exploration + CF** | Needs a user population large enough for CF to be non-degenerate, and an exploration budget that Phase 6's MMR=0.6 currently approximates as a stopgap. | | |
| | Migrating to a different reranker family (e.g. cross-encoder, ColBERT, BGE-reranker-v2-m3) | **Phase 10+** | Explicitly forbidden in serving by CLAUDE.md rule 4. LightGBM is the serving model; anything heavier is a teacher in distillation, not a serving model. | | |
| | Replacing citation pseudo-labels with click-through CTR labels from production | **Phase 6.4c** (real-user retrain) | Triggered by traffic threshold, not by code. | | |
| **The tightest possible definition of "Phase 6 done":** | |
| > *Every candidate that arrives at the LightGBM ranker is described by a 37-dimensional feature vector in which slots 23β30 carry per-candidate signal derived from the user's actual cluster state and onboarding/category history; the ranker's inference is verified live on `siddhm11-researchit.hf.space` via `/healthz/reranker`; and a feature non-zero rate is logged per request. Retraining is deferred and documented.* | |
| When that sentence is true, Phase 6 ships and Phase 7 begins. | |
| --- | |
| ## Caveats and unknowns (full disclosure) | |
| 1. **The HF model card itself was not directly readable** from the verification environment used for this framing β only the structured Hub API metadata (which exposes `tags`, `created_at`, `last_modified`, `pipeline_tag`, `library_name`, `description`, etc., all of which were null/empty for the model repo). The conclusion that "the README is empty" is an inference from the absent metadata fields a populated model card would normally surface, not a direct read of the file. **Action item for Amin:** open `https://huggingface.co/siddhm11/researchit-reranker-phase6` in a browser, look at the README and the Files tab, and confirm: (a) is `reranker_v1.txt` present at exactly 974 KB? (b) is there any README content at all? (c) are training scripts present? Adjust the Phase 6.3 doc-backfill checklist items accordingly. | |
| 2. **The 90,993 / 7,007 triple counts and the 141-tree figure come from the audit**, not from a re-derivation against the live HF artifacts. The HF dataset metadata only confirms the order of magnitude (`100K<n<1M` rows, Parquet). If the real numbers differ, the framing logic does not change β only the diagnostic prose in Part B. | |
| 3. **The Space's actual loaded model state is unknown without running the `curl /healthz/reranker` step in Section E.2.** The TASK-TRACKER's unchecked box is the only authoritative signal we have today, and it says "not verified." | |
| 4. **The synthetic-user simulator (Phase 6.4b) is plausible but unproven.** Its quality depends entirely on whether citation neighborhoods are a good proxy for "papers a user with interest profile X would save." That is an empirical question; the simulator is worth building only because the alternative is "do nothing until users arrive." |