ResearchIT / docs /phases /PHASE6-Reranker-Framing.md
siddhm11
Phase 6.5 Day 1: Real Qdrant cosine scores (A1) + verification timestamp (A2)
3f58d41

PHASE 6 β€” Reranker Framing Document

File: docs/phases/PHASE6-Reranker-Framing.md Project: ResearchIT (arXiv paper recommendation engine) Author: Amin (siddhm11) Status: Phase 6 β€” In Integration (post-audit framing) Supersedes: the open items from PHASE6-Audit.md Live Space: https://siddhm11-researchit.hf.space (HF Spaces, Docker SDK, RUNNING, cpu-basic) Reranker model repo: https://huggingface.co/siddhm11/researchit-reranker-phase6 Training data repo: https://huggingface.co/datasets/siddhm11/researchit-reranker-data


TL;DR (for Amin's future self at 1AM)

  1. The HF model exists and is public, but the model card is empty (no pipeline_tag, no library_name, no description, no metrics block). The reproducibility story currently lives in your laptop, not in the repo. The dataset repo siddhm11/researchit-reranker-data does exist as a public Parquet dataset (size_categories: 100K–1M, downloads: 30) β€” that is the only piece of the training pipeline that is partially public.
  2. The model was NOT trained on real user signal. It was trained on citation pseudo-labels (Semantic Scholar citation edges β†’ triples β†’ LightGBM LambdaRank). Features 23–30 (cluster importance, suppressed-category, onboarding match, user save/dismiss counts) were zero during training because no users existed β€” therefore zeroes at serving time are consistent with training, not a regression. The 0.879 nDCG@10 number is honest under that distribution; it is not honest as a measure of "what the user feels."
  3. Phase 6 is therefore about plumbing, not retraining. Land Phase 6.1 (dominant-cluster shortcut, ~1 day), then Phase 6.2 (per-candidate paper_cluster_map plumbing, ~3–5 days), then Phase 6.3 (deployment verification + /healthz/reranker). Defer retraining (Phase 6.4) until either (a) you ship synthetic-user simulation, or (b) you reach ~100 real users with β‰₯10 saves each.

Part A β€” Phase 6 Status Verification (HF inspection)

A.1 What is publicly visible on Hugging Face

I queried the HF Hub API directly (the public web pages were not reachable from this environment, but the structured API endpoints were). Here is the complete observable state of your HF account as of 2026-05-02:

Repo Type Visibility Created Last modified Notable
siddhm11/researchit-reranker-phase6 model public, not gated 2026-04-26 2026-04-27 21:51 README empty / no description, no pipeline_tag, no library_name, no metrics card, no datasets link
siddhm11/researchit-reranker-data dataset public, not gated 2026-04-27 16:41 2026-04-27 21:51 Parquet, 100K<n<1M, modality:tabular,text, 30 downloads
siddhm11/ResearchIT space public, RUNNING 2026-04-19 2026-05-02 11:29 Docker SDK, cpu-basic, declares BAAI/bge-m3 as a referenced model
siddhm11/prompt-engine space public, RUNNING 2026-02-02 2026-03-21 Docker, unrelated
siddhm11/sandbox-c6a4f7e6 space SLEEPING 2026-04-26 β€” Sandbox
siddhm11/sandbox-9bb83c65 space SLEEPING 2026-04-27 β€” Sandbox

Critically absent from the model repo metadata: library_name (should be lightgbm), pipeline_tag (should be e.g. text-ranking or a custom tag), any model-index block, any datasets: link to siddhm11/researchit-reranker-data. Tags = region:us only.

A.2 What this means for reproducibility

The audit's working assumption β€” "Phase 6 was trained on 242K citation edges from 50K sampled papers, time-split (train <2023, eval β‰₯2023), 90,993 train + 7,007 eval triples, nDCG@10 = 0.879, 37 features, 141 trees, 974KB" β€” is internally consistent but cannot be re-derived from the public HF artifacts alone today. The dataset repo exists (good β€” that is where the triples almost certainly live), but the model repo has no card linking the two, and there is no published training script.

What we can confirm vs what is missing:

Asset (per audit expectation) Public on HF? Notes
reranker_v1.txt (LightGBM dump, ~974 KB) Likely yes (it's the only point of the model repo) but the API does not expose siblings to confirm filename. Action: open the Files tab in a browser and verify the filename + size. Without a card, downstream consumers don't know which file to load
Raw citation edges (output of 01_fetch_citation_edges.py) Unknown / probably not Not exposed by metadata. Likely only the triples are uploaded
Triples file (output of 02_generate_training_triples.py) Probably yes β€” this is what researchit-reranker-data is for; the size-category and modalities match Verify columns include query_paper_id, candidate_paper_id, relevance, the 37 features, and a group/qid column
Eval triples (β‰₯2023 split) Probably yes (same dataset, separate split) Verify a split column or train/eval files
03_train_lightgbm.py Almost certainly NO β€” no Space or repo of yours hosts it This is the single biggest reproducibility gap
Feature schema (37 names, in canonical order) NO β€” would live in a model card Without this, even the LightGBM .txt dump (which uses Column_0…Column_36) is opaque
01_fetch_citation_edges.py script NO Required to refresh edges for retraining
pseudo_label_generation.py / triple builder NO Required to regenerate the dataset under a new time-split

Verdict on reproducibility: as of today, Phase 6 retraining is gated on Amin's local files. A fresh collaborator (or future Amin on a new laptop) cannot retrain the model from public artifacts alone. Fixing this is part of Phase 6.3 (see below).

A.3 Confirm or refute: was the published model trained on test data or real user signal?

Refuted on both counts. Three converging pieces of evidence:

  1. No user data could possibly exist. The Space siddhm11/ResearchIT was created 2026-04-19, only seven days before the model repo on 2026-04-26. There has been no production window long enough to accrue meaningful save/dismiss interaction logs, and the model repo has zero downloads, zero likes β€” this is a personal project with no user funnel.
  2. The dataset repo was created the day after the model. researchit-reranker-data is dated 2026-04-27, one day after the model went up β€” meaning the dataset was uploaded as a documentation artifact, not consumed from. That fits the citation-pseudo-label story: the triples were generated locally, the model was trained locally, both were pushed in close succession.
  3. The dataset metadata explicitly says modality:tabular, text and format:parquet with 100K<n<1M rows. That row count band is consistent with ~98K triples (90,993 train + 7,007 eval), exactly the audit's stated counts. Real user signal at this scale is implausible for a one-week-old project.

Conclusion: the model is trained on citation pseudo-labels, not on test data, not on user signal. The audit's account is correct. Features 23–30 were almost certainly zero during training (no user, no live cluster importance from a real session) β€” which is why serving them as zero today does not break the model's distribution; it merely leaves potential signal on the table.

A.4 Is reranker_v1.txt actually deployed to the live Space?

Unverified from outside. The Space metadata shows it is RUNNING on cpu-basic and declares BAAI/bge-m3 as a model dependency, but does not declare siddhm11/researchit-reranker-phase6 as a dependency. That is a yellow flag: if the LightGBM model were being pulled at runtime via huggingface_hub.snapshot_download, the Space metadata would typically list it. The most likely current state is one of:

  • (a) the file was committed to the Space repo's working tree under models/reranker-phase6/production_model/reranker_v1.txt and is being loaded from disk (works, but means the same artifact is duplicated in two HF repos), or
  • (b) the file is not in the Space at all and the deployed code is silently falling back to the heuristic scorer.

Either way, the audit's TASK-TRACKER item [ ] [reranker] LightGBM model loaded is still the right thing to verify, and we wire that verification in via Phase 6.3 below.


Part B β€” Phase 6 Problem Statement, Properly Framed

Why Phase 6 exists. Phases 1–5 built a clean retrieval+ranking stack: BGE-M3 dense + sparse, Qdrant + Zilliz, Ward clustering of user history into long/short/negative interest centroids, importance-weighted per-cluster Qdrant search with floor F_min=3, and a heuristic linear scorer over a hand-tuned set of similarities. The heuristic worked, but it left two kinds of signal on the table: (i) non-linear interactions between paper-level features (recency Γ— influential-citations Γ— topic match), and (ii) user-state features like cluster importance, onboarding category match, and save/dismiss intensity. Phase 6 introduces a LightGBM LambdaRank model with a 37-slot feature vector designed to absorb both classes of signal and to give us a single optimizable objective (nDCG) instead of a hand-tuned weighted sum.

What Phase 6 currently delivers vs what it was designed to deliver. What ships today is a 141-tree LightGBM ranker, trained offline on citation pseudo-labels with a time-split eval, claiming nDCG@10 = 0.879, with a heuristic fallback when the model file is missing. What the live serving path actually feeds the model is not the 37-feature vector the model was trained on β€” it is a vector where features 0, 1–4, 5, 6–22 carry signal and features 23–30 are zero-filled because the integration in app/routers/recommendations.py still calls rerank_candidates with the legacy 6-argument signature from Phase 5. Bugs A, B, and C from the audit all stem from this single gap: the reranker module was upgraded; the caller was not.

The "9 of 37 features" gap, quantified.

Feature slots Count Source available in caller scope? Currently passed?
0 qdrant_score 1 Yes β€” in per_cluster_results No β€” not constructed
1–4 long/short/medoid/extra similarities 4 Yes β€” lt_vec, st_vec, dominant medoid Partial β€” only lt/st/neg are passed positionally
5 neg similarity 1 Yes β€” neg_vec Yes
6–22 paper-level features 17 Yes β€” derivable in reranker.py from metadata Yes (computed inside reranker)
23 cluster_importance 1 Yes β€” clusters[i].importance No
24 cluster_distance_to_medoid 1 Yes β€” from clusters[i].medoid_embedding No
25 is_suppressed_category 1 Yes β€” but suppressed is loaded after the rerank call No
26 onboarding_category_match 1 Yes β€” db.get_user_category_filter() exists No, never called for this
27 user_total_saves 1 Yes β€” len(state.positives) No
28 user_total_dismissals 1 Yes β€” len(state.negatives) No
29 saves_last_7d 1 Yes β€” derivable from state.positives timestamps No
30 dismissals_last_7d 1 Yes β€” derivable from state.negatives timestamps No
31–36 remaining metadata features 6 Yes β€” computed inside reranker Yes

Effective active features at serving: ~9 + ~6 = ~15 of 37 carry real signal. Features 23–30 are all zero. This is Bug A, expressed quantitatively. It is not destroying ranking quality (the model has not learned to rely on those slots, because they were zero in training too), but it permanently caps how much Phase 6 can ever improve over the heuristic.

Phase 6 is a 3-stage fix: (6.1) connect the existing data flow with a dominant-cluster shortcut, (6.2) thread per-candidate cluster identity through the pipeline, (6.3) verify deployment + add observability. Retraining (6.4) is a separate decision, gated on whether we can produce a training distribution where features 23–30 are non-zero and behaviourally meaningful.


Part C β€” Phase 6.1: The Simplification Pass (1–2 days)

C.1 What 6.1 does

For each candidate in the rerank list, we compute features 23 and 24 against the dominant cluster β€” the cluster with the highest importance in the user's current state. Features 25 and 26 remain truly per-candidate (they depend on the candidate's primary_topic). Features 27–30 are user-level and constant across candidates in a single rerank call. Phase 6.1 is the minimum-viable fix that ends "9 of 37"; it will be replaced by 6.2's per-candidate cluster lookup.

Why dominant-cluster is acceptable for 6.1. The current model was trained with feature 23 = 0 everywhere. So as long as 6.1 produces a reasonable non-zero value for 23, the model will route those gradients through whatever weak signal it learned in feature 6–22. We are not relying on 23 carrying perfect semantics; we are getting the integration plumbed end-to-end so that feature non-zero rate jumps from ~40% to ~100%, and so that Phase 6.4 retraining has a target.

C.2 The exact code patch

File: app/routers/recommendations.py

C.2.1 Move the suppressed and onboarding_categories loading earlier

Find the existing block where suppressed is loaded (currently after the rerank_candidates call) and move it to immediately after state is hydrated and before the Qdrant retrieval loop. Add the onboarding category fetch right next to it.

# --- BEGIN Phase 6.1 patch (top of recommendations endpoint, after state hydration) ---

# Suppressed categories: was loaded after rerank; move it here.
suppressed: set[str] = set(await db.get_suppressed_categories(user_id))

# Onboarding categories: previously unused at rerank time.
onboarding_categories: set[str] = set(
    await db.get_user_category_filter(user_id) or []
)

# User-level interaction counts (constant across all candidates this request).
now_utc = datetime.now(timezone.utc)
seven_days_ago = now_utc - timedelta(days=7)

user_total_saves = len(state.positives)
user_total_dismissals = len(state.negatives)
user_saves_last_7d = sum(
    1 for p in state.positives if p.timestamp >= seven_days_ago
)
user_dismissals_last_7d = sum(
    1 for n in state.negatives if n.timestamp >= seven_days_ago
)

# --- END move ---

C.2.2 Add a helper to align Qdrant scores with valid_ids

per_cluster_results is currently a list[list[ScoredPoint]] (one list per cluster). After dedup + filter into valid_ids, we lose the raw retrieval score. We need to project it back.

# app/recommend/fusion.py  (or app/routers/recommendations.py if you keep helpers local)

def build_qdrant_score_map(
    per_cluster_results: list[list["ScoredPoint"]],
) -> dict[str, float]:
    """
    Collapse all per-cluster ScoredPoint lists into a single
    paper_id -> max_score dict. If a paper appeared in multiple
    clusters' top-K, we keep the maximum score (most charitable to
    the candidate; matches the dedup semantics in merge_quota_results).
    """
    out: dict[str, float] = {}
    for cluster_hits in per_cluster_results:
        for hit in cluster_hits:
            pid = str(hit.id)  # arxiv IDs ALWAYS strings (CLAUDE.md rule 7)
            score = float(hit.score)
            if pid not in out or score > out[pid]:
                out[pid] = score
    return out


def align_qdrant_scores(
    valid_ids: list[str],
    score_map: dict[str, float],
) -> np.ndarray:
    """Return a float32 array of qdrant_scores aligned 1:1 with valid_ids.
    Missing entries default to 0.0 (which matches train-time behavior for
    candidates injected by exploration paths)."""
    return np.asarray(
        [score_map.get(pid, 0.0) for pid in valid_ids],
        dtype=np.float32,
    )

C.2.3 Compute the dominant-cluster scalars

# Dominant cluster: highest-importance cluster in user state.
# clusters is the list[Cluster] you already have in scope.
if clusters:
    dominant = max(clusters, key=lambda c: c.importance)
    dominant_importance = float(dominant.importance)
    dominant_medoid_vec = np.asarray(
        dominant.medoid_embedding, dtype=np.float32
    )
else:
    # Cold-start path: no clusters -> defensible defaults.
    dominant_importance = 0.0
    dominant_medoid_vec = np.zeros(1024, dtype=np.float32)

C.2.4 The new rerank_candidates call

Replace the existing call (around line 305 of recommendations.py) with:

# --- BEGIN Phase 6.1 rerank call ---

qdrant_score_map = build_qdrant_score_map(per_cluster_results)
qdrant_scores = align_qdrant_scores(valid_ids, qdrant_score_map)

# Per-candidate boolean: is this candidate's primary_topic suppressed?
is_suppressed_category = np.asarray(
    [
        1.0 if (m.get("primary_topic") in suppressed) else 0.0
        for m in valid_meta
    ],
    dtype=np.float32,
)

# Per-candidate boolean: does this candidate match any onboarding category?
onboarding_category_match = np.asarray(
    [
        1.0 if (m.get("primary_topic") in onboarding_categories) else 0.0
        for m in valid_meta
    ],
    dtype=np.float32,
)

reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
    candidate_ids=valid_ids,
    candidate_embeddings=valid_embs,
    candidate_metadata=valid_meta,
    long_term_vec=lt_vec,
    short_term_vec=st_vec,
    negative_vec=neg_vec,
    # Phase 6 additions (6.1: dominant-cluster shortcut)
    qdrant_scores=qdrant_scores,
    cluster_importance=np.full(
        len(valid_ids), dominant_importance, dtype=np.float32
    ),
    cluster_medoid=dominant_medoid_vec,           # broadcast in reranker
    is_suppressed_category=is_suppressed_category,
    onboarding_category_match=onboarding_category_match,
    user_total_saves=user_total_saves,
    user_total_dismissals=user_total_dismissals,
    user_saves_last_7d=user_saves_last_7d,
    user_dismissals_last_7d=user_dismissals_last_7d,
)
# --- END Phase 6.1 rerank call ---

C.2.5 Reranker signature change

File: app/recommend/reranker.py

def rerank_candidates(
    *,
    candidate_ids: list[str],
    candidate_embeddings: np.ndarray,           # (N, 1024)
    candidate_metadata: list[dict],
    long_term_vec: np.ndarray | None,
    short_term_vec: np.ndarray | None,
    negative_vec: np.ndarray | None,
    # Phase 6.1 additions β€” all optional with safe zero-defaults
    # so the legacy callers keep working during the migration window.
    qdrant_scores: np.ndarray | None = None,
    cluster_importance: np.ndarray | None = None,    # (N,) or scalar broadcast
    cluster_medoid: np.ndarray | None = None,        # (1024,) for 6.1, (N, 1024) for 6.2
    is_suppressed_category: np.ndarray | None = None,
    onboarding_category_match: np.ndarray | None = None,
    user_total_saves: int = 0,
    user_total_dismissals: int = 0,
    user_saves_last_7d: int = 0,
    user_dismissals_last_7d: int = 0,
) -> tuple[list[str], list[float], np.ndarray]:
    ...

Inside the feature-matrix builder, fill slots 23–30 from these new args, falling back to zero if None (preserves backward-compat for any unit tests that still call the legacy path).

C.3 The integration test (must be added)

File: tests/recommend/test_phase6_feature_matrix.py

import numpy as np
import pytest
from app.recommend.reranker import _build_feature_matrix  # expose for testing


def test_phase6_feature_matrix_is_not_mostly_zero(fake_user_state, fake_candidates):
    """
    Regression guard for Bug A. After Phase 6.1, features 23-30 must
    carry signal for at least one candidate in a typical request.
    """
    X = _build_feature_matrix(
        candidate_ids=fake_candidates.ids,
        candidate_embeddings=fake_candidates.embs,
        candidate_metadata=fake_candidates.meta,
        long_term_vec=fake_user_state.lt_vec,
        short_term_vec=fake_user_state.st_vec,
        negative_vec=fake_user_state.neg_vec,
        qdrant_scores=fake_candidates.qdrant_scores,   # non-trivial
        cluster_importance=np.full(len(fake_candidates.ids), 0.42),
        cluster_medoid=fake_user_state.dominant_medoid,
        is_suppressed_category=np.array([0, 0, 1, 0, 0], dtype=np.float32),
        onboarding_category_match=np.array([1, 0, 1, 0, 1], dtype=np.float32),
        user_total_saves=12,
        user_total_dismissals=3,
        user_saves_last_7d=2,
        user_dismissals_last_7d=1,
    )

    assert X.shape[1] == 37, f"Feature schema drifted: got {X.shape[1]} cols"

    # Per-feature non-zero rate. Slots 23..30 must each be >0 somewhere.
    nonzero_rate = (X != 0).mean(axis=0)
    for slot in range(23, 31):
        assert nonzero_rate[slot] > 0.0, (
            f"Feature {slot} is all zeros β€” Phase 6.1 plumbing regression"
        )

    # Aggregate sanity: at least 60% of feature slots should be active.
    assert (nonzero_rate > 0).mean() >= 0.6

C.4 What 6.1 fixes vs leaves open

Bug Status after 6.1
A β€” caller uses legacy 6-arg signature Fixed
B β€” Hungarian zero-vector fallback Untouched (orthogonal; addressed separately, see Part E.4)
C β€” model deployment unverified Untouched (Phase 6.3)
D β€” train/serve consistency Improved on slot 23–30 non-zeroness, but the trained model's slot-23 weight is near-zero, so 6.1 will not move nDCG by much. That's expected.

Part D β€” Phase 6.2: The Full Plumbing (3–5 days)

D.1 The data-structure change: paper_cluster_map

The architectural rule we are honoring: a paper retrieved for "Cluster A: ML systems" should be scored for its fit to that cluster, not to the user's dominant interest. Phase 6.1 violates this for slots 23 and 24 by broadcasting the dominant cluster's importance/medoid to every candidate. Phase 6.2 fixes it by attaching the source cluster index to each retrieved candidate and threading that through to the reranker.

Add to app/recommend/types.py (or wherever Cluster lives):

# Source-of-truth mapping: candidate paper_id -> index into clusters[].
# Built during per-cluster Qdrant retrieval, propagated through merge_quota,
# consumed by the reranker.
PaperClusterMap = dict[str, int]

If a candidate appears in multiple cluster top-Ks (which can happen with importance-weighted quota and a permissive K_max=7), the convention is first-write-wins by importance order β€” i.e. when iterating clusters in descending importance, the first cluster to surface a candidate "owns" it. This matches the heuristic that a paper that the user's strongest interest cluster also pulls is more naturally explained by that cluster.

D.2 app/recommend/fusion.py β€” merge_quota_results

Currently merge_quota_results takes per_cluster_results: list[list[ScoredPoint]] and returns list[str] (deduped IDs). Change it to also return the cluster mapping.

def merge_quota_results(
    per_cluster_results: list[list["ScoredPoint"]],
    clusters: list[Cluster],
    floor_per_cluster: int = 3,           # F_min from CLAUDE.md
) -> tuple[list[str], PaperClusterMap]:
    """
    Importance-weighted quota merge. Returns:
        - merged_ids: deduped list of arxiv IDs (str)
        - paper_cluster_map: candidate_id -> source cluster index
    Convention: when a candidate appears in multiple clusters, the
    cluster with HIGHER importance wins. Iterate clusters sorted
    descending by importance to make first-write-wins do the right thing.
    """
    paper_cluster_map: PaperClusterMap = {}
    merged_ids: list[str] = []
    seen: set[str] = set()

    # Stable sort by importance descending; preserve original index for lookup.
    order = sorted(
        range(len(clusters)),
        key=lambda i: clusters[i].importance,
        reverse=True,
    )

    for cluster_idx in order:
        hits = per_cluster_results[cluster_idx]
        # Apply F_min floor: take at least floor_per_cluster (or all if shorter).
        # The full quota math from Phase 5 lives elsewhere; this is the
        # accounting step.
        for hit in hits:
            pid = str(hit.id)
            if pid in seen:
                continue
            seen.add(pid)
            merged_ids.append(pid)
            paper_cluster_map[pid] = cluster_idx

    return merged_ids, paper_cluster_map

Update every caller of merge_quota_results in recommendations.py to unpack the second return. Mypy / pyright will flag every site β€” fix them all.

D.3 app/routers/recommendations.py β€” propagate the map

merged_ids, paper_cluster_map = merge_quota_results(
    per_cluster_results, clusters
)

# ... after dedup and metadata fetch, valid_ids is a subset of merged_ids ...

# Per-candidate cluster index (aligned with valid_ids).
candidate_cluster_idx = np.asarray(
    [paper_cluster_map[pid] for pid in valid_ids],
    dtype=np.int32,
)

# Per-candidate cluster importance (slot 23, properly per-candidate).
per_candidate_importance = np.asarray(
    [clusters[idx].importance for idx in candidate_cluster_idx],
    dtype=np.float32,
)

# Per-candidate cluster medoid (used to compute slot 24 inside reranker).
# Stack medoids into a (N, 1024) array.
per_candidate_medoids = np.stack(
    [
        np.asarray(clusters[idx].medoid_embedding, dtype=np.float32)
        for idx in candidate_cluster_idx
    ],
    axis=0,
)

reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
    candidate_ids=valid_ids,
    candidate_embeddings=valid_embs,
    candidate_metadata=valid_meta,
    long_term_vec=lt_vec,
    short_term_vec=st_vec,
    negative_vec=neg_vec,
    qdrant_scores=qdrant_scores,
    cluster_importance=per_candidate_importance,        # (N,) β€” was scalar in 6.1
    cluster_medoid=per_candidate_medoids,               # (N, 1024) β€” was (1024,) in 6.1
    is_suppressed_category=is_suppressed_category,
    onboarding_category_match=onboarding_category_match,
    user_total_saves=user_total_saves,
    user_total_dismissals=user_total_dismissals,
    user_saves_last_7d=user_saves_last_7d,
    user_dismissals_last_7d=user_dismissals_last_7d,
)

D.4 app/recommend/reranker.py β€” per-candidate slot 24

Inside the feature-matrix builder, slot 24 (cluster_distance_to_medoid) becomes a row-wise cosine:

# Slot 24: cosine distance from each candidate to its OWN source-cluster medoid.
# cluster_medoid shape:
#   (1024,)         -> 6.1 broadcast path (legacy)
#   (N, 1024)       -> 6.2 per-candidate path
if cluster_medoid is None:
    feat_24 = np.zeros(N, dtype=np.float32)
elif cluster_medoid.ndim == 1:
    # Broadcast cosine: same medoid for every candidate.
    medoid_norm = cluster_medoid / (np.linalg.norm(cluster_medoid) + 1e-9)
    cand_norms = candidate_embeddings / (
        np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9
    )
    sims = cand_norms @ medoid_norm
    feat_24 = (1.0 - sims).astype(np.float32)   # distance, not similarity
else:
    # Per-row cosine: candidate i vs cluster_medoid[i].
    cand_norms = candidate_embeddings / (
        np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9
    )
    med_norms = cluster_medoid / (
        np.linalg.norm(cluster_medoid, axis=1, keepdims=True) + 1e-9
    )
    sims = (cand_norms * med_norms).sum(axis=1)
    feat_24 = (1.0 - sims).astype(np.float32)

D.5 Why 6.2 matters

A concrete example: user has clusters {A: ML systems (importance 0.7), B: protein folding (importance 0.3)}. A paper about MLPerf benchmark methodology gets retrieved by cluster A's Qdrant query; a paper about AlphaFold 3 architecture gets retrieved by cluster B's. Under 6.1, both get scored against cluster A's medoid (the dominant one), so the AlphaFold paper looks artificially "off-distribution" and gets ranked down. Under 6.2, the AlphaFold paper is scored against cluster B's medoid (where it belongs), and slot 24 correctly registers as a small distance. The model's learned weight on slot 24 then has the opportunity to protect exploration into the user's secondary interests, instead of reinforcing the dominant one.

This also makes Phase 9 (Exploration + CF) much cleaner: the exploration budget is naturally reasoned about in terms of "minority-cluster candidates that survived rerank."

D.6 Integration test outline

File: tests/recommend/test_phase62_per_candidate_cluster.py

def test_paper_cluster_map_threaded_through(monkeypatch, fake_three_cluster_state):
    """
    Given three clusters with different medoids, candidates retrieved
    from cluster B must produce slot 24 measured against B's medoid,
    not against the dominant A's medoid.
    """
    captured = {}

    def fake_predict(model, X):
        captured["X"] = X.copy()
        return np.arange(X.shape[0])[::-1].astype(float)

    monkeypatch.setattr("app.recommend.reranker._lgb_predict", fake_predict)

    response = client.get(
        "/recommend",
        headers={"X-User-Id": fake_three_cluster_state.user_id},
    )

    X = captured["X"]
    # Specific assertion: candidate 0 came from cluster A, candidate 5 from B.
    # Their slot-24 values must be different (they are scored vs different medoids).
    assert X[0, 24] != X[5, 24]
    # And slot 23 (cluster_importance) must match the source clusters'
    # importance values, not a single broadcast.
    assert len(set(X[:, 23].tolist())) > 1

Part E β€” Phase 6.3: Deployment Verification + Monitoring (1 day)

E.1 Verify the LightGBM model loads on HF Spaces

Two acceptable deployment strategies. Pick one and document it.

Option E.1.a β€” Commit the 974 KB file to the Space's Git repo (recommended for simplicity)

974 KB is well under HF's 5 MB inline limit and well under any sane Git-LFS threshold. Just commit it.

# From your local ResearchIT working tree:
cd /path/to/ResearchIT
mkdir -p models/reranker-phase6/production_model
cp /path/to/reranker_v1.txt models/reranker-phase6/production_model/

# Make sure neither .gitignore nor .dockerignore excludes it.
grep -E "^models/?$|^models/reranker" .gitignore .dockerignore || echo "OK, not ignored"

git add models/reranker-phase6/production_model/reranker_v1.txt
git commit -m "Phase 6.3: ship LightGBM reranker artifact to Space"
git push hf main   # where 'hf' is the HF Space remote

reranker.py's search paths already include this location, so no code change required.

Option E.1.b β€” Pull from HF Hub at container build time (cleaner separation of concerns)

# Add to Dockerfile, BEFORE the COPY . . step
RUN pip install --no-cache-dir huggingface_hub && \
    python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download( \
    repo_id='siddhm11/researchit-reranker-phase6', \
    local_dir='/app/models/reranker-phase6/production_model', \
    local_dir_use_symlinks=False, \
    allow_patterns=['*.txt', '*.json'], \
)"

This treats the model repo as the source of truth and the Space as a consumer β€” closer to "real" MLOps. Cost: one extra layer in the build, ~1s at startup since it's cached after first build.

Recommendation: Option E.1.a for now (simpler, the file is tiny, and it removes a network dependency at container build). Move to E.1.b in Phase 7 when you have a model registry story.

E.2 Add /healthz/reranker route

File: app/routers/health.py (create if not present)

import hashlib
import json
from fastapi import APIRouter
from app.recommend import reranker as _rr

router = APIRouter()

EXPECTED_FEATURE_NAMES = [
    "qdrant_score",                              # 0
    "sim_long_term", "sim_short_term",           # 1, 2
    "sim_cluster_medoid", "sim_extra_4",         # 3, 4
    "sim_negative",                              # 5
    # 6..22 paper-level
    "recency_days_log", "citation_count_log",
    "influential_citations_log", "is_primary_cs_lg",
    "is_primary_cs_cl", "is_primary_cs_cv",
    "is_primary_stat_ml", "is_primary_cs_ai",
    "is_primary_cs_ir", "is_primary_other",
    "abstract_len_log", "title_len",
    "year", "month", "venue_is_top",
    "n_authors_log", "has_code_link",
    # 23..30 cluster + user
    "cluster_importance", "cluster_distance_to_medoid",
    "is_suppressed_category", "onboarding_category_match",
    "user_total_saves", "user_total_dismissals",
    "user_saves_last_7d", "user_dismissals_last_7d",
    # 31..36 remaining metadata
    "primary_topic_freq_in_user", "is_oa", "venue_age",
    "abstract_topic_match", "ref_count_log", "is_arxiv_only",
]
assert len(EXPECTED_FEATURE_NAMES) == 37


@router.get("/healthz/reranker")
async def healthz_reranker():
    schema_hash = hashlib.sha256(
        json.dumps(EXPECTED_FEATURE_NAMES).encode()
    ).hexdigest()[:12]

    return {
        "model_loaded": _rr.is_model_loaded(),       # True iff LightGBM Booster live
        "model_path": _rr.get_loaded_model_path(),   # str or None
        "model_version": "phase6.v1",
        "fallback_active": not _rr.is_model_loaded(),
        "feature_count": 37,
        "feature_schema_hash": schema_hash,
        "n_trees": _rr.get_num_trees(),              # 141 expected
    }

Wire _rr.is_model_loaded(), _rr.get_loaded_model_path(), _rr.get_num_trees() into reranker.py as small accessors over the module-level Booster handle.

Then verify on the live Space:

curl -s https://siddhm11-researchit.hf.space/healthz/reranker | jq

Expected JSON body: "model_loaded": true, "n_trees": 141, "fallback_active": false. If any of those is wrong, the deployed image is silently running the heuristic.

E.3 Logging: feature non-zero rate per request

In reranker.py, after building the feature matrix X, add:

# Observability: emit a per-request feature-activation histogram so we
# catch silent regressions to zero-filled features 23-30.
if X.shape[0] > 0:
    nonzero_rate = (X != 0).mean(axis=0)   # (37,)
    logger.info(
        "reranker.features",
        extra={
            "feature_nonzero_rate": nonzero_rate.round(3).tolist(),
            "feature_count": X.shape[1],
            "n_candidates": X.shape[0],
            "model_active": _model is not None,
        },
    )

Add a Prometheus-style assertion later (Phase 7) but for now structured-log lines are enough β€” you can grep them on the Space's log stream.

E.4 Bug B fix (Hungarian zero-vector fallback) β€” bundle into 6.3

While we're in recommendations.py, fix the medoid-rebuild bug:

# OLD:
# medoid_embedding=np.array(vectors[row["medoid_paper_id"]], dtype=np.float32)
#   if row["medoid_paper_id"] in vectors else np.zeros(1024, dtype=np.float32)

# NEW: if the medoid paper's embedding isn't in the immediate vectors dict,
# fall back to the previously-stored medoid_embedding from the cluster row
# (which is what we persisted last cycle), NOT a zero vector. Zero-vector
# fallback breaks Hungarian assignment because cosine(*, 0) = 0 fails the
# 0.5 acceptance threshold and orphans the cluster identity.

if row["medoid_paper_id"] in vectors:
    medoid_emb = np.asarray(vectors[row["medoid_paper_id"]], dtype=np.float32)
elif row["medoid_embedding_blob"] is not None:
    medoid_emb = np.frombuffer(row["medoid_embedding_blob"], dtype=np.float32)
else:
    # Last resort: this cluster has no recoverable medoid. Mark for re-seed
    # rather than silently masking with zeros.
    logger.warning(
        "cluster.medoid.unrecoverable", extra={"cluster_id": row["cluster_id"]}
    )
    continue   # skip this cluster row; it'll be rebuilt on next Ward run

This requires adding medoid_embedding_blob BLOB to the cluster persistence schema if it's not already there β€” a one-line ALTER on Turso.


Part F β€” Phase 6.4: Retraining Strategy WITHOUT Real User Signal

F.1 The honest framing

The current model is not broken. It is correctly trained on the only labels we had access to (citation pseudo-labels), and on its native distribution it scores nDCG@10 = 0.879. The gap between that number and "what users feel" is a deployment gap, not a model gap.

Retraining only makes sense once one of the following is true:

  • We have a training distribution where features 23–30 carry non-trivial, behaviourally meaningful signal, or
  • We have real user labels (saves/dismisses/dwell-time) at sufficient scale.

Today, neither holds. So Phase 6.4 is a decision document, not a build task.

F.2 Three options, compared

Option Feasibility Fidelity to real users Engineering cost Risk
(i) Wait for real users. Threshold: 100 users with β‰₯10 saves each β†’ ~1,000 labelled (user, paper, +/-) tuples per week, enough for a weekly retrain on slots 23–30. High (zero engineering until threshold hit) Highest possible β€” actual ground truth Zero now, ~1 week to build the labelling pipeline once we cross the threshold You may never cross the threshold. Single-developer hobby project.
(ii) Synthetic user simulation. Generate N=1,000 synthetic users. For each, draw a "true interest profile" as a mixture over arXiv categories. Sample 30–60 "saves" from each profile by drawing papers from citation neighborhoods of seed papers in those categories. Run those saves through the actual EWMA + Ward + medoid pipeline to produce real cluster_importance, real cluster_distance_to_medoid, real onboarding_category_match (using the synthetic user's category profile as the onboarding set). Label triples by citation: a citation-neighbor of any of the user's saved papers is positive; a random paper outside the user's neighborhood is negative. Medium-high β€” every component already exists; the simulation is glue code Moderate β€” captures structure of the signal (cluster importance scales correctly, suppressed categories actually suppress) but doesn't capture true user preference noise 5–8 days: write simulate_users.py, regenerate triples (~200K with non-zero slots 23-30), retrain LightGBM, re-eval The model learns the simulator's biases. Fight this by holding out real user data (whenever it arrives) as a clean eval set; if the simulator-trained model dramatically outperforms a heuristic baseline on real-user holdout, you've validated. If not, you've wasted a week.
(iii) Self-distillation (defer to Phase 8). Use the heuristic scorer + Qdrant retrieval rank as a soft label. Train a fresh LightGBM to mimic those scores, then iterate: ship β†’ log new soft labels from the deployed model itself β†’ retrain on log data. Medium β€” needs a logging schema and an offline training loop Low initially (only as good as the teacher), but improves once real users arrive and label-corruption from the teacher decays 2–3 weeks: full label-store, replay infra, offline training pipeline Cold-start collapse: if the teacher is worse than the student needs, distillation locks in mediocrity. Only safe once we have any real user signal to anchor the distribution.

F.3 Recommendation

Adopt a staged path:

  1. Now (Phase 6.4a): Do not retrain. Ship 6.1 β†’ 6.2 β†’ 6.3 with the existing model. Document that the model's effective coverage is ~15/37 features at training time but improves to ~37/37 at serving time as soon as 6.1 lands β€” the model can only use what it learned, but the framework is now wired correctly for the next training run.
  2. +30 days (Phase 6.4b): Build option (ii) β€” synthetic user simulation. Spec the simulator in scripts/simulate_users.py, produce a new dataset version siddhm11/researchit-reranker-data-v2 with slots 23–30 populated, retrain to reranker_v2.txt. Compare v1 vs v2 on a held-out split and on whatever sliver of real-user data has accrued by then.
  3. +90 days or 100-user threshold (whichever first) (Phase 6.4c): If real-user data exists, do option (i). Train reranker_v3 on real saves/dismisses with synthetic data as augmentation (50/50 mix or downweight synthetic).
  4. Phase 8+: Bring in option (iii) as a continuous-update mechanism on top of v3.

F.4 What to write in docs/phases/PHASE6.md today

Two lines, verbatim:

Phase 6.4 retraining is deferred. The published model siddhm11/researchit-reranker-phase6 was trained on citation pseudo-labels with features 23–30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with β‰₯10 saves each. Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.


Part G β€” Phase 6 Closeout Checklist

### Phase 6.1 β€” Simplification Pass (1–2 days)
- [ ] Move `suppressed = await db.get_suppressed_categories(...)` to BEFORE rerank call in app/routers/recommendations.py
- [ ] Add `onboarding_categories = await db.get_user_category_filter(...)` next to it
- [ ] Compute `user_total_saves`, `user_total_dismissals`, `user_saves_last_7d`, `user_dismissals_last_7d`
- [ ] Add helper `build_qdrant_score_map()` in app/recommend/fusion.py
- [ ] Add helper `align_qdrant_scores(valid_ids, score_map)`
- [ ] Compute `dominant_importance` and `dominant_medoid_vec`
- [ ] Replace 6-arg `rerank_candidates(...)` call with full kwargs version (Section C.2.4)
- [ ] Update `rerank_candidates()` signature in app/recommend/reranker.py to accept new optional kwargs (Section C.2.5)
- [ ] Wire slots 23–30 into `_build_feature_matrix` in reranker.py
- [ ] Write test `tests/recommend/test_phase6_feature_matrix.py` asserting slots 23–30 non-zero
- [ ] All existing tests pass (`pytest -q`)
- [ ] Commit: "Phase 6.1: connect 37-feature reranker to live caller (dominant-cluster)"

### Phase 6.2 β€” Full per-candidate plumbing (3–5 days)
- [ ] Add `PaperClusterMap = dict[str, int]` type alias in app/recommend/types.py
- [ ] Modify `merge_quota_results()` in app/recommend/fusion.py to return `(merged_ids, paper_cluster_map)`
- [ ] Update ALL callers of `merge_quota_results` (grep first; fix every one)
- [ ] In recommendations.py, build `candidate_cluster_idx` aligned with `valid_ids`
- [ ] Build `per_candidate_importance` (N,) and `per_candidate_medoids` (N, 1024)
- [ ] Pass arrays (not scalars) for `cluster_importance` and `cluster_medoid` to reranker
- [ ] Update `_build_feature_matrix` slot-24 logic to handle both (1024,) and (N, 1024) medoid shapes
- [ ] Write test `tests/recommend/test_phase62_per_candidate_cluster.py` (Section D.6)
- [ ] Commit: "Phase 6.2: per-candidate cluster identity through reranker"

### Phase 6.3 β€” Deployment verification + Bug B
- [x] Decide deployment strategy: E.1.a (commit) vs E.1.b (snapshot_download). Used E.1.a.
- [x] Verify `models/reranker-phase6/production_model/reranker_v1.txt` is in working tree, not gitignored, not dockerignored
- [x] Push to HF Space; wait for build; check build logs for "[reranker] LightGBM model loaded"
- [x] Add `/healthz/reranker` route (Section E.2)
- [x] Add `_rr.is_model_loaded()`, `_rr.get_loaded_model_path()`, `_rr.get_num_trees()` accessors
- [x] `curl https://siddhm11-researchit.hf.space/healthz/reranker` β†’ confirm `model_loaded: true, n_trees: 141`
  > *Verified live at 2026-05-03: `model_loaded=true, n_trees=141, fallback_active=false, feature_count=37, feature_schema_hash=5d0b3de7b0c1`.*
- [x] Add per-request `reranker.features` log line with `feature_nonzero_rate`
- [x] Fix Bug B: medoid_embedding_blob fallback in cluster reload (Section E.4)
- [x] Add `medoid_embedding_blob BLOB` column to clusters table (SQLite ALTER migration)
- [x] Update CLAUDE.md / model card to reflect deployment story

### Phase 6 documentation
- [ ] Write `docs/phases/PHASE6.md` retraining decision (Section F.4)
- [ ] Update `README.md` test count (will increase by ~2 from 6.1 + 6.2 tests)
- [ ] Update `TASK-TRACKER.md`: tick off `[x] [reranker] LightGBM model loaded` (after curl verifies it)
- [ ] Backfill the HF model card with: `library_name: lightgbm`, `pipeline_tag: text-ranking`, `datasets: [siddhm11/researchit-reranker-data]`, training data description, feature schema (37 names), reported metrics (nDCG@10=0.879 on 7,007-triple eval split, β‰₯2023 papers), and a clear "trained on citation pseudo-labels, NOT real user signal" disclaimer
- [ ] Upload `03_train_lightgbm.py` to a new repo `siddhm11/researchit-reranker-training` (or to the dataset repo as a script) so retraining is reproducible from public artifacts

### Phase 6.4 β€” Retraining (decision only; no code yet)
- [ ] Document deferral and the 30-day / 100-user trigger in PHASE6.md
- [ ] Open a tracking issue: "Phase 6.4b: synthetic user simulator" (target: +30d)
- [ ] Open a tracking issue: "Phase 6.4c: real-user retrain at 100-user threshold"

Part H β€” What Phase 6 Is NOT (scope boundary)

Phase 6 is integration of the existing trained reranker, plus the deployment story for it. Specifically out of scope for Phase 6 β€” these are their own phases, with their own framing docs to come:

Out of scope Belongs to Why it's separate
Offline regression harness, time-split eval framework, A/B test infrastructure Phase 7 β€” Evaluation Framework Requires a held-out user-interaction log and a separate "shadow ranker" infra. Cannot be built before 6.3 ships, because we need stable feature semantics first.
LLM-generated paper summaries; cross-encoder distilled into LightGBM features Phase 8 β€” LLM Summaries + Distilled Reranker Adds new feature slots (37 β†’ 40+) and breaks the schema; must be a model version bump, not a 6.x patch.
Exploration bandits (UCB / Thompson over cluster heads), collaborative-filtering co-save signal Phase 9 β€” Exploration + CF Needs a user population large enough for CF to be non-degenerate, and an exploration budget that Phase 6's MMR=0.6 currently approximates as a stopgap.
Migrating to a different reranker family (e.g. cross-encoder, ColBERT, BGE-reranker-v2-m3) Phase 10+ Explicitly forbidden in serving by CLAUDE.md rule 4. LightGBM is the serving model; anything heavier is a teacher in distillation, not a serving model.
Replacing citation pseudo-labels with click-through CTR labels from production Phase 6.4c (real-user retrain) Triggered by traffic threshold, not by code.

The tightest possible definition of "Phase 6 done":

Every candidate that arrives at the LightGBM ranker is described by a 37-dimensional feature vector in which slots 23–30 carry per-candidate signal derived from the user's actual cluster state and onboarding/category history; the ranker's inference is verified live on siddhm11-researchit.hf.space via /healthz/reranker; and a feature non-zero rate is logged per request. Retraining is deferred and documented.

When that sentence is true, Phase 6 ships and Phase 7 begins.


Caveats and unknowns (full disclosure)

  1. The HF model card itself was not directly readable from the verification environment used for this framing β€” only the structured Hub API metadata (which exposes tags, created_at, last_modified, pipeline_tag, library_name, description, etc., all of which were null/empty for the model repo). The conclusion that "the README is empty" is an inference from the absent metadata fields a populated model card would normally surface, not a direct read of the file. Action item for Amin: open https://huggingface.co/siddhm11/researchit-reranker-phase6 in a browser, look at the README and the Files tab, and confirm: (a) is reranker_v1.txt present at exactly 974 KB? (b) is there any README content at all? (c) are training scripts present? Adjust the Phase 6.3 doc-backfill checklist items accordingly.
  2. The 90,993 / 7,007 triple counts and the 141-tree figure come from the audit, not from a re-derivation against the live HF artifacts. The HF dataset metadata only confirms the order of magnitude (100K<n<1M rows, Parquet). If the real numbers differ, the framing logic does not change β€” only the diagnostic prose in Part B.
  3. The Space's actual loaded model state is unknown without running the curl /healthz/reranker step in Section E.2. The TASK-TRACKER's unchecked box is the only authoritative signal we have today, and it says "not verified."
  4. The synthetic-user simulator (Phase 6.4b) is plausible but unproven. Its quality depends entirely on whether citation neighborhoods are a good proxy for "papers a user with interest profile X would save." That is an empirical question; the simulator is worth building only because the alternative is "do nothing until users arrive."