Spaces:
Running
PHASE 6 β Reranker Framing Document
File:
docs/phases/PHASE6-Reranker-Framing.mdProject: ResearchIT (arXiv paper recommendation engine) Author: Amin (siddhm11) Status: Phase 6 β In Integration (post-audit framing) Supersedes: the open items fromPHASE6-Audit.mdLive Space:https://siddhm11-researchit.hf.space(HF Spaces, Docker SDK, RUNNING, cpu-basic) Reranker model repo:https://huggingface.co/siddhm11/researchit-reranker-phase6Training data repo:https://huggingface.co/datasets/siddhm11/researchit-reranker-data
TL;DR (for Amin's future self at 1AM)
- The HF model exists and is public, but the model card is empty (no
pipeline_tag, nolibrary_name, nodescription, no metrics block). The reproducibility story currently lives in your laptop, not in the repo. The dataset reposiddhm11/researchit-reranker-datadoes exist as a public Parquet dataset (size_categories: 100Kβ1M, downloads: 30) β that is the only piece of the training pipeline that is partially public. - The model was NOT trained on real user signal. It was trained on citation pseudo-labels (Semantic Scholar citation edges β triples β LightGBM LambdaRank). Features 23β30 (cluster importance, suppressed-category, onboarding match, user save/dismiss counts) were zero during training because no users existed β therefore zeroes at serving time are consistent with training, not a regression. The 0.879 nDCG@10 number is honest under that distribution; it is not honest as a measure of "what the user feels."
- Phase 6 is therefore about plumbing, not retraining. Land Phase 6.1 (dominant-cluster shortcut, ~1 day), then Phase 6.2 (per-candidate
paper_cluster_mapplumbing, ~3β5 days), then Phase 6.3 (deployment verification +/healthz/reranker). Defer retraining (Phase 6.4) until either (a) you ship synthetic-user simulation, or (b) you reach ~100 real users with β₯10 saves each.
Part A β Phase 6 Status Verification (HF inspection)
A.1 What is publicly visible on Hugging Face
I queried the HF Hub API directly (the public web pages were not reachable from this environment, but the structured API endpoints were). Here is the complete observable state of your HF account as of 2026-05-02:
| Repo | Type | Visibility | Created | Last modified | Notable |
|---|---|---|---|---|---|
siddhm11/researchit-reranker-phase6 |
model | public, not gated | 2026-04-26 | 2026-04-27 21:51 | README empty / no description, no pipeline_tag, no library_name, no metrics card, no datasets link |
siddhm11/researchit-reranker-data |
dataset | public, not gated | 2026-04-27 16:41 | 2026-04-27 21:51 | Parquet, 100K<n<1M, modality:tabular,text, 30 downloads |
siddhm11/ResearchIT |
space | public, RUNNING | 2026-04-19 | 2026-05-02 11:29 | Docker SDK, cpu-basic, declares BAAI/bge-m3 as a referenced model |
siddhm11/prompt-engine |
space | public, RUNNING | 2026-02-02 | 2026-03-21 | Docker, unrelated |
siddhm11/sandbox-c6a4f7e6 |
space | SLEEPING | 2026-04-26 | β | Sandbox |
siddhm11/sandbox-9bb83c65 |
space | SLEEPING | 2026-04-27 | β | Sandbox |
Critically absent from the model repo metadata: library_name (should be lightgbm), pipeline_tag (should be e.g. text-ranking or a custom tag), any model-index block, any datasets: link to siddhm11/researchit-reranker-data. Tags = region:us only.
A.2 What this means for reproducibility
The audit's working assumption β "Phase 6 was trained on 242K citation edges from 50K sampled papers, time-split (train <2023, eval β₯2023), 90,993 train + 7,007 eval triples, nDCG@10 = 0.879, 37 features, 141 trees, 974KB" β is internally consistent but cannot be re-derived from the public HF artifacts alone today. The dataset repo exists (good β that is where the triples almost certainly live), but the model repo has no card linking the two, and there is no published training script.
What we can confirm vs what is missing:
| Asset (per audit expectation) | Public on HF? | Notes |
|---|---|---|
reranker_v1.txt (LightGBM dump, ~974 KB) |
Likely yes (it's the only point of the model repo) but the API does not expose siblings to confirm filename. Action: open the Files tab in a browser and verify the filename + size. |
Without a card, downstream consumers don't know which file to load |
Raw citation edges (output of 01_fetch_citation_edges.py) |
Unknown / probably not | Not exposed by metadata. Likely only the triples are uploaded |
Triples file (output of 02_generate_training_triples.py) |
Probably yes β this is what researchit-reranker-data is for; the size-category and modalities match |
Verify columns include query_paper_id, candidate_paper_id, relevance, the 37 features, and a group/qid column |
| Eval triples (β₯2023 split) | Probably yes (same dataset, separate split) | Verify a split column or train/eval files |
03_train_lightgbm.py |
Almost certainly NO β no Space or repo of yours hosts it | This is the single biggest reproducibility gap |
| Feature schema (37 names, in canonical order) | NO β would live in a model card | Without this, even the LightGBM .txt dump (which uses Column_0β¦Column_36) is opaque |
01_fetch_citation_edges.py script |
NO | Required to refresh edges for retraining |
pseudo_label_generation.py / triple builder |
NO | Required to regenerate the dataset under a new time-split |
Verdict on reproducibility: as of today, Phase 6 retraining is gated on Amin's local files. A fresh collaborator (or future Amin on a new laptop) cannot retrain the model from public artifacts alone. Fixing this is part of Phase 6.3 (see below).
A.3 Confirm or refute: was the published model trained on test data or real user signal?
Refuted on both counts. Three converging pieces of evidence:
- No user data could possibly exist. The Space
siddhm11/ResearchITwas created 2026-04-19, only seven days before the model repo on 2026-04-26. There has been no production window long enough to accrue meaningful save/dismiss interaction logs, and the model repo has zero downloads, zero likes β this is a personal project with no user funnel. - The dataset repo was created the day after the model.
researchit-reranker-datais dated 2026-04-27, one day after the model went up β meaning the dataset was uploaded as a documentation artifact, not consumed from. That fits the citation-pseudo-label story: the triples were generated locally, the model was trained locally, both were pushed in close succession. - The dataset metadata explicitly says
modality:tabular, textandformat:parquetwith100K<n<1Mrows. That row count band is consistent with ~98K triples (90,993 train + 7,007 eval), exactly the audit's stated counts. Real user signal at this scale is implausible for a one-week-old project.
Conclusion: the model is trained on citation pseudo-labels, not on test data, not on user signal. The audit's account is correct. Features 23β30 were almost certainly zero during training (no user, no live cluster importance from a real session) β which is why serving them as zero today does not break the model's distribution; it merely leaves potential signal on the table.
A.4 Is reranker_v1.txt actually deployed to the live Space?
Unverified from outside. The Space metadata shows it is RUNNING on cpu-basic and declares BAAI/bge-m3 as a model dependency, but does not declare siddhm11/researchit-reranker-phase6 as a dependency. That is a yellow flag: if the LightGBM model were being pulled at runtime via huggingface_hub.snapshot_download, the Space metadata would typically list it. The most likely current state is one of:
- (a) the file was committed to the Space repo's working tree under
models/reranker-phase6/production_model/reranker_v1.txtand is being loaded from disk (works, but means the same artifact is duplicated in two HF repos), or - (b) the file is not in the Space at all and the deployed code is silently falling back to the heuristic scorer.
Either way, the audit's TASK-TRACKER item [ ] [reranker] LightGBM model loaded is still the right thing to verify, and we wire that verification in via Phase 6.3 below.
Part B β Phase 6 Problem Statement, Properly Framed
Why Phase 6 exists. Phases 1β5 built a clean retrieval+ranking stack: BGE-M3 dense + sparse, Qdrant + Zilliz, Ward clustering of user history into long/short/negative interest centroids, importance-weighted per-cluster Qdrant search with floor F_min=3, and a heuristic linear scorer over a hand-tuned set of similarities. The heuristic worked, but it left two kinds of signal on the table: (i) non-linear interactions between paper-level features (recency Γ influential-citations Γ topic match), and (ii) user-state features like cluster importance, onboarding category match, and save/dismiss intensity. Phase 6 introduces a LightGBM LambdaRank model with a 37-slot feature vector designed to absorb both classes of signal and to give us a single optimizable objective (nDCG) instead of a hand-tuned weighted sum.
What Phase 6 currently delivers vs what it was designed to deliver. What ships today is a 141-tree LightGBM ranker, trained offline on citation pseudo-labels with a time-split eval, claiming nDCG@10 = 0.879, with a heuristic fallback when the model file is missing. What the live serving path actually feeds the model is not the 37-feature vector the model was trained on β it is a vector where features 0, 1β4, 5, 6β22 carry signal and features 23β30 are zero-filled because the integration in app/routers/recommendations.py still calls rerank_candidates with the legacy 6-argument signature from Phase 5. Bugs A, B, and C from the audit all stem from this single gap: the reranker module was upgraded; the caller was not.
The "9 of 37 features" gap, quantified.
| Feature slots | Count | Source available in caller scope? | Currently passed? |
|---|---|---|---|
0 qdrant_score |
1 | Yes β in per_cluster_results |
No β not constructed |
1β4 long/short/medoid/extra similarities |
4 | Yes β lt_vec, st_vec, dominant medoid |
Partial β only lt/st/neg are passed positionally |
5 neg similarity |
1 | Yes β neg_vec |
Yes |
6β22 paper-level features |
17 | Yes β derivable in reranker.py from metadata |
Yes (computed inside reranker) |
23 cluster_importance |
1 | Yes β clusters[i].importance |
No |
24 cluster_distance_to_medoid |
1 | Yes β from clusters[i].medoid_embedding |
No |
25 is_suppressed_category |
1 | Yes β but suppressed is loaded after the rerank call |
No |
26 onboarding_category_match |
1 | Yes β db.get_user_category_filter() exists |
No, never called for this |
27 user_total_saves |
1 | Yes β len(state.positives) |
No |
28 user_total_dismissals |
1 | Yes β len(state.negatives) |
No |
29 saves_last_7d |
1 | Yes β derivable from state.positives timestamps |
No |
30 dismissals_last_7d |
1 | Yes β derivable from state.negatives timestamps |
No |
31β36 remaining metadata features |
6 | Yes β computed inside reranker | Yes |
Effective active features at serving: ~9 + ~6 = ~15 of 37 carry real signal. Features 23β30 are all zero. This is Bug A, expressed quantitatively. It is not destroying ranking quality (the model has not learned to rely on those slots, because they were zero in training too), but it permanently caps how much Phase 6 can ever improve over the heuristic.
Phase 6 is a 3-stage fix: (6.1) connect the existing data flow with a dominant-cluster shortcut, (6.2) thread per-candidate cluster identity through the pipeline, (6.3) verify deployment + add observability. Retraining (6.4) is a separate decision, gated on whether we can produce a training distribution where features 23β30 are non-zero and behaviourally meaningful.
Part C β Phase 6.1: The Simplification Pass (1β2 days)
C.1 What 6.1 does
For each candidate in the rerank list, we compute features 23 and 24 against the dominant cluster β the cluster with the highest importance in the user's current state. Features 25 and 26 remain truly per-candidate (they depend on the candidate's primary_topic). Features 27β30 are user-level and constant across candidates in a single rerank call. Phase 6.1 is the minimum-viable fix that ends "9 of 37"; it will be replaced by 6.2's per-candidate cluster lookup.
Why dominant-cluster is acceptable for 6.1. The current model was trained with feature 23 = 0 everywhere. So as long as 6.1 produces a reasonable non-zero value for 23, the model will route those gradients through whatever weak signal it learned in feature 6β22. We are not relying on 23 carrying perfect semantics; we are getting the integration plumbed end-to-end so that feature non-zero rate jumps from ~40% to ~100%, and so that Phase 6.4 retraining has a target.
C.2 The exact code patch
File: app/routers/recommendations.py
C.2.1 Move the suppressed and onboarding_categories loading earlier
Find the existing block where suppressed is loaded (currently after the rerank_candidates call) and move it to immediately after state is hydrated and before the Qdrant retrieval loop. Add the onboarding category fetch right next to it.
# --- BEGIN Phase 6.1 patch (top of recommendations endpoint, after state hydration) ---
# Suppressed categories: was loaded after rerank; move it here.
suppressed: set[str] = set(await db.get_suppressed_categories(user_id))
# Onboarding categories: previously unused at rerank time.
onboarding_categories: set[str] = set(
await db.get_user_category_filter(user_id) or []
)
# User-level interaction counts (constant across all candidates this request).
now_utc = datetime.now(timezone.utc)
seven_days_ago = now_utc - timedelta(days=7)
user_total_saves = len(state.positives)
user_total_dismissals = len(state.negatives)
user_saves_last_7d = sum(
1 for p in state.positives if p.timestamp >= seven_days_ago
)
user_dismissals_last_7d = sum(
1 for n in state.negatives if n.timestamp >= seven_days_ago
)
# --- END move ---
C.2.2 Add a helper to align Qdrant scores with valid_ids
per_cluster_results is currently a list[list[ScoredPoint]] (one list per cluster). After dedup + filter into valid_ids, we lose the raw retrieval score. We need to project it back.
# app/recommend/fusion.py (or app/routers/recommendations.py if you keep helpers local)
def build_qdrant_score_map(
per_cluster_results: list[list["ScoredPoint"]],
) -> dict[str, float]:
"""
Collapse all per-cluster ScoredPoint lists into a single
paper_id -> max_score dict. If a paper appeared in multiple
clusters' top-K, we keep the maximum score (most charitable to
the candidate; matches the dedup semantics in merge_quota_results).
"""
out: dict[str, float] = {}
for cluster_hits in per_cluster_results:
for hit in cluster_hits:
pid = str(hit.id) # arxiv IDs ALWAYS strings (CLAUDE.md rule 7)
score = float(hit.score)
if pid not in out or score > out[pid]:
out[pid] = score
return out
def align_qdrant_scores(
valid_ids: list[str],
score_map: dict[str, float],
) -> np.ndarray:
"""Return a float32 array of qdrant_scores aligned 1:1 with valid_ids.
Missing entries default to 0.0 (which matches train-time behavior for
candidates injected by exploration paths)."""
return np.asarray(
[score_map.get(pid, 0.0) for pid in valid_ids],
dtype=np.float32,
)
C.2.3 Compute the dominant-cluster scalars
# Dominant cluster: highest-importance cluster in user state.
# clusters is the list[Cluster] you already have in scope.
if clusters:
dominant = max(clusters, key=lambda c: c.importance)
dominant_importance = float(dominant.importance)
dominant_medoid_vec = np.asarray(
dominant.medoid_embedding, dtype=np.float32
)
else:
# Cold-start path: no clusters -> defensible defaults.
dominant_importance = 0.0
dominant_medoid_vec = np.zeros(1024, dtype=np.float32)
C.2.4 The new rerank_candidates call
Replace the existing call (around line 305 of recommendations.py) with:
# --- BEGIN Phase 6.1 rerank call ---
qdrant_score_map = build_qdrant_score_map(per_cluster_results)
qdrant_scores = align_qdrant_scores(valid_ids, qdrant_score_map)
# Per-candidate boolean: is this candidate's primary_topic suppressed?
is_suppressed_category = np.asarray(
[
1.0 if (m.get("primary_topic") in suppressed) else 0.0
for m in valid_meta
],
dtype=np.float32,
)
# Per-candidate boolean: does this candidate match any onboarding category?
onboarding_category_match = np.asarray(
[
1.0 if (m.get("primary_topic") in onboarding_categories) else 0.0
for m in valid_meta
],
dtype=np.float32,
)
reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
candidate_ids=valid_ids,
candidate_embeddings=valid_embs,
candidate_metadata=valid_meta,
long_term_vec=lt_vec,
short_term_vec=st_vec,
negative_vec=neg_vec,
# Phase 6 additions (6.1: dominant-cluster shortcut)
qdrant_scores=qdrant_scores,
cluster_importance=np.full(
len(valid_ids), dominant_importance, dtype=np.float32
),
cluster_medoid=dominant_medoid_vec, # broadcast in reranker
is_suppressed_category=is_suppressed_category,
onboarding_category_match=onboarding_category_match,
user_total_saves=user_total_saves,
user_total_dismissals=user_total_dismissals,
user_saves_last_7d=user_saves_last_7d,
user_dismissals_last_7d=user_dismissals_last_7d,
)
# --- END Phase 6.1 rerank call ---
C.2.5 Reranker signature change
File: app/recommend/reranker.py
def rerank_candidates(
*,
candidate_ids: list[str],
candidate_embeddings: np.ndarray, # (N, 1024)
candidate_metadata: list[dict],
long_term_vec: np.ndarray | None,
short_term_vec: np.ndarray | None,
negative_vec: np.ndarray | None,
# Phase 6.1 additions β all optional with safe zero-defaults
# so the legacy callers keep working during the migration window.
qdrant_scores: np.ndarray | None = None,
cluster_importance: np.ndarray | None = None, # (N,) or scalar broadcast
cluster_medoid: np.ndarray | None = None, # (1024,) for 6.1, (N, 1024) for 6.2
is_suppressed_category: np.ndarray | None = None,
onboarding_category_match: np.ndarray | None = None,
user_total_saves: int = 0,
user_total_dismissals: int = 0,
user_saves_last_7d: int = 0,
user_dismissals_last_7d: int = 0,
) -> tuple[list[str], list[float], np.ndarray]:
...
Inside the feature-matrix builder, fill slots 23β30 from these new args, falling back to zero if None (preserves backward-compat for any unit tests that still call the legacy path).
C.3 The integration test (must be added)
File: tests/recommend/test_phase6_feature_matrix.py
import numpy as np
import pytest
from app.recommend.reranker import _build_feature_matrix # expose for testing
def test_phase6_feature_matrix_is_not_mostly_zero(fake_user_state, fake_candidates):
"""
Regression guard for Bug A. After Phase 6.1, features 23-30 must
carry signal for at least one candidate in a typical request.
"""
X = _build_feature_matrix(
candidate_ids=fake_candidates.ids,
candidate_embeddings=fake_candidates.embs,
candidate_metadata=fake_candidates.meta,
long_term_vec=fake_user_state.lt_vec,
short_term_vec=fake_user_state.st_vec,
negative_vec=fake_user_state.neg_vec,
qdrant_scores=fake_candidates.qdrant_scores, # non-trivial
cluster_importance=np.full(len(fake_candidates.ids), 0.42),
cluster_medoid=fake_user_state.dominant_medoid,
is_suppressed_category=np.array([0, 0, 1, 0, 0], dtype=np.float32),
onboarding_category_match=np.array([1, 0, 1, 0, 1], dtype=np.float32),
user_total_saves=12,
user_total_dismissals=3,
user_saves_last_7d=2,
user_dismissals_last_7d=1,
)
assert X.shape[1] == 37, f"Feature schema drifted: got {X.shape[1]} cols"
# Per-feature non-zero rate. Slots 23..30 must each be >0 somewhere.
nonzero_rate = (X != 0).mean(axis=0)
for slot in range(23, 31):
assert nonzero_rate[slot] > 0.0, (
f"Feature {slot} is all zeros β Phase 6.1 plumbing regression"
)
# Aggregate sanity: at least 60% of feature slots should be active.
assert (nonzero_rate > 0).mean() >= 0.6
C.4 What 6.1 fixes vs leaves open
| Bug | Status after 6.1 |
|---|---|
| A β caller uses legacy 6-arg signature | Fixed |
| B β Hungarian zero-vector fallback | Untouched (orthogonal; addressed separately, see Part E.4) |
| C β model deployment unverified | Untouched (Phase 6.3) |
| D β train/serve consistency | Improved on slot 23β30 non-zeroness, but the trained model's slot-23 weight is near-zero, so 6.1 will not move nDCG by much. That's expected. |
Part D β Phase 6.2: The Full Plumbing (3β5 days)
D.1 The data-structure change: paper_cluster_map
The architectural rule we are honoring: a paper retrieved for "Cluster A: ML systems" should be scored for its fit to that cluster, not to the user's dominant interest. Phase 6.1 violates this for slots 23 and 24 by broadcasting the dominant cluster's importance/medoid to every candidate. Phase 6.2 fixes it by attaching the source cluster index to each retrieved candidate and threading that through to the reranker.
Add to app/recommend/types.py (or wherever Cluster lives):
# Source-of-truth mapping: candidate paper_id -> index into clusters[].
# Built during per-cluster Qdrant retrieval, propagated through merge_quota,
# consumed by the reranker.
PaperClusterMap = dict[str, int]
If a candidate appears in multiple cluster top-Ks (which can happen with importance-weighted quota and a permissive K_max=7), the convention is first-write-wins by importance order β i.e. when iterating clusters in descending importance, the first cluster to surface a candidate "owns" it. This matches the heuristic that a paper that the user's strongest interest cluster also pulls is more naturally explained by that cluster.
D.2 app/recommend/fusion.py β merge_quota_results
Currently merge_quota_results takes per_cluster_results: list[list[ScoredPoint]] and returns list[str] (deduped IDs). Change it to also return the cluster mapping.
def merge_quota_results(
per_cluster_results: list[list["ScoredPoint"]],
clusters: list[Cluster],
floor_per_cluster: int = 3, # F_min from CLAUDE.md
) -> tuple[list[str], PaperClusterMap]:
"""
Importance-weighted quota merge. Returns:
- merged_ids: deduped list of arxiv IDs (str)
- paper_cluster_map: candidate_id -> source cluster index
Convention: when a candidate appears in multiple clusters, the
cluster with HIGHER importance wins. Iterate clusters sorted
descending by importance to make first-write-wins do the right thing.
"""
paper_cluster_map: PaperClusterMap = {}
merged_ids: list[str] = []
seen: set[str] = set()
# Stable sort by importance descending; preserve original index for lookup.
order = sorted(
range(len(clusters)),
key=lambda i: clusters[i].importance,
reverse=True,
)
for cluster_idx in order:
hits = per_cluster_results[cluster_idx]
# Apply F_min floor: take at least floor_per_cluster (or all if shorter).
# The full quota math from Phase 5 lives elsewhere; this is the
# accounting step.
for hit in hits:
pid = str(hit.id)
if pid in seen:
continue
seen.add(pid)
merged_ids.append(pid)
paper_cluster_map[pid] = cluster_idx
return merged_ids, paper_cluster_map
Update every caller of merge_quota_results in recommendations.py to unpack the second return. Mypy / pyright will flag every site β fix them all.
D.3 app/routers/recommendations.py β propagate the map
merged_ids, paper_cluster_map = merge_quota_results(
per_cluster_results, clusters
)
# ... after dedup and metadata fetch, valid_ids is a subset of merged_ids ...
# Per-candidate cluster index (aligned with valid_ids).
candidate_cluster_idx = np.asarray(
[paper_cluster_map[pid] for pid in valid_ids],
dtype=np.int32,
)
# Per-candidate cluster importance (slot 23, properly per-candidate).
per_candidate_importance = np.asarray(
[clusters[idx].importance for idx in candidate_cluster_idx],
dtype=np.float32,
)
# Per-candidate cluster medoid (used to compute slot 24 inside reranker).
# Stack medoids into a (N, 1024) array.
per_candidate_medoids = np.stack(
[
np.asarray(clusters[idx].medoid_embedding, dtype=np.float32)
for idx in candidate_cluster_idx
],
axis=0,
)
reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
candidate_ids=valid_ids,
candidate_embeddings=valid_embs,
candidate_metadata=valid_meta,
long_term_vec=lt_vec,
short_term_vec=st_vec,
negative_vec=neg_vec,
qdrant_scores=qdrant_scores,
cluster_importance=per_candidate_importance, # (N,) β was scalar in 6.1
cluster_medoid=per_candidate_medoids, # (N, 1024) β was (1024,) in 6.1
is_suppressed_category=is_suppressed_category,
onboarding_category_match=onboarding_category_match,
user_total_saves=user_total_saves,
user_total_dismissals=user_total_dismissals,
user_saves_last_7d=user_saves_last_7d,
user_dismissals_last_7d=user_dismissals_last_7d,
)
D.4 app/recommend/reranker.py β per-candidate slot 24
Inside the feature-matrix builder, slot 24 (cluster_distance_to_medoid) becomes a row-wise cosine:
# Slot 24: cosine distance from each candidate to its OWN source-cluster medoid.
# cluster_medoid shape:
# (1024,) -> 6.1 broadcast path (legacy)
# (N, 1024) -> 6.2 per-candidate path
if cluster_medoid is None:
feat_24 = np.zeros(N, dtype=np.float32)
elif cluster_medoid.ndim == 1:
# Broadcast cosine: same medoid for every candidate.
medoid_norm = cluster_medoid / (np.linalg.norm(cluster_medoid) + 1e-9)
cand_norms = candidate_embeddings / (
np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9
)
sims = cand_norms @ medoid_norm
feat_24 = (1.0 - sims).astype(np.float32) # distance, not similarity
else:
# Per-row cosine: candidate i vs cluster_medoid[i].
cand_norms = candidate_embeddings / (
np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9
)
med_norms = cluster_medoid / (
np.linalg.norm(cluster_medoid, axis=1, keepdims=True) + 1e-9
)
sims = (cand_norms * med_norms).sum(axis=1)
feat_24 = (1.0 - sims).astype(np.float32)
D.5 Why 6.2 matters
A concrete example: user has clusters {A: ML systems (importance 0.7), B: protein folding (importance 0.3)}. A paper about MLPerf benchmark methodology gets retrieved by cluster A's Qdrant query; a paper about AlphaFold 3 architecture gets retrieved by cluster B's. Under 6.1, both get scored against cluster A's medoid (the dominant one), so the AlphaFold paper looks artificially "off-distribution" and gets ranked down. Under 6.2, the AlphaFold paper is scored against cluster B's medoid (where it belongs), and slot 24 correctly registers as a small distance. The model's learned weight on slot 24 then has the opportunity to protect exploration into the user's secondary interests, instead of reinforcing the dominant one.
This also makes Phase 9 (Exploration + CF) much cleaner: the exploration budget is naturally reasoned about in terms of "minority-cluster candidates that survived rerank."
D.6 Integration test outline
File: tests/recommend/test_phase62_per_candidate_cluster.py
def test_paper_cluster_map_threaded_through(monkeypatch, fake_three_cluster_state):
"""
Given three clusters with different medoids, candidates retrieved
from cluster B must produce slot 24 measured against B's medoid,
not against the dominant A's medoid.
"""
captured = {}
def fake_predict(model, X):
captured["X"] = X.copy()
return np.arange(X.shape[0])[::-1].astype(float)
monkeypatch.setattr("app.recommend.reranker._lgb_predict", fake_predict)
response = client.get(
"/recommend",
headers={"X-User-Id": fake_three_cluster_state.user_id},
)
X = captured["X"]
# Specific assertion: candidate 0 came from cluster A, candidate 5 from B.
# Their slot-24 values must be different (they are scored vs different medoids).
assert X[0, 24] != X[5, 24]
# And slot 23 (cluster_importance) must match the source clusters'
# importance values, not a single broadcast.
assert len(set(X[:, 23].tolist())) > 1
Part E β Phase 6.3: Deployment Verification + Monitoring (1 day)
E.1 Verify the LightGBM model loads on HF Spaces
Two acceptable deployment strategies. Pick one and document it.
Option E.1.a β Commit the 974 KB file to the Space's Git repo (recommended for simplicity)
974 KB is well under HF's 5 MB inline limit and well under any sane Git-LFS threshold. Just commit it.
# From your local ResearchIT working tree:
cd /path/to/ResearchIT
mkdir -p models/reranker-phase6/production_model
cp /path/to/reranker_v1.txt models/reranker-phase6/production_model/
# Make sure neither .gitignore nor .dockerignore excludes it.
grep -E "^models/?$|^models/reranker" .gitignore .dockerignore || echo "OK, not ignored"
git add models/reranker-phase6/production_model/reranker_v1.txt
git commit -m "Phase 6.3: ship LightGBM reranker artifact to Space"
git push hf main # where 'hf' is the HF Space remote
reranker.py's search paths already include this location, so no code change required.
Option E.1.b β Pull from HF Hub at container build time (cleaner separation of concerns)
# Add to Dockerfile, BEFORE the COPY . . step
RUN pip install --no-cache-dir huggingface_hub && \
python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download( \
repo_id='siddhm11/researchit-reranker-phase6', \
local_dir='/app/models/reranker-phase6/production_model', \
local_dir_use_symlinks=False, \
allow_patterns=['*.txt', '*.json'], \
)"
This treats the model repo as the source of truth and the Space as a consumer β closer to "real" MLOps. Cost: one extra layer in the build, ~1s at startup since it's cached after first build.
Recommendation: Option E.1.a for now (simpler, the file is tiny, and it removes a network dependency at container build). Move to E.1.b in Phase 7 when you have a model registry story.
E.2 Add /healthz/reranker route
File: app/routers/health.py (create if not present)
import hashlib
import json
from fastapi import APIRouter
from app.recommend import reranker as _rr
router = APIRouter()
EXPECTED_FEATURE_NAMES = [
"qdrant_score", # 0
"sim_long_term", "sim_short_term", # 1, 2
"sim_cluster_medoid", "sim_extra_4", # 3, 4
"sim_negative", # 5
# 6..22 paper-level
"recency_days_log", "citation_count_log",
"influential_citations_log", "is_primary_cs_lg",
"is_primary_cs_cl", "is_primary_cs_cv",
"is_primary_stat_ml", "is_primary_cs_ai",
"is_primary_cs_ir", "is_primary_other",
"abstract_len_log", "title_len",
"year", "month", "venue_is_top",
"n_authors_log", "has_code_link",
# 23..30 cluster + user
"cluster_importance", "cluster_distance_to_medoid",
"is_suppressed_category", "onboarding_category_match",
"user_total_saves", "user_total_dismissals",
"user_saves_last_7d", "user_dismissals_last_7d",
# 31..36 remaining metadata
"primary_topic_freq_in_user", "is_oa", "venue_age",
"abstract_topic_match", "ref_count_log", "is_arxiv_only",
]
assert len(EXPECTED_FEATURE_NAMES) == 37
@router.get("/healthz/reranker")
async def healthz_reranker():
schema_hash = hashlib.sha256(
json.dumps(EXPECTED_FEATURE_NAMES).encode()
).hexdigest()[:12]
return {
"model_loaded": _rr.is_model_loaded(), # True iff LightGBM Booster live
"model_path": _rr.get_loaded_model_path(), # str or None
"model_version": "phase6.v1",
"fallback_active": not _rr.is_model_loaded(),
"feature_count": 37,
"feature_schema_hash": schema_hash,
"n_trees": _rr.get_num_trees(), # 141 expected
}
Wire _rr.is_model_loaded(), _rr.get_loaded_model_path(), _rr.get_num_trees() into reranker.py as small accessors over the module-level Booster handle.
Then verify on the live Space:
curl -s https://siddhm11-researchit.hf.space/healthz/reranker | jq
Expected JSON body: "model_loaded": true, "n_trees": 141, "fallback_active": false. If any of those is wrong, the deployed image is silently running the heuristic.
E.3 Logging: feature non-zero rate per request
In reranker.py, after building the feature matrix X, add:
# Observability: emit a per-request feature-activation histogram so we
# catch silent regressions to zero-filled features 23-30.
if X.shape[0] > 0:
nonzero_rate = (X != 0).mean(axis=0) # (37,)
logger.info(
"reranker.features",
extra={
"feature_nonzero_rate": nonzero_rate.round(3).tolist(),
"feature_count": X.shape[1],
"n_candidates": X.shape[0],
"model_active": _model is not None,
},
)
Add a Prometheus-style assertion later (Phase 7) but for now structured-log lines are enough β you can grep them on the Space's log stream.
E.4 Bug B fix (Hungarian zero-vector fallback) β bundle into 6.3
While we're in recommendations.py, fix the medoid-rebuild bug:
# OLD:
# medoid_embedding=np.array(vectors[row["medoid_paper_id"]], dtype=np.float32)
# if row["medoid_paper_id"] in vectors else np.zeros(1024, dtype=np.float32)
# NEW: if the medoid paper's embedding isn't in the immediate vectors dict,
# fall back to the previously-stored medoid_embedding from the cluster row
# (which is what we persisted last cycle), NOT a zero vector. Zero-vector
# fallback breaks Hungarian assignment because cosine(*, 0) = 0 fails the
# 0.5 acceptance threshold and orphans the cluster identity.
if row["medoid_paper_id"] in vectors:
medoid_emb = np.asarray(vectors[row["medoid_paper_id"]], dtype=np.float32)
elif row["medoid_embedding_blob"] is not None:
medoid_emb = np.frombuffer(row["medoid_embedding_blob"], dtype=np.float32)
else:
# Last resort: this cluster has no recoverable medoid. Mark for re-seed
# rather than silently masking with zeros.
logger.warning(
"cluster.medoid.unrecoverable", extra={"cluster_id": row["cluster_id"]}
)
continue # skip this cluster row; it'll be rebuilt on next Ward run
This requires adding medoid_embedding_blob BLOB to the cluster persistence schema if it's not already there β a one-line ALTER on Turso.
Part F β Phase 6.4: Retraining Strategy WITHOUT Real User Signal
F.1 The honest framing
The current model is not broken. It is correctly trained on the only labels we had access to (citation pseudo-labels), and on its native distribution it scores nDCG@10 = 0.879. The gap between that number and "what users feel" is a deployment gap, not a model gap.
Retraining only makes sense once one of the following is true:
- We have a training distribution where features 23β30 carry non-trivial, behaviourally meaningful signal, or
- We have real user labels (saves/dismisses/dwell-time) at sufficient scale.
Today, neither holds. So Phase 6.4 is a decision document, not a build task.
F.2 Three options, compared
| Option | Feasibility | Fidelity to real users | Engineering cost | Risk |
|---|---|---|---|---|
| (i) Wait for real users. Threshold: 100 users with β₯10 saves each β ~1,000 labelled (user, paper, +/-) tuples per week, enough for a weekly retrain on slots 23β30. | High (zero engineering until threshold hit) | Highest possible β actual ground truth | Zero now, ~1 week to build the labelling pipeline once we cross the threshold | You may never cross the threshold. Single-developer hobby project. |
(ii) Synthetic user simulation. Generate N=1,000 synthetic users. For each, draw a "true interest profile" as a mixture over arXiv categories. Sample 30β60 "saves" from each profile by drawing papers from citation neighborhoods of seed papers in those categories. Run those saves through the actual EWMA + Ward + medoid pipeline to produce real cluster_importance, real cluster_distance_to_medoid, real onboarding_category_match (using the synthetic user's category profile as the onboarding set). Label triples by citation: a citation-neighbor of any of the user's saved papers is positive; a random paper outside the user's neighborhood is negative. |
Medium-high β every component already exists; the simulation is glue code | Moderate β captures structure of the signal (cluster importance scales correctly, suppressed categories actually suppress) but doesn't capture true user preference noise | 5β8 days: write simulate_users.py, regenerate triples (~200K with non-zero slots 23-30), retrain LightGBM, re-eval |
The model learns the simulator's biases. Fight this by holding out real user data (whenever it arrives) as a clean eval set; if the simulator-trained model dramatically outperforms a heuristic baseline on real-user holdout, you've validated. If not, you've wasted a week. |
| (iii) Self-distillation (defer to Phase 8). Use the heuristic scorer + Qdrant retrieval rank as a soft label. Train a fresh LightGBM to mimic those scores, then iterate: ship β log new soft labels from the deployed model itself β retrain on log data. | Medium β needs a logging schema and an offline training loop | Low initially (only as good as the teacher), but improves once real users arrive and label-corruption from the teacher decays | 2β3 weeks: full label-store, replay infra, offline training pipeline | Cold-start collapse: if the teacher is worse than the student needs, distillation locks in mediocrity. Only safe once we have any real user signal to anchor the distribution. |
F.3 Recommendation
Adopt a staged path:
- Now (Phase 6.4a): Do not retrain. Ship 6.1 β 6.2 β 6.3 with the existing model. Document that the model's effective coverage is ~15/37 features at training time but improves to ~37/37 at serving time as soon as 6.1 lands β the model can only use what it learned, but the framework is now wired correctly for the next training run.
- +30 days (Phase 6.4b): Build option (ii) β synthetic user simulation. Spec the simulator in
scripts/simulate_users.py, produce a new dataset versionsiddhm11/researchit-reranker-data-v2with slots 23β30 populated, retrain toreranker_v2.txt. Compare v1 vs v2 on a held-out split and on whatever sliver of real-user data has accrued by then. - +90 days or 100-user threshold (whichever first) (Phase 6.4c): If real-user data exists, do option (i). Train
reranker_v3on real saves/dismisses with synthetic data as augmentation (50/50 mix or downweight synthetic). - Phase 8+: Bring in option (iii) as a continuous-update mechanism on top of v3.
F.4 What to write in docs/phases/PHASE6.md today
Two lines, verbatim:
Phase 6.4 retraining is deferred. The published model
siddhm11/researchit-reranker-phase6was trained on citation pseudo-labels with features 23β30 zero. Retraining is gated on either (a) the synthetic-user simulator (Phase 6.4b, ~30 days out) or (b) crossing 100 real users with β₯10 saves each. Until then, Phase 6.1+6.2+6.3 plumbing is the unit of deliverable work.
Part G β Phase 6 Closeout Checklist
### Phase 6.1 β Simplification Pass (1β2 days)
- [ ] Move `suppressed = await db.get_suppressed_categories(...)` to BEFORE rerank call in app/routers/recommendations.py
- [ ] Add `onboarding_categories = await db.get_user_category_filter(...)` next to it
- [ ] Compute `user_total_saves`, `user_total_dismissals`, `user_saves_last_7d`, `user_dismissals_last_7d`
- [ ] Add helper `build_qdrant_score_map()` in app/recommend/fusion.py
- [ ] Add helper `align_qdrant_scores(valid_ids, score_map)`
- [ ] Compute `dominant_importance` and `dominant_medoid_vec`
- [ ] Replace 6-arg `rerank_candidates(...)` call with full kwargs version (Section C.2.4)
- [ ] Update `rerank_candidates()` signature in app/recommend/reranker.py to accept new optional kwargs (Section C.2.5)
- [ ] Wire slots 23β30 into `_build_feature_matrix` in reranker.py
- [ ] Write test `tests/recommend/test_phase6_feature_matrix.py` asserting slots 23β30 non-zero
- [ ] All existing tests pass (`pytest -q`)
- [ ] Commit: "Phase 6.1: connect 37-feature reranker to live caller (dominant-cluster)"
### Phase 6.2 β Full per-candidate plumbing (3β5 days)
- [ ] Add `PaperClusterMap = dict[str, int]` type alias in app/recommend/types.py
- [ ] Modify `merge_quota_results()` in app/recommend/fusion.py to return `(merged_ids, paper_cluster_map)`
- [ ] Update ALL callers of `merge_quota_results` (grep first; fix every one)
- [ ] In recommendations.py, build `candidate_cluster_idx` aligned with `valid_ids`
- [ ] Build `per_candidate_importance` (N,) and `per_candidate_medoids` (N, 1024)
- [ ] Pass arrays (not scalars) for `cluster_importance` and `cluster_medoid` to reranker
- [ ] Update `_build_feature_matrix` slot-24 logic to handle both (1024,) and (N, 1024) medoid shapes
- [ ] Write test `tests/recommend/test_phase62_per_candidate_cluster.py` (Section D.6)
- [ ] Commit: "Phase 6.2: per-candidate cluster identity through reranker"
### Phase 6.3 β Deployment verification + Bug B
- [x] Decide deployment strategy: E.1.a (commit) vs E.1.b (snapshot_download). Used E.1.a.
- [x] Verify `models/reranker-phase6/production_model/reranker_v1.txt` is in working tree, not gitignored, not dockerignored
- [x] Push to HF Space; wait for build; check build logs for "[reranker] LightGBM model loaded"
- [x] Add `/healthz/reranker` route (Section E.2)
- [x] Add `_rr.is_model_loaded()`, `_rr.get_loaded_model_path()`, `_rr.get_num_trees()` accessors
- [x] `curl https://siddhm11-researchit.hf.space/healthz/reranker` β confirm `model_loaded: true, n_trees: 141`
> *Verified live at 2026-05-03: `model_loaded=true, n_trees=141, fallback_active=false, feature_count=37, feature_schema_hash=5d0b3de7b0c1`.*
- [x] Add per-request `reranker.features` log line with `feature_nonzero_rate`
- [x] Fix Bug B: medoid_embedding_blob fallback in cluster reload (Section E.4)
- [x] Add `medoid_embedding_blob BLOB` column to clusters table (SQLite ALTER migration)
- [x] Update CLAUDE.md / model card to reflect deployment story
### Phase 6 documentation
- [ ] Write `docs/phases/PHASE6.md` retraining decision (Section F.4)
- [ ] Update `README.md` test count (will increase by ~2 from 6.1 + 6.2 tests)
- [ ] Update `TASK-TRACKER.md`: tick off `[x] [reranker] LightGBM model loaded` (after curl verifies it)
- [ ] Backfill the HF model card with: `library_name: lightgbm`, `pipeline_tag: text-ranking`, `datasets: [siddhm11/researchit-reranker-data]`, training data description, feature schema (37 names), reported metrics (nDCG@10=0.879 on 7,007-triple eval split, β₯2023 papers), and a clear "trained on citation pseudo-labels, NOT real user signal" disclaimer
- [ ] Upload `03_train_lightgbm.py` to a new repo `siddhm11/researchit-reranker-training` (or to the dataset repo as a script) so retraining is reproducible from public artifacts
### Phase 6.4 β Retraining (decision only; no code yet)
- [ ] Document deferral and the 30-day / 100-user trigger in PHASE6.md
- [ ] Open a tracking issue: "Phase 6.4b: synthetic user simulator" (target: +30d)
- [ ] Open a tracking issue: "Phase 6.4c: real-user retrain at 100-user threshold"
Part H β What Phase 6 Is NOT (scope boundary)
Phase 6 is integration of the existing trained reranker, plus the deployment story for it. Specifically out of scope for Phase 6 β these are their own phases, with their own framing docs to come:
| Out of scope | Belongs to | Why it's separate |
|---|---|---|
| Offline regression harness, time-split eval framework, A/B test infrastructure | Phase 7 β Evaluation Framework | Requires a held-out user-interaction log and a separate "shadow ranker" infra. Cannot be built before 6.3 ships, because we need stable feature semantics first. |
| LLM-generated paper summaries; cross-encoder distilled into LightGBM features | Phase 8 β LLM Summaries + Distilled Reranker | Adds new feature slots (37 β 40+) and breaks the schema; must be a model version bump, not a 6.x patch. |
| Exploration bandits (UCB / Thompson over cluster heads), collaborative-filtering co-save signal | Phase 9 β Exploration + CF | Needs a user population large enough for CF to be non-degenerate, and an exploration budget that Phase 6's MMR=0.6 currently approximates as a stopgap. |
| Migrating to a different reranker family (e.g. cross-encoder, ColBERT, BGE-reranker-v2-m3) | Phase 10+ | Explicitly forbidden in serving by CLAUDE.md rule 4. LightGBM is the serving model; anything heavier is a teacher in distillation, not a serving model. |
| Replacing citation pseudo-labels with click-through CTR labels from production | Phase 6.4c (real-user retrain) | Triggered by traffic threshold, not by code. |
The tightest possible definition of "Phase 6 done":
Every candidate that arrives at the LightGBM ranker is described by a 37-dimensional feature vector in which slots 23β30 carry per-candidate signal derived from the user's actual cluster state and onboarding/category history; the ranker's inference is verified live on
siddhm11-researchit.hf.spacevia/healthz/reranker; and a feature non-zero rate is logged per request. Retraining is deferred and documented.
When that sentence is true, Phase 6 ships and Phase 7 begins.
Caveats and unknowns (full disclosure)
- The HF model card itself was not directly readable from the verification environment used for this framing β only the structured Hub API metadata (which exposes
tags,created_at,last_modified,pipeline_tag,library_name,description, etc., all of which were null/empty for the model repo). The conclusion that "the README is empty" is an inference from the absent metadata fields a populated model card would normally surface, not a direct read of the file. Action item for Amin: openhttps://huggingface.co/siddhm11/researchit-reranker-phase6in a browser, look at the README and the Files tab, and confirm: (a) isreranker_v1.txtpresent at exactly 974 KB? (b) is there any README content at all? (c) are training scripts present? Adjust the Phase 6.3 doc-backfill checklist items accordingly. - The 90,993 / 7,007 triple counts and the 141-tree figure come from the audit, not from a re-derivation against the live HF artifacts. The HF dataset metadata only confirms the order of magnitude (
100K<n<1Mrows, Parquet). If the real numbers differ, the framing logic does not change β only the diagnostic prose in Part B. - The Space's actual loaded model state is unknown without running the
curl /healthz/rerankerstep in Section E.2. The TASK-TRACKER's unchecked box is the only authoritative signal we have today, and it says "not verified." - The synthetic-user simulator (Phase 6.4b) is plausible but unproven. Its quality depends entirely on whether citation neighborhoods are a good proxy for "papers a user with interest profile X would save." That is an empirical question; the simulator is worth building only because the alternative is "do nothing until users arrive."