Spaces:

siddhm11
/

ResearchIT

Sleeping

App Files Files Community

ResearchIT / docs /phases /PHASE6.5-Implementation-Plan.md

siddhm11

Phase 6.5 Day 1: Real Qdrant cosine scores (A1) + verification timestamp (A2)

3f58d41 30 days ago

preview code

raw

history blame contribute delete

14.2 kB

Phase 6.5 — Implementation Plan

Source: docs/phases/PHASE6.5-Instrumentation-Framing.md Timeline: 5 days (each day leaves the app in a working state) Prerequisite for: Phase 7 (Evaluation Framework)

Day 1: Phase 6 Hot-fix (A1 + A2)

A1: Real Qdrant Cosine Scores (Feature 0 fix)

Problem: recommendations.py:329-339 fakes Qdrant scores with linear rank decay (1.0 - rank * 0.01). Feature 0 is the model's #5 most important feature — it should be real cosines from Qdrant.

Root cause: The search calls use search_by_vector() (returns list[str]) instead of search_by_vector_with_scores() (returns list[dict] with {"arxiv_id": str, "score": float}).

[MODIFY] recommendations.py

Change 1 — Per-cluster searches (line 258-266): Switch from search_by_vector() to search_by_vector_with_scores():

-        search_tasks = [
-            qdrant_svc.search_by_vector(
-                query_vector=c.medoid_embedding.tolist(),
-                limit=quota * _OVERSAMPLE,
-                exclude_ids=seen,
-            )
-            for c, quota in zip(clusters, quotas)
-        ]
-        per_cluster_results = await asyncio.gather(*search_tasks)
+        search_tasks = [
+            qdrant_svc.search_by_vector_with_scores(
+                query_vector=c.medoid_embedding.tolist(),
+                limit=quota * _OVERSAMPLE,
+                exclude_ids=seen,
+            )
+            for c, quota in zip(clusters, quotas)
+        ]
+        per_cluster_scored = await asyncio.gather(*search_tasks)

Change 2 — Build paper_cluster_map AND qdrant_score_map in one pass (line 268-277):

-        paper_cluster_map: dict[str, int] = {}
-        for cluster, result_ids in zip(clusters, per_cluster_results):
-            for aid in result_ids:
-                if aid not in paper_cluster_map:
-                    paper_cluster_map[aid] = cluster.cluster_idx
-
-        candidate_ids = merge_quota_results(list(per_cluster_results), quotas)
+        paper_cluster_map: dict[str, int] = {}
+        qdrant_score_map: dict[str, float] = {}
+        for cluster, scored_results in zip(clusters, per_cluster_scored):
+            for hit in scored_results:
+                aid = hit["arxiv_id"]
+                if aid not in paper_cluster_map:
+                    paper_cluster_map[aid] = cluster.cluster_idx
+                # Keep highest cosine if paper appears in multiple clusters
+                if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
+                    qdrant_score_map[aid] = float(hit["score"])
+
+        # merge_quota_results expects list[list[str]] — extract IDs
+        per_cluster_ids = [[h["arxiv_id"] for h in scored] for scored in per_cluster_scored]
+        candidate_ids = merge_quota_results(per_cluster_ids, quotas)

Change 3 — Short-term supplement search (line 280-290): Also switch to scored search:

-            st_results = await qdrant_svc.search_by_vector(
+            st_scored = await qdrant_svc.search_by_vector_with_scores(
                 query_vector=st_vec.tolist(),
                 limit=_ST_SUPPLEMENT,
                 exclude_ids=seen_so_far,
             )
-            for aid in st_results:
-                if aid not in set(candidate_ids):
-                    candidate_ids.append(aid)
+            for hit in st_scored:
+                aid = hit["arxiv_id"]
+                if aid not in set(candidate_ids):
+                    candidate_ids.append(aid)
+                    if aid not in qdrant_score_map:
+                        qdrant_score_map[aid] = float(hit["score"])
                     paper_cluster_map[aid] = -1  # short-term supplement

Change 4 — Delete fake score block (line 329-339): The entire synthetic-decay block becomes dead code. Delete it:

-        # Build qdrant_score_map from per_cluster_results
-        # per_cluster_results is list[list[str]] — we need scores too.
-        # Use the paper_cluster_map to approximate: score = 1.0 - (rank / total)
-        # for now, as the current retrieval path returns only IDs.
-        # TODO: Phase 6.2+ switch to search_by_vector_with_scores()
-        qdrant_score_map: dict[str, float] = {}
-        for cluster_ids in per_cluster_results:
-            for rank, aid in enumerate(cluster_ids):
-                if aid not in qdrant_score_map:
-                    # Approximate score from rank position (higher rank = higher score)
-                    qdrant_score_map[aid] = max(0.0, 1.0 - rank * 0.01)

The existing qdrant_scores = np.asarray(...) on line 341-344 stays as-is — it reads from qdrant_score_map which now has real cosines.

A2: Verify `/healthz/reranker` live

✅ Already done. Verified 2026-05-03: model_loaded: true, n_trees: 141, fallback_active: false.

Just need to add the timestamp to PHASE6-Reranker-Framing.md.

Day 2: B1 — `query_id` Linkage

What it enables

Per-feed CTR: "out of 30 papers shown in this request, how many got saved?"

Current state verified

interactions table already has a query_id TEXT column ✅ (line 31 in DDL)
db.log_interaction() already accepts query_id ✅ (line 135)
events.py already accepts and forwards query_id via Form(default="") ✅ (line 26)
Missing: recommendations.py never generates or passes query_id. Search router never generates one either. Templates don't carry it.

[MODIFY] recommendations.py

1. Generate query_id at the top of get_recommendations() (line 59):

query_id = str(uuid.uuid4())

2. Thread query_id into paper_tags in all 3 tiers:

Tier 1: In _multi_interest_recommend() return value, add "query_id": query_id to each tag dict (line 455-458)
Tier 2: EWMA fallback tags (line 116-120) — add "query_id": query_id
Tier 3: Qdrant recommend tags (line 131-135) — add "query_id": query_id
Trending fallback (line 85-87) — add "query_id": query_id

3. Embed query_id + position into paper dicts (line 153-166):

for idx, aid in enumerate(rec_arxiv_ids):
    ...
    papers.append({
        **meta[aid],
        "saved": False,
        "dismissed": False,
        "ranker_version": tags.get("ranker_version", _RANKER_VERSION),
        "candidate_source": tags.get("candidate_source", ""),
        "cluster_id": tags.get("cluster_id", ""),
        "query_id": tags.get("query_id", ""),       # NEW
        "position": idx,                              # NEW
    })

The _multi_interest_recommend signature needs updating to accept query_id as a parameter, since it's where the Tier 1 paper_tags are built. Alternatively, we generate query_id inside it and return it alongside the tags. I'll use the approach of passing it as a param.

[MODIFY] search.py

Generate query_id per search and embed in paper dicts (line 70-77):

query_id = str(uuid.uuid4())  # generated once per /search request

for idx, p in enumerate(papers):
    p["saved"] = p["arxiv_id"] in saved_ids
    p["dismissed"] = p["arxiv_id"] in dismissed_ids
    p["query_id"] = query_id        # NEW
    p["position"] = idx             # NEW

[MODIFY] action_buttons.html

Add query_id and position to ALL three hx-vals JSON blobs:

Add to template header:

{% set _query_id = paper.query_id | default("") if paper is defined else "" %}
{% set _position = paper.position | default(0) if paper is defined else 0 %}

Add to each hx-vals:

"query_id": "{{ _query_id }}", "position": "{{ _position }}"

The save button (line 37) already has position — update to use _position. The not-interested buttons (line 26, 45) need query_id and position added.

Day 3: B2 — Propensity Logging

What it enables

Counterfactual evaluation (SNIPS estimator) — "what would have happened with ranker B?"

[MODIFY] db.py

1. Migration (after _MIGRATION_6_3):

_MIGRATION_6_5 = [
    "ALTER TABLE interactions ADD COLUMN propensity REAL",
    "ALTER TABLE interactions ADD COLUMN policy_id TEXT",
]

2. Run in init_db().

3. Extend log_interaction() signature (line 129-149): Add propensity: float | None = None and policy_id: str | None = None kwargs. Extend the INSERT.

[MODIFY] recommendations.py

Compute propensity after inject_exploration() (line 443):

# Exploration papers: uniformly sampled from pool
explore_pool_size = max(1, len(reranked_ids) - len(mmr_selected))
explore_propensity = len(exploration_set) / explore_pool_size if explore_pool_size > 0 else 0.0

# Exploitation (MMR-selected): deterministic → propensity = 1.0
for aid in final:
    paper_tags[aid]["propensity"] = (
        explore_propensity if aid in exploration_set else 1.0
    )
    paper_tags[aid]["policy_id"] = _RANKER_VERSION

Thread propensity and policy_id into template context the same way as query_id.

[MODIFY] search.py

Search is fully deterministic → propensity = 1.0 for all results.

[MODIFY] action_buttons.html

Add propensity and policy_id to hx-vals.

[MODIFY] events.py

Add propensity: float = Form(default=0.0) and policy_id: str = Form(default="") to both endpoints. Forward to db.log_interaction().

Day 4: B3 — Cluster Snapshot Versioning

What it enables

Cluster history, debugging "why did recs shift?", content-addressed key for Phase 8a LLM summary cache.

[MODIFY] db.py

1. Add cluster_snapshots DDL to _SCHEMA:

CREATE TABLE IF NOT EXISTS cluster_snapshots (
    user_id              TEXT NOT NULL,
    snapshot_id          TEXT NOT NULL,
    cluster_idx          INTEGER NOT NULL,
    medoid_paper_id      TEXT NOT NULL,
    importance           REAL NOT NULL,
    paper_ids            TEXT NOT NULL,
    medoid_embedding_blob BLOB,
    snapshot_date        TEXT NOT NULL DEFAULT (datetime('now')),
    paper_ids_hash       TEXT NOT NULL,
    PRIMARY KEY (user_id, snapshot_id, cluster_idx)
);
CREATE INDEX IF NOT EXISTS idx_snap_user_date ON cluster_snapshots(user_id, snapshot_date DESC);
CREATE INDEX IF NOT EXISTS idx_snap_hash ON cluster_snapshots(paper_ids_hash);

2. Add save_cluster_snapshot() and prune_old_snapshots() functions.

[MODIFY] recommendations.py

After save_clusters_to_db(user_id, clusters) (line ~253), call db.save_cluster_snapshot().

[MODIFY] main.py

Call db.prune_old_snapshots(retention_days=30) in the lifespan handler after init_db().

Day 5: B4 — Semantic Scholar Author Import

What it enables

"Paste S2 URL → 20 implicit saves" — replaces manual seed search friction.

[NEW] s2_svc.py

Functions:

parse_author_input(text) → str | None — accepts S2 URL, raw S2 ID, or ORCID
resolve_orcid(orcid) → str | None — resolves ORCID via S2 author search
fetch_author_arxiv_papers(author_id, limit=50) → list[str] — returns arXiv IDs

[MODIFY] config.py

Add S2_API_KEY = os.getenv("S2_API_KEY", "") — key already in .env.

[MODIFY] onboarding.py

Add POST /api/onboarding/import-author endpoint.

[NEW] Template partials for import step

partials/import_author.html — the import form step
partials/import_success.html — success confirmation
partials/import_error.html — error message

Verification Plan

Automated Tests

After each day:

python -m pytest tests/ -v --tb=short

New test files:

Day 1: Add test_qdrant_scores_are_real_cosines to tests/test_phase6_feature_wiring.py
Day 2: Create tests/test_instrumentation.py — test_query_id_round_trips
Day 3: Add test_propensity_sums_correctly to instrumentation tests
Day 4: Add test_snapshot_appended_on_each_recluster, test_prune_respects_retention
Day 5: Add test_s2_import_saves_papers_with_correct_source_tag

Manual Verification

Day 1: curl -s https://siddhm11-researchit.hf.space/healthz/reranker — confirm model still loaded after code change
Day 5: Test author import with real S2 profile URL

Documentation Updates (after all days)

CLAUDE.md: Add Rule 3.11 — "Every interaction must carry query_id, propensity, and policy_id"
TASK-TRACKER.md: Add Phase 6.5 section with checklist
README.md: Update test count
PHASE6-Reranker-Framing.md: Add live verification timestamp

Open Questions

Q1: The framing doc proposes _RANKER_VERSION as the policy_id. Currently it's "v4.1_quota_hungarian_suppression". Should we also bump this to "v6.5_lightgbm_real_cosines" when Day 1 lands? It would make A/B-style log analysis cleaner.

Q2: Day 5 (S2 author import) requires httpx as a dependency. It's already used by turso_svc.py, so no new install needed — just confirming.

Q3: The framing doc suggests cluster snapshot pruning at startup. For a simple MVP this is fine. Phase 7 can upgrade to APScheduler if needed.

Phase 6.5 — Implementation Plan

Day 1: Phase 6 Hot-fix (A1 + A2)

A1: Real Qdrant Cosine Scores (Feature 0 fix)

[MODIFY] recommendations.py

A2: Verify /healthz/reranker live

Day 2: B1 — query_id Linkage

What it enables

Current state verified

[MODIFY] recommendations.py

[MODIFY] search.py

[MODIFY] action_buttons.html

Day 3: B2 — Propensity Logging

What it enables

[MODIFY] db.py

[MODIFY] recommendations.py

[MODIFY] search.py

[MODIFY] action_buttons.html

[MODIFY] events.py

Day 4: B3 — Cluster Snapshot Versioning

What it enables

[MODIFY] db.py

[MODIFY] recommendations.py

[MODIFY] main.py

Day 5: B4 — Semantic Scholar Author Import

What it enables

[NEW] s2_svc.py

[MODIFY] config.py

[MODIFY] onboarding.py

[NEW] Template partials for import step

Verification Plan

Automated Tests

Manual Verification

Documentation Updates (after all days)

Open Questions

A2: Verify `/healthz/reranker` live

Day 2: B1 — `query_id` Linkage