File size: 14,238 Bytes
3f58d41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
# Phase 6.5 β€” Implementation Plan

> **Source:** `docs/phases/PHASE6.5-Instrumentation-Framing.md`
> **Timeline:** 5 days (each day leaves the app in a working state)
> **Prerequisite for:** Phase 7 (Evaluation Framework)

---

## Day 1: Phase 6 Hot-fix (A1 + A2)

### A1: Real Qdrant Cosine Scores (Feature 0 fix)

**Problem:** `recommendations.py:329-339` fakes Qdrant scores with linear rank decay (`1.0 - rank * 0.01`). Feature 0 is the model's #5 most important feature β€” it should be real cosines from Qdrant.

**Root cause:** The search calls use `search_by_vector()` (returns `list[str]`) instead of `search_by_vector_with_scores()` (returns `list[dict]` with `{"arxiv_id": str, "score": float}`).

---

#### [MODIFY] [recommendations.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/recommendations.py)

**Change 1 β€” Per-cluster searches (line 258-266):**
Switch from `search_by_vector()` to `search_by_vector_with_scores()`:

```diff
-        search_tasks = [
-            qdrant_svc.search_by_vector(
-                query_vector=c.medoid_embedding.tolist(),
-                limit=quota * _OVERSAMPLE,
-                exclude_ids=seen,
-            )
-            for c, quota in zip(clusters, quotas)
-        ]
-        per_cluster_results = await asyncio.gather(*search_tasks)
+        search_tasks = [
+            qdrant_svc.search_by_vector_with_scores(
+                query_vector=c.medoid_embedding.tolist(),
+                limit=quota * _OVERSAMPLE,
+                exclude_ids=seen,
+            )
+            for c, quota in zip(clusters, quotas)
+        ]
+        per_cluster_scored = await asyncio.gather(*search_tasks)
```

**Change 2 β€” Build `paper_cluster_map` AND `qdrant_score_map` in one pass (line 268-277):**

```diff
-        paper_cluster_map: dict[str, int] = {}
-        for cluster, result_ids in zip(clusters, per_cluster_results):
-            for aid in result_ids:
-                if aid not in paper_cluster_map:
-                    paper_cluster_map[aid] = cluster.cluster_idx
-
-        candidate_ids = merge_quota_results(list(per_cluster_results), quotas)
+        paper_cluster_map: dict[str, int] = {}
+        qdrant_score_map: dict[str, float] = {}
+        for cluster, scored_results in zip(clusters, per_cluster_scored):
+            for hit in scored_results:
+                aid = hit["arxiv_id"]
+                if aid not in paper_cluster_map:
+                    paper_cluster_map[aid] = cluster.cluster_idx
+                # Keep highest cosine if paper appears in multiple clusters
+                if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
+                    qdrant_score_map[aid] = float(hit["score"])
+
+        # merge_quota_results expects list[list[str]] β€” extract IDs
+        per_cluster_ids = [[h["arxiv_id"] for h in scored] for scored in per_cluster_scored]
+        candidate_ids = merge_quota_results(per_cluster_ids, quotas)
```

**Change 3 β€” Short-term supplement search (line 280-290):**
Also switch to scored search:

```diff
-            st_results = await qdrant_svc.search_by_vector(
+            st_scored = await qdrant_svc.search_by_vector_with_scores(
                 query_vector=st_vec.tolist(),
                 limit=_ST_SUPPLEMENT,
                 exclude_ids=seen_so_far,
             )
-            for aid in st_results:
-                if aid not in set(candidate_ids):
-                    candidate_ids.append(aid)
+            for hit in st_scored:
+                aid = hit["arxiv_id"]
+                if aid not in set(candidate_ids):
+                    candidate_ids.append(aid)
+                    if aid not in qdrant_score_map:
+                        qdrant_score_map[aid] = float(hit["score"])
                     paper_cluster_map[aid] = -1  # short-term supplement
```

**Change 4 β€” Delete fake score block (line 329-339):**
The entire synthetic-decay block becomes dead code. Delete it:

```diff
-        # Build qdrant_score_map from per_cluster_results
-        # per_cluster_results is list[list[str]] β€” we need scores too.
-        # Use the paper_cluster_map to approximate: score = 1.0 - (rank / total)
-        # for now, as the current retrieval path returns only IDs.
-        # TODO: Phase 6.2+ switch to search_by_vector_with_scores()
-        qdrant_score_map: dict[str, float] = {}
-        for cluster_ids in per_cluster_results:
-            for rank, aid in enumerate(cluster_ids):
-                if aid not in qdrant_score_map:
-                    # Approximate score from rank position (higher rank = higher score)
-                    qdrant_score_map[aid] = max(0.0, 1.0 - rank * 0.01)
```

The existing `qdrant_scores = np.asarray(...)` on line 341-344 stays as-is β€” it reads from `qdrant_score_map` which now has real cosines.

### A2: Verify `/healthz/reranker` live

> βœ… **Already done.** Verified 2026-05-03: `model_loaded: true, n_trees: 141, fallback_active: false`.

Just need to add the timestamp to `PHASE6-Reranker-Framing.md`.

---

## Day 2: B1 β€” `query_id` Linkage

### What it enables
Per-feed CTR: "out of 30 papers shown in this request, how many got saved?"

### Current state verified
- `interactions` table already has a `query_id TEXT` column βœ… (line 31 in DDL)
- `db.log_interaction()` already accepts `query_id` βœ… (line 135)
- `events.py` already accepts and forwards `query_id` via `Form(default="")` βœ… (line 26)
- **Missing:** `recommendations.py` never generates or passes `query_id`. Search router never generates one either. Templates don't carry it.

---

#### [MODIFY] [recommendations.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/recommendations.py)

**1. Generate `query_id` at the top of `get_recommendations()` (line 59):**

```python
query_id = str(uuid.uuid4())
```

**2. Thread `query_id` into `paper_tags` in all 3 tiers:**

- Tier 1: In `_multi_interest_recommend()` return value, add `"query_id": query_id` to each tag dict (line 455-458)
- Tier 2: EWMA fallback tags (line 116-120) β€” add `"query_id": query_id`
- Tier 3: Qdrant recommend tags (line 131-135) β€” add `"query_id": query_id`
- Trending fallback (line 85-87) β€” add `"query_id": query_id`

**3. Embed `query_id` + `position` into paper dicts (line 153-166):**

```python
for idx, aid in enumerate(rec_arxiv_ids):
    ...
    papers.append({
        **meta[aid],
        "saved": False,
        "dismissed": False,
        "ranker_version": tags.get("ranker_version", _RANKER_VERSION),
        "candidate_source": tags.get("candidate_source", ""),
        "cluster_id": tags.get("cluster_id", ""),
        "query_id": tags.get("query_id", ""),       # NEW
        "position": idx,                              # NEW
    })
```

> [!IMPORTANT]
> The `_multi_interest_recommend` signature needs updating to accept `query_id` as a parameter, since it's where the Tier 1 paper_tags are built. Alternatively, we generate `query_id` inside it and return it alongside the tags. I'll use the approach of passing it as a param.

---

#### [MODIFY] [search.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/search.py)

**Generate `query_id` per search and embed in paper dicts (line 70-77):**

```python
query_id = str(uuid.uuid4())  # generated once per /search request

for idx, p in enumerate(papers):
    p["saved"] = p["arxiv_id"] in saved_ids
    p["dismissed"] = p["arxiv_id"] in dismissed_ids
    p["query_id"] = query_id        # NEW
    p["position"] = idx             # NEW
```

---

#### [MODIFY] [action_buttons.html](file:///c:/Users/siddh/ResearchIT-Final/app/templates/partials/action_buttons.html)

**Add `query_id` and `position` to ALL three `hx-vals` JSON blobs:**

Add to template header:
```jinja2
{% set _query_id = paper.query_id | default("") if paper is defined else "" %}
{% set _position = paper.position | default(0) if paper is defined else 0 %}
```

Add to each `hx-vals`:
```
"query_id": "{{ _query_id }}", "position": "{{ _position }}"
```

The save button (line 37) already has `position` β€” update to use `_position`. The not-interested buttons (line 26, 45) need `query_id` and `position` added.

---

## Day 3: B2 β€” Propensity Logging

### What it enables
Counterfactual evaluation (SNIPS estimator) β€” "what would have happened with ranker B?"

---

#### [MODIFY] [db.py](file:///c:/Users/siddh/ResearchIT-Final/app/db.py)

**1. Migration (after `_MIGRATION_6_3`):**
```python
_MIGRATION_6_5 = [
    "ALTER TABLE interactions ADD COLUMN propensity REAL",
    "ALTER TABLE interactions ADD COLUMN policy_id TEXT",
]
```

**2. Run in `init_db()`.**

**3. Extend `log_interaction()` signature (line 129-149):**
Add `propensity: float | None = None` and `policy_id: str | None = None` kwargs. Extend the INSERT.

---

#### [MODIFY] [recommendations.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/recommendations.py)

**Compute propensity after `inject_exploration()` (line 443):**

```python
# Exploration papers: uniformly sampled from pool
explore_pool_size = max(1, len(reranked_ids) - len(mmr_selected))
explore_propensity = len(exploration_set) / explore_pool_size if explore_pool_size > 0 else 0.0

# Exploitation (MMR-selected): deterministic β†’ propensity = 1.0
for aid in final:
    paper_tags[aid]["propensity"] = (
        explore_propensity if aid in exploration_set else 1.0
    )
    paper_tags[aid]["policy_id"] = _RANKER_VERSION
```

Thread `propensity` and `policy_id` into template context the same way as `query_id`.

---

#### [MODIFY] [search.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/search.py)

Search is fully deterministic β†’ `propensity = 1.0` for all results.

---

#### [MODIFY] [action_buttons.html](file:///c:/Users/siddh/ResearchIT-Final/app/templates/partials/action_buttons.html)

Add `propensity` and `policy_id` to `hx-vals`.

---

#### [MODIFY] [events.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/events.py)

Add `propensity: float = Form(default=0.0)` and `policy_id: str = Form(default="")` to both endpoints. Forward to `db.log_interaction()`.

---

## Day 4: B3 β€” Cluster Snapshot Versioning

### What it enables
Cluster history, debugging "why did recs shift?", content-addressed key for Phase 8a LLM summary cache.

---

#### [MODIFY] [db.py](file:///c:/Users/siddh/ResearchIT-Final/app/db.py)

**1. Add `cluster_snapshots` DDL to `_SCHEMA`:**
```sql
CREATE TABLE IF NOT EXISTS cluster_snapshots (
    user_id              TEXT NOT NULL,
    snapshot_id          TEXT NOT NULL,
    cluster_idx          INTEGER NOT NULL,
    medoid_paper_id      TEXT NOT NULL,
    importance           REAL NOT NULL,
    paper_ids            TEXT NOT NULL,
    medoid_embedding_blob BLOB,
    snapshot_date        TEXT NOT NULL DEFAULT (datetime('now')),
    paper_ids_hash       TEXT NOT NULL,
    PRIMARY KEY (user_id, snapshot_id, cluster_idx)
);
CREATE INDEX IF NOT EXISTS idx_snap_user_date ON cluster_snapshots(user_id, snapshot_date DESC);
CREATE INDEX IF NOT EXISTS idx_snap_hash ON cluster_snapshots(paper_ids_hash);
```

**2. Add `save_cluster_snapshot()` and `prune_old_snapshots()` functions.**

---

#### [MODIFY] [recommendations.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/recommendations.py)

After `save_clusters_to_db(user_id, clusters)` (line ~253), call `db.save_cluster_snapshot()`.

---

#### [MODIFY] [main.py](file:///c:/Users/siddh/ResearchIT-Final/app/main.py)

Call `db.prune_old_snapshots(retention_days=30)` in the lifespan handler after `init_db()`.

---

## Day 5: B4 β€” Semantic Scholar Author Import

### What it enables
"Paste S2 URL β†’ 20 implicit saves" β€” replaces manual seed search friction.

---

#### [NEW] [s2_svc.py](file:///c:/Users/siddh/ResearchIT-Final/app/s2_svc.py)

Functions:
- `parse_author_input(text) β†’ str | None` β€” accepts S2 URL, raw S2 ID, or ORCID
- `resolve_orcid(orcid) β†’ str | None` β€” resolves ORCID via S2 author search
- `fetch_author_arxiv_papers(author_id, limit=50) β†’ list[str]` β€” returns arXiv IDs

---

#### [MODIFY] [config.py](file:///c:/Users/siddh/ResearchIT-Final/app/config.py)

Add `S2_API_KEY = os.getenv("S2_API_KEY", "")` β€” key already in `.env`.

---

#### [MODIFY] [onboarding.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/onboarding.py)

Add `POST /api/onboarding/import-author` endpoint.

---

#### [NEW] Template partials for import step

- `partials/import_author.html` β€” the import form step
- `partials/import_success.html` β€” success confirmation
- `partials/import_error.html` β€” error message

---

## Verification Plan

### Automated Tests

After each day:

```bash
python -m pytest tests/ -v --tb=short
```

**New test files:**
- Day 1: Add `test_qdrant_scores_are_real_cosines` to `tests/test_phase6_feature_wiring.py`
- Day 2: Create `tests/test_instrumentation.py` β€” `test_query_id_round_trips`
- Day 3: Add `test_propensity_sums_correctly` to instrumentation tests
- Day 4: Add `test_snapshot_appended_on_each_recluster`, `test_prune_respects_retention`
- Day 5: Add `test_s2_import_saves_papers_with_correct_source_tag`

### Manual Verification

- Day 1: `curl -s https://siddhm11-researchit.hf.space/healthz/reranker` β€” confirm model still loaded after code change
- Day 5: Test author import with real S2 profile URL

---

## Documentation Updates (after all days)

- [ ] CLAUDE.md: Add Rule 3.11 β€” "Every interaction must carry `query_id`, `propensity`, and `policy_id`"
- [ ] TASK-TRACKER.md: Add Phase 6.5 section with checklist
- [ ] README.md: Update test count
- [ ] PHASE6-Reranker-Framing.md: Add live verification timestamp

---

## Open Questions

> [!IMPORTANT]
> **Q1:** The framing doc proposes `_RANKER_VERSION` as the `policy_id`. Currently it's `"v4.1_quota_hungarian_suppression"`. Should we also bump this to `"v6.5_lightgbm_real_cosines"` when Day 1 lands? It would make A/B-style log analysis cleaner.

> [!IMPORTANT]
> **Q2:** Day 5 (S2 author import) requires `httpx` as a dependency. It's already used by `turso_svc.py`, so no new install needed β€” just confirming.

> [!NOTE]
> **Q3:** The framing doc suggests cluster snapshot pruning at startup. For a simple MVP this is fine. Phase 7 can upgrade to APScheduler if needed.