Spaces:
Running
Running
File size: 14,238 Bytes
3f58d41 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 | # Phase 6.5 β Implementation Plan
> **Source:** `docs/phases/PHASE6.5-Instrumentation-Framing.md`
> **Timeline:** 5 days (each day leaves the app in a working state)
> **Prerequisite for:** Phase 7 (Evaluation Framework)
---
## Day 1: Phase 6 Hot-fix (A1 + A2)
### A1: Real Qdrant Cosine Scores (Feature 0 fix)
**Problem:** `recommendations.py:329-339` fakes Qdrant scores with linear rank decay (`1.0 - rank * 0.01`). Feature 0 is the model's #5 most important feature β it should be real cosines from Qdrant.
**Root cause:** The search calls use `search_by_vector()` (returns `list[str]`) instead of `search_by_vector_with_scores()` (returns `list[dict]` with `{"arxiv_id": str, "score": float}`).
---
#### [MODIFY] [recommendations.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/recommendations.py)
**Change 1 β Per-cluster searches (line 258-266):**
Switch from `search_by_vector()` to `search_by_vector_with_scores()`:
```diff
- search_tasks = [
- qdrant_svc.search_by_vector(
- query_vector=c.medoid_embedding.tolist(),
- limit=quota * _OVERSAMPLE,
- exclude_ids=seen,
- )
- for c, quota in zip(clusters, quotas)
- ]
- per_cluster_results = await asyncio.gather(*search_tasks)
+ search_tasks = [
+ qdrant_svc.search_by_vector_with_scores(
+ query_vector=c.medoid_embedding.tolist(),
+ limit=quota * _OVERSAMPLE,
+ exclude_ids=seen,
+ )
+ for c, quota in zip(clusters, quotas)
+ ]
+ per_cluster_scored = await asyncio.gather(*search_tasks)
```
**Change 2 β Build `paper_cluster_map` AND `qdrant_score_map` in one pass (line 268-277):**
```diff
- paper_cluster_map: dict[str, int] = {}
- for cluster, result_ids in zip(clusters, per_cluster_results):
- for aid in result_ids:
- if aid not in paper_cluster_map:
- paper_cluster_map[aid] = cluster.cluster_idx
-
- candidate_ids = merge_quota_results(list(per_cluster_results), quotas)
+ paper_cluster_map: dict[str, int] = {}
+ qdrant_score_map: dict[str, float] = {}
+ for cluster, scored_results in zip(clusters, per_cluster_scored):
+ for hit in scored_results:
+ aid = hit["arxiv_id"]
+ if aid not in paper_cluster_map:
+ paper_cluster_map[aid] = cluster.cluster_idx
+ # Keep highest cosine if paper appears in multiple clusters
+ if aid not in qdrant_score_map or hit["score"] > qdrant_score_map[aid]:
+ qdrant_score_map[aid] = float(hit["score"])
+
+ # merge_quota_results expects list[list[str]] β extract IDs
+ per_cluster_ids = [[h["arxiv_id"] for h in scored] for scored in per_cluster_scored]
+ candidate_ids = merge_quota_results(per_cluster_ids, quotas)
```
**Change 3 β Short-term supplement search (line 280-290):**
Also switch to scored search:
```diff
- st_results = await qdrant_svc.search_by_vector(
+ st_scored = await qdrant_svc.search_by_vector_with_scores(
query_vector=st_vec.tolist(),
limit=_ST_SUPPLEMENT,
exclude_ids=seen_so_far,
)
- for aid in st_results:
- if aid not in set(candidate_ids):
- candidate_ids.append(aid)
+ for hit in st_scored:
+ aid = hit["arxiv_id"]
+ if aid not in set(candidate_ids):
+ candidate_ids.append(aid)
+ if aid not in qdrant_score_map:
+ qdrant_score_map[aid] = float(hit["score"])
paper_cluster_map[aid] = -1 # short-term supplement
```
**Change 4 β Delete fake score block (line 329-339):**
The entire synthetic-decay block becomes dead code. Delete it:
```diff
- # Build qdrant_score_map from per_cluster_results
- # per_cluster_results is list[list[str]] β we need scores too.
- # Use the paper_cluster_map to approximate: score = 1.0 - (rank / total)
- # for now, as the current retrieval path returns only IDs.
- # TODO: Phase 6.2+ switch to search_by_vector_with_scores()
- qdrant_score_map: dict[str, float] = {}
- for cluster_ids in per_cluster_results:
- for rank, aid in enumerate(cluster_ids):
- if aid not in qdrant_score_map:
- # Approximate score from rank position (higher rank = higher score)
- qdrant_score_map[aid] = max(0.0, 1.0 - rank * 0.01)
```
The existing `qdrant_scores = np.asarray(...)` on line 341-344 stays as-is β it reads from `qdrant_score_map` which now has real cosines.
### A2: Verify `/healthz/reranker` live
> β
**Already done.** Verified 2026-05-03: `model_loaded: true, n_trees: 141, fallback_active: false`.
Just need to add the timestamp to `PHASE6-Reranker-Framing.md`.
---
## Day 2: B1 β `query_id` Linkage
### What it enables
Per-feed CTR: "out of 30 papers shown in this request, how many got saved?"
### Current state verified
- `interactions` table already has a `query_id TEXT` column β
(line 31 in DDL)
- `db.log_interaction()` already accepts `query_id` β
(line 135)
- `events.py` already accepts and forwards `query_id` via `Form(default="")` β
(line 26)
- **Missing:** `recommendations.py` never generates or passes `query_id`. Search router never generates one either. Templates don't carry it.
---
#### [MODIFY] [recommendations.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/recommendations.py)
**1. Generate `query_id` at the top of `get_recommendations()` (line 59):**
```python
query_id = str(uuid.uuid4())
```
**2. Thread `query_id` into `paper_tags` in all 3 tiers:**
- Tier 1: In `_multi_interest_recommend()` return value, add `"query_id": query_id` to each tag dict (line 455-458)
- Tier 2: EWMA fallback tags (line 116-120) β add `"query_id": query_id`
- Tier 3: Qdrant recommend tags (line 131-135) β add `"query_id": query_id`
- Trending fallback (line 85-87) β add `"query_id": query_id`
**3. Embed `query_id` + `position` into paper dicts (line 153-166):**
```python
for idx, aid in enumerate(rec_arxiv_ids):
...
papers.append({
**meta[aid],
"saved": False,
"dismissed": False,
"ranker_version": tags.get("ranker_version", _RANKER_VERSION),
"candidate_source": tags.get("candidate_source", ""),
"cluster_id": tags.get("cluster_id", ""),
"query_id": tags.get("query_id", ""), # NEW
"position": idx, # NEW
})
```
> [!IMPORTANT]
> The `_multi_interest_recommend` signature needs updating to accept `query_id` as a parameter, since it's where the Tier 1 paper_tags are built. Alternatively, we generate `query_id` inside it and return it alongside the tags. I'll use the approach of passing it as a param.
---
#### [MODIFY] [search.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/search.py)
**Generate `query_id` per search and embed in paper dicts (line 70-77):**
```python
query_id = str(uuid.uuid4()) # generated once per /search request
for idx, p in enumerate(papers):
p["saved"] = p["arxiv_id"] in saved_ids
p["dismissed"] = p["arxiv_id"] in dismissed_ids
p["query_id"] = query_id # NEW
p["position"] = idx # NEW
```
---
#### [MODIFY] [action_buttons.html](file:///c:/Users/siddh/ResearchIT-Final/app/templates/partials/action_buttons.html)
**Add `query_id` and `position` to ALL three `hx-vals` JSON blobs:**
Add to template header:
```jinja2
{% set _query_id = paper.query_id | default("") if paper is defined else "" %}
{% set _position = paper.position | default(0) if paper is defined else 0 %}
```
Add to each `hx-vals`:
```
"query_id": "{{ _query_id }}", "position": "{{ _position }}"
```
The save button (line 37) already has `position` β update to use `_position`. The not-interested buttons (line 26, 45) need `query_id` and `position` added.
---
## Day 3: B2 β Propensity Logging
### What it enables
Counterfactual evaluation (SNIPS estimator) β "what would have happened with ranker B?"
---
#### [MODIFY] [db.py](file:///c:/Users/siddh/ResearchIT-Final/app/db.py)
**1. Migration (after `_MIGRATION_6_3`):**
```python
_MIGRATION_6_5 = [
"ALTER TABLE interactions ADD COLUMN propensity REAL",
"ALTER TABLE interactions ADD COLUMN policy_id TEXT",
]
```
**2. Run in `init_db()`.**
**3. Extend `log_interaction()` signature (line 129-149):**
Add `propensity: float | None = None` and `policy_id: str | None = None` kwargs. Extend the INSERT.
---
#### [MODIFY] [recommendations.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/recommendations.py)
**Compute propensity after `inject_exploration()` (line 443):**
```python
# Exploration papers: uniformly sampled from pool
explore_pool_size = max(1, len(reranked_ids) - len(mmr_selected))
explore_propensity = len(exploration_set) / explore_pool_size if explore_pool_size > 0 else 0.0
# Exploitation (MMR-selected): deterministic β propensity = 1.0
for aid in final:
paper_tags[aid]["propensity"] = (
explore_propensity if aid in exploration_set else 1.0
)
paper_tags[aid]["policy_id"] = _RANKER_VERSION
```
Thread `propensity` and `policy_id` into template context the same way as `query_id`.
---
#### [MODIFY] [search.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/search.py)
Search is fully deterministic β `propensity = 1.0` for all results.
---
#### [MODIFY] [action_buttons.html](file:///c:/Users/siddh/ResearchIT-Final/app/templates/partials/action_buttons.html)
Add `propensity` and `policy_id` to `hx-vals`.
---
#### [MODIFY] [events.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/events.py)
Add `propensity: float = Form(default=0.0)` and `policy_id: str = Form(default="")` to both endpoints. Forward to `db.log_interaction()`.
---
## Day 4: B3 β Cluster Snapshot Versioning
### What it enables
Cluster history, debugging "why did recs shift?", content-addressed key for Phase 8a LLM summary cache.
---
#### [MODIFY] [db.py](file:///c:/Users/siddh/ResearchIT-Final/app/db.py)
**1. Add `cluster_snapshots` DDL to `_SCHEMA`:**
```sql
CREATE TABLE IF NOT EXISTS cluster_snapshots (
user_id TEXT NOT NULL,
snapshot_id TEXT NOT NULL,
cluster_idx INTEGER NOT NULL,
medoid_paper_id TEXT NOT NULL,
importance REAL NOT NULL,
paper_ids TEXT NOT NULL,
medoid_embedding_blob BLOB,
snapshot_date TEXT NOT NULL DEFAULT (datetime('now')),
paper_ids_hash TEXT NOT NULL,
PRIMARY KEY (user_id, snapshot_id, cluster_idx)
);
CREATE INDEX IF NOT EXISTS idx_snap_user_date ON cluster_snapshots(user_id, snapshot_date DESC);
CREATE INDEX IF NOT EXISTS idx_snap_hash ON cluster_snapshots(paper_ids_hash);
```
**2. Add `save_cluster_snapshot()` and `prune_old_snapshots()` functions.**
---
#### [MODIFY] [recommendations.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/recommendations.py)
After `save_clusters_to_db(user_id, clusters)` (line ~253), call `db.save_cluster_snapshot()`.
---
#### [MODIFY] [main.py](file:///c:/Users/siddh/ResearchIT-Final/app/main.py)
Call `db.prune_old_snapshots(retention_days=30)` in the lifespan handler after `init_db()`.
---
## Day 5: B4 β Semantic Scholar Author Import
### What it enables
"Paste S2 URL β 20 implicit saves" β replaces manual seed search friction.
---
#### [NEW] [s2_svc.py](file:///c:/Users/siddh/ResearchIT-Final/app/s2_svc.py)
Functions:
- `parse_author_input(text) β str | None` β accepts S2 URL, raw S2 ID, or ORCID
- `resolve_orcid(orcid) β str | None` β resolves ORCID via S2 author search
- `fetch_author_arxiv_papers(author_id, limit=50) β list[str]` β returns arXiv IDs
---
#### [MODIFY] [config.py](file:///c:/Users/siddh/ResearchIT-Final/app/config.py)
Add `S2_API_KEY = os.getenv("S2_API_KEY", "")` β key already in `.env`.
---
#### [MODIFY] [onboarding.py](file:///c:/Users/siddh/ResearchIT-Final/app/routers/onboarding.py)
Add `POST /api/onboarding/import-author` endpoint.
---
#### [NEW] Template partials for import step
- `partials/import_author.html` β the import form step
- `partials/import_success.html` β success confirmation
- `partials/import_error.html` β error message
---
## Verification Plan
### Automated Tests
After each day:
```bash
python -m pytest tests/ -v --tb=short
```
**New test files:**
- Day 1: Add `test_qdrant_scores_are_real_cosines` to `tests/test_phase6_feature_wiring.py`
- Day 2: Create `tests/test_instrumentation.py` β `test_query_id_round_trips`
- Day 3: Add `test_propensity_sums_correctly` to instrumentation tests
- Day 4: Add `test_snapshot_appended_on_each_recluster`, `test_prune_respects_retention`
- Day 5: Add `test_s2_import_saves_papers_with_correct_source_tag`
### Manual Verification
- Day 1: `curl -s https://siddhm11-researchit.hf.space/healthz/reranker` β confirm model still loaded after code change
- Day 5: Test author import with real S2 profile URL
---
## Documentation Updates (after all days)
- [ ] CLAUDE.md: Add Rule 3.11 β "Every interaction must carry `query_id`, `propensity`, and `policy_id`"
- [ ] TASK-TRACKER.md: Add Phase 6.5 section with checklist
- [ ] README.md: Update test count
- [ ] PHASE6-Reranker-Framing.md: Add live verification timestamp
---
## Open Questions
> [!IMPORTANT]
> **Q1:** The framing doc proposes `_RANKER_VERSION` as the `policy_id`. Currently it's `"v4.1_quota_hungarian_suppression"`. Should we also bump this to `"v6.5_lightgbm_real_cosines"` when Day 1 lands? It would make A/B-style log analysis cleaner.
> [!IMPORTANT]
> **Q2:** Day 5 (S2 author import) requires `httpx` as a dependency. It's already used by `turso_svc.py`, so no new install needed β just confirming.
> [!NOTE]
> **Q3:** The framing doc suggests cluster snapshot pruning at startup. For a simple MVP this is fine. Phase 7 can upgrade to APScheduler if needed.
|