siddhm11 commited on
Commit
d33f7fa
·
1 Parent(s): 19ccf32

fix(onboarding): correct hybrid_search call + add CLAUDE.md Rule 3.10

Browse files

Onboarding seed search bug (Phase 5):
- onboarding.py: hybrid_search_svc.hybrid_search(top_k=6) does not exist
Correct call: hybrid_search_svc.search(limit=6)
- search() returns list[str], not list[dict] — remove dict comprehension
- This was silently falling through to the slow arXiv API on every search

CLAUDE.md:
- Add Rule 3.10: per-candidate cluster identity invariant
paper_cluster_map must flow through to reranker as per-candidate arrays
Do not re-introduce dominant-cluster shortcuts

Tests: 203 passed, 0 failures

Files changed (2) hide show
  1. CLAUDE.md +4 -0
  2. app/routers/onboarding.py +2 -2
CLAUDE.md CHANGED
@@ -156,6 +156,10 @@ End-to-end feed generation target: **<30ms on CPU** (excluding metadata fetch, w
156
 
157
  ArXiv IDs can have leading zeros (e.g., `0704.0001`). **Treat all arXiv IDs as strings, never integers.** Pandas will silently coerce them — always pass `dtype=str` to `read_csv`. This is a real bug that has bitten this project before.
158
 
 
 
 
 
159
  ---
160
 
161
  ## 4. What is in scope vs out of scope right now
 
156
 
157
  ArXiv IDs can have leading zeros (e.g., `0704.0001`). **Treat all arXiv IDs as strings, never integers.** Pandas will silently coerce them — always pass `dtype=str` to `read_csv`. This is a real bug that has bitten this project before.
158
 
159
+ ### 3.10 Per-candidate cluster identity (Phase 6)
160
+
161
+ The per-cluster origin of each retrieved candidate is preserved end-to-end via `paper_cluster_map: dict[str, int]` (built in `recommendations.py` before `merge_quota_results()`). This mapping flows through to the reranker as per-candidate `cluster_importance` (N,) and `cluster_medoid` (N, 1024) arrays. **Do not re-introduce dominant-cluster shortcuts as "simplifications"** — LightGBM feature slot 24 (`cluster_distance_to_medoid`) depends on per-candidate medoids to correctly score papers from minority-interest clusters.
162
+
163
  ---
164
 
165
  ## 4. What is in scope vs out of scope right now
app/routers/onboarding.py CHANGED
@@ -95,8 +95,8 @@ async def seed_search(
95
  papers = []
96
  if q.strip():
97
  try:
98
- results = await hybrid_search_svc.hybrid_search(q.strip(), top_k=6)
99
- arxiv_ids = [r["arxiv_id"] for r in results]
100
  if arxiv_ids:
101
  meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
102
  missing = [aid for aid in arxiv_ids if aid not in meta]
 
95
  papers = []
96
  if q.strip():
97
  try:
98
+ results = await hybrid_search_svc.search(q.strip(), limit=6)
99
+ arxiv_ids = results # search() returns list[str] directly
100
  if arxiv_ids:
101
  meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
102
  missing = [aid for aid in arxiv_ids if aid not in meta]