Spaces:
Sleeping
Sleeping
siddhm11 commited on
Commit ·
d33f7fa
1
Parent(s): 19ccf32
fix(onboarding): correct hybrid_search call + add CLAUDE.md Rule 3.10
Browse filesOnboarding seed search bug (Phase 5):
- onboarding.py: hybrid_search_svc.hybrid_search(top_k=6) does not exist
Correct call: hybrid_search_svc.search(limit=6)
- search() returns list[str], not list[dict] — remove dict comprehension
- This was silently falling through to the slow arXiv API on every search
CLAUDE.md:
- Add Rule 3.10: per-candidate cluster identity invariant
paper_cluster_map must flow through to reranker as per-candidate arrays
Do not re-introduce dominant-cluster shortcuts
Tests: 203 passed, 0 failures
- CLAUDE.md +4 -0
- app/routers/onboarding.py +2 -2
CLAUDE.md
CHANGED
|
@@ -156,6 +156,10 @@ End-to-end feed generation target: **<30ms on CPU** (excluding metadata fetch, w
|
|
| 156 |
|
| 157 |
ArXiv IDs can have leading zeros (e.g., `0704.0001`). **Treat all arXiv IDs as strings, never integers.** Pandas will silently coerce them — always pass `dtype=str` to `read_csv`. This is a real bug that has bitten this project before.
|
| 158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
---
|
| 160 |
|
| 161 |
## 4. What is in scope vs out of scope right now
|
|
|
|
| 156 |
|
| 157 |
ArXiv IDs can have leading zeros (e.g., `0704.0001`). **Treat all arXiv IDs as strings, never integers.** Pandas will silently coerce them — always pass `dtype=str` to `read_csv`. This is a real bug that has bitten this project before.
|
| 158 |
|
| 159 |
+
### 3.10 Per-candidate cluster identity (Phase 6)
|
| 160 |
+
|
| 161 |
+
The per-cluster origin of each retrieved candidate is preserved end-to-end via `paper_cluster_map: dict[str, int]` (built in `recommendations.py` before `merge_quota_results()`). This mapping flows through to the reranker as per-candidate `cluster_importance` (N,) and `cluster_medoid` (N, 1024) arrays. **Do not re-introduce dominant-cluster shortcuts as "simplifications"** — LightGBM feature slot 24 (`cluster_distance_to_medoid`) depends on per-candidate medoids to correctly score papers from minority-interest clusters.
|
| 162 |
+
|
| 163 |
---
|
| 164 |
|
| 165 |
## 4. What is in scope vs out of scope right now
|
app/routers/onboarding.py
CHANGED
|
@@ -95,8 +95,8 @@ async def seed_search(
|
|
| 95 |
papers = []
|
| 96 |
if q.strip():
|
| 97 |
try:
|
| 98 |
-
results = await hybrid_search_svc.
|
| 99 |
-
arxiv_ids =
|
| 100 |
if arxiv_ids:
|
| 101 |
meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
|
| 102 |
missing = [aid for aid in arxiv_ids if aid not in meta]
|
|
|
|
| 95 |
papers = []
|
| 96 |
if q.strip():
|
| 97 |
try:
|
| 98 |
+
results = await hybrid_search_svc.search(q.strip(), limit=6)
|
| 99 |
+
arxiv_ids = results # search() returns list[str] directly
|
| 100 |
if arxiv_ids:
|
| 101 |
meta = await turso_svc.fetch_metadata_batch(arxiv_ids)
|
| 102 |
missing = [aid for aid in arxiv_ids if aid not in meta]
|